don't run dup-to generated packets through pf_test in pf_route{,6}

2021-01-26 Thread David Gwynne
this was discussed as part of the big route-to issues thread. i think
it's easy to break out and handle separately now.

the diff does what the subject line says. it seems to work as expected
for me. i don't see weird state issues anymore when i dup my ssh session
out over a tunnel interface.

sasha suggested setting PF_TAG_GENERATED on the duplicated packet, but i
didn't set it in this diff. the reason is that i can't see
PF_TAG_GENERATED get cleared anywhere. this means that if you dup-to a
host over a tunnel (eg, gif, gre, etc), the encapsulated packet still
has that tag, which means pf doesn't run against the encapsulated
packet.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1101
diff -u -p -r1.1101 pf.c
--- pf.c19 Jan 2021 22:22:23 -  1.1101
+++ pf.c27 Jan 2021 01:21:24 -
@@ -6039,7 +6041,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (r->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
if (pf_test(AF_INET, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)
@@ -6194,7 +6195,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (r->rt != PF_DUPTO && pd->kif->pfik_ifp != ifp) {
if (pf_test(AF_INET6, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)



Re: tiny pf_route{,6} tweak

2021-01-26 Thread David Gwynne
On Wed, Jan 27, 2021 at 11:13:12AM +1000, David Gwynne wrote:
> when pf_route (and pf_route6) are supposed to handle forwarding the
> packet (ie, for route-to or reply-to rules), they take the mbuf
> away from the calling code path. this is done by clearing the mbuf
> pointer in the pf_pdesc struct. it doesn't do this for dup-to rules
> though.
> 
> at the moment pf_route clears that pointer on the way out, but it could
> take the mbuf away up front in the same place that it already checks if
> it's a dup-to rule or not.
> 
> it's a small change. i've bumped up the number of lines of context so
> it's easier to read too.
> 
> ok?

sigh. here's the diff with the extra context.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1101
diff -u -p -U8 -r1.1101 pf.c
--- pf.c19 Jan 2021 22:22:23 -  1.1101
+++ pf.c27 Jan 2021 01:10:52 -
@@ -5983,16 +5983,17 @@ pf_route(struct pf_pdesc *pd, struct pf_
 
if (r->rt == PF_DUPTO) {
if ((m0 = m_dup_pkt(pd->m, max_linkhdr, M_NOWAIT)) == NULL)
return;
} else {
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip)) {
DPFPRINTF(LOG_ERR,
"%s: m0->m_len < sizeof(struct ip)", __func__);
goto bad;
}
 
@@ -6103,18 +6104,16 @@ pf_route(struct pf_pdesc *pd, struct pf_
else
m_freem(m0);
}
 
if (error == 0)
ipstat_inc(ips_fragmented);
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
 bad:
m_freem(m0);
goto done;
 }
 
@@ -6141,16 +6140,17 @@ pf_route6(struct pf_pdesc *pd, struct pf
 
if (r->rt == PF_DUPTO) {
if ((m0 = m_dup_pkt(pd->m, max_linkhdr, M_NOWAIT)) == NULL)
return;
} else {
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip6_hdr)) {
DPFPRINTF(LOG_ERR,
"%s: m0->m_len < sizeof(struct ip6_hdr)", __func__);
goto bad;
}
ip6 = mtod(m0, struct ip6_hdr *);
@@ -6232,18 +6232,16 @@ pf_route6(struct pf_pdesc *pd, struct pf
ip6stat_inc(ip6s_cantfrag);
if (r->rt != PF_DUPTO)
pf_send_icmp(m0, ICMP6_PACKET_TOO_BIG, 0,
ifp->if_mtu, pd->af, r, pd->rdomain);
goto bad;
}
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
 bad:
m_freem(m0);
goto done;
 }
 #endif /* INET6 */



tiny pf_route{,6} tweak

2021-01-26 Thread David Gwynne
when pf_route (and pf_route6) are supposed to handle forwarding the
packet (ie, for route-to or reply-to rules), they take the mbuf
away from the calling code path. this is done by clearing the mbuf
pointer in the pf_pdesc struct. it doesn't do this for dup-to rules
though.

at the moment pf_route clears that pointer on the way out, but it could
take the mbuf away up front in the same place that it already checks if
it's a dup-to rule or not.

it's a small change. i've bumped up the number of lines of context so
it's easier to read too.

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1101
diff -u -p -r1.1101 pf.c
--- pf.c19 Jan 2021 22:22:23 -  1.1101
+++ pf.c27 Jan 2021 01:05:29 -
@@ -5988,6 +5988,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip)) {
@@ -6108,8 +6109,6 @@ pf_route(struct pf_pdesc *pd, struct pf_
ipstat_inc(ips_fragmented);
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
@@ -6146,6 +6145,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if ((r->rt == PF_REPLYTO) == (r->direction == pd->dir))
return;
m0 = pd->m;
+   pd->m = NULL;
}
 
if (m0->m_len < sizeof(struct ip6_hdr)) {
@@ -6237,8 +6237,6 @@ pf_route6(struct pf_pdesc *pd, struct pf
}
 
 done:
-   if (r->rt != PF_DUPTO)
-   pd->m = NULL;
rtfree(rt);
return;
 
 



Re: [External] : Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 05:38:40PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> On Mon, Jan 25, 2021 at 03:21:29PM +0100, Alexander Bluhm wrote:
> > Hi,
> > 
> > Some personal thoughts.  I am happy when pf route-to gets simpler.
> > Especially I have never understood what this address@interface
> > syntax is used for.
> > 
> > I cannot estimate what configuration is used by our cutomers in
> > many installations.  Simple syntax change address@interface ->
> > address of next hob should be no problem.  Slight semantic changes
> > have to be dealt with.  Current packet flow is complicated and may
> > be inspired by old NAT behavior.  As long it becomes more sane and
> > easier to understand, we should change it.
> 
> 
> I'm not sure if proposed scenario real. Let's assume there
> is a PF box with three NICs running on this awkward set up
> 
>   em1 ... 192.168.1.10
> 
>   em0
> 
>   em2 ... 192.168.1.10
> 
> em0 is attached to LAN em1 and em2 are facing to internet which is
> reachable with two different physical lines. both lines are connected via
> equipment, which uses fixed IP address 192.168.1.10 and PF admin has
> no way to change that.

in this scenario are em1 and em2 connected to the same ethernet
segment and ip subnet?

> the 'address@interface' syntax is the only way to define rules:
> 
>   pass in on em0 from 172.16.0.0/16 route-to 192.168.1.10@em1
>   pass in on em0 from 172.17.0.0/16 route-to 192.168.1.10@em2
> 
> regardless of how much real such scenario is I believe it can
> currently work.

this is a very awkward configuration. while i think what it's trying
to do is useful, how it is expressed relies on something i want to
break (fix?).

one of the original reasons i wanted to break this kind of config
is because pfsync hasn't got a way to exchange interace information,
and different firewalls are going to have different interface
topologies anyway. one of the reasons to only use a destination/next
hop as the argument to route-to rules was so pfsync would work.

i'm pretty sure this is broken at the moment because of bugs in the
routing code. it is possible to configure routes to 192.168.1.10
via both em1 and em2 if net.inet.ip.multipath is set to 1, but im
sure the llinfo (arp and rtable) part of this kind of multipath
route setup does not work reliably. i guess i should try and get
my fixes for this into the tree.

there are two alternate ways i can think of to do this. the first
is to configure an rtable for each interface:

  # route -T 1 add default 192.168.1.10 -ifp em1
  # route -T 2 add default 192.168.1.10 -ifp em2

then you could write rules like this:

  pass in on em0 from 172.16.0.0/16 rtable 1
  pass in on em0 from 172.17.0.0/16 rtable 2

the other is to add add routes "beyond" 192.168.1.10 and route-to
them:

  # route add 127.0.1.10 192.168.1.10 -ifp em1
  # route add 127.0.2.10 192.168.1.10 -ifp em2

then you can write rules like this:

  pass in on em0 from 172.16.0.0/16 route-to 127.0.1.10
  pass in on em0 from 172.17.0.0/16 route-to 127.0.2.10

this will likely hit the same bugs in the rtable/arp code i referred
to above though.

also, note that i haven't tested either of these.

cheers,
dlg



Re: [External] : Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 06:17:02PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> pf_route() might leak a refence to ifp.

oh no :(

> > Index: sys/net/pf.c
> > ===
> > RCS file: /cvs/src/sys/net/pf.c,v
> > retrieving revision 1.1101
> > diff -u -p -r1.1101 pf.c
> > --- sys/net/pf.c19 Jan 2021 22:22:23 -  1.1101
> > +++ sys/net/pf.c22 Jan 2021 07:33:31 -
> 
> 
> 
> > @@ -5998,48 +5994,40 @@ pf_route(struct pf_pdesc *pd, struct pf_
> >  
> > ip = mtod(m0, struct ip *);
> >  
> 
> > +
> > +   ifp = if_get(rt->rt_ifidx);
> > if (ifp == NULL)
> > goto bad;
> 
> here we get a reference to ifp.

yep.

> > @@ -6082,9 +6060,9 @@ pf_route(struct pf_pdesc *pd, struct pf_
> >  */
> > if (ip->ip_off & htons(IP_DF)) {
> > ipstat_inc(ips_cantfrag);
> > -   if (r->rt != PF_DUPTO)
> > +   if (s->rt != PF_DUPTO)
> > pf_send_icmp(m0, ICMP_UNREACH, ICMP_UNREACH_NEEDFRAG,
> > -   ifp->if_mtu, pd->af, r, pd->rdomain);
> > +   ifp->if_mtu, pd->af, s->rule.ptr, pd->rdomain);
> > goto bad;
> 
>   here we do 'goto bad', which does not call if_put().

yes it does. the whole chunk with the diff applied is:

done:
if (s->rt != PF_DUPTO)
pd->m = NULL;
if_put(ifp);
rtfree(rt);
return;

bad:
m_freem(m0);
goto done;
}

bad drops the mbuf and then goes to done.



Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 04:19:11PM +0100, Alexander Bluhm wrote:
> On Fri, Jan 22, 2021 at 06:07:59PM +1000, David Gwynne wrote:
> > --- sys/conf/GENERIC30 Sep 2020 14:51:17 -  1.273
> > +++ sys/conf/GENERIC22 Jan 2021 07:33:30 -
> > @@ -82,6 +82,7 @@ pseudo-device msts1   # MSTS line discipl
> >  pseudo-device  endrun  1   # EndRun line discipline
> >  pseudo-device  vnd 4   # vnode disk devices
> >  pseudo-device  ksyms   1   # kernel symbols device
> > +pseudo-device  kstat
> >  #pseudo-device dt  # Dynamic Tracer
> >
> >  # clonable devices
> 
> This is an unrelated chunk.

oh yeah...

> > +pf_route(struct pf_pdesc *pd, struct pf_state *s)
> ...
> > +   if (pd->dir == PF_IN) {
> > if (pf_test(AF_INET, PF_OUT, ifp, ) != PF_PASS)
> 
> Yes, this is the correct logic.  When the packet comes in, pf
> overrides forwarding, tests the out rules, and sends it.  For
> outgoing packets on out route-to rules we have already tested the
> rules.  It also works for reply-to the other way around.

yep.

> But what about dup-to?  The packet is duplicated for both directions.
> I guess the main use case for dup-to is implementing a monitor port.
> There you have to pass packets stateless, otherwise it would not
> work anyway.  The strange semantics is not related to this diff.

are you saying i should skip pf_test for all dup-to generated packets?

> We are reaching a state where this diff can go in.  I just startet
> a regress run with it.  OK bluhm@

hopefully i fixed the pfctl error messages up so the regress tests arent
too unhappy.



Re: [External] : Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 03:21:29PM +0100, Alexander Bluhm wrote:
> Hi,
> 
> Some personal thoughts.  I am happy when pf route-to gets simpler.
> Especially I have never understood what this address@interface
> syntax is used for.

even after staring at it for so long, i still don't get it. i do think
it was a reimplementation of an ipfilter thing, but i don't think it
makes a lot of sense in ipfilter either.

> I cannot estimate what configuration is used by our cutomers in
> many installations.  Simple syntax change address@interface ->
> address of next hob should be no problem.  Slight semantic changes
> have to be dealt with.  Current packet flow is complicated and may
> be inspired by old NAT behavior.  As long it becomes more sane and
> easier to understand, we should change it.

route-to $destination, not $next_hop...

the biggest change we have to agree on at the moment is whether we're
changing the semantic of "pf runs when a packet goes over an interface"
to "pf runs when a packet goes in or out of the stack". this affects
whether pf_test runs again when route-to changes the interface.

> But I don't like artificial restrictions.  We don't know all use
> cases.  reply-to and route-to could be used for both in and out
> rules.  I have used them for strange divert-to on bridge setups.
> It should stay that way.

i don't think it's complicated to support route-to and reply-to on both
in and out rules. we've already found that there's use cases for
reply-to on inbound rules, doing things on bridges just adds to that.
it could be used on tpmr(4) too...

> It would be nice to keep state-less route-to.  I have found a special
> case with that in the code of our product.  But it looks like dead
> code, so I would not object to remove state-less route-to for now.

ok. thank you.

> bluhm



Re: [External] : Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 02:30:46PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
...
> > 
> > i dont understand the kif lifetimes though. can we just point a
> > pdesc at an arbitrary kif or do we need ot reference count?
> > 
> 
> as long as we don't release NET_LOCK() (or PF_LOCK() in near future),
> the reference to kif should remain valid.
> 
> kif is very own PF's abstraction of network interfaces. there are
> two ways how one can create a kif:
> 
>   there is either rule loaded, which refers to interface, which
>   is not plumbed yet.
> 
>   the new interface gets plumbed and PF must have a kif for it.
> 
> the reference count currently used makes sure we destroy kif when
> either interface is gone or when the rule, which refers kif is
> gone.
> 
> hope I remember details above right.

ok. sounds simple enough...

> > > I think this might be way to go.
> > > 
> > > My only concern now is that such change is subtle. I mean the
> > > 
> > >   pass out ... route-to 
> > > 
> > > will change behavior in sense that current code will dispatch
> > > packet to new interface and run pf_test() again. Once your diff
> > > will be in the same rule will be accepted, but will bring entirely
> > > different behaviour: just dispatch packet to new interface.
> > 
> > yeah.
> > 
> > the counter example is that i was extremely surprised when i
> > discovered that pf_test gets run again when the outgoing interface
> > is changed with a route-to rule.
> 
> surprised because you've forgot about current model? which is:
> run pf_test() whenever packet crosses the interface?

forgot isn't the right word. i was in the room when henning and mcbride
reworked pf and came up with the stack and wire side key terminology and
semantics, and i've spent a lot of time looking at the ip input and
output code where pf_test is called close to the stack. it just wasn't
obvious to me that pf filtered over an interface rather than filtered
in and out of the stack. apart from route-to in pf, i'm not sure it is a
meaningful difference either.

> > there's subtlety either way, we're just figuring out which one we're
> > going for.
> 
> yes exactly. there are trade offs either way.
> 
> 
> 
> > > I think this is acceptable. If this will cause a friction we can 
> > > always
> > > adjust the code in follow up commit to allow state-less 
> > > route-to/reply-to
> > > with no support from pfsync(4).
> > 
> > if we're going to support route-to on match rules i think this will be
> > easy to implement. 
> > 
> 
> I think there must be some broader consent on model change
> from current which is run pf_test() for every NIC crossing
> to new way, which is run pf_test() at most two times.

agreed.

> > > > > > lastly, the "argument" or address specified with route-to (and
> > > > > > reply-to and dup-to) is a destination address, not a next-hop. this
> > > > > > has been discussed on the lists a couple of times before, so i won't
> > > > > > go over it again, except to reiterate that it allows pf to force
> > > > > > "sticky" path selection while opening up the possibility for ecmp
> > > > > > and failover for where that path traverses.
> > > > > 
> > > > > I keep forgetting about it as I still stick to current 
> > > > > interpretation.
> > > > > 
> > > > > 
> > > > > I've seen changes to pfctl. Diff below still allows rule:
> > > > > 
> > > > > pass in on net0 from 192.168.1.0/24 to any route-to 
> > > > > 10.10.10.10@em0
> > > > 
> > > > Is there use case for the @interface syntax apart from the current
> > > > route-to rules? If not, we can just delete it.
> > > 
> > > perhaps I'm still not quite on the same page as you then. I also
> > > had no time to entirely test you diff. The way I understand your
> > > effort is to change route-to behavior such it will be using
> > > a destination instead of next-hop@interface. Or are you planning
> > > to keep current form ('route-to next-hop@interface') working?
> > 
> > if we ignore route-to, what's the use case for the interface part of
> > address@interface? it doesnt seem to be accepted as part of an address
> > in other parts of the grammar:
> > 
> > dlg@kbuild ~$ echo pass in from 192.168.0.0@vmx0 | sudo pfctl -nf -
> > stdin:1: @if syntax not permitted in from or to
> > stdin:1: skipping rule due to errors
> > stdin:1: rule expands to no valid combination
> > dlg@kbuild ~$ echo pass from 192.168.0.0@vmx0 | sudo pfctl -nf -
> > stdin:1: @if syntax not permitted in from or to
> > stdin:1: skipping rule due to errors
> > stdin:1: rule expands to no valid combination
> > dlg@kbuild ~$ echo pass nat-to 192.168.0.0@vmx0 | sudo pfctl -nf -
> > stdin:1: @if not permitted
> > stdin:1: nat-to and rdr-to require a direction
> > stdin:1: skipping rule due to errors
> > stdin:1: rule expands to no valid combination
> > dlg@kbuild ~$ echo pass nat-to 192.168.0.0@vmx0 | sudo 

Re: [External] : Re: pf route-to issues

2021-01-25 Thread David Gwynne
On Mon, Jan 25, 2021 at 01:11:35PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > > 
> > > I understand that simple is better here, so I won't object
> > > if we will lean towards simplified model above. However I still
> > > would like to share my view on current PF.
> > > 
> > > the way I understand how things (should) work currently is fairly 
> > > simple:
> > > 
> > >   we always run pf_test() as packet crosses interface.
> > >   packet can cross interface either in outbound or
> > >   inbound direction.
> > 
> > That's how I understand the current code. I'm proposing that we change
> > the semantics so they are:
> > 
> > - we always run pf_test as a packet enters or leaves the network stack.
> > - pf is able to filter or apply policy based on various attributes
> >   of the packet such as addresses and ports, but also metadata about
> >   the packet such as the current prio, or the interface it came
> >   from or is going to.
> > - changing a packet or it's metadata does not cause a rerun of pf_test.
> > - route-to on an incoming packet basically bypasses the default
> >   stack processing with a "fast route" out of the stack.
> > 
> > > this way we can always create a complex route-to loops,
> > > however it can also solve some route-to vs. NAT issues.
> > > consider those fairly innocent rules:
> > > 
> > > 8<---8<---8<--8<
> > > table  { 10.10.10.10, 172.16.1.1 }
> > > 
> > > pass out on em0 from 192.168.1.0/24 to any route-to 
> > > pass out on em1 from 192.168.1.0 to any nat-to (em1)
> > > pass out on em2 all 
> > > 8<---8<---8<--8<
> > > 
> > > Rules above should currently work, but will stop if we will
> > > go with simplified model.
> > 
> > The entries in  make the packet go out em1 and em2?
> 
> yes they do. let's say 10.10.10.10 is reached over em1, 172.16.1.1 is
> reached over em2. sorry I have not specified that in my earlier email.

npz.

> 
> > 
> > I'm ok with breaking configs like that. We don't run pf_test again for
> > other changes to the packet, so if we do want to support something like
> > that I think we should make the following work:
> > 
> >   # pf_pdesc kif is em0
> >   match out on em0 from 192.168.1.0/24 to any route-to 
> >   # pf_pdesc kif is now em1
> >   pass out on em1 from 192.168.1.0 to any nat-to (em1)
> >   pass out on em2 all
> > 
> > This is more in line with how NAT rules operate.
> 
> If I understand the idea right, then basically 'match out on em0'
> figures out the new 'outbound interface' so either 'pass out on em1...' or
> 'pass out on em2...' will kick in. In other words:
> 
>   depending on the destination picked up from  table,
>   the route-to action will override the em0 interface to
>   either em1 or em2.

yes.

i dont understand the kif lifetimes though. can we just point a
pdesc at an arbitrary kif or do we need ot reference count?

> I think this might be way to go.
> 
> My only concern now is that such change is subtle. I mean the
> 
>   pass out ... route-to 
> 
> will change behavior in sense that current code will dispatch
> packet to new interface and run pf_test() again. Once your diff
> will be in the same rule will be accepted, but will bring entirely
> different behaviour: just dispatch packet to new interface.

yeah.

the counter example is that i was extremely surprised when i
discovered that pf_test gets run again when the outgoing interface
is changed with a route-to rule.

there's subtlety either way, we're just figuring out which one we're
going for.

> > > I'll be OK with your simplified model if it will make things
> > > more explicit:
> > > 
> > >   route-to option should be applied on inbound rules
> > >   only
> > 
> > This would restrict how we currently write rules. See below about how we
> > would be using it.
> > 
> > >   reply-to option should be applied on outbound rule
> > >   only
> > 
> > I'm using reply-to on inbound rules. On these boxes I have a service
> > (it's a dns resolver running unbound) that is accessible only via
> > gre(4) tunnels, and I need the replies to those connections to go
> > out the same interface they came in on. I'm running an older version of
> > my diff, so I can have rules like this to make it work:
> > 
> >   pass in quick on gre0 reply-to gre0:peer
> >   pass in quick on gre1 reply-to gre1:peer
> > 
> > The DNS traffic isn't going through this box, the replies that
> > unbound is generating match the state created by the inbound rule.
> > 
> > If I'm remembering correctly, sthen@ had a similar use case.
> 
> you are right, I did not think much of a local bound traffic.
> in this case reply-to needs to be kept as-is.
> 
> > 
> > >   dup-to option can go either way (in/out)
> > 
> > Yep.
> > 
> > > does it make sense? IMO yes, because doing route-to
> > > 

Re: [External] : Re: pf route-to issues

2021-01-24 Thread David Gwynne
On Mon, Jan 25, 2021 at 02:50:12AM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> > 
> > ok. i don't know how to split up the rest of the change though.
> > 
> > here's an updated diff that includes the rest of the kernel changes and
> > the pfctl and pf.conf tweaks.
> > 
> > it's probably useful for me to try and explain at a high level what
> > i think the semantics should be, otherwise we might end up arguing about
> > which bits of the current config i broke.
> > 
> > so, from an extremely high level point of view, and apologies if
> > this is condescending, pf sits between the network stack and an
> > interface that a packet travels on. for connections handled by the
> > local box, this means packets come from the stack and get an output
> > interface selected by a route lookup, then pf checks it, and then
> > it goes out the selected interface. replies come into an interface,
> > get checked by pf, and then enter the stack. when forwarding, a
> > packet comes into an interface, pf checks it, the stack does a route
> > lookup to pick an interface, pf checks it again, and then it goes
> > out the interface.
> > 
> > so what does it mean when route-to (or reply-to) gets involved? i'm
> > saying that when route-to is applied to a packet, pf takes the packet
> > away from the stack and immediately forwards it toward to specified
> > destination address. for a packet entering the system, ie, when the
> > packet is going from the interface into the stack, route-to should
> > pretend that it is forwarding the packet and basically push it
> > straight out an interface. however, like normal forwarding via the
> > stack, there might be some policy on packets leaving that interface that
> > you want to apply, so pf should run pf_test in that situation so the
> > policy can be applied. this is especially useful if you need to apply
> > nat-to when packets leave a particular interface.
> > 
> > however, if you route-to when a packet is on the way out of the
> > stack, i'm arguing that pf should not run again against that packet.
> > currently route-to rules run pf_test again if the interface the packet
> > is routed out of changes, which means pf runs multiple times against a
> > packet if rules keep changing which interface it goes out. this means
> > there's loop prevention in pf to mitigate against this, and weird
> > potentials for multiple states to be created when nat gets involved.
> > 
> > for simplicity, both in terms of reasoning and code i think pf should
> > only be run once when a packet enters the system, and only once when it
> > leaves the system. the only reason i can come up with for running
> > pf_test multiple times when route-to changes the outgoing interface is
> > so you can check the packet with "pass out on $new_if" type rules. we
> > don't rerun pf again when nat/rdr changes addresses, so this feels
> > inconsistent to me.
> 
> I understand that simple is better here, so I won't object
> if we will lean towards simplified model above. However I still
> would like to share my view on current PF.
> 
> the way I understand how things (should) work currently is fairly simple:
> 
>   we always run pf_test() as packet crosses interface.
>   packet can cross interface either in outbound or
>   inbound direction.

That's how I understand the current code. I'm proposing that we change
the semantics so they are:

- we always run pf_test as a packet enters or leaves the network stack.
- pf is able to filter or apply policy based on various attributes
  of the packet such as addresses and ports, but also metadata about
  the packet such as the current prio, or the interface it came
  from or is going to.
- changing a packet or it's metadata does not cause a rerun of pf_test.
- route-to on an incoming packet basically bypasses the default
  stack processing with a "fast route" out of the stack.

> this way we can always create a complex route-to loops,
> however it can also solve some route-to vs. NAT issues.
> consider those fairly innocent rules:
> 
> 8<---8<---8<--8<
> table  { 10.10.10.10, 172.16.1.1 }
> 
> pass out on em0 from 192.168.1.0/24 to any route-to 
> pass out on em1 from 192.168.1.0 to any nat-to (em1)
> pass out on em2 all 
> 8<---8<---8<--8<
> 
> Rules above should currently work, but will stop if we will
> go with simplified model.

The entries in  make the packet go out em1 and em2?

I'm ok with breaking configs like that. We don't run pf_test again for
other changes to the packet, so if we do want to support something like
that I think we should make the following work:

  # pf_pdesc kif is em0
  match out on em0 from 192.168.1.0/24 to any route-to 
  # pf_pdesc kif is now em1
  pass out on em1 from 192.168.1.0 to any nat-to (em1)
  pass out on em2 all

This is more in line with how NAT rules operate.

> I'll be OK 

Re: [External] : Re: tell pfctl(8) route-to and reply-to accept next-hop only

2021-01-24 Thread David Gwynne



> On 25 Jan 2021, at 10:43, Alexandr Nedvedicky 
>  wrote:
> 
> hello,
> 
> On Fri, Jan 22, 2021 at 05:32:47PM +1000, David Gwynne wrote:
>> I tried this diff, and it broke the ability to use dynamic addresses.
>> ie, the following rules should work:
>> 
>> pass in on gre52 inet proto icmp route-to (gre49:peer)
>> pass in on vmx0 inet proto icmp route-to (gre:peer)
> 
>I see, I did not know those should work.

We are suffering a bit from not having a high level design :/

>> 
>> however, other forms of dynamic interface addresses should fail. or do
>> we want to support route-to if0:broadcast too?
> 
>I can't think of any valid reason why 'ifp0:broadcast' should work.  this
>seems to be poor hack to work around some awkward glitch.  I would prefer
>to disable this option now. We can always add it later, when we will
>understand the true purpose.

Agreed.

Cheers,
dlg

> 
> 
> thanks and
> regards
> sashan



Re: pf route-to issues

2021-01-22 Thread David Gwynne
On Fri, Jan 08, 2021 at 04:43:39PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > 
> > revision 1.294
> > date: 2003/01/02 01:56:56;  author: dhartmei;  state: Exp;  lines: +27 -49;
> > When route-to/reply-to is used in combination with address translation,
> > pf_test() may be called twice for the same packet. In this case, make
> > sure the translation is only applied in the second call. This solves
> > the problem with state insert failures where the second pf_test() call
> > tried to insert another state entry after the first call's translation.
> > ok henning@, mcbride@, thanks to Joe Nall for additional testing.
> > 
> > 
> > I have tested your diffs in my setup, they all pass.  I have not
> > tested the scenario mentioned in the commit message.  Note that the
> > address translation implementation in 2003 was different from what
> > we have now.  And sasha@'s analysis shows that the current code is
> > wrong in other use cases.
> > 
> 
> I've completely forgot there was a change in NAT. Therefore I could
> not understand the commit message.
> 
> 
> 
> > 
> > The only way to find out is to commit it.  It reduces comlexity that
> > noone understands.
> > 
> > OK bluhm@ to remove the check
> > 
> > Please leave the "if (pd->kif->pfik_ifp != ifp)" around pf_test()
> > in pf_route() as it is for now.
> 
> I agree with bluhm@ here. we should proceed with small steps in such
> case and let things to settle down before making next move.

ok. i don't know how to split up the rest of the change though.

here's an updated diff that includes the rest of the kernel changes and
the pfctl and pf.conf tweaks.

it's probably useful for me to try and explain at a high level what
i think the semantics should be, otherwise we might end up arguing about
which bits of the current config i broke.

so, from an extremely high level point of view, and apologies if
this is condescending, pf sits between the network stack and an
interface that a packet travels on. for connections handled by the
local box, this means packets come from the stack and get an output
interface selected by a route lookup, then pf checks it, and then
it goes out the selected interface. replies come into an interface,
get checked by pf, and then enter the stack. when forwarding, a
packet comes into an interface, pf checks it, the stack does a route
lookup to pick an interface, pf checks it again, and then it goes
out the interface.

so what does it mean when route-to (or reply-to) gets involved? i'm
saying that when route-to is applied to a packet, pf takes the packet
away from the stack and immediately forwards it toward to specified
destination address. for a packet entering the system, ie, when the
packet is going from the interface into the stack, route-to should
pretend that it is forwarding the packet and basically push it
straight out an interface. however, like normal forwarding via the
stack, there might be some policy on packets leaving that interface that
you want to apply, so pf should run pf_test in that situation so the
policy can be applied. this is especially useful if you need to apply
nat-to when packets leave a particular interface.

however, if you route-to when a packet is on the way out of the
stack, i'm arguing that pf should not run again against that packet.
currently route-to rules run pf_test again if the interface the packet
is routed out of changes, which means pf runs multiple times against a
packet if rules keep changing which interface it goes out. this means
there's loop prevention in pf to mitigate against this, and weird
potentials for multiple states to be created when nat gets involved.

for simplicity, both in terms of reasoning and code i think pf should
only be run once when a packet enters the system, and only once when it
leaves the system. the only reason i can come up with for running
pf_test multiple times when route-to changes the outgoing interface is
so you can check the packet with "pass out on $new_if" type rules. we
don't rerun pf again when nat/rdr changes addresses, so this feels
inconsistent to me.

i also don't think route-to is used much. getting basic functionality
working is surprisingly hard, so the complicated possibilities in the
current code are almost certainly not taken advantage of. we're going to
break existing configurations anyway, so if we can agree that pf only
runs twice even if route-to gets involved, then i'm not going to feel
bad about breaking something this complicated anyway.

this also breaks the ability to do route-to without states. is there a
reason to do that apart from the DSR type things? did we agree that
those use cases could be handled by sloppy states instead?

lastly, the "argument" or address specified with route-to (and
reply-to and dup-to) is a destination address, not a next-hop. this
has been discussed on the lists a couple of times before, so i won't
go over it again, except to 

Re: tell pfctl(8) route-to and reply-to accept next-hop only

2021-01-21 Thread David Gwynne
I tried this diff, and it broke the ability to use dynamic addresses.
ie, the following rules should work:

pass in on gre52 inet proto icmp route-to (gre49:peer)
pass in on vmx0 inet proto icmp route-to (gre:peer)

however, other forms of dynamic interface addresses should fail. or do
we want to support route-to if0:broadcast too?

i have an updated diff to send out that includes both kernel and
pfctl changes. i think the only thing i'm worried about in it is that it
reuses the redirspec pool_opts stuff, and some of those might not make
sense for route-to address selection

dlg

On Tue, Jan 12, 2021 at 08:45:22PM +0100, Alexandr Nedvedicky wrote:
> Hello,
> 
> proposed diff follows stuff discussed here [1] (pf route-to issues).  I think
> we've reached a consensus to change route-to/reply-to such the only supported
> option will be next-hop (and list and table of next-hop addresses).
> 
> I think bluhm@ and dlg@ have committed part of that change already.
> the proposed diff updates pfctl(8) so parser will do 'a right thing',
> namely:
> 
> specifying host using form as 1.2.3.4@em0 is not supporte
> anymore
> 
> diff introduces a next_hop() function, which is a clone of
> existing host(). unlike host(), the next_hop() does not accept
> a name of local network interface.
> 
> the diff also breaks existing regression tests. We can update
> them once, we will agree on proposed diff.
> 
> thanks and
> regards
> sashan
> 
> [1] https://marc.info/?l=openbsd-tech=160308583701259=2
> 
> 8<---8<---8<--8<
> diff --git a/sbin/pfctl/parse.y b/sbin/pfctl/parse.y
> index 2b3e62b1a7e..536aec3286b 100644
> --- a/sbin/pfctl/parse.y
> +++ b/sbin/pfctl/parse.y
> @@ -3745,23 +3745,13 @@ pool_opt  : BITMASK   {
>   ;
>  
>  route_host   : STRING{
> - /* try to find @if0 address specs */
> - if (strrchr($1, '@') != NULL) {
> - if (($$ = host($1, pf->opts)) == NULL)  {
> - yyerror("invalid host for route spec");
> - YYERROR;
> - }
> + if (($$ = next_hop($1, pf->opts)) == NULL)  {
> + /* error. "any" is handled elsewhere */
>   free($1);
> - } else {
> - $$ = calloc(1, sizeof(struct node_host));
> - if ($$ == NULL)
> - err(1, "route_host: calloc");
> - $$->ifname = $1;
> - $$->addr.type = PF_ADDR_NONE;
> - set_ipmask($$, 128);
> - $$->next = NULL;
> - $$->tail = $$;
> + yyerror("could not parse host specification");
> + YYERROR;
>   }
> + free($1);
>   }
>   | STRING '/' STRING {
>   char*buf;
> @@ -3769,7 +3759,7 @@ route_host  : STRING{
>   if (asprintf(, "%s/%s", $1, $3) == -1)
>   err(1, "host: asprintf");
>   free($1);
> - if (($$ = host(buf, pf->opts)) == NULL) {
> + if (($$ = next_hop(buf, pf->opts)) == NULL) {
>   /* error. "any" is handled elsewhere */
>   free(buf);
>   yyerror("could not parse host specification");
> @@ -3795,33 +3785,6 @@ route_host : STRING{
>   $$->next = NULL;
>   $$->tail = $$;
>   }
> - | dynaddr '/' NUMBER{
> - struct node_host*n;
> -
> - if ($3 < 0 || $3 > 128) {
> - yyerror("bit number too big");
> - YYERROR;
> - }
> - $$ = $1;
> - for (n = $1; n != NULL; n = n->next)
> - set_ipmask(n, $3);
> - }
> - | '(' STRING host ')'   {
> - struct node_host*n;
> -
> - $$ = $3;
> - /* XXX check masks, only full mask should be allowed */
> - for (n = $3; n != NULL; n = n->next) {
> - if ($$->ifname) {
> - yyerror("cannot specify interface twice 
> "
> - "in route spec");
> - YYERROR;
> - }
> - if (($$->ifname = strdup($2)) 

bpf(4) doesn't have to keep track of nonblocking state itself

2021-01-18 Thread David Gwynne
vfs does it for us.

ok?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.202
diff -u -p -r1.202 bpf.c
--- bpf.c   17 Jan 2021 02:27:29 -  1.202
+++ bpf.c   19 Jan 2021 00:10:22 -
@@ -379,7 +379,6 @@ bpfopen(dev_t dev, int flag, int mode, s
sigio_init(>bd_sigio);
 
bd->bd_rtout = 0;   /* no timeout by default */
-   bd->bd_rnonblock = ISSET(flag, FNONBLOCK);
 
bpf_get(bd);
LIST_INSERT_HEAD(_d_list, bd, bd_list);
@@ -497,7 +496,7 @@ bpfread(dev_t dev, struct uio *uio, int 
ROTATE_BUFFERS(d);
break;
}
-   if (d->bd_rnonblock) {
+   if (ISSET(ioflag, IO_NDELAY)) {
/* User requested non-blocking I/O */
error = EWOULDBLOCK;
} else if (d->bd_rtout == 0) {
@@ -982,10 +981,7 @@ bpfioctl(dev_t dev, u_long cmd, caddr_t 
break;
 
case FIONBIO:   /* Non-blocking I/O */
-   if (*(int *)addr)
-   d->bd_rnonblock = 1;
-   else
-   d->bd_rnonblock = 0;
+   /* let vfs to keep track of this */
break;
 
case FIOASYNC:  /* Send signal on receive packets */
Index: bpfdesc.h
===
RCS file: /cvs/src/sys/net/bpfdesc.h,v
retrieving revision 1.44
diff -u -p -r1.44 bpfdesc.h
--- bpfdesc.h   2 Jan 2021 02:46:06 -   1.44
+++ bpfdesc.h   19 Jan 2021 00:10:22 -
@@ -80,7 +80,6 @@ struct bpf_d {
struct bpf_if  *bd_bif; /* interface descriptor */
uint64_tbd_rtout;   /* [m] Read timeout in nanoseconds */
u_long  bd_nreaders;/* [m] # threads asleep in bpfread() */
-   int bd_rnonblock;   /* true if nonblocking reads are set */
struct bpf_program_smr
   *bd_rfilter; /* read filter code */
struct bpf_program_smr



bpf_mtap_ether doesnt need to encode packet priority

2021-01-14 Thread David Gwynne
bpf should be showing what will be or has been on the wire, which is
what the ether_vtag in the mbuf has. the prio is either about to be
decoded from the tag on the wya into the stack, or has been encoded by
vlan(4) on the way out of the stack.

ok?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.201
diff -u -p -r1.201 bpf.c
--- bpf.c   2 Jan 2021 07:25:42 -   1.201
+++ bpf.c   15 Jan 2021 01:12:31 -
@@ -1444,7 +1444,6 @@ bpf_mtap_ether(caddr_t arg, const struct
 #if NVLAN > 0
struct ether_vlan_header evh;
struct m_hdr mh, md;
-   uint8_t prio;
 
if ((m->m_flags & M_VLANTAG) == 0)
 #endif
@@ -1455,15 +1454,10 @@ bpf_mtap_ether(caddr_t arg, const struct
 #if NVLAN > 0
KASSERT(m->m_len >= ETHER_HDR_LEN);
 
-   prio = m->m_pkthdr.pf.prio;
-   if (prio <= 1)
-   prio = !prio;
-
memcpy(, mtod(m, char *), ETHER_HDR_LEN);
evh.evl_proto = evh.evl_encap_proto;
evh.evl_encap_proto = htons(ETHERTYPE_VLAN);
-   evh.evl_tag = htons(m->m_pkthdr.ether_vtag |
-   (prio << EVL_PRIO_BITS));
+   evh.evl_tag = htons(m->m_pkthdr.ether_vtag);
 
mh.mh_flags = 0;
mh.mh_data = (caddr_t)



Re: pf route-to issues

2021-01-05 Thread David Gwynne
On Mon, Jan 04, 2021 at 06:37:54PM +0100, Alexander Bluhm wrote:
> On Mon, Jan 04, 2021 at 11:21:50PM +1000, David Gwynne wrote:
> > this chunk pops out as a standalone change.
> >
> > having pf_find_state() return PF_PASS here means the callers short
> > circuit and let the packet go through without running it through the
> > a lot of the state handling, which includes things like protocol state
> > updates, nat, scrubbing, some pflog handling, and most importantly,
> > later calls to pf_route().
> 
> pf_route() calls pf_test() again with a different interface.
> 
> The idea of this code is, that the interface which is passed to
> pf_test() from ip_output() is wrong.  The call to pf_set_rt_ifp()
> changes it in the state.
> 
> In the pf_test() call from ip_output() we skip the tests.  We know
> they will happen in pf_test() called from pf_route().  Without this
> chunk we would do state handling twice with different interfaces.
> 
> Is that analysis correct?

I think so, but I didn't get as much time to poke at this today as I was
hoping.

If the idea is to avoid running most of pf_test again if route-to is
applied during ip_output, I think this tweaked diff is simpler. Is there
a valid use case for running some of pf_test again after route-to is
applied?

The pf_set_rt_ifp() stuff could be cleaned up if we can get away with
this.

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1097
diff -u -p -r1.1097 pf.c
--- pf.c4 Jan 2021 12:48:27 -   1.1097
+++ pf.c5 Jan 2021 11:18:14 -
@@ -1122,12 +1122,6 @@ pf_find_state(struct pf_pdesc *pd, struc
}
 
*state = s;
-   if (pd->dir == PF_OUT && s->rt_kif != NULL && s->rt_kif != pd->kif &&
-   ((s->rule.ptr->rt == PF_ROUTETO &&
-   s->rule.ptr->direction == PF_OUT) ||
-   (s->rule.ptr->rt == PF_REPLYTO &&
-   s->rule.ptr->direction == PF_IN)))
-   return (PF_PASS);
 
return (PF_MATCH);
 }
@@ -6049,7 +6043,7 @@ pf_route(struct pf_pdesc *pd, struct pf_
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (pd->dir == PF_IN) {
if (pf_test(AF_INET, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)
@@ -6204,7 +6198,7 @@ pf_route6(struct pf_pdesc *pd, struct pf
if (ifp == NULL)
goto bad;
 
-   if (pd->kif->pfik_ifp != ifp) {
+   if (pd->dir == PF_IN) {
if (pf_test(AF_INET6, PF_OUT, ifp, ) != PF_PASS)
goto bad;
else if (m0 == NULL)



Re: pf route-to issues

2021-01-04 Thread David Gwynne
On Mon, Jan 04, 2021 at 01:57:24PM +0100, Alexander Bluhm wrote:
> On Mon, Jan 04, 2021 at 11:46:16AM +0100, Alexandr Nedvedicky wrote:
> > > let's put this in and then i'll have a look. ok by me.
> > bluhm's diff is fine with me.
> 
> Refactoring is commited, here is the remaining kernel diff after merge.

this chunk pops out as a standalone change.

having pf_find_state() return PF_PASS here means the callers short
circuit and let the packet go through without running it through the
a lot of the state handling, which includes things like protocol state
updates, nat, scrubbing, some pflog handling, and most importantly,
later calls to pf_route().

ok?

Index: pf.c
===
RCS file: /cvs/src/sys/net/pf.c,v
retrieving revision 1.1097
diff -u -p -r1.1097 pf.c
--- pf.c4 Jan 2021 12:48:27 -   1.1097
+++ pf.c4 Jan 2021 13:08:26 -
@@ -1122,12 +1122,6 @@ pf_find_state(struct pf_pdesc *pd, struc
}
 
*state = s;
-   if (pd->dir == PF_OUT && s->rt_kif != NULL && s->rt_kif != pd->kif &&
-   ((s->rule.ptr->rt == PF_ROUTETO &&
-   s->rule.ptr->direction == PF_OUT) ||
-   (s->rule.ptr->rt == PF_REPLYTO &&
-   s->rule.ptr->direction == PF_IN)))
-   return (PF_PASS);
 
return (PF_MATCH);
 }



Re: pf route-to issues

2021-01-04 Thread David Gwynne



> On 4 Jan 2021, at 9:27 pm, Alexandr Nedvedicky 
>  wrote:
> 
> Hello,
> 
> there is one more thing, which just came up on my mind.
> 
> 
>> 
>> so i want to change route-to in pfctl so it takes a nexthop instead
>> of an interface. you could argue that pf already lets you do this,
>> because there's some bs nexthop@interface syntax. my counter argument
>> is that the interface the nexthop is reachable over is redundant, and it
>> makes fixing some of the other problems harder if we keep it.
>> 
> 
>what is your plan for dup-to then? if my understanding of dup-to is
>correct, then it allows administrator to copy matching packets and send
>them out dedicated interface so another physical box (box with running
>snort) can intercept them and process them.
> 
>I remember we had to do some assumptions about this, when porting PF to
>Solaris. So Solaris interpretation of option
> 
>   'dup-to net12'
> 
>is to send out copy of matching packet via net12 interface.  because there
>is no next-hop specified, we just use link broadcast when pushing out the
>packet to network. I agree this is a hack. If route-to will be changed
>to accept next-hop instead of interface, then we will be able to kill
>such hack.

route-to, reply-to, and dup-to take an address as an argument. In dup-to's 
case, the packet is duplicated and then the copy is routed toward the 
destination.

As discussed previously, the address argument is a destination, it's not a link 
local address per se. The destination address could be a directly connected or 
link local address, but it could also be via a gateway (including a those 
learnt by a dynamic routing protocol), and/or via multiple gateways. This 
allows for failover or ECMP like we get with packets where the path is selected 
by the rtable lookup.

dlg

> 
> 
> 
>> 
>> if we limit the information needed for pf_route to a nexthop address,
>> and which direction the address is used, this is doable. both the
>> pf_state and pfsync_state structs already contain an address to store a
>> nexthop in, i just had to move the route-to direction from the rule into
>> the state. this is easy with pf_state, but i used a spare pad field in
>> pfsync_state for this.
>> 
> 
>this should be fine, because route-to et.al. don't work with 'block' rules.
> 
> 
> thanks and
> regards
> sashan



Re: pf route-to issues

2021-01-03 Thread David Gwynne
On Mon, Jan 04, 2021 at 12:58:17AM +0100, Alexander Bluhm wrote:
> On Sun, Jan 03, 2021 at 06:56:20PM +0100, Alexander Bluhm wrote:
> > I am currently running a full regress to find more fallout.
> 
> These regress tests fail:
> 
> sys/net/pf_forward
> sys/net/pf_fragment
> sbin/pfctl
> 
> The first two are easy to fix.  That means my tests using route-to
> work fine with your diff.  Just remove the @interface as below.

pretty much, yes.

> pfctl tests pfail8 and pf13 use very strange routespec syntax.  You
> might want to take a look at what that meant before and what should
> be valid now.

this is another syntax which we seem to agree is confusing. this makes
me more convinced that it needs to be changed.

pfail8.in and pf13.in should be modified to route-to an IP address
instead of an interface. these regress tests are a bit confusing
because they're just testing the parser and the addresses that
they're using aren't configured anywhere.

pfail8.ok shows that pfctl should generate some more specific error
messages, which is easily fixed.

> 
> bluhm
> 
> Index: regress/sys/net/pf_forward/pf.conf
> ===
> RCS file: /mount/openbsd/cvs/src/regress/sys/net/pf_forward/pf.conf,v
> retrieving revision 1.5
> diff -u -p -r1.5 pf.conf
> --- regress/sys/net/pf_forward/pf.conf11 Jan 2018 03:23:16 -  
> 1.5
> +++ regress/sys/net/pf_forward/pf.conf3 Jan 2021 23:26:54 -
> @@ -17,22 +17,22 @@ pass out inet6
>  pass in  to $AF_IN6/64 af-to inet  from $PF_OUT  to $ECO_IN/24   tag af
>  pass out inettagged af
> 
> -pass in  to $RTT_IN/24  route-to $RT_IN@$PF_IFOUT  tag rttin
> -pass out   tagged rttin
> -pass in  to $RTT_IN6/64 route-to $RT_IN6@$PF_IFOUT tag rttin
> -pass out   tagged rttin
> +pass in  to $RTT_IN/24  route-to $RT_IN  tag rttin
> +pass out tagged rttin
> +pass in  to $RTT_IN6/64 route-to $RT_IN6 tag rttin
> +pass out tagged rttin
> 
> -pass in  to $RTT_OUT/24 tag rttout
> -pass out route-to $RT_IN@$PF_IFOUT  tagged rttout
> -pass in  to $RTT_OUT6/64tag rttout
> -pass out route-to $RT_IN6@$PF_IFOUT tagged rttout
> +pass in  to $RTT_OUT/24   tag rttout
> +pass out route-to $RT_IN  tagged rttout
> +pass in  to $RTT_OUT6/64  tag rttout
> +pass out route-to $RT_IN6 tagged rttout
> 
> -pass in  from $RPT_IN/24  reply-to $SRC_OUT@$PF_IFIN  tag rptin
> -pass out  tagged rptin
> -pass in  from $RPT_IN6/64 reply-to $SRC_OUT6@$PF_IFIN tag rptin
> -pass out  tagged rptin
> +pass in  from $RPT_IN/24  reply-to $SRC_OUT  tag rptin
> +pass out tagged rptin
> +pass in  from $RPT_IN6/64 reply-to $SRC_OUT6 tag rptin
> +pass out tagged rptin
> 
> -pass in  from $RPT_OUT/24  tag rptout
> -pass out   reply-to $SRC_OUT@$PF_IFIN  tagged rptout
> -pass in  from $RPT_OUT6/64 tag rptout
> -pass out   reply-to $SRC_OUT6@$PF_IFIN tagged rptout
> +pass in  from $RPT_OUT/24 tag rptout
> +pass out   reply-to $SRC_OUT  tagged rptout
> +pass in  from $RPT_OUT6/64tag rptout
> +pass out   reply-to $SRC_OUT6 tagged rptout
> Index: regress/sys/net/pf_fragment/pf.conf
> ===
> RCS file: /mount/openbsd/cvs/src/regress/sys/net/pf_fragment/pf.conf,v
> retrieving revision 1.5
> diff -u -p -r1.5 pf.conf
> --- regress/sys/net/pf_fragment/pf.conf   7 Jun 2017 20:09:07 -   
> 1.5
> +++ regress/sys/net/pf_fragment/pf.conf   3 Jan 2021 23:28:07 -
> @@ -10,7 +10,7 @@ pass outnat-to $PF_OUT
>  pass in  to $RDR_IN6/64 rdr-to $ECO_IN6 allow-opts tag rdr
>  pass outnat-to $PF_OUT6 allow-opts tagged rdr
> 
> -pass in  to $RTT_IN/24 allow-opts tag rtt
> -pass outroute-to $RT_IN@$PF_IFOUT  allow-opts tagged rtt
> -pass in  to $RTT_IN6/64allow-opts tag rtt
> -pass outroute-to $RT_IN6@$PF_IFOUT allow-opts tagged rtt
> +pass in  to $RTT_IN/24   allow-opts tag rtt
> +pass outroute-to $RT_IN  allow-opts tagged rtt
> +pass in  to $RTT_IN6/64  allow-opts tag rtt
> +pass outroute-to $RT_IN6 allow-opts tagged rtt
> 



Re: pf route-to issues

2021-01-03 Thread David Gwynne
On Sun, Jan 03, 2021 at 06:56:20PM +0100, Alexander Bluhm wrote:
> On Sun, Jan 03, 2021 at 02:00:00PM +1000, David Gwynne wrote:
> > On Tue, Oct 20, 2020 at 09:27:09AM +1000, David Gwynne wrote:
> > We've been running this diff in production for the last couple of
> > months, and it's been solid for us so far. Ignoring the fixes for
> > crashes, I personally find it a lot more usable than the current
> > route-to rules too.
> >
> > Can I commit it?
> 
> The diff is quite large and does multiple things at a time.

In hindsight I agree. It was hard to see while being so close to it.

> In general I also did not understand why I have to say em0@10.0.0.1
> for routing and it took me a while to figure out what to put into
> pf.conf.  I use this syntax in /usr/src/regress/sys/net/pf_forward/pf.conf.
> This has to be fixed after this goes in.  I will care about regress
> as this test is quite complex an needs several machines to setup.
> I am currently running a full regress to find more fallout.
> 
> I do not use pfsync, so I cannot say what the consequences of the
> change are in this area.  Also I don't know why pf-route interfaces
> were designed in such a strange way.

we do use pfsync, and not being able to use it with route-to has been a
point of friction for us for years.

as for the design, i think it was copied (imperfectly) from ipfilter.
look for "Policy Based Routing" in
https://www.freebsd.org/cgi/man.cgi?query=ipf=5.

> From a user perspective it is not clear, why route-to should not
> work together with no-state.  So we should either fix it or document
> it and add a check in the parser.  Is fixing hard?

pf_route only takes a state now, so i'd say it is non-trivial. for now
i'll go with documentation and a check in the parser..

> Are we losing any other features apart from this strange arp reuse
> you described in your mail?

i wouldn't say the arp reuse is a feature.

> There is some refactoring in your diff.  I splitted it to make
> review easier.  I think this should go in first.  Note that the
> pf_state variable is called st in if_pfsync.c.  Can we be consistent
> here?  Is the pfsync_state properly aligned?  During import it comes
> from an mbuf.

the stack should provide it on a 4 byte boundary, but it has uint64_t
members. however, it is also __packed, so the compiler makes no
assumptions about alignment.

> Is there anything else that can be split out easily?

let's put this in and then i'll have a look. ok by me.

> 
> bluhm
> 
> Index: net/if_pfsync.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/net/if_pfsync.c,v
> retrieving revision 1.279
> diff -u -p -r1.279 if_pfsync.c
> --- net/if_pfsync.c   12 Dec 2020 11:49:02 -  1.279
> +++ net/if_pfsync.c   3 Jan 2021 17:16:55 -
> @@ -612,7 +612,7 @@ pfsync_state_import(struct pfsync_state
>   st->rtableid[PF_SK_STACK] = ntohl(sp->rtableid[PF_SK_STACK]);
> 
>   /* copy to state */
> - bcopy(>rt_addr, >rt_addr, sizeof(st->rt_addr));
> + st->rt_addr = sp->rt_addr;
>   st->creation = getuptime() - ntohl(sp->creation);
>   st->expire = getuptime();
>   if (ntohl(sp->expire)) {
> @@ -1843,6 +1843,7 @@ pfsync_undefer(struct pfsync_deferral *p
>  {
>   struct pfsync_softc *sc = pfsyncif;
>   struct pf_pdesc pdesc;
> + struct pf_state *st = pd->pd_st;
> 
>   NET_ASSERT_LOCKED();
> 
> @@ -1852,35 +1853,32 @@ pfsync_undefer(struct pfsync_deferral *p
>   TAILQ_REMOVE(>sc_deferrals, pd, pd_entry);
>   sc->sc_deferred--;
> 
> - CLR(pd->pd_st->state_flags, PFSTATE_ACK);
> + CLR(st->state_flags, PFSTATE_ACK);
>   if (drop)
>   m_freem(pd->pd_m);
>   else {
> - if (pd->pd_st->rule.ptr->rt == PF_ROUTETO) {
> - if (pf_setup_pdesc(,
> - pd->pd_st->key[PF_SK_WIRE]->af,
> - pd->pd_st->direction, pd->pd_st->rt_kif,
> - pd->pd_m, NULL) != PF_PASS) {
> + if (st->rule.ptr->rt == PF_ROUTETO) {
> + if (pf_setup_pdesc(, st->key[PF_SK_WIRE]->af,
> + st->direction, st->kif, pd->pd_m, NULL) !=
> + PF_PASS) {
>   m_freem(pd->pd_m);
>   goto out;
>   }
> - switch (pd->pd_st->key[PF_SK_WIRE]->af) {
> + switch (st->key[PF_SK_WIRE]->af) {
>   case AF_INET:
> - pf_

Re: pf route-to issues

2021-01-02 Thread David Gwynne
On Tue, Oct 20, 2020 at 09:27:09AM +1000, David Gwynne wrote:
> 
> i am feeling very warm and fuzzy about this diff at the moment.

We've been running this diff in production for the last couple of
months, and it's been solid for us so far. Ignoring the fixes for
crashes, I personally find it a lot more usable than the current
route-to rules too.

Can I commit it?

Index: sbin/pfctl/parse.y
===
RCS file: /cvs/src/sbin/pfctl/parse.y,v
retrieving revision 1.707
diff -u -p -r1.707 parse.y
--- sbin/pfctl/parse.y  16 Dec 2020 18:01:16 -  1.707
+++ sbin/pfctl/parse.y  3 Jan 2021 03:53:02 -
@@ -276,6 +276,7 @@ struct filter_opts {
struct redirspec nat;
struct redirspec rdr;
struct redirspec rroute;
+   u_int8_t rt;
 
/* scrub opts */
int  nodf;
@@ -284,15 +285,6 @@ struct filter_opts {
int  randomid;
int  max_mss;
 
-   /* route opts */
-   struct {
-   struct node_host*host;
-   u_int8_t rt;
-   u_int8_t pool_opts;
-   sa_family_t  af;
-   struct pf_poolhashkey   *key;
-   }route;
-
struct {
u_int32_t   limit;
u_int32_t   seconds;
@@ -518,7 +510,6 @@ int parseport(char *, struct range *r, i
 %type  ipspec xhost host dynaddr host_list
 %type  table_host_list tablespec
 %type  redir_host_list redirspec
-%type  route_host route_host_list routespec
 %typeos xos os_list
 %type  portspec port_list port_item
 %type   uids uid_list uid_item
@@ -975,7 +966,7 @@ anchorrule  : ANCHOR anchorname dir quick
YYERROR;
}
 
-   if ($9.route.rt) {
+   if ($9.rt) {
yyerror("cannot specify route handling "
"on anchors");
YYERROR;
@@ -1843,37 +1834,13 @@ pfrule  : action dir logquick interface 
decide_address_family($7.src.host, );
decide_address_family($7.dst.host, );
 
-   if ($8.route.rt) {
-   if (!r.direction) {
+   if ($8.rt) {
+   if ($8.rt != PF_DUPTO && !r.direction) {
yyerror("direction must be explicit "
"with rules that specify routing");
YYERROR;
}
-   r.rt = $8.route.rt;
-   r.route.opts = $8.route.pool_opts;
-   if ($8.route.key != NULL)
-   memcpy(, $8.route.key,
-   sizeof(struct pf_poolhashkey));
-   }
-   if (r.rt) {
-   decide_address_family($8.route.host, );
-   if ((r.route.opts & PF_POOL_TYPEMASK) ==
-   PF_POOL_NONE && ($8.route.host->next != 
NULL ||
-   $8.route.host->addr.type == PF_ADDR_TABLE ||
-   DYNIF_MULTIADDR($8.route.host->addr)))
-   r.route.opts |= PF_POOL_ROUNDROBIN;
-   if ($8.route.host->next != NULL) {
-   if (!PF_POOL_DYNTYPE(r.route.opts)) {
-   yyerror("address pool option "
-   "not supported by type");
-   YYERROR;
-   }
-   }
-   /* fake redirspec */
-   if (($8.rroute.rdr = calloc(1,
-   sizeof(*$8.rroute.rdr))) == NULL)
-   err(1, "$8.rroute.rdr");
-   $8.rroute.rdr->host = $8.route.host;
+   r.rt = $8.rt;
}
 
if (expand_divertspec(, &$8.divert))
@@ -2137,30 +2104,14 @@ filter_opt  : USER uids {
sizeof(filter_opts.nat.pool_opts));
filter_opts.nat.pool_opts.staticport = 1;
}
-   | ROUTETO rout

use stoeplitz to set flowids on tcp connections

2021-01-02 Thread David Gwynne
if stoeplitz is enabled by a driver (eg, ix, mcx, etc), this uses it in
the tcp code to set the flowid on packets. this encourages both the tx
and rx side of a tcp connection to get processed in the same places.

ok?

Index: netinet/in_pcb.c
===
RCS file: /cvs/src/sys/netinet/in_pcb.c,v
retrieving revision 1.252
diff -u -p -r1.252 in_pcb.c
--- netinet/in_pcb.c7 Nov 2020 09:51:40 -   1.252
+++ netinet/in_pcb.c3 Jan 2021 02:12:45 -
@@ -95,6 +95,11 @@
 #include 
 #endif /* IPSEC */
 
+#include "stoeplitz.h"
+#if NSTOEPLITZ > 0
+#include 
+#endif
+
 const struct in_addr zeroin_addr;
 
 union {
@@ -516,6 +521,10 @@ in_pcbconnect(struct inpcb *inp, struct 
inp->inp_faddr = sin->sin_addr;
inp->inp_fport = sin->sin_port;
in_pcbrehash(inp);
+#if NSTOEPLITZ > 0
+   inp->inp_flowid = stoeplitz_ip4port(inp->inp_laddr.s_addr,
+   inp->inp_faddr.s_addr, inp->inp_lport, inp->inp_fport);
+#endif
 #ifdef IPSEC
{
/* Cause an IPsec SA to be established. */
@@ -549,6 +558,7 @@ in_pcbdisconnect(struct inpcb *inp)
}
 
inp->inp_fport = 0;
+   inp->inp_flowid = 0;
in_pcbrehash(inp);
if (inp->inp_socket->so_state & SS_NOFDREF)
in_pcbdetach(inp);
Index: netinet/in_pcb.h
===
RCS file: /cvs/src/sys/netinet/in_pcb.h,v
retrieving revision 1.120
diff -u -p -r1.120 in_pcb.h
--- netinet/in_pcb.h21 Jun 2020 05:14:04 -  1.120
+++ netinet/in_pcb.h3 Jan 2021 02:12:45 -
@@ -148,6 +148,7 @@ struct inpcb {
void*inp_upcall_arg;
u_int   inp_rtableid;
int inp_pipex;  /* pipex indication */
+   uint16_t inp_flowid;
 };
 
 LIST_HEAD(inpcbhead, inpcb);
Index: netinet/tcp_output.c
===
RCS file: /cvs/src/sys/netinet/tcp_output.c,v
retrieving revision 1.128
diff -u -p -r1.128 tcp_output.c
--- netinet/tcp_output.c10 Nov 2018 18:40:34 -  1.128
+++ netinet/tcp_output.c3 Jan 2021 02:12:45 -
@@ -69,6 +69,7 @@
  */
 
 #include "pf.h"
+#include "stoeplitz.h"
 
 #include 
 #include 
@@ -1037,6 +1038,10 @@ send:
ip->ip_tos |= IPTOS_ECN_ECT0;
 #endif
}
+#if NSTOEPLITZ > 0
+   m->m_pkthdr.ph_flowid = tp->t_inpcb->inp_flowid;
+   SET(m->m_pkthdr.csum_flags, M_FLOWID);
+#endif
error = ip_output(m, tp->t_inpcb->inp_options,
>t_inpcb->inp_route,
(ip_mtudisc ? IP_MTUDISC : 0), NULL, tp->t_inpcb, 0);
Index: netinet6/in6_pcb.c
===
RCS file: /cvs/src/sys/netinet6/in6_pcb.c,v
retrieving revision 1.110
diff -u -p -r1.110 in6_pcb.c
--- netinet6/in6_pcb.c  29 Nov 2019 16:41:01 -  1.110
+++ netinet6/in6_pcb.c  3 Jan 2021 02:12:45 -
@@ -100,6 +100,7 @@
  */
 
 #include "pf.h"
+#include "stoeplitz.h"
 
 #include 
 #include 
@@ -119,6 +120,10 @@
 
 #include 
 
+#if NSTOEPLITZ > 0
+#include 
+#endif
+
 const struct in6_addr zeroin6_addr;
 
 struct inpcbhead *
@@ -297,6 +302,10 @@ in6_pcbconnect(struct inpcb *inp, struct
if (ip6_auto_flowlabel)
inp->inp_flowinfo |=
(htonl(ip6_randomflowlabel()) & IPV6_FLOWLABEL_MASK);
+#if NSTOEPLITZ > 0
+   inp->inp_flowid = stoeplitz_ip6port(>inp_laddr6,
+   >inp_faddr6, inp->inp_lport, inp->inp_fport);
+#endif
in_pcbrehash(inp);
return (0);
 }



Re: bpf(4): remove ticks

2020-12-28 Thread David Gwynne
On Mon, Dec 28, 2020 at 06:56:08PM -0600, Scott Cheloha wrote:
> On Mon, Dec 28, 2020 at 10:49:59AM +1000, David Gwynne wrote:
> > On Sat, Dec 26, 2020 at 04:48:23PM -0600, Scott Cheloha wrote:
> > > Now that we've removed bd_rdStart from the bpf_d struct, removing
> > > ticks from bpf(4) itself is straightforward.
> > > 
> > > - bd_rtout becomes a timespec; update bpfioctl() accordingly.
> > >   Cap it at MAXTSLP nanoseconds to avoid arithmetic overflow
> > >   in bpfread().
> > > 
> > > - At the start of bpfread(), if a timeout is set, find the end
> > >   of the read as an absolute uptime.  This is the point where
> > >   we want to avoid overflow: if bd_rtout is only MAXTSLP
> > >   nanoseconds the timespecadd(3) will effectively never overflow.
> > > 
> > > - Before going to sleep, if we have a timeout set, compute how
> > >   much longer to sleep in nanoseconds.
> > > 
> > >   Here's a spot where an absolute timeout sleep would save a
> > >   little code, but we don't have such an interface yet.  Worth
> > >   keeping in mind for the future, though.
> > 
> > Are there any other places that would be useful though? bpf is pretty
> > special.
> 
> kqueue_scan() in kern_event.c can have a spurious wakeup, so an
> absolute sleep would be useful there.  doppoll() in sys_generic.c is
> the same, though I think mpi@/visa@ intend to refactor it to use
> kqueue_scan().
> 
> In general, if you have a thread that wants to do something on a
> strict period you need to use an absolute sleep to avoid drift.

True, something to keep in mind then.

> This code drifts:
> 
>   for (;;) {
>   do_work();
>   tsleep_nsec(, PPAUSE, "worker", SEC_TO_NSEC(1));
>   }
> 
> While this code will not:
> 
>   uint64_t deadline;
> 
>   deadline = nsecuptime();
>   for (;;) {
>   do_work();
>   deadline = nsec_advance(deadline, SEC_TO_NSEC(1));
>   tsleep_abs_nsec(, PPAUSE, "worker", deadline);
>   }
> 
> (Some of those interfaces don't actually exist, but they are easy to
> write and you can infer how they work.)
> 
> Most developers probably do not care about maintaining a strict period
> for periodic workloads, but I have a suspicion that it would keep
> system performance more deterministic because you don't have various
> periodic workloads drifting into one another, overlapping, and
> momentarily causing utilization spikes.
> 
> I know that probably sounds far-fetched... it's an idea I've been
> fussing with.
> 
> > > dlg@: You said you wanted to simplify this loop.  Unsure what shape
> > > that cleanup would take.  Should we clean it up before this change?
> > 
> > I wanted to have a single msleep_nsec call and pass INFSLP when it
> > should sleep forever.. You saw my first attempt at that. It had
> > issues.
> > 
> > > Thoughts?  ok?
> > 
> > How would this look if you used a uint64_t for nsecs for bd_rtout,
> > and the nsec uptime thing from your pool diff instead of timespecs
> > and nanouptime?
> 
> See the attached patch.  It is shorter because we can do more inline
> stuff with a uint64_t than with a timespec.

I like it.

> 
> > What's the thinking behind nanouptime instead of getnanouptime?
> 
> In general we should prefer high resolution time unless there is a
> compelling performance reason to use low-res time.
> 
> In particular, we should prefer high res time whenever userspace
> timeouts are involved as userspace can only use high-res time.
> 
> For instance, when tsleep_nsec/etc. are reimplemented with kclock
> timeouts (soon!) I will remove the use of low-res time from
> nanosleep(2), select(2)/pselect(2), poll(2)/ppoll(2), and kevent(2).
> The use of low-res time in these interfaces can cause undersleep right
> now.  They're buggy.
> 
> > More generally, what will getnanouptime do when ticks go away?
> 
> Ticks will probably never "go away" entirely.
> 
> In general, some CPU is always going to need to call tc_windup()
> regularly to keep time moving forward.  When tc_windup() is called we
> update the low-res timestamps.  So getnanouptime(9)/etc. will continue
> to work as they do today.  I also imagine that the CPU responsible for
> calling tc_windup() will continue to increment `ticks' and `jiffies'.

Only one CPU needs to do that though, not all CPUs need to tick.

> In the near-ish future I want to add support for dynamic clock
> interrupts.  This would permit a CPU to stay idle, or drop into a
> deeper power-saving state

Re: bpf_catchpacket and bpf_wakeup optimisations

2020-12-28 Thread David Gwynne
On Mon, Dec 28, 2020 at 02:45:06PM +1000, David Gwynne wrote:
> now that bpf read timeouts are only handled on the bpfread() side,
> there's a simplification that can be made in bpf_catchpacket. the chunk
> in bpf_catchpacket that rotates the buffers when one gets full already
> does a wakeup, so we don't have to check if we have any waiting readers
> and wake them up when a buffer gets full.
> 
> we can use bd_nreaders to omgoptimise bpf_wakeup though. wakeup(9) is
> mpsafe, so we don't have to defer the call to a task. however, we can
> avoid calling wakeup() and therefore trying to take the sched lock and
> all that stuff when we know there's nothing sleeping.
> 
> this also avoids scheduling the task if there's no async stuff set up.
> it's a bit magical because it knows what's inside selwakeup.
> 
> tests? ok?

visa pointed out that i missed a bit relating to kq.

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.199
diff -u -p -r1.199 bpf.c
--- bpf.c   26 Dec 2020 16:30:58 -  1.199
+++ bpf.c   29 Dec 2020 06:04:58 -
@@ -554,7 +554,6 @@ out:
return (error);
 }
 
-
 /*
  * If there are processes sleeping on this descriptor, wake them up.
  */
@@ -563,14 +562,20 @@ bpf_wakeup(struct bpf_d *d)
 {
MUTEX_ASSERT_LOCKED(>bd_mtx);
 
+   if (d->bd_nreaders)
+   wakeup(d);
+
/*
 * As long as pgsigio() and selwakeup() need to be protected
 * by the KERNEL_LOCK() we have to delay the wakeup to
 * another context to keep the hot path KERNEL_LOCK()-free.
 */
-   bpf_get(d);
-   if (!task_add(systq, >bd_wake_task))
-   bpf_put(d);
+   if ((d->bd_async && d->bd_sig) ||
+   (!klist_empty(>bd_sel.si_note) || d->bd_sel.si_seltid != 0)) {
+   bpf_get(d);
+   if (!task_add(systq, >bd_wake_task))
+   bpf_put(d);
+   }
 }
 
 void
@@ -578,7 +583,6 @@ bpf_wakeup_cb(void *xd)
 {
struct bpf_d *d = xd;
 
-   wakeup(d);
if (d->bd_async && d->bd_sig)
pgsigio(>bd_sigio, d->bd_sig, 0);
 
@@ -1542,17 +1546,6 @@ bpf_catchpacket(struct bpf_d *d, u_char 
 * reads should be woken up.
 */
do_wakeup = 1;
-   }
-
-   if (d->bd_nreaders > 0) {
-   /*
-* We have one or more threads sleeping in bpfread().
-* We got a packet, so wake up all readers.
-*/
-   if (d->bd_fbuf != NULL) {
-   ROTATE_BUFFERS(d);
-   do_wakeup = 1;
-   }
}
 
if (do_wakeup)



Re: Fwd: gre(4): mgre

2020-12-28 Thread David Gwynne
On Sun, Nov 29, 2020 at 08:30:23PM +0100, Pierre Emeriaud wrote:
> Le sam. 28 nov. 2020 ?? 21:46, Jason McIntyre  a ??crit :
> > > > > +.Bd -literal
> > > > add "-offset indent" to match the other examples
> > > Done, although I copied this block from gre example, so there's
> > > another occurrence here which I didn't touch.
> > >
> >
> > yes, sorry, that's my mistake. i think the width of the gre example
> > probably caused that. so i think you should keep your original text
> > (i.e. no indent for the artwork; indent for commands).
> 
> There you are. David, does this look sound to you?

yes, ok by me.

> Index: share/man/man4/gre.4
> ===
> RCS file: /cvs/src/share/man/man4/gre.4,v
> retrieving revision 1.79
> diff -u -p -u -r1.79 gre.4
> --- share/man/man4/gre.418 Nov 2020 16:19:54 -  1.79
> +++ share/man/man4/gre.429 Nov 2020 19:26:28 -
> @@ -455,6 +455,67 @@ In most cases the following should work:
>  .Bd -literal -offset indent
>  pass quick on gre proto gre no state
>  .Ed
> +.Ss Point-to-Multipoint Layer 3 GRE tunnel interfaces (mgre) example
> +.Nm mgre
> +can be used to build a point-to-multipoint tunnel network to several
> +hosts using a single
> +.Nm mgre
> +interface.
> +.Pp
> +In this example the host A has an outer IP of 198.51.100.12, host
> +B has 203.0.113.27, and host C has 203.0.113.254.
> +.Pp
> +Addressing within the tunnel is done using 192.0.2.0/24:
> +.Bd -literal
> ++--- Host B
> +   /
> +  /
> +Host A --- tunnel ---+
> +  \e
> +   \e
> ++--- Host C
> +.Ed
> +.Pp
> +On Host A:
> +.Bd -literal -offset indent
> +# ifconfig mgreN create
> +# ifconfig mgreN tunneladdr 198.51.100.12
> +# ifconfig mgreN inet 192.0.2.1 netmask 0xff00 up
> +.Ed
> +.Pp
> +On Host B:
> +.Bd -literal -offset indent
> +# ifconfig mgreN create
> +# ifconfig mgreN tunneladdr 203.0.113.27
> +# ifconfig mgreN inet 192.0.2.2 netmask 0xff00 up
> +.Ed
> +.Pp
> +On Host C:
> +.Bd -literal -offset indent
> +# ifconfig mgreN create
> +# ifconfig mgreN tunneladdr 203.0.113.254
> +# ifconfig mgreN inet 192.0.2.3 netmask 0xff00 up
> +.Ed
> +.Pp
> +To reach Host B over the tunnel (from Host A), there has to be a
> +route on Host A specifying the next-hop:
> +.Pp
> +.Dl # route add -host 192.0.2.2 203.0.113.27 -iface -ifp mgreN
> +.Pp
> +Similarly, to reach Host A over the tunnel from Host B, a route must
> +be present on B with A's outer IP as next-hop:
> +.Pp
> +.Dl # route add -host 192.0.2.1 198.51.100.12 -iface -ifp mgreN
> +.Pp
> +The same tunnel interface can then be used between host B and C by
> +adding the appropriate routes, making the network any-to-any instead
> +of hub-and-spoke:
> +.Pp
> +On Host B:
> +.Dl # route add -host 192.0.2.3 203.0.113.254 -iface -ifp mgreN
> +.Pp
> +On Host C:
> +.Dl # route add -host 192.0.2.2 203.0.113.27 -iface -ifp mgreN
>  .Ss Point-to-Point Ethernet over GRE tunnel interfaces (egre) example
>  .Nm egre
>  can be used to carry Ethernet traffic between two endpoints over



bpf_catchpacket and bpf_wakeup optimisations

2020-12-27 Thread David Gwynne
now that bpf read timeouts are only handled on the bpfread() side,
there's a simplification that can be made in bpf_catchpacket. the chunk
in bpf_catchpacket that rotates the buffers when one gets full already
does a wakeup, so we don't have to check if we have any waiting readers
and wake them up when a buffer gets full.

we can use bd_nreaders to omgoptimise bpf_wakeup though. wakeup(9) is
mpsafe, so we don't have to defer the call to a task. however, we can
avoid calling wakeup() and therefore trying to take the sched lock and
all that stuff when we know there's nothing sleeping.

this also avoids scheduling the task if there's no async stuff set up.
it's a bit magical because it knows what's inside selwakeup.

tests? ok?

Index: bpf.c
===
RCS file: /cvs/src/sys/net/bpf.c,v
retrieving revision 1.199
diff -u -p -r1.199 bpf.c
--- bpf.c   26 Dec 2020 16:30:58 -  1.199
+++ bpf.c   28 Dec 2020 03:05:36 -
@@ -563,14 +563,19 @@ bpf_wakeup(struct bpf_d *d)
 {
MUTEX_ASSERT_LOCKED(>bd_mtx);
 
+   if (d->bd_nreaders)
+   wakeup(d);
+
/*
 * As long as pgsigio() and selwakeup() need to be protected
 * by the KERNEL_LOCK() we have to delay the wakeup to
 * another context to keep the hot path KERNEL_LOCK()-free.
 */
-   bpf_get(d);
-   if (!task_add(systq, >bd_wake_task))
-   bpf_put(d);
+   if ((d->bd_async && d->bd_sig) || (d->bd_sel.si_seltid != 0)) {
+   bpf_get(d);
+   if (!task_add(systq, >bd_wake_task))
+   bpf_put(d);
+   }
 }
 
 void
@@ -578,7 +583,6 @@ bpf_wakeup_cb(void *xd)
 {
struct bpf_d *d = xd;
 
-   wakeup(d);
if (d->bd_async && d->bd_sig)
pgsigio(>bd_sigio, d->bd_sig, 0);
 
@@ -1542,17 +1546,6 @@ bpf_catchpacket(struct bpf_d *d, u_char 
 * reads should be woken up.
 */
do_wakeup = 1;
-   }
-
-   if (d->bd_nreaders > 0) {
-   /*
-* We have one or more threads sleeping in bpfread().
-* We got a packet, so wake up all readers.
-*/
-   if (d->bd_fbuf != NULL) {
-   ROTATE_BUFFERS(d);
-   do_wakeup = 1;
-   }
}
 
if (do_wakeup)



Re: bpf(4): remove ticks

2020-12-27 Thread David Gwynne
On Sat, Dec 26, 2020 at 04:48:23PM -0600, Scott Cheloha wrote:
> Now that we've removed bd_rdStart from the bpf_d struct, removing
> ticks from bpf(4) itself is straightforward.
> 
> - bd_rtout becomes a timespec; update bpfioctl() accordingly.
>   Cap it at MAXTSLP nanoseconds to avoid arithmetic overflow
>   in bpfread().
> 
> - At the start of bpfread(), if a timeout is set, find the end
>   of the read as an absolute uptime.  This is the point where
>   we want to avoid overflow: if bd_rtout is only MAXTSLP
>   nanoseconds the timespecadd(3) will effectively never overflow.
> 
> - Before going to sleep, if we have a timeout set, compute how
>   much longer to sleep in nanoseconds.
> 
>   Here's a spot where an absolute timeout sleep would save a
>   little code, but we don't have such an interface yet.  Worth
>   keeping in mind for the future, though.

Are there any other places that would be useful though? bpf is pretty
special.

> dlg@: You said you wanted to simplify this loop.  Unsure what shape
> that cleanup would take.  Should we clean it up before this change?

I wanted to have a single msleep_nsec call and pass INFSLP when it
should sleep forever.. You saw my first attempt at that. It had
issues.

> Thoughts?  ok?

How would this look if you used a uint64_t for nsecs for bd_rtout,
and the nsec uptime thing from your pool diff instead of timespecs
and nanouptime?

What's the thinking behind nanouptime instead of getnanouptime? More
generally, what will getnanouptime do when ticks go away?

dlg

> 
> Index: bpf.c
> ===
> RCS file: /cvs/src/sys/net/bpf.c,v
> retrieving revision 1.199
> diff -u -p -r1.199 bpf.c
> --- bpf.c 26 Dec 2020 16:30:58 -  1.199
> +++ bpf.c 26 Dec 2020 22:05:04 -
> @@ -60,6 +60,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -77,9 +78,6 @@
>  
>  #define PRINET  26   /* interruptible */
>  
> -/* from kern/kern_clock.c; incremented each clock tick. */
> -extern int ticks;
> -
>  /*
>   * The default read buffer size is patchable.
>   */
> @@ -380,7 +378,7 @@ bpfopen(dev_t dev, int flag, int mode, s
>   smr_init(>bd_smr);
>   sigio_init(>bd_sigio);
>  
> - bd->bd_rtout = 0;   /* no timeout by default */
> + timespecclear(>bd_rtout);   /* no timeout by default */
>   bd->bd_rnonblock = ISSET(flag, FNONBLOCK);
>  
>   bpf_get(bd);
> @@ -428,9 +426,11 @@ bpfclose(dev_t dev, int flag, int mode, 
>  int
>  bpfread(dev_t dev, struct uio *uio, int ioflag)
>  {
> + struct timespec diff, end, now;
> + uint64_t nsecs;
>   struct bpf_d *d;
>   caddr_t hbuf;
> - int end, error, hlen, nticks;
> + int error, hlen, timeout;
>  
>   KERNEL_ASSERT_LOCKED();
>  
> @@ -453,8 +453,11 @@ bpfread(dev_t dev, struct uio *uio, int 
>   /*
>* If there's a timeout, mark when the read should end.
>*/
> - if (d->bd_rtout)
> - end = ticks + (int)d->bd_rtout;
> + timeout = timespecisset(>bd_rtout);
> + if (timeout) {
> + nanouptime();
> + timespecadd(, >bd_rtout, );
> + }
>  
>   /*
>* If the hold buffer is empty, then do a timed sleep, which
> @@ -483,21 +486,26 @@ bpfread(dev_t dev, struct uio *uio, int 
>   if (d->bd_rnonblock) {
>   /* User requested non-blocking I/O */
>   error = EWOULDBLOCK;
> - } else if (d->bd_rtout == 0) {
> + } else if (timeout == 0) {
>   /* No read timeout set. */
>   d->bd_nreaders++;
>   error = msleep_nsec(d, >bd_mtx, PRINET|PCATCH,
>   "bpf", INFSLP);
>   d->bd_nreaders--;
> - } else if ((nticks = end - ticks) > 0) {
> - /* Read timeout has not expired yet. */
> - d->bd_nreaders++;
> - error = msleep(d, >bd_mtx, PRINET|PCATCH, "bpf",
> - nticks);
> - d->bd_nreaders--;
>   } else {
> - /* Read timeout has expired. */
> - error = EWOULDBLOCK;
> + nanouptime();
> + if (timespeccmp(, , <)) {
> + /* Read timeout has not expired yet. */
> + timespecsub(, , );
> + nsecs = TIMESPEC_TO_NSEC();
> + d->bd_nreaders++;
> + error = msleep_nsec(d, >bd_mtx,
> + PRINET|PCATCH, "bpf", nsecs);
> + d->bd_nreaders--;
> + } else {
> + /* Read timeout has expired. */
> + error = EWOULDBLOCK;
> + }
>   }
>   if (error == 

Re: tht(4): more tsleep(9) -> tsleep_nsec(9) conversions

2020-12-16 Thread David Gwynne
ok

> On 17 Dec 2020, at 04:22, Scott Cheloha  wrote:
> 
> On Thu, Dec 03, 2020 at 09:59:11PM -0600, Scott Cheloha wrote:
>> Hi,
>> 
>> tht(4) is another driver still using tsleep(9).
>> 
>> It uses it to spin while it waits for the card to load the firmware.
>> Then it uses it to spin for up to 2 seconds while waiting for
>> THT_REG_INIT_STATUS.
>> 
>> In the firmware case we can sleep for 10 milliseconds each iteration.
>> 
>> In the THT_REG_INIT_STATUS loop we can sleep for 10 milliseconds each
>> iteration again, but instead of using a timeout to set a flag after 2
>> seconds we can just count how many milliseconds we've slept.  This is
>> less precise than using the timeout but it is much simpler.  Obviously
>> we then need to remove all the timeout-related stuff from the function
>> and the file.
>> 
>> Thoughts?  ok?
> 
> Two week bump.
> 
> Index: if_tht.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_tht.c,v
> retrieving revision 1.142
> diff -u -p -r1.142 if_tht.c
> --- if_tht.c  10 Jul 2020 13:26:38 -  1.142
> +++ if_tht.c  4 Dec 2020 03:57:21 -
> @@ -582,7 +582,6 @@ void  tht_lladdr_read(struct 
> tht_softc 
> void  tht_lladdr_write(struct tht_softc *);
> int   tht_sw_reset(struct tht_softc *);
> int   tht_fw_load(struct tht_softc *);
> -void tht_fw_tick(void *arg);
> void  tht_link_state(struct tht_softc *);
> 
> /* interface operations */
> @@ -1667,11 +1666,9 @@ tht_sw_reset(struct tht_softc *sc)
> int
> tht_fw_load(struct tht_softc *sc)
> {
> - struct timeout  ticker;
> - volatile intok = 1;
>   u_int8_t*fw, *buf;
>   size_t  fwlen, wrlen;
> - int error = 1;
> + int error = 1, msecs, ret;
> 
>   if (loadfirmware("tht", , ) != 0)
>   return (1);
> @@ -1682,7 +1679,9 @@ tht_fw_load(struct tht_softc *sc)
>   buf = fw;
>   while (fwlen > 0) {
>   while (tht_fifo_writable(sc, >sc_txt) <= THT_FIFO_GAP) {
> - if (tsleep(sc, PCATCH, "thtfw", 1) == EINTR)
> + ret = tsleep_nsec(sc, PCATCH, "thtfw",
> + MSEC_TO_NSEC(10));
> + if (ret == EINTR)
>   goto err;
>   }
> 
> @@ -1695,32 +1694,21 @@ tht_fw_load(struct tht_softc *sc)
>   buf += wrlen;
>   }
> 
> - timeout_set(, tht_fw_tick, (void *));
> - timeout_add_sec(, 2);
> - while (ok) {
> + for (msecs = 0; msecs < 2000; msecs += 10) {
>   if (tht_read(sc, THT_REG_INIT_STATUS) != 0) {
>   error = 0;
>   break;
>   }
> -
> - if (tsleep(sc, PCATCH, "thtinit", 1) == EINTR)
> + ret = tsleep_nsec(sc, PCATCH, "thtinit", MSEC_TO_NSEC(10));
> + if (ret == EINTR)
>   goto err;
>   }
> - timeout_del();
> 
>   tht_write(sc, THT_REG_INIT_SEMAPHORE, 0x1);
> 
> err:
>   free(fw, M_DEVBUF, fwlen);
>   return (error);
> -}
> -
> -void
> -tht_fw_tick(void *arg)
> -{
> - volatile int*ok = arg;
> -
> - *ok = 0;
> }
> 
> void



Re: Lock operations for knote lists

2020-12-11 Thread David Gwynne
On Fri, Dec 11, 2020 at 05:37:57PM +, Visa Hankala wrote:
> This patch extends struct klist with a callback descriptor and
> an argument. The main purpose of this is to let the kqueue subsystem
> assert when a klist should be locked, and operate the klist lock
> in klist_invalidate().

i've always felt that the klist has always had an obvious need for
locking, but it's excellent that you've actually gone and done more than
feel things about it.

> Access to a knote list of a kqueue-monitored object has to be
> serialized somehow. Because the object often has a lock for protecting
> its state, and because the object often acquires this lock at the latest
> in its f_event callback functions, I would like to use the same lock
> also for the knote lists. Uses of NOTE_SUBMIT already show a pattern
> arising.

That makes sense. If it helps, even if it just helps justify it,
this is the same approach used in kstat, for much the same reason.

> There could be an embedded lock in klist. However, such a lock would be
> redundant in many cases. The code could not rely on a single lock type
> (mutex, rwlock, something else) because the needs of monitored objects
> vary. In addition, an embedded lock would introduce new lock order
> constraints. Note that this patch does not rule out use of dedicated
> klist locks.
> 
> The patch introduces a way to associate lock operations with a klist.
> The caller can provide a custom implementation, or use a ready-made
> interface with a mutex or rwlock.
> 
> For compatibility with old code, the new code falls back to using the
> kernel lock if no specific klist initialization has been done. The
> existing code already relies on implicit initialization of klist.

I was going to ask if you could provide a struct klistops around
KERNEL_LOCK as the default, but that would involve a lot more churn to
explicitly init all the klist structs.

> Unfortunately, the size of struct klist will grow threefold.

I think we'll live.

> As the patch gives the code the ability to operate the klist lock,
> the klist API could provide variants of insert and remove actions that
> handle locking internally, for convenience. However, that I would leave
> for another patch because I would prefer to rename the current
> klist_insert() to klist_insert_locked(), and klist_remove() to
> klist_remove_locked().
> 
> The patch additionally provides three examples of usage: audio, pipes,
> and sockets. Each of these examples is logically a separate changeset.
> 
> 
> Please test and review.

I'll try to have a closer look soon.

> Index: dev/audio.c
> ===
> RCS file: src/sys/dev/audio.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 audio.c
> --- dev/audio.c   19 May 2020 06:32:24 -  1.191
> +++ dev/audio.c   11 Dec 2020 17:05:09 -
> @@ -305,11 +305,12 @@ audio_buf_wakeup(void *addr)
>  int
>  audio_buf_init(struct audio_softc *sc, struct audio_buf *buf, int dir)
>  {
> + klist_init_mutex(>sel.si_note, _lock);
>   buf->softintr = softintr_establish(IPL_SOFTAUDIO,
>   audio_buf_wakeup, buf);
>   if (buf->softintr == NULL) {
>   printf("%s: can't establish softintr\n", DEVNAME(sc));
> - return ENOMEM;
> + goto bad;
>   }
>   if (sc->ops->round_buffersize) {
>   buf->datalen = sc->ops->round_buffersize(sc->arg,
> @@ -323,9 +324,12 @@ audio_buf_init(struct audio_softc *sc, s
>   buf->data = malloc(buf->datalen, M_DEVBUF, M_WAITOK);
>   if (buf->data == NULL) {
>   softintr_disestablish(buf->softintr);
> - return ENOMEM;
> + goto bad;
>   }
>   return 0;
> +bad:
> + klist_free(>sel.si_note);
> + return ENOMEM;
>  }
>  
>  void
> @@ -336,6 +340,7 @@ audio_buf_done(struct audio_softc *sc, s
>   else
>   free(buf->data, M_DEVBUF, buf->datalen);
>   softintr_disestablish(buf->softintr);
> + klist_free(>sel.si_note);
>  }
>  
>  /*
> @@ -1256,6 +1261,7 @@ audio_attach(struct device *parent, stru
>   return;
>   }
>  
> + klist_init_mutex(>mix_sel.si_note, _lock);
>   sc->mix_softintr = softintr_establish(IPL_SOFTAUDIO,
>   audio_mixer_wakeup, sc);
>   if (sc->mix_softintr == NULL) {
> @@ -1451,6 +1457,7 @@ audio_detach(struct device *self, int fl
>  
>   /* free resources */
>   softintr_disestablish(sc->mix_softintr);
> + klist_free(>mix_sel.si_note);
>   free(sc->mix_evbuf, M_DEVBUF, sc->mix_nent * sizeof(struct mixer_ev));
>   free(sc->mix_ents, M_DEVBUF, sc->mix_nent * sizeof(struct mixer_ctrl));
>   audio_buf_done(sc, >play);
> Index: kern/kern_event.c
> ===
> RCS file: src/sys/kern/kern_event.c,v
> retrieving revision 1.147
> diff -u -p -r1.147 kern_event.c
> --- kern/kern_event.c 9 Dec 2020 18:58:19 -   1.147
> +++ 

Re: diff: refactor MCLGETI() macro

2020-12-11 Thread David Gwynne
On Wed, Oct 07, 2020 at 10:44:15PM +0200, Jan Klemkow wrote:
> Hi,
> 
> The name of the macro MCLGETI obsolete.  It was made to use a network
> interface pointer inside.  But, now it is just used to define a special
> length and the interface pointer is discarded.
> 
> Thus, the following diff renames the macro to MCLGETL and removes the
> dead parameter ifp.
> 
> OK?

ok.

just a heads up, the rge chunk didnt apply for some reason, but should
be easy to keep an eye on before committing.

cheers,
dlg

> Bye,
> Jan
> 
> Index: share/man/man9/mbuf.9
> ===
> RCS file: /cvs/src/share/man/man9/mbuf.9,v
> retrieving revision 1.119
> diff -u -p -r1.119 mbuf.9
> --- share/man/man9/mbuf.9 8 Aug 2020 07:42:31 -   1.119
> +++ share/man/man9/mbuf.9 7 Oct 2020 19:24:48 -
> @@ -58,7 +58,7 @@
>  .Nm m_devget ,
>  .Nm m_apply ,
>  .Nm MCLGET ,
> -.Nm MCLGETI ,
> +.Nm MCLGETL ,
>  .Nm MEXTADD ,
>  .Nm m_align ,
>  .Nm M_READONLY ,
> @@ -126,7 +126,7 @@
>  "int (*func)(caddr_t, caddr_t, unsigned int)" "caddr_t fstate"
>  .Fn MCLGET "struct mbuf *m" "int how"
>  .Ft struct mbuf *
> -.Fn MCLGETI "struct mbuf *m" "int how" "struct ifnet *ifp" "int len"
> +.Fn MCLGETL "struct mbuf *m" "int how" "int len"
>  .Fn MEXTADD "struct mbuf *m" "caddr_t buf" "u_int size" "int flags" \
>  "void (*free)(caddr_t, u_int, void *)" "void *arg"
>  .Ft void
> @@ -721,7 +721,7 @@ See
>  .Fn m_get
>  for a description of
>  .Fa how .
> -.It Fn MCLGETI "struct mbuf *m" "int how" "struct ifnet *ifp" "int len"
> +.It Fn MCLGETL "struct mbuf *m" "int how" "int len"
>  If
>  .Fa m
>  is NULL, allocate it.
> Index: sys/arch/octeon/dev/if_cnmac.c
> ===
> RCS file: /cvs/src/sys/arch/octeon/dev/if_cnmac.c,v
> retrieving revision 1.79
> diff -u -p -r1.79 if_cnmac.c
> --- sys/arch/octeon/dev/if_cnmac.c4 Sep 2020 15:18:05 -   1.79
> +++ sys/arch/octeon/dev/if_cnmac.c7 Oct 2020 19:27:08 -
> @@ -1106,7 +1106,7 @@ cnmac_mbuf_alloc(int n)
>   paddr_t pktbuf;
>  
>   while (n > 0) {
> - m = MCLGETI(NULL, M_NOWAIT, NULL,
> + m = MCLGETL(NULL, M_NOWAIT,
>   OCTEON_POOL_SIZE_PKT + CACHELINESIZE);
>   if (m == NULL || !ISSET(m->m_flags, M_EXT)) {
>   m_freem(m);
> Index: sys/arch/octeon/dev/if_ogx.c
> ===
> RCS file: /cvs/src/sys/arch/octeon/dev/if_ogx.c,v
> retrieving revision 1.2
> diff -u -p -r1.2 if_ogx.c
> --- sys/arch/octeon/dev/if_ogx.c  9 Sep 2020 15:53:25 -   1.2
> +++ sys/arch/octeon/dev/if_ogx.c  7 Oct 2020 19:27:25 -
> @@ -1147,7 +1147,7 @@ ogx_load_mbufs(struct ogx_softc *sc, uns
>   paddr_t pktbuf;
>  
>   for ( ; n > 0; n--) {
> - m = MCLGETI(NULL, M_NOWAIT, NULL, MCLBYTES);
> + m = MCLGETL(NULL, M_NOWAIT, MCLBYTES);
>   if (m == NULL)
>   break;
>  
> Index: sys/arch/sparc64/dev/vnet.c
> ===
> RCS file: /cvs/src/sys/arch/sparc64/dev/vnet.c,v
> retrieving revision 1.62
> diff -u -p -r1.62 vnet.c
> --- sys/arch/sparc64/dev/vnet.c   10 Jul 2020 13:26:36 -  1.62
> +++ sys/arch/sparc64/dev/vnet.c   7 Oct 2020 19:27:41 -
> @@ -834,7 +834,7 @@ vnet_rx_vio_dring_data(struct vnet_softc
>   goto skip;
>   }
>  
> - m = MCLGETI(NULL, M_DONTWAIT, NULL, desc.nbytes);
> + m = MCLGETL(NULL, M_DONTWAIT, desc.nbytes);
>   if (!m)
>   break;
>   m->m_len = m->m_pkthdr.len = desc.nbytes;
> Index: sys/dev/fdt/if_dwge.c
> ===
> RCS file: /cvs/src/sys/dev/fdt/if_dwge.c,v
> retrieving revision 1.6
> diff -u -p -r1.6 if_dwge.c
> --- sys/dev/fdt/if_dwge.c 13 Sep 2020 01:54:05 -  1.6
> +++ sys/dev/fdt/if_dwge.c 7 Oct 2020 19:28:00 -
> @@ -1283,7 +1283,7 @@ dwge_alloc_mbuf(struct dwge_softc *sc, b
>  {
>   struct mbuf *m = NULL;
>  
> - m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES);
> + m = MCLGETL(NULL, M_DONTWAIT, MCLBYTES);
>   if (!m)
>   return (NULL);
>   m->m_len = m->m_pkthdr.len = MCLBYTES;
> Index: sys/dev/fdt/if_dwxe.c
> ===
> RCS file: /cvs/src/sys/dev/fdt/if_dwxe.c,v
> retrieving revision 1.17
> diff -u -p -r1.17 if_dwxe.c
> --- sys/dev/fdt/if_dwxe.c 10 Jul 2020 13:26:36 -  1.17
> +++ sys/dev/fdt/if_dwxe.c 7 Oct 2020 19:28:28 -
> @@ -1342,7 +1342,7 @@ dwxe_alloc_mbuf(struct dwxe_softc *sc, b
>  {
>   struct mbuf *m = NULL;
>  
> - m = MCLGETI(NULL, M_DONTWAIT, NULL, MCLBYTES);
> + m = MCLGETL(NULL, M_DONTWAIT, MCLBYTES);
>   if 

clear mbuf timestamp when it leaves the stack

2020-12-11 Thread David Gwynne
an mbuf timestamp is set by hw when a packet is rxed, and is then used
by the socket layer and things like ntpd, but is also used by bpf when
it provides packet timestamps.

the timestamp is only valid on rxed packets though. when they leave the
stack they should not be used anymore. on the way out of the stack it is
more correct to read the clock for the time. this clears the timestamp
bit on the way out so it can't be used anymore.

ok?

Index: if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.620
diff -u -p -r1.620 if.c
--- if.c3 Oct 2020 00:23:55 -   1.620
+++ if.c12 Dec 2020 00:24:28 -
@@ -682,6 +682,8 @@ if_qstart_compat(struct ifqueue *ifq)
 int
 if_enqueue(struct ifnet *ifp, struct mbuf *m)
 {
+   CLR(m->m_pkthdr.csum_flags, M_TIMESTAMP);
+
 #if NPF > 0
if (m->m_pkthdr.pf.delay > 0)
return (pf_delay_pkt(m, ifp->if_index));



Re: bgpd pftable change

2020-11-14 Thread David Gwynne
I've been using this for a week or so now and it's been very boring, which is 
an improvement in my experience.

I has my ok if that has any value.

dlg

> On 9 Nov 2020, at 8:16 pm, Claudio Jeker  wrote:
> 
> Hi bgpd and esp. bgpd-spamd users,
> 
> Currently the pftable code does not keep track how often a prefix was
> added to a pftable. Because of this using the same pftable for multiple
> neighbor tables does not work well. If one neighbor withdraws a route the
> pftable entry is removed from the table no matter if the prefix is still
> available from the other neighbor.
> 
> This diff changes this behaviour and introduces proper reference counting.
> A pftable entry will now remain in the table until the last neighbor
> withdraws the prefix. This makes much more sense and should not break
> working setups. It will fix configs where more than one neighbor feeds
> into the bgpd pftable.
> 
> As a side-effect bgpd will no commit pftable updates on a more regular
> basis and not wait until some operations (full table walk) finished. This
> should result in better responsiveness of updates.
> 
> Please test :)
> -- 
> :wq Claudio
> 
> Index: rde.c
> ===
> RCS file: /cvs/src/usr.sbin/bgpd/rde.c,v
> retrieving revision 1.506
> diff -u -p -r1.506 rde.c
> --- rde.c 5 Nov 2020 14:44:59 -   1.506
> +++ rde.c 9 Nov 2020 10:06:51 -
> @@ -69,6 +69,7 @@ void rde_dump_ctx_terminate(pid_t);
> void   rde_dump_mrt_new(struct mrt *, pid_t, int);
> 
> intrde_l3vpn_import(struct rde_community *, struct l3vpn *);
> +static void   rde_commit_pftable(void);
> void   rde_reload_done(void);
> static voidrde_softreconfig_in_done(void *, u_int8_t);
> static voidrde_softreconfig_out_done(void *, u_int8_t);
> @@ -296,6 +297,8 @@ rde_main(int debug, int verbose)
>   for (aid = AID_INET6; aid < AID_MAX; aid++)
>   rde_update6_queue_runner(aid);
>   }
> + /* commit pftable once per poll loop */
> + rde_commit_pftable();
>   }
> 
>   /* do not clean up on shutdown on production, it takes ages. */
> @@ -497,8 +500,6 @@ badnetdel:
>   RDE_RUNNER_ROUNDS, peerself, network_flush_upcall,
>   NULL, NULL) == -1)
>   log_warn("rde_dispatch: IMSG_NETWORK_FLUSH");
> - /* Deletions were performed in network_flush_upcall */
> - rde_send_pftable_commit();
>   break;
>   case IMSG_FILTER_SET:
>   if (imsg.hdr.len - IMSG_HEADER_SIZE !=
> @@ -1389,7 +1390,6 @@ rde_update_dispatch(struct rde_peer *pee
> 
> done:
>   rde_filterstate_clean();
> - rde_send_pftable_commit();
> }
> 
> int
> @@ -2950,10 +2950,31 @@ rde_update6_queue_runner(u_int8_t aid)
> /*
>  * pf table specific functions
>  */
> +struct rde_pftable_node {
> + RB_ENTRY(rde_pftable_node)   entry;
> + struct pt_entry *prefix;
> + int  refcnt;
> + u_int16_tid;
> +};
> +RB_HEAD(rde_pftable_tree, rde_pftable_node);
> +
> +static inline int
> +rde_pftable_cmp(struct rde_pftable_node *a, struct rde_pftable_node *b)
> +{
> + if (a->prefix > b->prefix)
> + return 1;
> + if (a->prefix < b->prefix)
> + return -1;
> + return (a->id - b->id);
> +}
> +
> +RB_GENERATE_STATIC(rde_pftable_tree, rde_pftable_node, entry, 
> rde_pftable_cmp);
> +
> +struct rde_pftable_tree pftable_tree = RB_INITIALIZER(_tree);
> int need_commit;
> -void
> -rde_send_pftable(u_int16_t id, struct bgpd_addr *addr,
> -u_int8_t len, int del)
> +
> +static void
> +rde_pftable_send(u_int16_t id, struct pt_entry *pt, int del)
> {
>   struct pftable_msg pfm;
> 
> @@ -2966,8 +2987,8 @@ rde_send_pftable(u_int16_t id, struct bg
> 
>   bzero(, sizeof(pfm));
>   strlcpy(pfm.pftable, pftable_id2name(id), sizeof(pfm.pftable));
> - memcpy(, addr, sizeof(pfm.addr));
> - pfm.len = len;
> + pt_getaddr(pt, );
> + pfm.len = pt->prefixlen;
> 
>   if (imsg_compose(ibuf_main,
>   del ? IMSG_PFTABLE_REMOVE : IMSG_PFTABLE_ADD,
> @@ -2978,7 +2999,55 @@ rde_send_pftable(u_int16_t id, struct bg
> }
> 
> void
> -rde_send_pftable_commit(void)
> +rde_pftable_add(u_int16_t id, struct prefix *p)
> +{
> + struct rde_pftable_node *pfn, node;
> +
> + memset(, 0, sizeof(node));
> + node.prefix = p->pt;
> + node.id = id;
> +
> + pfn = RB_FIND(rde_pftable_tree, _tree, );
> + if (pfn == NULL) {
> + if ((pfn = calloc(1, sizeof(*pfn))) == NULL)
> + fatal("%s", __func__);
> + pfn->prefix = pt_ref(p->pt);
> + pfn->id = id;
> +
> + if (RB_INSERT(rde_pftable_tree, _tree, pfn) != NULL)
> + fatalx("%s: 

Re: net.inet.ip.forwarding=0 vs lo(4)

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 07:16:46PM +1000, David Gwynne wrote:
> On Mon, Oct 19, 2020 at 10:03:29AM +0100, Stuart Henderson wrote:
> > On 2020/10/19 11:47, David Gwynne wrote:
> > > On Sun, Oct 18, 2020 at 08:57:34PM +0100, Stuart Henderson wrote:
> > > > On 2020/10/18 14:04, David Gwynne wrote:
> > > > > the problem i'm hitting is that i have a multihomed box where the
> > > > > service it provides listens on an IP address that's assigned to lo1.
> > > > > it's a host running a service, it's not a router, so the
> > > > > net.inet.ip.forwarding sysctl is not set to 1.
> > > > 
> > > > I ran into this, I just turned on the forwarding sysctl to avoid the
> > > > problem.
> > > > 
> > > > > i came up with this diff, which adds even more special casing for
> > > > > loopback interfaces. it says addreesses on loopbacks are globally
> > > > > reachable, even if ip forwarding is disabled.
> > > > 
> > > > I don't see why loopbacks should be special. Another place this
> > > > might show up is services running on carp addresses (I haven't updated
> > > > those machines yet but there's a fair chance they'll be affected too).
> > > > I would prefer an explicit sysctl to disable "strong host model".
> > > 
> > > loopback is already special. if a packet comes from an loopback
> > > interface, we allow it to talk to any IP on the local machine. i think
> > > this is mostly to cope with the semantic we've had where local traffic
> > > get's tied to a loopback interface instead of going anywhere near the
> > > physical ones.
> > > 
> > > carp is also special.
> > > 
> > > let me paste the ip_laddr function instead of the diff to it, it's a bit
> > > more obvious what's going on:
> > 
> > Thanks, that will already work for the machines I was thinking of then.
> > 
> > > back to loopback and receiving packets. loopback is special because it
> > > is not connected to the outside world. it is impossible to send a packet
> > > via a loopback interface from another host, so configuring a globally
> > > (externally) routable IP on it is currently pointless unless you enable
> > > forwarding. i think making loopback more special and allowing it
> > > to be globally reachable makes sense. i can't think of any downsides to
> > > this at the moment, except that the behaviour would be subtle/not
> > > obvious
> > 
> > ok, so it makes sense for this to be independent of any possible
> > separate lever.
> > 
> > > is there a need to configure a globally reachable IP on a non-loopback
> > > interface on a host (not router)? if so, then i'd be more convinced that
> > > we need a separate lever to pull.
> > 
> > I'm not using it this way, but here's a scenario.
> > 
> > Say there are a couple of webservers with addresses from a carp on
> > ethernet/vlan, with a link to their upstream router on some separate
> > interface. They announce the carp prefix into ospf.
> 
> so carp is just being used to elect a webserver as a master, and then
> the result of that election is fed upstream.
> 
> > They aren't routing themselves so the only reason to have forwarding=1
> > is to have them use "weak host model".
> > 
> > With forwarding=0 I think they'll have to use "stub router no" otherwise
> > everything will be announced high metric (rather than being dependent on
> > carp state), but ospfd explicitly handles this; it's marked in parse.y
> > with "/* allow to force non stub mode */".
> 
> so is a Big Global Lever what you want here? if you enable weak host
> mode, all IPs on the host will be addressible from all legs of the
> host. would it make more sense to configure specific interfaces as
> holding globally addressible IPs?
> 
> if my understanding of your scenario is right, you could configure
> the carp interface with the weak or globally accessible flag. in
> my situation i could configure that on lo1.

such a diff looks like this. it adds a "global" flag that you can set on
interfaces.

Index: sbin/ifconfig/ifconfig.c
===
RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
retrieving revision 1.429
diff -u -p -r1.429 ifconfig.c
--- sbin/ifconfig/ifconfig.c7 Oct 2020 14:38:54 -   1.429
+++ sbin/ifconfig/ifconfig.c20 Oct 2020 00:12:06 -
@@ -468,6 +468,8 @@ const structcmd {
{ "-autoconfprivacy",  

Re: pf route-to issues

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 12:33:25PM +0100, Stuart Henderson wrote:
> On 2020/10/19 19:53, David Gwynne wrote:
> > On Mon, Oct 19, 2020 at 09:34:31AM +0100, Stuart Henderson wrote:
> > > On 2020/10/19 15:35, David Gwynne wrote:
> > > > every few years i try and use route-to in pf, and every time it
> > > > goes badly. i tried it again last week in a slightly different
> > > > setting, and actually tried to understand the sharp edges i hit
> > > > this time instead of giving up. it turns out there are 2 or 3
> > > > different things together that have cause me trouble, which is why
> > > > the diff below is so big.
> > > 
> > > I used to route-to/reply-to quite a lot at places with poor internet
> > > connections to split traffic between lines (mostly those have better
> > > connections now so I don't need it as often). It worked as I expected -
> > > but I only ever used it with the interface specified.
> > 
> > cool. did it work beyond the first packet in a connection?
> 
> It must have done. The webcams would have utterly broken the rest of
> traffic if it hadn't :)


> 
> > > I mostly used it with pppoe interfaces so the peer address was unknown
> > > at ruleset load time. (I was lucky and had static IPs my side, but the
> > > ISP side was variable). I relied on the fact that once packets are
> > > directed at a point-point interface there's only one place for them to
> > > go. I didn't notice that ":peer" might be useful here (and the syntax
> > > 'route-to pppoe1:peer@pppoe1' is pretty awkward so I probably wouldn't
> > > have come up with it), I had 0.0.0.1@pppoe1, 0.0.0.2@pppoe2 etc
> > > (though actually I think it works with $any_random_address@pppoeX).
> > 
> > yes. i was trying to use it with peers over ethernet, and always
> > struggled with the syntax.
> > 
> > > > the first and i would argue most fundamental problem is a semantic
> > > > problem. if you ask a random person who has some clue about networks
> > > > and routing what they would expect the "argument" to route-to or
> > > > reply-to to be, they would say "a nexthop address" or "a gateway
> > > > address". eg, say i want to force packets to a specific backend
> > > > server without using NAT, i would write a rule like this:
> > > > 
> > > >   n_servers="192.0.2.128/27"
> > > >   pass out on $if_internal to $n_servers route-to 192.168.0.1
> > > > 
> > > > pfctl will happily parse this, shove it into the kernel, let you read
> > > > the rules back out again with pfctl -sr, and it all looks plausible, but
> > > > it turns out that it's using the argument to route-to as an interface
> > > > name. because rulesets can refer to interfaces that don't exist yet, pf
> > > > just passes the IP address around as a string, hoping i'll plug in an
> > > > interface with a driver name that looks like an ip address. i spent
> > > > literally a day trying to figure out why a rule like this wasn't
> > > > working.
> > > 
> > > I don't think I tried this, but the pf.conf(5) BNF syntax suggests it's
> > > supposed to work. So either doc or implementation bug there.
> > 
> > im leaning toward implementation bug.
> > 
> > >  route  = ( "route-to" | "reply-to" | "dup-to" )
> > >   ( routehost | "{" routehost-list "}" )
> > >   [ pooltype ]
> > > 
> > >  routehost-list = routehost [ [ "," ] routehost-list ]
> > > 
> > >  routehost  = host | host "@" interface-name |
> > >   "(" interface-name [ address [ "/" mask-bits ] ] ")"
> > > 
> > > > the second problem is that the pf_route calls from pfsync don't
> > > > have all the information it is supposed to have. more specifically,
> > > > an ifp pointer isn't set which leads to a segfault. the ifp pointer
> > > > isn't set because pfsync doesnt track which interface a packet is
> > > > going out, it assumes the ip layer will get it right again later, or a
> > > > rule provided something usable.
> > > > 
> > > > the third problem is that pf_route relies on information from rules to
> > > > work correctly. this is a problem in a pfsync environment because you
> > > > cannot ha

Re: pf route-to issues

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 12:28:19PM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> 
> > > 
> > > it seems to me 'route-to vs. pfsync' still needs more thought.  the
> > > next-hop IP address in route-to may be different for each PF box
> > > linked by pfsync(4). To be honest I have no answer to address this
> > > issue at the moment.
> > 
> > i have thought about that a little bit. we could play with what the
> > argument to route-to means. rather than requiring it to be a directly
> > connected host/gateway address, we could interpret it as a destination
> > address, and use the gateway for that destination as the nexthop.
> > 
> > eg, if i have the following routing table on frontend a:
> > 
> > Internet:
> > DestinationGatewayFlags   Refs  Use   Mtu  Prio 
> > Iface
> > default192.168.96.33  UGS6   176171 - 8 
> > vmx0 
> > 224/4  127.0.0.1  URS00 32768 8 lo0 
> >  
> > 10.0.0.0/30192.168.0.1UGS00 - 8 
> > gre0 
> 
> > 
> > and this routing table on frontend b:
> > 
> > Internet:
> > DestinationGatewayFlags   Refs  Use   Mtu  Prio 
> > Iface
> > default192.168.96.33  UGS987548 - 8 
> > aggr0
> > 224/4  127.0.0.1  URS00 32768 8 lo0 
> >  
> > 10.0.0.0/30192.168.0.3UGS00 - 8 
> > gre0 
> 
> > 
> > if gre0 on both frontends pointed at different legs on the same backend
> > server, i could write a pf rule like this:
> > 
> > pass out to port 80 route-to 10.0.0.1
> > 
> > 10.0.0.1 would then end up as the rt_addr field in pf_state and
> > pfsync_state.
> > 
> > both frontend a and b would lookup the route to 10.0.0.1, and then
> > use 192.168.0.1 and 192.168.0.3 as the gateway address respectively.
> > both would end up pushing the packet over their gre link to the
> > same backend. the same semantic would work if the link to the backend
> > was over ethernet instead of a tunnel.
> > 
> > > > thoughts?
> > > > 
> > > 
> > > What you've said makes sense. However I still feel pfsync(4)
> > > does not play well with route-to.
> > 
> > maybe your opinion is different if the above makes sense?
> > 
> 
> Thanks for detailed explanation. This is good enough to make me happy.
> The remaining questions on this are sort of 'homework for me' to poke
> to PF source code, for example:
>   are we doing route look up for every packet? or route look up
>   is performed when state is created/imported? (and we cache
>   outbound interface + next-hop along the state)

the "destination" address is determined when the state is created and
stored in rt_addr if pf_state. the route lookup using that address is
done per packet in pf_route.

>   also what happens when route does not exist on pfsync peer, which
>   receives state? How admin will discover state failed to import?

the route lookup or interface lookup fails and the packet is dropped.
there are no counters to show this is happening though :(

> Anyway, your plan above looks solid to me now. It's certainly more 
> flexible
> (?reliable?) to select route to particular destination, than using pair of
> interface,next-hop.

cool, i'll keep working on it then.

> 
> thanks and
> regards
> sashan



Re: pf route-to issues

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 09:34:31AM +0100, Stuart Henderson wrote:
> On 2020/10/19 15:35, David Gwynne wrote:
> > every few years i try and use route-to in pf, and every time it
> > goes badly. i tried it again last week in a slightly different
> > setting, and actually tried to understand the sharp edges i hit
> > this time instead of giving up. it turns out there are 2 or 3
> > different things together that have cause me trouble, which is why
> > the diff below is so big.
> 
> I used to route-to/reply-to quite a lot at places with poor internet
> connections to split traffic between lines (mostly those have better
> connections now so I don't need it as often). It worked as I expected -
> but I only ever used it with the interface specified.

cool. did it work beyond the first packet in a connection?

> I mostly used it with pppoe interfaces so the peer address was unknown
> at ruleset load time. (I was lucky and had static IPs my side, but the
> ISP side was variable). I relied on the fact that once packets are
> directed at a point-point interface there's only one place for them to
> go. I didn't notice that ":peer" might be useful here (and the syntax
> 'route-to pppoe1:peer@pppoe1' is pretty awkward so I probably wouldn't
> have come up with it), I had 0.0.0.1@pppoe1, 0.0.0.2@pppoe2 etc
> (though actually I think it works with $any_random_address@pppoeX).

yes. i was trying to use it with peers over ethernet, and always
struggled with the syntax.

> > the first and i would argue most fundamental problem is a semantic
> > problem. if you ask a random person who has some clue about networks
> > and routing what they would expect the "argument" to route-to or
> > reply-to to be, they would say "a nexthop address" or "a gateway
> > address". eg, say i want to force packets to a specific backend
> > server without using NAT, i would write a rule like this:
> > 
> >   n_servers="192.0.2.128/27"
> >   pass out on $if_internal to $n_servers route-to 192.168.0.1
> > 
> > pfctl will happily parse this, shove it into the kernel, let you read
> > the rules back out again with pfctl -sr, and it all looks plausible, but
> > it turns out that it's using the argument to route-to as an interface
> > name. because rulesets can refer to interfaces that don't exist yet, pf
> > just passes the IP address around as a string, hoping i'll plug in an
> > interface with a driver name that looks like an ip address. i spent
> > literally a day trying to figure out why a rule like this wasn't
> > working.
> 
> I don't think I tried this, but the pf.conf(5) BNF syntax suggests it's
> supposed to work. So either doc or implementation bug there.

im leaning toward implementation bug.

>  route  = ( "route-to" | "reply-to" | "dup-to" )
>   ( routehost | "{" routehost-list "}" )
>   [ pooltype ]
> 
>  routehost-list = routehost [ [ "," ] routehost-list ]
> 
>  routehost  = host | host "@" interface-name |
>   "(" interface-name [ address [ "/" mask-bits ] ] ")"
> 
> > the second problem is that the pf_route calls from pfsync don't
> > have all the information it is supposed to have. more specifically,
> > an ifp pointer isn't set which leads to a segfault. the ifp pointer
> > isn't set because pfsync doesnt track which interface a packet is
> > going out, it assumes the ip layer will get it right again later, or a
> > rule provided something usable.
> > 
> > the third problem is that pf_route relies on information from rules to
> > work correctly. this is a problem in a pfsync environment because you
> > cannot have the same ruleset on both firewalls 100% of the time, which
> > means you cannot have route-to/reply-to behave consistently on a pair of
> > firwalls 100% of the time.
> 
> I didn't run into this because pppoe(4) and pfsync/carp don't really
> go well together, but ouch!
> 
> > all of this together makes things work pretty obviously and smoothly.
> > in my opinion anyway. route-to now works more like rdr-to, it just
> > feels like it changes the address used for the route lookup rather
> > than changing the actual IP address in the packet. it also works
> > predictably in a pfsync pair, which is great from the point of view of
> > high availability.
> > 
> > the main caveat is that it's not backward compatible. if you're already
> > using route-to, you will need to tweak your rules to have them parse.
> > however, i doubt anyone is

Re: pf route-to issues

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 09:46:19AM +0200, Alexandr Nedvedicky wrote:
> Hello,
> 
> disclaimer: I have no chance to run pfsync on production, I'm very
> inexperienced with pfsync(4).

i wrote the defer code in pfsync, and i think i wrote the code in
pfsync that calls pf_route badly, so noones perfect :)

> 
> > 
> > the third problem is that pf_route relies on information from rules to
> > work correctly. this is a problem in a pfsync environment because you
> > cannot have the same ruleset on both firewalls 100% of the time, which
> > means you cannot have route-to/reply-to behave consistently on a pair of
> > firwalls 100% of the time.
> > 
> > my solution to both these problems is reduce the amount of information
> > pf_route needs to work with, to make sure that the info it does need
> > is in the pf state structure, and that pfsync handles it properly.
> > 
> > if we limit the information needed for pf_route to a nexthop address,
> > and which direction the address is used, this is doable. both the
> > pf_state and pfsync_state structs already contain an address to store a
> > nexthop in, i just had to move the route-to direction from the rule into
> > the state. this is easy with pf_state, but i used a spare pad field in
> > pfsync_state for this.
> > 
> 
> it seems to me 'route-to vs. pfsync' still needs more thought.  the
> next-hop IP address in route-to may be different for each PF box
> linked by pfsync(4). To be honest I have no answer to address this
> issue at the moment.

i have thought about that a little bit. we could play with what the
argument to route-to means. rather than requiring it to be a directly
connected host/gateway address, we could interpret it as a destination
address, and use the gateway for that destination as the nexthop.

eg, if i have the following routing table on frontend a:

Internet:
DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
default192.168.96.33  UGS6   176171 - 8 vmx0 
224/4  127.0.0.1  URS00 32768 8 lo0  
10.0.0.0/30192.168.0.1UGS00 - 8 gre0 
127/8  127.0.0.1  UGRS   00 32768 8 lo0  
127.0.0.1  127.0.0.1  UHhl   1  142 32768 1 lo0  
192.168.0.0192.168.0.0UHl00 - 1 gre0 
192.168.0.1192.168.0.0UHh11 - 8 gre0 
192.168.96.32/27   192.168.96.34  UCn2   122849 - 4 vmx0 
192.168.96.33  00:00:5e:00:01:47  UHLch  114611 - 3 vmx0 
192.168.96.34  00:50:56:a1:73:91  UHLl   0   362231 - 1 vmx0 
192.168.96.60  fe:e1:ba:d0:74:ef  UHLc   059690 - 3 vmx0 
192.168.96.63  192.168.96.34  UHb00 - 1 vmx0 

and this routing table on frontend b:

Internet:
DestinationGatewayFlags   Refs  Use   Mtu  Prio Iface
default192.168.96.33  UGS987548 - 8 aggr0
224/4  127.0.0.1  URS00 32768 8 lo0  
10.0.0.0/30192.168.0.3UGS00 - 8 gre0 
127/8  127.0.0.1  UGRS   00 32768 8 lo0  
127.0.0.1  127.0.0.1  UHhl   1   62 32768 1 lo0  
192.168.0.2192.168.0.2UHl00 - 1 gre0 
192.168.0.3192.168.0.2UHh11 - 8 gre0 
192.168.96.32/27   192.168.96.55  UCn362186 - 4 aggr0
192.168.96.33  00:00:5e:00:01:47  UHLch  1 7442 - 3 aggr0
192.168.96.35  00:23:42:d0:56:8e  UHLc   0 1905 - 3 aggr0
192.168.96.55  fe:e1:ba:d0:e0:83  UHLl   0   178119 - 1 aggr0
192.168.96.60  fe:e1:ba:d0:74:ef  UHLc   031021 - 3 aggr0
192.168.96.63  192.168.96.55  UHb00 - 1 aggr0

if gre0 on both frontends pointed at different legs on the same backend
server, i could write a pf rule like this:

pass out to port 80 route-to 10.0.0.1

10.0.0.1 would then end up as the rt_addr field in pf_state and
pfsync_state.

both frontend a and b would lookup the route to 10.0.0.1, and then
use 192.168.0.1 and 192.168.0.3 as the gateway address respectively.
both would end up pushing the packet over their gre link to the
same backend. the same semantic would work if the link to the backend
was over ethernet instead of a tunnel.

> > thoughts?
> > 
> 
> What you've said makes sense. However I still feel pfsync(4)
> does not play well with route-to.

maybe your opinion is different if the above makes sense?

> thanks and
> regards
> sashan

no, thank you for reading my long email.

cheers,
dlg



Re: net.inet.ip.forwarding=0 vs lo(4)

2020-10-19 Thread David Gwynne
On Mon, Oct 19, 2020 at 10:03:29AM +0100, Stuart Henderson wrote:
> On 2020/10/19 11:47, David Gwynne wrote:
> > On Sun, Oct 18, 2020 at 08:57:34PM +0100, Stuart Henderson wrote:
> > > On 2020/10/18 14:04, David Gwynne wrote:
> > > > the problem i'm hitting is that i have a multihomed box where the
> > > > service it provides listens on an IP address that's assigned to lo1.
> > > > it's a host running a service, it's not a router, so the
> > > > net.inet.ip.forwarding sysctl is not set to 1.
> > > 
> > > I ran into this, I just turned on the forwarding sysctl to avoid the
> > > problem.
> > > 
> > > > i came up with this diff, which adds even more special casing for
> > > > loopback interfaces. it says addreesses on loopbacks are globally
> > > > reachable, even if ip forwarding is disabled.
> > > 
> > > I don't see why loopbacks should be special. Another place this
> > > might show up is services running on carp addresses (I haven't updated
> > > those machines yet but there's a fair chance they'll be affected too).
> > > I would prefer an explicit sysctl to disable "strong host model".
> > 
> > loopback is already special. if a packet comes from an loopback
> > interface, we allow it to talk to any IP on the local machine. i think
> > this is mostly to cope with the semantic we've had where local traffic
> > get's tied to a loopback interface instead of going anywhere near the
> > physical ones.
> > 
> > carp is also special.
> > 
> > let me paste the ip_laddr function instead of the diff to it, it's a bit
> > more obvious what's going on:
> 
> Thanks, that will already work for the machines I was thinking of then.
> 
> > back to loopback and receiving packets. loopback is special because it
> > is not connected to the outside world. it is impossible to send a packet
> > via a loopback interface from another host, so configuring a globally
> > (externally) routable IP on it is currently pointless unless you enable
> > forwarding. i think making loopback more special and allowing it
> > to be globally reachable makes sense. i can't think of any downsides to
> > this at the moment, except that the behaviour would be subtle/not
> > obvious
> 
> ok, so it makes sense for this to be independent of any possible
> separate lever.
> 
> > is there a need to configure a globally reachable IP on a non-loopback
> > interface on a host (not router)? if so, then i'd be more convinced that
> > we need a separate lever to pull.
> 
> I'm not using it this way, but here's a scenario.
> 
> Say there are a couple of webservers with addresses from a carp on
> ethernet/vlan, with a link to their upstream router on some separate
> interface. They announce the carp prefix into ospf.

so carp is just being used to elect a webserver as a master, and then
the result of that election is fed upstream.

> They aren't routing themselves so the only reason to have forwarding=1
> is to have them use "weak host model".
> 
> With forwarding=0 I think they'll have to use "stub router no" otherwise
> everything will be announced high metric (rather than being dependent on
> carp state), but ospfd explicitly handles this; it's marked in parse.y
> with "/* allow to force non stub mode */".

so is a Big Global Lever what you want here? if you enable weak host
mode, all IPs on the host will be addressible from all legs of the
host. would it make more sense to configure specific interfaces as
holding globally addressible IPs?

if my understanding of your scenario is right, you could configure
the carp interface with the weak or globally accessible flag. in
my situation i could configure that on lo1.

dlg



pf route-to issues

2020-10-18 Thread David Gwynne
every few years i try and use route-to in pf, and every time it
goes badly. i tried it again last week in a slightly different
setting, and actually tried to understand the sharp edges i hit
this time instead of giving up. it turns out there are 2 or 3
different things together that have cause me trouble, which is why
the diff below is so big.

the first and i would argue most fundamental problem is a semantic
problem. if you ask a random person who has some clue about networks
and routing what they would expect the "argument" to route-to or
reply-to to be, they would say "a nexthop address" or "a gateway
address". eg, say i want to force packets to a specific backend
server without using NAT, i would write a rule like this:

  n_servers="192.0.2.128/27"
  pass out on $if_internal to $n_servers route-to 192.168.0.1

pfctl will happily parse this, shove it into the kernel, let you read
the rules back out again with pfctl -sr, and it all looks plausible, but
it turns out that it's using the argument to route-to as an interface
name. because rulesets can refer to interfaces that don't exist yet, pf
just passes the IP address around as a string, hoping i'll plug in an
interface with a driver name that looks like an ip address. i spent
literally a day trying to figure out why a rule like this wasn't
working.

i happened to be talking to pascoe@ at the time, and his vague memory
was that the idea was to try and switch the interface a packet was going
to travel over, but to try and reuse the arp lookup from the parent one.
neither of us could figure out why that would be a good idea though.

the best i can say about this is that it only really makes some
kind of sense if you're moving a packet into a tunnel. tunnels don't
really care about nexthops and will happily route anything you give
them. if you were trying to add a route to the routing table to do this,
you'd be specifying the peer address on a tunnel interface as the
gateway. pf has a if0:peer syntax that makes this convenient to write.

so i want to change route-to in pfctl so it takes a nexthop instead
of an interface. you could argue that pf already lets you do this,
because there's some bs nexthop@interface syntax. my counter argument
is that the interface the nexthop is reachable over is redundant, and it
makes fixing some of the other problems harder if we keep it.

the second and third problems i hit are when route-to is used on
a pair of boxes that have pfsync and pfsync defer set up. when defer
is enabled, pfsync takes the packet away from the forwarding path,
and when it has some confidence that the peer is aware of the state,
then it tries to push the packet back out.

to understand the following, be aware that route-to, reply-to, and
dup-to are implemented in pf in a pair of functions called pf_route
and pf_route6. if i say pf_route, just assume i'm talking about
both of these functions.

the second problem is that the pf_route calls from pfsync don't
have all the information it is supposed to have. more specifically,
an ifp pointer isn't set which leads to a segfault. the ifp pointer
isn't set because pfsync doesnt track which interface a packet is
going out, it assumes the ip layer will get it right again later, or a
rule provided something usable.

the third problem is that pf_route relies on information from rules to
work correctly. this is a problem in a pfsync environment because you
cannot have the same ruleset on both firewalls 100% of the time, which
means you cannot have route-to/reply-to behave consistently on a pair of
firwalls 100% of the time.

my solution to both these problems is reduce the amount of information
pf_route needs to work with, to make sure that the info it does need
is in the pf state structure, and that pfsync handles it properly.

if we limit the information needed for pf_route to a nexthop address,
and which direction the address is used, this is doable. both the
pf_state and pfsync_state structs already contain an address to store a
nexthop in, i just had to move the route-to direction from the rule into
the state. this is easy with pf_state, but i used a spare pad field in
pfsync_state for this.

the pf_state struct has had which interface the route is using
removed. there's no simple way to sync interface information between
pfsync peers on the wire, and the need for them is marginal at best.
things are much simpler if we can get away with not having this info.

a bonus problem i hit is that there's code in pf_match that
appears to try and short circuit some processing of states when
route-to/reply-to is in effect. this has two consequences. first,
if you're using route-to with tcp states, half the tcp state machine
is is skipped. when you look at these states with pfctl -vvss,
one half of the TCP state never moves forward. secondly, because
the processing is short circuited, it never falls through to the
end of pf_test where the actual call to pf_route is done. so the
first packet is properly handled by 

Re: net.inet.ip.forwarding=0 vs lo(4)

2020-10-18 Thread David Gwynne
On Sun, Oct 18, 2020 at 08:57:34PM +0100, Stuart Henderson wrote:
> On 2020/10/18 14:04, David Gwynne wrote:
> > the problem i'm hitting is that i have a multihomed box where the
> > service it provides listens on an IP address that's assigned to lo1.
> > it's a host running a service, it's not a router, so the
> > net.inet.ip.forwarding sysctl is not set to 1.
> 
> I ran into this, I just turned on the forwarding sysctl to avoid the
> problem.
> 
> > i came up with this diff, which adds even more special casing for
> > loopback interfaces. it says addreesses on loopbacks are globally
> > reachable, even if ip forwarding is disabled.
> 
> I don't see why loopbacks should be special. Another place this
> might show up is services running on carp addresses (I haven't updated
> those machines yet but there's a fair chance they'll be affected too).
> I would prefer an explicit sysctl to disable "strong host model".

loopback is already special. if a packet comes from an loopback
interface, we allow it to talk to any IP on the local machine. i think
this is mostly to cope with the semantic we've had where local traffic
get's tied to a loopback interface instead of going anywhere near the
physical ones.

carp is also special.

let me paste the ip_laddr function instead of the diff to it, it's a bit
more obvious what's going on:

int
ip_laddr(struct ifnet *ifp, struct mbuf *m, struct rtentry *rt)
{
struct ifnet *rtifp;
int match = 0;

if (rt->rt_ifidx == ifp->if_index ||
ifp->if_type == IFT_ENC ||
ISSET(ifp->if_flags, IFF_LOOPBACK) ||
ISSET(m->m_pkthdr.pf.flags, PF_TAG_TRANSLATE_LOCALHOST))
return (1);

/* received on a different interface. */
rtifp = if_get(rt->rt_ifidx);
if (rtifp != NULL) {
if (ISSET(rtifp->if_flags, IFF_LOOPBACK))
match = 1;
#if NCARP > 0
/*
 * Virtual IPs on carp interfaces need to be checked also
 * against the parent interface and other carp interfaces
 * sharing the same parent.
 */
else if (carp_strict_addr_chk(rtifp, ifp))
match = 1;
#endif
}
if_put(rtifp);

return (match);
}

the only thing i've added above is the
ISSET(rtifp->if_flags, IFF_LOOPBACK) check. everything else is already
in place.

here's the code for carp_strict_addr_chk:

/*
 * If two carp interfaces share same physical interface, then we pretend all IP
 * addresses belong to single interface.
 */
static inline int
carp_strict_addr_chk(struct ifnet *ifp_a, struct ifnet *ifp_b)
{
return ((ifp_a->if_type == IFT_CARP &&
ifp_b->if_index == ifp_a->if_carpdevidx) ||
(ifp_b->if_type == IFT_CARP &&
ifp_a->if_index == ifp_b->if_carpdevidx) ||
(ifp_a->if_type == IFT_CARP && ifp_b->if_type == IFT_CARP &&
ifp_a->if_carpdevidx == ifp_b->if_carpdevidx));
}

back to loopback and receiving packets. loopback is special because it
is not connected to the outside world. it is impossible to send a packet
via a loopback interface from another host, so configuring a globally
(externally) routable IP on it is currently pointless unless you enable
forwarding. i think making loopback more special and allowing it
to be globally reachable makes sense. i can't think of any downsides to
this at the moment, except that the behaviour would be subtle/not
obvious

is there a need to configure a globally reachable IP on a non-loopback
interface on a host (not router)? if so, then i'd be more convinced that
we need a separate lever to pull.



net.inet.ip.forwarding=0 vs lo(4)

2020-10-17 Thread David Gwynne
this is mostly about discussion at the moment, i'm not tied to this diff
at all.

the problem i'm hitting is that i have a multihomed box where the
service it provides listens on an IP address that's assigned to lo1.
it's a host running a service, it's not a router, so the
net.inet.ip.forwarding sysctl is not set to 1.

because of the checks introduced in src/sys/netinet/ip_input.c
r1.345, this doesnt work because a packet has to be received on the
interface the IP is assigned to. however, you cannot receive traffic on
an lo(4) interface (unless you're connecting from the local host), so
the addresses are on lo1 are not globally accessible.

i came up with this diff, which adds even more special casing for
loopback interfaces. it says addreesses on loopbacks are globally
reachable, even if ip forwarding is disabled.

is this reasonable? or is there a way i can do this without a diff
already? or is there a better diff to support this usecase? eg, would we
need a flag on IPs to specify if they're globally accessible or not?

also, i don't like the name "ip_laddr" for the function, but couldnt
come up with something better.

thoughts?

for those that are interested, the multihoming is via a bunch of gre(4)
interfaces so a set of frontend load balancers can route packets to
these backends, even if those backends are not directly connected to the
frontends because they're at different sites.

Index: netinet/ip_input.c
===
RCS file: /cvs/src/sys/netinet/ip_input.c,v
retrieving revision 1.351
diff -u -p -r1.351 ip_input.c
--- netinet/ip_input.c  22 Aug 2020 17:55:30 -  1.351
+++ netinet/ip_input.c  16 Oct 2020 04:27:43 -
@@ -753,29 +753,42 @@ in_ouraddr(struct mbuf *m, struct ifnet 
break;
}
}
-   } else if (ipforwarding == 0 && rt->rt_ifidx != ifp->if_index &&
-   !((ifp->if_flags & IFF_LOOPBACK) || (ifp->if_type == IFT_ENC) ||
-   (m->m_pkthdr.pf.flags & PF_TAG_TRANSLATE_LOCALHOST))) {
-   /* received on wrong interface. */
-#if NCARP > 0
-   struct ifnet *out_if;
+   } else if (ipforwarding == 0 && !ip_laddr(ifp, m, rt)) {
+   ipstat_inc(ips_wrongif);
+   match = 2;
+   }
+
+   return (match);
+}
 
+int
+ip_laddr(struct ifnet *ifp, struct mbuf *m, struct rtentry *rt)
+{
+   struct ifnet *rtifp;
+   int match = 0;
+
+   if (rt->rt_ifidx == ifp->if_index ||
+   ifp->if_type == IFT_ENC ||
+   ISSET(ifp->if_flags, IFF_LOOPBACK) ||
+   ISSET(m->m_pkthdr.pf.flags, PF_TAG_TRANSLATE_LOCALHOST))
+   return (1);
+
+   /* received on a different interface. */
+   rtifp = if_get(rt->rt_ifidx);
+   if (rtifp != NULL) {
+   if (ISSET(rtifp->if_flags, IFF_LOOPBACK))
+   match = 1;
+#if NCARP > 0
/*
 * Virtual IPs on carp interfaces need to be checked also
 * against the parent interface and other carp interfaces
 * sharing the same parent.
 */
-   out_if = if_get(rt->rt_ifidx);
-   if (!(out_if && carp_strict_addr_chk(out_if, ifp))) {
-   ipstat_inc(ips_wrongif);
-   match = 2;
-   }
-   if_put(out_if);
-#else
-   ipstat_inc(ips_wrongif);
-   match = 2;
+   else if (carp_strict_addr_chk(rtifp, ifp))
+   match = 1;
 #endif
}
+   if_put(rtifp);
 
return (match);
 }
Index: netinet/ip_var.h
===
RCS file: /cvs/src/sys/netinet/ip_var.h,v
retrieving revision 1.86
diff -u -p -r1.86 ip_var.h
--- netinet/ip_var.h8 Dec 2019 11:08:22 -   1.86
+++ netinet/ip_var.h16 Oct 2020 04:27:43 -
@@ -244,6 +244,7 @@ void ip_savecontrol(struct inpcb *, str
 voidipintr(void);
 int ip_input_if(struct mbuf **, int *, int, int, struct ifnet *);
 int ip_deliver(struct mbuf **, int *, int, int);
+int ip_laddr(struct ifnet *, struct mbuf *, struct rtentry *);
 voidip_forward(struct mbuf *, struct ifnet *, struct rtentry *, int);
 int rip_ctloutput(int, struct socket *, int, int, struct mbuf *);
 voidrip_init(void);
Index: netinet6/ip6_input.c
===
RCS file: /cvs/src/sys/netinet6/ip6_input.c,v
retrieving revision 1.229
diff -u -p -r1.229 ip6_input.c
--- netinet6/ip6_input.c24 Aug 2020 16:40:07 -  1.229
+++ netinet6/ip6_input.c16 Oct 2020 04:27:43 -
@@ -425,30 +425,9 @@ ip6_input_if(struct mbuf **mp, int *offp
if (rtisvalid(rt) && ISSET(rt->rt_flags, RTF_LOCAL)) {
struct in6_ifaddr *ia6 = ifatoia6(rt->rt_ifa);
 
-   if (ip6_forwarding == 0 && rt->rt_ifidx != 

Re: ifconfig: print tpmr(4) members

2020-08-04 Thread David Gwynne



> On 31 Jul 2020, at 17:17, Klemens Nanni  wrote:
> 
> This diff is to be applied on top of my other diff on tech@ with subject
> "ifconfig: merge switch_status() into bridge_status()".
> 
> It hooks completes the output of tpmr intefaces in what I think is the
> simplest and least intrusive way.
> 
> tpmr is a trivial bridge and has no specific ioctls, so to distinguish
> it from the rest we must rely on the interface name;  assuming that it
> is tpmr because neither is_bridge() nor is_switch() return success is
> not possible due to the way ifconfig is designed: it runs all *_status()
> commands for all interface types.
> 
> An alternative approach would be to make ifconfig try all the various
> bridge related ioctls on all bridge-like interfaces and quiet down all
> failures such output stays clean, but I dislike this shotgun approach
> and prefer testing for different drivers where possible.
> 
> With this last piece in, I could finally document tpmr under ifconfig(8)
> (and move on the next drivers in need of love).
> 
> Feedback? OK?

let's do it.

> 
> 
> --- brconfig.c.orig   Fri Jul 31 08:58:03 2020
> +++ brconfig.cFri Jul 31 09:16:59 2020
> @@ -775,15 +775,28 @@
>   return (1);
> }
> 
> +/* no tpmr(4) specific ioctls, name is enough if ifconfig.c:printif() passed 
> */
> +int
> +is_tpmr(void)
> +{
> + return (strncmp(ifname, "tpmr", sizeof("tpmr") - 1) == 0);
> +}
> +
> void
> bridge_status(void)
> {
>   struct ifbrparam bp1, bp2;
> - int isswitch = is_switch();
> + int isswitch;
> 
> + if (is_tpmr()) {
> + bridge_list("\t");
> + return;
> + }
> +
>   if (!is_bridge())
>   return;
> 
> + isswitch = is_switch();
>   if (isswitch)
>   switch_cfg("\t");
>   else



Re: switch: allow datapath_id and maxflow ioctls for non-root

2020-08-04 Thread David Gwynne



> On 31 Jul 2020, at 14:28, Klemens Nanni  wrote:
> 
> ifconfig(8) detects switch(4) through its unique SIOCSWSDPID ioctl and
> further does another switch specific ioctl for the default output
> regardless of configuration and/or members:
> 
>   SIOCSWSDPID struct ifbrparam
>   Set the datapath_id in the OpenFlow protocol of the switch named
>   in ifbrp_name to the value in the ifbrpu_datapath field.
>   
>   SIOCSWGMAXFLOW struct ifbrparam
>   Retrieve the maximum number of flows in the OpenFlow protocol of
>   the switch named in ifbrp_name into the ifbrp_maxflow field.
> 
> This is how it should look like:
> 
>   # ifconfig switch0 create
>   # ifconfig switch0
>   switch0: flags=0<>
>   index 29 llprio 3
>   groups: switch
>   datapath 0x5bea2b5b8e2456cf maxflow 1 maxgroup 1000
> 
> But using ifconfig as unprivileged user makes it fail switch(4)
> interfaces as such and thus interprets them as bridge(4) instead:
> 
>   $ ifconfig switch0
>   switch0: flags=0<>
>   index 29 llprio 3
>   groups: switch
>   priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 
> proto rstp
>   designated: id 00:00:00:00:00:00 priority 0
> 
> This is because the above mentioned ioctls are listed together with all
> other bridge and switch related ioctls that set or write things.
> Getting datapath_id and maxflow values however is read-only and crucial
> for ifconfig as demonstrated above, so I'd like to move them out of the
> root check to fix ifconfig.
> 
> Feedback? OK?

can't they be caught by the default case now?

dlg

> 
> 
> Index: sys/net/if.c
> ===
> RCS file: /cvs/src/sys/net/if.c,v
> retrieving revision 1.616
> diff -u -p -r1.616 if.c
> --- sys/net/if.c  24 Jul 2020 18:17:14 -  1.616
> +++ sys/net/if.c  31 Jul 2020 04:13:40 -
> @@ -2170,13 +2170,15 @@ ifioctl(struct socket *so, u_long cmd, c
>   case SIOCBRDGSIFCOST:
>   case SIOCBRDGSTXHC:
>   case SIOCBRDGSPROTO:
> - case SIOCSWGDPID:
>   case SIOCSWSPORTNO:
> - case SIOCSWGMAXFLOW:
> #endif
>   if ((error = suser(p)) != 0)
>   break;
>   /* FALLTHROUGH */
> +#if NBRIDGE > 0
> + case SIOCSWGDPID:
> + case SIOCSWGMAXFLOW:
> +#endif
>   default:
>   error = ((*so->so_proto->pr_usrreq)(so, PRU_CONTROL,
>   (struct mbuf *) cmd, (struct mbuf *) data,
> 



Re: ifconfig: merge switch_status() into bridge_status()

2020-08-04 Thread David Gwynne



> On 31 Jul 2020, at 16:36, Klemens Nanni  wrote:
> 
> On Wed, Jul 29, 2020 at 02:21:42PM +0200, Klemens Nanni wrote:
>> This is to reduce duplicate code and pave the way for a single
>> bridge_status() that covers all bridge like interfaces: bridge(4),
>> switch(4) and tpmr(4).
> A duplicate bridge_cfg() call snuck in, fixed diff below.
> 
> Feedback? OK?

ok.

> 
>> +if (isswitch)
>> +switch_cfg("\t");
>> +else
>> +bridge_cfg("\t");
>> +
>>  bridge_cfg("\t");
>> 
>>  bridge_list("\t");
> 
> 
> Index: brconfig.c
> ===
> RCS file: /cvs/src/sbin/ifconfig/brconfig.c,v
> retrieving revision 1.26
> diff -u -p -r1.26 brconfig.c
> --- brconfig.c29 Jul 2020 12:13:28 -  1.26
> +++ brconfig.c31 Jul 2020 06:32:54 -
> @@ -54,6 +54,7 @@ void bridge_ifclrflag(const char *, u_in
> 
> void bridge_list(char *);
> void bridge_cfg(const char *);
> +void switch_cfg(const char *);
> void bridge_badrule(int, char **, int);
> void bridge_showrule(struct ifbrlreq *);
> int is_switch(void);
> @@ -778,17 +779,24 @@ void
> bridge_status(void)
> {
>   struct ifbrparam bp1, bp2;
> + int isswitch = is_switch();
> 
> - if (!is_bridge() || is_switch())
> + if (!is_bridge())
>   return;
> 
> - bridge_cfg("\t");
> + if (isswitch)
> + switch_cfg("\t");
> + else
> + bridge_cfg("\t");
> 
>   bridge_list("\t");
> 
>   if (aflag && !ifaliases)
>   return;
> 
> + if (isswitch)
> + return;
> +
>   strlcpy(bp1.ifbrp_name, ifname, sizeof(bp1.ifbrp_name));
>   if (ioctl(sock, SIOCBRDGGCACHE, (caddr_t)) == -1)
>   return;
> @@ -1146,8 +1154,8 @@ is_switch()
>   return (1);
> }
> 
> -static void
> -switch_cfg(char *delim)
> +void
> +switch_cfg(const char *delim)
> {
>   struct ifbrparam bp;
> 
> @@ -1168,20 +1176,6 @@ switch_cfg(char *delim)
>   err(1, "%s", ifname);
> 
>   printf(" maxgroup %d\n", bp.ifbrp_maxgroup);
> -}
> -
> -void
> -switch_status(void)
> -{
> - if (!is_switch())
> - return;
> -
> - switch_cfg("\t");
> -
> - bridge_list("\t");
> -
> - if (aflag && !ifaliases)
> - return;
> }
> 
> void
> Index: ifconfig.c
> ===
> RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
> retrieving revision 1.424
> diff -u -p -r1.424 ifconfig.c
> --- ifconfig.c3 Jul 2020 17:42:50 -   1.424
> +++ ifconfig.c31 Jul 2020 06:24:17 -
> @@ -3507,7 +3507,6 @@ status(int link, struct sockaddr_dl *sdl
>   phys_status(0);
> #ifndef SMALL
>   bridge_status();
> - switch_status();
> #endif
> }
> 
> Index: ifconfig.h
> ===
> RCS file: /cvs/src/sbin/ifconfig/ifconfig.h,v
> retrieving revision 1.2
> diff -u -p -r1.2 ifconfig.h
> --- ifconfig.h24 Oct 2019 18:54:10 -  1.2
> +++ ifconfig.h31 Jul 2020 06:24:18 -
> @@ -69,7 +69,6 @@ void bridge_flushrule(const char *, int)
> int is_bridge(void);
> void bridge_status(void);
> int bridge_rule(int, char **, int);
> -void switch_status(void);
> void switch_datapathid(const char *, int);
> void switch_portno(const char *, const char *);
> 



Re: ifconfig: remove redundant bridge checks

2020-07-28 Thread David Gwynne
ok.

> On 29 Jul 2020, at 11:38, Klemens Nanni  wrote:
> 
> On Tue, Jul 28, 2020 at 07:09:17PM +0200, Klemens Nanni wrote:
>> bridge_status() and switch_status() do the regular sanity check with
>> SIOCGIFFLAGS, but both functions also call is_switch(), bridge_status()
>> also calls is_bridge().
>> 
>> Those is_*() helpers do the same SIOCGIFFLAGS sanity check, making those
>> in *_status() entirely redundant, so I'd like to remove them.
> Small correction: is_bridge() does SIOCGIFFLAGS, is_switch() does not.
> 
> Below is a new diff that removes the SIOCGIFFLAGS check form is_bridge()
> as well, leaving the two is_*() helpers to their driver specific ioctls
> alone.
> 
>> I'm here since the tpmr(4) ioctl interface transition from trunk to
>> bridge semantics is now complete, so ifconfig(8) now requires tpmr bits
>> to show its members in bridge fashion.
>> 
>> One way would be duplicate code into is_tpmr() and tpmr_status() which
>> I've already done, but another approach is to unify all bridge like
>> interfaces under bridge_status().
> With this in, merging switch_status() into bridge_status() is a trivial
> diff, adding tpmr awareness to the mix would then be another diff after
> that.
> 
>> Either ways, diff below cleans up and makes for simpler code.
> So this effectively just removes SIOCGIFFLAGS in brconfig.c which are of
> no use, imho.  ifconfig.c:getinfo() already checks interfaces flags,
> even more than once, for all interfaces.
> 
>> Feedback? OK?
> 
> 
> Index: brconfig.c
> ===
> RCS file: /cvs/src/sbin/ifconfig/brconfig.c,v
> retrieving revision 1.25
> diff -u -p -r1.25 brconfig.c
> --- brconfig.c22 Jan 2020 06:24:07 -  1.25
> +++ brconfig.c29 Jul 2020 00:58:40 -
> @@ -762,14 +762,8 @@ bridge_holdcnt(const char *value, int d)
> int
> is_bridge()
> {
> - struct ifreq ifr;
>   struct ifbaconf ifbac;
> 
> - strlcpy(ifr.ifr_name, ifname, sizeof(ifr.ifr_name));
> -
> - if (ioctl(sock, SIOCGIFFLAGS, (caddr_t)) == -1)
> - return (0);
> -
>   ifbac.ifbac_len = 0;
>   strlcpy(ifbac.ifbac_name, ifname, sizeof(ifbac.ifbac_name));
>   if (ioctl(sock, SIOCBRDGRTS, (caddr_t)) == -1) {
> @@ -783,16 +777,11 @@ is_bridge()
> void
> bridge_status(void)
> {
> - struct ifreq ifr;
>   struct ifbrparam bp1, bp2;
> 
>   if (!is_bridge() || is_switch())
>   return;
> 
> - strlcpy(ifr.ifr_name, ifname, sizeof(ifr.ifr_name));
> - if (ioctl(sock, SIOCGIFFLAGS, (caddr_t)) == -1)
> - return;
> -
>   bridge_cfg("\t");
> 
>   bridge_list("\t");
> @@ -1184,13 +1173,7 @@ switch_cfg(char *delim)
> void
> switch_status(void)
> {
> - struct ifreq ifr;
> -
>   if (!is_switch())
> - return;
> -
> - strlcpy(ifr.ifr_name, ifname, sizeof(ifr.ifr_name));
> - if (ioctl(sock, SIOCGIFFLAGS, (caddr_t)) == -1)
>   return;
> 
>   switch_cfg("\t");



Re: random toeplitz seeds

2020-07-17 Thread David Gwynne
On Fri, Jun 26, 2020 at 07:55:43AM +0200, Theo Buehler wrote:
> This adds an stoeplitz_random_seed() function that generates a random
> Toeplitz key seed with an invertible matrix T. This is necessary and
> sufficient for the hash to spread out over all 65536 possible values.
> 
> While it is clear from T * (-1) == 0 that seeds with parity 0 are bad,
> I don't have a neat and clean proof for the fact that a seed with
> parity 1 always generates an invertible Toeplitz matrix. It's not hard
> to check, but rather tedious.
> 
> I'm unsure how to hook it up. I enabled random seeds by using the
> function in stoeplitz_init(), but that's just for illustration.

sorry, i didnt see this when you sent it out.

i can't say whether the maths is right or not, but i'm happy to trust
you on it. it's hooked up fine though, so ok by me.

dlg

> Index: sys/net/toeplitz.c
> ===
> RCS file: /var/cvs/src/sys/net/toeplitz.c,v
> retrieving revision 1.7
> diff -u -p -r1.7 toeplitz.c
> --- sys/net/toeplitz.c19 Jun 2020 08:48:15 -  1.7
> +++ sys/net/toeplitz.c25 Jun 2020 18:43:02 -
> @@ -69,9 +69,38 @@ static struct stoeplitz_cache  stoeplitz_
>  const struct stoeplitz_cache *const
>   stoeplitz_cache = _syskey_cache; 
>  
> +/* parity of n16: count (mod 2) of ones in the binary representation. */
> +int
> +parity(uint16_t n16)
> +{
> + n16 = ((n16 & 0x) >> 1) ^ (n16 & 0x);
> + n16 = ((n16 & 0x) >> 2) ^ (n16 & 0x);
> + n16 = ((n16 & 0xf0f0) >> 4) ^ (n16 & 0x0f0f);
> + n16 = ((n16 & 0xff00) >> 8) ^ (n16 & 0x00ff);
> +
> + return (n16);
> +}
> +
> +/*
> + * The Toeplitz matrix obtained from a seed is invertible if and only if the
> + * parity of the seed is 1. Generate such a seed uniformly at random.
> + */
> +stoeplitz_key
> +stoeplitz_random_seed(void)
> +{
> + stoeplitz_key seed;
> +   
> + seed = arc4random() & UINT16_MAX;
> + if (parity(seed) == 0)
> + seed ^= 1;
> +
> + return (seed);
> +}
> +
>  void
>  stoeplitz_init(void)
>  {
> + stoeplitz_keyseed = stoeplitz_random_seed();
>   stoeplitz_cache_init(_syskey_cache, stoeplitz_keyseed);
>  }
>  
> 



Re: mcx(4) RSS

2020-07-14 Thread David Gwynne



> On 14 Jul 2020, at 4:40 pm, Jonathan Matthew  wrote:
> 
> mcx(4) is almost ready to enable RSS, except arm64 doesn't yet support
> mapping interrupts to cpus.  Until that's in place, here's a diff with the
> missing pieces from the driver in case anyone wants to test.  This will
> enable up to 8 rx/tx queues, depending on the number of cpus available.

seems to work fine on sparc64.

dlg

> Index: if_mcx.c
> ===
> RCS file: /cvs/src/sys/dev/pci/if_mcx.c,v
> retrieving revision 1.64
> diff -u -p -r1.64 if_mcx.c
> --- if_mcx.c  14 Jul 2020 04:10:18 -  1.64
> +++ if_mcx.c  14 Jul 2020 04:49:36 -
> @@ -33,6 +33,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include 
> #include 
> @@ -83,7 +84,7 @@
> #define MCX_LOG_RQ_SIZE   10
> #define MCX_LOG_SQ_SIZE   11
> 
> -#define MCX_MAX_QUEUES   1
> +#define MCX_MAX_QUEUES   8
> 
> /* completion event moderation - about 10khz, or 90% of the cq */
> #define MCX_CQ_MOD_PERIOD 50
> @@ -2331,6 +2332,7 @@ struct mcx_softc {
>   unsigned int sc_calibration_gen;
>   struct timeout   sc_calibrate;
> 
> + struct intrmap  *sc_intrmap;
>   struct mcx_queuessc_queues[MCX_MAX_QUEUES];
>   unsigned int sc_nqueues;
> 
> @@ -2716,7 +2718,11 @@ mcx_attach(struct device *parent, struct
>   ether_sprintf(sc->sc_ac.ac_enaddr));
> 
>   msix = pci_intr_msix_count(pa->pa_pc, pa->pa_tag);
> - sc->sc_nqueues = 1;
> + sc->sc_intrmap = intrmap_create(>sc_dev, msix, MCX_MAX_QUEUES,
> + INTRMAP_POWEROF2);
> + sc->sc_nqueues = intrmap_count(sc->sc_intrmap);
> + KASSERT(sc->sc_nqueues > 0);
> + KASSERT(powerof2(sc->sc_nqueues));
> 
>   strlcpy(ifp->if_xname, DEVNAME(sc), IFNAMSIZ);
>   ifp->if_softc = sc;
> @@ -2786,8 +2792,9 @@ mcx_attach(struct device *parent, struct
>   }
>   snprintf(q->q_name, sizeof(q->q_name), "%s:%d",
>   DEVNAME(sc), i);
> - q->q_ihc = pci_intr_establish(sc->sc_pc, ih,
> - IPL_NET | IPL_MPSAFE, mcx_cq_intr, q, q->q_name);
> + q->q_ihc = pci_intr_establish_cpu(sc->sc_pc, ih,
> + IPL_NET | IPL_MPSAFE, intrmap_cpu(sc->sc_intrmap, i),
> + mcx_cq_intr, q, q->q_name);
>   }
> 
>   timeout_set(>sc_calibrate, mcx_calibrate, sc);
> Index: files.pci
> ===
> RCS file: /cvs/src/sys/dev/pci/files.pci,v
> retrieving revision 1.350
> diff -u -p -r1.350 files.pci
> --- files.pci 14 Jul 2020 04:10:18 -  1.350
> +++ files.pci 14 Jul 2020 04:49:36 -
> @@ -831,7 +831,7 @@ attachbnxt at pci
> file  dev/pci/if_bnxt.c   bnxt
> 
> # Mellanox ConnectX-4 and later
> -device  mcx: ether, ifnet, ifmedia, stoeplitz
> +device  mcx: ether, ifnet, ifmedia, stoeplitz, intrmap
> attach  mcx at pci
> filedev/pci/if_mcx.cmcx
> 
> 



deprecate interface input handler lists, just use a single function pointer

2020-07-09 Thread David Gwynne
this diff is about fixing some semantic issues with the current
interface input handler list processing. i was a bit worried it would
cause a small performance hit, but it seems it has the opposite effect
and is actually slightly faster. so in my opinion it is more correct,
but also improves performance.

originially the network stack only really dealt with layer 3 protocols
like ipv4, arp, ipv6, and so on. this meant ethernet drivers (ie, 90% of
network drivers) had to pull the ethernet header apart in their hardware
interrupt handlers before they could throw IP packets over the wall
(read queue packets for softnet to process).

etherent got complicated though. there's a whole bunch of pseudo/virtual
interfaces that do interesting things at the ethernet level, and when we
were doing the intiial mpsafe network stack work, none of them were
mpsafe. this meant that we couldn't pull ethernet headers apart without
taking the big lock, which meant that mpsafe hardware interrupt handlers
for nics wouldn't be able to run completely big lock free in moderately
interesting network configs.

so to help out in the initial mpsafe work, we decided to throw ethernet
packets straight off the ring over to softnet and pull ethernet apart
there. this let the interrupts run without biglock, but the virtual
ethernet interface drivers weren't mpsafe yet. to mitigate against that,
we made processing for them optional. since then we have made pretty
much all the pseudo interfaces mpsafe though.

so the currently situation is that by default, the softnet interface
input handler for ethernet interfaces simply looks at the protocol and
splits it up into ip/arp/etc. when you enable somethiing like vlan(4) on
an interface, it prepends an input handler to normal ethernet one. this
handler acts like a filter, so the vlan one takes vlan encapsulated
packets away and lets the rest fall through to the normal ethernet one.

the bridge input handler looks at the mac address on the packet forwards
it based on the address.

the semantic problems referred to above kick in if you enable bridge and
vlan at the same time. depending on the order in which you attach them
to the physical interface, they will filter in different orders. if you
enable bridge after vlan, packets will be sent to other ports regardless
of the vlan tags on the packet. this sucks if you want to land a vlan on
the current box, but bridge everythign else.

im arguing that the interaction between vlan interfaces and bridges
should be deterministic. and all the virtual interfaces actually.

the semantics im suggested are documented as comments in ether_input in
the diff below, but i'll list them here too:

 * Ethernet input has several "phases" of filtering packets to
 * support virtual/pseudo interfaces before actual layer 3 protocol
 * handling.
 *
 * First phase:
 *
 * The first phase supports drivers that aggregate multiple Ethernet
 * ports into a single logical interface, ie, aggr(4) and trunk(4).
 * These drivers intercept packets by swapping out the if_input handler
 * on the "port" interfaces to steal the packets before they get here
 * to ether_input().

 * Second phase: service delimited packet filtering.
 *
 * Let vlan(4) and svlan(4) look at "service delimited"
 * packets. If a virtual interface does not exist to take
 * those packets, they're returned to ether_input() so a
 * bridge can have a go at forwarding them.

 * Third phase: bridge processing.
 *
 * Give the packet to a bridge interface, ie, bridge(4),
 * switch(4), or tpmr(4), if it is configured. A bridge
 * may take the packet and forward it to another port, or it
 * may return it here to ether_input() to support local
 * delivery to this port.

 * Fourth phase: drop service delimited packets.
 *
 * If the packet has a tag, and a bridge didn't want it,
 * it's not for this port.

 * Fifth phase: destination address check.
 *
 * Is the packet specifically addressed to this port?
 * If it's not for this port, it could be for carp(4).
 * If not, it must be multicast or broadcast to go further.

 * Sixth phase: protocol demux.
 *
 * At this point it is known that the packet is destined
 * for layer 3 protocol handling on the local port.

another motivation for doing this is that jmatthew@ tried to convert the
srpl holding the interface input handlers to an smrl, and it blew up cos
it's possible to do a context switch in the network stack. you're not
supposed to do that with srps either, but they're a bit more forgiving.
getting rid of the list means another step closer to deprecating srps.

Index: net/if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.611
diff -u -p -r1.611 if.c
--- net/if.c30 Jun 2020 09:31:38 -  1.611
+++ net/if.c9 Jul 2020 10:15:36 -
@@ -631,8 +631,6 @@ if_attach_common(struct ifnet *ifp)
if (ifp->if_enqueue == NULL)
ifp->if_enqueue = if_enqueue_ifq;

Re: bridge(4) shouldn't try to create new interfaces when i make a typo

2020-07-09 Thread David Gwynne



> On 10 Jul 2020, at 12:45 am, sven falempin  wrote:
> 
> On Thu, Jul 9, 2020 at 3:31 AM Klemens Nanni  wrote:
>> 
>> On Thu, Jul 09, 2020 at 05:08:01PM +1000, David Gwynne wrote:
>>> if i accidentally `ifconfig bridge add gre0` instead of egre0, having
>>> bridge create gre0 and then not like it is not what i expect to happen.
>>> especially when it leaves me with an extra gre0 interface lying around
>>> afterwards.
>>> 
>>> i can appreciate that this was trying to be helpful when you wanted
>>> to add virtual interfaces to a bridge on boot, but that was before
>>> netstart(8) created all the interfaces with config files up front, before
>>> it then goes through and runs the config for them.
>> I agree.
>> 
>> OK kn
>> 
> 
> this will force the use of create beforehand ?

the interface you want to add as a bridge port will have to exist before you 
can add it to a bridge.

> or ifconfig bridge0 up will still work ?

ifconfig does that, so this diff does not break that.

> because script in the wild may not do the create first.

scripts may have to be fixed as things change.

> 
> -- 
> --
> -
> Knowing is not enough; we must apply. Willing is not enough; we must do
> 



Re: silicom X710 ixl, unable to query phy types, no sff

2020-07-09 Thread David Gwynne
ok.

so the problem is the older api doenst support the "get phy types" command, or 
the sff commands. should we silence the "get phy types" error output? is there 
a better errno to use when the sff command isn't supported? should we add 
something to the manpage? should that something be "i'm not angry, just 
disappointed"?

> On 9 Jul 2020, at 10:36 pm, Stuart Henderson  wrote:
> 
> Update on this: Silicom support responded promptly, asked sensible
> questions, didn't immediately bail on the fact that I'm not running a
> supported OS, and prepared an update (to run under Linux but the
> Debian live image worked fine for this).
> 
> ixl0 at pci6 dev 0 function 0 "Intel X710 SFP+" rev 0x02: port 1, FW 
> 7.0.50775 API 1.8, msix, 4 queues, address 00:e0:ed:75:a5:5c
> ixl1 at pci6 dev 0 function 1 "Intel X710 SFP+" rev 0x02: port 0, FW 
> 7.0.50775 API 1.8, msix, 4 queues, address 00:e0:ed:75:a5:5d
> ixl2 at pci6 dev 0 function 2 "Intel X710 SFP+" rev 0x02: port 2, FW 
> 7.0.50775 API 1.8, msix, 4 queues, address 00:e0:ed:75:a5:5e
> ixl3 at pci6 dev 0 function 3 "Intel X710 SFP+" rev 0x02: port 3, FW 
> 7.0.50775 API 1.8, msix, 4 queues, address 00:e0:ed:75:a5:5f
> 
> ixl3: flags=8802 mtu 1500
>lladdr 00:e0:ed:75:a5:5f
>index 6 priority 0 llprio 3
>media: Ethernet autoselect (10GbaseLR full-duplex)
>status: active
>transceiver: SFP LC, 1310 nm, 10.0km SMF
>model: FLEXOPTIX P.1396.10 rev A
>serial: F78R21S, date: 2018-07-09
>voltage: 3.30 V, bias current: 32.07 mA
>temp: 29.43 C (low -25.00 C, high 90.00 C)
>tx: -2.63 dBm (low -7.00 dBm, high 2.50 dBm)
>rx: -4.75 dBm (low -16.00 dBm, high 1.00 dBm)
> 
> 
> 
> On 2020/07/08 22:59, Stuart Henderson wrote:
>> I have some ixl cards which show "unable to query phy types" at
>> attach time, and return either EIO or ENODEV if I try fetching sff
>> pages.
>> 
>> I booted with SFP+ in all ixl ports and have this:
>> 
>> ixl0 at pci6 dev 0 function 0 "Intel X710 SFP+" rev 0x02: port 1, FW 
>> 5.0.40043 API 1.5, msix, 4 queues, address 00:e0:ed:75:a5:5c
>> ixl0: unable to query phy types
>> ixl1 at pci6 dev 0 function 1 "Intel X710 SFP+" rev 0x02: port 0, FW 
>> 5.0.40043 API 1.5, msix, 4 queues, address 00:e0:ed:75:a5:5d
>> ixl1: unable to query phy types
>> ixl2 at pci6 dev 0 function 2 "Intel X710 SFP+" rev 0x02: port 2, FW 
>> 5.0.40043 API 1.5, msix, 4 queues, address 00:e0:ed:75:a5:5e
>> ixl2: unable to query phy types
>> ixl3 at pci6 dev 0 function 3 "Intel X710 SFP+" rev 0x02: port 3, FW 
>> 5.0.40043 API 1.5, msix, 4 queues, address 00:e0:ed:75:a5:5f
>> ixl3: unable to query phy types
>> 
>> # ifconfig ixl sff
>> ixl0: flags=8802 mtu 1500
>>lladdr 00:e0:ed:75:a5:5c
>>index 3 priority 0 llprio 3
>>media: Ethernet autoselect
>>status: no carrier
>> ifconfig: ixl0 transceiver: Input/output error
>> ixl1: flags=8802 mtu 1500
>>lladdr 00:e0:ed:75:a5:5d
>>index 4 priority 0 llprio 3
>>media: Ethernet autoselect
>>status: no carrier
>> ifconfig: ixl1 transceiver: Input/output error
>> ixl2: flags=8802 mtu 1500
>>lladdr 00:e0:ed:75:a5:5e
>>index 5 priority 0 llprio 3
>>media: Ethernet autoselect (10GbaseLR full-duplex)
>>status: active
>> ifconfig: ixl2 transceiver: Operation not supported by device
>> ixl3: flags=8802 mtu 1500
>>lladdr 00:e0:ed:75:a5:5f
>>index 6 priority 0 llprio 3
>>media: Ethernet autoselect
>>status: no carrier
>> ifconfig: ixl3 transceiver: Input/output error
>> 
>> With "ifconfig ixlX debug" set, I get this on the interface
>> 
>> ixl2: ixl_sff_get_byte(dev 0xa0, reg 0x7f) -> 0003
>> 
>> Firmware on these are a bit older than the Intel cards that I've seen
>> so my first thought is to try updating, I've mailed Silicom to ask them
>> if they can provide anything newer (Intel's own downloads say not to
>> use them for non Intel-branded cards and I don't really want to
>> brick a card..), does anyone have other ideas while I'm waiting to
>> hear back from them?
>> 
>> 
>> OpenBSD 6.7-current (GENERIC.MP) #337: Wed Jul  8 10:37:10 MDT 2020
>>dera...@amd64.openbsd.org:/usr/src/sys/arch/amd64/compile/GENERIC.MP
>> real mem = 8464842752 (8072MB)
>> avail mem = 8193245184 (7813MB)
>> random: good seed from bootblocks
>> mpath0 at root
>> scsibus0 at mpath0: 256 targets
>> mainbus0 at root
>> bios0 at mainbus0: SMBIOS rev. 2.8 @ 0xed9b0 (46 entries)
>> bios0: vendor American Megatrends Inc. version "1.3" date 03/19/2018
>> bios0: Supermicro Super Server
>> acpi0 at bios0: ACPI 5.0
>> acpi0: sleep states S0 S4 S5
>> acpi0: tables DSDT FACP APIC FPDT FIDT SPMI MCFG UEFI DBG2 HPET WDDT SSDT 
>> SSDT SSDT PRAD DMAR HEST BERT ERST EINJ
>> acpi0: wakeup devices IP2P(S4) EHC1(S4) EHC2(S4) RP07(S4) RP08(S4) BR1A(S4) 
>> BR1B(S4) BR2A(S4) BR2B(S4) BR2C(S4) BR2D(S4) BR3A(S4) BR3B(S4) BR3C(S4) 
>> BR3D(S4) RP01(S4) [...]
>> acpitimer0 at acpi0: 

bridge(4) shouldn't try to create new interfaces when i make a typo

2020-07-09 Thread David Gwynne
if i accidentally `ifconfig bridge add gre0` instead of egre0, having
bridge create gre0 and then not like it is not what i expect to happen.
especially when it leaves me with an extra gre0 interface lying around
afterwards.

i can appreciate that this was trying to be helpful when you wanted
to add virtual interfaces to a bridge on boot, but that was before
netstart(8) created all the interfaces with config files up front, before
it then goes through and runs the config for them.

ok?

Index: if_bridge.c
===
RCS file: /cvs/src/sys/net/if_bridge.c,v
retrieving revision 1.340
diff -u -p -r1.340 if_bridge.c
--- if_bridge.c 24 Jun 2020 22:03:42 -  1.340
+++ if_bridge.c 9 Jul 2020 07:02:24 -
@@ -273,14 +273,6 @@ bridge_ioctl(struct ifnet *ifp, u_long c
break;
 
ifs = ifunit(req->ifbr_ifsname);
-
-   /* try to create the interface if it does't exist */
-   if (ifs == NULL) {
-   error = if_clone_create(req->ifbr_ifsname, 0);
-   if (error == 0)
-   ifs = ifunit(req->ifbr_ifsname);
-   }
-
if (ifs == NULL) {  /* no such interface */
error = ENOENT;
break;



let's not pretend switch(4) works with anything except ethernet interfaces.

2020-07-09 Thread David Gwynne
the code pretty obviously assumes that it only handles Ethernet packets,
so we should restrict it to IF_ETHER type interfaces.

ok?

Index: if_switch.c
===
RCS file: /cvs/src/sys/net/if_switch.c,v
retrieving revision 1.30
diff -u -p -r1.30 if_switch.c
--- if_switch.c 6 Nov 2019 03:51:26 -   1.30
+++ if_switch.c 9 Jul 2020 05:59:51 -
@@ -506,6 +506,9 @@ switch_port_add(struct switch_softc *sc,
if ((ifs = ifunit(req->ifbr_ifsname)) == NULL)
return (ENOENT);
 
+   if (ifs->if_type != IFT_ETHER)
+   return (EPROTONOSUPPORT);
+
if (ifs->if_bridgeidx != 0)
return (EBUSY);
 
@@ -517,15 +520,12 @@ switch_port_add(struct switch_softc *sc,
return (EBUSY);
}
 
-   if (ifs->if_type == IFT_ETHER) {
-   if ((error = ifpromisc(ifs, 1)) != 0)
-   return (error);
-   }
+   if ((error = ifpromisc(ifs, 1)) != 0)
+   return (error);
 
swpo = malloc(sizeof(*swpo), M_DEVBUF, M_NOWAIT|M_ZERO);
if (swpo == NULL) {
-   if (ifs->if_type == IFT_ETHER)
-   ifpromisc(ifs, 0);
+   ifpromisc(ifs, 0);
return (ENOMEM);
}
swpo->swpo_switch = sc;



kstats for em(4)

2020-07-07 Thread David Gwynne
this is a first pass at converting the stats gathering in em(4) to
kstat instead of a disabled printf.

there are some semantic differences. the most obvious is that hardware
counters are not fed into the stacks counters on struct ifnet. i also
don't collect ring based counters (yet).

they look like this:

em0:0:em-stats:0
 rx crc errs: 0 packets
   rx align errs: 0 packets
   rx align errs: 0 packets
 rx errs: 0 packets
   rx missed: 0 packets
  tx single coll: 0 packets
  tx excess coll: 0 packets
   tx multi coll: 0 packets
tx late coll: 0 packets
 tx coll: 0
   tx defers: 0
   tx no CRS: 0 packets
seq errs: 0
   carr ext errs: 0 packets
 rx len errs: 0 packets
  rx xon: 0 packets
  tx xon: 0 packets
 rx xoff: 0 packets
 tx xoff: 0 packets
  FC unsupported: 0 packets
  rx 64B: 6 packets
  rx 65-127B: 223 packets
 rx 128-255B: 15 packets
 rx 256-511B: 7 packets
rx 512-1023B: 2 packets
rx 1024-maxB: 1234584 packets
 rx good: 1234837 packets
rx bcast: 19 packets
rx mcast: 0 packets
 tx good: 630873 packets
 rx good: 1874122399 bytes
 tx good: 44180108 bytes
   rx no buffers: 0 packets
rx undersize: 0 packets
rx fragments: 0 packets
 rx oversize: 0 packets
  rx jabbers: 0 packets
 rx mgmt: 0 packets
   rx mgmt drops: 0 packets
 tx mgmt: 0 packets
rx total: 1874146525 bytes
tx total: 44180108 bytes
rx total: 1235017 packets
tx total: 630873 packets
  tx 64B: 113 packets
  tx 65-127B: 630681 packets
 tx 128-255B: 47 packets
 tx 256-511B: 29 packets
tx 512-1023B: 2 packets
tx 1024-maxB: 1 packets
tx mcast: 0 packets
tx bcast: 10 packets

unfortunately em(4) covers a lot of chips of different vintages, so if
anyone has a super old one they can try this diff on with kstat enabled
in their kernel config, that would be appreciated.

Index: if_em.c
===
RCS file: /cvs/src/sys/dev/pci/if_em.c,v
retrieving revision 1.354
diff -u -p -r1.354 if_em.c
--- if_em.c 22 Jun 2020 02:31:32 -  1.354
+++ if_em.c 7 Jul 2020 08:48:37 -
@@ -270,9 +270,6 @@ void em_receive_checksum(struct em_softc
 u_int  em_transmit_checksum_setup(struct em_queue *, struct mbuf *, u_int,
u_int32_t *, u_int32_t *);
 void em_iff(struct em_softc *);
-#ifdef EM_DEBUG
-void em_print_hw_stats(struct em_softc *);
-#endif
 void em_update_link_status(struct em_softc *);
 int  em_get_buf(struct em_queue *, int);
 void em_enable_hw_vlans(struct em_softc *);
@@ -302,6 +299,12 @@ void em_enable_queue_intr_msix(struct em
 #define em_allocate_msix(_sc)  (-1)
 #endif
 
+#if NKSTAT > 0
+void   em_kstat_attach(struct em_softc *);
+intem_kstat_read(struct kstat *);
+void   em_tbi_adjust_stats(struct em_softc *, uint32_t, uint8_t *);
+#endif
+
 /*
  *  OpenBSD Device Interface Entry Points
  */
@@ -561,8 +564,8 @@ em_attach(struct device *parent, struct 
 
/* Initialize statistics */
em_clear_hw_cntrs(>hw);
-#ifndef SMALL_KERNEL
-   em_update_stats_counters(sc);
+#if NKSTAT > 0
+   em_kstat_attach(sc);
 #endif
sc->hw.get_link_status = 1;
if (!defer)
@@ -1467,26 +1470,21 @@ em_iff(struct em_softc *sc)
 void
 em_local_timer(void *arg)
 {
-   struct ifnet   *ifp;
struct em_softc *sc = arg;
int s;
 
-   ifp = >sc_ac.ac_if;
+   timeout_add_sec(>timer_handle, 1);
 
s = splnet();
-
-#ifndef SMALL_KERNEL
-   em_update_stats_counters(sc);
-#ifdef EM_DEBUG
-   if (ifp->if_flags & IFF_DEBUG && ifp->if_flags & IFF_RUNNING)
-   em_print_hw_stats(sc);
-#endif
-#endif
em_smartspeed(sc);
-
-   timeout_add_sec(>timer_handle, 1);
-
splx(s);
+
+#if NKSTAT > 0
+   if (sc->kstat != NULL && mtx_enter_try(>kstat_mtx)) {
+   em_kstat_read(sc->kstat);
+   mtx_leave(>kstat_mtx);
+   }
+#endif
 }
 
 void
@@ -2934,8 +2932,8 @@ em_rxeof(struct em_queue *que)
last_byte = *(mtod(m, caddr_t) + desc_len - 1);
if (TBI_ACCEPT(>hw, status, desc->errors,
pkt_len, last_byte)) {
-#ifndef SMALL_KERNEL
-   em_tbi_adjust_stats(>hw, >stats, 
+#if NKSTAT > 0
+   em_tbi_adjust_stats(sc,
pkt_len, sc->hw.mac_addr);
 #endif
if (len > 0)
@@ -3396,228 +3394,365 @@ em_allocate_legacy(struct em_softc *sc)
return (0);
 }
 
+#if NKSTAT > 0
+/* this is used to look up the array of kstats quickly */
+enum em_stat {
+   em_stat_crcerrs,
+   em_stat_algnerrc,
+   

Re: [patch] dhclient(8) crashes with vm.malloc_conf=J

2020-07-06 Thread David Gwynne



> On 7 Jul 2020, at 7:57 am, Jesper Wallin  wrote:
> 
> Hi all,
> 
> I received a segmentation fault from dhclient(8) upon boot and decided
> to investigate...  My system is running with vm.malloc_conf=CFGJUR and
> figured one of those options was the cause of the crash.  I noticed that
> the buffer which holds my config options contained a lot of junk at the
> end and learned that 'J' is to blame together with a missing \0.
> 
> 
> How to reproduce:
> # sysctl vm.malloc_conf=J
> # cp /etc/dhclient.conf /etc/dhclient.conf.backup
> # echo 'supersede domain-name "ifconfig.se";' > /etc/dhclient.conf
> 
> Then run 'dhclient if0' a lot of times until it crashes, sometimes it
> takes more than 100 attempts.  Using vm.malloc_conf=CFGJUR might trigger
> it faster.
> 
> 
> In clparse.c:916, malloc(3) is used to get a buffer of the same length
> as the option in the config file.  But with 'J' in vm.malloc_conf, the
> buffer is bigger and contains junk.  I wouldn't say that my fix is the
> prettiest, but I get an extra byte and zero out the buffer.  Maybe
> someone has a more elegant fix for this.
> 
> 
> Yours,
> Jesper Wallin

you might want to put the memset after the check to see if the malloc failed...

> Index: clparse.c
> ===
> RCS file: /cvs/src/sbin/dhclient/clparse.c,v
> retrieving revision 1.199
> diff -u -p -r1.199 clparse.c
> --- clparse.c 13 May 2020 20:55:41 -  1.199
> +++ clparse.c 6 Jul 2020 21:25:54 -
> @@ -913,7 +913,8 @@ parse_option(FILE *cfile, int *code, str
>   } while (*fmt == 'A' && token == ',');
> 
>   free(options[i].data);
> - options[i].data = malloc(hunkix);
> + options[i].data = malloc(hunkix+1);
> + memset(options[i].data, 0, hunkix+1);
>   if (options[i].data == NULL)
>   fatal("option data");
>   memcpy(options[i].data, hunkbuf, hunkix);
> 



Re: use libc base64 code instead of libcrypt for ifconfig wg key handling

2020-06-21 Thread David Gwynne
On Sun, Jun 21, 2020 at 07:15:15PM -0600, Theo de Raadt wrote:
> In that case you can also delete:
> 
> ifconfig.c:#include 

indeed i can.

Index: Makefile
===
RCS file: /cvs/src/sbin/ifconfig/Makefile,v
retrieving revision 1.16
diff -u -p -r1.16 Makefile
--- Makefile21 Jun 2020 12:20:06 -  1.16
+++ Makefile22 Jun 2020 01:22:20 -
@@ -4,7 +4,7 @@ PROG=   ifconfig
 SRCS=  ifconfig.c brconfig.c sff.c
 MAN=   ifconfig.8
 
-LDADD= -lutil -lm -lcrypto
+LDADD= -lutil -lm
 DPADD= ${LIBUTIL}
 
 .include 
Index: ifconfig.c
===
RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
retrieving revision 1.422
diff -u -p -r1.422 ifconfig.c
--- ifconfig.c  21 Jun 2020 12:20:06 -  1.422
+++ ifconfig.c  22 Jun 2020 01:22:20 -
@@ -94,7 +94,6 @@
 #include 
 
 #include 
-#include 
 
 #include 
 #include 
@@ -5673,14 +5672,12 @@ setifpriority(const char *id, int param)
  * space.
  */
 #define WG_BASE64_KEY_LEN (4 * ((WG_KEY_LEN + 2) / 3))
-#define WG_TMP_KEY_LEN (WG_BASE64_KEY_LEN / 4 * 3)
 #define WG_LOAD_KEY(dst, src, fn_name) do {\
-   uint8_t _tmp[WG_TMP_KEY_LEN];   \
+   uint8_t _tmp[WG_KEY_LEN]; int _r;   \
if (strlen(src) != WG_BASE64_KEY_LEN)   \
errx(1, fn_name " (key): invalid length");  \
-   if (EVP_DecodeBlock(_tmp, src,  \
-   WG_BASE64_KEY_LEN) != WG_TMP_KEY_LEN)   \
-   errx(1, fn_name " (key): invalid base64");  \
+   if ((_r = b64_pton(src, _tmp, sizeof(_tmp))) != sizeof(_tmp))   
\
+   errx(1, fn_name " (key): invalid base64 %d/%zu", _r, 
sizeof(_tmp)); \
memcpy(dst, _tmp, WG_KEY_LEN);  \
 } while (0)
 
@@ -5899,13 +5896,15 @@ wg_status(void)
if (wg_interface->i_flags & WG_INTERFACE_HAS_RTABLE)
printf("\twgrtable %d\n", wg_interface->i_rtable);
if (wg_interface->i_flags & WG_INTERFACE_HAS_PUBLIC) {
-   EVP_EncodeBlock(key, wg_interface->i_public, WG_KEY_LEN);
+   b64_ntop(wg_interface->i_public, WG_KEY_LEN,
+   key, sizeof(key));
printf("\twgpubkey %s\n", key);
}
 
wg_peer = _interface->i_peers[0];
for (i = 0; i < wg_interface->i_peers_count; i++) {
-   EVP_EncodeBlock(key, wg_peer->p_public, WG_KEY_LEN);
+   b64_ntop(wg_peer->p_public, WG_KEY_LEN,
+   key, sizeof(key));
printf("\twgpeer %s\n", key);
 
if (wg_peer->p_flags & WG_PEER_HAS_PSK)



use libc base64 code instead of libcrypt for ifconfig wg key handling

2020-06-21 Thread David Gwynne
libc has undocumented base64 encoding and decoding funtionality. this
cuts ifconfig over to using it instead of the code in libcrypto.

whether the libc functionality should be "blessed" and documented is a
separate issue.

ok?

Index: Makefile
===
RCS file: /cvs/src/sbin/ifconfig/Makefile,v
retrieving revision 1.16
diff -u -p -r1.16 Makefile
--- Makefile21 Jun 2020 12:20:06 -  1.16
+++ Makefile21 Jun 2020 23:15:34 -
@@ -4,7 +4,7 @@ PROG=   ifconfig
 SRCS=  ifconfig.c brconfig.c sff.c
 MAN=   ifconfig.8
 
-LDADD= -lutil -lm -lcrypto
+LDADD= -lutil -lm
 DPADD= ${LIBUTIL}
 
 .include 
Index: ifconfig.c
===
RCS file: /cvs/src/sbin/ifconfig/ifconfig.c,v
retrieving revision 1.422
diff -u -p -r1.422 ifconfig.c
--- ifconfig.c  21 Jun 2020 12:20:06 -  1.422
+++ ifconfig.c  21 Jun 2020 23:15:35 -
@@ -5673,14 +5673,12 @@ setifpriority(const char *id, int param)
  * space.
  */
 #define WG_BASE64_KEY_LEN (4 * ((WG_KEY_LEN + 2) / 3))
-#define WG_TMP_KEY_LEN (WG_BASE64_KEY_LEN / 4 * 3)
 #define WG_LOAD_KEY(dst, src, fn_name) do {\
-   uint8_t _tmp[WG_TMP_KEY_LEN];   \
+   uint8_t _tmp[WG_KEY_LEN]; int _r;   \
if (strlen(src) != WG_BASE64_KEY_LEN)   \
errx(1, fn_name " (key): invalid length");  \
-   if (EVP_DecodeBlock(_tmp, src,  \
-   WG_BASE64_KEY_LEN) != WG_TMP_KEY_LEN)   \
-   errx(1, fn_name " (key): invalid base64");  \
+   if ((_r = b64_pton(src, _tmp, sizeof(_tmp))) != sizeof(_tmp))   
\
+   errx(1, fn_name " (key): invalid base64 %d/%zu", _r, 
sizeof(_tmp)); \
memcpy(dst, _tmp, WG_KEY_LEN);  \
 } while (0)
 
@@ -5899,13 +5897,15 @@ wg_status(void)
if (wg_interface->i_flags & WG_INTERFACE_HAS_RTABLE)
printf("\twgrtable %d\n", wg_interface->i_rtable);
if (wg_interface->i_flags & WG_INTERFACE_HAS_PUBLIC) {
-   EVP_EncodeBlock(key, wg_interface->i_public, WG_KEY_LEN);
+   b64_ntop(wg_interface->i_public, WG_KEY_LEN,
+   key, sizeof(key));
printf("\twgpubkey %s\n", key);
}
 
wg_peer = _interface->i_peers[0];
for (i = 0; i < wg_interface->i_peers_count; i++) {
-   EVP_EncodeBlock(key, wg_peer->p_public, WG_KEY_LEN);
+   b64_ntop(wg_peer->p_public, WG_KEY_LEN,
+   key, sizeof(key));
printf("\twgpeer %s\n", key);
 
if (wg_peer->p_flags & WG_PEER_HAS_PSK)



Re: WireGuard patchset for OpenBSD, rev. 3

2020-06-21 Thread David Gwynne
On Sun, Jun 21, 2020 at 12:52:53PM +0200, Matthieu Herrb wrote:
> On Fri, Jun 19, 2020 at 06:46:00PM +1000, Matt Dunwoodie wrote:
> > Hi all,
> > 
> > After the previous submission of WireGuard, we've again been through a
> > number of improvements. Thank you everyone for your feedback.
> 
> Hi,
> 
> While giving wireguard a try, I found that this patch is needed to fix
> ifconfig(8) documentation :

Oh yeah, I hit that too.

OK by me.

> 
> diff --git sbin/ifconfig/ifconfig.8 sbin/ifconfig/ifconfig.8
> index 29edeb60793..93429b4c103 100644
> --- sbin/ifconfig/ifconfig.8
> +++ sbin/ifconfig/ifconfig.8
> @@ -2056,7 +2056,7 @@ Packets on a VLAN interface without a tag set will use 
> a value of
>  .Op Cm wgpsk Ar presharedkey
>  .Op Fl wgpsk
>  .Op Cm wgpka Ar persistent-keepalive
> -.Op Cm wgpip Ar ip port
> +.Op Cm wgendpoint Ar ip port
>  .Op Cm wgaip Ar allowed-ip/prefix
>  .Oc
>  .Op Fl wgpeerall
> @@ -2137,7 +2137,7 @@ By default this functionality is disabled, equivalent 
> to a value of 0.
>  This is often used to ensure a peer will be accessible when protected by
>  a firewall, as is when behind a NAT address.
>  A value of 25 is commonly used.
> -.It Cm wgpip Ar ip port
> +.It Cm wgendpoint Ar ip port
>  Set the IP address and port to send the encapsulated packets to.
>  If the peer changes address, the local interface will update the address
>  after receiving a correctly authenticated packet.
> 
> -- 
> Matthieu Herrb
> 



deprecate softclock based network livelock detection

2020-06-19 Thread David Gwynne
the network stack doesnt really block timeouts from firing anymore. this
is especially true on MP systems, because timeouts fire on cpu0 and the
nettq thread could be somewhere else entirely. this means network
activity doesn't make the softclock lose ticks, which means we aren't
scaling rx ring activity like we think we are. 

the alternative way to detect livelock is when a driver queues packets
for the stack to process, if there's too many packets built up then the
input routine return value tells the driver to slow down. this enables
finer grained livelock detection too. the rx ring accounting is done per
rx ring, and each rx ring is tied to a specific nettq. if one of
them is going too fast it shouldn't affect the others. the tick
based detection was done system wide and punished all the drivers.

the diff below converts all the drivers to the new mechanism, and
removes the old one.

i really need tests for this one. can someone try an affected nic
on armv7? other than that i think im mostly interested in em and bge
tests. i've been kicking bge a bit here on a sparc64, but the more the
merrier.

Index: dev/fdt/if_dwge.c
===
RCS file: /cvs/src/sys/dev/fdt/if_dwge.c,v
retrieving revision 1.2
diff -u -p -r1.2 if_dwge.c
--- dev/fdt/if_dwge.c   7 Oct 2019 00:40:04 -   1.2
+++ dev/fdt/if_dwge.c   19 Jun 2020 03:57:17 -
@@ -907,13 +907,15 @@ dwge_rx_proc(struct dwge_softc *sc)
sc->sc_rx_cons++;
}
 
+   if (ifiq_input(>if_rcv, ))
+   if_rxr_livelocked(>sc_rx_ring);
+
dwge_fill_rx_ring(sc);
 
bus_dmamap_sync(sc->sc_dmat, DWGE_DMA_MAP(sc->sc_rxring), 0,
DWGE_DMA_LEN(sc->sc_rxring),
BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
 
-   if_input(ifp, );
 }
 
 void
Index: dev/fdt/if_dwxe.c
===
RCS file: /cvs/src/sys/dev/fdt/if_dwxe.c,v
retrieving revision 1.15
diff -u -p -r1.15 if_dwxe.c
--- dev/fdt/if_dwxe.c   7 Oct 2019 00:40:04 -   1.15
+++ dev/fdt/if_dwxe.c   19 Jun 2020 03:57:17 -
@@ -966,13 +966,14 @@ dwxe_rx_proc(struct dwxe_softc *sc)
sc->sc_rx_cons++;
}
 
+   if (ifiq_input(>if_rcv, ))
+   if_rxr_livelocked(>sc_rx_ring);
+
dwxe_fill_rx_ring(sc);
 
bus_dmamap_sync(sc->sc_dmat, DWXE_DMA_MAP(sc->sc_rxring), 0,
DWXE_DMA_LEN(sc->sc_rxring),
BUS_DMASYNC_PREREAD | BUS_DMASYNC_PREWRITE);
-
-   if_input(ifp, );
 }
 
 void
Index: dev/fdt/if_fec.c
===
RCS file: /cvs/src/sys/dev/fdt/if_fec.c,v
retrieving revision 1.8
diff -u -p -r1.8 if_fec.c
--- dev/fdt/if_fec.c6 Feb 2019 22:59:06 -   1.8
+++ dev/fdt/if_fec.c19 Jun 2020 03:57:17 -
@@ -1123,6 +1123,9 @@ fec_rx_proc(struct fec_softc *sc)
sc->sc_rx_cons++;
}
 
+   if (ifiq_input(>if_rcv, ))
+   if_rxr_livelocked(>sc_rx_ring);
+
fec_fill_rx_ring(sc);
 
bus_dmamap_sync(sc->sc_dmat, ENET_DMA_MAP(sc->sc_rxring), 0,
@@ -1131,8 +1134,6 @@ fec_rx_proc(struct fec_softc *sc)
 
/* rx descriptors are ready */
HWRITE4(sc, ENET_RDAR, ENET_RDAR_RDAR);
-
-   if_input(ifp, );
 }
 
 void
Index: dev/fdt/if_mvneta.c
===
RCS file: /cvs/src/sys/dev/fdt/if_mvneta.c,v
retrieving revision 1.10
diff -u -p -r1.10 if_mvneta.c
--- dev/fdt/if_mvneta.c 22 May 2020 10:02:30 -  1.10
+++ dev/fdt/if_mvneta.c 19 Jun 2020 03:57:17 -
@@ -1363,9 +1363,10 @@ mvneta_rx_proc(struct mvneta_softc *sc)
sc->sc_rx_cons = MVNETA_RX_RING_NEXT(idx);
}
 
-   mvneta_fill_rx_ring(sc);
+   if (ifiq_input(>if_rcv, ))
+   if_rxr_livelocked(>sc_rx_ring);
 
-   if_input(ifp, );
+   mvneta_fill_rx_ring(sc);
 }
 
 void
Index: dev/ic/bcmgenet.c
===
RCS file: /cvs/src/sys/dev/ic/bcmgenet.c,v
retrieving revision 1.1
diff -u -p -r1.1 bcmgenet.c
--- dev/ic/bcmgenet.c   14 Apr 2020 21:02:39 -  1.1
+++ dev/ic/bcmgenet.c   19 Jun 2020 03:57:17 -
@@ -729,8 +729,10 @@ genet_rxintr(struct genet_softc *sc, int
sc->sc_rx.next = index;
sc->sc_rx.pidx = pidx;
 
+   if (ifiq_input(>if_rcv, ))
+   if_rxr_livelocked(>sc_rx_ring);
+
genet_fill_rx_ring(sc, qid);
-   if_input(ifp, );
}
 }
 
Index: dev/ic/gem.c
===
RCS file: /cvs/src/sys/dev/ic/gem.c,v
retrieving revision 1.123
diff -u -p -r1.123 gem.c
--- dev/ic/gem.c7 Feb 2018 22:35:14 -   1.123
+++ dev/ic/gem.c19 Jun 2020 03:57:17 -
@@ -1020,6 +1020,9 @@ gem_rint(struct gem_softc *sc)
ml_enqueue(, 

Re: multiple rings and cpus for ix(4)

2020-06-19 Thread David Gwynne
On Wed, Jun 17, 2020 at 01:17:37PM +0200, Hrvoje Popovski wrote:
> On 17.6.2020. 13:13, Jonathan Matthew wrote:
> > On Wed, Jun 17, 2020 at 12:50:46PM +0200, Hrvoje Popovski wrote:
> >> On 17.6.2020. 12:45, Hrvoje Popovski wrote:
> >>> On 17.6.2020. 11:27, Hrvoje Popovski wrote:
> >>>> On 17.6.2020. 10:36, David Gwynne wrote:
> >>>>> this is an updated version of a diff from christiano haesbaert by way of
> >>>>> mpi@ to enable the use of multiple tx and rx rings with msi-x.
> >>>>>
> >>>>> the high level description is that that driver checks to see if msix is
> >>>>> available, and if so how many vectors it has. it then gets an intrmap
> >>>>> based on that information, and bumps the number of queues to the number
> >>>>> of cpus that intrmap says are available.
> >>>>>
> >>>>> once the queues are allocated, it then iterates over them and wires up
> >>>>> interrupts to the cpus provided by the intrmap.
> >>>>>
> >>>>> im happy for people to try this out, but i can't commit it until all the
> >>>>> architectures that ix(4) is enabled on support the APIs that it's using.
> >>>>> this basically means it'll work on amd64 (and a little bit on i386), but
> >>>>> not much else. please hold back your tears and cries of anguish.
> >>>>>
> >>>>> thanks to christiano and mpi for doing most of the work leading up to
> >>>>> this diff :)
> >>>>
> >>>> Hi,
> >>>>
> >>>> first, thank you all for mq work :)
> >>>>
> >>>> with this diff, if i'm sending traffic over ix and at the same time
> >>>> execute ifconfig ix down/up, forwarding stops until i stop generator,
> >>>> wait for few seconds and execute ifconfig ix down/up few times and than
> >>>> forwarding start normally

i'll have to wire up a topology i can test this properly with when i get
back to the office.

> >> in vmstat i should see ix0:0-5 and ix1:0-5 ?
> > 
> > vmstat -i only shows interrupts that have actually fired. Use -zi to show
> > all interrupts.
> > 
> > This diff doesn't set up RSS, so received packets will only go to the first
> > vector, which is why only one of the ix1 interrupts has fired. Outgoing
> > packets are scattered across the tx queues, so all the ix0 interrupts have
> > fired.
> 
> yes, thank you ..

i had a look at setting up RSS today, and it turns out it's already
there. it's using a randomly generated toeplitz key (which is ok), but
it only hashes ipv4, ipv6, and tcp. if your test traffic is udp
between the same 2 IPs, it'll land on the same rx ring.

anyway, here's an updated diff with a couple of tweaks. firstly, it uses
intr_barrier when the interface is going down to make sure the rings
arent in use on another cpu, and secondly it wires ix up with the
stoeplitz code.

Index: if_ix.c
===
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.166
diff -u -p -r1.166 if_ix.c
--- if_ix.c 7 Jun 2020 23:52:05 -   1.166
+++ if_ix.c 19 Jun 2020 05:02:31 -
@@ -115,7 +115,7 @@ voidixgbe_identify_hardware(struct ix_s
 intixgbe_allocate_pci_resources(struct ix_softc *);
 intixgbe_allocate_legacy(struct ix_softc *);
 intixgbe_allocate_msix(struct ix_softc *);
-intixgbe_setup_msix(struct ix_softc *);
+void   ixgbe_setup_msix(struct ix_softc *);
 intixgbe_allocate_queues(struct ix_softc *);
 void   ixgbe_free_pci_resources(struct ix_softc *);
 void   ixgbe_local_timer(void *);
@@ -199,7 +199,7 @@ struct cfattach ix_ca = {
 };
 
 int ixgbe_smart_speed = ixgbe_smart_speed_on;
-int ixgbe_enable_msix = 0;
+int ixgbe_enable_msix = 1;
 
 /*
  *  Device identification routine
@@ -301,7 +301,7 @@ ixgbe_attach(struct device *parent, stru
bcopy(sc->hw.mac.addr, sc->arpcom.ac_enaddr,
IXGBE_ETH_LENGTH_OF_ADDRESS);
 
-   if (sc->msix > 1)
+   if (sc->sc_intrmap)
error = ixgbe_allocate_msix(sc);
else
error = ixgbe_allocate_legacy(sc);
@@ -798,7 +798,7 @@ ixgbe_init(void *arg)
timeout_add_sec(>timer, 1);
 
/* Set up MSI/X routing */
-   if (sc->msix > 1) {
+   if (sc->sc_intrmap) {
ixgbe_configure_ivars(sc);
/* Set up auto-mask */
if (sc->hw.mac.type == ixgbe_mac_82598EB)
@@ -829,7 +829,7 @@ ixgbe_init(void *arg)

Re: stoeplitz_hash_ip*: rename lo & simplify further

2020-06-18 Thread David Gwynne



> On 18 Jun 2020, at 7:49 pm, Theo Buehler  wrote:
> 
> The same trick as in the previous diff can be used a second time:
> widen the type, accumulate before folding.
> 
> I've also shuffled things into an order where the introduction of
> a stoeplitz_hash_n32(scache, n32) suggests itself as a next step.
> It would just call stoeplitz_hash_n16(scache, n32 ^ (n32 >> 16));
> This trivializes the ip4 versions to one line:
> 
>   return (stoeplitz_hash_n32(faddr ^ laddr));
> 
> and
> 
>   return (stoeplitz_hash_n32(faddr ^ laddr ^ fport ^ lport));
> 
> respectively. At that point I will definitely be done since there
> will be no more code to remove... The ip6 versions are only slightly
> simplified by this.
> 
> While I think this is rather neat, I don't want to push things too far.
> Should I do this?

i want a hash_n32 anyway (for feeding things like gre keys in). i think i would 
like it more than the folding of the 32-bits into 16-bits at the end of these 
functions too.

this diff is ok.

dlg

> 
> Index: toeplitz.c
> ===
> RCS file: /var/cvs/src/sys/net/toeplitz.c,v
> retrieving revision 1.4
> diff -u -p -r1.4 toeplitz.c
> --- toeplitz.c18 Jun 2020 05:33:17 -  1.4
> +++ toeplitz.c18 Jun 2020 09:35:12 -
> @@ -116,30 +116,25 @@ uint16_t
> stoeplitz_hash_ip4(const struct stoeplitz_cache *scache,
> in_addr_t faddr, in_addr_t laddr)
> {
> - uint16_t lo;
> + uint32_t n32;
> 
> - lo  = faddr >> 0;
> - lo ^= faddr >> 16;
> - lo ^= laddr >> 0;
> - lo ^= laddr >> 16;
> + n32  = faddr ^ laddr;
> + n32 ^= n32 >> 16;
> 
> - return (stoeplitz_hash_n16(scache, lo));
> + return (stoeplitz_hash_n16(scache, n32));
> }
> 
> uint16_t
> stoeplitz_hash_ip4port(const struct stoeplitz_cache *scache,
> in_addr_t faddr, in_addr_t laddr, in_port_t fport, in_port_t lport)
> {
> - uint16_t lo;
> + uint32_t n32;
> 
> - lo  = faddr >> 0;
> - lo ^= faddr >> 16;
> - lo ^= laddr >> 0;
> - lo ^= laddr >> 16;
> - lo ^= fport >> 0;
> - lo ^= lport >> 0;
> + n32  = faddr ^ laddr;
> + n32 ^= fport ^ lport;
> + n32 ^= n32 >> 16;
> 
> - return (stoeplitz_hash_n16(scache, lo));
> + return (stoeplitz_hash_n16(scache, n32));
> }
> 
> #ifdef INET6
> @@ -147,44 +142,32 @@ uint16_t
> stoeplitz_hash_ip6(const struct stoeplitz_cache *scache,
> const struct in6_addr *faddr6, const struct in6_addr *laddr6)
> {
> - uint16_t lo = 0;
> + uint32_t n32 = 0;
>   size_t i;
> 
> - for (i = 0; i < nitems(faddr6->s6_addr32); i++) {
> - uint32_t faddr = faddr6->s6_addr32[i];
> - uint32_t laddr = laddr6->s6_addr32[i];
> -
> - lo ^= faddr >> 0;
> - lo ^= faddr >> 16;
> - lo ^= laddr >> 0;
> - lo ^= laddr >> 16;
> - }
> + for (i = 0; i < nitems(faddr6->s6_addr32); i++)
> + n32 ^= faddr6->s6_addr32[i] ^ laddr6->s6_addr32[i];
> +
> + n32 ^= n32 >> 16;
> 
> - return (stoeplitz_hash_n16(scache, lo));
> + return (stoeplitz_hash_n16(scache, n32));
> }
> 
> uint16_t
> stoeplitz_hash_ip6port(const struct stoeplitz_cache *scache,
> -const struct in6_addr *faddr6, const struct in6_addr * laddr6,
> +const struct in6_addr *faddr6, const struct in6_addr *laddr6,
> in_port_t fport, in_port_t lport)
> {
> - uint16_t lo = 0;
> + uint32_t n32 = 0;
>   size_t i;
> 
> - for (i = 0; i < nitems(faddr6->s6_addr32); i++) {
> - uint32_t faddr = faddr6->s6_addr32[i];
> - uint32_t laddr = laddr6->s6_addr32[i];
> -
> - lo ^= faddr >> 0;
> - lo ^= faddr >> 16;
> - lo ^= laddr >> 0;
> - lo ^= laddr >> 16;
> - }
> + for (i = 0; i < nitems(faddr6->s6_addr32); i++)
> + n32 ^= faddr6->s6_addr32[i] ^ laddr6->s6_addr32[i];
> 
> - lo ^= fport >> 0;
> - lo ^= lport >> 0;
> + n32 ^= fport ^ lport;
> + n32 ^= n32 >> 16;
> 
> - return (stoeplitz_hash_n16(scache, lo));
> + return (stoeplitz_hash_n16(scache, n32));
> }
> #endif /* INET6 */
> 
> 



Re: stoeplitz_hash_ip*: avoid early split into hi and lo

2020-06-17 Thread David Gwynne



> On 18 Jun 2020, at 2:34 pm, Theo Buehler  wrote:
> 
> Now that the calls to stoeplitz_cache_entry() are out of the way,
> we can avoid half of the calculations by merging the computation of
> hi and lo, only spliting at the end.  This allows us to leverage
> stoeplitz_hash_n16().
> 
> The name lo is now wrong. I kept it in order to avoid noise. I'm
> going clean this up in the next step.

ok on this, and on the next step where lo is renamed.

please keep __unused though.

> 
> Index: toeplitz.c
> ===
> RCS file: /cvs/src/sys/net/toeplitz.c,v
> retrieving revision 1.3
> diff -u -p -U5 -r1.3 toeplitz.c
> --- toeplitz.c18 Jun 2020 03:53:38 -  1.3
> +++ toeplitz.c18 Jun 2020 03:57:43 -
> @@ -114,108 +114,79 @@ stoeplitz_cache_init(struct stoeplitz_ca
> 
> uint16_t
> stoeplitz_hash_ip4(const struct stoeplitz_cache *scache,
> in_addr_t faddr, in_addr_t laddr)
> {
> - uint16_t lo, hi;
> + uint16_t lo;
> 
>   lo  = faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
> 
> - hi  = faddr >> 8;
> - hi ^= faddr >> 24;
> - hi ^= laddr >> 8;
> - hi ^= laddr >> 24;
> -
> - return (swap16(stoeplitz_cache_entry(scache, lo))
> - ^ stoeplitz_cache_entry(scache, hi));
> + return (stoeplitz_hash_n16(scache, lo));
> }
> 
> uint16_t
> stoeplitz_hash_ip4port(const struct stoeplitz_cache *scache,
> in_addr_t faddr, in_addr_t laddr, in_port_t fport, in_port_t lport)
> {
> - uint16_t hi, lo;
> + uint16_t lo;
> 
>   lo  = faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
>   lo ^= fport >> 0;
>   lo ^= lport >> 0;
> 
> - hi  = faddr >> 8;
> - hi ^= faddr >> 24;
> - hi ^= laddr >> 8;
> - hi ^= laddr >> 24;
> - hi ^= fport >> 8;
> - hi ^= lport >> 8;
> -
> - return (swap16(stoeplitz_cache_entry(scache, lo))
> - ^ stoeplitz_cache_entry(scache, hi));
> + return (stoeplitz_hash_n16(scache, lo));
> }
> 
> #ifdef INET6
> uint16_t
> stoeplitz_hash_ip6(const struct stoeplitz_cache *scache,
> const struct in6_addr *faddr6, const struct in6_addr *laddr6)
> {
> - uint16_t hi = 0, lo = 0;
> + uint16_t lo = 0;
>   size_t i;
> 
>   for (i = 0; i < nitems(faddr6->s6_addr32); i++) {
>   uint32_t faddr = faddr6->s6_addr32[i];
>   uint32_t laddr = laddr6->s6_addr32[i];
> 
>   lo ^= faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
> -
> - hi ^= faddr >> 8;
> - hi ^= faddr >> 24;
> - hi ^= laddr >> 8;
> - hi ^= laddr >> 24;
>   }
> 
> - return (swap16(stoeplitz_cache_entry(scache, lo))
> - ^ stoeplitz_cache_entry(scache, hi));
> + return (stoeplitz_hash_n16(scache, lo));
> }
> 
> uint16_t
> stoeplitz_hash_ip6port(const struct stoeplitz_cache *scache,
> const struct in6_addr *faddr6, const struct in6_addr * laddr6,
> in_port_t fport, in_port_t lport)
> {
> - uint16_t hi = 0, lo = 0;
> + uint16_t lo = 0;
>   size_t i;
> 
>   for (i = 0; i < nitems(faddr6->s6_addr32); i++) {
>   uint32_t faddr = faddr6->s6_addr32[i];
>   uint32_t laddr = laddr6->s6_addr32[i];
> 
>   lo ^= faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
> -
> - hi ^= faddr >> 8;
> - hi ^= faddr >> 24;
> - hi ^= laddr >> 8;
> - hi ^= laddr >> 24;
>   }
> 
>   lo ^= fport >> 0;
>   lo ^= lport >> 0;
> 
> - hi ^= fport >> 8;
> - hi ^= lport >> 8;
> -
> - return (swap16(stoeplitz_cache_entry(scache, lo))
> - ^ stoeplitz_cache_entry(scache, hi));
> + return (stoeplitz_hash_n16(scache, lo));
> }
> #endif /* INET6 */
> 
> void
> stoeplitz_to_key(uint8_t *k, size_t klen)
> Index: toeplitz.h
> ===
> RCS file: /cvs/src/sys/net/toeplitz.h,v
> retrieving revision 1.1
> diff -u -p -U5 -r1.1 toeplitz.h
> --- toeplitz.h16 Jun 2020 04:46:49 -  1.1
> +++ toeplitz.h18 Jun 2020 03:57:43 -
> @@ -52,11 +52,11 @@ uint16_t  stoeplitz_hash_ip6port(const st
>   const struct in6_addr *, const struct in6_addr *,
>   uint16_t, uint16_t);
> #endif
> 
> /* hash a uint16_t in network byte order */
> -static __unused inline uint16_t
> +static inline uint16_t
> stoeplitz_hash_n16(const struct stoeplitz_cache *scache, uint16_t n16)
> {
>   uint16_t hi, lo;
> 
>   hi = stoeplitz_cache_entry(scache, n16 >> 8);
> 



Re: simplify stoeplitz_hash_ip*

2020-06-17 Thread David Gwynne



> On 18 Jun 2020, at 1:34 am, Theo Buehler  wrote:
> 
> The next step is to use that we have cached the result of the matrix
> multiplication H * val in stoeplitz_cache_entry(scache, val), so the
> identity (H * x) ^ (H * y) == H * (x ^ y) allows us to push the calls to
> the cache function down to the end of stoeplitz_hash_ip{4,6}{,port}().
> 
> The result is the mechanical diff below. I have at least one follow-up,
> so it's intentionally minimalistic.
> 
> The identity in question was again confirmed by brute force on amd64,
> sparc64 and powerpc for all possible values of skey, x and y.

ok dlg@

> 
> Index: toeplitz.c
> ===
> RCS file: /cvs/src/sys/net/toeplitz.c,v
> retrieving revision 1.2
> diff -u -p -r1.2 toeplitz.c
> --- toeplitz.c17 Jun 2020 06:36:56 -  1.2
> +++ toeplitz.c17 Jun 2020 06:56:11 -
> @@ -118,17 +118,18 @@ stoeplitz_hash_ip4(const struct stoeplit
> {
>   uint16_t lo, hi;
> 
> - lo  = stoeplitz_cache_entry(scache, faddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 16);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 16);
> -
> - hi  = stoeplitz_cache_entry(scache, faddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 24);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 24);
> + lo  = faddr >> 0;
> + lo ^= faddr >> 16;
> + lo ^= laddr >> 0;
> + lo ^= laddr >> 16;
> +
> + hi  = faddr >> 8;
> + hi ^= faddr >> 24;
> + hi ^= laddr >> 8;
> + hi ^= laddr >> 24;
> 
> - return (swap16(lo) ^ hi);
> + return (swap16(stoeplitz_cache_entry(scache, lo))
> + ^ stoeplitz_cache_entry(scache, hi));
> }
> 
> uint16_t
> @@ -137,21 +138,22 @@ stoeplitz_hash_ip4port(const struct stoe
> {
>   uint16_t hi, lo;
> 
> - lo  = stoeplitz_cache_entry(scache, faddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 16);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 16);
> - lo ^= stoeplitz_cache_entry(scache, fport >> 0);
> - lo ^= stoeplitz_cache_entry(scache, lport >> 0);
> -
> - hi  = stoeplitz_cache_entry(scache, faddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 24);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 24);
> - hi ^= stoeplitz_cache_entry(scache, fport >> 8);
> - hi ^= stoeplitz_cache_entry(scache, lport >> 8);
> + lo  = faddr >> 0;
> + lo ^= faddr >> 16;
> + lo ^= laddr >> 0;
> + lo ^= laddr >> 16;
> + lo ^= fport >> 0;
> + lo ^= lport >> 0;
> +
> + hi  = faddr >> 8;
> + hi ^= faddr >> 24;
> + hi ^= laddr >> 8;
> + hi ^= laddr >> 24;
> + hi ^= fport >> 8;
> + hi ^= lport >> 8;
> 
> - return (swap16(lo) ^ hi);
> + return (swap16(stoeplitz_cache_entry(scache, lo))
> + ^ stoeplitz_cache_entry(scache, hi));
> }
> 
> #ifdef INET6
> @@ -166,18 +168,19 @@ stoeplitz_hash_ip6(const struct stoeplit
>   uint32_t faddr = faddr6->s6_addr32[i];
>   uint32_t laddr = laddr6->s6_addr32[i];
> 
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 16);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 16);
> -
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 24);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 24);
> + lo ^= faddr >> 0;
> + lo ^= faddr >> 16;
> + lo ^= laddr >> 0;
> + lo ^= laddr >> 16;
> +
> + hi ^= faddr >> 8;
> + hi ^= faddr >> 24;
> + hi ^= laddr >> 8;
> + hi ^= laddr >> 24;
>   }
> 
> - return (swap16(lo) ^ hi);
> + return (swap16(stoeplitz_cache_entry(scache, lo))
> + ^ stoeplitz_cache_entry(scache, hi));
> }
> 
> uint16_t
> @@ -192,24 +195,25 @@ stoeplitz_hash_ip6port(const struct stoe
>   uint32_t faddr = faddr6->s6_addr32[i];
>   uint32_t laddr = laddr6->s6_addr32[i];
> 
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, faddr >> 16);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 0);
> - lo ^= stoeplitz_cache_entry(scache, laddr >> 16);
> -
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, faddr >> 24);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 8);
> - hi ^= stoeplitz_cache_entry(scache, laddr >> 24);
> +  

multiple rings and cpus for ix(4)

2020-06-17 Thread David Gwynne
this is an updated version of a diff from christiano haesbaert by way of
mpi@ to enable the use of multiple tx and rx rings with msi-x.

the high level description is that that driver checks to see if msix is
available, and if so how many vectors it has. it then gets an intrmap
based on that information, and bumps the number of queues to the number
of cpus that intrmap says are available.

once the queues are allocated, it then iterates over them and wires up
interrupts to the cpus provided by the intrmap.

im happy for people to try this out, but i can't commit it until all the
architectures that ix(4) is enabled on support the APIs that it's using.
this basically means it'll work on amd64 (and a little bit on i386), but
not much else. please hold back your tears and cries of anguish.

thanks to christiano and mpi for doing most of the work leading up to
this diff :)

Index: if_ix.c
===
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.166
diff -u -p -r1.166 if_ix.c
--- if_ix.c 7 Jun 2020 23:52:05 -   1.166
+++ if_ix.c 17 Jun 2020 08:27:55 -
@@ -115,7 +115,7 @@ voidixgbe_identify_hardware(struct ix_s
 intixgbe_allocate_pci_resources(struct ix_softc *);
 intixgbe_allocate_legacy(struct ix_softc *);
 intixgbe_allocate_msix(struct ix_softc *);
-intixgbe_setup_msix(struct ix_softc *);
+void   ixgbe_setup_msix(struct ix_softc *);
 intixgbe_allocate_queues(struct ix_softc *);
 void   ixgbe_free_pci_resources(struct ix_softc *);
 void   ixgbe_local_timer(void *);
@@ -199,7 +199,7 @@ struct cfattach ix_ca = {
 };
 
 int ixgbe_smart_speed = ixgbe_smart_speed_on;
-int ixgbe_enable_msix = 0;
+int ixgbe_enable_msix = 1;
 
 /*
  *  Device identification routine
@@ -301,7 +301,7 @@ ixgbe_attach(struct device *parent, stru
bcopy(sc->hw.mac.addr, sc->arpcom.ac_enaddr,
IXGBE_ETH_LENGTH_OF_ADDRESS);
 
-   if (sc->msix > 1)
+   if (sc->sc_intrmap)
error = ixgbe_allocate_msix(sc);
else
error = ixgbe_allocate_legacy(sc);
@@ -798,7 +798,7 @@ ixgbe_init(void *arg)
timeout_add_sec(>timer, 1);
 
/* Set up MSI/X routing */
-   if (sc->msix > 1) {
+   if (sc->sc_intrmap) {
ixgbe_configure_ivars(sc);
/* Set up auto-mask */
if (sc->hw.mac.type == ixgbe_mac_82598EB)
@@ -829,7 +829,7 @@ ixgbe_init(void *arg)
itr |= IXGBE_EITR_LLI_MOD | IXGBE_EITR_CNT_WDIS;
IXGBE_WRITE_REG(>hw, IXGBE_EITR(0), itr);
 
-   if (sc->msix > 1) {
+   if (sc->sc_intrmap) {
/* Set moderation on the Link interrupt */
IXGBE_WRITE_REG(>hw, IXGBE_EITR(sc->linkvec),
IXGBE_LINK_ITR);
@@ -903,7 +903,7 @@ ixgbe_config_gpie(struct ix_softc *sc)
gpie |= 0xf << IXGBE_GPIE_LLI_DELAY_SHIFT;
}
 
-   if (sc->msix > 1) {
+   if (sc->sc_intrmap) {
/* Enable Enhanced MSIX mode */
gpie |= IXGBE_GPIE_MSIX_MODE;
gpie |= IXGBE_GPIE_EIAME | IXGBE_GPIE_PBA_SUPPORT |
@@ -1717,80 +1717,86 @@ ixgbe_allocate_msix(struct ix_softc *sc)
 {
struct ixgbe_osdep  *os = >osdep;
struct pci_attach_args  *pa  = >os_pa;
-   int  vec, error = 0;
-   struct ix_queue *que = sc->queues;
-   const char  *intrstr = NULL;
-   pci_chipset_tag_t   pc = pa->pa_pc;
+   int  i = 0, error = 0;
+   struct ix_queue *que;
pci_intr_handle_t   ih;
 
-   vec = 0;
-   if (pci_intr_map_msix(pa, vec, )) {
-   printf(": couldn't map interrupt\n");
-   return (ENXIO);
-   }
+   for (i = 0, que = sc->queues; i < sc->num_queues; i++, que++) {
+   if (pci_intr_map_msix(pa, i, )) {
+   printf("ixgbe_allocate_msix: "
+   "pci_intr_map_msix vec %d failed\n", i);
+   error = ENOMEM;
+   goto fail;
+   }
 
-   que->msix = vec;
-   snprintf(que->name, sizeof(que->name), "%s:%d", sc->dev.dv_xname, vec);
+   que->tag = pci_intr_establish_cpu(pa->pa_pc, ih,
+   IPL_NET | IPL_MPSAFE, intrmap_cpu(sc->sc_intrmap, i),
+   ixgbe_queue_intr, que, que->name);
+   if (que->tag == NULL) {
+   printf("ixgbe_allocate_msix: "
+   "pci_intr_establish vec %d failed\n", i);
+   error = ENOMEM;
+   goto fail;
+   }
 
-   intrstr = pci_intr_string(pc, ih);
-   que->tag = pci_intr_establish(pc, ih, IPL_NET | IPL_MPSAFE,
-   ixgbe_queue_intr, que, que->name);
-   if (que->tag == NULL) {
-   printf(": 

Re: simplify Toeplitz cache computation

2020-06-16 Thread David Gwynne



> On 17 Jun 2020, at 01:57, Theo Buehler  wrote:
> 
> The diff below removes some of the unnecessary complications in the
> calculation of the stoeplitz_cache and brings them into a form more
> suitable for mathematical reasoning. I added a somewhat dense comment
> which explains the full construction and which will help justifying
> upcoming diffs.
> 
> The observations for the code changes are quite simple:

Sure ;)

Actually, reading your new comments in the code makes things a lot clearer to 
me. Thank you.

> First, scache->bytes[val] is a uint16_t, and it's easy to see that we
> only need the lower 16 bits of res in the second nested pair of for
> loops.  The values of key[b] are only xored together, to compute res,
> so we only need the lower 16 bits of those, too.
> 
> Next, looking at the first nested for loop, we see that the values
> 0..15 of j only touch the top 16 bits of key[b], so we can skip them.
> For b = 0, the inner loop for j in 16..31 scans backwards through skey
> and sets the corresponding bits of key[b], so we can see key[0] = skey.
> A bit of pondering then leads to the general expression:
> 
>   key[b] = skey << b | skey >> (NBSK - b);
> 
> I renamed the key array into toeplitz_column since it stores columns of
> the Toeplitz matrix.  If key is considered better, I won't insist.

I'd call it column instead of toeplitz_column or key, but I also don't insist.

> It's not very expensive to brute-force verify that scache->bytes[val]
> remains the same for all values of val and all values of skey. I did
> this on amd64, sparc64 and powerpc.

OK by me.

> 
> Index: sys/net/toeplitz.c
> ===
> RCS file: /var/cvs/src/sys/net/toeplitz.c,v
> retrieving revision 1.1
> diff -u -p -r1.1 toeplitz.c
> --- sys/net/toeplitz.c16 Jun 2020 04:46:49 -  1.1
> +++ sys/net/toeplitz.c16 Jun 2020 15:08:29 -
> @@ -76,40 +76,37 @@ stoeplitz_init(void)
> 
> #define NBSK (NBBY * sizeof(stoeplitz_key))
> 
> +/*
> + * The Toeplitz hash of a 16-bit number considered as a column vector over
> + * the field with two elements is calculated as a matrix multiplication with
> + * a 16x16 circulant Toeplitz matrix T generated by skey.
> + *
> + * The first eight columns H of T generate the remaining eight columns using
> + * the byteswap operation J = swap16:  T = [H JH].  Thus, the Toeplitz hash 
> of
> + * n = [hi lo] is computed via the formula T * n = (H * hi) ^ swap16(H * lo).
> + *
> + * Therefore the results H * val for all values of a byte are cached in 
> scache.
> + */
> void
> stoeplitz_cache_init(struct stoeplitz_cache *scache, stoeplitz_key skey)
> {
> - uint32_t key[NBBY];
> - unsigned int j, b, shift, val;
> + uint16_t toeplitz_column[NBBY];
> + unsigned int b, shift, val;
> 
> - bzero(key, sizeof(key));
> + bzero(toeplitz_column, sizeof(toeplitz_column));
> 
> - /*
> -  * Calculate 32bit keys for one byte; one key for each bit.
> -  */
> - for (b = 0; b < NBBY; ++b) {
> - for (j = 0; j < 32; ++j) {
> - unsigned int bit;
> + /* Calculate the first eight columns H of the Toeplitz matrix T. */
> + for (b = 0; b < NBBY; ++b)
> + toeplitz_column[b] = skey << b | skey >> (NBSK - b);
> 
> - bit = b + j;
> -
> - shift = NBSK - (bit % NBSK) - 1;
> - if (skey & (1 << shift))
> - key[b] |= 1 << (31 - j);
> - }
> - }
> -
> - /*
> -  * Cache the results of all possible bit combination of
> -  * one byte.
> -  */
> + /* Cache the results of H * val for all possible values of a byte. */
>   for (val = 0; val < 256; ++val) {
> - uint32_t res = 0;
> + uint16_t res = 0;
> 
>   for (b = 0; b < NBBY; ++b) {
>   shift = NBBY - b - 1;
>   if (val & (1 << shift))
> - res ^= key[b];
> + res ^= toeplitz_column[b];
>   }
>   scache->bytes[val] = res;
>   }
> 



interrupt to cpu mapping API

2020-06-15 Thread David Gwynne
e api provide struct cpu_info pointers instead
of number cpu ids?

our experience so far is that pci_intr_establish_cpuid() immediately
maps the id to a pointer anyway, and intrmap iterates over struct
cpu_info pointers to build the list of ids, so we could just remove the
numbers in the middle. pci_intr_establish_cpu() could take a cpu_info
pointer, and intrmap could provide cpu_info pointers.

the only caveat to this i can think of is if we need to establish
interrupts before cpus are attached, which might be useful on arm
archs. we can also change this in the tree.

if it's not obvious, im kind of sick of talking about this stuff,
so i'd rather shut up and hack on multiq support in the tree as
much as possible.

ok?

Index: share/man/man9/intrmap_create.9
===
RCS file: share/man/man9/intrmap_create.9
diff -N share/man/man9/intrmap_create.9
--- /dev/null   1 Jan 1970 00:00:00 -
+++ share/man/man9/intrmap_create.9 16 Jun 2020 00:13:50 -
@@ -0,0 +1,125 @@
+.\" $OpenBSD$
+.\"
+.\" Copyright (c) 2020 David Gwynne 
+.\"
+.\" Permission to use, copy, modify, and distribute this software for any
+.\" purpose with or without fee is hereby granted, provided that the above
+.\" copyright notice and this permission notice appear in all copies.
+.\"
+.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+.\"
+.Dd $Mdocdate: June 16 2020 $
+.Dt INTRMAP_CREATE 9
+.Os
+.Sh NAME
+.Nm intrmap_create ,
+.Nm intrmap_destroy ,
+.Nm intrmap_count ,
+.Nm intrmap_cpu
+.Nd interrupt to CPU mapping API
+.Sh SYNOPSIS
+.In sys/inrtmap.h
+.Ft struct intrmap *
+.Fo intrmap_create
+.Fa "const struct device *dv"
+.Fa "unsigned int nintr"
+.Fa "unsigned int maxintr"
+.Fa "unsigned int flags"
+.Fc
+.Ft void
+.Fn intrmap_destroy "struct intrmap *im"
+.Ft unsigned int
+.Fn intrmap_count "struct intrmap *im"
+.Ft unsigned int
+.Fn intrmap_cpu "struct intrmap *im" "unsigned int index"
+.Sh DESCRIPTION
+The interrupt to CPU mapping API supports the use of multiple CPUs
+by hardware drivers.
+Drivers that can use multiple interrupts use the API to request a
+set of CPUs that they can establish those interrupts on.
+The API limits the requested number of interrupts to what is available
+on the system, and attempts to distribute the requested interrupts
+over those CPUs.
+On some platforms the API will filter the set of available CPUs.
+.\" to avoid hyperthreads, basically.
+.Pp
+.Fn intrmap_create
+allocates an interrupt map data structure for use by the driver
+identified by
+.Fa dv .
+The number of interrupts the hardware supports is specified via the
+.Fa nintr
+argument.
+The driver supplies the maximum number of interrupts it can support
+via
+.Fa maxintr ,
+which, along with the number of available CPUs at the time the
+function is called, is used as a constraint on the number of requested
+interrupts.
+.Fa nintr
+may be zero to use the driver limit as the number of requested
+interrupts.
+The
+.Fa flags
+argument may have the following defines OR'ed together:
+.Bl -tag -width xxx -offset indent
+.It Dv INTRMAP_POWEROF2
+The hardware only supports a power of 2 number of interrupts, so
+constrain the number of supplied interrupts after the system and
+driver limits are applied.
+.El
+.Pp
+.Fn intrmap_destroy
+frees the memory associated with the interrupt map data structure
+passed via
+.Fa im .
+.Pp
+.Fn intrmap_count
+returns the number of interrupts that the driver can establish
+according to the
+.Fa im
+interrupt map.
+.Pp
+.Fn intrmap_cpu
+returns which CPU the interrupt specified in
+.Fa index
+should be established on according to the
+.Fa im
+interrupt map.
+Interrupts are identified as a number from 0 to the value returned by
+.Fn intrmap_count .
+.Sh CONTEXT
+.Fn intrmap_create ,
+.Fn intrmap_destroy ,
+.Fn intrmap_count ,
+and
+.Fn intrmap_cpu
+can be called during autoconf, or from process context.
+.Sh RETURN VALUES
+.Fn intrmap_create
+returns a pointer to a interrupt mapping structure on success, or
+.Dv NULL
+on failure.
+.Pp
+.Fn intrmap_count
+returns the number of interrupts that were allocated for the driver
+to use.
+.Pp
+.Fn intrmap_cpu
+returns an identifier for the CPU that the interrupt should be
+established on.
+.\" .Sh SEE ALSO
+.\" .Xr pci_int

Re: symmetric toeplitz hashing

2020-06-15 Thread David Gwynne



> On 13 Jun 2020, at 3:20 pm, Theo Buehler  wrote:
> 
> On Sat, Jun 13, 2020 at 11:35:42AM +1000, David Gwynne wrote:
>> On Fri, Jun 12, 2020 at 03:37:59PM +0200, Theo Buehler wrote:
>>> I finally found the time to think about the mathematics of this some
>>> more and I'm now convinced that it's a sound construction. I hope that
>>> one or the other observation below will be useful for you.
>> 
>> Yes, I read everything below and it sounds great and useful. My only
>> issue is that I'd like to show the application of those changes as
>> commits in the tree. I already feel the diff I originally posted has
>> diverged too far from the dfly code and a lot of why I made the changes
>> have been lost.
> 
> I'm fine with tweaking things in tree. I have reviewed the original code
> and it looks good. The API looks sane. You have my ok for the original
> diff (modulo the typos I pointed out), but it's really not my part of
> the tree...
> 
> Once it's landed, I can provide diffs for the tweaks for the internals I
> suggested.  We can also wait with those until some consumers are wired
> up, so we can actually test them.

Alright. I'm going to start putting it into the tree tomorrow unless someone 
with a good reason objects.


Re: symmetric toeplitz hashing

2020-06-14 Thread David Gwynne



> On 14 Jun 2020, at 10:59 pm, Miod Vallat  wrote:
> 
> 
>>> Others have pointed out off-list that one can use __builtin_popcount(),
>>> but __builtin_parity() is exactly what I want. Is it available on all
>>> architectures?
>> 
>> I don't think it is available on gcc 3.x for m88k but someone with
>> an m88k should confirm.
> 
> __builtin_popcount() does not exist in gcc 3.

Also not sure it's necessary to omg-optimise the initialisation of the cache, 
which hopefully happens once early in boot.


Re: symmetric toeplitz hashing

2020-06-12 Thread David Gwynne
On Fri, Jun 12, 2020 at 03:37:59PM +0200, Theo Buehler wrote:
> I finally found the time to think about the mathematics of this some
> more and I'm now convinced that it's a sound construction. I hope that
> one or the other observation below will be useful for you.

Yes, I read everything below and it sounds great and useful. My only
issue is that I'd like to show the application of those changes as
commits in the tree. I already feel the diff I originally posted has
diverged too far from the dfly code and a lot of why I made the changes
have been lost.

> The hash as it is now can be proved to produce values in the full range
> of uint16_t, so that's good.
> 
> As we discussed already, you can simplify the construction further.

Yep, I'm keen.

> One trick is that stoeplitz_cache_entry() is linear in the second
> argument -- you can think of the xor operation ^ as addition of vectors
> (uint8_t and uint16_t) over the field with two elements {0, 1}.  That's
> just the mathematician's way of expressing this relation:
> 
>   stoeplitz_cache_entry(scache, a ^ b)
>   == stoeplitz_cache_entry(scache, a) ^ stoeplitz_cache_entry(scache, 
> b);
> 
> Using this, the stoeplitz hash functions can be rewritten to this.
> I would expect it to be a bit cheaper.

Yes, that's very cool.

> uint16_t
> stoeplitz_hash_ip4(const struct stoeplitz_cache *scache,
> in_addr_t faddr, in_addr_t laddr)
> {
>   uint16_t lo, hi;
> 
>   lo  = faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
> 
>   hi  = faddr >> 8;
>   hi ^= faddr >> 24;
>   hi ^= laddr >> 8;
>   hi ^= laddr >> 24;
> 
>   return swap16(stoeplitz_cache_entry(scache, lo))
>   ^ stoeplitz_cache_entry(scache, hi);
> }
> 
> or another example:
> 
> uint16_t
> stoeplitz_hash_ip6port(const struct stoeplitz_cache *scache,
> const struct in6_addr *faddr6, const struct in6_addr * laddr6,
> in_port_t fport, in_port_t lport)
> {
>   uint16_t hi = 0, lo = 0;
>   size_t i;
> 
>   for (i = 0; i < nitems(faddr6->s6_addr32); i++) {
>   uint32_t faddr = faddr6->s6_addr32[i];
>   uint32_t laddr = laddr6->s6_addr32[i];
> 
>   lo ^= faddr >> 0;
>   lo ^= faddr >> 16;
>   lo ^= laddr >> 0;
>   lo ^= laddr >> 16;
> 
>   hi ^= faddr >> 8;
>   hi ^= faddr >> 24;
>   hi ^= laddr >> 8;
>   hi ^= laddr >> 24;
>   }
> 
>   lo ^= fport >> 0;
>   lo ^= lport >> 0;
> 
>   hi ^= fport >> 8;
>   hi ^= lport >> 8;
> 
>   return swap16(stoeplitz_cache_entry(scache, lo))
>   ^ stoeplitz_cache_entry(scache, hi);
> }
> 
> and so on.

Very cool.

> Next, I don't particularly like this magic STOEPLITZ_KEYSEED number, but
> I guess we can live with it.

My understanding is that it comes from the vector MS provided or used. I
can't easily find their old or original documentation, but you can see
it's the first two bytes in the following:

https://docs.microsoft.com/en-us/windows-hardware/drivers/network/verifying-the-rss-hash-calculation

> Another option would be to generate the key seed randomly. You will get
> a "good" hash that spreads out over all 16 bit values if and only if the
> random value has an odd number of binary digits.

sephe and I have talked about replacing it with a random number, but I
was going to wait until such a change could have an interesting commit
message in a tree. However, my idea of a good seed was "making sure it's
not zero", so I'm super happy there's a much more rigorous definition
we can use.

> I haven't thought hard about this, but I don't immediately see a better
> way of generating such numbers than:
> 
> int
> stoeplitz_good_seed(uint16_t seed)
> {
>   int ones = 0;
> 
>   while (seed > 0) {
>   ones += seed % 2;
>   seed /= 2;
>   }
> 
>   return ones % 2;
> }
> 
> uint16_t
> stoeplitz_seed_init(void)
> {
>   uint16_tseed;
> 
>   do {
>   seed = arc4random() & UINT16_MAX;
>   } while (!stoeplitz_good_seed(seed));
> 
>   return seed;
> }
> 
> This will loop as long as it needs to get a good toeplitz key seed.
> Each time there is a 50% chance that it will find one, so this will
> need to loop n times with probability 1 / 2**n. This is basically the
> same situation as for arc4random_uniform() with an upper bound that is
> not a power of two.
> 
> I don't know if something like that is acceptable early in init_main.c.
> If it is, you can use a seed generated by this init function in place
> of STOEPLITZ_KEYSEED.

I think deraadt@ would be fine with us permuting the random number
generator by taking some bytes out of it early.

> Finally, I don't think it would be all bad to eliminate the cache
> altogether and simply do this (and similarly for all the other
> stoeplitz_hash_*() variants). This would be another way of 

Re: npppd(8) man pages fix

2020-06-12 Thread David Gwynne



> On 13 Jun 2020, at 6:47 am, Jason McIntyre  wrote:
> 
> On Fri, Jun 12, 2020 at 09:53:33PM +0300, Vitaliy Makkoveev wrote:
>> Since 6.7-release npppd(8) uses pppac(4) instead of tun(4) but manual
>> page still has the reference to tun(4).
>> 
>> Diff below replaced `tun' by `pppac' in npppd(8) man page. Also `pppac'
>> added to npppd.conf(5).
>> 
> 
> hi!
> 
> this isn;t warranted, because pppx is already listed and it's the same
> page. if a page has multiple names, we'd only list it once.
> 
> i'm already asking about the removal of the tun reference from npppd.
> i'll do that after confirmation...

I think you can still use tun(4) with npppd, you just can't use it with pipex 
cos that code was ripped out and moved to pppac. I'm not convinced this is 
worth mentioning though, cos you can also use pppac without pipex.

> 
> jmc
> 
>> Index: usr.sbin/npppd/npppd/npppd.8
>> ===
>> RCS file: /cvs/src/usr.sbin/npppd/npppd/npppd.8,v
>> retrieving revision 1.6
>> diff -u -p -r1.6 npppd.8
>> --- usr.sbin/npppd/npppd/npppd.8 9 Nov 2015 01:14:22 -   1.6
>> +++ usr.sbin/npppd/npppd/npppd.8 12 Jun 2020 18:48:00 -
>> @@ -78,7 +78,7 @@ configuration file.
>> .Xr gre 4 ,
>> .Xr pipex 4 ,
>> .Xr pppx 4 ,
>> -.Xr tun 4 ,
>> +.Xr pppac 4 ,
>> .Xr npppd.conf 5 ,
>> .Xr npppctl 8 ,
>> .Xr sysctl 8
>> Index: usr.sbin/npppd/npppd/npppd.conf.5
>> ===
>> RCS file: /cvs/src/usr.sbin/npppd/npppd/npppd.conf.5,v
>> retrieving revision 1.27
>> diff -u -p -r1.27 npppd.conf.5
>> --- usr.sbin/npppd/npppd/npppd.conf.523 Apr 2020 21:10:54 -  
>> 1.27
>> +++ usr.sbin/npppd/npppd/npppd.conf.512 Jun 2020 18:48:00 -
>> @@ -691,6 +691,7 @@ bind tunnel from L2TP authenticated by L
>> .Sh SEE ALSO
>> .Xr pipex 4 ,
>> .Xr pppx 4 ,
>> +.Xr pppac 4 ,
>> .Xr npppctl 8 ,
>> .Xr npppd 8 ,
>> .Xr sysctl 8
>> 
> 



Re: powerpc64: ldbrx/stdbrx for endian.h?

2020-06-08 Thread David Gwynne



> On 8 Jun 2020, at 9:28 pm, Christian Weisgerber  wrote:
> 
> powerpc has byte-swapping 16 and 32-bit load/stores and we use those
> in .
> 
> Starting with POWER7 (Power ISA v.2.06), there are also corresponding
> 64-bit instructions.  Do we want to use those on powerpc64?  Or do
> we want to keep compatibility with older processors?

I'm ok with using the instructions. I can't think of what benefit compat in 
this space would actually provide.

Did they also happen to add opcodes for doing swaps in registers?

dlg

> Index: arch/powerpc64/include/endian.h
> ===
> RCS file: /cvs/src/sys/arch/powerpc64/include/endian.h,v
> retrieving revision 1.1
> diff -u -p -r1.1 endian.h
> --- arch/powerpc64/include/endian.h   16 May 2020 17:11:14 -  1.1
> +++ arch/powerpc64/include/endian.h   8 Jun 2020 11:16:33 -
> @@ -36,7 +36,7 @@ __mswap16(volatile const __uint16_t *m)
> 
>   __asm("lhbrx %0, 0, %1"
>   : "=r" (v)
> -: "r" (m), "m" (*m));
> + : "r" (m), "m" (*m));
> 
>   return (v);
> }
> @@ -48,7 +48,7 @@ __mswap32(volatile const __uint32_t *m)
> 
>   __asm("lwbrx %0, 0, %1"
>   : "=r" (v)
> -: "r" (m), "m" (*m));
> + : "r" (m), "m" (*m));
> 
>   return (v);
> }
> @@ -56,11 +56,11 @@ __mswap32(volatile const __uint32_t *m)
> static inline __uint64_t
> __mswap64(volatile const __uint64_t *m)
> {
> - __uint32_t *a = (__uint32_t *)m;
>   __uint64_t v;
> 
> - v = (__uint64_t)__mswap32(a + 1) << 32 |
> - (__uint64_t)__mswap32(a);
> + __asm("ldbrx %0, 0, %1"
> + : "=r" (v)
> + : "r" (m), "m" (*m));
> 
>   return (v);
> }
> @@ -84,10 +84,9 @@ __swapm32(volatile __uint32_t *m, __uint
> static inline void
> __swapm64(volatile __uint64_t *m, __uint64_t v)
> {
> - __uint32_t *a = (__uint32_t *)m;
> -
> - __swapm32(a + 1, v >> 32);
> - __swapm32(a, v);
> + __asm("stdbrx %1, 0, %2"
> + : "=m" (*m)
> + : "r" (v), "r" (m));
> }
> 
> #define __HAVE_MD_SWAPIO
> -- 
> Christian "naddy" Weisgerber  na...@mips.inka.de
> 



unfinished diff to let pcidump print VPD info

2020-06-07 Thread David Gwynne
I was looking for something in a devices VPD info and got this far
before finding out it wasn't there. I'm a bit over it now though.

PCI VPD stuff is weird because the capability doesn't contain the data
itself, it's a register that lets you communicate with whatever is
storing the data on the device. Rather than let userland do the
necessary writes to the register for that communication, this adds
another PCI ioctl so the kernel can do it on userlands behalf. This also
prevents userland from updating or writing to the VPD info on the
device.

It looks a bit like this:

dlg@ix pcidump$ sudo ./obj/pcidump -v 0:3:0  
 0:3:0: Intel 82599
0x: Vendor ID: 8086, Product ID: 10fb
0x0004: Command: 0107, Status: 0010
0x0008: Class: 02 Network, Subclass: 00 Ethernet,
Interface: 00, Revision: 01
0x000c: BIST: 00, Header Type: 00, Latency Timer: 00,
Cache Line Size: 10
0x0010: BAR mem 64bit addr: 0xfea8/0x0008
0x0018: BAR io addr: 0xc0a0/0x0020
0x001c: BAR empty ()
0x0020: BAR mem 64bit addr: 0xfebd/0x4000
0x0028: Cardbus CIS: 
0x002c: Subsystem Vendor ID: 1028 Product ID: 1f72
0x0030: Expansion ROM Base Address: 
0x0038: 
0x003c: Interrupt Pin: 02 Line: 0a Min Gnt: 00 Max Lat: 00
0x0040: Capability 0x01: Power Management
State: D0
0x0050: Capability 0x05: Message Signalled Interrupts (MSI)
Enabled: yes
0x0070: Capability 0x11: Extended Message Signalled Interrupts (MSI-X)
Enabled: no; table size 64 (BAR 4:0)
0x00a0: Capability 0x10: PCI Express
0x0100: Enhanced Capability 0x01: Advanced Error Reporting
0x0140: Enhanced Capability 0x03: Device Serial Number
Serial Number: 02deffad
0x00e0: Capability 0x03: Vital Product Data (VPD)
Product Name: X520 10GbE Controller
PN: G61346
MN: 1028
V0: FFV14.5.8
V1: DSV1028VPDR.VER1.0
V3: DTINIC
V4: DCM10010081D521010081D5
V5: NPY2
V6: PMT12345678
V7: NMVIntel Corp
RV: \M-z

If anyone think there's value in doing more work on it, please feel
free.

dlg

Index: sys/dev/pci/pci.c
===
RCS file: /cvs/src/sys/dev/pci/pci.c,v
retrieving revision 1.115
diff -u -p -r1.115 pci.c
--- sys/dev/pci/pci.c   15 Jan 2020 14:01:19 -  1.115
+++ sys/dev/pci/pci.c   8 Jun 2020 03:13:05 -
@@ -1031,10 +1031,11 @@ pci_vpd_read(pci_chipset_tag_t pc, pcita
int ofs, i, j;
 
KASSERT(data != NULL);
-   KASSERT((offset + count) < 0x7fff);
+   if (offset + count) >= PCI_VPD_ADDRESS_MASK)
+   return (EINVAL);
 
if (pci_get_capability(pc, tag, PCI_CAP_VPD, , ) == 0)
-   return (1);
+   return (ENXIO);
 
for (i = 0; i < count; offset += sizeof(*data), i++) {
reg &= 0x;
@@ -1049,7 +1050,7 @@ pci_vpd_read(pci_chipset_tag_t pc, pcita
j = 0;
do {
if (j++ == 20)
-   return (1);
+   return (EIO);
delay(4);
reg = pci_conf_read(pc, tag, ofs);
} while ((reg & PCI_VPD_OPFLAG) == 0);
@@ -1203,6 +1204,7 @@ pciioctl(dev_t dev, u_long cmd, caddr_t 
break;
case PCIOCGETROMLEN:
case PCIOCGETROM:
+   case PCIOCGETVPD:
break;
case PCIOCGETVGA:
case PCIOCSETVGA:
@@ -1366,6 +1368,35 @@ pciioctl(dev_t dev, u_long cmd, caddr_t 
 
fail:
rom->pr_romlen = PCI_ROM_SIZE(mask);
+   break;
+   }
+
+   case PCIOCGETVPD: {
+   struct pci_vpd_req *pv = (struct pci_vpd_req *)data;
+   pcireg_t *data;
+   size_t len;
+   int s;
+
+   CTASSERT(sizeof(*data) == sizeof(*pv->pv_data));
+
+   data = mallocarray(pv->pv_count, sizeof(*data), M_TEMP,
+   M_WAITOK|M_CANFAIL);
+   if (data == NULL) {
+   error = ENOMEM;
+   break;
+   }
+
+   s = splhigh();
+   error = pci_vpd_read(pc, tag, pv->pv_offset, pv->pv_count,
+   data);
+   splx(s);
+
+   len = pv->pv_count * sizeof(*pv->pv_data);
+
+   if (error == 0)
+   error = copyout(data, pv->pv_data, len);
+
+   free(data, M_TEMP, len);
break;
}
 
Index: sys/sys/pciio.h

shuffle ix(4) code a bit more for multiq operation

2020-06-04 Thread David Gwynne
this builds on the work mpi has been doing to prepare ix for multiq
operation. the main change is to call if_attach_queues and
if_attach_iqueues to allocate an ifq and ifiq per tx ring and rx
ring respectivly, and then tie them together. ix rx rings deliver
packets into ifiqs, and ifqs push packets onto ix tx rings.

another change is to make the rx refill timeout per rx ring instead of
per interface, and to make it only schedule the timeout when the ring is
completely empty to avoid races with normal rx ring operation.

the return value from ifiq_input is also used to tell the if_rxr stuff to
back off. apart from that, there's just a bit of deck chair shuffling.

there should be no functional change from this because we still only set
up one rx and tx ring per interface at the moment.

Index: if_ix.c
===
RCS file: /cvs/src/sys/dev/pci/if_ix.c,v
retrieving revision 1.165
diff -u -p -r1.165 if_ix.c
--- if_ix.c 24 Apr 2020 08:50:23 -  1.165
+++ if_ix.c 5 Jun 2020 03:11:57 -
@@ -147,7 +147,7 @@ voidixgbe_enable_intr(struct ix_softc *
 void   ixgbe_disable_intr(struct ix_softc *);
 void   ixgbe_update_stats_counters(struct ix_softc *);
 intixgbe_txeof(struct tx_ring *);
-intixgbe_rxeof(struct ix_queue *);
+intixgbe_rxeof(struct rx_ring *);
 void   ixgbe_rx_checksum(uint32_t, struct mbuf *, uint32_t);
 void   ixgbe_iff(struct ix_softc *);
 #ifdef IX_DEBUG
@@ -248,7 +248,6 @@ ixgbe_attach(struct device *parent, stru
 
/* Set up the timer callout */
timeout_set(>timer, ixgbe_local_timer, sc);
-   timeout_set(>rx_refill, ixgbe_rxrefill, sc);
 
/* Determine hardware revision */
ixgbe_identify_hardware(sc);
@@ -378,7 +377,6 @@ ixgbe_detach(struct device *self, int fl
if_detach(ifp);
 
timeout_del(>timer);
-   timeout_del(>rx_refill);
ixgbe_free_pci_resources(sc);
 
ixgbe_free_transmit_structures(sc);
@@ -404,13 +402,11 @@ ixgbe_start(struct ifqueue *ifq)
 {
struct ifnet*ifp = ifq->ifq_if;
struct ix_softc *sc = ifp->if_softc;
-   struct tx_ring  *txr = sc->tx_rings;
+   struct tx_ring  *txr = ifq->ifq_softc;
struct mbuf *m_head;
unsigned int head, free, used;
int  post = 0;
 
-   if (!(ifp->if_flags & IFF_RUNNING) || ifq_is_oactive(ifq))
-   return;
if (!sc->link_up)
return;
 
@@ -858,7 +854,8 @@ ixgbe_init(void *arg)
 
/* Now inform the stack we're ready */
ifp->if_flags |= IFF_RUNNING;
-   ifq_clr_oactive(>if_snd);
+   for (i = 0; i < sc->num_queues; i++)
+   ifq_clr_oactive(ifp->if_ifqs[i]);
 
splx(s);
 }
@@ -1030,17 +1027,13 @@ ixgbe_queue_intr(void *vque)
struct ix_queue *que = vque;
struct ix_softc *sc = que->sc;
struct ifnet*ifp = >arpcom.ac_if;
-   struct tx_ring  *txr = sc->tx_rings;
+   struct rx_ring  *rxr = que->rxr;
+   struct tx_ring  *txr = que->txr;
 
if (ISSET(ifp->if_flags, IFF_RUNNING)) {
-   ixgbe_rxeof(que);
+   ixgbe_rxeof(rxr);
ixgbe_txeof(txr);
-   if (ixgbe_rxfill(que->rxr)) {
-   /* Advance the Rx Queue "Tail Pointer" */
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(que->rxr->me),
-   que->rxr->last_desc_filled);
-   } else
-   timeout_add(>rx_refill, 1);
+   ixgbe_rxrefill(rxr);
}
 
ixgbe_enable_queue(sc, que->msix);
@@ -1059,6 +1052,7 @@ ixgbe_legacy_intr(void *arg)
 {
struct ix_softc *sc = (struct ix_softc *)arg;
struct ifnet*ifp = >arpcom.ac_if;
+   struct rx_ring  *rxr = sc->rx_rings;
struct tx_ring  *txr = sc->tx_rings;
int rv;
 
@@ -1069,16 +1063,9 @@ ixgbe_legacy_intr(void *arg)
}
 
if (ISSET(ifp->if_flags, IFF_RUNNING)) {
-   struct ix_queue *que = sc->queues;
-
-   ixgbe_rxeof(que);
+   ixgbe_rxeof(rxr);
ixgbe_txeof(txr);
-   if (ixgbe_rxfill(que->rxr)) {
-   /* Advance the Rx Queue "Tail Pointer" */
-   IXGBE_WRITE_REG(>hw, IXGBE_RDT(que->rxr->me),
-   que->rxr->last_desc_filled);
-   } else
-   timeout_add(>rx_refill, 1);
+   ixgbe_rxrefill(rxr);
}
 
ixgbe_enable_queues(sc);
@@ -1103,7 +1090,6 @@ ixgbe_intr(struct ix_softc *sc)
KERNEL_LOCK();
ixgbe_update_link_status(sc);
KERNEL_UNLOCK();
-   ifq_start(>if_snd);
}
 
if (hw->mac.type != ixgbe_mac_82598EB) {
@@ -1604,6 +1590,7 @@ ixgbe_stop(void *arg)
 {
struct ix_softc *sc = arg;
struct ifnet   *ifp = 

Re: cpu_rnd_messybits() for alpha, powerpc, sparc64

2020-06-04 Thread David Gwynne



> On 5 Jun 2020, at 4:33 am, Christian Weisgerber  wrote:
> 
> Here's a proposal for implementing cpu_rnd_messybits() as a read of
> the cycle counter on alpha, powerpc, and sparc64.  Since I don't have
> those archs, the diff is not even compile-tested.
> 
> * alpha: RPCC is a 32-bit counter (in a 64-bit register)
> * powerpc: TB is a 64-bit counter split into two registers
> * sparc64: TICK is a(n implementation-defined, up to) 63-bit counter

ok by me.

> 
> 
> Index: sys/arch/alpha/alpha/machdep.c
> ===
> RCS file: /cvs/src/sys/arch/alpha/alpha/machdep.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 machdep.c
> --- sys/arch/alpha/alpha/machdep.c31 May 2020 06:23:56 -  1.191
> +++ sys/arch/alpha/alpha/machdep.c4 Jun 2020 17:57:45 -
> @@ -1854,12 +1854,3 @@ alpha_XXX_dmamap(v)
> /* XXX */
>   return (vtophys(v) | alpha_XXX_dmamap_or);  /* XXX */
> } /* XXX */
> /* XXX XXX END XXX XXX */
> -
> -unsigned int
> -cpu_rnd_messybits(void)
> -{
> - struct timespec ts;
> -
> - nanotime();
> - return (ts.tv_nsec ^ (ts.tv_sec << 20));
> -}
> Index: sys/arch/alpha/include/cpu.h
> ===
> RCS file: /cvs/src/sys/arch/alpha/include/cpu.h,v
> retrieving revision 1.62
> diff -u -p -r1.62 cpu.h
> --- sys/arch/alpha/include/cpu.h  31 May 2020 06:23:56 -  1.62
> +++ sys/arch/alpha/include/cpu.h  4 Jun 2020 17:59:25 -
> @@ -288,7 +288,11 @@ do { 
> \
>  */
> #define   cpu_number()alpha_pal_whami()
> 
> -unsigned int cpu_rnd_messybits(void);
> +static inline unsigned int
> +cpu_rnd_messybits(void)
> +{
> + return alpha_rpcc();
> +}
> 
> /*
>  * Arguments to hardclock and gatherstats encapsulate the previous
> Index: sys/arch/macppc/macppc/machdep.c
> ===
> RCS file: /cvs/src/sys/arch/macppc/macppc/machdep.c,v
> retrieving revision 1.191
> diff -u -p -r1.191 machdep.c
> --- sys/arch/macppc/macppc/machdep.c  31 May 2020 06:23:57 -  1.191
> +++ sys/arch/macppc/macppc/machdep.c  4 Jun 2020 18:07:31 -
> @@ -913,12 +913,3 @@ cpu_switchto(struct proc *oldproc, struc
> 
>   cpu_switchto_asm(oldproc, newproc);
> }
> -
> -unsigned int
> -cpu_rnd_messybits(void)
> -{
> - struct timespec ts;
> -
> - nanotime();
> - return (ts.tv_nsec ^ (ts.tv_sec << 20));
> -}
> Index: sys/arch/powerpc/include/cpu.h
> ===
> RCS file: /cvs/src/sys/arch/powerpc/include/cpu.h,v
> retrieving revision 1.67
> diff -u -p -r1.67 cpu.h
> --- sys/arch/powerpc/include/cpu.h31 May 2020 06:23:58 -  1.67
> +++ sys/arch/powerpc/include/cpu.h4 Jun 2020 18:13:07 -
> @@ -161,7 +161,15 @@ extern int ppc_nobat;
> 
> void  cpu_bootstrap(void);
> 
> -unsigned int cpu_rnd_messybits(void);
> +static inline unsigned int
> +cpu_rnd_messybits(void)
> +{
> + unsigned int hi, lo;
> +
> + __asm volatile("mftbu %0; mftb %1" : "=r" (hi), "=r" (lo));
> +
> + return (hi ^ lo);
> +}
> 
> /*
>  * This is used during profiling to integrate system time.
> Index: sys/arch/sparc64/include/cpu.h
> ===
> RCS file: /cvs/src/sys/arch/sparc64/include/cpu.h,v
> retrieving revision 1.94
> diff -u -p -r1.94 cpu.h
> --- sys/arch/sparc64/include/cpu.h31 May 2020 06:23:58 -  1.94
> +++ sys/arch/sparc64/include/cpu.h4 Jun 2020 18:05:18 -
> @@ -211,7 +211,15 @@ void cpu_unidle(struct cpu_info *);
> #define curpcb__curcpu->ci_cpcb
> #define fpproc__curcpu->ci_fpproc
> 
> -unsigned int cpu_rnd_messybits(void);
> +static inline unsigned int
> +cpu_rnd_messybits(void)
> +{
> + u_int64_t tick;
> +
> + __asm volatile("rd %%tick, %0" : "=r" (tick) :);
> +
> + return ((tick >> 32) ^ tick);
> +}
> 
> /*
>  * On processors with multiple threads we force a thread switch.
> Index: sys/arch/sparc64/sparc64/machdep.c
> ===
> RCS file: /cvs/src/sys/arch/sparc64/sparc64/machdep.c,v
> retrieving revision 1.196
> diff -u -p -r1.196 machdep.c
> --- sys/arch/sparc64/sparc64/machdep.c31 May 2020 06:23:58 -  
> 1.196
> +++ sys/arch/sparc64/sparc64/machdep.c4 Jun 2020 18:01:16 -
> @@ -2114,12 +2114,3 @@ blink_led_timeout(void *vsc)
>   t = (((averunnable.ldavg[0] + FSCALE) * hz) >> (FSHIFT + 1));
>   timeout_add(>bls_to, t);
> }
> -
> -unsigned int
> -cpu_rnd_messybits(void)
> -{
> - struct timespec ts;
> -
> - nanotime();
> - return (ts.tv_nsec ^ (ts.tv_sec << 20));
> -}
> -- 
> Christian "naddy" Weisgerber  

Re: fix pppx(4) with net/ifq.c rev 1.38

2020-05-30 Thread David Gwynne



> On 30 May 2020, at 9:43 pm, Vitaliy Makkoveev  
> wrote:
> 
> 
>> On 30 May 2020, at 09:40, David Gwynne  wrote:
>> 
>> On Mon, May 25, 2020 at 09:44:22AM +0200, Martin Pieuchot wrote:
>>> On 23/05/20(Sat) 15:38, Vitaliy Makkoveev wrote:
>>>>> On 23 May 2020, at 12:54, Martin Pieuchot  wrote:
>>>>> On 22/05/20(Fri) 13:25, Vitaliy Makkoveev wrote:
>>>>>> On Fri, May 22, 2020 at 07:57:13AM +1000, David Gwynne wrote:
>>>>>>> [...] 
>>>>>>> can you try the following diff?
>>>>>>> 
>>>>>> 
>>>>>> I tested this diff and it works for me. But the problem I pointed is
>>>>>> about pipex(4) locking.
>>>>>> 
>>>>>> pipex(4) requires NET_LOCK() be grabbed not only for underlaying
>>>>>> ip{,6}_output() but for itself too. But since pppac_start() has
>>>>>> unpredictable behavior I suggested to make it predictable [1].
>>>>> 
>>>>> What needs the NET_LOCK() in their?  We're talking about
>>>>> pipex_ppp_output(), right?  Does it really need the NET_LOCK() or
>>>>> the KERNEL_LOCK() is what protects those data structures?
>>>> 
>>>> Yes, about pipex_ppp_output() and pipex_output(). Except
>>>> ip{,6}_output() nothing requires NET_LOCK(). As David Gwynne pointed,
>>>> they can be replaced by ip{,6}_send().
>>> 
>>> Locks protect data structures, you're talking about functions, which
>>> data structures are serialized by this lock?  I'm questioning whether
>>> there is one.
>>> 
>>>> [...]
>>>>> In case of pipex(4) is isn't clear that the NET_LOCK() is necessary.
>>>> 
>>>> I guess, pipex(4) was wrapped by NET_LOCK() to protect it while it???s
>>>> accessed through `pr_input???. Is NET_LOCK() required for this case?
>>> 
>>> pipex(4) like all the network stack has been wrapped in the NET_LOCK()
>>> because it was easy to do.  That means it isn't a concious decision or
>>> design.  The fact that pipex(4) code runs under the NET_LOCK() is a side
>>> effect of how the rest of the stack evolved.  I'm questioning whether
>>> this lock is required there.  In theory it shouldn't.  What is the
>>> reality?
>> 
>> pipex and pppx pre-date the NET_LOCK, which means you can assume
>> that any implicit locking was and is done by the KERNEL_LOCK. mpi is
>> asking the right questions here.
>> 
>> As for the ifq maxlen difference between pppx and pppac, that's more
>> about when and how quickly they were written more than anything else.
>> The IFQ_SET_MAXLEN(>if_snd, 1) in pppx is because that's a way to
>> bypass transmit mitigation for pseudo/virtual interfaces. That was the
>> only way to do it historically. It is not an elegant hack to keep
>> hold of the NET_LOCK over a call to a start routine.
>> 
>> As a rule of thumb, network interface drivers should not (maybe
>> cannot) rely on the NET_LOCK in their if_start handlers. To be
>> clear, they should not rely on it being held by the network stack
>> when if_start is called because sometimes the stack calls it without
>> holding NET_LOCK, and they should not take it because they might
>> be called by the stack when it is being held.
>> 
>> Also, be aware that the ifq machinery makes sure that the start
>> routine is not called concurrently or recursively. You can queue
>> packets for transmission on an ifq from anywhere in the kernel at
>> any time, but only one cpu will run the start routine. Other cpus
>> can queue packets while another one is running if_start, but the
>> first one ends up responsible for trying to transmit it.
>> 
>> ifqs also take the KERNEL_LOCK before calling if_start if the interface
>> is not marked as IFXF_MPSAFE.
>> 
>> The summary is that pppx and pppac are not marked as mpsafe so their
>> start routines are called with KERNEL_LOCK held. Currently pppx
>> accidentally gets NET_LOCK because of the IFQ_SET_MAXLEN, but shouldn't
>> rely on it.
>> 
>> Cheers,
>> dlg
>> 
> 
> Thanks for explanation.
> Will you commit diff you posted in this thread?

Yes, I'm doing that now.

Thanks for testing it btw.

dlg


Re: fix pppx(4) with net/ifq.c rev 1.38

2020-05-30 Thread David Gwynne
On Mon, May 25, 2020 at 09:44:22AM +0200, Martin Pieuchot wrote:
> On 23/05/20(Sat) 15:38, Vitaliy Makkoveev wrote:
> > > On 23 May 2020, at 12:54, Martin Pieuchot  wrote:
> > > On 22/05/20(Fri) 13:25, Vitaliy Makkoveev wrote:
> > >> On Fri, May 22, 2020 at 07:57:13AM +1000, David Gwynne wrote:
> > >>> [...] 
> > >>> can you try the following diff?
> > >>> 
> > >> 
> > >> I tested this diff and it works for me. But the problem I pointed is
> > >> about pipex(4) locking.
> > >> 
> > >> pipex(4) requires NET_LOCK() be grabbed not only for underlaying
> > >> ip{,6}_output() but for itself too. But since pppac_start() has
> > >> unpredictable behavior I suggested to make it predictable [1].
> > > 
> > > What needs the NET_LOCK() in their?  We're talking about
> > > pipex_ppp_output(), right?  Does it really need the NET_LOCK() or
> > > the KERNEL_LOCK() is what protects those data structures?
> > 
> > Yes, about pipex_ppp_output() and pipex_output(). Except
> > ip{,6}_output() nothing requires NET_LOCK(). As David Gwynne pointed,
> > they can be replaced by ip{,6}_send().
> 
> Locks protect data structures, you're talking about functions, which
> data structures are serialized by this lock?  I'm questioning whether
> there is one.
> 
> > [...]
> > > In case of pipex(4) is isn't clear that the NET_LOCK() is necessary.
> > 
> > I guess, pipex(4) was wrapped by NET_LOCK() to protect it while it???s
> > accessed through `pr_input???. Is NET_LOCK() required for this case?
> 
> pipex(4) like all the network stack has been wrapped in the NET_LOCK()
> because it was easy to do.  That means it isn't a concious decision or
> design.  The fact that pipex(4) code runs under the NET_LOCK() is a side
> effect of how the rest of the stack evolved.  I'm questioning whether
> this lock is required there.  In theory it shouldn't.  What is the
> reality?

pipex and pppx pre-date the NET_LOCK, which means you can assume
that any implicit locking was and is done by the KERNEL_LOCK. mpi is
asking the right questions here.

As for the ifq maxlen difference between pppx and pppac, that's more
about when and how quickly they were written more than anything else.
The IFQ_SET_MAXLEN(>if_snd, 1) in pppx is because that's a way to
bypass transmit mitigation for pseudo/virtual interfaces. That was the
only way to do it historically. It is not an elegant hack to keep
hold of the NET_LOCK over a call to a start routine.

As a rule of thumb, network interface drivers should not (maybe
cannot) rely on the NET_LOCK in their if_start handlers. To be
clear, they should not rely on it being held by the network stack
when if_start is called because sometimes the stack calls it without
holding NET_LOCK, and they should not take it because they might
be called by the stack when it is being held.

Also, be aware that the ifq machinery makes sure that the start
routine is not called concurrently or recursively. You can queue
packets for transmission on an ifq from anywhere in the kernel at
any time, but only one cpu will run the start routine. Other cpus
can queue packets while another one is running if_start, but the
first one ends up responsible for trying to transmit it.

ifqs also take the KERNEL_LOCK before calling if_start if the interface
is not marked as IFXF_MPSAFE.

The summary is that pppx and pppac are not marked as mpsafe so their
start routines are called with KERNEL_LOCK held. Currently pppx
accidentally gets NET_LOCK because of the IFQ_SET_MAXLEN, but shouldn't
rely on it.

Cheers,
dlg



symmetric toeplitz hashing

2020-05-28 Thread David Gwynne
This is another bit of the puzzle for supporting multiple rx rings
and receive side scaling (RSS) on nics. It borrows heavily from
DragonflyBSD, but I've made some tweaks on the way.

For background on the dfly side, I recommend having a look at
https://leaf.dragonflybsd.org/~sephe/AsiaBSDCon%20-%20Dfly.pdf.

>From my point of view, the interesting thing is that they came up
with a way to use Toeplitz hashing so the kernel AND network
interfaces hash packets so packets in both directions onto the same
bucket. The other interesting thing is that the optimised the hash
calculation by building a cache of all the intermediate results
possible for each input byte. Their hash calculation is simply
xoring these intermediate reults together.

I've made some tweaks compared to dfly for how the caching is
calculated and used, so it's not an exactly 1:1 port of the dfly
code. If anyone is interested in the tweaks, let me know.

So this diff adds an API for the kernel to use for calculating a
hash for ip addresses and ports, and adds a function for network
drivers to call that gives them a key to use with RSS. If all drivers
use the same key, then the same flows should be steered to the same
place when they enter the network stack regardless of which hardware
they came in on.

I've tested it with vmx(4) and some quick and dirty hacks to the
network stack (and with some magical observability), and can see
things like tcpbench push packets onto the same numbered ifq/txring
that the "nic" picks for the rxring and therefore ifiq into the
stack. We're going to try it on some more drivers soon.

The way this is set up now, if a nic driver wants to do RSS, you
add stoeplitz as a dependency in the kernel config file, which
causes this code to be included in the build.

There's some discussion to be had about the best way to integrate
this on the IP stack side, but that is about where this API is
called from, not the internals of it per se.

Thoughts? ok?

Index: share/man/man9/stoeplitz_to_key.9
===
RCS file: share/man/man9/stoeplitz_to_key.9
diff -N share/man/man9/stoeplitz_to_key.9
--- /dev/null   1 Jan 1970 00:00:00 -
+++ share/man/man9/stoeplitz_to_key.9   29 May 2020 04:01:26 -
@@ -0,0 +1,126 @@
+.\" $OpenBSD$
+.\"
+.\" Copyright (c) 2020 David Gwynne 
+.\"
+.\" Permission to use, copy, modify, and distribute this software for any
+.\" purpose with or without fee is hereby granted, provided that the above
+.\" copyright notice and this permission notice appear in all copies.
+.\"
+.\" THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
+.\" WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
+.\" MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
+.\" ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+.\" WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+.\" ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
+.\" OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
+.\"
+.Dd $Mdocdate: May 29 2020 $
+.Dt STOEPLITZ_TO_KEY 9
+.Os
+.Sh NAME
+.Nm stoeplitz_to_key ,
+.Nm stoeplitz_hash_ip4 ,
+.Nm stoeplitz_hash_ip4port ,
+.Nm stoeplitz_hash_ip6 ,
+.Nm stoeplitz_hash_ip4port
+.Nd Symmetric Toeplitz Hash API
+.Sh SYNOPSIS
+.In net/toeplitz.h
+.Ft void
+.Fn stoeplitz_to_key "uint8_t *key" "size_t keylen"
+.Ft uint16_t
+.Fo stoeplitz_hash_ip4
+.Fa "uint32_t srcaddr"
+.Fa "uint32_t dstaddr"
+.Fc
+.Ft uint16_t
+.Fo stoeplitz_hash_ip4port
+.Fa "uint32_t srcaddr"
+.Fa "uint32_t dstaddr"
+.Fa "uint16_t srcport"
+.Fa "uint16_t dstport"
+.Fc
+.Ft uint16_t
+.Fo stoeplitz_hash_ip6
+.Fa "const struct in6_addr *srcaddr"
+.Fa "const struct in6_addr *dstaddr"
+.Fc
+.Ft uint16_t
+.Fo stoeplitz_hash_ip6port
+.Fa "const struct in6_addr *srcaddr"
+.Fa "const struct in6_addr *dstaddr"
+.Fa "uint16_t srcport"
+.Fa "uint16_t dstport"
+.Fc
+.Sh DESCRIPTION
+The Toeplitz hash algorithm is commonly used by network interface
+controllers to to generate a short hash based on the value of fields
+in network packet headers.
+.\" mention RSS?
+The resulting hash value can be used as a flow identifier, which
+in turn can be used to consistently select a context for processing
+packets using those fields.
+Traditionally, the Toeplitz hash produces different results depending
+on the order of inputs, ie, adding port 80 then 1234 as inputs would
+produce a different result to hashing port 1234 then 80.
+.Pp
+The symmetric Toeplitz API uses a key selected to generate the same
+hash result regardless of the order the inputs were added.
+The API also supports producing Toeplitz hash keys for use by
+network interface

Re: WireGuard patchset for OpenBSD, rev. 2

2020-05-27 Thread David Gwynne
On Tue, May 26, 2020 at 05:42:13PM -0600, Jason A. Donenfeld wrote:
> On Tue, May 26, 2020 at 4:52 PM Jason A. Donenfeld  wrote:
> > With regards to your crash, though, that's a bit more puzzling, and
> > I'd be interested to learn more details. Because these structs are
> > already naturally aligned, the __packed attribute, even with the odd
> > nesting Matt had prior, should have produced all entirely aligned
> > accesses. That makes me think your kaboom was coming from someplace
> > else. One possibility is that you were running the git tree on the two
> > days that I was playing with uint128_t, only to find out that some of
> > openbsd's patches to clang miscalculate stack sizes when they're in
> > use, so that work was shelved for another day and the commits removed;
> > perhaps you were just unlucky? Or you hit some other bug that's
> > lurking. Either way, output from ddb's `bt` would at least be useful.

When you say clang patches miscalculate the stack size with uint128_t,
do you know which one(s) specifically? Was it -msave-args?

> Do you know off hand if we're able to assume any type of alignment
> with mbuf->m_data? mtod just casts without any address fixup, which
> means if mbuf->m_data isn't aligned by some other mechanism, we're in
> trouble. But I would assume there _is_ some alignment imposed, since
> the rest of the stack appears to parse tcp headers and such directly
> without byte-by-byte copies being made.

It's probably more correct to say that payload alignment is required by
the network stack rather than imposed. However, you could argue
that's a useless distinction to make in practice.

Anyway, tl;dr, my guess is you're reading a 64bit value out of a packet,
but I'm pretty sure the stack probably only provides 32bit alignment.

The long version of the relevant things are:

- OpenBSD runs on some architectures that require strict alignment
for accessing words, and the fault handlers for unaligned accesses
in the kernel drop you into ddb. ie, you don't want to do that.

- The network stack accesses bits of packets directly. I'm pretty
sure so far this is limited to 32bit word accesses for things like
addresses in IPv4 headers, GRE protocol fields, etc. I'm not sure
there are (currently) any 64bit accesses.

- mbufs and mbuf clusters (big buffers for packet data) are allocated
out of pools which provide 64 byte alignment. The data portion of
mbufs starts after some header structures, but should remain long
aligned. At least on amd64 and sparc64 it happens to be 16 byte
aligned. Figuring it out on a 32bit arch is too much maths for me atm.

- The mbuf API tries to keep the m_data pointer in newly allocated mbufs
long word aligned.

- However, when you're building a packet and you put a header onto it,
you usually do that with m_prepend. m_prepend will put the new header
up against the existing data, unless there is no space to do that with
the current mbuf. In that latter situation it will allocate a new mbuf
and add the new header there. Because it is a new mbuf, and as per the
previous point, the new header will start on a long word boundary.

This is mostly a problem for Ethernet related stuff (14 byte headers,
yay), and has a few consequences.

One of the consequences is that the receive side of Ethernet drivers
in OpenBSD try hard to align Ethernet payloads to 4 byte boundaries.
This means they allocate an mbuf with at least their mtu + 2 bytes,
and then chop 2 bytes off the front and give that to the chip to
rx into.

So if you're on a sparc64 running one of those, let's say the driver does
this and it gets an mbuf cluster to rx into. That cluster will be 64
byte aligned, which is good. Let's also say it gets a wg packet. The
layout of the packet will be:

- 2 byte ETHER_ALIGN pad at offset 0
- 14 byte Ethernet header at offset 2
- 20 byte IP header at offset 16 (hooray)
- 8 byte UDP header at offset 36
- 4 byte wg message type + pad at offset 44
- 4 byte wg/noise receiver index at offset 48
- 8 byte wg/noise counter/none at offset 52

52 % 8 = explosion.

However, in your diff struct noise_data is __packed, so the compiler
shouldn't make assumptions about the alignment and should do a byte
by byte load. It's hard to say what's happening without a backtrace
like you suggest.

For reasons I would suggest making struct noise_data look like this:

struct noise_data {
uint32_tr_idx;
uint32_tnonce_lo;
uint32_tnonce_hi;
};

And making the load of the counter look like this:

uint64_t ctr;

ctr = lemtoh32(>nonce_lo) |
((uint64_t)lemtoh32(>nonce_hi) << 32);

Sorry for the rambling. Hopefully you know a bit more about mbuf
and network stack alignment now though.



Re: libpcap: allow breaking out of loop when using savefile

2020-05-26 Thread David Gwynne
I just committed this, thank you :)

dlg

> On 16 May 2020, at 05:14, Caspar Schutijser  wrote:
> 
> Hi,
> 
> Below is a patch that makes breaking out of the loop work when using
> a savefile.
> 
> The pcap_breakloop() function was backported from tcpdump.org libpcap
> to OpenBSD libpcap by djm@ on Nov 18, 2005. The bits to make
> pcap_breakloop() work were backported to pcap-bpf.c [1] but not to
> savefile.c even though tcpdump.org implemented support there too [2].
> 
> The diff below backports this piece of code to savefile.c after all.
> 
> Thanks,
> Caspar Schutijser
> 
> [1] 
> https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/lib/libpcap/pcap-bpf.c.diff?r1=1.16=1.17
> [2] 
> https://github.com/the-tcpdump-group/libpcap/commit/991d444f7116bef16893826b46f3950f62281507#diff-95d4d29d0f11145ff40b850860667a97R745
> 
> 
> Index: savefile.c
> ===
> RCS file: /cvs/src/lib/libpcap/savefile.c,v
> retrieving revision 1.16
> diff -u -p -r1.16 savefile.c
> --- savefile.c22 Dec 2015 19:51:04 -  1.16
> +++ savefile.c15 May 2020 19:03:44 -
> @@ -307,6 +307,23 @@ pcap_offline_read(pcap_t *p, int cnt, pc
>   while (status == 0) {
>   struct pcap_pkthdr h;
> 
> + /*
> +  * Has "pcap_breakloop()" been called?
> +  * If so, return immediately - if we haven't read any
> +  * packets, clear the flag and return -2 to indicate
> +  * that we were told to break out of the loop, otherwise
> +  * leave the flag set, so that the *next* call will break
> +  * out of the loop without having read any packets, and
> +  * return the number of packets we've processed so far.
> +  */
> + if (p->break_loop) {
> + if (n == 0) {
> + p->break_loop = 0;
> + return (PCAP_ERROR_BREAK);
> + } else
> + return (n);
> + }
> +
>   status = sf_next_packet(p, , p->buffer, p->bufsize);
>   if (status) {
>   if (status == 1)
> 



Re: ADMtec aue(4) interface supporting VLAN_MTU ?

2020-05-22 Thread David Gwynne



> On 23 May 2020, at 7:54 am, Christopher Zimmermann  wrote:
> 
> On Tue, Apr 21, 2020 at 04:12:16PM -0700, Chris Cappuccio wrote:
>> Tom Smyth [tom.sm...@wirelessconnect.eu] wrote:
>>> Hi Chrisz,
>>> 
>>> 4 bytes for the vlan header .. have you tried increasing the parent
>>> intetface mtu by 4bytes
>>> 
>> 
>> IFCAP_VLAN_MTU is a direct bypass for this. "hardmtu" on the parent interface
>> is perhaps more interesting as it will limit everything including these
>> encapsulations
> 
> IFAP_VLAN_MTU will affect how the hardmtu is passed on to vlan child 
> interfaces. The vlan interfaces won't care for the parent's "soft" mtu AFAICS:

That is correct.

> hardmtu = ifp0->if_hardmtu;
> if (!ISSET(ifp0->if_capabilities, IFCAP_VLAN_MTU))
>   hardmtu -= EVL_ENCAPLEN;
> 
> Linux uses a MTU of 1536 for Pegasus chips. We always default to a hardmtu of 
> 1500. So the hardmtu can't be the cause for my interface not managing 
> full-size vlan.

We should set the hardmtu of the Pegasus chips higher then. Does Linux include 
the Ethernet header in that MTU there?

dlg

> 
> Christopher
> 
> 
> -- 
> http://gmerlin.de
> OpenPGP: http://gmerlin.de/christopher.pub
> CB07 DA40 B0B6 571D 35E2  0DEF 87E2 92A7 13E5 DEE1
> 



Re: carp: send only IPv4 carp packets on dual stack interface

2020-05-22 Thread David Gwynne



> On 23 May 2020, at 8:44 am, Christopher Zimmermann  wrote:
> 
> On Sun, Jan 19, 2020 at 01:32:17PM +, Stuart Henderson wrote:
>> On 2020/01/19 00:11, Sebastian Benoit wrote:
>>> chr...@openbsd.org(chr...@openbsd.org) on 2020.01.18 06:18:21 +0100:
>>> > On Wed, Jan 15, 2020 at 12:47:28PM +0100, Sebastian Benoit wrote:
>>> > >Christopher Zimmermann(chr...@openbsd.org) on 2020.01.15 11:55:43 +0100:
>>> > >>Hi,
>>> > >>
>>> > >>as far as I can see a dual stack carp interface does not care whether it
>>> > >>receives advertisements addressed to IPv4 or IPv6. Any one will do.
>>> > >>So I propose to send IPv6 advertisements only when IPv4 is not possible.
>>> > >>
>>> > >>Why?
>>> > >>
>>> > >>- Noise can be reduced by using unicast advertisements.
>>> > >>  This is only possible for IPv4 by ``ifconfig carppeer``.
>>> > >>  I don't like flooding the whole network with carp advertisements when
>>> > >>  I may also unicast them.
>>> > >
>>> > >Maybe i'm getting confused, but in the problem description you were 
>>> > >talking
>>> > >about v6 vs v4, and here you argue about unicast (vs multicast?) being
>>> > >better. Thats orthogonal, isnt it?
>>> >
>>> > Yes, kind of. The point is we support ``carppeer`` for IPv4, but not for
>>> > IPv6.
>>> >
>>> > >>- breaking IPv6 connectivity (for example by running iked without -6)
>>> > >>  will start a preempt-war, because failing ip6_output will cause the
>>> > >>  demote counter to be increased. That's what hit me.
>>> > >
>>> > >But the whole point of carp is to notice broken connectivity. If you run 
>>> > >v6
>>> > >on an interface, you want to know if its working, no?
>>> >
>>> > I grant you that much. But what kind of failures do you hope to detect
>>> > on the _sending_ carp master, that would not also affect the backup?
>>> 
>>> sure: misconfigured pf. Missing routes. Buggy switch.
>> 
>> misconfigured mac address filter on switch.
> 
> I'm afraid you guys haven't yet got the point I'm trying to make.
> 
> Current behaviour is that in a dual-stack carp setup failover only happens 
> when advertisements on _both_ AFs fail to reach the backup.
> A node in backup state will stay in backup state as long as it receives _any_ 
> advertisements.
> In my mind this is the only sensible way for a backup node to react.
> 
> If a backup node that fails to receive advertisements of only one AF would 
> transition to master it would in most cases start a preempt war. 
> So why do we even send dual-stack advertisements?
> The only effect those dual-stack ipv6 advertisements currently have is that 
> they prevent failover when ipv4 connectivity breaks.
> 
> I would propose to choose one "sentinel" AF (in this case ipv4) and failover 
> whenever advertisements of this AF fail to reach the backup.
> 
> Monitoring multiple AFs is not helpful, because there is no good way in which 
> to react to a failure that affects only one AF.

I don't know if this helps, but at work we use separate carp interfaces for v4 
and v6. It ends up looking a bit like this:

# cat /etc/hostname.vlan871:
parent aggr0 vnetid 871
inet alias 192.0.2.2/24
inet6 alias 2001:db8:871::2/64
up

# cat /etc/hostname.carp40871
carpdev vlan871 vhid 47
-inet6
-group carp
group ipv4g
inet alias 192.0.2.1/24
up

# cat /etc/hostname.carp60871
carpdev vlan871 vhid 61
-group carp
group ipv6g
inet6 alias 2001:db8:871::1/64
up

This let's us run a pair of firewalls one active for v4 and the other for v6. 
We don't do any af-to in PF, so it works pretty well. But yeah, it means v4 and 
v6 fail separately.

> 
>>> > >At the very least, this needs some more thought and testing in all > 
>>> > >>the ways
>>> > >carp can be configured.
>>> >
>>> > Anyway, my main concern indeed is the broadcast noise generated by carp
>>> > and I would be equally happy if we had a ``carppeer6`` option. Would
>>> > that be considered?
>>> 
>>> of course carppeer should work with v6, and as claudio says without an extra
>>> keyword in ifconfig, but thats a trivial detail.
>>> 
>> 
>> Currently carp only handles one address per af, setting carppeer twice
>> changes the current peer address rather than adding another. A trivial
>> implementation that sets the v4 peer address if a v4 address is passed
>> in, and sets the v6 peer address if a v6 address is passed in, that
>> would mean things work differently with
>> 
>> ifconfig carp1 carppeer $foo
>> ifconfig carp1 carppeer $bar
>> 
>> depending on whether foo/bar are v4 or v6. Also removing a configured
>> carppeer address to reset to multicast is just done with -carppeer
>> with no way to indicate the af.
>> 
>> It would work pretty nicely if you could set multiple carppeer addresses
>> (of whatever af) and remove them individually. That's a more complex
>> change (carp would need to keep a list of peers per af rather than a
>> single address) but without something like that they can't really be
>> equals and it feels like shoehorning both afs into the same keyword
>> will just be 

Re: fix pppx(4) with net/ifq.c rev 1.38

2020-05-21 Thread David Gwynne
On Wed, May 20, 2020 at 05:42:35PM +0300, Vitaliy Makkoveev wrote:
> I got splassert with pppx(4) and net/ifq.c rev 1.38 raised by
> NET_ASSERT_LOCKED() in netinet/ip_output.c:113 and underlaying routines.
> 
> net/ifq.c rev 1.38 is not in snapshot yet so you need to checkout and
> build kernel to reproduce.
> 
>  dmesg begin 
> 
> splassert: ip_output: want 2 have 0
> Starting stack trace...
> ip_output(fd801f8c8800,0,0,0,0,0) at ip_output+0x8f
> pipex_l2tp_output(fd801f8c8800,8e787808) at
> pipex_l2tp_output+0x21d
> pipex_ppp_output(fd801f8c8800,8e787808,21) at
> pipex_ppp_output+0xda
> pppx_if_start(8e787268) at pppx_if_start+0x83
> if_qstart_compat(8e7874e0) at if_qstart_compat+0x2e
> ifq_serialize(8e7874e0,8e7875b0) at ifq_serialize+0x103
> taskq_thread(81f3df30) at taskq_thread+0x4d
> end trace frame: 0x0, count: 250
> End of stack trace.
> splassert: ipsp_spd_lookup: want 2 have 0
> Starting stack trace...
> ipsp_spd_lookup(fd801f8c8800,2,14,8e726c9c,2,0) at
> ipsp_spd_lookup+0x80
> ip_output_ipsec_lookup(fd801f8c8800,14,8e726c9c,0,1) at
> ip_output_ipsec_lookup+0x4d
> ip_output(fd801f8c8800,0,0,0,0,0) at ip_output+0x4fa
> pipex_l2tp_output(fd801f8c8800,8e787808) at
> pipex_l2tp_output+0x21d
> pipex_ppp_output(fd801f8c8800,8e787808,21) at
> pipex_ppp_output+0xda
> pppx_if_start(8e787268) at pppx_if_start+0x83
> if_qstart_compat(8e7874e0) at if_qstart_compat+0x2e
> ifq_serialize(8e7874e0,8e7875b0) at ifq_serialize+0x103
> taskq_thread(81f3df30) at taskq_thread+0x4d
> end trace frame: 0x0, count: 248
> End of stack trace.
> splassert: spd_table_get: want 2 have 0
> 
>  dmesg end 
> 
> 1. `pxi_if' owned by struct pppx_if has IFXF_MPSAFE flag unset
>   1.1 pppx(4) sets IFQ_SET_MAXLEN(>if_snd, 1) at net/if_pppx.c:866
> 2. pppx_if_output() is called under NET_LOCK()
> 3. pppx_if_output() calls if_enqueue() at net/if_pppx.c:1123
> 4. pppx(4) doesn't set `ifp->if_enqueue' so if_enqueue() calls
> if_enqueue_ifq() at net/if.c:709 (which is set in net/if.c:639)
> 5. if_enqueue_ifq() calls ifq_start() at net/if.c:734
> 6. ifq_start() we a still under NET_LOCK() here
> 
> 6.a. in net/ifq.c rev 1.37 ifq_start() checks "ifq_len(ifq) >=
> min(ifp->if_txmit, ifq->ifq_maxlen)" and this was always true because
> (1.1) so we always call ifq_run_start() which calls ifq_serialize().
> 
> ifq_serialize() will call if_qstart_compat() which calls
> pppx_if_start() which calls pipex_ppp_output() etc while we still
> holding NET_LOCK() so the assertions I reported above are not raised.
> 
> 6.b. net/ifq.c rev 1.38 introduce checks of IFXF_MPSAFE flag. so we are
> always going to net/ifq.c:132 where we adding out task to `systq'
> referenced by `ifq_softnet' (ifq_softnet set to `systq' at
> net/ifq.c:199).
> 
> taskq_thread() doesn't grab NET_LOCK() so after net/ifq.c rev 1.38
> ifq_serialize() and underlaying pppx_if_start() call without NET_LOCK()
> and corresponding asserts raised.
> 
> The problem I pointed is not in net/ifq.c rev 1.38 but in pppx(4).
> `if_start' routines should grab NET_LOCK() by themself if it is required
> but pppx_if_start() and pppac_start() did't do that. pppac_start() has
> no underlaying NET_ASSERT_LOCKED() so the pppx(4) is the only case were
> the problem is shown.
> 
> Since NET_LOCK() is required by pipex(4), diff below adds it to
> pppx_if_start() and pppac_start().
> 
> After net/ifq.c rev 1.38 pppx_if_start() will newer be called from
> pppx_if_output() but from `systq' only so I don't add lock/unlock
> dances around if_enqueue() at net/if_pppx.c:1123.
> 
> Diff tested for both pppx(4) and pppac(4) cases.

thanks for the detailed analysis. i wondered how the ifq change
triggered this exactly, and your mail makes it clear.

however, pppx/pppac/pipex are not the first or only drivers in the tree
that encapsulate a packet in IP from their if_start routine and send it
out with the network stack. the way this has been solved in every other
driver has been to call ip{,6}_send to transmit the packet instead
of ip{,6}_output.

can you try the following diff?

Index: pipex.c
===
RCS file: /cvs/src/sys/net/pipex.c,v
retrieving revision 1.113
diff -u -p -r1.113 pipex.c
--- pipex.c 7 Apr 2020 07:11:22 -   1.113
+++ pipex.c 21 May 2020 21:49:50 -
@@ -1453,10 +1453,7 @@ pipex_pptp_output(struct mbuf *m0, struc
gre->flags = htons(gre->flags);
 
m0->m_pkthdr.ph_ifidx = session->pipex_iface->ifnet_this->if_index;
-   if (ip_output(m0, NULL, NULL, 0, NULL, NULL, 0) != 0) {
-   PIPEX_DBG((session, LOG_DEBUG, "ip_output failed."));
-   goto drop;
-   }
+   ip_send(m0);
if (len > 0) {  /* network layer only */
/* countup statistics */
session->stat.opackets++;

omgoptimise carp transmit

2020-05-19 Thread David Gwynne
Generally packets are not transmitted by carp interfaces, but we have a
couple of things at work that mean we do send packets out on them.

Firstly, we have a dhcp relay implementation we run on carp interfaces,
and we sent dhcp replies fast enough to fill up the one slot on the
transmit queue and our relay got ENOBUFS unexpectedly. Letting carp use
the default ifq len would be enough to fix that, but we also have routes
to some networks that prefer carp interfaces (for failover reasons)
which I wanted to go fast, so I copied the vlan transmit semantics over.

carp can basically push packets onto it's parent without going through
the transmit queue or tx mitigation now.

I've been running this in production for most of a month, and it's been
nice and boring. Even more boring than before cos I see less ENOBUFS
complaints in our logs.

ok?

Index: ip_carp.c
===
RCS file: /cvs/src/sys/netinet/ip_carp.c,v
retrieving revision 1.343
diff -u -p -r1.343 ip_carp.c
--- ip_carp.c   29 Apr 2020 07:04:32 -  1.343
+++ ip_carp.c   29 Apr 2020 09:49:57 -
@@ -233,6 +233,8 @@ int carp_check_dup_vhids(struct carp_sof
 void   carp_ifgroup_ioctl(struct ifnet *, u_long, caddr_t);
 void   carp_ifgattr_ioctl(struct ifnet *, u_long, caddr_t);
 void   carp_start(struct ifnet *);
+intcarp_enqueue(struct ifnet *, struct mbuf *);
+void   carp_transmit(struct carp_softc *, struct ifnet *, struct mbuf *);
 void   carp_setrun_all(struct carp_softc *, sa_family_t);
 void   carp_setrun(struct carp_vhost_entry *, sa_family_t);
 void   carp_set_state_all(struct carp_softc *, int);
@@ -830,8 +809,8 @@ carp_clone_create(struct if_clone *ifc, 
ifp->if_flags = IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST;
ifp->if_ioctl = carp_ioctl;
ifp->if_start = carp_start;
+   ifp->if_enqueue = carp_enqueue;
ifp->if_xflags = IFXF_CLONED;
-   IFQ_SET_MAXLEN(>if_snd, 1);
if_counters_alloc(ifp);
if_attach(ifp);
ether_ifattach(ifp);
@@ -2263,65 +2226,87 @@ void
 carp_start(struct ifnet *ifp)
 {
struct carp_softc *sc = ifp->if_softc;
+   struct ifnet *ifp0 = sc->sc_carpdev;
struct mbuf *m;
 
-   for (;;) {
-   IFQ_DEQUEUE(>if_snd, m);
-   if (m == NULL)
-   break;
+   if (ifp0 == NULL) {
+   ifq_purge(>if_snd);
+   return;
+   }
 
-#if NBPFILTER > 0
-   if (ifp->if_bpf)
-   bpf_mtap_ether(ifp->if_bpf, m, BPF_DIRECTION_OUT);
-#endif /* NBPFILTER > 0 */
+   while ((m = ifq_dequeue(>if_snd)) != NULL)
+   carp_transmit(sc, ifp0, m);
+}
 
-   if ((ifp->if_carpdev->if_flags & (IFF_UP|IFF_RUNNING)) !=
-   (IFF_UP|IFF_RUNNING)) {
-   ifp->if_oerrors++;
-   m_freem(m);
-   continue;
-   }
+void
+carp_transmit(struct carp_softc *sc, struct ifnet *ifp0, struct mbuf *m)
+{
+   struct ifnet *ifp = >sc_if;
 
-   /*
-* Do not leak the multicast address when sending
-* advertisements in 'ip' and 'ip-stealth' balacing
-* modes.
-*/
-   if (sc->sc_balancing == CARP_BAL_IP ||
-   sc->sc_balancing == CARP_BAL_IPSTEALTH) {
-   struct ether_header *eh;
-   uint8_t *esrc;
-
-   eh = mtod(m, struct ether_header *);
-   esrc = ((struct arpcom*)ifp->if_carpdev)->ac_enaddr;
-   memcpy(eh->ether_shost, esrc, sizeof(eh->ether_shost));
+#if NBPFILTER > 0
+   {
+   caddr_t if_bpf = ifp->if_bpf;
+   if (if_bpf) {
+   if (bpf_mtap_ether(if_bpf, m, BPF_DIRECTION_OUT))
+   m_freem(m);
}
+   }
+#endif /* NBPFILTER > 0 */
 
-   if (if_enqueue(ifp->if_carpdev, m)) {
-   ifp->if_oerrors++;
-   continue;
-   }
-   ifp->if_opackets++;
+   if (!ISSET(ifp0->if_flags, IFF_RUNNING)) {
+   counters_inc(ifp->if_counters, ifc_oerrors);
+   m_freem(m);
+   return;
}
+
+   /*
+* Do not leak the multicast address when sending
+* advertisements in 'ip' and 'ip-stealth' balacing
+* modes.
+*/
+   if (sc->sc_balancing == CARP_BAL_IP ||
+   sc->sc_balancing == CARP_BAL_IPSTEALTH) {
+   struct ether_header *eh = mtod(m, struct ether_header *);
+   memcpy(eh->ether_shost, sc->sc_ac.ac_enaddr,
+   sizeof(eh->ether_shost));
+   }
+
+   if (if_enqueue(ifp0, m))
+   counters_inc(ifp->if_counters, ifc_oerrors);
 }
 
 int
-carp_output(struct ifnet *ifp, struct mbuf *m, struct sockaddr *sa,
-struct rtentry *rt)

Re: tcpdump gtp bugfix

2020-05-19 Thread David Gwynne



> On 20 May 2020, at 1:31 am, Miod Vallat  wrote:
> 
> There seems to be a logic error in tcpdump's print-gtp.c.
> 
> The code is printing some values by passing a pointer to the array of
> strings, and the index within the array, and the routine uses
> sizeof(array) / sizeof(array[0]) to figure out the bound.
> 
> But since the caller is passing a pointer, sizeof returns the size of
> the pointer and not of the array itself (and clang will rightfully warn
> about this).
> 
> The right fix is to have the caller pass the upper bound.
> 
> Suggested fix below.

I'll have a look at this.

> 
> Index: print-gtp.c
> ===
> RCS file: /OpenBSD/src/usr.sbin/tcpdump/print-gtp.c,v
> retrieving revision 1.11
> diff -u -p -r1.11 print-gtp.c
> --- print-gtp.c   22 Oct 2018 16:12:45 -  1.11
> +++ print-gtp.c   1 May 2020 09:37:58 -
> @@ -57,12 +57,16 @@
> #include "interface.h"
> #include "gtp.h"
> 
> +#ifndef nitems
> +#define nitems(_a)  (sizeof((_a)) / sizeof((_a)[0]))
> +#endif
> +
> void  gtp_print(const u_char *, u_int, u_short, u_short);
> void  gtp_decode_ie(const u_char *, u_short, int);
> void  gtp_print_tbcd(const u_char *, u_int);
> void  gtp_print_user_address(const u_char *, u_int);
> void  gtp_print_apn(const u_char *, u_int);
> -void gtp_print_str(const char **, u_int);
> +void gtp_print_str(const char **, u_int, u_int);
> 
> void  gtp_v0_print(const u_char *, u_int, u_short, u_short);
> void  gtp_v0_print_prime(const u_char *);
> @@ -466,10 +470,9 @@ gtp_print_apn(const u_char *cp, u_int le
> 
> /* Print string from array. */
> void
> -gtp_print_str(const char **strs, u_int index)
> +gtp_print_str(const char **strs, u_int bound, u_int index)
> {
> -
> - if (index >= (sizeof(*strs) / sizeof(*strs[0])))
> + if (index >= bound)
>   printf(": %u", index);
>   else if (strs[index] != NULL)
>   printf(": %s", strs[index]);
> @@ -727,7 +730,8 @@ gtp_v0_print_tv(const u_char *cp, u_int 
>   /* 12.15 7.3.4.5.3 - Packet Transfer Command. */
>   TCHECK2(cp[0], GTPV0_TV_PACKET_XFER_CMD_LENGTH - 1);
>   printf("Packet Transfer Command");
> - gtp_print_str(gtp_packet_xfer_cmd, cp[0]);
> + gtp_print_str(gtp_packet_xfer_cmd, nitems(gtp_packet_xfer_cmd),
> + cp[0]);
>   ielen = GTPV0_TV_PACKET_XFER_CMD_LENGTH;
>   break;
>   
> @@ -1315,7 +1319,8 @@ gtp_v1_print_tv(const u_char *cp, u_int 
>   /* 32.295 6.2.4.5.2 - Packet Transfer Command. */
>   TCHECK2(cp[0], GTPV1_TV_PACKET_XFER_CMD_LENGTH - 1);
>   printf("Packet Transfer Command");
> - gtp_print_str(gtp_packet_xfer_cmd, cp[0]);
> + gtp_print_str(gtp_packet_xfer_cmd, nitems(gtp_packet_xfer_cmd),
> + cp[0]);
>   ielen = GTPV1_TV_PACKET_XFER_CMD_LENGTH;
>   break;
> 
> @@ -1515,7 +1520,7 @@ gtp_v1_print_tlv(const u_char *cp, u_int
> 
>   /* 29.060 7.7.50 - RAT Type. */
>   printf("RAT");
> - gtp_print_str(gtp_rat_type, cp[0]);
> + gtp_print_str(gtp_rat_type, nitems(gtp_rat_type), cp[0]);
>   break;
> 
>   case GTPV1_TLV_USER_LOCATION_INFO:
> @@ -1607,7 +1612,8 @@ gtp_v1_print_tlv(const u_char *cp, u_int
> 
>   /* 29.060 7.7.66 - MBMS 2G/3G Indicator. */
>   printf("MBMS 2G/3G Indicator");
> - gtp_print_str(mbms_2g3g_indicator, cp[0]);
> + gtp_print_str(mbms_2g3g_indicator, nitems(mbms_2g3g_indicator),
> + cp[0]);
>   break;
> 
>   case GTPV1_TLV_ENHANCED_NSAPI:
> @@ -1697,7 +1703,8 @@ gtp_v1_print_tlv(const u_char *cp, u_int
> 
>   /* 29.060 7.7.80 - MS Info Change Reporting. */
>   printf("MS Info Change Reporting");
> - gtp_print_str(ms_info_change_rpt, cp[0]);
> + gtp_print_str(ms_info_change_rpt, nitems(ms_info_change_rpt),
> + cp[0]);
>   break;
> 
>   case GTPV1_TLV_DIRECT_TUNNEL_FLAGS:
> 



Re: vi: add expandtab option

2020-04-27 Thread David Gwynne



> On 28 Apr 2020, at 13:31, Todd C. Miller  wrote:
> 
> On Thu, 02 Apr 2020 16:21:47 -0600, "Todd C. Miller" wrote:
> 
>> In vim, the expandtab option expands tabs to spaces in insert mode
>> as well as when shifting and indenting/outdenting.  This is very
>> useful when working on a code-base where the style dictates using
>> spaces instead of tabs for indentation.
>> 
>> NetBSD added an implementation of expandtab to their vi some time
>> ago, but theirs doesn't convert tabs to spaces in insert mode.  I've
>> adapted the NetBSD patch and added support for expanding tabs in
>> insert mode, unless escaped via ^V.
>> 
>> The option is off by default (of course).
>> 
>> Comments?  Please, no tabs vs spaces flame wars.
> 
> Ping?  It would be nice for this to make 6.7.

im ok with it.

i've only read it, but it makes sense. i mostly like the idea that i wouldn't 
have to to use (or install) another editor if i need this.

dlg

> 
> - todd
> 
> Index: usr.bin/vi/common/options.c
> ===
> RCS file: /cvs/src/usr.bin/vi/common/options.c,v
> retrieving revision 1.27
> diff -u -p -u -r1.27 options.c
> --- usr.bin/vi/common/options.c   21 May 2019 09:24:58 -  1.27
> +++ usr.bin/vi/common/options.c   2 Apr 2020 20:43:14 -
> @@ -69,6 +69,8 @@ OPTLIST const optlist[] = {
>   {"escapetime",  NULL,   OPT_NUM,0},
> /* O_ERRORBELLS   4BSD */
>   {"errorbells",  NULL,   OPT_0BOOL,  0},
> +/* O_EXPANDTAB   NetBSD 5.0 */
> + {"expandtab",   NULL,   OPT_0BOOL,  0},
> /* O_EXRC System V (undocumented) */
>   {"exrc",NULL,   OPT_0BOOL,  0},
> /* O_EXTENDED   4.4BSD */
> @@ -207,6 +209,7 @@ static OABBREV const abbrev[] = {
>   {"co",  O_COLUMNS}, /*   4.4BSD */
>   {"eb",  O_ERRORBELLS},  /* 4BSD */
>   {"ed",  O_EDCOMPATIBLE},/* 4BSD */
> + {"et",  O_EXPANDTAB},   /* NetBSD 5.0 */
>   {"ex",  O_EXRC},/* System V (undocumented) */
>   {"ht",  O_HARDTABS},/* 4BSD */
>   {"ic",  O_IGNORECASE},  /* 4BSD */
> Index: usr.bin/vi/docs/USD.doc/vi.man/vi.1
> ===
> RCS file: /cvs/src/usr.bin/vi/docs/USD.doc/vi.man/vi.1,v
> retrieving revision 1.77
> diff -u -p -u -r1.77 vi.1
> --- usr.bin/vi/docs/USD.doc/vi.man/vi.1   4 Oct 2019 20:12:01 -   
> 1.77
> +++ usr.bin/vi/docs/USD.doc/vi.man/vi.1   2 Apr 2020 22:05:31 -
> @@ -1606,6 +1606,11 @@ and
> characters to move forward to the next
> .Ar shiftwidth
> column boundary.
> +If the
> +.Cm expandtab
> +option is set, only insert
> +.Aq space
> +characters.
> .Pp
> .It Aq Cm erase
> .It Aq Cm control-H
> @@ -2343,6 +2348,16 @@ key mapping.
> .Nm ex
> only.
> Announce error messages with a bell.
> +.It Cm expandtab , et Bq off
> +Expand
> +.Aq tab
> +characters to
> +.Aq space
> +when inserting, replacing or shifting text, autoindenting,
> +indenting with
> +.Aq Ic control-T ,
> +or outdenting with
> +.Aq Ic control-D .
> .It Cm exrc , ex Bq off
> Read the startup files in the local directory.
> .It Cm extended Bq off
> Index: usr.bin/vi/docs/USD.doc/vi.ref/set.opt.roff
> ===
> RCS file: /cvs/src/usr.bin/vi/docs/USD.doc/vi.ref/set.opt.roff,v
> retrieving revision 1.12
> diff -u -p -u -r1.12 set.opt.roff
> --- usr.bin/vi/docs/USD.doc/vi.ref/set.opt.roff   8 Aug 2016 15:09:33 
> -   1.12
> +++ usr.bin/vi/docs/USD.doc/vi.ref/set.opt.roff   2 Apr 2020 22:05:27 
> -
> @@ -96,7 +96,9 @@ the first nonblank character of the line
> Lines are indented using tab characters to the extent possible (based on
> the value of the
> .OP tabstop
> -option) and then using space characters as necessary.
> +option, and if
> +.OP expandtab
> +is not set) and then using space characters as necessary.
> For commands inserting text into the middle of a line, any blank characters
> to the right of the cursor are discarded, and the first nonblank character
> to the right of the cursor is aligned as described above.
> @@ -400,6 +402,17 @@ only.
> error messages are normally presented in inverse video.
> If that is not possible for the terminal, setting this option causes
> error messages to be announced by ringing the terminal bell.
> +.KY expandtab
> +.IP "expandtab, et [off]"
> +Expand
> +.LI 
> +characters to
> +.LI 
> +when inserting, replacing or shifting text, autoindenting,
> +indenting with
> +.CO ,
> +or outdenting with
> +.CO .
> .KY exrc
> .IP "exrc, ex [off]"
> If this option is turned on in the EXINIT environment variables,
> Index: usr.bin/vi/ex/ex_shift.c
> ===
> RCS file: /cvs/src/usr.bin/vi/ex/ex_shift.c,v
> retrieving revision 1.8

Re: tcpdump: print nhrp packets

2020-04-15 Thread David Gwynne



> On 14 Apr 2020, at 8:15 pm, Remi Locherer  wrote:
> 
> On Tue, Apr 14, 2020 at 01:49:32PM +1000, David Gwynne wrote:
>> 
>> 
>>> On 13 Apr 2020, at 19:03, Remi Locherer  wrote:
>>> 
>>> Hi,
>>> 
>>> I recently looked into NHRP (RFC 2332) and noticed that our tcpdump does
>>> not have a printer for it. So I added support for NHRP to tcpdump.
>>> 
>>> Initially I was surprised: I expected a simpler protocol! But it is from
>>> the 90's with all the protocols from then in mind (frame relay, ATM, ...).
>>> 
>>> I tested with public available pcap files and compared the output with
>>> wirshark.
>>> https://packetlife.net/captures/protocol/nhrp/
>>> https://www.networkingwithfish.com/fun-in-the-lab-sniffer-tracing-a-dmvpn-tunnel-startup/
>>> 
>>> The output looks like this:
>>> 
>>> 08:34:45.647483 172.16.25.2 > 172.16.15.2: gre NHRP: reg request, id 7 [tos 
>>> 0xc0]
>>> 08:34:45.671422 172.16.15.2 > 172.16.25.2: gre NHRP: reg reply, id 7 [tos 
>>> 0xc0]
>>> 
>>> 08:47:16.138679 172.16.15.2 > 172.16.25.2: gre NHRP: res request, id 6 [tos 
>>> 0xc0]
>>> 08:47:16.148863 172.16.25.2 > 172.16.15.2: gre NHRP: res reply, id 6 [tos 
>>> 0xc0]
>>> 
>>> With -v set:
>>> 
>>> 08:34:45.647483 172.16.25.2 > 172.16.15.2: gre [] 2001 NHRP: reg request, 
>>> id 7, hopcnt 255, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.1 (code 0, 
>>> pl 255, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 254, id 22, len 116)
>>> 08:34:45.671422 172.16.15.2 > 172.16.25.2: gre [] 2001 NHRP: reg reply, id 
>>> 7, hopcnt 255, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.1 (code 0, pl 
>>> 255, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 255, id 7, len 136)
>>> 
>>> 08:47:16.138679 172.16.15.2 > 172.16.25.2: gre [] 2001 NHRP: res request, 
>>> id 6, hopcnt 254, src nbma 172.16.45.2, 192.168.0.4 -> 192.168.0.2 (code 0, 
>>> pl 0, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 254, id 20, len 116)
>>> 08:47:16.148863 172.16.25.2 > 172.16.15.2: gre [] 2001 NHRP: res reply, id 
>>> 6, hopcnt 255, src nbma 172.16.45.2, 192.168.0.4 -> 192.168.0.2 (code 0, pl 
>>> 32, mtu 1514, htime 7199, pref 0, nbma 172.16.25.2, proto 192.168.0.2) [tos 
>>> 0xc0] (ttl 255, id 31, len 144)
>>> 
>>> Extensions are not parsed and printed.
>>> 
>>> It would be nice to get pcaps with expamles that use address or protocol
>>> combinations other than GRE and IPv4.
>>> 
>>> Comments, OKs?
>> 
>> Can you print the addresses when -v is not set too?
>> 
>> Otherwise I'm keen.
>> 
> 
> Like this?

yes. ok by me.

> tcpdump -n:
> 08:47:16.068855 172.16.25.2 > 172.16.15.2: gre NHRP: res request, id 8, src 
> nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.4 (code 0) [tos 0xc0]
> 08:47:16.150679 172.16.15.2 > 172.16.25.2: gre NHRP: res reply, id 8, src 
> nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.4 (code 0, nbma 172.16.45.2, proto 
> 192.168.0.4) [tos 0xc0]
> 
> tcpdump -nv:
> 08:47:16.068855 172.16.25.2 > 172.16.15.2: gre [] 2001 NHRP: res request, id 
> 8, hopcnt 255, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.4 (code 0, pl 
> 0, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 255, id 29, len 96)
> 08:47:16.150679 172.16.15.2 > 172.16.25.2: gre [] 2001 NHRP: res reply, id 8, 
> hopcnt 254, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.4 (code 0, pl 32, 
> mtu 1514, htime 7199, pref 0, nbma 172.16.45.2, proto 192.168.0.4) [tos 0xc0] 
> (ttl 254, id 21, len 164)
> 
> 
> 
> Index: Makefile
> ===
> RCS file: /cvs/src/usr.sbin/tcpdump/Makefile,v
> retrieving revision 1.64
> diff -u -p -r1.64 Makefile
> --- Makefile  3 Dec 2019 01:43:33 -   1.64
> +++ Makefile  28 Mar 2020 17:07:22 -
> @@ -48,7 +48,7 @@ SRCS=   tcpdump.c addrtoname.c privsep.c p
>   print-bgp.c print-ospf6.c print-ripng.c print-rt6.c print-stp.c \
>   print-etherip.c print-lwres.c print-lldp.c print-cdp.c print-pflog.c \
>   print-pfsync.c pf_print_state.c print-ofp.c ofp_map.c \
> - print-udpencap.c print-carp.c \
> + print-udpencap.c print-carp.c print-nhrp.c \
>   print-802_11.c print-iapp.c print-mpls.c print-slow.c print-usbpcap.c \
>   gmt2local.c savestr.c setsignal.c in_cksum.c
> 
> Index: interface.h
> =

Re: tcpdump: print nhrp packets

2020-04-13 Thread David Gwynne



> On 13 Apr 2020, at 19:03, Remi Locherer  wrote:
> 
> Hi,
> 
> I recently looked into NHRP (RFC 2332) and noticed that our tcpdump does
> not have a printer for it. So I added support for NHRP to tcpdump.
> 
> Initially I was surprised: I expected a simpler protocol! But it is from
> the 90's with all the protocols from then in mind (frame relay, ATM, ...).
> 
> I tested with public available pcap files and compared the output with
> wirshark.
> https://packetlife.net/captures/protocol/nhrp/
> https://www.networkingwithfish.com/fun-in-the-lab-sniffer-tracing-a-dmvpn-tunnel-startup/
> 
> The output looks like this:
> 
> 08:34:45.647483 172.16.25.2 > 172.16.15.2: gre NHRP: reg request, id 7 [tos 
> 0xc0]
> 08:34:45.671422 172.16.15.2 > 172.16.25.2: gre NHRP: reg reply, id 7 [tos 
> 0xc0]
> 
> 08:47:16.138679 172.16.15.2 > 172.16.25.2: gre NHRP: res request, id 6 [tos 
> 0xc0]
> 08:47:16.148863 172.16.25.2 > 172.16.15.2: gre NHRP: res reply, id 6 [tos 
> 0xc0]
> 
> With -v set:
> 
> 08:34:45.647483 172.16.25.2 > 172.16.15.2: gre [] 2001 NHRP: reg request, id 
> 7, hopcnt 255, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.1 (code 0, pl 
> 255, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 254, id 22, len 116)
> 08:34:45.671422 172.16.15.2 > 172.16.25.2: gre [] 2001 NHRP: reg reply, id 7, 
> hopcnt 255, src nbma 172.16.25.2, 192.168.0.2 -> 192.168.0.1 (code 0, pl 255, 
> mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 255, id 7, len 136)
> 
> 08:47:16.138679 172.16.15.2 > 172.16.25.2: gre [] 2001 NHRP: res request, id 
> 6, hopcnt 254, src nbma 172.16.45.2, 192.168.0.4 -> 192.168.0.2 (code 0, pl 
> 0, mtu 1514, htime 7200, pref 0) [tos 0xc0] (ttl 254, id 20, len 116)
> 08:47:16.148863 172.16.25.2 > 172.16.15.2: gre [] 2001 NHRP: res reply, id 6, 
> hopcnt 255, src nbma 172.16.45.2, 192.168.0.4 -> 192.168.0.2 (code 0, pl 32, 
> mtu 1514, htime 7199, pref 0, nbma 172.16.25.2, proto 192.168.0.2) [tos 0xc0] 
> (ttl 255, id 31, len 144)
> 
> Extensions are not parsed and printed.
> 
> It would be nice to get pcaps with expamles that use address or protocol
> combinations other than GRE and IPv4.
> 
> Comments, OKs?

Can you print the addresses when -v is not set too?

Otherwise I'm keen.

> 
> Remi
> 
> 
> Index: Makefile
> ===
> RCS file: /cvs/src/usr.sbin/tcpdump/Makefile,v
> retrieving revision 1.64
> diff -u -p -r1.64 Makefile
> --- Makefile  3 Dec 2019 01:43:33 -   1.64
> +++ Makefile  28 Mar 2020 17:07:22 -
> @@ -48,7 +48,7 @@ SRCS=   tcpdump.c addrtoname.c privsep.c p
>   print-bgp.c print-ospf6.c print-ripng.c print-rt6.c print-stp.c \
>   print-etherip.c print-lwres.c print-lldp.c print-cdp.c print-pflog.c \
>   print-pfsync.c pf_print_state.c print-ofp.c ofp_map.c \
> - print-udpencap.c print-carp.c \
> + print-udpencap.c print-carp.c print-nhrp.c \
>   print-802_11.c print-iapp.c print-mpls.c print-slow.c print-usbpcap.c \
>   gmt2local.c savestr.c setsignal.c in_cksum.c
> 
> Index: interface.h
> ===
> RCS file: /cvs/src/usr.sbin/tcpdump/interface.h,v
> retrieving revision 1.83
> diff -u -p -r1.83 interface.h
> --- interface.h   3 Dec 2019 01:43:33 -   1.83
> +++ interface.h   28 Mar 2020 17:07:22 -
> @@ -217,6 +217,7 @@ extern void ppp_ether_if_print(u_char *,
> extern void gre_print(const u_char *, u_int);
> extern void vxlan_print(const u_char *, u_int);
> extern void nsh_print(const u_char *, u_int);
> +extern void nhrp_print(const u_char *, u_int);
> extern void icmp_print(const u_char *, u_int, const u_char *);
> extern void ieee802_11_if_print(u_char *, const struct pcap_pkthdr *,
> const u_char *);
> Index: print-ether.c
> ===
> RCS file: /cvs/src/usr.sbin/tcpdump/print-ether.c,v
> retrieving revision 1.37
> diff -u -p -r1.37 print-ether.c
> --- print-ether.c 24 Jan 2020 22:46:36 -  1.37
> +++ print-ether.c 28 Mar 2020 17:07:22 -
> @@ -303,6 +303,13 @@ recurse:
>   ether_pbb_print(p, length, caplen);
>   return (1);
> 
> +#ifndef ETHERTYPE_NHRP
> +#define ETHERTYPE_NHRP 0x2001
> +#endif
> + case ETHERTYPE_NHRP:
> + nhrp_print(p, length);
> + return (1);
> +
> #ifdef PPP
>   case ETHERTYPE_PPPOEDISC:
>   case ETHERTYPE_PPPOE:
> Index: print-gre.c
> ===
> RCS file: /cvs/src/usr.sbin/tcpdump/print-gre.c,v
> retrieving revision 1.30
> diff -u -p -r1.30 print-gre.c
> --- print-gre.c   24 Jan 2020 22:46:36 -  1.30
> +++ print-gre.c   28 Mar 2020 17:07:22 -
> @@ -289,6 +289,12 @@ gre_print_0(const u_char *p, u_int lengt
>   case 0x2000:
>   cdp_print(p, length, l, 0);
>   break;
> +#ifndef ETHERTYPE_NHRP
> +#define ETHERTYPE_NHRP 0x2001
> +#endif
> + 

Re: tweak how amd64 (not intel) cpu topology is calculated

2020-03-11 Thread David Gwynne



> On 10 Mar 2020, at 00:04, Stuart Henderson  wrote:
> 
> On 2020/03/09 22:50, David Gwynne wrote:
>> this works better on his epyc 2 box, and works right on my epyc 1, esxi
>> on epyc 1, and on an apu1.
> 
> Fine on apu2 (GX-412TC) and the old HP microserver (Turion N40L) also.
> Diff makes sense and I'm happy you found an alternative to my dodgy
> CPU_INFO_FOREACH :)

i prefer to think of ESXi as the dodgy bit in this situation.

does anyone want to ok this?

dlg



Re: ifq: ifq_dec_sleep may return garbage

2020-03-09 Thread David Gwynne
On Tue, Mar 10, 2020 at 12:03:24AM +0100, Tobias Heider wrote:
> If 'm = ifq->ifq_ops->ifqop_deq_begin(ifq, )' is not NULL
> the loop is exited and an uninitialized 'int error' is returned.
> Several lines below error is checked for '!= 0', so i assume it
> was meant to be initialized to '0'. 
> 
> ok?

ok

> 
> Index: ifq.c
> ===
> RCS file: /mount/openbsd/cvs/src/sys/net/ifq.c,v
> retrieving revision 1.36
> diff -u -p -r1.36 ifq.c
> --- ifq.c 25 Jan 2020 06:31:32 -  1.36
> +++ ifq.c 9 Mar 2020 22:57:58 -
> @@ -395,7 +395,7 @@ ifq_deq_sleep(struct ifqueue *ifq, struc
>  {
>   struct mbuf *m;
>   void *cookie;
> - int error;
> + int error = 0;
>  
>   ifq_deq_enter(ifq);
>   if (ifq->ifq_len == 0 && nbio)



Re: net/if.c: nullptr deref in if_hooks_run

2020-03-09 Thread David Gwynne
On Mon, Mar 09, 2020 at 11:56:09PM +0100, Klemens Nanni wrote:
> On Mon, Mar 09, 2020 at 10:33:17PM +0100, Tobias Heider wrote:
> > there seems to be a nullptr dereference in if_hooks_run.
> Did your kernel crash here or did you find reading alone?
> 
> > When the inner while loop is exited because 't == NULL' the next
> > line is an access to 't->t_func'.
> Yes, reads obviously wrong.

:'(

> > Because 't==NULL' means the TAILQ is fully traversed I think we
> > should break and exit instead.
> Make sense, OK kn

how about this? this has the cursor handling move the traversal and
NULL check back to the for loop instead of looping on its own.

Index: if.c
===
RCS file: /cvs/src/sys/net/if.c,v
retrieving revision 1.600
diff -u -p -r1.600 if.c
--- if.c24 Jan 2020 05:14:51 -  1.600
+++ if.c10 Mar 2020 00:24:00 -
@@ -1050,10 +1050,9 @@ if_hooks_run(struct task_list *hooks)
 
mtx_enter(_hooks_mtx);
for (t = TAILQ_FIRST(hooks); t != NULL; t = nt) {
-   while (t->t_func == NULL) { /* skip cursors */
-   t = TAILQ_NEXT(t, t_entry);
-   if (t == NULL)
-   break;
+   if (t->t_func == NULL) { /* skip cursors */
+   nt = TAILQ_NEXT(t, t_entry);
+   continue;
}
func = t->t_func;
arg = t->t_arg;



Re: tweak how amd64 (not intel) cpu topology is calculated

2020-03-09 Thread David Gwynne
On Mon, Mar 09, 2020 at 05:00:37PM +1000, David Gwynne wrote:
> ive been running multi-cpu/core openbsd VMs on esxi on top of amd
> epyc cpus, and have noticed for a long time now that for some reason
> openbsd decides that all the virtual cpus are threads on the one core.
> this is annoying, cos setting hw.smt=1 feels dirty and wrong.
> 
> i spent the weekend figuring this out, and came up with the following.
> 
> our current code assumes that some cpuid fields on recent CPUs contain
> information about package and core topologies which we base the package
> and core ids on. turns out we're pretty much alone in this assumption. if
> you're running on real hardware, they do tend to contain useful info,
> but virtual machines don't fill them in properly.
> 
> every other operating system seems to rely on the one mechanism across
> all families. specifically, the initial local apic id provides a
> globally unique identifier that has bits representing at least the
> package (socket) the logical thread is on, plus the core, and if
> supported, the smt id. multi socket cpus provide information about
> how many bits there are before you get to the ones identifying the
> package, so we use that everywhere.
> 
> only the most recent generation of cpu (zen) supports SMT, but the cpuid
> that provides that information is available on at least the previous
> gen. fortunately they made it so reading the number of threads per
> core falls back gracefully. this means we can default to there being
> one thread per core, and increase that if the cpuid is available and
> appropriately set.
> 
> this seems to fix my vmware problem, but also still seems to work on my
> physical epyc boxes. unfortunately i do not have any other amd systems
> up and running at the moment, so i would appreciate testing on any and
> every amd based amd64 system.
> 
> tests please. ok?

of course hrvoje found a bug. i cleaned up my diff too hard before
sending it out, and ended up reading nthreads from the wrong bits.

this works better on his epyc 2 box, and works right on my epyc 1, esxi
on epyc 1, and on an apu1.

Index: identcpu.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/identcpu.c,v
retrieving revision 1.113
diff -u -p -r1.113 identcpu.c
--- identcpu.c  14 Jun 2019 18:13:55 -  1.113
+++ identcpu.c  9 Mar 2020 12:38:59 -
@@ -824,37 +824,31 @@ cpu_topology(struct cpu_info *ci)
apicid = (ebx >> 24) & 0xff;
 
if (strcmp(cpu_vendor, "AuthenticAMD") == 0) {
+   uint32_t nthreads = 1; /* per core */
+   uint32_t thread_id; /* within a package */
+
/* We need at least apicid at CPUID 0x8008 */
if (ci->ci_pnfeatset < 0x8008)
goto no_topology;
 
-   if (ci->ci_pnfeatset >= 0x801e) {
-   struct cpu_info *ci_other;
-   CPU_INFO_ITERATOR cii;
+   CPUID(0x8008, eax, ebx, ecx, edx);
+   core_bits = (ecx >> 12) & 0xf;
 
+   if (ci->ci_pnfeatset >= 0x801e) {
CPUID(0x801e, eax, ebx, ecx, edx);
-   ci->ci_core_id = ebx & 0xff;
-   ci->ci_pkg_id = ecx & 0xff;
-   ci->ci_smt_id = 0;
-   CPU_INFO_FOREACH(cii, ci_other) {
-   if (ci != ci_other &&
-   ci_other->ci_core_id == ci->ci_core_id &&
-   ci_other->ci_pkg_id == ci->ci_pkg_id)
-   ci->ci_smt_id++;
-   }
-   } else {
-   CPUID(0x8008, eax, ebx, ecx, edx);
-   core_bits = (ecx >> 12) & 0xf;
-   if (core_bits == 0)
-   goto no_topology;
-   /* So coreidsize 2 gives 3, 3 gives 7... */
-   core_mask = (1 << core_bits) - 1;
-   /* Core id is the least significant considering mask */
-   ci->ci_core_id = apicid & core_mask;
-   /* Pkg id is the upper remaining bits */
-   ci->ci_pkg_id = apicid & ~core_mask;
-   ci->ci_pkg_id >>= core_bits;
+   nthreads = ((ebx >> 8) & 0xf) + 1;
}
+
+   /* Shift the core_bits off to get at the pkg bits */
+   ci->ci_pkg_id = apicid >> core_bits;
+
+   /* Get rid of the package bits */
+   core_mask = (1 << core_bits) - 1;
+   thread_id = apicid

tweak how amd64 (not intel) cpu topology is calculated

2020-03-09 Thread David Gwynne
ive been running multi-cpu/core openbsd VMs on esxi on top of amd
epyc cpus, and have noticed for a long time now that for some reason
openbsd decides that all the virtual cpus are threads on the one core.
this is annoying, cos setting hw.smt=1 feels dirty and wrong.

i spent the weekend figuring this out, and came up with the following.

our current code assumes that some cpuid fields on recent CPUs contain
information about package and core topologies which we base the package
and core ids on. turns out we're pretty much alone in this assumption. if
you're running on real hardware, they do tend to contain useful info,
but virtual machines don't fill them in properly.

every other operating system seems to rely on the one mechanism across
all families. specifically, the initial local apic id provides a
globally unique identifier that has bits representing at least the
package (socket) the logical thread is on, plus the core, and if
supported, the smt id. multi socket cpus provide information about
how many bits there are before you get to the ones identifying the
package, so we use that everywhere.

only the most recent generation of cpu (zen) supports SMT, but the cpuid
that provides that information is available on at least the previous
gen. fortunately they made it so reading the number of threads per
core falls back gracefully. this means we can default to there being
one thread per core, and increase that if the cpuid is available and
appropriately set.

this seems to fix my vmware problem, but also still seems to work on my
physical epyc boxes. unfortunately i do not have any other amd systems
up and running at the moment, so i would appreciate testing on any and
every amd based amd64 system.

tests please. ok?

Index: identcpu.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/identcpu.c,v
retrieving revision 1.113
diff -u -p -r1.113 identcpu.c
--- identcpu.c  14 Jun 2019 18:13:55 -  1.113
+++ identcpu.c  9 Mar 2020 06:44:41 -
@@ -824,37 +824,31 @@ cpu_topology(struct cpu_info *ci)
apicid = (ebx >> 24) & 0xff;
 
if (strcmp(cpu_vendor, "AuthenticAMD") == 0) {
+   uint32_t nthreads = 1; /* per core */
+   uint32_t thread_id; /* within a package */
+
/* We need at least apicid at CPUID 0x8008 */
if (ci->ci_pnfeatset < 0x8008)
goto no_topology;
 
-   if (ci->ci_pnfeatset >= 0x801e) {
-   struct cpu_info *ci_other;
-   CPU_INFO_ITERATOR cii;
+   CPUID(0x8008, eax, ebx, ecx, edx);
+   core_bits = (ecx >> 12) & 0xf;
 
+   if (ci->ci_pnfeatset >= 0x801e) {
CPUID(0x801e, eax, ebx, ecx, edx);
-   ci->ci_core_id = ebx & 0xff;
-   ci->ci_pkg_id = ecx & 0xff;
-   ci->ci_smt_id = 0;
-   CPU_INFO_FOREACH(cii, ci_other) {
-   if (ci != ci_other &&
-   ci_other->ci_core_id == ci->ci_core_id &&
-   ci_other->ci_pkg_id == ci->ci_pkg_id)
-   ci->ci_smt_id++;
-   }
-   } else {
-   CPUID(0x8008, eax, ebx, ecx, edx);
-   core_bits = (ecx >> 12) & 0xf;
-   if (core_bits == 0)
-   goto no_topology;
-   /* So coreidsize 2 gives 3, 3 gives 7... */
-   core_mask = (1 << core_bits) - 1;
-   /* Core id is the least significant considering mask */
-   ci->ci_core_id = apicid & core_mask;
-   /* Pkg id is the upper remaining bits */
-   ci->ci_pkg_id = apicid & ~core_mask;
-   ci->ci_pkg_id >>= core_bits;
+   nthreads = ((ebx >> 12) & 0xf) + 1;
}
+
+   /* Shift the core_bits off to get at the pkg bits */
+   ci->ci_pkg_id = apicid >> core_bits;
+
+   /* Get rid of the package bits */
+   core_mask = (1 << core_bits) - 1;
+   thread_id = apicid & core_mask;
+
+   /* Cut logical thread_id into core id, and smt id in a core */
+   ci->ci_core_id = thread_id / nthreads;
+   ci->ci_smt_id = thread_id % nthreads;
} else if (strcmp(cpu_vendor, "GenuineIntel") == 0) {
/* We only support leaf 1/4 detection */
if (cpuid_level < 4)



retries and timeouts for radiusctl(8) test

2020-02-20 Thread David Gwynne
we (work) use radiusctl as part of a check script with relayd so
we can try and keep a radius service available with some magical
routing and redirect configs.

radiusctl is currently pretty simple and sends a radius request,
and then waits for a reply. however, it does not implement retries
and timeouts, so if either the reply or request are lost, radiusctl
ends up waiting in recv forever for a packet that will never turn
up.

combined with this, we seem to tickle something in relayd where it
loses track of its children. this results in us "leaking" radiusctl
processes. historically we've coped with this by running pkill -x
radiusctl out of cron, and while it made us a sad on the inside,
we've been busy recently and had to ignore this for now.

unfortunatly one of the radius servers failed this morning. this
meant that instead of a few radius packets being lost over a long
period time causing a slow leak of radiusctl processes, we never
got a reply to any radius packet and started accumulating a ton of 
radius processes. in fact, we hit maxproc, which made recovering very
annoying.

we run a bunch of different checks out of relayd, but the radiusctl
one is the only one that doesnt implement timeouts and retries. so
im fixing it first, and then we'll try and figure out what's wrong
with relayd.

the specific changes in this diff is the introduction of a transmission
retry counter, a time interval between the tries, and a maximum
wait time that the process has before it gives up waiting for a
reply. each of these is configurable, but i think the defaults are
reasonable for a test.

it introduces libevent, because that makes it easy to manage the
timeouts.

ok?

Index: Makefile
===
RCS file: /cvs/src/usr.sbin/radiusctl/Makefile,v
retrieving revision 1.2
diff -u -p -r1.2 Makefile
--- Makefile3 Aug 2015 04:10:21 -   1.2
+++ Makefile21 Feb 2020 05:15:14 -
@@ -3,7 +3,7 @@ PROG=   radiusctl
 SRCS=  radiusctl.c parser.c chap_ms.c
 MAN=   radiusctl.8
 CFLAGS+=   -Wall -Wextra -Wno-unused-parameter
-LDADD+=-lradius -lcrypto
-DPADD+=${LIBRADIUS} ${LIBCRYPTO}
+LDADD+=-lradius -lcrypto -levent
+DPADD+=${LIBRADIUS} ${LIBCRYPTO} ${LIBEVENT}
 
 .include 
Index: parser.c
===
RCS file: /cvs/src/usr.sbin/radiusctl/parser.c,v
retrieving revision 1.1
diff -u -p -r1.1 parser.c
--- parser.c21 Jul 2015 04:06:04 -  1.1
+++ parser.c21 Feb 2020 05:15:14 -
@@ -18,6 +18,8 @@
  * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
  */
 
+#include 
+
 #include 
 #include 
 #include 
@@ -35,6 +37,9 @@ enum token_type {
PORT,
METHOD,
NAS_PORT,
+   TRIES,
+   INTERVAL,
+   MAXWAIT,
ENDTOKEN
 };
 
@@ -45,7 +50,11 @@ struct token {
const struct token  *next;
 };
 
-static struct parse_result res;
+static struct parse_result res = {
+   .tries  = TEST_TRIES_DEFAULT,
+   .interval   = { TEST_INTERVAL_DEFAULT, 0 },
+   .maxwait= { TEST_MAXWAIT_DEFAULT, 0 },
+};
 
 static const struct token t_test[];
 static const struct token t_secret[];
@@ -55,6 +64,9 @@ static const struct token t_password[];
 static const struct token t_port[];
 static const struct token t_method[];
 static const struct token t_nas_port[];
+static const struct token t_tries[];
+static const struct token t_interval[];
+static const struct token t_maxwait[];
 
 static const struct token t_main[] = {
{ KEYWORD,  "test", TEST,   t_test },
@@ -82,6 +94,9 @@ static const struct token t_test_opts[] 
{ KEYWORD,  "port", NONE,   t_port },
{ KEYWORD,  "method",   NONE,   t_method },
{ KEYWORD,  "nas-port", NONE,   t_nas_port },
+   { KEYWORD,  "interval", NONE,   t_interval },
+   { KEYWORD,  "tries",NONE,   t_tries },
+   { KEYWORD,  "maxwait",  NONE,   t_maxwait },
{ ENDTOKEN, "", NONE,   NULL }
 };
 
@@ -105,6 +120,21 @@ static const struct token t_nas_port[] =
{ ENDTOKEN, "", NONE,   NULL }
 };
 
+static const struct token t_tries[] = {
+   { TRIES,"", NONE,   t_test_opts },
+   { ENDTOKEN, "", NONE,   NULL }
+};
+
+static const struct token t_interval[] = {
+   { INTERVAL, "", NONE,   t_test_opts },
+   { ENDTOKEN, "", NONE,   NULL }
+};
+
+static const struct token t_maxwait[] = {
+   { MAXWAIT,  "", NONE,   t_test_opts },
+   { ENDTOKEN, "", NONE,   NULL }
+};
+
 
 static const struct token  *match_token(char *, const 

Re: MSI-X & Interrupting CPU > 0

2020-01-26 Thread David Gwynne



> On 25 Jan 2020, at 10:57 am, Mark Kettenis  wrote:
> 
> David Gwynne schreef op 2020-01-25 01:28:
>>> On 23 Jan 2020, at 10:38 pm, Mark Kettenis  wrote:
>>> Martin Pieuchot schreef op 2020-01-23 11:28:
>>>> I'd like to make progress towards interrupting multiple CPUs in order to
>>>> one day make use of multiple queues in some network drivers.  The road
>>>> towards that goal is consequent and I'd like to proceed in steps to make
>>>> it easier to squash bugs.  I'm currently thinking of the following steps:
>>>> 1. Is my interrupt handler safe to be executed on CPU != CPU0?
>>> Except for things that are inherently tied to a specific CPU (clock 
>>> interrupts,
>>> performance counters, etc) I think the answer here should always be "yes".
>> Agreed.
>>> It probably only makes sense for mpsafe handlers to run on secondary CPUs 
>>> though.
>> Only because keeping !mpsafe handlers on one CPU means they're less
>> likely to need to spin against other !mpsafe interrupts on other CPUs
>> waiting for the kernel lock before they can execute. Otherwise this
>> shouldn't matter.
>>>> 2. Is it safe to execute this handler on two or more CPUs at the same
>>>>   time?
>>> I think that is never safe.  Unless you run execute the handler on 
>>> different "data".
>>> Running multiple rx interrupt handlers on different CPUs should be fine.
>> Agreed.
>>>> 3. How does interrupting multiple CPUs influence packet processing in
>>>>   the softnet thread?  Is any knowledge required (CPU affinity?) to
>>>>   have an optimum processing when multiple softnet threads are used?
>> I think this is my question to answer.
>> Packet sources (ie, rx rings) are supposed to be tied to a specific
>> nettq. Part of this is to avoid packet reordering where multiple
>> nettqs for one ring could overlap processing of packets for a single
>> TCP stream. The other part is so a busy nettq can apply backpressure
>> when it is overloaded to the rings that are feeding it.
>> Experience from other systems is that affinity does matter, but
>> running stuff in parallel matters more. Affinity between rings and
>> nettqs is something that can be worked on later.
>>>> 4. How to split traffic in one incoming NIC between multiple processing
>>>>   units?
>>> You'll need to have some sort of hardware filter that uses a hash of the
>>> packet header to assign an rx queue such that all packets from a single 
>>> "flow"
>>> end up on the same queue and therefore will be processed by the same 
>>> interrupt
>>> handler.
>> Yep.
>>>> This new journey comes with the requirement of being able to interrupt
>>>> an arbitrary CPU.  For that we need a new API.  Patrick gave me the
>>>> diff below during u2k20 and I'd like to use it to start a discussion.
>>>> We currently have 6 drivers using pci_intr_map_msix().  Since we want to
>>>> be able to specify a CPU should we introduce a new function like in the
>>>> diff below or do we prefer to add a new argument (cpuid_t?) to this one?
>>>> This change in itself should already allow us to proceed with the first
>>>> item of the above list.
>>> I'm not sure you want to have the driver pick the CPU to which to assign the
>>> interrupt.  In fact I think that doesn't make sense at all.  The CPU
>>> should be picked by more generic code instead.  But perhaps we do need to
>>> pass a hint from the driver to that code.
>> Letting the driver pick the CPU is Good Enough(tm) today. It may limit
>> us to 70 or 80 percent of some theoretical maximum, but we don't have
>> the machinery to make a better decision on behalf of the driver at
>> this point. It is much better to start with something simple today
>> (ie, letting the driver pick the CPU) and improve on it after we hit
>> the limits with the simple thing.
>> I also look at how far dfly has got, and from what I can tell their
>> MSI-X stuff let's the driver pick the CPU. So it can't be too bad.
>>>> Then we need a way to read the MSI-X control table size using the define
>>>> PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
>>>> want to print that information in dmesg, some maybe cache it in pci(4)?
>>> There are already defines for MSIX in pcireg.h, some of which are duplicated
>>> by the defines in this diff.  Don't think caching makes all that much sense.
>>> Don't think we need to print the t

Re: MSI-X & Interrupting CPU > 0

2020-01-24 Thread David Gwynne



> On 23 Jan 2020, at 10:38 pm, Mark Kettenis  wrote:
> 
> Martin Pieuchot schreef op 2020-01-23 11:28:
>> I'd like to make progress towards interrupting multiple CPUs in order to
>> one day make use of multiple queues in some network drivers.  The road
>> towards that goal is consequent and I'd like to proceed in steps to make
>> it easier to squash bugs.  I'm currently thinking of the following steps:
>> 1. Is my interrupt handler safe to be executed on CPU != CPU0?
> 
> Except for things that are inherently tied to a specific CPU (clock 
> interrupts,
> performance counters, etc) I think the answer here should always be "yes".

Agreed.

> It probably only makes sense for mpsafe handlers to run on secondary CPUs 
> though.

Only because keeping !mpsafe handlers on one CPU means they're less likely to 
need to spin against other !mpsafe interrupts on other CPUs waiting for the 
kernel lock before they can execute. Otherwise this shouldn't matter.

> 
>> 2. Is it safe to execute this handler on two or more CPUs at the same
>>time?
> 
> I think that is never safe.  Unless you run execute the handler on different 
> "data".
> Running multiple rx interrupt handlers on different CPUs should be fine.

Agreed.

> 
>> 3. How does interrupting multiple CPUs influence packet processing in
>>the softnet thread?  Is any knowledge required (CPU affinity?) to
>>have an optimum processing when multiple softnet threads are used?

I think this is my question to answer.

Packet sources (ie, rx rings) are supposed to be tied to a specific nettq. Part 
of this is to avoid packet reordering where multiple nettqs for one ring could 
overlap processing of packets for a single TCP stream. The other part is so a 
busy nettq can apply backpressure when it is overloaded to the rings that are 
feeding it.

Experience from other systems is that affinity does matter, but running stuff 
in parallel matters more. Affinity between rings and nettqs is something that 
can be worked on later.

>> 4. How to split traffic in one incoming NIC between multiple processing
>>units?
> 
> You'll need to have some sort of hardware filter that uses a hash of the
> packet header to assign an rx queue such that all packets from a single "flow"
> end up on the same queue and therefore will be processed by the same interrupt
> handler.

Yep.

> 
>> This new journey comes with the requirement of being able to interrupt
>> an arbitrary CPU.  For that we need a new API.  Patrick gave me the
>> diff below during u2k20 and I'd like to use it to start a discussion.
>> We currently have 6 drivers using pci_intr_map_msix().  Since we want to
>> be able to specify a CPU should we introduce a new function like in the
>> diff below or do we prefer to add a new argument (cpuid_t?) to this one?
>> This change in itself should already allow us to proceed with the first
>> item of the above list.
> 
> I'm not sure you want to have the driver pick the CPU to which to assign the
> interrupt.  In fact I think that doesn't make sense at all.  The CPU
> should be picked by more generic code instead.  But perhaps we do need to
> pass a hint from the driver to that code.

Letting the driver pick the CPU is Good Enough(tm) today. It may limit us to 70 
or 80 percent of some theoretical maximum, but we don't have the machinery to 
make a better decision on behalf of the driver at this point. It is much better 
to start with something simple today (ie, letting the driver pick the CPU) and 
improve on it after we hit the limits with the simple thing.

I also look at how far dfly has got, and from what I can tell their MSI-X stuff 
let's the driver pick the CPU. So it can't be too bad.

> 
>> Then we need a way to read the MSI-X control table size using the define
>> PCI_MSIX_CTL_TBLSIZE() below.  This can be done in MI, we might also
>> want to print that information in dmesg, some maybe cache it in pci(4)?
> 
> There are already defines for MSIX in pcireg.h, some of which are duplicated
> by the defines in this diff.  Don't think caching makes all that much sense.
> Don't think we need to print the table size in dmesg; pcidump(8) already
> prints it.  Might make sense to print the vector number though.

I'm ok with with using pcidump(8) to see what a particular device offers rather 
than having it in dmesg. I'd avoid putting vectors in dmesg output, cos if have 
a lot of rings there's going to be a lot of dmesg output. Probably better to 
make vmstat -i more useful, or systat mb.

> 
>> Does somebody has a better/stronger/magic way to achieve this goal?
> 
> I playes a little bit with assigning interrupts to different CPUs in the
> past, but at that point this didn't really result in a performance boost.
> That was quite a while ago though.  I don't think there are fundamental 
> problems
> in getting this going.

Well, packet processing still goes through a single nettq, and that's the limit 
I hit on my firewalls. I have a lot of CARP, LACP and VLAN stuff 

Re: GRE datagram socket support

2020-01-21 Thread David Gwynne



> On 22 Jan 2020, at 8:54 am, Damien Miller  wrote:
> 
> On Wed, 22 Jan 2020, David Gwynne wrote:
> 
>>> Index: sys/kern/kern_pledge.c
>>> ===
>>> RCS file: /cvs/src/sys/kern/kern_pledge.c,v
>>> retrieving revision 1.255
>>> diff -u -p -r1.255 kern_pledge.c
>>> --- sys/kern/kern_pledge.c  25 Aug 2019 18:46:40 -  1.255
>>> +++ sys/kern/kern_pledge.c  29 Oct 2019 07:57:58 -
>>> @@ -666,7 +666,7 @@ pledge_namei(struct proc *p, struct name
>>> }
>>> }
>>> 
>>> -   /* DNS needs /etc/{resolv.conf,hosts,services}. */
>>> +   /* DNS needs /etc/{resolv.conf,hosts,services,protocols}. */
>>> if ((ni->ni_pledge == PLEDGE_RPATH) &&
>>> (p->p_p->ps_pledge & PLEDGE_DNS)) {
>>> if (strcmp(path, "/etc/resolv.conf") == 0) {
>>> @@ -678,6 +678,10 @@ pledge_namei(struct proc *p, struct name
>>> return (0);
>>> }
>>> if (strcmp(path, "/etc/services") == 0) {
>>> +   ni->ni_cnd.cn_flags |= BYPASSUNVEIL;
>>> +   return (0);
>>> +   }
>>> +   if (strcmp(path, "/etc/protocols") == 0) {
>>> ni->ni_cnd.cn_flags |= BYPASSUNVEIL;
>>> return (0);
> 
> This looks like it is fixing a real, separate bug in pledge vs
> getaddrinfo, no? (specifically: that lookups for named ports will fail
> currently).

no, our getaddrinfo currently hardcodes mapping of SOCK_STREAM, SOCK_DGRAM, 
IPPROTO_TCP, and IPPROTO_UDP and maps them to "udp" and "tcp" for use when 
looking up /etc/services via getservbyname_r. this is fine because they are by 
far the most common case and worth optimising for.

the problem is if (when) i want to use getnameinfo to look up entries for 
IPPROTO_GRE. i either hardcode IPPROTO_GRE in getnameinfo guts to "gre" for it 
to pass to getservbyname_r, or i look up /etc/protocols via getprotobynumber_r 
to get a name. i opted for the latter.

dlg


Re: GRE datagram socket support

2020-01-21 Thread David Gwynne
Has anyone got an opinion on this? I am still interested in doing more
packet capture things on OpenBSD using GRE as a transport, and the idea
of maintaining this out of tree just makes me feel tired.

On Tue, Oct 29, 2019 at 06:34:50PM +1000, David Gwynne wrote:
> i've been toying with this idea of implementing GRE as a datagram
> protocol that userland can use just like UDP. the idea is to make it
> easy to support the implementation of NHRP in userland for mgre(4),
> and also for ERSPAN* support without going down the path linux took**.
> 
> so this is the result of having a go at implementing the idea. the diff
> includes several independent parts, but they all work together to make
> GRE as comfortable to use as UDP. the two main parts are the actual
> protocol implementation in src/sys/netinet/ip_gre.c, and the tweaks to
> getaddrinfo to allow the resolution of gre services. the /etc/services
> chunk gets used by the getaddrinfo bits.
> 
> so, the first chunk lets you do this (as root in userland):
> 
>   int s = socket(AF_INET, SOCK_DGRAM, IPPROTO_GRE);
> 
> that gives you a file descriptor you can then use with bind(),
> connect(), sendto(), recvfrom(), etc. you write a message to the
> kernel and it prepends the GRE and IP headers and pushes it out.
> it is set up so the GRE protocol is handed to the kernel via the
> sin_port or sin6_port member of struct sockaddr_in an sockaddr_in6
> respectively. there's no source and destination protocol fields, just
> one that both ends agree on, so if you connect then bind, your
> sockaddrs have to agree on the proto. unfortunately there's no such
> thing as a wildcard or reserved protocol in GRE, so 0 can't be used
> as a wildcard like it can in udp and tcp.
> 
> the sockets support the configuration of optional GRE headers, as
> defined in RFC 2890, using setsockopt. importantly you can enable
> the key and sequence number headers, which again, the kernel offloads
> for you.
> 
> the second chunk tweaks getaddrinfo so it lets you specify things other
> than IPPROTO_UDP and IPPROTO_TCP. protocols other than those are now
> looked up in /etc/protocols to get their name, which in turn is used to
> look up entries in /etc/services. while i was there and reading rfcs, i
> noted different behaviour for wildcarded socktypes and protocols, which
> i've tried to implement. eric@ seems generally ok with this stuff, and
> suggested the tweak to pledge to allow access to /etc/protocols using
> the dns pledge. tcp and udp are still special though, and are still
> omgoptimised.
> 
> all this together lets the program at
> https://mild.embarrassm.net/~dlg/diff/egred.c work. it is a userland
> reimplementation of a simplified egre(4) using tap(4) and a gre socket.
> the io path is literally reading from one fd and writing it to the othe,
> everything else is boilerplate.
> 
> i suspect the kernel stuff is a bit rough as i havent had to test every
> path, but it supports common functionality.
> 
> thoughts? i am pretty pleased with this has turned out, and would be
> keen to put it in the tree and work on it some more.
> 
> * https://tools.ietf.org/html/draft-foschiano-erspan-03
> ** http://vger.kernel.org/lpc_net2018_talks/erspan-linux-presentation.pdf
> 
> Index: etc/services
> ===
> RCS file: /cvs/src/etc/services,v
> retrieving revision 1.96
> diff -u -p -r1.96 services
> --- etc/services  27 Jan 2019 20:35:06 -  1.96
> +++ etc/services  29 Oct 2019 07:57:44 -
> @@ -332,6 +332,21 @@ spamd-cfg8026/tcp# 
> spamd(8) configur
>  dhcpd-sync   8067/udp# dhcpd(8) synchronisation
>  hunt 26740/udp   # hunt(6)
>  #
> +# GRE Protocol Types
> +#
> +keepalive0/gre   # 0x: IP tunnel keepalive
> +ipv4 2048/gre# 0x0800: IPv4
> +nhrp 8193/gre# 0x2001: Next Hop Resolution 
> Protocol
> +erspan3  8939/gre# 0x22eb: ERSPAN III
> +transether   25944/gre   ethernet# 0x6558: Trans Ether Bridging
> +ipv6 34525/gre   # 0x86dd: IPv6
> +wccp 34878/gre   # 0x883e: Web Content Cache 
> Protocol
> +mpls 34887/gre   # 0x8847: MPLS
> +#mpls34888/gre   # 0x8848: MPLS Multicast
> +erspan   35006/gre   erspan2 # 0x88be: ERSPAN I/II
> +nsh  35151/gre   # 0x894f: Network Service Header
> +control  47082/gre   # 0xb7ea: RFC 8157
>

move PIPEX from tun(4) into pppac(4), a dedicated PPP access concentrator

2020-01-21 Thread David Gwynne
claudio and i have been looking at some tun(4) semantics, and want to
move stuff around, but feel constrained because PIPEX has been very
carefully added to tun(4). the PIPEX bits make it hard to rearrange
tun(4), so this moves that functionality out into a separate driver
called pppac(4).

there's not much to say about it except that it does exactly enough for
npppd to work with. the major missing bits are a manpage (which will be
boring), and all the MAKEDEV bits.

so would anyone object if we move forward with this at the hackathon
so we can then pull PIPEX out of tun(4), and in turn start making tun(4)
less complicated?

Index: sys/arch/amd64/amd64/conf.c
===
RCS file: /cvs/src/sys/arch/amd64/amd64/conf.c,v
retrieving revision 1.65
diff -u -p -r1.65 conf.c
--- sys/arch/amd64/amd64/conf.c 17 Dec 2019 13:08:54 -  1.65
+++ sys/arch/amd64/amd64/conf.c 21 Jan 2020 08:38:59 -
@@ -299,6 +299,7 @@ struct cdevsw   cdevsw[] =
cdev_ipmi_init(NIPMI,ipmi), /* 96: ipmi */
cdev_switch_init(NSWITCH,switch), /* 97: switch(4) control interface */
cdev_fido_init(NFIDO,fido), /* 98: FIDO/U2F security keys */
+   cdev_pppx_init(NPPPX,pppac),/* 99: PPP Access Concentrator */
 };
 intnchrdev = nitems(cdevsw);
 
Index: sys/net/if_pppx.c
===
RCS file: /cvs/src/sys/net/if_pppx.c,v
retrieving revision 1.70
diff -u -p -r1.70 if_pppx.c
--- sys/net/if_pppx.c   31 Dec 2019 13:48:32 -  1.70
+++ sys/net/if_pppx.c   21 Jan 2020 08:38:59 -
@@ -1134,3 +1134,508 @@ pppx_if_ioctl(struct ifnet *ifp, u_long 
 }
 
 RBT_GENERATE(pppx_ifs, pppx_if, pxi_entry, pppx_if_cmp);
+
+/*
+ * pppac(4) - PPP Access Concentrator interface
+ */
+
+#include 
+
+struct pppac_softc {
+   struct ifnetsc_if;
+   unsigned intsc_dead;
+   dev_t   sc_dev;
+   LIST_ENTRY(pppac_softc)
+   sc_entry;
+
+   struct mutexsc_rsel_mtx;
+   struct selinfo  sc_rsel;
+   struct mutexsc_wsel_mtx;
+   struct selinfo  sc_wsel;
+
+   struct pipex_iface_context
+   sc_pipex_iface;
+
+   struct mbuf_queue
+   sc_mq;
+};
+
+LIST_HEAD(pppac_list, pppac_softc);
+
+static voidfilt_pppac_rdetach(struct knote *);
+static int filt_pppac_read(struct knote *, long);
+
+static const struct filterops pppac_rd_filtops = {
+   1,
+   NULL,
+   filt_pppac_rdetach,
+   filt_pppac_read
+};
+
+static voidfilt_pppac_wdetach(struct knote *);
+static int filt_pppac_write(struct knote *, long);
+
+static const struct filterops pppac_wr_filtops = {
+   1,
+   NULL,
+   filt_pppac_wdetach,
+   filt_pppac_write
+};
+
+static struct pppac_list pppac_devs = LIST_HEAD_INITIALIZER(pppac_devs);
+
+static int pppac_ioctl(struct ifnet *, u_long, caddr_t);
+
+static int pppac_output(struct ifnet *, struct mbuf *, struct sockaddr *,
+   struct rtentry *);
+static voidpppac_start(struct ifnet *);
+
+static inline struct pppac_softc *
+pppac_lookup(dev_t dev)
+{
+   struct pppac_softc *sc;
+
+   LIST_FOREACH(sc, _devs, sc_entry) {
+   if (sc->sc_dev == dev)
+   return (sc);
+   }
+
+   return (NULL);
+}
+
+void
+pppacattach(int n)
+{
+   pipex_init(); /* to be sure, to be sure */
+}
+
+int
+pppacopen(dev_t dev, int flags, int mode, struct proc *p)
+{
+   struct pppac_softc *sc;
+   struct ifnet *ifp;
+
+   sc = pppac_lookup(dev);
+   if (sc != NULL)
+   return (EBUSY);
+
+   sc = malloc(sizeof(*sc), M_DEVBUF, M_WAITOK|M_ZERO);
+   sc->sc_dev = dev;
+
+   mtx_init(>sc_rsel_mtx, IPL_SOFTNET);
+   mtx_init(>sc_wsel_mtx, IPL_SOFTNET);
+   mq_init(>sc_mq, IFQ_MAXLEN, IPL_SOFTNET);
+
+   LIST_INSERT_HEAD(_devs, sc, sc_entry);
+
+   ifp = >sc_if;
+   snprintf(ifp->if_xname, sizeof(ifp->if_xname), "pppac%u", minor(dev));
+
+   ifp->if_softc = sc;
+   ifp->if_type = IFT_L3IPVLAN;
+   ifp->if_hdrlen = sizeof(uint32_t); /* for BPF */;
+   ifp->if_mtu = MAXMCLBYTES - sizeof(uint32_t);
+   ifp->if_flags = IFF_SIMPLEX | IFF_BROADCAST;
+   ifp->if_xflags = IFXF_CLONED;
+   ifp->if_rtrequest = p2p_rtrequest; /* XXX */
+   ifp->if_output = pppac_output;
+   ifp->if_start = pppac_start;
+   ifp->if_ioctl = pppac_ioctl;
+
+   if_counters_alloc(ifp);
+   if_attach(ifp);
+   if_alloc_sadl(ifp);
+
+#if NBPFILTER > 0
+   bpfattach(>if_bpf, ifp, DLT_LOOP, sizeof(uint32_t));
+#endif
+
+   pipex_iface_init(>sc_pipex_iface, ifp);
+
+   return (0);
+}
+
+int
+pppacread(dev_t dev, struct uio *uio, int ioflag)
+{
+   struct pppac_softc *sc = pppac_lookup(dev);
+   struct ifnet *ifp = >sc_if;
+   struct mbuf *m0, *m;
+   int error = 0;
+   size_t len;
+
+   

Re: massage tcpdump ip and encapsulation output

2019-12-12 Thread David Gwynne
so this should go in and i can keep improving things in the tree?

dlg

> On 6 Dec 2019, at 9:45 pm, Claudio Jeker  wrote:
> 
> On Fri, Dec 06, 2019 at 12:16:09PM +0100, Sebastian Benoit wrote:
>> David Gwynne(da...@gwynne.id.au) on 2019.12.06 15:14:42 +1000:
>>> 
>>> 
>>>> On 5 Dec 2019, at 21:14, Sebastian Benoit  wrote:
>>>> 
>>>> Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.12.05 09:53:49 +0100:
>>>>> I would suggest to just pack most of the headers into one group of ().
>>>>> 
>>>>> IPv4 ttl 1 [tos 0x20] 10.0.127.15 > 10.0.127.1
>>>>> would become
>>>>> IPv4 (ttl 1 tos 0x20) 10.0.127.15 > 10.0.127.1
>>>>> and
>>>>> IPv4 ttl 1 [tos 0x20] (id 39958, len 84) 10.0.127.15 > 10.0.127.1
>>>>> would become
>>>>> IPv4 (ttl 1 tos 0x20 id 39958 len 84) 10.0.127.15 > 10.0.127.1
>>>>> 
>>>>> Maybe add the commas if that is easy to do.
>>>> 
>>>> its more readable with commas, i think
>>> 
>>> do you want me to come up with something in this space as part of the
>>> large diff, or is the large change generally ok and we can tinker with
>>> this stuff afterward?
>> 
>> It was just a comment on the readability of lists like that.
>> I like your idea, please proceed whichever way you like.
>> 
>>> 
>>> there's some concern that what i'm proposing is too radical and will break
>>> peoples muscle memory.
> 
> The output of tcpdump depends on the version and OS it is used on.
> IMO the important bits that people normally scan for are the IPs, port
> numbers, some of the TCP seq numbers or similar protocol specific data.
> To make this scanning easier I suggested to reduce the line noise of the
> IP header by reducing the amount of different () and [] sequences giving
> the eye a way to skip over that chunk quickly.
> 
> I think the new format is better and people need to retrain a bit but
> again we should not make it harder den necessary.
> 
> For me your work can go in as long as the further improvements as
> discussed here follow.
> -- 
> :wq Claudio



Re: massage tcpdump ip and encapsulation output

2019-12-05 Thread David Gwynne
On Fri, Dec 06, 2019 at 03:14:42PM +1000, David Gwynne wrote:
> 
> 
> > On 5 Dec 2019, at 21:14, Sebastian Benoit  wrote:
> > 
> > Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.12.05 09:53:49 +0100:
> >> I would suggest to just pack most of the headers into one group of ().
> >> 
> >> IPv4 ttl 1 [tos 0x20] 10.0.127.15 > 10.0.127.1
> >> would become
> >> IPv4 (ttl 1 tos 0x20) 10.0.127.15 > 10.0.127.1
> >> and
> >> IPv4 ttl 1 [tos 0x20] (id 39958, len 84) 10.0.127.15 > 10.0.127.1
> >> would become
> >> IPv4 (ttl 1 tos 0x20 id 39958 len 84) 10.0.127.15 > 10.0.127.1
> >> 
> >> Maybe add the commas if that is easy to do.
> > 
> > its more readable with commas, i think
> 
> do you want me to come up with something in this space as part of the large 
> diff, or is the large change generally ok and we can tinker with this stuff 
> afterward?
> 
> there's some concern that what i'm proposing is too radical and will break 
> peoples muscle memory.

fyi, here's what stock (or apple tweaked) tcpdump looks like for a
similar set of packets:

dlg@fatmac Temp$ tcpdump -V
tcpdump: option requires an argument -- V
tcpdump version tcpdump version 4.9.2 -- Apple version 83.200.2
libpcap version 1.8.1 -- Apple version 79.250.1
LibreSSL 2.2.7
Usage: tcpdump [-aAbdDefhHIJKlLnNOpqStuUvxX#] [ -B size ] [ -c count ]
[ -C file_size ] [ -E algo:secret ] [ -F file ] [ -G seconds ]
[ -i interface ] [ -j tstamptype ] [ -M secret ] [ --number ]
[ -Q in|out|inout ]
[ -r file ] [ -s snaplen ] [ --time-stamp-precision precision ]
[ --immediate-mode ] [ -T type ] [ --version ] [ -V file ]
[ -w file ] [ -W filecount ] [ -y datalinktype ] [ -z 
postrotate-command ]
[ -g ] [ -k ] [ -o ] [ -P ] [ -Q met[ --time-zone-offset offset ]
[ -Z user ] [ expression ]

dlg@fatmac Temp$ tcpdump -nr ping.pcap
reading from file ping.pcap, link-type EN10MB (Ethernet)
16:31:18.836620 IP 10.0.127.15 > 10.0.127.1: ICMP echo request, id 46495, seq 
0, length 64
16:31:18.837074 IP 10.0.127.1 > 10.0.127.15: ICMP echo reply, id 46495, seq 0, 
length 64
dlg@fatmac Temp$ tcpdump -nr ping.pcap -v
reading from file ping.pcap, link-type EN10MB (Ethernet)
16:31:18.836620 IP (tos 0x20, ttl 1, id 39958, offset 0, flags [none], proto 
ICMP (1), length 84)
10.0.127.15 > 10.0.127.1: ICMP echo request, id 46495, seq 0, length 64
16:31:18.837074 IP (tos 0x20, ttl 255, id 36919, offset 0, flags [none], proto 
ICMP (1), length 84)
10.0.127.1 > 10.0.127.15: ICMP echo reply, id 46495, seq 0, length 64

dlg@fatmac Temp$ tcpdump -nr ipv6-udp-fragmented.pcap
reading from file ipv6-udp-fragmented.pcap, link-type EN10MB (Ethernet)
05:35:13.312348 IP6 2607:f010:3f9::11:0.6363 > 2607:f010:3f9::1001.6363: UDP, 
length 118
05:35:13.549553 IP6 2607:f010:3f9::11:0.6363 > 2607:f010:3f9::1001.6363: UDP, 
length 31
05:35:13.569339 IP6 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag (0|1448) 
6363 > 6363: UDP, bad length 5379 > 1440
05:35:13.569345 IP6 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag (1448|1448)
05:35:13.569346 IP6 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag (2896|1448)
05:35:13.569349 IP6 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag (4344|1043)
dlg@fatmac Temp$ tcpdump -nr ipv6-udp-fragmented.pcap -v
reading from file ipv6-udp-fragmented.pcap, link-type EN10MB (Ethernet)
05:35:13.312348 IP6 (hlim 64, next-header UDP (17) payload length: 126) 
2607:f010:3f9::11:0.6363 > 2607:f010:3f9::1001.6363: [udp sum ok] UDP, length 
118
05:35:13.549553 IP6 (hlim 64, next-header UDP (17) payload length: 39) 
2607:f010:3f9::11:0.6363 > 2607:f010:3f9::1001.6363: [udp sum ok] UDP, length 31
05:35:13.569339 IP6 (flowlabel 0x21289, hlim 64, next-header Fragment (44) 
payload length: 1456) 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag 
(0xf88eb466:0|1448) 6363 > 6363: UDP, bad length 5379 > 1440
05:35:13.569345 IP6 (flowlabel 0x21289, hlim 64, next-header Fragment (44) 
payload length: 1456) 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag 
(0xf88eb466:1448|1448)
05:35:13.569346 IP6 (flowlabel 0x21289, hlim 64, next-header Fragment (44) 
payload length: 1456) 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag 
(0xf88eb466:2896|1448)
05:35:13.569349 IP6 (flowlabel 0x21289, hlim 64, next-header Fragment (44) 
payload length: 1051) 2607:f010:3f9::1001 > 2607:f010:3f9::11:0: frag 
(0xf88eb466:4344|1043)


dlg@fatmac Temp$ tcpdump -nr udp-frag.pcap
reading from file udp-frag.pcap, link-type EN10MB (Ethernet)
20:34:42.184788 IP 10.0.127.15.20550 > 10.0.127.1.6363: UDP, bad length 6000 > 
1472
20:34:42.184789 IP 10.0.127.15 > 10.0.127.1: ip-proto-17
20:34:42.184790 IP 10.0.127.15 > 10.0.127.1: ip-proto-17
20:34:42.184791 IP 10.0.127.15 > 

Re: massage tcpdump ip and encapsulation output

2019-12-05 Thread David Gwynne



> On 5 Dec 2019, at 21:14, Sebastian Benoit  wrote:
> 
> Claudio Jeker(cje...@diehard.n-r-g.com) on 2019.12.05 09:53:49 +0100:
>> I would suggest to just pack most of the headers into one group of ().
>> 
>> IPv4 ttl 1 [tos 0x20] 10.0.127.15 > 10.0.127.1
>> would become
>> IPv4 (ttl 1 tos 0x20) 10.0.127.15 > 10.0.127.1
>> and
>> IPv4 ttl 1 [tos 0x20] (id 39958, len 84) 10.0.127.15 > 10.0.127.1
>> would become
>> IPv4 (ttl 1 tos 0x20 id 39958 len 84) 10.0.127.15 > 10.0.127.1
>> 
>> Maybe add the commas if that is easy to do.
> 
> its more readable with commas, i think

do you want me to come up with something in this space as part of the large 
diff, or is the large change generally ok and we can tinker with this stuff 
afterward?

there's some concern that what i'm proposing is too radical and will break 
peoples muscle memory.


  1   2   3   4   5   6   7   8   9   >