Re: rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Claudio Jeker
On Tue, Mar 22, 2022 at 06:35:47PM +0100, Alexander Bluhm wrote:
> On Tue, Mar 22, 2022 at 04:42:45PM +0100, Claudio Jeker wrote:
> > No but you push this layer into a specifc direction and by that make it
> > harder to fix the PCB tables in a different way. I just see people
> > changing the NET_ASSERT_WLOCKED() without realizing the actual reason for
> > the exclusive netlock use.
> 
> Of course MP for PCB could be implemented differently.  SRP, SMR,
> whatever.  But mutex is easy and we should start with that.  As we
> currently run only on one CPU, it does not matter.

Mutexes are easy but they also tend to cause deadlocks and lock order
issues when used in more complex situation. It makes sense to use them for
PCBs but maybe something else is needed for the PCB lookup tables. The
locking dance done in FreeBSD in that area of code is rather complicated.
 
> There is much more to be done like ref counting and protecting the
> PCB fields.  But I want to go in small steps.  This NET_ASSERT_WLOCKED()
> beside SIMPLEQ_INIT() makes it quite obvious where the next unlocking
> problem is.  Look in netinet/ip_ipsp.c tdb_walk(), there is another
> one.  When they will be only left in the slow path a lot is gained.
> And if not, we have to fix them step by step.

Agreed. There is a lot of work to be done and not enough hands to actually
work on them.
 
> > Looking at the pcb hash problem, I have to wonder if this reinserting of
> > PCBs is actually resulting in a measurable performance difference. The
> > hash table should be large enough to keep the number of PCB per bucket low. 
> 
> The reinsertion is done for PCB notify, UDP multicast, Raw IPv4 and
> IPv6.  I don't have benchmarks for these cases and I doubt that
> others will feel much difference there.
> 
> There is one thing that might make things slower.  The in_pcbref(inp)
> in_pcbunref(inp) is not strictly necessary, we have exclusive net
> lock.  But I put it there so we will not forget it when unlocking.
> Maybe it costs a bit of performance, but who cares about multicast
> and rip.

Oh, I was talking about the mail you referenced where in_pcbhashlookup()
does this:
if (inp != LIST_FIRST(head)) {
LIST_REMOVE(inp, inp_hash);
LIST_INSERT_HEAD(head, inp, inp_hash);
}

I highly doubt that this improves throughput in most cases (even with e.g.
100k active connections).
 
> I have measured the PCB mutex diff.  But as usual benchmarks have
> to be explained.
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/perform.html
> 
> Left column is baseline, middle column is as mistake, where I only
> applied the IPv4 part.  Right columns is the full PCB diff, but
> without UDP Multicast and Raw IP queuing, which I missed before.
> 
> In TCP Perfomance graph (IPv4 only), the right column looks slow.
> But the code difference between middle and right only affects IPv6.
> How can that be?  The answer is in this row:
> 
> kernel name list  +53 -52 +40054 -40054
> 
> Mapping the kernel object files to different pages affects throughput
> a lot, more than most diffs.  So I sort and align them and compare
> the nm /bsd output.  When you click on it you see how 4 symbol
> addresses move around.
> 
> So the more or less correct numbers are in the middle column but
> only for IPv4.
> 
> Look at the kstack output of one TCP benchmark.
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/iperf3_-c10.3.45.35_-w1m_-t10_-R-btrace-kstack.0.svg
> 
> Search for mtx in the right top field.  Then you see mutex contension.
> They are not in PCB lockup as the pf state links to the socket.
> 
> When receiving short UDP packets you can see the affected code:
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg
> 
> We are 4.3% in PCB lookup.  And in that part you find 10% in mutex.
> 
> Compare it to the orignal code, this mutex is not there:
> 
> http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/2022-03-10T00%3A00%3A00Z/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg
> 
> But UDP thoughput numbers do not change.
 
I do not expect this diff to make a big difference for performance and
your analysis shows this. To get forward we need to accept that some
benchmarks will suffer before they get better again. A lot of code needs
to be changed and not all of it can happen in one go.

> > One comment below.
> > At least the one bit that can fail can be moved outside of this loop:
> 
> This is a very good idea and can be done upfront.
> 
> ok?

OK.
 
> bluhm
> 
> Index: netinet6/raw_ip6.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v
> retrieving revision 1.145
> diff 

Re: rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Alexander Bluhm
On Tue, Mar 22, 2022 at 04:42:45PM +0100, Claudio Jeker wrote:
> No but you push this layer into a specifc direction and by that make it
> harder to fix the PCB tables in a different way. I just see people
> changing the NET_ASSERT_WLOCKED() without realizing the actual reason for
> the exclusive netlock use.

Of course MP for PCB could be implemented differently.  SRP, SMR,
whatever.  But mutex is easy and we should start with that.  As we
currently run only on one CPU, it does not matter.

There is much more to be done like ref counting and protecting the
PCB fields.  But I want to go in small steps.  This NET_ASSERT_WLOCKED()
beside SIMPLEQ_INIT() makes it quite obvious where the next unlocking
problem is.  Look in netinet/ip_ipsp.c tdb_walk(), there is another
one.  When they will be only left in the slow path a lot is gained.
And if not, we have to fix them step by step.

> Looking at the pcb hash problem, I have to wonder if this reinserting of
> PCBs is actually resulting in a measurable performance difference. The
> hash table should be large enough to keep the number of PCB per bucket low. 

The reinsertion is done for PCB notify, UDP multicast, Raw IPv4 and
IPv6.  I don't have benchmarks for these cases and I doubt that
others will feel much difference there.

There is one thing that might make things slower.  The in_pcbref(inp)
in_pcbunref(inp) is not strictly necessary, we have exclusive net
lock.  But I put it there so we will not forget it when unlocking.
Maybe it costs a bit of performance, but who cares about multicast
and rip.

I have measured the PCB mutex diff.  But as usual benchmarks have
to be explained.

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/perform.html

Left column is baseline, middle column is as mistake, where I only
applied the IPv4 part.  Right columns is the full PCB diff, but
without UDP Multicast and Raw IP queuing, which I missed before.

In TCP Perfomance graph (IPv4 only), the right column looks slow.
But the code difference between middle and right only affects IPv6.
How can that be?  The answer is in this row:

kernel name list+53 -52 +40054 -40054

Mapping the kernel object files to different pages affects throughput
a lot, more than most diffs.  So I sort and align them and compare
the nm /bsd output.  When you click on it you see how 4 symbol
addresses move around.

So the more or less correct numbers are in the middle column but
only for IPv4.

Look at the kstack output of one TCP benchmark.

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/iperf3_-c10.3.45.35_-w1m_-t10_-R-btrace-kstack.0.svg

Search for mtx in the right top field.  Then you see mutex contension.
They are not in PCB lockup as the pf state links to the socket.

When receiving short UDP packets you can see the affected code:

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/patch-sys-pcbtable-mtx.0/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg

We are 4.3% in PCB lookup.  And in that part you find 10% in mutex.

Compare it to the orignal code, this mutex is not there:

http://bluhm.genua.de/perform/results/2022-03-10T17%3A19%3A00Z/2022-03-10T00%3A00%3A00Z/btrace/udpbench_-l36_-t10_-r_ot15_recv_10.3.45.34-btrace-kstack.0.svg

But UDP thoughput numbers do not change.


> One comment below.
> At least the one bit that can fail can be moved outside of this loop:

This is a very good idea and can be done upfront.

ok?

bluhm

Index: netinet6/raw_ip6.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v
retrieving revision 1.145
diff -u -p -r1.145 raw_ip6.c
--- netinet6/raw_ip6.c  21 Mar 2022 09:12:34 -  1.145
+++ netinet6/raw_ip6.c  22 Mar 2022 16:22:43 -
@@ -125,10 +125,19 @@ rip6_input(struct mbuf **mp, int *offp, 
struct in6_addr *key;
struct sockaddr_in6 rip6src;
struct mbuf *opts = NULL;
+   uint8_t type;
 
KASSERT(af == AF_INET6);
 
-   if (proto != IPPROTO_ICMPV6)
+   if (proto == IPPROTO_ICMPV6) {
+   struct icmp6_hdr *icmp6;
+
+   IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
+   sizeof(*icmp6));
+   if (icmp6 == NULL)
+   return IPPROTO_DONE;
+   type = icmp6->icmp6_type;
+   } else
rip6stat_inc(rip6s_ipackets);
 
bzero(, sizeof(rip6src));
@@ -177,16 +186,7 @@ rip6_input(struct mbuf **mp, int *offp, 
!IN6_ARE_ADDR_EQUAL(>inp_faddr6, >ip6_src))
continue;
if (proto == IPPROTO_ICMPV6 && in6p->inp_icmp6filt) {
-   struct icmp6_hdr *icmp6;
-
-   IP6_EXTHDR_GET(icmp6, struct icmp6_hdr *, m, *offp,
-   sizeof(*icmp6));
-   if (icmp6 == NULL) {
-   

Re: rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Claudio Jeker
On Tue, Mar 22, 2022 at 02:56:43PM +0100, Alexander Bluhm wrote:
> On Tue, Mar 22, 2022 at 02:25:08PM +0100, Claudio Jeker wrote:
> > On Tue, Mar 22, 2022 at 02:09:51PM +0100, Alexander Bluhm wrote:
> > > Hi,
> > > 
> > > syzkaller and witness found the same bug I introduced in UDP also
> > > for Raw IP.  Fix it the same was for rip and rip6.
> > > 
> > > https://syzkaller.appspot.com/bug?extid=9bac6356a881dc644265
> > > https://syzkaller.appspot.com/bug?extid=5b2679ee9be0895d26f9
> > > 
> > > ok?
> > 
> > Absolutly not a fan of this "fix". It just moves the landmine that is
> > about to explode a bit further to the left for the next person to step on.
> > The moment someone tries to run these input handlers in parallel all of
> > this will blow up. It is a workaround for now but how will we get out of
> > this in the future when the code runs in parallel up to the socket
> > layer?
> 
> Moving the problem around is the only way to make any progress.
> 
> The bug with MP forwarding I try to solve is this one.
> https://marc.info/?l=openbsd-tech=163857624429253=2
> 
> After 4 months of ideas that were denied by different people, I
> came to this solution.  Put a mutex around PCB tables.  I think
> this is necessary anyway if we want to reach parallel protocol
> processing.  Unfortunately I missed 3 of 4 places where I hold the
> mutex too long.  I am trying to fix the last 2 of them.
> 
> I do not want to delay parallel forwaring until parallel protocol
> layer is finished.  Then neither will happen.  If someone is working
> on parallel protocols, this code will blow up due to NET_ASSERT_WLOCKED().
> It has to be fixed then.  My change is delaying work to make progress
> elsewhere.  We cannot solve everything in a big commit.
> 
> Do you have a better idea?
 
No but you push this layer into a specifc direction and by that make it
harder to fix the PCB tables in a different way. I just see people
changing the NET_ASSERT_WLOCKED() without realizing the actual reason for
the exclusive netlock use.

Looking at the pcb hash problem, I have to wonder if this reinserting of
PCBs is actually resulting in a measurable performance difference. The
hash table should be large enough to keep the number of PCB per bucket low. 

One comment below.
-- 
:wq Claudio

> > > Index: netinet/raw_ip.c
> > > ===
> > > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/raw_ip.c,v
> > > retrieving revision 1.125
> > > diff -u -p -r1.125 raw_ip.c
> > > --- netinet/raw_ip.c  21 Mar 2022 09:12:34 -  1.125
> > > +++ netinet/raw_ip.c  22 Mar 2022 12:59:05 -
> > > @@ -122,9 +122,9 @@ rip_input(struct mbuf **mp, int *offp, i
> > >  {
> > >   struct mbuf *m = *mp;
> > >   struct ip *ip = mtod(m, struct ip *);
> > > - struct inpcb *inp, *last = NULL;
> > > + struct inpcb *inp;
> > > + SIMPLEQ_HEAD(, inpcb) inpcblist;
> > >   struct in_addr *key;
> > > - struct mbuf *opts = NULL;
> > >   struct counters_ref ref;
> > >   uint64_t *counters;
> > >  
> > > @@ -150,7 +150,8 @@ rip_input(struct mbuf **mp, int *offp, i
> > >   }
> > >   }
> > >  #endif
> > > - NET_ASSERT_LOCKED();
> > > + NET_ASSERT_WLOCKED();
> > > + SIMPLEQ_INIT();
> > >   mtx_enter(_mtx);
> > >   TAILQ_FOREACH(inp, _queue, inp_queue) {
> > >   if (inp->inp_socket->so_state & SS_CANTRCVMORE)
> > > @@ -171,41 +172,16 @@ rip_input(struct mbuf **mp, int *offp, i
> > >   if (inp->inp_faddr.s_addr &&
> > >   inp->inp_faddr.s_addr != ip->ip_src.s_addr)
> > >   continue;
> > > - if (last) {
> > > - struct mbuf *n;
> > >  
> > > - if ((n = m_copym(m, 0, M_COPYALL, M_NOWAIT)) != NULL) {
> > > - if (last->inp_flags & INP_CONTROLOPTS ||
> > > - last->inp_socket->so_options & SO_TIMESTAMP)
> > > - ip_savecontrol(last, , ip, n);
> > > - if (sbappendaddr(last->inp_socket,
> > > - >inp_socket->so_rcv,
> > > - sintosa(), n, opts) == 0) {
> > > - /* should notify about lost packet */
> > > - m_freem(n);
> > > - m_freem(opts);
> > > - } else
> > > - sorwakeup(last->inp_socket);
> > > - opts = NULL;
> > > - }
> > > - }
> > > - last = inp;
> > > + in_pcbref(inp);
> > > + SIMPLEQ_INSERT_TAIL(, inp, inp_notify);
> > >   }
> > >   mtx_leave(_mtx);
> > >  
> > > - if (last) {
> > > - if (last->inp_flags & INP_CONTROLOPTS ||
> > > - last->inp_socket->so_options & SO_TIMESTAMP)
> > > - ip_savecontrol(last, , ip, m);
> > > - if (sbappendaddr(last->inp_socket, >inp_socket->so_rcv,
> > > - sintosa(), 

Re: rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Alexander Bluhm
On Tue, Mar 22, 2022 at 02:25:08PM +0100, Claudio Jeker wrote:
> On Tue, Mar 22, 2022 at 02:09:51PM +0100, Alexander Bluhm wrote:
> > Hi,
> > 
> > syzkaller and witness found the same bug I introduced in UDP also
> > for Raw IP.  Fix it the same was for rip and rip6.
> > 
> > https://syzkaller.appspot.com/bug?extid=9bac6356a881dc644265
> > https://syzkaller.appspot.com/bug?extid=5b2679ee9be0895d26f9
> > 
> > ok?
> 
> Absolutly not a fan of this "fix". It just moves the landmine that is
> about to explode a bit further to the left for the next person to step on.
> The moment someone tries to run these input handlers in parallel all of
> this will blow up. It is a workaround for now but how will we get out of
> this in the future when the code runs in parallel up to the socket
> layer?

Moving the problem around is the only way to make any progress.

The bug with MP forwarding I try to solve is this one.
https://marc.info/?l=openbsd-tech=163857624429253=2

After 4 months of ideas that were denied by different people, I
came to this solution.  Put a mutex around PCB tables.  I think
this is necessary anyway if we want to reach parallel protocol
processing.  Unfortunately I missed 3 of 4 places where I hold the
mutex too long.  I am trying to fix the last 2 of them.

I do not want to delay parallel forwaring until parallel protocol
layer is finished.  Then neither will happen.  If someone is working
on parallel protocols, this code will blow up due to NET_ASSERT_WLOCKED().
It has to be fixed then.  My change is delaying work to make progress
elsewhere.  We cannot solve everything in a big commit.

Do you have a better idea?

bluhm

> > Index: netinet/raw_ip.c
> > ===
> > RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/raw_ip.c,v
> > retrieving revision 1.125
> > diff -u -p -r1.125 raw_ip.c
> > --- netinet/raw_ip.c21 Mar 2022 09:12:34 -  1.125
> > +++ netinet/raw_ip.c22 Mar 2022 12:59:05 -
> > @@ -122,9 +122,9 @@ rip_input(struct mbuf **mp, int *offp, i
> >  {
> > struct mbuf *m = *mp;
> > struct ip *ip = mtod(m, struct ip *);
> > -   struct inpcb *inp, *last = NULL;
> > +   struct inpcb *inp;
> > +   SIMPLEQ_HEAD(, inpcb) inpcblist;
> > struct in_addr *key;
> > -   struct mbuf *opts = NULL;
> > struct counters_ref ref;
> > uint64_t *counters;
> >  
> > @@ -150,7 +150,8 @@ rip_input(struct mbuf **mp, int *offp, i
> > }
> > }
> >  #endif
> > -   NET_ASSERT_LOCKED();
> > +   NET_ASSERT_WLOCKED();
> > +   SIMPLEQ_INIT();
> > mtx_enter(_mtx);
> > TAILQ_FOREACH(inp, _queue, inp_queue) {
> > if (inp->inp_socket->so_state & SS_CANTRCVMORE)
> > @@ -171,41 +172,16 @@ rip_input(struct mbuf **mp, int *offp, i
> > if (inp->inp_faddr.s_addr &&
> > inp->inp_faddr.s_addr != ip->ip_src.s_addr)
> > continue;
> > -   if (last) {
> > -   struct mbuf *n;
> >  
> > -   if ((n = m_copym(m, 0, M_COPYALL, M_NOWAIT)) != NULL) {
> > -   if (last->inp_flags & INP_CONTROLOPTS ||
> > -   last->inp_socket->so_options & SO_TIMESTAMP)
> > -   ip_savecontrol(last, , ip, n);
> > -   if (sbappendaddr(last->inp_socket,
> > -   >inp_socket->so_rcv,
> > -   sintosa(), n, opts) == 0) {
> > -   /* should notify about lost packet */
> > -   m_freem(n);
> > -   m_freem(opts);
> > -   } else
> > -   sorwakeup(last->inp_socket);
> > -   opts = NULL;
> > -   }
> > -   }
> > -   last = inp;
> > +   in_pcbref(inp);
> > +   SIMPLEQ_INSERT_TAIL(, inp, inp_notify);
> > }
> > mtx_leave(_mtx);
> >  
> > -   if (last) {
> > -   if (last->inp_flags & INP_CONTROLOPTS ||
> > -   last->inp_socket->so_options & SO_TIMESTAMP)
> > -   ip_savecontrol(last, , ip, m);
> > -   if (sbappendaddr(last->inp_socket, >inp_socket->so_rcv,
> > -   sintosa(), m, opts) == 0) {
> > -   m_freem(m);
> > -   m_freem(opts);
> > -   } else
> > -   sorwakeup(last->inp_socket);
> > -   } else {
> > +   if (SIMPLEQ_EMPTY()) {
> > if (ip->ip_p != IPPROTO_ICMP)
> > -   icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL, 0, 
> > 0);
> > +   icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL,
> > +   0, 0);
> > else
> > m_freem(m);
> >  
> > @@ -213,6 +189,30 @@ rip_input(struct mbuf **mp, int *offp, i
> > counters[ips_noproto]++;
> > 

Re: rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Claudio Jeker
On Tue, Mar 22, 2022 at 02:09:51PM +0100, Alexander Bluhm wrote:
> Hi,
> 
> syzkaller and witness found the same bug I introduced in UDP also
> for Raw IP.  Fix it the same was for rip and rip6.
> 
> https://syzkaller.appspot.com/bug?extid=9bac6356a881dc644265
> https://syzkaller.appspot.com/bug?extid=5b2679ee9be0895d26f9
> 
> ok?

Absolutly not a fan of this "fix". It just moves the landmine that is
about to explode a bit further to the left for the next person to step on.
The moment someone tries to run these input handlers in parallel all of
this will blow up. It is a workaround for now but how will we get out of
this in the future when the code runs in parallel up to the socket
layer?
 
> bluhm
> 
> Index: netinet/raw_ip.c
> ===
> RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/raw_ip.c,v
> retrieving revision 1.125
> diff -u -p -r1.125 raw_ip.c
> --- netinet/raw_ip.c  21 Mar 2022 09:12:34 -  1.125
> +++ netinet/raw_ip.c  22 Mar 2022 12:59:05 -
> @@ -122,9 +122,9 @@ rip_input(struct mbuf **mp, int *offp, i
>  {
>   struct mbuf *m = *mp;
>   struct ip *ip = mtod(m, struct ip *);
> - struct inpcb *inp, *last = NULL;
> + struct inpcb *inp;
> + SIMPLEQ_HEAD(, inpcb) inpcblist;
>   struct in_addr *key;
> - struct mbuf *opts = NULL;
>   struct counters_ref ref;
>   uint64_t *counters;
>  
> @@ -150,7 +150,8 @@ rip_input(struct mbuf **mp, int *offp, i
>   }
>   }
>  #endif
> - NET_ASSERT_LOCKED();
> + NET_ASSERT_WLOCKED();
> + SIMPLEQ_INIT();
>   mtx_enter(_mtx);
>   TAILQ_FOREACH(inp, _queue, inp_queue) {
>   if (inp->inp_socket->so_state & SS_CANTRCVMORE)
> @@ -171,41 +172,16 @@ rip_input(struct mbuf **mp, int *offp, i
>   if (inp->inp_faddr.s_addr &&
>   inp->inp_faddr.s_addr != ip->ip_src.s_addr)
>   continue;
> - if (last) {
> - struct mbuf *n;
>  
> - if ((n = m_copym(m, 0, M_COPYALL, M_NOWAIT)) != NULL) {
> - if (last->inp_flags & INP_CONTROLOPTS ||
> - last->inp_socket->so_options & SO_TIMESTAMP)
> - ip_savecontrol(last, , ip, n);
> - if (sbappendaddr(last->inp_socket,
> - >inp_socket->so_rcv,
> - sintosa(), n, opts) == 0) {
> - /* should notify about lost packet */
> - m_freem(n);
> - m_freem(opts);
> - } else
> - sorwakeup(last->inp_socket);
> - opts = NULL;
> - }
> - }
> - last = inp;
> + in_pcbref(inp);
> + SIMPLEQ_INSERT_TAIL(, inp, inp_notify);
>   }
>   mtx_leave(_mtx);
>  
> - if (last) {
> - if (last->inp_flags & INP_CONTROLOPTS ||
> - last->inp_socket->so_options & SO_TIMESTAMP)
> - ip_savecontrol(last, , ip, m);
> - if (sbappendaddr(last->inp_socket, >inp_socket->so_rcv,
> - sintosa(), m, opts) == 0) {
> - m_freem(m);
> - m_freem(opts);
> - } else
> - sorwakeup(last->inp_socket);
> - } else {
> + if (SIMPLEQ_EMPTY()) {
>   if (ip->ip_p != IPPROTO_ICMP)
> - icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL, 0, 
> 0);
> + icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL,
> + 0, 0);
>   else
>   m_freem(m);
>  
> @@ -213,6 +189,30 @@ rip_input(struct mbuf **mp, int *offp, i
>   counters[ips_noproto]++;
>   counters[ips_delivered]--;
>   counters_leave(, ipcounters);
> + }
> +
> + while ((inp = SIMPLEQ_FIRST()) != NULL) {
> + struct mbuf *n, *opts = NULL;
> +
> + SIMPLEQ_REMOVE_HEAD(, inp_notify);
> + if (SIMPLEQ_EMPTY())
> + n = m;
> + else
> + n = m_copym(m, 0, M_COPYALL, M_NOWAIT);
> + if (n != NULL) {
> + if (inp->inp_flags & INP_CONTROLOPTS ||
> + inp->inp_socket->so_options & SO_TIMESTAMP)
> + ip_savecontrol(inp, , ip, n);
> + if (sbappendaddr(inp->inp_socket,
> + >inp_socket->so_rcv,
> + sintosa(), n, opts) == 0) {
> + /* should notify about lost packet */
> + m_freem(n);
> + m_freem(opts);
> + } else
> +  

rip sbappendaddr() with inpcb table mutex

2022-03-22 Thread Alexander Bluhm
Hi,

syzkaller and witness found the same bug I introduced in UDP also
for Raw IP.  Fix it the same was for rip and rip6.

https://syzkaller.appspot.com/bug?extid=9bac6356a881dc644265
https://syzkaller.appspot.com/bug?extid=5b2679ee9be0895d26f9

ok?

bluhm

Index: netinet/raw_ip.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet/raw_ip.c,v
retrieving revision 1.125
diff -u -p -r1.125 raw_ip.c
--- netinet/raw_ip.c21 Mar 2022 09:12:34 -  1.125
+++ netinet/raw_ip.c22 Mar 2022 12:59:05 -
@@ -122,9 +122,9 @@ rip_input(struct mbuf **mp, int *offp, i
 {
struct mbuf *m = *mp;
struct ip *ip = mtod(m, struct ip *);
-   struct inpcb *inp, *last = NULL;
+   struct inpcb *inp;
+   SIMPLEQ_HEAD(, inpcb) inpcblist;
struct in_addr *key;
-   struct mbuf *opts = NULL;
struct counters_ref ref;
uint64_t *counters;
 
@@ -150,7 +150,8 @@ rip_input(struct mbuf **mp, int *offp, i
}
}
 #endif
-   NET_ASSERT_LOCKED();
+   NET_ASSERT_WLOCKED();
+   SIMPLEQ_INIT();
mtx_enter(_mtx);
TAILQ_FOREACH(inp, _queue, inp_queue) {
if (inp->inp_socket->so_state & SS_CANTRCVMORE)
@@ -171,41 +172,16 @@ rip_input(struct mbuf **mp, int *offp, i
if (inp->inp_faddr.s_addr &&
inp->inp_faddr.s_addr != ip->ip_src.s_addr)
continue;
-   if (last) {
-   struct mbuf *n;
 
-   if ((n = m_copym(m, 0, M_COPYALL, M_NOWAIT)) != NULL) {
-   if (last->inp_flags & INP_CONTROLOPTS ||
-   last->inp_socket->so_options & SO_TIMESTAMP)
-   ip_savecontrol(last, , ip, n);
-   if (sbappendaddr(last->inp_socket,
-   >inp_socket->so_rcv,
-   sintosa(), n, opts) == 0) {
-   /* should notify about lost packet */
-   m_freem(n);
-   m_freem(opts);
-   } else
-   sorwakeup(last->inp_socket);
-   opts = NULL;
-   }
-   }
-   last = inp;
+   in_pcbref(inp);
+   SIMPLEQ_INSERT_TAIL(, inp, inp_notify);
}
mtx_leave(_mtx);
 
-   if (last) {
-   if (last->inp_flags & INP_CONTROLOPTS ||
-   last->inp_socket->so_options & SO_TIMESTAMP)
-   ip_savecontrol(last, , ip, m);
-   if (sbappendaddr(last->inp_socket, >inp_socket->so_rcv,
-   sintosa(), m, opts) == 0) {
-   m_freem(m);
-   m_freem(opts);
-   } else
-   sorwakeup(last->inp_socket);
-   } else {
+   if (SIMPLEQ_EMPTY()) {
if (ip->ip_p != IPPROTO_ICMP)
-   icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL, 0, 
0);
+   icmp_error(m, ICMP_UNREACH, ICMP_UNREACH_PROTOCOL,
+   0, 0);
else
m_freem(m);
 
@@ -213,6 +189,30 @@ rip_input(struct mbuf **mp, int *offp, i
counters[ips_noproto]++;
counters[ips_delivered]--;
counters_leave(, ipcounters);
+   }
+
+   while ((inp = SIMPLEQ_FIRST()) != NULL) {
+   struct mbuf *n, *opts = NULL;
+
+   SIMPLEQ_REMOVE_HEAD(, inp_notify);
+   if (SIMPLEQ_EMPTY())
+   n = m;
+   else
+   n = m_copym(m, 0, M_COPYALL, M_NOWAIT);
+   if (n != NULL) {
+   if (inp->inp_flags & INP_CONTROLOPTS ||
+   inp->inp_socket->so_options & SO_TIMESTAMP)
+   ip_savecontrol(inp, , ip, n);
+   if (sbappendaddr(inp->inp_socket,
+   >inp_socket->so_rcv,
+   sintosa(), n, opts) == 0) {
+   /* should notify about lost packet */
+   m_freem(n);
+   m_freem(opts);
+   } else
+   sorwakeup(inp->inp_socket);
+   }
+   in_pcbunref(inp);
}
return IPPROTO_DONE;
 }
Index: netinet6/raw_ip6.c
===
RCS file: /data/mirror/openbsd/cvs/src/sys/netinet6/raw_ip6.c,v
retrieving revision 1.145
diff -u -p -r1.145 raw_ip6.c
--- netinet6/raw_ip6.c  21 Mar 2022 09:12:34 -  1.145
+++ netinet6/raw_ip6.c  22 Mar 2022 12:59:05 -
@@ -121,10 +121,9 @@