Re: MCLADDREFERENCE() incrementing the wrong ext_refcnt?

2024-03-28 Thread Ryota Ozaki
Hi,

On Sat, Mar 23, 2024 at 12:44 AM Edgar Fuß  wrote:
>
> Hello.
>
> I'm under the impression that MCLADDREFERENCE() may increment the wrong
> ext_refcnt.
>
> In case it's permitted (I cant't find anything to the contrary) to
> call MCLADDREFERENCE(m1, m2) and then MCLADDREFERENCE(m2, m3), then the
> second call will increment m2's ext_refcnt where it should be incrementing
> m1's one (e.g. the one all of m1, m2 and m3's m_ext_ref are pointing to), no?
>
> So I think
> atomic_inc_uint(&(o)->m_ext.ext_refcnt);\
> should really be
> atomic_inc_uint(&(o)->m_ext_ref->m_ext.ext_refcnt); \
> which, of course, is the same thing if MEXT_ISEMBEDDED(o) is true.
>
> Am I getting something wrong?

I think you're right.

IIUC use-after-free can occur in some cases. In the case of your example,
if the mbufs are freed in the order of m1, m3 and m2, a freed buffer of m1
can be accessed via m2 after m3 is freed.

I'll commit your fix.

Thanks,
  ozaki-r


Re: Enable to send packets on if_loop via bpf

2022-11-28 Thread Ryota Ozaki
On Tue, Nov 22, 2022 at 8:25 PM Ryota Ozaki  wrote:
>
> On Tue, Nov 22, 2022 at 8:00 PM Ryota Ozaki  wrote:
> >
> > On Tue, Nov 22, 2022 at 12:49 AM Greg Troxel  wrote:
> > >
> > >
> > > Ryota Ozaki  writes:
> > >
> > > > In the specification DLT_NULL assumes a protocol family in the host
> > > > byte order followed by a payload.  Interfaces of DLT_NULL uses
> > > > bpf_mtap_af to pass a mbuf prepending a protocol family.  All interfaces
> > > > follow the spec and work well.
> > > >
> > > > OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation.
> > > > A writing data to an interface of DLT_NULL is treated as a raw data
> > > > (I don't know why); the data is passed to the interface's output routine
> > > > as is with dst (sa_family=AF_UNSPEC).  tun seems to be able
> > > > to handle such raw data but the others can't handle the data (probably
> > > > the data will be dropped like if_loop).
> > >
> > > Summarizing and commenting to make sure I'm not confused
> > >
> > >   on receive/read, DLT_NULL  prepends AF in host byte order
> > >   on transmit/write, it just sends with AF_UNSPCE
> > >
> > >   This seems broken as it is asymmetric, and is bad because it throws
> > >   away information that is hard to reliably recreate.  On the other hand
> > >   this is for link-layer formats, and it seems that some interfaces have
> > >   an AF that is not really part of what is transmitted, even though
> > >   really it is.  For example tun is using an IP proto byte to specify AF
> > >   and really this is part of the link protocol.  Except we pretend it
> > >   isn't.
> >
> > I found the following sentence in bpf.4:
> >
> >  A packet can be sent out on the network by writing to a bpf file
> >  descriptor.  The writes are unbuffered, meaning only one packet can be
> >  processed per write.  Currently, only writes to Ethernets and SLIP 
> > links
> >  are supported.
> >
> > So bpf_write to interfaces of DLT_NULL may be simply unsupported on
> > NetBSD...
> >
> > >
> > > > Correcting bpf_write to assume a prepending protocol family will
> > > > save some interfaces like gif and gre but won't save others like stf
> > > > and wg.  Even worse, the change may break existing users of tun
> > > > that want to treat data as is (though I don't know if users exist).
> > > >
> > > > BTW, prepending a protocol family on tun is a different protocol from
> > > > DLT_NULL of bpf.  tun has three protocol modes and doesn't always 
> > > > prepend
> > > > a protocol family.  (And also the network byte order is used on tun
> > > > as gert says while DLT_NULL assumes the host byte order.)
> > >
> > > wow.
> > >
> > > > So my fix will:
> > > > - keep DLT_NULL of if_loop to not break bpf_mtap_af, and
> > > > - unchange DLT_NULL handling in bpf_write except for if_loop to bother
> > > > existing users.
> > > > The patch looks like this:
> > > >
> > > > @@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype,
> > > > uint64_t mtu, struct mbuf **mp,
> > > > m0->m_len -= hlen;
> > > > }
> > > >
> > > > +   if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) {
> > > > +   uint32_t af;
> > > > +   memcpy(, mtod(m0, void *), sizeof(af));
> > > > +   sockp->sa_family = af;
> > > > +   m0->m_data += sizeof(af);
> > > > +   m0->m_len -= sizeof(af);
> > > > +   }
> > > > +
> > > > *mp = m0;
> > > > return (0);
> > >
> > > That seems ok to me.
> >
> > Thanks.
> >
> > >
> > >
> > > I think the long-term right fix is to define DLT_AF which has an AF word
> > > in host order on receive and transmit always, and to modify interfaces
> > > to use it whenever they are AF aware at all.   In this case tun would
> > > fill in the AF word from the IP proto field, and you'd get a
> > > transformed/regularized AF word when really the "link layer packet" had
> > > the IP proto field.  But that's ok as it's just cleanup and reversible.
> >
> > I think introducing DLT_AF is a bit of a tough task because DLT_* 
> > definitions
> > are managed by us.
>^ are NOT managed, I meant to say...
>
>   ozaki-r

https://www.netbsd.org/~ozaki-r/loop-bpf2.patch

Anyway this is the latest patch.  It is adjusted to ensure to apply
input validations
(pointed out by ryo@).

  ozaki-r


Re: Enable to send packets on if_loop via bpf

2022-11-22 Thread Ryota Ozaki
On Tue, Nov 22, 2022 at 8:00 PM Ryota Ozaki  wrote:
>
> On Tue, Nov 22, 2022 at 12:49 AM Greg Troxel  wrote:
> >
> >
> > Ryota Ozaki  writes:
> >
> > > In the specification DLT_NULL assumes a protocol family in the host
> > > byte order followed by a payload.  Interfaces of DLT_NULL uses
> > > bpf_mtap_af to pass a mbuf prepending a protocol family.  All interfaces
> > > follow the spec and work well.
> > >
> > > OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation.
> > > A writing data to an interface of DLT_NULL is treated as a raw data
> > > (I don't know why); the data is passed to the interface's output routine
> > > as is with dst (sa_family=AF_UNSPEC).  tun seems to be able
> > > to handle such raw data but the others can't handle the data (probably
> > > the data will be dropped like if_loop).
> >
> > Summarizing and commenting to make sure I'm not confused
> >
> >   on receive/read, DLT_NULL  prepends AF in host byte order
> >   on transmit/write, it just sends with AF_UNSPCE
> >
> >   This seems broken as it is asymmetric, and is bad because it throws
> >   away information that is hard to reliably recreate.  On the other hand
> >   this is for link-layer formats, and it seems that some interfaces have
> >   an AF that is not really part of what is transmitted, even though
> >   really it is.  For example tun is using an IP proto byte to specify AF
> >   and really this is part of the link protocol.  Except we pretend it
> >   isn't.
>
> I found the following sentence in bpf.4:
>
>  A packet can be sent out on the network by writing to a bpf file
>  descriptor.  The writes are unbuffered, meaning only one packet can be
>  processed per write.  Currently, only writes to Ethernets and SLIP links
>  are supported.
>
> So bpf_write to interfaces of DLT_NULL may be simply unsupported on
> NetBSD...
>
> >
> > > Correcting bpf_write to assume a prepending protocol family will
> > > save some interfaces like gif and gre but won't save others like stf
> > > and wg.  Even worse, the change may break existing users of tun
> > > that want to treat data as is (though I don't know if users exist).
> > >
> > > BTW, prepending a protocol family on tun is a different protocol from
> > > DLT_NULL of bpf.  tun has three protocol modes and doesn't always prepend
> > > a protocol family.  (And also the network byte order is used on tun
> > > as gert says while DLT_NULL assumes the host byte order.)
> >
> > wow.
> >
> > > So my fix will:
> > > - keep DLT_NULL of if_loop to not break bpf_mtap_af, and
> > > - unchange DLT_NULL handling in bpf_write except for if_loop to bother
> > > existing users.
> > > The patch looks like this:
> > >
> > > @@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype,
> > > uint64_t mtu, struct mbuf **mp,
> > > m0->m_len -= hlen;
> > > }
> > >
> > > +   if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) {
> > > +   uint32_t af;
> > > +   memcpy(, mtod(m0, void *), sizeof(af));
> > > +   sockp->sa_family = af;
> > > +   m0->m_data += sizeof(af);
> > > +   m0->m_len -= sizeof(af);
> > > +   }
> > > +
> > > *mp = m0;
> > > return (0);
> >
> > That seems ok to me.
>
> Thanks.
>
> >
> >
> > I think the long-term right fix is to define DLT_AF which has an AF word
> > in host order on receive and transmit always, and to modify interfaces
> > to use it whenever they are AF aware at all.   In this case tun would
> > fill in the AF word from the IP proto field, and you'd get a
> > transformed/regularized AF word when really the "link layer packet" had
> > the IP proto field.  But that's ok as it's just cleanup and reversible.
>
> I think introducing DLT_AF is a bit of a tough task because DLT_* definitions
> are managed by us.
   ^ are NOT managed, I meant to say...

  ozaki-r


Re: Enable to send packets on if_loop via bpf

2022-11-22 Thread Ryota Ozaki
On Tue, Nov 22, 2022 at 12:49 AM Greg Troxel  wrote:
>
>
> Ryota Ozaki  writes:
>
> > In the specification DLT_NULL assumes a protocol family in the host
> > byte order followed by a payload.  Interfaces of DLT_NULL uses
> > bpf_mtap_af to pass a mbuf prepending a protocol family.  All interfaces
> > follow the spec and work well.
> >
> > OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation.
> > A writing data to an interface of DLT_NULL is treated as a raw data
> > (I don't know why); the data is passed to the interface's output routine
> > as is with dst (sa_family=AF_UNSPEC).  tun seems to be able
> > to handle such raw data but the others can't handle the data (probably
> > the data will be dropped like if_loop).
>
> Summarizing and commenting to make sure I'm not confused
>
>   on receive/read, DLT_NULL  prepends AF in host byte order
>   on transmit/write, it just sends with AF_UNSPCE
>
>   This seems broken as it is asymmetric, and is bad because it throws
>   away information that is hard to reliably recreate.  On the other hand
>   this is for link-layer formats, and it seems that some interfaces have
>   an AF that is not really part of what is transmitted, even though
>   really it is.  For example tun is using an IP proto byte to specify AF
>   and really this is part of the link protocol.  Except we pretend it
>   isn't.

I found the following sentence in bpf.4:

 A packet can be sent out on the network by writing to a bpf file
 descriptor.  The writes are unbuffered, meaning only one packet can be
 processed per write.  Currently, only writes to Ethernets and SLIP links
 are supported.

So bpf_write to interfaces of DLT_NULL may be simply unsupported on
NetBSD...

>
> > Correcting bpf_write to assume a prepending protocol family will
> > save some interfaces like gif and gre but won't save others like stf
> > and wg.  Even worse, the change may break existing users of tun
> > that want to treat data as is (though I don't know if users exist).
> >
> > BTW, prepending a protocol family on tun is a different protocol from
> > DLT_NULL of bpf.  tun has three protocol modes and doesn't always prepend
> > a protocol family.  (And also the network byte order is used on tun
> > as gert says while DLT_NULL assumes the host byte order.)
>
> wow.
>
> > So my fix will:
> > - keep DLT_NULL of if_loop to not break bpf_mtap_af, and
> > - unchange DLT_NULL handling in bpf_write except for if_loop to bother
> > existing users.
> > The patch looks like this:
> >
> > @@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype,
> > uint64_t mtu, struct mbuf **mp,
> > m0->m_len -= hlen;
> > }
> >
> > +   if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) {
> > +   uint32_t af;
> > +   memcpy(, mtod(m0, void *), sizeof(af));
> > +   sockp->sa_family = af;
> > +   m0->m_data += sizeof(af);
> > +   m0->m_len -= sizeof(af);
> > +   }
> > +
> > *mp = m0;
> > return (0);
>
> That seems ok to me.

Thanks.

>
>
> I think the long-term right fix is to define DLT_AF which has an AF word
> in host order on receive and transmit always, and to modify interfaces
> to use it whenever they are AF aware at all.   In this case tun would
> fill in the AF word from the IP proto field, and you'd get a
> transformed/regularized AF word when really the "link layer packet" had
> the IP proto field.  But that's ok as it's just cleanup and reversible.

I think introducing DLT_AF is a bit of a tough task because DLT_* definitions
are managed by us.

  ozaki-r


Re: Enable to send packets on if_loop via bpf

2022-11-16 Thread Ryota Ozaki
On Wed, Nov 9, 2022 at 9:21 PM Greg Troxel  wrote:
>
>
> Ryota Ozaki  writes:
>
> > NetBSD can't do this because a loopback interface
> > registers itself to bpf as DLT_NULL and bpf treats
> > packets being sent over the interface as AF_UNSPEC.
> > Packets of AF_UNSPEC are just dropped by loopback
> > interfaces.
> >
> > FreeBSD and OpenBSD enable to do that by letting users
> > prepend a protocol family to a sending data.  bpf (or if_loop)
> > extracts it and handles the packet as an extracted protocol
> > family.  The following patch follows them (the implementation
> > is inspired by OpenBSD).
> >
> > http://www.netbsd.org/~ozaki-r/loop-bpf.patch
> >
> > The patch changes if_loop to register itself to bpf
> > as DLT_LOOP and bpf to handle a prepending protocol
> > family on bpf_write if a sender interface is DLT_LOOP.
>
> I am surprised that there is not already a DLT_foo that already has this
> concept, an AF word followed by data.  But I guess every interface
> already has a more-specific format.
>
> Looking at if_tun.c, I see DLT_NULL.  This should have the same ability
> to write.   I have forgotten the details of how tun encodes AF when
> transmitting, but I know you can have v4 or v6 inside, and tcpdump works
> now. so obviously I must be missing something.
>
> My suggestion is to look at the rest of the drivers that register
> DLT_NULL and see if they are amenable to the same fix, and choose a new
> DLT_SOMETHING that accomodates the broader situation.
>
> I am not demanding that you add features to the rest of the drivers.  I
> am only asking that you think about the architectural issue of how the
> rest of them would be updated, so we don't end up with DLT_LOOP,
> DLT_TUN, and so on, where they all do almost the same thing, when they
> could be the same.
>
> I don't really have an opinion on host vs network for AF, but I think
> your choice of aligning with FreeBSD is reasonable.

Thank you for your suggestion and I'm sorry for my late reply.

I've investigated the DLT specification(*), DLT_NULL users including tun
and the implementation of bpf and others.

(*) https://www.tcpdump.org/linktypes.html

At first, my patch was wrong because DLT_LOOP assumes a protocol family
in the network byte order.  So prepending a protocol family in
the host byte order was wrong and also changing DLT_LOOP broke mbuf
tapping on if_loop (i.e., tcpdump).


In the specification DLT_NULL assumes a protocol family in the host
byte order followed by a payload.  Interfaces of DLT_NULL uses
bpf_mtap_af to pass a mbuf prepending a protocol family.  All interfaces
follow the spec and work well.

OTOH, bpf_write to interfaces of DLT_NULL is a bit of a sad situation.
A writing data to an interface of DLT_NULL is treated as a raw data
(I don't know why); the data is passed to the interface's output routine
as is with dst (sa_family=AF_UNSPEC).  tun seems to be able
to handle such raw data but the others can't handle the data (probably
the data will be dropped like if_loop).


Correcting bpf_write to assume a prepending protocol family will
save some interfaces like gif and gre but won't save others like stf
and wg.  Even worse, the change may break existing users of tun
that want to treat data as is (though I don't know if users exist).

BTW, prepending a protocol family on tun is a different protocol from
DLT_NULL of bpf.  tun has three protocol modes and doesn't always prepend
a protocol family.  (And also the network byte order is used on tun
as gert says while DLT_NULL assumes the host byte order.)


So my fix will:
- keep DLT_NULL of if_loop to not break bpf_mtap_af, and
- unchange DLT_NULL handling in bpf_write except for if_loop to bother
existing users.
The patch looks like this:

@@ -447,6 +448,14 @@ bpf_movein(struct uio *uio, int linktype,
uint64_t mtu, struct mbuf **mp,
m0->m_len -= hlen;
}

+   if (linktype == DLT_NULL && ifp->if_type == IFT_LOOP) {
+   uint32_t af;
+   memcpy(, mtod(m0, void *), sizeof(af));
+   sockp->sa_family = af;
+   m0->m_data += sizeof(af);
+   m0->m_len -= sizeof(af);
+   }
+
*mp = m0;
return (0);


If we want to support another interface, we can add it to the condition.

Any comments?
  ozaki-r


Re: Restructuring inpcb/in6pcb

2022-10-31 Thread Ryota Ozaki
On Fri, Oct 28, 2022 at 2:53 PM Ryota Ozaki  wrote:
>
> On Wed, Oct 26, 2022 at 5:36 PM Ryota Ozaki  wrote:
> >
> > On Wed, Oct 26, 2022 at 12:00 AM Christos Zoulas  
> > wrote:
> > >
> > > In article 
> > > ,
> > > Ryota Ozaki   wrote:
> > > >Hi,
> > > >
> > > >Data structures of network protocol control blocks (PCBs)
> > > >(struct inpcb, in6pcb and inpcb_hdr) are not organized well.
> > > >Users of the data structures have to handle them separately
> > > >and thus the code becomes duplicated and cluttered.
> > > >
> > > >The proposal restructures the data structures, which eliminates
> > > >duplicate and cluttered code and makes further changes easy.
> > > >
> > > >A typical cleanup by the change looks like this:
> > > >
> > > >-   if (tp->t_inpcb)
> > > >-   so = tp->t_inpcb->inp_socket;
> > > >-#ifdef INET6
> > > >-   else if (tp->t_in6pcb)
> > > >-   so = tp->t_in6pcb->in6p_socket;
> > > >-#endif
> > > >+   so = tp->t_inpcb->inp_socket;
> > > >
> > > >Also some duplicated functions for IPv4 and IPv6 are integrated:
> > > >tcp6_notify, tcp6_quench, etc.
> > > >
> > > >The change consists of two parts.  The first half of the series of
> > > >commits tries to integrate all the data structures into one structure
> > > >(struct inpcb).  The commits cleans up ugly code like above.
> > > >However, because the structure includes both IPv4 and IPv6 stuff,
> > > >the data size for IPv4 increases by 40 bytes (from 248 to 288).
> > > >
> > > >The second half of the series of commits addresses the increasement
> > > >of the data size by separating the data structure again while keeping
> > > >the code simple.  By the change, struct inpcb has only common data
> > > >and newly introduced data structures, struct in4pcb and struct in6pcb
> > > >inherited from struct inpcb, have dedicated data for each protocol.
> > > >Even after the separation, users don't need to recognize the separation
> > > >and just need to use some macros to access dedicated data.
> > > >For example, inp->inp_laddr, which represents the local IPv4 address,
> > > >is now accessed through in4p_laddr(inp).  Using these macros adds
> > > >some code complexity, but this is a trade-off between data size and
> > > >code complexity.
> > > >
> > > >The diffs are here:
> > > >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-1.patch
> > > >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-2.patch
> > > >
> > > >Also, you can see individual commits at github:
> > > >  https://github.com/ozaki-r/netbsd-src/commits/restructure-inpcb
> > > >
> > > >
> > > >We can adopt either of the whole changes or only changes of the first 
> > > >half.
> > > >Which should we choose? Smaller data size or simpler code?
> > > >
> > > >By the way, I think the changes should be committed (if permitted)
> > > >after branching netbsd-10.  When will the netbsd-10 branch come?
> > > >If it doesn't come soon, is it ok to commit the changes before branching?
> > >
> > > Thanks! I wanted to do this for a long time... I don't see why it should 
> > > wait
> > > for the branch...
> >
> > Just because it's relatively big, but if nobody cares I'm not going to
> > care too :)
>
> I've committed the changes anyway.  Let me know if something goes wrong.

It seems that all known issues are resolved now.

Thank you for your help!

  ozaki-r


Re: Restructuring inpcb/in6pcb

2022-10-27 Thread Ryota Ozaki
On Wed, Oct 26, 2022 at 5:36 PM Ryota Ozaki  wrote:
>
> On Wed, Oct 26, 2022 at 12:00 AM Christos Zoulas  wrote:
> >
> > In article 
> > ,
> > Ryota Ozaki   wrote:
> > >Hi,
> > >
> > >Data structures of network protocol control blocks (PCBs)
> > >(struct inpcb, in6pcb and inpcb_hdr) are not organized well.
> > >Users of the data structures have to handle them separately
> > >and thus the code becomes duplicated and cluttered.
> > >
> > >The proposal restructures the data structures, which eliminates
> > >duplicate and cluttered code and makes further changes easy.
> > >
> > >A typical cleanup by the change looks like this:
> > >
> > >-   if (tp->t_inpcb)
> > >-   so = tp->t_inpcb->inp_socket;
> > >-#ifdef INET6
> > >-   else if (tp->t_in6pcb)
> > >-   so = tp->t_in6pcb->in6p_socket;
> > >-#endif
> > >+   so = tp->t_inpcb->inp_socket;
> > >
> > >Also some duplicated functions for IPv4 and IPv6 are integrated:
> > >tcp6_notify, tcp6_quench, etc.
> > >
> > >The change consists of two parts.  The first half of the series of
> > >commits tries to integrate all the data structures into one structure
> > >(struct inpcb).  The commits cleans up ugly code like above.
> > >However, because the structure includes both IPv4 and IPv6 stuff,
> > >the data size for IPv4 increases by 40 bytes (from 248 to 288).
> > >
> > >The second half of the series of commits addresses the increasement
> > >of the data size by separating the data structure again while keeping
> > >the code simple.  By the change, struct inpcb has only common data
> > >and newly introduced data structures, struct in4pcb and struct in6pcb
> > >inherited from struct inpcb, have dedicated data for each protocol.
> > >Even after the separation, users don't need to recognize the separation
> > >and just need to use some macros to access dedicated data.
> > >For example, inp->inp_laddr, which represents the local IPv4 address,
> > >is now accessed through in4p_laddr(inp).  Using these macros adds
> > >some code complexity, but this is a trade-off between data size and
> > >code complexity.
> > >
> > >The diffs are here:
> > >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-1.patch
> > >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-2.patch
> > >
> > >Also, you can see individual commits at github:
> > >  https://github.com/ozaki-r/netbsd-src/commits/restructure-inpcb
> > >
> > >
> > >We can adopt either of the whole changes or only changes of the first half.
> > >Which should we choose? Smaller data size or simpler code?
> > >
> > >By the way, I think the changes should be committed (if permitted)
> > >after branching netbsd-10.  When will the netbsd-10 branch come?
> > >If it doesn't come soon, is it ok to commit the changes before branching?
> >
> > Thanks! I wanted to do this for a long time... I don't see why it should 
> > wait
> > for the branch...
>
> Just because it's relatively big, but if nobody cares I'm not going to
> care too :)

I've committed the changes anyway.  Let me know if something goes wrong.

  ozaki-r


Re: Restructuring inpcb/in6pcb

2022-10-26 Thread Ryota Ozaki
On Wed, Oct 26, 2022 at 12:00 AM Christos Zoulas  wrote:
>
> In article 
> ,
> Ryota Ozaki   wrote:
> >Hi,
> >
> >Data structures of network protocol control blocks (PCBs)
> >(struct inpcb, in6pcb and inpcb_hdr) are not organized well.
> >Users of the data structures have to handle them separately
> >and thus the code becomes duplicated and cluttered.
> >
> >The proposal restructures the data structures, which eliminates
> >duplicate and cluttered code and makes further changes easy.
> >
> >A typical cleanup by the change looks like this:
> >
> >-   if (tp->t_inpcb)
> >-   so = tp->t_inpcb->inp_socket;
> >-#ifdef INET6
> >-   else if (tp->t_in6pcb)
> >-   so = tp->t_in6pcb->in6p_socket;
> >-#endif
> >+   so = tp->t_inpcb->inp_socket;
> >
> >Also some duplicated functions for IPv4 and IPv6 are integrated:
> >tcp6_notify, tcp6_quench, etc.
> >
> >The change consists of two parts.  The first half of the series of
> >commits tries to integrate all the data structures into one structure
> >(struct inpcb).  The commits cleans up ugly code like above.
> >However, because the structure includes both IPv4 and IPv6 stuff,
> >the data size for IPv4 increases by 40 bytes (from 248 to 288).
> >
> >The second half of the series of commits addresses the increasement
> >of the data size by separating the data structure again while keeping
> >the code simple.  By the change, struct inpcb has only common data
> >and newly introduced data structures, struct in4pcb and struct in6pcb
> >inherited from struct inpcb, have dedicated data for each protocol.
> >Even after the separation, users don't need to recognize the separation
> >and just need to use some macros to access dedicated data.
> >For example, inp->inp_laddr, which represents the local IPv4 address,
> >is now accessed through in4p_laddr(inp).  Using these macros adds
> >some code complexity, but this is a trade-off between data size and
> >code complexity.
> >
> >The diffs are here:
> >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-1.patch
> >- https://www.netbsd.org/~ozaki-r/restructure-inpcb-2.patch
> >
> >Also, you can see individual commits at github:
> >  https://github.com/ozaki-r/netbsd-src/commits/restructure-inpcb
> >
> >
> >We can adopt either of the whole changes or only changes of the first half.
> >Which should we choose? Smaller data size or simpler code?
> >
> >By the way, I think the changes should be committed (if permitted)
> >after branching netbsd-10.  When will the netbsd-10 branch come?
> >If it doesn't come soon, is it ok to commit the changes before branching?
>
> Thanks! I wanted to do this for a long time... I don't see why it should wait
> for the branch...

Just because it's relatively big, but if nobody cares I'm not going to
care too :)

  ozaki-r


Restructuring inpcb/in6pcb

2022-10-25 Thread Ryota Ozaki
Hi,

Data structures of network protocol control blocks (PCBs)
(struct inpcb, in6pcb and inpcb_hdr) are not organized well.
Users of the data structures have to handle them separately
and thus the code becomes duplicated and cluttered.

The proposal restructures the data structures, which eliminates
duplicate and cluttered code and makes further changes easy.

A typical cleanup by the change looks like this:

-   if (tp->t_inpcb)
-   so = tp->t_inpcb->inp_socket;
-#ifdef INET6
-   else if (tp->t_in6pcb)
-   so = tp->t_in6pcb->in6p_socket;
-#endif
+   so = tp->t_inpcb->inp_socket;

Also some duplicated functions for IPv4 and IPv6 are integrated:
tcp6_notify, tcp6_quench, etc.

The change consists of two parts.  The first half of the series of
commits tries to integrate all the data structures into one structure
(struct inpcb).  The commits cleans up ugly code like above.
However, because the structure includes both IPv4 and IPv6 stuff,
the data size for IPv4 increases by 40 bytes (from 248 to 288).

The second half of the series of commits addresses the increasement
of the data size by separating the data structure again while keeping
the code simple.  By the change, struct inpcb has only common data
and newly introduced data structures, struct in4pcb and struct in6pcb
inherited from struct inpcb, have dedicated data for each protocol.
Even after the separation, users don't need to recognize the separation
and just need to use some macros to access dedicated data.
For example, inp->inp_laddr, which represents the local IPv4 address,
is now accessed through in4p_laddr(inp).  Using these macros adds
some code complexity, but this is a trade-off between data size and
code complexity.

The diffs are here:
- https://www.netbsd.org/~ozaki-r/restructure-inpcb-1.patch
- https://www.netbsd.org/~ozaki-r/restructure-inpcb-2.patch

Also, you can see individual commits at github:
  https://github.com/ozaki-r/netbsd-src/commits/restructure-inpcb


We can adopt either of the whole changes or only changes of the first half.
Which should we choose? Smaller data size or simpler code?

By the way, I think the changes should be committed (if permitted)
after branching netbsd-10.  When will the netbsd-10 branch come?
If it doesn't come soon, is it ok to commit the changes before branching?

Thanks,
   ozaki-r


Re: if_ethersubr.c: if_ierrors -> if_iqdrops?

2021-11-09 Thread Ryota Ozaki
On Wed, Nov 10, 2021 at 6:03 AM Christos Zoulas  wrote:
>
> In article <3360841c-e7fc-147d-1347-fe427d41a...@sdf.org>,
> RVP   wrote:
> >On Mon, 8 Nov 2021, Jason Thorpe wrote:
> >
> >>> On Nov 7, 2021, at 11:07 PM, RVP  wrote:
> >>>
> >>> So, I hacked up a small patch to put most of these into the
> >>> "if_iqdrops" bucket. The rest (following FreeBSD) remain as errors.
> >>
> >> LGTM!
> >>
> >
> >On Mon, 8 Nov 2021, Christos Zoulas wrote:
> >
> >>> As far as I'm concerned, you can silently chuck most of these errors
> >>> away--like FreeBSD does, or classify them as "dropped" like Linux does.
> >>> Either way is fine with me--just not "if_ierrors": that's too confusing.
> >>
> >> I think that the patch makes sense...
> >>
> >
> >OK, so discarded packets are fine as `if_iqdrops'. Otherwise, a kernel w/o
> >vlan (or netatalk, carp, ...) compiled in will show:
>
> Yes, I think that we need the distinction between discard and drop...
> Or we should just not count discard...

Another option may be if_noproto.

  ozaki-r


Re: psref leak detector (aggressive version)

2019-05-20 Thread Ryota Ozaki
On Fri, May 10, 2019 at 7:01 PM Ryota Ozaki  wrote:
>
> Hi,
>
> https://www.netbsd.org/~ozaki-r/psref-debug-aggressive.diff
>
> This patch implements yet another psref leak detector that
> enables to tell where a leak occurs while a simpler version
> that is already committed just tells an occurrence of a leak.
>
> Investigating of psref leaks is hard because once a leak occurs
> a percpu list of psref that tracks references can be corrupted.
> A reference to a tracking object is memorized in the list
> via an intermediate object (struct psref) that is normally allocated
> on a stack of a thread.  Thus, the intermediate object can be
> overwritten on a leak resulting in corrupt of the list.
>
> The tracker makes a shadow entry to an intermediate object and
> stores some hints into it (currently it's a caller address of
> psref_acquire).  We can detect a leak by checking the entries on
> certain points where any references should be released such as
> the return point of syscalls and the end of each softint handler.
>
> Note that the current version is based on -current.  Once the
> simpler version is revised as using lwp_specificdata [*], the
> patch will be tweaked too.
>
> [*] http://mail-index.netbsd.org/source-changes-d/2019/05/08/msg011297.html
>
> Any comments?

FYI I've committed the patch last week.

Note that the patch has been committed without tweaks mentioned above
because the simpler version is decided to be kept as is.

  ozaki-r


psref leak detector (aggressive version)

2019-05-10 Thread Ryota Ozaki
Hi,

https://www.netbsd.org/~ozaki-r/psref-debug-aggressive.diff

This patch implements yet another psref leak detector that
enables to tell where a leak occurs while a simpler version
that is already committed just tells an occurrence of a leak.

Investigating of psref leaks is hard because once a leak occurs
a percpu list of psref that tracks references can be corrupted.
A reference to a tracking object is memorized in the list
via an intermediate object (struct psref) that is normally allocated
on a stack of a thread.  Thus, the intermediate object can be
overwritten on a leak resulting in corrupt of the list.

The tracker makes a shadow entry to an intermediate object and
stores some hints into it (currently it's a caller address of
psref_acquire).  We can detect a leak by checking the entries on
certain points where any references should be released such as
the return point of syscalls and the end of each softint handler.

Note that the current version is based on -current.  Once the
simpler version is revised as using lwp_specificdata [*], the
patch will be tweaked too.

[*] http://mail-index.netbsd.org/source-changes-d/2019/05/08/msg011297.html

Any comments?

Thanks,
  ozaki-r


Re: Importing libraries for the kernel

2018-12-13 Thread Ryota Ozaki
On Thu, Dec 13, 2018 at 6:30 AM Joerg Sonnenberger  wrote:
>
> On Thu, Dec 13, 2018 at 12:58:21AM +0900, Ryota Ozaki wrote:
> > Before that, I want to ask about how to import cryptography
> > libraries needed tor the implementation.  The libraries are
> > libb2[1] and libsodium[2]: the former is for blake2s and
> > the latter is for curve25519 and [x]chacha20-poly1305.
>
> I don't really have a problem with Blake2s, but I have serious concerns
> for doing asymmetric cryptography in the kernel. In fact, it is one of
> the IMHO very questionable design decisions behind WireGuard and
> something I don't want to see repeated in NetBSD.

Can you clarify the concerns?

  ozaki-r


Re: Importing libraries for the kernel

2018-12-12 Thread Ryota Ozaki
On Thu, Dec 13, 2018 at 2:12 AM Greg Troxel  wrote:
>
> m...@netbsd.org writes:
>
> > I don't expect there to be any problems with the ISC license. It's the
> > preferred license for OpenBSD and we use a lot of their code (it's
> > everywhere in sys/dev/)
>
> Agreed that the license is ok.
>
> > external, as I understand it, means "please upstream your changes, and
> > avoid unnecessary local changes".
>
> Agreed.
>
> And also that we have a plan/expectation of tracking changes and
> improvements that upstream makes.  Code that is not in external more or
> less implies that we are the maintainer.  For these libraries, my
> expectation is that they are being actively maintained and that we will
> want to update to newer upstream versions from time to time.

I think libsodium matches the case.  The project is active and we should
track the updates.

OTOH for libb2 (blake2s), we need just two files (.h and .c) and those
seems to be a reference implementation of the algorithm and unlikely
to update, at least, frequently.  So just copying the files may work.

Note that it's not to say that I don't want to do the formal process
(because I've already done the task in my local changes).  I just don't
want to increase unnecessary files in the repository if possible (and
if preferable).

Anyway thank you for suggestions!
  ozaki-r


Importing libraries for the kernel

2018-12-12 Thread Ryota Ozaki
Hi,

As you may know I'm working on implementing WireGuard
in the kernel.  The implementation works and buildable.
Some code cleanups are still needed but I'll propose it on
the list in the near future.

Before that, I want to ask about how to import cryptography
libraries needed tor the implementation.  The libraries are
libb2[1] and libsodium[2]: the former is for blake2s and
the latter is for curve25519 and [x]chacha20-poly1305.

[1] https://github.com/BLAKE2/libb2
[2] https://github.com/jedisct1/libsodium

Both libraries are not BSD (CC0 and ISC) and so I imported
them into sys/external/, added some gules for build and
copied some header files in sys/crypto/ to include them
from kernel codes.

I can build normal kernels and rump kernels with the libraries,
but the way of importing the libraries is a bit awkward.  It would
be clean if directly import them into sys/crypto/, but I'm not sure
if it's acceptable.  blake2s might be okay because it's CC0
which is a kind of public domain, IIUC.

The diff includes glue codes of the library import in question:
  http://www.netbsd.org/~ozaki-r/import-libs.diff

Any suggestions?

Thanks,
  ozaki-r


Re: Things not referenced in kernel configs, but mentioned in files.*

2018-11-12 Thread Ryota Ozaki
On Tue, Nov 13, 2018 at 1:26 AM Christos Zoulas  wrote:
>
> In article ,
> Greg Troxel   wrote:
> >
> >co...@sdf.org writes:
> >
> >> This is an automatically generated list with some hand touchups, feel
> >> free to do whatever with it. I only generated the output.
> >>
> >> ac100ic
> >> acemidi
> >> acpipmtr
> >> [snip]
> >
> >I wonder if these are candidates to add to an ALL kernel, and if it will
> >turn out that they are mostly not x86 things.
> >
> >I see we only have ALL for i386/amd64.  I wonder if it makes sense to
> >have one in evbarm.
>
> We should.

And we should build kernels of ALL configs in daily builds...

  ozaki-r


Re: Let callout_reset return if it reschedules a pending callout

2018-02-28 Thread Ryota Ozaki
On Thu, Mar 1, 2018 at 5:45 AM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Thu, Mar 01, 2018 at 01:58:29AM +0900, Ryota Ozaki wrote:
>> On Wed, Feb 28, 2018 at 10:11 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
>> > On Wed, Feb 28, 2018 at 05:47:13PM +0900, Ryota Ozaki wrote:
>> >> The feature is useful when you have a reference to an object that is
>> >> passed to a callout. In this case you need to take care of a
>> >> reference leak on callout_reset (and other APIs); it silently
>> >> reschedules (IOW cancels) a pending callout and leaks a reference.
>> >> Unfortunately callout_reset doesn't tell us the reschedule.
>> >
>> > Can we look at this part first before discussing any API changes?
>> > Why can't you store the reference next to callout instance itself?
>>
>> Oh, my explanation was quite confusing... I meant that the object
>> has a reference counter and the counter is incremented to indicate
>> that it is referred by a callout. obj_ref and obj_unref can be
>> thought as obj->refcnt++ and obj->refcnt-- respectively.
>
> Just consider the object reference owned by the callout struct. You
> change the reference count when the callout is destroyed or if the
> callout is reassigned, not because it is executed.

Oh, there is a limitation I didn't mention yet; we can't use callout_halt
to stop the callout when we try to destroy an obj because of the issue of
softnet_lock vs. callout_halt(*).

(*) http://mail-index.netbsd.org/tech-net/2018/01/12/msg006612.html

We can use only callout_stop that doesn't ensure if a callout stops
or not. It requires that we track the activity of the callout, which
is what a refcount does.

  ozaki-r


Re: Let callout_reset return if it reschedules a pending callout

2018-02-28 Thread Ryota Ozaki
On Wed, Feb 28, 2018 at 10:11 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Wed, Feb 28, 2018 at 05:47:13PM +0900, Ryota Ozaki wrote:
>> The feature is useful when you have a reference to an object that is
>> passed to a callout. In this case you need to take care of a
>> reference leak on callout_reset (and other APIs); it silently
>> reschedules (IOW cancels) a pending callout and leaks a reference.
>> Unfortunately callout_reset doesn't tell us the reschedule.
>
> Can we look at this part first before discussing any API changes?
> Why can't you store the reference next to callout instance itself?

Oh, my explanation was quite confusing... I meant that the object
has a reference counter and the counter is incremented to indicate
that it is referred by a callout. obj_ref and obj_unref can be
thought as obj->refcnt++ and obj->refcnt-- respectively.

  ozaki-r


Let callout_reset return if it reschedules a pending callout

2018-02-28 Thread Ryota Ozaki
Hi,

This proposal lets callout_reset and callout_schedule return true if
it reschedules a pending callout and return false otherwise.

The feature is useful when you have a reference to an object that is
passed to a callout. In this case you need to take care of a
reference leak on callout_reset (and other APIs); it silently
reschedules (IOW cancels) a pending callout and leaks a reference.
Unfortunately callout_reset doesn't tell us the reschedule.

To avoid the leak with the current APIs of callout, we have to do
something like this:

  if (callout_pending(>ch)) {
expired = callout_stop(>ch);
if (!expired)
  obj_unref(obj);
  }
  obj_ref(obj);
  callout_reset(>ch);

callout_pending is required to exclude the case that the callout has
never been scheduled because callout_stop just checks CALLOUT_FIRED.

The proposal makes the code simple like this:

  obj_ref(obj);
  canceled = callout_reset(>ch);
  if (canceled)
obj_unref(obj);

This is a patch (quite simple):
  http://www.netbsd.org/~ozaki-r/callout_reset.diff

Note that this proposal changes API/ABI however we don't need to
tweak existing callers of callout_reset/callout_schedule because
it doesn't break the original usage.

Any comments?

Thanks,
  ozaki-r


Re: Race condition between an LWP migration and curlwp_bind

2018-02-15 Thread Ryota Ozaki
On Thu, Feb 15, 2018 at 5:30 AM, Mateusz Guzik <mjgu...@gmail.com> wrote:
> On Tue, Feb 13, 2018 at 05:14:38PM +0900, Ryota Ozaki wrote:
>>   panic: kernel diagnostic assertion "(psref->psref_cpu == curcpu())"
>> failed: file "/(hidden)/sys/kern/subr_psref.c", line 317 passive
>> reference transferred from CPU 0 to CPU 3
>>
>> I first thought that something went wrong in an ioctl handler
>> for example curlwp_bindx was called doubly and LP_BOUND was unset
>> then the LWP was migrated to another CPU. However, this kind of
>> assumptions was denied by KASSERTs in psref_release. So I doubted
>> of LP_BOUND and found there is a race condition on LWP migrations.
>>
>> curlwp_bind sets LP_BOUND to l_pflags of curlwp and that prevents
>> curlwp from migrating to another CPU until curlwp_bindx is called.
>
> The bug you found (and I trimmed) looks like the culprit, but there is
> an extra problem which probably happens to not manifest itself in terms
> of code generation: the bind/unbind inlines lack compiler barriers. See
> KPREEMPT_* inlines for comparison. The diff is definitely trivial:
>
> diff --git a/sys/sys/lwp.h b/sys/sys/lwp.h
> index 47d162271f9c..f18b76b984e4 100644
> --- a/sys/sys/lwp.h
> +++ b/sys/sys/lwp.h
> @@ -536,6 +536,7 @@ curlwp_bind(void)
>
> bound = curlwp->l_pflag & LP_BOUND;
> curlwp->l_pflag |= LP_BOUND;
> +   __insn_barrier();
>
> return bound;
>  }
> @@ -545,6 +546,7 @@ curlwp_bindx(int bound)
>  {
>
> KASSERT(curlwp->l_pflag & LP_BOUND);
> +   __insn_barrier();
> curlwp->l_pflag ^= bound ^ LP_BOUND;
>  }
>
> --
> Mateusz Guzik 

Right, the barriers are basically required as you say.

Note that for the current usage of curlwp_bind, i.e., preventing LWP
migrations between psref_acquire and psref_release, the barriers are
not must, IIUC. Because psref_acquire and psref_release are functions
and no reordering probably happen between psref_* and curlwp_*. Even
if a compiler optimization tries to reorder, curlwp_* won't cross
psref_* because psref_* uses percpu that uses kpreempt APIs that call
__insn_barrier.

Anyway I'll add the barriers.

Thanks,
  ozaki-r


Re: Race condition between an LWP migration and curlwp_bind

2018-02-13 Thread Ryota Ozaki
On Wed, Feb 14, 2018 at 8:05 AM, Christos Zoulas <chris...@astron.com> wrote:
> In article 
> <CAKrYomgZYn-YUR8yhipp-7co+DioyofrZT-hsu6LmkBWKWyp=g...@mail.gmail.com>,
> Ryota Ozaki  <ozak...@netbsd.org> wrote:
>>
>>Is the fix appropriate?
>
> Looks right to me, but it is above my pay grade :-)

Thanks, me too :)

I hope that postponing a migration is not a big deal.

  ozaki-r


Race condition between an LWP migration and curlwp_bind

2018-02-13 Thread Ryota Ozaki
Hi,

I have been encountering the following panic infrequently
during torture tests.

  panic: kernel diagnostic assertion "(psref->psref_cpu == curcpu())"
failed: file "/(hidden)/sys/kern/subr_psref.c", line 317 passive
reference transferred from CPU 0 to CPU 3
  fatal breakpoint trap in supervisor mode
  trap type 1 code 0 rip 0x8021ce28 cs 0x8 rflags 0x206 cr2
0x7f7ff7b0e020 ilevel 0 rsp 0x80003335fb60
  curlwp 0xe40025abc480 pid 15571.1 lowest kstack 0x80003335c2c0
  Stopped in pid 15571.1 (tcpdump) at netbsd:breakpoint+0x10: leave
  db{3}> bt
  breakpoint() at netbsd:breakpoint+0x10
  vpanic() at netbsd:vpanic+0x145
  kern_assert() at netbsd:kern_assert+0x4d
  psref_release() at netbsd:psref_release+0x25c
  doifioctl() at netbsd:doifioctl+0x8ac
  sys_ioctl() at netbsd:sys_ioctl+0x106
  syscall() at netbsd:syscall+0x1f2
  --- syscall (number 54) ---
  7f7ff6b1756a:
  db{3}>

The panic indicates that an LWP that holds a psref is unexpectedly
migrated to another CPU. However, the migration should prevented
by curlwp_bind in doifioctl.

I first thought that something went wrong in an ioctl handler
for example curlwp_bindx was called doubly and LP_BOUND was unset
then the LWP was migrated to another CPU. However, this kind of
assumptions was denied by KASSERTs in psref_release. So I doubted
of LP_BOUND and found there is a race condition on LWP migrations.

curlwp_bind sets LP_BOUND to l_pflags of curlwp and that prevents
curlwp from migrating to another CPU until curlwp_bindx is called.
Meanwhile, there are several ways that an LWP is migrated to
another CPU and in any cases the scheduler postpones a migration
if a target LWP is running. For example the scheduler
periodically explores CPU-hogging LWPs and schedule them to migrate
(see sched_lwp_stats). At that point the scheduler checks LP_BOUND
flag and if it's set to a LWP, the scheduler doesn't schedule the LWP.
A scheduled LWP is migrated when it is leaving a running CPU (*),
i.e., mi_switch. And mi_switch does *NOT* check LP_BOUND flag.

(*) To be exact, sched_lwp_stats sets a CPU to be migrated to
a target LWP and mi_switch does schedule the LWP to be migrated
if a CPU is set to the LWP. Then sched_idle actually migrates
the LWP.

So here is a race condition:
- [CPU#A] An LWP is dispatched and running
- [CPU#B] The scheduler schedules the LWP to be migrated
- [CPU#A] The LWP calls curlwp_bind and sets LP_BOUND
- [CPU#A] The LWP is going to sleep and call mi_switch
- [CPU#A] The LWP is migrated to another CPU regardless
  of LP_BOUND

So this is a fix:
  http://www.netbsd.org/~ozaki-r/fix-LP_BOUND.diff

It checks the LP_BOUND flag of curlwp in mi_switch and
skips a migration if it's set.

Is the fix appropriate?

Thanks,
  ozaki-r


Re: rw_lock_held

2018-02-06 Thread Ryota Ozaki
On Wed, Feb 7, 2018 at 2:29 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Wed, Feb 7, 2018 at 2:21 PM, Taylor R Campbell
> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> Date: Wed, 7 Feb 2018 14:14:35 +0900
>>> From: Ryota Ozaki <ozak...@netbsd.org>
>>>
>>> On Wed, Feb 7, 2018 at 2:55 AM, Taylor R Campbell
>>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> >> Date: Tue, 6 Feb 2018 19:06:33 +0900
>>> >> From: Ryota Ozaki <ozak...@netbsd.org>
>>> >>
>>> >> So I think we should do either:
>>> >> (1) fix rw_lock_held, or
>>> >> - probably it would be rw_read_held() || rw_write_held()
>>> >> (2) fix users of rw_lock_held.
>>> >> - it also would replace rw_lock_held with
>>> >>   rw_read_held() || rw_write_held()
>>> >>
>>> >> I prefer (1) because I think there is no user of the current
>>> >> behavior that is difficult to use for users (it may be useful
>>> >> as an internal utility function though).
>>> >
>>> > I like (1) too, but check for Solaris compatibility?  It is hard to
>>> > imagine that there are correct uses of the current semantics that are
>>> > not also correct uses of (1).
>>>
>>> https://docs.oracle.com/cd/E36784_01/html/E36886/rwlock-9f.html
>>>
>>> They seem to have only rw_read_locked that is compatible to our
>>> rw_read_held.
>>
>> Looks to me like rw_read_locked is actually more like:
>>
>> KASSERT(rw_read_held(l) || rw_write_held(l));
>> return rw_read_held(l);
>>
>> or
>>
>> KASSERT(rw_read_held(l) || rw_write_held(l));
>> return !rw_write_held(l);
>
> Oops. Yes, I misunderstood rw_read_held.

Well, so the proposed rw_lock_held makes the situation better but
it's not yet what I expected.

Anyway a revised rwlock.9 would be:

   rw_write_held() returns non-zero if the lock is held by the current
   LWP for write.  rw_read_held() returns non-zero if the lock is held
   (potentially by the current LWP) for read.  rw_lock_held() returns
   non-zero if the lock is held by the current LWP for write or held
   (potentially by the current LWP) for read.  Otherwise, these
   functions return zero.

It may be better to write just "rw_lock_held is equivalent to
rw_write_held() || rw_read_held()"...

  ozaki-r


Re: rw_lock_held

2018-02-06 Thread Ryota Ozaki
On Wed, Feb 7, 2018 at 2:21 PM, Paul Goyette <p...@whooppee.com> wrote:
> On Wed, 7 Feb 2018, Ryota Ozaki wrote:
>
>>> a better wording for the manpage could be:
>>>
>>> Test the lock's condition and return non-zero if the lock is
>>> potentially held by the current LWP and matches the specified
>>> condition.  Otherwise, return zero.
>>>
>>> thus if we see that lock is held (in write mode) by some other LWP then
>>> we know that is not even potentially held by the current LWP.
>>
>>
>> I think we don't need "potentially" anymore. And "matches the specified
>> condition" is unclear to me (I know it's the original wording). So I
>> prefer
>> the following one though it's a bit redundant:
>>
>>   rw_write_held() returns non-zero if the lock is held by the current
>>   LWP for write.  rw_read_held() returns non-zero if the lock is held
>>   by the current LWP for read.  rw_lock_held() returns non-zero if
>>   the lock is held by the current LWP for write or read.  Otherwise,
>>   these functions return zero.
>
>
> I like this version much better.
>
> It is also important to retain the restriction against using these functions
> for making run-time locking decisions...  In particular, it is not
> acceptable to do
>
> if (!rw_write_held(_lock))
> rw_enter(_lock, RW_WRITER);

rwlock.9 already has the sentence :) :

   These functions must never be used to make locking decisions at run
   time: they are provided only for diagnostic purposes.

  ozaki-r


Re: rw_lock_held

2018-02-06 Thread Ryota Ozaki
On Wed, Feb 7, 2018 at 2:21 PM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Wed, 7 Feb 2018 14:14:35 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> On Wed, Feb 7, 2018 at 2:55 AM, Taylor R Campbell
>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>> >> Date: Tue, 6 Feb 2018 19:06:33 +0900
>> >> From: Ryota Ozaki <ozak...@netbsd.org>
>> >>
>> >> So I think we should do either:
>> >> (1) fix rw_lock_held, or
>> >> - probably it would be rw_read_held() || rw_write_held()
>> >> (2) fix users of rw_lock_held.
>> >> - it also would replace rw_lock_held with
>> >>   rw_read_held() || rw_write_held()
>> >>
>> >> I prefer (1) because I think there is no user of the current
>> >> behavior that is difficult to use for users (it may be useful
>> >> as an internal utility function though).
>> >
>> > I like (1) too, but check for Solaris compatibility?  It is hard to
>> > imagine that there are correct uses of the current semantics that are
>> > not also correct uses of (1).
>>
>> https://docs.oracle.com/cd/E36784_01/html/E36886/rwlock-9f.html
>>
>> They seem to have only rw_read_locked that is compatible to our
>> rw_read_held.
>
> Looks to me like rw_read_locked is actually more like:
>
> KASSERT(rw_read_held(l) || rw_write_held(l));
> return rw_read_held(l);
>
> or
>
> KASSERT(rw_read_held(l) || rw_write_held(l));
> return !rw_write_held(l);

Oops. Yes, I misunderstood rw_read_held.

  ozaki-r


Re: rw_lock_held

2018-02-06 Thread Ryota Ozaki
On Wed, Feb 7, 2018 at 2:55 AM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Tue, 6 Feb 2018 19:06:33 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> Is there any reason that rw_lock_held checks if it's held
>> by *someone*, not by curlwp? It's confusable and most (or all?)
>> users of the API seem to expect the latter (held by curlwp).
>
> Wild guesses:
>
> (a) because it seemed fast to just test (owner & RW_THREAD) != 0, or
> (b) because Solaris did it.
>
>> So I think we should do either:
>> (1) fix rw_lock_held, or
>> - probably it would be rw_read_held() || rw_write_held()
>> (2) fix users of rw_lock_held.
>> - it also would replace rw_lock_held with
>>   rw_read_held() || rw_write_held()
>>
>> I prefer (1) because I think there is no user of the current
>> behavior that is difficult to use for users (it may be useful
>> as an internal utility function though).
>
> I like (1) too, but check for Solaris compatibility?  It is hard to
> imagine that there are correct uses of the current semantics that are
> not also correct uses of (1).

https://docs.oracle.com/cd/E36784_01/html/E36886/rwlock-9f.html

They seem to have only rw_read_locked that is compatible to our
rw_read_held.

  ozaki-r


Re: rw_lock_held

2018-02-06 Thread Ryota Ozaki
On Wed, Feb 7, 2018 at 2:01 AM, Chuck Silvers <c...@chuq.com> wrote:
> On Tue, Feb 06, 2018 at 07:06:33PM +0900, Ryota Ozaki wrote:
>> Hi,
>>
>> Is there any reason that rw_lock_held checks if it's held
>> by *someone*, not by curlwp? It's confusable and most (or all?)
>> users of the API seem to expect the latter (held by curlwp).
>>
>> Well, rwlock.9 says that rw_*_held behave so, but that's perhaps
>> just out-of-date because rw_read_held and rw_write_held check
>> if it's held by curlwp.
>>
>> So I think we should do either:
>> (1) fix rw_lock_held, or
>> - probably it would be rw_read_held() || rw_write_held()
>> (2) fix users of rw_lock_held.
>> - it also would replace rw_lock_held with
>>   rw_read_held() || rw_write_held()
>>
>> I prefer (1) because I think there is no user of the current
>> behavior that is difficult to use for users (it may be useful
>> as an internal utility function though).
>>
>> Any thoughts?
>
>
> your option (1) sounds fine to me.
> a better wording for the manpage could be:
>
> Test the lock's condition and return non-zero if the lock is
> potentially held by the current LWP and matches the specified
> condition.  Otherwise, return zero.
>
> thus if we see that lock is held (in write mode) by some other LWP then
> we know that is not even potentially held by the current LWP.

I think we don't need "potentially" anymore. And "matches the specified
condition" is unclear to me (I know it's the original wording). So I prefer
the following one though it's a bit redundant:

   rw_write_held() returns non-zero if the lock is held by the current
   LWP for write.  rw_read_held() returns non-zero if the lock is held
   by the current LWP for read.  rw_lock_held() returns non-zero if
   the lock is held by the current LWP for write or read.  Otherwise,
   these functions return zero.

  ozaki-r


rw_lock_held

2018-02-06 Thread Ryota Ozaki
Hi,

Is there any reason that rw_lock_held checks if it's held
by *someone*, not by curlwp? It's confusable and most (or all?)
users of the API seem to expect the latter (held by curlwp).

Well, rwlock.9 says that rw_*_held behave so, but that's perhaps
just out-of-date because rw_read_held and rw_write_held check
if it's held by curlwp.

So I think we should do either:
(1) fix rw_lock_held, or
- probably it would be rw_read_held() || rw_write_held()
(2) fix users of rw_lock_held.
- it also would replace rw_lock_held with
  rw_read_held() || rw_write_held()

I prefer (1) because I think there is no user of the current
behavior that is difficult to use for users (it may be useful
as an internal utility function though).

Any thoughts?

  ozaki-r


Re: High priority xcall with an arbitrary IPL_SOFT*

2018-01-29 Thread Ryota Ozaki
On Mon, Jan 29, 2018 at 1:08 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> Hi,
>
> This proposal extends xcall to allow to specify an IPL of the high
> priority xcall. Currently the high priority xcall uses only a softint
> of IPL_SOFTSERIAL. It doesn't work if the caller depends on
> a different softint IPL.
>
> I'm now working on stability improvements of the network stack
> under load. psref_target_destroy can take longer time than we desire
> under load at worst, say 10 or 20 seconds. The cause of the delay
> is xcall and the delay can be mitigated by changing xcall to use
> the high priority. However it breaks atomicity of psref because
> psrefs used in the network stack normally run with IPL_SOFTNET.
> One workaround is to use IPL_SOFSERIAL for such psrefs but we should
> avoid using abnormal IPL as much as possible.
>
> So I propose the above enhancement to solve the issue.
>
> This is a patch implements the feature:
>   http://www.netbsd.org/~ozaki-r/xcall-multi-ipl.diff
> The next patch applies it to psref:
>   http://www.netbsd.org/~ozaki-r/xcall-multi-ipl-psref.diff
> The patch updates xcall.9
>   http://www.netbsd.org/~ozaki-r/xcall-multi-ipl-man.diff

I forgot to mention about API. By the change, there will be three
usages of xcall.

Low priority:
  xc_broadcast(0, ...);
High priority:
  xc_broadcast(XC_HIGHPRI, ...);  // Use a softint of IPL_SOFTSERIAL
High priority with IPL:
  xc_broadcast(XC_HIGHPRI_IPL(ipl), ...);  // Use a softint of ipl

The change doesn't break the kernel API/ABI.

  ozaki-r


High priority xcall with an arbitrary IPL_SOFT*

2018-01-28 Thread Ryota Ozaki
Hi,

This proposal extends xcall to allow to specify an IPL of the high
priority xcall. Currently the high priority xcall uses only a softint
of IPL_SOFTSERIAL. It doesn't work if the caller depends on
a different softint IPL.

I'm now working on stability improvements of the network stack
under load. psref_target_destroy can take longer time than we desire
under load at worst, say 10 or 20 seconds. The cause of the delay
is xcall and the delay can be mitigated by changing xcall to use
the high priority. However it breaks atomicity of psref because
psrefs used in the network stack normally run with IPL_SOFTNET.
One workaround is to use IPL_SOFSERIAL for such psrefs but we should
avoid using abnormal IPL as much as possible.

So I propose the above enhancement to solve the issue.

This is a patch implements the feature:
  http://www.netbsd.org/~ozaki-r/xcall-multi-ipl.diff
The next patch applies it to psref:
  http://www.netbsd.org/~ozaki-r/xcall-multi-ipl-psref.diff
The patch updates xcall.9
  http://www.netbsd.org/~ozaki-r/xcall-multi-ipl-man.diff

Any comments or suggestions?

Thanks,
  ozaki-r


Re: MP-safe DAD timer destruction with callout_stop

2018-01-12 Thread Ryota Ozaki
On Fri, Jan 12, 2018 at 8:27 PM, Taylor R Campbell
 wrote:
>> Date: Fri, 12 Jan 2018 04:33:06 + (UTC)
>> From: chris...@astron.com (Christos Zoulas)
>>
>> Even then (with callout_halt) nothing prevents something from
>> calling callout_schedule() or callout_reset() again between
>> that time and callout_destroy()... I have code that adds another
>> flag that stops further callouts from being reschuled. Does this
>> help you?
>
> I don't understand.  With Ozaki-san's patch, nd6_dad_stop prevents new
> references to the callout, by removing dp from the dadq.  The only
> remaining references should be in the scheduled callout itself, which
> will run at most twice more:
>
> (a) once if it was already scheduled and in the process of firing, and
> (b) once more immediately after because of the callout_reset(0).
>
> The dp is then freed by (b).  (Ozaki-san: You should also call
> callout_destroy before kmem_intr_free.)

Sure. I added it: http://www.netbsd.org/~ozaki-r/dad-callout_stop.v3.diff

  ozaki-r


Re: MP-safe DAD timer destruction with callout_stop

2018-01-11 Thread Ryota Ozaki
On Fri, Jan 12, 2018 at 1:33 PM, Christos Zoulas  wrote:
> In article 
> 

MP-safe DAD timer destruction with callout_stop

2018-01-11 Thread Ryota Ozaki
Hi,

For a certain reason(*), DAD timers are hard to use
callout_halt to destroy DAD timer objects. So I was going
to fall back to callout_stop (as of NetBSD 7) that is
almost safe but still has a possibility of going wrong(**).

(*) See the thread starting at
http://mail-index.netbsd.org/source-changes-d/2017/12/26/msg009744.html
(**) http://mail-index.netbsd.org/source-changes-d/2018/01/11/msg009802.html

Discussing with @riastradh, we found a possible way to
make DAD timer destructions MP-safe with callout_stop.
This is a PoC patch:
  http://www.netbsd.org/~ozaki-r/dad-callout_stop.v2.diff

The issue of using callout_stop instead of callout_halt is
that callout_stop doesn't actually stop the running callout.
So we cannot safety free a related data after calling
callout_stop.

The new approach copes with the issue by delegating the
destruction of a callout to callout itself, which allows us
to not wait the callout to finish. This can be done thanks
to that DAD objects are separated from other data such as
ifa.

Any comments or suggestions?

Thanks,
  ozaki-r


Re: workqueue_drain

2017-12-27 Thread Ryota Ozaki
On Fri, Dec 22, 2017 at 4:12 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Thu, Dec 21, 2017 at 5:56 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
>> On Thu, Dec 21, 2017 at 5:58 AM, Michael van Elst <mlel...@serpens.de> wrote:
>>> ozak...@netbsd.org (Ryota Ozaki) writes:
>>>
>>>>On Wed, Dec 20, 2017 at 8:06 PM, Michael van Elst <mlel...@serpens.de> 
>>>>wrote:
>>>>> What about a method that waits for a specific work to complete?
>>>
>>>>It might be useful but how it works in this case?
>>>
>>> You could remember the last work enqueued and wait for it to finish
>>> before destroying the queue. For this the wait function must handle
>>> the case where the work already has finished.
>>>
>>> I would consider such a function more versatile than just a drain function.
>>
>> Oh I got it. Well, that should work for my cases (bridge, pppoe) because
>> there is just one work. I'm not sure the API suits for other use cases.
>>
>> Anyway I'm trying to implement the API (workqueue_wait?).
>
> Here it is: http://www.netbsd.org/~ozaki-r/workqueue_wait.diff
>
> I think that workqueue_drain based on workqueue_wait would be that
> workqueue_drain prepares a local work, enqueues it to the workqueue
> and waits on it by workqueue_wait. (A workqueue with WQ_PERCPU needs
> to repeat them on every CPUs.) IMHO if I want workqueue_drain, I
> would implement it without workqueue_wait; that would be simpler.

I wrote a test for workqueue_wait:
  http://www.netbsd.org/~ozaki-r/workqueue_wait.test.diff

I don't have strong persistence to workqueue_drain and
workqueue_wait is OK to me. I'm going to commit workqueue_wait
patches if no objections raised. Any others?

Regards,
  ozaki-r


Re: workqueue_drain

2017-12-21 Thread Ryota Ozaki
On Thu, Dec 21, 2017 at 5:56 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Thu, Dec 21, 2017 at 5:58 AM, Michael van Elst <mlel...@serpens.de> wrote:
>> ozak...@netbsd.org (Ryota Ozaki) writes:
>>
>>>On Wed, Dec 20, 2017 at 8:06 PM, Michael van Elst <mlel...@serpens.de> wrote:
>>>> What about a method that waits for a specific work to complete?
>>
>>>It might be useful but how it works in this case?
>>
>> You could remember the last work enqueued and wait for it to finish
>> before destroying the queue. For this the wait function must handle
>> the case where the work already has finished.
>>
>> I would consider such a function more versatile than just a drain function.
>
> Oh I got it. Well, that should work for my cases (bridge, pppoe) because
> there is just one work. I'm not sure the API suits for other use cases.
>
> Anyway I'm trying to implement the API (workqueue_wait?).

Here it is: http://www.netbsd.org/~ozaki-r/workqueue_wait.diff

I think that workqueue_drain based on workqueue_wait would be that
workqueue_drain prepares a local work, enqueues it to the workqueue
and waits on it by workqueue_wait. (A workqueue with WQ_PERCPU needs
to repeat them on every CPUs.) IMHO if I want workqueue_drain, I
would implement it without workqueue_wait; that would be simpler.

  ozaki-r


Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-21 Thread Ryota Ozaki
On Fri, Dec 22, 2017 at 11:33 AM, Ryota Ozaki <ozaki.ry...@gmail.com> wrote:
> On Thu, Dec 21, 2017 at 7:45 PM, Frank Kardel <kar...@netbsd.org> wrote:
>> Hi,
>>
>> does this panic (with -current/amd64 as of 20171221) ring a bell with
>> anyone?
>
> Yes. Could you try the patch below?

Oops. The patch was wrong... Please use the new one below.

  ozaki-r

diff --git a/sys/net/if.c b/sys/net/if.c
index 05ac3b2273c..6c867cdce0a 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -3598,7 +3598,7 @@ if_mcast_op(ifnet_t *ifp, const unsigned long
cmd, const struct sockaddr *sa)
struct ifreq ifr;

/* CARP still doesn't deal with the lock yet */
-#if !defined(NCARP) || (NCARP == 0)
+#if (!defined(NCARP) || (NCARP == 0)) && !defined(MROUTER)
KASSERT(IFNET_LOCKED(ifp));
 #endif
if (ifp->if_mcastop != NULL)


Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-21 Thread Ryota Ozaki
On Thu, Dec 21, 2017 at 7:58 PM, Nick Hudson <sk...@netbsd.org> wrote:
> On 12/19/17 08:21, Ryota Ozaki wrote:
>>
>> On Tue, Dec 19, 2017 at 4:52 PM, Nick Hudson <nick.hud...@gmx.co.uk>
>> wrote:
>>>
>>> On 19/12/2017 03:43, Ryota Ozaki wrote:
>>>
>>>
>>>> BTW I committed a change that disables IFEF_MPSAFE by default on
>>>> all interfaces because it seems that IFEF_MPSAFE requires additional
>>>> changes to work safely with it. We should enable it by default if an
>>>> interface is guaranteed to be safe.
>>>
>>> What additional changes?
>>
>> For example, avoid updating if_flags (and reading it and depending on
>> the result) in Tx/Rx paths and using if_watchdog.
>
>
> Ah, I see.
>
> Do we have a model driver yet?

Not yet, but ixg and wm are close.

  ozaki-r


Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-21 Thread Ryota Ozaki
On Thu, Dec 21, 2017 at 7:45 PM, Frank Kardel  wrote:
> Hi,
>
> does this panic (with -current/amd64 as of 20171221) ring a bell with
> anyone?

Yes. Could you try the patch below?

Thanks,
  ozaki-r


diff --git a/sys/net/if.c b/sys/net/if.c
index 05ac3b2273c..fc3e144373f 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -3598,7 +3598,7 @@ if_mcast_op(ifnet_t *ifp, const unsigned long
cmd, const struct sockaddr *sa)
struct ifreq ifr;

/* CARP still doesn't deal with the lock yet */
-#if !defined(NCARP) || (NCARP == 0)
+#if !defined(NCARP) || (NCARP == 0) || !defined(MROUTER)
KASSERT(IFNET_LOCKED(ifp));
 #endif
if (ifp->if_mcastop != NULL)


Re: workqueue_drain

2017-12-21 Thread Ryota Ozaki
On Wed, Dec 20, 2017 at 5:41 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Wed, Dec 20, 2017 at 4:57 PM, Paul Goyette <p...@whooppee.com> wrote:
>> On Wed, 20 Dec 2017, Ryota Ozaki wrote:
>>
>>> Hi,
>>>
>>> workqueue_destroy requires that its queue doesn't have
>>> any works (and no new work will be enqueued anymore)
>>> before being called. Currently users of workqueue have
>>> to ensure the contract by theirselves. It's possible
>>> but it may need to add extra synchronization mechanisms
>>> such as mutex and condvar only for that purpose.
>>>
>>> workqueue itself can provide an API to help the routine.
>>> It, say workqueue_drain, waits for all pending works to
>>> finish. With it the caller needs to only ensure that no
>>> new work will be enqueued.
>>>
>>> Here is a patch:
>>>  http://www.netbsd.org/~ozaki-r/workqueue_drain.diff
>>> (I will update workqueue.4 once the proposal is accepted.)
>>>
>>> And this is a patch that applies workqueue_drain to bridge(4):
>>>  http://www.netbsd.org/~ozaki-r/workqueue_drain-example.diff
>>>
>>> If we don't have workqueue_drain we had to add an extra
>>> condvar to wait the workqueue worker of bridge to finish.
>>>
>>> Any comments or suggestions?
>>
>>
>> You could add a new flag bit for WQ_DRAINING when you start to
>> drain.  Then workqueue_enque() could enforce the "no new work"
>> with
>>
>> KASSERT_MSG((wq->wg_flags & WQ_DRAINING) != 0,
>> "adding work to a draining workqueue")
>
> Good idea!
>
> I updated the patch:
>   http://www.netbsd.org/~ozaki-r/workqueue_drain.v2.diff

The implementation was not enough because workqueue_drain just waits
for works to start, not finish. So to wait for completions, we need
more changes like this:
  http://www.netbsd.org/~ozaki-r/workqueue_drain.v3.diff

  ozaki-r


Re: workqueue_drain

2017-12-21 Thread Ryota Ozaki
On Thu, Dec 21, 2017 at 5:58 AM, Michael van Elst <mlel...@serpens.de> wrote:
> ozak...@netbsd.org (Ryota Ozaki) writes:
>
>>On Wed, Dec 20, 2017 at 8:06 PM, Michael van Elst <mlel...@serpens.de> wrote:
>>> What about a method that waits for a specific work to complete?
>
>>It might be useful but how it works in this case?
>
> You could remember the last work enqueued and wait for it to finish
> before destroying the queue. For this the wait function must handle
> the case where the work already has finished.
>
> I would consider such a function more versatile than just a drain function.

Oh I got it. Well, that should work for my cases (bridge, pppoe) because
there is just one work. I'm not sure the API suits for other use cases.

Anyway I'm trying to implement the API (workqueue_wait?).

Thanks,
  ozaki-r


Re: workqueue_drain

2017-12-20 Thread Ryota Ozaki
On Wed, Dec 20, 2017 at 8:06 PM, Michael van Elst <mlel...@serpens.de> wrote:
> ozak...@netbsd.org (Ryota Ozaki) writes:
>
>>Here is a patch:
>>  http://www.netbsd.org/~ozaki-r/workqueue_drain.diff
>>(I will update workqueue.4 once the proposal is accepted.)
>
> What about a method that waits for a specific work to complete?

It might be useful but how it works in this case?

  ozaki-r


Re: workqueue_drain

2017-12-20 Thread Ryota Ozaki
On Wed, Dec 20, 2017 at 5:41 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Wed, Dec 20, 2017 at 4:57 PM, Paul Goyette <p...@whooppee.com> wrote:
>> On Wed, 20 Dec 2017, Ryota Ozaki wrote:
>>
>>> Hi,
>>>
>>> workqueue_destroy requires that its queue doesn't have
>>> any works (and no new work will be enqueued anymore)
>>> before being called. Currently users of workqueue have
>>> to ensure the contract by theirselves. It's possible
>>> but it may need to add extra synchronization mechanisms
>>> such as mutex and condvar only for that purpose.
>>>
>>> workqueue itself can provide an API to help the routine.
>>> It, say workqueue_drain, waits for all pending works to
>>> finish. With it the caller needs to only ensure that no
>>> new work will be enqueued.
>>>
>>> Here is a patch:
>>>  http://www.netbsd.org/~ozaki-r/workqueue_drain.diff
>>> (I will update workqueue.4 once the proposal is accepted.)
>>>
>>> And this is a patch that applies workqueue_drain to bridge(4):
>>>  http://www.netbsd.org/~ozaki-r/workqueue_drain-example.diff
>>>
>>> If we don't have workqueue_drain we had to add an extra
>>> condvar to wait the workqueue worker of bridge to finish.
>>>
>>> Any comments or suggestions?
>>
>>
>> You could add a new flag bit for WQ_DRAINING when you start to
>> drain.  Then workqueue_enque() could enforce the "no new work"
>> with
>>
>> KASSERT_MSG((wq->wg_flags & WQ_DRAINING) != 0,
>> "adding work to a draining workqueue")
>
> Good idea!
>
> I updated the patch:
>   http://www.netbsd.org/~ozaki-r/workqueue_drain.v2.diff

I noticed an issue. After workqueue_drain we can't enqueue a work
to the workqueue however we may want to reuse the workqueue.
For bridge(4), workqueue_drain is called in bridge_stop that is
if_stop, so the workqueue is just stopped temporarily and it
can reenabled again in bridge_init (if_init). We perhaps need to
have an API to clear WQ_DRAINING flag (workqueue_activate?).

  ozaki-r


Re: workqueue_drain

2017-12-20 Thread Ryota Ozaki
On Wed, Dec 20, 2017 at 4:57 PM, Paul Goyette <p...@whooppee.com> wrote:
> On Wed, 20 Dec 2017, Ryota Ozaki wrote:
>
>> Hi,
>>
>> workqueue_destroy requires that its queue doesn't have
>> any works (and no new work will be enqueued anymore)
>> before being called. Currently users of workqueue have
>> to ensure the contract by theirselves. It's possible
>> but it may need to add extra synchronization mechanisms
>> such as mutex and condvar only for that purpose.
>>
>> workqueue itself can provide an API to help the routine.
>> It, say workqueue_drain, waits for all pending works to
>> finish. With it the caller needs to only ensure that no
>> new work will be enqueued.
>>
>> Here is a patch:
>>  http://www.netbsd.org/~ozaki-r/workqueue_drain.diff
>> (I will update workqueue.4 once the proposal is accepted.)
>>
>> And this is a patch that applies workqueue_drain to bridge(4):
>>  http://www.netbsd.org/~ozaki-r/workqueue_drain-example.diff
>>
>> If we don't have workqueue_drain we had to add an extra
>> condvar to wait the workqueue worker of bridge to finish.
>>
>> Any comments or suggestions?
>
>
> You could add a new flag bit for WQ_DRAINING when you start to
> drain.  Then workqueue_enque() could enforce the "no new work"
> with
>
> KASSERT_MSG((wq->wg_flags & WQ_DRAINING) != 0,
> "adding work to a draining workqueue")

Good idea!

I updated the patch:
  http://www.netbsd.org/~ozaki-r/workqueue_drain.v2.diff

Thanks,
  ozaki-r


workqueue_drain

2017-12-19 Thread Ryota Ozaki
Hi,

workqueue_destroy requires that its queue doesn't have
any works (and no new work will be enqueued anymore)
before being called. Currently users of workqueue have
to ensure the contract by theirselves. It's possible
but it may need to add extra synchronization mechanisms
such as mutex and condvar only for that purpose.

workqueue itself can provide an API to help the routine.
It, say workqueue_drain, waits for all pending works to
finish. With it the caller needs to only ensure that no
new work will be enqueued.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/workqueue_drain.diff
(I will update workqueue.4 once the proposal is accepted.)

And this is a patch that applies workqueue_drain to bridge(4):
  http://www.netbsd.org/~ozaki-r/workqueue_drain-example.diff

If we don't have workqueue_drain we had to add an extra
condvar to wait the workqueue worker of bridge to finish.

Any comments or suggestions?

Thanks,
  ozaki-r


Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-19 Thread Ryota Ozaki
On Thu, Dec 14, 2017 at 10:41 PM, Nick Hudson <nick.hud...@gmx.co.uk> wrote:
>
>
> On 14/12/2017 10:48, Ryota Ozaki wrote:
>>
>> On Fri, Dec 8, 2017 at 7:53 PM, Nick Hudson <sk...@netbsd.org> wrote:
>>
>
>>> Not sure I follow this. I think we agree that the driver should not use
>>> if_flags in the rx/tx path (anymore).
>>
>> Yes. Drivers should provide their own method.
>
>
> Great.
>
>>
>> Anyway I updated the document that reflects recent changes:
>>http://www.netbsd.org/~ozaki-r/ifnet-locks.v2.diff
>
>
> Some wording improvement suggestions...
>
> @@ -391,11 +429,33 @@ typedef struct ifnet {
>  #defineIFEF_NO_LINK_STATE_CHANGE   __BIT(1)/* doesn't
> use link state interrupts */
>   /*
> - * The following if_XXX() handlers don't take KERNEL_LOCK if the interface
> - * is set IFEF_MPSAFE:
> - *   - if_start
> - *   - if_output
> - *   - if_ioctl
> + * The guidline to enable IFEF_MPSAFE on an interface
>
> The guidelines for converting an interface to IFEF_MPSAFE are as
> follows
>
> + * Enabling IFEF_MPSAFE on an interface suppress taking KERNEL_LOCK
> + * on the following handlers:
>
> Enabling IFEF_MPSAFE on an interface suppresses taking KERNEL_LOCK
> when
> calling the following handlers:

Thanks. I reflected the suggestions and committed the updated document.

BTW I committed a change that disables IFEF_MPSAFE by default on
all interfaces because it seems that IFEF_MPSAFE requires additional
changes to work safely with it. We should enable it by default if an
interface is guaranteed to be safe.

  ozaki-r



Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-19 Thread Ryota Ozaki
On Tue, Dec 19, 2017 at 4:52 PM, Nick Hudson <nick.hud...@gmx.co.uk> wrote:
> On 19/12/2017 03:43, Ryota Ozaki wrote:
>
>
>> BTW I committed a change that disables IFEF_MPSAFE by default on
>> all interfaces because it seems that IFEF_MPSAFE requires additional
>> changes to work safely with it. We should enable it by default if an
>> interface is guaranteed to be safe.
>
> What additional changes?

For example, avoid updating if_flags (and reading it and depending on
the result) in Tx/Rx paths and using if_watchdog.

  ozaki-r



Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-14 Thread Ryota Ozaki
On Fri, Dec 8, 2017 at 7:53 PM, Nick Hudson <sk...@netbsd.org> wrote:
> On 12/06/17 11:11, Ryota Ozaki wrote:
>>
>> On Tue, Nov 28, 2017 at 6:40 PM, Ryota Ozaki <ozaki.ry...@gmail.com>
>> wrote:
>>>
>>> On Mon, Nov 27, 2017 at 12:24 AM, Nick Hudson <sk...@netbsd.org> wrote:
>>>>
>>>> On 11/17/17 07:44, Ryota Ozaki wrote:
>>>>>
>>>>>
> [snip]
>>>>
>>>> Hi,
>>>
>>> (Sorry for late replying. I was sick in bed for days...)
>>>
>>>> Can you document the protection requirements of the struct ifnet
>>>> members,
>>>> e.g. if_flags.
>>>>
>>>> Ideally by annotating it like struct proc
>>>>
>>>> http://src.illumos.org/source/xref/netbsd-src/sys/sys/proc.h#230
>>>
>>> OK. I'm trying to add annotations like that.
>
>
> It's my turn to say sorry for a late reply... Sorry and thanks for working
> on this.

NP. I tend to be late...

>
>
>> Some more changes are needed though, I fixed some race conditions on ifnet
>> and wrote the lock requirements of ifnet:
>> http://www.netbsd.org/~ozaki-r/ifnet-locks.diff
>>
>> There remain some unsafe members. Those for carp, agr and pf should be
>> fixed
>> when making them MP-safe. Another remainder is a set of statistic
>> counters,
>> which will be MP-safe (by making them per-CPU counters) sooner or later.
>>
>> if_flags should be modified with holding if_ioct_lock. If an interface
>> enables
>> IEFE_MPSAFE, the interface must meet the contract, which enforces the
>> interface to update if_flags only in XXX_ioctl and also the interface has
>> to
>> give up using IFF_OACTIVE flag which doesn't suit for multi-core systems
>> and hardware multi-queue NICs.
>
>
> I'd pretty much come to the same conclusion, but wanted to make sure we
> shared the same understanding.
>
>> I'm not sure yet we have to do
>> synchronization between updates and reads on if_flags. (Anyway we should
>> NOT do that because the synchronization will hurt performance.)
>
>
> Not sure I follow this. I think we agree that the driver should not use
> if_flags in the rx/tx path (anymore).

Yes. Drivers should provide their own method.

Anyway I updated the document that reflects recent changes:
  http://www.netbsd.org/~ozaki-r/ifnet-locks.v2.diff

  ozaki-r



Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-12-11 Thread Ryota Ozaki
On Fri, Dec 8, 2017 at 6:51 PM, Kengo NAKAHARA  wrote:
> Hi,
>
> On 2017/12/07 14:21, Taylor R Campbell wrote:
>> I dropped this thread on the floor a while ago and I forget what the
>> status was.  I've had a patch sitting in my tree for a while which I
>> brushed off to put the list update logic in separate functions, as I
>> think chuq requested a while ago, but still keep it all isolated to
>> subr_psref.c and avoid defining any new macros.
>>
>> If you've measured that SLIST works better -- which would make sense
>> because the typical bracketed psref_acquire/release nesting makes a
>> nice LIFO stack of the things so that SLIST_REMOVE should usually be
>> done in the first iteration -- then I'm happy to replace it by SLIST.
>> We should just make sure to fix this bug before netbsd-8 goes out!
>>
>> Thoughts?
>
> I measure IPv4 forwarding performance your patch(PSREF_LIST) version
> and SLIST version. At first, the result is quite affected by
> optimization option "-falign-functions"... Based on this, it seems
> there is almost no difference between PSREF_LIST and SLIST.
>
> However, it seems your patch has large diff... From the point of code
> stability, smaller diff SLIST version would be better for netbsd-8 branch
> to fix the bug. Because your patch causes some new ATF failures such as
> ldp_regen and route_change_ifp (reported by ozaki-r@n.o). We can probably
> fix them at once but guaranteeing its stability would take more time.

Hmm, the same failures also happen with the SLIST version. IIRC there
was no regression with it two months ago and so I didn't test it enough
days ago. Anyway I guess the failures are not because of your patch.
I'm sorry for that.

I have no idea why the failures happen. The patches may dig out a
latent bug in -current.

  ozaki-r


Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-12-06 Thread Ryota Ozaki
On Tue, Nov 28, 2017 at 6:40 PM, Ryota Ozaki <ozaki.ry...@gmail.com> wrote:
> On Mon, Nov 27, 2017 at 12:24 AM, Nick Hudson <sk...@netbsd.org> wrote:
>> On 11/17/17 07:44, Ryota Ozaki wrote:
>>>
>>> On Wed, Nov 15, 2017 at 1:53 PM, Ryota Ozaki <ozaki.ry...@gmail.com>
>>> wrote:
>>>>
>>>> On Fri, Nov 10, 2017 at 6:35 PM, Ryota Ozaki <ozaki.ry...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> http://www.netbsd.org/~ozaki-r/IFEF_MPSAFE.diff
>>>>>
>>>>> I'm going to commit the above change that integrates
>>>>> IFEF_OUTPUT_MPSAFE and IFEF_START_MPSAFE flags into
>>>>> IFEF_MPSAFE.
>>>>>
>>>>> The motivation is to not waste if_extflags bits. I'm now
>>>>> trying to make if_ioctl() hold KERNEL_LOCK selectively
>>>>> for some reasons as well as if_start() and if_output().
>>>>
>>>> BTW this is a patch for this plan:
>>>>http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.diff
>>>>
>>>> It removes KERNEL_LOCK for if_ioctl from soo_ioctl and
>>>> selectively takes it in doifioctl. To this end, some fine-grain
>>>> KERNEL_LOCKs have to be added where calling components/functions
>>>> that aren't MP-safe.
>>>
>>> Revised rebased on latest NET_MPSAFE macros. The new patch also provides
>>> similar macros:
>>>http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.revised.diff
>>>
>>>ozaki-r
>>>
>> Hi,
>
> (Sorry for late replying. I was sick in bed for days...)
>
>>
>> Can you document the protection requirements of the struct ifnet members,
>> e.g. if_flags.
>>
>> Ideally by annotating it like struct proc
>>
>> http://src.illumos.org/source/xref/netbsd-src/sys/sys/proc.h#230
>
> OK. I'm trying to add annotations like that.

Some more changes are needed though, I fixed some race conditions on ifnet
and wrote the lock requirements of ifnet:
   http://www.netbsd.org/~ozaki-r/ifnet-locks.diff

There remain some unsafe members. Those for carp, agr and pf should be fixed
when making them MP-safe. Another remainder is a set of statistic counters,
which will be MP-safe (by making them per-CPU counters) sooner or later.

if_flags should be modified with holding if_ioct_lock. If an interface enables
IEFE_MPSAFE, the interface must meet the contract, which enforces the
interface to update if_flags only in XXX_ioctl and also the interface has to
give up using IFF_OACTIVE flag which doesn't suit for multi-core systems
and hardware multi-queue NICs. I'm not sure yet we have to do
synchronization between updates and reads on if_flags. (Anyway we should
NOT do that because the synchronization will hurt performance.)

Thanks,
  ozaki-r



Re: struct ifnet locking [was Re: IFEF_MPSAFE]

2017-11-28 Thread Ryota Ozaki
On Mon, Nov 27, 2017 at 12:24 AM, Nick Hudson <sk...@netbsd.org> wrote:
> On 11/17/17 07:44, Ryota Ozaki wrote:
>>
>> On Wed, Nov 15, 2017 at 1:53 PM, Ryota Ozaki <ozaki.ry...@gmail.com>
>> wrote:
>>>
>>> On Fri, Nov 10, 2017 at 6:35 PM, Ryota Ozaki <ozaki.ry...@gmail.com>
>>> wrote:
>>>>
>>>> Hi,
>>>>
>>>> http://www.netbsd.org/~ozaki-r/IFEF_MPSAFE.diff
>>>>
>>>> I'm going to commit the above change that integrates
>>>> IFEF_OUTPUT_MPSAFE and IFEF_START_MPSAFE flags into
>>>> IFEF_MPSAFE.
>>>>
>>>> The motivation is to not waste if_extflags bits. I'm now
>>>> trying to make if_ioctl() hold KERNEL_LOCK selectively
>>>> for some reasons as well as if_start() and if_output().
>>>
>>> BTW this is a patch for this plan:
>>>http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.diff
>>>
>>> It removes KERNEL_LOCK for if_ioctl from soo_ioctl and
>>> selectively takes it in doifioctl. To this end, some fine-grain
>>> KERNEL_LOCKs have to be added where calling components/functions
>>> that aren't MP-safe.
>>
>> Revised rebased on latest NET_MPSAFE macros. The new patch also provides
>> similar macros:
>>http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.revised.diff
>>
>>ozaki-r
>>
> Hi,

(Sorry for late replying. I was sick in bed for days...)

>
> Can you document the protection requirements of the struct ifnet members,
> e.g. if_flags.
>
> Ideally by annotating it like struct proc
>
> http://src.illumos.org/source/xref/netbsd-src/sys/sys/proc.h#230

OK. I'm trying to add annotations like that.

  ozaki-r



Re: Debugging features for pserialize(9)

2017-11-19 Thread Ryota Ozaki
On Mon, Nov 13, 2017 at 12:44 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Sat, Nov 11, 2017 at 1:14 AM, Taylor R Campbell
> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> Date: Fri, 10 Nov 2017 18:54:58 +0900
>>> From: Ryota Ozaki <ozak...@netbsd.org>
>>>
>>> Hi,
>>>
>>> I implemented two debugging features for pserialize(9):
>>>   http://www.netbsd.org/~ozaki-r/debugging-pserialize.diff
>>>
>>> It detects:
>>> - if a context switch happens in a read section, and
>>> - if a sleepable function is called in a read section.
>>>
>>> Any comments or suggestions? Is the implementation correct?
>>>
>>> Anyway the features have found out two new bugs :)
>>
>> Looks great!  I'm not surprised it found bugs.  We should have done
>> something like this a long time ago.
>>
>> One bug: LOCKDEBUG is not suposed to change the kernel ABI.  So you
>> should unconditionally define two different functions:
>>
>> pserialize_in_read_section()
>>
>>True unless we can _guarantee_ the caller is not in a pserialize
>>read section.  To be used only for diagnostic assertions like:
>>
>>KASSERT(pserialize_in_read_section());
>>
>> pserialize_not_in_read_section()
>>
>>True unless we can _guarantee_ the caller is in a pserialize read
>>section.  To be used only for diagnostic assertions like:
>>
>>KASSERT(pserialize_not_in_read_section());
>>
>> That way, modules will continue to work with or without LOCKDEBUG.
>>
>> assert_sleepable would then use !pserialize_not_in_read_section(),
>> which is a slightly confusing double-negative, but it would be as if
>> you had written KASSERT(pserialize_not_in_read_section()) -- the
>> KASSERT internally produces the same `if (!...)'.
>> pserialize_switchpoint can then also just
>> KASSERT(pserialize_not_in_read_section()).
>
> That really makes sense! I revised the patch:
>   http://www.netbsd.org/~ozaki-r/debugging-pserialize.revised.diff
>
>>
>>
>> One nit: Any need to use signed int for the nread?  Why not unsigned?
>
> Just I want to detect the counter being negative naturally, though
> using signed int is not must.

So I've changed the patch to use uint32_t:
  http://www.netbsd.org/~ozaki-r/debugging-pserialize.uint32_t.diff

  ozaki-r


Re: IFEF_MPSAFE

2017-11-16 Thread Ryota Ozaki
On Wed, Nov 15, 2017 at 1:53 PM, Ryota Ozaki <ozaki.ry...@gmail.com> wrote:
> On Fri, Nov 10, 2017 at 6:35 PM, Ryota Ozaki <ozaki.ry...@gmail.com> wrote:
>> Hi,
>>
>> http://www.netbsd.org/~ozaki-r/IFEF_MPSAFE.diff
>>
>> I'm going to commit the above change that integrates
>> IFEF_OUTPUT_MPSAFE and IFEF_START_MPSAFE flags into
>> IFEF_MPSAFE.
>>
>> The motivation is to not waste if_extflags bits. I'm now
>> trying to make if_ioctl() hold KERNEL_LOCK selectively
>> for some reasons as well as if_start() and if_output().
>
> BTW this is a patch for this plan:
>   http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.diff
>
> It removes KERNEL_LOCK for if_ioctl from soo_ioctl and
> selectively takes it in doifioctl. To this end, some fine-grain
> KERNEL_LOCKs have to be added where calling components/functions
> that aren't MP-safe.

Revised rebased on latest NET_MPSAFE macros. The new patch also provides
similar macros:
  http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.revised.diff

  ozaki-r



Re: IFEF_MPSAFE

2017-11-14 Thread Ryota Ozaki
On Fri, Nov 10, 2017 at 6:35 PM, Ryota Ozaki <ozaki.ry...@gmail.com> wrote:
> Hi,
>
> http://www.netbsd.org/~ozaki-r/IFEF_MPSAFE.diff
>
> I'm going to commit the above change that integrates
> IFEF_OUTPUT_MPSAFE and IFEF_START_MPSAFE flags into
> IFEF_MPSAFE.
>
> The motivation is to not waste if_extflags bits. I'm now
> trying to make if_ioctl() hold KERNEL_LOCK selectively
> for some reasons as well as if_start() and if_output().

BTW this is a patch for this plan:
  http://www.netbsd.org/~ozaki-r/if_ioctl-no-KERNEL_LOCK.diff

It removes KERNEL_LOCK for if_ioctl from soo_ioctl and
selectively takes it in doifioctl. To this end, some fine-grain
KERNEL_LOCKs have to be added where calling components/functions
that aren't MP-safe.

  ozaki-r

> But adding one more flag for if_ioctl() is I think
> wasteful. Also there are other functions such as if_init()
> and if_slowtimo() that would also need a flag.
>
> So I propose to have just one flag for indications of
> MP-safe. If an interface have both MP-safe and non-MP-safe
> operations at a time, we have to set the IFEF_MPSAFE flag
> and let callees of non-MP-safe operations take KERNEL_LOCK.
>
> This change breaks ABI and need a kernel version bump,
> however, IFEF_*_MPSAFE flags are new to netbsd-8 so it
> doesn't break backward compatibility.
>
> Any comments or objections?
>
> Thanks,
>   ozaki-r



Re: Debugging features for pserialize(9)

2017-11-12 Thread Ryota Ozaki
On Sat, Nov 11, 2017 at 1:14 AM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Fri, 10 Nov 2017 18:54:58 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> Hi,
>>
>> I implemented two debugging features for pserialize(9):
>>   http://www.netbsd.org/~ozaki-r/debugging-pserialize.diff
>>
>> It detects:
>> - if a context switch happens in a read section, and
>> - if a sleepable function is called in a read section.
>>
>> Any comments or suggestions? Is the implementation correct?
>>
>> Anyway the features have found out two new bugs :)
>
> Looks great!  I'm not surprised it found bugs.  We should have done
> something like this a long time ago.
>
> One bug: LOCKDEBUG is not suposed to change the kernel ABI.  So you
> should unconditionally define two different functions:
>
> pserialize_in_read_section()
>
>True unless we can _guarantee_ the caller is not in a pserialize
>read section.  To be used only for diagnostic assertions like:
>
>KASSERT(pserialize_in_read_section());
>
> pserialize_not_in_read_section()
>
>True unless we can _guarantee_ the caller is in a pserialize read
>section.  To be used only for diagnostic assertions like:
>
>KASSERT(pserialize_not_in_read_section());
>
> That way, modules will continue to work with or without LOCKDEBUG.
>
> assert_sleepable would then use !pserialize_not_in_read_section(),
> which is a slightly confusing double-negative, but it would be as if
> you had written KASSERT(pserialize_not_in_read_section()) -- the
> KASSERT internally produces the same `if (!...)'.
> pserialize_switchpoint can then also just
> KASSERT(pserialize_not_in_read_section()).

That really makes sense! I revised the patch:
  http://www.netbsd.org/~ozaki-r/debugging-pserialize.revised.diff

>
>
> One nit: Any need to use signed int for the nread?  Why not unsigned?

Just I want to detect the counter being negative naturally, though
using signed int is not must.

Thanks,
  ozaki-r


IFEF_MPSAFE

2017-11-10 Thread Ryota Ozaki
Hi,

http://www.netbsd.org/~ozaki-r/IFEF_MPSAFE.diff

I'm going to commit the above change that integrates
IFEF_OUTPUT_MPSAFE and IFEF_START_MPSAFE flags into
IFEF_MPSAFE.

The motivation is to not waste if_extflags bits. I'm now
trying to make if_ioctl() hold KERNEL_LOCK selectively
for some reasons as well as if_start() and if_output().
But adding one more flag for if_ioctl() is I think
wasteful. Also there are other functions such as if_init()
and if_slowtimo() that would also need a flag.

So I propose to have just one flag for indications of
MP-safe. If an interface have both MP-safe and non-MP-safe
operations at a time, we have to set the IFEF_MPSAFE flag
and let callees of non-MP-safe operations take KERNEL_LOCK.

This change breaks ABI and need a kernel version bump,
however, IFEF_*_MPSAFE flags are new to netbsd-8 so it
doesn't break backward compatibility.

Any comments or objections?

Thanks,
  ozaki-r



Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-10-06 Thread Ryota Ozaki
On Fri, Oct 6, 2017 at 4:24 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Fri, Oct 6, 2017 at 1:14 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
>> On Fri, Oct 6, 2017 at 11:58 AM, Taylor R Campbell
>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>>> Date: Fri, 6 Oct 2017 11:26:40 +0900
>>>> From: Ryota Ozaki <ozak...@netbsd.org>
>>>>
>>>> On Mon, Oct 2, 2017 at 11:13 PM, Taylor R Campbell
>>>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>>> > Quick summary of the problem:
>>>> >
>>>> > Possible solutions.  I'm leaning toward (6), to open-code the linked
>>>> > list operations for this special purpose, with compile-tested patch
>>>> > attached.  This changes the text of psref.h, but shouldn't change the
>>>> > ABI.  Comments?
>>>>
>>>> How about using SLIST instead of open-coding? The instructions of them
>>>> are very similar, but the SLIST version is slightly simpler.
>>>
>>> I avoided that because n psref_release operations takes expected and
>>> worst-case O(n^2) time and there's no constant bound on the latency of
>>> a single psref_release operation.  But maybe n is always small enough
>>> that it doesn't matter -- perhaps enough that the concrete cost of
>>> maintaining a doubly-linked list is higher.
>>
>> I also suppose that a target being released is at the head of the list
>> because targets of psref are typically manipulated in last-in-first-out
>> manner. But yes psref_release of SLIST version can be O(n) in worst
>> cases theoretically.
>>
>> Well, when I think this kind of problems I tend to think of our usages,
>> i.e., network processing as a router. In such usages, psref is used in
>> softint and the last-in-first-out manner would keep in many cases. But
>> in other usages such as uses in threads the assumption is perhaps not
>> really.
>>
>>>
>>> (My desire to avoid thinking about bounds on n is also what motivated
>>> me to use a linked list instead of an array in the first place.)
>>>
>>> Note that your patch changes the ABI of struct psref!
>>
>> Let's bump the kernel version!
>>
>>>
>>> I wonder whether the open-coded version would do better if it
>>> unconditionally loaded the percpu:
>>>
>>> pcpu = percpu_getref(class->prc_percpu);
>>> KASSERTMSG(psref->psref_prevp == NULL || *psref->psref_prevp == 
>>> psref,
>>> "psref %p prevp %p points at %p",
>>> psref, psref->psref_prevp, *psref->psref_prevp);
>>> KASSERTMSG(psref->psref_prevp != NULL || pcpu->pcpu_first == psref,
>>> "psref %p marked as first but psref_cpu %p on %d first is %p",
>>> psref, pcpu, cpu_index(curcpu()), pcpu->pcpu_first);
>>> *(psref->psref_prevp ? psref->psref_prevp : >pcpu_first) =
>>> psref->psref_next;
>>> percpu_putref(class->prc_percpu);
>>>
>>> With DIAGNOSTIC disabled, I get a conditional move instruction instead
>>> of branches this way:
>>>
>>>  4f9:   e8 00 00 00 00  callq  4fe <psref_release+0x93>
>>> 4fa: R_X86_64_PC32 
>>> percpu_getref+0xfffc
>>>  4fe:   48 8b 53 08 mov0x8(%rbx),%rdx
>>>  502:   48 85 d2test   %rdx,%rdx
>>>  505:   48 0f 44 d0 cmove  %rax,%rdx
>>>  509:   48 8b 03mov(%rbx),%rax
>>>  50c:   48 89 02mov%rax,(%rdx)
>>>  50f:   49 8b 7c 24 20  mov0x20(%r12),%rdi
>>>  514:   e8 00 00 00 00  callq  519 <psref_release+0xae>
>>> 515: R_X86_64_PC32 
>>> percpu_putref+0xfffc
>>>
>>> Also, my original patch was missing a percpu_putref.  I hope you put
>>> it back before you ran your test!
>>
>> I'll test with the patch in the other mail later.
>
> The above results were actually measured by knakahara@ and
> this time I've measured by myself (the measurement hardware
> is the same and the kernel configs of his and mine should
> be almost the same though).

I notice a difference between his and mine: check-out date of
the source code. Mine is checked out today while his is probably
some days ago.

>
> ...so the results show a slightly different trend:
>   - original:  137.9 138.0 (144.7) Mbps
>   - open-code: 135.2 134.6 (138.5) Mbps
>   - SLIST: 140.7 141.4 (140.1) Mbps

...though still these results are puzzling.

  ozaki-r

>
> I think knakahara@ (or someone else) needs to re-evaluate :-/
>
>   ozaki-r


Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-10-06 Thread Ryota Ozaki
On Fri, Oct 6, 2017 at 1:14 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Fri, Oct 6, 2017 at 11:58 AM, Taylor R Campbell
> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> Date: Fri, 6 Oct 2017 11:26:40 +0900
>>> From: Ryota Ozaki <ozak...@netbsd.org>
>>>
>>> On Mon, Oct 2, 2017 at 11:13 PM, Taylor R Campbell
>>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>>> > Quick summary of the problem:
>>> >
>>> > Possible solutions.  I'm leaning toward (6), to open-code the linked
>>> > list operations for this special purpose, with compile-tested patch
>>> > attached.  This changes the text of psref.h, but shouldn't change the
>>> > ABI.  Comments?
>>>
>>> How about using SLIST instead of open-coding? The instructions of them
>>> are very similar, but the SLIST version is slightly simpler.
>>
>> I avoided that because n psref_release operations takes expected and
>> worst-case O(n^2) time and there's no constant bound on the latency of
>> a single psref_release operation.  But maybe n is always small enough
>> that it doesn't matter -- perhaps enough that the concrete cost of
>> maintaining a doubly-linked list is higher.
>
> I also suppose that a target being released is at the head of the list
> because targets of psref are typically manipulated in last-in-first-out
> manner. But yes psref_release of SLIST version can be O(n) in worst
> cases theoretically.
>
> Well, when I think this kind of problems I tend to think of our usages,
> i.e., network processing as a router. In such usages, psref is used in
> softint and the last-in-first-out manner would keep in many cases. But
> in other usages such as uses in threads the assumption is perhaps not
> really.
>
>>
>> (My desire to avoid thinking about bounds on n is also what motivated
>> me to use a linked list instead of an array in the first place.)
>>
>> Note that your patch changes the ABI of struct psref!
>
> Let's bump the kernel version!
>
>>
>> I wonder whether the open-coded version would do better if it
>> unconditionally loaded the percpu:
>>
>> pcpu = percpu_getref(class->prc_percpu);
>> KASSERTMSG(psref->psref_prevp == NULL || *psref->psref_prevp == 
>> psref,
>> "psref %p prevp %p points at %p",
>> psref, psref->psref_prevp, *psref->psref_prevp);
>> KASSERTMSG(psref->psref_prevp != NULL || pcpu->pcpu_first == psref,
>> "psref %p marked as first but psref_cpu %p on %d first is %p",
>> psref, pcpu, cpu_index(curcpu()), pcpu->pcpu_first);
>> *(psref->psref_prevp ? psref->psref_prevp : >pcpu_first) =
>> psref->psref_next;
>> percpu_putref(class->prc_percpu);
>>
>> With DIAGNOSTIC disabled, I get a conditional move instruction instead
>> of branches this way:
>>
>>  4f9:   e8 00 00 00 00  callq  4fe <psref_release+0x93>
>> 4fa: R_X86_64_PC32 
>> percpu_getref+0xfffc
>>  4fe:   48 8b 53 08 mov0x8(%rbx),%rdx
>>  502:   48 85 d2test   %rdx,%rdx
>>  505:   48 0f 44 d0 cmove  %rax,%rdx
>>  509:   48 8b 03mov(%rbx),%rax
>>  50c:   48 89 02mov%rax,(%rdx)
>>  50f:   49 8b 7c 24 20  mov0x20(%r12),%rdi
>>  514:   e8 00 00 00 00  callq  519 <psref_release+0xae>
>> 515: R_X86_64_PC32 
>> percpu_putref+0xfffc
>>
>> Also, my original patch was missing a percpu_putref.  I hope you put
>> it back before you ran your test!
>
> I'll test with the patch in the other mail later.

The above results were actually measured by knakahara@ and
this time I've measured by myself (the measurement hardware
is the same and the kernel configs of his and mine should
be almost the same though).

...so the results show a slightly different trend:
  - original:  137.9 138.0 (144.7) Mbps
  - open-code: 135.2 134.6 (138.5) Mbps
  - SLIST: 140.7 141.4 (140.1) Mbps

I think knakahara@ (or someone else) needs to re-evaluate :-/

  ozaki-r


Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-10-05 Thread Ryota Ozaki
On Fri, Oct 6, 2017 at 1:21 PM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Fri, 6 Oct 2017 13:14:14 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> On Fri, Oct 6, 2017 at 11:58 AM, Taylor R Campbell
>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>> > Note that your patch changes the ABI of struct psref!
>>
>> Let's bump the kernel version!
>
> If we want to pull the fix up to netbsd-8 (which we almost certainly
> do), then we should really avoid changing the ABI.  I forget what the
> rules are for pullups to pre-release branches, but changing the ABI is
> a big no-no post-release.

Hmm, I thought we could change ABI until the release (or tagging an RC).
I mean my past pullups to netbsd-8 sometimes included ABI changes :-/

  ozaki-r


Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-10-05 Thread Ryota Ozaki
On Fri, Oct 6, 2017 at 11:58 AM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Fri, 6 Oct 2017 11:26:40 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> On Mon, Oct 2, 2017 at 11:13 PM, Taylor R Campbell
>> <campbell+netbsd-tech-k...@mumble.net> wrote:
>> > Quick summary of the problem:
>> >
>> > Possible solutions.  I'm leaning toward (6), to open-code the linked
>> > list operations for this special purpose, with compile-tested patch
>> > attached.  This changes the text of psref.h, but shouldn't change the
>> > ABI.  Comments?
>>
>> How about using SLIST instead of open-coding? The instructions of them
>> are very similar, but the SLIST version is slightly simpler.
>
> I avoided that because n psref_release operations takes expected and
> worst-case O(n^2) time and there's no constant bound on the latency of
> a single psref_release operation.  But maybe n is always small enough
> that it doesn't matter -- perhaps enough that the concrete cost of
> maintaining a doubly-linked list is higher.

I also suppose that a target being released is at the head of the list
because targets of psref are typically manipulated in last-in-first-out
manner. But yes psref_release of SLIST version can be O(n) in worst
cases theoretically.

Well, when I think this kind of problems I tend to think of our usages,
i.e., network processing as a router. In such usages, psref is used in
softint and the last-in-first-out manner would keep in many cases. But
in other usages such as uses in threads the assumption is perhaps not
really.

>
> (My desire to avoid thinking about bounds on n is also what motivated
> me to use a linked list instead of an array in the first place.)
>
> Note that your patch changes the ABI of struct psref!

Let's bump the kernel version!

>
> I wonder whether the open-coded version would do better if it
> unconditionally loaded the percpu:
>
> pcpu = percpu_getref(class->prc_percpu);
> KASSERTMSG(psref->psref_prevp == NULL || *psref->psref_prevp == psref,
> "psref %p prevp %p points at %p",
> psref, psref->psref_prevp, *psref->psref_prevp);
> KASSERTMSG(psref->psref_prevp != NULL || pcpu->pcpu_first == psref,
> "psref %p marked as first but psref_cpu %p on %d first is %p",
> psref, pcpu, cpu_index(curcpu()), pcpu->pcpu_first);
> *(psref->psref_prevp ? psref->psref_prevp : >pcpu_first) =
> psref->psref_next;
> percpu_putref(class->prc_percpu);
>
> With DIAGNOSTIC disabled, I get a conditional move instruction instead
> of branches this way:
>
>  4f9:   e8 00 00 00 00  callq  4fe <psref_release+0x93>
> 4fa: R_X86_64_PC32 
> percpu_getref+0xfffc
>  4fe:   48 8b 53 08 mov0x8(%rbx),%rdx
>  502:   48 85 d2test   %rdx,%rdx
>  505:   48 0f 44 d0 cmove  %rax,%rdx
>  509:   48 8b 03mov(%rbx),%rax
>  50c:   48 89 02mov%rax,(%rdx)
>  50f:   49 8b 7c 24 20  mov0x20(%r12),%rdi
>  514:   e8 00 00 00 00  callq  519 <psref_release+0xae>
> 515: R_X86_64_PC32 
> percpu_putref+0xfffc
>
> Also, my original patch was missing a percpu_putref.  I hope you put
> it back before you ran your test!

I'll test with the patch in the other mail later.

Thanks,
  ozaki-r


Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-10-05 Thread Ryota Ozaki
On Mon, Oct 2, 2017 at 11:13 PM, Taylor R Campbell
 wrote:
> Quick summary of the problem:
>
>percpu_alloc(9) may move existing per-CPU objects to different
>locations in memory.  This means you can't reliably store pointers
>to the per-CPU objects anywhere.
>
>psref(9) currently uses queue.h LIST, with a per-CPU list head --
>but LIST stores back-pointers into the list head, which breaks when
>percpu_alloc(9) moves the list head.
>
> Possible solutions.  I'm leaning toward (6), to open-code the linked
> list operations for this special purpose, with compile-tested patch
> attached.  This changes the text of psref.h, but shouldn't change the
> ABI.  Comments?

How about using SLIST instead of open-coding? The instructions of them
are very similar, but the SLIST version is slightly simpler.

This is a patch: http://www.netbsd.org/~ozaki-r/psref-SLIST.diff

One worry of the implementation is that percpu_{getref,putref} need to
be always called in psref_release while the open-coded one doesn't need
it. However the result of performance evaluations shows that the overhead
is not so big and moreover SLIST version overtakes open-coded one slightly.

These are the results of IP forwarding performance evaluations
with 64 byte frames:
  - original:  144.7 Mbps
  - open-code: 138.5 Mbps
  - SLIST: 140.1 Mbps

  ozaki-r

>
>
> 1. Make psref(9) store a pointer to a separately kmem_alloc-allocated
>list head.  (This is what Nakahara-san considered but rejected.)
>
> struct psref_cpu {
> -   struct psref_head   pcpu_head;
> +   struct psref_head   *pcpu_head;
> };
>
>   Problem: We need to initialize these before psref_acquire/release.
>We could initialize it in psref_class_create, but for various
>applications of psref, we may not know how many CPUs there are at
>that time.
>
>To fix this properly we need to teach percpu(9) how to initialize
>and finalize per-CPU objects as CPUs come online, which as a bonus
>would make it work with dynamic CPU attachment and detachment.  And
>that's a lot of work.
>
> 2. Make percpu(9) always store pointers into separately kmem_alloc-
>allocated memory, like userland TLS and pthread keys do.
>
>   Problem: It's a bit of work to basically rewrite subr_percpu.c, and
>may have performance implications.  I started last night, spent an
>hour rewriting it, and didn't finish before bedtime came.
>
> 3. Invent a new variant of LIST called PERCPU_LIST that doesn't
>require back-pointers into the head.  (This is what Nakahara-san
>suggested before I vanished into a EuroBSDcon vortex.)
>
>   Problem: This is almost the same as LIST -- supports essentially all
>the same operations with the same performance characteristics --
>and has very limited utility.  So unless we find other applications
>that need exactly this too, I'm reluctant to expose it as a
>general-purpose kernel API right now.
>
>It can't be a drop-in replacement for LIST, because we have
>LIST_REMOVE(obj, field) but we need MHLIST_REMOVE(head, obj, field)
>in order to allow the head to move.
>
>And we can't just add LIST_REMOVE_MAYBEHEAD(head, obj, field) or
>something to LIST, because there's no way to distinguish with
>standard LIST whether an object is at the head of the list or not,
>without knowing where the head *was* in memory when the object (or
>one of its predecessors) was inserted.
>
> 4. Teach the percpu(9) API to move objects with an operation other
>than memcpy.  This would look like
>
> percpu_t *percpu_alloc(size_t);
> +   percpu_t *percpu_alloc_movable(size_t,
> +   void *(*)(void *, const void *, size_t));
>
>with percpu_alloc(sz) := percpu_alloc_movable(sz, ).  Then
>for psref(9), we could use LIST_MOVE to migrate the old list head
>to the new list head and fix the one back-pointer that would have
>been invalidated.
>
>   Problem: This is again an operation of limited utility, since most
>per-CPU objects (like counters) don't need special operations to
>move in memory.  And it would compete with extending the API for
>dynamic CPU initialization/finalization operations, so I'm not sure
>I want to preemptively step on the toes of that API design space at
>the moment.
>
> 5. Violate the abstraction of LIST in psref(9), as I proposed as a
>stop-gap measure in
>.
>
>   Problem: It's gross, makes maintenance of the LIST implementation
>more difficult, makes understanding the psref(9) code more
>difficult, breaks QUEUEDEBUG, 
>
> 6. Just open-code it in subr_psref.c.  If we ever find more
>applications of this idiom, we can always upgrade it without
>trouble to an explicit PERCPU_LIST abstraction.
>
>   Problem: We don't get to share any common LIST code.  This 

Re: RFC: PERCPU_LIST to fix PR kern/52515

2017-09-19 Thread Ryota Ozaki
On Tue, Sep 19, 2017 at 1:38 PM, Kengo NAKAHARA  wrote:
> Hi,
>
> To fix PR kern/52515(*), in particular psref(9), I implement PERCPU_LIST which
> is dedicated for percpu_alloc'ed percpu struct. Here is the patch which 
> consists
> of PERCULI_LIST implementation and applying to subr_psref.c.
> http://www.NetBSD.org/~knakahara/percpu-list/percpu-list.patch
>
> (*) http://gnats.netbsd.org/cgi-bin/query-pr-single.pl?number=52515
>
> The details are following.
>
> As written in PR/52515, the root of the problem is back-pointer to
> percpu_alloc'ed struct. At first, I tried generic solution to let
> percpu_alloc'ed struct use a pointer instead of itself. Here is a part of the
> PoC patch.
> == bad patch ==
> @@ -103,9 +103,18 @@ struct psref_class {
>   * Not exposed by the API.
>   */
>  struct psref_cpu {
> -   struct psref_head   pcpu_head;
> +   struct psref_head   *pcpu_head;
>  };
>
> +static void
> +psref_class_pcpu_init_p(void *p, void *cookie, struct cpu_info *ci)
> +{
> +   struct psref_cpu *pcpu = p;
> +
> +   pcpu->pcpu_head = kmem_alloc(sizeof(*(pcpu->pcpu_head)), KM_SLEEP);
> +   LIST_INIT(pcpu->pcpu_head);
> +}
> +
>  /*
>   * psref_class_create(name, ipl)
>   *
> @@ -121,6 +130,7 @@ psref_class_create(const char *name, int ipl)
>
> class = kmem_alloc(sizeof(*class), KM_SLEEP);
> class->prc_percpu = percpu_alloc(sizeof(struct psref_cpu));
> +   percpu_foreach(class->prc_percpu, _class_pcpu_init_p, NULL);
> mutex_init(>prc_lock, MUTEX_DEFAULT, ipl);
> cv_init(>prc_cv, name);
> class->prc_iplcookie = makeiplcookie(ipl);
> == bad patch ==
>
> Howerver, it causes initialization dependency problems to apply this
> modification to psref(9). That result from that percpu_foreach() must be 
> called
> after ncpu is fixed, that is,
> (A) psref_class_create() for "ifnet_psref_class" must be called after ncpu
> is fixed
> - as psref_class_create() use percpu_foreach() in this 
> modification
> (B) "ifnet_psref_class" must be initialized before any (real and pseudo)
> network interfaces are attached
>
> To satisfy both (A) and (B), it is required the MI hook called immediately
> after ncpu is fixed, yes, that hook is not implemented yet. Furthermore,
> the modification is too large even if that hook is already implemented.
> So, I gave up to use this generic solution.
>
> After that, I decide to use list specific solution to use PERCPU_LIST
> implemented for percpu_alloc'ed struct. The key design is an omission of the
> back-pointer to list head in the first list element, that is, when it wants
> to access list head, it always call percpu_getref() and then use the returned
> pointer.
> # There are some PR kern/52515 problems which cannot be fix by percpu list,
> # such as ipforward_rt_percpu. They can be fixed by using a pointer instead
> # of itself because they don't have complicated initialization dependency.

For rtcaches, I recommend that we stop manipulating rtcaches by global lists
(dom->dom_rtcache) for cache invalidation and instead use a global generation
counter for cache invalidation (inspired by IPsec SP cache). By doing so we
can remove LIST_ENTRY from struct route and avoid the issue.

This is a patch:
  http://www.netbsd.org/~ozaki-r/rtcache-gen.diff

Thanks,
  ozaki-r


Re: Debugging function of localcount

2017-08-06 Thread Ryota Ozaki
On Thu, Jul 27, 2017 at 1:49 AM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Wed, 26 Jul 2017 18:41:36 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> Oops. I thought we could use percpu_foreach in softint, but
>> I was wrong. So localcount_debug_refcnt can be used only
>> in thread contexts. And the above assertion shouldn't be
>> there. The proposal isn't so much worthwhile ;-<
>
> You also can't access one CPU's count from another CPU -- there's no
> memory ordering, so you can easily get wildly inconsistent answers.
>
> Maybe if you did a cross-call to count it up, that would work, but it
> sounds like for the cases you had in mind, you can't do a cross-call,
> and I'm not sure the view is guaranteed to be stable enough: maybe one
> thread decrements the count on CPU 0, then you count 'em up, then the
> thread migrates to CPU 1 and increments the count there.  (It is not a
> theorem that something like this might violate your assertion, just a
> conjecture!)

Yes, so I expected the uses in very limited environments like ATF tests
using rump kernels.

Anyway I've brought another (stupid) approach that uses an atomic
counter:
  http://www.netbsd.org/~ozaki-r/localcount_debug_atomic.diff

It counts up a counter with atomic instructions along with localcount
if LOCKDEBUG enabled. Of course it blows the merit of localcount but
I can provide an accurate view of the counter.

Do you think that it works and is useful?

  ozaki-r


Re: MP-safe IPsec SPD

2017-08-06 Thread Ryota Ozaki
Hi,

Thank you for your reviewing! I think I handled all your suggestions.
Let me know if I missed or misunderstood something.

BTW can we have an API to assert that we are in a pserialize critical
section? Sometimes I want to write something like:
  KASSERT(pserialize_in_read_critical_section());
rather than writing the constraint in a comment.

Thanks,
  ozaki-r


On Fri, Aug 4, 2017 at 12:02 AM, Taylor R Campbell
<campbell+netbsd-tech-k...@mumble.net> wrote:
>> Date: Fri, 28 Jul 2017 16:47:15 +0900
>> From: Ryota Ozaki <ozak...@netbsd.org>
>>
>> This is a patch to make IPsec SPD MP-safe:
>>   http://www.netbsd.org/~ozaki-r/mpsafe-ipsec-spd.diff
>>
>> Please check the locking notes in the patch to know
>> how I MP-ify SPD.
>>
>> This is the first time for me to use localcount(9)
>> so please comment/suggest me on it.
>
> Sorry, haven't had time for a detailed review!  Some brief comments:
>
> General approach looks reasonable.  Haven't had time to digest it all,
> but nothing leaps out at me as wrong about your use of localcount.
>
> Glad to see the detailed locking notes.  That will be helpful for
> proper review.
> - One typo: `pstree' should be `sptree'.  (Aside: Why `tree', if list?)
> - Can you say how key_mtx fits into it too?
> - Can you add to that comment that you must KEY_SP_REF during psz
>   read, and KEY_SP_UNREF after use?  (Rather than saying `incrementing
>   reference count', just to make it easier to cross-reference comment
>   with API.)
> - Can you put a comment over key_sp_ref saying what the caller's
>   responsibilities are?  (Must be in psz read section, must release
>   with key_sp_unref outside psz read section later.)
>
> /*
>  * Protect regtree, acqtree and items stored in the lists.
>  */
> static kmutex_t key_mtx __cacheline_aligned;
>+static pserialize_t key_psz;
>+static kmutex_t key_sp_mtx __cacheline_aligned;
>+static kcondvar_t key_sp_cv __cacheline_aligned;
>
> Consider gathering these into a cacheline-aligned struct along with
> the lists they cover?  And/or split the two locks up with a comment
> saying what each one is for?
>
>+void
>+key_sp_unref(struct secpolicy *sp, const char* where, int tag)
>+{
>+
>+   KASSERT(mutex_ownable(_sp_mtx));
>
> I suggest using KDASSERT for mutex_ownable, so that it happens only
> when you've enabled DEBUG *and* LOCKDEBUG -- it is a very expensive
> check, moreso than LOCKDEBUG normally is, since it acquires and
> releases a lock, and it is normally used in a path where we expect
> *not* to acquire and release a lock in a fast path.
>
>@@ -279,7 +295,6 @@ ipsec_fillpcbcache(struct inpcbpolicy *pcbsp, struct 
> mbuf *m,
>}
>pcbsp->sp_cache[dir].cachesp = sp;
>if (pcbsp->sp_cache[dir].cachesp) {
>-   KEY_SP_REF(pcbsp->sp_cache[dir].cachesp);
>KEYDEBUG_PRINTF(KEYDEBUG_IPSEC_STAMP,
>"DP cause refcnt++:%d SP:%p\n",
>key_sp_refcnt(pcbsp->sp_cache[dir].cachesp),
>
> It looks like this debug message doesn't make sense any more?  (Aside
> from the stub definition of key_sp_refcnt.)


Re: Debugging function of localcount

2017-07-26 Thread Ryota Ozaki
On Wed, Jul 26, 2017 at 1:34 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> (resending, please ignore if duplicated)
>
> Hi,
>
> I propose a debugging function of localcount,
> localcount_debug_refcnt, which returns a total
> reference count of a specified localcount.
>
> A result of localcount_debug_refcnt isn't guaranteed
> to be accurate, however it's still useful for
> debugging purposes, for example debugging on
> ATF tests using rump kernels.
>
> Here is a patch:
>   http://www.netbsd.org/~ozaki-r/localcount_debug_refcnt.diff
>
> I added the following assertion in localcount_release:
>   KDASSERT(localcount_debug_refcnt(lc) >= 0);
>
> IIUC it's a proper constraint of localcount even if
> a result of localcount_debug_refcnt isn't accurate.
>
> Any comments or suggestions?

Oops. I thought we could use percpu_foreach in softint, but
I was wrong. So localcount_debug_refcnt can be used only
in thread contexts. And the above assertion shouldn't be
there. The proposal isn't so much worthwhile ;-<

  ozaki-r


Debugging function of localcount

2017-07-26 Thread Ryota Ozaki
Hi,

I propose a debugging function of localcount,
localcount_debug_refcnt, which returns a total
reference count of a specified localcount.

A result of localcount_debug_refcnt isn't guaranteed
to be accurate, however it's still useful for
debugging purposes, for example debugging on
ATF tests using rump kernels.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/localcount_debug_refcnt.diff

I added the following assertion in localcount_release:
  KDASSERT(localcount_debug_refcnt(lc) >= 0);

IIUC it's a proper constraint of localcount even if
a result of localcount_debug_refcnt isn't accurate.

Any comments or suggestions?

Thanks,
  ozaki-r



Debugging function of localcount

2017-07-25 Thread Ryota Ozaki
(resending, please ignore if duplicated)

Hi,

I propose a debugging function of localcount,
localcount_debug_refcnt, which returns a total
reference count of a specified localcount.

A result of localcount_debug_refcnt isn't guaranteed
to be accurate, however it's still useful for
debugging purposes, for example debugging on
ATF tests using rump kernels.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/localcount_debug_refcnt.diff

I added the following assertion in localcount_release:
  KDASSERT(localcount_debug_refcnt(lc) >= 0);

IIUC it's a proper constraint of localcount even if
a result of localcount_debug_refcnt isn't accurate.

Any comments or suggestions?

Thanks,
  ozaki-r


Re: Keep local symbols of rump libraries

2017-04-13 Thread Ryota Ozaki
On Fri, Apr 14, 2017 at 1:31 PM, Paul Goyette <p...@whooppee.com> wrote:
> On Fri, 14 Apr 2017, Paul Goyette wrote:
>
>> On Fri, 14 Apr 2017, Ryota Ozaki wrote:
>>
>>>>> If it's useful for other than rump libraries, having a new global
>>>>> option would be adequate. I don't have an idea of the name though.
>>>>
>>>>
>>>> http://www.netbsd.org/~ozaki-r/libs-keep-symbols.diff
>>>>
>>>> So this is a new patch that provides a global option named MKSTRIPSYM.
>>>> (the name is suggested by paulg, thanks!)
>>>
>>>
>>> The patch is updated; bsd.lua.mk also applies MKSTRIPSYM. Let me know
>>> if the change is wrong.
>>
>>
>> The README still refers to RUMP_KEEPSYMBOLS
>>
>> :)
>
>
> Also, the RUMP_KEEPSYMBOLS stuff is still included in Makefile.rump ???

Oh sorry, I changed the filename of the patch from rumplibs-keep-symbols.diff
to libs-keep-symbols.diff because it's already rump-specific.

  ozaki-r


Re: Keep local symbols of rump libraries

2017-04-12 Thread Ryota Ozaki
On Wed, Apr 12, 2017 at 6:58 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Wed, Apr 12, 2017 at 9:27 AM, Chuck Silvers <c...@chuq.com> wrote:
>> On Mon, Apr 10, 2017 at 07:22:45PM +0900, Ryota Ozaki wrote:
>>> Hi,
>>>
>>> I'm using ATF tests running rump kernels (rump_server)
>>> for development and debugging. When a rump kernel gets
>>> panic, it dumps a core file from which we can obtain
>>> a backtrace by using gdb.
>>>
>>> Unfortunately local symbols (i.e., static functions)
>>> in a backtrace are unresolved because they are stripped
>>> by the linker (and objdump). That makes debugging a bit
>>> difficult.
>>>
>>> The patch introduces a compile option for rump kernels
>>> called RUMP_KEEPSYMBOLS to keep such symbols:
>>>   http://www.netbsd.org/~ozaki-r/rumplibs-keep-symbols.diff
>>>
>>> The option is disabled by default to not increase
>>> the size of all rump libraries.
>>>
>>> I'm not so familiar with the NetBSD build system and
>>> Makefiles, so the patch may be wrong or clumsy.
>>>
>>> Any comments?
>>
>> I've been meaning to ask about something similar to this
>> but for everything in the build (libraries and commands)
>> for the benefit of dtrace.  dtrace "ustack()" stack traces
>> are similarly less useful without the static symbols.
>>
>> my patch for that was similar to yours, but I changed the
>> "-Wl,-x" to "-Wl,-X" rather than removing the option entirely.
>> for OBJCOPYLIBFLAGS I did what you did, though I suppose using "-X"
>> there too would have been more consistent.
>
> Sure. Changing -x to -X works for me.
>
>>
>> if we had an option for preserving static symbols for everything,
>> would you still want a rump-library-specific option?
>
> If it's useful for other than rump libraries, having a new global
> option would be adequate. I don't have an idea of the name though.

http://www.netbsd.org/~ozaki-r/libs-keep-symbols.diff

So this is a new patch that provides a global option named MKSTRIPSYM.
(the name is suggested by paulg, thanks!)

  ozaki-r


Re: Keep local symbols of rump libraries

2017-04-12 Thread Ryota Ozaki
On Wed, Apr 12, 2017 at 9:27 AM, Chuck Silvers <c...@chuq.com> wrote:
> On Mon, Apr 10, 2017 at 07:22:45PM +0900, Ryota Ozaki wrote:
>> Hi,
>>
>> I'm using ATF tests running rump kernels (rump_server)
>> for development and debugging. When a rump kernel gets
>> panic, it dumps a core file from which we can obtain
>> a backtrace by using gdb.
>>
>> Unfortunately local symbols (i.e., static functions)
>> in a backtrace are unresolved because they are stripped
>> by the linker (and objdump). That makes debugging a bit
>> difficult.
>>
>> The patch introduces a compile option for rump kernels
>> called RUMP_KEEPSYMBOLS to keep such symbols:
>>   http://www.netbsd.org/~ozaki-r/rumplibs-keep-symbols.diff
>>
>> The option is disabled by default to not increase
>> the size of all rump libraries.
>>
>> I'm not so familiar with the NetBSD build system and
>> Makefiles, so the patch may be wrong or clumsy.
>>
>> Any comments?
>
> I've been meaning to ask about something similar to this
> but for everything in the build (libraries and commands)
> for the benefit of dtrace.  dtrace "ustack()" stack traces
> are similarly less useful without the static symbols.
>
> my patch for that was similar to yours, but I changed the
> "-Wl,-x" to "-Wl,-X" rather than removing the option entirely.
> for OBJCOPYLIBFLAGS I did what you did, though I suppose using "-X"
> there too would have been more consistent.

Sure. Changing -x to -X works for me.

>
> if we had an option for preserving static symbols for everything,
> would you still want a rump-library-specific option?

If it's useful for other than rump libraries, having a new global
option would be adequate. I don't have an idea of the name though.

  ozaki-r


Re: Keep local symbols of rump libraries

2017-04-12 Thread Ryota Ozaki
On Mon, Apr 10, 2017 at 11:15 PM, Kamil Rytarowski <n...@gmx.com> wrote:
> On 10.04.2017 12:22, Ryota Ozaki wrote:
>> Hi,
>>
>> I'm using ATF tests running rump kernels (rump_server)
>> for development and debugging. When a rump kernel gets
>> panic, it dumps a core file from which we can obtain
>> a backtrace by using gdb.
>>
>> Unfortunately local symbols (i.e., static functions)
>> in a backtrace are unresolved because they are stripped
>> by the linker (and objdump). That makes debugging a bit
>> difficult.
>>
>> The patch introduces a compile option for rump kernels
>> called RUMP_KEEPSYMBOLS to keep such symbols:
>>   http://www.netbsd.org/~ozaki-r/rumplibs-keep-symbols.diff
>>
>> The option is disabled by default to not increase
>> the size of all rump libraries.
>>
>> I'm not so familiar with the NetBSD build system and
>> Makefiles, so the patch may be wrong or clumsy.
>>
>> Any comments?
>>
>> Thanks,
>>   ozaki-r
>>
>
> There are already available the following options:
> MKDEBUG
> MKDEBUGLIB
> MKKDEBUG
>
> How about MKDEBUGRUMP or MKRUMPDEBUG?

I intended to provide the option as a rump-specific option
(like options described in sys/rump/README.compileopts), not
a global build option, because it's rump-specific.

But as Chuck suggested in another mail, if the option is useful
for other than rump libraries, making the option global would
be good. Though I'm not sure if the word DEBUG suits for the
purpose (keep local symbols) while DEBUG indicates that it adds
debugging symbols.

  ozaki-r


Move if_opackets++ to where if_obytes are

2017-03-07 Thread Ryota Ozaki
Hi,

I propose to move if_opackets from device drivers to where
if_obytes is incremented (not just one place), for example
in if_transmit.

The aims of the change are:
- Unify how it's incremented
  - Usages of the counter are different between device
drivers
- There are roughly two places to count it up;
  (1) when a packet is dequeued from if_snd (in
  XXX_start function) and (2) when a packet is
  successfully transmitted (in Tx interrupt)
  - Users normally expect the counter is incremented in the
same manner between device drivers
  - It is hard to unify the behavior while having the counter
in each device driver
- Synchronize with if_obytes
  - Users normally expect if_opackets and if_obytes are
synchronized, but they aren't now
  - It is hard to do so while having the two counters in
different places
- Simplify the code
  - We can reduce the number of places doing if_opackets++

Some concerns on the change:
- The meaning/behavior of the counter is changed
  - From "the number of packets sent on the interface" to
"the number of packets passed to the interface to send"
  - But if_obytes is ever like that
- if_oerrors isn't moved (at least in this change)
  - It is hard to count errors happening in a device
driver outside the device driver
  - We can count the number of dropped packets at the if_snd
queue but that's perhaps not what we want to know
by if_oerrors
  - That aside IMO errors happening in a device driver should
be counted by driver own statistic counters and be
reported via sysctl, evcnt or the like
- So we can know what kind of errors happened, not just
  some sort of errors happened

This is a patch of the change:
  http://www.netbsd.org/~ozaki-r/move-if_opackets.diff

The patch includes only changes for sys/net*, wm(4) and
vioif(4). Changes for other drivers will be made once the
proposal is accepted.

Any comments or suggestions?

Thanks,
  ozaki-r



Re: workqueue for pr_input

2017-01-27 Thread Ryota Ozaki
On Thu, Jan 26, 2017 at 10:29 AM, Darren Reed  wrote:
> Apologies all for the blob of text :(
>
> I resent with better formatting but that's likely in a spam queue
> somewhere...
>
> So I'll try and summarise in a readable fashion:
>
> (1) patch introduces new locks for per-packet processing when the
> architectural goal is lock-less. What would this patch look like if done
> to be lock-less?
> (2) queuing has a memory cost. There are DoS attacks introduced that
> need to be mitigated.
> (4) does workqueue and this patch suggest that  needs
> enhancing for optimal processing of lists?

http://www.netbsd.org/~ozaki-r/workqueue-pr_input-percpu.diff

As per your comment, I changed the patch: (i) introduced percpu queues
and removed the lock and (ii) introduced the hard limit to each queue
to mitigate DoS, which is the same as other input queues such as
pktqueue and if_input percpuq.

As a side effect of the change, the overhead reduced to 10 usec
for a single packet pingpong and for flooding the overhead is now
negligible, I guess, thanks to bulk packet processing per one
workqueue activation.

> (3) do all packets need this workqueue approach or only some? Can
> selective queuing help with (2)?

Workqueue is used only for ICMP/ICMPv6/CARP and packets on fast paths
aren't affected by the change. Even for the affected packets, the
influence by the change is small now by the above improvement.

We might be able to use workqueue selectively per packet types say
ICMP message types, but there is a concern; it will introduce packet
reordering between packets (e.g., ICMP messages). I think we should
avoid the behavior, though I'm not sure the impact of the reordering.
And I don't so much like to introduce such an exceptional behavior
in the middle of the packet handling (pr_input is modestly a feasible
place to do, I think).

Thanks,
  ozaki-r


Re: workqueue for pr_input

2017-01-25 Thread Ryota Ozaki
On Tue, Jan 24, 2017 at 3:47 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon <t...@panix.com> wrote:
>> On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
>>>
>>> The demerit is that that mechanism adds non-trivial
>>> overhead; RTT of ping increases by 30 usec.
>>
>> I don't see why overhead for control protocols like ICMP matters.  I
>> think if you look at longstanding commercial designs for networking
>> equipment -- software and hardware alike -- for literally decades they
>> have taken the approach of optimizing the fast path (forwarding, data
>> protocol performance) while allowing the exception/error paths (ICMP etc.)
>> to be very slow.
>
> Yes, so I proposed this approach regardless of the overhead.
>
> If none objects, I'm going to implement a complete version of my proposal
> in protosw.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/workqueue-pr_input.diff

I eventually gave up implementing the mechanism in protosw; it just made
protosw complex. Instead I implemented it as helper functions (see wqinput.c).

Any comments?

  ozaki-r


Re: workqueue for pr_input

2017-01-23 Thread Ryota Ozaki
On Mon, Jan 23, 2017 at 10:01 PM, Thor Lancelot Simon <t...@panix.com> wrote:
> On Mon, Jan 23, 2017 at 05:58:20PM +0900, Ryota Ozaki wrote:
>>
>> The demerit is that that mechanism adds non-trivial
>> overhead; RTT of ping increases by 30 usec.
>
> I don't see why overhead for control protocols like ICMP matters.  I
> think if you look at longstanding commercial designs for networking
> equipment -- software and hardware alike -- for literally decades they
> have taken the approach of optimizing the fast path (forwarding, data
> protocol performance) while allowing the exception/error paths (ICMP etc.)
> to be very slow.

Yes, so I proposed this approach regardless of the overhead.

If none objects, I'm going to implement a complete version of my proposal
in protosw.

  ozaki-r


Re: bpf_mtap_softint

2017-01-18 Thread Ryota Ozaki
On Wed, Jan 18, 2017 at 6:25 PM, Martin Husemann <mar...@duskware.de> wrote:
> On Wed, Jan 18, 2017 at 04:48:32PM +0900, Ryota Ozaki wrote:
>> Note that it of course consumes more CPU time for softint
>> and more memory for m_dup. That said, I think that this is
>> a suitable workaround for now until someone makes them
>> softint-based.
>
> Could you provide some TODO like document (e.g. on the wiki) where
> the poor soul wading into ISDN drivers (i.e.) could find out what
> exactly needs to be done to the drivers? (Even if this is just two
> or three sentences, it should be spelled out explicitly.)

I added doc/TODO.smpnet for it. It will include TODOs of MP-safe
network stack.

Updated the patch: http://www.netbsd.org/~ozaki-r/bpf_mtap_softint.diff

>
> Did I understand correctly that you will add the defered mtap to all
> drivers for now, and after the driver has been fixed it will return to
> old bpf_mtap?

Yes.

> So if we ask nxr for users of the defered tap and find none,
> we know that the task is done?

Yes, we can know it by searching bpf_mtap_softint.

  ozaki-r


bpf_mtap_softint

2017-01-17 Thread Ryota Ozaki
Hi,

This proposal is a part of the task of MP-safe bpf. To make
the task easy, we want to prevent bpf_mtap* from running
in hardware interrupt context.

By the commit "Move bpf_mtap and if_ipackets++ on Rx of each
driver to percpuq if_input" (*), most bpf_mtap on Rx run in
softint now. However, there remain some bpf_mtap that still
run in hardware interrupt context, which are marked as
"XXX not in softint" (**).

(*) 
http://www.nerv.org/netbsd/?q=id:20161215T092807Z.77df0ccf164687a2911f2ff5cc2dafe4fcbb08c5
(**) wireless drivers and net80211 are also remaining ones.
 nonaka@ is working on them (iwm(4) is already softint-based).

Ideally they (some drivers and ISDN protocol processing) are
converted to softint-based, but it is difficult because we
don't have actual devices for testing, time to work, (and
interest/love). So instead, provide a means to defer only
bpf_mtap to a softint handler.

This is a patch:
  http://www.netbsd.org/~ozaki-r/bpf_mtap_softint.diff

We need to only add bpf_mtap_softint_init to the driver
initializations and replace bpf_mtap with bpf_mtap_softint.
I tested it with wm(4) and it works for me. If someone
can test the patch, please do it (ISDN users?).

Note that it of course consumes more CPU time for softint
and more memory for m_dup. That said, I think that this is
a suitable workaround for now until someone makes them
softint-based.

Comments?

Thanks,
  ozaki-r


Unify bpf_mtap & if_ipackets++ on Rx

2016-12-06 Thread Ryota Ozaki
Hi,

This is another proposal that unifies bpf_mtap and
if_ipackets++, which are executed in each driver now,
into the percpuq if_input framework.

The benefits of the change are:
- We can reduce codes
- We can provide the same behavior between drivers
  - How if_ipackets is counted up
  - Well, actually some drivers periodically update
packet statistics though
- We can run bpf_mtap in softint
  - This makes it easy to MP-ify bpf

Here is a patch:
  http://www.netbsd.org/~ozaki-r/unify-bpf_mtap-if_ipackets.diff

The patch is almost mechanically generated by spatch.
Let me know if there are mistakes.


We can move most bpf_mtap into softint by the patch
but there remain some bpf_mtap that can run in
hardware interrupt context.

Two of them that are executed from Tx interrupt and can be
moved if_start which is the usual place for bpf_mtap on Tx.

Here is a patch for the change:
  http://www.netbsd.org/~ozaki-r/move-bpf_mtap-txintr.diff

Is this kind of changes correct?


So there are five remaining bpf_mtap that cannot be solved
without big changes. I annotate them:
  http://www.netbsd.org/~ozaki-r/comment-bpf_mtap-hardintr.diff

One solution is to provide a deferred bpf_mtap mechanism that
duplicates mbuf and run bpf_mtap in softint (or workqueue).
It should work but it's perhaps a last resort.

Let me know if there is a better approach.


Any comments or suggestions?

Thanks,
  ozaki-r


Re: Deferred if_start

2016-12-05 Thread Ryota Ozaki
On Mon, Dec 5, 2016 at 6:54 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Mon, Dec 05, 2016 at 12:16:42PM +0900, Ryota Ozaki wrote:
>> On Fri, Dec 2, 2016 at 7:55 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
>> > Especially on older systems, adding non-trivial overhead to RX
>> > processing can be costly, especially if the technical reasons are
>> > questionable. So let's look at those reasons again.
>>
>> I think it's trivial because deferred if_start only runs if it's
>> requested and the check if it's requested is executed per multiple
>> packets, not per packet.
>
> Please remember, I'm not arguing about if_start, but changing RX
> processing to include another softint layer.

I think we can unify Rx softint and Tx sofint as pq3etsec driver does,
but okay I don't include the design in this proposal and add a softint
for deferred if_start.

This is a new patch including complete changes:
  http://www.netbsd.org/~ozaki-r/if_deferred_start.complete.diff

Only drivers that call if_deferred_start_init have a new softint for
deferred if_start.

>
>> BTW, how about moving bpf_mtap (and ifp->if_ipackets++) from each driver
>> to the common percpuq if_input? That's also a requirement for MP-safe bpf.
>
> That would work at least for Ethernet drivers. It would actually be nice
> to recheck the function pointer at this point, e.g. the way VLAN
> processing is hooked up is somewhat gross, but that's unrelated.

bridge(4) already changes if_input thought, hmm... that's future work :-|

> You
> should keep in mind though that some drivers (e.g. WLAN) have multiple
> bpf_mtap hooks to include the signal statistics etc.

Yes, I know. nonakap@ will tackle the ieee80211 stack and wifi drivers.

Thanks,
  ozaki-r


Re: Deferred if_start

2016-12-05 Thread Ryota Ozaki
On Fri, Dec 2, 2016 at 7:55 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Fri, Dec 02, 2016 at 06:46:12PM +0900, Ryota Ozaki wrote:
>> On Fri, Dec 2, 2016 at 6:32 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
>> > On Fri, Dec 02, 2016 at 04:07:24PM +0900, Ryota Ozaki wrote:
>> >> We need it for MP-ification of bpf because some drivers call
>> >> if_start (or equivalent, say drvX_start) from hardware
>> >> interrupt, which means that bpf_mtap is also called from
>> >> hardware interrupt. That makes MP-ification hard.
>> >
>> > What is the reason for bpf_mtap to fail in hardware interrupt context?
>>
>> Because we can use only spin mutex.
>>
>> > I really don't like this requirement change.
>>
>> Why?
>
> Especially on older systems, adding non-trivial overhead to RX
> processing can be costly, especially if the technical reasons are
> questionable. So let's look at those reasons again.

I think it's trivial because deferred if_start only runs if it's
requested and the check if it's requested is executed per multiple
packets, not per packet.

Actually the performance down of forwarding throughput is 2 or 3 %.
The overhead mainly comes from preempt_{enable,disable} of percpu(9)
which aren't needed for hardint/softint and if we get rid of them
(i.e., use percpu_getptr_remote directly), the overhead becomes
negligible.

>
> For the BPF hooks to work, two things are necessary: read-only access to
> the filter program and read-write engine to the BPF buffer space. The
> latter is fine with a spin-lock, so that should not be a problem. That
> leaves the filter program. Hardware interrupt context is not allowed to
> sleep at all, so BPF access from this context can only see a filter
> program while it is currently running. So requirements for safe access
> should therefore be:
> (1) Atomic loading the new BPF program.
> (2) Delayed freeing of the old BPF program.
> (3) Access to BPF program must block some synchronisation cross call.
> (4) After the cross call got ACKed by all CPUs, the old program can be
> removed.
>
> I'm not sure if passive references already provides this, but this
> approach should work without any further changes to network drivers.

psz/psref + mutex can do that. I guess using psz/psref + mutex also
doesn't require any changes to network drivers.

>
> Side note: Please don't understand this as objection against moving the
> if_start processing to softint context. That's a different topic and
> decoupling if_start from the queue processing by deferring it into a
> softint / workqueue / whatever seems perfectly reasonable to me.

BTW, how about moving bpf_mtap (and ifp->if_ipackets++) from each driver
to the common percpuq if_input? That's also a requirement for MP-safe bpf.

Thanks,
 ozaki-r



Re: Deferred if_start

2016-12-02 Thread Ryota Ozaki
On Fri, Dec 2, 2016 at 6:32 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Fri, Dec 02, 2016 at 04:07:24PM +0900, Ryota Ozaki wrote:
>> We need it for MP-ification of bpf because some drivers call
>> if_start (or equivalent, say drvX_start) from hardware
>> interrupt, which means that bpf_mtap is also called from
>> hardware interrupt. That makes MP-ification hard.
>
> What is the reason for bpf_mtap to fail in hardware interrupt context?

Because we can use only spin mutex.

> I really don't like this requirement change.

Why?

  ozaki-r


Deferred if_start

2016-12-01 Thread Ryota Ozaki
Hi,

This proposal introduces the deferred if_start framework
for network device drivers. It provides a means to schedule
if_start that will be executed in softint later.

We need it for MP-ification of bpf because some drivers call
if_start (or equivalent, say drvX_start) from hardware
interrupt, which means that bpf_mtap is also called from
hardware interrupt. That makes MP-ification hard.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/if_deferred_start.diff

The implementation reuses softint of percpuq if_input to
avoid adding yet another softint per interface. Deferred
if_start is called before if_input in the softint handler.

The patch applies the framework only to vioif and wm.
For vioif, we need to just replace vioif_start with
if_schedule_deferred_start. For wm that supports
H/W multiqueue, we need a bit more changes. Once the
proposal is accepted, we apply the framework to remaining
tens of drivers.

Any comments?

BTW, the design to run if_start in softint in conjunction
with if_input recalls the design of pq3etsec driver that
handles both Tx and Rx always in one softint. We would
eventually provide the common framework of the design in
the future. (It would also include multiqueue and poll mode.)

  ozaki-r


Re: Tidy up in6_select*

2016-11-08 Thread Ryota Ozaki
On Mon, Nov 7, 2016 at 9:55 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> On Mon, Nov 7, 2016 at 7:12 PM, Roy Marples <r...@marples.name> wrote:
>> On 07/11/2016 07:30, Ryota Ozaki wrote:
>>> I propose tidy-up of in6_select* functions, esp.
>>> selectroute.
>>
>> ...
>>
>>> The patch is here:
>>>   http://www.netbsd.org/~ozaki-r/tidyup-in6_select.diff
>>>
>>> Tests added recently ensure there is no regression
>>> in basic cases, but I cannot say that it breaks
>>> nothing we should keep.
>>>
>>> Any comments or suggestions?
>>
>> Looks good to me!
>
> Thanks :)

I found a bug (ifp leak in ip6_output) and fixed it.

This is a updated patch:
  http://www.netbsd.org/~ozaki-r/tidyup-in6_select.v2.diff

  ozaki-r


Re: Tidy up in6_select*

2016-11-07 Thread Ryota Ozaki
On Mon, Nov 7, 2016 at 7:12 PM, Roy Marples <r...@marples.name> wrote:
> On 07/11/2016 07:30, Ryota Ozaki wrote:
>> I propose tidy-up of in6_select* functions, esp.
>> selectroute.
>
> ...
>
>> The patch is here:
>>   http://www.netbsd.org/~ozaki-r/tidyup-in6_select.diff
>>
>> Tests added recently ensure there is no regression
>> in basic cases, but I cannot say that it breaks
>> nothing we should keep.
>>
>> Any comments or suggestions?
>
> Looks good to me!

Thanks :)

  ozaki-r


Tidy up in6_select*

2016-11-06 Thread Ryota Ozaki
Hi,

I propose tidy-up of in6_select* functions, esp.
selectroute.

selectroute is annoying because:
- It returns both/either of a rtentry and/or an ifp
  - Yes, it may return only an ifp!
- It is valid but selectroute shouldn't handle the case
  - Such conditional behavior makes it difficult
to apply locking/psref thingy
- It may return a rtentry even if error
- It may use opt->ip6po_nextroute rtcache implicitly
  - The caller can know if it is used
by rtcache_validate(>ip6po_nextroute)
but it's racy in MP-safe world
  - Even if it uses opt->ip6po_nextroute, it may
return a rtentry that isn't derived from the rtcache

My changes (inspired by OpenBSD) include:
- Rename selectroute to in6_selectroute
  - Let a remaining caller of selectroute, in6_selectif,
use in6_selectroute instead
- Let in6_selectroute return only an rtentry
  - If error, it doesn't return an rtentry
  - A caller gets an ifp from a returned rtentry
- Allow in6_selectroute to modify a passed rtcache
  and a caller can know if opt->ip6po_nextroute is
  used via the rtcache
- Let callers (ip6_output and in6_selectif) handle
  the case that only an ifp is required

The patch is here:
  http://www.netbsd.org/~ozaki-r/tidyup-in6_select.diff

Tests added recently ensure there is no regression
in basic cases, but I cannot say that it breaks
nothing we should keep.

Any comments or suggestions?
  ozaki-r


Re: if_tun module doesn't autoload if opening /dev/tunN

2016-09-08 Thread Ryota Ozaki
On Wed, Sep 7, 2016 at 2:50 PM, Paul Goyette  wrote:
> According to the tun(4) man page, you can create an interface "by using the
> ifconfig(8) create command" and this works just fine.  The man page goes on
> to say "An open() call on /dev/tunN, will also create a network
> interface..." but this does not work.
>
> I think this is because the code in sys/miscfs/specfs/spec_vnops.c assumes
> that the module name is the same as the device name, ie it expects that
> /dev/tunN will be handled by module tun.kmod  However, the module is
> actually named if_tun.kmod so it never gets loaded, and the open() fails.
>
> I can see a couple of ways to "fix" this:
>
> 1. Rename the module (with all appropriate updates to the
>sets lists in src/distrib/sets/lists), or
> 2. Add code to spec_vnops.c to try loading module if_xxx if the
>autoload of module xxx fails.
>
> Comments?  Preferences?  Any other options I haven't yet seen?

Cannot we have an alias name for a device to use it for module_autoload?
I'm assuming cdevsw#d_mod_filename for example. It's simpler and explicit.

  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-09-08 Thread Ryota Ozaki
On Tue, Sep 6, 2016 at 3:14 AM, matthew green  wrote:
>> dut1# cpuctl identify 0
>> cpu0: highest basic info 000b
>> cpu0: highest extended info 8008
>> cpu0: "Intel(R) Atom(TM) CPU  C2558  @ 2.40GHz"
>> cpu0: Intel Atom C2000 (686-class), 2400.27 MHz
>> cpu0: family 0x6 model 0x4d stepping 0x8 (id 0x406d8)
>> cpu0: features 
>> 0xbfebfbff
>> cpu0: features 
>> 0xbfebfbff
>> cpu0: features 0xbfebfbff
>> cpu0: features1 0x43d8e3bf
>> cpu0: features1 0x43d8e3bf
>> cpu0: features1 0x43d8e3bf
>> cpu0: features2 0x28100800
>> cpu0: features3 0x101
>> cpu0: I-cache 32KB 64B/line 8-way, D-cache 24KB 64B/line 6-way
>> cpu0: L2 cache 1MB 64B/line 16-way
>> cpu0: ITLB 48 4KB entries fully associative
>> cpu0: DTLB 128 4KB entries 4-way, 4K/2M: 16 entries
>> cpu0: Initial APIC ID 0
>> cpu0: Cluster/Package ID 0
>> cpu0: Core ID 0
>> cpu0: SMT ID 0
>> cpu0: DSPM-eax 0x5
>> cpu0: DSPM-ecx 0x9
>> cpu0: SEF highest subleaf 
>> cpu0: SEF-main 0x2282
>> cpu0: microcode version 0x127, platform ID 0
>>
>>
>> > can you run performance tests on systems with small cache?
>>
>> Not tested ever. It'll take a bit time to do because I don't
>> have a suitable one. BTW what size do you expect for small?
>
> ah, heh.  the above is a small system. :)

Oh, okay :)

>
> can you run it on a big system?  like a xeon with a large cpu
> cache?

Hmm, we don't have such big systems that we can test conveniently.
I hope someone who has such systems tries it out.

  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-09-01 Thread Ryota Ozaki
On Fri, Sep 2, 2016 at 4:40 AM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Thu, Sep 01, 2016 at 10:00:52PM +0900, Ryota Ozaki wrote:
>> On Thu, Sep 1, 2016 at 6:45 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
>> > On Thu, Sep 01, 2016 at 03:46:15PM +0900, Ryota Ozaki wrote:
>> >> http://www.netbsd.org/~ozaki-r/align-functions-16.diff
>> >>
>> >> The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
>> >> Is it a feasible place to add?
>> >
>> > There are two small issues I have with this patch:
>> > (1) I think it should be restricted to GCC with an appropiate comment of
>> > what this is a workaround for. Clang seems to behave a lot more sensible
>> > out of the box. If there are CPU models with a different base alignment
>> > and the user asked for one of them as optimisation target, it should be
>> > honored IMO.
>> > (2) This should not touch CFLAGS, but COPTS.
>>
>> Okay, I see. How about the following patch?
>> (nonaka@ helped improving Makefile options.)
>>
>> http://www.netbsd.org/~ozaki-r/align-functions-16.v2.diff
>
> Almost. You shouldn't need the whole if-block.

The if-block intends to add the option even when a kernel config
(like GENERIC) has own COPTS.

Well, should we diereclty add the option to COPTS in GENERIC instead?
Or we may be able to get rid of (or comment-out) COPTS from GENERIC
because it's the same as DEFCOPTS.

>  I don't understand the
> bsd.own.mk reference, it doesn't contain -mtune=nocona here?

I meant GCC_CONFIG_TUNE.x86_64=nocona line that makes gcc use
nocona as the default value of -mtune.

Okay, revised the comment like this:

-# By default, our gcc uses -mtune=nocona for compiling the kernels
-# (see share/mk/bsd.own.mk). With -mtune=nocona, gcc doesn't align
+# Our gcc is built to use nocona as the default value of -mtune
+# (see GCC_CONFIG_TUNE in share/mk/bsd.own.mk). With -mtune=nocona,
+# gcc doesn't align (...)

Thanks,
  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-09-01 Thread Ryota Ozaki
On Thu, Sep 1, 2016 at 6:45 PM, Joerg Sonnenberger <jo...@bec.de> wrote:
> On Thu, Sep 01, 2016 at 03:46:15PM +0900, Ryota Ozaki wrote:
>> http://www.netbsd.org/~ozaki-r/align-functions-16.diff
>>
>> The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
>> Is it a feasible place to add?
>
> There are two small issues I have with this patch:
> (1) I think it should be restricted to GCC with an appropiate comment of
> what this is a workaround for. Clang seems to behave a lot more sensible
> out of the box. If there are CPU models with a different base alignment
> and the user asked for one of them as optimisation target, it should be
> honored IMO.
> (2) This should not touch CFLAGS, but COPTS.

Okay, I see. How about the following patch?
(nonaka@ helped improving Makefile options.)

http://www.netbsd.org/~ozaki-r/align-functions-16.v2.diff

Thanks,
  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-09-01 Thread Ryota Ozaki
On Thu, Sep 1, 2016 at 4:04 PM, matthew green  wrote:
> have you tested other values than 1 and 16?  what about 4 or 8?

4 and 8 are not so good; their performance fluctuations are
similar to the unaligned case in my experiments.

>
> can you post the size difference of kernels?  particularly the
> kernel without DIAGNOSTIC or DEBUG (since those are the ones
> where performance matters most.)

I measured the sizes of GENERIC kernels, i.e., DIAGNOSTIC on
and DEBUG off.

The sizes of kernel binaries don't change in most cases because
the alignment of __rodata_start that begins just after kernel text
hides the changes due to -falign-functions.

The sizes of the actual kernel text (from kernel_text to _etext)
slightly changes. The difference between that of GENERIC kernels
w/ and w/o -falign-functions=16 is 200kB. That is 1% of the total
kernel text size.

BTW, as I noted, I'm not exploring an alignment size that provides
best performance, I just want to reduce performance fluctuations.

Thanks,
  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-09-01 Thread Ryota Ozaki
On Mon, Aug 29, 2016 at 4:43 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> Hi,
>
> I propose to set -falign-functions=16 to kernels
> of i386/amd64 to reduce performance fluctuations
> by small, unrelated changes.
>
> [Background]
>
> I noticed that performance of IP forwarding had
> been degraded by 10% between Aug. 1 and Aug. 16.
> Bisecting commits between them points out that
> performance degradations happened by several
> commits and unfortunately the commits aren't
> related to performance of IP forwarding; for
> example a change to ip6flow.
>
> I and knakahara investigated how these
> degradations happened and concluded that they
> are because of changes of the start of functions
> (alignment of function codes), which probably
> affects CPU cache hits. (Actually this is just
> our guess because we don't have a way to know
> cache hit/miss ratios for now...)
>
> [How -falign-functions=16 helps?]
>
> Currently the start of functions of kernels of
> i386/amd64 is unaligned, i.e., functions can
> start at any bytes depending on leading objects
> linked to the kernel. If the size of leading
> objects has been changed, starts of all following
> functions also change.
>
> You can see how function alignments are organized
> by nm -n netbsd or just seeing symbol files
> generated in releasedir.
>
> If you specify -falign-functions=16 to COPTS in
> your kernel config, you can align functions by
> 16 bytes. By doing so, addresses of the start of
> all functions always become 0xXXX0 for i386
> 0xXXX0 for amd64. The alignment makes
> sure that functions don't affect by other
> unrelated code changes.
>
> [Why not aligned in the first place?]
>
> It seems because of -mtune=nocona that is specified
> in bsd.own.mk. -mtune=generic provides functions
> aligned by 16 bytes, but provides poorer performance
> than -mtune=nocona, so I don't propose this kind of
> changes.
>
> [-falign-functions=16 solves the issue completely?]
>
> No. It seems there remains some other cause(s) that
> provide performance fluctuations. Nonetheless,
> setting -falign-functions=16 reduces fluctuations.
>
> [The point of the proposal]
>
> The aim of the proposal isn't to provide good
> performance by aligning functions of a kernel,
> but to reduce performance fluctuations by small,
> unrelated changes. Such behavior makes it
> difficult to measure small overhead of a change
> because we cannot distinguish a given performance
> change comes from either the real change or
> function alignment changes.
>
>
> Any suggestions or comments?
>
> Adding -falign-functions=16 is one solution and
> there may be a better way to the goal. And also
> I'm not sure where we should add such option.

http://www.netbsd.org/~ozaki-r/align-functions-16.diff

The patch adds the option to sys/arch/amd64/conf/Makefile.amd64.
Is it a feasible place to add?

  ozaki-r


Re: -falign-functions=16 for i386/amd64

2016-08-29 Thread Ryota Ozaki
On Mon, Aug 29, 2016 at 4:43 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> Hi,
>
> I propose to set -falign-functions=16 to kernels
> of i386/amd64 to reduce performance fluctuations
> by small, unrelated changes.
>
> [Background]
>
> I noticed that performance of IP forwarding had
> been degraded by 10% between Aug. 1 and Aug. 16.
> Bisecting commits between them points out that
> performance degradations happened by several
> commits and unfortunately the commits aren't
> related to performance of IP forwarding; for
> example a change to ip6flow.
>
> I and knakahara investigated how these
> degradations happened and concluded that they
> are because of changes of the start of functions
> (alignment of function codes), which probably
> affects CPU cache hits. (Actually this is just
> our guess because we don't have a way to know
> cache hit/miss ratios for now...)
>
> [How -falign-functions=16 helps?]
>
> Currently the start of functions of kernels of
> i386/amd64 is unaligned, i.e., functions can
> start at any bytes depending on leading objects
> linked to the kernel. If the size of leading
> objects has been changed, starts of all following
> functions also change.
>
> You can see how function alignments are organized
> by nm -n netbsd or just seeing symbol files
> generated in releasedir.
>
> If you specify -falign-functions=16 to COPTS in
> your kernel config, you can align functions by
> 16 bytes. By doing so, addresses of the start of
> all functions always become 0xXXX0 for i386
> 0xXXX0 for amd64. The alignment makes
> sure that functions don't affect by other
> unrelated code changes.
>
> [Why not aligned in the first place?]
>
> It seems because of -mtune=nocona that is specified
> in bsd.own.mk. -mtune=generic provides functions
> aligned by 16 bytes, but provides poorer performance
> than -mtune=nocona, so I don't propose this kind of
> changes.
>
> [-falign-functions=16 solves the issue completely?]
>
> No. It seems there remains some other cause(s) that
> provide performance fluctuations. Nonetheless,
> setting -falign-functions=16 reduces fluctuations.
>
> [The point of the proposal]
>
> The aim of the proposal isn't to provide good
> performance by aligning functions of a kernel,
> but to reduce performance fluctuations by small,
> unrelated changes. Such behavior makes it
> difficult to measure small overhead of a change
> because we cannot distinguish a given performance
> change comes from either the real change or
> function alignment changes.

[Where 16 comes from?]

>From old Intel Optimization Manual (for Pen II and III).
For recent processors 32 may be better, but for stock
kernels (such as GENERIC) 16 is probably better (for old
machines). (And if we want to optimize really we should
use -march or -mtune instead.)

Another reason is that stock kernels of other OSes
(FreeBSD, OpenBSD and Linux) look employing 16 byte
alignment.

  ozaki-r

>
>
> Any suggestions or comments?
>
> Adding -falign-functions=16 is one solution and
> there may be a better way to the goal. And also
> I'm not sure where we should add such option.
>
> Thanks,
>   ozaki-r


Re: Apply psz/psref to ifaddr and in_ifaddr

2016-07-27 Thread Ryota Ozaki
On Tue, Jul 26, 2016 at 1:19 PM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> Hi,
>
> This change makes struct ifaddr and struct in_ifaddr
> (and lists for them) MP-safe using pserialize(9) and
> psref(9). Changes to in6_ifaddr aren't included.

http://www.netbsd.org/~ozaki-r/psref-ifa-ia6.diff

A patch for in6_ifaddr. There is no special change in the patch.

  ozaki-r

>
> There are two patches:
>   http://www.netbsd.org/~ozaki-r/revert-revert-ifaddr-change.diff
>   http://www.netbsd.org/~ozaki-r/psref-ifa-ia4.diff
>
> The first patch reverts the reverted commit that added
> a member variable to struct ifaddr which broke netstat -ia
> using kvm(3).
>
> netstat is already modified to use sysctl instead of kvm(3)
> and we don't need a dirty workaround to keep kvm(3)
> backcompat (see PR kern/51325 for the discussion about it).
>
> The second patch (tl;tr) applies pserialize and psref
> to the data structures. Most changes are usual changes
> for pserialize and psref while some changes require
> restructuring, e.g., rt_getifa.
>
> One note is that pserialize_perform and psref_target_destroy
> are commented out for now. This is because:
> - in6_ifaddr change is not yet
> - They aren't needed now because of softnet_lock
> - They cause a deadlock because of softnet_lock
>
> So we should enable them when we remove softnet_lock.
>
> Any comments or suggestions?
>
> Thanks,
>   ozaki-r


Apply psz/psref to ifaddr and in_ifaddr

2016-07-25 Thread Ryota Ozaki
Hi,

This change makes struct ifaddr and struct in_ifaddr
(and lists for them) MP-safe using pserialize(9) and
psref(9). Changes to in6_ifaddr aren't included.

There are two patches:
  http://www.netbsd.org/~ozaki-r/revert-revert-ifaddr-change.diff
  http://www.netbsd.org/~ozaki-r/psref-ifa-ia4.diff

The first patch reverts the reverted commit that added
a member variable to struct ifaddr which broke netstat -ia
using kvm(3).

netstat is already modified to use sysctl instead of kvm(3)
and we don't need a dirty workaround to keep kvm(3)
backcompat (see PR kern/51325 for the discussion about it).

The second patch (tl;tr) applies pserialize and psref
to the data structures. Most changes are usual changes
for pserialize and psref while some changes require
restructuring, e.g., rt_getifa.

One note is that pserialize_perform and psref_target_destroy
are commented out for now. This is because:
- in6_ifaddr change is not yet
- They aren't needed now because of softnet_lock
- They cause a deadlock because of softnet_lock

So we should enable them when we remove softnet_lock.

Any comments or suggestions?

Thanks,
  ozaki-r


Re: CVS commit: src/sys

2016-07-19 Thread Ryota Ozaki
On Fri, Jul 15, 2016 at 6:18 PM, Martin Husemann <mar...@duskware.de> wrote:
> On Fri, Jul 15, 2016 at 09:51:09AM +0900, Ryota Ozaki wrote:
>> Does NFS root matter?
>
> Not sure, Nick says his ERLITE still boots (where NFS root would be the
> only obvious difference), but his Cobalt is not using NFS root.
>
> I do not see the issue on ARM machines with NFS root.

I'm sorry, I cannot debug mips machines now.

If we cannot fix the issue easily, we can disable workqueue unless
NET_MPSAFE for now.

  ozaki-r


Re: CVS commit: src/sys

2016-07-14 Thread Ryota Ozaki
On Fri, Jul 15, 2016 at 4:20 AM, Martin Husemann  wrote:
>> > >> Modified Files:
>> > >>  src/sys/net: route.c
>> > >>  src/sys/netinet: ip_flow.c
>> > >>  src/sys/netinet6: ip6_flow.c nd6.c
>
> It is specifically the route.c change, backing that one out avoids the
> problem for me.
>
> I would suggest this patch:
>
> Index: route.c
> ===
> RCS file: /cvsroot/src/sys/net/route.c,v
> retrieving revision 1.171
> diff -u -p -r1.171 route.c
> --- route.c 13 Jul 2016 09:56:20 -  1.171
> +++ route.c 14 Jul 2016 19:12:41 -
> @@ -1160,12 +1160,12 @@ rt_timer_init(void)
> assert(rt_init_done == 0);
>
> LIST_INIT(_queue_head);
> -   callout_init(_timer_ch, 0);
> -   callout_reset(_timer_ch, hz, rt_timer_timer, NULL);
> +   callout_init(_timer_ch, CALLOUT_MPSAFE);
> error = workqueue_create(_timer_wq, "rt_timer",
> rt_timer_work, NULL, PRI_SOFTNET, IPL_SOFTNET, WQ_MPSAFE);
> if (error)
> panic("%s: workqueue_create failed (%d)\n", __func__, error);
> +   callout_reset(_timer_ch, hz, rt_timer_timer, NULL);
> rt_init_done = 1;
>  }

Thanks. LGTM.

>
>
> but it does not fix the issue. I don't see anything obviously wrong here
> nor understand why it seems to be mips specific.

Does NFS root matter?

  ozaki-r


Re: Run timers in workqueue

2016-07-07 Thread Ryota Ozaki
On Thu, Jul 7, 2016 at 10:34 AM, Ryota Ozaki <ozak...@netbsd.org> wrote:
> Hi,
>
> Timers (such as nd6_timer) typically free/destroy
> some data in callout (softint). If we apply psz/psref
> for such data, we cannot do free/destroy process in
> there because synchronization of psz/psref cannot
> be used in softint.
>
> So I propose to run such timers in workqueue. The
> change is straightforward; a timer's callout just
> calls workqueue_enqueue and workqueue's work does
> the exactly same as the original callout does.
>
> Here is a patch:
>   http://www.netbsd.org/~ozaki-r/timer-workqueue.diff
>
> The patch applies the proposal to nd6_timer,
> rt_timer_work, ipflow_slowtimo and ip6flow_slowtimo,
> which are needed by my local changes for MP-ification.
> Other timers would be migrated to workqueue if necessary.
>
> Any comments or suggestions?

I taught that doing workqueue_enqueue a work twice (i.e.,
call workqueue_enqueue before a previous task is scheduled)
isn't allowed. For nd6_timer and rt_timer_timer, this
doesn't happen because callout_reset is called only
from workqueue's work. OTOH, ip{,6}flow_slowtimo's callout
can be called before its work starts and completes
because the callout is periodically called regardless of
completion of the work.

http://www.netbsd.org/~ozaki-r/timer-workqueue-fix.diff

The patch fixes the issue by simply adding a flag that
indicates a work is enqueued but not finished.
workqueue_enqueue is only called if the flag indicates
that no work is enqueued.

Is the fix is correct?

Thanks,
  ozaki-r


Run timers in workqueue

2016-07-06 Thread Ryota Ozaki
Hi,

Timers (such as nd6_timer) typically free/destroy
some data in callout (softint). If we apply psz/psref
for such data, we cannot do free/destroy process in
there because synchronization of psz/psref cannot
be used in softint.

So I propose to run such timers in workqueue. The
change is straightforward; a timer's callout just
calls workqueue_enqueue and workqueue's work does
the exactly same as the original callout does.

Here is a patch:
  http://www.netbsd.org/~ozaki-r/timer-workqueue.diff

The patch applies the proposal to nd6_timer,
rt_timer_work, ipflow_slowtimo and ip6flow_slowtimo,
which are needed by my local changes for MP-ification.
Other timers would be migrated to workqueue if necessary.

Any comments or suggestions?

Thanks,
  ozaki-r


Re: Introduce curlwp_bind and curlwp_unbind for psref(9)

2016-06-30 Thread Ryota Ozaki
On Fri, Jul 1, 2016 at 9:51 AM, Masao Uebayashi <uebay...@gmail.com> wrote:
> On Tue, Jun 28, 2016 at 10:08 AM, Ryota Ozaki <ozak...@netbsd.org> wrote:
>> On Sat, Jun 25, 2016 at 11:56 AM, matthew green <m...@eterna.com.au> wrote:
>>>> > Since we already use preempt_disable() to force an lwp to stick to a cpu,
>>>> > doesn't that solve the problem?  If need be, we can enforce 
>>>> > nonpreemptable
>>>> > lwp's don't migrate.
>>>
>>> why would we want to disable preemption in code that merely wants
>>> to run on a particular cpu.
>>>
>>> i dno't understand why using the side effect of preempt_disable()
>>> is better than  explicitly stating what is wanted.
>>
>> Yes. That's why the API is introduced.
>
> And by introducing such a primitive function, you're doubling the
> number of combinations of primitives...
>
> I hope (some of) you surely understand all the implications of those
> combinations ... but I'm not.  I'm even not sure if kpreempt_disable()
> really prevents LWPs from migrating between CPUs.  It'd be really
> helpful if restrictions are expressed by *strict* assertions.
>
> Could you at least put KASSERT((l->l_pflags & LP_BOUND) == 0) in 
> curlwp_bind()?

The assertion is invalid because curlwp_bind can be used regardless of
if LP_BOUND is set or not; it just makes sure LP_BOUND is set and restore
the original state in curlwp_bindx, just like spl*/splx do.

  ozaki-r


Re: Make sure to insert ifa/ia after initialization

2016-06-29 Thread Ryota Ozaki
On Wed, Jun 29, 2016 at 5:26 PM, Roy Marples <r...@marples.name> wrote:
> On 29/06/2016 04:36, Ryota Ozaki wrote:
>> On Wed, Jun 29, 2016 at 3:22 AM, Roy Marples <r...@marples.name> wrote:
>>> On 2016-06-28 02:03, Ryota Ozaki wrote:
>>>>
>>>> Hi,
>>>>
>>>> Basically we should insert an item to a collection
>>>> (say a list) after completed item's initialization to
>>>> avoid accessing an item that is initialized halfway.
>>>>
>>>> ifaddr and in{,6}_ifaddr aren't processed like so.
>>>> The patch below fixes the issue.
>>>>
>>>>   http://www.netbsd.org/~ozaki-r/insert_ifa_after_init.diff
>>>>
>>>> To this end, I need to tweak {arp,nd6}_rtrequest that
>>>> depend on that an ifaddr (in{,6}_ifaddr) is inserted
>>>> during its initialization; they explore interface's
>>>> address list to determine that rt_getkey(rt) of a given
>>>> rtentry is in the list (i.e., whether the route's
>>>> interface should be a loopback), which doesn't work
>>>> after the change. To make it work I use RTF_LOCAL flag
>>>> instead.
>>>>
>>>> Any comments or suggestions?
>>>
>>>
>>> +   if (rt->rt_flags & RTF_LOCAL) {
>>> +   rt->rt_expire = 0;
>>> +   if (useloopback) {
>>> +   rt->rt_ifp = lo0ifp;
>>> +   rt->rt_rmx.rmx_mtu = 0;
>>> +   }
>>> }
>>> -   rt->rt_flags |= RTF_LOCAL;
>>> -   /*
>>>
>>> This has reverted r1.181 which had this comment:
>>> If, for whatever reason, a local interface route is removed and then
>>> re-added, mark it as a local route.
>>
>> Thank you for pointing it out! I think we can restore the behavior
>> by restoring the removed code that should work for the case because
>> the address already exists in the list. (Updated the patch.)
>
> Looks good now!

:)

>
>>>
>>> You've tested manually removing and re-adding the local route and noticing
>>> the flag is preserved?
>>
>> I'm trying, but I don't know how to re-add. Could you tell me how
>> to do that?
>>
>>   kvm# ifconfig vioif0 10.0.0.1/24
>>   kvm# netstat -nr -f inet |grep 10.0.0.1
>>   10.0.0.1   link#1 UHLl--  -  lo0
>>   kvm# route delete 10.0.0.1
>>   delete host 10.0.0.1
>>   kvm# netstat -nr -f inet |grep 10.0.0.1
>>   kvm# route add -host 10.0.0.1 -iface vioif0
>>   route: vioif0: bad value
>>   kvm# route add -host 10.0.0.1 -iface lo0
>>   route: lo0: bad value
>
> route add -host 10.0.0.1 -link vioif0 -ifp lo0
>
> On my -current (a few days old and without your patch) it claims the
> route is added, but no route actually appears so it's hard for me to
> verify this, but I'm pretty sure the command is correct.

Thanks. It seems that an entry is created as an ARP cache. I guess
something broken due to changes of nexthop cache separation
for backcompat. I'll investigate the regression.

  ozaki-r


Make sure to free all interface addresses in if_detach

2016-06-29 Thread Ryota Ozaki
Hi,

Addresses of an interface (struct ifaddr) have
a (reverse) pointer of an interface object
(ifa->ifa_ifp). If the addresses are surely
freed when their interface is destroyed, the
pointer is always valid and we don't need a
tweak of replacing the pointer to if_index like
mbuf.

Is the assumption is correct? No, for now.

The following patches ensure the assumption.

  http://www.netbsd.org/~ozaki-r/debug_ifaref.diff
  http://www.netbsd.org/~ozaki-r/free_all_ifa.diff

The 1st patch adds debugging functions to check
if all addresses are freed at the end of if_detach.

The 2nd patch tweaks some destruction functions
to satisfy the assumption.

- Deactivate the interface at the firstish of
  if_detach. This prevents in6_unlink_ifa from
  saving multicast addresses
- Invalidate rtcache(s) and clear a rtentry
  referencing an address on RTM_DELETE. rtcache(s)
  may delay freeing an address
- Replace callout_stop with callout_halt of DAD
  timers to ensure stopping such timers in
  if_detach

Any comments or suggestions? Am I missing some
cases?

Thanks,
  ozaki-r


  1   2   3   4   >