Re: svn commit: r194672 - in head/sys: kern netinet sys

Andre Oppermann Tue, 23 Jun 2009 01:30:20 -0700

Robert Watson wrote:


On Mon, 22 Jun 2009, Andre Oppermann wrote:

 Add soreceive_stream(), an optimized version of soreceive() for
 stream (TCP) sockets.


While this sounds like very interesting work, I was struck by the lack of:

  Reviewed by:    ?
  Tested by:    ?

Have you tested this change with some of our TCP conumer edge cases,especially in-kernel consumers:


- Accept filters
- NFS client
- NFS server
- smbfs
- iscsi initiator


No, yes, yes, no, no.  Plus a large number of SSH copying (it will trip
over every corrupted byte) and fetching half of the ports collection
programs source code and comparing the checksum.

I also assume that the plan is that this will not be enabled by defaultin 8.0? With soreceive_dgram, it took three months to shake out thenits, and the datagram case is far simpler than the stream case. Giventhat experience, I'd guess it will take longer for the new stream codeto shake out, and probably involve painfully tracking subtle bugs inblocking code, data corruption in NFS, etc. I've identified one suchpossible bug for iscsi.


No, it not supposed to be enable by default in 8.0.

To provide easier access for early adopters, we shipped withsoreceive_dgram available for UDP, but optionally enabled by a loadertunable, for at least one minor rev. I would recommend doing somethingsimilar here.


A good hint, thank you.

A few inline comments below, including one serious bug (assuming myreading is right). This doesn't consistute a full review because it'stoo early in the morning to reason fully about the socket buffer code.


OK. Thanks. I will look into the specifics later today.

--
Andre

Robert N M Watson
Computer Laboratory
University of Cambridge
 It is functionally identical to generic soreceive() but has a
 number stream specific optimizations:
 o does only one sockbuf unlock/lock per receive independent of
   the length of data to be moved into the uio compared to
   soreceive() which unlocks/locks per *mbuf*.
Hmm. I count four locks and unlocks per receive, assuming no blockingand a I/O to user memory:
- sblock/sbunlock to stabilize sb_mb and prevent I/O interlacing.
- start/stop socket buffer mutex lock/unlock at beginning/end of function
- unlock/relock around m_mbuftouio()
- unlock/relock around pru_rcvd
One idea I've futzed with a little is whether or not we could dropsblock() in certain cases for receive and transmit. As far as I'maware, it is used for at least four purposes in receive:
- Prevent I/O interlacing from concurrent system calls
- Provide consistency for blocking logic during concurrent system calls
- Stabilize non-NULL sb_mb while the socket buffer mutex is dropped
- Minimize contention on the socket buffer mutex during concurrent system
  calls
Some of these are more important than others -- in particular, thefunction of preventing I/O interlacing for stream system calls seemssomething we might drop, as it's not a guarantee provided by other OS's,so no portable application should depend on it. Have you looked at allat this?
 o uses m_mbuftouio() instead of its own copy(out) variant.
 o much more compact code flow as a large number of special
   cases is removed.
 o much improved reability.

 It offers significantly reduced CPU usage and lock contention
 when receiving fast TCP streams.  Additional gains are obtained
 when the receiving application is using SO_RCVLOWAT to batch up
 some data before a read (and wakeup) is done.

 This function was written by "reverse engineering" and is not
 just a stripped down variant of soreceive().

 It is not yet enabled by default on TCP sockets.  Instead it is
 commented out in the protocol initialization in tcp_usrreq.c
 until more widespread testing has been done.
Have you looked at using this with SCTP and UNIX domain sockets? UNIXdomain socket performance is quite important for database workloads, andSCTP isgoing to be increasingly important for Internet applications. In theUNIX domain socket case, not supporting control mbuf will be a potentialproblem (see comments below about either implementing it or fallingback, and if falling back m_mbuftouio needs to know how to properly freethem).
 Testers, especially with 10GigE gear, are welcome.

 MFP4:    r164817 //depot/user/andre/soreceive_stream/

Modified:
 head/sys/kern/uipc_socket.c
 head/sys/netinet/tcp_usrreq.c
 head/sys/sys/socketvar.h

Modified: head/sys/kern/uipc_socket.c
==============================================================================
--- head/sys/kern/uipc_socket.c    Mon Jun 22 22:54:44 2009    (r194671)
+++ head/sys/kern/uipc_socket.c    Mon Jun 22 23:08:05 2009    (r194672)
@@ -1857,6 +1857,202 @@ release:
}

/*
+ * Optimized version of soreceive() for stream (TCP) sockets.
+ */
+int
+soreceive_stream(struct socket *so, struct sockaddr **psa, struct uio*uio,
+    struct mbuf **mp0, struct mbuf **controlp, int *flagsp)
+{
+    int len = 0, error = 0, flags, oresid;
+    struct sockbuf *sb;
+    struct mbuf *m, *n = NULL;
+
+    /* We only do stream sockets. */
+    if (so->so_type != SOCK_STREAM)
+        return (EINVAL);
This should be a KASSERT(). You should also assert !PR_ADDR and!PR_ATOMIC, and possibly a few other things to ensure thatsoreceive_stream is not used with protocols that require features itdoes not implement.
Much better to discover on sanity-checking the kernel than throughsubtle application bugs later when people start throwing switcheswithout giving things full consideration.
+    if (psa != NULL)
+        *psa = NULL;
+    if (controlp != NULL)
+        return (EINVAL);
This would seem to preclude easy future use of ancillary data with TCP-- one type I've been considering adding is TCP_INFO updates as part ofthe stream. How would you feel about making this call intosoreceive_generic() instead?
My reading of the control code was that it actually was fairly simpleand I did include it in soreceive_dgram() (especially given that it'spopular for UDP so you can get the receive IP address per-datagram).
+    if (flagsp != NULL)
+        flags = *flagsp &~ MSG_EOR;
+    else
+        flags = 0;
+    if (flags & MSG_OOB)
+        return (soreceive_rcvoob(so, uio, flags));
+    if (mp0 != NULL)
+        *mp0 = NULL;
+
+    sb = &so->so_rcv;
+
+    /* Prevent other readers from entering the socket. */
+    error = sblock(sb, SBLOCKWAIT(flags));
+    if (error)
+        goto out;
+    SOCKBUF_LOCK(sb);
+
+    /* Easy one, no space to copyout anything. */
+    if (uio->uio_resid == 0) {
+        error = EINVAL;
+        goto out;
+    }
Didn't we previously allow a resid of 0 to be used to probe for socketerrors without receiving data? I've never used this semantic but itseems to be supported, which undoubtably means someone is using it :-).I notice that previously we didn't allow receiving EWOULDBLOCK using a0-byte buffer, but perhaps we should have.
+    oresid = uio->uio_resid;
+
+    /* We will never ever get anything unless we are connected. */
+    if (!(so->so_state & (SS_ISCONNECTED|SS_ISDISCONNECTED))) {
+        /* When disconnecting there may be still some data left. */
+        if (sb->sb_cc > 0)
+            goto deliver;
+        if (!(so->so_state & SS_ISDISCONNECTED))
+            error = ENOTCONN;
+        goto out;
+    }
+
+    /* Socket buffer is empty and we shall not block. */
+    if (sb->sb_cc == 0 &&
+ ((sb->sb_flags & SS_NBIO) || (flags &(MSG_DONTWAIT|MSG_NBIO)))) {
+        error = EAGAIN;
+        goto out;
+    }
I think you've changed the error number reporting here in subtle ways,but it's early in the morning. It appears we now prefer ENOTCONN toso_error once the socket has entered SS_ISDISCONNECTED. We need toreport and clear so_error if non-0 so that ECONNRESET, EPERM, etc, canbe reported properly.
The new so_state logic is markedly different from what came before, butI'm not sure what the impact of that is. Certainly the comment needs tochange to "... are or have been connected". The previous logic appearsto have handled SS_ISCONNECTING differently, and possiblySS_ISDISCONNECTING.
+restart:
+    SOCKBUF_LOCK_ASSERT(&so->so_rcv);
+
+    /* Abort if socket has reported problems. */
+    if (so->so_error) {
+        if (sb->sb_cc > 0)
+            goto deliver;
+        if (oresid > uio->uio_resid)
+            goto out;
+        error = so->so_error;
+        if (!(flags & MSG_PEEK))
+            so->so_error = 0;
+        goto out;
+    }
+
+    /* Door is closed.  Deliver what is left, if any. */
+    if (sb->sb_state & SBS_CANTRCVMORE) {
+        if (sb->sb_cc > 0)
+            goto deliver;
+        else
+            goto out;
+    }
+
+    /* Socket buffer got some data that we shall deliver now. */
+    if (sb->sb_cc > 0 && !(flags & MSG_WAITALL) &&
+        ((sb->sb_flags & SS_NBIO) ||
+         (flags & (MSG_DONTWAIT|MSG_NBIO)) ||
+         sb->sb_cc >= sb->sb_lowat ||
+         sb->sb_cc >= uio->uio_resid ||
+         sb->sb_cc >= sb->sb_hiwat) ) {
+        goto deliver;
+    }
+
+    /* On MSG_WAITALL we must wait until all data or error arrives. */
+    if ((flags & MSG_WAITALL) &&
+        (sb->sb_cc >= uio->uio_resid || sb->sb_cc >= sb->sb_lowat))
+        goto deliver;
+
+    /*
+     * Wait and block until (more) data comes in.
+     * NB: Drops the sockbuf lock during wait.
+     */
+    error = sbwait(sb);
+    if (error)
+        goto out;
+    goto restart;
+
+deliver:
+    SOCKBUF_LOCK_ASSERT(&so->so_rcv);
+    KASSERT(sb->sb_cc > 0, ("%s: sockbuf empty", __func__));
+    KASSERT(sb->sb_mb != NULL, ("%s: sb_mb == NULL", __func__));
+
+    /* Statistics. */
+    if (uio->uio_td)
+        uio->uio_td->td_ru.ru_msgrcv++;
+
+ /* Fill uio until full or current end of socket buffer isreached. */
+    len = min(uio->uio_resid, sb->sb_cc);
+    if (mp0 != NULL) {
+        /* Dequeue as many mbufs as possible. */
+        if (!(flags & MSG_PEEK) && len >= sb->sb_mb->m_len) {
+            for (*mp0 = m = sb->sb_mb;
+                 m != NULL && m->m_len <= len;
+                 m = m->m_next) {
+                len -= m->m_len;
+                uio->uio_resid -= m->m_len;
+                sbfree(sb, m);
+                n = m;
+            }
Since control mbufs aren't supported, you should assert !MT_CONTROL. Oradd control mbuf support per above.
Again, early in the morning, but: does (mp0 != NULL) still work with(flags & MSG_WAITALL)? In soreceive_generic(), mp0 is updated to pointto &m->m_next as the loop iterates, so that as you go back to 'repeat:',the chain pointed to by mp0 is appended to rather than replaced, whereasyour code appears to replace it in each loop (which seems wrong).Perhaps it's just too early in the morning.
I think the only consumer of (mp0 != NULL) with MSG_WAITALL is the iscsiinitiator, but breaking that would be bad. smbfs probably *should* useit... :-)
+            sb->sb_mb = m;
+            if (sb->sb_mb == NULL)
+                SB_EMPTY_FIXUP(sb);
+            n->m_next = NULL;
+        }
+        /* Copy the remainder. */
+        if (len > 0) {
+            KASSERT(sb->sb_mb != NULL,
+                ("%s: len > 0 && sb->sb_mb empty", __func__));
+
+            m = m_copym(sb->sb_mb, 0, len, M_DONTWAIT);
+            if (m == NULL)
+                len = 0;    /* Don't flush data from sockbuf. */
+            else
+                uio->uio_resid -= m->m_len;
+            if (*mp0 != NULL)
+                n->m_next = m;
+            else
+                *mp0 = m;
+            if (*mp0 == NULL) {
+                error = ENOBUFS;
+                goto out;
+            }
+        }
+    } else {
+        /* NB: Must unlock socket buffer as uiomove may sleep. */
+        SOCKBUF_UNLOCK(sb);
+        error = m_mbuftouio(uio, sb->sb_mb, len);
m_mbuftouio() should assert !MT_CONTROL on each mbuf if it doesn'tsupport control mbufs. See also above. :-)
+        SOCKBUF_LOCK(sb);
+        if (error)
+            goto out;
+    }
+    SBLASTRECORDCHK(sb);
+    SBLASTMBUFCHK(sb);
+
+    /*
+     * Remove the delivered data from the socket buffer unless we
+     * were only peeking.
+     */
+    if (!(flags & MSG_PEEK)) {
+        if (len > 0)
+            sbdrop_locked(sb, len);
+
+        /* Notify protocol that we drained some data. */
+        if ((so->so_proto->pr_flags & PR_WANTRCVD) &&
+            (((flags & MSG_WAITALL) && uio->uio_resid > 0) ||
+             !(flags & MSG_SOCALLBCK))) {
+            SOCKBUF_UNLOCK(sb);
+            (*so->so_proto->pr_usrreqs->pru_rcvd)(so, flags);
+            SOCKBUF_LOCK(sb);
+        }
+    }
+
+    /*
+     * For MSG_WAITALL we may have to loop again and wait for
+     * more data to come in.
+     */
+    if ((flags & MSG_WAITALL) && uio->uio_resid > 0)
+        goto restart;
+out:
+    SOCKBUF_LOCK_ASSERT(sb);
+    SBLASTRECORDCHK(sb);
+    SBLASTMBUFCHK(sb);
+    SOCKBUF_UNLOCK(sb);
+    sbunlock(sb);
+    return (error);
+}
+
+/*
* Optimized version of soreceive() for simple datagram cases fromuserspace.
 * Unlike in the stream case, we're able to drop a datagram if copyout()
* fails, and because we handle datagrams atomically, we don't need touse a
Modified: head/sys/netinet/tcp_usrreq.c
==============================================================================--- head/sys/netinet/tcp_usrreq.c Mon Jun 22 22:54:44 2009(r194671)+++ head/sys/netinet/tcp_usrreq.c Mon Jun 22 23:08:05 2009(r194672)
@@ -1032,6 +1032,9 @@ struct pr_usrreqs tcp_usrreqs = {
    .pru_send =        tcp_usr_send,
    .pru_shutdown =        tcp_usr_shutdown,
    .pru_sockaddr =        in_getsockaddr,
+#if 0
+    .pru_soreceive =    soreceive_stream,
+#endif
    .pru_sosetlabel =    in_pcbsosetlabel,
    .pru_close =        tcp_usr_close,
};
@@ -1053,6 +1056,9 @@ struct pr_usrreqs tcp6_usrreqs = {
    .pru_send =        tcp_usr_send,
    .pru_shutdown =        tcp_usr_shutdown,
    .pru_sockaddr =        in6_mapped_sockaddr,
+#if 0
+    .pru_soreceive =    soreceive_stream,
+#endif
     .pru_sosetlabel =    in_pcbsosetlabel,
    .pru_close =        tcp_usr_close,
};

Modified: head/sys/sys/socketvar.h
==============================================================================
--- head/sys/sys/socketvar.h    Mon Jun 22 22:54:44 2009    (r194671)
+++ head/sys/sys/socketvar.h    Mon Jun 22 23:08:05 2009    (r194672)
@@ -345,6 +345,9 @@ int    sopoll_generic(struct socket *so, in
        struct ucred *active_cred, struct thread *td);
int soreceive(struct socket *so, struct sockaddr **paddr, structuio *uio,
        struct mbuf **mp0, struct mbuf **controlp, int *flagsp);
+int    soreceive_stream(struct socket *so, struct sockaddr **paddr,
+        struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
+        int *flagsp);
int    soreceive_dgram(struct socket *so, struct sockaddr **paddr,
        struct uio *uio, struct mbuf **mp0, struct mbuf **controlp,
        int *flagsp);


_______________________________________________
svn-src-all@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Re: svn commit: r194672 - in head/sys: kern netinet sys

Reply via email to