On Mon, Aug 15, 2022 at 02:36:34PM +0200, Alexander Bluhm wrote:
> On Mon, Aug 15, 2022 at 02:29:02PM +0300, Vitaliy Makkoveev wrote:
> > Nothing special, just get assign `inp' and `tp' from passed socket. I
> > want to do this before (*pru_usrreq)() split to avoid huge code
> > duplication.
> >
> > We have "tp might get 0 when using socket splicing" commentary. I
> > checked where we set `inp_ppcb' to NULL and found, tcp_close() is the
> > only place, and the following in_pcbdetach() sets `so_pcb' to NULL. So
> > this commentary looks wrong, because we can't have TCP sockets without
> > TCP pcb attached.
>
> I added the comment and check after a panic in production.
>
> ----------------------------
> revision 1.109
> date: 2012/01/03 21:50:12; author: bluhm; state: Exp; lines: +6 -2;
> When used with socket splicing, tcp_usrreq() might get called with
> a socket that has an inp but tp is NULL. The call stack for that
> is tcp_input() tcp_close() soisdisconnected() sorwakeup() somove()
> tcp_usrreq(PRU_RCVD). To avoid a NULL dereference, just return in
> that case.
> ok henning@
> ----------------------------
>
> tcp_input() calls tcp_close() after a reset packet.
tcp_close() calls in_pcbdetach() after it sets `inp_ppcb' to NULL:
tcp_close(struct tcpcb *tp)
{
/* ... */
inp->inp_ppcb = NULL;
soisdisconnected(so);
in_pcbdetach(inp);
return (NULL);
}
soisdisconnected() doesn't release netlock, so concurrent thread can't
access this `inp'.
soisdisconnected(struct socket *so)
{
soassertlocked(so);
so->so_state &= ~(SS_ISCONNECTING|SS_ISCONNECTED|SS_ISDISCONNECTING);
so->so_state |= (SS_CANTRCVMORE|SS_CANTSENDMORE|SS_ISDISCONNECTED);
wakeup(&so->so_timeo);
sowwakeup(so);
sorwakeup(so);
}
in_pcbdetach() sets `so_pcb' to NULL just at the beginning.
in_pcbdetach(struct inpcb *inp)
{
struct socket *so = inp->inp_socket;
struct inpcbtable *table = inp->inp_table;
NET_ASSERT_LOCKED();
so->so_pcb = NULL;
/* ... */
}
So according the current code, we can't have the `inp' with NULL `tp'
after tcp_close() call. Looks like we had missing netlock somewhere.
>
> > This diff works fine with sys/kern/sosplice test.
>
> I am not sure that regress covers the tp == NULL case.
> These reset conditions are hard to trigger.
>
> Maybe it does not happen anymore as I have decoupled somove() with
> a soplice thread 4 years later.
>
> ----------------------------
> revision 1.155
> date: 2016/08/25 14:13:19; author: bluhm; state: Exp; lines: +48 -9;
> commitid: PnBATF8qpqSyZflF;
> Spliced TCP sockets become faster when the output part is running
> as its own task thread. This is inspired by userland copy where a
> process also has to go through the scheduler. This gives the socket
> buffer a chance to be filled up and tcp_output() is called less
> often and with bigger chunks.
> When two kernel tasks share all the workload, the current scheduler
> implementation will hang userland processes on single cpu machines.
> As a workaround put a yield() into the splicing thread after each
> task execution. This reduces the number of calls of tcp_output()
> even more.
> OK tedu@ mpi@
> ----------------------------
>
> /*
> * TCP has a sendbuffer that can handle multiple packets
> * at once. So queue the stream a bit to accumulate data.
> * The sosplice thread will call somove() later and send
> * the packets calling tcp_output() only once.
> * In the UDP case, send out the packets immediately.
> * Using a thread would make things slower.
> */
> if (so->so_proto->pr_flags & PR_WANTRCVD)
> task_add(sosplice_taskq, &so->so_splicetask);
>
> My original plan was to keep sosplice thread optional and have the
> code work without it. On the other hand we use it in TCP for 6
> years now.
>
> Should we move if (tp == NULL) to PRU_RCVD? That is the relevant
> case. I think if ((so->so_state & (SS_ISCONNECTED|SS_ISCONNECTING))
> != 0) prevents us from calling tcp_output(tp) with NULL.
>
> So the only place to crash by dereferencing tp is
> if (so->so_options & SO_DEBUG) {
> otp = tp;
> ostate = tp->t_state;
> }
>
> When I added the check in rev 1.109, tp was always dereferenced.
> if (inp) {
> tp = intotcpcb(inp);
> /* tp might get 0 when using socket splicing */
> if (tp == NULL) {
> splx(s);
> return (0);
> }
> #ifdef KPROF
> tcp_acounts[tp->t_state][req]++;
> #endif
> ostate = tp->t_state;
> } else
> ostate = 0;
>
> We seem to be safe unless someone removes the sosplice thread and
> runs with SO_DEBUG. Is this enough history to justify to remove a
> check while just moving code around?
>
> If we use if (tp != NULL && (so->so_options & SO_DEBUG)) everything
> should be fine.
>
> > + KASSERT(*rtp != NULL);
>
> Remember deraadt@ yelling about too many KASSERT().
>
What about to return EINVAL in the case, when `tp' is missing? In the
somove() we don't check the PRU_RCVD returned value, so this path
doesn't affected. In the other cases we stops to lie to the userland.
The `inp' without `tp' should not exist, right? This check exists only
to silently prevent panics.
Index: sys/netinet/tcp_usrreq.c
===================================================================
RCS file: /cvs/src/sys/netinet/tcp_usrreq.c,v
retrieving revision 1.187
diff -u -p -r1.187 tcp_usrreq.c
--- sys/netinet/tcp_usrreq.c 15 Aug 2022 09:11:39 -0000 1.187
+++ sys/netinet/tcp_usrreq.c 15 Aug 2022 13:16:39 -0000
@@ -142,6 +142,36 @@ struct inpcbtable tcbtable;
int tcp_fill_info(struct tcpcb *, struct socket *, struct mbuf *);
int tcp_ident(void *, size_t *, void *, size_t, int);
+static inline int tcp_sogetpcb(struct socket *, struct inpcb **,
+ struct tcpcb **);
+
+static inline int
+tcp_sogetpcb(struct socket *so, struct inpcb **rinp, struct tcpcb **rtp)
+{
+ struct inpcb *inp;
+ struct tcpcb *tp;
+
+ /*
+ * When a TCP is attached to a socket, then there will be
+ * a (struct inpcb) pointed at by the socket, and this
+ * structure will point at a subsidiary (struct tcpcb).
+ */
+ if ((inp = sotoinpcb(so)) == NULL) {
+ if (so->so_error)
+ return so->so_error;
+ return EINVAL;
+ }
+
+ /* tp might get 0 when using socket splicing */
+ if ((tp = intotcpcb(inp)) == NULL)
+ return EINVAL;
+
+ *rinp = inp;
+ *rtp = tp;
+
+ return 0;
+}
+
/*
* Process a TCP user request for TCP tb. If this is a send request
* then m is the mbuf chain of send data. If this is a timer expiration
@@ -153,7 +183,7 @@ tcp_usrreq(struct socket *so, int req, s
struct mbuf *control, struct proc *p)
{
struct inpcb *inp;
- struct tcpcb *otp = NULL, *tp = NULL;
+ struct tcpcb *otp = NULL, *tp;
int error = 0;
short ostate;
@@ -175,22 +205,9 @@ tcp_usrreq(struct socket *so, int req, s
goto release;
}
- inp = sotoinpcb(so);
- /*
- * When a TCP is attached to a socket, then there will be
- * a (struct inpcb) pointed at by the socket, and this
- * structure will point at a subsidiary (struct tcpcb).
- */
- if (inp == NULL) {
- error = so->so_error;
- if (error == 0)
- error = EINVAL;
- goto release;
- }
- tp = intotcpcb(inp);
- /* tp might get 0 when using socket splicing */
- if (tp == NULL)
+ if ((error = tcp_sogetpcb(so, &inp, &tp)))
goto release;
+
if (so->so_options & SO_DEBUG) {
otp = tp;
ostate = tp->t_state;
@@ -739,28 +756,15 @@ int
tcp_detach(struct socket *so)
{
struct inpcb *inp;
- struct tcpcb *otp = NULL, *tp = NULL;
+ struct tcpcb *otp = NULL, *tp;
int error = 0;
short ostate;
soassertlocked(so);
- inp = sotoinpcb(so);
- /*
- * When a TCP is attached to a socket, then there will be
- * a (struct inpcb) pointed at by the socket, and this
- * structure will point at a subsidiary (struct tcpcb).
- */
- if (inp == NULL) {
- error = so->so_error;
- if (error == 0)
- error = EINVAL;
+ if ((error = tcp_sogetpcb(so, &inp, &tp)))
return (error);
- }
- tp = intotcpcb(inp);
- /* tp might get 0 when using socket splicing */
- if (tp == NULL)
- return (0);
+
if (so->so_options & SO_DEBUG) {
otp = tp;
ostate = tp->t_state;