from:"Lawrence Brakmo"

Re: [PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-24 Thread Lawrence Brakmo

Note that without this fix the 99% latencies when doing 10KB RPCs in a 
congested network using DCTCP are 40ms vs. 190us with the patch. Also note that 
these 40ms high tail latencies started after commit 
3759824da87b30ce7a35b4873b62b0ba38905ef5 in Jul 2015, which triggered the 
bugs/features we are fixing/adding. I agree it is a debatable whether it is a 
bug fix or a feature improvement and I am fine either way.

Lawrence

On 7/24/18, 10:13 AM, "Neal Cardwell"  wrote:

On Tue, Jul 24, 2018 at 1:07 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 23, 2018 at 7:23 PM, Daniel Borkmann  
wrote:
> > Should this go to net tree instead where all the other fixes went?
> I am neutral but this feels more like a feature improvement

I agree this feels like a feature improvement rather than a bug fix.

neal

[PATCH net-next] tcp: ack immediately when a cwr packet arrives

2018-07-23 Thread Lawrence Brakmo

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR marked data packets arriving will be acked
immediately.

Packetdrill script to reproduce the problem:

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect01] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
// Previously the ACK sequence below would be 4501, causing a long RTO
+0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
+0 > [ect01] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257

Modified based on comments by Neal Cardwell 

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 91dbb9afb950..2370fd79c5c5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -246,8 +246,15 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 
 static void tcp_ecn_accept_cwr(struct tcp_sock *tp, const struct sk_buff *skb)
 {
-   if (tcp_hdr(skb)->cwr)
+   if (tcp_hdr(skb)->cwr) {
tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+
+   /* If the sender is telling us it has entered CWR, then its
+* cwnd may be very low (even just 1 packet), so we should ACK
+* immediately.
+*/
+   tcp_enter_quickack_mode((struct sock *)tp, 2);
+   }
 }
 
 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
-- 
2.17.1

Re: [PATCH bpf-next 0/6] TCP-BPF callback for listening sockets

2018-07-12 Thread Lawrence Brakmo

LGTM. Thank you for adding the listen callback and cleaning up the test.

Acked-by: Lawrence Brakmo 


On 7/11/18, 8:34 PM, "Andrey Ignatov"  wrote:

This patchset adds TCP-BPF callback for listening sockets.

Patch 0001 provides more details and is the main patch in the set.

Patch 0006 adds selftest for the new callback.

Other patches are bug fixes and improvements in TCP-BPF selftest to make it
easier to extend in 0006.


Andrey Ignatov (6):
  bpf: Add BPF_SOCK_OPS_TCP_LISTEN_CB
  bpf: Sync bpf.h to tools/
  selftests/bpf: Fix const'ness in cgroup_helpers
  selftests/bpf: Switch test_tcpbpf_user to cgroup_helpers
  selftests/bpf: Better verification in test_tcpbpf
  selftests/bpf: Test case for BPF_SOCK_OPS_TCP_LISTEN_CB

 include/uapi/linux/bpf.h  |   3 +
 net/ipv4/af_inet.c|   1 +
 tools/include/uapi/linux/bpf.h|   3 +
 tools/testing/selftests/bpf/Makefile  |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |   6 +-
 tools/testing/selftests/bpf/cgroup_helpers.h  |   6 +-
 tools/testing/selftests/bpf/test_tcpbpf.h |   1 +
 .../testing/selftests/bpf/test_tcpbpf_kern.c  |  17 ++-
 .../testing/selftests/bpf/test_tcpbpf_user.c  | 119 +-
 9 files changed, 88 insertions(+), 69 deletions(-)

-- 
2.17.1

Re: [PATCH net 1/2] tcp: fix dctcp delayed ACK schedule

2018-07-12 Thread Lawrence Brakmo

On 7/12/18, 9:05 AM, "netdev-ow...@vger.kernel.org on behalf of Yuchung Cheng" 
 wrote:

Previously, when a data segment was sent an ACK was piggybacked
on the data segment without generating a CA_EVENT_NON_DELAYED_ACK
event to notify congestion control modules. So the DCTCP
ca->delayed_ack_reserved flag could incorrectly stay set when
in fact there were no delayed ACKs being reserved. This could result
in sending a special ECN notification ACK that carries an older
ACK sequence, when in fact there was no need for such an ACK.
DCTCP keeps track of the delayed ACK status with its own separate
state ca->delayed_ack_reserved. Previously it may accidentally cancel
the delayed ACK without updating this field upon sending a special
ACK that carries a older ACK sequence. This inconsistency would
lead to DCTCP receiver never acknowledging the latest data until the
sender times out and retry in some cases.

Packetdrill script (provided by Larry Brakmo)

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect01] . 1:1(0) ack 1001

0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect01] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect01] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect01] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
// Previously the ACK sequence below would be 4501, causing a long RTO
+0.040~+0.045 > [ect01] . 4:4(0) ack 5501   // delayed ack

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // More data
+0 > [ect01] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257

Reported-by: Larry Brakmo 
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 net/ipv4/tcp_dctcp.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..89f88b0d8167 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -134,7 +134,8 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)
/* State has changed from CE=0 to CE=1 and delayed
 * ACK has not sent yet.
 */
-   if (!ca->ce_state && ca->delayed_ack_reserved) {
+   if (!ca->ce_state &&
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
u32 tmp_rcv_nxt;
 
/* Save current rcv_nxt. */
@@ -164,7 +165,8 @@ static void dctcp_ce_state_1_to_0(struct sock *sk)
/* State has changed from CE=1 to CE=0 and delayed
 * ACK has not sent yet.
 */
-   if (ca->ce_state && ca->delayed_ack_reserved) {
+   if (ca->ce_state &&
+   inet_csk(sk)->icsk_ack.pending & ICSK_ACK_TIMER) {
u32 tmp_rcv_nxt;
 
/* Save current rcv_nxt. */
-- 
2.18.0.203.gfac676dfb9-goog

LGTM. Thanks for the patch.

Acked-by: Lawrence Brakmo

Re: [PATCH net 2/2] tcp: remove DELAYED ACK events in DCTCP

2018-07-12 Thread Lawrence Brakmo

LGTM. Thanks for the patch.

Acked-by: Lawrence Brakmo 

On 7/12/18, 9:05 AM, "Yuchung Cheng"  wrote:

After fixing the way DCTCP tracking delayed ACKs, the delayed-ACK
related callbacks are no longer needed

Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Acked-by: Neal Cardwell 
---
 include/net/tcp.h |  2 --
 net/ipv4/tcp_dctcp.c  | 25 -
 net/ipv4/tcp_output.c |  4 
 3 files changed, 31 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index af3ec72d5d41..3482d13d655b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -912,8 +912,6 @@ enum tcp_ca_event {
CA_EVENT_LOSS,  /* loss timeout */
CA_EVENT_ECN_NO_CE, /* ECT set, but not CE marked */
CA_EVENT_ECN_IS_CE, /* received CE marked IP packet */
-   CA_EVENT_DELAYED_ACK,   /* Delayed ack is sent */
-   CA_EVENT_NON_DELAYED_ACK,
 };
 
 /* Information about inbound ACK, passed to cong_ops->in_ack_event() */
diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 89f88b0d8167..5869f89ca656 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -55,7 +55,6 @@ struct dctcp {
u32 dctcp_alpha;
u32 next_seq;
u32 ce_state;
-   u32 delayed_ack_reserved;
u32 loss_cwnd;
 };
 
@@ -96,7 +95,6 @@ static void dctcp_init(struct sock *sk)
 
ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
 
-   ca->delayed_ack_reserved = 0;
ca->loss_cwnd = 0;
ca->ce_state = 0;
 
@@ -250,25 +248,6 @@ static void dctcp_state(struct sock *sk, u8 new_state)
}
 }
 
-static void dctcp_update_ack_reserved(struct sock *sk, enum tcp_ca_event 
ev)
-{
-   struct dctcp *ca = inet_csk_ca(sk);
-
-   switch (ev) {
-   case CA_EVENT_DELAYED_ACK:
-   if (!ca->delayed_ack_reserved)
-   ca->delayed_ack_reserved = 1;
-   break;
-   case CA_EVENT_NON_DELAYED_ACK:
-   if (ca->delayed_ack_reserved)
-   ca->delayed_ack_reserved = 0;
-   break;
-   default:
-   /* Don't care for the rest. */
-   break;
-   }
-}
-
 static void dctcp_cwnd_event(struct sock *sk, enum tcp_ca_event ev)
 {
switch (ev) {
@@ -278,10 +257,6 @@ static void dctcp_cwnd_event(struct sock *sk, enum 
tcp_ca_event ev)
case CA_EVENT_ECN_NO_CE:
dctcp_ce_state_1_to_0(sk);
break;
-   case CA_EVENT_DELAYED_ACK:
-   case CA_EVENT_NON_DELAYED_ACK:
-   dctcp_update_ack_reserved(sk, ev);
-   break;
default:
/* Don't care for the rest. */
break;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8e08b409c71e..00e5a300ddb9 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3523,8 +3523,6 @@ void tcp_send_delayed_ack(struct sock *sk)
int ato = icsk->icsk_ack.ato;
unsigned long timeout;
 
-   tcp_ca_event(sk, CA_EVENT_DELAYED_ACK);
-
if (ato > TCP_DELACK_MIN) {
const struct tcp_sock *tp = tcp_sk(sk);
int max_ato = HZ / 2;
@@ -3581,8 +3579,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
 
-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.
-- 
2.18.0.203.gfac676dfb9-goog

Re: [PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-09 Thread Lawrence Brakmo

On 7/9/18, 12:32 PM, "Yuchung Cheng"  wrote:

On Sat, Jul 7, 2018 at 7:07 AM, Neal Cardwell  wrote:
> On Sat, Jul 7, 2018 at 7:15 AM David Miller  wrote:
>>
    >> From: Lawrence Brakmo 
>> Date: Tue, 3 Jul 2018 09:26:13 -0700
>>
>> > When have observed high tail latencies when using DCTCP for RPCs as
>> > compared to using Cubic. For example, in one setup there are 2 hosts
>> > sending to a 3rd one, with each sender having 3 flows (1 stream,
>> > 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
>> > table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>> >
>> >Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>> > 1MB RPCs2.6ms   5.5ms 43ms  208ms
>> > 10KB RPCs1.1ms   1.3ms 53ms  212ms
>>  ...
>> > v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
>> > tcp_event_ack_sent. Based on Neal Cardwell 
>> > feedback.
>> > Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead 
of modifying
>> > tcp_ack_send_check to insure an ACK when cwr is received.
>> > v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.
>> >
>> > [PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
>> > [PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet
>>
>> Neal and co., what are your thoughts right now about this patch series?
>>
>> Thank you.
>
> IMHO these patches are a definite improvement over what we have now.
>
> That said, in chatting with Yuchung before the July 4th break, I think
> Yuchung and I agreed that we would ideally like to see something like
> the following:
>
> (1) refactor the DCTCP code to check for pending delayed ACKs directly
> using existing state (inet_csk(sk)->icsk_ack.pending &
> ICSK_ACK_TIMER), and remove the ca->delayed_ack_reserved DCTCP field
> and the CA_EVENT_DELAYED_ACK and CA_EVENT_NON_DELAYED_ACK callbacks
> added for DCTCP (which Larry determined had at least one bug).

I agree that getting rid of the callbacks would be an improvement, but that is 
more about optimizing the code. This could be done after we fix the current 
bugs. My concern is that it may be more complicated that we think and the 
current bug would continue to exist. Yes, I realize that it has been there for 
a while; but not because no one found it before, but because it was hard to 
pinpoint. 

> (2) fix the bug with the DCTCP call to tcp_send_ack(sk) causing
> delayed ACKs to be incorrectly dropped/forgotten (not yet addressed by
> this patch series)

Good idea, but as I mentioned earlier, I would rather fix the really bad bugs 
first and then deal with this one. As far as I can see from my testing of 
DC-TCP, I have not seen any bad consequences from this bug so far.

> (3) then with fixes (1) and (2) in place, re-run tests and see if we
> still need Larry's heuristic (in patch 2) to fire an ACK immediately
> if a receiver receives a CWR packet (I suspect this is still very
> useful, but I think Yuchung is reluctant to add this complexity unless
> we have verified it's still needed after (1) and (2))

I fail to understand how (1) and (2) have anything to do with ACKing 
immediately when we receive a CWR packet. It has nothing to do with a current 
delayed ACK, it has to do with the cwnd closing to 1 when an TCP ECE marked 
packet is received at the end of an RPC and the current TCP delay ACK logic 
choosing to delay the ACK. The issue happens right after the receiver has sent 
its reply to the RPC, so at that stage there are no active delayed ACKs (the 
first patch fixed the issue where DC-TCP thought there was an active delayed 
ACK). 

>
> Our team may be able to help out with some proposed patches for (1) and 
(2).
>
> In any case, I would love to have Yuchung and Eric weigh in (perhaps
> Monday) before we merge this patch series.
Thanks Neal. Sorry for not reflecting these timely before I took off
for July 4 holidays. I was going to post the same comment - Larry: I
could provide draft patches if that helps.

Yuchung: go ahead and send me the drafts. But as I already mentioned, I would 
like to fix the bad bug first and then make it pretty.

>
> Thanks,
> neal

Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo

On 7/3/18, 6:15 AM, "Neal Cardwell"  wrote:

On Mon, Jul 2, 2018 at 7:49 PM Yuchung Cheng  wrote:
>
> On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
> >
> > DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> > notifications to keep track if it needs to send an ACK for packets that
> > were received with a particular ECN state but whose ACK was delayed.
> >
> > Under some circumstances, for example when a delayed ACK is sent with a
> > data packet, DCTCP state was not being updated due to a lack of
> > notification that the previously delayed ACK was sent. As a result, it
> > would sometimes send a duplicate ACK when a new data packet arrived.
> >
> > This patch insures that DCTCP's state is correctly updated so it will
> > not send the duplicate ACK.
> Sorry to chime-in late here (lame excuse: IETF deadline)
>
> IIRC this issue would exist prior to 4.11 kernel. While it'd be good
> to fix that, it's not clear which patch introduced the regression
> between 4.11 and 4.16? I assume you tested Eric's most recent quickack
> fix.

I don't think Larry is saying there's a regression between 4.11 and
4.16. His recent "tcp: force cwnd at least 2 in tcp_cwnd_reduction"
patch here:

https://urldefense.proofpoint.com/v2/url?u=https-3A__patchwork.ozlabs.org_patch_935233_=DwIBaQ=5VD0RTtNlTh3ycd41b3MUw=pq_Mqvzfy-C8ltkgyx1u_g=7ko5C_ln2b7gZvz2A_UrZzz0AlcnhrW7-9KRahj_PEA=HcpZvo1TulN4-Y7Jhba5KM1MIaPwnBC95T8pLZfJESI=

said that 4.11 was bad (high tail latency and lots of RTOs) and 4.16
was still bad but with different netstat counters (no RTOs but still
high tail latency):

"""
On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there.
"""

I suspect the RTOs disappeared but latencies remained too high because
between 4.11 and 4.16 we introduced:
  tcp: allow TLP in ECN CWR

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=b4f70c3d4ec32a2ff4c62e1e2da0da5f55fe12bd

So the RTOs probably disappeared because this commit turned them into
TLPs. But the latencies remained high because the fundamental bug
remained throughout 4.11 and 4.16 and today: the DCTCP use of
tcp_send_ack() with an old rcv_nxt caused delayed ACKs to be cancelled
when they should not have been.

> In terms of the fix itself, it seems odd the tcp_send_ack() call in
> DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
> delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
> "prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
> cancelled, because that prior-ACK really is a supplementary old ACK.

This patch is fixing an issue that's orthogonal to the one you are
talking about. Using the taxonomy from our team's internal discussion
yesterday, the issue you mention where the DCTCP "prior" ACK is
cancelling delayed ACKs is issue (4); the issue that this particular
"tcp: notify when a delayed ack is sent" patch from Larry fixes is
issue (3). It's a bit tricky because both issues appeared in Larry's
trace summary and packetdrill script to reproduce the issue.

neal

I was able to track the patch that introduced the problem:

commit 3759824da87b30ce7a35b4873b62b0ba38905ef5
Author: Yuchung Cheng 
Date:   Wed Jul 1 14:11:15 2015 -0700

tcp: PRR uses CRB mode by default and SS mode conditionally

I tested a kernel which reverted the relevant change (see diff below) and the 
high tail latencies of more than 40ms disappeared. However, the 10KB high 
percentile latencies are 4ms vs. less than 200us with my patches. It looks like 
the patch above ended up reducing the cwnd to 1 in the scenarios that were 
triggering the high tail latencies. That is, it increased the likelihood of 
triggering actual bugs in the network stack code that my patch set fixes.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..50fabb07d739 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2468,13 +2468,14 @@ void tcp_cwnd_reduction(struct sock *sk, int 
newly_acked_sacked, int flag)
u64 dividend = (u64)tp->snd_ssthresh * tp->prr_delivered +
   tp->prior_cwnd - 1;
sndcnt = div_u64(dividend, tp->prior_cwnd) - tp->prr_out;
-   } else if ((flag & FLAG_RETRANS_DATA_ACKED)

[PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo

DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
notifications to keep track if it needs to send an ACK for packets that
were received with a particular ECN state but whose ACK was delayed.

Under some circumstances, for example when a delayed ACK is sent with a
data packet, DCTCP state was not being updated due to a lack of
notification that the previously delayed ACK was sent. As a result, it
would sometimes send a duplicate ACK when a new data packet arrived.

This patch insures that DCTCP's state is correctly updated so it will
not send the duplicate ACK.

Improved based on comments from Neal Cardwell .

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8f6129160dd..acefb64e8280 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
@@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
 
-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.
-- 
2.17.1

[PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-03 Thread Lawrence Brakmo

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR makred data packets arriving will be acked
immediately.

Modified based on comments by Neal Cardwell 

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2c5d70bc294e..23c2a43de8a4 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -246,8 +246,15 @@ static void tcp_ecn_queue_cwr(struct tcp_sock *tp)
 
 static void tcp_ecn_accept_cwr(struct tcp_sock *tp, const struct sk_buff *skb)
 {
-   if (tcp_hdr(skb)->cwr)
+   if (tcp_hdr(skb)->cwr) {
tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
+
+   /* If the sender is telling us it has entered CWR, then its
+* cwnd may be very low (even just 1 packet), so we should ACK
+* immediately.
+*/
+   tcp_enter_quickack_mode((struct sock *)tp, 2);
+   }
 }
 
 static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
-- 
2.17.1

[PATCH net-next v3 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Lawrence Brakmo

When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
1MB RPCs2.6ms   5.5ms 43ms  208ms
10KB RPCs1.1ms   1.3ms 53ms  212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
 the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
 pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

   dctcp 99%  dctcp 99.9%
1MB RPCs5.8ms   6.9ms
10KB RPCs146us   203us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these 
patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
  congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
  and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
  did receive 1st packet of request)
   Host A send more packets…

v2: Removed call to tcp_ca_event from tcp_send_ack since I added one in
tcp_event_ack_sent. Based on Neal Cardwell 
feedback.
Modified tcp_ecn_check_ce (and renamed it tcp_ecn_check) instead of 
modifying
tcp_ack_send_check to insure an ACK when cwr is received.
v3: Handling cwr in tcp_ecn_accept_cwr instead of in tcp_ecn_check.

[PATCH net-next v3 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next v3 2/2] tcp: ack immediately when a cwr packet

 net/ipv4/tcp_input.c  | 9 -
 net/ipv4/tcp_output.c | 4 ++--
 2 files changed, 10 insertions(+), 3 deletions(-)

Re: [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-03 Thread Lawrence Brakmo

On 7/2/18, 4:50 PM, "Yuchung Cheng"  wrote:

On Mon, Jul 2, 2018 at 2:39 PM, Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
Sorry to chime-in late here (lame excuse: IETF deadline)

IIRC this issue would exist prior to 4.11 kernel. While it'd be good
to fix that, it's not clear which patch introduced the regression
between 4.11 and 4.16? I assume you tested Eric's most recent quickack
fix.

In terms of the fix itself, it seems odd the tcp_send_ack() call in
DCTCP generates NON_DELAYED_ACK event to toggle DCTCP's
delayed_ack_reserved bit. Shouldn't the fix to have DCTCP send the
"prior" ACK w/o cancelling delayed-ACK and mis-recording that it's
cancelled, because that prior-ACK really is a supplementary old ACK.

But it's still unclear how this bug introduces the regression 4.11 - 4.16

Feedback is always appreciated! This issue is also present in 4.11 (that is 
where I discovered). I think the bug was introduces much earlier.

Yes, I tested with Eric's quickack fix, it did not fix either of the two issues 
that are fixed with this patch set.

As I mentioned earlier, the bug was introduced before 4.11. I am not sure I 
understand your comments. Yes, at some level it would make sense to change the 
delayed_ack_reserved bit directly, but we would still need to do it whenever we 
send the ACK, so I do not think it can be helped. Please clarify if I 
misunderstood your comment.

>
> Improved based on comments from Neal Cardwell .
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..acefb64e8280 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock 
*sk, unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }
>
> @@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
> if (sk->sk_state == TCP_CLOSE)
> return;
>
> -   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> -
> /* We are not putting this on the write queue, so
>  * tcp_transmit_skb() will set the ownership to this
>  * sock.
> --
> 2.17.1
>

Re: [PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-03 Thread Lawrence Brakmo

On 7/2/18, 5:52 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Mon, Jul 2, 2018 at 5:39 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>  1MB RPCs2.6ms   5.5ms 43ms  208ms
> 10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>  1MB RPCs5.8ms   6.9ms
> 10KB RPCs146us   203us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion 
(CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet
>
>  net/ipv4/tcp_input.c  | 16 +++-
>  net/ipv4/tcp_output.c |  4 ++--
>  2 files changed, 13 insertions(+), 7 deletions(-)

Thanks, Larry. Just for context, can you please let us know whether
your tests included zero, one, or both of Eric's recent commits
(listed below) that tuned the number of ACKs after ECN events? (Or
maybe the tests were literally using a net-next kernel?) Just wanted
to get a better handle on any possible interactions there.

Thanks!

neal

Yes, my test kernel includes both patches listed below. BTW, I will send a new 
patch where I move the call to tcp_incr_quickack from tcp_ecn_check_ce to 
tcp_ecn_accept_cwr.

---

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
commit 522040ea5fdd1c33bbf75e1d7c7c0422b96a94ef
Author: Eric Dumazet 
Date:   Mon May 21 15:08:57 2018 -0700

tcp: do not aggressively quick ack after ECN events

ECN signals currently forces TCP to enter quickack mode for
up to 16 (TCP_MAX_QUICKACKS) following incoming packets.

We believe this is not needed, and only sending one immediate ack
for the current packet should be enough.

This should reduce the extra load noticed in DCTCP environments,
after congestion events.

This is part 2 of our effort to reduce pure ACK packets.

Signed-off-by: Eric Dumazet 
Acked-by: Soheil Hassas Yeganeh 
Acked-by: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 


https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id

[PATCH net-next v2 0/2] tcp: fix high tail latencies in DCTCP

2018-07-02 Thread Lawrence Brakmo

When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
 1MB RPCs2.6ms   5.5ms 43ms  208ms
10KB RPCs1.1ms   1.3ms 53ms  212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
 the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
 pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

   dctcp 99%  dctcp 99.9%
 1MB RPCs5.8ms   6.9ms
10KB RPCs146us   203us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these 
patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
  congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
  and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
  did receive 1st packet of request)
   Host A send more packets…

[PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet

 net/ipv4/tcp_input.c  | 16 +++-
 net/ipv4/tcp_output.c |  4 ++--
 2 files changed, 13 insertions(+), 7 deletions(-)

[PATCH net-next v2 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-02 Thread Lawrence Brakmo

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives, the ACK was sometimes delayed even though it
is CWR marked, adding up to 40ms to the RPC latency.

This patch insures that CWR makred data packets arriving will be acked
immediately.

Modified based on comments by Neal Cardwell 

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..6fd1f2378f6c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -254,10 +254,16 @@ static void tcp_ecn_withdraw_cwr(struct tcp_sock *tp)
tp->ecn_flags &= ~TCP_ECN_DEMAND_CWR;
 }
 
-static void __tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void __tcp_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);
 
+   /* If the sender is telling us it has entered CWR, then its cwnd may be
+* very low (even just 1 packet), so we should ACK immediately.
+*/
+   if (tcp_hdr(skb)->cwr)
+   tcp_enter_quickack_mode(sk, 2);
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Funny extension: if ECT is not set on a segment,
@@ -286,10 +292,10 @@ static void __tcp_ecn_check_ce(struct sock *sk, const 
struct sk_buff *skb)
}
 }
 
-static void tcp_ecn_check_ce(struct sock *sk, const struct sk_buff *skb)
+static void tcp_ecn_check(struct sock *sk, const struct sk_buff *skb)
 {
if (tcp_sk(sk)->ecn_flags & TCP_ECN_OK)
-   __tcp_ecn_check_ce(sk, skb);
+   __tcp_ecn_check(sk, skb);
 }
 
 static void tcp_ecn_rcv_synack(struct tcp_sock *tp, const struct tcphdr *th)
@@ -715,7 +721,7 @@ static void tcp_event_data_recv(struct sock *sk, struct 
sk_buff *skb)
}
icsk->icsk_ack.lrcvtime = now;
 
-   tcp_ecn_check_ce(sk, skb);
+   tcp_ecn_check(sk, skb);
 
if (skb->len >= 128)
tcp_grow_window(sk, skb);
@@ -4439,7 +4445,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
u32 seq, end_seq;
bool fragstolen;
 
-   tcp_ecn_check_ce(sk, skb);
+   tcp_ecn_check(sk, skb);
 
if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFODROP);
-- 
2.17.1

[PATCH net-next v2 1/2] tcp: notify when a delayed ack is sent

2018-07-02 Thread Lawrence Brakmo

DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
notifications to keep track if it needs to send an ACK for packets that
were received with a particular ECN state but whose ACK was delayed.

Under some circumstances, for example when a delayed ACK is sent with a
data packet, DCTCP state was not being updated due to a lack of
notification that the previously delayed ACK was sent. As a result, it
would sometimes send a duplicate ACK when a new data packet arrived.

This patch insures that DCTCP's state is correctly updated so it will
not send the duplicate ACK.

Improved based on comments from Neal Cardwell .

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_output.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8f6129160dd..acefb64e8280 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
@@ -3567,8 +3569,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;
 
-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.
-- 
2.17.1

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-07-02 Thread Lawrence Brakmo

On 7/2/18, 7:57 AM, "Neal Cardwell"  wrote:

On Sat, Jun 30, 2018 at 9:47 PM Lawrence Brakmo  wrote:
> I see two issues, one is that entering quickack mode as you
> mentioned does not insure that it will still be on when the CWR
> arrives. The second issue is that the problem occurs right after the
> receiver sends a small reply which results in entering pingpong mode
> right before the sender starts the new request by sending just one
> packet (i.e. forces delayed ack).
>
> I compiled and tested your patch. Both 99 and 99.9 percentile
> latencies are around 40ms. Looking at the packet traces shows that
> some CWR marked packets are not being ack immediately (delayed by
> 40ms).

Thanks, Larry! So your tests provide nice, specific evidence that it
is good to force an immediate ACK when a receiver receives a packet
with CWR marked. Given that, I am wondering what the simplest way is
to achieve that goal.

What if, rather than plumbing a new specific signal into
__tcp_ack_snd_check(), we use the existing general quick-ack
mechanism, where various parts of the TCP stack (like
__tcp_ecn_check_ce())  are already using the quick-ack mechanism to
"remotely" signal to __tcp_ack_snd_check() that they want an immediate
ACK.

For example, would it work to do something like:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53ae5fc834a5..8168d1938b376 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -262,6 +262,12 @@ static void __tcp_ecn_check_ce(struct sock *sk,
const struct sk_buff *skb)
 {
struct tcp_sock *tp = tcp_sk(sk);

+   /* If the sender is telling us it has entered CWR, then its cwnd 
may be
+* very low (even just 1 packet), so we should ACK immediately.
+*/
+   if (tcp_hdr(skb)->cwr)
+   tcp_enter_quickack_mode(sk, 2);
+
switch (TCP_SKB_CB(skb)->ip_dsfield & INET_ECN_MASK) {
case INET_ECN_NOT_ECT:
/* Insure that GCN will not continue to mark packets. */

And then since that broadens the mission of this function beyond
checking just the ECT/CE bits, I supposed we could rename the
__tcp_ecn_check_ce() and tcp_ecn_check_ce() functions to
__tcp_ecn_check() and tcp_ecn_check(), or something like that.

Would that work for this particular issue?

Neal

Thanks Neal, it does work and is cleaner than what I was doing. I will submit a 
revised patch set.

Re: [PATCH net-next 1/2] tcp: notify when a delayed ack is sent

2018-07-02 Thread Lawrence Brakmo

On 7/2/18, 8:18 AM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
> notifications to keep track if it needs to send an ACK for packets that
> were received with a particular ECN state but whose ACK was delayed.
>
> Under some circumstances, for example when a delayed ACK is sent with a
> data packet, DCTCP state was not being updated due to a lack of
> notification that the previously delayed ACK was sent. As a result, it
> would sometimes send a duplicate ACK when a new data packet arrived.
>
> This patch insures that DCTCP's state is correctly updated so it will
> not send the duplicate ACK.
>
> Signed-off-by: Lawrence Brakmo 
> ---
>  net/ipv4/tcp_output.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index f8f6129160dd..41f6ad7a21e4 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock 
*sk, unsigned int pkts)
> __sock_put(sk);
> }
> tcp_dec_quickack_mode(sk, pkts);
> +   if (inet_csk_ack_scheduled(sk))
> +   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
> inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
>  }

Thanks for this fix! Seems like this would work, but if I am reading
this correctly then it seems like this would cause a duplicate call to
tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) when we are sending a pure
ACK (delayed or non-delayed):

(1) once from tcp_send_ack() before we send the ACK:

tcp_send_ack(struct sock *sk)
 ->tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

(2) then again from tcp_event_ack_sent() after we have sent the ACK:

tcp_event_ack_sent()
->   if (inet_csk_ack_scheduled(sk))
 tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);

What if we remove the original CA_EVENT_NON_DELAYED_ACK call and just
replace it with your new one? (not compiled, not tested):

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 3889dcd4868d4..bddb49617d9be 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -184,6 +184,8 @@ static inline void tcp_event_ack_sent(struct sock
*sk, unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }

@@ -3836,8 +3838,6 @@ void tcp_send_ack(struct sock *sk)
if (sk->sk_state == TCP_CLOSE)
return;

-   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
-
/* We are not putting this on the write queue, so
 * tcp_transmit_skb() will set the ownership to this
 * sock.

Aside from lower CPU overhead, one nice benefit of that is that we
then only call tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK) in one
place, which might be a little easier to reason about.

Does that work?

neal

Thanks Neal, good catch!  I will resubmit the patch.

Re: [PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-30 Thread Lawrence Brakmo

On 6/30/18, 5:26 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> When have observed high tail latencies when using DCTCP for RPCs as
> compared to using Cubic. For example, in one setup there are 2 hosts
> sending to a 3rd one, with each sender having 3 flows (1 stream,
> 1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
> table shows the 99% and 99.9% latencies for both Cubic and dctcp:
>
>Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
>   1MB RPCs2.6ms   5.5ms 43ms  208ms
>  10KB RPCs1.1ms   1.3ms 53ms  212ms
>
> Looking at tcpdump traces showed that there are two causes for the
> latency.
>
>   1) RTOs caused by the receiver sending a dup ACK and not ACKing
>  the last (and only) packet sent.
>   2) Delaying ACKs when the sender has a cwnd of 1, so everything
>  pauses for the duration of the delayed ACK.
>
> The first patch fixes the cause of the dup ACKs, not updating DCTCP
> state when an ACK that was initially delayed has been sent with a
> data packet.
>
> The second patch insures that an ACK is sent immediately when a
> CWR marked packet arrives.
>
> With the patches the latencies for DCTCP now look like:
>
>dctcp 99%  dctcp 99.9%
>   1MB RPCs4.8ms   6.5ms
>  10KB RPCs143us   184us
>
> Note that while the 1MB RPCs tail latencies are higher than Cubic's,
> the 10KB latencies are much smaller than Cubic's. These patches fix
> issues on the receiver, but tcpdump traces indicate there is an
> opportunity to also fix an issue at the sender that adds about 3ms
> to the tail latencies.
>
> The following trace shows the issue that tiggers an RTO (fixed by these 
patches):
>
>Host A sends the last packets of the request
>Host B receives them, and the last packet is marked with congestion 
(CE)
>Host B sends ACKs for packets not marked with congestion
>Host B sends data packet with reply and ACK for packet marked with
>   congestion (TCP flag ECE)
>Host A receives ACKs with no ECE flag
>Host A receives data packet with ACK for the last packet of request
>   and which has TCP ECE bit set
>Host A sends 1st data packet of the next request with TCP flag CWR
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>Host B sends a dup ACK that also has the TCP ECE flag
>Host A RTO timer fires!
>Host A to send the next packet
>Host A receives an ACK for everything it has sent (i.e. Host B
>   did receive 1st packet of request)
>Host A send more packets…
>
> [PATCH net-next 1/2] tcp: notify when a delayed ack is sent
> [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

Thanks, Larry!

I suspect there is a broader problem with "DCTCP-style precise ECE
ACKs" that this patch series does not address.

AFAICT the DCTCP "historical" ACKs for ECE precision, generated in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0(), violate the
assumptions of the pre-existing delayed ACK state machine. They
violate those assumptions by rewinding tp->rcv_nxt backwards and then
calling tcp_send_ack(sk). But the existing code path to send an ACK
assumes that any ACK transmission always sends an ACK that accounts
for all received data, so that after sending an ACK we can cancel the
delayed ACK timer. So it seems with DCTCP we can have buggy sequences
where the DCTCP historical ACK causes us to forget that we need to
schedule a delayed ACK (which will force the sender to RTO, as shown
in the trace):

tcp_event_data_recv()
 inet_csk_schedule_ack()  // remember that we need an ACK
 tcp_ecn_check_ce()
  tcp_ca_event() // with CA_EVENT_ECN_IS_CE or CA_EVENT_ECN_NO_CE
dctcp_cwnd_event()
  dctcp_ce_state_0_to_1() or dctcp_ce_state_1_to_0()
   if (... && ca->delayed_ack_reserved)
  tp->rcv_nxt = ca->prior_rcv_nxt;
  tcp_send_ack(sk); // send an ACK, but for old data!
tcp_transmit_skb()
  tcp_event_ack_sent()
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
// forget that we need a delayed ACK!

AFAICT the first patch in this series, to force an immediate ACK on
any packet with a CWR, papers over this issue of the forgotten delayed
ACK by forcing an immedi

Re: [PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-30 Thread Lawrence Brakmo

On 6/30/18, 11:23 AM, "Neal Cardwell"  wrote:

On Fri, Jun 29, 2018 at 9:48 PM Lawrence Brakmo  wrote:
>
> We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
> problem is triggered when the last packet of a request arrives CE
> marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
> to 1 (because there are no packets in flight). When the 1st packet of
> the next request arrives it was sometimes delayed adding up to 40ms to
> the latency.
>
> This patch insures that CWR makred data packets arriving will be acked
> immediately.
    >
> Signed-off-by: Lawrence Brakmo 
> ---

Thanks, Larry. Ensuring that CWR-marked data packets arriving will be
acked immediately sounds like a good goal to me.

I am wondering if we can have a simpler fix.

The dctcp_ce_state_0_to_1() code is setting the TCP_ECN_DEMAND_CWR
bit in ecn_flags, which disables the code in __tcp_ecn_check_ce() that
would have otherwise used the tcp_enter_quickack_mode() mechanism to
force an ACK:

__tcp_ecn_check_ce():
...
case INET_ECN_CE:
  if (tcp_ca_needs_ecn(sk))
tcp_ca_event(sk, CA_EVENT_ECN_IS_CE);
   // -> dctcp_ce_state_0_to_1()
   // ->  tp->ecn_flags |= TCP_ECN_DEMAND_CWR;

  if (!(tp->ecn_flags & TCP_ECN_DEMAND_CWR)) {
/* Better not delay acks, sender can have a very low cwnd */
tcp_enter_quickack_mode(sk, 1);
tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
  }
  tp->ecn_flags |= TCP_ECN_SEEN;
  break;

So it seems like the bug here may be that dctcp_ce_state_0_to_1()  is
setting the TCP_ECN_DEMAND_CWR  bit in ecn_flags, when really it
should let its caller, __tcp_ecn_check_ce() set TCP_ECN_DEMAND_CWR, in
which case the code would already properly force an immediate ACK.

So, what if we use this fix instead (not compiled, not tested):

diff --git a/net/ipv4/tcp_dctcp.c b/net/ipv4/tcp_dctcp.c
index 5f5e5936760e..4fecd2824edb 100644
--- a/net/ipv4/tcp_dctcp.c
+++ b/net/ipv4/tcp_dctcp.c
@@ -152,8 +152,6 @@ static void dctcp_ce_state_0_to_1(struct sock *sk)

ca->prior_rcv_nxt = tp->rcv_nxt;
ca->ce_state = 1;
-
-   tp->ecn_flags |= TCP_ECN_DEMAND_CWR;
 }

 static void dctcp_ce_state_1_to_0(struct sock *sk)

What do you think?

neal

I see two issues, one is that entering quickack mode as you mentioned does not 
insure that it will still be on when the CWR arrives. The second issue is that 
the problem occurs right after the receiver sends a small reply which results 
in entering pingpong mode right before the sender starts the new request by 
sending just one packet (i.e. forces delayed ack).

I compiled and tested your patch. Both 99 and 99.9 percentile latencies are 
around 40ms. Looking at the packet traces shows that some CWR marked packets 
are not being ack immediately (delayed by 40ms).

Larry

[PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

2018-06-29 Thread Lawrence Brakmo

We observed high 99 and 99.9% latencies when doing RPCs with DCTCP. The
problem is triggered when the last packet of a request arrives CE
marked. The reply will carry the ECE mark causing TCP to shrink its cwnd
to 1 (because there are no packets in flight). When the 1st packet of
the next request arrives it was sometimes delayed adding up to 40ms to
the latency.

This patch insures that CWR makred data packets arriving will be acked
immediately.

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..b024d36f0d56 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -98,6 +98,8 @@ int sysctl_tcp_max_orphans __read_mostly = NR_FILE;
 #define FLAG_UPDATE_TS_RECENT  0x4000 /* tcp_replace_ts_recent() */
 #define FLAG_NO_CHALLENGE_ACK  0x8000 /* do not call tcp_send_challenge_ack()  
*/
 #define FLAG_ACK_MAYBE_DELAYED 0x1 /* Likely a delayed ACK */
+#define FLAG_OFO_POSSIBLE  0x2 /* Possible OFO */
+#define FLAG_CWR   0x4 /* CWR in this ACK */
 
 #define FLAG_ACKED (FLAG_DATA_ACKED|FLAG_SYN_ACKED)
 #define FLAG_NOT_DUP   (FLAG_DATA|FLAG_WIN_UPDATE|FLAG_ACKED)
@@ -5087,7 +5089,7 @@ static inline void tcp_data_snd_check(struct sock *sk)
 /*
  * Check if sending an ack is needed.
  */
-static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
+static void __tcp_ack_snd_check(struct sock *sk, int flags)
 {
struct tcp_sock *tp = tcp_sk(sk);
unsigned long rtt, delay;
@@ -5102,13 +5104,16 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
(tp->rcv_nxt - tp->copied_seq < sk->sk_rcvlowat ||
 __tcp_select_window(sk) >= tp->rcv_wnd)) ||
/* We ACK each frame or... */
-   tcp_in_quickack_mode(sk)) {
+   tcp_in_quickack_mode(sk) ||
+   /* We received CWR */
+   flags & FLAG_CWR) {
 send_now:
tcp_send_ack(sk);
return;
}
 
-   if (!ofo_possible || RB_EMPTY_ROOT(>out_of_order_queue)) {
+   if (!(flags & FLAG_OFO_POSSIBLE) ||
+   RB_EMPTY_ROOT(>out_of_order_queue)) {
tcp_send_delayed_ack(sk);
return;
}
@@ -5134,13 +5139,13 @@ static void __tcp_ack_snd_check(struct sock *sk, int 
ofo_possible)
  HRTIMER_MODE_REL_PINNED_SOFT);
 }
 
-static inline void tcp_ack_snd_check(struct sock *sk)
+static inline void tcp_ack_snd_check(struct sock *sk, int flags)
 {
if (!inet_csk_ack_scheduled(sk)) {
/* We sent a data segment already. */
return;
}
-   __tcp_ack_snd_check(sk, 1);
+   __tcp_ack_snd_check(sk, flags | FLAG_OFO_POSSIBLE);
 }
 
 /*
@@ -5489,6 +5494,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
goto discard;
}
} else {
+   int flags;
int eaten = 0;
bool fragstolen = false;
 
@@ -5525,7 +5531,9 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
goto no_ack;
}
 
-   __tcp_ack_snd_check(sk, 0);
+   flags = (tcp_flag_word(th) & TCP_FLAG_CWR) ?
+   FLAG_CWR : 0;
+   __tcp_ack_snd_check(sk, flags);
 no_ack:
if (eaten)
kfree_skb_partial(skb, fragstolen);
@@ -5561,7 +5569,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
tcp_data_queue(sk, skb);
 
tcp_data_snd_check(sk);
-   tcp_ack_snd_check(sk);
+   tcp_ack_snd_check(sk, (tcp_flag_word(th) & TCP_FLAG_CWR) ?
+ FLAG_CWR : 0);
return;
 
 csum_error:
@@ -6150,7 +6159,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff 
*skb)
/* tcp_data could move socket to TIME-WAIT */
if (sk->sk_state != TCP_CLOSE) {
tcp_data_snd_check(sk);
-   tcp_ack_snd_check(sk);
+   tcp_ack_snd_check(sk, 0);
}
 
if (!queued) {
-- 
2.17.1

[PATCH net-next 1/2] tcp: notify when a delayed ack is sent

2018-06-29 Thread Lawrence Brakmo

DCTCP depends on the CA_EVENT_NON_DELAYED_ACK and CA_EVENT_DELAYED_ACK
notifications to keep track if it needs to send an ACK for packets that
were received with a particular ECN state but whose ACK was delayed.

Under some circumstances, for example when a delayed ACK is sent with a
data packet, DCTCP state was not being updated due to a lack of
notification that the previously delayed ACK was sent. As a result, it
would sometimes send a duplicate ACK when a new data packet arrived.

This patch insures that DCTCP's state is correctly updated so it will
not send the duplicate ACK.

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index f8f6129160dd..41f6ad7a21e4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -172,6 +172,8 @@ static inline void tcp_event_ack_sent(struct sock *sk, 
unsigned int pkts)
__sock_put(sk);
}
tcp_dec_quickack_mode(sk, pkts);
+   if (inet_csk_ack_scheduled(sk))
+   tcp_ca_event(sk, CA_EVENT_NON_DELAYED_ACK);
inet_csk_clear_xmit_timer(sk, ICSK_TIME_DACK);
 }
 
-- 
2.17.1

[PATCH net-next 0/2] tcp: fix high tail latencies in DCTCP

2018-06-29 Thread Lawrence Brakmo

When have observed high tail latencies when using DCTCP for RPCs as
compared to using Cubic. For example, in one setup there are 2 hosts
sending to a 3rd one, with each sender having 3 flows (1 stream,
1 1MB back-to-back RPCs and 1 10KB back-to-back RPCs). The following
table shows the 99% and 99.9% latencies for both Cubic and dctcp:

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
  1MB RPCs2.6ms   5.5ms 43ms  208ms
 10KB RPCs1.1ms   1.3ms 53ms  212ms

Looking at tcpdump traces showed that there are two causes for the
latency.  

  1) RTOs caused by the receiver sending a dup ACK and not ACKing
 the last (and only) packet sent.
  2) Delaying ACKs when the sender has a cwnd of 1, so everything
 pauses for the duration of the delayed ACK.

The first patch fixes the cause of the dup ACKs, not updating DCTCP
state when an ACK that was initially delayed has been sent with a
data packet.

The second patch insures that an ACK is sent immediately when a
CWR marked packet arrives.

With the patches the latencies for DCTCP now look like:

   dctcp 99%  dctcp 99.9% 
  1MB RPCs4.8ms   6.5ms
 10KB RPCs143us   184us

Note that while the 1MB RPCs tail latencies are higher than Cubic's,
the 10KB latencies are much smaller than Cubic's. These patches fix
issues on the receiver, but tcpdump traces indicate there is an
opportunity to also fix an issue at the sender that adds about 3ms
to the tail latencies.

The following trace shows the issue that tiggers an RTO (fixed by these 
patches):

   Host A sends the last packets of the request
   Host B receives them, and the last packet is marked with congestion (CE)
   Host B sends ACKs for packets not marked with congestion
   Host B sends data packet with reply and ACK for packet marked with
  congestion (TCP flag ECE)
   Host A receives ACKs with no ECE flag
   Host A receives data packet with ACK for the last packet of request
  and which has TCP ECE bit set
   Host A sends 1st data packet of the next request with TCP flag CWR
   Host B receives the packet (as seen in tcpdump at B), no CE flag
   Host B sends a dup ACK that also has the TCP ECE flag
   Host A RTO timer fires!
   Host A to send the next packet
   Host A receives an ACK for everything it has sent (i.e. Host B
  did receive 1st packet of request)
   Host A send more packets…

[PATCH net-next 1/2] tcp: notify when a delayed ack is sent
[PATCH net-next 2/2] tcp: ack immediately when a cwr packet arrives

 net/ipv4/tcp_input.c  | 25 +
 net/ipv4/tcp_output.c |  2 ++
 2 files changed, 19 insertions(+), 8 deletions(-)

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-29 Thread Lawrence Brakmo

On 6/28/18, 1:48 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion 
(CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request 
and has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did 
receive 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

You were right Neal, one of the bugs is related to this and is caused by a lack 
of state update to DCTCP. DCTCP is first informed that the ACK was delayed but 
it is not updated when the ACK is sent with a data packet.

I am working on a patch to fix this which hopefully should be out today. Thanks 
everyone for the great feedback!

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-28 Thread Lawrence Brakmo

On 6/28/18, 1:48 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion 
(CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request 
and has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did 
receive 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

Here is a packetdrill script that reproduces the problem:

// Repro bug that does not ack data, not even with delayed-ack

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 
0.100 > SE. 0:0(0) ack 1 
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect0] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect0] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect0] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
+0  > [ect0] E. 4:4(0) ack 4501   // dup ack sent

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // Long RTO
+0 > [ect0] . 4:4(0) ack 6501 // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257

Re: [PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-28 Thread Lawrence Brakmo

On 6/28/18, 1:48 PM, "netdev-ow...@vger.kernel.org on behalf of Neal Cardwell" 
 wrote:

On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo  wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>Host A sends the last packets of the request
>
>Host B receives them, and the last packet is marked with congestion 
(CE)
>
>Host B sends ACKs for packets not marked with congestion
>
>Host B sends data packet with reply and ACK for packet marked with 
congestion (TCP flag ECE)
>
>Host A receives ACKs with no ECE flag
>
>Host A receives data packet with ACK for the last packet of request 
and has TCP ECE bit set
>
>Host A sends 1st data packet of the next request with TCP flag CWR
>
>Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>Host B sends a dup ACK that also has the TCP ECE flag
>
>Host A RTO timer fires!
>
>Host A to send the next packet
>
>Host A receives an ACK for everything it has sent (i.e. Host B did 
receive 1st packet of request)
>
>Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal

Just to reiterate that there are 2 distinct issues:

1) When the last data packet of a request is received with the CE flag, it 
causes a dup ack with ECE flag to be sent after the first data packet of the 
next request is received.

2) When one of the last (but not the last) data packets of a request is 
received with the CE flag, it causes a delayed ack to be sent after the first 
data packet of the next request is received. This delayed ack does acknowledge 
the data in 1st packet of the reply.

[PATCH net-next v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-26 Thread Lawrence Brakmo

When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
 1MB RPCs3.5ms   6.0ms 43ms  208ms
10KB RPCs1.0ms   2.5ms 53ms  212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

 dctcp 99%   dctcp 99.9%
 1MB RPCs  3.8ms 4.4ms
10KB RPCs  168us 211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

v2: fixed compiler warning in max function arguments

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..282bd85322b0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@ void tcp_cwnd_reduction(struct sock *sk, int 
newly_acked_sacked, int flag)
}
/* Force a fast retransmit upon entering fast recovery */
sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-   tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+   tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
 }
 
 static inline void tcp_end_cwnd_reduction(struct sock *sk)
-- 
2.17.1

[PATCH net-next] tcp: force cwnd at least 2 in tcp_cwnd_reduction

2018-06-26 Thread Lawrence Brakmo

When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

   Cubic 99%  Cubic 99.9%   dctcp 99%dctcp 99.9%
 1MB RPCs3.5ms   6.0ms 43ms  208ms
10KB RPCs1.0ms   2.5ms 53ms  212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

 dctcp 99%   dctcp 99.9%
 1MB RPCs  3.8ms 4.4ms
10KB RPCs  168us 211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

Signed-off-by: Lawrence Brakmo 
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..a9255c424761 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@ void tcp_cwnd_reduction(struct sock *sk, int 
newly_acked_sacked, int flag)
}
/* Force a fast retransmit upon entering fast recovery */
sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-   tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+   tp->snd_cwnd = max(tcp_packets_in_flight(tp) + sndcnt, 2);
 }
 
 static inline void tcp_end_cwnd_reduction(struct sock *sk)
-- 
2.17.1

[PATCH bpf-next] bpf: clean up from test_tcpbpf_kern.c

2018-01-26 Thread Lawrence Brakmo

Removed commented lines from test_tcpbpf_kern.c

Fixes: d6d4f60c3a09 bpf: add selftest for tcpbpf
Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c 
b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
index 66bf715..57119ad 100644
--- a/tools/testing/selftests/bpf/test_tcpbpf_kern.c
+++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
@@ -79,9 +79,6 @@ int bpf_testcb(struct bpf_sock_ops *skops)
}
break;
case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
-   /* Set callback */
-// good_call_rv = bpf_sock_ops_cb_flags_set(skops,
-//  BPF_SOCK_OPS_STATE_CB_FLAG);
skops->sk_txhash = 0x12345f;
v = 0xff;
rv = bpf_setsockopt(skops, SOL_IPV6, IPV6_TCLASS, ,
-- 
2.9.5

[PATCH bpf-next v10 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-25 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index a858ebc..fe2c793 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4690,7 +4731,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4701,6 +4743,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v10 00/12] bpf: More sock_ops callbacks

2018-01-25 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  
v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
based on comments form Eric Dumazet and Alexei Starovoitov.
Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
comments from Daniel Borkmann.
Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
on comments from Daniel Borkmann.
v8: Add commit message 00/12
Add Acked-by as appropriate
v9: Moved the bug fix to the front of the patchset
Changed RETRANS_CB so it is always called (before it was only called if
the retransmit succeeded). It is now called with an extra argument, the
return value of tcp_transmit_skb (0 => success). Based on comments
from Yuchung Cheng.
Added support for reading 2 new fields, sacked_out and lost_out, based on
comments from Yuchung Cheng.
v10: Moved the callback flags from include/uapi/linux/tcp.h to
 include/uapi/linux/bpf.h
 Cleaned up the test in selftest. Added a timeout so it always completes,
 even if the client is not communicating with the server. Made it faster
 by removing the sleeps. Made sure it works even when called back-to-back
 20 times.

Consists of the following patches:
[PATCH bpf-next v10 01/12] bpf: Only reply field should be writeable
[PATCH bpf-next v10 02/12] bpf: Make SOCK_OPS_GET_TCP size
[PATCH bpf-next v10 03/12] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v10 04/12] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v10 05/12] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v10 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v10 07/12] bpf: Add sock_ops RTO callback
[PATCH bpf-next v10 08/12] bpf: Add support for reading sk_state and
[PATCH bpf-next v10 09/12] bpf: Add sock_ops R/W access to tclass
[PATCH bpf-next v10 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v10 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v10 12/12] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  42 -
 include/uapi/linux/bpf.h   |  84 -
 net/core/filter.c  | 290 
---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   6 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  86 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  51 ++
 tools/testing/selftests/bpf/tcp_server.py  |  83 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 118 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 17 files changed, 925 insertions(+), 39 deletions(-)

[PATCH bpf-next v10 04/12] bpf: Add write access to tcp_sock and sock fields

2018-01-25 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index dbb6d2f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4491,6 +4491,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+

[PATCH bpf-next v10 02/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-25 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bf9bb75..62e7874 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,9 +4470,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4484,16 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v10 03/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-25 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 62e7874..dbb6d2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4469,11 +4469,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,18 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v10 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-25 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 29 -
 net/ipv4/tcp.c   | 24 
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 31c93a0b..db6bdc3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1006,7 +1006,8 @@ struct bpf_sock_ops {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
@@ -1054,6 +1055,32 @@ enum {
 * Arg3: return value of
 *   tcp_transmit_skb (0 => success)
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v10 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-25 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);

[PATCH bpf-next v10 12/12] bpf: add selftest for tcpbpf

2018-01-25 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user
Adding the flag "-d" will show why it did not pass.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 tools/include/uapi/linux/bpf.h |  86 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  51 ++
 tools/testing/selftests/bpf/tcp_server.py  |  83 
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 118 +++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 +
 8 files changed, 480 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..db6bdc3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,8 +978,39 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
+* supported cb flags
+*/
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
@@ -1003,6 +1044,43 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+

[PATCH bpf-next v10 01/12] bpf: Only reply field should be writeable

2018-01-25 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Yuchung Cheng <ych...@google.com>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..bf9bb75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

[PATCH bpf-next v10 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-25 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 9 -
 net/ipv4/tcp_output.c| 4 
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 46520ea..31c93a0b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1005,7 +1005,8 @@ struct bpf_sock_ops {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
@@ -1047,6 +1048,12 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit_skb (0 => success)
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..e9f985e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2905,6 +2905,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}
 
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs, err);
+
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
-- 
2.9.5

[PATCH bpf-next v10 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-25 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 17 -
 net/core/filter.c| 34 ++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..aa12840 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,8 +978,14 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 /* List of known BPF sock_ops operators.
  * New entries can only be added at the end
  */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_flags_set,
+   .gpl_only   = false,
+   .ret_type   = RET_INTEGER,
+   .arg1_type  = ARG_PTR_TO_CTX,
+   .arg2_type  = ARG_ANYTHING,
+};
+
 static const struct bpf_func_proto *
 bpf_base_func_proto(enum bpf_func_id func_id)
 {
@@

[PATCH bpf-next v10 08/12] bpf: Add support for reading sk_state and more

2018-01-25 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  lost_out
  sacked_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  22 
 net/core/filter.c| 143 +++
 2 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c8cecf9..46520ea 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,28 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..a858ebc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+

[PATCH bpf-next v10 07/12] bpf: Add sock_ops RTO callback

2018-01-25 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 8 +++-
 net/ipv4/tcp_timer.c | 7 +++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index aa12840..c8cecf9 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -982,7 +982,8 @@ struct bpf_sock_ops {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
@@ -1019,6 +1020,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-24 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 26 ++
 include/uapi/linux/tcp.h |  3 ++-
 net/ipv4/tcp.c   | 24 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 59fa771..ff7758d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1047,6 +1047,32 @@ enum {
 * Arg3: return value of
 *   tcp_transmit_skb (0 => success)
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index ec03a2b..cf0b861 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -271,7 +271,8 @@ struct tcp_diag_md5sig {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable

2018-01-24 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Yuchung Cheng <ych...@google.com>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..bf9bb75 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

[PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback

2018-01-24 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_timer.c | 7 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7573f5b..2a8c40a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index d1df2f6..129032ca 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -269,7 +269,8 @@ struct tcp_diag_md5sig {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-24 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 12 +++-
 include/uapi/linux/tcp.h |  5 +
 net/core/filter.c| 34 ++
 4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..7573f5b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,6 +978,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..d1df2f6 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -268,4 +268,9 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_fla

[PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and more

2018-01-24 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  lost_out
  sacked_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  22 
 net/core/filter.c| 143 +++
 2 files changed, 154 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a8c40a..5f08420 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,28 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..a858ebc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+

[PATCH bpf-next v9 00/12] bpf: More sock_ops callbacks

2018-01-24 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  
v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
based on comments form Eric Dumazet and Alexei Starovoitov.
Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
comments from Daniel Borkmann.
Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
on comments from Daniel Borkmann.
v8: Add commit message 00/12
Add Acked-by as appropriate
v9: Moved the bug fix to the front of the patchset
Changed RETRANS_CB so it is always called (before it was only called if
the retransmit succeeded). It is now called with an extra argument, the
return value of tcp_transmit_skb (0 => success). Based on comments
from Yuchung Cheng.
Added support for reading 2 new fields, sacked_out and lost_out, based on
comments from Yuchung Cheng.

Consists of the following patches:
[PATCH bpf-next v9 01/12] bpf: Only reply field should be writeable
[PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v9 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v9 07/12] bpf: Add sock_ops RTO callback
[PATCH bpf-next v9 08/12] bpf: Add support for reading sk_state and
[PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass
[PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v9 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  42 -
 include/uapi/linux/bpf.h   |  76 +++-
 include/uapi/linux/tcp.h   |   8 +
 net/core/filter.c  | 290 
---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   6 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  78 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 ++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 18 files changed, 927 insertions(+), 39 deletions(-)

[PATCH bpf-next v9 04/12] bpf: Add write access to tcp_sock and sock fields

2018-01-24 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index dbb6d2f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4491,6 +4491,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+

[PATCH bpf-next v9 03/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-24 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 62e7874..dbb6d2f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4469,11 +4469,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,18 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v9 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-24 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);

[PATCH bpf-next v9 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-24 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Three arguments are used; one for the sequence number,
another for the number of segments retransmitted, and the last one for
the return value of tcp_transmit_skb (0 => success).
Does not include syn-ack retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 6 ++
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_output.c| 4 
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 5f08420..59fa771 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1041,6 +1041,12 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit_skb (0 => success)
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 129032ca..ec03a2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..e9f985e 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2905,6 +2905,10 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
err = tcp_transmit_skb(sk, skb, 1, GFP_ATOMIC);
}
 
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs, err);
+
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
-- 
2.9.5

[PATCH bpf-next v9 12/12] bpf: add selftest for tcpbpf

2018-01-24 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 tools/include/uapi/linux/bpf.h |  78 ++-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +++
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 
 8 files changed, 482 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..ff7758d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,6 +978,29 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 lost_out;
+   __u32 sacked_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
@@ -1003,6 +1036,43 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+* Arg3: return value of
+*   tcp_transmit

[PATCH bpf-next v9 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-24 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index a858ebc..fe2c793 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4690,7 +4731,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4701,6 +4743,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v9 02/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-24 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index bf9bb75..62e7874 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,9 +4470,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4484,16 +4485,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

Re: [PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and more

2018-01-24 Thread Lawrence Brakmo

On 1/24/18, 12:07 PM, "netdev-ow...@vger.kernel.org on behalf of Yuchung Cheng" 
<netdev-ow...@vger.kernel.org on behalf of ych...@google.com> wrote:

On Tue, Jan 23, 2018 at 11:57 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> Add support for reading many more tcp_sock fields
>
>   state,same as sk->sk_state
>   rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
>   snd_ssthresh
>   rcv_nxt
>   snd_nxt
>   snd_una
>   mss_cache
>   ecn_flags
>   rate_delivered
>   rate_interval_us
>   packets_out
>   retrans_out
Might as well get ca_state, sacked_out and lost_out to estimate CA
states and the packets in flight?

Will try to add in updated patchset. If not, I will add as a new patch.

>   total_retrans
>   segs_in
>   data_segs_in
>   segs_out
>   data_segs_out
>   sk_txhash
    >   bytes_received (__u64)
>   bytes_acked(__u64)
>
> Signed-off-by: Lawrence Brakmo <bra...@fb.com>
> ---
>  include/uapi/linux/bpf.h |  20 +++
>  net/core/filter.c| 135 
+++
>  2 files changed, 144 insertions(+), 11 deletions(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a8c40a..6998032 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -979,6 +979,26 @@ struct bpf_sock_ops {
> __u32 snd_cwnd;
> __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
> __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h 
*/
> +   __u32 state;
> +   __u32 rtt_min;
> +   __u32 snd_ssthresh;
> +   __u32 rcv_nxt;
> +   __u32 snd_nxt;
> +   __u32 snd_una;
> +   __u32 mss_cache;
> +   __u32 ecn_flags;
> +   __u32 rate_delivered;
> +   __u32 rate_interval_us;
> +   __u32 packets_out;
> +   __u32 retrans_out;
> +   __u32 total_retrans;
> +   __u32 segs_in;
> +   __u32 data_segs_in;
> +   __u32 segs_out;
> +   __u32 data_segs_out;
> +   __u32 sk_txhash;
> +   __u64 bytes_received;
> +   __u64 bytes_acked;
>  };
>
>  /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 6936d19..ffe9b60 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
>  }
>  EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>
> -static bool __is_valid_sock_ops_access(int off, int size)
> +static bool sock_ops_is_valid_access(int off, int size,
> +enum bpf_access_type type,
> +struct bpf_insn_access_aux *info)
>  {
> +   const int size_default = sizeof(__u32);
> +
> if (off < 0 || off >= sizeof(struct bpf_sock_ops))
> return false;
> +
> /* The verifier guarantees that size > 0. */
> if (off % size != 0)
> return false;
> -   if (size != sizeof(__u32))
> -   return false;
> -
> -   return true;
> -}
>
> -static bool sock_ops_is_valid_access(int off, int size,
> -enum bpf_access_type type,
> -struct bpf_insn_access_aux *info)
> -{
> if (type == BPF_WRITE) {
> switch (off) {
> case offsetof(struct bpf_sock_ops, reply):
> +   if (size != size_default)
> +   return false;
> break;
> default:
> return false;
> }
> +   } else {
> +   switch (off) {
> +   case bpf_ctx_range_till(struct bpf_sock_ops, 
bytes_received,
> +   bytes_acked):
> +   if (size != sizeof(__u64))
> +   return false;
> +   break;
> +   default:
> +   if (size != size_default)
> +   return false;
> +   break;
> +   }
> }
>
> -   return __is_valid_sock_ops_

Re: [PATCH bpf-next v8 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-24 Thread Lawrence Brakmo

On 1/24/18, 12:02 PM, "netdev-ow...@vger.kernel.org on behalf of Yuchung Cheng" 
<netdev-ow...@vger.kernel.org on behalf of ych...@google.com> wrote:

On Tue, Jan 23, 2018 at 11:58 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> Adds support for calling sock_ops BPF program when there is a
> retransmission. Two arguments are used; one for the sequence number and
> other for the number of segments retransmitted. Does not include syn-ack
> retransmissions.
>
> New op: BPF_SOCK_OPS_RETRANS_CB.
    >
> Signed-off-by: Lawrence Brakmo <bra...@fb.com>
> ---
>  include/uapi/linux/bpf.h | 4 
>  include/uapi/linux/tcp.h | 3 ++-
>  net/ipv4/tcp_output.c| 3 +++
>  3 files changed, 9 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 6998032..eb26cdb 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -1039,6 +1039,10 @@ enum {
>  * Arg2: value of icsk_rto
>  * Arg3: whether RTO has expired
>  */
> +   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is 
retransmitted.
> +* Arg1: sequence number of 1st 
byte
> +* Arg2: # segments
> +*/
>  };
>
>  #define TCP_BPF_IW 1001/* Set TCP initial congestion 
window */
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index 129032ca..ec03a2b 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
>
>  /* Definitions for bpf_sock_ops_cb_flags */
>  #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
> -#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all 
currently
> +#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
> +#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all 
currently
>  * supported cb 
flags
>  */
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d12f7f7..f7d34f01 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2908,6 +2908,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct 
sk_buff *skb, int segs)
> if (likely(!err)) {
> TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
> trace_tcp_retransmit_skb(sk, skb);
> +   if (BPF_SOCK_OPS_TEST_FLAG(tp, 
BPF_SOCK_OPS_RETRANS_CB_FLAG))
> +   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
> + TCP_SKB_CB(skb)->seq, segs);
Any reason to skip failed retransmission? I would think that's helpful as 
well.

Good point, thanks Yuchung. I will do a new patch shortly that will also pass 
the err value to the BPF program.

> } else if (err != -EBUSY) {
> NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
> }
> --
> 2.9.5
>

[PATCH bpf-next v8 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-23 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 12 +++-
 include/uapi/linux/tcp.h |  5 +
 net/core/filter.c| 34 ++
 4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..7573f5b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,6 +978,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..d1df2f6 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -268,4 +268,9 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_fla

[PATCH bpf-next v8 01/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-23 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..31c10a4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4471,9 +4471,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,16 +4486,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v8 04/12] bpf: Only reply field should be writeable

2018-01-23 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 0cf170f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

[PATCH bpf-next v8 12/12] bpf: add selftest for tcpbpf

2018-01-23 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 tools/include/uapi/linux/bpf.h |  74 +-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +++
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 
 8 files changed, 478 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..4d4701a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,6 +978,27 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
@@ -1003,6 +1034,41 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+

[PATCH bpf-next v8 03/12] bpf: Add write access to tcp_sock and sock fields

2018-01-23 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
Acked-by: Alexei Starovoitov <a...@kernel.org>
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index 22fac64..0cf170f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4492,6 +4492,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+

[PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and more

2018-01-23 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  20 +++
 net/core/filter.c| 135 +++
 2 files changed, 144 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a8c40a..6998032 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,26 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..ffe9b60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn

[PATCH bpf-next v8 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-23 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 26 ++
 include/uapi/linux/tcp.h |  3 ++-
 net/ipv4/tcp.c   | 24 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index eb26cdb..4d4701a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1043,6 +1043,32 @@ enum {
 * Arg1: sequence number of 1st byte
 * Arg2: # segments
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index ec03a2b..cf0b861 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -271,7 +271,8 @@ struct tcp_diag_md5sig {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v8 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-23 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Two arguments are used; one for the sequence number and
other for the number of segments retransmitted. Does not include syn-ack
retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 4 
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_output.c| 3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6998032..eb26cdb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1039,6 +1039,10 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 129032ca..ec03a2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..f7d34f01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2908,6 +2908,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs);
} else if (err != -EBUSY) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
-- 
2.9.5

[PATCH bpf-next v8 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-23 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);

[PATCH bpf-next v8 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-23 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ffe9b60..763b2f3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4682,7 +4723,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4693,6 +4735,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v8 00/12] bpf: More sock_ops callbacks

2018-01-23 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  
v7: Enfornced fields "op" and "replylong[1] .. replylong[3]" not be writable
based on comments form Eric Dumazet and Alexei Starovoitov.
Filled 32 bit hole in bpf_sock_ops struct with sk_txhash based on
comments from Daniel Borkmann.
Removed unused functions (tcp_call_bpf_1arg, tcp_call_bpf_4arg) based
on comments from Daniel Borkmann.
v8: Add commit message 00/12
Add Acked-by as appropriate

Signed-off-by: Lawrence Brakmo <bra...@fb.com>

Consists of the following patches:
[PATCH bpf-next v8 01/12] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf-next v8 02/12] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v8 03/12] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v8 04/12] bpf: Only reply field should be writeable
[PATCH bpf-next v8 05/12] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v8 06/12] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v8 07/12] bpf: Add sock_ops RTO callback
[PATCH bpf-next v8 08/12] bpf: Add support for reading sk_state and
[PATCH bpf-next v8 09/12] bpf: Add sock_ops R/W access to tclass
[PATCH bpf-next v8 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v8 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v8 12/12] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  42 -
 include/uapi/linux/bpf.h   |  72 +++-
 include/uapi/linux/tcp.h   |   8 +
 net/core/filter.c  | 282 
---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   5 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  74 +++-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 ++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 18 files changed, 910 insertions(+), 39 deletions(-)

[PATCH bpf-next v8 07/12] bpf: Add sock_ops RTO callback

2018-01-23 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_timer.c | 7 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7573f5b..2a8c40a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index d1df2f6..129032ca 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -269,7 +269,8 @@ struct tcp_diag_md5sig {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v8 02/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-23 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 31c10a4..22fac64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,11 +4470,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4486,18 +4486,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v7 12/12] bpf: add selftest for tcpbpf callbacks

2018-01-23 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 tools/include/uapi/linux/bpf.h |  74 +-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +++
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 
 8 files changed, 478 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..4d4701a 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,6 +978,27 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
@@ -1003,6 +1034,41 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+}

[PATCH bpf-next v7 07/12] bpf: Add sock_ops RTO callback

2018-01-23 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_timer.c | 7 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7573f5b..2a8c40a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index d1df2f6..129032ca 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -269,7 +269,8 @@ struct tcp_diag_md5sig {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v7 09/12] bpf: Add sock_ops R/W access to tclass

2018-01-23 Thread Lawrence Brakmo

Adds direct write access to sk_txhash and access to tclass for ipv6
flows through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index ffe9b60..763b2f3 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3232,6 +3232,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3241,7 +3264,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3307,6 +3331,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3871,6 +3911,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   case offsetof(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4682,7 +4723,8 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
case offsetof(struct bpf_sock_ops, sk_txhash):
-   SOCK_OPS_GET_FIELD(sk_txhash, sk_txhash, struct sock);
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
break;
 
case offsetof(struct bpf_sock_ops, bytes_received):
@@ -4693,6 +4735,7 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v7 02/12] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-23 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 31c10a4..22fac64 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4470,11 +4470,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4486,18 +4486,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v7 01/12] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-23 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 18da42a..31c10a4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4471,9 +4471,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4485,16 +4486,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v7 03/12] bpf: Add write access to tcp_sock and sock fields

2018-01-23 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 5a1d26a..6092eaf 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2011,7 +2011,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index 22fac64..0cf170f 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4492,6 +4492,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, sk),\
+

[PATCH bpf-next v7 10/12] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-23 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Two arguments are used; one for the sequence number and
other for the number of segments retransmitted. Does not include syn-ack
retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 4 
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_output.c| 3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 6998032..eb26cdb 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1039,6 +1039,10 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 129032ca..ec03a2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..f7d34f01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2908,6 +2908,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs);
} else if (err != -EBUSY) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
-- 
2.9.5

[PATCH bpf-next v7 08/12] bpf: Add support for reading sk_state and more

2018-01-23 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  sk_txhash
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  20 +++
 net/core/filter.c| 135 +++
 2 files changed, 144 insertions(+), 11 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a8c40a..6998032 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,26 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u32 sk_txhash;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 6936d19..ffe9b60 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3855,33 +3855,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
case offsetof(struct bpf_sock_ops, reply):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4498,6 +4508,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn

[PATCH bpf-next v7 06/12] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-23 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 12 +++-
 include/uapi/linux/tcp.h |  5 +
 net/core/filter.c| 34 ++
 4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..7573f5b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,6 +978,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..d1df2f6 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -268,4 +268,9 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index c356ec0..6936d19 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3328,6 +3328,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_flags_set,
+   .gpl_only   = false,
+

[PATCH bpf-next v7 11/12] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-23 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 26 ++
 include/uapi/linux/tcp.h |  3 ++-
 net/ipv4/tcp.c   | 24 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index eb26cdb..4d4701a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1043,6 +1043,32 @@ enum {
 * Arg1: sequence number of 1st byte
 * Arg2: # segments
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index ec03a2b..cf0b861 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -271,7 +271,8 @@ struct tcp_diag_md5sig {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v7 05/12] bpf: Support passing args to sock_ops bpf function

2018-01-23 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 40 +++-
 include/uapi/linux/bpf.h |  5 +++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 42 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6092eaf..093e967 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2006,7 +2006,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2019,6 +2019,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs * sizeof(*args));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2027,18 +2029,46 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2049,7 +2079,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2058,7 +2088,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops->rebuild_header(sk);
tcp_init_metrics(sk);
-   tcp_call_bpf(sk, bpf_op);
+   tcp_call_bpf(sk, bpf_op, 0, NULL);
tcp_init_congestion_control(sk);
tcp_init_buffer_space(sk);
 }
diff --git a/net/ipv4/tcp_nv.c b/net/ipv4/tcp_nv.c
index 0b5a05b..ddbce73 100644
--- a/net/ipv4/tcp_nv.c
+++ b/net/ipv4/tcp_nv.c
@@ -146,7 +146,7 @@ static void tcpnv_init(struct sock *sk)
 * within a datacenter, where we have reasonable estimates of
 * RTTs
 */
-   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT);
+   base_rtt = tcp_call_bpf(sk, BPF_SOCK_OPS_BASE_RTT, 0, NULL);

[PATCH bpf-next v7 04/12] bpf: Only reply field should be writeable

2018-01-23 Thread Lawrence Brakmo

Currently, a sock_ops BPF program can write the op field and all the
reply fields (reply and replylong). This is a bug. The op field should
not have been writeable and there is currently no way to use replylong
field for indices >= 1. This patch enforces that only the reply field
(which equals replylong[0]) is writeable.

Fixes: 40304b2a1567 ("bpf: BPF support for sock_ops")
Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 0cf170f..c356ec0 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3845,8 +3845,7 @@ static bool sock_ops_is_valid_access(int off, int size,
 {
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case offsetof(struct bpf_sock_ops, reply):
break;
default:
return false;
-- 
2.9.5

Re: [PATCH bpf-next v6 04/11] bpf: Support passing args to sock_ops bpf function

2018-01-23 Thread Lawrence Brakmo

On 1/23/18, 5:11 PM, "Daniel Borkmann" <dan...@iogearbox.net> wrote:

On 01/20/2018 02:45 AM, Lawrence Brakmo wrote:
> Adds support for passing up to 4 arguments to sock_ops bpf functions. It
> reusues the reply union, so the bpf_sock_ops structures are not
> increased in size.
    > 
    > Signed-off-by: Lawrence Brakmo <bra...@fb.com>
> ---
>  include/linux/filter.h   |  1 +
>  include/net/tcp.h| 64 

>  include/uapi/linux/bpf.h |  5 ++--
>  net/ipv4/tcp.c   |  2 +-
>  net/ipv4/tcp_nv.c|  2 +-
>  net/ipv4/tcp_output.c|  2 +-
>  6 files changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/filter.h b/include/linux/filter.h
> index daa5a67..20384c4 100644
> --- a/include/linux/filter.h
> +++ b/include/linux/filter.h
> @@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
>   struct  sock *sk;
>   u32 op;
>   union {
> + u32 args[4];
>   u32 reply;
>   u32 replylong[4];
>   };
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 108d16a..8e9111f 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -2005,7 +2005,7 @@ void tcp_cleanup_ulp(struct sock *sk);
>   * program loaded).
>   */
>  #ifdef CONFIG_BPF
> -static inline int tcp_call_bpf(struct sock *sk, int op)
> +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 
*args)
>  {
>   struct bpf_sock_ops_kern sock_ops;
>   int ret;
> @@ -2018,6 +2018,8 @@ static inline int tcp_call_bpf(struct sock *sk, int 
op)
>  
>   sock_ops.sk = sk;
>   sock_ops.op = op;
> + if (nargs > 0)
> + memcpy(sock_ops.args, args, nargs*sizeof(u32));

Small nit given respin: nargs * sizeof(*args)

Thanks, will fix.

>   ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
>   if (ret == 0)
> @@ -2026,18 +2028,70 @@ static inline int tcp_call_bpf(struct sock *sk, 
int op)
>   ret = -1;
>   return ret;
>  }
> +
> +static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
> +{
> + return tcp_call_bpf(sk, op, 1, );
> +}
> +
> +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, 
u32 arg2)
> +{
> + u32 args[2] = {arg1, arg2};
> +
> + return tcp_call_bpf(sk, op, 2, args);
> +}
> +
> +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, 
u32 arg2,
> + u32 arg3)
> +{
> + u32 args[3] = {arg1, arg2, arg3};
> +
> + return tcp_call_bpf(sk, op, 3, args);
> +}
> +
> +static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, 
u32 arg2,
> + u32 arg3, u32 arg4)
> +{
> + u32 args[4] = {arg1, arg2, arg3, arg4};
> +
> + return tcp_call_bpf(sk, op, 4, args);
> +}
> +
>  #else
> -static inline int tcp_call_bpf(struct sock *sk, int op)
> +static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 
*args)
>  {
>   return -EPERM;
>  }
> +
> +static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
> +{
> + return -EPERM;
> +}
> +
> +static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, 
u32 arg2)
> +{
> + return -EPERM;
> +}
> +
> +static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, 
u32 arg2,
> + u32 arg3)

indent: arg3

OK

> +{
> + return -EPERM;
> +}
> +
> +static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, 
u32 arg2,
> + u32 arg3, u32 arg4)
> +{
> + return -EPERM;
> +}
> +
>  #endif

tcp_call_bpf_1arg() and tcp_call_bpf_4arg() unused for the time being?

Yes, I just thought I should add them for completeness. Should I remove them 
until
they are actually used?

Re: [PATCH bpf-next v6 07/11] bpf: Add support for reading sk_state and more

2018-01-23 Thread Lawrence Brakmo

On 1/23/18, 5:05 PM, "Daniel Borkmann" <dan...@iogearbox.net> wrote:

On 01/20/2018 02:45 AM, Lawrence Brakmo wrote:
> Add support for reading many more tcp_sock fields
> 
>   state,  same as sk->sk_state
>   rtt_min same as sk->rtt_min.s[0].v (current rtt_min)
>   snd_ssthresh
>   rcv_nxt
>   snd_nxt
>   snd_una
>   mss_cache
>   ecn_flags
>   rate_delivered
>   rate_interval_us
>   packets_out
>   retrans_out
>   total_retrans
>   segs_in
>   data_segs_in
>   segs_out
>   data_segs_out
>   bytes_received (__u64)
>   bytes_acked(__u64)
> 
> Signed-off-by: Lawrence Brakmo <bra...@fb.com>
> ---
>  include/uapi/linux/bpf.h |  19 +++
>  net/core/filter.c| 134 
++-
>  2 files changed, 140 insertions(+), 13 deletions(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2a8c40a..ff34f3c 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -979,6 +979,25 @@ struct bpf_sock_ops {
>   __u32 snd_cwnd;
>   __u32 srtt_us;  /* Averaged RTT << 3 in usecs */
>   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
> + __u32 state;
> + __u32 rtt_min;
> + __u32 snd_ssthresh;
> + __u32 rcv_nxt;
> + __u32 snd_nxt;
> + __u32 snd_una;
> + __u32 mss_cache;
> + __u32 ecn_flags;
> + __u32 rate_delivered;
> + __u32 rate_interval_us;
> + __u32 packets_out;
> + __u32 retrans_out;
> + __u32 total_retrans;
> + __u32 segs_in;
> + __u32 data_segs_in;
> + __u32 segs_out;
> + __u32 data_segs_out;

Btw, this will have a 4 bytes hole in here which the user can otherwise
address out of the prog. Could you add the sk_txhash from the next patch
in between here instead?

Good point. Will fix in new patch. Thanks Daniel.

> + __u64 bytes_received;
> + __u64 bytes_acked;
>  };
>  
>  /* List of known BPF sock_ops operators.
> diff --git a/net/core/filter.c b/net/core/filter.c
> index c9411dc..98665ba 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3849,34 +3849,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
>  }
>  EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
>  
> -static bool __is_valid_sock_ops_access(int off, int size)
> +static bool sock_ops_is_valid_access(int off, int size,
> +  enum bpf_access_type type,
> +  struct bpf_insn_access_aux *info)
>  {
> + const int size_default = sizeof(__u32);
> +
>   if (off < 0 || off >= sizeof(struct bpf_sock_ops))
>   return false;
> +
>   /* The verifier guarantees that size > 0. */
>   if (off % size != 0)
>   return false;
> - if (size != sizeof(__u32))
> - return false;
> -
> - return true;
> -}
>  
> -static bool sock_ops_is_valid_access(int off, int size,
> -  enum bpf_access_type type,
> -  struct bpf_insn_access_aux *info)
> -{
>   if (type == BPF_WRITE) {
>   switch (off) {
> - case offsetof(struct bpf_sock_ops, op) ...
> -  offsetof(struct bpf_sock_ops, replylong[3]):
> + case bpf_ctx_range_till(struct bpf_sock_ops, op, replylong[3]):
> + if (size != size_default)
> + return false;
>   break;
>   default:
>   return false;
>   }
> + } else {
> + switch (off) {
> + case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
> + bytes_acked):
> + if (size != sizeof(__u64))
> + return false;
> + break;
> + default:
> + if (size != size_default)
> + return false;
> + break;
> + }
>   }
>  
> - return __is_valid_sock_ops_access(off, size);
> + return true;
>  }

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Lawrence Brakmo



On 1/23/18, 3:26 PM, "Alexei Starovoitov" <alexei.starovoi...@gmail.com> wrote:

On Tue, Jan 23, 2018 at 08:19:54PM +, Lawrence Brakmo wrote:
> On 1/23/18, 11:50 AM, "Eric Dumazet" <eric.duma...@gmail.com> wrote:
> 
> On Tue, 2018-01-23 at 14:39 -0500, Neal Cardwell wrote:
> > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> 
wrote:
> > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > > 
> > > The original patch that changes TCP's congestion control via 
eBPF only
> > > re-initializes the new congestion control, if the BPF op is 
set to an
> > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently 
TCP will
> > > 
> > > What do you mean by “(invalid) value”?
> > > 
> > > run the new congestion control from random states.
> > > 
> > > This has always been possible with setsockopt, no?
> > > 
> > >This patch fixes
> > > the issue by always re-init the congestion control like other 
means
> > > such as setsockopt and sysctl changes.
> > > 
> > > The current code re-inits the congestion control when calling
> > > tcp_set_congestion_control except when it is called early on 
(i.e. op <=
> > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
re-initialize
> > > since it will be initialized later by TCP when the connection is 
established.
> > > 
    >     > > Otherwise, if we always call tcp_reinit_congestion_control we 
would call
> > > tcp_cleanup_congestion_control when the congestion control has 
not been
> > > initialized yet.
> > 
> > On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> 
wrote:
> > > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > > 
> > > The original patch that changes TCP's congestion control via 
eBPF only
> > > re-initializes the new congestion control, if the BPF op is 
set to an
> > > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently 
TCP will
> > > 
> > > What do you mean by “(invalid) value”?
> > > 
> > > run the new congestion control from random states.
> > > 
> > > This has always been possible with setsockopt, no?
> > > 
> > >This patch fixes
> > > the issue by always re-init the congestion control like other 
means
> > > such as setsockopt and sysctl changes.
> > > 
> > > The current code re-inits the congestion control when calling
> > > tcp_set_congestion_control except when it is called early on 
(i.e. op <=
> > > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to 
re-initialize
> > > since it will be initialized later by TCP when the connection is 
established.
> > > 
> > > Otherwise, if we always call tcp_reinit_congestion_control we 
would call
> > > tcp_cleanup_congestion_control when the congestion control has 
not been
> > > initialized yet.
> > 
> > Interesting. So I wonder if the symptoms we were seeing were due to
> > kernels that did not yet have this fix:
> > 
> >   27204aaa9dc6 ("tcp: uniform the set up of sockets after successful
> > connection):
> >   
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=27204aaa9dc67b833b77179fdac556288ec3a4bf
> > 
> > Before that fix, there could be TFO passive connections that at SYN 
time called:
> >   tcp_init_congestion_control(child);
> > and then:
> >   tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
> > 
> > So that if the CC was switched in the
> > BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB handler then there would be no
> > init for the new module?
> 
> 
> Note that bpf_sock->op can be written by a malicious BPF filter.
> 
> So, a malicious filter can switch from Cubic to BBR without re-init,
> and bad things can happen.
> 
> I do not believe we should trust BPF h

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Lawrence Brakmo

On 1/23/18, 11:50 AM, "Eric Dumazet" <eric.duma...@gmail.com> wrote:

On Tue, 2018-01-23 at 14:39 -0500, Neal Cardwell wrote:
> On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > 
> > The original patch that changes TCP's congestion control via eBPF 
only
> > re-initializes the new congestion control, if the BPF op is set to 
an
> > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will
> > 
> > What do you mean by “(invalid) value”?
> > 
> > run the new congestion control from random states.
> > 
> > This has always been possible with setsockopt, no?
> > 
> >This patch fixes
> > the issue by always re-init the congestion control like other means
> > such as setsockopt and sysctl changes.
> > 
> > The current code re-inits the congestion control when calling
> > tcp_set_congestion_control except when it is called early on (i.e. op <=
> > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to re-initialize
> > since it will be initialized later by TCP when the connection is 
established.
> > 
> > Otherwise, if we always call tcp_reinit_congestion_control we would call
> > tcp_cleanup_congestion_control when the congestion control has not been
> > initialized yet.
> 
> On Tue, Jan 23, 2018 at 2:20 PM, Lawrence Brakmo <bra...@fb.com> wrote:
> > On 1/23/18, 9:30 AM, "Yuchung Cheng" <ych...@google.com> wrote:
> > 
> > The original patch that changes TCP's congestion control via eBPF 
only
> > re-initializes the new congestion control, if the BPF op is set to 
an
> > (invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will
> > 
> > What do you mean by “(invalid) value”?
> > 
> > run the new congestion control from random states.
> > 
> > This has always been possible with setsockopt, no?
> > 
> >This patch fixes
> > the issue by always re-init the congestion control like other means
> > such as setsockopt and sysctl changes.
> > 
> > The current code re-inits the congestion control when calling
> > tcp_set_congestion_control except when it is called early on (i.e. op <=
> > BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to re-initialize
> > since it will be initialized later by TCP when the connection is 
established.
> > 
> > Otherwise, if we always call tcp_reinit_congestion_control we would call
> > tcp_cleanup_congestion_control when the congestion control has not been
> > initialized yet.
> 
> Interesting. So I wonder if the symptoms we were seeing were due to
> kernels that did not yet have this fix:
> 
>   27204aaa9dc6 ("tcp: uniform the set up of sockets after successful
> connection):
>   
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=27204aaa9dc67b833b77179fdac556288ec3a4bf
> 
> Before that fix, there could be TFO passive connections that at SYN time 
called:
>   tcp_init_congestion_control(child);
> and then:
>   tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
> 
> So that if the CC was switched in the
> BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB handler then there would be no
> init for the new module?


Note that bpf_sock->op can be written by a malicious BPF filter.

So, a malicious filter can switch from Cubic to BBR without re-init,
and bad things can happen.

I do not believe we should trust BPF here.

Very good point Eric. One solution would be to make bpf_sock->op not writeable 
by
the BPF program. 

Neal, you are correct that would have been a problem. I leave it up to you guys 
whether
making bpf_sock->op not writeable by BPF program is enough or if it is safer to 
always
re-init (as long as there is no problem calling tcp_cleanup_congestion_control 
when it
has not been initialized.

Re: [PATCH net] bpf: always re-init the congestion control after switching to it

2018-01-23 Thread Lawrence Brakmo

On 1/23/18, 9:30 AM, "Yuchung Cheng"  wrote:

The original patch that changes TCP's congestion control via eBPF only
re-initializes the new congestion control, if the BPF op is set to an
(invalid) value beyond BPF_SOCK_OPS_NEEDS_ECN. Consequently TCP will

What do you mean by “(invalid) value”?

run the new congestion control from random states. 

This has always been possible with setsockopt, no?

   This patch fixes
the issue by always re-init the congestion control like other means
such as setsockopt and sysctl changes.

The current code re-inits the congestion control when calling 
tcp_set_congestion_control except when it is called early on (i.e. op <= 
BPF_SOCK_OPS_NEEDS_ECN). In that case there is no need to re-initialize
since it will be initialized later by TCP when the connection is established.

Otherwise, if we always call tcp_reinit_congestion_control we would call 
tcp_cleanup_congestion_control when the congestion control has not been
initialized yet.


Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
Signed-off-by: Yuchung Cheng 
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Soheil Hassas Yeganeh 
---
 include/net/tcp.h   |  2 +-
 net/core/filter.c   |  3 +--
 net/ipv4/tcp.c  |  2 +-
 net/ipv4/tcp_cong.c | 11 ++-
 4 files changed, 5 insertions(+), 13 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6da880d2f022..f94a71b62ba5 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1006,7 +1006,7 @@ void tcp_get_default_congestion_control(struct net 
*net, char *name);
 void tcp_get_available_congestion_control(char *buf, size_t len);
 void tcp_get_allowed_congestion_control(char *buf, size_t len);
 int tcp_set_allowed_congestion_control(char *allowed);
-int tcp_set_congestion_control(struct sock *sk, const char *name, bool 
load, bool reinit);
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool 
load);
 u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
 void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
 
diff --git a/net/core/filter.c b/net/core/filter.c
index 6a85e67fafce..757d52adccfc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3233,12 +3233,11 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern 
*, bpf_sock,
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
char name[TCP_CA_NAME_MAX];
-   bool reinit = bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN;
 
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f08eebe60446..21e2a07e857e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2550,7 +2550,7 @@ static int do_tcp_setsockopt(struct sock *sk, int 
level,
name[val] = 0;
 
lock_sock(sk);
-   err = tcp_set_congestion_control(sk, name, true, true);
+   err = tcp_set_congestion_control(sk, name, true);
release_sock(sk);
return err;
}
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index bc6c02f16243..70895bee3026 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -332,7 +332,7 @@ int tcp_set_allowed_congestion_control(char *val)
  * tcp_reinit_congestion_control (if the current congestion control was
  * already initialized.
  */
-int tcp_set_congestion_control(struct sock *sk, const char *name, bool 
load, bool reinit)
+int tcp_set_congestion_control(struct sock *sk, const char *name, bool 
load)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
const struct tcp_congestion_ops *ca;
@@ -356,15 +356,8 @@ int tcp_set_congestion_control(struct sock *sk, const 
char *name, bool load, boo
if (!ca) {
err = -ENOENT;
} else if (!load) {
-   const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
-
if (try_module_get(ca->owner)) {
-   if (reinit) {
-   tcp_reinit_congestion_control(sk, ca);
-   } else {
-   icsk->icsk_ca_ops = ca;
-   module_put(old_ca->owner);
-   }
+

Re: [PATCH bpf-next v6 05/11] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-19 Thread Lawrence Brakmo

On 1/19/18, 7:52 PM, "Alexei Starovoitov" <alexei.starovoi...@gmail.com> wrote:

On Fri, Jan 19, 2018 at 05:45:42PM -0800, Lawrence Brakmo wrote:
> Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
> use is to determine if there should be calls to sock_ops bpf program at
> various points in the TCP code. The field is initialized to zero,
> disabling the calls. A sock_ops BPF program can set it, per connection and
> as necessary, when the connection is established.
> 
> It also adds support for reading and writting the field within a
> sock_ops BPF program. Reading is done by accessing the field directly.
> However, writing is done through the helper function
> bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
> is trying to set a callback that is not supported in the current kernel
> (i.e. running an older kernel). The helper function returns 0 if it was
> able to set all of the bits set in the argument, a positive number
> containing the bits that could not be set, or -EINVAL if the socket is
> not a full TCP socket.
...
> +/* Sock_ops bpf program related variables */
> +#ifdef CONFIG_BPF
> + u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
> +  * values defined in uapi/linux/tcp.h

I guess we can extend u8 into u16 or more if necessary in the future.

Yes, that was my thought.

> + * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
> + * Set callback flags for sock_ops
> + * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
> + * @flags: flags value
> + * Return: 0 for no error
> + * -EINVAL if there is no full tcp socket
> + * bits in flags that are not supported by current kernel
...
> +BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, 
bpf_sock,
> +int, argval)
> +{
> + struct sock *sk = bpf_sock->sk;
> + int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> +
> + if (!sk_fullsock(sk))
> + return -EINVAL;
> +
> +#ifdef CONFIG_INET
> + if (val)
> + tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
> +
> + return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);

interesting idea! took me some time to realize the potential
of such semantics, but now I like it a lot.
It blends 'set good flag' with 'which flags are supported' logic
into single helper. Nice.
Thanks for adding a test for both ways.
Acked-by: Alexei Starovoitov <a...@kernel.org>

Eric, does this approach address your concerns?

[PATCH bpf-next v6 08/11] bpf: Add sock_ops R/W access to tclass & sk_txhash

2018-01-19 Thread Lawrence Brakmo

Adds direct R/W access to sk_txhash and access to tclass for ipv6 flows
through getsockopt and setsockopt. Sample usage for tclass:

  bpf_getsockopt(skops, SOL_IPV6, IPV6_TCLASS, , sizeof(v))

where skops is a pointer to the ctx (struct bpf_sock_ops).

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  1 +
 net/core/filter.c| 48 +++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ff34f3c..1c80ff4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -998,6 +998,7 @@ struct bpf_sock_ops {
__u32 data_segs_out;
__u64 bytes_received;
__u64 bytes_acked;
+   __u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index 98665ba..e136796 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3228,6 +3228,29 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
ret = -EINVAL;
}
 #ifdef CONFIG_INET
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   return -EINVAL;
+
+   val = *((int *)optval);
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   if (val < -1 || val > 0xff) {
+   ret = -EINVAL;
+   } else {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (val == -1)
+   val = 0;
+   np->tclass = val;
+   }
+   break;
+   default:
+   ret = -EINVAL;
+   }
+#endif
} else if (level == SOL_TCP &&
   sk->sk_prot->setsockopt == tcp_setsockopt) {
if (optname == TCP_CONGESTION) {
@@ -3237,7 +3260,8 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
strncpy(name, optval, min_t(long, optlen,
TCP_CA_NAME_MAX-1));
name[TCP_CA_NAME_MAX-1] = 0;
-   ret = tcp_set_congestion_control(sk, name, false, 
reinit);
+   ret = tcp_set_congestion_control(sk, name, false,
+reinit);
} else {
struct tcp_sock *tp = tcp_sk(sk);
 
@@ -3303,6 +3327,22 @@ BPF_CALL_5(bpf_getsockopt, struct bpf_sock_ops_kern *, 
bpf_sock,
} else {
goto err_clear;
}
+#if IS_ENABLED(CONFIG_IPV6)
+   } else if (level == SOL_IPV6) {
+   struct ipv6_pinfo *np = inet6_sk(sk);
+
+   if (optlen != sizeof(int) || sk->sk_family != AF_INET6)
+   goto err_clear;
+
+   /* Only some options are supported */
+   switch (optname) {
+   case IPV6_TCLASS:
+   *((int *)optval) = (int)np->tclass;
+   break;
+   default:
+   goto err_clear;
+   }
+#endif
} else {
goto err_clear;
}
@@ -3865,6 +3905,7 @@ static bool sock_ops_is_valid_access(int off, int size,
if (type == BPF_WRITE) {
switch (off) {
case bpf_ctx_range_till(struct bpf_sock_ops, op, replylong[3]):
+   case bpf_ctx_range(struct bpf_sock_ops, sk_txhash):
if (size != size_default)
return false;
break;
@@ -4683,6 +4724,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
case offsetof(struct bpf_sock_ops, bytes_acked):
SOCK_OPS_GET_FIELD(bytes_acked, bytes_acked, struct tcp_sock);
break;
+
+   case offsetof(struct bpf_sock_ops, sk_txhash):
+   SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
+ struct sock, type);
+   break;
}
return insn - insn_buf;
 }
-- 
2.9.5

[PATCH bpf-next v6 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB

2018-01-19 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a
retransmission. Two arguments are used; one for the sequence number and
other for the number of segments retransmitted. Does not include syn-ack
retransmissions.

New op: BPF_SOCK_OPS_RETRANS_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 4 
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_output.c| 3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1c80ff4..2f752f3 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1039,6 +1039,10 @@ enum {
 * Arg2: value of icsk_rto
 * Arg3: whether RTO has expired
 */
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 129032ca..ec03a2b 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -270,7 +270,8 @@ struct tcp_diag_md5sig {
 
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
+#define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d12f7f7..f7d34f01 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2908,6 +2908,9 @@ int __tcp_retransmit_skb(struct sock *sk, struct sk_buff 
*skb, int segs)
if (likely(!err)) {
TCP_SKB_CB(skb)->sacked |= TCPCB_EVER_RETRANS;
trace_tcp_retransmit_skb(sk, skb);
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RETRANS_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_RETRANS_CB,
+ TCP_SKB_CB(skb)->seq, segs);
} else if (err != -EBUSY) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
}
-- 
2.9.5

[PATCH bpf-next v6 11/11] bpf: add selftest for tcpbpf

2018-01-19 Thread Lawrence Brakmo

Added a selftest for tcpbpf (sock_ops) that checks that the appropriate
callbacks occured and that it can access tcp_sock fields and that their
values are correct.

Run with command: ./test_tcpbpf_user

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 tools/include/uapi/linux/bpf.h |  74 +-
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +++
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 +++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 
 8 files changed, 478 insertions(+), 6 deletions(-)
 create mode 100755 tools/testing/selftests/bpf/tcp_client.py
 create mode 100755 tools/testing/selftests/bpf/tcp_server.py
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf.h
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_kern.c
 create mode 100644 tools/testing/selftests/bpf/test_tcpbpf_user.c

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index af1f49a..0e336cf8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -17,7 +17,7 @@
 #define BPF_ALU64  0x07/* alu mode in double word width */
 
 /* ld/ldx fields */
-#define BPF_DW 0x18/* double word */
+#define BPF_DW 0x18/* double word (64-bit) */
 #define BPF_XADD   0xc0/* exclusive add */
 
 /* alu/jmp fields */
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -952,8 +961,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
@@ -968,6 +978,27 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u64 bytes_received;
+   __u64 bytes_acked;
+   __u32 sk_txhash;
 };
 
 /* List of known BPF sock_ops operators.
@@ -1003,6 +1034,41 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
+   BPF_SOCK_OPS_RETRANS_CB,/* Called when skb is retransmitted.
+* Arg1: sequence number of 1st byte
+* Arg2: # segments
+*/
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+}

[PATCH bpf-next v6 06/11] bpf: Add sock_ops RTO callback

2018-01-19 Thread Lawrence Brakmo

Adds an optional call to sock_ops BPF program based on whether the
BPF_SOCK_OPS_RTO_CB_FLAG is set in bpf_sock_ops_flags.
The BPF program is passed 2 arguments: icsk_retransmits and whether the
RTO has expired.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 5 +
 include/uapi/linux/tcp.h | 3 ++-
 net/ipv4/tcp_timer.c | 7 +++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 7573f5b..2a8c40a 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1014,6 +1014,11 @@ enum {
 * a congestion threshold. RTTs above
 * this indicate congestion
 */
+   BPF_SOCK_OPS_RTO_CB,/* Called when an RTO has triggered.
+* Arg1: value of icsk_retransmits
+* Arg2: value of icsk_rto
+* Arg3: whether RTO has expired
+*/
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index d1df2f6..129032ca 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -269,7 +269,8 @@ struct tcp_diag_md5sig {
 };
 
 /* Definitions for bpf_sock_ops_cb_flags */
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+#define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x1/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 6db3124..257abdd 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -213,11 +213,18 @@ static int tcp_write_timeout(struct sock *sk)
icsk->icsk_user_timeout);
}
tcp_fastopen_active_detect_blackhole(sk, expired);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tp, BPF_SOCK_OPS_RTO_CB_FLAG))
+   tcp_call_bpf_3arg(sk, BPF_SOCK_OPS_RTO_CB,
+ icsk->icsk_retransmits,
+ icsk->icsk_rto, (int)expired);
+
if (expired) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
}
+
return 0;
 }
 
-- 
2.9.5

[PATCH bpf-next v6 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB

2018-01-19 Thread Lawrence Brakmo

Adds support for calling sock_ops BPF program when there is a TCP state
change. Two arguments are used; one for the old state and another for
the new state.

There is a new enum in include/uapi/linux/bpf.h that exports the TCP
states that prepends BPF_ to the current TCP state names. If it is ever
necessary to change the internal TCP state values (other than adding
more to the end), then it will become necessary to convert from the
internal TCP state value to the BPF value before calling the BPF
sock_ops function. There are a set of compile checks added in tcp.c
to detect if the internal and BPF values differ so we can make the
necessary fixes.

New op: BPF_SOCK_OPS_STATE_CB.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h | 26 ++
 include/uapi/linux/tcp.h |  3 ++-
 net/ipv4/tcp.c   | 24 
 3 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2f752f3..0e336cf8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1043,6 +1043,32 @@ enum {
 * Arg1: sequence number of 1st byte
 * Arg2: # segments
 */
+   BPF_SOCK_OPS_STATE_CB,  /* Called when TCP changes state.
+* Arg1: old_state
+* Arg2: new_state
+*/
+};
+
+/* List of TCP states. There is a build check in net/ipv4/tcp.c to detect
+ * changes between the TCP and BPF versions. Ideally this should never happen.
+ * If it does, we need to add code to convert them before calling
+ * the BPF sock_ops function.
+ */
+enum {
+   BPF_TCP_ESTABLISHED = 1,
+   BPF_TCP_SYN_SENT,
+   BPF_TCP_SYN_RECV,
+   BPF_TCP_FIN_WAIT1,
+   BPF_TCP_FIN_WAIT2,
+   BPF_TCP_TIME_WAIT,
+   BPF_TCP_CLOSE,
+   BPF_TCP_CLOSE_WAIT,
+   BPF_TCP_LAST_ACK,
+   BPF_TCP_LISTEN,
+   BPF_TCP_CLOSING,/* Now a valid state */
+   BPF_TCP_NEW_SYN_RECV,
+
+   BPF_TCP_MAX_STATES  /* Leave at the end! */
 };
 
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index ec03a2b..cf0b861 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -271,7 +271,8 @@ struct tcp_diag_md5sig {
 /* Definitions for bpf_sock_ops_cb_flags */
 #define BPF_SOCK_OPS_RTO_CB_FLAG   (1<<0)
 #define BPF_SOCK_OPS_RETRANS_CB_FLAG   (1<<1)
-#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x3/* Mask of all currently
+#define BPF_SOCK_OPS_STATE_CB_FLAG (1<<2)
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0x7/* Mask of all currently
 * supported cb flags
 */
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 88b6244..f013ddc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2042,6 +2042,30 @@ void tcp_set_state(struct sock *sk, int state)
 {
int oldstate = sk->sk_state;
 
+   /* We defined a new enum for TCP states that are exported in BPF
+* so as not force the internal TCP states to be frozen. The
+* following checks will detect if an internal state value ever
+* differs from the BPF value. If this ever happens, then we will
+* need to remap the internal value to the BPF value before calling
+* tcp_call_bpf_2arg.
+*/
+   BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
+   BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
+   BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
+   BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
+   BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
+   BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
+   BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
+   BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
+   BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
+
+   if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
+   tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
+
switch (state) {
case TCP_ESTABLISHED:
if (oldstate != TCP_ESTABLISHED)
-- 
2.9.5

[PATCH bpf-next v6 07/11] bpf: Add support for reading sk_state and more

2018-01-19 Thread Lawrence Brakmo

Add support for reading many more tcp_sock fields

  state,same as sk->sk_state
  rtt_min   same as sk->rtt_min.s[0].v (current rtt_min)
  snd_ssthresh
  rcv_nxt
  snd_nxt
  snd_una
  mss_cache
  ecn_flags
  rate_delivered
  rate_interval_us
  packets_out
  retrans_out
  total_retrans
  segs_in
  data_segs_in
  segs_out
  data_segs_out
  bytes_received (__u64)
  bytes_acked(__u64)

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/uapi/linux/bpf.h |  19 +++
 net/core/filter.c| 134 ++-
 2 files changed, 140 insertions(+), 13 deletions(-)

diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 2a8c40a..ff34f3c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -979,6 +979,25 @@ struct bpf_sock_ops {
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
__u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
+   __u32 state;
+   __u32 rtt_min;
+   __u32 snd_ssthresh;
+   __u32 rcv_nxt;
+   __u32 snd_nxt;
+   __u32 snd_una;
+   __u32 mss_cache;
+   __u32 ecn_flags;
+   __u32 rate_delivered;
+   __u32 rate_interval_us;
+   __u32 packets_out;
+   __u32 retrans_out;
+   __u32 total_retrans;
+   __u32 segs_in;
+   __u32 data_segs_in;
+   __u32 segs_out;
+   __u32 data_segs_out;
+   __u64 bytes_received;
+   __u64 bytes_acked;
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/net/core/filter.c b/net/core/filter.c
index c9411dc..98665ba 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3849,34 +3849,43 @@ void bpf_warn_invalid_xdp_action(u32 act)
 }
 EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
 
-static bool __is_valid_sock_ops_access(int off, int size)
+static bool sock_ops_is_valid_access(int off, int size,
+enum bpf_access_type type,
+struct bpf_insn_access_aux *info)
 {
+   const int size_default = sizeof(__u32);
+
if (off < 0 || off >= sizeof(struct bpf_sock_ops))
return false;
+
/* The verifier guarantees that size > 0. */
if (off % size != 0)
return false;
-   if (size != sizeof(__u32))
-   return false;
-
-   return true;
-}
 
-static bool sock_ops_is_valid_access(int off, int size,
-enum bpf_access_type type,
-struct bpf_insn_access_aux *info)
-{
if (type == BPF_WRITE) {
switch (off) {
-   case offsetof(struct bpf_sock_ops, op) ...
-offsetof(struct bpf_sock_ops, replylong[3]):
+   case bpf_ctx_range_till(struct bpf_sock_ops, op, replylong[3]):
+   if (size != size_default)
+   return false;
break;
default:
return false;
}
+   } else {
+   switch (off) {
+   case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
+   bytes_acked):
+   if (size != sizeof(__u64))
+   return false;
+   break;
+   default:
+   if (size != size_default)
+   return false;
+   break;
+   }
}
 
-   return __is_valid_sock_ops_access(off, size);
+   return true;
 }
 
 static int sk_skb_prologue(struct bpf_insn *insn_buf, bool direct_write,
@@ -4493,6 +4502,32 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
+   case offsetof(struct bpf_sock_ops, state):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct sock_common, skc_state) != 1);
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+ si->dst_reg, si->src_reg,
+ offsetof(struct bpf_sock_ops_kern, sk));
+   *insn++ = BPF_LDX_MEM(BPF_B, si->dst_reg, si->dst_reg,
+ offsetof(struct sock_common, skc_state));
+   break;
+
+   case offsetof(struct bpf_sock_ops, rtt_min):
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, rtt_min) !=
+sizeof(struct minmax));
+   BUILD_BUG_ON(sizeof(struct minmax) <
+sizeof(struct minmax_sample));
+
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(
+   struct bpf_sock_ops_kern, sk),
+

[PATCH bpf-next v6 02/11] bpf: Make SOCK_OPS_GET_TCP struct independent

2018-01-19 Thread Lawrence Brakmo

Changed SOCK_OPS_GET_TCP to SOCK_OPS_GET_FIELD and added 2
arguments so now it can also work with struct sock fields.
The first argument is the name of the field in the bpf_sock_ops
struct, the 2nd argument is the name of the field in the OBJ struct.

Previous: SOCK_OPS_GET_TCP(FIELD_NAME)
New:  SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)

Where OBJ is either "struct tcp_sock" or "struct sock" (without
quotation). BPF_FIELD is the name of the field in the bpf_sock_ops
struct and OBJ_FIELD is the name of the field in the OBJ struct.

Although the field names are currently the same, the kernel struct names
could change in the future and this change makes it easier to support
that.

Note that adding access to tcp_sock fields in sock_ops programs does
not preclude the tcp_sock fields from being removed as long as we are
willing to do one of the following:

  1) Return a fixed value (e.x. 0 or 0x), or
  2) Make the verifier fail if that field is accessed (i.e. program
fails to load) so the user will know that field is no longer
supported.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 5d6f121..292bda8 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4464,11 +4464,11 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
   is_fullsock));
break;
 
-/* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP(FIELD_NAME) \
+/* Helper macro for adding read access to tcp_sock or sock fields. */
+#define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
-FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4480,18 +4480,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
-  FIELD_NAME), si->dst_reg,  \
- si->dst_reg,\
- offsetof(struct tcp_sock, FIELD_NAME)); \
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(OBJ,   \
+  OBJ_FIELD),\
+ si->dst_reg, si->dst_reg,   \
+ offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP(snd_cwnd);
+   SOCK_OPS_GET_FIELD(snd_cwnd, snd_cwnd, struct tcp_sock);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP(srtt_us);
+   SOCK_OPS_GET_FIELD(srtt_us, srtt_us, struct tcp_sock);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v6 03/11] bpf: Add write access to tcp_sock and sock fields

2018-01-19 Thread Lawrence Brakmo

This patch adds a macro, SOCK_OPS_SET_FIELD, for writing to
struct tcp_sock or struct sock fields. This required adding a new
field "temp" to struct bpf_sock_ops_kern for temporary storage that
is used by sock_ops_convert_ctx_access. It is used to store and recover
the contents of a register, so the register can be used to store the
address of the sk. Since we cannot overwrite the dst_reg because it
contains the pointer to ctx, nor the src_reg since it contains the value
we want to store, we need an extra register to contain the address
of the sk.

Also adds the macro SOCK_OPS_GET_OR_SET_FIELD that calls one of the
GET or SET macros depending on the value of the TYPE field.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h |  9 +
 include/net/tcp.h  |  2 +-
 net/core/filter.c  | 48 
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index 425056c..daa5a67 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1007,6 +1007,15 @@ struct bpf_sock_ops_kern {
u32 replylong[4];
};
u32 is_fullsock;
+   u64 temp;   /* temp and everything after is not
+* initialized to 0 before calling
+* the BPF program. New fields that
+* should be initialized to 0 should
+* be inserted before temp.
+* temp is scratch storage used by
+* sock_ops_convert_ctx_access
+* as temporary storage of a register.
+*/
 };
 
 #endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 6939e69..108d16a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2010,7 +2010,7 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
struct bpf_sock_ops_kern sock_ops;
int ret;
 
-   memset(_ops, 0, sizeof(sock_ops));
+   memset(_ops, 0, offsetof(struct bpf_sock_ops_kern, temp));
if (sk_fullsock(sk)) {
sock_ops.is_fullsock = 1;
sock_owned_by_me(sk);
diff --git a/net/core/filter.c b/net/core/filter.c
index 292bda8..1ff36ca 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4486,6 +4486,54 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
  offsetof(OBJ, OBJ_FIELD));  \
} while (0)
 
+/* Helper macro for adding write access to tcp_sock or sock fields.
+ * The macro is called with two registers, dst_reg which contains a pointer
+ * to ctx (context) and src_reg which contains the value that should be
+ * stored. However, we need an additional register since we cannot overwrite
+ * dst_reg because it may be used later in the program.
+ * Instead we "borrow" one of the other register. We first save its value
+ * into a new (temp) field in bpf_sock_ops_kern, use it, and then restore
+ * it at the end of the macro.
+ */
+#define SOCK_OPS_SET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)\
+   do {  \
+   int reg = BPF_REG_9;  \
+   BUILD_BUG_ON(FIELD_SIZEOF(OBJ, OBJ_FIELD) >   \
+FIELD_SIZEOF(struct bpf_sock_ops, BPF_FIELD));   \
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   if (si->dst_reg == reg || si->src_reg == reg) \
+   reg--;\
+   *insn++ = BPF_STX_MEM(BPF_DW, si->dst_reg, reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  temp));\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, \
+   is_fullsock), \
+ reg, si->dst_reg,   \
+ offsetof(struct bpf_sock_ops_kern,  \
+  is_fullsock)); \
+   *insn++ = BPF_JMP_IMM(BPF_JEQ, reg, 0, 2);\
+   *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
+   struct bpf_sock_ops_kern, sk),\
+

[PATCH bpf-next v6 05/11] bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock

2018-01-19 Thread Lawrence Brakmo

Adds field bpf_sock_ops_cb_flags to tcp_sock and bpf_sock_ops. Its primary
use is to determine if there should be calls to sock_ops bpf program at
various points in the TCP code. The field is initialized to zero,
disabling the calls. A sock_ops BPF program can set it, per connection and
as necessary, when the connection is established.

It also adds support for reading and writting the field within a
sock_ops BPF program. Reading is done by accessing the field directly.
However, writing is done through the helper function
bpf_sock_ops_cb_flags_set, in order to return an error if a BPF program
is trying to set a callback that is not supported in the current kernel
(i.e. running an older kernel). The helper function returns 0 if it was
able to set all of the bits set in the argument, a positive number
containing the bits that could not be set, or -EINVAL if the socket is
not a full TCP socket.

Examples of where one could call the bpf program:

1) When RTO fires
2) When a packet is retransmitted
3) When the connection terminates
4) When a packet is sent
5) When a packet is received

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/tcp.h  | 11 +++
 include/uapi/linux/bpf.h | 12 +++-
 include/uapi/linux/tcp.h |  5 +
 net/core/filter.c| 34 ++
 4 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 4f93f095..8f4c549 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -335,6 +335,17 @@ struct tcp_sock {
 
int linger2;
 
+
+/* Sock_ops bpf program related variables */
+#ifdef CONFIG_BPF
+   u8  bpf_sock_ops_cb_flags;  /* Control calling BPF programs
+* values defined in uapi/linux/tcp.h
+*/
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_cb_flags & ARG)
+#else
+#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
+#endif
+
 /* Receiver side RTT estimation */
struct {
u32 rtt_us;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 8d5874c..7573f5b 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -642,6 +642,14 @@ union bpf_attr {
  * @optlen: length of optval in bytes
  * Return: 0 or negative error
  *
+ * int bpf_sock_ops_cb_flags_set(bpf_sock_ops, flags)
+ * Set callback flags for sock_ops
+ * @bpf_sock_ops: pointer to bpf_sock_ops_kern struct
+ * @flags: flags value
+ * Return: 0 for no error
+ * -EINVAL if there is no full tcp socket
+ * bits in flags that are not supported by current kernel
+ *
  * int bpf_skb_adjust_room(skb, len_diff, mode, flags)
  * Grow or shrink room in sk_buff.
  * @skb: pointer to skb
@@ -748,7 +756,8 @@ union bpf_attr {
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
FN(getsockopt), \
-   FN(override_return),
+   FN(override_return),\
+   FN(sock_ops_cb_flags_set),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -969,6 +978,7 @@ struct bpf_sock_ops {
 */
__u32 snd_cwnd;
__u32 srtt_us;  /* Averaged RTT << 3 in usecs */
+   __u32 bpf_sock_ops_cb_flags; /* flags defined in uapi/linux/tcp.h */
 };
 
 /* List of known BPF sock_ops operators.
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index b4a4f64..d1df2f6 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -268,4 +268,9 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* Definitions for bpf_sock_ops_cb_flags */
+#define BPF_SOCK_OPS_ALL_CB_FLAGS   0  /* Mask of all currently
+* supported cb flags
+*/
+
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 1ff36ca..c9411dc 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3324,6 +3324,33 @@ static const struct bpf_func_proto bpf_getsockopt_proto 
= {
.arg5_type  = ARG_CONST_SIZE,
 };
 
+BPF_CALL_2(bpf_sock_ops_cb_flags_set, struct bpf_sock_ops_kern *, bpf_sock,
+  int, argval)
+{
+   struct sock *sk = bpf_sock->sk;
+   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
+
+   if (!sk_fullsock(sk))
+   return -EINVAL;
+
+#ifdef CONFIG_INET
+   if (val)
+   tcp_sk(sk)->bpf_sock_ops_cb_flags = val;
+
+   return argval & (~BPF_SOCK_OPS_ALL_CB_FLAGS);
+#else
+   return -EINVAL;
+#endif
+}
+
+static const struct bpf_func_proto bpf_sock_ops_cb_flags_set_proto = {
+   .func   = bpf_sock_ops_cb_flags_set,
+   .gpl_only   = false,
+

[PATCH bpf-next v6 01/11] bpf: Make SOCK_OPS_GET_TCP size independent

2018-01-19 Thread Lawrence Brakmo

Make SOCK_OPS_GET_TCP helper macro size independent (before only worked
with 4-byte fields.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 net/core/filter.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 30fafaa..5d6f121 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4465,9 +4465,10 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
break;
 
 /* Helper macro for adding read access to tcp_sock fields. */
-#define SOCK_OPS_GET_TCP32(FIELD_NAME)   \
+#define SOCK_OPS_GET_TCP(FIELD_NAME) \
do {  \
-   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) != 4); \
+   BUILD_BUG_ON(FIELD_SIZEOF(struct tcp_sock, FIELD_NAME) >  \
+FIELD_SIZEOF(struct bpf_sock_ops, FIELD_NAME));  \
*insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(   \
struct bpf_sock_ops_kern, \
is_fullsock), \
@@ -4479,16 +4480,18 @@ static u32 sock_ops_convert_ctx_access(enum 
bpf_access_type type,
struct bpf_sock_ops_kern, sk),\
  si->dst_reg, si->src_reg,   \
  offsetof(struct bpf_sock_ops_kern, sk));\
-   *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,\
+   *insn++ = BPF_LDX_MEM(FIELD_SIZEOF(struct tcp_sock,   \
+  FIELD_NAME), si->dst_reg,  \
+ si->dst_reg,\
  offsetof(struct tcp_sock, FIELD_NAME)); \
} while (0)
 
case offsetof(struct bpf_sock_ops, snd_cwnd):
-   SOCK_OPS_GET_TCP32(snd_cwnd);
+   SOCK_OPS_GET_TCP(snd_cwnd);
break;
 
case offsetof(struct bpf_sock_ops, srtt_us):
-   SOCK_OPS_GET_TCP32(srtt_us);
+   SOCK_OPS_GET_TCP(srtt_us);
break;
}
return insn - insn_buf;
-- 
2.9.5

[PATCH bpf-next v6 04/11] bpf: Support passing args to sock_ops bpf function

2018-01-19 Thread Lawrence Brakmo

Adds support for passing up to 4 arguments to sock_ops bpf functions. It
reusues the reply union, so the bpf_sock_ops structures are not
increased in size.

Signed-off-by: Lawrence Brakmo <bra...@fb.com>
---
 include/linux/filter.h   |  1 +
 include/net/tcp.h| 64 
 include/uapi/linux/bpf.h |  5 ++--
 net/ipv4/tcp.c   |  2 +-
 net/ipv4/tcp_nv.c|  2 +-
 net/ipv4/tcp_output.c|  2 +-
 6 files changed, 66 insertions(+), 10 deletions(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index daa5a67..20384c4 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1003,6 +1003,7 @@ struct bpf_sock_ops_kern {
struct  sock *sk;
u32 op;
union {
+   u32 args[4];
u32 reply;
u32 replylong[4];
};
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 108d16a..8e9111f 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2005,7 +2005,7 @@ void tcp_cleanup_ulp(struct sock *sk);
  * program loaded).
  */
 #ifdef CONFIG_BPF
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
struct bpf_sock_ops_kern sock_ops;
int ret;
@@ -2018,6 +2018,8 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
 
sock_ops.sk = sk;
sock_ops.op = op;
+   if (nargs > 0)
+   memcpy(sock_ops.args, args, nargs*sizeof(u32));
 
ret = BPF_CGROUP_RUN_PROG_SOCK_OPS(_ops);
if (ret == 0)
@@ -2026,18 +2028,70 @@ static inline int tcp_call_bpf(struct sock *sk, int op)
ret = -1;
return ret;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+   return tcp_call_bpf(sk, op, 1, );
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   u32 args[2] = {arg1, arg2};
+
+   return tcp_call_bpf(sk, op, 2, args);
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   u32 args[3] = {arg1, arg2, arg3};
+
+   return tcp_call_bpf(sk, op, 3, args);
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3, u32 arg4)
+{
+   u32 args[4] = {arg1, arg2, arg3, arg4};
+
+   return tcp_call_bpf(sk, op, 4, args);
+}
+
 #else
-static inline int tcp_call_bpf(struct sock *sk, int op)
+static inline int tcp_call_bpf(struct sock *sk, int op, u32 nargs, u32 *args)
 {
return -EPERM;
 }
+
+static inline int tcp_call_bpf_1arg(struct sock *sk, int op, u32 arg)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_2arg(struct sock *sk, int op, u32 arg1, u32 
arg2)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_3arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3)
+{
+   return -EPERM;
+}
+
+static inline int tcp_call_bpf_4arg(struct sock *sk, int op, u32 arg1, u32 
arg2,
+   u32 arg3, u32 arg4)
+{
+   return -EPERM;
+}
+
 #endif
 
 static inline u32 tcp_timeout_init(struct sock *sk)
 {
int timeout;
 
-   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT);
+   timeout = tcp_call_bpf(sk, BPF_SOCK_OPS_TIMEOUT_INIT, 0, NULL);
 
if (timeout <= 0)
timeout = TCP_TIMEOUT_INIT;
@@ -2048,7 +2102,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 {
int rwnd;
 
-   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT);
+   rwnd = tcp_call_bpf(sk, BPF_SOCK_OPS_RWND_INIT, 0, NULL);
 
if (rwnd < 0)
rwnd = 0;
@@ -2057,7 +2111,7 @@ static inline u32 tcp_rwnd_init_bpf(struct sock *sk)
 
 static inline bool tcp_bpf_ca_needs_ecn(struct sock *sk)
 {
-   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN) == 1);
+   return (tcp_call_bpf(sk, BPF_SOCK_OPS_NEEDS_ECN, 0, NULL) == 1);
 }
 
 #if IS_ENABLED(CONFIG_SMC)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 406c19d..8d5874c 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -952,8 +952,9 @@ struct bpf_map_info {
 struct bpf_sock_ops {
__u32 op;
union {
-   __u32 reply;
-   __u32 replylong[4];
+   __u32 args[4];  /* Optionally passed to bpf program */
+   __u32 reply;/* Returned by bpf program  */
+   __u32 replylong[4]; /* Optionally returned by bpf prog  */
};
__u32 family;
__u32 remote_ip4;   /* Stored in network byte order */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index d7cf861..88b6244 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -463,7 +463,7 @@ void tcp_init_transfer(struct sock *sk, int bpf_op)
tcp_mtup_init(sk);
icsk->icsk_af_ops-

[PATCH bpf-next v6 00/11] bpf: More sock_ops callbacks

2018-01-19 Thread Lawrence Brakmo

This patchset adds support for:

- direct R or R/W access to many tcp_sock fields
- passing up to 4 arguments to sock_ops BPF functions
- tcp_sock field bpf_sock_ops_cb_flags for controlling callbacks
- optionally calling sock_ops BPF program when RTO fires
- optionally calling sock_ops BPF program when packet is retransmitted
- optionally calling sock_ops BPF program when TCP state changes
- access to tclass and sk_txhash
- new selftest

v2: Fixed commit message 0/11. The commit is to "bpf-next" but the patch
below used "bpf" and Patchwork didn't work correctly.
v3: Cleaned RTO callback as per  Yuchung's comment
Added BPF enum for TCP states as per  Alexei's comment
v4: Fixed compile warnings related to detecting changes between TCP
internal states and the BPF defined states.
v5: Fixed comment issues in some selftest files
Fixed accesss issue with u64 fields in bpf_sock_ops struct
v6: Made fixes based on comments form Eric Dumazet:
The field bpf_sock_ops_cb_flags was addded in a hole on 64bit kernels
Field bpf_sock_ops_cb_flags is now set through a helper function
which returns an error when a BPF program tries to set bits for
callbacks that are not supported in the current kernel.
Added a comment indicating that when adding fields to bpf_sock_ops_kern
they should be added before the field named "temp" if they need to be
cleared before calling the BPF function.  

Signed-off-by: Lawrence Brakmo <bra...@fb.com>

Consists of the following patches:
[PATCH bpf-next v6 01/11] bpf: Make SOCK_OPS_GET_TCP size independent
[PATCH bpf-next v6 02/11] bpf: Make SOCK_OPS_GET_TCP struct
[PATCH bpf-next v6 03/11] bpf: Add write access to tcp_sock and sock
[PATCH bpf-next v6 04/11] bpf: Support passing args to sock_ops bpf
[PATCH bpf-next v6 05/11] bpf: Adds field bpf_sock_ops_cb_flags to
[PATCH bpf-next v6 06/11] bpf: Add sock_ops RTO callback
[PATCH bpf-next v6 07/11] bpf: Add support for reading sk_state and
[PATCH bpf-next v6 08/11] bpf: Add sock_ops R/W access to tclass &
[PATCH bpf-next v6 09/11] bpf: Add BPF_SOCK_OPS_RETRANS_CB
[PATCH bpf-next v6 10/11] bpf: Add BPF_SOCK_OPS_STATE_CB
[PATCH bpf-next v6 11/11] bpf: add selftest for tcpbpf

 include/linux/filter.h |  10 ++
 include/linux/tcp.h|  11 ++
 include/net/tcp.h  |  66 +++-
 include/uapi/linux/bpf.h   |  72 +++-
 include/uapi/linux/tcp.h   |   8 +
 net/core/filter.c  | 281 
+---
 net/ipv4/tcp.c |  26 ++-
 net/ipv4/tcp_nv.c  |   2 +-
 net/ipv4/tcp_output.c  |   5 +-
 net/ipv4/tcp_timer.c   |   7 +
 tools/include/uapi/linux/bpf.h |  74 -
 tools/testing/selftests/bpf/Makefile   |   4 +-
 tools/testing/selftests/bpf/bpf_helpers.h  |   2 +
 tools/testing/selftests/bpf/tcp_client.py  |  52 ++
 tools/testing/selftests/bpf/tcp_server.py  |  79 +
 tools/testing/selftests/bpf/test_tcpbpf.h  |  16 ++
 tools/testing/selftests/bpf/test_tcpbpf_kern.c | 131 +++
 tools/testing/selftests/bpf/test_tcpbpf_user.c | 126 ++
 18 files changed, 933 insertions(+), 39 deletions(-)

Re: [PATCH bpf-next v5 05/11] bpf: Adds field bpf_sock_ops_flags to tcp_sock

2018-01-09 Thread Lawrence Brakmo

On 1/9/18, 3:31 PM, "Eric Dumazet" <eric.duma...@gmail.com> wrote:

On Tue, 2018-01-09 at 13:06 -0800, Lawrence Brakmo wrote:
> Adds field bpf_sock_ops_flags to tcp_sock and bpf_sock_ops. Its primary
> use is to determine if there should be calls to sock_ops bpf program at
> various points in the TCP code. The field is initialized to zero,
> disabling the calls. A sock_ops BPF program can set, per connection and
> as necessary, when the connection is established.
> 
> It also adds support for reading and writting the field within a
> sock_ops BPF program.
> 
> Examples of where to call the bpf program:
> 
> 1) When RTO fires
> 2) When a packet is retransmitted
> 3) When the connection terminates
> 4) When a packet is sent
    > 5) When a packet is received
> 
> Signed-off-by: Lawrence Brakmo <bra...@fb.com>
> ---
>  include/linux/tcp.h  | 8 
>  include/uapi/linux/bpf.h | 1 +
>  net/core/filter.c| 7 +++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 4f93f095..62f4388 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -373,6 +373,14 @@ struct tcp_sock {
>*/
>   struct request_sock *fastopen_rsk;
>   u32 *saved_syn;
> +
> +/* Sock_ops bpf program related variables */
> +#ifdef CONFIG_BPF
> + u32 bpf_sock_ops_flags; /* values defined in uapi/linux/tcp.h */
> +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) (TP->bpf_sock_ops_flags & ARG)
> +#else
> +#define BPF_SOCK_OPS_TEST_FLAG(TP, ARG) 0
> +#endif
>  };
> 

It looks like we add yet another TCP socket field for some feature that
 is only used by you or FB :/

I am certainly hoping it will be used by others as well, but it is not possible
until it is available (many people expressed interest at Netdev 2.2). 

At least please try to fill a hole (on 64bit kernels), instead of
adding one more.

Makes sense.

Also, should not we reject attempts to use bits that are not yet
supported by the kernel ?

If a BPF filter expects to be called for every retransmit, but kernel
is too old, this wont work.

Thanks, good point. I will add a helper function that returns which
callbacks are supported by the current kernel so the bpf program can
take appropriate action.

Eric, thank you for the feedback.

1 2 3 4 >

1 - 100 of 388 matches

Mail list logo