Re: [PATCH] [0/11] pf checksum modification / refactoring

2016-08-16 Thread Richard Procter


On Tue, 9 Aug 2016, Richard Procter wrote:

> 
> Here is the first step in the pf checksum modification / refactoring 
> series.
> 
> The complete series is available at http://203.79.107.124/. It differs 
> from what I presented at the hackathon only by a small optimisation [0].

Change of plan. Here is the complete phase one (checksum modification).
It's easier to test, and easier to commit. The easier-to-review
steps remain up at http://203.79.107.124/ 

ok?

best, 
Richard. 

Index: net/pf.c
===
--- net.orig/pf.c
+++ net/pf.c
@@ -150,21 +150,28 @@ void   pf_init_threshold(struct 
pf_thre
u_int32_t);
 voidpf_add_threshold(struct pf_threshold *);
 int pf_check_threshold(struct pf_threshold *);
-
-voidpf_change_ap(struct pf_pdesc *, struct pf_addr *,
+int pf_check_tcp_cksum(struct mbuf *, int, int,
+   sa_family_t);
+static __inline voidpf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
+   u_int8_t);
+voidpf_translate_ap(struct pf_pdesc *, struct pf_addr *,
u_int16_t *, struct pf_addr *, u_int16_t);
+voidpf_cksum_fixup_a(u_int16_t *, const struct pf_addr *,
+   const struct pf_addr *, sa_family_t, u_int8_t);
 int pf_modulate_sack(struct pf_pdesc *,
struct pf_state_peer *);
 int pf_icmp_mapping(struct pf_pdesc *, u_int8_t, int *,
u_int16_t *, u_int16_t *);
-voidpf_change_icmp(struct pf_pdesc *, struct pf_addr *,
-   u_int16_t *, struct pf_addr *, struct pf_addr *,
-   u_int16_t);
 int pf_change_icmp_af(struct mbuf *, int,
struct pf_pdesc *, struct pf_pdesc *,
struct pf_addr *, struct pf_addr *, sa_family_t,
sa_family_t);
-int pf_translate_icmp_af(int, void *);
+voidpf_translate_a(struct pf_pdesc *, struct pf_addr *,
+   struct pf_addr *);
+voidpf_translate_icmp(struct pf_pdesc *, struct pf_addr *,
+   u_int16_t *, struct pf_addr *, struct pf_addr *,
+   u_int16_t);
+int pf_translate_icmp_af(struct pf_pdesc*, int, void *);
 voidpf_send_tcp(const struct pf_rule *, sa_family_t,
const struct pf_addr *, const struct pf_addr *,
u_int16_t, u_int16_t, u_int32_t, u_int32_t,
@@ -293,6 +300,8 @@ static __inline int pf_state_compare_key
struct pf_state_key *);
 static __inline int pf_state_compare_id(struct pf_state *,
struct pf_state *);
+static __inline void pf_cksum_uncover(u_int16_t *, u_int16_t, u_int8_t);
+static __inline void pf_cksum_cover(u_int16_t *, u_int16_t, u_int8_t);
 
 struct pf_src_tree tree_src_tracking;
 
@@ -1684,27 +1693,247 @@ pf_addr_wrap_neq(struct pf_addr_wrap *aw
}
 }
 
+/* This algorithm computes 'a + b - c' in ones-complement using a trick to
+ * emulate at most one ones-complement subtraction. This thereby limits net
+ * carries/borrows to at most one, eliminating a reduction step and saving one
+ * each of +, >>, & and ~.
+ *
+ * def. x mod y = x - (x//y)*y for integer x,y
+ * def. sum = x mod 2^16
+ * def. accumulator = (x >> 16) mod 2^16
+ *
+ * The trick works as follows: subtracting exactly one u_int16_t from the
+ * u_int32_t x incurs at most one underflow, wrapping its upper 16-bits, the
+ * accumulator, to 2^16 - 1. Adding this to the 16-bit sum preserves the
+ * ones-complement borrow:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { assume underflow: accumulator := 2^16 - 1 }
+ *  (sum + 2^16 - 1) mod 2^16
+ * =   { mod }
+ *  (sum - 1) mod 2^16
+ *
+ * Although this breaks for sum = 0, giving 0x, which is ones-complement's
+ * other zero, not -1, that cannot occur: the 16-bit sum cannot be underflown
+ * to zero as that requires subtraction of at least 2^16, which exceeds a
+ * single u_int16_t's range.
+ *
+ * We use the following theorem to derive the implementation:
+ *
+ * th. (x + (y mod z)) mod z  =  (x + y) mod z   (0)
+ * proof.
+ * (x + (y mod z)) mod z
+ *=  { def mod }
+ * (x + y - (y//z)*z) mod z
+ *=  { (a + b*c) mod c = a mod c }
+ * (x + y) mod z   [end of proof]
+ *
+ * ... and thereby obtain:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { def. accumulator, def. sum }
+ *  (x mod 2^16 + (x >> 16) mod 2^16) mod 2^16
+ * =   { (0), twice }
+ *  (x + (x >> 16)) mod 2^16
+ * =   { x mod 2^n = x & (2^n - 1) }
+ *  (x + (x >> 16)) & 0x
+ *
+ * Note: this serves also as a reduction 

[PATCH] [0/11] pf checksum modification / refactoring

2016-08-10 Thread Richard Procter

Here is the first step in the pf checksum modification / refactoring 
series.

The complete series is available at http://203.79.107.124/. It differs 
from what I presented at the hackathon only by a small optimisation [0].

Overview


The series is broken into two phases. 

Phase1 is a minimal patch, reintroducing the 5.3 checksum fixup algorithm 
that preserves end-to-end checksums when fiddling with packets, but 
without the mess that motivated Henning to remove it (this is my own 
motivation; there are others). The patch only affects this one aspect of 
Henning's checksum work, without which this patch would be far far uglier.

(Note: checksum modification will not be fully 'live' until regeneration 
is fully removed at the end of phase one.)

Phase2 builds on Phase1 and includes the refactorings I posted last year.

The complete patch shows no apparent performance regression, even slight 
improvement under small-packet DDOS over both hw offload and software 
regeneration. IIRC throughput for 10G ix(4), which lacks offload, 
increases by ~5%. It also shaves off 26 non-comment lines of code (and 
adds 376 bytes of object code on amd64; on i386 it prunes ~500 bytes).

See http://203.79.107.124/UNITTEST.tar for userspace unittests of the new 
fixup algorithm. 

I've been running the complete patch for some time without issue, as has 
sthen@, and possibly others.

Patch Mechanics 
---

The complete series patch totals 50kB; to ease review it is split into 35 
diffs with commentary. These diffs are grouped into 12 patches to be 
committed (each patch shares a three digit prefix, e.g. 000*.diff forms 
one patch). Each aims to leave the code in a consistent state. 

I post for each patch the diffs that comprise it. Following these is their 
sum, the complete patch to be committed. (You can check they're equivalent 
by feeding the entire post to patch(1) and, when it comes to the complete 
patch, agreeing to apply that in reverse. This should leave the source 
unaltered.)

Thanks
--

Thanks to everyone who has supported this work in one way or another, and 
special mention to reannz.co.nz for providing me with a 10Gb test harness. 
Any faults remain of course my own.

[0] small optimisation: see 
http://203.79.107.124/017b_pf_change_32_remove_udp.diff 


ok? 

- BEGIN DIFFS FOR PATCH 000 -
-
* Re-introduce pf_cksum_fixup()

- Same algorithm as in 5.3 but for one well-tested tweak, detailed 
  below.

Unlike 5.3, the checksum is passed by reference for concision, replacing

*pd->pcksum = pf_cksum_fixup(*pd->pcksum, ...)
with
pf_cksum_fixup(pd->pcksum, ...)

Note: although this precludes the compiler optimisations for nested calls of
pure functions, which 5.3 took advantage of for its nested (unreadable)
pf_cksum_fixup() chains, these optimisations are no longer relevant: with the
introduction of pf_cksum_fixup_a() below, at most two consecutive
pf_cksum_fixup() calls are needed. (EON)

Regards the fixup algorithm tweak: The OpenBSD 5.3 fixup was (in essence)

x = ((x & 0x) + (x >> 16)) & 0x;

the new, equivalent, line is:

x = (x + (x >> 16)) & 0x;

For justification, see source comments. 

* Introduce pf_patch_{8,16,16_unaligned,32}() interface

- modification of checksum-covered data is 
  assignment-with-side-effects. 
- all new functions will be used by later patches

 +ve provides type-appropriate checksum modification
 +ve will replace existing 'altered value' guards, 
 reducing code length
 -ve five new functions in total

C assignment hides behind one assignment operator the nitty gritty of
differing l-value widths. As we cannot change the language to suit our needs,
we are obliged to expose these differences in our interface.

An added wrinkle is that our side-effect, namely, modifying the checksum,
depends on the alignment of the l-value within the packet (the checksum's
summands are 16-bit aligned with respect to the packet). So the interface
provides _unaligned() versions parameterised by the l-value's packet
alignment, either 'hi' or 'lo'. Thankfully, these are for most protocol fields
unnecessary.  

Later patches will augment these functions with 'altered value' guards,
allowing us to replace, e.g.

if (icmpid != pd->hdr.icmp->icmp_id) {
if (pd->csum_status == PF_CSUM_UNKNOWN)
pf_check_proto_cksum(pd, pd->off,
pd->tot_len - pd->off, pd->proto,
pd->af);
pd->hdr.icmp->icmp_id = icmpid;
rewrite = 1;
}

with
rewrite += pf_patch_16(pd, >hdr.icmp->icmp_id, icmpid);

Lastly, thanks to mikeb@ for the name 'pf_patch_*'.

* Convert miscellaneous packet alteration to pf_patch_*() interface and
checksum modification.