Here is the first step in the pf checksum modification / refactoring 
series.

The complete series is available at http://203.79.107.124/. It differs 
from what I presented at the hackathon only by a small optimisation [0].

Overview
--------

The series is broken into two phases. 

Phase1 is a minimal patch, reintroducing the 5.3 checksum fixup algorithm 
that preserves end-to-end checksums when fiddling with packets, but 
without the mess that motivated Henning to remove it (this is my own 
motivation; there are others). The patch only affects this one aspect of 
Henning's checksum work, without which this patch would be far far uglier.

(Note: checksum modification will not be fully 'live' until regeneration 
is fully removed at the end of phase one.)

Phase2 builds on Phase1 and includes the refactorings I posted last year.

The complete patch shows no apparent performance regression, even slight 
improvement under small-packet DDOS over both hw offload and software 
regeneration. IIRC throughput for 10G ix(4), which lacks offload, 
increases by ~5%. It also shaves off 26 non-comment lines of code (and 
adds 376 bytes of object code on amd64; on i386 it prunes ~500 bytes).

See http://203.79.107.124/UNITTEST.tar for userspace unittests of the new 
fixup algorithm. 

I've been running the complete patch for some time without issue, as has 
sthen@, and possibly others.

Patch Mechanics 
---------------

The complete series patch totals 50kB; to ease review it is split into 35 
diffs with commentary. These diffs are grouped into 12 patches to be 
committed (each patch shares a three digit prefix, e.g. 000*.diff forms 
one patch). Each aims to leave the code in a consistent state. 

I post for each patch the diffs that comprise it. Following these is their 
sum, the complete patch to be committed. (You can check they're equivalent 
by feeding the entire post to patch(1) and, when it comes to the complete 
patch, agreeing to apply that in reverse. This should leave the source 
unaltered.)

Thanks
------

Thanks to everyone who has supported this work in one way or another, and 
special mention to reannz.co.nz for providing me with a 10Gb test harness. 
Any faults remain of course my own.

[0] small optimisation: see 
http://203.79.107.124/017b_pf_change_32_remove_udp.diff 


ok? 

------------- BEGIN DIFFS FOR PATCH 000 ---------------------
-------------------------------------------------------------
* Re-introduce pf_cksum_fixup()
        
        - Same algorithm as in 5.3 but for one well-tested tweak, detailed 
          below.
        
Unlike 5.3, the checksum is passed by reference for concision, replacing

        *pd->pcksum = pf_cksum_fixup(*pd->pcksum, ...)
with
        pf_cksum_fixup(pd->pcksum, ...)

Note: although this precludes the compiler optimisations for nested calls of
pure functions, which 5.3 took advantage of for its nested (unreadable)
pf_cksum_fixup() chains, these optimisations are no longer relevant: with the
introduction of pf_cksum_fixup_a() below, at most two consecutive
pf_cksum_fixup() calls are needed. (EON)

Regards the fixup algorithm tweak: The OpenBSD 5.3 fixup was (in essence)

        x = ((x & 0xffff) + (x >> 16)) & 0xffff;

the new, equivalent, line is:

        x = (x + (x >> 16)) & 0xffff;

For justification, see source comments. 

* Introduce pf_patch_{8,16,16_unaligned,32}() interface
        
        - modification of checksum-covered data is 
          assignment-with-side-effects. 
        - all new functions will be used by later patches

     +ve provides type-appropriate checksum modification
     +ve will replace existing 'altered value' guards, 
         reducing code length
     -ve five new functions in total

C assignment hides behind one assignment operator the nitty gritty of
differing l-value widths. As we cannot change the language to suit our needs,
we are obliged to expose these differences in our interface.

An added wrinkle is that our side-effect, namely, modifying the checksum,
depends on the alignment of the l-value within the packet (the checksum's
summands are 16-bit aligned with respect to the packet). So the interface
provides _unaligned() versions parameterised by the l-value's packet
alignment, either 'hi' or 'lo'. Thankfully, these are for most protocol fields
unnecessary.  

Later patches will augment these functions with 'altered value' guards,
allowing us to replace, e.g.

        if (icmpid != pd->hdr.icmp->icmp_id) {
                if (pd->csum_status == PF_CSUM_UNKNOWN)
                        pf_check_proto_cksum(pd, pd->off,
                            pd->tot_len - pd->off, pd->proto,
                            pd->af);
                pd->hdr.icmp->icmp_id = icmpid;
                rewrite = 1;
        }

with
        rewrite += pf_patch_16(pd, &pd->hdr.icmp->icmp_id, icmpid);

Lastly, thanks to mikeb@ for the name 'pf_patch_*'.

* Convert miscellaneous packet alteration to pf_patch_*() interface and
checksum modification.

As these now modify the checksum they need no longer call
pf_check_proto_cksum(), which is used when regenerating checksums. (Other
parts of the code do not appear to depend on these removed calls.)

* Initialise pd->pcksum for icmp6 

        - ensures pcksum is set for all known checksummed protocols   
        - ICMP is not performance critical
        - pf_patch_*() relies on this for icmp6 packets

Index: net/pf.c
===================================================================
--- net.orig/pf.c
+++ net/pf.c
@@ -150,7 +150,8 @@ void                         pf_init_threshold(struct 
pf_thre
                            u_int32_t);
 void                    pf_add_threshold(struct pf_threshold *);
 int                     pf_check_threshold(struct pf_threshold *);
-
+void                    pf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
+                           u_int8_t);
 void                    pf_change_ap(struct pf_pdesc *, struct pf_addr *,
                            u_int16_t *, struct pf_addr *, u_int16_t);
 int                     pf_modulate_sack(struct pf_pdesc *,
@@ -1684,6 +1685,124 @@ pf_addr_wrap_neq(struct pf_addr_wrap *aw
        }
 }
 
+/* This algorithm is an optimised special case of a method for emulating
+ * 16-bit ones-complement sums on a twos-complement machine. That, more
+ * general, method conserves ones-complement's carries, which twos-complement
+ * otherwise discards, in the upper bits of x and these accumulated carries
+ * when added to the lower 16-bits over at least zero 'reduction' steps then
+ * complete the ones-complement sum.
+ *
+ * This algorithm computes 'a + b - c' in ones-complement using a trick to
+ * emulate at most one ones-complement subtraction. This thereby limits net
+ * carries/borrows to at most one, eliminating a reduction step and saving one
+ * each of +, >>, & and ~.
+ *
+ * def. x mod y = x - (x//y)*y for integer x,y
+ * def. sum = x mod 2^16
+ * def. accumulator = (x >> 16) mod 2^16
+ *
+ * The trick works as follows: subtracting exactly one u_int16_t from the
+ * u_int32_t x incurs at most one underflow, wrapping its upper 16-bits, the
+ * accumulator, to 2^16 - 1. Adding this to the 16-bit sum preserves the
+ * ones-complement borrow:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { assume underflow: accumulator := 2^16 - 1 }
+ *  (sum + 2^16 - 1) mod 2^16
+ * =   { mod }
+ *  (sum - 1) mod 2^16
+ *
+ * Although this breaks for sum = 0, giving 0xffff, which is ones-complement's
+ * other zero, not -1, that cannot occur: the 16-bit sum cannot be underflown
+ * to zero as that requires subtraction of at least 2^16, which exceeds a
+ * single u_int16_t's range.
+ *
+ * We use the following theorem to derive the implementation:
+ *
+ * th. (x + (y mod z)) mod z  =  (x + y) mod z   (0)
+ * proof.
+ *     (x + (y mod z)) mod z
+ *    =  { def mod }
+ *     (x + y - (y//z)*z) mod z
+ *    =  { (a + b*c) mod c = a mod c }
+ *     (x + y) mod z                   [end of proof]
+ *
+ * ... and thereby obtain:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { def. accumulator, def. sum }
+ *  (x mod 2^16 + (x >> 16) mod 2^16) mod 2^16
+ * =   { (0), twice }
+ *  (x + (x >> 16)) mod 2^16
+ * =   { x mod 2^n = x & (2^n - 1) }
+ *  (x + (x >> 16)) & 0xffff
+ *
+ * Note: this serves also as a reduction step for at most one add (as the
+ * trailing mod 2^16 prevents further reductions by destroying carries).
+ */
+void
+pf_cksum_fixup(u_int16_t *cksum, u_int16_t was, u_int16_t now,
+    u_int8_t proto)
+{
+       u_int32_t x;
+       const int udp = proto == IPPROTO_UDP;
+
+       x = *cksum + was - now;
+       x = (x + (x >> 16)) & 0xffff;
+
+       /* optimise: eliminate a branch when not udp */
+       if (udp && *cksum == 0x0000)
+               return;
+       if (udp && x == 0x0000)
+               x = 0xffff;
+
+        *cksum = (u_int16_t)(x);
+}
+
+void
+pf_patch_8(struct pf_pdesc *pd, u_int8_t *f, u_int8_t v, bool hi)
+{
+       u_int16_t new = htons(hi ? ( v << 8) :  v);
+       u_int16_t old = htons(hi ? (*f << 8) : *f);
+
+       pf_cksum_fixup(pd->pcksum, old, new, pd->proto);
+       *f = v;
+}
+
+/* pre: *f is 16-bit aligned within its packet */
+void
+pf_patch_16(struct pf_pdesc *pd, u_int16_t *f, u_int16_t v)
+{
+       pf_cksum_fixup(pd->pcksum, *f, v, pd->proto);
+       *f = v;
+}
+
+void
+pf_patch_16_unaligned(struct pf_pdesc *pd, void *f, u_int16_t v, bool hi)
+{
+       u_int8_t *fb = (u_int8_t*)f;
+       u_int8_t *vb = (u_int8_t*)&v;
+
+       if (hi && ALIGNED_POINTER(f, u_int16_t)) {
+               pf_patch_16(pd, f, v); /* optimise */
+               return;
+       }
+
+       pf_patch_8(pd, fb++, *vb++, hi);
+       pf_patch_8(pd, fb++, *vb++,!hi);
+}
+
+/* pre: *f is 16-bit aligned within its packet */
+void
+pf_patch_32(struct pf_pdesc *pd, u_int32_t *f, u_int32_t v)
+{
+       u_int16_t *pc = pd->pcksum;
+
+       pf_cksum_fixup(pc, *f / (1 << 16), v / (1 << 16), pd->proto);
+       pf_cksum_fixup(pc, *f % (1 << 16), v % (1 << 16), pd->proto);
+       *f = v;
+}
+
 void
 pf_change_ap(struct pf_pdesc *pd, struct pf_addr *a, u_int16_t *p,
     struct pf_addr *an, u_int16_t pn)
@@ -3750,11 +3869,8 @@ pf_translate(struct pf_pdesc *pd, struct
                        u_int16_t icmpid = (icmp_dir == PF_IN) ? sport : dport;
 
                        if (icmpid != pd->hdr.icmp->icmp_id) {
-                               if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                       pf_check_proto_cksum(pd, pd->off,
-                                           pd->tot_len - pd->off, pd->proto,
-                                           pd->af);
-                               pd->hdr.icmp->icmp_id = icmpid;
+                               pf_patch_16(pd,
+                                   &pd->hdr.icmp->icmp_id, icmpid);
                                rewrite = 1;
                        }
                }
@@ -3786,11 +3902,8 @@ pf_translate(struct pf_pdesc *pd, struct
                        u_int16_t icmpid = (icmp_dir == PF_IN) ? sport : dport;
 
                        if (icmpid != pd->hdr.icmp6->icmp6_id) {
-                               if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                       pf_check_proto_cksum(pd, pd->off,
-                                           pd->tot_len - pd->off, pd->proto,
-                                           pd->af);
-                               pd->hdr.icmp6->icmp6_id = icmpid;
+                               pf_patch_16(pd,
+                                   &pd->hdr.icmp6->icmp6_id, icmpid);
                                rewrite = 1;
                        }
                }
@@ -4599,11 +4712,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                }
 
                                if (nk->port[iidx] !=  pd->hdr.icmp->icmp_id) {
-                                       if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                               pf_check_proto_cksum(pd,
-                                                   pd->off, pd->tot_len -
-                                                   pd->off, pd->proto, pd->af);
-                                       pd->hdr.icmp->icmp_id = nk->port[iidx];
+                                       pf_patch_16(pd, &pd->hdr.icmp->icmp_id,
+                                           nk->port[iidx]);
                                }
 
                                m_copyback(pd->m, pd->off, ICMP_MINLEN,
@@ -4631,12 +4741,9 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                }
 
                                if (nk->port[iidx] != pd->hdr.icmp6->icmp6_id) {
-                                       if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                               pf_check_proto_cksum(pd,
-                                                   pd->off, pd->tot_len -
-                                                   pd->off, pd->proto, pd->af);
-                                       pd->hdr.icmp6->icmp6_id =
-                                           nk->port[iidx];
+                                       pf_patch_16(pd,
+                                           &pd->hdr.icmp6->icmp6_id,
+                                           nk->port[iidx]);
                                }
 
                                m_copyback(pd->m, pd->off,
@@ -6274,6 +6381,7 @@ pf_setup_pdesc(struct pf_pdesc *pd, void
                        REASON_SET(reason, PFRES_SHORT);
                        return (PF_DROP);
                }
+               pd->pcksum = &pd->hdr.icmp6->icmp6_cksum;
                break;
        }
 #endif /* INET6 */
Index: net/pfvar.h
===================================================================
--- net.orig/pfvar.h
+++ net/pfvar.h
@@ -1715,6 +1715,13 @@ void     pf_addr_inc(struct pf_addr *, sa_fa
 
 void   *pf_pull_hdr(struct mbuf *, int, void *, int, u_short *, u_short *,
            sa_family_t);
+#define PF_HI (true)
+#define PF_LO (!PF_HI)
+#define PF_ALGNMNT(off) (((off) % 2) == 0 ? PF_HI : PF_LO)
+void   pf_patch_8(struct pf_pdesc *, u_int8_t *, u_int8_t, bool);
+void   pf_patch_16(struct pf_pdesc *, u_int16_t *, u_int16_t);
+void   pf_patch_16_unaligned(struct pf_pdesc *, void *, u_int16_t, bool);
+void   pf_patch_32(struct pf_pdesc *, u_int32_t *, u_int32_t);
 void   pf_change_a(struct pf_pdesc *, void *, u_int32_t);
 int    pf_check_proto_cksum(struct pf_pdesc *, int, int, u_int8_t,
            sa_family_t);
Index: net/pf_norm.c
===================================================================
--- net.orig/pf_norm.c
+++ net/pf_norm.c
@@ -855,10 +855,6 @@ pf_normalize_tcp(struct pf_pdesc *pd)
        u_int8_t         flags;
        u_int            rewrite = 0;
 
-       if (pd->csum_status == PF_CSUM_UNKNOWN)
-               pf_check_proto_cksum(pd, pd->off, pd->tot_len - pd->off,
-                   pd->proto, pd->af);
-
        flags = th->th_flags;
        if (flags & TH_SYN) {
                /* Illegal packet */
@@ -880,15 +876,18 @@ pf_normalize_tcp(struct pf_pdesc *pd)
        }
 
        /* If flags changed, or reserved data set, then adjust */
-       if (flags != th->th_flags || th->th_x2 != 0) {
-               th->th_flags = flags;
-               th->th_x2 = 0;
-               rewrite = 1;
-       }
+       if (flags != th->th_flags || th->th_x2 != 0) {
+               /* hack: set 4-bit th_x2 = 0 */
+               u_int8_t *th_off = (u_int8_t*)(&th->th_ack+1);
+               pf_patch_8(pd, th_off, th->th_off << 4, PF_HI);
+
+               pf_patch_8(pd, &th->th_flags, flags, PF_LO);
+               rewrite = 1;
+       }
 
        /* Remove urgent pointer, if TH_URG is not set */
        if (!(flags & TH_URG) && th->th_urp) {
-               th->th_urp = 0;
+               pf_patch_16(pd, &th->th_urp, 0);
                rewrite = 1;
        }
 
@@ -1391,12 +1390,8 @@ pf_normalize_mss(struct pf_pdesc *pd, u_
        u_int16_t        mss;
        int              thoff;
        int              opt, cnt, optlen = 0;
-       u_char           opts[MAX_TCPOPTLEN];
-       u_char          *optp = opts;
-
-       if (pd->csum_status == PF_CSUM_UNKNOWN)
-               pf_check_proto_cksum(pd, pd->off, pd->tot_len - pd->off,
-                   pd->proto, pd->af);
+       u_int8_t         opts[MAX_TCPOPTLEN];
+       u_int8_t        *optp = opts;
 
        thoff = th->th_off << 2;
        cnt = thoff - sizeof(struct tcphdr);
@@ -1419,12 +1414,15 @@ pf_normalize_mss(struct pf_pdesc *pd, u_
                                break;
                }
                if (opt == TCPOPT_MAXSEG) {
-                       memcpy(&mss, (optp + 2), 2);
+                       u_int8_t *mssp = optp + 2;
+                       memcpy(&mss, mssp, sizeof(mss));
                        if (ntohs(mss) > maxmss) {
-                               mss = htons(maxmss);
+                               size_t mssoffopts = mssp - opts;
+                               pf_patch_16_unaligned(pd, &mss,
+                                   htons(maxmss), PF_ALGNMNT(mssoffopts));
                                m_copyback(pd->m,
-                                   pd->off + sizeof(*th) + optp + 2 - opts,
-                                   2, &mss, M_NOWAIT);
+                                   pd->off + sizeof(*th) + mssoffopts,
+                                   sizeof(mss), &mss, M_NOWAIT);
                                pf_cksum(pd, pd->m);
                                m_copyback(pd->m, pd->off, sizeof(*th), th,
                                    M_NOWAIT);
-------------------------------------------------------------
* Inline pf_cksum_fixup() 

For justification, see efficiency testing results (particularly, phase2-inline).

Index: net/pf.c
===================================================================
--- net.orig/pf.c
+++ net/pf.c
@@ -150,7 +150,7 @@ void                         pf_init_threshold(struct 
pf_thre
                            u_int32_t);
 void                    pf_add_threshold(struct pf_threshold *);
 int                     pf_check_threshold(struct pf_threshold *);
-void                    pf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
+static __inline void    pf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
                            u_int8_t);
 void                    pf_change_ap(struct pf_pdesc *, struct pf_addr *,
                            u_int16_t *, struct pf_addr *, u_int16_t);
@@ -1740,7 +1740,7 @@ pf_addr_wrap_neq(struct pf_addr_wrap *aw
  * Note: this serves also as a reduction step for at most one add (as the
  * trailing mod 2^16 prevents further reductions by destroying carries).
  */
-void
+static __inline void
 pf_cksum_fixup(u_int16_t *cksum, u_int16_t was, u_int16_t now,
     u_int8_t proto)
 {
-------------------------------------------------------------
* Convert three packet modifications not obviously guarded by the necessary

        if (pd->csum_status == PF_CSUM_UNKNOWN)
                pf_check_proto_cksum()

  to pf_patch_*()

I suppose but haven't checked that these were originally correct.

I audited pf_test_state_icmp() for other unguarded instances, found none, and
I am reasonably confident no others exist in PF as these would have been
exposed in testing.

Index: net/pf.c
===================================================================
--- net.orig/pf.c
+++ net/pf.c
@@ -5132,7 +5132,10 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                        break;
 #endif /* INET6 */
                                }
-                               uh.uh_sum = 0;
+                               /* Avoid recomputing quoted UDP checksum.
+                                * note: udp6 0 csum invalid per rfc2460 p27.
+                                * but presumed nothing cares in this context */
+                               pf_patch_16(pd, &uh.uh_sum, 0);
                                m_copyback(pd2.m, pd2.off, sizeof(uh), &uh,
                                    M_NOWAIT);
                                copyback = 1;
@@ -5198,7 +5201,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                                return (PF_DROP);
                                        if (virtual_type == htons(ICMP_ECHO) &&
                                            nk->port[iidx] != iih.icmp_id)
-                                               iih.icmp_id = nk->port[iidx];
+                                               pf_patch_16(pd, &iih.icmp_id,
+                                                   nk->port[iidx]);
                                        m_copyback(pd2.m, pd2.off, ICMP_MINLEN,
                                            &iih, M_NOWAIT);
                                        pd->m->m_pkthdr.ph_rtableid =
@@ -5309,7 +5313,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                        if (virtual_type ==
                                            htons(ICMP6_ECHO_REQUEST) &&
                                            nk->port[iidx] != iih.icmp6_id)
-                                               iih.icmp6_id = nk->port[iidx];
+                                               pf_patch_16(pd, &iih.icmp6_id,
+                                                   nk->port[iidx]);
                                        m_copyback(pd2.m, pd2.off,
                                            sizeof(struct icmp6_hdr), &iih,
                                            M_NOWAIT);

------------- END OF DIFFS FOR PATCH 000 -------------------
----- EQUIVALENT COMBINED PATCH FOR COMMIT FOLLOWS ---------

Index: net/pf.c
===================================================================
--- net.orig/pf.c
+++ net/pf.c
@@ -150,7 +150,8 @@ void                         pf_init_threshold(struct 
pf_thre
                            u_int32_t);
 void                    pf_add_threshold(struct pf_threshold *);
 int                     pf_check_threshold(struct pf_threshold *);
-
+static __inline void    pf_cksum_fixup(u_int16_t *, u_int16_t, u_int16_t,
+                           u_int8_t);
 void                    pf_change_ap(struct pf_pdesc *, struct pf_addr *,
                            u_int16_t *, struct pf_addr *, u_int16_t);
 int                     pf_modulate_sack(struct pf_pdesc *,
@@ -1684,6 +1685,124 @@ pf_addr_wrap_neq(struct pf_addr_wrap *aw
        }
 }
 
+/* This algorithm is an optimised special case of a method for emulating
+ * 16-bit ones-complement sums on a twos-complement machine. That, more
+ * general, method conserves ones-complement's carries, which twos-complement
+ * otherwise discards, in the upper bits of x and these accumulated carries
+ * when added to the lower 16-bits over at least zero 'reduction' steps then
+ * complete the ones-complement sum.
+ *
+ * This algorithm computes 'a + b - c' in ones-complement using a trick to
+ * emulate at most one ones-complement subtraction. This thereby limits net
+ * carries/borrows to at most one, eliminating a reduction step and saving one
+ * each of +, >>, & and ~.
+ *
+ * def. x mod y = x - (x//y)*y for integer x,y
+ * def. sum = x mod 2^16
+ * def. accumulator = (x >> 16) mod 2^16
+ *
+ * The trick works as follows: subtracting exactly one u_int16_t from the
+ * u_int32_t x incurs at most one underflow, wrapping its upper 16-bits, the
+ * accumulator, to 2^16 - 1. Adding this to the 16-bit sum preserves the
+ * ones-complement borrow:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { assume underflow: accumulator := 2^16 - 1 }
+ *  (sum + 2^16 - 1) mod 2^16
+ * =   { mod }
+ *  (sum - 1) mod 2^16
+ *
+ * Although this breaks for sum = 0, giving 0xffff, which is ones-complement's
+ * other zero, not -1, that cannot occur: the 16-bit sum cannot be underflown
+ * to zero as that requires subtraction of at least 2^16, which exceeds a
+ * single u_int16_t's range.
+ *
+ * We use the following theorem to derive the implementation:
+ *
+ * th. (x + (y mod z)) mod z  =  (x + y) mod z   (0)
+ * proof.
+ *     (x + (y mod z)) mod z
+ *    =  { def mod }
+ *     (x + y - (y//z)*z) mod z
+ *    =  { (a + b*c) mod c = a mod c }
+ *     (x + y) mod z                   [end of proof]
+ *
+ * ... and thereby obtain:
+ *
+ *  (sum + accumulator) mod 2^16
+ * =   { def. accumulator, def. sum }
+ *  (x mod 2^16 + (x >> 16) mod 2^16) mod 2^16
+ * =   { (0), twice }
+ *  (x + (x >> 16)) mod 2^16
+ * =   { x mod 2^n = x & (2^n - 1) }
+ *  (x + (x >> 16)) & 0xffff
+ *
+ * Note: this serves also as a reduction step for at most one add (as the
+ * trailing mod 2^16 prevents further reductions by destroying carries).
+ */
+static __inline void
+pf_cksum_fixup(u_int16_t *cksum, u_int16_t was, u_int16_t now,
+    u_int8_t proto)
+{
+       u_int32_t x;
+       const int udp = proto == IPPROTO_UDP;
+
+       x = *cksum + was - now;
+       x = (x + (x >> 16)) & 0xffff;
+
+       /* optimise: eliminate a branch when not udp */
+       if (udp && *cksum == 0x0000)
+               return;
+       if (udp && x == 0x0000)
+               x = 0xffff;
+
+        *cksum = (u_int16_t)(x);
+}
+
+void
+pf_patch_8(struct pf_pdesc *pd, u_int8_t *f, u_int8_t v, bool hi)
+{
+       u_int16_t new = htons(hi ? ( v << 8) :  v);
+       u_int16_t old = htons(hi ? (*f << 8) : *f);
+
+       pf_cksum_fixup(pd->pcksum, old, new, pd->proto);
+       *f = v;
+}
+
+/* pre: *f is 16-bit aligned within its packet */
+void
+pf_patch_16(struct pf_pdesc *pd, u_int16_t *f, u_int16_t v)
+{
+       pf_cksum_fixup(pd->pcksum, *f, v, pd->proto);
+       *f = v;
+}
+
+void
+pf_patch_16_unaligned(struct pf_pdesc *pd, void *f, u_int16_t v, bool hi)
+{
+       u_int8_t *fb = (u_int8_t*)f;
+       u_int8_t *vb = (u_int8_t*)&v;
+
+       if (hi && ALIGNED_POINTER(f, u_int16_t)) {
+               pf_patch_16(pd, f, v); /* optimise */
+               return;
+       }
+
+       pf_patch_8(pd, fb++, *vb++, hi);
+       pf_patch_8(pd, fb++, *vb++,!hi);
+}
+
+/* pre: *f is 16-bit aligned within its packet */
+void
+pf_patch_32(struct pf_pdesc *pd, u_int32_t *f, u_int32_t v)
+{
+       u_int16_t *pc = pd->pcksum;
+
+       pf_cksum_fixup(pc, *f / (1 << 16), v / (1 << 16), pd->proto);
+       pf_cksum_fixup(pc, *f % (1 << 16), v % (1 << 16), pd->proto);
+       *f = v;
+}
+
 void
 pf_change_ap(struct pf_pdesc *pd, struct pf_addr *a, u_int16_t *p,
     struct pf_addr *an, u_int16_t pn)
@@ -3750,11 +3869,8 @@ pf_translate(struct pf_pdesc *pd, struct
                        u_int16_t icmpid = (icmp_dir == PF_IN) ? sport : dport;
 
                        if (icmpid != pd->hdr.icmp->icmp_id) {
-                               if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                       pf_check_proto_cksum(pd, pd->off,
-                                           pd->tot_len - pd->off, pd->proto,
-                                           pd->af);
-                               pd->hdr.icmp->icmp_id = icmpid;
+                               pf_patch_16(pd,
+                                   &pd->hdr.icmp->icmp_id, icmpid);
                                rewrite = 1;
                        }
                }
@@ -3786,11 +3902,8 @@ pf_translate(struct pf_pdesc *pd, struct
                        u_int16_t icmpid = (icmp_dir == PF_IN) ? sport : dport;
 
                        if (icmpid != pd->hdr.icmp6->icmp6_id) {
-                               if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                       pf_check_proto_cksum(pd, pd->off,
-                                           pd->tot_len - pd->off, pd->proto,
-                                           pd->af);
-                               pd->hdr.icmp6->icmp6_id = icmpid;
+                               pf_patch_16(pd,
+                                   &pd->hdr.icmp6->icmp6_id, icmpid);
                                rewrite = 1;
                        }
                }
@@ -4599,11 +4712,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                }
 
                                if (nk->port[iidx] !=  pd->hdr.icmp->icmp_id) {
-                                       if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                               pf_check_proto_cksum(pd,
-                                                   pd->off, pd->tot_len -
-                                                   pd->off, pd->proto, pd->af);
-                                       pd->hdr.icmp->icmp_id = nk->port[iidx];
+                                       pf_patch_16(pd, &pd->hdr.icmp->icmp_id,
+                                           nk->port[iidx]);
                                }
 
                                m_copyback(pd->m, pd->off, ICMP_MINLEN,
@@ -4631,12 +4741,9 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                }
 
                                if (nk->port[iidx] != pd->hdr.icmp6->icmp6_id) {
-                                       if (pd->csum_status == PF_CSUM_UNKNOWN)
-                                               pf_check_proto_cksum(pd,
-                                                   pd->off, pd->tot_len -
-                                                   pd->off, pd->proto, pd->af);
-                                       pd->hdr.icmp6->icmp6_id =
-                                           nk->port[iidx];
+                                       pf_patch_16(pd,
+                                           &pd->hdr.icmp6->icmp6_id,
+                                           nk->port[iidx]);
                                }
 
                                m_copyback(pd->m, pd->off,
@@ -5025,7 +5132,10 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                        break;
 #endif /* INET6 */
                                }
-                               uh.uh_sum = 0;
+                               /* Avoid recomputing quoted UDP checksum.
+                                * note: udp6 0 csum invalid per rfc2460 p27.
+                                * but presumed nothing cares in this context */
+                               pf_patch_16(pd, &uh.uh_sum, 0);
                                m_copyback(pd2.m, pd2.off, sizeof(uh), &uh,
                                    M_NOWAIT);
                                copyback = 1;
@@ -5091,7 +5201,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                                return (PF_DROP);
                                        if (virtual_type == htons(ICMP_ECHO) &&
                                            nk->port[iidx] != iih.icmp_id)
-                                               iih.icmp_id = nk->port[iidx];
+                                               pf_patch_16(pd, &iih.icmp_id,
+                                                   nk->port[iidx]);
                                        m_copyback(pd2.m, pd2.off, ICMP_MINLEN,
                                            &iih, M_NOWAIT);
                                        pd->m->m_pkthdr.ph_rtableid =
@@ -5202,7 +5313,8 @@ pf_test_state_icmp(struct pf_pdesc *pd,
                                        if (virtual_type ==
                                            htons(ICMP6_ECHO_REQUEST) &&
                                            nk->port[iidx] != iih.icmp6_id)
-                                               iih.icmp6_id = nk->port[iidx];
+                                               pf_patch_16(pd, &iih.icmp6_id,
+                                                   nk->port[iidx]);
                                        m_copyback(pd2.m, pd2.off,
                                            sizeof(struct icmp6_hdr), &iih,
                                            M_NOWAIT);
@@ -6274,6 +6386,7 @@ pf_setup_pdesc(struct pf_pdesc *pd, void
                        REASON_SET(reason, PFRES_SHORT);
                        return (PF_DROP);
                }
+               pd->pcksum = &pd->hdr.icmp6->icmp6_cksum;
                break;
        }
 #endif /* INET6 */
Index: net/pfvar.h
===================================================================
--- net.orig/pfvar.h
+++ net/pfvar.h
@@ -1715,6 +1715,13 @@ void     pf_addr_inc(struct pf_addr *, sa_fa
 
 void   *pf_pull_hdr(struct mbuf *, int, void *, int, u_short *, u_short *,
            sa_family_t);
+#define PF_HI (true)
+#define PF_LO (!PF_HI)
+#define PF_ALGNMNT(off) (((off) % 2) == 0 ? PF_HI : PF_LO)
+void   pf_patch_8(struct pf_pdesc *, u_int8_t *, u_int8_t, bool);
+void   pf_patch_16(struct pf_pdesc *, u_int16_t *, u_int16_t);
+void   pf_patch_16_unaligned(struct pf_pdesc *, void *, u_int16_t, bool);
+void   pf_patch_32(struct pf_pdesc *, u_int32_t *, u_int32_t);
 void   pf_change_a(struct pf_pdesc *, void *, u_int32_t);
 int    pf_check_proto_cksum(struct pf_pdesc *, int, int, u_int8_t,
            sa_family_t);
Index: net/pf_norm.c
===================================================================
--- net.orig/pf_norm.c
+++ net/pf_norm.c
@@ -855,10 +855,6 @@ pf_normalize_tcp(struct pf_pdesc *pd)
        u_int8_t         flags;
        u_int            rewrite = 0;
 
-       if (pd->csum_status == PF_CSUM_UNKNOWN)
-               pf_check_proto_cksum(pd, pd->off, pd->tot_len - pd->off,
-                   pd->proto, pd->af);
-
        flags = th->th_flags;
        if (flags & TH_SYN) {
                /* Illegal packet */
@@ -880,15 +876,18 @@ pf_normalize_tcp(struct pf_pdesc *pd)
        }
 
        /* If flags changed, or reserved data set, then adjust */
-       if (flags != th->th_flags || th->th_x2 != 0) {
-               th->th_flags = flags;
-               th->th_x2 = 0;
-               rewrite = 1;
-       }
+       if (flags != th->th_flags || th->th_x2 != 0) {
+               /* hack: set 4-bit th_x2 = 0 */
+               u_int8_t *th_off = (u_int8_t*)(&th->th_ack+1);
+               pf_patch_8(pd, th_off, th->th_off << 4, PF_HI);
+
+               pf_patch_8(pd, &th->th_flags, flags, PF_LO);
+               rewrite = 1;
+       }
 
        /* Remove urgent pointer, if TH_URG is not set */
        if (!(flags & TH_URG) && th->th_urp) {
-               th->th_urp = 0;
+               pf_patch_16(pd, &th->th_urp, 0);
                rewrite = 1;
        }
 
@@ -1391,12 +1390,8 @@ pf_normalize_mss(struct pf_pdesc *pd, u_
        u_int16_t        mss;
        int              thoff;
        int              opt, cnt, optlen = 0;
-       u_char           opts[MAX_TCPOPTLEN];
-       u_char          *optp = opts;
-
-       if (pd->csum_status == PF_CSUM_UNKNOWN)
-               pf_check_proto_cksum(pd, pd->off, pd->tot_len - pd->off,
-                   pd->proto, pd->af);
+       u_int8_t         opts[MAX_TCPOPTLEN];
+       u_int8_t        *optp = opts;
 
        thoff = th->th_off << 2;
        cnt = thoff - sizeof(struct tcphdr);
@@ -1419,12 +1414,15 @@ pf_normalize_mss(struct pf_pdesc *pd, u_
                                break;
                }
                if (opt == TCPOPT_MAXSEG) {
-                       memcpy(&mss, (optp + 2), 2);
+                       u_int8_t *mssp = optp + 2;
+                       memcpy(&mss, mssp, sizeof(mss));
                        if (ntohs(mss) > maxmss) {
-                               mss = htons(maxmss);
+                               size_t mssoffopts = mssp - opts;
+                               pf_patch_16_unaligned(pd, &mss,
+                                   htons(maxmss), PF_ALGNMNT(mssoffopts));
                                m_copyback(pd->m,
-                                   pd->off + sizeof(*th) + optp + 2 - opts,
-                                   2, &mss, M_NOWAIT);
+                                   pd->off + sizeof(*th) + mssoffopts,
+                                   sizeof(mss), &mss, M_NOWAIT);
                                pf_cksum(pd, pd->m);
                                m_copyback(pd->m, pd->off, sizeof(*th), th,
                                    M_NOWAIT);

Reply via email to