checksumming on non-local forward path

2018-10-26 Thread Jason A. Donenfeld
Hey netdev,

In a protocol like wireguard, if a packet arrives on the other end of
the tunnel, and the crypto checks out, then we're absolutely certain
that the packet hasn't been modified in transit. In this case, there's
no useful information that validating the inner checksums of the
various headers would give us. Every byte of the packet has arrived
intact, and we're certain of it.

Therefore, you might think in a situation like this, before calling
netif_receive_skb or the like, we can just set ip_summed to
CHECKSUM_UNNECESSARY, csum_level to ~0, and feel glad that we're not
wasting cycles unnecessarily validating the checksum.

But what if the receiving computer forwards the packet on across a
real physical network? In that case, it's probably important that the
kernel that originally produced the packet in the first place ensures
it has a valid checksum before encrypting it and sending it through
the wireguard tunnel. That's fine, but it does mean that in the case
of the packet being locally received on the other end, and not
forwarded, the checksumming by the original sender was not needed and
was therefore wasteful.

What would you think of a flag on the receiving end like,
"CHECKSUM_INVALID_BUT_UNNECESSARY"? It would be treated as
CHECKSUM_UNNECESSARY in the case that the the packet is locally
received. But if the packet is going to be forwarded instead, then
skb_checksum_help is called on it before forwarding onward.

AFAICT, wireguard isn't the only thing that could benefit from this:
virtio is another case where it's not always necessary for the sender
to call skb_checksum_help, when the receiver could just do it
conditionally based on whether it's being forwarded.

Thoughts?

Regards,
Jason


Re: [PATCH net-next v6 23/23] net: WireGuard secure network tunnel

2018-09-27 Thread Jason A. Donenfeld
On Fri, Sep 28, 2018 at 12:37 AM Jason A. Donenfeld  wrote:
> Will do. v7 will include the wg_ prefix.

$ nm *.o | while read a b c; do [[ $b == T ]] && echo $c; done | grep -v ^wg_
cleanup_module
init_module

Success.


Re: [PATCH net-next v6 23/23] net: WireGuard secure network tunnel

2018-09-27 Thread Jason A. Donenfeld
Hi Andrew,

Thanks for following up with this.

On Thu, Sep 27, 2018 at 3:15 AM Andrew Lunn  wrote:
> I know you have been concentrating on the crypto code, so i'm not
> expecting too many changes at the moment in the network code.

I should be addressing things in parallel, actually, so I'm happy to
work on this.

> WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather 
> than BUG() or BUG_ON()
> #2984: FILE: drivers/net/wireguard/noise.c:293:
> +   BUG_ON(first_len > BLAKE2S_HASH_SIZE || second_len > 
> BLAKE2S_HASH_SIZE ||

I was actually going to ask you about this, because it applies
similarly in another context too that I'm trying to refine. The above
function you quote has the following properties:

- Only ever accepts fixed length parameters, so the compiler can
constant fold invocations of it fantastically. Those parameters are
fixed length in the sense that they're enum/macro constants. They
never come from the user or from a packet or something.
- Never produces an incorrect result. For said constants, all inputs
are valid, and so it succeeds in producing an output every time.
- Is a "pure" function, just knocking bytes around, without needing to
interact with fancy kernel-y things; could be implemented on some sort
of WWII-era rotor machine provided you had the patience.

Because of the above, there's never any error to return to the user of
the function. Also, because it only ever takes constant sized inputs,
in theory I should be able to change that BUG_ON() to BUILD_BUG_ON(),
but in practice the optimizer/inliner isn't actually that aggressive.

But what I would like is some way of signaling to the developer using
this function that they've passed it an illegal value, and their code
should not ship until that's fixed, under any circumstances at all  --
that their usage of the function is completely unacceptable and wrong.
Bla bla strong statements.

For this, I figured the notion would come across with the aberrant
behavior of "crash the developer's [in this case, my] QEMU instance"
when "#ifdef DEBUG is true". This is the same kind of place where I'd
have an "assert()" statement in userland. It sounds like what you're
saying is that a WARN_ON is equally as effective instead? Or given the
above, do you think the BUG_ON is actually sensible? Or neither and I
should do something else?

> WARNING: Macros with flow control statements should be avoided
> #5471: FILE: drivers/net/wireguard/selftest/allowedips.h:456:
> +#define init_peer(name) do {   \
> +   name = kzalloc(sizeof(*name), GFP_KERNEL); \
> +   if (unlikely(!name)) { \
> +   pr_info("allowedips self-test: out of memory\n");  \
> +   goto free; \
> +   }  \
> +   kref_init(>refcount);\
> +   } while (0)

This is part of a debugging selftest, where I'm initing a bunch of
peers one after another, and this macro helps keep the test clear
while offloading the actual irrelevant coding part to this macro. The
test itself then just has code like:

init_peer(a);
init_peer(b);
init_peer(c);
init_peer(d);
init_peer(e);
init_peer(f);
init_peer(g);
init_peer(h);

insert(4, a, 192, 168, 4, 0, 24);
insert(4, b, 192, 168, 4, 4, 32);
insert(4, c, 192, 168, 0, 0, 16);
insert(4, d, 192, 95, 5, 64, 27);
/* replaces previous entry, and maskself is required */
insert(4, c, 192, 95, 5, 65, 27);
insert(6, d, 0x26075300, 0x60006b00, 0, 0xc05f0543, 128);
insert(6, c, 0x26075300, 0x60006b00, 0, 0, 64);
...

And so forth. I can probably figure out a different way to code this
if you really want, but I thought this would be clear.

> The namespace pollution also needs to be addresses. You have some
> pretty generic named global symbols. I picked out a few examples from
> objdump
>
> 2a94 g F .text  0060 peer_put
> 3484 g F .text  004c timers_stop
> 3520 g F .text  0114 packet_queue_init
> 2640 g F .text  0034 device_uninit
> 26bc g F .text  0288 peer_create
> 90d4 g F .text  01bc ratelimiter_init
>
> Please make use of a prefix for global symbols, e.g. wg_.

Will do. v7 will include the wg_ prefix.

On a slightly related note, out of curiosity, any idea what's up with
the future of LTO in the kernel? It sounds like that'd be nice to have
on a module-by-module basis. IOW, I'd love to LTO all of my .c files
in wireguard together, and then only ever expose mod_init/exit and
whatever I explicitly EXPORT_SYMBOL, and then have the compiler and
linker treat the rest of everything as essentially in one .c file and
optimize the heck out of it, and then strip all the 

Re: [PATCH net-next v6 00/23] WireGuard: Secure Network Tunnel

2018-09-27 Thread Jason A. Donenfeld
Hi Eric,

On Thu, Sep 27, 2018 at 8:29 PM Eric Biggers  wrote:
> Why is Herbert Xu's existing crypto tree being circumvented, especially for
> future patches (the initial merge isn't quite as important as that's a 
> one-time
> event)?  I like being able to check out cryptodev to test upcoming crypto
> patches.  And currently, changes to APIs, algorithms, tests, and 
> implementations
> all go through cryptodev, which is convenient for crypto developers.
>
> Apparently, you're proposing that someone adding a new algorithm will now have
> to submit the API portion to one maintainer (Herbert Xu) and the 
> implementation
> portion to another maintainer (you), and they'll go through separate git 
> trees.
> That's inconvenient for developers, and it seems that in practice you and
> Herbert will be stepping on each other's toes a lot.
>
> Can you please reach some kind of sane agreement with Herbert so that the
> development process isn't fractured into two?  Perhaps you could review 
> patches,
> but Herbert could still apply them?

I think you're overthinking it a bit. Zinc will have a few software
implementations of primitives that are useful in cases where it's nice to call
the primitive directly. Think: various usages of sha2, siphash, the wireguard
suite (what this patchset includes), other things in lib/, etc. In so much as
this winds up duplicating things within the crypto API, I'll work with Herbert
to build one on top of the other -- as I've done in the two commits in this
series. But beyond that, think of the two initiatives as orthogonal. I'm
working on curating a few primitives that are maximally useful throughout
the kernel for various uses, and doing so in a way that I think brings
about a certain quality. Meanwhile the crypto API is amassing a huge
collection of primitives for some things, and that will continue to exist,
and Herbert will continue to maintain that. I expect for the crossover
to be fairly isolated and manageable, without too much foreseeable tree-
conflicts and such. Therefore, Samuel Neves and I plan to maintain the
codebase we've spent quite some time writing, and maintain our own tree for
it, which we'll be submitting through Greg. In other words, this is not
a matter of "circumvention" or "stepping on toes", but rather separate
efforts. I'm quite certain to the extent they overlap we'll be able to work
out fairly easily.

Either way, I'll take your suggestion and reach out to Herbert, since at
least a discussion between the two of us sounds like it could be productive.

> I'm also wondering about the criteria for making additions and changes to
> "Zinc".  You mentioned before that one of the "advantages" of Zinc is that it
> doesn't include "cipher modes from 90s cryptographers" -- what does that mean
> exactly?  You've also indicated before that you don't want people modifying 
> the
> Poly1305 implementations as they are too error-prone.  Useful contributions
> could be blocked or discouraged in the future. Can you please elaborate on
> your criteria for contributions to Zinc?
>
> Also, will you allow algorithms that aren't up to modern security standards 
> but
> are needed for compatibility reasons, e.g. MD5, SHA-1, and DES?  There are
> existing standards, APIs, and data formats that use these "legacy" algorithms;
> so implementations of them are often still needed, whether we like it or not.
>
> And does it matter who designed the algorithms, e.g. do algorithms from Daniel
> Bernstein get effectively a free pass, while algorithms from certain 
> countries,
> governments, or organizations are not allowed?  E.g. wireless driver 
> developers
> may need the SM4 block cipher (which is now supported by the crypto API) as 
> it's
> specified in a Chinese wireless standard.  Will you allow SM4 in Zinc?  Or 
> will
> people have to submit some algorithms to Herbert and some to you due to
> disagreements about what algorithms should be included?

Similarly here, I think you're over-politicizing everything. Stable address
generation for IPv6 uses SHA1 -- see net/ipv6/addrconf.c:3203 -- do you think
that this should use, say, the SM3 chinese hash function instead? No, of
course not, for a variety of interesting reasons. Rather, it should use some
simple hash function that's fast in software that we have available in Zinc.
On the other hand, it seems like parts of the kernel that have pretty high-
levels of cipher agility -- such as dmcrypt, ipsec, wifi apparently, and
so on -- will continue to use dynamic-dispatch system like the crypto API,
since that's what it was made to do and is effective at doing. And so, your
example of SM4 seems to fit perfectly into what the crypto API is well-suited
for, and it would fit naturally in there.

In other words, the "political criteria" for what we add to lib/zinc/ will
mostly be the same as for the rest of lib/: are there things using it that
benefit from it being there in a direct and obvious way, and does the
implementation meet certain 

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
On Thu, Sep 27, 2018 at 6:27 PM Andy Lutomirski  wrote:
> I would add another consideration: if you can get better latency with 
> negligible overhead (0.1%? 0.05%), then that might make sense too. For 
> example, it seems plausible that checking need_resched() every few blocks 
> adds basically no overhead, and the SIMD helpers could do this themselves or 
> perhaps only ever do a block at a time.
>
> need_resched() costs a cacheline access, but it’s usually a hot cacheline, 
> and the actual check is just whether a certain bit in memory is set.

Yes you're right, I do plan to check quite often, rather than seldom,
for this reason. I've been toying with the idea of instead processing
65k (maximum size of a UDP packet) at a time before checking
need_resched(), but armed with the 20µs figure, this isn't remotely
possible on most hardware. So I'll stick with the original
conservative plan of checking very often, and not making things
different from the aspects worked out by the present crypto API in
this regard.


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
Hey again Thomas,

On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld  wrote:
>
> Hi Thomas,
>
> I'm trying to optimize this for crypto performance while still taking
> into account preemption concerns. I'm having a bit of trouble figuring
> out a way to determine numerically what the upper bounds for this
> stuff looks like. I'm sure I could pick a pretty sane number that's
> arguably okay -- and way under the limit -- but I still am interested
> in determining what that limit actually is. I was hoping there'd be a
> debugging option called, "warn if preemption is disabled for too
> long", or something, but I couldn't find anything like that. I'm also
> not quite sure what the latency limits are, to just compute this with
> a formula. Essentially what I'm trying to determine is:
>
> preempt_disable();
> asm volatile(".fill N, 1, 0x90;");
> preempt_enable();
>
> What is the maximum value of N for which the above is okay? What
> technique would you generally use in measuring this?
>
> Thanks,
> Jason

>From talking to Peter (now CC'd) on IRC, it sounds like what you're
mostly interested in is clocktime latency on reasonable hardware, with
a goal of around ~20µs as a maximum upper bound? I don't expect to get
anywhere near this value at all, but if you can confirm that's a
decent ballpark, it would make for some interesting calculations.

Regards,
Jason


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
Hi Thomas,

I'm trying to optimize this for crypto performance while still taking
into account preemption concerns. I'm having a bit of trouble figuring
out a way to determine numerically what the upper bounds for this
stuff looks like. I'm sure I could pick a pretty sane number that's
arguably okay -- and way under the limit -- but I still am interested
in determining what that limit actually is. I was hoping there'd be a
debugging option called, "warn if preemption is disabled for too
long", or something, but I couldn't find anything like that. I'm also
not quite sure what the latency limits are, to just compute this with
a formula. Essentially what I'm trying to determine is:

preempt_disable();
asm volatile(".fill N, 1, 0x90;");
preempt_enable();

What is the maximum value of N for which the above is okay? What
technique would you generally use in measuring this?

Thanks,
Jason


[PATCH net-next v6 17/23] zinc: Curve25519 generic C implementations and selftest

2018-09-25 Thread Jason A. Donenfeld
This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.

The 64-bit version comes from HACL*: 
https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto

Information: https://cr.yp.to/ecdh.html

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Karthikeyan Bhargavan 
---
 include/zinc/curve25519.h   |   22 +
 lib/zinc/Kconfig|4 +
 lib/zinc/Makefile   |3 +
 lib/zinc/curve25519/curve25519-fiat32.h |  860 +++
 lib/zinc/curve25519/curve25519-hacl64.h |  784 ++
 lib/zinc/curve25519/curve25519.c|  109 ++
 lib/zinc/selftest/curve25519.h  | 1321 +++
 7 files changed, 3103 insertions(+)
 create mode 100644 include/zinc/curve25519.h
 create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
 create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
 create mode 100644 lib/zinc/curve25519/curve25519.c
 create mode 100644 lib/zinc/selftest/curve25519.h

diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h
new file mode 100644
index ..def173e736fc
--- /dev/null
+++ b/include/zinc/curve25519.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_CURVE25519_H
+#define _ZINC_CURVE25519_H
+
+#include 
+
+enum curve25519_lengths {
+   CURVE25519_KEY_SIZE = 32
+};
+
+bool __must_check curve25519(u8 mypublic[CURVE25519_KEY_SIZE],
+const u8 secret[CURVE25519_KEY_SIZE],
+const u8 basepoint[CURVE25519_KEY_SIZE]);
+void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE]);
+bool __must_check curve25519_generate_public(
+   u8 pub[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE]);
+
+#endif /* _ZINC_CURVE25519_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 7bf4bc88f81f..0d8a94dd73a8 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -14,6 +14,10 @@ config ZINC_CHACHA20POLY1305
 config ZINC_BLAKE2S
tristate
 
+config ZINC_CURVE25519
+   tristate
+   select CONFIG_CRYPTO
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
help
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 67ad837c822c..65440438c6e5 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -25,3 +25,6 @@ obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o
 zinc_blake2s-y := blake2s/blake2s.o
 zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o
 obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o
+
+zinc_curve25519-y := curve25519/curve25519.o
+obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o
diff --git a/lib/zinc/curve25519/curve25519-fiat32.h 
b/lib/zinc/curve25519/curve25519-fiat32.h
new file mode 100644
index ..32b5ec7aa040
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-fiat32.h
@@ -0,0 +1,860 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: <https://github.com/mit-plv/fiat-crypto>. Though originally
+ * machine generated, it has been tweaked to be suitable for use in the kernel.
+ * It is optimized for 32-bit machines and machines that cannot work 
efficiently
+ * with 128-bit integer types.
+ */
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 
3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+   /* Ignores top bit of s. */
+   u32 a0 = get_unaligned_le32(s);
+   u32 a1 = get_unaligned_le32(s+4);
+   u32 a2 = get_unaligned_le32(s+8);
+   u32 a3 = get_unaligned_le32(s+12);
+   u32 a4 = get_una

[PATCH net-next v5 13/20] zinc: BLAKE2s x86_64 implementation

2018-09-18 Thread Jason A. Donenfeld
These implementations from Samuel Neves support AVX and AVX-512VL.
Originally this used AVX-512F, but Skylake thermal throttling made
AVX-512VL more attractive and possible to do with negligable difference.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
---
 lib/zinc/Makefile  |   1 +
 lib/zinc/blake2s/blake2s-x86_64-glue.h |  59 +++
 lib/zinc/blake2s/blake2s-x86_64.S  | 685 +
 lib/zinc/blake2s/blake2s.c |   4 +-
 4 files changed, 748 insertions(+), 1 deletion(-)
 create mode 100644 lib/zinc/blake2s/blake2s-x86_64-glue.h
 create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 0e0020f58d27..47607486c39d 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -22,4 +22,5 @@ zinc_chacha20poly1305-y := chacha20poly1305.o
 obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o
 
 zinc_blake2s-y := blake2s/blake2s.o
+zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o
 obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o
diff --git a/lib/zinc/blake2s/blake2s-x86_64-glue.h 
b/lib/zinc/blake2s/blake2s-x86_64-glue.h
new file mode 100644
index ..22900ef8e7fe
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64-glue.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: MIT
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_AS_AVX
+asmlinkage void blake2s_compress_avx(struct blake2s_state *state,
+const u8 *block, const size_t nblocks,
+const u32 inc);
+#endif
+#ifdef CONFIG_AS_AVX512
+asmlinkage void blake2s_compress_avx512(struct blake2s_state *state,
+   const u8 *block, const size_t nblocks,
+   const u32 inc);
+#endif
+
+static bool blake2s_use_avx __ro_after_init;
+static bool blake2s_use_avx512 __ro_after_init;
+
+static void __init blake2s_fpu_init(void)
+{
+   blake2s_use_avx =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+   blake2s_use_avx512 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   boot_cpu_has(X86_FEATURE_AVX512F) &&
+   boot_cpu_has(X86_FEATURE_AVX512VL) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL);
+}
+
+static inline bool blake2s_arch(struct blake2s_state *state, const u8 *block,
+   size_t nblocks, const u32 inc)
+{
+#ifdef CONFIG_AS_AVX512
+   if (blake2s_use_avx512 && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   blake2s_compress_avx512(state, block, nblocks, inc);
+   kernel_fpu_end();
+   return true;
+   }
+#endif
+#ifdef CONFIG_AS_AVX
+   if (blake2s_use_avx && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   blake2s_compress_avx(state, block, nblocks, inc);
+   kernel_fpu_end();
+   return true;
+   }
+#endif
+   return false;
+}
diff --git a/lib/zinc/blake2s/blake2s-x86_64.S 
b/lib/zinc/blake2s/blake2s-x86_64.S
new file mode 100644
index ..360be4818d06
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64.S
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: MIT
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ * Copyright (C) 2017 Samuel Neves . All Rights Reserved.
+ */
+
+#include 
+
+.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.align 32
+IV:.octa 0xA54FF53A3C6EF372BB67AE856A09E667
+   .octa 0x5BE0CD191F83D9AB9B05688C510E527F
+.section .rodata.cst16.ROT16, "aM", @progbits, 16
+.align 16
+ROT16: .octa 0x0D0C0F0E09080B0A0504070601000302
+.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.align 16
+ROR328:.octa 0x0C0F0E0D080B0A090407060500030201
+#ifdef CONFIG_AS_AVX512
+.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 640
+.align 64
+SIGMA:
+.long 0, 2, 4, 6, 1, 3, 5, 7, 8, 10, 12, 14, 9, 11, 13, 15
+.long 11, 2, 12, 14, 9, 8, 15, 3, 4, 0, 13, 6, 10, 1, 7, 5
+.long 10, 12, 11, 6, 5, 9, 13, 3, 4, 15, 14, 2, 0, 7, 8, 1
+.long 10, 9, 7, 0, 11, 14, 1, 12, 6, 2, 15, 3, 13, 8, 5, 4
+.long 4, 9, 8, 13, 14, 0, 10, 11, 7, 3, 12, 1, 5, 6, 15, 2
+.long 2, 10, 4, 14, 13, 3, 9, 11, 6, 5, 7, 12, 15, 1, 8, 0
+.long 4, 11, 14, 8, 13, 10, 12, 5, 2, 1, 15, 3, 9, 7, 0, 6
+.long 6, 12, 0, 13, 15, 2, 1, 10, 4, 5, 11, 14, 8, 3, 9, 7
+.long 14, 5, 4, 12, 9, 7, 3, 10, 2, 0, 6, 15, 11, 1, 13, 8
+.long 11, 7, 13, 10, 12, 14, 0,

Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc

2018-09-16 Thread Jason A. Donenfeld
Hey Martin,

Thanks for running these and pointing this out. I've replicated the
results with tcrypt and fixed some issues, and the next patch series
should be a lot closer to what you'd expect, instead of the regression
you noticed. Most of the slowdown happened as a result of over-eager
XSAVEs, which I've now rectified. I'm still working on a few other
facets of it, but I believe v5 will be more satisfactory when posted.

Regards,
Jason


Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-12 Thread Jason A. Donenfeld
Hi Eric,

On Wed, Sep 12, 2018 at 12:08 AM Eric Biggers  wrote:
> I'd strongly prefer the assembly to be readable too.  Jason, I'm not sure if
> you've actually read through the asm from the OpenSSL implementations, but the
> generated .S files actually do lose a lot of semantic information that was in
> the original .pl scripts.

The thing to keep in mind is that the .S was not directly and blindly
generated from the .pl. We started with the output of the .pl, and
then, particularly in the case of x86_64, worked with it a lot, and
now it's something a bit different. We've definitely spent a lot of
time reading that assembly.

I'll see if I can improve the readability with some register name
remapping on ARM. No guarantees, but I'll play a bit and see if I can
make it a bit better.

Jason


Re: [PATCH v1 3/3] net: WireGuard secure network tunnel

2018-08-21 Thread Jason A. Donenfeld
On Tue, Aug 21, 2018 at 4:54 PM David Miller  wrote:
>
> From: "Jason A. Donenfeld" 
> Date: Tue, 21 Aug 2018 16:41:50 -0700
>
> > Is 100 in fact acceptable for new code? 120? 180?  What's the
> > generally accepted limit these days?
>
> Please keep it as close to 80 columns as possible.
>
> Line breaks are not ugly, please embrace them :)

Okay! Will do. Thanks for the response.

Jason


Re: [PATCH v1 3/3] net: WireGuard secure network tunnel

2018-08-21 Thread Jason A. Donenfeld
On Tue, Jul 31, 2018 at 1:22 PM Stephen Hemminger
 wrote:
>
> On Tue, 31 Jul 2018 21:11:02 +0200
> "Jason A. Donenfeld"  wrote:
>
> > +static int walk_by_peer(struct allowedips_node __rcu *top, u8 bits, struct 
> > allowedips_cursor *cursor, struct wireguard_peer *peer, int (*func)(void 
> > *ctx, const u8 *ip, u8 cidr, int family), void *ctx, struct mutex *lock)
> > +{
>
> Please break lines at something reasonable like 100 characters.

As I mentioned earlier, things will certainly be that way for v2, and
I'm in the process of that. I do care quite a bit about having "pretty
code", and so while I'm wrapping, I want to do it well and
consistently. Indeed 100 characters is far more reasonable than 80 --
things wind up much less annoying. And 120 is even easier. I don't
want to have to start renaming nice descriptive struct names and nice
descriptive struct member names just to avoid the line breaks in
function signatures, or something silly like that. In any case, before
I take out my word wrapping paint brush and make the prettiest
wrapping I can, I wanted to double check and confirm what the norm is
for the net tree. Is 100 in fact acceptable for new code? 120? 180?
What's the generally accepted limit these days?

(And, folks, please don't let this inquiry spiral out of control into
a bikeshed thread. I'm narrowly interested in knowing what the facts
are on what the norm is and what will be accepted here, rather than a
cacophony of random opinions.)


[PATCH v2] fib_rules: match rules based on suppress_* properties too

2018-06-25 Thread Jason A. Donenfeld
Two rules with different values of suppress_prefix or suppress_ifgroup
are not the same. This fixes an -EEXIST when running:

   $ ip -4 rule add table main suppress_prefixlength 0

Signed-off-by: Jason A. Donenfeld 
Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
into fib_nl2rule")
---
This adds the new condition you mentioned. I'm not sure what you make of
DaveM's remark about this not being in the original code, but here is
nonetheless the requested change.

 net/core/fib_rules.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 126ffc5bc630..bc8425d81022 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
*ops,
if (rule->mark && r->mark != rule->mark)
continue;
 
+   if (rule->suppress_ifgroup != -1 &&
+   r->suppress_ifgroup != rule->suppress_ifgroup)
+   continue;
+
+   if (rule->suppress_prefixlen != -1 &&
+   r->suppress_prefixlen != rule->suppress_prefixlen)
+   continue;
+
if (rule->mark_mask && r->mark_mask != rule->mark_mask)
continue;
 
-- 
2.17.1



[PATCH] fib_rules: match rules based on suppress_* properties too

2018-06-23 Thread Jason A. Donenfeld
Two rules with different values of suppress_prefix or suppress_ifgroup
are not the same. This fixes an -EEXIST when running:

   $ ip -4 rule add table main suppress_prefixlength 0

Signed-off-by: Jason A. Donenfeld 
Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs 
into fib_nl2rule")
---
 net/core/fib_rules.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 126ffc5bc630..665799311b98 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -416,6 +416,12 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
*ops,
if (rule->mark && r->mark != rule->mark)
continue;
 
+   if (r->suppress_ifgroup != rule->suppress_ifgroup)
+   continue;
+
+   if (r->suppress_prefixlen != rule->suppress_prefixlen)
+   continue;
+
if (rule->mark_mask && r->mark_mask != rule->mark_mask)
continue;
 
-- 
2.17.1



[net regression] "fib_rules: move common handling of newrule delrule msgs into fib_nl2rule" breaks suppress_prefixlength

2018-06-23 Thread Jason A. Donenfeld
Hey Roopa,

On a kernel with a minimal networking config,
CONFIG_IP_MULTIPLE_TABLES appears to be broken for certain rules after
f9d4b0c1e9695e3de7af3768205bacc27312320c.

Try, for example, running:

$ ip -4 rule add table main suppress_prefixlength 0

It returns with EEXIST.

Perhaps the reason is that the new rule_find function does not match
on suppress_prefixlength? However, rule_exist from before didn't do
that either. I'll keep playing and see if I can track it down myself,
but thought I should let you know first.

A relevant .config can be found at https://א.cc/iq5HoUY0

Jason


Re: safe skb resetting after decapsulation and encapsulation

2018-05-13 Thread Jason A. Donenfeld
On Sat, May 12, 2018 at 4:07 AM, Md. Islam  wrote:
> I'm not an expert on this, but it looks about right.

Really? Even zeroing between headers_start and headers_end? With the
latest RHEL 7.5 kernel's i40e driver, doing this results in a crash in
kfree. It's possible redhat is putting something silly within
header_start and header_end, and so zeroing it is bad, but I suspect
that instead blanket zeroing it like that might actually be incorrect.

> look at build_skb() or __build_skb(). It shows the fields that needs to be set

These just kmalloc a new skb, with most fields set to zero. The ones
it modifies are the ones I'm modifying anyway when messing with the
data the skb contains. Doesn't look like there's much to help there.


I wrote the original post wondering precisely -- which specifically of
1-14 are incorrect, and is there anything specific missing from there.


safe skb resetting after decapsulation and encapsulation

2018-05-11 Thread Jason A. Donenfeld
Hey Netdev,

A UDP skb comes in via the encap_rcv interface. I do a lot of wild
things to the bytes in the skb -- change where the head starts, modify
a few fragments, decrypt some stuff, trim off some things at the end,
etc. In other words, I'm decapsulating the skb in a pretty intense
way. I benefit from reusing the same skb, performance wise, but after
I'm done processing it, it's really a totally new skb. Eventually it's
time to pass off my skb to netif_receive_skb/netif_rx, but before I do
that, I need to "reinitialize" the skb. (The same goes for when
sending out an skb -- I get it from userspace via ndo_start_xmit, do
crazy things to it, and eventually pass it off to the udp_tunnel send
functions, but first "reinitializing" it.)

At the moment I'm using a function that looks like this:

static void jasons_wild_and_crazy_skb_reset(struct sk_buff *skb)
{
skb_scrub_packet(skb, true); //1
memset(>headers_start, 0, offsetof(struct sk_buff,
headers_end) - offsetof(struct sk_buff, headers_start)); //2
skb->queue_mapping = 0; //3
skb->nohdr = 0; //4
skb->peeked = 0; //5
skb->mac_len = 0; //6
skb->dev = NULL; //7
#ifdef CONFIG_NET_SCHED
skb->tc_index = 0; //8
skb_reset_tc(skb); //9
#endif
skb->hdr_len = skb_headroom(skb); //10
skb_reset_mac_header(skb); //11
skb_reset_network_header(skb); //12
skb_probe_transport_header(skb, 0); //13
skb_reset_inner_headers(skb); //14
}

I'm sure that some of this is wrong. Most of it is based on part of an
Octeon ethernet driver I read a few years ago. I numbered each
statement above, hoping to go through it with you all in detail here,
and see what we can cut away and see what we can approve.

1. Obviously correct and required.
2. This is probably wrong. At least it causes crashes when receiving
packets from RHEL 7.5's latest i40e driver in their vendor
frankenkernel, because those flags there have some critical bits
related to allocation. But there are a lot flags in there that I might
consider going through one by one and zeroing out.
3-5. Fields that should be zero, I assume, after
decapsulating/decrypting (and encapsulating/encrypting).
6. WireGuard is layer 3, so there's no mac.
7. We're later going to change the dev this came in on.
8-9: Same flakey rationale as 2,3-5.
10: Since the headroom has changed during the various modifications, I
need to let the packet field know about it.
11-14: The beginning of the headers has changed, and so resetting and
probing is necessary for this to work at all.

So I'm wondering - how much of this is necessary? How much am I
unnecessarily reinventing things that exist elsewhere? I'm pretty sure
in most cases the driver would work with only 1,10-14, but I worry
that bad things would happen in more unusual configurations. I've
tried to systematically go through the entire stack and see where
these might be used or not used, but it seems really inconsistent.

So, I'm writing wondering if somebody has an easy simplification or
rule for handling this kind of intense decapsulation/decryption (and
encapsulation/encryption operation on the other way) operation. I'd
like to make sure I get this down solid.

Thanks,
Jason


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-22 Thread Jason A. Donenfeld
Thanks!

Jason


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Jason A. Donenfeld
Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")


Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Jason A. Donenfeld
On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazet  wrote:
>> This probably should be queued up for stable.
>
> When was the bug added ? This would help a lot stable teams ...

This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+.


[PATCH] netlink: put module reference if dump start fails

2018-02-20 Thread Jason A. Donenfeld
Before, if cb->start() failed, the module reference would never be put,
because cb->cb_running is intentionally false at this point. Users are
generally annoyed by this because they can no longer unload modules that
leak references. Also, it may be possible to tediously wrap a reference
counter back to zero, especially since module.c still uses atomic_inc
instead of refcount_inc.

This patch expands the error path to simply call module_put if
cb->start() fails.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
This probably should be queued up for stable.

 net/netlink/af_netlink.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 2ad445c1d27c..07e8478068f0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2308,7 +2308,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
if (cb->start) {
ret = cb->start(cb);
if (ret)
-   goto error_unlock;
+   goto error_put;
}
 
nlk->cb_running = true;
@@ -2328,6 +2328,8 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
 */
return -EINTR;
 
+error_put:
+   module_put(control->module);
 error_unlock:
sock_put(sk);
mutex_unlock(nlk->cb_mutex);
-- 
2.16.1



Re: WireGuard Upstreaming Roadmap (November 2017)

2017-12-08 Thread Jason A. Donenfeld
Hi Dave,

On Fri, Dec 8, 2017 at 4:38 PM, David Miller  wrote:
> Sorry, you cannot force the discussion of a feature which will be submitted
> upstream to occur on a private mailing list.
>
> It is _ABSOLUTELY_ appropriate to discss this on netdev since it is the
> netdev community which must consider issues like this when looking at
> whether to accept WireGuard upstream.
>
> Jason, this action and response was entirely inappropriate, and please
> I'd like you to reply properly to questions about your feature here.

Whoops, okay! Very sorry. I'm actually kind of happy to hear that. I
had assumed that you'd be annoyed if WireGuard crypto discussion
spewed over into netdev adding even more message volume there for
something perhaps not directly relevant. But in fact, you're
interested and find it important to discuss there. So, good news. And
sorry for trying to shew it away before. I'll copy and paste the
response I had made on the other list:

> This is covered in the paper:
> https://www.wireguard.com/papers/wireguard.pdf
>
> The basic answer is that WireGuard has message type identifiers, and
> the handshake also hashes into it an identifier of the primitives
> used. If there's ever a problem with those primitives chosen, it will
> be possible to introduce new message type identifiers, if that kind of
> "support everything even the broken garbage" approach is desired by
> misguided people. However, a better approach, of course, is to keep
> your secure network separate from your insecure network, and to not
> allow insecure nodes on secure segments; when you mix the two,
> disaster tends to strike. So, in other words, both approaches to "upgrading"
> are possible, in this fantasy wireguardalypse. Take note, though, that
> neither one of these approaches (support new and retain old protocol
> too for old nodes, or only support new) are "agile" or are anything at
> all like the 90s "cipher agility" -- the user still is not permitted
> to "choose" ciphers.

Regards,
Jason


Re: WireGuard Upstreaming Roadmap (November 2017)

2017-12-07 Thread Jason A. Donenfeld
On Thu, Dec 7, 2017 at 11:22 AM, Stefan Tatschner
 wrote:
> I have a question which is related to the involved crypto. As far as I
> have understood the protocol and the concept of wireguard
> What's your opinion on this?

This thread has been picked up on the WireGuard mailing list, not here.

Since this concerns the interworkings of the protocol and cryptography
as a whole, as opposed to implementation details of Linux, please do
not send these inquiries to LKML. Additionally, please start new
threads for new topics in the future, rather than hijacking a roadmap
thread.

Look for my answer on the other mailing list. I'll CC you too.

Regards,
Jason


Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-11 Thread Jason A. Donenfeld
On Sat, Nov 11, 2017 at 11:18 PM, Johannes Berg
 wrote:
>
>> > If you're handling this by forcing another read() to procude the
>> > NLMSG_DONE, then you have no reason to WARN_ON() here.
>> >
>> > In fact you are adding a WARN_ON() which is trivially triggerable by
>> > any user.
>>
>> I added this in my suggestion for how this could work, but I don't
>> think you're right, since we previously check if there's enough space.
>
> Or perhaps I should say this differently:
>
> Forcing another read happens through the
>
> skb_tailroom(skb) < nlmsg_total_size(...)
>
> check, so the nlmsg_put_answer() can't really fail.
>
>
> Handling nlmsg_put_answer() failures by forcing another read would have
> required jumping to the existing if code with a goto, or restructuring
> the whole thing completely somehow, and I didn't see how to do that.

Exactly. And if something _does_ go wrong in our logic, and we can't
add NLMSG_DONE, we really do want people to report this to us, since
dumps must always end that way. We'd probably have caught this a
number of years ago when userspace developers were twiddling with
their receive buffers if we had had the WARN_ON. Nice suggestion from
Johannes.


WireGuard Upstreaming Roadmap (November 2017)

2017-11-10 Thread Jason A. Donenfeld
Hi folks,

This relates to WireGuard [0].

Following a very nice conference with the Linux kernel networking subsystem
community [1,2], I thought it might be a good idea to spell out the roadmap
for the coming months and the trajectory into upstream, based on my
discussions with several developers and maintainers. There are several threads
of this, the biggest of which surrounds the kernel code base, but there are
some other ends of the WireGuard project as a whole that are also relevant.

The current biggest blocker is issues with the crypto API. Before WireGuard
can go upstream, I intend to embark on a multi-pronged effort to overhaul the
crypto API. I very much need to sync up with Herbert regarding my plans for
this, and start spec'ing things out a bit more formally, so I can begin
concrete discussions with him. I intend to base my work both on feedback
from linux-crypto/Herbert and from the cryptographic research community. I
hope to go to RWC2018 [3] and the subsequent HACS workshop for the academic
engagement side, but of course like all the work I do on the kernel, things
will be highly based in engineering, rather than purely academic, practices.

Dave has strongly encouraged me to post patches sooner rather than later.
So I think before the crypto API is ready to go, I'll likely post a [RFG] --
request for grumbles -- patch set to netdev, in order to have some code
review, so as to gauge where we're at. This patch set will use my current
crypto API, not the kernel's crypto API, with it mentioned in the opening that
I intend to switch to the kernel's crypto API when it looks like the one used
here. This way we'll get early feedback so that the later real [PATCH] series
will go more smoothly.

There are a few WireGuard features that some of you have been waiting for. At
the urging of some folks at the conference, I intend to submit a "core"
WireGuard to upstream, and then use future releases to build on top of that,
to make the initial upstreaming process go as easily as possible. Therefore,
there are three big TODO items that may _or_ may not go in after upstreaming:

 - In-band control messages [possibly not]
 - Netlink multicast event notification
 - GRO integration

Of these, it's most likely I'll implement GRO and leave the other two until
after, but this may change. Since WireGuard already does GSO, it would make
sense to implement the other side of things too. There's also a longer more
ambitious roadmap [4], filled with both easy coding-related things and longer
term design things, but that's out of scope for this email, though many of
which will likely be completed before submission time.

There are also six other threads of development that are ongoing, which I
intend to put a bit more focus on too in the near future:

  - The userspace implementation. I'd like to bring this up to deployment
quality, which naturally fits into the next area.

  - Mobile apps. It's fairly easy to integrate the userspace implementation
with existing APIs. The current Android app already works well with the
kernel module, but of course people want this more easily deployed.

  - Mac and Windows support for the userspace implementation. These are
already mostly done, but the APIs used may in fact change, so there may
still be a bit of work to do here before we're satisfied.

  - Bindings and libraries. Now that we have a stable Netlink API, we can
start making nice wrappers for the various languages people like to use.
It remains to be seen whether or not a C "libwireguard" is needed, since in
that domain, talking Netlink directly is often a better choice, but I do see
some potential for a pywireguard and the like. This will also be essential
when the already mentioned plans for event notification and the possibly
excluded control messages materialize.

  - More formal verification. While we have the cryptographic protocol
verified, there are still more places where formalism is quite helpful,
proving more state machines, and even proving C implementations to be
correct. Work and research is ongoing in this domain.

  - Integration into network managers and routing daemons (mesh and
traditional). Work has already begun here on systemd-networkd, and others
are looking into daemons like babel and bird.

So that's where we're at. I hope to have a RFG submitted in the next several
months, and hopefully we gather some nice momentum and get the work
upstreamed and the project completed soon, for some definition of "complete".

If you'd like to work on WireGuard, or simply have any questions, don't
hesitate to email me.

Regards,
Jason


[0] https://www.wireguard.com/
[1] https://www.netdevconf.org/2.2/
[2] https://www.wireguard.com/presentations/
[3] https://rwc.iacr.org
[4] https://www.wireguard.com/todo/


Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-10 Thread Jason A. Donenfeld
On Sat, Nov 11, 2017 at 11:37 AM, David Miller  wrote:
> Jason I'm already pushing my luck as-is with the pull request I made
> yesterday.
>
> I've seen your original requst to get this in, you don't have to say
> it multiple times.
>
> We can get this into the merge window and submit it for -stable, so
> please relax.

Whoops, sorry! Okay, no problem.

Jason


Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-10 Thread Jason A. Donenfeld
IIRC, 4.14 comes tomorrow-ish? If possible, it would be nice to get
this in 4.14 before then, so it doesn't have to take time to trickle
down through stable.

Jason

On Thu, Nov 9, 2017 at 1:04 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
> The way people generally use netlink_dump is that they fill in the skb
> as much as possible, breaking when nla_put returns an error. Then, they
> get called again and start filling out the next skb, and again, and so
> forth. The mechanism at work here is the ability for the iterative
> dumping function to detect when the skb is filled up and not fill it
> past the brim, waiting for a fresh skb for the rest of the data.
>
> However, if the attributes are small and nicely packed, it is possible
> that a dump callback function successfully fills in attributes until the
> skb is of size 4080 (libmnl's default page-sized receive buffer size).
> The dump function completes, satisfied, and then, if it happens to be
> that this is actually the last skb, and no further ones are to be sent,
> then netlink_dump will add on the NLMSG_DONE part:
>
>   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
>
> It is very important that netlink_dump does this, of course. However, in
> this example, that call to nlmsg_put_answer will fail, because the
> previous filling by the dump function did not leave it enough room. And
> how could it possibly have done so? All of the nla_put variety of
> functions simply check to see if the skb has enough tailroom,
> independent of the context it is in.
>
> In order to keep the important assumptions of all netlink dump users, it
> is therefore important to give them an skb that has this end part of the
> tail already reserved, so that the call to nlmsg_put_answer does not
> fail. Otherwise, library authors are forced to find some bizarre sized
> receive buffer that has a large modulo relative to the common sizes of
> messages received, which is ugly and buggy.
>
> This patch thus saves the NLMSG_DONE for an additional message, for the
> case that things are dangerously close to the brim. This requires
> keeping track of the errno from ->dump() across calls.
>
> Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
> ---
> Can we get this into 4.14? Is there still time? It should also be queued
> up for stable.
>
> Changes v3->v4:
>  - I like long lines. The kernel does not. Wrapped at 80 now.
>
>  net/netlink/af_netlink.c | 17 +++--
>  net/netlink/af_netlink.h |  1 +
>  2 files changed, 12 insertions(+), 6 deletions(-)
>
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index b93148e8e9fb..15c99dfa3d72 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk)
> struct sk_buff *skb = NULL;
> struct nlmsghdr *nlh;
> struct module *module;
> -   int len, err = -ENOBUFS;
> +   int err = -ENOBUFS;
> int alloc_min_size;
> int alloc_size;
>
> @@ -2183,9 +2183,11 @@ static int netlink_dump(struct sock *sk)
> skb_reserve(skb, skb_tailroom(skb) - alloc_size);
> netlink_skb_set_owner_r(skb, sk);
>
> -   len = cb->dump(skb, cb);
> +   if (nlk->dump_done_errno > 0)
> +   nlk->dump_done_errno = cb->dump(skb, cb);
>
> -   if (len > 0) {
> +   if (nlk->dump_done_errno > 0 ||
> +   skb_tailroom(skb) < 
> nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
> mutex_unlock(nlk->cb_mutex);
>
> if (sk_filter(sk, skb))
> @@ -2195,13 +2197,15 @@ static int netlink_dump(struct sock *sk)
> return 0;
> }
>
> -   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
> -   if (!nlh)
> +   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
> +  sizeof(nlk->dump_done_errno), NLM_F_MULTI);
> +   if (WARN_ON(!nlh))
> goto errout_skb;
>
> nl_dump_check_consistent(cb, nlh);
>
> -   memcpy(nlmsg_data(nlh), , sizeof(len));
> +   memcpy(nlmsg_data(nlh), >dump_done_errno,
> +  sizeof(nlk->dump_done_errno));
>
> if (sk_filter(sk, skb))
> kfree_skb(skb);
> @@ -2273,6 +2277,7 @@ int __netlink_dump_start(struct sock *ssk, struct 
> sk_buff *skb,
> }
>
> nlk->cb_running = true;
> +   nlk->dump_done_errno = INT_MAX;
>
> mutex_unlock(nlk->cb_mutex);
>
> diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
> index 028188597eaa..962de7b3c023 100644
> --- a/net/netlink/af_netlink.

[PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-08 Thread Jason A. Donenfeld
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

  nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
Can we get this into 4.14? Is there still time? It should also be queued
up for stable.

Changes v3->v4:
 - I like long lines. The kernel does not. Wrapped at 80 now.

 net/netlink/af_netlink.c | 17 +++--
 net/netlink/af_netlink.h |  1 +
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b93148e8e9fb..15c99dfa3d72 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk)
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
struct module *module;
-   int len, err = -ENOBUFS;
+   int err = -ENOBUFS;
int alloc_min_size;
int alloc_size;
 
@@ -2183,9 +2183,11 @@ static int netlink_dump(struct sock *sk)
skb_reserve(skb, skb_tailroom(skb) - alloc_size);
netlink_skb_set_owner_r(skb, sk);
 
-   len = cb->dump(skb, cb);
+   if (nlk->dump_done_errno > 0)
+   nlk->dump_done_errno = cb->dump(skb, cb);
 
-   if (len > 0) {
+   if (nlk->dump_done_errno > 0 ||
+   skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) 
{
mutex_unlock(nlk->cb_mutex);
 
if (sk_filter(sk, skb))
@@ -2195,13 +2197,15 @@ static int netlink_dump(struct sock *sk)
return 0;
}
 
-   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
-   if (!nlh)
+   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
+  sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+   if (WARN_ON(!nlh))
goto errout_skb;
 
nl_dump_check_consistent(cb, nlh);
 
-   memcpy(nlmsg_data(nlh), , sizeof(len));
+   memcpy(nlmsg_data(nlh), >dump_done_errno,
+  sizeof(nlk->dump_done_errno));
 
if (sk_filter(sk, skb))
kfree_skb(skb);
@@ -2273,6 +2277,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
}
 
nlk->cb_running = true;
+   nlk->dump_done_errno = INT_MAX;
 
mutex_unlock(nlk->cb_mutex);
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 028188597eaa..962de7b3c023 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -34,6 +34,7 @@ struct netlink_sock {
wait_queue_head_t   wait;
boolbound;
boolcb_running;
+   int dump_done_errno;
struct netlink_callback cb;
struct mutex*cb_mutex;
struct mutexcb_def_mutex;
-- 
2.15.0



Re: [PATCH v3] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-08 Thread Jason A. Donenfeld
On Thu, Nov 9, 2017 at 11:02 AM, Johannes Berg
 wrote:
> nit: I think your line got a little long here :)

Ack. For v4.

> and here

Ack. For v4.

>
>> + nlk->dump_done_errno = INT_MAX;
>
> I guess positive values aren't really returned from dump?

When a positive value is returned, the API uses this to signify that
the dump is not done, and it should be called again. When 0 or
negative is returned, it means the dump is complete and the return
value should be returned in DONE. We therefore initialize
dump_done_errno to some positive value, so that we can make calling
->dump() conditional on it being positive. When it becomes zero or
negative, it's time to return. This mirrors the semantics of the
actual ->dump() function precisely.


[PATCH v3] af_netlink: ensure that NLMSG_DONE never fails in dumps

2017-11-08 Thread Jason A. Donenfeld
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

  nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
Can we get this into 4.14? Is there still time? It should also be queued
up for stable.

Changes v2->v3:
  - Johannes didn't like the subject line of the patch, so the only
thing that's changed in this version is the new subject line.

 net/netlink/af_netlink.c | 14 --
 net/netlink/af_netlink.h |  1 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b93148e8e9fb..7020689e643e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk)
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
struct module *module;
-   int len, err = -ENOBUFS;
+   int err = -ENOBUFS;
int alloc_min_size;
int alloc_size;
 
@@ -2183,9 +2183,10 @@ static int netlink_dump(struct sock *sk)
skb_reserve(skb, skb_tailroom(skb) - alloc_size);
netlink_skb_set_owner_r(skb, sk);
 
-   len = cb->dump(skb, cb);
+   if (nlk->dump_done_errno > 0)
+   nlk->dump_done_errno = cb->dump(skb, cb);
 
-   if (len > 0) {
+   if (nlk->dump_done_errno > 0 || skb_tailroom(skb) < 
nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
mutex_unlock(nlk->cb_mutex);
 
if (sk_filter(sk, skb))
@@ -2195,13 +2196,13 @@ static int netlink_dump(struct sock *sk)
return 0;
}
 
-   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
-   if (!nlh)
+   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, 
sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+   if (WARN_ON(!nlh))
goto errout_skb;
 
nl_dump_check_consistent(cb, nlh);
 
-   memcpy(nlmsg_data(nlh), , sizeof(len));
+   memcpy(nlmsg_data(nlh), >dump_done_errno, 
sizeof(nlk->dump_done_errno));
 
if (sk_filter(sk, skb))
kfree_skb(skb);
@@ -2273,6 +2274,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
}
 
nlk->cb_running = true;
+   nlk->dump_done_errno = INT_MAX;
 
mutex_unlock(nlk->cb_mutex);
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 028188597eaa..962de7b3c023 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -34,6 +34,7 @@ struct netlink_sock {
wait_queue_head_t   wait;
boolbound;
boolcb_running;
+   int dump_done_errno;
struct netlink_callback cb;
struct mutex*cb_mutex;
struct mutexcb_def_mutex;
-- 
2.15.0



[PATCH v2] af_netlink: give correct bounds to dump skb for NLMSG_DONE

2017-11-07 Thread Jason A. Donenfeld
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

  nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus saves the NLMSG_DONE for an additional message, for the
case that things are dangerously close to the brim. This requires
keeping track of the errno from ->dump() across calls.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
Changes v1->v2:
  - This implements things without having to twiddle with the skb, at
the expense of potentially having an extra back-and-forth and quite
a bit of added complexity.


 net/netlink/af_netlink.c | 14 --
 net/netlink/af_netlink.h |  1 +
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b93148e8e9fb..7020689e643e 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk)
struct sk_buff *skb = NULL;
struct nlmsghdr *nlh;
struct module *module;
-   int len, err = -ENOBUFS;
+   int err = -ENOBUFS;
int alloc_min_size;
int alloc_size;
 
@@ -2183,9 +2183,10 @@ static int netlink_dump(struct sock *sk)
skb_reserve(skb, skb_tailroom(skb) - alloc_size);
netlink_skb_set_owner_r(skb, sk);
 
-   len = cb->dump(skb, cb);
+   if (nlk->dump_done_errno > 0)
+   nlk->dump_done_errno = cb->dump(skb, cb);
 
-   if (len > 0) {
+   if (nlk->dump_done_errno > 0 || skb_tailroom(skb) < 
nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
mutex_unlock(nlk->cb_mutex);
 
if (sk_filter(sk, skb))
@@ -2195,13 +2196,13 @@ static int netlink_dump(struct sock *sk)
return 0;
}
 
-   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);
-   if (!nlh)
+   nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, 
sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+   if (WARN_ON(!nlh))
goto errout_skb;
 
nl_dump_check_consistent(cb, nlh);
 
-   memcpy(nlmsg_data(nlh), , sizeof(len));
+   memcpy(nlmsg_data(nlh), >dump_done_errno, 
sizeof(nlk->dump_done_errno));
 
if (sk_filter(sk, skb))
kfree_skb(skb);
@@ -2273,6 +2274,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff 
*skb,
}
 
nlk->cb_running = true;
+   nlk->dump_done_errno = INT_MAX;
 
mutex_unlock(nlk->cb_mutex);
 
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 028188597eaa..962de7b3c023 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -34,6 +34,7 @@ struct netlink_sock {
wait_queue_head_t   wait;
boolbound;
boolcb_running;
+   int dump_done_errno;
struct netlink_callback cb;
struct mutex*cb_mutex;
struct mutexcb_def_mutex;
-- 
2.15.0



Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE

2017-11-07 Thread Jason A. Donenfeld
Erf, your patch doesn't handle what happens if len comes back
negative, but I'll fix it up and send a v2 using this approach.

I think I really prefer v1 though.

Jason


Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE

2017-11-07 Thread Jason A. Donenfeld
Hi Johannes,

Yes indeed. It sacrifices 24 bytes for making things much less
complex. However, if you prefer increasing the complexity of the state
machine a bit instead, I suppose we could roll with this approach
instead...

Jason


Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE

2017-11-07 Thread Jason A. Donenfeld
By the way, in case you're curious, here's the {up,down,cross}stream
WireGuard commit that works around it via its compat layer (a rat's
nest of hideous backports for all the weird kernels people want
WireGuard to run on, which I cannot wait to remove):

https://git.zx2c4.com/WireGuard/commit/?id=f689ea7acc23dc8e0968699d964ee382b04fbbe4

Particularly relavent here is the last chunk of that, which is part of
the automated test suite, which reproduces the issue by finding the
tightest possible packing.


[PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE

2017-11-07 Thread Jason A. Donenfeld
The way people generally use netlink_dump is that they fill in the skb
as much as possible, breaking when nla_put returns an error. Then, they
get called again and start filling out the next skb, and again, and so
forth. The mechanism at work here is the ability for the iterative
dumping function to detect when the skb is filled up and not fill it
past the brim, waiting for a fresh skb for the rest of the data.

However, if the attributes are small and nicely packed, it is possible
that a dump callback function successfully fills in attributes until the
skb is of size 4080 (libmnl's default page-sized receive buffer size).
The dump function completes, satisfied, and then, if it happens to be
that this is actually the last skb, and no further ones are to be sent,
then netlink_dump will add on the NLMSG_DONE part:

  nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI);

It is very important that netlink_dump does this, of course. However, in
this example, that call to nlmsg_put_answer will fail, because the
previous filling by the dump function did not leave it enough room. And
how could it possibly have done so? All of the nla_put variety of
functions simply check to see if the skb has enough tailroom,
independent of the context it is in.

In order to keep the important assumptions of all netlink dump users, it
is therefore important to give them an skb that has this end part of the
tail already reserved, so that the call to nlmsg_put_answer does not
fail. Otherwise, library authors are forced to find some bizarre sized
receive buffer that has a large modulo relative to the common sizes of
messages received, which is ugly and buggy.

This patch thus reserves and restores the required length for
NLMSG_DONE during the call to the dump function.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
See you all at netdevconf tomorrow!

 net/netlink/af_netlink.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index b93148e8e9fb..b2d0a2fb1939 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2183,7 +2183,9 @@ static int netlink_dump(struct sock *sk)
skb_reserve(skb, skb_tailroom(skb) - alloc_size);
netlink_skb_set_owner_r(skb, sk);
 
+   skb->end -= nlmsg_total_size(sizeof(len));
len = cb->dump(skb, cb);
+   skb->end += nlmsg_total_size(sizeof(len));
 
if (len > 0) {
mutex_unlock(nlk->cb_mutex);
-- 
2.15.0



Re: [PATCH net] tun/tap: sanitize TUNSETSNDBUF input

2017-11-01 Thread Jason A. Donenfeld
Here's a simple reproducer, in case Skyzaller's case was overcomplicated:

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
  struct ifreq ifr;
  int fd, sock;

  fd = open("/dev/net/tun", O_RDWR);
  if (fd < 0) {
perror("open(/dev/net/tun)");
return 1;
  }

  memset(, 0, sizeof(ifr));

  ifr.ifr_flags = IFF_TUN;
  strncpy(ifr.ifr_name, "yikes", IFNAMSIZ);

  if (ioctl(fd, TUNSETIFF, ) < 0) {
perror("TUNSETIFF");
return 1;
  }

  sock = socket(AF_INET, SOCK_DGRAM, 0);
  if (sock < 0) {
perror("socket");
return 1;
  }
  ifr.ifr_flags = IFF_UP;
  if (ioctl(sock, SIOCSIFFLAGS, ) < 0) {
perror("SIOCSIFFLAGS");
return 1;
  }
  close(sock);

  sock = -1;
  if (ioctl(fd, TUNSETSNDBUF, )) {
perror("TUNSETSNDBUF");
return 1;
  }

  if (write(fd, , sizeof(fd)) < 0) {
perror("write");
return 1;
  }

  return 0;
}


[PATCH] mac80211: use constant time comparison with keys

2017-10-17 Thread Jason A. Donenfeld
Otherwise we risk leaking information via timing side channel.

Fixes: fdf7cb4185b6 ("mac80211: accept key reinstall without changing anything")
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
 net/mac80211/key.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/mac80211/key.c b/net/mac80211/key.c
index ae995c8480db..035d16fe926e 100644
--- a/net/mac80211/key.c
+++ b/net/mac80211/key.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ieee80211_i.h"
 #include "driver-ops.h"
@@ -635,7 +636,7 @@ int ieee80211_key_link(struct ieee80211_key *key,
 * new version of the key to avoid nonce reuse or replay issues.
 */
if (old_key && key->conf.keylen == old_key->conf.keylen &&
-   !memcmp(key->conf.key, old_key->conf.key, key->conf.keylen)) {
+   !crypto_memneq(key->conf.key, old_key->conf.key, key->conf.keylen)) 
{
ieee80211_key_free_unused(key);
ret = 0;
goto out;
-- 
2.14.2



Re: pull-request: mac80211 2017-10-16

2017-10-17 Thread Jason A. Donenfeld
On Tue, Oct 17, 2017 at 7:46 AM, Johannes Berg
 wrote:
> If it's not equal, you execute so much code
> beneath, going to the driver etc., that I'd think this particular time
> is in the noise.

Usually presumptions like this get you in trouble when some crafty
academic has a smart idea about that noise. I'll send a patch.


Re: pull-request: mac80211 2017-10-16

2017-10-16 Thread Jason A. Donenfeld
Mobile phone right now, so not able to write patch, but you probably
should be using crypto_memneq for comparing those two keys, not
memcmp.

Jason


Re: [PATCH v2] netlink: do not set cb_running if dump's start() errs

2017-10-09 Thread Jason A. Donenfeld
On Mon, Oct 9, 2017 at 2:31 PM, Johannes Berg  wrote:
>
> Reviewed-by: Johannes Berg 

Thanks for the review. Hopefully this can make it into 4.13.6 and 4.14-rc5.


[PATCH v2] netlink: do not set cb_running if dump's start() errs

2017-10-09 Thread Jason A. Donenfeld
It turns out that multiple places can call netlink_dump(), which means
it's still possible to dereference partially initialized values in
dump() that were the result of a faulty returned start().

This fixes the issue by calling start() _before_ setting cb_running to
true, so that there's no chance at all of hitting the dump() function
through any indirect paths.

It also moves the call to start() to be when the mutex is held. This has
the nice side effect of serializing invocations to start(), which is
likely desirable anyway. It also prevents any possible other races that
might come out of this logic.

In testing this with several different pieces of tricky code to trigger
these issues, this commit fixes all avenues that I'm aware of.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Johannes Berg <johan...@sipsolutions.net>
---
 net/netlink/af_netlink.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 94c11cf0459d..f34750691c5c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2266,16 +2266,17 @@ int __netlink_dump_start(struct sock *ssk, struct 
sk_buff *skb,
cb->min_dump_alloc = control->min_dump_alloc;
cb->skb = skb;
 
+   if (cb->start) {
+   ret = cb->start(cb);
+   if (ret)
+   goto error_unlock;
+   }
+
nlk->cb_running = true;
 
mutex_unlock(nlk->cb_mutex);
 
-   ret = 0;
-   if (cb->start)
-   ret = cb->start(cb);
-
-   if (!ret)
-   ret = netlink_dump(sk);
+   ret = netlink_dump(sk);
 
sock_put(sk);
 
-- 
2.14.2



Re: [PATCH] netlink: do not set cb_running if dump's start() errs

2017-10-09 Thread Jason A. Donenfeld
Hi Johannes,

Yes, indeed, and I think that's actually a good thing. It means that
the starts aren't ever called in parallel, which could be useful if
drivers have any ordering requirements. The change doesn't negatively
effect any existing drivers. I'll resubmit with a larger commit
message explaining this.

Jason


[PATCH] netlink: do not set cb_running if dump's start() errs

2017-10-09 Thread Jason A. Donenfeld
It turns out that multiple places can call netlink_dump(), which means
it's still possible to dereference partially initialized values in
dump() that were the result of a faulty returned start().

This fixes the issue by calling start() _before_ setting cb_running to
true, so that there's no chance at all of hitting the dump() function
through any indirect paths.

In testing this with several different pieces of tricky code to trigger
these issues, this commit fixes all avenues that I'm aware of.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Johannes Berg <johan...@sipsolutions.net>
---
 net/netlink/af_netlink.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 94c11cf0459d..f34750691c5c 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2266,16 +2266,17 @@ int __netlink_dump_start(struct sock *ssk, struct 
sk_buff *skb,
cb->min_dump_alloc = control->min_dump_alloc;
cb->skb = skb;
 
+   if (cb->start) {
+   ret = cb->start(cb);
+   if (ret)
+   goto error_unlock;
+   }
+
nlk->cb_running = true;
 
mutex_unlock(nlk->cb_mutex);
 
-   ret = 0;
-   if (cb->start)
-   ret = cb->start(cb);
-
-   if (!ret)
-   ret = netlink_dump(sk);
+   ret = netlink_dump(sk);
 
sock_put(sk);
 
-- 
2.14.2



Re: [PATCH] netlink: do not set cb_running if dump's start() errs

2017-10-09 Thread Jason A. Donenfeld
Dave - this very likely should be queued up for stable.

Jason


Re: netlink backwards compatibility in userspace tools

2017-10-09 Thread Jason A. Donenfeld
Hi Dave,

That seems wise. Thanks for the advice. A more sophisticated way of
approaching this would be for the kernel to also send a bitmap of
which attributes are "critical" and only warn (or even error) of
_those_ are not understood. But that seems needlessly complex, and so
I think I'll go with your suggestion, and simply warn once with a
shorter message.

Jason


multi producer / multi consumer lock-free queue with ptr_ring

2017-10-04 Thread Jason A. Donenfeld
Hey Michael,

Thanks for your work on ptr_ring.h. I'm interested in using it, but in
a multi-producer, multi-consumer context. I realize it's been designed
for a single-producer, single-consumer context, and thus uses a
spinlock. I'm wondering if you'd be happy to receive patches that
implement things in a lock-free way, in order to make the data
structure more broadly usable.

In case you're curious, this would be used for the multi-core
algorithms in WireGuard.

Thanks,
Jason


Re: cross namespace interface notification for tun devices

2017-10-02 Thread Jason A. Donenfeld
On Mon, Oct 2, 2017 at 11:32 AM, Nicolas Dichtel
 wrote:
> 1. Move the process to netns B, open the netlink socket and move back the
> process to netns A. The socket will remain in netns B and you will receive all
> netlink messages related to netns B.
>
> 2. Assign a nsid to netns B in netns A and use NETLINK_LISTEN_ALL_NSID on your
> netlink socket (see iproute2).

Both of these seem to rely on the process knowing where the device is
being moved and having access to that namespace. I don't think these
two things are a given though. Unless I'm missing something?

Jason


netlink backwards compatibility in userspace tools

2017-09-29 Thread Jason A. Donenfeld
Hi guys,

One handy aspect of Netlink is that it's backwards compatible. This
means that you can run old userspace utilities on new kernels, even if
the new kernel supports new features and netlink attributes. The wire
format is stable enough that the data marshaled can be extended
without breaking compat. Neat.

I was wondering, though, what you think the best stance is toward
these old userspace utilities. What should they do if the kernel sends
it netlink attributes that it does not recognize? At the moment, I'm
doing something like this:

static void warn_unrecognized(void)
{
static bool once = false;
if (once)
return;
once = true;
fprintf(stderr,
"Warning: this program received from your kernel one or more\n"
"attributes that it did not recognize. It is possible that\n"
"this version of wg(8) is older than your kernel. You may\n"
"want to update this program.\n");
}

This seems like a somewhat sensible warning, but then I wonder about
distributions like Debian, which has a long stable life cycle, so it
frequently has very old tools (ancient iproute2 for example). Then,
VPS providers have these Debian images run on top of newer kernels.
People in this situation would undoubtedly see the above warning a lot
and not be able to do anything about it. Not horrible, but a bit
annoying. Is this an okay annoyance? Or is it advised to just have no
warning at all? One idea would be to put it behind an environment
variable flag, but I don't like too many nobs.

I'm generally wondering about attitudes toward this kind of userspace
program behavior in response to newer kernels.

Thanks,
Jason


Re: [PATCH v2] netlink: do not proceed if dump's start() errs

2017-09-28 Thread Jason A. Donenfeld
On Thu, Sep 28, 2017 at 12:41:44AM +0200, Jason A. Donenfeld wrote:
> diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
> index 327807731b44..94c11cf0459d 100644
> --- a/net/netlink/af_netlink.c
> +++ b/net/netlink/af_netlink.c
> @@ -2270,10 +2270,13 @@ int __netlink_dump_start(struct sock *ssk, struct 
> sk_buff *skb,
>  
>   mutex_unlock(nlk->cb_mutex);
>  
> + ret = 0;
>   if (cb->start)
> - cb->start(cb);
> + ret = cb->start(cb);
> +
> + if (!ret)
> + ret = netlink_dump(sk);
>  
> - ret = netlink_dump(sk);
>   sock_put(sk);
>  
>   if (ret)


FYI, using this horrific hack to currently work around bug:

#define KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX KERNEL_VERSION(4, 14, 0) 
/* Hopefully! */

static int get(struct sk_buff *skb, struct netlink_callback *cb)
{
#if LINUX_VERSION_CODE < KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX
/* https://marc.info/?l=linux-netdev=150655213004221=2 */
if (!cb->args[0]) {
ret = get_start(cb);
if (ret)
return ret;
}
#endif
...
}

static const struct genl_ops genl_ops[] = {
{
.cmd = CMD_GET,
#if LINUX_VERSION_CODE >= KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX
/* https://marc.info/?l=linux-netdev=150655213004221=2 */
.start = get_start,
#endif
.dumpit = get,
...
}
}

Gross.


[PATCH v2] netlink: do not proceed if dump's start() errs

2017-09-27 Thread Jason A. Donenfeld
Drivers that use the start method for netlink dumping rely on dumpit not
being called if start fails. For example, ila_xlat.c allocates memory
and assigns it to cb->args[0] in its start() function. It might fail to
do that and return -ENOMEM instead. However, even when returning an
error, dumpit will be called, which, in the example above, quickly
dereferences the memory in cb->args[0], which will OOPS the kernel. This
is but one example of how this goes wrong.

Since start() has always been a function with an int return type, it
therefore makes sense to use it properly, rather than ignoring it. This
patch thus returns early and does not call dumpit() when start() fails.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Johannes Berg <johan...@sipsolutions.net>
---
 net/netlink/af_netlink.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 327807731b44..94c11cf0459d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2270,10 +2270,13 @@ int __netlink_dump_start(struct sock *ssk, struct 
sk_buff *skb,
 
mutex_unlock(nlk->cb_mutex);
 
+   ret = 0;
if (cb->start)
-   cb->start(cb);
+   ret = cb->start(cb);
+
+   if (!ret)
+   ret = netlink_dump(sk);
 
-   ret = netlink_dump(sk);
sock_put(sk);
 
if (ret)
-- 
2.14.1



Re: [PATCH] netlink: do not proceed if dump's start() errs

2017-09-27 Thread Jason A. Donenfeld
On Wed, Sep 27, 2017 at 3:05 PM, Johannes Berg
 wrote:
> I guess you could change it to
>
> if (cb->start)
> ret = cb->start(cb);
> if (!ret)
> ret = netlink_dump(sk);

Very clean. I'll do it like that. I'll wait a bit longer before
submitting v2, but beyond that, seems sane to you?

Jason


Re: [PATCH] netlink: do not proceed if dump's start() errs

2017-09-27 Thread Jason A. Donenfeld
On Wed, Sep 27, 2017 at 2:39 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
> -   if (cb->start)
> -   cb->start(cb);
> +   if (cb->start) {
> +   ret = cb->start(cb);
> +   if (ret)

I need to sock_put(sk); before returning. I'll fix this for v2, but
will for additional comments in case anybody has some.

> +   return ret;
> +   }
>
> ret = netlink_dump(sk);
> sock_put(sk);


[PATCH] netlink: do not proceed if dump's start() errs

2017-09-27 Thread Jason A. Donenfeld
Drivers that use the start method for netlink dumping rely on dumpit not
being called if start fails. For example, ila_xlat.c allocates memory
and assigns it to cb->args[0] in its start() function. It might fail to
do that and return -ENOMEM instead. However, even when returning an
error, dumpit will be called, which, in the example above, quickly
dereferences the memory in cb->args[0], which will OOPS the kernel. This
is but one example of how this goes wrong.

Since start() has always been a function with an int return type, it
therefore makes sense to use it properly, rather than ignoring it. This
patch thus returns early and does not call dumpit() when start() fails.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: sta...@vger.kernel.org
---
 net/netlink/af_netlink.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 327807731b44..be179876227d 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2270,8 +2270,11 @@ int __netlink_dump_start(struct sock *ssk, struct 
sk_buff *skb,
 
mutex_unlock(nlk->cb_mutex);
 
-   if (cb->start)
-   cb->start(cb);
+   if (cb->start) {
+   ret = cb->start(cb);
+   if (ret)
+   return ret;
+   }
 
ret = netlink_dump(sk);
sock_put(sk);
-- 
2.14.1



Re: cross namespace interface notification for tun devices

2017-09-20 Thread Jason A. Donenfeld
On Wed, Sep 20, 2017 at 8:29 PM, Cong Wang  wrote:
> Sounds like we should set NETIF_F_NETNS_LOCAL for tun
> device.

Absolutely do not do this under any circumstances. This would be a
regression and would break API compatibility. As I wrote in my first
email, it's already possible to sleep-loop for that information using
the tun device's fd; I'm just looking for a better event-based
approach.

> What is your legitimate use case of send/receive packet to/from
> a tun device in a different netns?

Because sometimes it's very nice to be able to move network interfaces
that use tun devices into different namespaces, for some xnamespace
proxying.

What Dan described in the email he just sent is exactly this use case.

In WireGuard (a kernel thing), I have facilities for this --
https://www.wireguard.com/netns/ . Now I'm working on the userspace
version and would like to expose the same utility.

Anyway, the purpose of me sending this message to the list was not to
question the "legitimacy" of my application usage, but rather to
elicit feedback on two specific things:

1. to determine if there's already a mechanism in place for this that
I've overlooked; and
2. to determine particularities of me implementing a mechanism, if
it's not already there.

I'm slightly more convinced that there isn't currently a mechanism for
this. It seems like the easiest way, therefore, would be some kind of
control message that could be poll'd for, using the existing
per-process fd. That way there wouldn't be any violations of the
current namespace situation, yet processes could still get event
notifications as needed.


Re: cross namespace interface notification for tun devices

2017-09-19 Thread Jason A. Donenfeld
On Tue, Sep 19, 2017 at 10:40 PM, Cong Wang  wrote:
> By "notification" I assume you mean netlink notification.

Yes, netlink notification.

> The question is why does the process in A still care about
> the device sitting in B?
>
> Also, the process should be able to receive a last notification
> on IFF_UP|IFF_RUNNING before device is finally moved to B.
> After this point, it should not have any relation to netns A
> any more, like the device were completely gone.

That's very clearly not the case with a tun device. Tun devices work
by letting a userspace process control the inputs (ndo_start_xmit) and
outputs (netif_rx) of the actual network device. This controlling
userspace process needs to know when its own interface that it
controls goes up and down. In the kernel, we can do this by just
checking dev->flags_UP, and receive notifications on ndo_open and
ndo_stop. In userspace, the controlling process looses the ability to
receive notifications like ndo_open/ndo_stop when the interface is
moved to a new namespace. After the interface is moved to a namespace,
the process will still control inputs and ouputs (ndo_start_xmit and
netif_rx), but it will no longer receive netlink notifications for the
equivalent of ndo_open and ndo_stop. This is problematic.


Re: cross namespace interface notification for tun devices

2017-09-18 Thread Jason A. Donenfeld
On Mon, Sep 18, 2017 at 8:47 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
> The best I've come up with is, in a sleep loop, writing to the tun
> device's fd something with a NULL or invalid payload. If the interface
> is down, the kernel returns -EIO. If the interface is up, the kernel
> returns -EINVAL. This seems to be a reliable distinguisher, but is a
> pretty insane way of doing it. And sleep loops are somewhat different
> from events too.

Specifically, I'm referring to the horrific hack exemplified in the
attached .c file, in case anybody is curious about the details of what
I'd rather not use.
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main(int argc, char *argv[])
{
	/* If IFF_NO_PI is specified, this still sort of works but it
	 * bumps the device error counters, which we don't want, so
	 * it's best not to use this trick with IFF_NO_PI. */
	struct ifreq ifr = { .ifr_flags = IFF_TUN };
	int tun, sock, ret;

	tun = open("/dev/net/tun", O_RDWR);
	if (tun < 0) {
		perror("[-] open(/dev/net/tun)");
		return 1;
	}

	sock = socket(AF_INET, SOCK_DGRAM, 0);
	if (sock < 0) {
		perror("[-] socket(AF_INET, SOCK_DGRAM)");
		return 1;
	}

	ret = ioctl(tun, TUNSETIFF, );
	if (ret < 0) {
		perror("[-] ioctl(TUNSETIFF)");
		return 1;
	}

	if (write(tun, NULL, 0) >= 0 || errno != EIO)
		perror("[-] write(if:down, NULL, 0) did not return -EIO");
	else
		fprintf(stderr, "[+] write(if:down, NULL, 0) returned -EIO: test successful\n");

	ifr.ifr_flags = IFF_UP;
	ret = ioctl(sock, SIOCSIFFLAGS, );
	if (ret < 0) {
		perror("[-] ioctl(SIOCSIFFLAGS)");
		return 1;
	}

	if (write(tun, NULL, 0) >= 0 || errno != EINVAL)
		perror("[-] write(if:up, NULL, 0) did not return -EINVAL");
	else
		fprintf(stderr, "[+] write(if:up, NULL, 0) returned -EINVAL: test successful\n");
	
	return 0;
}


cross namespace interface notification for tun devices

2017-09-18 Thread Jason A. Donenfeld
Hey guys,

It's possible to create a tun device in a process in namespace A and
then move that interface to namespace B. The controlling process in A
needs to receive notifications on when the interface is brought up or
down. It can receive these notifications via netlink while the
interface lives in A but not when it moves to B.

Any tricks or APIs to get around this?

The best I've come up with is, in a sleep loop, writing to the tun
device's fd something with a NULL or invalid payload. If the interface
is down, the kernel returns -EIO. If the interface is up, the kernel
returns -EFAULT. This seems to be a reliable distinguisher, but is a
pretty insane way of doing it. And sleep loops are somewhat different
from events too.

If there aren't any current APIs for receiving events directly from
the fd of a tun interface, would this list be happy with a patch that
adds one?

Thanks,
Jason


[PATCH] ioc3-eth: store pointer to net_device for priviate area

2017-07-10 Thread Jason A. Donenfeld
Computing the alignment manually for going from priv to pub is probably
not such a good idea, and in general the assumption that going from priv
to pub is possible trivially could change, so rather than relying on
that, we change things to just store a pointer to pub. This was sugested
by DaveM in [1].

[1] http://www.spinics.net/lists/netdev/msg443992.html

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
Ralf - I don't have the platform to test this out, so you might want to
briefly put it through the paces before giving it your sign-off.

 drivers/net/ethernet/sgi/ioc3-eth.c | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c 
b/drivers/net/ethernet/sgi/ioc3-eth.c
index b607936e1b3e..9c0488e0f08e 100644
--- a/drivers/net/ethernet/sgi/ioc3-eth.c
+++ b/drivers/net/ethernet/sgi/ioc3-eth.c
@@ -90,17 +90,13 @@ struct ioc3_private {
spinlock_t ioc3_lock;
struct mii_if_info mii;
 
+   struct net_device *dev;
struct pci_dev *pdev;
 
/* Members used by autonegotiation  */
struct timer_list ioc3_timer;
 };
 
-static inline struct net_device *priv_netdev(struct ioc3_private *dev)
-{
-   return (void *)dev - ((sizeof(struct net_device) + 31) & ~31);
-}
-
 static int ioc3_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
 static void ioc3_set_multicast_list(struct net_device *dev);
 static int ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev);
@@ -427,7 +423,7 @@ static void ioc3_get_eaddr_nic(struct ioc3_private *ip)
nic[i] = nic_read_byte(ioc3);
 
for (i = 2; i < 8; i++)
-   priv_netdev(ip)->dev_addr[i - 2] = nic[i];
+   ip->dev->dev_addr[i - 2] = nic[i];
 }
 
 /*
@@ -439,7 +435,7 @@ static void ioc3_get_eaddr(struct ioc3_private *ip)
 {
ioc3_get_eaddr_nic(ip);
 
-   printk("Ethernet address is %pM.\n", priv_netdev(ip)->dev_addr);
+   printk("Ethernet address is %pM.\n", ip->dev->dev_addr);
 }
 
 static void __ioc3_set_mac_address(struct net_device *dev)
@@ -790,13 +786,12 @@ static void ioc3_timer(unsigned long data)
  */
 static int ioc3_mii_init(struct ioc3_private *ip)
 {
-   struct net_device *dev = priv_netdev(ip);
int i, found = 0, res = 0;
int ioc3_phy_workaround = 1;
u16 word;
 
for (i = 0; i < 32; i++) {
-   word = ioc3_mdio_read(dev, i, MII_PHYSID1);
+   word = ioc3_mdio_read(ip->dev, i, MII_PHYSID1);
 
if (word != 0x && word != 0x) {
found = 1;
@@ -1276,6 +1271,7 @@ static int ioc3_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
SET_NETDEV_DEV(dev, >dev);
 
ip = netdev_priv(dev);
+   ip->dev = dev;
 
dev->irq = pdev->irq;
 
-- 
2.13.2



Re: [PATCH 1/2] netdevice: add netdev_pub helper function

2017-07-10 Thread Jason A. Donenfeld
On Mon, Jul 10, 2017 at 10:04 AM, David Miller  wrote:
> I disagree.  Assuming one can go from the driver private to the netdev
> object trivially is a worse assumption than the other way around, and
> locks us into the current implementation of how the netdev and driver
> private memory is allocated.
>
> If you want to style your driver such that the private is passed
> around instead of the netdev, put a pointer back to the netdev object
> in your private data structure.

I'm surprised you're okay with the memory waste of that, but you bring
up the ability to change the interface later, which is a great point.
I'll submit a patch for that random driver, and I'll also refactor
WireGuard to do the same. Thanks for the guidance.

Jason


[PATCH 2/2] ioc3-eth: use netdev_pub instead of handrolling alignment

2017-07-09 Thread Jason A. Donenfeld
It's safer to use the generic library function for this, rather than
reinventing it here with hard-coded alignment values.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
 drivers/net/ethernet/sgi/ioc3-eth.c | 11 +++
 1 file changed, 3 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c 
b/drivers/net/ethernet/sgi/ioc3-eth.c
index b607936e1b3e..514eca163ea5 100644
--- a/drivers/net/ethernet/sgi/ioc3-eth.c
+++ b/drivers/net/ethernet/sgi/ioc3-eth.c
@@ -86,31 +86,26 @@ struct ioc3_private {
int tx_ci;  /* TX consumer index */
int tx_pi;  /* TX producer index */
int txqlen;
u32 emcr, ehar_h, ehar_l;
spinlock_t ioc3_lock;
struct mii_if_info mii;
 
struct pci_dev *pdev;
 
/* Members used by autonegotiation  */
struct timer_list ioc3_timer;
 };
 
-static inline struct net_device *priv_netdev(struct ioc3_private *dev)
-{
-   return (void *)dev - ((sizeof(struct net_device) + 31) & ~31);
-}
-
 static int ioc3_ioctl(struct net_device *dev, struct ifreq *rq, int cmd);
 static void ioc3_set_multicast_list(struct net_device *dev);
 static int ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev);
 static void ioc3_timeout(struct net_device *dev);
 static inline unsigned int ioc3_hash(const unsigned char *addr);
 static inline void ioc3_stop(struct ioc3_private *ip);
 static void ioc3_init(struct net_device *dev);
 
 static const char ioc3_str[] = "IOC3 Ethernet";
 static const struct ethtool_ops ioc3_ethtool_ops;
 
 /* We use this to acquire receive skb's that we can DMA directly into. */
 
@@ -417,39 +412,39 @@ static void ioc3_get_eaddr_nic(struct ioc3_private *ip)
printk("Failed to read MAC address\n");
return;
}
 
/* Read Memory.  */
nic_write_byte(ioc3, 0xf0);
nic_write_byte(ioc3, 0x00);
nic_write_byte(ioc3, 0x00);
 
for (i = 13; i >= 0; i--)
nic[i] = nic_read_byte(ioc3);
 
for (i = 2; i < 8; i++)
-   priv_netdev(ip)->dev_addr[i - 2] = nic[i];
+   netdev_pub(ip)->dev_addr[i - 2] = nic[i];
 }
 
 /*
  * Ok, this is hosed by design.  It's necessary to know what machine the
  * NIC is in in order to know how to read the NIC address.  We also have
  * to know if it's a PCI card or a NIC in on the node board ...
  */
 static void ioc3_get_eaddr(struct ioc3_private *ip)
 {
ioc3_get_eaddr_nic(ip);
 
-   printk("Ethernet address is %pM.\n", priv_netdev(ip)->dev_addr);
+   printk("Ethernet address is %pM.\n", netdev_pub(ip)->dev_addr);
 }
 
 static void __ioc3_set_mac_address(struct net_device *dev)
 {
struct ioc3_private *ip = netdev_priv(dev);
struct ioc3 *ioc3 = ip->regs;
 
ioc3_w_emar_h((dev->dev_addr[5] <<  8) | dev->dev_addr[4]);
ioc3_w_emar_l((dev->dev_addr[3] << 24) | (dev->dev_addr[2] << 16) |
  (dev->dev_addr[1] <<  8) | dev->dev_addr[0]);
 }
 
 static int ioc3_set_mac_address(struct net_device *dev, void *addr)
@@ -780,27 +775,27 @@ static void ioc3_timer(unsigned long data)
add_timer(>ioc3_timer);
 }
 
 /*
  * Try to find a PHY.  There is no apparent relation between the MII addresses
  * in the SGI documentation and what we find in reality, so we simply probe
  * for the PHY.  It seems IOC3 PHYs usually live on address 31.  One of my
  * onboard IOC3s has the special oddity that probing doesn't seem to find it
  * yet the interface seems to work fine, so if probing fails we for now will
  * simply default to PHY 31 instead of bailing out.
  */
 static int ioc3_mii_init(struct ioc3_private *ip)
 {
-   struct net_device *dev = priv_netdev(ip);
+   struct net_device *dev = netdev_pub(ip);
int i, found = 0, res = 0;
int ioc3_phy_workaround = 1;
u16 word;
 
for (i = 0; i < 32; i++) {
word = ioc3_mdio_read(dev, i, MII_PHYSID1);
 
if (word != 0x && word != 0x) {
found = 1;
break;  /* Found a PHY  */
}
}
 
-- 
2.13.2



[PATCH 1/2] netdevice: add netdev_pub helper function

2017-07-09 Thread Jason A. Donenfeld
Being able to utilize this makes code a lot simpler and cleaner. It's
easier in many cases for drivers to pass around their private data
structure, while occationally needing to dip into net_device, rather
than the other way around, which results in tons of calls to netdev_priv
in the top of every single function, which makes everything confusing
and less clear. Additionally, this enables a "correct" way of doing such
a thing, instead of having drivers attempt to reinvent the wheel and
screw it up.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
 include/linux/netdevice.h | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 779b23595596..83d58504e5c4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2030,26 +2030,37 @@ void dev_net_set(struct net_device *dev, struct net 
*net)
 }
 
 /**
  * netdev_priv - access network device private data
  * @dev: network device
  *
  * Get network device private data
  */
 static inline void *netdev_priv(const struct net_device *dev)
 {
return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN);
 }
 
+/**
+ * netdev_pub - access network device from private pointer
+ * @priv: private data pointer of network device
+ *
+ * Get network device from a network device private data pointer
+ */
+static inline struct net_device *netdev_pub(void *priv)
+{
+   return (struct net_device *)((char *)priv - ALIGN(sizeof(struct 
net_device), NETDEV_ALIGN));
+}
+
 /* Set the sysfs physical device reference for the network logical device
  * if set prior to registration will cause a symlink during initialization.
  */
 #define SET_NETDEV_DEV(net, pdev)  ((net)->dev.parent = (pdev))
 
 /* Set the sysfs device type for the network logical device to allow
  * fine-grained identification of different network device types. For
  * example Ethernet, Wireless LAN, Bluetooth, WiMAX etc.
  */
 #define SET_NETDEV_DEVTYPE(net, devtype)   ((net)->dev.type = (devtype))
 
 /* Default NAPI poll() weight
  * Device drivers are strongly advised to not use bigger value
-- 
2.13.2



Re: net: Fix inconsistent teardown and release of private netdev state.

2017-07-09 Thread Jason A. Donenfeld
On Sat, Jul 8, 2017 at 12:39 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Thu, Jul 6, 2017 at 7:24 AM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
>> list_add(>list, _of_things);
>>
>> ret = register_netdevice(); // if ret is < 0, then destruct above is 
>> automatically called
>>
>> // RACE WITH LIST_ADD/LIST_DEL!! It's impossible to call list_add 
>> only after
>> // things are brought up successfully. This is problematic.
>>
>> if (!ret)
>> pr_info("Yay it worked!\n");
>
> I fail to understand what you mean by RACE here.
>
> Here you should already have RTNL lock, so it can't race with any other
> newlink() calls. In fact you can't acquire RTNL lock in your destructor
> since register_netdevice() already gets it. Perhaps you mean
> netdev_run_todo() calls it without RTNL lock?
>
> I don't know why you reorder the above list_add(), you can order it
> as it was before, aka, call it after register_netdevice(), but you have to
> init the priv->list now for the list_del() on error path.

The race is that there's a state in which priv->list is part of
list_of_things before the interface is actually successfully setup and
ready to go.

And no, it's not possible to order it _after_ register_netdevice,
since register_netdevice might call priv_destructor, and
priv_destructor calls list_del, so if it's not already on the list,
we'll OOPS. In otherwords, API problem.

To work around this shortcoming, I'm actually just assigning
dev->priv_device *after* a successful call to register_netdevice. This
seems to be working well and allows me to retain ± the old behavior.
Hopefully this is an okay way to go about things?

Jason


Re: net: Fix inconsistent teardown and release of private netdev state.

2017-07-06 Thread Jason A. Donenfeld
On Thu, Jul 6, 2017 at 4:24 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
>
> if (!ret)
> pr_info("Yay it worked!\n");
>
> return 0;

This is supposed to be `return ret;`. See, even in psuedocode it's
hard to get right.


Re: net: Fix inconsistent teardown and release of private netdev state.

2017-07-06 Thread Jason A. Donenfeld
Hey guys,

I see why this priv_destructor patch is an acceptable bandaid for
certain drivers, but I'm not exactly sure on the cleanest way of using
it in new drivers. Check out the following psuedocode.

Here's how error handling in newlink used to work before this patch:

static void destruct(struct net_device *dev)
{
struct private *priv = netdev_priv(dev);

rtnl_lock();
list_del(>list);
rtnl_unlock();
stop_anotherweirdthing(>wow);
put_can_in_garbage(priv->burrito);
kfree(priv->cheese);
stop_something(>something);
put_whatever(priv->whatever);
}

static int newlink(struct net *src_net, struct net_device *dev, ...)
{
int ret;
struct private *priv = netdev_priv(dev);

priv->whatever = get_whatever(src_net); // never fails

ret = do_something();
if (ret < 0)
goto error_1;

ret = -ENOMEM;
priv->cheese = kmalloc(128, GFP_KERNEL);
if (!priv->cheese)
goto error_2;

priv->burrito = allocate_bean_can(_net);
if (IS_ERR(priv->burrito)) {
ret = PTR_ERR(priv->burrito);
goto error_3;
}

do_anotherweirdthing(>wow); // never fails

ret = register_netdevice();
if (ret < 0)
goto error_4;

list_add(>list, _of_things);
pr_info("Yay it worked!\n");

return ret;

error_4:
stop_anotherweirdthing(>wow);
put_can_in_garbage(priv->burrito);
error_3:
kfree(priv->cheese);
error_2:
stop_something(>something);
error_1:
put_whatever(priv->whatever);

return ret;
}

Here's what it must look like now, which introduces a weird race with
list_add, and doesn't really have a consistent clean-up section. Having
to remember that register_netdevice will call destruct() when it fails
is quite confusing. Observe:

static int newlink(struct net *src_net, struct net_device *dev, ...)
{
int ret;
struct private *priv = netdev_priv(dev);

priv->whatever = get_whatever(src_net); // never fails

ret = do_something();
if (ret < 0)
goto error_1;

ret = -ENOMEM;
priv->cheese = kmalloc(128, GFP_KERNEL);
if (!priv->cheese)
goto error_2;

priv->burrito = allocate_bean_can(_net);
if (IS_ERR(priv->burrito)) {
ret = PTR_ERR(priv->burrito);
goto error_3;
}

do_anotherweirdthing(>wow); // never fails

list_add(>list, _of_things);

ret = register_netdevice(); // if ret is < 0, then destruct above is 
automatically called

// RACE WITH LIST_ADD/LIST_DEL!! It's impossible to call list_add only 
after
// things are brought up successfully. This is problematic.

if (!ret)
pr_info("Yay it worked!\n");

return 0;

error_3:
kfree(priv->cheese);
error_2:
stop_something(>something);
error_1:
put_whatever(priv->whatever);

return ret;
}

Something we just have to live with? Any cleaner way of approaching
this?

Regards,
Jason


Re: [PATCH] net/icmp: restore source address if packet is NATed

2017-06-25 Thread Jason A. Donenfeld
Hi David,

On Sun, Jun 25, 2017 at 5:49 PM, David Miller  wrote:
> This violates things on so many levels.

Yes, indeed.

> I think this kind of thing need to be hidden inside of netfilter,
> it can do the rate limiting and stuff like that in the spot
> where it makes the transformation and knows all of the original
> addressing and ports.

Indeed I'd prefer that, and I'll look again into trying to make that
work. But when I tried last, it seemed like there were some
insurmountable challenges. With the ratelimiting, the limit has
already been applied to one IP -- the masqueraded one -- before
netfilter even has a chance to act -- so that IP will already hit the
ratelimits well before any additional one inside netfilter would. Then
the issue of transformation: last I looked it seemed like icmp_send
threw away a bit too much information to do the transformation
entirely correctly, but I could be wrong, so I'll give it another
look. Hopefully it winds up being as easy as just reverse-transforming
ICMP's payload IP header.

>
> You definitely can't just rewrite header fields here either.  The
> SKB could be shared, for example.

I was afraid of that. It's easy to rework this particular patch,
though, if you're still interested in the crufty bolt on approach...
But I think I should investigate the netfilter-only approach instead,
as you suggested.

Jason


[PATCH] net/icmp: restore source address if packet is NATed

2017-06-23 Thread Jason A. Donenfeld
The ICMP routines use the source address for two reasons:

1. Rate-limiting ICMP transmissions based on source address, so
   that one source address cannot provoke a flood of replies. If
   the source address is wrong, the rate limiting will be
   incorrectly applied.

2. Choosing the interface and hence new source address of the
   generated ICMP packet. If the original packet source address
   is wrong, ICMP replies will be sent from the wrong source
   address, resulting in either a misdelivery, infoleak, or just
   general network admin confusion.

Most of the time, the icmp_send and icmpv6_send routines can just reach
down into the skb's IP header to determine the saddr. However, if
icmp_send or icmpv6_send is being called from a network device driver --
there are a few in the tree -- then it's possible that by the time
icmp_send or icmpv6_send looks at the packet, the packet's source
address has already been transformed by SNAT or MASQUERADE or some other
transformation that CONNTRACK knows about. In this case, the packet's
source address is most certainly the *wrong* source address to be used
for the purpose of ICMP replies.

Rather, the source address we want to use for ICMP replies is the
original one, from before the transformation occurred.

Fortunately, it's very easy to just ask CONNTRACK if it knows about this
packet, and if so, how to fix it up. The saddr is the only field in the
header we need to fix up, for the purposes of the subsequent processing
in the icmp_send and icmpv6_send functions, so we do the lookup very
early on, so that the rest of the ICMP machinery can progress as usual.
In my tests, this setup works very well.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
---
 net/ipv4/icmp.c | 21 +
 net/ipv6/icmp.c | 21 +
 2 files changed, 42 insertions(+)

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index c2be26b98b5f..30aa6aa79fd2 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -97,6 +97,10 @@
 #include 
 #include 
 #include 
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+#include 
+#include 
+#endif
 
 /*
  * Build xmit assembly blocks
@@ -586,6 +590,10 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
u32 mark;
struct net *net;
struct sock *sk;
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+#endif
 
if (!rt)
goto out;
@@ -604,6 +612,19 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, 
__be32 info)
goto out;
 
/*
+*  If this function is called after the skb has already been
+*  NAT transformed, the ratelimiting will apply to the wrong
+*  saddr, and the reply will will be marked as coming from the
+*  wrong host. So, we fix it up here in case connection tracking
+*  enables that.
+*/
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   ct = nf_ct_get(skb_in, );
+   if (ct)
+   iph->saddr = ct->tuplehash[0].tuple.src.u3.ip;
+#endif
+
+   /*
 *  No replies to physical multicast/broadcast
 */
if (skb_in->pkt_type != PACKET_HOST)
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 8d7b113958b1..ee8a2853121e 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -69,6 +69,10 @@
 #include 
 #include 
 #include 
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+#include 
+#include 
+#endif
 
 #include 
 
@@ -422,12 +426,29 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 
code, __u32 info,
int len;
int err = 0;
u32 mark = IP6_REPLY_MARK(net, skb->mark);
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   enum ip_conntrack_info ctinfo;
+   struct nf_conn *ct;
+#endif
 
if ((u8 *)hdr < skb->head ||
(skb_network_header(skb) + sizeof(*hdr)) > skb_tail_pointer(skb))
return;
 
/*
+*  If this function is called after the skb has already been
+*  NAT transformed, the ratelimiting will apply to the wrong
+*  saddr, and the reply will will be marked as coming from the
+*  wrong host. So, we fix it up here in case connection tracking
+*  enables that.
+*/
+#if IS_ENABLED(CONFIG_NF_CONNTRACK)
+   ct = nf_ct_get(skb, );
+   if (ct)
+   hdr->saddr = ct->tuplehash[0].tuple.src.u3.in6;
+#endif
+
+   /*
 *  Make sure we respect the rules
 *  i.e. RFC 1885 2.4(e)
 *  Rule (e.1) is enforced by not using icmp6_send
-- 
2.13.1



Re: [PATCH 0/6] Constant Time Memory Comparisons Are Important

2017-06-11 Thread Jason A. Donenfeld
Hi Stephan,

On Sun, Jun 11, 2017 at 11:06 PM, Stephan Müller  wrote:
> Are you planning to send an update to your patch set? If yes, there is another
> one which should be converted too: crypto/rsa-pkcs1pad.c.

I just sent an update to this thread patching that, per your
suggestion. Since these issues are expected to be cherry picked by
their respective committer, I figure we can just pile on the patches
here, listing the 0/6 intro email as each patch's parent.

Jason


[PATCH 0/6] Constant Time Memory Comparisons Are Important

2017-06-09 Thread Jason A. Donenfeld
Whenever you're comparing two MACs, it's important to do this using
crypto_memneq instead of memcmp. With memcmp, you leak timing information,
which could then be used to iteratively forge a MAC. This is far too basic
of a mistake for us to have so pervasively in the year 2017, so let's begin
cleaning this stuff up. The following 6 locations were found with some
simple regex greps, but I'm sure more lurk below the surface. If you
maintain some code or know somebody who maintains some code that deals
with MACs, tell them to double check which comparison function they're
using.

Jason A. Donenfeld (6):
  sunrpc: use constant time memory comparison for mac
  net/ipv6: use constant time memory comparison for mac
  ccree: use constant time memory comparison for macs and tags
  security/keys: use constant time memory comparison for macs
  bluetooth/smp: use constant time memory comparison for secret values
  mac80211/wpa: use constant time memory comparison for MACs

Cc: Anna Schumaker <anna.schuma...@netapp.com>
Cc: David Howells <dhowe...@redhat.com>
Cc: David Safford <saff...@us.ibm.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: Gilad Ben-Yossef <gi...@benyossef.com>
Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org>
Cc: Gustavo Padovan <gust...@padovan.org>
Cc: "J. Bruce Fields" <bfie...@fieldses.org>
Cc: Jeff Layton <jlay...@poochiereds.net>
Cc: Johan Hedberg <johan.hedb...@gmail.com>
Cc: Johannes Berg <johan...@sipsolutions.net>
Cc: Marcel Holtmann <mar...@holtmann.org>
Cc: Mimi Zohar <zo...@linux.vnet.ibm.com>
Cc: Trond Myklebust <trond.mykleb...@primarydata.com>
Cc: keyri...@vger.kernel.org
Cc: linux-blueto...@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-wirel...@vger.kernel.org
Cc: netdev@vger.kernel.org

 drivers/staging/ccree/ssi_fips_ll.c   | 17 ---
 net/bluetooth/smp.c   | 39 ++-
 net/ipv6/seg6_hmac.c  |  3 ++-
 net/mac80211/wpa.c|  9 
 net/sunrpc/auth_gss/gss_krb5_crypto.c |  3 ++-
 security/keys/trusted.c   |  7 ---
 6 files changed, 42 insertions(+), 36 deletions(-)

-- 
2.13.1


[PATCH 2/6] net/ipv6: use constant time memory comparison for mac

2017-06-09 Thread Jason A. Donenfeld
Otherwise, we enable a MAC forgery via timing attack.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: sta...@vger.kernel.org
---
 net/ipv6/seg6_hmac.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/seg6_hmac.c b/net/ipv6/seg6_hmac.c
index f950cb53d5e3..54213c83b44e 100644
--- a/net/ipv6/seg6_hmac.c
+++ b/net/ipv6/seg6_hmac.c
@@ -38,6 +38,7 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -274,7 +275,7 @@ bool seg6_hmac_validate_skb(struct sk_buff *skb)
if (seg6_hmac_compute(hinfo, srh, _hdr(skb)->saddr, hmac_output))
return false;
 
-   if (memcmp(hmac_output, tlv->hmac, SEG6_HMAC_FIELD_LEN) != 0)
+   if (crypto_memneq(hmac_output, tlv->hmac, SEG6_HMAC_FIELD_LEN))
return false;
 
return true;
-- 
2.13.1



[PATCH net-next v10 0/5] Avoiding stack overflow in skb_to_sgvec

2017-06-03 Thread Jason A. Donenfeld
Changes v9->v10:
   - Spaces to tabs on one line.
   - Added some acked-by, reviewed-by lines.

Since we're down to only cleaning up things like spaces-to-tabs, I
believe we can merge this patch series. David - would you put this in
net-next, please?


The recent bug with macsec and historical one with virtio have
indicated that letting skb_to_sgvec trounce all over an sglist
without checking the length is probably a bad idea. And it's not
necessary either: an sglist already explicitly marks its last
item, and the initialization functions are diligent in doing so.
Thus there's a clear way of avoiding future overflows.

So, this patchset, from a high level, makes skb_to_sgvec return
a potential error code, and then adjusts all callers to check
for the error code. There are two situations in which skb_to_sgvec
might return such an error:

   1) When the passed in sglist is too small; and
   2) When the passed in skbuff is too deeply nested.

So, the first patch in this series handles the issues with
skb_to_sgvec directly, and the remaining ones then handle the call
sites.

Jason A. Donenfeld (5):
  skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
  ipsec: check return value of skb_to_sgvec always
  rxrpc: check return value of skb_to_sgvec always
  macsec: check return value of skb_to_sgvec always
  virtio_net: check return value of skb_to_sgvec always

 drivers/net/macsec.c | 13 --
 drivers/net/virtio_net.c |  9 +--
 include/linux/skbuff.h   |  8 +++---
 net/core/skbuff.c| 65 +++-
 net/ipv4/ah4.c   |  8 --
 net/ipv4/esp4.c  | 20 +--
 net/ipv6/ah6.c   |  8 --
 net/ipv6/esp6.c  | 20 +--
 net/rxrpc/rxkad.c| 19 ++
 9 files changed, 116 insertions(+), 54 deletions(-)

-- 
2.13.0



[PATCH net-next v10 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-06-03 Thread Jason A. Donenfeld
This is a defense-in-depth measure in response to bugs like
4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: David Howells <dhowe...@redhat.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 include/linux/skbuff.h |  8 +++
 net/core/skbuff.c  | 65 --
 2 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 45a59c1e0cc7..d460a4cbda1c 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -953,10 +953,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb,
 unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
int newtailroom, gfp_t priority);
-int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
-   int offset, int len);
-int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset,
-int len);
+int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist 
*sg,
+int offset, int len);
+int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg,
+ int offset, int len);
 int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer);
 int skb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)   consume_skb(a)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 780b7c1563d0..bba33cf4f7cd 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3508,24 +3508,18 @@ void __init skb_init(void)
NULL);
 }
 
-/**
- * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
- * @skb: Socket buffer containing the buffers to be mapped
- * @sg: The scatter-gather list to map into
- * @offset: The offset into the buffer's contents to start mapping
- * @len: Length of buffer space to be mapped
- *
- * Fill the specified scatter-gather list with mappings/pointers into a
- * region of the buffer space attached to a socket buffer.
- */
 static int
-__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len)
+__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len,
+  unsigned int recursion_level)
 {
int start = skb_headlen(skb);
int i, copy = start - offset;
struct sk_buff *frag_iter;
int elt = 0;
 
+   if (unlikely(recursion_level >= 24))
+   return -EMSGSIZE;
+
if (copy > 0) {
if (copy > len)
copy = len;
@@ -3544,6 +3538,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
end = start + skb_frag_size(_shinfo(skb)->frags[i]);
if ((copy = end - offset) > 0) {
skb_frag_t *frag = _shinfo(skb)->frags[i];
+   if (unlikely(elt && sg_is_last([elt - 1])))
+   return -EMSGSIZE;
 
if (copy > len)
copy = len;
@@ -3558,16 +3554,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
}
 
skb_walk_frags(skb, frag_iter) {
-   int end;
+   int end, ret;
 
WARN_ON(start > offset + len);
 
end = start + frag_iter->len;
if ((copy = end - offset) 

[PATCH net-next v10 3/5] rxrpc: check return value of skb_to_sgvec always

2017-06-03 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Acked-by: David Howells <dhowe...@redhat.com>
---
 net/rxrpc/rxkad.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 1bb9b2ccc267..29fe20ad04aa 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct 
rxrpc_call *call,
len &= ~(call->conn->size_align - 1);
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, 0, len);
+   err = skb_to_sgvec(skb, sg, 0, len);
+   if (unlikely(err < 0))
+   goto out;
skcipher_request_set_crypt(req, sg, sg, len, iv.x);
crypto_skcipher_encrypt(req);
 
@@ -324,7 +326,7 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
bool aborted;
u32 data_size, buf;
u16 check;
-   int nsg;
+   int nsg, ret;
 
_enter("");
 
@@ -342,7 +344,9 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
goto nomem;
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, 8);
+   ret = skb_to_sgvec(skb, sg, offset, 8);
+   if (unlikely(ret < 0))
+   return ret;
 
/* start the decryption afresh */
memset(, 0, sizeof(iv));
@@ -409,7 +413,7 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
bool aborted;
u32 data_size, buf;
u16 check;
-   int nsg;
+   int nsg, ret;
 
_enter(",{%d}", skb->len);
 
@@ -434,7 +438,12 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
}
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, len);
+   ret = skb_to_sgvec(skb, sg, offset, len);
+   if (unlikely(ret < 0)) {
+   if (sg != _sg)
+   kfree(sg);
+   return ret;
+   }
 
/* decrypt from the session key */
token = call->conn->params.key->payload.data[0];
-- 
2.13.0



[PATCH net-next v10 2/5] ipsec: check return value of skb_to_sgvec always

2017-06-03 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
---
 net/ipv4/ah4.c  |  8 ++--
 net/ipv4/esp4.c | 20 +---
 net/ipv6/ah6.c  |  8 ++--
 net/ipv6/esp6.c | 20 +---
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 22377c8ff14b..e8f862358518 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
skb_push(skb, ihl);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 93322f895eab..d815d1755473 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -377,9 +377,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
esp->esph = esph;
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, sg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
 
if (!esp->inplace) {
int allocsize;
@@ -403,9 +405,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
spin_unlock_bh(>lock);
 
sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1);
-   skb_to_sgvec(skb, dsg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, dsg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
}
 
if ((x->props.flags & XFRM_STATE_ESN))
@@ -690,7 +694,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
esp_input_set_header(skb, seqhi);
 
sg_init_table(sg, nfrags);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   err = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out;
 
skb->ip_summed = CHECKSUM_NONE;
 
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index dda6035e3b84..755f38271dd5 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff 
*skb)
ip6h->hop_limit   = 0;
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 1fe99ba8066c..2ede4e459c4e 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info
esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi);
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(s

[PATCH net-next v10 4/5] macsec: check return value of skb_to_sgvec always

2017-06-03 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
---
 drivers/net/macsec.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index 91642fd87cd1..b79513b8322f 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -740,7 +740,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
macsec_fill_iv(iv, secy->sci, pn);
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   macsec_txsa_put(tx_sa);
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (tx_sc->encrypt) {
int len = skb->len - macsec_hdr_len(sci_present) -
@@ -947,7 +952,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
macsec_fill_iv(iv, sci, ntohl(hdr->packet_number));
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (hdr->tci_an & MACSEC_TCI_E) {
/* confidentiality: ethernet + macsec header
-- 
2.13.0



[PATCH net-next v10 5/5] virtio_net: check return value of skb_to_sgvec always

2017-06-03 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Reviewed-by: Sergei Shtylyov <sergei.shtyl...@cogentembedded.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 3e9246cc49c3..57763d30cabb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff 
*skb)
struct virtio_net_hdr_mrg_rxbuf *hdr;
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
struct virtnet_info *vi = sq->vq->vdev->priv;
-   unsigned num_sg;
+   int num_sg;
unsigned hdr_len = vi->hdr_len;
bool can_push;
 
@@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
if (can_push) {
__skb_push(skb, hdr_len);
num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len);
+   if (unlikely(num_sg < 0))
+   return num_sg;
/* Pull header back to avoid skew in tx bytes calculations. */
__skb_pull(skb, hdr_len);
} else {
sg_set_buf(sq->sg, hdr, hdr_len);
-   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   if (unlikely(num_sg < 0))
+   return num_sg;
+   num_sg++;
}
return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
-- 
2.13.0



Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always

2017-05-24 Thread Jason A. Donenfeld
On Wed, May 24, 2017 at 6:41 PM, Sergei Shtylyov
>I've only looked on the last 2 patches. You can add my:
>
> Reviewed-by: Sergei Shtylyov 
>
> if you want. :-)

Will do. For the series, or just for 5/5?


Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always

2017-05-24 Thread Jason A. Donenfeld
I'm shocked this somehow made it into the commit. I wonder how that happened?

Anyway, fixed in my git repo, and will be part of the next series.
(Unless DaveM wants to fix it up trivially when/if he merges this v9,
which would be faster.)

Barring that, does this look good to you? Could I have your signed-off-by?

Regards,
Jason


Re: [PATCH net-next v9 0/5] skb_to_sgvec hardening

2017-05-23 Thread Jason A. Donenfeld
Hi List,

Could somebody do a holistic review of the series, or at least on
individual commits that seem fine, and sign off on it, so that this
can actually be merged? We're now at v9. I hope we can get this merged
now, but if not, I'd like for v10 to finally land these changes.

Regards,
Jason


[PATCH net-next v9 2/5] ipsec: check return value of skb_to_sgvec always

2017-05-23 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
---
 net/ipv4/ah4.c  |  8 ++--
 net/ipv4/esp4.c | 20 +---
 net/ipv6/ah6.c  |  8 ++--
 net/ipv6/esp6.c | 20 +---
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 22377c8ff14b..e8f862358518 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
skb_push(skb, ihl);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 65cc02bd82bc..392432860bb9 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -374,9 +374,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
esp->esph = esph;
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, sg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
 
if (!esp->inplace) {
int allocsize;
@@ -400,9 +402,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
spin_unlock_bh(>lock);
 
sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1);
-   skb_to_sgvec(skb, dsg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, dsg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
}
 
if ((x->props.flags & XFRM_STATE_ESN))
@@ -687,7 +691,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
esp_input_set_header(skb, seqhi);
 
sg_init_table(sg, nfrags);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   err = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out;
 
skb->ip_summed = CHECKSUM_NONE;
 
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index dda6035e3b84..755f38271dd5 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff 
*skb)
ip6h->hop_limit   = 0;
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 1fe99ba8066c..2ede4e459c4e 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info
esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi);
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(s

[PATCH net-next v9 4/5] macsec: check return value of skb_to_sgvec always

2017-05-23 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
---
 drivers/net/macsec.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index cdc347be68f2..dfcb1e9d2ab2 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -742,7 +742,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
macsec_fill_iv(iv, secy->sci, pn);
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   macsec_txsa_put(tx_sa);
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (tx_sc->encrypt) {
int len = skb->len - macsec_hdr_len(sci_present) -
@@ -952,7 +957,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
macsec_fill_iv(iv, sci, ntohl(hdr->packet_number));
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (hdr->tci_an & MACSEC_TCI_E) {
/* confidentiality: ethernet + macsec header
-- 
2.13.0



[PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always

2017-05-23 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9320d96a1632..13fbe4b349c2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff 
*skb)
struct virtio_net_hdr_mrg_rxbuf *hdr;
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
struct virtnet_info *vi = sq->vq->vdev->priv;
-   unsigned num_sg;
+   int num_sg;
unsigned hdr_len = vi->hdr_len;
bool can_push;
 
@@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
if (can_push) {
__skb_push(skb, hdr_len);
num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len);
+if (unlikely(num_sg < 0))
+   return num_sg;
/* Pull header back to avoid skew in tx bytes calculations. */
__skb_pull(skb, hdr_len);
} else {
sg_set_buf(sq->sg, hdr, hdr_len);
-   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   if (unlikely(num_sg < 0))
+   return num_sg;
+   num_sg++;
}
return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
-- 
2.13.0



[PATCH net-next v9 3/5] rxrpc: check return value of skb_to_sgvec always

2017-05-23 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: David Howells <dhowe...@redhat.com>
---
 net/rxrpc/rxkad.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 1bb9b2ccc267..29fe20ad04aa 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct 
rxrpc_call *call,
len &= ~(call->conn->size_align - 1);
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, 0, len);
+   err = skb_to_sgvec(skb, sg, 0, len);
+   if (unlikely(err < 0))
+   goto out;
skcipher_request_set_crypt(req, sg, sg, len, iv.x);
crypto_skcipher_encrypt(req);
 
@@ -324,7 +326,7 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
bool aborted;
u32 data_size, buf;
u16 check;
-   int nsg;
+   int nsg, ret;
 
_enter("");
 
@@ -342,7 +344,9 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
goto nomem;
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, 8);
+   ret = skb_to_sgvec(skb, sg, offset, 8);
+   if (unlikely(ret < 0))
+   return ret;
 
/* start the decryption afresh */
memset(, 0, sizeof(iv));
@@ -409,7 +413,7 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
bool aborted;
u32 data_size, buf;
u16 check;
-   int nsg;
+   int nsg, ret;
 
_enter(",{%d}", skb->len);
 
@@ -434,7 +438,12 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
}
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, len);
+   ret = skb_to_sgvec(skb, sg, offset, len);
+   if (unlikely(ret < 0)) {
+   if (sg != _sg)
+   kfree(sg);
+   return ret;
+   }
 
/* decrypt from the session key */
token = call->conn->params.key->payload.data[0];
-- 
2.13.0



[PATCH net-next v9 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-05-23 Thread Jason A. Donenfeld
This is a defense-in-depth measure in response to bugs like
4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: David Howells <dhowe...@redhat.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 include/linux/skbuff.h |  8 +++
 net/core/skbuff.c  | 65 --
 2 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a098d95b3d84..f351df28942d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1001,10 +1001,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff 
*skb,
 unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
int newtailroom, gfp_t priority);
-int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
-   int offset, int len);
-int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset,
-int len);
+int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist 
*sg,
+int offset, int len);
+int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg,
+ int offset, int len);
 int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer);
 int skb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)   consume_skb(a)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 346d3e85dfbc..ec33811559e3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3482,24 +3482,18 @@ void __init skb_init(void)
NULL);
 }
 
-/**
- * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
- * @skb: Socket buffer containing the buffers to be mapped
- * @sg: The scatter-gather list to map into
- * @offset: The offset into the buffer's contents to start mapping
- * @len: Length of buffer space to be mapped
- *
- * Fill the specified scatter-gather list with mappings/pointers into a
- * region of the buffer space attached to a socket buffer.
- */
 static int
-__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len)
+__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len,
+  unsigned int recursion_level)
 {
int start = skb_headlen(skb);
int i, copy = start - offset;
struct sk_buff *frag_iter;
int elt = 0;
 
+   if (unlikely(recursion_level >= 24))
+   return -EMSGSIZE;
+
if (copy > 0) {
if (copy > len)
copy = len;
@@ -3518,6 +3512,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
end = start + skb_frag_size(_shinfo(skb)->frags[i]);
if ((copy = end - offset) > 0) {
skb_frag_t *frag = _shinfo(skb)->frags[i];
+   if (unlikely(elt && sg_is_last([elt - 1])))
+   return -EMSGSIZE;
 
if (copy > len)
copy = len;
@@ -3532,16 +3528,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
}
 
skb_walk_frags(skb, frag_iter) {
-   int end;
+   int end, ret;
 
WARN_ON(start > offset + len);
 
end = start + frag_iter->len;
if ((copy = end - offset) 

[PATCH net-next v9 0/5] skb_to_sgvec hardening

2017-05-23 Thread Jason A. Donenfeld
The recent bug with macsec and historical one with virtio have
indicated that letting skb_to_sgvec trounce all over an sglist
without checking the length is probably a bad idea. And it's not
necessary either: an sglist already explicitly marks its last
item, and the initialization functions are diligent in doing so.
Thus there's a clear way of avoiding future overflows.

So, this patchset, from a high level, makes skb_to_sgvec return
a potential error code, and then adjusts all callers to check
for the error code. There are two situations in which skb_to_sgvec
might return such an error:

   1) When the passed in sglist is too small; and
   2) When the passed in skbuff is too deeply nested.

So, the first patch in this series handles the issues with
skb_to_sgvec directly, and the remaining ones then handle the call
sites.

Changes v8->v9:
   - Return correct errno in rxrpc, thanks to feedback from Dave Howells.

Jason A. Donenfeld (5):
  skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
  ipsec: check return value of skb_to_sgvec always
  rxrpc: check return value of skb_to_sgvec always
  macsec: check return value of skb_to_sgvec always
  virtio_net: check return value of skb_to_sgvec always

 drivers/net/macsec.c | 13 --
 drivers/net/virtio_net.c |  9 +--
 include/linux/skbuff.h   |  8 +++---
 net/core/skbuff.c| 65 +++-
 net/ipv4/ah4.c   |  8 --
 net/ipv4/esp4.c  | 20 +--
 net/ipv6/ah6.c   |  8 --
 net/ipv6/esp6.c  | 20 +--
 net/rxrpc/rxkad.c| 19 ++
 9 files changed, 116 insertions(+), 54 deletions(-)

-- 
2.13.0



Re: [PATCH v8 3/5] rxrpc: check return value of skb_to_sgvec always

2017-05-16 Thread Jason A. Donenfeld
On Mon, May 15, 2017 at 3:11 PM, David Howells  wrote:
> skb_to_sgvec() can return -EMSGSIZE in some circumstances.  You shouldn't
> return -ENOMEM here in such a case.

Noted. I'll fix this up for the next round.


Re: [PATCH v8 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-05-16 Thread Jason A. Donenfeld
On Mon, May 15, 2017 at 3:12 PM, David Howells  wrote:
> Is there a reason you moved skb_to_sgvec() in the file rather than just moving
> the comment to it (since you moved the comment anyway)?

1) Because it's easier to understand skb_to_sgvec_nomark as a variant
of skb_to_sgvec, so I'd rather skb_to_sgvec to be first when reading.
2) Because skb_to_sgvec relies on the return value of __skb_to_sgvec,
and so when assessing it, it's sometimes nice to be able to look at
why it will return different things. In that case, it's easier to have
both functions within the same view without scrolling.

It's the little things that make life easier sometimes.


Re: Implementing Dynamic Rerouting in Kernel

2017-05-12 Thread Jason A. Donenfeld
On Thu, May 11, 2017 at 6:22 PM, Florian Fainelli  wrote:
> What you are looking for can be done using ipset-dns from Jason:
>
> https://git.zx2c4.com/ipset-dns/about/

Funny to see this project coming up. I actually ported this
functionality into dnsmasq directly a few weeks after writing
ipset-dns, since that's a much more robust place to put this sort of
feature. It's the --ipset option.


[PATCH v8 0/5] skb_to_sgvec hardening

2017-05-11 Thread Jason A. Donenfeld
The recent bug with macsec and historical one with virtio have
indicated that letting skb_to_sgvec trounce all over an sglist
without checking the length is probably a bad idea. And it's not
necessary either: an sglist already explicitly marks its last
item, and the initialization functions are diligent in doing so.
Thus there's a clear way of avoiding future overflows.

So, this patchset, from a high level, makes skb_to_sgvec return
a potential error code, and then adjusts all callers to check
for the error code. There are two situations in which skb_to_sgvec
might return such an error:

   1) When the passed in sglist is too small; and
   2) When the passed in skbuff is too deeply nested.

So, the first patch in this series handles the issues with
skb_to_sgvec directly, and the remaining ones then handle the call
sites.

Changes v7->v8:
   - Added __must_check annotation.
   - Rebased against latest upstream ipsec changes.

Jason A. Donenfeld (5):
  skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
  ipsec: check return value of skb_to_sgvec always
  rxrpc: check return value of skb_to_sgvec always
  macsec: check return value of skb_to_sgvec always
  virtio_net: check return value of skb_to_sgvec always

 drivers/net/macsec.c | 13 --
 drivers/net/virtio_net.c |  9 +--
 include/linux/skbuff.h   |  8 +++---
 net/core/skbuff.c| 65 +++-
 net/ipv4/ah4.c   |  8 --
 net/ipv4/esp4.c  | 20 +--
 net/ipv6/ah6.c   |  8 --
 net/ipv6/esp6.c  | 20 +--
 net/rxrpc/rxkad.c| 13 +++---
 9 files changed, 112 insertions(+), 52 deletions(-)

-- 
2.13.0



[PATCH v8 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-05-11 Thread Jason A. Donenfeld
This is a defense-in-depth measure in response to bugs like
4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: David Howells <dhowe...@redhat.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 include/linux/skbuff.h |  8 +++
 net/core/skbuff.c  | 65 --
 2 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a098d95b3d84..f351df28942d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1001,10 +1001,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff 
*skb,
 unsigned int headroom);
 struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom,
int newtailroom, gfp_t priority);
-int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg,
-   int offset, int len);
-int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset,
-int len);
+int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist 
*sg,
+int offset, int len);
+int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg,
+ int offset, int len);
 int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer);
 int skb_pad(struct sk_buff *skb, int pad);
 #define dev_kfree_skb(a)   consume_skb(a)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 346d3e85dfbc..ec33811559e3 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3482,24 +3482,18 @@ void __init skb_init(void)
NULL);
 }
 
-/**
- * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
- * @skb: Socket buffer containing the buffers to be mapped
- * @sg: The scatter-gather list to map into
- * @offset: The offset into the buffer's contents to start mapping
- * @len: Length of buffer space to be mapped
- *
- * Fill the specified scatter-gather list with mappings/pointers into a
- * region of the buffer space attached to a socket buffer.
- */
 static int
-__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len)
+__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len,
+  unsigned int recursion_level)
 {
int start = skb_headlen(skb);
int i, copy = start - offset;
struct sk_buff *frag_iter;
int elt = 0;
 
+   if (unlikely(recursion_level >= 24))
+   return -EMSGSIZE;
+
if (copy > 0) {
if (copy > len)
copy = len;
@@ -3518,6 +3512,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
end = start + skb_frag_size(_shinfo(skb)->frags[i]);
if ((copy = end - offset) > 0) {
skb_frag_t *frag = _shinfo(skb)->frags[i];
+   if (unlikely(elt && sg_is_last([elt - 1])))
+   return -EMSGSIZE;
 
if (copy > len)
copy = len;
@@ -3532,16 +3528,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
}
 
skb_walk_frags(skb, frag_iter) {
-   int end;
+   int end, ret;
 
WARN_ON(start > offset + len);
 
end = start + frag_iter->len;
if ((copy = end - offset) 

[PATCH v8 5/5] virtio_net: check return value of skb_to_sgvec always

2017-05-11 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 drivers/net/virtio_net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 9320d96a1632..13fbe4b349c2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff 
*skb)
struct virtio_net_hdr_mrg_rxbuf *hdr;
const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest;
struct virtnet_info *vi = sq->vq->vdev->priv;
-   unsigned num_sg;
+   int num_sg;
unsigned hdr_len = vi->hdr_len;
bool can_push;
 
@@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct 
sk_buff *skb)
if (can_push) {
__skb_push(skb, hdr_len);
num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len);
+if (unlikely(num_sg < 0))
+   return num_sg;
/* Pull header back to avoid skew in tx bytes calculations. */
__skb_pull(skb, hdr_len);
} else {
sg_set_buf(sq->sg, hdr, hdr_len);
-   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;
+   num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len);
+   if (unlikely(num_sg < 0))
+   return num_sg;
+   num_sg++;
}
return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
 }
-- 
2.13.0



[PATCH v8 2/5] ipsec: check return value of skb_to_sgvec always

2017-05-11 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
---
 net/ipv4/ah4.c  |  8 ++--
 net/ipv4/esp4.c | 20 +---
 net/ipv6/ah6.c  |  8 ++--
 net/ipv6/esp6.c | 20 +---
 4 files changed, 38 insertions(+), 18 deletions(-)

diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 22377c8ff14b..e8f862358518 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff 
*skb)
skb_push(skb, ihl);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 65cc02bd82bc..392432860bb9 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -374,9 +374,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
esp->esph = esph;
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, sg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
 
if (!esp->inplace) {
int allocsize;
@@ -400,9 +402,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info *
spin_unlock_bh(>lock);
 
sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1);
-   skb_to_sgvec(skb, dsg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(skb, dsg,
+  (unsigned char *)esph - skb->data,
+  assoclen + ivlen + esp->clen + alen);
+   if (unlikely(err < 0))
+   goto error;
}
 
if ((x->props.flags & XFRM_STATE_ESN))
@@ -687,7 +691,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff 
*skb)
esp_input_set_header(skb, seqhi);
 
sg_init_table(sg, nfrags);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   err = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out;
 
skb->ip_summed = CHECKSUM_NONE;
 
diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c
index dda6035e3b84..755f38271dd5 100644
--- a/net/ipv6/ah6.c
+++ b/net/ipv6/ah6.c
@@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff 
*skb)
ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low);
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
@@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff 
*skb)
ip6h->hop_limit   = 0;
 
sg_init_table(sg, nfrags + sglists);
-   skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   err = skb_to_sgvec_nomark(skb, sg, 0, skb->len);
+   if (unlikely(err < 0))
+   goto out_free;
 
if (x->props.flags & XFRM_STATE_ESN) {
/* Attach seqhi sg right after packet payload */
diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c
index 1fe99ba8066c..2ede4e459c4e 100644
--- a/net/ipv6/esp6.c
+++ b/net/ipv6/esp6.c
@@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff 
*skb, struct esp_info
esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi);
 
sg_init_table(sg, esp->nfrags);
-   skb_to_sgvec(skb, sg,
-(unsigned char *)esph - skb->data,
-assoclen + ivlen + esp->clen + alen);
+   err = skb_to_sgvec(s

[PATCH v8 3/5] rxrpc: check return value of skb_to_sgvec always

2017-05-11 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: David Howells <dhowe...@redhat.com>
---
 net/rxrpc/rxkad.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 1bb9b2ccc267..ecab9334e3c1 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct 
rxrpc_call *call,
len &= ~(call->conn->size_align - 1);
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, 0, len);
+   err = skb_to_sgvec(skb, sg, 0, len);
+   if (unlikely(err < 0))
+   goto out;
skcipher_request_set_crypt(req, sg, sg, len, iv.x);
crypto_skcipher_encrypt(req);
 
@@ -342,7 +344,8 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
goto nomem;
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, 8);
+   if (unlikely(skb_to_sgvec(skb, sg, offset, 8) < 0))
+   goto nomem;
 
/* start the decryption afresh */
memset(, 0, sizeof(iv));
@@ -434,7 +437,11 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
}
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, len);
+   if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0)) {
+   if (sg != _sg)
+   kfree(sg);
+   goto nomem;
+   }
 
/* decrypt from the session key */
token = call->conn->params.key->payload.data[0];
-- 
2.13.0



[PATCH v8 4/5] macsec: check return value of skb_to_sgvec always

2017-05-11 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
---
 drivers/net/macsec.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index cdc347be68f2..dfcb1e9d2ab2 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -742,7 +742,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
macsec_fill_iv(iv, secy->sci, pn);
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   macsec_txsa_put(tx_sa);
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (tx_sc->encrypt) {
int len = skb->len - macsec_hdr_len(sci_present) -
@@ -952,7 +957,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
macsec_fill_iv(iv, sci, ntohl(hdr->packet_number));
 
sg_init_table(sg, ret);
-   skb_to_sgvec(skb, sg, 0, skb->len);
+   ret = skb_to_sgvec(skb, sg, 0, skb->len);
+   if (unlikely(ret < 0)) {
+   kfree_skb(skb);
+   return ERR_PTR(ret);
+   }
 
if (hdr->tci_an & MACSEC_TCI_E) {
/* confidentiality: ethernet + macsec header
-- 
2.13.0



Re: [PATCH v7 5/5] virtio_net: check return value of skb_to_sgvec always

2017-05-09 Thread Jason A. Donenfeld
On Tue, May 9, 2017 at 3:50 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote:
> num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1;

(The next submission of this will take into account this + 1 here.
https://git.zx2c4.com/linux-dev/log/?h=jd/safe-skb-vec )


Re: [PATCH v7 2/5] ipsec: check return value of skb_to_sgvec always

2017-05-09 Thread Jason A. Donenfeld
(The next submission of this ipsec patch will have this rebased over
the latest upstream tree.
https://git.zx2c4.com/linux-dev/log/?h=jd/safe-skb-vec )


Re: [PATCH v7 0/5] skb_to_sgvec hardening

2017-05-09 Thread Jason A. Donenfeld
On Tue, May 9, 2017 at 4:03 PM, Johannes Berg  wrote:
> Perhaps you should add __must_check annotation to the function
> prototype(s)?

Great idea. I've started doing this in my own code. Wasn't sure how
popular it was outside of there, but I'm glad to hear a suggestion of
it now. I'll have this in v8.


[PATCH v7 0/5] skb_to_sgvec hardening

2017-05-09 Thread Jason A. Donenfeld
The recent bug with macsec and historical one with virtio have
indicated that letting skb_to_sgvec trounce all over an sglist
without checking the length is probably a bad idea. And it's not
necessary either: an sglist already explicitly marks its last
item, and the initialization functions are diligent in doing so.
Thus there's a clear way of avoiding future overflows.

So, this patchset, from a high level, makes skb_to_sgvec return
a potential error code, and then adjusts all callers to check
for the error code. There are two situations in which skb_to_sgvec
might return such an error:

   1) When the passed in sglist is too small; and
   2) When the passed in skbuff is too deeply nested.

So, the first patch in this series handles the issues with
skb_to_sgvec directly, and the remaining ones then handle the call
sites.

Jason A. Donenfeld (5):
  skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
  ipsec: check return value of skb_to_sgvec always
  rxrpc: check return value of skb_to_sgvec always
  macsec: check return value of skb_to_sgvec always
  virtio_net: check return value of skb_to_sgvec always

 drivers/net/macsec.c | 13 --
 drivers/net/virtio_net.c |  4 ++-
 net/core/skbuff.c| 65 +++-
 net/ipv4/ah4.c   |  8 --
 net/ipv4/esp4.c  | 30 ++
 net/ipv6/ah6.c   |  8 --
 net/ipv6/esp6.c  | 31 +++
 net/rxrpc/rxkad.c| 13 +++---
 8 files changed, 119 insertions(+), 53 deletions(-)

-- 
2.12.2



[PATCH v7 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2017-05-09 Thread Jason A. Donenfeld
This is a defense-in-depth measure in response to bugs like
4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: Steffen Klassert <steffen.klass...@secunet.com>
Cc: Herbert Xu <herb...@gondor.apana.org.au>
Cc: "David S. Miller" <da...@davemloft.net>
Cc: David Howells <dhowe...@redhat.com>
Cc: Sabrina Dubroca <s...@queasysnail.net>
Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
---
 net/core/skbuff.c | 65 +++
 1 file changed, 42 insertions(+), 23 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f86bf69cfb8d..ab51b97d5600 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3481,24 +3481,18 @@ void __init skb_init(void)
NULL);
 }
 
-/**
- * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
- * @skb: Socket buffer containing the buffers to be mapped
- * @sg: The scatter-gather list to map into
- * @offset: The offset into the buffer's contents to start mapping
- * @len: Length of buffer space to be mapped
- *
- * Fill the specified scatter-gather list with mappings/pointers into a
- * region of the buffer space attached to a socket buffer.
- */
 static int
-__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len)
+__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int 
len,
+  unsigned int recursion_level)
 {
int start = skb_headlen(skb);
int i, copy = start - offset;
struct sk_buff *frag_iter;
int elt = 0;
 
+   if (unlikely(recursion_level >= 24))
+   return -EMSGSIZE;
+
if (copy > 0) {
if (copy > len)
copy = len;
@@ -3517,6 +3511,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
end = start + skb_frag_size(_shinfo(skb)->frags[i]);
if ((copy = end - offset) > 0) {
skb_frag_t *frag = _shinfo(skb)->frags[i];
+   if (unlikely(elt && sg_is_last([elt - 1])))
+   return -EMSGSIZE;
 
if (copy > len)
copy = len;
@@ -3531,16 +3527,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
}
 
skb_walk_frags(skb, frag_iter) {
-   int end;
+   int end, ret;
 
WARN_ON(start > offset + len);
 
end = start + frag_iter->len;
if ((copy = end - offset) > 0) {
+   if (unlikely(elt && sg_is_last([elt - 1])))
+   return -EMSGSIZE;
+
if (copy > len)
copy = len;
-   elt += __skb_to_sgvec(frag_iter, sg+elt, offset - start,
- copy);
+   ret = __skb_to_sgvec(frag_iter, sg+elt, offset - start,
+ copy, recursion_level + 1);
+   if (unlikely(ret < 0))
+   return ret;
+   elt += ret;
if ((len -= copy) == 0)
return elt;
offset += copy;
@@ -3551,6 +3553,31 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist 
*sg, int offset, int len)
return elt;
 }
 
+/**
+ * skb_to_sgvec - Fill a scatter-gather list from a socket buffer
+ * @skb: Socket buffer containing the buffers to be mapped
+ * @sg: The scatter-gather list to map into
+ * @offset: The offset into the buffer's contents t

[PATCH v7 3/5] rxrpc: check return value of skb_to_sgvec always

2017-05-09 Thread Jason A. Donenfeld
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com>
Cc: David Howells <dhowe...@redhat.com>
---
 net/rxrpc/rxkad.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c
index 4374e7b9c7bf..486026689448 100644
--- a/net/rxrpc/rxkad.c
+++ b/net/rxrpc/rxkad.c
@@ -229,7 +229,9 @@ static int rxkad_secure_packet_encrypt(const struct 
rxrpc_call *call,
len &= ~(call->conn->size_align - 1);
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, 0, len);
+   err = skb_to_sgvec(skb, sg, 0, len);
+   if (unlikely(err < 0))
+   goto out;
skcipher_request_set_crypt(req, sg, sg, len, iv.x);
crypto_skcipher_encrypt(req);
 
@@ -342,7 +344,8 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, 
struct sk_buff *skb,
goto nomem;
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, 8);
+   if (unlikely(skb_to_sgvec(skb, sg, offset, 8) < 0))
+   goto nomem;
 
/* start the decryption afresh */
memset(, 0, sizeof(iv));
@@ -429,7 +432,11 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, 
struct sk_buff *skb,
}
 
sg_init_table(sg, nsg);
-   skb_to_sgvec(skb, sg, offset, len);
+   if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0)) {
+   if (sg != _sg)
+   kfree(sg);
+   goto nomem;
+   }
 
/* decrypt from the session key */
token = call->conn->params.key->payload.data[0];
-- 
2.12.2



  1   2   3   4   >