checksumming on non-local forward path
Hey netdev, In a protocol like wireguard, if a packet arrives on the other end of the tunnel, and the crypto checks out, then we're absolutely certain that the packet hasn't been modified in transit. In this case, there's no useful information that validating the inner checksums of the various headers would give us. Every byte of the packet has arrived intact, and we're certain of it. Therefore, you might think in a situation like this, before calling netif_receive_skb or the like, we can just set ip_summed to CHECKSUM_UNNECESSARY, csum_level to ~0, and feel glad that we're not wasting cycles unnecessarily validating the checksum. But what if the receiving computer forwards the packet on across a real physical network? In that case, it's probably important that the kernel that originally produced the packet in the first place ensures it has a valid checksum before encrypting it and sending it through the wireguard tunnel. That's fine, but it does mean that in the case of the packet being locally received on the other end, and not forwarded, the checksumming by the original sender was not needed and was therefore wasteful. What would you think of a flag on the receiving end like, "CHECKSUM_INVALID_BUT_UNNECESSARY"? It would be treated as CHECKSUM_UNNECESSARY in the case that the the packet is locally received. But if the packet is going to be forwarded instead, then skb_checksum_help is called on it before forwarding onward. AFAICT, wireguard isn't the only thing that could benefit from this: virtio is another case where it's not always necessary for the sender to call skb_checksum_help, when the receiver could just do it conditionally based on whether it's being forwarded. Thoughts? Regards, Jason
Re: [PATCH net-next v6 23/23] net: WireGuard secure network tunnel
On Fri, Sep 28, 2018 at 12:37 AM Jason A. Donenfeld wrote: > Will do. v7 will include the wg_ prefix. $ nm *.o | while read a b c; do [[ $b == T ]] && echo $c; done | grep -v ^wg_ cleanup_module init_module Success.
Re: [PATCH net-next v6 23/23] net: WireGuard secure network tunnel
Hi Andrew, Thanks for following up with this. On Thu, Sep 27, 2018 at 3:15 AM Andrew Lunn wrote: > I know you have been concentrating on the crypto code, so i'm not > expecting too many changes at the moment in the network code. I should be addressing things in parallel, actually, so I'm happy to work on this. > WARNING: Avoid crashing the kernel - try using WARN_ON & recovery code rather > than BUG() or BUG_ON() > #2984: FILE: drivers/net/wireguard/noise.c:293: > + BUG_ON(first_len > BLAKE2S_HASH_SIZE || second_len > > BLAKE2S_HASH_SIZE || I was actually going to ask you about this, because it applies similarly in another context too that I'm trying to refine. The above function you quote has the following properties: - Only ever accepts fixed length parameters, so the compiler can constant fold invocations of it fantastically. Those parameters are fixed length in the sense that they're enum/macro constants. They never come from the user or from a packet or something. - Never produces an incorrect result. For said constants, all inputs are valid, and so it succeeds in producing an output every time. - Is a "pure" function, just knocking bytes around, without needing to interact with fancy kernel-y things; could be implemented on some sort of WWII-era rotor machine provided you had the patience. Because of the above, there's never any error to return to the user of the function. Also, because it only ever takes constant sized inputs, in theory I should be able to change that BUG_ON() to BUILD_BUG_ON(), but in practice the optimizer/inliner isn't actually that aggressive. But what I would like is some way of signaling to the developer using this function that they've passed it an illegal value, and their code should not ship until that's fixed, under any circumstances at all -- that their usage of the function is completely unacceptable and wrong. Bla bla strong statements. For this, I figured the notion would come across with the aberrant behavior of "crash the developer's [in this case, my] QEMU instance" when "#ifdef DEBUG is true". This is the same kind of place where I'd have an "assert()" statement in userland. It sounds like what you're saying is that a WARN_ON is equally as effective instead? Or given the above, do you think the BUG_ON is actually sensible? Or neither and I should do something else? > WARNING: Macros with flow control statements should be avoided > #5471: FILE: drivers/net/wireguard/selftest/allowedips.h:456: > +#define init_peer(name) do { \ > + name = kzalloc(sizeof(*name), GFP_KERNEL); \ > + if (unlikely(!name)) { \ > + pr_info("allowedips self-test: out of memory\n"); \ > + goto free; \ > + } \ > + kref_init(>refcount);\ > + } while (0) This is part of a debugging selftest, where I'm initing a bunch of peers one after another, and this macro helps keep the test clear while offloading the actual irrelevant coding part to this macro. The test itself then just has code like: init_peer(a); init_peer(b); init_peer(c); init_peer(d); init_peer(e); init_peer(f); init_peer(g); init_peer(h); insert(4, a, 192, 168, 4, 0, 24); insert(4, b, 192, 168, 4, 4, 32); insert(4, c, 192, 168, 0, 0, 16); insert(4, d, 192, 95, 5, 64, 27); /* replaces previous entry, and maskself is required */ insert(4, c, 192, 95, 5, 65, 27); insert(6, d, 0x26075300, 0x60006b00, 0, 0xc05f0543, 128); insert(6, c, 0x26075300, 0x60006b00, 0, 0, 64); ... And so forth. I can probably figure out a different way to code this if you really want, but I thought this would be clear. > The namespace pollution also needs to be addresses. You have some > pretty generic named global symbols. I picked out a few examples from > objdump > > 2a94 g F .text 0060 peer_put > 3484 g F .text 004c timers_stop > 3520 g F .text 0114 packet_queue_init > 2640 g F .text 0034 device_uninit > 26bc g F .text 0288 peer_create > 90d4 g F .text 01bc ratelimiter_init > > Please make use of a prefix for global symbols, e.g. wg_. Will do. v7 will include the wg_ prefix. On a slightly related note, out of curiosity, any idea what's up with the future of LTO in the kernel? It sounds like that'd be nice to have on a module-by-module basis. IOW, I'd love to LTO all of my .c files in wireguard together, and then only ever expose mod_init/exit and whatever I explicitly EXPORT_SYMBOL, and then have the compiler and linker treat the rest of everything as essentially in one .c file and optimize the heck out of it, and then strip all the
Re: [PATCH net-next v6 00/23] WireGuard: Secure Network Tunnel
Hi Eric, On Thu, Sep 27, 2018 at 8:29 PM Eric Biggers wrote: > Why is Herbert Xu's existing crypto tree being circumvented, especially for > future patches (the initial merge isn't quite as important as that's a > one-time > event)? I like being able to check out cryptodev to test upcoming crypto > patches. And currently, changes to APIs, algorithms, tests, and > implementations > all go through cryptodev, which is convenient for crypto developers. > > Apparently, you're proposing that someone adding a new algorithm will now have > to submit the API portion to one maintainer (Herbert Xu) and the > implementation > portion to another maintainer (you), and they'll go through separate git > trees. > That's inconvenient for developers, and it seems that in practice you and > Herbert will be stepping on each other's toes a lot. > > Can you please reach some kind of sane agreement with Herbert so that the > development process isn't fractured into two? Perhaps you could review > patches, > but Herbert could still apply them? I think you're overthinking it a bit. Zinc will have a few software implementations of primitives that are useful in cases where it's nice to call the primitive directly. Think: various usages of sha2, siphash, the wireguard suite (what this patchset includes), other things in lib/, etc. In so much as this winds up duplicating things within the crypto API, I'll work with Herbert to build one on top of the other -- as I've done in the two commits in this series. But beyond that, think of the two initiatives as orthogonal. I'm working on curating a few primitives that are maximally useful throughout the kernel for various uses, and doing so in a way that I think brings about a certain quality. Meanwhile the crypto API is amassing a huge collection of primitives for some things, and that will continue to exist, and Herbert will continue to maintain that. I expect for the crossover to be fairly isolated and manageable, without too much foreseeable tree- conflicts and such. Therefore, Samuel Neves and I plan to maintain the codebase we've spent quite some time writing, and maintain our own tree for it, which we'll be submitting through Greg. In other words, this is not a matter of "circumvention" or "stepping on toes", but rather separate efforts. I'm quite certain to the extent they overlap we'll be able to work out fairly easily. Either way, I'll take your suggestion and reach out to Herbert, since at least a discussion between the two of us sounds like it could be productive. > I'm also wondering about the criteria for making additions and changes to > "Zinc". You mentioned before that one of the "advantages" of Zinc is that it > doesn't include "cipher modes from 90s cryptographers" -- what does that mean > exactly? You've also indicated before that you don't want people modifying > the > Poly1305 implementations as they are too error-prone. Useful contributions > could be blocked or discouraged in the future. Can you please elaborate on > your criteria for contributions to Zinc? > > Also, will you allow algorithms that aren't up to modern security standards > but > are needed for compatibility reasons, e.g. MD5, SHA-1, and DES? There are > existing standards, APIs, and data formats that use these "legacy" algorithms; > so implementations of them are often still needed, whether we like it or not. > > And does it matter who designed the algorithms, e.g. do algorithms from Daniel > Bernstein get effectively a free pass, while algorithms from certain > countries, > governments, or organizations are not allowed? E.g. wireless driver > developers > may need the SM4 block cipher (which is now supported by the crypto API) as > it's > specified in a Chinese wireless standard. Will you allow SM4 in Zinc? Or > will > people have to submit some algorithms to Herbert and some to you due to > disagreements about what algorithms should be included? Similarly here, I think you're over-politicizing everything. Stable address generation for IPv6 uses SHA1 -- see net/ipv6/addrconf.c:3203 -- do you think that this should use, say, the SM3 chinese hash function instead? No, of course not, for a variety of interesting reasons. Rather, it should use some simple hash function that's fast in software that we have available in Zinc. On the other hand, it seems like parts of the kernel that have pretty high- levels of cipher agility -- such as dmcrypt, ipsec, wifi apparently, and so on -- will continue to use dynamic-dispatch system like the crypto API, since that's what it was made to do and is effective at doing. And so, your example of SM4 seems to fit perfectly into what the crypto API is well-suited for, and it would fit naturally in there. In other words, the "political criteria" for what we add to lib/zinc/ will mostly be the same as for the rest of lib/: are there things using it that benefit from it being there in a direct and obvious way, and does the implementation meet certain
Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations
On Thu, Sep 27, 2018 at 6:27 PM Andy Lutomirski wrote: > I would add another consideration: if you can get better latency with > negligible overhead (0.1%? 0.05%), then that might make sense too. For > example, it seems plausible that checking need_resched() every few blocks > adds basically no overhead, and the SIMD helpers could do this themselves or > perhaps only ever do a block at a time. > > need_resched() costs a cacheline access, but it’s usually a hot cacheline, > and the actual check is just whether a certain bit in memory is set. Yes you're right, I do plan to check quite often, rather than seldom, for this reason. I've been toying with the idea of instead processing 65k (maximum size of a UDP packet) at a time before checking need_resched(), but armed with the 20µs figure, this isn't remotely possible on most hardware. So I'll stick with the original conservative plan of checking very often, and not making things different from the aspects worked out by the present crypto API in this regard.
Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations
Hey again Thomas, On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld wrote: > > Hi Thomas, > > I'm trying to optimize this for crypto performance while still taking > into account preemption concerns. I'm having a bit of trouble figuring > out a way to determine numerically what the upper bounds for this > stuff looks like. I'm sure I could pick a pretty sane number that's > arguably okay -- and way under the limit -- but I still am interested > in determining what that limit actually is. I was hoping there'd be a > debugging option called, "warn if preemption is disabled for too > long", or something, but I couldn't find anything like that. I'm also > not quite sure what the latency limits are, to just compute this with > a formula. Essentially what I'm trying to determine is: > > preempt_disable(); > asm volatile(".fill N, 1, 0x90;"); > preempt_enable(); > > What is the maximum value of N for which the above is okay? What > technique would you generally use in measuring this? > > Thanks, > Jason >From talking to Peter (now CC'd) on IRC, it sounds like what you're mostly interested in is clocktime latency on reasonable hardware, with a goal of around ~20µs as a maximum upper bound? I don't expect to get anywhere near this value at all, but if you can confirm that's a decent ballpark, it would make for some interesting calculations. Regards, Jason
Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations
Hi Thomas, I'm trying to optimize this for crypto performance while still taking into account preemption concerns. I'm having a bit of trouble figuring out a way to determine numerically what the upper bounds for this stuff looks like. I'm sure I could pick a pretty sane number that's arguably okay -- and way under the limit -- but I still am interested in determining what that limit actually is. I was hoping there'd be a debugging option called, "warn if preemption is disabled for too long", or something, but I couldn't find anything like that. I'm also not quite sure what the latency limits are, to just compute this with a formula. Essentially what I'm trying to determine is: preempt_disable(); asm volatile(".fill N, 1, 0x90;"); preempt_enable(); What is the maximum value of N for which the above is okay? What technique would you generally use in measuring this? Thanks, Jason
[PATCH net-next v6 17/23] zinc: Curve25519 generic C implementations and selftest
This contains two formally verified C implementations of the Curve25519 scalar multiplication function, one for 32-bit systems, and one for 64-bit systems whose compiler supports efficient 128-bit integer types. Not only are these implementations formally verified, but they are also the fastest available C implementations. They have been modified to be friendly to kernel space and to be generally less horrendous looking, but still an effort has been made to retain their formally verified characteristic, and so the C might look slightly unidiomatic. The 64-bit version comes from HACL*: https://github.com/project-everest/hacl-star The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto Information: https://cr.yp.to/ecdh.html Signed-off-by: Jason A. Donenfeld Cc: Samuel Neves Cc: Andy Lutomirski Cc: Greg KH Cc: Jean-Philippe Aumasson Cc: Karthikeyan Bhargavan --- include/zinc/curve25519.h | 22 + lib/zinc/Kconfig|4 + lib/zinc/Makefile |3 + lib/zinc/curve25519/curve25519-fiat32.h | 860 +++ lib/zinc/curve25519/curve25519-hacl64.h | 784 ++ lib/zinc/curve25519/curve25519.c| 109 ++ lib/zinc/selftest/curve25519.h | 1321 +++ 7 files changed, 3103 insertions(+) create mode 100644 include/zinc/curve25519.h create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h create mode 100644 lib/zinc/curve25519/curve25519.c create mode 100644 lib/zinc/selftest/curve25519.h diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h new file mode 100644 index ..def173e736fc --- /dev/null +++ b/include/zinc/curve25519.h @@ -0,0 +1,22 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#ifndef _ZINC_CURVE25519_H +#define _ZINC_CURVE25519_H + +#include + +enum curve25519_lengths { + CURVE25519_KEY_SIZE = 32 +}; + +bool __must_check curve25519(u8 mypublic[CURVE25519_KEY_SIZE], +const u8 secret[CURVE25519_KEY_SIZE], +const u8 basepoint[CURVE25519_KEY_SIZE]); +void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE]); +bool __must_check curve25519_generate_public( + u8 pub[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE]); + +#endif /* _ZINC_CURVE25519_H */ diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig index 7bf4bc88f81f..0d8a94dd73a8 100644 --- a/lib/zinc/Kconfig +++ b/lib/zinc/Kconfig @@ -14,6 +14,10 @@ config ZINC_CHACHA20POLY1305 config ZINC_BLAKE2S tristate +config ZINC_CURVE25519 + tristate + select CONFIG_CRYPTO + config ZINC_DEBUG bool "Zinc cryptography library debugging and self-tests" help diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 67ad837c822c..65440438c6e5 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -25,3 +25,6 @@ obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o zinc_blake2s-y := blake2s/blake2s.o zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o + +zinc_curve25519-y := curve25519/curve25519.o +obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o diff --git a/lib/zinc/curve25519/curve25519-fiat32.h b/lib/zinc/curve25519/curve25519-fiat32.h new file mode 100644 index ..32b5ec7aa040 --- /dev/null +++ b/lib/zinc/curve25519/curve25519-fiat32.h @@ -0,0 +1,860 @@ +/* SPDX-License-Identifier: GPL-2.0 OR MIT */ +/* + * Copyright (C) 2015-2016 The fiat-crypto Authors. + * Copyright (C) 2018 Jason A. Donenfeld . All Rights Reserved. + * + * This is a machine-generated formally verified implementation of Curve25519 + * ECDH from: <https://github.com/mit-plv/fiat-crypto>. Though originally + * machine generated, it has been tweaked to be suitable for use in the kernel. + * It is optimized for 32-bit machines and machines that cannot work efficiently + * with 128-bit integer types. + */ + +/* fe means field element. Here the field is \Z/(2^255-19). An element t, + * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77 + * t[3]+2^102 t[4]+...+2^230 t[9]. + * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc. + * Multiplication and carrying produce fe from fe_loose. + */ +typedef struct fe { u32 v[10]; } fe; + +/* fe_loose limbs are bounded by 3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc + * Addition and subtraction produce fe_loose from (fe, fe). + */ +typedef struct fe_loose { u32 v[10]; } fe_loose; + +static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s) +{ + /* Ignores top bit of s. */ + u32 a0 = get_unaligned_le32(s); + u32 a1 = get_unaligned_le32(s+4); + u32 a2 = get_unaligned_le32(s+8); + u32 a3 = get_unaligned_le32(s+12); + u32 a4 = get_una
[PATCH net-next v5 13/20] zinc: BLAKE2s x86_64 implementation
These implementations from Samuel Neves support AVX and AVX-512VL. Originally this used AVX-512F, but Skylake thermal throttling made AVX-512VL more attractive and possible to do with negligable difference. Signed-off-by: Jason A. Donenfeld Signed-off-by: Samuel Neves Cc: Andy Lutomirski Cc: Greg KH Cc: Jean-Philippe Aumasson Cc: Thomas Gleixner Cc: Ingo Molnar Cc: x...@kernel.org --- lib/zinc/Makefile | 1 + lib/zinc/blake2s/blake2s-x86_64-glue.h | 59 +++ lib/zinc/blake2s/blake2s-x86_64.S | 685 + lib/zinc/blake2s/blake2s.c | 4 +- 4 files changed, 748 insertions(+), 1 deletion(-) create mode 100644 lib/zinc/blake2s/blake2s-x86_64-glue.h create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile index 0e0020f58d27..47607486c39d 100644 --- a/lib/zinc/Makefile +++ b/lib/zinc/Makefile @@ -22,4 +22,5 @@ zinc_chacha20poly1305-y := chacha20poly1305.o obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o zinc_blake2s-y := blake2s/blake2s.o +zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o diff --git a/lib/zinc/blake2s/blake2s-x86_64-glue.h b/lib/zinc/blake2s/blake2s-x86_64-glue.h new file mode 100644 index ..22900ef8e7fe --- /dev/null +++ b/lib/zinc/blake2s/blake2s-x86_64-glue.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: MIT + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + */ + +#include +#include +#include +#include + +#ifdef CONFIG_AS_AVX +asmlinkage void blake2s_compress_avx(struct blake2s_state *state, +const u8 *block, const size_t nblocks, +const u32 inc); +#endif +#ifdef CONFIG_AS_AVX512 +asmlinkage void blake2s_compress_avx512(struct blake2s_state *state, + const u8 *block, const size_t nblocks, + const u32 inc); +#endif + +static bool blake2s_use_avx __ro_after_init; +static bool blake2s_use_avx512 __ro_after_init; + +static void __init blake2s_fpu_init(void) +{ + blake2s_use_avx = + boot_cpu_has(X86_FEATURE_AVX) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL); + blake2s_use_avx512 = + boot_cpu_has(X86_FEATURE_AVX) && + boot_cpu_has(X86_FEATURE_AVX2) && + boot_cpu_has(X86_FEATURE_AVX512F) && + boot_cpu_has(X86_FEATURE_AVX512VL) && + cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM | + XFEATURE_MASK_AVX512, NULL); +} + +static inline bool blake2s_arch(struct blake2s_state *state, const u8 *block, + size_t nblocks, const u32 inc) +{ +#ifdef CONFIG_AS_AVX512 + if (blake2s_use_avx512 && irq_fpu_usable()) { + kernel_fpu_begin(); + blake2s_compress_avx512(state, block, nblocks, inc); + kernel_fpu_end(); + return true; + } +#endif +#ifdef CONFIG_AS_AVX + if (blake2s_use_avx && irq_fpu_usable()) { + kernel_fpu_begin(); + blake2s_compress_avx(state, block, nblocks, inc); + kernel_fpu_end(); + return true; + } +#endif + return false; +} diff --git a/lib/zinc/blake2s/blake2s-x86_64.S b/lib/zinc/blake2s/blake2s-x86_64.S new file mode 100644 index ..360be4818d06 --- /dev/null +++ b/lib/zinc/blake2s/blake2s-x86_64.S @@ -0,0 +1,685 @@ +/* SPDX-License-Identifier: MIT + * + * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights Reserved. + * Copyright (C) 2017 Samuel Neves . All Rights Reserved. + */ + +#include + +.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32 +.align 32 +IV:.octa 0xA54FF53A3C6EF372BB67AE856A09E667 + .octa 0x5BE0CD191F83D9AB9B05688C510E527F +.section .rodata.cst16.ROT16, "aM", @progbits, 16 +.align 16 +ROT16: .octa 0x0D0C0F0E09080B0A0504070601000302 +.section .rodata.cst16.ROR328, "aM", @progbits, 16 +.align 16 +ROR328:.octa 0x0C0F0E0D080B0A090407060500030201 +#ifdef CONFIG_AS_AVX512 +.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 640 +.align 64 +SIGMA: +.long 0, 2, 4, 6, 1, 3, 5, 7, 8, 10, 12, 14, 9, 11, 13, 15 +.long 11, 2, 12, 14, 9, 8, 15, 3, 4, 0, 13, 6, 10, 1, 7, 5 +.long 10, 12, 11, 6, 5, 9, 13, 3, 4, 15, 14, 2, 0, 7, 8, 1 +.long 10, 9, 7, 0, 11, 14, 1, 12, 6, 2, 15, 3, 13, 8, 5, 4 +.long 4, 9, 8, 13, 14, 0, 10, 11, 7, 3, 12, 1, 5, 6, 15, 2 +.long 2, 10, 4, 14, 13, 3, 9, 11, 6, 5, 7, 12, 15, 1, 8, 0 +.long 4, 11, 14, 8, 13, 10, 12, 5, 2, 1, 15, 3, 9, 7, 0, 6 +.long 6, 12, 0, 13, 15, 2, 1, 10, 4, 5, 11, 14, 8, 3, 9, 7 +.long 14, 5, 4, 12, 9, 7, 3, 10, 2, 0, 6, 15, 11, 1, 13, 8 +.long 11, 7, 13, 10, 12, 14, 0,
Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc
Hey Martin, Thanks for running these and pointing this out. I've replicated the results with tcrypt and fixed some issues, and the next patch series should be a lot closer to what you'd expect, instead of the regression you noticed. Most of the slowdown happened as a result of over-eager XSAVEs, which I've now rectified. I'm still working on a few other facets of it, but I believe v5 will be more satisfactory when posted. Regards, Jason
Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library
Hi Eric, On Wed, Sep 12, 2018 at 12:08 AM Eric Biggers wrote: > I'd strongly prefer the assembly to be readable too. Jason, I'm not sure if > you've actually read through the asm from the OpenSSL implementations, but the > generated .S files actually do lose a lot of semantic information that was in > the original .pl scripts. The thing to keep in mind is that the .S was not directly and blindly generated from the .pl. We started with the output of the .pl, and then, particularly in the case of x86_64, worked with it a lot, and now it's something a bit different. We've definitely spent a lot of time reading that assembly. I'll see if I can improve the readability with some register name remapping on ARM. No guarantees, but I'll play a bit and see if I can make it a bit better. Jason
Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
On Tue, Aug 21, 2018 at 4:54 PM David Miller wrote: > > From: "Jason A. Donenfeld" > Date: Tue, 21 Aug 2018 16:41:50 -0700 > > > Is 100 in fact acceptable for new code? 120? 180? What's the > > generally accepted limit these days? > > Please keep it as close to 80 columns as possible. > > Line breaks are not ugly, please embrace them :) Okay! Will do. Thanks for the response. Jason
Re: [PATCH v1 3/3] net: WireGuard secure network tunnel
On Tue, Jul 31, 2018 at 1:22 PM Stephen Hemminger wrote: > > On Tue, 31 Jul 2018 21:11:02 +0200 > "Jason A. Donenfeld" wrote: > > > +static int walk_by_peer(struct allowedips_node __rcu *top, u8 bits, struct > > allowedips_cursor *cursor, struct wireguard_peer *peer, int (*func)(void > > *ctx, const u8 *ip, u8 cidr, int family), void *ctx, struct mutex *lock) > > +{ > > Please break lines at something reasonable like 100 characters. As I mentioned earlier, things will certainly be that way for v2, and I'm in the process of that. I do care quite a bit about having "pretty code", and so while I'm wrapping, I want to do it well and consistently. Indeed 100 characters is far more reasonable than 80 -- things wind up much less annoying. And 120 is even easier. I don't want to have to start renaming nice descriptive struct names and nice descriptive struct member names just to avoid the line breaks in function signatures, or something silly like that. In any case, before I take out my word wrapping paint brush and make the prettiest wrapping I can, I wanted to double check and confirm what the norm is for the net tree. Is 100 in fact acceptable for new code? 120? 180? What's the generally accepted limit these days? (And, folks, please don't let this inquiry spiral out of control into a bikeshed thread. I'm narrowly interested in knowing what the facts are on what the norm is and what will be accepted here, rather than a cacophony of random opinions.)
[PATCH v2] fib_rules: match rules based on suppress_* properties too
Two rules with different values of suppress_prefix or suppress_ifgroup are not the same. This fixes an -EEXIST when running: $ ip -4 rule add table main suppress_prefixlength 0 Signed-off-by: Jason A. Donenfeld Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule") --- This adds the new condition you mentioned. I'm not sure what you make of DaveM's remark about this not being in the original code, but here is nonetheless the requested change. net/core/fib_rules.c | 8 1 file changed, 8 insertions(+) diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 126ffc5bc630..bc8425d81022 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -416,6 +416,14 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops, if (rule->mark && r->mark != rule->mark) continue; + if (rule->suppress_ifgroup != -1 && + r->suppress_ifgroup != rule->suppress_ifgroup) + continue; + + if (rule->suppress_prefixlen != -1 && + r->suppress_prefixlen != rule->suppress_prefixlen) + continue; + if (rule->mark_mask && r->mark_mask != rule->mark_mask) continue; -- 2.17.1
[PATCH] fib_rules: match rules based on suppress_* properties too
Two rules with different values of suppress_prefix or suppress_ifgroup are not the same. This fixes an -EEXIST when running: $ ip -4 rule add table main suppress_prefixlength 0 Signed-off-by: Jason A. Donenfeld Fixes: f9d4b0c1e969 ("fib_rules: move common handling of newrule delrule msgs into fib_nl2rule") --- net/core/fib_rules.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c index 126ffc5bc630..665799311b98 100644 --- a/net/core/fib_rules.c +++ b/net/core/fib_rules.c @@ -416,6 +416,12 @@ static struct fib_rule *rule_find(struct fib_rules_ops *ops, if (rule->mark && r->mark != rule->mark) continue; + if (r->suppress_ifgroup != rule->suppress_ifgroup) + continue; + + if (r->suppress_prefixlen != rule->suppress_prefixlen) + continue; + if (rule->mark_mask && r->mark_mask != rule->mark_mask) continue; -- 2.17.1
[net regression] "fib_rules: move common handling of newrule delrule msgs into fib_nl2rule" breaks suppress_prefixlength
Hey Roopa, On a kernel with a minimal networking config, CONFIG_IP_MULTIPLE_TABLES appears to be broken for certain rules after f9d4b0c1e9695e3de7af3768205bacc27312320c. Try, for example, running: $ ip -4 rule add table main suppress_prefixlength 0 It returns with EEXIST. Perhaps the reason is that the new rule_find function does not match on suppress_prefixlength? However, rule_exist from before didn't do that either. I'll keep playing and see if I can track it down myself, but thought I should let you know first. A relevant .config can be found at https://א.cc/iq5HoUY0 Jason
Re: safe skb resetting after decapsulation and encapsulation
On Sat, May 12, 2018 at 4:07 AM, Md. Islamwrote: > I'm not an expert on this, but it looks about right. Really? Even zeroing between headers_start and headers_end? With the latest RHEL 7.5 kernel's i40e driver, doing this results in a crash in kfree. It's possible redhat is putting something silly within header_start and header_end, and so zeroing it is bad, but I suspect that instead blanket zeroing it like that might actually be incorrect. > look at build_skb() or __build_skb(). It shows the fields that needs to be set These just kmalloc a new skb, with most fields set to zero. The ones it modifies are the ones I'm modifying anyway when messing with the data the skb contains. Doesn't look like there's much to help there. I wrote the original post wondering precisely -- which specifically of 1-14 are incorrect, and is there anything specific missing from there.
safe skb resetting after decapsulation and encapsulation
Hey Netdev, A UDP skb comes in via the encap_rcv interface. I do a lot of wild things to the bytes in the skb -- change where the head starts, modify a few fragments, decrypt some stuff, trim off some things at the end, etc. In other words, I'm decapsulating the skb in a pretty intense way. I benefit from reusing the same skb, performance wise, but after I'm done processing it, it's really a totally new skb. Eventually it's time to pass off my skb to netif_receive_skb/netif_rx, but before I do that, I need to "reinitialize" the skb. (The same goes for when sending out an skb -- I get it from userspace via ndo_start_xmit, do crazy things to it, and eventually pass it off to the udp_tunnel send functions, but first "reinitializing" it.) At the moment I'm using a function that looks like this: static void jasons_wild_and_crazy_skb_reset(struct sk_buff *skb) { skb_scrub_packet(skb, true); //1 memset(>headers_start, 0, offsetof(struct sk_buff, headers_end) - offsetof(struct sk_buff, headers_start)); //2 skb->queue_mapping = 0; //3 skb->nohdr = 0; //4 skb->peeked = 0; //5 skb->mac_len = 0; //6 skb->dev = NULL; //7 #ifdef CONFIG_NET_SCHED skb->tc_index = 0; //8 skb_reset_tc(skb); //9 #endif skb->hdr_len = skb_headroom(skb); //10 skb_reset_mac_header(skb); //11 skb_reset_network_header(skb); //12 skb_probe_transport_header(skb, 0); //13 skb_reset_inner_headers(skb); //14 } I'm sure that some of this is wrong. Most of it is based on part of an Octeon ethernet driver I read a few years ago. I numbered each statement above, hoping to go through it with you all in detail here, and see what we can cut away and see what we can approve. 1. Obviously correct and required. 2. This is probably wrong. At least it causes crashes when receiving packets from RHEL 7.5's latest i40e driver in their vendor frankenkernel, because those flags there have some critical bits related to allocation. But there are a lot flags in there that I might consider going through one by one and zeroing out. 3-5. Fields that should be zero, I assume, after decapsulating/decrypting (and encapsulating/encrypting). 6. WireGuard is layer 3, so there's no mac. 7. We're later going to change the dev this came in on. 8-9: Same flakey rationale as 2,3-5. 10: Since the headroom has changed during the various modifications, I need to let the packet field know about it. 11-14: The beginning of the headers has changed, and so resetting and probing is necessary for this to work at all. So I'm wondering - how much of this is necessary? How much am I unnecessarily reinventing things that exist elsewhere? I'm pretty sure in most cases the driver would work with only 1,10-14, but I worry that bad things would happen in more unusual configurations. I've tried to systematically go through the entire stack and see where these might be used or not used, but it seems really inconsistent. So, I'm writing wondering if somebody has an easy simplification or rule for handling this kind of intense decapsulation/decryption (and encapsulation/encryption operation on the other way) operation. I'd like to make sure I get this down solid. Thanks, Jason
Re: [PATCH] netlink: put module reference if dump start fails
Thanks! Jason
Re: [PATCH] netlink: put module reference if dump start fails
Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")
Re: [PATCH] netlink: put module reference if dump start fails
On Wed, Feb 21, 2018 at 6:47 AM, Eric Dumazetwrote: >> This probably should be queued up for stable. > > When was the bug added ? This would help a lot stable teams ... This needs to be backported to 4.16-rc0+, 4.15+, 4.14+, 4.13.14+, and 4.9.63+.
[PATCH] netlink: put module reference if dump start fails
Before, if cb->start() failed, the module reference would never be put, because cb->cb_running is intentionally false at this point. Users are generally annoyed by this because they can no longer unload modules that leak references. Also, it may be possible to tediously wrap a reference counter back to zero, especially since module.c still uses atomic_inc instead of refcount_inc. This patch expands the error path to simply call module_put if cb->start() fails. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- This probably should be queued up for stable. net/netlink/af_netlink.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 2ad445c1d27c..07e8478068f0 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2308,7 +2308,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, if (cb->start) { ret = cb->start(cb); if (ret) - goto error_unlock; + goto error_put; } nlk->cb_running = true; @@ -2328,6 +2328,8 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, */ return -EINTR; +error_put: + module_put(control->module); error_unlock: sock_put(sk); mutex_unlock(nlk->cb_mutex); -- 2.16.1
Re: WireGuard Upstreaming Roadmap (November 2017)
Hi Dave, On Fri, Dec 8, 2017 at 4:38 PM, David Millerwrote: > Sorry, you cannot force the discussion of a feature which will be submitted > upstream to occur on a private mailing list. > > It is _ABSOLUTELY_ appropriate to discss this on netdev since it is the > netdev community which must consider issues like this when looking at > whether to accept WireGuard upstream. > > Jason, this action and response was entirely inappropriate, and please > I'd like you to reply properly to questions about your feature here. Whoops, okay! Very sorry. I'm actually kind of happy to hear that. I had assumed that you'd be annoyed if WireGuard crypto discussion spewed over into netdev adding even more message volume there for something perhaps not directly relevant. But in fact, you're interested and find it important to discuss there. So, good news. And sorry for trying to shew it away before. I'll copy and paste the response I had made on the other list: > This is covered in the paper: > https://www.wireguard.com/papers/wireguard.pdf > > The basic answer is that WireGuard has message type identifiers, and > the handshake also hashes into it an identifier of the primitives > used. If there's ever a problem with those primitives chosen, it will > be possible to introduce new message type identifiers, if that kind of > "support everything even the broken garbage" approach is desired by > misguided people. However, a better approach, of course, is to keep > your secure network separate from your insecure network, and to not > allow insecure nodes on secure segments; when you mix the two, > disaster tends to strike. So, in other words, both approaches to "upgrading" > are possible, in this fantasy wireguardalypse. Take note, though, that > neither one of these approaches (support new and retain old protocol > too for old nodes, or only support new) are "agile" or are anything at > all like the 90s "cipher agility" -- the user still is not permitted > to "choose" ciphers. Regards, Jason
Re: WireGuard Upstreaming Roadmap (November 2017)
On Thu, Dec 7, 2017 at 11:22 AM, Stefan Tatschnerwrote: > I have a question which is related to the involved crypto. As far as I > have understood the protocol and the concept of wireguard > What's your opinion on this? This thread has been picked up on the WireGuard mailing list, not here. Since this concerns the interworkings of the protocol and cryptography as a whole, as opposed to implementation details of Linux, please do not send these inquiries to LKML. Additionally, please start new threads for new topics in the future, rather than hijacking a roadmap thread. Look for my answer on the other mailing list. I'll CC you too. Regards, Jason
Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps
On Sat, Nov 11, 2017 at 11:18 PM, Johannes Bergwrote: > >> > If you're handling this by forcing another read() to procude the >> > NLMSG_DONE, then you have no reason to WARN_ON() here. >> > >> > In fact you are adding a WARN_ON() which is trivially triggerable by >> > any user. >> >> I added this in my suggestion for how this could work, but I don't >> think you're right, since we previously check if there's enough space. > > Or perhaps I should say this differently: > > Forcing another read happens through the > > skb_tailroom(skb) < nlmsg_total_size(...) > > check, so the nlmsg_put_answer() can't really fail. > > > Handling nlmsg_put_answer() failures by forcing another read would have > required jumping to the existing if code with a goto, or restructuring > the whole thing completely somehow, and I didn't see how to do that. Exactly. And if something _does_ go wrong in our logic, and we can't add NLMSG_DONE, we really do want people to report this to us, since dumps must always end that way. We'd probably have caught this a number of years ago when userspace developers were twiddling with their receive buffers if we had had the WARN_ON. Nice suggestion from Johannes.
WireGuard Upstreaming Roadmap (November 2017)
Hi folks, This relates to WireGuard [0]. Following a very nice conference with the Linux kernel networking subsystem community [1,2], I thought it might be a good idea to spell out the roadmap for the coming months and the trajectory into upstream, based on my discussions with several developers and maintainers. There are several threads of this, the biggest of which surrounds the kernel code base, but there are some other ends of the WireGuard project as a whole that are also relevant. The current biggest blocker is issues with the crypto API. Before WireGuard can go upstream, I intend to embark on a multi-pronged effort to overhaul the crypto API. I very much need to sync up with Herbert regarding my plans for this, and start spec'ing things out a bit more formally, so I can begin concrete discussions with him. I intend to base my work both on feedback from linux-crypto/Herbert and from the cryptographic research community. I hope to go to RWC2018 [3] and the subsequent HACS workshop for the academic engagement side, but of course like all the work I do on the kernel, things will be highly based in engineering, rather than purely academic, practices. Dave has strongly encouraged me to post patches sooner rather than later. So I think before the crypto API is ready to go, I'll likely post a [RFG] -- request for grumbles -- patch set to netdev, in order to have some code review, so as to gauge where we're at. This patch set will use my current crypto API, not the kernel's crypto API, with it mentioned in the opening that I intend to switch to the kernel's crypto API when it looks like the one used here. This way we'll get early feedback so that the later real [PATCH] series will go more smoothly. There are a few WireGuard features that some of you have been waiting for. At the urging of some folks at the conference, I intend to submit a "core" WireGuard to upstream, and then use future releases to build on top of that, to make the initial upstreaming process go as easily as possible. Therefore, there are three big TODO items that may _or_ may not go in after upstreaming: - In-band control messages [possibly not] - Netlink multicast event notification - GRO integration Of these, it's most likely I'll implement GRO and leave the other two until after, but this may change. Since WireGuard already does GSO, it would make sense to implement the other side of things too. There's also a longer more ambitious roadmap [4], filled with both easy coding-related things and longer term design things, but that's out of scope for this email, though many of which will likely be completed before submission time. There are also six other threads of development that are ongoing, which I intend to put a bit more focus on too in the near future: - The userspace implementation. I'd like to bring this up to deployment quality, which naturally fits into the next area. - Mobile apps. It's fairly easy to integrate the userspace implementation with existing APIs. The current Android app already works well with the kernel module, but of course people want this more easily deployed. - Mac and Windows support for the userspace implementation. These are already mostly done, but the APIs used may in fact change, so there may still be a bit of work to do here before we're satisfied. - Bindings and libraries. Now that we have a stable Netlink API, we can start making nice wrappers for the various languages people like to use. It remains to be seen whether or not a C "libwireguard" is needed, since in that domain, talking Netlink directly is often a better choice, but I do see some potential for a pywireguard and the like. This will also be essential when the already mentioned plans for event notification and the possibly excluded control messages materialize. - More formal verification. While we have the cryptographic protocol verified, there are still more places where formalism is quite helpful, proving more state machines, and even proving C implementations to be correct. Work and research is ongoing in this domain. - Integration into network managers and routing daemons (mesh and traditional). Work has already begun here on systemd-networkd, and others are looking into daemons like babel and bird. So that's where we're at. I hope to have a RFG submitted in the next several months, and hopefully we gather some nice momentum and get the work upstreamed and the project completed soon, for some definition of "complete". If you'd like to work on WireGuard, or simply have any questions, don't hesitate to email me. Regards, Jason [0] https://www.wireguard.com/ [1] https://www.netdevconf.org/2.2/ [2] https://www.wireguard.com/presentations/ [3] https://rwc.iacr.org [4] https://www.wireguard.com/todo/
Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps
On Sat, Nov 11, 2017 at 11:37 AM, David Millerwrote: > Jason I'm already pushing my luck as-is with the pull request I made > yesterday. > > I've seen your original requst to get this in, you don't have to say > it multiple times. > > We can get this into the merge window and submit it for -stable, so > please relax. Whoops, sorry! Okay, no problem. Jason
Re: [PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps
IIRC, 4.14 comes tomorrow-ish? If possible, it would be nice to get this in 4.14 before then, so it doesn't have to take time to trickle down through stable. Jason On Thu, Nov 9, 2017 at 1:04 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: > The way people generally use netlink_dump is that they fill in the skb > as much as possible, breaking when nla_put returns an error. Then, they > get called again and start filling out the next skb, and again, and so > forth. The mechanism at work here is the ability for the iterative > dumping function to detect when the skb is filled up and not fill it > past the brim, waiting for a fresh skb for the rest of the data. > > However, if the attributes are small and nicely packed, it is possible > that a dump callback function successfully fills in attributes until the > skb is of size 4080 (libmnl's default page-sized receive buffer size). > The dump function completes, satisfied, and then, if it happens to be > that this is actually the last skb, and no further ones are to be sent, > then netlink_dump will add on the NLMSG_DONE part: > > nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); > > It is very important that netlink_dump does this, of course. However, in > this example, that call to nlmsg_put_answer will fail, because the > previous filling by the dump function did not leave it enough room. And > how could it possibly have done so? All of the nla_put variety of > functions simply check to see if the skb has enough tailroom, > independent of the context it is in. > > In order to keep the important assumptions of all netlink dump users, it > is therefore important to give them an skb that has this end part of the > tail already reserved, so that the call to nlmsg_put_answer does not > fail. Otherwise, library authors are forced to find some bizarre sized > receive buffer that has a large modulo relative to the common sizes of > messages received, which is ugly and buggy. > > This patch thus saves the NLMSG_DONE for an additional message, for the > case that things are dangerously close to the brim. This requires > keeping track of the errno from ->dump() across calls. > > Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> > --- > Can we get this into 4.14? Is there still time? It should also be queued > up for stable. > > Changes v3->v4: > - I like long lines. The kernel does not. Wrapped at 80 now. > > net/netlink/af_netlink.c | 17 +++-- > net/netlink/af_netlink.h | 1 + > 2 files changed, 12 insertions(+), 6 deletions(-) > > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c > index b93148e8e9fb..15c99dfa3d72 100644 > --- a/net/netlink/af_netlink.c > +++ b/net/netlink/af_netlink.c > @@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk) > struct sk_buff *skb = NULL; > struct nlmsghdr *nlh; > struct module *module; > - int len, err = -ENOBUFS; > + int err = -ENOBUFS; > int alloc_min_size; > int alloc_size; > > @@ -2183,9 +2183,11 @@ static int netlink_dump(struct sock *sk) > skb_reserve(skb, skb_tailroom(skb) - alloc_size); > netlink_skb_set_owner_r(skb, sk); > > - len = cb->dump(skb, cb); > + if (nlk->dump_done_errno > 0) > + nlk->dump_done_errno = cb->dump(skb, cb); > > - if (len > 0) { > + if (nlk->dump_done_errno > 0 || > + skb_tailroom(skb) < > nlmsg_total_size(sizeof(nlk->dump_done_errno))) { > mutex_unlock(nlk->cb_mutex); > > if (sk_filter(sk, skb)) > @@ -2195,13 +2197,15 @@ static int netlink_dump(struct sock *sk) > return 0; > } > > - nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); > - if (!nlh) > + nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, > + sizeof(nlk->dump_done_errno), NLM_F_MULTI); > + if (WARN_ON(!nlh)) > goto errout_skb; > > nl_dump_check_consistent(cb, nlh); > > - memcpy(nlmsg_data(nlh), , sizeof(len)); > + memcpy(nlmsg_data(nlh), >dump_done_errno, > + sizeof(nlk->dump_done_errno)); > > if (sk_filter(sk, skb)) > kfree_skb(skb); > @@ -2273,6 +2277,7 @@ int __netlink_dump_start(struct sock *ssk, struct > sk_buff *skb, > } > > nlk->cb_running = true; > + nlk->dump_done_errno = INT_MAX; > > mutex_unlock(nlk->cb_mutex); > > diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h > index 028188597eaa..962de7b3c023 100644 > --- a/net/netlink/af_netlink.
[PATCH v4] af_netlink: ensure that NLMSG_DONE never fails in dumps
The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus saves the NLMSG_DONE for an additional message, for the case that things are dangerously close to the brim. This requires keeping track of the errno from ->dump() across calls. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- Can we get this into 4.14? Is there still time? It should also be queued up for stable. Changes v3->v4: - I like long lines. The kernel does not. Wrapped at 80 now. net/netlink/af_netlink.c | 17 +++-- net/netlink/af_netlink.h | 1 + 2 files changed, 12 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b93148e8e9fb..15c99dfa3d72 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk) struct sk_buff *skb = NULL; struct nlmsghdr *nlh; struct module *module; - int len, err = -ENOBUFS; + int err = -ENOBUFS; int alloc_min_size; int alloc_size; @@ -2183,9 +2183,11 @@ static int netlink_dump(struct sock *sk) skb_reserve(skb, skb_tailroom(skb) - alloc_size); netlink_skb_set_owner_r(skb, sk); - len = cb->dump(skb, cb); + if (nlk->dump_done_errno > 0) + nlk->dump_done_errno = cb->dump(skb, cb); - if (len > 0) { + if (nlk->dump_done_errno > 0 || + skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) { mutex_unlock(nlk->cb_mutex); if (sk_filter(sk, skb)) @@ -2195,13 +2197,15 @@ static int netlink_dump(struct sock *sk) return 0; } - nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); - if (!nlh) + nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, + sizeof(nlk->dump_done_errno), NLM_F_MULTI); + if (WARN_ON(!nlh)) goto errout_skb; nl_dump_check_consistent(cb, nlh); - memcpy(nlmsg_data(nlh), , sizeof(len)); + memcpy(nlmsg_data(nlh), >dump_done_errno, + sizeof(nlk->dump_done_errno)); if (sk_filter(sk, skb)) kfree_skb(skb); @@ -2273,6 +2277,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, } nlk->cb_running = true; + nlk->dump_done_errno = INT_MAX; mutex_unlock(nlk->cb_mutex); diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h index 028188597eaa..962de7b3c023 100644 --- a/net/netlink/af_netlink.h +++ b/net/netlink/af_netlink.h @@ -34,6 +34,7 @@ struct netlink_sock { wait_queue_head_t wait; boolbound; boolcb_running; + int dump_done_errno; struct netlink_callback cb; struct mutex*cb_mutex; struct mutexcb_def_mutex; -- 2.15.0
Re: [PATCH v3] af_netlink: ensure that NLMSG_DONE never fails in dumps
On Thu, Nov 9, 2017 at 11:02 AM, Johannes Bergwrote: > nit: I think your line got a little long here :) Ack. For v4. > and here Ack. For v4. > >> + nlk->dump_done_errno = INT_MAX; > > I guess positive values aren't really returned from dump? When a positive value is returned, the API uses this to signify that the dump is not done, and it should be called again. When 0 or negative is returned, it means the dump is complete and the return value should be returned in DONE. We therefore initialize dump_done_errno to some positive value, so that we can make calling ->dump() conditional on it being positive. When it becomes zero or negative, it's time to return. This mirrors the semantics of the actual ->dump() function precisely.
[PATCH v3] af_netlink: ensure that NLMSG_DONE never fails in dumps
The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus saves the NLMSG_DONE for an additional message, for the case that things are dangerously close to the brim. This requires keeping track of the errno from ->dump() across calls. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- Can we get this into 4.14? Is there still time? It should also be queued up for stable. Changes v2->v3: - Johannes didn't like the subject line of the patch, so the only thing that's changed in this version is the new subject line. net/netlink/af_netlink.c | 14 -- net/netlink/af_netlink.h | 1 + 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b93148e8e9fb..7020689e643e 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk) struct sk_buff *skb = NULL; struct nlmsghdr *nlh; struct module *module; - int len, err = -ENOBUFS; + int err = -ENOBUFS; int alloc_min_size; int alloc_size; @@ -2183,9 +2183,10 @@ static int netlink_dump(struct sock *sk) skb_reserve(skb, skb_tailroom(skb) - alloc_size); netlink_skb_set_owner_r(skb, sk); - len = cb->dump(skb, cb); + if (nlk->dump_done_errno > 0) + nlk->dump_done_errno = cb->dump(skb, cb); - if (len > 0) { + if (nlk->dump_done_errno > 0 || skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) { mutex_unlock(nlk->cb_mutex); if (sk_filter(sk, skb)) @@ -2195,13 +2196,13 @@ static int netlink_dump(struct sock *sk) return 0; } - nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); - if (!nlh) + nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(nlk->dump_done_errno), NLM_F_MULTI); + if (WARN_ON(!nlh)) goto errout_skb; nl_dump_check_consistent(cb, nlh); - memcpy(nlmsg_data(nlh), , sizeof(len)); + memcpy(nlmsg_data(nlh), >dump_done_errno, sizeof(nlk->dump_done_errno)); if (sk_filter(sk, skb)) kfree_skb(skb); @@ -2273,6 +2274,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, } nlk->cb_running = true; + nlk->dump_done_errno = INT_MAX; mutex_unlock(nlk->cb_mutex); diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h index 028188597eaa..962de7b3c023 100644 --- a/net/netlink/af_netlink.h +++ b/net/netlink/af_netlink.h @@ -34,6 +34,7 @@ struct netlink_sock { wait_queue_head_t wait; boolbound; boolcb_running; + int dump_done_errno; struct netlink_callback cb; struct mutex*cb_mutex; struct mutexcb_def_mutex; -- 2.15.0
[PATCH v2] af_netlink: give correct bounds to dump skb for NLMSG_DONE
The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus saves the NLMSG_DONE for an additional message, for the case that things are dangerously close to the brim. This requires keeping track of the errno from ->dump() across calls. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- Changes v1->v2: - This implements things without having to twiddle with the skb, at the expense of potentially having an extra back-and-forth and quite a bit of added complexity. net/netlink/af_netlink.c | 14 -- net/netlink/af_netlink.h | 1 + 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b93148e8e9fb..7020689e643e 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2136,7 +2136,7 @@ static int netlink_dump(struct sock *sk) struct sk_buff *skb = NULL; struct nlmsghdr *nlh; struct module *module; - int len, err = -ENOBUFS; + int err = -ENOBUFS; int alloc_min_size; int alloc_size; @@ -2183,9 +2183,10 @@ static int netlink_dump(struct sock *sk) skb_reserve(skb, skb_tailroom(skb) - alloc_size); netlink_skb_set_owner_r(skb, sk); - len = cb->dump(skb, cb); + if (nlk->dump_done_errno > 0) + nlk->dump_done_errno = cb->dump(skb, cb); - if (len > 0) { + if (nlk->dump_done_errno > 0 || skb_tailroom(skb) < nlmsg_total_size(sizeof(nlk->dump_done_errno))) { mutex_unlock(nlk->cb_mutex); if (sk_filter(sk, skb)) @@ -2195,13 +2196,13 @@ static int netlink_dump(struct sock *sk) return 0; } - nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); - if (!nlh) + nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(nlk->dump_done_errno), NLM_F_MULTI); + if (WARN_ON(!nlh)) goto errout_skb; nl_dump_check_consistent(cb, nlh); - memcpy(nlmsg_data(nlh), , sizeof(len)); + memcpy(nlmsg_data(nlh), >dump_done_errno, sizeof(nlk->dump_done_errno)); if (sk_filter(sk, skb)) kfree_skb(skb); @@ -2273,6 +2274,7 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, } nlk->cb_running = true; + nlk->dump_done_errno = INT_MAX; mutex_unlock(nlk->cb_mutex); diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h index 028188597eaa..962de7b3c023 100644 --- a/net/netlink/af_netlink.h +++ b/net/netlink/af_netlink.h @@ -34,6 +34,7 @@ struct netlink_sock { wait_queue_head_t wait; boolbound; boolcb_running; + int dump_done_errno; struct netlink_callback cb; struct mutex*cb_mutex; struct mutexcb_def_mutex; -- 2.15.0
Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE
Erf, your patch doesn't handle what happens if len comes back negative, but I'll fix it up and send a v2 using this approach. I think I really prefer v1 though. Jason
Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE
Hi Johannes, Yes indeed. It sacrifices 24 bytes for making things much less complex. However, if you prefer increasing the complexity of the state machine a bit instead, I suppose we could roll with this approach instead... Jason
Re: [PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE
By the way, in case you're curious, here's the {up,down,cross}stream WireGuard commit that works around it via its compat layer (a rat's nest of hideous backports for all the weird kernels people want WireGuard to run on, which I cannot wait to remove): https://git.zx2c4.com/WireGuard/commit/?id=f689ea7acc23dc8e0968699d964ee382b04fbbe4 Particularly relavent here is the last chunk of that, which is part of the automated test suite, which reproduces the issue by finding the tightest possible packing.
[PATCH] af_netlink: give correct bounds to dump skb for NLMSG_DONE
The way people generally use netlink_dump is that they fill in the skb as much as possible, breaking when nla_put returns an error. Then, they get called again and start filling out the next skb, and again, and so forth. The mechanism at work here is the ability for the iterative dumping function to detect when the skb is filled up and not fill it past the brim, waiting for a fresh skb for the rest of the data. However, if the attributes are small and nicely packed, it is possible that a dump callback function successfully fills in attributes until the skb is of size 4080 (libmnl's default page-sized receive buffer size). The dump function completes, satisfied, and then, if it happens to be that this is actually the last skb, and no further ones are to be sent, then netlink_dump will add on the NLMSG_DONE part: nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, sizeof(len), NLM_F_MULTI); It is very important that netlink_dump does this, of course. However, in this example, that call to nlmsg_put_answer will fail, because the previous filling by the dump function did not leave it enough room. And how could it possibly have done so? All of the nla_put variety of functions simply check to see if the skb has enough tailroom, independent of the context it is in. In order to keep the important assumptions of all netlink dump users, it is therefore important to give them an skb that has this end part of the tail already reserved, so that the call to nlmsg_put_answer does not fail. Otherwise, library authors are forced to find some bizarre sized receive buffer that has a large modulo relative to the common sizes of messages received, which is ugly and buggy. This patch thus reserves and restores the required length for NLMSG_DONE during the call to the dump function. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- See you all at netdevconf tomorrow! net/netlink/af_netlink.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index b93148e8e9fb..b2d0a2fb1939 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2183,7 +2183,9 @@ static int netlink_dump(struct sock *sk) skb_reserve(skb, skb_tailroom(skb) - alloc_size); netlink_skb_set_owner_r(skb, sk); + skb->end -= nlmsg_total_size(sizeof(len)); len = cb->dump(skb, cb); + skb->end += nlmsg_total_size(sizeof(len)); if (len > 0) { mutex_unlock(nlk->cb_mutex); -- 2.15.0
Re: [PATCH net] tun/tap: sanitize TUNSETSNDBUF input
Here's a simple reproducer, in case Skyzaller's case was overcomplicated: #include #include #include #include #include #include #include #include int main(int argc, char *argv[]) { struct ifreq ifr; int fd, sock; fd = open("/dev/net/tun", O_RDWR); if (fd < 0) { perror("open(/dev/net/tun)"); return 1; } memset(, 0, sizeof(ifr)); ifr.ifr_flags = IFF_TUN; strncpy(ifr.ifr_name, "yikes", IFNAMSIZ); if (ioctl(fd, TUNSETIFF, ) < 0) { perror("TUNSETIFF"); return 1; } sock = socket(AF_INET, SOCK_DGRAM, 0); if (sock < 0) { perror("socket"); return 1; } ifr.ifr_flags = IFF_UP; if (ioctl(sock, SIOCSIFFLAGS, ) < 0) { perror("SIOCSIFFLAGS"); return 1; } close(sock); sock = -1; if (ioctl(fd, TUNSETSNDBUF, )) { perror("TUNSETSNDBUF"); return 1; } if (write(fd, , sizeof(fd)) < 0) { perror("write"); return 1; } return 0; }
[PATCH] mac80211: use constant time comparison with keys
Otherwise we risk leaking information via timing side channel. Fixes: fdf7cb4185b6 ("mac80211: accept key reinstall without changing anything") Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- net/mac80211/key.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mac80211/key.c b/net/mac80211/key.c index ae995c8480db..035d16fe926e 100644 --- a/net/mac80211/key.c +++ b/net/mac80211/key.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include "ieee80211_i.h" #include "driver-ops.h" @@ -635,7 +636,7 @@ int ieee80211_key_link(struct ieee80211_key *key, * new version of the key to avoid nonce reuse or replay issues. */ if (old_key && key->conf.keylen == old_key->conf.keylen && - !memcmp(key->conf.key, old_key->conf.key, key->conf.keylen)) { + !crypto_memneq(key->conf.key, old_key->conf.key, key->conf.keylen)) { ieee80211_key_free_unused(key); ret = 0; goto out; -- 2.14.2
Re: pull-request: mac80211 2017-10-16
On Tue, Oct 17, 2017 at 7:46 AM, Johannes Bergwrote: > If it's not equal, you execute so much code > beneath, going to the driver etc., that I'd think this particular time > is in the noise. Usually presumptions like this get you in trouble when some crafty academic has a smart idea about that noise. I'll send a patch.
Re: pull-request: mac80211 2017-10-16
Mobile phone right now, so not able to write patch, but you probably should be using crypto_memneq for comparing those two keys, not memcmp. Jason
Re: [PATCH v2] netlink: do not set cb_running if dump's start() errs
On Mon, Oct 9, 2017 at 2:31 PM, Johannes Bergwrote: > > Reviewed-by: Johannes Berg Thanks for the review. Hopefully this can make it into 4.13.6 and 4.14-rc5.
[PATCH v2] netlink: do not set cb_running if dump's start() errs
It turns out that multiple places can call netlink_dump(), which means it's still possible to dereference partially initialized values in dump() that were the result of a faulty returned start(). This fixes the issue by calling start() _before_ setting cb_running to true, so that there's no chance at all of hitting the dump() function through any indirect paths. It also moves the call to start() to be when the mutex is held. This has the nice side effect of serializing invocations to start(), which is likely desirable anyway. It also prevents any possible other races that might come out of this logic. In testing this with several different pieces of tricky code to trigger these issues, this commit fixes all avenues that I'm aware of. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Johannes Berg <johan...@sipsolutions.net> --- net/netlink/af_netlink.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 94c11cf0459d..f34750691c5c 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2266,16 +2266,17 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, cb->min_dump_alloc = control->min_dump_alloc; cb->skb = skb; + if (cb->start) { + ret = cb->start(cb); + if (ret) + goto error_unlock; + } + nlk->cb_running = true; mutex_unlock(nlk->cb_mutex); - ret = 0; - if (cb->start) - ret = cb->start(cb); - - if (!ret) - ret = netlink_dump(sk); + ret = netlink_dump(sk); sock_put(sk); -- 2.14.2
Re: [PATCH] netlink: do not set cb_running if dump's start() errs
Hi Johannes, Yes, indeed, and I think that's actually a good thing. It means that the starts aren't ever called in parallel, which could be useful if drivers have any ordering requirements. The change doesn't negatively effect any existing drivers. I'll resubmit with a larger commit message explaining this. Jason
[PATCH] netlink: do not set cb_running if dump's start() errs
It turns out that multiple places can call netlink_dump(), which means it's still possible to dereference partially initialized values in dump() that were the result of a faulty returned start(). This fixes the issue by calling start() _before_ setting cb_running to true, so that there's no chance at all of hitting the dump() function through any indirect paths. In testing this with several different pieces of tricky code to trigger these issues, this commit fixes all avenues that I'm aware of. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Johannes Berg <johan...@sipsolutions.net> --- net/netlink/af_netlink.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 94c11cf0459d..f34750691c5c 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2266,16 +2266,17 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, cb->min_dump_alloc = control->min_dump_alloc; cb->skb = skb; + if (cb->start) { + ret = cb->start(cb); + if (ret) + goto error_unlock; + } + nlk->cb_running = true; mutex_unlock(nlk->cb_mutex); - ret = 0; - if (cb->start) - ret = cb->start(cb); - - if (!ret) - ret = netlink_dump(sk); + ret = netlink_dump(sk); sock_put(sk); -- 2.14.2
Re: [PATCH] netlink: do not set cb_running if dump's start() errs
Dave - this very likely should be queued up for stable. Jason
Re: netlink backwards compatibility in userspace tools
Hi Dave, That seems wise. Thanks for the advice. A more sophisticated way of approaching this would be for the kernel to also send a bitmap of which attributes are "critical" and only warn (or even error) of _those_ are not understood. But that seems needlessly complex, and so I think I'll go with your suggestion, and simply warn once with a shorter message. Jason
multi producer / multi consumer lock-free queue with ptr_ring
Hey Michael, Thanks for your work on ptr_ring.h. I'm interested in using it, but in a multi-producer, multi-consumer context. I realize it's been designed for a single-producer, single-consumer context, and thus uses a spinlock. I'm wondering if you'd be happy to receive patches that implement things in a lock-free way, in order to make the data structure more broadly usable. In case you're curious, this would be used for the multi-core algorithms in WireGuard. Thanks, Jason
Re: cross namespace interface notification for tun devices
On Mon, Oct 2, 2017 at 11:32 AM, Nicolas Dichtelwrote: > 1. Move the process to netns B, open the netlink socket and move back the > process to netns A. The socket will remain in netns B and you will receive all > netlink messages related to netns B. > > 2. Assign a nsid to netns B in netns A and use NETLINK_LISTEN_ALL_NSID on your > netlink socket (see iproute2). Both of these seem to rely on the process knowing where the device is being moved and having access to that namespace. I don't think these two things are a given though. Unless I'm missing something? Jason
netlink backwards compatibility in userspace tools
Hi guys, One handy aspect of Netlink is that it's backwards compatible. This means that you can run old userspace utilities on new kernels, even if the new kernel supports new features and netlink attributes. The wire format is stable enough that the data marshaled can be extended without breaking compat. Neat. I was wondering, though, what you think the best stance is toward these old userspace utilities. What should they do if the kernel sends it netlink attributes that it does not recognize? At the moment, I'm doing something like this: static void warn_unrecognized(void) { static bool once = false; if (once) return; once = true; fprintf(stderr, "Warning: this program received from your kernel one or more\n" "attributes that it did not recognize. It is possible that\n" "this version of wg(8) is older than your kernel. You may\n" "want to update this program.\n"); } This seems like a somewhat sensible warning, but then I wonder about distributions like Debian, which has a long stable life cycle, so it frequently has very old tools (ancient iproute2 for example). Then, VPS providers have these Debian images run on top of newer kernels. People in this situation would undoubtedly see the above warning a lot and not be able to do anything about it. Not horrible, but a bit annoying. Is this an okay annoyance? Or is it advised to just have no warning at all? One idea would be to put it behind an environment variable flag, but I don't like too many nobs. I'm generally wondering about attitudes toward this kind of userspace program behavior in response to newer kernels. Thanks, Jason
Re: [PATCH v2] netlink: do not proceed if dump's start() errs
On Thu, Sep 28, 2017 at 12:41:44AM +0200, Jason A. Donenfeld wrote: > diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c > index 327807731b44..94c11cf0459d 100644 > --- a/net/netlink/af_netlink.c > +++ b/net/netlink/af_netlink.c > @@ -2270,10 +2270,13 @@ int __netlink_dump_start(struct sock *ssk, struct > sk_buff *skb, > > mutex_unlock(nlk->cb_mutex); > > + ret = 0; > if (cb->start) > - cb->start(cb); > + ret = cb->start(cb); > + > + if (!ret) > + ret = netlink_dump(sk); > > - ret = netlink_dump(sk); > sock_put(sk); > > if (ret) FYI, using this horrific hack to currently work around bug: #define KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX KERNEL_VERSION(4, 14, 0) /* Hopefully! */ static int get(struct sk_buff *skb, struct netlink_callback *cb) { #if LINUX_VERSION_CODE < KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX /* https://marc.info/?l=linux-netdev=150655213004221=2 */ if (!cb->args[0]) { ret = get_start(cb); if (ret) return ret; } #endif ... } static const struct genl_ops genl_ops[] = { { .cmd = CMD_GET, #if LINUX_VERSION_CODE >= KERNEL_VERSION_THAT_HAS_NETLINK_START_FIX /* https://marc.info/?l=linux-netdev=150655213004221=2 */ .start = get_start, #endif .dumpit = get, ... } } Gross.
[PATCH v2] netlink: do not proceed if dump's start() errs
Drivers that use the start method for netlink dumping rely on dumpit not being called if start fails. For example, ila_xlat.c allocates memory and assigns it to cb->args[0] in its start() function. It might fail to do that and return -ENOMEM instead. However, even when returning an error, dumpit will be called, which, in the example above, quickly dereferences the memory in cb->args[0], which will OOPS the kernel. This is but one example of how this goes wrong. Since start() has always been a function with an int return type, it therefore makes sense to use it properly, rather than ignoring it. This patch thus returns early and does not call dumpit() when start() fails. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Johannes Berg <johan...@sipsolutions.net> --- net/netlink/af_netlink.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 327807731b44..94c11cf0459d 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2270,10 +2270,13 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, mutex_unlock(nlk->cb_mutex); + ret = 0; if (cb->start) - cb->start(cb); + ret = cb->start(cb); + + if (!ret) + ret = netlink_dump(sk); - ret = netlink_dump(sk); sock_put(sk); if (ret) -- 2.14.1
Re: [PATCH] netlink: do not proceed if dump's start() errs
On Wed, Sep 27, 2017 at 3:05 PM, Johannes Bergwrote: > I guess you could change it to > > if (cb->start) > ret = cb->start(cb); > if (!ret) > ret = netlink_dump(sk); Very clean. I'll do it like that. I'll wait a bit longer before submitting v2, but beyond that, seems sane to you? Jason
Re: [PATCH] netlink: do not proceed if dump's start() errs
On Wed, Sep 27, 2017 at 2:39 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: > - if (cb->start) > - cb->start(cb); > + if (cb->start) { > + ret = cb->start(cb); > + if (ret) I need to sock_put(sk); before returning. I'll fix this for v2, but will for additional comments in case anybody has some. > + return ret; > + } > > ret = netlink_dump(sk); > sock_put(sk);
[PATCH] netlink: do not proceed if dump's start() errs
Drivers that use the start method for netlink dumping rely on dumpit not being called if start fails. For example, ila_xlat.c allocates memory and assigns it to cb->args[0] in its start() function. It might fail to do that and return -ENOMEM instead. However, even when returning an error, dumpit will be called, which, in the example above, quickly dereferences the memory in cb->args[0], which will OOPS the kernel. This is but one example of how this goes wrong. Since start() has always been a function with an int return type, it therefore makes sense to use it properly, rather than ignoring it. This patch thus returns early and does not call dumpit() when start() fails. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: sta...@vger.kernel.org --- net/netlink/af_netlink.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index 327807731b44..be179876227d 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2270,8 +2270,11 @@ int __netlink_dump_start(struct sock *ssk, struct sk_buff *skb, mutex_unlock(nlk->cb_mutex); - if (cb->start) - cb->start(cb); + if (cb->start) { + ret = cb->start(cb); + if (ret) + return ret; + } ret = netlink_dump(sk); sock_put(sk); -- 2.14.1
Re: cross namespace interface notification for tun devices
On Wed, Sep 20, 2017 at 8:29 PM, Cong Wangwrote: > Sounds like we should set NETIF_F_NETNS_LOCAL for tun > device. Absolutely do not do this under any circumstances. This would be a regression and would break API compatibility. As I wrote in my first email, it's already possible to sleep-loop for that information using the tun device's fd; I'm just looking for a better event-based approach. > What is your legitimate use case of send/receive packet to/from > a tun device in a different netns? Because sometimes it's very nice to be able to move network interfaces that use tun devices into different namespaces, for some xnamespace proxying. What Dan described in the email he just sent is exactly this use case. In WireGuard (a kernel thing), I have facilities for this -- https://www.wireguard.com/netns/ . Now I'm working on the userspace version and would like to expose the same utility. Anyway, the purpose of me sending this message to the list was not to question the "legitimacy" of my application usage, but rather to elicit feedback on two specific things: 1. to determine if there's already a mechanism in place for this that I've overlooked; and 2. to determine particularities of me implementing a mechanism, if it's not already there. I'm slightly more convinced that there isn't currently a mechanism for this. It seems like the easiest way, therefore, would be some kind of control message that could be poll'd for, using the existing per-process fd. That way there wouldn't be any violations of the current namespace situation, yet processes could still get event notifications as needed.
Re: cross namespace interface notification for tun devices
On Tue, Sep 19, 2017 at 10:40 PM, Cong Wangwrote: > By "notification" I assume you mean netlink notification. Yes, netlink notification. > The question is why does the process in A still care about > the device sitting in B? > > Also, the process should be able to receive a last notification > on IFF_UP|IFF_RUNNING before device is finally moved to B. > After this point, it should not have any relation to netns A > any more, like the device were completely gone. That's very clearly not the case with a tun device. Tun devices work by letting a userspace process control the inputs (ndo_start_xmit) and outputs (netif_rx) of the actual network device. This controlling userspace process needs to know when its own interface that it controls goes up and down. In the kernel, we can do this by just checking dev->flags_UP, and receive notifications on ndo_open and ndo_stop. In userspace, the controlling process looses the ability to receive notifications like ndo_open/ndo_stop when the interface is moved to a new namespace. After the interface is moved to a namespace, the process will still control inputs and ouputs (ndo_start_xmit and netif_rx), but it will no longer receive netlink notifications for the equivalent of ndo_open and ndo_stop. This is problematic.
Re: cross namespace interface notification for tun devices
On Mon, Sep 18, 2017 at 8:47 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: > The best I've come up with is, in a sleep loop, writing to the tun > device's fd something with a NULL or invalid payload. If the interface > is down, the kernel returns -EIO. If the interface is up, the kernel > returns -EINVAL. This seems to be a reliable distinguisher, but is a > pretty insane way of doing it. And sleep loops are somewhat different > from events too. Specifically, I'm referring to the horrific hack exemplified in the attached .c file, in case anybody is curious about the details of what I'd rather not use. #include #include #include #include #include #include #include #include #include #include int main(int argc, char *argv[]) { /* If IFF_NO_PI is specified, this still sort of works but it * bumps the device error counters, which we don't want, so * it's best not to use this trick with IFF_NO_PI. */ struct ifreq ifr = { .ifr_flags = IFF_TUN }; int tun, sock, ret; tun = open("/dev/net/tun", O_RDWR); if (tun < 0) { perror("[-] open(/dev/net/tun)"); return 1; } sock = socket(AF_INET, SOCK_DGRAM, 0); if (sock < 0) { perror("[-] socket(AF_INET, SOCK_DGRAM)"); return 1; } ret = ioctl(tun, TUNSETIFF, ); if (ret < 0) { perror("[-] ioctl(TUNSETIFF)"); return 1; } if (write(tun, NULL, 0) >= 0 || errno != EIO) perror("[-] write(if:down, NULL, 0) did not return -EIO"); else fprintf(stderr, "[+] write(if:down, NULL, 0) returned -EIO: test successful\n"); ifr.ifr_flags = IFF_UP; ret = ioctl(sock, SIOCSIFFLAGS, ); if (ret < 0) { perror("[-] ioctl(SIOCSIFFLAGS)"); return 1; } if (write(tun, NULL, 0) >= 0 || errno != EINVAL) perror("[-] write(if:up, NULL, 0) did not return -EINVAL"); else fprintf(stderr, "[+] write(if:up, NULL, 0) returned -EINVAL: test successful\n"); return 0; }
cross namespace interface notification for tun devices
Hey guys, It's possible to create a tun device in a process in namespace A and then move that interface to namespace B. The controlling process in A needs to receive notifications on when the interface is brought up or down. It can receive these notifications via netlink while the interface lives in A but not when it moves to B. Any tricks or APIs to get around this? The best I've come up with is, in a sleep loop, writing to the tun device's fd something with a NULL or invalid payload. If the interface is down, the kernel returns -EIO. If the interface is up, the kernel returns -EFAULT. This seems to be a reliable distinguisher, but is a pretty insane way of doing it. And sleep loops are somewhat different from events too. If there aren't any current APIs for receiving events directly from the fd of a tun interface, would this list be happy with a patch that adds one? Thanks, Jason
[PATCH] ioc3-eth: store pointer to net_device for priviate area
Computing the alignment manually for going from priv to pub is probably not such a good idea, and in general the assumption that going from priv to pub is possible trivially could change, so rather than relying on that, we change things to just store a pointer to pub. This was sugested by DaveM in [1]. [1] http://www.spinics.net/lists/netdev/msg443992.html Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- Ralf - I don't have the platform to test this out, so you might want to briefly put it through the paces before giving it your sign-off. drivers/net/ethernet/sgi/ioc3-eth.c | 14 +- 1 file changed, 5 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c index b607936e1b3e..9c0488e0f08e 100644 --- a/drivers/net/ethernet/sgi/ioc3-eth.c +++ b/drivers/net/ethernet/sgi/ioc3-eth.c @@ -90,17 +90,13 @@ struct ioc3_private { spinlock_t ioc3_lock; struct mii_if_info mii; + struct net_device *dev; struct pci_dev *pdev; /* Members used by autonegotiation */ struct timer_list ioc3_timer; }; -static inline struct net_device *priv_netdev(struct ioc3_private *dev) -{ - return (void *)dev - ((sizeof(struct net_device) + 31) & ~31); -} - static int ioc3_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); static void ioc3_set_multicast_list(struct net_device *dev); static int ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev); @@ -427,7 +423,7 @@ static void ioc3_get_eaddr_nic(struct ioc3_private *ip) nic[i] = nic_read_byte(ioc3); for (i = 2; i < 8; i++) - priv_netdev(ip)->dev_addr[i - 2] = nic[i]; + ip->dev->dev_addr[i - 2] = nic[i]; } /* @@ -439,7 +435,7 @@ static void ioc3_get_eaddr(struct ioc3_private *ip) { ioc3_get_eaddr_nic(ip); - printk("Ethernet address is %pM.\n", priv_netdev(ip)->dev_addr); + printk("Ethernet address is %pM.\n", ip->dev->dev_addr); } static void __ioc3_set_mac_address(struct net_device *dev) @@ -790,13 +786,12 @@ static void ioc3_timer(unsigned long data) */ static int ioc3_mii_init(struct ioc3_private *ip) { - struct net_device *dev = priv_netdev(ip); int i, found = 0, res = 0; int ioc3_phy_workaround = 1; u16 word; for (i = 0; i < 32; i++) { - word = ioc3_mdio_read(dev, i, MII_PHYSID1); + word = ioc3_mdio_read(ip->dev, i, MII_PHYSID1); if (word != 0x && word != 0x) { found = 1; @@ -1276,6 +1271,7 @@ static int ioc3_probe(struct pci_dev *pdev, const struct pci_device_id *ent) SET_NETDEV_DEV(dev, >dev); ip = netdev_priv(dev); + ip->dev = dev; dev->irq = pdev->irq; -- 2.13.2
Re: [PATCH 1/2] netdevice: add netdev_pub helper function
On Mon, Jul 10, 2017 at 10:04 AM, David Millerwrote: > I disagree. Assuming one can go from the driver private to the netdev > object trivially is a worse assumption than the other way around, and > locks us into the current implementation of how the netdev and driver > private memory is allocated. > > If you want to style your driver such that the private is passed > around instead of the netdev, put a pointer back to the netdev object > in your private data structure. I'm surprised you're okay with the memory waste of that, but you bring up the ability to change the interface later, which is a great point. I'll submit a patch for that random driver, and I'll also refactor WireGuard to do the same. Thanks for the guidance. Jason
[PATCH 2/2] ioc3-eth: use netdev_pub instead of handrolling alignment
It's safer to use the generic library function for this, rather than reinventing it here with hard-coded alignment values. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- drivers/net/ethernet/sgi/ioc3-eth.c | 11 +++ 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/drivers/net/ethernet/sgi/ioc3-eth.c b/drivers/net/ethernet/sgi/ioc3-eth.c index b607936e1b3e..514eca163ea5 100644 --- a/drivers/net/ethernet/sgi/ioc3-eth.c +++ b/drivers/net/ethernet/sgi/ioc3-eth.c @@ -86,31 +86,26 @@ struct ioc3_private { int tx_ci; /* TX consumer index */ int tx_pi; /* TX producer index */ int txqlen; u32 emcr, ehar_h, ehar_l; spinlock_t ioc3_lock; struct mii_if_info mii; struct pci_dev *pdev; /* Members used by autonegotiation */ struct timer_list ioc3_timer; }; -static inline struct net_device *priv_netdev(struct ioc3_private *dev) -{ - return (void *)dev - ((sizeof(struct net_device) + 31) & ~31); -} - static int ioc3_ioctl(struct net_device *dev, struct ifreq *rq, int cmd); static void ioc3_set_multicast_list(struct net_device *dev); static int ioc3_start_xmit(struct sk_buff *skb, struct net_device *dev); static void ioc3_timeout(struct net_device *dev); static inline unsigned int ioc3_hash(const unsigned char *addr); static inline void ioc3_stop(struct ioc3_private *ip); static void ioc3_init(struct net_device *dev); static const char ioc3_str[] = "IOC3 Ethernet"; static const struct ethtool_ops ioc3_ethtool_ops; /* We use this to acquire receive skb's that we can DMA directly into. */ @@ -417,39 +412,39 @@ static void ioc3_get_eaddr_nic(struct ioc3_private *ip) printk("Failed to read MAC address\n"); return; } /* Read Memory. */ nic_write_byte(ioc3, 0xf0); nic_write_byte(ioc3, 0x00); nic_write_byte(ioc3, 0x00); for (i = 13; i >= 0; i--) nic[i] = nic_read_byte(ioc3); for (i = 2; i < 8; i++) - priv_netdev(ip)->dev_addr[i - 2] = nic[i]; + netdev_pub(ip)->dev_addr[i - 2] = nic[i]; } /* * Ok, this is hosed by design. It's necessary to know what machine the * NIC is in in order to know how to read the NIC address. We also have * to know if it's a PCI card or a NIC in on the node board ... */ static void ioc3_get_eaddr(struct ioc3_private *ip) { ioc3_get_eaddr_nic(ip); - printk("Ethernet address is %pM.\n", priv_netdev(ip)->dev_addr); + printk("Ethernet address is %pM.\n", netdev_pub(ip)->dev_addr); } static void __ioc3_set_mac_address(struct net_device *dev) { struct ioc3_private *ip = netdev_priv(dev); struct ioc3 *ioc3 = ip->regs; ioc3_w_emar_h((dev->dev_addr[5] << 8) | dev->dev_addr[4]); ioc3_w_emar_l((dev->dev_addr[3] << 24) | (dev->dev_addr[2] << 16) | (dev->dev_addr[1] << 8) | dev->dev_addr[0]); } static int ioc3_set_mac_address(struct net_device *dev, void *addr) @@ -780,27 +775,27 @@ static void ioc3_timer(unsigned long data) add_timer(>ioc3_timer); } /* * Try to find a PHY. There is no apparent relation between the MII addresses * in the SGI documentation and what we find in reality, so we simply probe * for the PHY. It seems IOC3 PHYs usually live on address 31. One of my * onboard IOC3s has the special oddity that probing doesn't seem to find it * yet the interface seems to work fine, so if probing fails we for now will * simply default to PHY 31 instead of bailing out. */ static int ioc3_mii_init(struct ioc3_private *ip) { - struct net_device *dev = priv_netdev(ip); + struct net_device *dev = netdev_pub(ip); int i, found = 0, res = 0; int ioc3_phy_workaround = 1; u16 word; for (i = 0; i < 32; i++) { word = ioc3_mdio_read(dev, i, MII_PHYSID1); if (word != 0x && word != 0x) { found = 1; break; /* Found a PHY */ } } -- 2.13.2
[PATCH 1/2] netdevice: add netdev_pub helper function
Being able to utilize this makes code a lot simpler and cleaner. It's easier in many cases for drivers to pass around their private data structure, while occationally needing to dip into net_device, rather than the other way around, which results in tons of calls to netdev_priv in the top of every single function, which makes everything confusing and less clear. Additionally, this enables a "correct" way of doing such a thing, instead of having drivers attempt to reinvent the wheel and screw it up. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- include/linux/netdevice.h | 11 +++ 1 file changed, 11 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 779b23595596..83d58504e5c4 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2030,26 +2030,37 @@ void dev_net_set(struct net_device *dev, struct net *net) } /** * netdev_priv - access network device private data * @dev: network device * * Get network device private data */ static inline void *netdev_priv(const struct net_device *dev) { return (char *)dev + ALIGN(sizeof(struct net_device), NETDEV_ALIGN); } +/** + * netdev_pub - access network device from private pointer + * @priv: private data pointer of network device + * + * Get network device from a network device private data pointer + */ +static inline struct net_device *netdev_pub(void *priv) +{ + return (struct net_device *)((char *)priv - ALIGN(sizeof(struct net_device), NETDEV_ALIGN)); +} + /* Set the sysfs physical device reference for the network logical device * if set prior to registration will cause a symlink during initialization. */ #define SET_NETDEV_DEV(net, pdev) ((net)->dev.parent = (pdev)) /* Set the sysfs device type for the network logical device to allow * fine-grained identification of different network device types. For * example Ethernet, Wireless LAN, Bluetooth, WiMAX etc. */ #define SET_NETDEV_DEVTYPE(net, devtype) ((net)->dev.type = (devtype)) /* Default NAPI poll() weight * Device drivers are strongly advised to not use bigger value -- 2.13.2
Re: net: Fix inconsistent teardown and release of private netdev state.
On Sat, Jul 8, 2017 at 12:39 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote: > On Thu, Jul 6, 2017 at 7:24 AM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: >> list_add(>list, _of_things); >> >> ret = register_netdevice(); // if ret is < 0, then destruct above is >> automatically called >> >> // RACE WITH LIST_ADD/LIST_DEL!! It's impossible to call list_add >> only after >> // things are brought up successfully. This is problematic. >> >> if (!ret) >> pr_info("Yay it worked!\n"); > > I fail to understand what you mean by RACE here. > > Here you should already have RTNL lock, so it can't race with any other > newlink() calls. In fact you can't acquire RTNL lock in your destructor > since register_netdevice() already gets it. Perhaps you mean > netdev_run_todo() calls it without RTNL lock? > > I don't know why you reorder the above list_add(), you can order it > as it was before, aka, call it after register_netdevice(), but you have to > init the priv->list now for the list_del() on error path. The race is that there's a state in which priv->list is part of list_of_things before the interface is actually successfully setup and ready to go. And no, it's not possible to order it _after_ register_netdevice, since register_netdevice might call priv_destructor, and priv_destructor calls list_del, so if it's not already on the list, we'll OOPS. In otherwords, API problem. To work around this shortcoming, I'm actually just assigning dev->priv_device *after* a successful call to register_netdevice. This seems to be working well and allows me to retain ± the old behavior. Hopefully this is an okay way to go about things? Jason
Re: net: Fix inconsistent teardown and release of private netdev state.
On Thu, Jul 6, 2017 at 4:24 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: > > if (!ret) > pr_info("Yay it worked!\n"); > > return 0; This is supposed to be `return ret;`. See, even in psuedocode it's hard to get right.
Re: net: Fix inconsistent teardown and release of private netdev state.
Hey guys, I see why this priv_destructor patch is an acceptable bandaid for certain drivers, but I'm not exactly sure on the cleanest way of using it in new drivers. Check out the following psuedocode. Here's how error handling in newlink used to work before this patch: static void destruct(struct net_device *dev) { struct private *priv = netdev_priv(dev); rtnl_lock(); list_del(>list); rtnl_unlock(); stop_anotherweirdthing(>wow); put_can_in_garbage(priv->burrito); kfree(priv->cheese); stop_something(>something); put_whatever(priv->whatever); } static int newlink(struct net *src_net, struct net_device *dev, ...) { int ret; struct private *priv = netdev_priv(dev); priv->whatever = get_whatever(src_net); // never fails ret = do_something(); if (ret < 0) goto error_1; ret = -ENOMEM; priv->cheese = kmalloc(128, GFP_KERNEL); if (!priv->cheese) goto error_2; priv->burrito = allocate_bean_can(_net); if (IS_ERR(priv->burrito)) { ret = PTR_ERR(priv->burrito); goto error_3; } do_anotherweirdthing(>wow); // never fails ret = register_netdevice(); if (ret < 0) goto error_4; list_add(>list, _of_things); pr_info("Yay it worked!\n"); return ret; error_4: stop_anotherweirdthing(>wow); put_can_in_garbage(priv->burrito); error_3: kfree(priv->cheese); error_2: stop_something(>something); error_1: put_whatever(priv->whatever); return ret; } Here's what it must look like now, which introduces a weird race with list_add, and doesn't really have a consistent clean-up section. Having to remember that register_netdevice will call destruct() when it fails is quite confusing. Observe: static int newlink(struct net *src_net, struct net_device *dev, ...) { int ret; struct private *priv = netdev_priv(dev); priv->whatever = get_whatever(src_net); // never fails ret = do_something(); if (ret < 0) goto error_1; ret = -ENOMEM; priv->cheese = kmalloc(128, GFP_KERNEL); if (!priv->cheese) goto error_2; priv->burrito = allocate_bean_can(_net); if (IS_ERR(priv->burrito)) { ret = PTR_ERR(priv->burrito); goto error_3; } do_anotherweirdthing(>wow); // never fails list_add(>list, _of_things); ret = register_netdevice(); // if ret is < 0, then destruct above is automatically called // RACE WITH LIST_ADD/LIST_DEL!! It's impossible to call list_add only after // things are brought up successfully. This is problematic. if (!ret) pr_info("Yay it worked!\n"); return 0; error_3: kfree(priv->cheese); error_2: stop_something(>something); error_1: put_whatever(priv->whatever); return ret; } Something we just have to live with? Any cleaner way of approaching this? Regards, Jason
Re: [PATCH] net/icmp: restore source address if packet is NATed
Hi David, On Sun, Jun 25, 2017 at 5:49 PM, David Millerwrote: > This violates things on so many levels. Yes, indeed. > I think this kind of thing need to be hidden inside of netfilter, > it can do the rate limiting and stuff like that in the spot > where it makes the transformation and knows all of the original > addressing and ports. Indeed I'd prefer that, and I'll look again into trying to make that work. But when I tried last, it seemed like there were some insurmountable challenges. With the ratelimiting, the limit has already been applied to one IP -- the masqueraded one -- before netfilter even has a chance to act -- so that IP will already hit the ratelimits well before any additional one inside netfilter would. Then the issue of transformation: last I looked it seemed like icmp_send threw away a bit too much information to do the transformation entirely correctly, but I could be wrong, so I'll give it another look. Hopefully it winds up being as easy as just reverse-transforming ICMP's payload IP header. > > You definitely can't just rewrite header fields here either. The > SKB could be shared, for example. I was afraid of that. It's easy to rework this particular patch, though, if you're still interested in the crufty bolt on approach... But I think I should investigate the netfilter-only approach instead, as you suggested. Jason
[PATCH] net/icmp: restore source address if packet is NATed
The ICMP routines use the source address for two reasons: 1. Rate-limiting ICMP transmissions based on source address, so that one source address cannot provoke a flood of replies. If the source address is wrong, the rate limiting will be incorrectly applied. 2. Choosing the interface and hence new source address of the generated ICMP packet. If the original packet source address is wrong, ICMP replies will be sent from the wrong source address, resulting in either a misdelivery, infoleak, or just general network admin confusion. Most of the time, the icmp_send and icmpv6_send routines can just reach down into the skb's IP header to determine the saddr. However, if icmp_send or icmpv6_send is being called from a network device driver -- there are a few in the tree -- then it's possible that by the time icmp_send or icmpv6_send looks at the packet, the packet's source address has already been transformed by SNAT or MASQUERADE or some other transformation that CONNTRACK knows about. In this case, the packet's source address is most certainly the *wrong* source address to be used for the purpose of ICMP replies. Rather, the source address we want to use for ICMP replies is the original one, from before the transformation occurred. Fortunately, it's very easy to just ask CONNTRACK if it knows about this packet, and if so, how to fix it up. The saddr is the only field in the header we need to fix up, for the purposes of the subsequent processing in the icmp_send and icmpv6_send functions, so we do the lookup very early on, so that the rest of the ICMP machinery can progress as usual. In my tests, this setup works very well. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> --- net/ipv4/icmp.c | 21 + net/ipv6/icmp.c | 21 + 2 files changed, 42 insertions(+) diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c index c2be26b98b5f..30aa6aa79fd2 100644 --- a/net/ipv4/icmp.c +++ b/net/ipv4/icmp.c @@ -97,6 +97,10 @@ #include #include #include +#if IS_ENABLED(CONFIG_NF_CONNTRACK) +#include +#include +#endif /* * Build xmit assembly blocks @@ -586,6 +590,10 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info) u32 mark; struct net *net; struct sock *sk; +#if IS_ENABLED(CONFIG_NF_CONNTRACK) + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; +#endif if (!rt) goto out; @@ -604,6 +612,19 @@ void icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info) goto out; /* +* If this function is called after the skb has already been +* NAT transformed, the ratelimiting will apply to the wrong +* saddr, and the reply will will be marked as coming from the +* wrong host. So, we fix it up here in case connection tracking +* enables that. +*/ +#if IS_ENABLED(CONFIG_NF_CONNTRACK) + ct = nf_ct_get(skb_in, ); + if (ct) + iph->saddr = ct->tuplehash[0].tuple.src.u3.ip; +#endif + + /* * No replies to physical multicast/broadcast */ if (skb_in->pkt_type != PACKET_HOST) diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c index 8d7b113958b1..ee8a2853121e 100644 --- a/net/ipv6/icmp.c +++ b/net/ipv6/icmp.c @@ -69,6 +69,10 @@ #include #include #include +#if IS_ENABLED(CONFIG_NF_CONNTRACK) +#include +#include +#endif #include @@ -422,12 +426,29 @@ static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info, int len; int err = 0; u32 mark = IP6_REPLY_MARK(net, skb->mark); +#if IS_ENABLED(CONFIG_NF_CONNTRACK) + enum ip_conntrack_info ctinfo; + struct nf_conn *ct; +#endif if ((u8 *)hdr < skb->head || (skb_network_header(skb) + sizeof(*hdr)) > skb_tail_pointer(skb)) return; /* +* If this function is called after the skb has already been +* NAT transformed, the ratelimiting will apply to the wrong +* saddr, and the reply will will be marked as coming from the +* wrong host. So, we fix it up here in case connection tracking +* enables that. +*/ +#if IS_ENABLED(CONFIG_NF_CONNTRACK) + ct = nf_ct_get(skb, ); + if (ct) + hdr->saddr = ct->tuplehash[0].tuple.src.u3.in6; +#endif + + /* * Make sure we respect the rules * i.e. RFC 1885 2.4(e) * Rule (e.1) is enforced by not using icmp6_send -- 2.13.1
Re: [PATCH 0/6] Constant Time Memory Comparisons Are Important
Hi Stephan, On Sun, Jun 11, 2017 at 11:06 PM, Stephan Müllerwrote: > Are you planning to send an update to your patch set? If yes, there is another > one which should be converted too: crypto/rsa-pkcs1pad.c. I just sent an update to this thread patching that, per your suggestion. Since these issues are expected to be cherry picked by their respective committer, I figure we can just pile on the patches here, listing the 0/6 intro email as each patch's parent. Jason
[PATCH 0/6] Constant Time Memory Comparisons Are Important
Whenever you're comparing two MACs, it's important to do this using crypto_memneq instead of memcmp. With memcmp, you leak timing information, which could then be used to iteratively forge a MAC. This is far too basic of a mistake for us to have so pervasively in the year 2017, so let's begin cleaning this stuff up. The following 6 locations were found with some simple regex greps, but I'm sure more lurk below the surface. If you maintain some code or know somebody who maintains some code that deals with MACs, tell them to double check which comparison function they're using. Jason A. Donenfeld (6): sunrpc: use constant time memory comparison for mac net/ipv6: use constant time memory comparison for mac ccree: use constant time memory comparison for macs and tags security/keys: use constant time memory comparison for macs bluetooth/smp: use constant time memory comparison for secret values mac80211/wpa: use constant time memory comparison for MACs Cc: Anna Schumaker <anna.schuma...@netapp.com> Cc: David Howells <dhowe...@redhat.com> Cc: David Safford <saff...@us.ibm.com> Cc: "David S. Miller" <da...@davemloft.net> Cc: Gilad Ben-Yossef <gi...@benyossef.com> Cc: Greg Kroah-Hartman <gre...@linuxfoundation.org> Cc: Gustavo Padovan <gust...@padovan.org> Cc: "J. Bruce Fields" <bfie...@fieldses.org> Cc: Jeff Layton <jlay...@poochiereds.net> Cc: Johan Hedberg <johan.hedb...@gmail.com> Cc: Johannes Berg <johan...@sipsolutions.net> Cc: Marcel Holtmann <mar...@holtmann.org> Cc: Mimi Zohar <zo...@linux.vnet.ibm.com> Cc: Trond Myklebust <trond.mykleb...@primarydata.com> Cc: keyri...@vger.kernel.org Cc: linux-blueto...@vger.kernel.org Cc: linux-...@vger.kernel.org Cc: linux-wirel...@vger.kernel.org Cc: netdev@vger.kernel.org drivers/staging/ccree/ssi_fips_ll.c | 17 --- net/bluetooth/smp.c | 39 ++- net/ipv6/seg6_hmac.c | 3 ++- net/mac80211/wpa.c| 9 net/sunrpc/auth_gss/gss_krb5_crypto.c | 3 ++- security/keys/trusted.c | 7 --- 6 files changed, 42 insertions(+), 36 deletions(-) -- 2.13.1
[PATCH 2/6] net/ipv6: use constant time memory comparison for mac
Otherwise, we enable a MAC forgery via timing attack. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: "David S. Miller" <da...@davemloft.net> Cc: netdev@vger.kernel.org Cc: sta...@vger.kernel.org --- net/ipv6/seg6_hmac.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/ipv6/seg6_hmac.c b/net/ipv6/seg6_hmac.c index f950cb53d5e3..54213c83b44e 100644 --- a/net/ipv6/seg6_hmac.c +++ b/net/ipv6/seg6_hmac.c @@ -38,6 +38,7 @@ #include #include +#include #include #include #include @@ -274,7 +275,7 @@ bool seg6_hmac_validate_skb(struct sk_buff *skb) if (seg6_hmac_compute(hinfo, srh, _hdr(skb)->saddr, hmac_output)) return false; - if (memcmp(hmac_output, tlv->hmac, SEG6_HMAC_FIELD_LEN) != 0) + if (crypto_memneq(hmac_output, tlv->hmac, SEG6_HMAC_FIELD_LEN)) return false; return true; -- 2.13.1
[PATCH net-next v10 0/5] Avoiding stack overflow in skb_to_sgvec
Changes v9->v10: - Spaces to tabs on one line. - Added some acked-by, reviewed-by lines. Since we're down to only cleaning up things like spaces-to-tabs, I believe we can merge this patch series. David - would you put this in net-next, please? The recent bug with macsec and historical one with virtio have indicated that letting skb_to_sgvec trounce all over an sglist without checking the length is probably a bad idea. And it's not necessary either: an sglist already explicitly marks its last item, and the initialization functions are diligent in doing so. Thus there's a clear way of avoiding future overflows. So, this patchset, from a high level, makes skb_to_sgvec return a potential error code, and then adjusts all callers to check for the error code. There are two situations in which skb_to_sgvec might return such an error: 1) When the passed in sglist is too small; and 2) When the passed in skbuff is too deeply nested. So, the first patch in this series handles the issues with skb_to_sgvec directly, and the remaining ones then handle the call sites. Jason A. Donenfeld (5): skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow ipsec: check return value of skb_to_sgvec always rxrpc: check return value of skb_to_sgvec always macsec: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec always drivers/net/macsec.c | 13 -- drivers/net/virtio_net.c | 9 +-- include/linux/skbuff.h | 8 +++--- net/core/skbuff.c| 65 +++- net/ipv4/ah4.c | 8 -- net/ipv4/esp4.c | 20 +-- net/ipv6/ah6.c | 8 -- net/ipv6/esp6.c | 20 +-- net/rxrpc/rxkad.c| 19 ++ 9 files changed, 116 insertions(+), 54 deletions(-) -- 2.13.0
[PATCH net-next v10 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> Cc: David Howells <dhowe...@redhat.com> Cc: Sabrina Dubroca <s...@queasysnail.net> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- include/linux/skbuff.h | 8 +++ net/core/skbuff.c | 65 -- 2 files changed, 46 insertions(+), 27 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 45a59c1e0cc7..d460a4cbda1c 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -953,10 +953,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom); struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t priority); -int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, - int offset, int len); -int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, -int len); +int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, +int offset, int len); +int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, + int offset, int len); int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer); int skb_pad(struct sk_buff *skb, int pad); #define dev_kfree_skb(a) consume_skb(a) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 780b7c1563d0..bba33cf4f7cd 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3508,24 +3508,18 @@ void __init skb_init(void) NULL); } -/** - * skb_to_sgvec - Fill a scatter-gather list from a socket buffer - * @skb: Socket buffer containing the buffers to be mapped - * @sg: The scatter-gather list to map into - * @offset: The offset into the buffer's contents to start mapping - * @len: Length of buffer space to be mapped - * - * Fill the specified scatter-gather list with mappings/pointers into a - * region of the buffer space attached to a socket buffer. - */ static int -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len, + unsigned int recursion_level) { int start = skb_headlen(skb); int i, copy = start - offset; struct sk_buff *frag_iter; int elt = 0; + if (unlikely(recursion_level >= 24)) + return -EMSGSIZE; + if (copy > 0) { if (copy > len) copy = len; @@ -3544,6 +3538,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) end = start + skb_frag_size(_shinfo(skb)->frags[i]); if ((copy = end - offset) > 0) { skb_frag_t *frag = _shinfo(skb)->frags[i]; + if (unlikely(elt && sg_is_last([elt - 1]))) + return -EMSGSIZE; if (copy > len) copy = len; @@ -3558,16 +3554,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) } skb_walk_frags(skb, frag_iter) { - int end; + int end, ret; WARN_ON(start > offset + len); end = start + frag_iter->len; if ((copy = end - offset)
[PATCH net-next v10 3/5] rxrpc: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Acked-by: David Howells <dhowe...@redhat.com> --- net/rxrpc/rxkad.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index 1bb9b2ccc267..29fe20ad04aa 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call, len &= ~(call->conn->size_align - 1); sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, 0, len); + err = skb_to_sgvec(skb, sg, 0, len); + if (unlikely(err < 0)) + goto out; skcipher_request_set_crypt(req, sg, sg, len, iv.x); crypto_skcipher_encrypt(req); @@ -324,7 +326,7 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, bool aborted; u32 data_size, buf; u16 check; - int nsg; + int nsg, ret; _enter(""); @@ -342,7 +344,9 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, goto nomem; sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, 8); + ret = skb_to_sgvec(skb, sg, offset, 8); + if (unlikely(ret < 0)) + return ret; /* start the decryption afresh */ memset(, 0, sizeof(iv)); @@ -409,7 +413,7 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, bool aborted; u32 data_size, buf; u16 check; - int nsg; + int nsg, ret; _enter(",{%d}", skb->len); @@ -434,7 +438,12 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, } sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, len); + ret = skb_to_sgvec(skb, sg, offset, len); + if (unlikely(ret < 0)) { + if (sg != _sg) + kfree(sg); + return ret; + } /* decrypt from the session key */ token = call->conn->params.key->payload.data[0]; -- 2.13.0
[PATCH net-next v10 2/5] ipsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> --- net/ipv4/ah4.c | 8 ++-- net/ipv4/esp4.c | 20 +--- net/ipv6/ah6.c | 8 ++-- net/ipv6/esp6.c | 20 +--- 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index 22377c8ff14b..e8f862358518 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff *skb) skb_push(skb, ihl); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 93322f895eab..d815d1755473 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -377,9 +377,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * esp->esph = esph; sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, sg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; if (!esp->inplace) { int allocsize; @@ -403,9 +405,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * spin_unlock_bh(>lock); sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1); - skb_to_sgvec(skb, dsg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, dsg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; } if ((x->props.flags & XFRM_STATE_ESN)) @@ -690,7 +694,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb) esp_input_set_header(skb, seqhi); sg_init_table(sg, nfrags); - skb_to_sgvec(skb, sg, 0, skb->len); + err = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out; skb->ip_summed = CHECKSUM_NONE; diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index dda6035e3b84..755f38271dd5 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff *skb) ip6h->hop_limit = 0; sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 1fe99ba8066c..2ede4e459c4e 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi); sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(s
[PATCH net-next v10 4/5] macsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Sabrina Dubroca <s...@queasysnail.net> --- drivers/net/macsec.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index 91642fd87cd1..b79513b8322f 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -740,7 +740,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb, macsec_fill_iv(iv, secy->sci, pn); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + macsec_txsa_put(tx_sa); + kfree_skb(skb); + return ERR_PTR(ret); + } if (tx_sc->encrypt) { int len = skb->len - macsec_hdr_len(sci_present) - @@ -947,7 +952,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb, macsec_fill_iv(iv, sci, ntohl(hdr->packet_number)); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + kfree_skb(skb); + return ERR_PTR(ret); + } if (hdr->tci_an & MACSEC_TCI_E) { /* confidentiality: ethernet + macsec header -- 2.13.0
[PATCH net-next v10 5/5] virtio_net: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Reviewed-by: Sergei Shtylyov <sergei.shtyl...@cogentembedded.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 3e9246cc49c3..57763d30cabb 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) struct virtio_net_hdr_mrg_rxbuf *hdr; const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; struct virtnet_info *vi = sq->vq->vdev->priv; - unsigned num_sg; + int num_sg; unsigned hdr_len = vi->hdr_len; bool can_push; @@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (can_push) { __skb_push(skb, hdr_len); num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len); + if (unlikely(num_sg < 0)) + return num_sg; /* Pull header back to avoid skew in tx bytes calculations. */ __skb_pull(skb, hdr_len); } else { sg_set_buf(sq->sg, hdr, hdr_len); - num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1; + num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); + if (unlikely(num_sg < 0)) + return num_sg; + num_sg++; } return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); } -- 2.13.0
Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always
On Wed, May 24, 2017 at 6:41 PM, Sergei Shtylyov >I've only looked on the last 2 patches. You can add my: > > Reviewed-by: Sergei Shtylyov> > if you want. :-) Will do. For the series, or just for 5/5?
Re: [PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always
I'm shocked this somehow made it into the commit. I wonder how that happened? Anyway, fixed in my git repo, and will be part of the next series. (Unless DaveM wants to fix it up trivially when/if he merges this v9, which would be faster.) Barring that, does this look good to you? Could I have your signed-off-by? Regards, Jason
Re: [PATCH net-next v9 0/5] skb_to_sgvec hardening
Hi List, Could somebody do a holistic review of the series, or at least on individual commits that seem fine, and sign off on it, so that this can actually be merged? We're now at v9. I hope we can get this merged now, but if not, I'd like for v10 to finally land these changes. Regards, Jason
[PATCH net-next v9 2/5] ipsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> --- net/ipv4/ah4.c | 8 ++-- net/ipv4/esp4.c | 20 +--- net/ipv6/ah6.c | 8 ++-- net/ipv6/esp6.c | 20 +--- 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index 22377c8ff14b..e8f862358518 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff *skb) skb_push(skb, ihl); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 65cc02bd82bc..392432860bb9 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -374,9 +374,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * esp->esph = esph; sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, sg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; if (!esp->inplace) { int allocsize; @@ -400,9 +402,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * spin_unlock_bh(>lock); sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1); - skb_to_sgvec(skb, dsg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, dsg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; } if ((x->props.flags & XFRM_STATE_ESN)) @@ -687,7 +691,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb) esp_input_set_header(skb, seqhi); sg_init_table(sg, nfrags); - skb_to_sgvec(skb, sg, 0, skb->len); + err = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out; skb->ip_summed = CHECKSUM_NONE; diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index dda6035e3b84..755f38271dd5 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff *skb) ip6h->hop_limit = 0; sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 1fe99ba8066c..2ede4e459c4e 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi); sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(s
[PATCH net-next v9 4/5] macsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Sabrina Dubroca <s...@queasysnail.net> --- drivers/net/macsec.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index cdc347be68f2..dfcb1e9d2ab2 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -742,7 +742,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb, macsec_fill_iv(iv, secy->sci, pn); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + macsec_txsa_put(tx_sa); + kfree_skb(skb); + return ERR_PTR(ret); + } if (tx_sc->encrypt) { int len = skb->len - macsec_hdr_len(sci_present) - @@ -952,7 +957,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb, macsec_fill_iv(iv, sci, ntohl(hdr->packet_number)); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + kfree_skb(skb); + return ERR_PTR(ret); + } if (hdr->tci_an & MACSEC_TCI_E) { /* confidentiality: ethernet + macsec header -- 2.13.0
[PATCH net-next v9 5/5] virtio_net: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9320d96a1632..13fbe4b349c2 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) struct virtio_net_hdr_mrg_rxbuf *hdr; const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; struct virtnet_info *vi = sq->vq->vdev->priv; - unsigned num_sg; + int num_sg; unsigned hdr_len = vi->hdr_len; bool can_push; @@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (can_push) { __skb_push(skb, hdr_len); num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len); +if (unlikely(num_sg < 0)) + return num_sg; /* Pull header back to avoid skew in tx bytes calculations. */ __skb_pull(skb, hdr_len); } else { sg_set_buf(sq->sg, hdr, hdr_len); - num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1; + num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); + if (unlikely(num_sg < 0)) + return num_sg; + num_sg++; } return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); } -- 2.13.0
[PATCH net-next v9 3/5] rxrpc: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: David Howells <dhowe...@redhat.com> --- net/rxrpc/rxkad.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index 1bb9b2ccc267..29fe20ad04aa 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call, len &= ~(call->conn->size_align - 1); sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, 0, len); + err = skb_to_sgvec(skb, sg, 0, len); + if (unlikely(err < 0)) + goto out; skcipher_request_set_crypt(req, sg, sg, len, iv.x); crypto_skcipher_encrypt(req); @@ -324,7 +326,7 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, bool aborted; u32 data_size, buf; u16 check; - int nsg; + int nsg, ret; _enter(""); @@ -342,7 +344,9 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, goto nomem; sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, 8); + ret = skb_to_sgvec(skb, sg, offset, 8); + if (unlikely(ret < 0)) + return ret; /* start the decryption afresh */ memset(, 0, sizeof(iv)); @@ -409,7 +413,7 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, bool aborted; u32 data_size, buf; u16 check; - int nsg; + int nsg, ret; _enter(",{%d}", skb->len); @@ -434,7 +438,12 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, } sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, len); + ret = skb_to_sgvec(skb, sg, offset, len); + if (unlikely(ret < 0)) { + if (sg != _sg) + kfree(sg); + return ret; + } /* decrypt from the session key */ token = call->conn->params.key->payload.data[0]; -- 2.13.0
[PATCH net-next v9 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> Cc: David Howells <dhowe...@redhat.com> Cc: Sabrina Dubroca <s...@queasysnail.net> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- include/linux/skbuff.h | 8 +++ net/core/skbuff.c | 65 -- 2 files changed, 46 insertions(+), 27 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a098d95b3d84..f351df28942d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1001,10 +1001,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom); struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t priority); -int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, - int offset, int len); -int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, -int len); +int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, +int offset, int len); +int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, + int offset, int len); int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer); int skb_pad(struct sk_buff *skb, int pad); #define dev_kfree_skb(a) consume_skb(a) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 346d3e85dfbc..ec33811559e3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3482,24 +3482,18 @@ void __init skb_init(void) NULL); } -/** - * skb_to_sgvec - Fill a scatter-gather list from a socket buffer - * @skb: Socket buffer containing the buffers to be mapped - * @sg: The scatter-gather list to map into - * @offset: The offset into the buffer's contents to start mapping - * @len: Length of buffer space to be mapped - * - * Fill the specified scatter-gather list with mappings/pointers into a - * region of the buffer space attached to a socket buffer. - */ static int -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len, + unsigned int recursion_level) { int start = skb_headlen(skb); int i, copy = start - offset; struct sk_buff *frag_iter; int elt = 0; + if (unlikely(recursion_level >= 24)) + return -EMSGSIZE; + if (copy > 0) { if (copy > len) copy = len; @@ -3518,6 +3512,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) end = start + skb_frag_size(_shinfo(skb)->frags[i]); if ((copy = end - offset) > 0) { skb_frag_t *frag = _shinfo(skb)->frags[i]; + if (unlikely(elt && sg_is_last([elt - 1]))) + return -EMSGSIZE; if (copy > len) copy = len; @@ -3532,16 +3528,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) } skb_walk_frags(skb, frag_iter) { - int end; + int end, ret; WARN_ON(start > offset + len); end = start + frag_iter->len; if ((copy = end - offset)
[PATCH net-next v9 0/5] skb_to_sgvec hardening
The recent bug with macsec and historical one with virtio have indicated that letting skb_to_sgvec trounce all over an sglist without checking the length is probably a bad idea. And it's not necessary either: an sglist already explicitly marks its last item, and the initialization functions are diligent in doing so. Thus there's a clear way of avoiding future overflows. So, this patchset, from a high level, makes skb_to_sgvec return a potential error code, and then adjusts all callers to check for the error code. There are two situations in which skb_to_sgvec might return such an error: 1) When the passed in sglist is too small; and 2) When the passed in skbuff is too deeply nested. So, the first patch in this series handles the issues with skb_to_sgvec directly, and the remaining ones then handle the call sites. Changes v8->v9: - Return correct errno in rxrpc, thanks to feedback from Dave Howells. Jason A. Donenfeld (5): skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow ipsec: check return value of skb_to_sgvec always rxrpc: check return value of skb_to_sgvec always macsec: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec always drivers/net/macsec.c | 13 -- drivers/net/virtio_net.c | 9 +-- include/linux/skbuff.h | 8 +++--- net/core/skbuff.c| 65 +++- net/ipv4/ah4.c | 8 -- net/ipv4/esp4.c | 20 +-- net/ipv6/ah6.c | 8 -- net/ipv6/esp6.c | 20 +-- net/rxrpc/rxkad.c| 19 ++ 9 files changed, 116 insertions(+), 54 deletions(-) -- 2.13.0
Re: [PATCH v8 3/5] rxrpc: check return value of skb_to_sgvec always
On Mon, May 15, 2017 at 3:11 PM, David Howellswrote: > skb_to_sgvec() can return -EMSGSIZE in some circumstances. You shouldn't > return -ENOMEM here in such a case. Noted. I'll fix this up for the next round.
Re: [PATCH v8 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
On Mon, May 15, 2017 at 3:12 PM, David Howellswrote: > Is there a reason you moved skb_to_sgvec() in the file rather than just moving > the comment to it (since you moved the comment anyway)? 1) Because it's easier to understand skb_to_sgvec_nomark as a variant of skb_to_sgvec, so I'd rather skb_to_sgvec to be first when reading. 2) Because skb_to_sgvec relies on the return value of __skb_to_sgvec, and so when assessing it, it's sometimes nice to be able to look at why it will return different things. In that case, it's easier to have both functions within the same view without scrolling. It's the little things that make life easier sometimes.
Re: Implementing Dynamic Rerouting in Kernel
On Thu, May 11, 2017 at 6:22 PM, Florian Fainelliwrote: > What you are looking for can be done using ipset-dns from Jason: > > https://git.zx2c4.com/ipset-dns/about/ Funny to see this project coming up. I actually ported this functionality into dnsmasq directly a few weeks after writing ipset-dns, since that's a much more robust place to put this sort of feature. It's the --ipset option.
[PATCH v8 0/5] skb_to_sgvec hardening
The recent bug with macsec and historical one with virtio have indicated that letting skb_to_sgvec trounce all over an sglist without checking the length is probably a bad idea. And it's not necessary either: an sglist already explicitly marks its last item, and the initialization functions are diligent in doing so. Thus there's a clear way of avoiding future overflows. So, this patchset, from a high level, makes skb_to_sgvec return a potential error code, and then adjusts all callers to check for the error code. There are two situations in which skb_to_sgvec might return such an error: 1) When the passed in sglist is too small; and 2) When the passed in skbuff is too deeply nested. So, the first patch in this series handles the issues with skb_to_sgvec directly, and the remaining ones then handle the call sites. Changes v7->v8: - Added __must_check annotation. - Rebased against latest upstream ipsec changes. Jason A. Donenfeld (5): skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow ipsec: check return value of skb_to_sgvec always rxrpc: check return value of skb_to_sgvec always macsec: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec always drivers/net/macsec.c | 13 -- drivers/net/virtio_net.c | 9 +-- include/linux/skbuff.h | 8 +++--- net/core/skbuff.c| 65 +++- net/ipv4/ah4.c | 8 -- net/ipv4/esp4.c | 20 +-- net/ipv6/ah6.c | 8 -- net/ipv6/esp6.c | 20 +-- net/rxrpc/rxkad.c| 13 +++--- 9 files changed, 112 insertions(+), 52 deletions(-) -- 2.13.0
[PATCH v8 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> Cc: David Howells <dhowe...@redhat.com> Cc: Sabrina Dubroca <s...@queasysnail.net> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- include/linux/skbuff.h | 8 +++ net/core/skbuff.c | 65 -- 2 files changed, 46 insertions(+), 27 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a098d95b3d84..f351df28942d 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1001,10 +1001,10 @@ struct sk_buff *skb_realloc_headroom(struct sk_buff *skb, unsigned int headroom); struct sk_buff *skb_copy_expand(const struct sk_buff *skb, int newheadroom, int newtailroom, gfp_t priority); -int skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, - int offset, int len); -int skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, -int len); +int __must_check skb_to_sgvec_nomark(struct sk_buff *skb, struct scatterlist *sg, +int offset, int len); +int __must_check skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, + int offset, int len); int skb_cow_data(struct sk_buff *skb, int tailbits, struct sk_buff **trailer); int skb_pad(struct sk_buff *skb, int pad); #define dev_kfree_skb(a) consume_skb(a) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 346d3e85dfbc..ec33811559e3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3482,24 +3482,18 @@ void __init skb_init(void) NULL); } -/** - * skb_to_sgvec - Fill a scatter-gather list from a socket buffer - * @skb: Socket buffer containing the buffers to be mapped - * @sg: The scatter-gather list to map into - * @offset: The offset into the buffer's contents to start mapping - * @len: Length of buffer space to be mapped - * - * Fill the specified scatter-gather list with mappings/pointers into a - * region of the buffer space attached to a socket buffer. - */ static int -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len, + unsigned int recursion_level) { int start = skb_headlen(skb); int i, copy = start - offset; struct sk_buff *frag_iter; int elt = 0; + if (unlikely(recursion_level >= 24)) + return -EMSGSIZE; + if (copy > 0) { if (copy > len) copy = len; @@ -3518,6 +3512,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) end = start + skb_frag_size(_shinfo(skb)->frags[i]); if ((copy = end - offset) > 0) { skb_frag_t *frag = _shinfo(skb)->frags[i]; + if (unlikely(elt && sg_is_last([elt - 1]))) + return -EMSGSIZE; if (copy > len) copy = len; @@ -3532,16 +3528,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) } skb_walk_frags(skb, frag_iter) { - int end; + int end, ret; WARN_ON(start > offset + len); end = start + frag_iter->len; if ((copy = end - offset)
[PATCH v8 5/5] virtio_net: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- drivers/net/virtio_net.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 9320d96a1632..13fbe4b349c2 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -1150,7 +1150,7 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) struct virtio_net_hdr_mrg_rxbuf *hdr; const unsigned char *dest = ((struct ethhdr *)skb->data)->h_dest; struct virtnet_info *vi = sq->vq->vdev->priv; - unsigned num_sg; + int num_sg; unsigned hdr_len = vi->hdr_len; bool can_push; @@ -1177,11 +1177,16 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb) if (can_push) { __skb_push(skb, hdr_len); num_sg = skb_to_sgvec(skb, sq->sg, 0, skb->len); +if (unlikely(num_sg < 0)) + return num_sg; /* Pull header back to avoid skew in tx bytes calculations. */ __skb_pull(skb, hdr_len); } else { sg_set_buf(sq->sg, hdr, hdr_len); - num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1; + num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len); + if (unlikely(num_sg < 0)) + return num_sg; + num_sg++; } return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC); } -- 2.13.0
[PATCH v8 2/5] ipsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> --- net/ipv4/ah4.c | 8 ++-- net/ipv4/esp4.c | 20 +--- net/ipv6/ah6.c | 8 ++-- net/ipv6/esp6.c | 20 +--- 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index 22377c8ff14b..e8f862358518 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -220,7 +220,9 @@ static int ah_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -393,7 +395,9 @@ static int ah_input(struct xfrm_state *x, struct sk_buff *skb) skb_push(skb, ihl); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 65cc02bd82bc..392432860bb9 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -374,9 +374,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * esp->esph = esph; sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, sg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; if (!esp->inplace) { int allocsize; @@ -400,9 +402,11 @@ int esp_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info * spin_unlock_bh(>lock); sg_init_table(dsg, skb_shinfo(skb)->nr_frags + 1); - skb_to_sgvec(skb, dsg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(skb, dsg, + (unsigned char *)esph - skb->data, + assoclen + ivlen + esp->clen + alen); + if (unlikely(err < 0)) + goto error; } if ((x->props.flags & XFRM_STATE_ESN)) @@ -687,7 +691,9 @@ static int esp_input(struct xfrm_state *x, struct sk_buff *skb) esp_input_set_header(skb, seqhi); sg_init_table(sg, nfrags); - skb_to_sgvec(skb, sg, 0, skb->len); + err = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out; skb->ip_summed = CHECKSUM_NONE; diff --git a/net/ipv6/ah6.c b/net/ipv6/ah6.c index dda6035e3b84..755f38271dd5 100644 --- a/net/ipv6/ah6.c +++ b/net/ipv6/ah6.c @@ -423,7 +423,9 @@ static int ah6_output(struct xfrm_state *x, struct sk_buff *skb) ah->seq_no = htonl(XFRM_SKB_CB(skb)->seq.output.low); sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ @@ -606,7 +608,9 @@ static int ah6_input(struct xfrm_state *x, struct sk_buff *skb) ip6h->hop_limit = 0; sg_init_table(sg, nfrags + sglists); - skb_to_sgvec_nomark(skb, sg, 0, skb->len); + err = skb_to_sgvec_nomark(skb, sg, 0, skb->len); + if (unlikely(err < 0)) + goto out_free; if (x->props.flags & XFRM_STATE_ESN) { /* Attach seqhi sg right after packet payload */ diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 1fe99ba8066c..2ede4e459c4e 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -346,9 +346,11 @@ int esp6_output_tail(struct xfrm_state *x, struct sk_buff *skb, struct esp_info esph = esp_output_set_esn(skb, x, ip_esp_hdr(skb), seqhi); sg_init_table(sg, esp->nfrags); - skb_to_sgvec(skb, sg, -(unsigned char *)esph - skb->data, -assoclen + ivlen + esp->clen + alen); + err = skb_to_sgvec(s
[PATCH v8 3/5] rxrpc: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: David Howells <dhowe...@redhat.com> --- net/rxrpc/rxkad.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index 1bb9b2ccc267..ecab9334e3c1 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -227,7 +227,9 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call, len &= ~(call->conn->size_align - 1); sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, 0, len); + err = skb_to_sgvec(skb, sg, 0, len); + if (unlikely(err < 0)) + goto out; skcipher_request_set_crypt(req, sg, sg, len, iv.x); crypto_skcipher_encrypt(req); @@ -342,7 +344,8 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, goto nomem; sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, 8); + if (unlikely(skb_to_sgvec(skb, sg, offset, 8) < 0)) + goto nomem; /* start the decryption afresh */ memset(, 0, sizeof(iv)); @@ -434,7 +437,11 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, } sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, len); + if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0)) { + if (sg != _sg) + kfree(sg); + goto nomem; + } /* decrypt from the session key */ token = call->conn->params.key->payload.data[0]; -- 2.13.0
[PATCH v8 4/5] macsec: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Sabrina Dubroca <s...@queasysnail.net> --- drivers/net/macsec.c | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c index cdc347be68f2..dfcb1e9d2ab2 100644 --- a/drivers/net/macsec.c +++ b/drivers/net/macsec.c @@ -742,7 +742,12 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb, macsec_fill_iv(iv, secy->sci, pn); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + macsec_txsa_put(tx_sa); + kfree_skb(skb); + return ERR_PTR(ret); + } if (tx_sc->encrypt) { int len = skb->len - macsec_hdr_len(sci_present) - @@ -952,7 +957,11 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb, macsec_fill_iv(iv, sci, ntohl(hdr->packet_number)); sg_init_table(sg, ret); - skb_to_sgvec(skb, sg, 0, skb->len); + ret = skb_to_sgvec(skb, sg, 0, skb->len); + if (unlikely(ret < 0)) { + kfree_skb(skb); + return ERR_PTR(ret); + } if (hdr->tci_an & MACSEC_TCI_E) { /* confidentiality: ethernet + macsec header -- 2.13.0
Re: [PATCH v7 5/5] virtio_net: check return value of skb_to_sgvec always
On Tue, May 9, 2017 at 3:50 PM, Jason A. Donenfeld <ja...@zx2c4.com> wrote: > num_sg = skb_to_sgvec(skb, sq->sg + 1, 0, skb->len) + 1; (The next submission of this will take into account this + 1 here. https://git.zx2c4.com/linux-dev/log/?h=jd/safe-skb-vec )
Re: [PATCH v7 2/5] ipsec: check return value of skb_to_sgvec always
(The next submission of this ipsec patch will have this rebased over the latest upstream tree. https://git.zx2c4.com/linux-dev/log/?h=jd/safe-skb-vec )
Re: [PATCH v7 0/5] skb_to_sgvec hardening
On Tue, May 9, 2017 at 4:03 PM, Johannes Bergwrote: > Perhaps you should add __must_check annotation to the function > prototype(s)? Great idea. I've started doing this in my own code. Wasn't sure how popular it was outside of there, but I'm glad to hear a suggestion of it now. I'll have this in v8.
[PATCH v7 0/5] skb_to_sgvec hardening
The recent bug with macsec and historical one with virtio have indicated that letting skb_to_sgvec trounce all over an sglist without checking the length is probably a bad idea. And it's not necessary either: an sglist already explicitly marks its last item, and the initialization functions are diligent in doing so. Thus there's a clear way of avoiding future overflows. So, this patchset, from a high level, makes skb_to_sgvec return a potential error code, and then adjusts all callers to check for the error code. There are two situations in which skb_to_sgvec might return such an error: 1) When the passed in sglist is too small; and 2) When the passed in skbuff is too deeply nested. So, the first patch in this series handles the issues with skb_to_sgvec directly, and the remaining ones then handle the call sites. Jason A. Donenfeld (5): skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow ipsec: check return value of skb_to_sgvec always rxrpc: check return value of skb_to_sgvec always macsec: check return value of skb_to_sgvec always virtio_net: check return value of skb_to_sgvec always drivers/net/macsec.c | 13 -- drivers/net/virtio_net.c | 4 ++- net/core/skbuff.c| 65 +++- net/ipv4/ah4.c | 8 -- net/ipv4/esp4.c | 30 ++ net/ipv6/ah6.c | 8 -- net/ipv6/esp6.c | 31 +++ net/rxrpc/rxkad.c| 13 +++--- 8 files changed, 119 insertions(+), 53 deletions(-) -- 2.12.2
[PATCH v7 1/5] skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow
This is a defense-in-depth measure in response to bugs like 4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's not only a potential overflow of sglist items, but also a stack overflow potential, so we fix this by limiting the amount of recursion this function is allowed to do. Not actually providing a bounded base case is a future disaster that we can easily avoid here. As a small matter of house keeping, we take this opportunity to move the documentation comment over the actual function the documentation is for. While this could be implemented by using an explicit stack of skbuffs, when implementing this, the function complexity increased considerably, and I don't think such complexity and bloat is actually worth it. So, instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS, and measured the stack usage there. I also reverted the recent MIPS changes that give it a separate IRQ stack, so that I could experience some worst-case situations. I found that limiting it to 24 layers deep yielded a good stack usage with room for safety, as well as being much deeper than any driver actually ever creates. Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: Steffen Klassert <steffen.klass...@secunet.com> Cc: Herbert Xu <herb...@gondor.apana.org.au> Cc: "David S. Miller" <da...@davemloft.net> Cc: David Howells <dhowe...@redhat.com> Cc: Sabrina Dubroca <s...@queasysnail.net> Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> --- net/core/skbuff.c | 65 +++ 1 file changed, 42 insertions(+), 23 deletions(-) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f86bf69cfb8d..ab51b97d5600 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3481,24 +3481,18 @@ void __init skb_init(void) NULL); } -/** - * skb_to_sgvec - Fill a scatter-gather list from a socket buffer - * @skb: Socket buffer containing the buffers to be mapped - * @sg: The scatter-gather list to map into - * @offset: The offset into the buffer's contents to start mapping - * @len: Length of buffer space to be mapped - * - * Fill the specified scatter-gather list with mappings/pointers into a - * region of the buffer space attached to a socket buffer. - */ static int -__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) +__skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len, + unsigned int recursion_level) { int start = skb_headlen(skb); int i, copy = start - offset; struct sk_buff *frag_iter; int elt = 0; + if (unlikely(recursion_level >= 24)) + return -EMSGSIZE; + if (copy > 0) { if (copy > len) copy = len; @@ -3517,6 +3511,8 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) end = start + skb_frag_size(_shinfo(skb)->frags[i]); if ((copy = end - offset) > 0) { skb_frag_t *frag = _shinfo(skb)->frags[i]; + if (unlikely(elt && sg_is_last([elt - 1]))) + return -EMSGSIZE; if (copy > len) copy = len; @@ -3531,16 +3527,22 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) } skb_walk_frags(skb, frag_iter) { - int end; + int end, ret; WARN_ON(start > offset + len); end = start + frag_iter->len; if ((copy = end - offset) > 0) { + if (unlikely(elt && sg_is_last([elt - 1]))) + return -EMSGSIZE; + if (copy > len) copy = len; - elt += __skb_to_sgvec(frag_iter, sg+elt, offset - start, - copy); + ret = __skb_to_sgvec(frag_iter, sg+elt, offset - start, + copy, recursion_level + 1); + if (unlikely(ret < 0)) + return ret; + elt += ret; if ((len -= copy) == 0) return elt; offset += copy; @@ -3551,6 +3553,31 @@ __skb_to_sgvec(struct sk_buff *skb, struct scatterlist *sg, int offset, int len) return elt; } +/** + * skb_to_sgvec - Fill a scatter-gather list from a socket buffer + * @skb: Socket buffer containing the buffers to be mapped + * @sg: The scatter-gather list to map into + * @offset: The offset into the buffer's contents t
[PATCH v7 3/5] rxrpc: check return value of skb_to_sgvec always
Signed-off-by: Jason A. Donenfeld <ja...@zx2c4.com> Cc: David Howells <dhowe...@redhat.com> --- net/rxrpc/rxkad.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/net/rxrpc/rxkad.c b/net/rxrpc/rxkad.c index 4374e7b9c7bf..486026689448 100644 --- a/net/rxrpc/rxkad.c +++ b/net/rxrpc/rxkad.c @@ -229,7 +229,9 @@ static int rxkad_secure_packet_encrypt(const struct rxrpc_call *call, len &= ~(call->conn->size_align - 1); sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, 0, len); + err = skb_to_sgvec(skb, sg, 0, len); + if (unlikely(err < 0)) + goto out; skcipher_request_set_crypt(req, sg, sg, len, iv.x); crypto_skcipher_encrypt(req); @@ -342,7 +344,8 @@ static int rxkad_verify_packet_1(struct rxrpc_call *call, struct sk_buff *skb, goto nomem; sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, 8); + if (unlikely(skb_to_sgvec(skb, sg, offset, 8) < 0)) + goto nomem; /* start the decryption afresh */ memset(, 0, sizeof(iv)); @@ -429,7 +432,11 @@ static int rxkad_verify_packet_2(struct rxrpc_call *call, struct sk_buff *skb, } sg_init_table(sg, nsg); - skb_to_sgvec(skb, sg, offset, len); + if (unlikely(skb_to_sgvec(skb, sg, offset, len) < 0)) { + if (sg != _sg) + kfree(sg); + goto nomem; + } /* decrypt from the session key */ token = call->conn->params.key->payload.data[0]; -- 2.12.2