from:"Ard Biesheuvel"

Re: [PATCH 1/4] bpf: account for freed JIT allocations in arch code

2018-11-19 Thread Ard Biesheuvel

On Mon, 19 Nov 2018 at 02:37, Daniel Borkmann  wrote:
>
> On 11/17/2018 07:57 PM, Ard Biesheuvel wrote:
> > Commit ede95a63b5e84 ("bpf: add bpf_jit_limit knob to restrict unpriv
> > allocations") added a call to bpf_jit_uncharge_modmem() to the routine
> > bpf_jit_binary_free() which is called from the __weak bpf_jit_free().
> > This function is overridden by arches, some of which do not call
> > bpf_jit_binary_free() to release the memory, and so the released
> > memory is not accounted for, potentially leading to spurious allocation
> > failures.
> >
> > So replace the direct calls to module_memfree() in the arch code with
> > calls to bpf_jit_binary_free().
>
> Sorry but this patch is completely buggy, and above description on the
> accounting incorrect as well. Looks like this patch was not tested at all.
>

My apologies. I went off into the weeds a bit looking at different
versions for 32-bit and 64-bit on different architectures. So indeed,
this patch should be dropped.

> The below cBPF JITs that use module_memfree() which you replace with
> bpf_jit_binary_free() are using module_alloc() internally to get the JIT
> image buffer ...
>

Indeed. So would you prefer for arm64 to override bpf_jit_free() in
its entirety, and not call bpf_jit_binary_free() but simply call
bpf_jit_uncharge_modmem() and vfree() directly? It's either that, or
we'd have to untangle this a bit, to avoid having one __weak function
on top of the other just so other arches can replace the
module_memfree() call in bpf_jit_binary_free() with vfree() (which
amount to the same thing on arm64 anyway)

Re: [PATCH net v2 0/3] Bugfix for the netsec driver

2018-10-23 Thread Ard Biesheuvel

(+ Florian)

On 23 October 2018 at 08:24,   wrote:
> From: Masahisa Kojima 
>
> This patch series include bugfix for the netsec ethernet
> controller driver, fix the problem in interface down/up.
>
> changes in v2:
>  - change the place to perform the PHY power down
>  - use the MACROs defiend in include/uapi/linux/mii.h
>  - update commit comment
>
> Masahisa Kojima (3):
>   net: socionext: Stop PHY before resetting netsec
>   net: socionext: Add dummy PHY register read in phy_write()
>   net: socionext: Reset tx queue in ndo_stop
>
>  drivers/net/ethernet/socionext/netsec.c | 40 
> -
>  1 file changed, 34 insertions(+), 6 deletions(-)
>

Hello Masahisa,

As a courtesy, please cc people that have commented on your patches on
subsequent revisions of the series.

Re: [PATCH 1/1] net: socionext: clear rx irq correctly

2018-10-11 Thread Ard Biesheuvel

On 11 October 2018 at 14:28, Ilias Apalodimas
 wrote:
> commit 63ae7949e94a ("net: socionext: Use descriptor info instead of MMIO 
> reads on Rx")
> removed constant mmio reads from the driver and started using a descriptor
> field to check if packet should be processed.
> This lead the napi rx handler being constantly called while no packets
> needed processing and ksoftirq getting 100% cpu usage. Issue one mmio read
> to clear the irq correcty after processing packets
>
> Signed-off-by: Ilias Apalodimas 
> Reported-by: Ard Biesheuvel 

Tested-by: Ard Biesheuvel 
Acked-by: Ard Biesheuvel 

> ---
>  drivers/net/ethernet/socionext/netsec.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 7aa5ebb..4289ccb 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -735,8 +735,11 @@ static int netsec_process_rx(struct netsec_priv *priv, 
> int budget)
> u16 idx = dring->tail;
> struct netsec_de *de = dring->vaddr + (DESC_SZ * idx);
>
> -   if (de->attr & (1U << NETSEC_RX_PKT_OWN_FIELD))
> +   if (de->attr & (1U << NETSEC_RX_PKT_OWN_FIELD)) {
> +   /* reading the register clears the irq */
> +   netsec_read(priv, NETSEC_REG_NRM_RX_PKTCNT);
> break;
> +   }
>
> /* This  barrier is needed to keep us from reading
>  * any other fields out of the netsec_de until we have
> --
> 2.7.4
>

Re: [RFC PATCH] skb: Define NET_IP_ALIGN based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

2018-10-04 Thread Ard Biesheuvel

(+ Arnd but really)

On 4 October 2018 at 19:43, Ard Biesheuvel  wrote:
> (+ Arnd, Russell, Catalin, Will)
>
> On 4 October 2018 at 19:36, Ben Hutchings  
> wrote:
>> NET_IP_ALIGN is supposed to be defined as 0 if DMA writes to an
>> unaligned buffer would be more expensive than CPU access to unaligned
>> header fields, and otherwise defined as 2.
>>
>> Currently only ppc64 and x86 configurations define it to be 0.
>> However several other architectures (conditionally) define
>> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, which seems to imply that
>> NET_IP_ALIGN should be 0.
>>
>> Remove the overriding definitions for ppc64 and x86 and define
>> NET_IP_ALIGN solely based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>>
>> Signed-off-by: Ben Hutchings 
>
> While this makes sense for arm64, I don't think it is appropriate for
> ARM per se.
>
> The unusual thing about ARM is that some instructions require 32-bit
> alignment even when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
> (i.e., load/store multiple, load/store double), and we rely on
> alignment fixups done by the kernel to deal with the fallout if such
> instructions happen to be used on unaligned quantities (Russell,
> please correct me if this is inaccurate)
>
>
>> ---
>>  arch/powerpc/include/asm/processor.h | 11 ---
>>  arch/x86/include/asm/processor.h |  8 
>>  include/linux/skbuff.h   |  7 +++
>>  3 files changed, 3 insertions(+), 23 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/processor.h 
>> b/arch/powerpc/include/asm/processor.h
>> index 52fadded5c1e..65c8210d2787 100644
>> --- a/arch/powerpc/include/asm/processor.h
>> +++ b/arch/powerpc/include/asm/processor.h
>> @@ -525,17 +525,6 @@ extern void cvt_fd(float *from, double *to);
>>  extern void cvt_df(double *from, float *to);
>>  extern void _nmask_and_or_msr(unsigned long nmask, unsigned long or_val);
>>
>> -#ifdef CONFIG_PPC64
>> -/*
>> - * We handle most unaligned accesses in hardware. On the other hand
>> - * unaligned DMA can be very expensive on some ppc64 IO chips (it does
>> - * powers of 2 writes until it reaches sufficient alignment).
>> - *
>> - * Based on this we disable the IP header alignment in network drivers.
>> - */
>> -#define NET_IP_ALIGN   0
>> -#endif
>> -
>>  #endif /* __KERNEL__ */
>>  #endif /* __ASSEMBLY__ */
>>  #endif /* _ASM_POWERPC_PROCESSOR_H */
>> diff --git a/arch/x86/include/asm/processor.h 
>> b/arch/x86/include/asm/processor.h
>> index d53c54b842da..0108efc9726e 100644
>> --- a/arch/x86/include/asm/processor.h
>> +++ b/arch/x86/include/asm/processor.h
>> @@ -33,14 +33,6 @@ struct vm86;
>>  #include 
>>  #include 
>>
>> -/*
>> - * We handle most unaligned accesses in hardware.  On the other hand
>> - * unaligned DMA can be quite expensive on some Nehalem processors.
>> - *
>> - * Based on this we disable the IP header alignment in network drivers.
>> - */
>> -#define NET_IP_ALIGN   0
>> -
>>  #define HBP_NUM 4
>>  /*
>>   * Default implementation of macro that returns current
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 17a13e4785fc..42467be8021f 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -2435,11 +2435,10 @@ static inline int pskb_network_may_pull(struct 
>> sk_buff *skb, unsigned int len)
>>   * The downside to this alignment of the IP header is that the DMA is now
>>   * unaligned. On some architectures the cost of an unaligned DMA is high
>>   * and this cost outweighs the gains made by aligning the IP header.
>> - *
>> - * Since this trade off varies between architectures, we allow NET_IP_ALIGN
>> - * to be overridden.
>>   */
>> -#ifndef NET_IP_ALIGN
>> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> +#define NET_IP_ALIGN   0
>> +#else
>>  #define NET_IP_ALIGN   2
>>  #endif
>>
>> --
>> Ben Hutchings, Software Developer Codethink Ltd
>> https://www.codethink.co.uk/ Dale House, 35 Dale Street
>>  Manchester, M1 2HF, United Kingdom
>>
>>
>> ___
>> linux-arm-kernel mailing list
>> linux-arm-ker...@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [RFC PATCH] skb: Define NET_IP_ALIGN based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS

2018-10-04 Thread Ard Biesheuvel

(+ Arnd, Russell, Catalin, Will)

On 4 October 2018 at 19:36, Ben Hutchings  wrote:
> NET_IP_ALIGN is supposed to be defined as 0 if DMA writes to an
> unaligned buffer would be more expensive than CPU access to unaligned
> header fields, and otherwise defined as 2.
>
> Currently only ppc64 and x86 configurations define it to be 0.
> However several other architectures (conditionally) define
> CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS, which seems to imply that
> NET_IP_ALIGN should be 0.
>
> Remove the overriding definitions for ppc64 and x86 and define
> NET_IP_ALIGN solely based on CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> Signed-off-by: Ben Hutchings 

While this makes sense for arm64, I don't think it is appropriate for
ARM per se.

The unusual thing about ARM is that some instructions require 32-bit
alignment even when CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is set,
(i.e., load/store multiple, load/store double), and we rely on
alignment fixups done by the kernel to deal with the fallout if such
instructions happen to be used on unaligned quantities (Russell,
please correct me if this is inaccurate)


> ---
>  arch/powerpc/include/asm/processor.h | 11 ---
>  arch/x86/include/asm/processor.h |  8 
>  include/linux/skbuff.h   |  7 +++
>  3 files changed, 3 insertions(+), 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/processor.h 
> b/arch/powerpc/include/asm/processor.h
> index 52fadded5c1e..65c8210d2787 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -525,17 +525,6 @@ extern void cvt_fd(float *from, double *to);
>  extern void cvt_df(double *from, float *to);
>  extern void _nmask_and_or_msr(unsigned long nmask, unsigned long or_val);
>
> -#ifdef CONFIG_PPC64
> -/*
> - * We handle most unaligned accesses in hardware. On the other hand
> - * unaligned DMA can be very expensive on some ppc64 IO chips (it does
> - * powers of 2 writes until it reaches sufficient alignment).
> - *
> - * Based on this we disable the IP header alignment in network drivers.
> - */
> -#define NET_IP_ALIGN   0
> -#endif
> -
>  #endif /* __KERNEL__ */
>  #endif /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_PROCESSOR_H */
> diff --git a/arch/x86/include/asm/processor.h 
> b/arch/x86/include/asm/processor.h
> index d53c54b842da..0108efc9726e 100644
> --- a/arch/x86/include/asm/processor.h
> +++ b/arch/x86/include/asm/processor.h
> @@ -33,14 +33,6 @@ struct vm86;
>  #include 
>  #include 
>
> -/*
> - * We handle most unaligned accesses in hardware.  On the other hand
> - * unaligned DMA can be quite expensive on some Nehalem processors.
> - *
> - * Based on this we disable the IP header alignment in network drivers.
> - */
> -#define NET_IP_ALIGN   0
> -
>  #define HBP_NUM 4
>  /*
>   * Default implementation of macro that returns current
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 17a13e4785fc..42467be8021f 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2435,11 +2435,10 @@ static inline int pskb_network_may_pull(struct 
> sk_buff *skb, unsigned int len)
>   * The downside to this alignment of the IP header is that the DMA is now
>   * unaligned. On some architectures the cost of an unaligned DMA is high
>   * and this cost outweighs the gains made by aligning the IP header.
> - *
> - * Since this trade off varies between architectures, we allow NET_IP_ALIGN
> - * to be overridden.
>   */
> -#ifndef NET_IP_ALIGN
> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> +#define NET_IP_ALIGN   0
> +#else
>  #define NET_IP_ALIGN   2
>  #endif
>
> --
> Ben Hutchings, Software Developer Codethink Ltd
> https://www.codethink.co.uk/ Dale House, 35 Dale Street
>  Manchester, M1 2HF, United Kingdom
>
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-26 Thread Ard Biesheuvel

On Wed, 26 Sep 2018 at 17:50, Jason A. Donenfeld  wrote:
>
> On Wed, Sep 26, 2018 at 5:45 PM Jason A. Donenfeld  wrote:
> > So what you have in mind is something like calling simd_relax() every
> > 4096 bytes or so?
>
> That was actually pretty easy, putting together both of your suggestions:
>
> static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,
>  u8 *src, size_t len,
>  simd_context_t *simd_context)
> {
> while (len > PAGE_SIZE) {
> chacha20_arch(state, dst, src, PAGE_SIZE, simd_context);
> len -= PAGE_SIZE;
> src += PAGE_SIZE;
> dst += PAGE_SIZE;
> simd_relax(simd_context);
> }
> if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon &&
> len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context))
> chacha20_neon(dst, src, len, state->key, state->counter);
> else
> chacha20_arm(dst, src, len, state->key, state->counter);
>
> state->counter[0] += (len + 63) / 64;
> return true;
> }

Nice one :-)

This works for me (but perhaps add a comment as well)

Re: Oops running iptables -F OUTPUT

2018-08-28 Thread Ard Biesheuvel

On 28 August 2018 at 15:56, Ard Biesheuvel  wrote:
> Hello Andreas, Nick,
>
> On 28 August 2018 at 06:06, Nicholas Piggin  wrote:
>> On Mon, 27 Aug 2018 19:11:01 +0200
>> Andreas Schwab  wrote:
>>
>>> I'm getting this Oops when running iptables -F OUTPUT:
>>>
>>> [   91.139409] Unable to handle kernel paging request for data at address 
>>> 0xd001fff12f34
>>> [   91.139414] Faulting instruction address: 0xd16a5718
>>> [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [   91.139426] BE SMP NR_CPUS=2 PowerMac
>>> [   91.139434] Modules linked in: iptable_filter ip_tables x_tables 
>>> bpfilter nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet 
>>> snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus 
>>> snd_aoa_soundbus snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device 
>>> snd_mixer_oss snd sungem sr_mod firewire_ohci cdrom sungem_phy soundcore 
>>> firewire_core pata_macio crc_itu_t sg hid_generic usbhid linear md_mod 
>>> ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore usb_common dm_snapshot dm_bufio 
>>> dm_mirror dm_region_hash dm_log dm_mod sata_svw
>>> [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
>>> [   91.139526] NIP:  d16a5718 LR: d16a569c CTR: 
>>> c06f560c
>>> [   91.139531] REGS: c001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
>>> [   91.139534] MSR:  9200b032   CR: 
>>> 84002484  XER: 2000
>>> [   91.139553] DAR: d001fff12f34 DSISR: 4000 IRQMASK: 0
>>> GPR00: d16a569c c001fa5778f0 d16b0400 
>>> GPR04: 0002  8001fa46418e c001fa0d05c8
>>> GPR08: d16b0400 d00037f13000 0001ff3e7000 d16a6fb8
>>> GPR12: c06f560c c780  
>>> GPR16: 11635010 3fffa1b7aa68  
>>> GPR20: 0003 10013918 116350c0 c0b88990
>>> GPR24: c0b88ba4  d001fff12f34 
>>> GPR28: d16b8000 c001fa20f400 c001fa20f440 
>>> [   91.139627] NIP [d16a5718] .alloc_counters.isra.10+0xbc/0x140 
>>> [ip_tables]
>>> [   91.139634] LR [d16a569c] .alloc_counters.isra.10+0x40/0x140 
>>> [ip_tables]
>>> [   91.139638] Call Trace:
>>> [   91.139645] [c001fa5778f0] [d16a569c] 
>>> .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
>>> [   91.139655] [c001fa5779b0] [d16a5b54] 
>>> .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
>>> [   91.139666] [c001fa577aa0] [c06233e0] 
>>> .nf_getsockopt+0x68/0x88
>>> [   91.139674] [c001fa577b40] [c0631608] 
>>> .ip_getsockopt+0xbc/0x128
>>> [   91.139682] [c001fa577bf0] [c065adf4] 
>>> .raw_getsockopt+0x18/0x5c
>>> [   91.139690] [c001fa577c60] [c05b5f60] 
>>> .sock_common_getsockopt+0x2c/0x40
>>> [   91.139697] [c001fa577cd0] [c05b3394] 
>>> .__sys_getsockopt+0xa4/0xd0
>>> [   91.139704] [c001fa577d80] [c05b5ab0] 
>>> .__se_sys_socketcall+0x238/0x2b4
>>> [   91.139712] [c001fa577e30] [c000a31c] system_call+0x5c/0x70
>>> [   91.139716] Instruction dump:
>>> [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 8138 2b890001 
>>> 419d000c 393e0060
>>> [   91.139736] 4810 7d57c82a e93e0060 7d295214 <815a> 794807e1 
>>> 41e20010 7c210b78
>>> [   91.139752] ---[ end trace f5d1d5431651845d ]---
>>
>> This is due to 7290d58095 ("module: use relative references for
>> __ksymtab entries"). This part of kernel/module.c -
>>
>>/* Divert to percpu allocation if a percpu var. */
>>if (sym[i].st_shndx == info->index.pcpu)
>>secbase = (unsigned long)mod_percpu(mod);
>>else
>>secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
>>sym[i].st_value += secbase;
>>
>> Causes the distance to the target to exceed 32-bits on powerpc, so
>> it doesn't fit in a rel32 reloc. Not sure how other archs cope.
>>
>
> Apologies for the breakage. It does indeed appear to affect all
> architectures, and I'm a bit puzzled why you are the first one to spot
> it.
>
> I will try to find a clean way to special case the per-CPU variable
> __ksymtab references in the generic module code, and if that is too
>

Re: Oops running iptables -F OUTPUT

2018-08-28 Thread Ard Biesheuvel

Hello Andreas, Nick,

On 28 August 2018 at 06:06, Nicholas Piggin  wrote:
> On Mon, 27 Aug 2018 19:11:01 +0200
> Andreas Schwab  wrote:
>
>> I'm getting this Oops when running iptables -F OUTPUT:
>>
>> [   91.139409] Unable to handle kernel paging request for data at address 
>> 0xd001fff12f34
>> [   91.139414] Faulting instruction address: 0xd16a5718
>> [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
>> [   91.139426] BE SMP NR_CPUS=2 PowerMac
>> [   91.139434] Modules linked in: iptable_filter ip_tables x_tables bpfilter 
>> nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet snd_aoa_codec_tas 
>> snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss 
>> snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss snd sungem sr_mod 
>> firewire_ohci cdrom sungem_phy soundcore firewire_core pata_macio crc_itu_t 
>> sg hid_generic usbhid linear md_mod ohci_pci ohci_hcd ehci_pci ehci_hcd 
>> usbcore usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log 
>> dm_mod sata_svw
>> [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
>> [   91.139526] NIP:  d16a5718 LR: d16a569c CTR: 
>> c06f560c
>> [   91.139531] REGS: c001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
>> [   91.139534] MSR:  9200b032   CR: 
>> 84002484  XER: 2000
>> [   91.139553] DAR: d001fff12f34 DSISR: 4000 IRQMASK: 0
>> GPR00: d16a569c c001fa5778f0 d16b0400 
>> GPR04: 0002  8001fa46418e c001fa0d05c8
>> GPR08: d16b0400 d00037f13000 0001ff3e7000 d16a6fb8
>> GPR12: c06f560c c780  
>> GPR16: 11635010 3fffa1b7aa68  
>> GPR20: 0003 10013918 116350c0 c0b88990
>> GPR24: c0b88ba4  d001fff12f34 
>> GPR28: d16b8000 c001fa20f400 c001fa20f440 
>> [   91.139627] NIP [d16a5718] .alloc_counters.isra.10+0xbc/0x140 
>> [ip_tables]
>> [   91.139634] LR [d16a569c] .alloc_counters.isra.10+0x40/0x140 
>> [ip_tables]
>> [   91.139638] Call Trace:
>> [   91.139645] [c001fa5778f0] [d16a569c] 
>> .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
>> [   91.139655] [c001fa5779b0] [d16a5b54] 
>> .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
>> [   91.139666] [c001fa577aa0] [c06233e0] .nf_getsockopt+0x68/0x88
>> [   91.139674] [c001fa577b40] [c0631608] 
>> .ip_getsockopt+0xbc/0x128
>> [   91.139682] [c001fa577bf0] [c065adf4] 
>> .raw_getsockopt+0x18/0x5c
>> [   91.139690] [c001fa577c60] [c05b5f60] 
>> .sock_common_getsockopt+0x2c/0x40
>> [   91.139697] [c001fa577cd0] [c05b3394] 
>> .__sys_getsockopt+0xa4/0xd0
>> [   91.139704] [c001fa577d80] [c05b5ab0] 
>> .__se_sys_socketcall+0x238/0x2b4
>> [   91.139712] [c001fa577e30] [c000a31c] system_call+0x5c/0x70
>> [   91.139716] Instruction dump:
>> [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 8138 2b890001 
>> 419d000c 393e0060
>> [   91.139736] 4810 7d57c82a e93e0060 7d295214 <815a> 794807e1 
>> 41e20010 7c210b78
>> [   91.139752] ---[ end trace f5d1d5431651845d ]---
>
> This is due to 7290d58095 ("module: use relative references for
> __ksymtab entries"). This part of kernel/module.c -
>
>/* Divert to percpu allocation if a percpu var. */
>if (sym[i].st_shndx == info->index.pcpu)
>secbase = (unsigned long)mod_percpu(mod);
>else
>secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
>sym[i].st_value += secbase;
>
> Causes the distance to the target to exceed 32-bits on powerpc, so
> it doesn't fit in a rel32 reloc. Not sure how other archs cope.
>

Apologies for the breakage. It does indeed appear to affect all
architectures, and I'm a bit puzzled why you are the first one to spot
it.

I will try to find a clean way to special case the per-CPU variable
__ksymtab references in the generic module code, and if that is too
cumbersome, we can switch to 64-bit relative references (or rather,
native word size relative references) instead. Or revert the whole
thing ...

Re: [PATCH] net: netsec: reduce DMA mask to 40 bits

2018-05-26 Thread Ard Biesheuvel

On 26 May 2018 at 05:44, Jassi Brar <jaswinder.si...@linaro.org> wrote:
> On 26 May 2018 at 08:56, Jassi Brar <jaswinder.si...@linaro.org> wrote:
>> On 26 May 2018 at 01:07, Robin Murphy <robin.mur...@arm.com> wrote:
>>> On Sat, 26 May 2018 00:33:05 +0530
>>> Jassi Brar <jaswinder.si...@linaro.org> wrote:
>>>
>>>> On 25 May 2018 at 18:20, Ard Biesheuvel <ard.biesheu...@linaro.org>
>>>> wrote:
>>>> > The netsec network controller IP can drive 64 address bits for DMA,
>>>> > and the DMA mask is set accordingly in the driver. However, the
>>>> > SynQuacer SoC, which is the only silicon incorporating this IP at
>>>> > the moment, integrates this IP in a manner that leaves address bits
>>>> > [63:40] unconnected.
>>>> >
>>>> > Up until now, this has not resulted in any problems, given that the
>>>> > DDR controller doesn't decode those bits to begin with. However,
>>>> > recent firmware updates for platforms incorporating this SoC allow
>>>> > the IOMMU to be enabled, which does decode address bits [47:40],
>>>> > and allocates top down from the IOVA space, producing DMA addresses
>>>> > that have bits set that have been left unconnected.
>>>> >
>>>> > Both the DT and ACPI (IORT) descriptions of the platform take this
>>>> > into account, and only describe a DMA address space of 40 bits
>>>> > (using either dma-ranges DT properties, or DMA address limits in
>>>> > IORT named component nodes). However, even though our IOMMU and bus
>>>> > layers may take such limitations into account by setting a narrower
>>>> > DMA mask when creating the platform device, the netsec probe()
>>>> > entrypoint follows the common practice of setting the DMA mask
>>>> > uncondionally, according to the capabilities of the IP block itself
>>>> > rather than to its integration into the chip.
>>>> >
>>>> > It is currently unclear what the correct fix is here. We could hack
>>>> > around it by only setting the DMA mask if it deviates from its
>>>> > default value of DMA_BIT_MASK(32). However, this makes it
>>>> > impossible for the bus layer to use DMA_BIT_MASK(32) as the bus
>>>> > limit, and so it appears that a more comprehensive approach is
>>>> > required to take DMA limits imposed by the SoC as a whole into
>>>> > account.
>>>> >
>>>> > In the mean time, let's limit the DMA mask to 40 bits. Given that
>>>> > there is currently only one SoC that incorporates this IP, this is
>>>> > a reasonable approach that can be backported to -stable and buys us
>>>> > some time to come up with a proper fix going forward.
>>>> >
>>>> I am sure you already thought about it, but why not let the platform
>>>> specify the bit mask for the driver (via some "bus-width" property),
>>>> to override the default 64 bit mask?
>>>
>>> Because lack of a property to describe the integration is not the
>>> problem. There are already at least two ways: the general DT/IORT
>>> properties for describing DMA addressing - which it would be a bit
>>> ungainly for a driver to parse for this reason, but not impossible -
>> 
>>
>>
>>> and inferring it from a SoC-specific compatible - which is more
>>> appropriate, and what we happen to be able to do here.
>>>
>> Sorry, I am not sure I follow. This patch changes from 64-bits default
>> to 40-bits capability without checking for the parent SoC. If the next
>> generation implements the full 64-bit or just 32-bit bus, we'll be
>> back in the pit again. No?
>>
> Probably you meant we'll change the ethernet compatible string for
> differently capable SoC. OK, but here it is more of integration issue
> than controller version.
>
> Which makes me realise the extant compatible property for netsec is
> not so correct (it embeds the platform name). So I am ok either way.
>

The platform in question has a dma-ranges DT property at the root
level that only describes 40 bits' worth of DMA. Also, the ACPI
description in the IORT table of the IOMMU integration of the netsec
controller limits DMA to 40 bits. In the latter case, we actually
enter netsec_probe() with the correct value already assigned to the
DMA mask fields. (In the former case, the DMA limit is ignored
entirely)

In other words, we can already describe these SoC limitations and
distinguish them from device limitations. The problem is that drivers
ignore the existing values of DMA mask.

Robin has volunteered to look into fixing this, but this cannot be
done in a way that is suitable for -stable. In the mean time, we have
a single platform using this network IP in the field that cannot
upgrade its firmware to a version that describes the IOMMU, because
the existing DMA layer code will start driving address bits that are
correctly described as unconnected by the DT/ACPI tables.

So as a a workaround, until Robin fixes things properly, let's reduce
the DMA mask to 40 bits.

[PATCH] net: netsec: reduce DMA mask to 40 bits

2018-05-25 Thread Ard Biesheuvel

The netsec network controller IP can drive 64 address bits for DMA, and
the DMA mask is set accordingly in the driver. However, the SynQuacer
SoC, which is the only silicon incorporating this IP at the moment,
integrates this IP in a manner that leaves address bits [63:40]
unconnected.

Up until now, this has not resulted in any problems, given that the DDR
controller doesn't decode those bits to begin with. However, recent
firmware updates for platforms incorporating this SoC allow the IOMMU
to be enabled, which does decode address bits [47:40], and allocates
top down from the IOVA space, producing DMA addresses that have bits
set that have been left unconnected.

Both the DT and ACPI (IORT) descriptions of the platform take this into
account, and only describe a DMA address space of 40 bits (using either
dma-ranges DT properties, or DMA address limits in IORT named component
nodes). However, even though our IOMMU and bus layers may take such
limitations into account by setting a narrower DMA mask when creating
the platform device, the netsec probe() entrypoint follows the common
practice of setting the DMA mask uncondionally, according to the
capabilities of the IP block itself rather than to its integration into
the chip.

It is currently unclear what the correct fix is here. We could hack around
it by only setting the DMA mask if it deviates from its default value of
DMA_BIT_MASK(32). However, this makes it impossible for the bus layer to
use DMA_BIT_MASK(32) as the bus limit, and so it appears that a more
comprehensive approach is required to take DMA limits imposed by the
SoC as a whole into account.

In the mean time, let's limit the DMA mask to 40 bits. Given that there
is currently only one SoC that incorporates this IP, this is a reasonable
approach that can be backported to -stable and buys us some time to come
up with a proper fix going forward.

Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
Cc: Robin Murphy <robin.mur...@arm.com>
Cc: Jassi Brar <jaswinder.si...@linaro.org>
Cc: Masahisa Kojima <masahisa.koj...@linaro.org>
Cc: Ilias Apalodimas <ilias.apalodi...@linaro.org>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
Please cc to -stable (v4.16+)

 drivers/net/ethernet/socionext/netsec.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/socionext/netsec.c 
b/drivers/net/ethernet/socionext/netsec.c
index f4c0b02ddad8..59fbf74dcada 100644
--- a/drivers/net/ethernet/socionext/netsec.c
+++ b/drivers/net/ethernet/socionext/netsec.c
@@ -1674,8 +1674,8 @@ static int netsec_probe(struct platform_device *pdev)
if (ret)
goto unreg_napi;
 
-   if (dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(64)))
-   dev_warn(>dev, "Failed to enable 64-bit DMA\n");
+   if (dma_set_mask_and_coherent(>dev, DMA_BIT_MASK(40)))
+   dev_warn(>dev, "Failed to set DMA mask\n");
 
ret = register_netdev(ndev);
if (ret) {
-- 
2.17.0

Re: [net-next: PATCH 0/8] Armada 7k/8k PP2 ACPI support

2018-01-21 Thread Ard Biesheuvel

On 21 January 2018 at 16:13, Andrew Lunn  wrote:
>> Right. So if you need to have some additional "parameters" with the
>> connection, then I suppose you may want to go with the GenericSerialBus
>> route. However, looking at the sample device tree description:
>>
>> davinci_mdio: ethernet@5c03 {
>> compatible = "ti,davinci_mdio";
>> reg = <0x5c03 0x1000>;
>> #address-cells = <1>;
>> #size-cells = <0>;
>>
>> reset-gpios = < 5 GPIO_ACTIVE_LOW>;
>> reset-delay-us = <2>;
>>
>> ethphy0: ethernet-phy@1 {
>> reg = <1>;
>> };
>>
>> ethphy1: ethernet-phy@3 {
>> reg = <3>;
>> };
>> };
>>
>> would pretty much translate directly to this in ACPI if you don't need
>> any additional attributes:
>>
>>   Device (ETH0) {
>>   Name (_ADR, /* PCI address of the NIC */)
>>
>>   Device (PHY0) {
>>   Name (_ADR, 1)
>>   ...
>>   }
>>
>>   Device (PHY1) {
>>   Name (_ADR, 3)
>>   ...
>>   }
>>   }
>>
>> which looks pretty simple to me. You can also use _DSM and _DSD here to
>> pass information (like the protocol number) for the PHY devices to Linux.
>
> I'm not particularly worried about that simple case. Other than, i
> don't want people to think that is all that is required.
>
> For a more full example, take a look at vf610-zii-dev-rev-b.dts. The
> Freescale FEC Ethernet controller provides the base MDIO device,
> mdio1. On top of this is an MDIO mux, using a few GPIO lines to
> enable/disable 3 child MDIO busses. Each of these busses has an
> Ethernet Switch. The Ethernet switch exports up to two MDIO busses,
> and on these busses are Ethernet PHYs which are embedded inside the
> switch. The Ethernet switches are also interrupt controllers, with the
> PHYs having interrupt properties which point back to the interrupt
> controller in the switch.
>
> So i'm interested in an ACPI proposal which supports this board.
>

However interesting as an example, I'm not convinced this is what we
should aim for.

ACPI is not a replacement for DT, and it is unlikely that people would
be interested in running ACPI-only distros such as RHEL on their
Ethernet switch. DT is excellent at describing this, and there is no
need to replace it.

ACPI is about firmware abstractions: you don't need to describe every
stacked interrupt controller in minute detail to the OS if the
firmware configures it sufficiently. That way, the OS does not need to
know all these details, and vendors can update their hardware without
having to update the software as well. (Or that is the idea at least,
how that works out in practice on arm64 systems remains to be seen)

Taking the Marvell 8040 as an example: the ACPI description does not
expose the ICU interrupt controllers to the OS, but the firmware
configures them and describes their configured state as ordinary GIC
interrupts.

Also, please bear in mind that the ACPI spec is owned by the UEFI/ACPI
forum, and only members (who are all under a contract regarding
reasonable and non-discriminatory (RAND) licensing terms to IP they
own that is covered by the spec) can contribute. Individuals can
become members for free AFAIR, but that doesn't mean individuals are
typically interested in getting involved in a process that is rather
tedious and bureaucratic.

So my suggestion would be to find out to what extent the latest
version of the spec can describe non-trivial MDIO topologies, and
hopefully that includes the mvpp2 on the Marvell 8040.

Re: [PATCH] [net-next] net: netsec: use dma_addr_t for storing dma address

2018-01-13 Thread Ard Biesheuvel

On 13 January 2018 at 21:13, Arnd Bergmann <a...@arndb.de> wrote:
> On targets that have different sizes for phys_addr_t and dma_addr_t,
> we get a type mismatch error:
>
> drivers/net/ethernet/socionext/netsec.c: In function 'netsec_alloc_dring':
> drivers/net/ethernet/socionext/netsec.c:970:9: error: passing argument 3 of 
> 'dma_zalloc_coherent' from incompatible pointer type 
> [-Werror=incompatible-pointer-types]
>
> The code is otherwise correct, as the address is never actually used as a
> physical address but only passed into a DMA register.  For consistently,

consistency

> I'm changing the variable name as well, to clarify that this is a DMA
> address.
>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  drivers/net/ethernet/socionext/netsec.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index 6c263af86b8a..f4c0b02ddad8 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -252,7 +252,7 @@ struct netsec_desc {
>  };
>
>  struct netsec_desc_ring {
> -   phys_addr_t desc_phys;
> +   dma_addr_t desc_dma;
> struct netsec_desc *desc;
> void *vaddr;
> u16 pkt_cnt;
> @@ -953,7 +953,7 @@ static void netsec_free_dring(struct netsec_priv *priv, 
> int id)
>
> if (dring->vaddr) {
> dma_free_coherent(priv->dev, DESC_SZ * DESC_NUM,
> - dring->vaddr, dring->desc_phys);
> + dring->vaddr, dring->desc_dma);
> dring->vaddr = NULL;
> }
>
> @@ -967,7 +967,7 @@ static int netsec_alloc_dring(struct netsec_priv *priv, 
> enum ring_id id)
> int ret = 0;
>
> dring->vaddr = dma_zalloc_coherent(priv->dev, DESC_SZ * DESC_NUM,
> -  >desc_phys, GFP_KERNEL);
> +  >desc_dma, GFP_KERNEL);
> if (!dring->vaddr) {
> ret = -ENOMEM;
> goto err;
> @@ -1087,14 +1087,14 @@ static int netsec_reset_hardware(struct netsec_priv 
> *priv)
>
> /* set desc_start addr */
> netsec_write(priv, NETSEC_REG_NRM_RX_DESC_START_UP,
> -
> upper_32_bits(priv->desc_ring[NETSEC_RING_RX].desc_phys));
> +upper_32_bits(priv->desc_ring[NETSEC_RING_RX].desc_dma));
> netsec_write(priv, NETSEC_REG_NRM_RX_DESC_START_LW,
> -
> lower_32_bits(priv->desc_ring[NETSEC_RING_RX].desc_phys));
> +lower_32_bits(priv->desc_ring[NETSEC_RING_RX].desc_dma));
>
> netsec_write(priv, NETSEC_REG_NRM_TX_DESC_START_UP,
> -
> upper_32_bits(priv->desc_ring[NETSEC_RING_TX].desc_phys));
> +upper_32_bits(priv->desc_ring[NETSEC_RING_TX].desc_dma));
> netsec_write(priv, NETSEC_REG_NRM_TX_DESC_START_LW,
> -
> lower_32_bits(priv->desc_ring[NETSEC_RING_TX].desc_phys));
> +lower_32_bits(priv->desc_ring[NETSEC_RING_TX].desc_dma));
>
> /* set normal tx dring ring config */
> netsec_write(priv, NETSEC_REG_NRM_TX_CONFIG,
> --
> 2.9.0
>

Re: [PATCH] [net-next] net: socionext: include linux/io.h to fix build

2018-01-11 Thread Ard Biesheuvel

On 11 January 2018 at 10:36, Arnd Bergmann <a...@arndb.de> wrote:
> I ran into a randconfig build failure:
>
> drivers/net/ethernet/socionext/netsec.c: In function 'netsec_probe':
> drivers/net/ethernet/socionext/netsec.c:1583:17: error: implicit declaration 
> of function 'devm_ioremap'; did you mean 'ioremap'? 
> [-Werror=implicit-function-declaration]
>
> Including linux/io.h directly fixes this.
>
> Fixes: 533dd11a12f6 ("net: socionext: Add Synquacer NetSec driver")
> Signed-off-by: Arnd Bergmann <a...@arndb.de>

Thanks for fixing this. This is the same issue spotted by kbuild test robot.

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  drivers/net/ethernet/socionext/netsec.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> index a8edcf387bba..af47147dd656 100644
> --- a/drivers/net/ethernet/socionext/netsec.c
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> --
> 2.9.0
>

Re: [PATCHv5 1/3] dt-bindings: net: Add DT bindings for Socionext Netsec

2018-01-03 Thread Ard Biesheuvel

Hi Jassi,

On 1 January 2018 at 05:14,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jassisinghb...@gmail.com>
>
> This patch adds documentation for Device-Tree bindings for the
> Socionext NetSec Controller driver.
>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> ---
>  .../devicetree/bindings/net/socionext-netsec.txt   | 53 
> ++
>  1 file changed, 53 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/socionext-netsec.txt
>
> diff --git a/Documentation/devicetree/bindings/net/socionext-netsec.txt 
> b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> new file mode 100644
> index 000..b70e35b
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> @@ -0,0 +1,53 @@
> +* Socionext NetSec Ethernet Controller IP
> +
> +Required properties:
> +- compatible: Should be "socionext,synquacer-netsec"
> +- reg: Address and length of the control register area, followed by the
> +   address and length of the EEPROM holding the MAC address and
> +   microengine firmware
> +- interrupts: Should contain ethernet controller interrupt
> +- clocks: phandle to the PHY reference clock
> +- clock-names: Should be "phy_ref_clk"

If clock-names is mandatory now even when only a single clock is
specified, you should add it to the example as well. However, please
be aware that hardware has shipped now with DT images that specify a
only a single clock and no clock-names property: those will require a
firmware upgrade before they can use this version of the driver.

> +- phy-mode: See ethernet.txt file in the same directory
> +- phy-handle: See ethernet.txt in the same directory.
> +
> +- mdio device tree subnode: When the Netsec has a phy connected to its local
> +   mdio, there must be device tree subnode with the following
> +   required properties:
> +
> +   - #address-cells: Must be <1>.
> +   - #size-cells: Must be <0>.
> +
> +   For each phy on the mdio bus, there must be a node with the following
> +   fields:
> +   - compatible: Refer to phy.txt
> +   - reg: phy id used to communicate to phy.
> +
> +Optional properties: (See ethernet.txt file in the same directory)
> +- dma-coherent: Boolean property, must only be present if memory
> +   accesses performed by the device are cache coherent.
> +- local-mac-address: See ethernet.txt in the same directory.
> +- mac-address: See ethernet.txt in the same directory.
> +- max-speed: See ethernet.txt in the same directory.
> +- max-frame-size: See ethernet.txt in the same directory.
> +
> +Example:
> +   eth0: ethernet@522d {
> +   compatible = "socionext,synquacer-netsec";
> +   reg = <0 0x522d 0x0 0x1>, <0 0x1000 0x0 0x1>;
> +   interrupts = ;
> +   clocks = <_netsec>;
> +   phy-mode = "rgmii";
> +   max-speed = <1000>;
> +   max-frame-size = <9000>;
> +   phy-handle = <>;
> +
> +   mdio {
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +   phy1: ethernet-phy@1 {
> +   compatible = "ethernet-phy-ieee802.3-c22";
> +   reg = <1>;
> +   };
> +   };
> +   };
> --
> 2.7.4
>

Re: [PATCHv4 2/3] net: socionext: Add Synquacer NetSec driver

2017-12-23 Thread Ard Biesheuvel

On 23 December 2017 at 15:01, Jassi Brar <jassisinghb...@gmail.com> wrote:
> On Sat, Dec 23, 2017 at 4:09 PM, Ard Biesheuvel
> <ard.biesheu...@linaro.org> wrote:
>> On 23 December 2017 at 05:45,  <jassisinghb...@gmail.com> wrote:
>>> From: Jassi Brar <jaswinder.si...@linaro.org>
>>>
>>> This driver adds support for Socionext "netsec" IP Gigabit
>>> Ethernet + PHY IP used in the Synquacer SC2A11 SoC.
>>>
>>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
>>> ---
>>>  drivers/net/ethernet/Kconfig|1 +
>>>  drivers/net/ethernet/Makefile   |1 +
>>>  drivers/net/ethernet/socionext/Kconfig  |   29 +
>>>  drivers/net/ethernet/socionext/Makefile |1 +
>>>  drivers/net/ethernet/socionext/netsec.c | 1844 
>>> +++
>>>  5 files changed, 1876 insertions(+)
>>>  create mode 100644 drivers/net/ethernet/socionext/Kconfig
>>>  create mode 100644 drivers/net/ethernet/socionext/Makefile
>>>  create mode 100644 drivers/net/ethernet/socionext/netsec.c
>>>
>> ...
>>> diff --git a/drivers/net/ethernet/socionext/netsec.c 
>>> b/drivers/net/ethernet/socionext/netsec.c
>>> new file mode 100644
>>> index 000..6af047b
>>> --- /dev/null
>>> +++ b/drivers/net/ethernet/socionext/netsec.c
>> ...
>>> +static int netsec_netdev_load_ucode_region(struct netsec_priv *priv, u32 
>>> reg,
>>> +  u32 addr_h, u32 addr_l, u32 size)
>>> +{
>>> +   u64 base = (u64)addr_h << 32 | addr_l;
>>> +   void __iomem *ucode;
>>> +   u32 i;
>>> +
>>> +   ucode = ioremap(base, size * sizeof(u32));
>>> +   if (!ucode)
>>> +   return -ENOMEM;
>>> +
>>> +   for (i = 0; i < size; i++)
>>> +   netsec_write(priv, reg, readl(ucode + i));
>>> +
>>
>> This is incorrect. The microcode is written one u32 word at a time,
>> and indexing ucode like this results in byte indexing, not u32
>> indexing.
>>
> Ouch! careless mistake. I was too eager to get done with netsec before
> I leave for holidays.
>
>> I changed the ucode declaration locally to
>>
>> u32 __iomem *ucode;
>>
>> and now everything works fine again.
>>
> Or we keep the void pointer but doreadl(ucode + i * 4)  ?
>

Whichever you prefer.

>
>>
>>> +   iounmap(ucode);
>>> +   return 0;
>>> +}
>>> +
>> ...
>>> +static int netsec_register_mdio(struct netsec_priv *priv, u32 phy_addr)
>>> +{
>>> +   struct mii_bus *bus;
>>> +   int ret;
>>> +
>>> +   bus = devm_mdiobus_alloc(priv->dev);
>>> +   if (!bus)
>>> +   return -ENOMEM;
>>> +
>>> +   snprintf(bus->id, MII_BUS_ID_SIZE, "%s", dev_name(priv->dev));
>>> +   bus->priv = priv;
>>> +   bus->name = "SNI NETSEC MDIO";
>>> +   bus->read = netsec_phy_read;
>>> +   bus->write = netsec_phy_write;
>>> +   bus->parent = priv->dev;
>>> +   priv->mii_bus = bus;
>>> +
>>> +   if (dev_of_node(priv->dev)) {
>>> +   struct device_node *mdio_node, *parent = 
>>> dev_of_node(priv->dev);
>>> +
>>> +   mdio_node = of_get_child_by_name(parent, "mdio");
>>> +   if (mdio_node) {
>>> +   parent = mdio_node;
>>> +   } else {
>>> +   /* older f/w doesn't populate the mdio subnode,
>>> +* allow relaxed upgrade of f/w in due time.
>>> +*/
>>> +   dev_err(priv->dev, "Upgrade f/w for mdio 
>>> subnode!\n");
>>
>> I wouldn't mind if you dropped this fallback altogether, and would
>> simply stick with the new binding only. However, if you prefer to keep
>> it, could you change this to dev_info()? It is not really an error
>> condition, and dev_err/dev_warns have the annoying tendency to pierce
>> through 'quiet' boot splashes.
>>
> Yes, it should have been dev_info. But I would like to keep it,
> atleast for a couple months. For example, my board needs jtag to
> upgrade f/w.
>

Fair enough.

> Thanks.

Likewise! And happy holidays.

Re: [PATCHv4 2/3] net: socionext: Add Synquacer NetSec driver

2017-12-23 Thread Ard Biesheuvel

On 23 December 2017 at 05:45,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jaswinder.si...@linaro.org>
>
> This driver adds support for Socionext "netsec" IP Gigabit
> Ethernet + PHY IP used in the Synquacer SC2A11 SoC.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
> ---
>  drivers/net/ethernet/Kconfig|1 +
>  drivers/net/ethernet/Makefile   |1 +
>  drivers/net/ethernet/socionext/Kconfig  |   29 +
>  drivers/net/ethernet/socionext/Makefile |1 +
>  drivers/net/ethernet/socionext/netsec.c | 1844 
> +++
>  5 files changed, 1876 insertions(+)
>  create mode 100644 drivers/net/ethernet/socionext/Kconfig
>  create mode 100644 drivers/net/ethernet/socionext/Makefile
>  create mode 100644 drivers/net/ethernet/socionext/netsec.c
>
...
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> new file mode 100644
> index 000..6af047b
> --- /dev/null
> +++ b/drivers/net/ethernet/socionext/netsec.c
...
> +static int netsec_netdev_load_ucode_region(struct netsec_priv *priv, u32 reg,
> +  u32 addr_h, u32 addr_l, u32 size)
> +{
> +   u64 base = (u64)addr_h << 32 | addr_l;
> +   void __iomem *ucode;
> +   u32 i;
> +
> +   ucode = ioremap(base, size * sizeof(u32));
> +   if (!ucode)
> +   return -ENOMEM;
> +
> +   for (i = 0; i < size; i++)
> +   netsec_write(priv, reg, readl(ucode + i));
> +

This is incorrect. The microcode is written one u32 word at a time,
and indexing ucode like this results in byte indexing, not u32
indexing.

I changed the ucode declaration locally to

u32 __iomem *ucode;

and now everything works fine again.


> +   iounmap(ucode);
> +   return 0;
> +}
> +
...
> +static int netsec_register_mdio(struct netsec_priv *priv, u32 phy_addr)
> +{
> +   struct mii_bus *bus;
> +   int ret;
> +
> +   bus = devm_mdiobus_alloc(priv->dev);
> +   if (!bus)
> +   return -ENOMEM;
> +
> +   snprintf(bus->id, MII_BUS_ID_SIZE, "%s", dev_name(priv->dev));
> +   bus->priv = priv;
> +   bus->name = "SNI NETSEC MDIO";
> +   bus->read = netsec_phy_read;
> +   bus->write = netsec_phy_write;
> +   bus->parent = priv->dev;
> +   priv->mii_bus = bus;
> +
> +   if (dev_of_node(priv->dev)) {
> +   struct device_node *mdio_node, *parent = 
> dev_of_node(priv->dev);
> +
> +   mdio_node = of_get_child_by_name(parent, "mdio");
> +   if (mdio_node) {
> +   parent = mdio_node;
> +   } else {
> +   /* older f/w doesn't populate the mdio subnode,
> +* allow relaxed upgrade of f/w in due time.
> +*/
> +   dev_err(priv->dev, "Upgrade f/w for mdio subnode!\n");

I wouldn't mind if you dropped this fallback altogether, and would
simply stick with the new binding only. However, if you prefer to keep
it, could you change this to dev_info()? It is not really an error
condition, and dev_err/dev_warns have the annoying tendency to pierce
through 'quiet' boot splashes.

> +   }
> +
> +   ret = of_mdiobus_register(bus, parent);
> +   of_node_put(mdio_node);
> +
> +   if (ret) {
> +   dev_err(priv->dev, "mdiobus register err(%d)\n", ret);
> +   return ret;
> +   }
> +   } else {
> +   /* Mask out all PHYs from auto probing. */
> +   bus->phy_mask = ~0;
> +   ret = mdiobus_register(bus);
> +   if (ret) {
> +   dev_err(priv->dev, "mdiobus register err(%d)\n", ret);
> +   return ret;
> +   }
> +
> +   priv->phydev = get_phy_device(bus, phy_addr, false);
> +   if (IS_ERR(priv->phydev)) {
> +   ret = PTR_ERR(priv->phydev);
> +   dev_err(priv->dev, "get_phy_device err(%d)\n", ret);
> +   priv->phydev = NULL;
> +   return -ENODEV;
> +   }
> +
> +   ret = phy_device_register(priv->phydev);
> +   if (ret) {
> +   mdiobus_unregister(bus);
> +   dev_err(priv->dev,
> +   "phy_device_register err(%d)\n", ret);
> +   }
> +   }
> +
> +   return ret;
> +}
> +
...

Thanks,
Ard.

Re: [PATCHv3 2/3] net: socionext: Add Synquacer NetSec driver

2017-12-21 Thread Ard Biesheuvel

Hi Jassi,

On 21 December 2017 at 12:11,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jaswinder.si...@linaro.org>
>
> This driver adds support for Socionext "netsec" IP Gigabit
> Ethernet + PHY IP used in the Synquacer SC2A11 SoC.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
> ---
>  drivers/net/ethernet/Kconfig|1 +
>  drivers/net/ethernet/Makefile   |1 +
>  drivers/net/ethernet/socionext/Kconfig  |   29 +
>  drivers/net/ethernet/socionext/Makefile |1 +
>  drivers/net/ethernet/socionext/netsec.c | 1849 
> +++
>  5 files changed, 1881 insertions(+)
>  create mode 100644 drivers/net/ethernet/socionext/Kconfig
>  create mode 100644 drivers/net/ethernet/socionext/Makefile
>  create mode 100644 drivers/net/ethernet/socionext/netsec.c
>
> diff --git a/drivers/net/ethernet/Kconfig b/drivers/net/ethernet/Kconfig
> index c604213..d50519e 100644
> --- a/drivers/net/ethernet/Kconfig
> +++ b/drivers/net/ethernet/Kconfig
> @@ -170,6 +170,7 @@ source "drivers/net/ethernet/sis/Kconfig"
>  source "drivers/net/ethernet/sfc/Kconfig"
>  source "drivers/net/ethernet/sgi/Kconfig"
>  source "drivers/net/ethernet/smsc/Kconfig"
> +source "drivers/net/ethernet/socionext/Kconfig"
>  source "drivers/net/ethernet/stmicro/Kconfig"
>  source "drivers/net/ethernet/sun/Kconfig"
>  source "drivers/net/ethernet/tehuti/Kconfig"
> diff --git a/drivers/net/ethernet/Makefile b/drivers/net/ethernet/Makefile
> index 39f62733..6cf5ade 100644
> --- a/drivers/net/ethernet/Makefile
> +++ b/drivers/net/ethernet/Makefile
> @@ -82,6 +82,7 @@ obj-$(CONFIG_SFC) += sfc/
>  obj-$(CONFIG_SFC_FALCON) += sfc/falcon/
>  obj-$(CONFIG_NET_VENDOR_SGI) += sgi/
>  obj-$(CONFIG_NET_VENDOR_SMSC) += smsc/
> +obj-$(CONFIG_NET_VENDOR_SOCIONEXT) += socionext/
>  obj-$(CONFIG_NET_VENDOR_STMICRO) += stmicro/
>  obj-$(CONFIG_NET_VENDOR_SUN) += sun/
>  obj-$(CONFIG_NET_VENDOR_TEHUTI) += tehuti/
> diff --git a/drivers/net/ethernet/socionext/Kconfig 
> b/drivers/net/ethernet/socionext/Kconfig
> new file mode 100644
> index 000..4601c2f
> --- /dev/null
> +++ b/drivers/net/ethernet/socionext/Kconfig
> @@ -0,0 +1,29 @@
> +#
> +# Socionext Network device configuration
> +#
> +
> +config NET_VENDOR_SOCIONEXT
> +   bool "Socionext devices"
> +   default y
> +   ---help---
> + If you have a network (Ethernet) card belonging to this class, say 
> Y.
> +
> + Note that the answer to this question doesn't directly affect the
> + the questions about Socionext cards. If you say Y, you will be asked
> + for your specific card in the following questions.
> +
> +if NET_VENDOR_SOCIONEXT
> +
> +config SNI_NETSEC
> +   tristate "NETSEC Driver Support"
> +   depends on ARCH_SYNQUACER && OF
> +   select PHYLIB
> +   select MII
> +help
> + Enable to add support for the SocioNext NetSec Gigabit Ethernet
> + controller + PHY, as found on the Synquacer SC2A11 SoC
> +
> + To compile this driver as a module, choose M here: the module will 
> be
> + called netsec.  If unsure, say N.
> +
> +endif # NET_VENDOR_SOCIONEXT
> diff --git a/drivers/net/ethernet/socionext/Makefile 
> b/drivers/net/ethernet/socionext/Makefile
> new file mode 100644
> index 000..9505923
> --- /dev/null
> +++ b/drivers/net/ethernet/socionext/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_SNI_NETSEC) += netsec.o
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> new file mode 100644
> index 000..9a9b699
> --- /dev/null
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -0,0 +1,1849 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#define NETSEC_REG_SOFT_RST0x104
> +#define NETSEC_REG_COM_INIT0x120
> +
> +#define NETSEC_REG_TOP_STATUS  0x200
> +#define NETSEC_IRQ_RX  BIT(1)
> +#define NETSEC_IRQ_TX  BIT(0)
> +
> +#define NETSEC_REG_TOP_INTEN   0x204
> +#define NETSEC_REG_INTEN_SET   0x234
> +#define NETSEC_REG_INTEN_CLR   0x238
> +
> +#define NETSEC_REG_NRM_TX_STATUS   0x400
> +#define NETSEC_REG_NRM_TX_INTEN0x404
>

Re: [net-next: PATCH 0/8] Armada 7k/8k PP2 ACPI support

2017-12-18 Thread Ard Biesheuvel

On 18 December 2017 at 10:17, Marcin Wojtas  wrote:
> Hi,
>
> This patchset introduces ACPI support in mvpp2 and mvmdio drivers.
> First three patches introduce fwnode helpers for obtaining PHY
> information from nodes and also MDIO fwnode API for registering
> the bus with its PHY/devices.
>
> Following patches update code of the mvmdio and mvpp2 drivers
> to support ACPI tables handling. The latter is done in 4 steps,
> as can be seen in the commits. For the details, please
> refer to the commit messages.
>
> Drivers operation was tested on top of the net-next branch
> with both DT and ACPI. Although for compatibility reasons
> with older platforms, the mvmdio driver keeps using
> of_ MDIO registering, new fwnode_ one proved to fully work
> with DT as well (tested on MacchiatoBin board).
>
> mvpp2/mvmdio driver can work with the ACPI representation, as exposed
> on a public branch:
> https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/commits/marvell-armada-wip
> It was compiled together with the most recent Tianocore EDK2 revision.
> Please refer to the firmware build instruction on MacchiatoBin board:
> http://wiki.macchiatobin.net/tiki-index.php?page=Build+from+source+-+UEFI+EDK+II
>
> Above support configures 1G to use its PHY normally. 10G can work now
> only with the link interrupt mode. Somehow reading of the
> string property in fwnode_mdiobus_child_is_phy works only with
> DT and cannot cope with 10G PHY nodes as in:
> https://pastebin.com/3JnYpU0A
>
> Above root cause will be further checked. In the meantime I will
> appreciate any comments or remarks for the kernel patches.
>

Hi Marcin,

I have added linux-acpi and Graeme to cc. I think it makes sense to
discuss the way you describe the device topology before looking at the
patches in more detail.

In particular, I would like to request feedback on the use of
[redundant] 'reg' properties and the use of _DSD + compatible to
describe PHYs. Usually, we try to avoid this, given that it is
essentially a ACPI encapsulated DT dialect that is difficult to
support in drivers unless they are based on DT to begin with. Also,
non-standard _DSD properties require a vendor prefix, it is not
freeform.

For reference, the ACPI description is here (starting at line 175)
https://github.com/MarvellEmbeddedProcessors/edk2-open-platform/blob/72d5ac23b20dd74d479daa5e40ba443264b31261/Platforms/Marvell/Armada/AcpiTables/Armada80x0McBin/Dsdt.asl

-- 
Ard.

Re: [PATCHv2 2/3] net: socionext: Add Synquacer NetSec driver

2017-12-12 Thread Ard Biesheuvel

Hi Jassi,


On 12 December 2017 at 17:15,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jaswinder.si...@linaro.org>
>
> This driver adds support for Socionext "netsec" IP Gigabit
> Ethernet + PHY IP used in the Synquacer SC2A11 SoC.
>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
> ---
>  drivers/net/ethernet/Kconfig|1 +
>  drivers/net/ethernet/Makefile   |1 +
>  drivers/net/ethernet/socionext/Kconfig  |   29 +
>  drivers/net/ethernet/socionext/Makefile |1 +
>  drivers/net/ethernet/socionext/netsec.c | 1826 
> +++
>  5 files changed, 1858 insertions(+)
>  create mode 100644 drivers/net/ethernet/socionext/Kconfig
>  create mode 100644 drivers/net/ethernet/socionext/Makefile
>  create mode 100644 drivers/net/ethernet/socionext/netsec.c
>
[...]
> diff --git a/drivers/net/ethernet/socionext/netsec.c 
> b/drivers/net/ethernet/socionext/netsec.c
> new file mode 100644
> index 000..4472303a
> --- /dev/null
> +++ b/drivers/net/ethernet/socionext/netsec.c
> @@ -0,0 +1,1826 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +
> +#define NETSEC_REG_SOFT_RST0x104
> +#define NETSEC_REG_COM_INIT0x120
> +
> +#define NETSEC_REG_TOP_STATUS  0x200
> +#define NETSEC_IRQ_RX  BIT(1)
> +#define NETSEC_IRQ_TX  BIT(0)
> +
> +#define NETSEC_REG_TOP_INTEN   0x204
> +#define NETSEC_REG_INTEN_SET   0x234
> +#define NETSEC_REG_INTEN_CLR   0x238
> +
> +#define NETSEC_REG_NRM_TX_STATUS   0x400
> +#define NETSEC_REG_NRM_TX_INTEN0x404
> +#define NETSEC_REG_NRM_TX_INTEN_SET0x428
> +#define NETSEC_REG_NRM_TX_INTEN_CLR0x42c
> +#define NRM_TX_ST_NTOWNR   BIT(17)
> +#define NRM_TX_ST_TR_ERR   BIT(16)
> +#define NRM_TX_ST_TXDONE   BIT(15)
> +#define NRM_TX_ST_TMREXP   BIT(14)
> +
> +#define NETSEC_REG_NRM_RX_STATUS   0x440
> +#define NETSEC_REG_NRM_RX_INTEN0x444
> +#define NETSEC_REG_NRM_RX_INTEN_SET0x468
> +#define NETSEC_REG_NRM_RX_INTEN_CLR0x46c
> +#define NRM_RX_ST_RC_ERR   BIT(16)
> +#define NRM_RX_ST_PKTCNT   BIT(15)
> +#define NRM_RX_ST_TMREXP   BIT(14)
> +
> +#define NETSEC_REG_PKT_CMD_BUF 0xd0
> +
> +#define NETSEC_REG_CLK_EN  0x100
> +
> +#define NETSEC_REG_PKT_CTRL0x140
> +
> +#define NETSEC_REG_DMA_TMR_CTRL0x20c
> +#define NETSEC_REG_F_TAIKI_MC_VER  0x22c
> +#define NETSEC_REG_F_TAIKI_VER 0x230
> +#define NETSEC_REG_DMA_HM_CTRL 0x214
> +#define NETSEC_REG_DMA_MH_CTRL 0x220
> +#define NETSEC_REG_ADDR_DIS_CORE   0x218
> +#define NETSEC_REG_DMAC_HM_CMD_BUF 0x210
> +#define NETSEC_REG_DMAC_MH_CMD_BUF 0x21c
> +
> +#define NETSEC_REG_NRM_TX_PKTCNT   0x410
> +
> +#define NETSEC_REG_NRM_TX_DONE_PKTCNT  0x414
> +#define NETSEC_REG_NRM_TX_DONE_TXINT_PKTCNT0x418
> +
> +#define NETSEC_REG_NRM_TX_TMR  0x41c
> +
> +#define NETSEC_REG_NRM_RX_PKTCNT   0x454
> +#define NETSEC_REG_NRM_RX_RXINT_PKTCNT 0x458
> +#define NETSEC_REG_NRM_TX_TXINT_TMR0x420
> +#define NETSEC_REG_NRM_RX_RXINT_TMR0x460
> +
> +#define NETSEC_REG_NRM_RX_TMR  0x45c
> +
> +#define NETSEC_REG_NRM_TX_DESC_START_UP0x434
> +#define NETSEC_REG_NRM_TX_DESC_START_LW0x408
> +#define NETSEC_REG_NRM_RX_DESC_START_UP0x474
> +#define NETSEC_REG_NRM_RX_DESC_START_LW0x448
> +
> +#define NETSEC_REG_NRM_TX_CONFIG   0x430
> +#define NETSEC_REG_NRM_RX_CONFIG   0x470
> +
> +#define MAC_REG_STATUS 0x1024
> +#define MAC_REG_DATA   0x11c0
> +#define MAC_REG_CMD0x11c4
> +#define MAC_REG_FLOW_TH0x11cc
> +#define MAC_REG_INTF_SEL   0x11d4
> +#define MAC_REG_DESC_INIT  0x11fc
> +#define MAC_REG_DESC_SOFT_RST  0x1204
> +#define NETSEC_REG_MODE_TRANS_COMP_STATUS  0x500
> +
> +#define GMAC_REG_MCR   0x
> +#define GMAC_

Re: [PATCH net-next resubmit 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-12-04 Thread Ard Biesheuvel

On 4 December 2017 at 15:50, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Mon, 4 Dec 2017 15:46:55 +
>
>> On 4 December 2017 at 15:24, David Miller <da...@davemloft.net> wrote:
>>> From: Heiner Kallweit <hkallwe...@gmail.com>
>>> Date: Thu, 30 Nov 2017 23:55:15 +0100
>>>
>>>> After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
>>>> sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
>>>> some simplification" all relevant code pieces run in process context
>>>> anyway and I don't think we need the disabling of interrupts any longer.
>>>>
>>>> Interestingly enough, latter commit already removed the comment
>>>> explaining why interrupts need to be temporarily disabled.
>>>>
>>>> On my system phy interrupt mode works fine with this patch.
>>>> However I may miss something, especially in the context of shared phy
>>>> interrupts, therefore I'd appreciate if more people could test this.
>>>>
>>>> Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com>
>>>> Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>>
>>> Ok, applied.
>>>
>>> But if this causes regressions I'm reverting.
>>
>> Thanks. But please note that the code in question does seem to use the
>> interrupt API incorrectly, and tbh, I was expecting some more
>> discussion first. For reference, here's the commit log for the mostly
>> equivalent patch [0] I sent out almost at the same time:
>
> Yes, it seemed to me that when the code was converted to threaded IRQS
> this {enable,disable}_irq() stuff was not considered.
>
> Again, I read these patches over and I'm willing to own up to the
> changes for now.  And if they cause regressions or someone screams
> loudly enough we can revert and talk about it some more.

Excellent, thanks.

Re: [PATCH net-next resubmit 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-12-04 Thread Ard Biesheuvel

On 4 December 2017 at 15:24, David Miller <da...@davemloft.net> wrote:
> From: Heiner Kallweit <hkallwe...@gmail.com>
> Date: Thu, 30 Nov 2017 23:55:15 +0100
>
>> After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
>> sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
>> some simplification" all relevant code pieces run in process context
>> anyway and I don't think we need the disabling of interrupts any longer.
>>
>> Interestingly enough, latter commit already removed the comment
>> explaining why interrupts need to be temporarily disabled.
>>
>> On my system phy interrupt mode works fine with this patch.
>> However I may miss something, especially in the context of shared phy
>> interrupts, therefore I'd appreciate if more people could test this.
>>
>> Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com>
>> Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>
> Ok, applied.
>
> But if this causes regressions I'm reverting.

Thanks. But please note that the code in question does seem to use the
interrupt API incorrectly, and tbh, I was expecting some more
discussion first. For reference, here's the commit log for the mostly
equivalent patch [0] I sent out almost at the same time:

"""
The PHY interrupt handling code registers its interrupt as a oneshot
threaded interrupt, which guarantees that the interrupt will be masked
throughout both the primary and secondary handling stages.

However, the handling code still disables the interrupt by calling
disable_irq(), and re-enables it by calling enable_irq() after having
acked the interrupt in the PHY hardware.

This causes problems with hierarchical irqchip implementations built
on top of the GIC, because the core threaded interrupt code will only
EOI the interrupt if it is still masked after the secondary handler
completes. If this is not the case, the EOI is not emitted, and the
interrupt remains active, blocking further interrupts from the same
source. Disabling and enabling the interrupt will result in the secondary
handler completing with the interrupt unmasked, resulting in the above
behavior.

So remove the disable_irq/enable_irq, and rely on the fact that the
interrupt remains masked already.
"""

[0] https://marc.info/?l=linux-netdev=151016412011661=2

Re: [PATCH 3/3] MAINTAINERS: Add entry for Socionext ethernet driver

2017-12-01 Thread Ard Biesheuvel

On 30 November 2017 at 16:13,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jaswinder.si...@linaro.org>
>
> Add entry for the Socionext Netsec controller driver and DT bindings.
>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

(with Joe's comment addressed)

> ---
>  MAINTAINERS | 8 
>  1 file changed, 8 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index aa71ab52f..aed9d32 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -12617,6 +12617,14 @@ F: drivers/md/raid*
>  F: include/linux/raid/
>  F: include/uapi/linux/raid/
>
> +SOCIONEXT (SNI) NETSEC NETWORK DRIVER
> +M: Jassi Brar <jaswinder.si...@linaro.org>
> +L: netdev@vger.kernel.org
> +S: Supported
> +S: Maintained
> +F: drivers/net/ethernet/socionext/netsec.c
> +F: Documentation/devicetree/bindings/net/socionext-netsec.txt
> +
>  SONIC NETWORK DRIVER
>  M: Thomas Bogendoerfer <tsbog...@alpha.franken.de>
>  L: netdev@vger.kernel.org
> --
> 2.7.4
>

Re: [PATCH 1/3] dt-bindings: net: Add DT bindings for Socionext Netsec

2017-12-01 Thread Ard Biesheuvel

Hi Jassi,

On 30 November 2017 at 16:12,  <jassisinghb...@gmail.com> wrote:
> From: Jassi Brar <jassisinghb...@gmail.com>
>
> This patch adds documentation for Device-Tree bindings for the
> Socionext NetSec Controller driver.
>
> Signed-off-by: Jassi Brar <jaswinder.si...@linaro.org>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

No need to keep my signoff here: if you are posting the patch, your
signoff should come last. (Authorship is no factor here, only the path
taken by the patch from the author/copyright holder to the sender of
the email)

> ---
>  .../devicetree/bindings/net/socionext-netsec.txt   | 43 
> ++
>  1 file changed, 43 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/socionext-netsec.txt
>
> diff --git a/Documentation/devicetree/bindings/net/socionext-netsec.txt 
> b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> new file mode 100644
> index 000..4695969
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/socionext-netsec.txt
> @@ -0,0 +1,43 @@
> +* Socionext NetSec Ethernet Controller IP
> +
> +Required properties:
> +- compatible: Should be "socionext,synquacer-netsec"
> +- reg: Address and length of the control register area, followed by the
> +   address and length of the EEPROM holding the MAC address and
> +   microengine firmware
> +- interrupts: Should contain ethernet controller interrupt
> +- clocks: phandle to the PHY reference clock, and any other clocks to be
> +  switched by runtime_pm
> +- clock-names: Required only if more than a single clock is listed in 
> 'clocks'.
> +   The PHY reference clock must be named 'phy_refclk'
> +- phy-mode: See ethernet.txt file in the same directory
> +- phy-handle: phandle to select child phy
> +

We should add the following property here:

- dma-coherent: Boolean property, must only be present if memory
accesses performed by the device are cache coherent

(I only added support for this in our platform earlier this week)

> +Optional properties: (See ethernet.txt file in the same directory)
> +- local-mac-address
> +- mac-address
> +- max-speed
> +- max-frame-size
> +
> +Required properties for the child phy:
> +- reg: phy address
> +
> +Example:
> +   eth0: netsec@522D {
> +   compatible = "socionext,synquacer-netsec";
> +   reg = <0 0x522D 0x0 0x1>, <0 0x1000 0x0 0x1>;
> +   interrupts = ;
> +   clocks = <_netsec>;
> +   phy-mode = "rgmii";
> +   max-speed = <1000>;
> +   max-frame-size = <9000>;
> +   phy-handle = <>;
> +
> +   #address-cells = <1>;
> +   #size-cells = <0>;
> +
> +   ethphy0: ethernet-phy@1 {
> +   compatible = "ethernet-phy-ieee802.3-c22";
> +   reg = <1>;
> +   };
> +   };
> --
> 2.7.4
>

Re: [PATCH 2/3] net: socionext: Add Synquacer NetSec driver

2017-11-30 Thread Ard Biesheuvel

On 30 November 2017 at 17:58, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Thu, 30 Nov 2017 17:48:44 +
>
>> On 30 November 2017 at 17:42, David Miller <da...@davemloft.net> wrote:
>>> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>> Date: Thu, 30 Nov 2017 17:37:27 +
>>>
>>>> Well, the whole point of using memremap() instead of ioremap() is that
>>>> the region has memory semantics, i.e., we read the MAC address and the
>>>> DMA engine microcode from it. If memremap() itself is flawed in this
>>>> regard, I agree we should fix it. But as I understand it, this is
>>>> really an implementation problem in memremap() [the fact that it falls
>>>> back to ioremap()] and not a problem in this driver.
>>>>
>>>> So what alternative would you propose in this case?
>>>>
>>>> BTW, this should be IOREMAP_WC not IOREMAP_WT, because the EEPROM on
>>>> the platform in question does not tolerate cached mappings (or rather,
>>>> shareable mappings). ioremap_wt() happens to result in device mappings
>>>> rather than writethrough cacheable mappings, but this is another
>>>> problem in itself. Once arm64 fixes ioremap_wt(), this code will no
>>>> longer work on the SynQuacer SoC.
>>>
>>> It doesn't "fall back", it directly uses ioremap_wt() for non-RAM
>>> mappings.
>>>
>>> It you look, most architectures do a "#define iomrep_wt ioremap"
>>
>> OK, but that still means the implementation of memremap() is flawed,
>> not its use in this driver.
>>
>> memremap() exists to allow non-DRAM regions to be mapped with memory
>> semantics, and if we currently implement that incorrectly, we should
>> fix it. But that still does not mean we should have __iomem
>> annotations and special accessors in this case, precisely because it
>> has memory semantics, and so it is a virtual pointer that may be
>> dereferenced normally.
>
> Memory semantics and how to access that memory are two different things.
>
> On sparc64 the cacheability etc. can be specified in the load and
> store instructions.
>
> That's why I have all of the ioremap*() routines return a physical
> address.  The value returned from all of the ioremap*() interfaces is
> completely opaque.  That's why the return type is __iomem.
>
> The iomem accessors on sparc64 use the appropriate access bits in the
> load and store instructions as needed.
>
> So in order for this to work properly, new sets of acessors need to be
> implemented to let me give you the access semantics you want yet at
> the same time allowing architectures to have the flexibility of
> still returning an opaque "pointer".
>
> There is no reason for me to make a virtual mapping on sparc64 because
> it is completely unnecessary overhead since I can give you whatever
> cacheability/IO/etc. attributes needed in the load and store
> instructions themselves.
>

But the whole point of memremap() is obtaining a virtual mapping that
does not require accessors, but can be used like ordinary memory. The
fact that Sparc64 can do nifty things using special accessors but
without a virtual mapping is interesting, but not really relevant
IMHO.

So I will work with Jassi to reimplement this using ioremap(), given
that it is not performance critical, and given the fact that your
criticism appears to be directed to memremap() in general, and not the
way it is being used here.

Thanks,
Ard.

Re: [PATCH 2/3] net: socionext: Add Synquacer NetSec driver

2017-11-30 Thread Ard Biesheuvel

On 30 November 2017 at 17:42, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Thu, 30 Nov 2017 17:37:27 +
>
>> Well, the whole point of using memremap() instead of ioremap() is that
>> the region has memory semantics, i.e., we read the MAC address and the
>> DMA engine microcode from it. If memremap() itself is flawed in this
>> regard, I agree we should fix it. But as I understand it, this is
>> really an implementation problem in memremap() [the fact that it falls
>> back to ioremap()] and not a problem in this driver.
>>
>> So what alternative would you propose in this case?
>>
>> BTW, this should be IOREMAP_WC not IOREMAP_WT, because the EEPROM on
>> the platform in question does not tolerate cached mappings (or rather,
>> shareable mappings). ioremap_wt() happens to result in device mappings
>> rather than writethrough cacheable mappings, but this is another
>> problem in itself. Once arm64 fixes ioremap_wt(), this code will no
>> longer work on the SynQuacer SoC.
>
> It doesn't "fall back", it directly uses ioremap_wt() for non-RAM
> mappings.
>
> It you look, most architectures do a "#define iomrep_wt ioremap"

OK, but that still means the implementation of memremap() is flawed,
not its use in this driver.

memremap() exists to allow non-DRAM regions to be mapped with memory
semantics, and if we currently implement that incorrectly, we should
fix it. But that still does not mean we should have __iomem
annotations and special accessors in this case, precisely because it
has memory semantics, and so it is a virtual pointer that may be
dereferenced normally.

So if memremap() is unsuitable, how should we map memory that is not DRAM?

Re: [PATCH 2/3] net: socionext: Add Synquacer NetSec driver

2017-11-30 Thread Ard Biesheuvel

(+ Dan, Will)

On 30 November 2017 at 17:29, David Miller  wrote:
> From: jassisinghb...@gmail.com
> Date: Thu, 30 Nov 2017 21:43:16 +0530
>
>> + priv->eeprom_base = devm_memremap(>dev, eeprom_res->start,
>> +   resource_size(eeprom_res),
>> +   MEMREMAP_WT);
>> + if (!priv->eeprom_base) {
>> + dev_err(>dev, "devm_memremap() failed for EEPROM\n");
>> + ret = -ENXIO;
>> + goto free_ndev;
>> + }
>
> dev_memremap() is implemented via memremap() which for MEMREMAP_WT is
> in turn implemented using ioremap_wt() which returns an "__iomem"
> pointer.
>
> The memremap() function talks about __iomem being about side effects.
> That's not really the full story.  The __iomem annotation is also
> about whether a special accessor is necessary to "dereference" the
> pointer.  On sparc64, for example, the ioremap_*() functions return a
> physical address not a virtual one.  So you cannot directly
> dereference pointers that are returned from the ioremap*() interfaces.
>
> You'll also need to mark priv->eeprom_base as "__iomem".
>
> devm_memremap() returns a straight "void *" without the __iomem
> annotation, and this is wrong then ioremap_*() is used.

Well, the whole point of using memremap() instead of ioremap() is that
the region has memory semantics, i.e., we read the MAC address and the
DMA engine microcode from it. If memremap() itself is flawed in this
regard, I agree we should fix it. But as I understand it, this is
really an implementation problem in memremap() [the fact that it falls
back to ioremap()] and not a problem in this driver.

So what alternative would you propose in this case?

BTW, this should be IOREMAP_WC not IOREMAP_WT, because the EEPROM on
the platform in question does not tolerate cached mappings (or rather,
shareable mappings). ioremap_wt() happens to result in device mappings
rather than writethrough cacheable mappings, but this is another
problem in itself. Once arm64 fixes ioremap_wt(), this code will no
longer work on the SynQuacer SoC.

Re: [PATCH RfC 1/2] net: phy: core: remove now uneeded disabling of interrupts

2017-11-16 Thread Ard Biesheuvel

On 15 November 2017 at 22:19, Heiner Kallweit <hkallwe...@gmail.com> wrote:
> Am 15.11.2017 um 23:04 schrieb Florian Fainelli:
>> On 11/12/2017 01:08 PM, Heiner Kallweit wrote:
>>> After commits c974bdbc3e "net: phy: Use threaded IRQ, to allow IRQ from
>>> sleeping devices" and 664fcf123a30 "net: phy: Threaded interrupts allow
>>> some simplification" all relevant code pieces run in process context
>>> anyway and I don't think we need the disabling of interrupts any longer.
>>>
>>> Interestingly enough, latter commit already removed the comment
>>> explaining why interrupts need to be temporarily disabled.
>>>
>>> On my system phy interrupt mode works fine with this patch.
>>> However I may miss something, especially in the context of shared phy
>>> interrupts, therefore I'd appreciate if more people could test this.
>>
>> I am not currently in a position to test this, but this looks very
>> similar, if not identical to what Ard submitted a few days earlier:
>>
>
> Thanks for the hint. Indeed it's exactly the same patch, so the one
> sent by me can be disregarded.
>

Well, it does appear your patch is more complete. Another difference
is that I actually need this change to fix an issue with a
hierarchical irqchip stacked on top of the GIC.


>> https://patchwork.kernel.org/patch/10048901/
>>
>> Since net-next is closed at the moment, that should allow us to give
>> this some good amount of testing.
>>
>> Thanks
>>
>>>
>>> Signed-off-by: Heiner Kallweit <hkallwe...@gmail.com>

For the record

Acked-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

Dear maintainers,

Please take whichever of these patches looks more correct to you.

Thanks,
Ard.

>>> ---
>>>  drivers/net/phy/phy.c | 26 ++
>>>  include/linux/phy.h   |  1 -
>>>  2 files changed, 2 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>>> index 2b1e67bc1..b3784c9a2 100644
>>> --- a/drivers/net/phy/phy.c
>>> +++ b/drivers/net/phy/phy.c
>>> @@ -629,9 +629,6 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
>>>  if (PHY_HALTED == phydev->state)
>>>  return IRQ_NONE;/* It can't be ours.  */
>>>
>>> -disable_irq_nosync(irq);
>>> -atomic_inc(>irq_disable);
>>> -
>>>  phy_change(phydev);
>>>
>>>  return IRQ_HANDLED;
>>> @@ -689,7 +686,6 @@ static int phy_disable_interrupts(struct phy_device 
>>> *phydev)
>>>   */
>>>  int phy_start_interrupts(struct phy_device *phydev)
>>>  {
>>> -atomic_set(>irq_disable, 0);
>>>  if (request_threaded_irq(phydev->irq, NULL, phy_interrupt,
>>>   IRQF_ONESHOT | IRQF_SHARED,
>>>   phydev_name(phydev), phydev) < 0) {
>>> @@ -716,13 +712,6 @@ int phy_stop_interrupts(struct phy_device *phydev)
>>>
>>>  free_irq(phydev->irq, phydev);
>>>
>>> -/* If work indeed has been cancelled, disable_irq() will have
>>> - * been left unbalanced from phy_interrupt() and enable_irq()
>>> - * has to be called so that other devices on the line work.
>>> - */
>>> -while (atomic_dec_return(>irq_disable) >= 0)
>>> -enable_irq(phydev->irq);
>>> -
>>>  return err;
>>>  }
>>>  EXPORT_SYMBOL(phy_stop_interrupts);
>>> @@ -736,7 +725,7 @@ void phy_change(struct phy_device *phydev)
>>>  if (phy_interrupt_is_valid(phydev)) {
>>>  if (phydev->drv->did_interrupt &&
>>>  !phydev->drv->did_interrupt(phydev))
>>> -goto ignore;
>>> +return;
>>>
>>>  if (phy_disable_interrupts(phydev))
>>>  goto phy_err;
>>> @@ -748,27 +737,16 @@ void phy_change(struct phy_device *phydev)
>>>  mutex_unlock(>lock);
>>>
>>>  if (phy_interrupt_is_valid(phydev)) {
>>> -atomic_dec(>irq_disable);
>>> -enable_irq(phydev->irq);
>>> -
>>>  /* Reenable interrupts */
>>>  if (PHY_HALTED != phydev->state &&
>>>  phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
>>> -goto irq_enable_err;
>>> +goto phy_err;
>>>  }
>>>
>>>  /* reschedule state queue work to run as soon as possible */
>>>  phy_trigger_machine(phydev, true);
>>>  return;
>>>
>>> -ignore:
>>> -atomic_dec(>irq_disable);
>>> -enable_irq(phydev->irq);
>>> -return;
>>> -
>>> -irq_enable_err:
>>> -disable_irq(phydev->irq);
>>> -atomic_inc(>irq_disable);
>>>  phy_err:
>>>  phy_error(phydev);
>>>  }
>>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>>> index dc82a07cb..8a87e441f 100644
>>> --- a/include/linux/phy.h
>>> +++ b/include/linux/phy.h
>>> @@ -468,7 +468,6 @@ struct phy_device {
>>>  /* Interrupt and Polling infrastructure */
>>>  struct work_struct phy_queue;
>>>  struct delayed_work state_queue;
>>> -atomic_t irq_disable;
>>>
>>>  struct mutex lock;
>>>
>>>
>>
>>
>

[RFC PATCH] phy: don't disable and re-enable interrupts in oneshot threaded handler

2017-11-08 Thread Ard Biesheuvel

The PHY interrupt handling code registers its interrupt as a oneshot
threaded interrupt, which guarantees that the interrupt will be masked
throughout both the primary and secondary handling stages.

However, the handling code still disables the interrupt by calling
disable_irq(), and re-enables it by calling enable_irq() after having
acked the interrupt in the PHY hardware.

This causes problems with hierarchical irqchip implementations built
on top of the GIC, because the core threaded interrupt code will only
EOI the interrupt if it is still masked after the secondary handler
completes. If this is not the case, the EOI is not emitted, and the
interrupt remains active, blocking further interrupts from the same
source. Disabling and enabling the interrupt will result in the secondary
handler completing with the interrupt unmasked, resulting in the above
behavior.

So remove the disable_irq/enable_irq, and rely on the fact that the
interrupt remains masked already.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 drivers/net/phy/phy.c | 20 
 1 file changed, 4 insertions(+), 16 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index d0626bf5c540..ce8bba0c1072 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -774,8 +774,8 @@ static void phy_error(struct phy_device *phydev)
  * @irq: interrupt line
  * @phy_dat: phy_device pointer
  *
- * Description: When a PHY interrupt occurs, the handler disables
- * interrupts, and uses phy_change to handle the interrupt.
+ * Description: When a PHY interrupt occurs, the handler invokes
+ *  phy_change to handle the interrupt.
  */
 static irqreturn_t phy_interrupt(int irq, void *phy_dat)
 {
@@ -784,9 +784,6 @@ static irqreturn_t phy_interrupt(int irq, void *phy_dat)
if (PHY_HALTED == phydev->state)
return IRQ_NONE;/* It can't be ours.  */
 
-   disable_irq_nosync(irq);
-   atomic_inc(>irq_disable);
-
phy_change(phydev);
 
return IRQ_HANDLED;
@@ -891,10 +888,10 @@ void phy_change(struct phy_device *phydev)
if (phy_interrupt_is_valid(phydev)) {
if (phydev->drv->did_interrupt &&
!phydev->drv->did_interrupt(phydev))
-   goto ignore;
+   return;
 
if (phy_disable_interrupts(phydev))
-   goto phy_err;
+   goto irq_enable_err;
}
 
mutex_lock(>lock);
@@ -903,9 +900,6 @@ void phy_change(struct phy_device *phydev)
mutex_unlock(>lock);
 
if (phy_interrupt_is_valid(phydev)) {
-   atomic_dec(>irq_disable);
-   enable_irq(phydev->irq);
-
/* Reenable interrupts */
if (PHY_HALTED != phydev->state &&
phy_config_interrupt(phydev, PHY_INTERRUPT_ENABLED))
@@ -916,15 +910,9 @@ void phy_change(struct phy_device *phydev)
phy_trigger_machine(phydev, true);
return;
 
-ignore:
-   atomic_dec(>irq_disable);
-   enable_irq(phydev->irq);
-   return;
-
 irq_enable_err:
disable_irq(phydev->irq);
atomic_inc(>irq_disable);
-phy_err:
phy_error(phydev);
 }
 
-- 
2.11.0

Re: [QUESTION] poor TX performance on new GbE driver

2017-10-22 Thread Ard Biesheuvel

On 22 October 2017 at 20:38, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Sun, 2017-10-22 at 20:14 +0100, Ard Biesheuvel wrote:
>> Hello all,
>>
>> I am working on upstreaming a network driver for a Socionext SoC, and
>> I am having some trouble figuring out why my TX performance is
>> horrible when booting a Debian Stretch rootfs, while booting a Ubuntu
>> 17.04 rootfs works absolutely fine. Note that this is using the exact
>> same kernel image, booted off the network.
>>
>> Under Ubuntu, I get the following iperf results from the box to my AMD
>> Seattle based devbox with a 1 Gbit switch in between. (The NIC in
>> question is also 1 Gbit)
>>
>>
>> $ sudo iperf -c dogfood.local -r
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> 
>> Client connecting to dogfood.local, TCP port 5001
>> TCP window size:  748 KByte (default)
>> 
>> [  5] local 192.168.1.112 port 51666 connected with 192.168.1.106 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  5]  0.0-10.0 sec  1.07 GBytes   920 Mbits/sec
>> [  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33048
>> [  4]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
>>
>> Booting the *exact* same kernel into a Debian based rootfs results in
>> the following numbers
>> $ sudo iperf -c dogfood.local -r
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> 
>> Client connecting to dogfood.local, TCP port 5001
>> TCP window size: 85.0 KByte (default)
>> 
>> [  5] local 192.168.1.112 port 40132 connected with 192.168.1.106 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  5]  0.0-10.1 sec  4.12 MBytes  3.43 Mbits/sec
>> [  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33068
>> [  4]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>
>> The ifconfig stats look perfectly fine to me (TX errors 0  dropped 0
>> overruns 0  carrier 0  collisions 0). During the TX test, the CPUs are
>> almost completely idle. (This system has 24 cores, but not
>> particularly powerful ones.)
>>
>> This test is based on v4.14-rc4, but v4.13 gives the same results.
>>
>> Could anyone please shed a light on this? What tuning parameters
>> and/or stats should I be looking at? I am a seasoned kernel developer
>> but a newbie when it comes to networking, so hopefully I just need a
>> nudge to go looking in the right place.
>
>
> This description smells a problem with TX completions being deferred.
>
> TX interrupts being lost or deferred too much.
>

Right. Do you have any if there are any userland-accessible tunables
that may affect this? I'm still rather stumped that the issue only
appears when running the Debian Stretch rootfs ...

Thanks,
Ard.

Re: [QUESTION] poor TX performance on new GbE driver

2017-10-22 Thread Ard Biesheuvel

On 22 October 2017 at 20:27, Florian Fainelli <f.faine...@gmail.com> wrote:
> On 10/22/2017 12:14 PM, Ard Biesheuvel wrote:
>> Hello all,
>>
>> I am working on upstreaming a network driver for a Socionext SoC, and
>> I am having some trouble figuring out why my TX performance is
>> horrible when booting a Debian Stretch rootfs, while booting a Ubuntu
>> 17.04 rootfs works absolutely fine. Note that this is using the exact
>> same kernel image, booted off the network.
>>
>> Under Ubuntu, I get the following iperf results from the box to my AMD
>> Seattle based devbox with a 1 Gbit switch in between. (The NIC in
>> question is also 1 Gbit)
>>
>>
>> $ sudo iperf -c dogfood.local -r
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> 
>> Client connecting to dogfood.local, TCP port 5001
>> TCP window size:  748 KByte (default)
>> 
>> [  5] local 192.168.1.112 port 51666 connected with 192.168.1.106 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  5]  0.0-10.0 sec  1.07 GBytes   920 Mbits/sec
>> [  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33048
>> [  4]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
>>
>> Booting the *exact* same kernel into a Debian based rootfs results in
>> the following numbers
>> $ sudo iperf -c dogfood.local -r
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> 
>> Client connecting to dogfood.local, TCP port 5001
>> TCP window size: 85.0 KByte (default)
>> 
>> [  5] local 192.168.1.112 port 40132 connected with 192.168.1.106 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  5]  0.0-10.1 sec  4.12 MBytes  3.43 Mbits/sec
>> [  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33068
>> [  4]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec
>>
>> The ifconfig stats look perfectly fine to me (TX errors 0  dropped 0
>> overruns 0  carrier 0  collisions 0). During the TX test, the CPUs are
>> almost completely idle. (This system has 24 cores, but not
>> particularly powerful ones.)
>>
>> This test is based on v4.14-rc4, but v4.13 gives the same results.
>>
>> Could anyone please shed a light on this? What tuning parameters
>> and/or stats should I be looking at? I am a seasoned kernel developer
>> but a newbie when it comes to networking, so hopefully I just need a
>> nudge to go looking in the right place.

Hi Florian,

Thanks for your response.

>
> You could look at /proc/net/snmp and see if you get higher level TCP/IP
> drops. The second run appears to be fine, is it possible that somehow
> your TX ring starts in a invalid state of some sort, TX activity cleans
> it up during the first run and the second run operates under normal
> condition?

The 'second' run is the opposite direction, due to the '-r' parameter.
I only included it for reference, since it works fine in both cases.

> At first glance I can't think of any sensible difference
> between the two rootfs that would explain what happens but it might be
> worth comparing /proc/sys/net between the two and spot possible TCP
> parameters differences.
>

Right, I can check that.

> How is UDP doing in your test cases? Once your image is loaded
> everything should be in the page cache already so there should not be
> any heavy NFS activity while you run your tests right?

I don't use NFS, only the kernel image is booted off the network, but
the rootfses are actually on SSDs

> You might also
> want to try to take a perf capture of the first run and see where and
> how packets may be dropped: perf record -g -e skb:kfree_skb iperf -c ..
> may help here.

OK, that looks interesting - let me try that.

Thanks a lot!

[QUESTION] poor TX performance on new GbE driver

2017-10-22 Thread Ard Biesheuvel

Hello all,

I am working on upstreaming a network driver for a Socionext SoC, and
I am having some trouble figuring out why my TX performance is
horrible when booting a Debian Stretch rootfs, while booting a Ubuntu
17.04 rootfs works absolutely fine. Note that this is using the exact
same kernel image, booted off the network.

Under Ubuntu, I get the following iperf results from the box to my AMD
Seattle based devbox with a 1 Gbit switch in between. (The NIC in
question is also 1 Gbit)


$ sudo iperf -c dogfood.local -r

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)


Client connecting to dogfood.local, TCP port 5001
TCP window size:  748 KByte (default)

[  5] local 192.168.1.112 port 51666 connected with 192.168.1.106 port 5001
[ ID] Interval   Transfer Bandwidth
[  5]  0.0-10.0 sec  1.07 GBytes   920 Mbits/sec
[  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33048
[  4]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec

Booting the *exact* same kernel into a Debian based rootfs results in
the following numbers
$ sudo iperf -c dogfood.local -r

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)


Client connecting to dogfood.local, TCP port 5001
TCP window size: 85.0 KByte (default)

[  5] local 192.168.1.112 port 40132 connected with 192.168.1.106 port 5001
[ ID] Interval   Transfer Bandwidth
[  5]  0.0-10.1 sec  4.12 MBytes  3.43 Mbits/sec
[  4] local 192.168.1.112 port 5001 connected with 192.168.1.106 port 33068
[  4]  0.0-10.0 sec  1.10 GBytes   939 Mbits/sec

The ifconfig stats look perfectly fine to me (TX errors 0  dropped 0
overruns 0  carrier 0  collisions 0). During the TX test, the CPUs are
almost completely idle. (This system has 24 cores, but not
particularly powerful ones.)

This test is based on v4.14-rc4, but v4.13 gives the same results.

Could anyone please shed a light on this? What tuning parameters
and/or stats should I be looking at? I am a seasoned kernel developer
but a newbie when it comes to networking, so hopefully I just need a
nudge to go looking in the right place.

Thanks,
Ard.

Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-19 Thread Ard Biesheuvel

On 19 October 2017 at 11:57, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Wed, 18 Oct 2017 16:45:15 +0100
>
>> Even though calling dql_completed() with a count that exceeds the
>> queued count is a serious error, it still does not justify bringing
>> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
>> instead.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>
> This is bogus.
>
> Unless you are going to do all of the work necessary to fix
> the out-of-bounds condition here, you cannot safely continue
> into the rest of this function.
>
> Things are going to explode in many places if you don't, at
> a minimum, fix the 'count' value to be in range.
>
> But like others I don't like this, the driver needs to be fixed
> urgently if this condition triggers.
>
> Sorry I'm not applying this.

Fair enough.

Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Ard Biesheuvel

On 18 October 2017 at 19:45, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Wed, 2017-10-18 at 18:57 +0100, Ard Biesheuvel wrote:
>> On 18 October 2017 at 17:29, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> > On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
>> >> Even though calling dql_completed() with a count that exceeds the
>> >> queued count is a serious error, it still does not justify bringing
>> >> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
>> >> instead.
>> >>
>> >> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>> >> ---
>> >>  lib/dynamic_queue_limits.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >>
>> >> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
>> >> index f346715e2255..24ce495d78f3 100644
>> >> --- a/lib/dynamic_queue_limits.c
>> >> +++ b/lib/dynamic_queue_limits.c
>> >> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>> >>   num_queued = ACCESS_ONCE(dql->num_queued);
>> >>
>> >>   /* Can't complete more than what's in queue */
>> >> - BUG_ON(count > num_queued - dql->num_completed);
>> >> + WARN_ON(count > num_queued - dql->num_completed);
>> >>
>> >>   completed = dql->num_completed + count;
>> >>   limit = dql->limit;
>> >
>> > So instead fixing the faulty driver, you'll have strange lockups, and
>> > force your users to reboot anyway, after annoying periods where
>> > "Internet does not work"
>> >
>> > These kinds of errors should be found when testing a new device driver
>> > or new kernel.
>> >
>> > Have you found the root cause ?
>> >
>>
>> Not yet, and I don't intend to send out any patches for this
>> particular hardware until this is fixed.
>>
>> But that still doesn't mean you should crash hard. As Linus puts it,
>> it is better to 'limp on' if you can (unless we're likely to corrupt
>> any non-volatile data, e.g., files on disk etc)
>
> How many BUG() do you plan to change to WARN() exactly ?
>

How is that relevant?

> If you want to comply to Linus wish, just compile your kernel
> with appropriate option.
>
> CONFIG_BUG=n
>

If it is essential that we crash hard in this location, without *any*
opportunity whatsoever to shutdown cleanly or perform any diagnosis on
the system while it is still up, then please disregard this patch.

Re: [PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Ard Biesheuvel

On 18 October 2017 at 17:29, Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Wed, 2017-10-18 at 16:45 +0100, Ard Biesheuvel wrote:
>> Even though calling dql_completed() with a count that exceeds the
>> queued count is a serious error, it still does not justify bringing
>> down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
>> instead.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>> ---
>>  lib/dynamic_queue_limits.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
>> index f346715e2255..24ce495d78f3 100644
>> --- a/lib/dynamic_queue_limits.c
>> +++ b/lib/dynamic_queue_limits.c
>> @@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
>>   num_queued = ACCESS_ONCE(dql->num_queued);
>>
>>   /* Can't complete more than what's in queue */
>> - BUG_ON(count > num_queued - dql->num_completed);
>> + WARN_ON(count > num_queued - dql->num_completed);
>>
>>   completed = dql->num_completed + count;
>>   limit = dql->limit;
>
> So instead fixing the faulty driver, you'll have strange lockups, and
> force your users to reboot anyway, after annoying periods where
> "Internet does not work"
>
> These kinds of errors should be found when testing a new device driver
> or new kernel.
>
> Have you found the root cause ?
>

Not yet, and I don't intend to send out any patches for this
particular hardware until this is fixed.

But that still doesn't mean you should crash hard. As Linus puts it,
it is better to 'limp on' if you can (unless we're likely to corrupt
any non-volatile data, e.g., files on disk etc)

[PATCH] lib/dynamic_queue_limits.c: relax BUG_ON to WARN_ON in dql_complete()

2017-10-18 Thread Ard Biesheuvel

Even though calling dql_completed() with a count that exceeds the
queued count is a serious error, it still does not justify bringing
down the entire kernel with a BUG_ON(). So relax it to a WARN_ON()
instead.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 lib/dynamic_queue_limits.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index f346715e2255..24ce495d78f3 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -23,7 +23,7 @@ void dql_completed(struct dql *dql, unsigned int count)
num_queued = ACCESS_ONCE(dql->num_queued);
 
/* Can't complete more than what's in queue */
-   BUG_ON(count > num_queued - dql->num_completed);
+   WARN_ON(count > num_queued - dql->num_completed);
 
completed = dql->num_completed + count;
limit = dql->limit;
-- 
2.11.0

Re: [PATCH 20/24] bpf: Restrict kernel image access functions when the kernel is locked down

2017-04-06 Thread Ard Biesheuvel

On 6 April 2017 at 13:29, Alexei Starovoitov
 wrote:
> On Wed, Apr 05, 2017 at 09:17:25PM +0100, David Howells wrote:
>> From: Chun-Yi Lee 
>>
>> There are some bpf functions can be used to read kernel memory:
>> bpf_probe_read, bpf_probe_write_user and bpf_trace_printk.  These allow
>> private keys in kernel memory (e.g. the hibernation image signing key) to
>> be read by an eBPF program.  Prohibit those functions when the kernel is
>> locked down.
>>
>> Signed-off-by: Chun-Yi Lee 
>> Signed-off-by: David Howells 
>> cc: netdev@vger.kernel.org
>> ---
>>
>>  kernel/trace/bpf_trace.c |   11 +++
>>  1 file changed, 11 insertions(+)
>>
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index cee9802cf3e0..7fde851f207b 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -65,6 +65,11 @@ BPF_CALL_3(bpf_probe_read, void *, dst, u32, size, const 
>> void *, unsafe_ptr)
>>  {
>>   int ret;
>>
>> + if (kernel_is_locked_down()) {
>> + memset(dst, 0, size);
>> + return -EPERM;
>> + }
>
> this will obviously break the program. How about disabling loading tracing
> programs during the lockdown completely?
>
> Also is there a description of what this lockdown trying to accomplish?
> The cover letter is scarce in details.
>

This is a very good point, and this is actually feedback that was
given (by Alan Cox, iirc) the last time this series was circulated.

This series is a mixed bag of patches that all look like they improve
'security' in one way or the other. But what is lacking is a coherent
view on the threat model, and to what extent all these patches reduce
the vulnerability to such threats. Without that, these patches do very
little beyond giving a false sense of security, imo.

Re: [PATCH v3 1/2] mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher

2017-02-07 Thread Ard Biesheuvel

On 8 February 2017 at 07:00, Johannes Berg  wrote:
> This looks strange to me:
>
>> +static int aes_s2v(struct crypto_shash *tfm,
>>  size_t num_elem, const u8 *addr[], size_t len[],
>> u8 *v)
>>  {
>> - u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE];
>> + u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE] = {};
>> + SHASH_DESC_ON_STACK(desc, tfm);
>
> desc declared
>
>>
>> + crypto_shash_digest(desc, tmp, AES_BLOCK_SIZE, d);
>
> used here
>

Each digest() call combines a init()/update()/final() sequence

>> + crypto_shash_init(desc);
>
> but initialized now?
>

... for the 6th time, or so. The final vector may require two
update()s, so we cannot use digest() here. But we can use finup() for
the last one, which combines update() and final().

Hence,

init()/finup()

or

init()/update()/finup()

depending on the length of the last vector.

[PATCH v3 2/2] mac80211: aes-cmac: switch to shash CMAC driver

2017-02-06 Thread Ard Biesheuvel

Instead of open coding the CMAC algorithm in the mac80211 driver using
byte wide xors and calls into the crypto layer for each block of data,
instantiate a cmac(aes) synchronous hash and pass all the data into it
directly. This does not only simplify the code, it also allows the use
of more efficient and more secure implementations, especially on
platforms where SIMD ciphers have a considerable setup cost.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_cmac.c | 126 
 net/mac80211/aes_cmac.h |  11 +-
 net/mac80211/key.h  |   2 +-
 3 files changed, 32 insertions(+), 107 deletions(-)

diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index d0bd5fff5f0a..2fb65588490c 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -22,126 +22,50 @@
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
 #define AAD_LEN 20
 
+static const u8 zero[CMAC_TLEN_256];
 
-void gf_mulx(u8 *pad)
-{
-   int i, carry;
-
-   carry = pad[0] & 0x80;
-   for (i = 0; i < AES_BLOCK_SIZE - 1; i++)
-   pad[i] = (pad[i] << 1) | (pad[i + 1] >> 7);
-   pad[AES_BLOCK_SIZE - 1] <<= 1;
-   if (carry)
-   pad[AES_BLOCK_SIZE - 1] ^= 0x87;
-}
-
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len)
-{
-   u8 cbc[AES_BLOCK_SIZE], pad[AES_BLOCK_SIZE];
-   const u8 *pos, *end;
-   size_t i, e, left, total_len;
-
-   memset(cbc, 0, AES_BLOCK_SIZE);
-
-   total_len = 0;
-   for (e = 0; e < num_elem; e++)
-   total_len += len[e];
-   left = total_len;
-
-   e = 0;
-   pos = addr[0];
-   end = pos + len[0];
-
-   while (left >= AES_BLOCK_SIZE) {
-   for (i = 0; i < AES_BLOCK_SIZE; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   if (left > AES_BLOCK_SIZE)
-   crypto_cipher_encrypt_one(tfm, cbc, cbc);
-   left -= AES_BLOCK_SIZE;
-   }
-
-   memset(pad, 0, AES_BLOCK_SIZE);
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   gf_mulx(pad);
-
-   if (left || total_len == 0) {
-   for (i = 0; i < left; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   cbc[left] ^= 0x80;
-   gf_mulx(pad);
-   }
-
-   for (i = 0; i < AES_BLOCK_SIZE; i++)
-   pad[i] ^= cbc[i];
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   memcpy(mac, pad, mac_len);
-}
-
-
-void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN];
+   SHASH_DESC_ON_STACK(desc, tfm);
+   u8 out[AES_BLOCK_SIZE];
 
-   memset(zero, 0, CMAC_TLEN);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN);
+   crypto_shash_finup(desc, zero, CMAC_TLEN, out);
+
+   memcpy(mic, out, CMAC_TLEN);
 }
 
-void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac_256(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN_256];
+   SHASH_DESC_ON_STACK(desc, tfm);
 
-   memset(zero, 0, CMAC_TLEN_256);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN_256;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN_256;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN_256);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN_256);
+   crypto_shash_finup(desc, zero, CMAC_TLEN_256, mic);
 }
 
-struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
-  size_t key_len)
+s

[PATCH v3 0/2] mac80211: use crypto shash for AES cmac

2017-02-06 Thread Ard Biesheuvel

This is something I spotted while working on AES in various modes for
ARM and arm64.

The mac80211 aes_cmac code reimplements the CMAC algorithm based on the
core AES cipher, which is rather restrictive in how platforms can satisfy
the dependency on this algorithm. For instance, SIMD implementations may
have a considerable setup time, which cannot be amortized over the entire
input when calling into the crypto API one block at a time. Also, it prevents
the use of more secure fixed time implementations, since not all AES drivers
expose the cipher interface.

So switch aes_cmac to use a cmac(aes) shash. Before updating the aes_cmac code
in patch #2, the FILS AEAD code is moved to using a cmac(aes) shash supplied by
the crypto API so that we can remove the open coded version entirely in the
second patch.

v3: - use more idiomatic SHASH_DESC_ON_STACK to allocate the shash descriptors
- replace compound literal zero vectors with explicitly defined ones
- drop a redundant memcpy() in #2

Ard Biesheuvel (2):
  mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher
  mac80211: aes-cmac: switch to shash CMAC driver

 net/mac80211/Kconfig |   1 +
 net/mac80211/aes_cmac.c  | 126 
 net/mac80211/aes_cmac.h  |  15 +--
 net/mac80211/fils_aead.c |  74 +---
 net/mac80211/key.h   |   2 +-
 5 files changed, 66 insertions(+), 152 deletions(-)

-- 
2.7.4

[PATCH v3 1/2] mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher

2017-02-06 Thread Ard Biesheuvel

Switch the FILS AEAD code to use a cmac(aes) shash instantiated by the
crypto API rather than reusing the open coded implementation in
aes_cmac_vector(). This makes the code more understandable, and allows
platforms to implement cmac(aes) in a more secure (*) and efficient way
than is typically possible when using the AES cipher directly.

So replace the crypto_cipher by a crypto_shash, and update the aes_s2v()
routine to call the shash interface directly.

* In particular, the generic table based AES implementation is sensitive
  to known-plaintext timing attacks on the key, to which AES based MAC
  algorithms are especially vulnerable, given that their plaintext is not
  usually secret. Time invariant alternatives are available (e.g., based
  on SIMD algorithms), but may incur a setup cost that is prohibitive when
  operating on a single block at a time, which is why they don't usually
  expose the cipher API.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/Kconfig |  1 +
 net/mac80211/aes_cmac.h  |  4 --
 net/mac80211/fils_aead.c | 74 +---
 3 files changed, 34 insertions(+), 45 deletions(-)

diff --git a/net/mac80211/Kconfig b/net/mac80211/Kconfig
index 3891cbd2adea..76e30f4797fb 100644
--- a/net/mac80211/Kconfig
+++ b/net/mac80211/Kconfig
@@ -6,6 +6,7 @@ config MAC80211
select CRYPTO_AES
select CRYPTO_CCM
select CRYPTO_GCM
+   select CRYPTO_CMAC
select CRC32
---help---
  This option enables the hardware independent IEEE 802.11
diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index c827e1d5de8b..3702041f44fd 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,10 +11,6 @@
 
 #include 
 
-void gf_mulx(u8 *pad);
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len);
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index 5c3af5eb4052..3cfb1e2ab7ac 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -9,66 +9,58 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include "ieee80211_i.h"
 #include "aes_cmac.h"
 #include "fils_aead.h"
 
-static int aes_s2v(struct crypto_cipher *tfm,
+static void gf_mulx(u8 *pad)
+{
+   u64 a = get_unaligned_be64(pad);
+   u64 b = get_unaligned_be64(pad + 8);
+
+   put_unaligned_be64((a << 1) | (b >> 63), pad);
+   put_unaligned_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0), pad + 8);
+}
+
+static int aes_s2v(struct crypto_shash *tfm,
   size_t num_elem, const u8 *addr[], size_t len[], u8 *v)
 {
-   u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE];
+   u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE] = {};
+   SHASH_DESC_ON_STACK(desc, tfm);
size_t i;
-   const u8 *data[2];
-   size_t data_len[2], data_elems;
+
+   desc->tfm = tfm;
 
/* D = AES-CMAC(K, ) */
-   memset(tmp, 0, AES_BLOCK_SIZE);
-   data[0] = tmp;
-   data_len[0] = AES_BLOCK_SIZE;
-   aes_cmac_vector(tfm, 1, data, data_len, d, AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, tmp, AES_BLOCK_SIZE, d);
 
for (i = 0; i < num_elem - 1; i++) {
/* D = dbl(D) xor AES_CMAC(K, Si) */
gf_mulx(d); /* dbl */
-   aes_cmac_vector(tfm, 1, [i], [i], tmp,
-   AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, addr[i], len[i], tmp);
crypto_xor(d, tmp, AES_BLOCK_SIZE);
}
 
+   crypto_shash_init(desc);
+
if (len[i] >= AES_BLOCK_SIZE) {
/* len(Sn) >= 128 */
-   size_t j;
-   const u8 *pos;
-
/* T = Sn xorend D */
-
-   /* Use a temporary buffer to perform xorend on Sn (addr[i]) to
-* avoid modifying the const input argument.
-*/
-   data[0] = addr[i];
-   data_len[0] = len[i] - AES_BLOCK_SIZE;
-   pos = addr[i] + data_len[0];
-   for (j = 0; j < AES_BLOCK_SIZE; j++)
-   tmp[j] = pos[j] ^ d[j];
-   data[1] = tmp;
-   data_len[1] = AES_BLOCK_SIZE;
-   data_elems = 2;
+   crypto_shash_update(desc, addr[i], len[i] - AES_BLOCK_SIZE);
+   crypto_xor(d, addr[i] + len[i] - AES_BLOCK_SIZE,
+  AES_BLOCK_SIZE);
} else {
/* len(Sn) < 128 */
/* T = dbl(D) xor pad(Sn) */
gf_mulx(d); /* dbl */
-   memset(tmp, 0, AES_BLOCK_SIZE);
-   memcpy(tmp, addr[i], len[i]);
-

Re: [PATCH v2 0/2] mac80211: use crypto shash for AES cmac

2017-02-06 Thread Ard Biesheuvel

On 6 February 2017 at 10:01, Malinen, Jouni <jo...@qca.qualcomm.com> wrote:
> On Sun, Feb 05, 2017 at 03:23:26PM +0000, Ard Biesheuvel wrote:
>> NOTE: Jouni has been so kind to test patch #2, and confirmed that it is 
>> working.
>>   I have not tested patch #1 myself, mainly because the test methodology
>>   requires downloading Ubuntu installer images, and I am currently on a
>>   metered 3G connection (and will be for another couple of weeks)
>
> These v2 patches pass the test cases as well.
>

Thanks!

> (And you don't really need Ubuntu to run the hwsim test cases; any
> reasonably recent distribution that is capable of running kvm should
> work.)
>

Well, now that you have done my testing for me, I am not sure I will
get around to trying the VM.

Thanks,
Ard.

Re: [PATCH v2 1/2] mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher

2017-02-06 Thread Ard Biesheuvel

On 6 February 2017 at 08:47, Johannes Berg  wrote:
>
>>  {
>>   u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE];
>> + struct shash_desc *desc;
>> + u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)]
>> CRYPTO_MINALIGN_ATTR;

I realised we have a more idiomatic SHASH_DESC_ON_STACK for this.

>>   size_t i;
>> - const u8 *data[2];
>> - size_t data_len[2], data_elems;
>> +
>> + desc = (struct shash_desc *)buf;
>> + desc->tfm = tfm;
>>
>> + crypto_shash_digest(desc, (u8[AES_BLOCK_SIZE]){},
>> AES_BLOCK_SIZE, d);
>
> That's an interesting expression in there. Can we name it into a real
> variable? :)
>

Sure, if you prefer.

> I'm also slightly worried about stack usage now - do we know none of
> this goes into an sg list eventually?
>

Shashes do not usually use scatterlists: the shash API does not use
them, but uses u8[] arrays and lengths everywhere, and shashes are
explicitly synchronous, which means they are unsuitable for being
exposed on top of a high latency peripheral that uses DMA.

[PATCH v2 2/2] mac80211: aes-cmac: switch to shash CMAC driver

2017-02-05 Thread Ard Biesheuvel

Instead of open coding the CMAC algorithm in the mac80211 driver using
byte wide xors and calls into the crypto layer for each block of data,
instantiate a cmac(aes) synchronous hash and pass all the data into it
directly. This does not only simplify the code, it also allows the use
of more efficient and more secure implementations, especially on
platforms where SIMD ciphers have considerable setup time.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_cmac.c | 130 +---
 net/mac80211/aes_cmac.h |  11 +-
 net/mac80211/key.h  |   2 +-
 3 files changed, 35 insertions(+), 108 deletions(-)

diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index d0bd5fff5f0a..0d4d2af52a56 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -22,126 +22,52 @@
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
 #define AAD_LEN 20
 
-
-void gf_mulx(u8 *pad)
-{
-   int i, carry;
-
-   carry = pad[0] & 0x80;
-   for (i = 0; i < AES_BLOCK_SIZE - 1; i++)
-   pad[i] = (pad[i] << 1) | (pad[i + 1] >> 7);
-   pad[AES_BLOCK_SIZE - 1] <<= 1;
-   if (carry)
-   pad[AES_BLOCK_SIZE - 1] ^= 0x87;
-}
-
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len)
-{
-   u8 cbc[AES_BLOCK_SIZE], pad[AES_BLOCK_SIZE];
-   const u8 *pos, *end;
-   size_t i, e, left, total_len;
-
-   memset(cbc, 0, AES_BLOCK_SIZE);
-
-   total_len = 0;
-   for (e = 0; e < num_elem; e++)
-   total_len += len[e];
-   left = total_len;
-
-   e = 0;
-   pos = addr[0];
-   end = pos + len[0];
-
-   while (left >= AES_BLOCK_SIZE) {
-   for (i = 0; i < AES_BLOCK_SIZE; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   if (left > AES_BLOCK_SIZE)
-   crypto_cipher_encrypt_one(tfm, cbc, cbc);
-   left -= AES_BLOCK_SIZE;
-   }
-
-   memset(pad, 0, AES_BLOCK_SIZE);
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   gf_mulx(pad);
-
-   if (left || total_len == 0) {
-   for (i = 0; i < left; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   cbc[left] ^= 0x80;
-   gf_mulx(pad);
-   }
-
-   for (i = 0; i < AES_BLOCK_SIZE; i++)
-   pad[i] ^= cbc[i];
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   memcpy(mac, pad, mac_len);
-}
-
-
-void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
+   u8 out[crypto_shash_digestsize(tfm)];
 
-   memset(zero, 0, CMAC_TLEN);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN;
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN);
+   crypto_shash_finup(desc, (u8[CMAC_TLEN]){}, CMAC_TLEN, out);
+
+   memcpy(mic, out, CMAC_TLEN);
 }
 
-void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac_256(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN_256];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
 
-   memset(zero, 0, CMAC_TLEN_256);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN_256;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN_256;
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN_256);
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN_256);
+   crypt

[PATCH v2 0/2] mac80211: use crypto shash for AES cmac

2017-02-05 Thread Ard Biesheuvel

This is something I spotted while working on AES in various modes for
ARM and arm64.

The mac80211 aes_cmac code reimplements the CMAC algorithm based on the
core AES cipher, which is rather restrictive in how platforms can satisfy
the dependency on this algorithm. For instance, SIMD implementations may
have a considerable setup time, which cannot be amortized over the entire
input when calling into the crypto API one block at a time. Also, it prevents
the use of more secure fixed time implementations, since not all AES drivers
expose the cipher interface.

So switch aes_cmac to use a cmac(aes) shash. Before updating the aes_cmac code
in patch #2, the FILS AEAD code is moved to using a cmac(aes) shash supplied by
the crypto API so that we can remove the open coded version entirely in the
second patch.

NOTE: Jouni has been so kind to test patch #2, and confirmed that it is working.
  I have not tested patch #1 myself, mainly because the test methodology
  requires downloading Ubuntu installer images, and I am currently on a
  metered 3G connection (and will be for another couple of weeks)

Ard Biesheuvel (2):
  mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher
  mac80211: aes-cmac: switch to shash CMAC driver

 net/mac80211/Kconfig |   1 +
 net/mac80211/aes_cmac.c  | 130 +---
 net/mac80211/aes_cmac.h  |  15 +--
 net/mac80211/fils_aead.c |  74 +--
 net/mac80211/key.h   |   2 +-
 5 files changed, 70 insertions(+), 152 deletions(-)

-- 
2.7.4

[PATCH v2 1/2] mac80211: fils_aead: Use crypto api CMAC shash rather than bare cipher

2017-02-05 Thread Ard Biesheuvel

Switch the FILS AEAD code to use a cmac(aes) shash instantiated by the
crypto API rather than reusing the open coded implementation in
aes_cmac_vector(). This makes the code more understandable, and allows
platforms to implement cmac(aes) in a more secure (*) and efficient way
than is typically possible when using the AES cipher directly.

So replace the crypto_cipher by a crypto_shash, and update the aes_s2v()
routine to call the shash interface directly.

* In particular, the generic table based AES implementation is sensitive
  to known-plaintext timing attacks on the key, to which AES based MAC
  algorithms are especially vulnerable, given that their plaintext is not
  usually secret. Time invariant alternatives are available (e.g., based
  on SIMD algorithms), but may incur a setup cost that is prohibitive when
  operating on a single block at a time, which is why they don't usually
  expose the cipher API.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/Kconfig |  1 +
 net/mac80211/aes_cmac.h  |  4 --
 net/mac80211/fils_aead.c | 74 +---
 3 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/net/mac80211/Kconfig b/net/mac80211/Kconfig
index 3891cbd2adea..76e30f4797fb 100644
--- a/net/mac80211/Kconfig
+++ b/net/mac80211/Kconfig
@@ -6,6 +6,7 @@ config MAC80211
select CRYPTO_AES
select CRYPTO_CCM
select CRYPTO_GCM
+   select CRYPTO_CMAC
select CRC32
---help---
  This option enables the hardware independent IEEE 802.11
diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index c827e1d5de8b..3702041f44fd 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,10 +11,6 @@
 
 #include 
 
-void gf_mulx(u8 *pad);
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len);
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index ecfdd97758a3..a294a57e856d 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -9,66 +9,60 @@
 
 #include 
 #include 
+#include 
 #include 
 
 #include "ieee80211_i.h"
 #include "aes_cmac.h"
 #include "fils_aead.h"
 
-static int aes_s2v(struct crypto_cipher *tfm,
+static void gf_mulx(u8 *pad)
+{
+   u64 a = get_unaligned_be64(pad);
+   u64 b = get_unaligned_be64(pad + 8);
+
+   put_unaligned_be64((a << 1) | (b >> 63), pad);
+   put_unaligned_be64((b << 1) ^ ((a >> 63) ? 0x87 : 0), pad + 8);
+}
+
+static int aes_s2v(struct crypto_shash *tfm,
   size_t num_elem, const u8 *addr[], size_t len[], u8 *v)
 {
u8 d[AES_BLOCK_SIZE], tmp[AES_BLOCK_SIZE];
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
size_t i;
-   const u8 *data[2];
-   size_t data_len[2], data_elems;
+
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
/* D = AES-CMAC(K, ) */
-   memset(tmp, 0, AES_BLOCK_SIZE);
-   data[0] = tmp;
-   data_len[0] = AES_BLOCK_SIZE;
-   aes_cmac_vector(tfm, 1, data, data_len, d, AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, (u8[AES_BLOCK_SIZE]){}, AES_BLOCK_SIZE, d);
 
for (i = 0; i < num_elem - 1; i++) {
/* D = dbl(D) xor AES_CMAC(K, Si) */
gf_mulx(d); /* dbl */
-   aes_cmac_vector(tfm, 1, [i], [i], tmp,
-   AES_BLOCK_SIZE);
+   crypto_shash_digest(desc, addr[i], len[i], tmp);
crypto_xor(d, tmp, AES_BLOCK_SIZE);
}
 
+   crypto_shash_init(desc);
+
if (len[i] >= AES_BLOCK_SIZE) {
/* len(Sn) >= 128 */
-   size_t j;
-   const u8 *pos;
-
/* T = Sn xorend D */
-
-   /* Use a temporary buffer to perform xorend on Sn (addr[i]) to
-* avoid modifying the const input argument.
-*/
-   data[0] = addr[i];
-   data_len[0] = len[i] - AES_BLOCK_SIZE;
-   pos = addr[i] + data_len[0];
-   for (j = 0; j < AES_BLOCK_SIZE; j++)
-   tmp[j] = pos[j] ^ d[j];
-   data[1] = tmp;
-   data_len[1] = AES_BLOCK_SIZE;
-   data_elems = 2;
+   crypto_shash_update(desc, addr[i], len[i] - AES_BLOCK_SIZE);
+   crypto_xor(d, addr[i] + len[i] - AES_BLOCK_SIZE,
+  AES_BLOCK_SIZE);
} else {
/* len(Sn) < 128 */
/* T = dbl(D) xor pad(Sn) */
gf_mulx(d); /* dbl */
-

Re: [RFC PATCH 0/2] mac80211: use crypto shash for AES cmac

2017-02-04 Thread Ard Biesheuvel

On 4 February 2017 at 14:39, Malinen, Jouni <jo...@qca.qualcomm.com> wrote:
> On Sat, Feb 04, 2017 at 02:24:36PM +0000, Ard Biesheuvel wrote:
>> There is another issue I spotted: the skcipher you allocate may be of
>> the async variant, which may return from skcipher_encrypt() with
>> -EINPROGRESS as soon as it has queued the request. Since you don't
>> deal with that result, you should allocate a sync cipher explicitly:
>
>> diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
>> -   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, 0);
>> +   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, CRYPTO_ALG_ASYNC);
>
>> -   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, 0);
>> +   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, CRYPTO_ALG_ASYNC);
>
> Thanks! Can you send this as a full contribution or do you want me to
> do that?

Please go ahead if you don't mind doing it

> I did run this through all the FILS test cases with
> mac80211_hwsim.
>

Well, even async ciphers can act synchronously: the SIMD based async
ciphers will only enqueue the request for deferred processing when
called in interrupt context (on most architectures) but if you happen
to run on a platform that has a h/w accelerator for ctr(aes), you are
quite likely to see failures here.

>> Thanks for the instructions and thanks for testing. If I manage to run
>> this locally, I will follow up with an alternative patch #1 tha here
>> switches FILS to use cmac(aes) as well (which looks reasonably
>> feasible now that I understand the code)
>
> If you have any issues in getting the hwsim test setup working, please
> let me know. I'm trying to make it easy for anyone to run those tests in
> hopes of improving quality of Linux WLAN contributions and what gets
> applied into cfg80211 or mac80211 (or hostap.git for that matter). In
> particular the latest step-by-step guide I added for the VM version (*)
> was hoping to show how that can be done in 15 minutes from scratch..
>
>
> (*) http://w1.fi/cgit/hostap/plain/tests/hwsim/vm/example-vm-setup.txt
>

I will take a look on Monday

Re: [RFC PATCH 0/2] mac80211: use crypto shash for AES cmac

2017-02-04 Thread Ard Biesheuvel

On 4 February 2017 at 11:35, Malinen, Jouni <jo...@qca.qualcomm.com> wrote:
> On Fri, Feb 03, 2017 at 09:55:36PM +0000, Ard Biesheuvel wrote:
>> OK, that looks like something I could figure out how to use. But are
>> you saying the CMAC code is never called in practice?
>
> It will get called if there is a frame that were to need to use BIP.
> There are some APs that do send broadcast addressed Deauthentication and
> Disassociation frames when terminating the network and those would get
> here. In theory, someone could come up with a new Action frame use case
> as well with a group addressed Robust Action frame, but no such thing is
> defined today. Anyway, this will be needed for certification purposes
> and to block DoS with broadcast addressed Deauthentication and
> Disassociation frames even if there is not really a common use case
> using BIP frames today.
>
>> I did spot something peculiar when looking at the code: if I am
>> reading the following sequence correctly (from
>> fils_encrypt_assoc_req())
>
>> addr[0] = mgmt->sa;
> ...
>> addr[4] = capab;
>> len[4] = encr - capab;
>>
>> crypt_len = skb->data + skb->len - encr;
>> skb_put(skb, AES_BLOCK_SIZE);
>> return aes_siv_encrypt(assoc_data->fils_kek, assoc_data->fils_kek_len,
>>   encr, crypt_len, 1, addr, len, encr);
>>
>> the addr[]/len[] arrays are populated with 5 (addr, len) pairs, but
>> only one is actually passed into aes_siv_encrypt()? This is actually
>> the main reason I stopped looking into whether I could convert it to
>> CMAC, because I couldn't figure it out.
>
> Argh.. Thanks for finding this!
>
> That is indeed supposed to have num_elems == 5. This "works" since the
> other end of the exchange is actually in user space (hostapd) and a
> similar error is there..
>

Interesting.

> This comes from the development history of FILS support.. The draft
> standard changed this AES-SIV AAD design from P802.11ai/D7.0 to D8.0
> from a single AAD vector with concatenated fields to separate vectors.
> Clearly I added the vectors here in addr[] and len[] arrays, but forgot
> to update the num_elems parameter in this direction (STA->AP); the other
> direction for Association Response frame is actually using correct
> number of AAD vectors.
>
> I'll fix this in hostapd and mac80211.
>

There is another issue I spotted: the skcipher you allocate may be of
the async variant, which may return from skcipher_encrypt() with
-EINPROGRESS as soon as it has queued the request. Since you don't
deal with that result, you should allocate a sync cipher explicitly:

diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index ecfdd97758a3..a31be5262283 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -124,7 +124,7 @@ static int aes_siv_encrypt(const u8 *key, size_t key_len,

/* CTR */

-   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, 0);
+   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm2)) {
kfree(tmp);
return PTR_ERR(tfm2);
@@ -183,7 +183,7 @@ static int aes_siv_decrypt(const u8 *key, size_t key_len,

/* CTR */

-   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, 0);
+   tfm2 = crypto_alloc_skcipher("ctr(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm2))
return PTR_ERR(tfm2);
/* K2 for CTR */

> I did not actually remember that the AP side was still in hostapd for
> FILS AEAD in case of mac80211, but with that in mind, the answer to your
> earlier question about testing these becomes much simpler.. All you need
> to do is this with hostap.git hwsim tests:
>
> hostap/tests/hwsim/vm$ ./vm-run.sh ap_cipher_bip fils_sk_erp
> Starting test run in a virtual machine
> ./run-all.sh: passing the following args to run-tests.py: ap_cipher_bip 
> fils_sk_erp
> START ap_cipher_bip 1/2
> PASS ap_cipher_bip 0.703048 2017-02-04 11:21:35.341174
> START fils_sk_erp 2/2
> PASS fils_sk_erp 0.667927 2017-02-04 11:21:36.029205
> passed all 2 test case(s)
> ALL-PASSED
>
>
> ap_cipher_bip uses wlantest to verify the BIP (AES-128-CMAC) protection
> in the frames and fils_sk_erp goes through FILS AEAD exchange with one
> side of the operation in mac80211 and the other one in hostapd. This
> should be able to discover if something is broken in the AES related
> changes in crypto.
>
> And this particular example run here was with your two patches applied,
> to the proposed net/mac80211/aes_cmac.c change seems to work fine.

Thanks for the instructions and thanks for testing. If I manage to run
this locally, I will follow up with an alternative patch #1 that
switches FILS to use cmac(aes) as well (which looks reasonably
feasible now that I understand the code)

Regards,
Ard.

Re: [RFC PATCH 0/2] mac80211: use crypto shash for AES cmac

2017-02-03 Thread Ard Biesheuvel

On 3 February 2017 at 21:47, Malinen, Jouni <jo...@qca.qualcomm.com> wrote:
> On Fri, Feb 03, 2017 at 07:25:53PM +0000, Ard Biesheuvel wrote:
>> The mac80211 aes_cmac code reimplements the CMAC algorithm based on the
>> core AES cipher, which is rather restrictive in how platforms can satisfy
>> the dependency on this algorithm. For instance, SIMD implementations may
>> have a considerable setup time, which cannot be amortized over the entire
>> input when calling into the crypto API one block at a time. Also, it prevents
>> the use of more secure fixed time implementations, since not all AES drivers
>> expose the cipher interface.
>>
>> So switch aes_cmac to use a cmac(aes) shash. This requires a preparatory
>> patch so that we can remove the open coded implementation, which it shares
>> with the fils aead driver. That driver could receive the same treatment, in
>> which case we could replace patch #1 with one that carries it over first.
>>
>> Note that this is an RFC. I have no idea how I would go about testing this
>> code, but I am on a mission to remove as many dependencies on the generic
>> AES cipher as I can.
>
> Neither the BIP nor FILS cases have any real speed requirements taken
> into account how rarely they end up being used in practice (there is
> really no use case for BIP today and FILS is used only once per
> association). That said, there should be no issues with moving these to
> a more generic mechanism assuming one is available now (I don't think
> that was the case when I was working on BIP and I was too lazy to figure
> out how to convert it or the newer FILS implementation)..
>
> mac80211_hwsim show allow some of the testing to be done with wlantest
> confirming the results in user space (*). I think that would cover all
> of BIP (net/mac80211/aes_cmac.c), but not FILS.

OK, that looks like something I could figure out how to use. But are
you saying the CMAC code is never called in practice?

> For FILS, we do not
> currently have a convenient mechanism for running two different
> instances of kernel or even just mac80211 in the setup, so that would
> likely need testing with real WLAN hardware. I don't currently have a
> good setup for testing this (was using Backports-based solution in the
> past instead of full kernel build and Backports is a bit behind the
> current state..), but I guess I'll need to build something functional
> for this eventually.. Once that's in working condition on two devices,
> it would be straightforward to run a test (snapshot of hostap.git build
> to enable FILS functionality and go through one FILS authentication
> round)..
>
> Another alternative would be to extend wlantest to decrypt/validate FIPS
> AEAD use case based on keys exposed from hostapd or wpa_supplicant.
> There has not been sufficient use case for that so far and I have not
> bothered working on it yet.
>
>
> By the way, FILS AEAD uses SIV mode and I'm not sure it is supported in
> the current crypto code, so that would be one additional piece to take
> care of when considering net/mac80211/fils_aead.c conversion.
>

I did spot something peculiar when looking at the code: if I am
reading the following sequence correctly (from
fils_encrypt_assoc_req())

addr[0] = mgmt->sa;
len[0] = ETH_ALEN;
/* The AP's BSSID */
addr[1] = mgmt->da;
len[1] = ETH_ALEN;
/* The STA's nonce */
addr[2] = assoc_data->fils_nonces;
len[2] = FILS_NONCE_LEN;
/* The AP's nonce */
addr[3] = _data->fils_nonces[FILS_NONCE_LEN];
len[3] = FILS_NONCE_LEN;
/* The (Re)Association Request frame from the Capability Information
* field to the FILS Session element (both inclusive).
*/
addr[4] = capab;
len[4] = encr - capab;

crypt_len = skb->data + skb->len - encr;
skb_put(skb, AES_BLOCK_SIZE);
return aes_siv_encrypt(assoc_data->fils_kek, assoc_data->fils_kek_len,
  encr, crypt_len, 1, addr, len, encr);

the addr[]/len[] arrays are populated with 5 (addr, len) pairs, but
only one is actually passed into aes_siv_encrypt()? This is actually
the main reason I stopped looking into whether I could convert it to
CMAC, because I couldn't figure it out.

[RFC PATCH 2/2] mac80211: aes-cmac: switch to shash CMAC driver

2017-02-03 Thread Ard Biesheuvel

Instead of open coding the CMAC algorithm in the mac80211 driver using
byte wide xors and calls into the crypto layer for each block of data,
instantiate a cmac(aes) synchronous hash and pass all the data into it
directly. This does not only simplify the code, it also allows the use
of more efficient and more secure implementations, especially on
platforms where SIMD ciphers have considerable setup time.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/Kconfig|   1 +
 net/mac80211/aes_cmac.c | 137 +---
 net/mac80211/aes_cmac.h |  11 +-
 net/mac80211/key.h  |   2 +-
 4 files changed, 42 insertions(+), 109 deletions(-)

diff --git a/net/mac80211/Kconfig b/net/mac80211/Kconfig
index 3891cbd2adea..76e30f4797fb 100644
--- a/net/mac80211/Kconfig
+++ b/net/mac80211/Kconfig
@@ -6,6 +6,7 @@ config MAC80211
select CRYPTO_AES
select CRYPTO_CCM
select CRYPTO_GCM
+   select CRYPTO_CMAC
select CRC32
---help---
  This option enables the hardware independent IEEE 802.11
diff --git a/net/mac80211/aes_cmac.c b/net/mac80211/aes_cmac.c
index d0bd5fff5f0a..fc004720e77c 100644
--- a/net/mac80211/aes_cmac.c
+++ b/net/mac80211/aes_cmac.c
@@ -22,126 +22,57 @@
 #define CMAC_TLEN_256 16 /* CMAC TLen = 128 bits (16 octets) */
 #define AAD_LEN 20
 
+static const u8 zero[CMAC_TLEN_256];
 
-void gf_mulx(u8 *pad)
+void ieee80211_aes_cmac(struct crypto_shash *tfm, const u8 *aad,
+   const u8 *data, size_t data_len, u8 *mic)
 {
-   int i, carry;
-
-   carry = pad[0] & 0x80;
-   for (i = 0; i < AES_BLOCK_SIZE - 1; i++)
-   pad[i] = (pad[i] << 1) | (pad[i + 1] >> 7);
-   pad[AES_BLOCK_SIZE - 1] <<= 1;
-   if (carry)
-   pad[AES_BLOCK_SIZE - 1] ^= 0x87;
-}
+   struct shash_desc *desc;
+   u8 buf[sizeof(*desc) + crypto_shash_descsize(tfm)] CRYPTO_MINALIGN_ATTR;
+   u8 out[crypto_shash_digestsize(tfm)];
 
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len)
-{
-   u8 cbc[AES_BLOCK_SIZE], pad[AES_BLOCK_SIZE];
-   const u8 *pos, *end;
-   size_t i, e, left, total_len;
-
-   memset(cbc, 0, AES_BLOCK_SIZE);
-
-   total_len = 0;
-   for (e = 0; e < num_elem; e++)
-   total_len += len[e];
-   left = total_len;
-
-   e = 0;
-   pos = addr[0];
-   end = pos + len[0];
-
-   while (left >= AES_BLOCK_SIZE) {
-   for (i = 0; i < AES_BLOCK_SIZE; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   if (left > AES_BLOCK_SIZE)
-   crypto_cipher_encrypt_one(tfm, cbc, cbc);
-   left -= AES_BLOCK_SIZE;
-   }
-
-   memset(pad, 0, AES_BLOCK_SIZE);
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   gf_mulx(pad);
-
-   if (left || total_len == 0) {
-   for (i = 0; i < left; i++) {
-   cbc[i] ^= *pos++;
-   if (pos >= end) {
-   e++;
-   pos = addr[e];
-   end = pos + len[e];
-   }
-   }
-   cbc[left] ^= 0x80;
-   gf_mulx(pad);
-   }
-
-   for (i = 0; i < AES_BLOCK_SIZE; i++)
-   pad[i] ^= cbc[i];
-   crypto_cipher_encrypt_one(tfm, pad, pad);
-   memcpy(mac, pad, mac_len);
-}
+   desc = (struct shash_desc *)buf;
+   desc->tfm = tfm;
 
+   crypto_shash_init(desc);
+   crypto_shash_update(desc, aad, AAD_LEN);
+   crypto_shash_update(desc, data, data_len - CMAC_TLEN);
+   crypto_shash_finup(desc, zero, CMAC_TLEN, out);
 
-void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
-   const u8 *data, size_t data_len, u8 *mic)
-{
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN];
-
-   memset(zero, 0, CMAC_TLEN);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;
-   len[1] = data_len - CMAC_TLEN;
-   addr[2] = zero;
-   len[2] = CMAC_TLEN;
-
-   aes_cmac_vector(tfm, 3, addr, len, mic, CMAC_TLEN);
+   memcpy(mic, out, CMAC_TLEN);
 }
 
-void ieee80211_aes_cmac_256(struct crypto_cipher *tfm, const u8 *aad,
+void ieee80211_aes_cmac_256(struct crypto_shash *tfm, const u8 *aad,
const u8 *data, size_t data_len, u8 *mic)
 {
-   const u8 *addr[3];
-   size_t len[3];
-   u8 zero[CMAC_TLEN_256];
-
-   memset(zero, 0, CMAC_TLEN_256);
-   addr[0] = aad;
-   len[0] = AAD_LEN;
-   addr[1] = data;

[RFC PATCH 0/2] mac80211: use crypto shash for AES cmac

2017-02-03 Thread Ard Biesheuvel

This is something I spotted while working on AES in various modes for
ARM and arm64.

The mac80211 aes_cmac code reimplements the CMAC algorithm based on the
core AES cipher, which is rather restrictive in how platforms can satisfy
the dependency on this algorithm. For instance, SIMD implementations may
have a considerable setup time, which cannot be amortized over the entire
input when calling into the crypto API one block at a time. Also, it prevents
the use of more secure fixed time implementations, since not all AES drivers
expose the cipher interface.

So switch aes_cmac to use a cmac(aes) shash. This requires a preparatory
patch so that we can remove the open coded implementation, which it shares
with the fils aead driver. That driver could receive the same treatment, in
which case we could replace patch #1 with one that carries it over first.

Note that this is an RFC. I have no idea how I would go about testing this
code, but I am on a mission to remove as many dependencies on the generic
AES cipher as I can.

Ard Biesheuvel (2):
  mac80211: fils_aead: clone shared CMAC functions into private version
  mac80211: aes-cmac: switch to shash CMAC driver

 net/mac80211/Kconfig |   1 +
 net/mac80211/aes_cmac.c  | 137 +---
 net/mac80211/aes_cmac.h  |  15 +--
 net/mac80211/fils_aead.c |  68 ++
 net/mac80211/key.h   |   2 +-
 5 files changed, 110 insertions(+), 113 deletions(-)

-- 
2.7.4

[RFC PATCH 1/2] mac80211: fils_aead: clone shared CMAC functions into private version

2017-02-03 Thread Ard Biesheuvel

Before reworking the AES CMAC mac80211 code, clone the routines that it
shares with the FILS AEAD driver into its own source file, and remove the
external declaration from aes_cmac.h. This will allow us to carry over one
user at a time from the open coded CMAC code to the crypto API.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_cmac.h  |  4 --
 net/mac80211/fils_aead.c | 68 
 2 files changed, 68 insertions(+), 4 deletions(-)

diff --git a/net/mac80211/aes_cmac.h b/net/mac80211/aes_cmac.h
index c827e1d5de8b..3702041f44fd 100644
--- a/net/mac80211/aes_cmac.h
+++ b/net/mac80211/aes_cmac.h
@@ -11,10 +11,6 @@
 
 #include 
 
-void gf_mulx(u8 *pad);
-void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
-const u8 *addr[], const size_t *len, u8 *mac,
-size_t mac_len);
 struct crypto_cipher *ieee80211_aes_cmac_key_setup(const u8 key[],
   size_t key_len);
 void ieee80211_aes_cmac(struct crypto_cipher *tfm, const u8 *aad,
diff --git a/net/mac80211/fils_aead.c b/net/mac80211/fils_aead.c
index ecfdd97758a3..ec493e68957c 100644
--- a/net/mac80211/fils_aead.c
+++ b/net/mac80211/fils_aead.c
@@ -15,6 +15,74 @@
 #include "aes_cmac.h"
 #include "fils_aead.h"
 
+static void gf_mulx(u8 *pad)
+{
+   int i, carry;
+
+   carry = pad[0] & 0x80;
+   for (i = 0; i < AES_BLOCK_SIZE - 1; i++)
+   pad[i] = (pad[i] << 1) | (pad[i + 1] >> 7);
+   pad[AES_BLOCK_SIZE - 1] <<= 1;
+   if (carry)
+   pad[AES_BLOCK_SIZE - 1] ^= 0x87;
+}
+
+static void aes_cmac_vector(struct crypto_cipher *tfm, size_t num_elem,
+   const u8 *addr[], const size_t *len, u8 *mac,
+   size_t mac_len)
+{
+   u8 cbc[AES_BLOCK_SIZE], pad[AES_BLOCK_SIZE];
+   const u8 *pos, *end;
+   size_t i, e, left, total_len;
+
+   memset(cbc, 0, AES_BLOCK_SIZE);
+
+   total_len = 0;
+   for (e = 0; e < num_elem; e++)
+   total_len += len[e];
+   left = total_len;
+
+   e = 0;
+   pos = addr[0];
+   end = pos + len[0];
+
+   while (left >= AES_BLOCK_SIZE) {
+   for (i = 0; i < AES_BLOCK_SIZE; i++) {
+   cbc[i] ^= *pos++;
+   if (pos >= end) {
+   e++;
+   pos = addr[e];
+   end = pos + len[e];
+   }
+   }
+   if (left > AES_BLOCK_SIZE)
+   crypto_cipher_encrypt_one(tfm, cbc, cbc);
+   left -= AES_BLOCK_SIZE;
+   }
+
+   memset(pad, 0, AES_BLOCK_SIZE);
+   crypto_cipher_encrypt_one(tfm, pad, pad);
+   gf_mulx(pad);
+
+   if (left || total_len == 0) {
+   for (i = 0; i < left; i++) {
+   cbc[i] ^= *pos++;
+   if (pos >= end) {
+   e++;
+   pos = addr[e];
+   end = pos + len[e];
+   }
+   }
+   cbc[left] ^= 0x80;
+   gf_mulx(pad);
+   }
+
+   for (i = 0; i < AES_BLOCK_SIZE; i++)
+   pad[i] ^= cbc[i];
+   crypto_cipher_encrypt_one(tfm, pad, pad);
+   memcpy(mac, pad, mac_len);
+}
+
 static int aes_s2v(struct crypto_cipher *tfm,
   size_t num_elem, const u8 *addr[], size_t len[], u8 *v)
 {
-- 
2.7.4

Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-26 Thread Ard Biesheuvel

On 26 December 2016 at 07:57, Herbert Xu  wrote:
> On Sat, Dec 24, 2016 at 09:57:53AM -0800, Andy Lutomirski wrote:
>>
>> I actually do use incremental hashing later on.   BPF currently
>> vmallocs() a big temporary buffer just so it can fill it and hash it.
>> I change it to hash as it goes.
>
> How much data is this supposed to hash on average? If it's a large
> amount then perhaps using the existing crypto API would be a better
> option than adding this.
>

This is a good point actually: you didn't explain *why* BPF shouldn't
depend on the crypto API.

Re: [RFC PATCH 4.10 1/6] crypto/sha256: Refactor the API so it can be used without shash

2016-12-24 Thread Ard Biesheuvel

Hi Andy,

On 24 December 2016 at 02:22, Andy Lutomirski <l...@kernel.org> wrote:
> There are some pieecs of kernel code that want to compute SHA256
> directly without going through the crypto core.  Adjust the exported
> API to decouple it from the crypto core.
>

There are a bunch of things happening at the same time in this patch,
i.e., unnecessary renames of functions with static linkage, return
type changes to the base prototypes (int (*)(...) to void (*)(...))
and the change for the base functions to take a struct sha256_state
ctx rather than a shash_desc. I suppose you are mainly after the
latter, so could we please drop the other changes?

For the name clashes, could we simply use the crypto_ prefix for the
globally visible functions rather than using names that are already in
use? (and having to go around clean up the conflicts)
As for the return type changes, the base functions intentionally
return int to allow tail calls from the functions exposed by the
crypto API (whose prototypes cannot be changed). Unlikely to matter in
the grand scheme of things (especially now that the base layer
consists of static inline functions primarily), but it is equally
pointless to go around and change them to return void IMO.

So what remains is the struct shash_desc to struct sha256_state
change, which makes sense given that you are after a sha256_digest()
function that does not require the crypto API. But it seems your use
case does not rely on incremental hashing, and so there is no reason
for the state to be exposed outside of the implementation, and we
could simply expose a crypto_sha256_digest() routine from the
sha256_generic.c implementation instead.

Also, I strongly feel that crypto and other security related patches
should be tested before being posted, even if they are only RFC,
especially when they are posted by high profile kernel devs like
yourself. (Your code incorrectly calls crypto_sha2_final() in a couple
of places, resulting in the finalization being performed twice, once
with the accelerated block transform and again with the generic
transform)

> I suspect this will very slightly speed up the SHA256 shash operations
> as well by reducing the amount of indirection involved.
>

I think you have a valid point when it comes to the complexity of the
crypto API in general. But the struct sha256_state is embedded in the
shash_desc rather than referred to via a pointer, so the level of
indirection does not appear to change. And given how 99.9% of the
SHA256 execution time is spent in the block transform routine anyway,
I expect the performance delta to be in the noise tbh.

Finally, another thing to keep in mind is that the base layers of
SHA-1, SHA-256 and SHA-512 are intentionally structured in the same
way. If there is a need for a digest() entry point, I'd prefer to add
them for all flavours.

E.g, something like

"""
@@ -126,3 +128,23 @@ static inline int sha256_base_finish(struct
shash_desc *desc, u8 *out)
*sctx = (struct sha256_state){};
return 0;
 }
+
+static inline int sha256_base_finish(struct shash_desc *desc, u8 *out)
+{
+   unsigned int digest_size = crypto_shash_digestsize(desc->tfm);
+   struct sha256_state *sctx = shash_desc_ctx(desc);
+
+   return __sha256_base_finish(sctx, out, digest_size);
+}
+
+static inline int sha256_base_digest(const u8 *data, unsigned int len, u8 *out,
+sha256_block_fn *block_fn)
+{
+   struct sha256_state sctx;
+
+   __sha256_base_init();
+   sha256_base_do_update(, data, len, block_fn);
+   sha256_base_do_finalize(, block_fn);
+
+   return __sha256_base_finish(, out, SHA256_DIGEST_SIZE);
+}
"""

(where __sha256_base_init() and __sha256_base_finish() are the
existing functions modified to take a struct sha256_state rather than
a shash_desc) should be sufficient to allow a generic
crypto_sha256_digest() to be composed that does not rely on the crypto
API.

Whether this still belongs under crypto or under lib/sha256.c as a
library function (allowing archs to override it) is open for debate.
If hashing BPF programs becomes a hot spot, we probably have bigger
problems.

Regards,
Ard.

P.S. I do take your point regarding the arch_sha256_block_transform()
proposed in your follow up email, but there are some details (SIMD,
availability of the instructions etc) that would make it only suitable
for the generic implementation anyway, and the base layer was already
a huge improvement compared to the open coded implementations of the
SHA boilerplate.

> Cc: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Cc: Herbert Xu <herb...@gondor.apana.org.au>
> Signed-off-by: Andy Lutomirski <l...@kernel.org>
> ---
>  arch/arm/crypto/sha2-ce-glue.c  | 10 ---
>  arch/arm/crypto/sha256_glue.c   | 23 ++-
>  arch/arm/crypto/sha256_neon_glue.c  | 34 +++

Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-19 Thread Ard Biesheuvel

On 19 October 2016 at 08:43, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Wed, 2016-10-19 at 11:31 +0800, Herbert Xu wrote:
>> On Mon, Oct 17, 2016 at 06:21:14PM +0100, Ard Biesheuvel wrote:
>> >
>> >
>> > Annoyingly, all this complication with scatterlists etc is for
>> > doing
>> > asynchronous crypto via DMA capable crypto accelerators, and the
>> > networking code (ipsec as well as mac80211, afaik) only allow
>> > synchronous in the first place, given that they execute in softirq
>> > context.
>>
>> I'm still thinking about the issue (in particular, whether we
>> should continue to rely on the request context being SG-capable
>> or allow it to be on the stack for AEAD).
>
> :)
>
>> But IPsec definitely supports async crypto.  In fact it was the
>> very first user of async crypto.
>
> Yeah.
>

Ah yes, my bad.

>> mac80211 on the other hand is currently sync-only.
>
> We could probably make mac80211 do that too, but can we guarantee in-
> order processing? Anyway, it's pretty low priority, maybe never
> happening, since hardly anyone really uses "software" crypto, the wifi
> devices mostly have it built in anyway.
>

Indeed. The code is now correct in terms of API requirements, so let's
just wait for someone to complain about any performance regressions.

Re: [RFC PATCH 2/2] mac80211: aes_ccm: cache AEAD request structures per CPU

2016-10-18 Thread Ard Biesheuvel

On 18 October 2016 at 15:24, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Tue, 2016-10-18 at 15:18 +0100, Ard Biesheuvel wrote:
>>
>> > Hmm. Is it really worth having a per-CPU variable for each possible
>> > key? You could have a large number of those (typically three when
>> > you're a client on an AP, and 1 + 1 for each client when you're the
>> > AP).
>
> 2 + 1 for each client, actually, since you have 2 GTKs present in the
> "steady state"; not a big difference though.
>
>> > Would it be so bad to have to set the TFM every time (if that's
>> > even possible), and just have a single per-CPU cache?
>
>> That would be preferred, yes. The only snag here is that
>> crypto_alloc_aead() is not guaranteed to return the same algo every
>> time, which means the request size is not guaranteed to be the same
>> either. This is a rare corner case, of course, but it needs to be
>> dealt with regardless
>
> Ah, good point. Well I guess you could allocate a bigger one it if it's
> too small, but then we'd have to recalculate the size all the time
> (which we already did anyway, but saving something else would be good).
> Then we'd be close to just having a per-CPU memory block cache though.
>

Well, ideally we'd allocate the ccm(aes) crypto_alg a single time and
'spawn' the transforms for each key. This is how the crypto API
implements templates internally, but I don't think this functionality
is publicly accessible. Herbert?

[RFC PATCH 0/2] mac80211: aes_ccm: cache AEAD request allocations per CPU

2016-10-18 Thread Ard Biesheuvel

This RFC implements per CPU caching of AEAD request structures, which allows
us to get rid of the per-packet kzalloc/kzfree calls we were forced to
introduce to deal with SG API violations, both in the mac80211 and in the
core crypto API code.

Since mac80211 only executes the AEAD transforms in softirq context, only one
AEAD request can be in flight at the same time on any given CPU, and so, instead
of free the request, we can stash its address in a per CPU variable, and reuse
it for the next packet.

This RFC only addressess CCMP, but GCM and GMAC could be fixed in the same way
(and CMAC did not suffer from the API violation issue in the first place)

Ard Biesheuvel (2):
  mac80211: aes_ccm: prepare key struct for storing context data
  mac80211: aes_ccm: cache AEAD request structures per CPU

 net/mac80211/aes_ccm.c | 80 +---
 net/mac80211/aes_ccm.h | 16 ++--
 net/mac80211/key.c | 16 ++--
 net/mac80211/key.h |  3 +-
 net/mac80211/wpa.c |  4 +-
 5 files changed, 71 insertions(+), 48 deletions(-)

-- 
2.7.4

Re: [RFC PATCH 2/2] mac80211: aes_ccm: cache AEAD request structures per CPU

2016-10-18 Thread Ard Biesheuvel

On 18 October 2016 at 15:16, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Tue, 2016-10-18 at 15:08 +0100, Ard Biesheuvel wrote:
>>
>> + aead_req = *this_cpu_ptr(ccmp->reqs);
>> + if (!aead_req) {
>> + aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
>> + if (!aead_req)
>> + return -ENOMEM;
>> + *this_cpu_ptr(ccmp->reqs) = aead_req;
>> + aead_request_set_tfm(aead_req, ccmp->tfm);
>> + }
>
> Hmm. Is it really worth having a per-CPU variable for each possible
> key? You could have a large number of those (typically three when
> you're a client on an AP, and 1 + 1 for each client when you're the
> AP).
>
> Would it be so bad to have to set the TFM every time (if that's even
> possible), and just have a single per-CPU cache?
>

That would be preferred, yes. The only snag here is that
crypto_alloc_aead() is not guaranteed to return the same algo every
time, which means the request size is not guaranteed to be the same
either. This is a rare corner case, of course, but it needs to be
dealt with regardless

[RFC PATCH 1/2] mac80211: aes_ccm: prepare key struct for storing context data

2016-10-18 Thread Ard Biesheuvel

As a prepatory change to allow per CPU caching of request structures,
refactor the key allocation routine so we can access per key data
beyond the core AEAD transform easily.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_ccm.c | 35 +++-
 net/mac80211/aes_ccm.h | 16 -
 net/mac80211/key.c | 16 -
 net/mac80211/key.h |  2 +-
 net/mac80211/wpa.c |  4 +--
 5 files changed, 37 insertions(+), 36 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index a4e0d59a40dd..58e0338a2c34 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -10,6 +10,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -18,13 +19,13 @@
 #include "key.h"
 #include "aes_ccm.h"
 
-int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
- u8 *data, size_t data_len, u8 *mic,
+int ieee80211_aes_ccm_encrypt(struct ieee80211_ccmp_aead *ccmp, u8 *b_0,
+ u8 *aad, u8 *data, size_t data_len, u8 *mic,
  size_t mic_len)
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
-   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm);
+   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(ccmp->tfm);
u8 *__aad;
 
aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
@@ -39,7 +40,7 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
+   aead_request_set_tfm(aead_req, ccmp->tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
@@ -49,13 +50,13 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
return 0;
 }
 
-int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
- u8 *data, size_t data_len, u8 *mic,
+int ieee80211_aes_ccm_decrypt(struct ieee80211_ccmp_aead *ccmp, u8 *b_0,
+ u8 *aad, u8 *data, size_t data_len, u8 *mic,
  size_t mic_len)
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
-   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm);
+   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(ccmp->tfm);
u8 *__aad;
int err;
 
@@ -74,7 +75,7 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, tfm);
+   aead_request_set_tfm(aead_req, ccmp->tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
@@ -84,16 +85,17 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
return err;
 }
 
-struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
-   size_t key_len,
-   size_t mic_len)
+int ieee80211_aes_key_setup_encrypt(struct ieee80211_ccmp_aead *ccmp,
+   const u8 key[],
+   size_t key_len,
+   size_t mic_len)
 {
struct crypto_aead *tfm;
int err;
 
tfm = crypto_alloc_aead("ccm(aes)", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(tfm))
-   return tfm;
+   return PTR_ERR(tfm);
 
err = crypto_aead_setkey(tfm, key, key_len);
if (err)
@@ -102,14 +104,15 @@ struct crypto_aead *ieee80211_aes_key_setup_encrypt(const 
u8 key[],
if (err)
goto free_aead;
 
-   return tfm;
+   ccmp->tfm = tfm;
+   return 0;
 
 free_aead:
crypto_free_aead(tfm);
-   return ERR_PTR(err);
+   return err;
 }
 
-void ieee80211_aes_key_free(struct crypto_aead *tfm)
+void ieee80211_aes_key_free(struct ieee80211_ccmp_aead *ccmp)
 {
-   crypto_free_aead(tfm);
+   crypto_free_aead(ccmp->tfm);
 }
diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
index fcd3254c5cf0..82e91c6ec41f 100644
--- a/net/mac80211/aes_ccm.h
+++ b/net/mac80211/aes_ccm.h
@@ -14,15 +14,15 @@
 
 #define CCM_AAD_LEN32
 
-struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
-   size_t key_len,
-   size_t mic_len);
-int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
- u8 *data, size_t data_len, u8 *mic,
+int ieee80211_aes_key_setup_encrypt(struct ieee80211_ccmp_aead *ccmp,
+   const u8 key[], size_t key_l

[RFC PATCH 2/2] mac80211: aes_ccm: cache AEAD request structures per CPU

2016-10-18 Thread Ard Biesheuvel

Now that we can no longer invoke AEAD transforms with the aead_request
structure allocated on the stack, we perform a kmalloc/kfree for every
packet, which is expensive.

Since the CCMP routines execute in softirq context, we know there can
never be more than one request in flight on each CPU, and so we can
simply keep a cached aead_request structure per CPU, and deallocate all
of them when deallocating the AEAD transform.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_ccm.c | 49 ++--
 net/mac80211/key.h |  1 +
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 58e0338a2c34..2cb5ee4868ea 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -28,9 +28,14 @@ int ieee80211_aes_ccm_encrypt(struct ieee80211_ccmp_aead 
*ccmp, u8 *b_0,
int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(ccmp->tfm);
u8 *__aad;
 
-   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
-   if (!aead_req)
-   return -ENOMEM;
+   aead_req = *this_cpu_ptr(ccmp->reqs);
+   if (!aead_req) {
+   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
+   *this_cpu_ptr(ccmp->reqs) = aead_req;
+   aead_request_set_tfm(aead_req, ccmp->tfm);
+   }
 
__aad = (u8 *)aead_req + reqsize;
memcpy(__aad, aad, CCM_AAD_LEN);
@@ -40,12 +45,10 @@ int ieee80211_aes_ccm_encrypt(struct ieee80211_ccmp_aead 
*ccmp, u8 *b_0,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, ccmp->tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
-   kzfree(aead_req);
 
return 0;
 }
@@ -58,14 +61,18 @@ int ieee80211_aes_ccm_decrypt(struct ieee80211_ccmp_aead 
*ccmp, u8 *b_0,
struct aead_request *aead_req;
int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(ccmp->tfm);
u8 *__aad;
-   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
-   if (!aead_req)
-   return -ENOMEM;
+   aead_req = *this_cpu_ptr(ccmp->reqs);
+   if (!aead_req) {
+   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
+   *this_cpu_ptr(ccmp->reqs) = aead_req;
+   aead_request_set_tfm(aead_req, ccmp->tfm);
+   }
 
__aad = (u8 *)aead_req + reqsize;
memcpy(__aad, aad, CCM_AAD_LEN);
@@ -75,14 +82,10 @@ int ieee80211_aes_ccm_decrypt(struct ieee80211_ccmp_aead 
*ccmp, u8 *b_0,
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
-   aead_request_set_tfm(aead_req, ccmp->tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   err = crypto_aead_decrypt(aead_req);
-   kzfree(aead_req);
-
-   return err;
+   return crypto_aead_decrypt(aead_req);
 }
 
 int ieee80211_aes_key_setup_encrypt(struct ieee80211_ccmp_aead *ccmp,
@@ -91,6 +94,7 @@ int ieee80211_aes_key_setup_encrypt(struct 
ieee80211_ccmp_aead *ccmp,
size_t mic_len)
 {
struct crypto_aead *tfm;
+   struct aead_request **r;
int err;
 
tfm = crypto_alloc_aead("ccm(aes)", 0, CRYPTO_ALG_ASYNC);
@@ -104,6 +108,14 @@ int ieee80211_aes_key_setup_encrypt(struct 
ieee80211_ccmp_aead *ccmp,
if (err)
goto free_aead;
 
+   /* allow a struct aead_request to be cached per cpu */
+   r = alloc_percpu(struct aead_request *);
+   if (!r) {
+   err = -ENOMEM;
+   goto free_aead;
+   }
+
+   ccmp->reqs = r;
ccmp->tfm = tfm;
return 0;
 
@@ -114,5 +126,14 @@ int ieee80211_aes_key_setup_encrypt(struct 
ieee80211_ccmp_aead *ccmp,
 
 void ieee80211_aes_key_free(struct ieee80211_ccmp_aead *ccmp)
 {
+   struct aead_request *req;
+   int cpu;
+
+   for_each_possible_cpu(cpu) {
+   req = *per_cpu_ptr(ccmp->reqs, cpu);
+   if (req)
+   kzfree(req);
+   }
+   free_percpu(ccmp->reqs);
crypto_free_aead(ccmp->tfm);
 }
diff --git a/net/mac80211/key.h b/net/mac80211/key.h
index 1ec7a737ab79..f99ec46dc08f 100644
--- a/net/mac80211/key.h
+++ b/net/mac80211/key.h
@@ -89,6 +89,7 @@ struct ieee80211_key {
 */
u8 rx_pn[IEEE80211_NUM_TIDS + 1][IEEE80211_CCMP_PN_LEN];
struct crypto_aead *tfm;
+   struct aead_request * __percpu *reqs;

Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 18:08, Andy Lutomirski <l...@amacapital.net> wrote:
> On Mon, Oct 17, 2016 at 12:37 AM, Ard Biesheuvel
> <ard.biesheu...@linaro.org> wrote:
>> On 17 October 2016 at 08:28, Johannes Berg <johan...@sipsolutions.net> wrote:
>>> On Sat, 2016-10-15 at 18:16 +0100, Ard Biesheuvel wrote:
>>>> The CCM code goes out of its way to perform the CTR encryption of the
>>>> MAC using the subordinate CTR driver. To this end, it tweaks the
>>>> input and output scatterlists so the aead_req 'odata' and/or
>>>> 'auth_tag' fields [which may live on the stack] are prepended to the
>>>> CTR payload. This involves calling sg_set_buf() on addresses which
>>>> are not direct mapped, which is not supported.
>>>
>>>> Since the calculation of the MAC keystream involves a single call
>>>> into the cipher, to which we have a handle already given that the
>>>> CBC-MAC calculation uses it as well, just calculate the MAC keystream
>>>> directly, and record it in the aead_req private context so we can
>>>> apply it to the MAC in cypto_ccm_auth_mac(). This greatly simplifies
>>>> the scatterlist manipulation, and no longer requires scatterlists to
>>>> refer to buffers that may live on the stack.
>>>
>>> No objection from me, Herbert?
>>>
>>> I'm getting a bit nervous though - I'd rather have any fix first so
>>> people get things working again - so maybe I'll apply your other patch
>>> and mine first, and then we can replace yours by this later.
>>>
>>
>> Could we get a statement first whether it is supported to allocate
>> aead_req (and other crypto req structures) on the stack? If not, then
>> we have our work cut out for us. But if it is, I'd rather we didn't
>> apply the kzalloc/kfree patch, since it is just a workaround for the
>> broken generic CCM driver, for which a fix is already available.
>
> I'm not a crypto person, but I don't see why not.  There's even a
> helper called SKCIPHER_REQUEST_ON_STACK. :)  The only problem I know
> of is pointing a scatterlist at the stack, which is bad for much the
> same reason as doing real DMA from the stack.

Excellent point!

So the CCM code was an easy fix, although the RFC4309 part is still broken:

It does

"""
scatterwalk_map_and_copy(iv + 16, req->src, 0, req->assoclen - 8, 0);
...
sg_set_buf(rctx->src, iv + 16, req->assoclen - 8);
sg = scatterwalk_ffwd(rctx->src + 1, req->src, req->assoclen);
"""

which essentially just hides the last 8 bytes of associated data from
the inner CCM transform. So we'd need to rewrite this part to create a
new scatterlist that omits those 8 bytes instead of just replacing the
first sg entry and point it to another buffer entirely (and copy the
data into it)

The GCM code is much more complicated, and does not easily allow the
offending sg_set_buf() calls to be turned into something that only
involves direct mapped memory references. I haven't looked at anything
else.

Annoyingly, all this complication with scatterlists etc is for doing
asynchronous crypto via DMA capable crypto accelerators, and the
networking code (ipsec as well as mac80211, afaik) only allow
synchronous in the first place, given that they execute in softirq
context.

[PATCH v4] mac80211: move struct aead_req off the stack

2016-10-17 Thread Ard Biesheuvel

From: Johannes Berg <johannes.b...@intel.com>

Some crypto implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of private data in their struct aead_req.
This means these data structures cannot live in the vmalloc area, which
means that they cannot live on the stack (with CONFIG_VMAP_STACK.)

This currently occurs only with the generic software implementation, but
the private data and usage is implementation specific, so move the whole
data structures off the stack into heap by allocating every time we need
to use them.

In addition, take care not to put any of our own stack allocations into
scatterlists. This involves reserving some extra room when allocating the
aead_request structures, and referring to those allocations in the scatter-
lists (while copying the data from/to the stack before/after the crypto
operation, as appropriate)

Signed-off-by: Johannes Berg <johannes.b...@intel.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 net/mac80211/aes_ccm.c  | 46 +---
 net/mac80211/aes_ccm.h  |  8 ++--
 net/mac80211/aes_cmac.c |  5 +--
 net/mac80211/aes_cmac.h |  2 +
 net/mac80211/aes_gcm.c  | 43 +++---
 net/mac80211/aes_gcm.h  |  6 ++-
 net/mac80211/aes_gmac.c | 26 +--
 net/mac80211/aes_gmac.h |  4 ++
 net/mac80211/wpa.c  | 22 --
 9 files changed, 97 insertions(+), 65 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 7663c28ba353..a4e0d59a40dd 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -18,21 +18,24 @@
 #include "key.h"
 #include "aes_ccm.h"
 
-void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
-  u8 *data, size_t data_len, u8 *mic,
-  size_t mic_len)
+int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
+ u8 *data, size_t data_len, u8 *mic,
+ size_t mic_len)
 {
struct scatterlist sg[3];
+   struct aead_request *aead_req;
+   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm);
+   u8 *__aad;
 
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   __aad = (u8 *)aead_req + reqsize;
+   memcpy(__aad, aad, CCM_AAD_LEN);
 
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
@@ -41,6 +44,9 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
+   kzfree(aead_req);
+
+   return 0;
 }
 
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
@@ -48,18 +54,23 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
  size_t mic_len)
 {
struct scatterlist sg[3];
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   struct aead_request *aead_req;
+   int reqsize = sizeof(*aead_req) + crypto_aead_reqsize(tfm);
+   u8 *__aad;
+   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = kzalloc(reqsize + CCM_AAD_LEN, GFP_ATOMIC);
+   if (!aead_req)
+   return -ENOMEM;
+
+   __aad = (u8 *)aead_req + reqsize;
+   memcpy(__aad, aad, CCM_AAD_LEN);
 
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);
 
@@ -67,7 +78,10 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   return crypto_aead_decrypt(aead_req);
+   err = crypto_aead_decrypt(aead_req);
+   kzfree(aead_req);
+
+   return err;
 }
 
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
index 6a73d1e4d186..fcd3254c5cf0 100644
--- a/net/mac80211/aes_ccm.h
+++ b/net/mac80211/aes_ccm.h
@@ -12,12 +12,14 @@
 
 #i

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 14:16, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Mon, 2016-10-17 at 14:06 +0100, Ard Biesheuvel wrote:
>>
>> Actually, while I think it will be worthwhile going forward to
>> implement such an 'auxiliary data' feature in a generic way, I still
>> think we should address the issue at hand without too much
>> complication.
>>
>> If we pedal back to the version of 'mac80211: move struct aead_req
>> off the stack' that uses kzalloc() instead of aead_request_alloc(),
>> we can simply add some space for aad[] and/or zero[], and get rid of
>> the kmem cache entirely.
>>
>> If you're past this point already, i won't bother but otherwise I can
>> rework 'mac80211: move struct aead_req off the stack' so that the
>> other patch is no longer required (and IIRC, this is actually
>> something you proposed yourself a couple of iterations ago?)
>
> Yes, I did consider that.
>
> It makes some sense, and I guess the extra memcpy() would be cheaper
> than the extra alloc?
>
> I'd happily use that instead of the combination of my two patches. The
> aead_request_alloc() is just a simple inline anyway, so no real problem
> not using it.
>

Indeed. And it keeps the clutter inside the aes_xxx.c files, which
could easily be updated in the future to use some auxdata feature if
it ever materializes.

I think it would help this code, but also the ESP code you pointed
out, to have some kind of 'ordered synchronous' CRYPTO_xxx flag, where
the crypto API could manage the kmem cache and percpu pointers to
allocations. This goes well beyond what we can do as a fix, though, so
we need an intermediate solution in any case.

Shall I propose the patch?

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 11:02, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
>
>
>> On 17 Oct 2016, at 10:54, Johannes Berg <johan...@sipsolutions.net> wrote:
>>
>>
>>>> Well, if your other patch to make it OK to be on-stack would be
>>>> applied instead, this wouldn't make much sense either :)
>>>>
>>>
>>> Yes but that one only fixes ccm not gcm
>>
>> Yes, but we can do the same for GCM, no?
>>
>
> No, not really. ccm happens to use aes with the same key for the mac and the 
> encryption. gcm uses an another algo entirely for the mac
>
>>>> In this particular patch, we could reduce the size of the struct,
>>>> but I
>>>> don't actually think it'll make a difference to go from 48 to 36
>>>> bytes,
>>>> given alignment etc., so I think I'll just leave it as is.
>>>>
>>>
>>> I understand you are in a hurry, but this is unlikely to be the only
>>> such issue. I will propose an 'auxdata' feature for the crypto api
>>> that can be used here, but also for any other occurrence where client
>>> data assoiciated with the request can no longer be allocated on the
>>> stack
>>
>> No objections. I'll merge this anyway today I think, reverting is easy
>> later.
>>
>
> ok fair enough

Actually, while I think it will be worthwhile going forward to
implement such an 'auxiliary data' feature in a generic way, I still
think we should address the issue at hand without too much
complication.

If we pedal back to the version of 'mac80211: move struct aead_req off
the stack' that uses kzalloc() instead of aead_request_alloc(), we can
simply add some space for aad[] and/or zero[], and get rid of the kmem
cache entirely.

If you're past this point already, i won't bother but otherwise I can
rework 'mac80211: move struct aead_req off the stack' so that the
other patch is no longer required (and IIRC, this is actually
something you proposed yourself a couple of iterations ago?)

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel



> On 17 Oct 2016, at 10:54, Johannes Berg  wrote:
> 
> 
>>> Well, if your other patch to make it OK to be on-stack would be
>>> applied instead, this wouldn't make much sense either :)
>>> 
>> 
>> Yes but that one only fixes ccm not gcm
> 
> Yes, but we can do the same for GCM, no?
> 

No, not really. ccm happens to use aes with the same key for the mac and the 
encryption. gcm uses an another algo entirely for the mac

>>> In this particular patch, we could reduce the size of the struct,
>>> but I
>>> don't actually think it'll make a difference to go from 48 to 36
>>> bytes,
>>> given alignment etc., so I think I'll just leave it as is.
>>> 
>> 
>> I understand you are in a hurry, but this is unlikely to be the only
>> such issue. I will propose an 'auxdata' feature for the crypto api
>> that can be used here, but also for any other occurrence where client
>> data assoiciated with the request can no longer be allocated on the
>> stack
> 
> No objections. I'll merge this anyway today I think, reverting is easy
> later.
> 

ok fair enough

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel



> On 17 Oct 2016, at 10:35, Johannes Berg <johan...@sipsolutions.net> wrote:
> 
>> On Mon, 2016-10-17 at 10:30 +0100, Ard Biesheuvel wrote:
>> 
>> Yes. But as I replied, setting the req size is not supported atm,
>> although it is reasonable to demand a way to allocate additional data
>> in the request specifically for this issue. So let's proceed with the
>> aead_request_alloc/free patch, but I would like to propose something
>> on the API side to address this particular issue
> 
> Well, if your other patch to make it OK to be on-stack would be applied
> instead, this wouldn't make much sense either :)
> 

Yes but that one only fixes ccm not gcm

> In this particular patch, we could reduce the size of the struct, but I
> don't actually think it'll make a difference to go from 48 to 36 bytes,
> given alignment etc., so I think I'll just leave it as is.
> 

I understand you are in a hurry, but this is unlikely to be the only such 
issue. I will propose an 'auxdata' feature for the crypto api that can be used 
here, but also for any other occurrence where client data assoiciated with the 
request can no longer be allocated on the stack

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel



> On 17 Oct 2016, at 10:35, Johannes Berg <johan...@sipsolutions.net> wrote:
> 
>> On Mon, 2016-10-17 at 10:30 +0100, Ard Biesheuvel wrote:
>> 
>> Yes. But as I replied, setting the req size is not supported atm,
>> although it is reasonable to demand a way to allocate additional data
>> in the request specifically for this issue. So let's proceed with the
>> aead_request_alloc/free patch, but I would like to propose something
>> on the API side to address this particular issue
> 
> Well, if your other patch to make it OK to be on-stack would be applied
> instead, this wouldn't make much sense either :)
> 
> In this particular patch, we could reduce the size of the struct, but I
> don't actually think it'll make a difference to go from 48 to 36 bytes,
> given alignment etc., so I think I'll just leave it as is.
> 
> johannes

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 10:23, Johannes Berg  wrote:
>
>> Apologies for going back and forth on this, but it appears there may
>> be another way to deal with this.
>>
>> First of all, we only need this handling for the authenticated data,
>
> Are you sure b_0/j_0 aren't needed? We pass those
> to aead_request_set_crypt(), and I wasn't sure what that really did
> internally, perhaps like the internal data.
>

They are the IV[], which is a fixed length parameter of the algorithm.
In contrast, the AAD[] could be of arbitrary length (from the POV of
the crypto API) so it uses scatterlists.

> Testing with that on the stack does seem to work, in fact.
>
> Surely we need zero for GMAC though, since we also put that into the sg
> list. Thus for GMAC we definitely need 20+16 bytes, and since I round
> up to a cacheline (at least on SMP) it doesn't really matter that we
> could get 36 instead of the 48 I have now.
>
>> and only for CCM and GCM, not CMAC (which does not use scatterlists
>> at all, it simply calls the AES cipher directly)
>
> I didn't modify CMAC, I think, only GMAC, which also uses scatterlists.
>

Ah ok, I misread the patch.

>> So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
>
> and 36 for GMAC :)

Yes. But as I replied, setting the req size is not supported atm,
although it is reasonable to demand a way to allocate additional data
in the request specifically for this issue. So let's proceed with the
aead_request_alloc/free patch, but I would like to propose something
on the API side to address this particular issue

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 10:14, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> On 17 October 2016 at 09:33, Johannes Berg <johan...@sipsolutions.net> wrote:
>> From: Johannes Berg <johannes.b...@intel.com>
>>
>> As the stack can (on x86-64) now be virtually mapped rather than
>> using "normal" kernel memory, Sergey noticed mac80211 isn't using
>> the SG APIs correctly by putting on-stack buffers into SG tables.
>> This leads to kernel crashes.
>>
>> Fix this by allocating the extra fields dynamically on the fly as
>> needed, using a kmem cache.
>>
>> I used per-CPU memory in a previous iteration of this patch, but
>> Ard Biesheuvel pointed out that was also vmalloc'ed on some
>> architectures.
>>
>> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>
>> Signed-off-by: Johannes Berg <johannes.b...@intel.com>
>
> Apologies for going back and forth on this, but it appears there may
> be another way to deal with this.
>
> First of all, we only need this handling for the authenticated data,
> and only for CCM and GCM, not CMAC (which does not use scatterlists at
> all, it simply calls the AES cipher directly)
>
> So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
> which we could allocate along with the AEAD request, e..g.,
>
> """
> diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
> index 8e898a6e8de8..c0c33e6ad94e 100644
> --- a/net/mac80211/aes_ccm.c
> +++ b/net/mac80211/aes_ccm.c
> @@ -24,13 +24,17 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
>  {
> struct scatterlist sg[3];
> struct aead_request *aead_req;
> +   u8 *__aad;
>
> aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> if (!aead_req)
> return -ENOMEM;
>
> +   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
> +   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
> +
> sg_init_table(sg, 3);
> -   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> +   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> @@ -49,6 +53,7 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
>  {
> struct scatterlist sg[3];
> struct aead_request *aead_req;
> +   u8 *__aad;
> int err;
>
> if (data_len == 0)
> @@ -58,8 +63,11 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
> *tfm, u8 *b_0, u8 *aad,
> if (!aead_req)
> return -ENOMEM;
>
> +   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
> +   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
> +
> sg_init_table(sg, 3);
> -   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> +   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> @@ -90,6 +98,8 @@ struct crypto_aead
> *ieee80211_aes_key_setup_encrypt(const u8 key[],
> if (err)
> goto free_aead;
>
> +   crypto_aead_set_reqsize(tfm,
> +   crypto_aead_reqsize(tfm) + 2 * 
> AES_BLOCK_SIZE));
> return tfm;
>

Darn, it seems crypto_aead_set_reqsize() is internal to the crypto API ... :-(

Re: [PATCH v4] mac80211: move extra crypto data off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 09:33, Johannes Berg <johan...@sipsolutions.net> wrote:
> From: Johannes Berg <johannes.b...@intel.com>
>
> As the stack can (on x86-64) now be virtually mapped rather than
> using "normal" kernel memory, Sergey noticed mac80211 isn't using
> the SG APIs correctly by putting on-stack buffers into SG tables.
> This leads to kernel crashes.
>
> Fix this by allocating the extra fields dynamically on the fly as
> needed, using a kmem cache.
>
> I used per-CPU memory in a previous iteration of this patch, but
> Ard Biesheuvel pointed out that was also vmalloc'ed on some
> architectures.
>
> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.w...@gmail.com>
> Signed-off-by: Johannes Berg <johannes.b...@intel.com>

Apologies for going back and forth on this, but it appears there may
be another way to deal with this.

First of all, we only need this handling for the authenticated data,
and only for CCM and GCM, not CMAC (which does not use scatterlists at
all, it simply calls the AES cipher directly)

So that leaves a fixed 20 bytes for GCM and fixed 32 bytes for CCM,
which we could allocate along with the AEAD request, e..g.,

"""
diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 8e898a6e8de8..c0c33e6ad94e 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -24,13 +24,17 @@ int ieee80211_aes_ccm_encrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
+   u8 *__aad;

aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
if (!aead_req)
return -ENOMEM;

+   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
+   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
+
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);

@@ -49,6 +53,7 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
 {
struct scatterlist sg[3];
struct aead_request *aead_req;
+   u8 *__aad;
int err;

if (data_len == 0)
@@ -58,8 +63,11 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead
*tfm, u8 *b_0, u8 *aad,
if (!aead_req)
return -ENOMEM;

+   __aad = (u8 *)aead_req + crypto_aead_reqsize(tfm);
+   memcpy(__aad, aad, 2 * AES_BLOCK_SIZE);
+
sg_init_table(sg, 3);
-   sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
+   sg_set_buf([0], &__aad[2], be16_to_cpup((__be16 *)__aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);

@@ -90,6 +98,8 @@ struct crypto_aead
*ieee80211_aes_key_setup_encrypt(const u8 key[],
if (err)
goto free_aead;

+   crypto_aead_set_reqsize(tfm,
+   crypto_aead_reqsize(tfm) + 2 * AES_BLOCK_SIZE));
return tfm;

 free_aead:
"""

Re: [PATCH v3] mac80211: move struct aead_req off the stack

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 09:03, Johannes Berg <johan...@sipsolutions.net> wrote:
> From: Johannes Berg <johannes.b...@intel.com>
>
> Some crypto implementations (such as the generic CCM wrapper in crypto/)
> use scatterlists to map fields of private data in their struct aead_req.
> This means these data structures cannot live in the vmalloc area, which
> means that they cannot live on the stack (with CONFIG_VMAP_STACK.)
>
> This currently occurs only with the generic software implementation, but
> the private data and usage is implementation specific, so move the whole
> data structures off the stack into heap by allocating every time we need
> to use them.
>
> This pattern already exists in the IPsec ESP driver, but in the future,
> we may want/need to improve upon this, e.g. by using a slab cache.
>
> (Based on Ard's patch, but that was missing error handling and GCM/GMAC)
>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Signed-off-by: Johannes Berg <johannes.b...@intel.com>
> ---
> v3: remove superfluous aead_request_set_tfm() calls

Reviewed-by: Ard Biesheuvel <ard.biesheu...@linaro.org>

> ---
>  net/mac80211/aes_ccm.c  | 36 +++-
>  net/mac80211/aes_ccm.h  |  6 +++---
>  net/mac80211/aes_gcm.c  | 33 +
>  net/mac80211/aes_gcm.h  |  4 ++--
>  net/mac80211/aes_gmac.c | 11 +--
>  net/mac80211/wpa.c  | 12 
>  6 files changed, 50 insertions(+), 52 deletions(-)
>
> diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
> index 7663c28ba353..8e898a6e8de8 100644
> --- a/net/mac80211/aes_ccm.c
> +++ b/net/mac80211/aes_ccm.c
> @@ -18,29 +18,29 @@
>  #include "key.h"
>  #include "aes_ccm.h"
>
> -void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> -  u8 *data, size_t data_len, u8 *mic,
> -  size_t mic_len)
> +int ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> + u8 *data, size_t data_len, u8 *mic,
> + size_t mic_len)
>  {
> struct scatterlist sg[3];
> +   struct aead_request *aead_req;
>
> -   char aead_req_data[sizeof(struct aead_request) +
> -  crypto_aead_reqsize(tfm)]
> -   __aligned(__alignof__(struct aead_request));
> -   struct aead_request *aead_req = (void *) aead_req_data;
> -
> -   memset(aead_req, 0, sizeof(aead_req_data));
> +   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> +   if (!aead_req)
> +   return -ENOMEM;
>
> sg_init_table(sg, 3);
> sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> -   aead_request_set_tfm(aead_req, tfm);
> aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
> aead_request_set_ad(aead_req, sg[0].length);
>
> crypto_aead_encrypt(aead_req);
> +   aead_request_free(aead_req);
> +
> +   return 0;
>  }
>
>  int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
> @@ -48,26 +48,28 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
> *b_0, u8 *aad,
>   size_t mic_len)
>  {
> struct scatterlist sg[3];
> -   char aead_req_data[sizeof(struct aead_request) +
> -  crypto_aead_reqsize(tfm)]
> -   __aligned(__alignof__(struct aead_request));
> -   struct aead_request *aead_req = (void *) aead_req_data;
> +   struct aead_request *aead_req;
> +   int err;
>
> if (data_len == 0)
> return -EINVAL;
>
> -   memset(aead_req, 0, sizeof(aead_req_data));
> +   aead_req = aead_request_alloc(tfm, GFP_ATOMIC);
> +   if (!aead_req)
> +   return -ENOMEM;
>
> sg_init_table(sg, 3);
> sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> -   aead_request_set_tfm(aead_req, tfm);
> aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
> aead_request_set_ad(aead_req, sg[0].length);
>
> -   return crypto_aead_decrypt(aead_req);
> +   err = crypto_aead_decrypt(aead_req);
> +   aead_request_free(aead_req);
> +
> +   return err;
>  }
>
>  struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
> diff --git a/net/mac80211/aes_ccm.h b/net/mac80211/aes_ccm.h
> index 6a73d1e4d186..03e21b0995e3 100644
> --- a/net/mac80211

Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 08:47, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Mon, 2016-10-17 at 08:37 +0100, Ard Biesheuvel wrote:
>>
>> Could we get a statement first whether it is supported to allocate
>> aead_req (and other crypto req structures) on the stack?
>
> Well, we haven't heard from Herbert :)
>
>> If not, then
>> we have our work cut out for us. But if it is, I'd rather we didn't
>> apply the kzalloc/kfree patch, since it is just a workaround for the
>> broken generic CCM driver, for which a fix is already available.
>
> Yeah but I can't apply it. I just fixed up your kzalloc patch to also
> handle GCM and GMAC, and to have error checking. Will send it in a
> minute.
>

I just realised that patch should probably use
aead_request_alloc/aead_request_free [and drop the memset]. That also
fixes the latent bug where the alignment of the req ctx is not take
into account.

>> Also, regarding your __percpu patch: those are located in the vmalloc
>> area as well, at least on arm64, and likely other architectures too.
>
> Crap. Any other bright ideas?
>

kmem_cache_create() and kmem_cache_alloc()

Re: [PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-17 Thread Ard Biesheuvel

On 17 October 2016 at 08:28, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Sat, 2016-10-15 at 18:16 +0100, Ard Biesheuvel wrote:
>> The CCM code goes out of its way to perform the CTR encryption of the
>> MAC using the subordinate CTR driver. To this end, it tweaks the
>> input and output scatterlists so the aead_req 'odata' and/or
>> 'auth_tag' fields [which may live on the stack] are prepended to the
>> CTR payload. This involves calling sg_set_buf() on addresses which
>> are not direct mapped, which is not supported.
>
>> Since the calculation of the MAC keystream involves a single call
>> into the cipher, to which we have a handle already given that the
>> CBC-MAC calculation uses it as well, just calculate the MAC keystream
>> directly, and record it in the aead_req private context so we can
>> apply it to the MAC in cypto_ccm_auth_mac(). This greatly simplifies
>> the scatterlist manipulation, and no longer requires scatterlists to
>> refer to buffers that may live on the stack.
>
> No objection from me, Herbert?
>
> I'm getting a bit nervous though - I'd rather have any fix first so
> people get things working again - so maybe I'll apply your other patch
> and mine first, and then we can replace yours by this later.
>

Could we get a statement first whether it is supported to allocate
aead_req (and other crypto req structures) on the stack? If not, then
we have our work cut out for us. But if it is, I'd rather we didn't
apply the kzalloc/kfree patch, since it is just a workaround for the
broken generic CCM driver, for which a fix is already available.

Also, regarding your __percpu patch: those are located in the vmalloc
area as well, at least on arm64, and likely other architectures too.

[PATCH] crypto: ccm - avoid scatterlist for MAC encryption

2016-10-15 Thread Ard Biesheuvel

The CCM code goes out of its way to perform the CTR encryption of the MAC
using the subordinate CTR driver. To this end, it tweaks the input and
output scatterlists so the aead_req 'odata' and/or 'auth_tag' fields [which
may live on the stack] are prepended to the CTR payload. This involves
calling sg_set_buf() on addresses which are not direct mapped, which is
not supported.

Since the calculation of the MAC keystream involves a single call into
the cipher, to which we have a handle already given that the CBC-MAC
calculation uses it as well, just calculate the MAC keystream directly,
and record it in the aead_req private context so we can apply it to the
MAC in cypto_ccm_auth_mac(). This greatly simplifies the scatterlist
manipulation, and no longer requires scatterlists to refer to buffers
that may live on the stack.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---

This is an alternative for the patch 'mac80211: aes_ccm: move struct
aead_req off the stack' that I sent out yesterday. IMO, this is a more
correct approach, since it addresses the problem directly in crypto/ccm.c,
which is the only CCM-AES driver that suffers from this issue.

 crypto/ccm.c | 55 +++-
 1 file changed, 29 insertions(+), 26 deletions(-)

diff --git a/crypto/ccm.c b/crypto/ccm.c
index 006d8575ef5c..faa5efcf59e2 100644
--- a/crypto/ccm.c
+++ b/crypto/ccm.c
@@ -46,10 +46,13 @@ struct crypto_ccm_req_priv_ctx {
u8 odata[16];
u8 idata[16];
u8 auth_tag[16];
+   u8 cmac[16];
u32 ilen;
u32 flags;
-   struct scatterlist src[3];
-   struct scatterlist dst[3];
+   struct scatterlist *src;
+   struct scatterlist *dst;
+   struct scatterlist srcbuf[2];
+   struct scatterlist dstbuf[2];
struct skcipher_request skreq;
 };
 
@@ -280,6 +283,8 @@ static int crypto_ccm_auth(struct aead_request *req, struct 
scatterlist *plain,
if (cryptlen)
get_data_to_compute(cipher, pctx, plain, cryptlen);
 
+   crypto_xor(odata, pctx->cmac, 16);
+
 out:
return err;
 }
@@ -307,10 +312,12 @@ static inline int crypto_ccm_check_iv(const u8 *iv)
return 0;
 }
 
-static int crypto_ccm_init_crypt(struct aead_request *req, u8 *tag)
+static int crypto_ccm_init_crypt(struct aead_request *req)
 {
+   struct crypto_aead *aead = crypto_aead_reqtfm(req);
+   struct crypto_ccm_ctx *ctx = crypto_aead_ctx(aead);
struct crypto_ccm_req_priv_ctx *pctx = crypto_ccm_reqctx(req);
-   struct scatterlist *sg;
+   struct crypto_cipher *cipher = ctx->cipher;
u8 *iv = req->iv;
int err;
 
@@ -325,19 +332,16 @@ static int crypto_ccm_init_crypt(struct aead_request 
*req, u8 *tag)
 */
memset(iv + 15 - iv[0], 0, iv[0] + 1);
 
-   sg_init_table(pctx->src, 3);
-   sg_set_buf(pctx->src, tag, 16);
-   sg = scatterwalk_ffwd(pctx->src + 1, req->src, req->assoclen);
-   if (sg != pctx->src + 1)
-   sg_chain(pctx->src, 2, sg);
+   /* prepare the key stream for the auth tag  */
+   crypto_cipher_encrypt_one(cipher, pctx->cmac, iv);
 
-   if (req->src != req->dst) {
-   sg_init_table(pctx->dst, 3);
-   sg_set_buf(pctx->dst, tag, 16);
-   sg = scatterwalk_ffwd(pctx->dst + 1, req->dst, req->assoclen);
-   if (sg != pctx->dst + 1)
-   sg_chain(pctx->dst, 2, sg);
-   }
+   /* increment BE counter in IV[] for the actual payload */
+   iv[15] = 1;
+
+   pctx->src = scatterwalk_ffwd(pctx->srcbuf, req->src, req->assoclen);
+   if (req->src != req->dst)
+   pctx->dst = scatterwalk_ffwd(pctx->dstbuf, req->dst,
+req->assoclen);
 
return 0;
 }
@@ -354,11 +358,11 @@ static int crypto_ccm_encrypt(struct aead_request *req)
u8 *iv = req->iv;
int err;
 
-   err = crypto_ccm_init_crypt(req, odata);
+   err = crypto_ccm_init_crypt(req);
if (err)
return err;
 
-   err = crypto_ccm_auth(req, sg_next(pctx->src), cryptlen);
+   err = crypto_ccm_auth(req, pctx->src, cryptlen);
if (err)
return err;
 
@@ -369,13 +373,13 @@ static int crypto_ccm_encrypt(struct aead_request *req)
skcipher_request_set_tfm(skreq, ctx->ctr);
skcipher_request_set_callback(skreq, pctx->flags,
  crypto_ccm_encrypt_done, req);
-   skcipher_request_set_crypt(skreq, pctx->src, dst, cryptlen + 16, iv);
+   skcipher_request_set_crypt(skreq, pctx->src, dst, cryptlen, iv);
err = crypto_skcipher_encrypt(skreq);
if (err)
return err;
 
/* copy authtag to end of dst */
-   scatterwalk_map_and_copy(odata, sg_next(dst), cryptlen,
+   scatterwalk_map_and_copy(odata, ds

Re: [PATCH] mac80211: aes_ccm: move struct aead_req off the stack

2016-10-14 Thread Ard Biesheuvel

> On 14 Oct 2016, at 14:46, Johannes Berg  wrote:
> 
> 
>> 
>> Is the aad[] actually reused? I would assume it only affects the mac
>> on encryption, and the verification on decryption but I don't think
>> we actually need it back from the crypto routines.
> 
> I don't think it's reused.
> 
>> Exactly what you said above :-) My patch only touches CCM but as you
>> said,
>> 
>> """
>> 'Also there's B_0/J_0 for CCM/GCM, and the 'zero' thing that GMAC
>> has.
>> """
> 
> Ah, but we can/should do the same for the others, no?
> 

Yes, but then we end up kmalloc/kfreeing chunks of 16 bytes, which is actually 
another problem.

I still think we are not violating the api by putting aead_req on the stack 
(but herbert should confirm). The aad[] issue does violate the api, so it 
deserves a separate fix imo

Re: [PATCH v2] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel


> On 14 Oct 2016, at 14:42, David Laight <david.lai...@aculab.com> wrote:
> 
> From: Of Ard Biesheuvel
>> Sent: 14 October 2016 14:41
>> PCI devices that are 64-bit DMA capable should set the coherent
>> DMA mask as well as the streaming DMA mask. On some architectures,
>> these are managed separately, and so the coherent DMA mask will be
>> left at its default value of 32 if it is not set explicitly. This
>> results in errors such as
>> 
>> r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
>> hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
>> swiotlb: coherent allocation failed for device :02:00.0 size=4096
>> CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
>> Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016
>> 
>> on systems without memory that is 32-bit addressable by PCI devices.
>> 
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>> ---
>> v2: dropped the hunk that sets the coherent DMA mask to DMA_BIT_MASK(32),
>>which is unnecessary given that it is the default
>> 
>> drivers/net/ethernet/realtek/r8169.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>> b/drivers/net/ethernet/realtek/r8169.c
>> index e55638c7505a..bf000d819a21 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -8273,7 +8273,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
>> struct pci_device_id *ent)
>>if ((sizeof(dma_addr_t) > 4) &&
>>(use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
>>  tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
>> -!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
>> +!pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) &&
>> +!pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
> 
> Isn't there a dma_set_mask_and_coherent() function ?
> 

Not of the pci_xxx variety afaik

Re: [PATCH v2] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel



> On 14 Oct 2016, at 14:42, David Laight<david.lai...@aculab.com> wrote:
> 
> From: Of Ard Biesheuvel
>> Sent: 14 October 2016 14:41
>> PCI devices that are 64-bit DMA capable should set the coherent
>> DMA mask as well as the streaming DMA mask. On some architectures,
>> these are managed separately, and so the coherent DMA mask will be
>> left at its default value of 32 if it is not set explicitly. This
>> results in errors such as
>> 
>> r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
>> hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
>> swiotlb: coherent allocation failed for device :02:00.0 size=4096
>> CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
>> Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016
>> 
>> on systems without memory that is 32-bit addressable by PCI devices.
>> 
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>> ---
>> v2: dropped the hunk that sets the coherent DMA mask to DMA_BIT_MASK(32),
>>which is unnecessary given that it is the default
>> 
>> drivers/net/ethernet/realtek/r8169.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>> b/drivers/net/ethernet/realtek/r8169.c
>> index e55638c7505a..bf000d819a21 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
>> @@ -8273,7 +8273,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
>> struct pci_device_id *ent)
>>if ((sizeof(dma_addr_t) > 4) &&
>>(use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
>>  tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
>> -!pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
>> +!pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) &&
>> +!pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
> 
> Isn't there a dma_set_mask_and_coherent() function ?
> 
>David
>

[PATCH v2] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel

PCI devices that are 64-bit DMA capable should set the coherent
DMA mask as well as the streaming DMA mask. On some architectures,
these are managed separately, and so the coherent DMA mask will be
left at its default value of 32 if it is not set explicitly. This
results in errors such as

 r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
 hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
 swiotlb: coherent allocation failed for device :02:00.0 size=4096
 CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
 Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016

on systems without memory that is 32-bit addressable by PCI devices.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
v2: dropped the hunk that sets the coherent DMA mask to DMA_BIT_MASK(32),
which is unnecessary given that it is the default

 drivers/net/ethernet/realtek/r8169.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index e55638c7505a..bf000d819a21 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8273,7 +8273,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if ((sizeof(dma_addr_t) > 4) &&
(use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
  tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
-   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
+   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) &&
+   !pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
 
/* CPlusCmd Dual Access Cycle is only needed for non-PCIe */
if (!pci_is_pcie(pdev))
-- 
2.7.4

Re: [PATCH] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 14:34, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Fri, 14 Oct 2016 14:32:24 +0100
>
>> On 14 October 2016 at 14:31, David Miller <da...@davemloft.net> wrote:
>>> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>> Date: Fri, 14 Oct 2016 12:39:30 +0100
>>>
>>>> PCI devices that are 64-bit DMA capable should set the coherent
>>>> DMA mask as well as the streaming DMA mask. On some architectures,
>>>> these are managed separately, and so the coherent DMA mask will be
>>>> left at its default value of 32 if it is not set explicitly. This
>>>> results in errors such as
>>>>
>>>>  r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
>>>>  hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
>>>>  swiotlb: coherent allocation failed for device :02:00.0 size=4096
>>>>  CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
>>>>  Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016
>>>>
>>>> on systems without memory that is 32-bit addressable by PCI devices.
>>>>
>>>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>>>  ...
>>>> @@ -8281,6 +8282,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
>>>> struct pci_device_id *ent)
>>>>   dev->features |= NETIF_F_HIGHDMA;
>>>>   } else {
>>>>   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
>>>> + if (!rc)
>>>> + rc = pci_set_consistent_dma_mask(pdev, 
>>>> DMA_BIT_MASK(32));
>>>
>>> As you state 32-bit is the default, therefore this part of your patch is 
>>> unnecessary.
>>
>> Perhaps, but the original code did not assume that either. Should I
>> remove the other call in a subsequent patch as well?
>
> I simply want you to respin this with the above hunk removed.
>
> Your code changes and your commit message must be consistent.

OK, fair enough

Re: [PATCH] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 14:31, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Fri, 14 Oct 2016 12:39:30 +0100
>
>> PCI devices that are 64-bit DMA capable should set the coherent
>> DMA mask as well as the streaming DMA mask. On some architectures,
>> these are managed separately, and so the coherent DMA mask will be
>> left at its default value of 32 if it is not set explicitly. This
>> results in errors such as
>>
>>  r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
>>  hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
>>  swiotlb: coherent allocation failed for device :02:00.0 size=4096
>>  CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
>>  Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016
>>
>> on systems without memory that is 32-bit addressable by PCI devices.
>>
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>  ...
>> @@ -8281,6 +8282,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
>> struct pci_device_id *ent)
>>   dev->features |= NETIF_F_HIGHDMA;
>>   } else {
>>   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
>> + if (!rc)
>> + rc = pci_set_consistent_dma_mask(pdev, 
>> DMA_BIT_MASK(32));
>
> As you state 32-bit is the default, therefore this part of your patch is 
> unnecessary.

Perhaps, but the original code did not assume that either. Should I
remove the other call in a subsequent patch as well?

Re: [PATCH] mac80211: aes_ccm: move struct aead_req off the stack

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 14:15, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Fri, 2016-10-14 at 14:13 +0100, Ard Biesheuvel wrote:
>>
>> > But if we allocate things anyway, is it worth expending per-CPU
>> > buffers on these?
>>
>> Ehmm, maybe not. I could spin a v2 that allocates a bigger buffer,
>> and copies aad[] into it as well
>
> Copies in/out, I guess. Also there's B_0/J_0 for CCM/GCM, and the
> 'zero' thing that GMAC has.
>

Is the aad[] actually reused? I would assume it only affects the mac
on encryption, and the verification on decryption but I don't think we
actually need it back from the crypto routines.

>> That does not help the other algos though
>
> What do you mean?
>

Exactly what you said above :-) My patch only touches CCM but as you said,

"""
'Also there's B_0/J_0 for CCM/GCM, and the 'zero' thing that GMAC has.
"""

Re: [PATCH] mac80211: aes_ccm: move struct aead_req off the stack

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 14:10, Johannes Berg  wrote:
>
>> So use kzalloc
>
> Do we really need kzalloc()? We have things on the stack right now, and
> don't initialize, so surely we don't really need to zero things?
>
>> This only addresses one half of the problem. The other problem, i.e.,
>> the fact that the aad[] array lives on the stack of the caller, is
>> handled adequately imo by the change proposed by Johannes.
>
> But if we allocate things anyway, is it worth expending per-CPU buffers
> on these?
>

Ehmm, maybe not. I could spin a v2 that allocates a bigger buffer, and
copies aad[] into it as well
That does not help the other algos though

[PATCH] mac80211: aes_ccm: move struct aead_req off the stack

2016-10-14 Thread Ard Biesheuvel

Some CCM implementations (such as the generic CCM wrapper in crypto/)
use scatterlists to map fields of struct aead_req. This means these
data structures cannot live in the vmalloc area, which means that in
the near future, they can no longer live on the stack either.

Given that these data structures have implementation specific context fields,
it really depends on the particular driver whether this issue is likely to
occur or not, and so it seems best to simply move the entire data structure
into the direct mapped kernel heap.

So use kzalloc/kfree to allocate and free the data structures. This pattern
already exists in the IPsec ESP driver, but in the future, we may need to
improve upon this by either moving the request into the SKB, or using a
slab cache to allocate/free the data structures.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---

This only addresses one half of the problem. The other problem, i.e., the
fact that the aad[] array lives on the stack of the caller, is handled
adequately imo by the change proposed by Johannes.

 net/mac80211/aes_ccm.c | 24 ++--
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/net/mac80211/aes_ccm.c b/net/mac80211/aes_ccm.c
index 7663c28ba353..a0ae8cebbe4e 100644
--- a/net/mac80211/aes_ccm.c
+++ b/net/mac80211/aes_ccm.c
@@ -23,13 +23,10 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
   size_t mic_len)
 {
struct scatterlist sg[3];
+   struct aead_request *aead_req;
 
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
-
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = kzalloc(sizeof(struct aead_request) +
+  crypto_aead_reqsize(tfm), GFP_ATOMIC);
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -41,6 +38,7 @@ void ieee80211_aes_ccm_encrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_ad(aead_req, sg[0].length);
 
crypto_aead_encrypt(aead_req);
+   kfree(aead_req);
 }
 
 int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 *b_0, u8 *aad,
@@ -48,15 +46,14 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
  size_t mic_len)
 {
struct scatterlist sg[3];
-   char aead_req_data[sizeof(struct aead_request) +
-  crypto_aead_reqsize(tfm)]
-   __aligned(__alignof__(struct aead_request));
-   struct aead_request *aead_req = (void *) aead_req_data;
+   struct aead_request *aead_req;
+   int err;
 
if (data_len == 0)
return -EINVAL;
 
-   memset(aead_req, 0, sizeof(aead_req_data));
+   aead_req = kzalloc(sizeof(struct aead_request) +
+  crypto_aead_reqsize(tfm), GFP_ATOMIC);
 
sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
@@ -67,7 +64,10 @@ int ieee80211_aes_ccm_decrypt(struct crypto_aead *tfm, u8 
*b_0, u8 *aad,
aead_request_set_crypt(aead_req, sg, sg, data_len + mic_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
 
-   return crypto_aead_decrypt(aead_req);
+   err = crypto_aead_decrypt(aead_req);
+   kfree(aead_req);
+
+   return err;
 }
 
 struct crypto_aead *ieee80211_aes_key_setup_encrypt(const u8 key[],
-- 
2.7.4

[PATCH] r8169: set coherent DMA mask as well as streaming DMA mask

2016-10-14 Thread Ard Biesheuvel

PCI devices that are 64-bit DMA capable should set the coherent
DMA mask as well as the streaming DMA mask. On some architectures,
these are managed separately, and so the coherent DMA mask will be
left at its default value of 32 if it is not set explicitly. This
results in errors such as

 r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
 hwdev DMA mask = 0x, dev_addr = 0x0080fbfff000
 swiotlb: coherent allocation failed for device :02:00.0 size=4096
 CPU: 0 PID: 1062 Comm: systemd-udevd Not tainted 4.8.0+ #35
 Hardware name: AMD Seattle/Seattle, BIOS 10:53:24 Oct 13 2016

on systems without memory that is 32-bit addressable by PCI devices.

Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 drivers/net/ethernet/realtek/r8169.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index e55638c7505a..04957a36b11f 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -8273,7 +8273,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if ((sizeof(dma_addr_t) > 4) &&
(use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
  tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
-   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
+   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) &&
+   !pci_set_consistent_dma_mask(pdev, DMA_BIT_MASK(64))) {
 
/* CPlusCmd Dual Access Cycle is only needed for non-PCIe */
if (!pci_is_pcie(pdev))
@@ -8281,6 +8282,8 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
dev->features |= NETIF_F_HIGHDMA;
} else {
rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+   if (!rc)
+   rc = pci_set_consistent_dma_mask(pdev, 
DMA_BIT_MASK(32));
if (rc < 0) {
netif_err(tp, probe, dev, "DMA configuration failed\n");
goto err_out_unmap_4;
-- 
2.7.4

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 11:00, Johannes Berg  wrote:
>
>> So why is the performance hit acceptable for ESP but not for WPA? We
>> could easily implement the same thing, i.e.,
>> kmalloc(GFP_ATOMIC)/kfree the aead_req struct rather than allocate it
>> on the stack
>
> Yeah, maybe we should. It's likely a much bigger allocation, but I
> don't actually know if that affects speed.
>
> In most cases where you want high performance we never hit this anyway
> since we'll have hardware crypto. I know for our (Intel's) devices we
> normally never hit these code paths.
>
> But on the other hand, you also did your changes for a reason, and the
> only reason I can see of that is performance. So you'd be the one with
> most "skin in the game", I guess?
>

Well, what sucks here is that the accelerated driver I implemented for
arm64 does not actually need this, as long as we take aad[] off the
stack. And note that the API was changed since my patch, to add aad[]
to the scatterlist: prior to this change, it used
aead_request_set_assoc() to set the associated data separately.

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 10:25, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Fri, 2016-10-14 at 10:21 +0100, Ard Biesheuvel wrote:
>
>> It is annotated with a TODO, though :-)
>>
>> 38320c70d282b (Herbert Xu   2008-01-28 19:35:05
>> -0800  41)
>>  * TODO: Use spare space in skb for this where possible.
>
> I saw that, but I don't think generally there will be spare space for
> it - the stuff there is likely far too big. Anyway ... same problem
> that we have.
>
> I'm not inclined to allocate ~500 bytes temporarily for every frame
> either though.
>
> Maybe we could try to manage it in mac80211, we'd "only" need 5 AEAD
> structs (which are today on the stack) in parallel for each key (4 TX,
> 1 RX), but in a typical case of having 3 keys that's already 7.5K worth
> of memory that we almost never use. Again, with more complexity, we
> could know that the TX will not be used if the driver does the TX, but
> the single RX one we'd need unconditionally... decisions decisions...
>

So why is the performance hit acceptable for ESP but not for WPA? We
could easily implement the same thing, i.e., kmalloc(GFP_ATOMIC)/kfree
the aead_req struct rather than allocate it on the stack

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 10:10, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Fri, 2016-10-14 at 10:05 +0100, Ard Biesheuvel wrote:
>>
>> Indeed. And the decrypt path does the same for auth_tag[].
>
> Hadn't gotten that far, due to the BUG_ON() in CONFIG_DEBUG_SG in the
> encrypt path :)
>
>> But that still means there are two separate problems here, one which
>> affects the WPA code, and one that only affects the generic CCM
>> chaining mode (but not the accelerated arm64 implementation)
>
> Yes. The generic CCM chaining still doesn't typically have a request on
> the stack though. In fact, ESP (net/ipv4/esp4.c) for example will do
> temporary allocations with kmalloc for every frame, it seems.
>

It is annotated with a TODO, though :-)

38320c70d282b (Herbert Xu   2008-01-28 19:35:05 -0800  41)
 * TODO: Use spare space in skb for this where possible.

>> Unsurprisingly, I would strongly prefer those to be fixed properly
>> rather than backing out my patch, but I'm happy to help out whichever
>> solution we reach consensus on.
>
> Yeah, obviously, it would be good to use the accelerated versions after
> all.
>
>> I will check whether this removes the issue when not using
>> crypto/ccm.ko
>
> Ok. I think we can probably live with having those 48 bytes in per-CPU
> buffers, but I suppose we don't really want to have ~500.
>

Agreed.

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 09:55, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Fri, 2016-10-14 at 09:47 +0100, Ard Biesheuvel wrote:
>>
>> Do you have a reference for the sg_set_buf() call on odata?
>> crypto/ccm.c does not seem to have it (afaict),
>
> It's indirect - crypto_ccm_encrypt() calls crypto_ccm_init_crypt()
> which does it.
>

Indeed. And the decrypt path does the same for auth_tag[].

But that still means there are two separate problems here, one which
affects the WPA code, and one that only affects the generic CCM
chaining mode (but not the accelerated arm64 implementation)

Unsurprisingly, I would strongly prefer those to be fixed properly
rather than backing out my patch, but I'm happy to help out whichever
solution we reach consensus on.

>> and the same problem
>> does not exist in the accelerated arm64 implementation. In the mean
>> time, I will try and see if we can move aad[] off the stack in the
>> WPA code.
>
> I had that with per-CPU buffers, just sent the patch upthread.
>

I will check whether this removes the issue when not using crypto/ccm.ko

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 09:42, Johannes Berg <johan...@sipsolutions.net> wrote:
> On Fri, 2016-10-14 at 09:41 +0100, Ard Biesheuvel wrote:
>
>> > I assume the stack buffer itself is not the problem here, but aad,
>> > which is allocated on the stack one frame up.
>> > Do we really need to revert the whole patch to fix that?
>>
>> Ah never mind, this is about 'odata'. Apologies, should have read
>> first
>
> Right, odata also goes into an sg list and further on.
>
> I think we should wait for Herbert to chime in before we do any further
> work though, perhaps he has any better ideas.
>

Do you have a reference for the sg_set_buf() call on odata?
crypto/ccm.c does not seem to have it (afaict), and the same problem
does not exist in the accelerated arm64 implementation. In the mean
time, I will try and see if we can move aad[] off the stack in the WPA
code.

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 09:39, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> On 14 October 2016 at 09:28, Johannes Berg <johan...@sipsolutions.net> wrote:
>>
>>>1. revert that patch (doing so would need some major adjustments now,
>>>   since it's pretty old and a number of new things were added in the
>>>   meantime)
>>
>> This it will have to be, I guess.
>>
>>>2. allocate a per-CPU buffer for all the things that we put on the
>>>   stack and use in SG lists, those are:
>>>* CCM/GCM: AAD (32B), B_0/J_0 (16B)
>>>* GMAC: AAD (20B), zero (16B)
>>>* (not sure why CMAC isn't using this API, but it would be like GMAC)
>>
>> This doesn't work - I tried to move the mac80211 buffers, but because
>> we also put the struct aead_request on the stack, and crypto_ccm has
>> the "odata" in there, and we can't separate the odata from that struct,
>> we'd have to also put that into a per-CPU buffer, but it's very big -
>> 456 bytes for CCM, didn't measure the others but I'd expect them to be
>> larger, if different.
>>
>> I don't think we can allocate half a kb for each CPU just to be able to
>> possibly use the acceleration here. We can't even make that conditional
>> on not having hardware crypto in the wifi NIC because drivers are
>> always allowed to pass undecrypted frames, regardless of whether or not
>> HW crypto was attempted, so we don't know upfront if we'll have to
>> decrypt anything in software...
>>
>> Given that, I think we have had a bug in here basically since Ard's
>> patch, we never should've put these structs on the stack. Herbert, you
>> also touched this later and converted the API usage, did you see the
>> way the stack is used here and think it should be OK, or did you simply
>> not realize that?
>>
>> Ard, are you able to help out working on a revert of your patch? That
>> would require also reverting a number of other patches (various fixes,
>> API adjustments, etc. to the AEAD usage), but the more complicated part
>> is that in the meantime Jouni introduced GCMP and CCMP-256, both of
>> which we of course need to retain.
>>
>
> I am missing some context here, but could you explain what exactly is
> the problem here?
>
> Look at this code
>
> """
> struct scatterlist sg[3];
>
> char aead_req_data[sizeof(struct aead_request) +
> crypto_aead_reqsize(tfm)]
> __aligned(__alignof__(struct aead_request));
> struct aead_request *aead_req = (void *) aead_req_data;
>
> memset(aead_req, 0, sizeof(aead_req_data));
>
> sg_init_table(sg, 3);
> sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
> sg_set_buf([1], data, data_len);
> sg_set_buf([2], mic, mic_len);
>
> aead_request_set_tfm(aead_req, tfm);
> aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
> aead_request_set_ad(aead_req, sg[0].length);
> """
>
> I assume the stack buffer itself is not the problem here, but aad,
> which is allocated on the stack one frame up.
> Do we really need to revert the whole patch to fix that?

Ah never mind, this is about 'odata'. Apologies, should have read first

Re: [mac80211] BUG_ON with current -git (4.8.0-11417-g24532f7)

2016-10-14 Thread Ard Biesheuvel

On 14 October 2016 at 09:28, Johannes Berg  wrote:
>
>>1. revert that patch (doing so would need some major adjustments now,
>>   since it's pretty old and a number of new things were added in the
>>   meantime)
>
> This it will have to be, I guess.
>
>>2. allocate a per-CPU buffer for all the things that we put on the
>>   stack and use in SG lists, those are:
>>* CCM/GCM: AAD (32B), B_0/J_0 (16B)
>>* GMAC: AAD (20B), zero (16B)
>>* (not sure why CMAC isn't using this API, but it would be like GMAC)
>
> This doesn't work - I tried to move the mac80211 buffers, but because
> we also put the struct aead_request on the stack, and crypto_ccm has
> the "odata" in there, and we can't separate the odata from that struct,
> we'd have to also put that into a per-CPU buffer, but it's very big -
> 456 bytes for CCM, didn't measure the others but I'd expect them to be
> larger, if different.
>
> I don't think we can allocate half a kb for each CPU just to be able to
> possibly use the acceleration here. We can't even make that conditional
> on not having hardware crypto in the wifi NIC because drivers are
> always allowed to pass undecrypted frames, regardless of whether or not
> HW crypto was attempted, so we don't know upfront if we'll have to
> decrypt anything in software...
>
> Given that, I think we have had a bug in here basically since Ard's
> patch, we never should've put these structs on the stack. Herbert, you
> also touched this later and converted the API usage, did you see the
> way the stack is used here and think it should be OK, or did you simply
> not realize that?
>
> Ard, are you able to help out working on a revert of your patch? That
> would require also reverting a number of other patches (various fixes,
> API adjustments, etc. to the AEAD usage), but the more complicated part
> is that in the meantime Jouni introduced GCMP and CCMP-256, both of
> which we of course need to retain.
>

I am missing some context here, but could you explain what exactly is
the problem here?

Look at this code

"""
struct scatterlist sg[3];

char aead_req_data[sizeof(struct aead_request) +
crypto_aead_reqsize(tfm)]
__aligned(__alignof__(struct aead_request));
struct aead_request *aead_req = (void *) aead_req_data;

memset(aead_req, 0, sizeof(aead_req_data));

sg_init_table(sg, 3);
sg_set_buf([0], [2], be16_to_cpup((__be16 *)aad));
sg_set_buf([1], data, data_len);
sg_set_buf([2], mic, mic_len);

aead_request_set_tfm(aead_req, tfm);
aead_request_set_crypt(aead_req, sg, sg, data_len, b_0);
aead_request_set_ad(aead_req, sg[0].length);
"""

I assume the stack buffer itself is not the problem here, but aad,
which is allocated on the stack one frame up.
Do we really need to revert the whole patch to fix that?

[PATCH v2] r8169: default to 64-bit DMA on recent PCIe chips

2016-05-14 Thread Ard Biesheuvel

The current logic around the 'use_dac' module parameter prevents the
r81969 driver from being loadable on 64-bit systems without any RAM
below 4 GB when the parameter is left at its default value.

So introduce a new default value -1 which indicates that 64-bit DMA
should be enabled on sufficiently recent PCIe chips, i.e., versions
RTL_GIGA_MAC_VER_18 or later. Explicit param values of 0 or 1 retain
the existing behavior of unconditionally enabling/disabling 64-bit DMA
on 64-bit architectures (i.e., regardless of the type and version of the
chip)

Since PCIe chips do not need to CPlusCmd Dual Address Cycle to be set,
make that conditional on the device type as well.

Cc: Realtek linux nic maintainers <nic_s...@realtek.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
This is a followup to 'r8169: default to 64-bit DMA on systems without memory
below 4 GB' [1]. At the request of Francois, this version bases the decision
whether to use 64-bit DMA by default on whether the device is PCIe and
sufficiently recent, rather than whether the platform requires 64-bit DMA
because it does not have any memory below 4 GB to begin with. This is safer,
since it will prevent the use of such problematic cards on these platforms.

v2: drop unnecessary reordering of rtl8169_get_mac_version() call with pcie
check

[1] http://article.gmane.org/gmane.linux.network/412246

 drivers/net/ethernet/realtek/r8169.c | 44 +++-
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 94f08f1e841c..0e62d74b09b3 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -345,7 +345,7 @@ static const struct pci_device_id rtl8169_pci_tbl[] = {
 MODULE_DEVICE_TABLE(pci, rtl8169_pci_tbl);
 
 static int rx_buf_sz = 16383;
-static int use_dac;
+static int use_dac = -1;
 static struct {
u32 msg_enable;
 } debug = { -1 };
@@ -8224,20 +8224,6 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
goto err_out_mwi_2;
}
 
-   tp->cp_cmd = 0;
-
-   if ((sizeof(dma_addr_t) > 4) &&
-   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) && use_dac) {
-   tp->cp_cmd |= PCIDAC;
-   dev->features |= NETIF_F_HIGHDMA;
-   } else {
-   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
-   if (rc < 0) {
-   netif_err(tp, probe, dev, "DMA configuration failed\n");
-   goto err_out_free_res_3;
-   }
-   }
-
/* ioremap MMIO region */
ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
if (!ioaddr) {
@@ -8253,6 +8239,25 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* Identify chip attached to board */
rtl8169_get_mac_version(tp, dev, cfg->default_ver);
 
+   tp->cp_cmd = 0;
+
+   if ((sizeof(dma_addr_t) > 4) &&
+   (use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
+ tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
+   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
+
+   /* CPlusCmd Dual Access Cycle is only needed for non-PCIe */
+   if (!pci_is_pcie(pdev))
+   tp->cp_cmd |= PCIDAC;
+   dev->features |= NETIF_F_HIGHDMA;
+   } else {
+   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+   if (rc < 0) {
+   netif_err(tp, probe, dev, "DMA configuration failed\n");
+   goto err_out_unmap_4;
+   }
+   }
+
rtl_init_rxcfg(tp);
 
rtl_irq_disable(tp);
@@ -8412,12 +8417,12 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
   >counters_phys_addr, GFP_KERNEL);
if (!tp->counters) {
rc = -ENOMEM;
-   goto err_out_msi_4;
+   goto err_out_msi_5;
}
 
rc = register_netdev(dev);
if (rc < 0)
-   goto err_out_cnt_5;
+   goto err_out_cnt_6;
 
pci_set_drvdata(pdev, dev);
 
@@ -8451,12 +8456,13 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 out:
return rc;
 
-err_out_cnt_5:
+err_out_cnt_6:
dma_free_coherent(>dev, sizeof(*tp->counters), tp->counters,
  tp->counters_phys_addr);
-err_out_msi_4:
+err_out_msi_5:
netif_napi_del(>napi);
rtl_disable_msi(pdev, tp);
+err_out_unmap_4:
iounmap(ioaddr);
 err_out_free_res_3:
pci_release_regions(pdev);
-- 
2.7.4

Re: [PATCH] r8169: default to 64-bit DMA on recent PCIe chips

2016-05-14 Thread Ard Biesheuvel

On 14 May 2016 at 12:41, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
> The current logic around the 'use_dac' module parameter prevents the
> r81969 driver from being loadable on 64-bit systems without any RAM
> below 4 GB when the parameter is left at its default value.
>
> So introduce a new default value -1 which indicates that 64-bit DMA
> should be enabled on sufficiently recent PCIe chips, i.e., versions
> RTL_GIGA_MAC_VER_18 or later. Explicit param values of 0 or 1 retain
> the existing behavior of unconditionally enabling/disabling 64-bit DMA
> on 64-bit architectures (i.e., regardless of the type and version of the
> chip)
>
> Since PCIe chips do not need to CPlusCmd Dual Address Cycle to be set,
> make that conditional on the device type as well.
>
> Cc: Realtek linux nic maintainers <nic_s...@realtek.com>
> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
> ---
>
> This is a followup to 'r8169: default to 64-bit DMA on systems without memory
> below 4 GB' [1]. At the request of Francois, this version bases the decision
> whether to use 64-bit DMA by default on whether the device is PCIe and
> sufficiently recent, rather than whether the platform requires 64-bit DMA
> because it does not have any memory below 4 GB to begin with. This is safer,
> since it will prevent the use of such problematic cards on these platforms.
>
> [1] http://article.gmane.org/gmane.linux.network/412246
>
>  drivers/net/ethernet/realtek/r8169.c | 48 +++-
>  1 file changed, 27 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c 
> b/drivers/net/ethernet/realtek/r8169.c
> index 94f08f1e841c..80bb8ea265ad 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -345,7 +345,7 @@ static const struct pci_device_id rtl8169_pci_tbl[] = {
>  MODULE_DEVICE_TABLE(pci, rtl8169_pci_tbl);
>
>  static int rx_buf_sz = 16383;
> -static int use_dac;
> +static int use_dac = -1;
>  static struct {
> u32 msg_enable;
>  } debug = { -1 };
> @@ -8224,20 +8224,6 @@ static int rtl_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
> goto err_out_mwi_2;
> }
>
> -   tp->cp_cmd = 0;
> -
> -   if ((sizeof(dma_addr_t) > 4) &&
> -   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) && use_dac) {
> -   tp->cp_cmd |= PCIDAC;
> -   dev->features |= NETIF_F_HIGHDMA;
> -   } else {
> -   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
> -   if (rc < 0) {
> -   netif_err(tp, probe, dev, "DMA configuration 
> failed\n");
> -   goto err_out_free_res_3;
> -   }
> -   }
> -
> /* ioremap MMIO region */
> ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
> if (!ioaddr) {
> @@ -8247,11 +8233,30 @@ static int rtl_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
> }
> tp->mmio_addr = ioaddr;
>
> +   /* Identify chip attached to board */
> +   rtl8169_get_mac_version(tp, dev, cfg->default_ver);
> +
> if (!pci_is_pcie(pdev))
> netif_info(tp, probe, dev, "not PCI Express\n");
>
> -   /* Identify chip attached to board */
> -   rtl8169_get_mac_version(tp, dev, cfg->default_ver);

The reordering above is actually unnecessary, it crept in inadvertently.

> +   tp->cp_cmd = 0;
> +
> +   if ((sizeof(dma_addr_t) > 4) &&
> +   (use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
> + tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
> +   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
> +
> +   /* CPlusCmd Dual Access Cycle is only needed for non-PCIe */
> +   if (!pci_is_pcie(pdev))
> +   tp->cp_cmd |= PCIDAC;
> +   dev->features |= NETIF_F_HIGHDMA;
> +   } else {
> +   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
> +   if (rc < 0) {
> +   netif_err(tp, probe, dev, "DMA configuration 
> failed\n");
> +   goto err_out_unmap_4;
> +   }
> +   }
>
> rtl_init_rxcfg(tp);
>
> @@ -8412,12 +8417,12 @@ static int rtl_init_one(struct pci_dev *pdev, const 
> struct pci_device_id *ent)
>>counters_phys_addr, 
> GFP_KERNEL);
> if (!tp->counters) {
> rc = -ENOMEM;
> -

[PATCH] r8169: default to 64-bit DMA on recent PCIe chips

2016-05-14 Thread Ard Biesheuvel

The current logic around the 'use_dac' module parameter prevents the
r81969 driver from being loadable on 64-bit systems without any RAM
below 4 GB when the parameter is left at its default value.

So introduce a new default value -1 which indicates that 64-bit DMA
should be enabled on sufficiently recent PCIe chips, i.e., versions
RTL_GIGA_MAC_VER_18 or later. Explicit param values of 0 or 1 retain
the existing behavior of unconditionally enabling/disabling 64-bit DMA
on 64-bit architectures (i.e., regardless of the type and version of the
chip)

Since PCIe chips do not need to CPlusCmd Dual Address Cycle to be set,
make that conditional on the device type as well.

Cc: Realtek linux nic maintainers <nic_s...@realtek.com>
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---

This is a followup to 'r8169: default to 64-bit DMA on systems without memory
below 4 GB' [1]. At the request of Francois, this version bases the decision
whether to use 64-bit DMA by default on whether the device is PCIe and
sufficiently recent, rather than whether the platform requires 64-bit DMA
because it does not have any memory below 4 GB to begin with. This is safer,
since it will prevent the use of such problematic cards on these platforms.

[1] http://article.gmane.org/gmane.linux.network/412246

 drivers/net/ethernet/realtek/r8169.c | 48 +++-
 1 file changed, 27 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 94f08f1e841c..80bb8ea265ad 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -345,7 +345,7 @@ static const struct pci_device_id rtl8169_pci_tbl[] = {
 MODULE_DEVICE_TABLE(pci, rtl8169_pci_tbl);
 
 static int rx_buf_sz = 16383;
-static int use_dac;
+static int use_dac = -1;
 static struct {
u32 msg_enable;
 } debug = { -1 };
@@ -8224,20 +8224,6 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
goto err_out_mwi_2;
}
 
-   tp->cp_cmd = 0;
-
-   if ((sizeof(dma_addr_t) > 4) &&
-   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) && use_dac) {
-   tp->cp_cmd |= PCIDAC;
-   dev->features |= NETIF_F_HIGHDMA;
-   } else {
-   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
-   if (rc < 0) {
-   netif_err(tp, probe, dev, "DMA configuration failed\n");
-   goto err_out_free_res_3;
-   }
-   }
-
/* ioremap MMIO region */
ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
if (!ioaddr) {
@@ -8247,11 +8233,30 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
}
tp->mmio_addr = ioaddr;
 
+   /* Identify chip attached to board */
+   rtl8169_get_mac_version(tp, dev, cfg->default_ver);
+
if (!pci_is_pcie(pdev))
netif_info(tp, probe, dev, "not PCI Express\n");
 
-   /* Identify chip attached to board */
-   rtl8169_get_mac_version(tp, dev, cfg->default_ver);
+   tp->cp_cmd = 0;
+
+   if ((sizeof(dma_addr_t) > 4) &&
+   (use_dac == 1 || (use_dac == -1 && pci_is_pcie(pdev) &&
+ tp->mac_version >= RTL_GIGA_MAC_VER_18)) &&
+   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
+
+   /* CPlusCmd Dual Access Cycle is only needed for non-PCIe */
+   if (!pci_is_pcie(pdev))
+   tp->cp_cmd |= PCIDAC;
+   dev->features |= NETIF_F_HIGHDMA;
+   } else {
+   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
+   if (rc < 0) {
+   netif_err(tp, probe, dev, "DMA configuration failed\n");
+   goto err_out_unmap_4;
+   }
+   }
 
rtl_init_rxcfg(tp);
 
@@ -8412,12 +8417,12 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
   >counters_phys_addr, GFP_KERNEL);
if (!tp->counters) {
rc = -ENOMEM;
-   goto err_out_msi_4;
+   goto err_out_msi_5;
}
 
rc = register_netdev(dev);
if (rc < 0)
-   goto err_out_cnt_5;
+   goto err_out_cnt_6;
 
pci_set_drvdata(pdev, dev);
 
@@ -8451,12 +8456,13 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 out:
return rc;
 
-err_out_cnt_5:
+err_out_cnt_6:
dma_free_coherent(>dev, sizeof(*tp->counters), tp->counters,
  tp->counters_phys_addr);
-err_out_msi_4:
+err_out_msi_5:
netif_napi_del(>napi);
rtl_disable_msi(pdev, tp);
+err_out_unmap_4:
iounmap(ioaddr);
 err_out_free_res_3:
pci_release_regions(pdev);
-- 
2.7.4

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-12 Thread Ard Biesheuvel

On 12 May 2016 at 16:09, Francois Romieu <rom...@fr.zoreil.com> wrote:
>> On 12 May 2016 at 01:58, David Miller <da...@davemloft.net> wrote:
>> > From: Ard Biesheuvel <ard.biesheu...@linaro.org>
>> > Date: Wed, 11 May 2016 09:47:49 +0200
> [...]
>> > I think we should just seriously consider changing the default, it's
>> > a really outdated reasoning behind the current default setting.  Maybe
>> > relevant a decade ago, but probably not now.
>> >
>> > And if the card is completely disfunctional in said configuration, the
>> > default is definitely wrong.
>>
>> The card is indeed completely disfunctional. So we could try to
>> resurrect 353176888386 ("r8169: enable 64-bit DMA by default for PCI
>> Express devices"), and instead of backing it out again if regressions
>> are reported, blacklist the particular chips. This is a much better
>> approach, since then we can also print some kind of diagnostic on
>> those arm64 systems why such a blacklisted NIC is not supported.
>
> I doubt there will be much *reporting* from broken systems that
> include plain old PCI realtek chipsets (r8169.c::RTL_CFG_0). Changing
> the default for those is imnsho asking for troubles without clear
> benefit (experimental evidence suggests that smsc etherpower II grows
> older more easily than plain pci 8169 :o/ ).
>

By resurrecting 353176888386, I mean the patch that changes the
default for PCI express devices, so I think we are in agreement here?

> I'd rather leave these alone and change the default for the PCI Express
> chipsets. Btw, while it does not seem to hurt, they should not need any
> CPlusCmd Dual Access Cycle tweak either. Realtek may establish it (Lin ?)
>
> A few news from the "pathologically better safe than sorry" squad:
> I have switched the default on a couple of non-critical production
> servers that include 8168c (RTL_GIGA_MAC_VER_22). It should give an hint
> for hardware from 2008 ~ 2009. I'll do some basic sanity testing with
> different chipsets.
>

Thanks for testing that. In the mean time, I will dust off the patch
and rebase it to today's -next.

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-12 Thread Ard Biesheuvel

On 12 May 2016 at 01:58, David Miller <da...@davemloft.net> wrote:
> From: Ard Biesheuvel <ard.biesheu...@linaro.org>
> Date: Wed, 11 May 2016 09:47:49 +0200
>
>> The current logic around the 'use_dac' module parameter prevents the
>> r81969 driver from being loadable on 64-bit systems without any RAM
>> below 4 GB when the parameter is left at its default value.
>> So introduce a new default value -1 which indicates that 64-bit DMA
>> should be enabled implicitly, but only if setting a 32-bit DMA mask
>> has failed earlier. This should prevent any regressions like the ones
>> caused by previous attempts to change this code.
>>
>> Cc: Realtek linux nic maintainers <nic_s...@realtek.com>
>> Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
>
> I think we should just seriously consider changing the default, it's
> a really outdated reasoning behind the current default setting.  Maybe
> relevant a decade ago, but probably not now.
>
> And if the card is completely disfunctional in said configuration, the
> default is definitely wrong.

The card is indeed completely disfunctional. So we could try to
resurrect 353176888386 ("r8169: enable 64-bit DMA by default for PCI
Express devices"), and instead of backing it out again if regressions
are reported, blacklist the particular chips. This is a much better
approach, since then we can also print some kind of diagnostic on
those arm64 systems why such a blacklisted NIC is not supported.

Re: [PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-11 Thread Ard Biesheuvel

On 11 May 2016 at 22:31, Francois Romieu <rom...@fr.zoreil.com> wrote:
> Ard Biesheuvel <ard.biesheu...@linaro.org> :
>> The current logic around the 'use_dac' module parameter prevents the
>> r81969 driver from being loadable on 64-bit systems without any RAM
>> below 4 GB when the parameter is left at its default value.
>> So introduce a new default value -1 which indicates that 64-bit DMA
>> should be enabled implicitly, but only if setting a 32-bit DMA mask
>> has failed earlier. This should prevent any regressions like the ones
>> caused by previous attempts to change this code.
>
> I am not a huge fan but if you really need it...
>
> Which current kernel arches do exhibit the interesting
> f*ck-legacy-32-bits-only-devices property you just described ?
>

It has little to do with f*cking legacy 32-bits-only-devices if DRAM
simply starts at 0x80__. This is on an AMD arm64 chip.

> [...]
>> diff --git a/drivers/net/ethernet/realtek/r8169.c 
>> b/drivers/net/ethernet/realtek/r8169.c
>> index 94f08f1e841c..a49e8a58e539 100644
>> --- a/drivers/net/ethernet/realtek/r8169.c
>> +++ b/drivers/net/ethernet/realtek/r8169.c
> [...]
>> @@ -859,7 +859,8 @@ struct rtl8169_private {
>>  MODULE_AUTHOR("Realtek and the Linux r8169 crew <netdev@vger.kernel.org>");
>>  MODULE_DESCRIPTION("RealTek RTL-8169 Gigabit Ethernet driver");
>>  module_param(use_dac, int, 0);
>> -MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
>> +MODULE_PARM_DESC(use_dac,
>> + "Enable PCI DAC. Unsafe on 32 bit PCI slot (default -1: enable on 
>> 64-bit archs only if needed");
>
> Nit: the parameter is bizarre enough that you could leave the original
> description.
>

OK, if you prefer. Should I send a v2?

[PATCH] r8169: default to 64-bit DMA on systems without memory below 4 GB

2016-05-11 Thread Ard Biesheuvel

The current logic around the 'use_dac' module parameter prevents the
r81969 driver from being loadable on 64-bit systems without any RAM
below 4 GB when the parameter is left at its default value.
So introduce a new default value -1 which indicates that 64-bit DMA
should be enabled implicitly, but only if setting a 32-bit DMA mask
has failed earlier. This should prevent any regressions like the ones
caused by previous attempts to change this code.

Cc: Realtek linux nic maintainers <nic_s...@realtek.com> 
Signed-off-by: Ard Biesheuvel <ard.biesheu...@linaro.org>
---
 drivers/net/ethernet/realtek/r8169.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/realtek/r8169.c 
b/drivers/net/ethernet/realtek/r8169.c
index 94f08f1e841c..a49e8a58e539 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -345,7 +345,7 @@ static const struct pci_device_id rtl8169_pci_tbl[] = {
 MODULE_DEVICE_TABLE(pci, rtl8169_pci_tbl);
 
 static int rx_buf_sz = 16383;
-static int use_dac;
+static int use_dac = -1;
 static struct {
u32 msg_enable;
 } debug = { -1 };
@@ -859,7 +859,8 @@ struct rtl8169_private {
 MODULE_AUTHOR("Realtek and the Linux r8169 crew <netdev@vger.kernel.org>");
 MODULE_DESCRIPTION("RealTek RTL-8169 Gigabit Ethernet driver");
 module_param(use_dac, int, 0);
-MODULE_PARM_DESC(use_dac, "Enable PCI DAC. Unsafe on 32 bit PCI slot.");
+MODULE_PARM_DESC(use_dac,
+   "Enable PCI DAC. Unsafe on 32 bit PCI slot (default -1: enable on 
64-bit archs only if needed");
 module_param_named(debug, debug.msg_enable, int, 0);
 MODULE_PARM_DESC(debug, "Debug verbosity level (0=none, ..., 16=all)");
 MODULE_LICENSE("GPL");
@@ -8226,12 +8227,16 @@ static int rtl_init_one(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 
tp->cp_cmd = 0;
 
+   /* Tentatively set the DMA mask to 32 bits. This may fail
+* on 64-bit architectures without any RAM below 4 GB.
+*/
+   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
if ((sizeof(dma_addr_t) > 4) &&
-   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64)) && use_dac) {
+   (use_dac == 1 || (use_dac == -1 && rc < 0)) &&
+   !pci_set_dma_mask(pdev, DMA_BIT_MASK(64))) {
tp->cp_cmd |= PCIDAC;
dev->features |= NETIF_F_HIGHDMA;
} else {
-   rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
if (rc < 0) {
netif_err(tp, probe, dev, "DMA configuration failed\n");
goto err_out_free_res_3;
-- 
2.7.4

97 matches

Mail list logo