Re: SSE instructions for fast packet copy?

2017-05-08 Thread Benjamin Poirier
On 2017/05/04 22:50, Tom Herbert wrote:
> Hi,
> 
> I am thinking about the possibility of using SSE in kernel for
> speeding up the kernel memcpy particularly for copy to userspace
> emeory, and maybe even using the string instructions (like if we
> supported regex in something like eBPF). AFAIK we don't use SSE in
> kernel because of xmm register state needing to be saved across
> context switch. However, if we start busy-polling a CPU in kernel on
> network queues then there might not be any context switches to worry
> about. In this model we'd want to enable SSE per CPU.
> 
> Has this ever been tried before? Is this at all feasible? :-) Is it
> possible to enable SSE for kernel for just one CPU? (I found CPUID
> will return SSE supported, but don't see how to enable other than
> -msse for compiling).

This reminds me of what you tried in
c6e1a0d12ca7 net: Allow no-cache copy from user on transmit
(v3.0-rc1)
and that I reverted in
cdb3f4a31b64 net: Do not enable tx-nocache-copy by default
(v3.14-rc1)

Sure, it's not exactly the same thing...


RE: SSE instructions for fast packet copy?

2017-05-05 Thread David Laight
From: Tom Herbert
> Sent: 05 May 2017 06:51
> To: Linux Kernel Network Developers
> Subject: SSE instructions for fast packet copy?
> 
> Hi,
> 
> I am thinking about the possibility of using SSE in kernel for
> speeding up the kernel memcpy particularly for copy to userspace
> emeory, and maybe even using the string instructions (like if we
> supported regex in something like eBPF). AFAIK we don't use SSE in
> kernel because of xmm register state needing to be saved across
> context switch. However, if we start busy-polling a CPU in kernel on
> network queues then there might not be any context switches to worry
> about. In this model we'd want to enable SSE per CPU.
> 
> Has this ever been tried before? Is this at all feasible? :-) Is it
> possible to enable SSE for kernel for just one CPU? (I found CPUID
> will return SSE supported, but don't see how to enable other than
> -msse for compiling).

Not even worth thinking about.
With recent intel cpus 'rep movsb' is optimised in the hardware
(for cached memory) and will run as fast as any other copy.

(There is a related fubar that memcopytoio() is implemented
as memcpy() and then as 'rep movsb' so generates repeated
byte accesses to io memory.)

I'm pretty sure the FP registers are 'lazy saved'.
The cpu's sse registers (the entire FP register set) might
contain life values for a process that is running on a different cpu.
If that process executes an FP instruction it will fault and an IPI
issued to get the registers written to the processes fp save area
from where they can be loaded.
Any use of the sse registers would have to interact correctly
with that IPI code.

David



SSE instructions for fast packet copy?

2017-05-04 Thread Tom Herbert
Hi,

I am thinking about the possibility of using SSE in kernel for
speeding up the kernel memcpy particularly for copy to userspace
emeory, and maybe even using the string instructions (like if we
supported regex in something like eBPF). AFAIK we don't use SSE in
kernel because of xmm register state needing to be saved across
context switch. However, if we start busy-polling a CPU in kernel on
network queues then there might not be any context switches to worry
about. In this model we'd want to enable SSE per CPU.

Has this ever been tried before? Is this at all feasible? :-) Is it
possible to enable SSE for kernel for just one CPU? (I found CPUID
will return SSE supported, but don't see how to enable other than
-msse for compiling).

Thanks,
Tom