Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
On Tue, Apr 17, 2018 at 09:07:01AM +0200, Jesper Dangaard Brouer wrote: > > > number should improve more). > > > > What is the number for the otherwise comparable setup without repolines? > > Approx 12 Mpps. > > You forgot to handle the dma_direct_mapping_error() case, which still > used the

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Jesper Dangaard Brouer
On Mon, 16 Apr 2018 23:15:50 -0700 Christoph Hellwig wrote: > On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
> I'm not sure if I am really a fan of trying to solve this in this way. > It seems like this is going to be optimizing the paths for one case at > the detriment of others. Historically mapping and unmapping has always > been expensive, especially in the case of IOMMU enabled environments. > I

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-17 Thread Christoph Hellwig
On Mon, Apr 16, 2018 at 11:07:04PM +0200, Jesper Dangaard Brouer wrote: > On X86 swiotlb fallback (via get_dma_ops -> get_arch_dma_ops) to use > x86_swiotlb_dma_ops, instead of swiotlb_dma_ops. I also included that > in below fix patch. x86_swiotlb_dma_ops should not exist any mor, and x86 now

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Jesper Dangaard Brouer
On Mon, 16 Apr 2018 05:27:06 -0700 Christoph Hellwig wrote: > Can you try the following hack which avoids indirect calls entirely > for the fast path direct mapping case? > > --- > From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 > From: Christoph

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Alexander Duyck
On Mon, Apr 16, 2018 at 5:27 AM, Christoph Hellwig wrote: > Can you try the following hack which avoids indirect calls entirely > for the fast path direct mapping case? > > --- > From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Christoph Hellwig
Can you try the following hack which avoids indirect calls entirely for the fast path direct mapping case? --- >From b256a008c1b305e6a1c2afe7c004c54ad2e96d4b Mon Sep 17 00:00:00 2001 From: Christoph Hellwig Date: Mon, 16 Apr 2018 14:18:14 +0200 Subject: dma-mapping: bypass dma_ops

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-16 Thread Jesper Dangaard Brouer
On Sat, 14 Apr 2018 21:29:26 +0200 David Woodhouse wrote: > On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote: > > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > > > I guess there is nothing we need to do! > > > > > > On x86, in case of no intel

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-14 Thread David Woodhouse
On Fri, 2018-04-13 at 19:26 +0200, Christoph Hellwig wrote: > On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > > I guess there is nothing we need to do! > > > > On x86, in case of no intel iommu or iommu is disabled, you end up in > > swiotlb for DMA API calls when system has 4G

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Fri, Apr 13, 2018 at 10:12:41AM -0700, Tushar Dave wrote: > I guess there is nothing we need to do! > > On x86, in case of no intel iommu or iommu is disabled, you end up in > swiotlb for DMA API calls when system has 4G memory. > However, AFAICT, for 64bit DMA capable devices swiotlb DMA APIs

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Tushar Dave
On 04/12/2018 07:56 AM, Christoph Hellwig wrote: On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: --- Implement support for keeping the DMA mapping through the XDP return call, to remove RX

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-13 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 05:31:31PM +0200, Jesper Dangaard Brouer wrote: > > I guess that is because x86 selects it as the default as soon as > > we have more than 4G memory. > > I were also confused why I ended up using SWIOTLB (SoftWare IO-TLB), > that might explain it. And I'm not hitting the

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
On Thu, 12 Apr 2018 16:56:53 +0200 Christoph Hellwig wrote: > On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: > > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > > > --- > > > Implement support for keeping the DMA mapping

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 04:51:23PM +0200, Christoph Hellwig wrote: > On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > > --- > > Implement support for keeping the DMA mapping through the XDP return > > call, to remove RX map/unmap calls. Implement bulking for

Re: XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Christoph Hellwig
On Thu, Apr 12, 2018 at 03:50:29PM +0200, Jesper Dangaard Brouer wrote: > --- > Implement support for keeping the DMA mapping through the XDP return > call, to remove RX map/unmap calls. Implement bulking for XDP > ndo_xdp_xmit and XDP return frame API. Bulking allows to perform DMA

XDP performance regression due to CONFIG_RETPOLINE Spectre V2

2018-04-12 Thread Jesper Dangaard Brouer
Heads-up XDP performance nerds! I got an unpleasant surprise when I updated my GCC compiler (to support the option -mindirect-branch=thunk-extern). My XDP redirect performance numbers when cut in half; from approx 13Mpps to 6Mpps (single CPU core). I've identified the issue, which is caused by