Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-16 Thread Benjamin Serebrin
On Mon, Nov 16, 2015 at 12:42 AM, David Woodhouse  wrote:
>
> On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> > We looked into Intel IOMMU performance a while ago and learned a few
> > useful things.  We generally did a parallel 200 thread TCP_RR test,
> > as this churns through mappings quickly and uses all available cores.
> >
> > First, the main bottleneck was software performance[1].
>
> For the Intel IOMMU, *all* we need to do is put a PTE in place. For
> real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
> need to do an IOTLB flush. It's a single 64-bit write of the PTE.
>
> All else is software overhead.
>

Agreed!

>
> (I'm deliberately ignoring the stupid chipsets where DMA page tables
> aren't cache coherent and we need a clflush too. They make me too sad.)



How much does Linux need to care about such chipsets?  Can we argue
that they get very sad performance and so be it?

>
>
>
> >   This study preceded the recent patch to break the locks into pools
> > ("Break up monolithic iommu table/lock into finer graularity pools
> > and lock").  There were several points of lock contention:
> > - the RB tree ...
> > - the RB tree ...
> > - the RB tree ...
> >
> > Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> > c15-paper-peleg.pdf) has some promising approaches.  The magazine
> > avoids the RB tree issue.
>
> I'm thinking of ditching the RB tree altogether and switching to the
> allocator in lib/iommu-common.c (and thus getting the benefit of the
> finer granularity pools).


Sounds promising!  Is 4 parallel arenas enough?  We'll try to play
with that here.
I think lazy_flush leaves dangling references.

>
>
> > I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> > page table cleanup algorithm could do well.
>
> When you say 'dynamic 1:1 mapping', is that the same thing that's been
> suggested elsewhere — avoiding the IOVA allocator completely by using a
> virtual address which *matches* the physical address, if that virtual
> address is available?


Yes.  We munge in 2 upper address bits in the IOVA to encode read and
write permissions as well.

>
> Simply cmpxchg on the PTE itself, and if it was
> already set *then* we fall back to the allocator, obviously configured
> to allocate from a range *higher* than the available physical memory.
>
> Jörg has been looking at this too, and was even trying to find space in
> the PTE for a use count so a given page could be in more than one
> mapping before we call back to the IOVA allocator.


Aren't bits 63:52 available at all levels?

>
>
>
> > There are correctness fixes and optimizations left in the
> > invalidation path: I want strict-ish semantics (a page doesn't go
> > back into the freelist until the last IOTLB/IOMMU TLB entry is
> > invalidated) with good performance, and that seems to imply that an
> > additional page reference should be gotten at dma_map time and put
> > back at the completion of the IOMMU flush routine.  (This is worthy
> > of much discussion.)
>
> We already do something like this for page table pages which are freed
> by an unmap, FWIW.


As I understood the code when I last looked, this was true only if a
single unmap operation covered an entire table's worth (2MByte, or
1GByte, etc) of mappings.  The caffeine hasn't hit yet, though, so I
can't even begin to dive into the new calls into mm.c.


>
>
> > Additionally, we can find ways to optimize the flush routine by
> > realizing that if we have frequent maps and unmaps, it may be because
> > the device creates and destroys buffers a lot; these kind of
> > workloads use an IOVA for one event and then never come back.  Maybe
> > TLBs don't do much good and we could just flush the entire IOMMU TLB
> > [and ATS cache] for that BDF.
>
> That would be a very interesting thing to look at. Although it would be
> nice if we had a better way to measure the performance impact of IOTLB
> misses — currently we don't have a lot of visibility at all.


All benchmarks are lies.  But we intend to run internal workloads as
well as well-agreed loads and see how things go.

>
>
> > 1: We verified that the IOMMU costs are almost entirely software
> > overheads by forcing software 1:1 mode, where we create page tables
> > for all physical addresses.  We tested using leaf nodes of size 4KB,
> > of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> > of the page tables, and no IOMMU invalidations.  We did piles of DMA
> > maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> > addresses.  At 4KB page size, we could see some bandwidth slowdown,
> > but at 2MB and 1GB, there was < 1% performance loss as compared with
> > IOMMU off.
>
> Was this with ATS on or off? With ATS, the cost of the page walk can be
> amortised in some cases — you can look up the physical address *before*
> you are ready to actually start the DMA to it, and don't take that
> latency at the time you're 

Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-16 Thread David Woodhouse
On Sun, 2015-11-15 at 22:54 -0800, Benjamin Serebrin wrote:
> We looked into Intel IOMMU performance a while ago and learned a few
> useful things.  We generally did a parallel 200 thread TCP_RR test,
> as this churns through mappings quickly and uses all available cores.
> 
> First, the main bottleneck was software performance[1].

For the Intel IOMMU, *all* we need to do is put a PTE in place. For
real hardware (i.e not an IOMMU emulated by qemu for a VM), we don't
need to do an IOTLB flush. It's a single 64-bit write of the PTE.

All else is software overhead.

(I'm deliberately ignoring the stupid chipsets where DMA page tables
aren't cache coherent and we need a clflush too. They make me too sad.)


>   This study preceded the recent patch to break the locks into pools
> ("Break up monolithic iommu table/lock into finer graularity pools
> and lock").  There were several points of lock contention:
> - the RB tree ...
> - the RB tree ...
> - the RB tree ...
> 
> Omer's paper (https://www.usenix.org/system/files/conference/atc15/at
> c15-paper-peleg.pdf) has some promising approaches.  The magazine
> avoids the RB tree issue.  

I'm thinking of ditching the RB tree altogether and switching to the
allocator in lib/iommu-common.c (and thus getting the benefit of the
finer granularity pools).

> I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
> page table cleanup algorithm could do well.

When you say 'dynamic 1:1 mapping', is that the same thing that's been
suggested elsewhere — avoiding the IOVA allocator completely by using a
virtual address which *matches* the physical address, if that virtual
address is available? Simply cmpxchg on the PTE itself, and if it was
already set *then* we fall back to the allocator, obviously configured
to allocate from a range *higher* than the available physical memory.

Jörg has been looking at this too, and was even trying to find space in
the PTE for a use count so a given page could be in more than one
mapping before we call back to the IOVA allocator.


> There are correctness fixes and optimizations left in the
> invalidation path: I want strict-ish semantics (a page doesn't go
> back into the freelist until the last IOTLB/IOMMU TLB entry is
> invalidated) with good performance, and that seems to imply that an
> additional page reference should be gotten at dma_map time and put
> back at the completion of the IOMMU flush routine.  (This is worthy
> of much discussion.)  

We already do something like this for page table pages which are freed
by an unmap, FWIW.

> Additionally, we can find ways to optimize the flush routine by
> realizing that if we have frequent maps and unmaps, it may be because
> the device creates and destroys buffers a lot; these kind of
> workloads use an IOVA for one event and then never come back.  Maybe
> TLBs don't do much good and we could just flush the entire IOMMU TLB
> [and ATS cache] for that BDF.

That would be a very interesting thing to look at. Although it would be
nice if we had a better way to measure the performance impact of IOTLB
misses — currently we don't have a lot of visibility at all.

> 1: We verified that the IOMMU costs are almost entirely software
> overheads by forcing software 1:1 mode, where we create page tables
> for all physical addresses.  We tested using leaf nodes of size 4KB,
> of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
> of the page tables, and no IOMMU invalidations.  We did piles of DMA
> maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
> addresses.  At 4KB page size, we could see some bandwidth slowdown,
> but at 2MB and 1GB, there was < 1% performance loss as compared with
> IOMMU off.

Was this with ATS on or off? With ATS, the cost of the page walk can be
amortised in some cases — you can look up the physical address *before*
you are ready to actually start the DMA to it, and don't take that
latency at the time you're actually moving the data.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-15 Thread Benjamin Serebrin
We looked into Intel IOMMU performance a while ago and learned a few
useful things.  We generally did a parallel 200 thread TCP_RR test, as
this churns through mappings quickly and uses all available cores.

First, the main bottleneck was software performance[1].  This study
preceded the recent patch to break the locks into pools ("Break up
monolithic iommu table/lock into finer graularity pools and lock").
There were several points of lock contention:
- the RB tree is per device (and in the network test, there's one
device).  Every dma_map and unmap holds the lock.
- the RB tree lock is held during invalidations as well.  There's a
250-entry queue for invalidations that doesn't do any batching
intelligence (for example, promote to larger-range invalidations,
flush entire device, etc).  RB tree locks may be held while waiting
for invalidation drains.  Invalidations have even worse behavior with
ATS enabled for a given device.
- the RB tree has one entry per dma_map call (that entry is deleted by
the corresponding dma_unmap).  If we had merged all adjacent entries
when we could, we would have not lost any information that's actually
used by the code today.  (There could be a check that a dma_unmap
actually covers the entire region that was mapped, but there isn't.)
At boot (without network traffic), two common NICs' drivers show tens
of thousands of static dma_maps that never go away; this means the RB
tree is ~14-16 levels deep.  A rbtree walk (holding that big lock) has
a 14-16 level pointer chase through mostly cache-cold entries.  I
wrote a modification to the RB tree handling that merges nodes that
represent abutting IOVA ranges (and unmerges them on dma_unmap), and
the same drivers created around 7 unique entries.  Steady state grew
to a few hundreds and maybe a thousand, but the fragmentation didn't
get worse than that.  This optimization got about a third of the
performance back.

Omer's paper 
(https://www.usenix.org/system/files/conference/atc15/atc15-paper-peleg.pdf)
has some promising approaches.  The magazine avoids the RB tree issue.

I'm interested in seeing if the dynamic 1:1 with a mostly-lock-free
page table cleanup algorithm could do well.

There are correctness fixes and optimizations left in the invalidation
path: I want strict-ish semantics (a page doesn't go back into the
freelist until the last IOTLB/IOMMU TLB entry is invalidated) with
good performance, and that seems to imply that an additional page
reference should be gotten at dma_map time and put back at the
completion of the IOMMU flush routine.  (This is worthy of much
discussion.)

Additionally, we can find ways to optimize the flush routine by
realizing that if we have frequent maps and unmaps, it may be because
the device creates and destroys buffers a lot; these kind of workloads
use an IOVA for one event and then never come back.  Maybe TLBs don't
do much good and we could just flush the entire IOMMU TLB [and ATS
cache] for that BDF.

We'll try to get free time to do some of these things soon.

Ben


1: We verified that the IOMMU costs are almost entirely software
overheads by forcing software 1:1 mode, where we create page tables
for all physical addresses.  We tested using leaf nodes of size 4KB,
of 2MB, and of 1GB.  In call cases, there is zero runtime maintenance
of the page tables, and no IOMMU invalidations.  We did piles of DMA
maximizing x16 PCIe bandwidth on multiple lanes, to random DRAM
addresses.  At 4KB page size, we could see some bandwidth slowdown,
but at 2MB and 1GB, there was < 1% performance loss as compared with
IOMMU off.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-07 Thread Shamir Rabinovitch
On Thu, Nov 05, 2015 at 04:11:21PM -0500, David Miller wrote:
> 
> And for the record Sowmini fixed a lot of the lock contention:
> 
> commit ff7d37a502022149655c18035b99a53391be0383
> Author: Sowmini Varadhan 
> Date:   Thu Apr 9 15:33:30 2015 -0400
> 
> Break up monolithic iommu table/lock into finer graularity pools and lock
> 

The poor rds-stress results w/o IOMMU bypass I sent in early post were taken 
from
kernel that has the above patch and that has all the needed changes in 
arch/sparc
to use this new feature.

It seems that it worked well for 10G ETH IOMMU lock contention but it still not 
solving
the rds-stress issue.

The difference can be from:

1. Lock contention still left with this enhancement <-- zero in bypass
2. Overhead to setup the IOMMU mapping <-- almost zero in bypass (require 1 HV 
call)
3. Overhead to use the IOMMU mapping <-- not sure how to measure this
4. Overhead to tear the IOMMU mapping <-- zero in bypass 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-05 Thread David Miller
From: Joerg Roedel 
Date: Thu, 5 Nov 2015 14:42:06 +0100

> Contended IOMMU locks are not only a problem on SPARC, but on x86 and
> various other IOMMU drivers too. But I have some ideas on how to improve
> the situation there.

And for the record Sowmini fixed a lot of the lock contention:

commit ff7d37a502022149655c18035b99a53391be0383
Author: Sowmini Varadhan 
Date:   Thu Apr 9 15:33:30 2015 -0400

Break up monolithic iommu table/lock into finer graularity pools and lock

Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu->lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan 
Acked-by: Benjamin Herrenschmidt 
Signed-off-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-05 Thread David Miller
From: David Woodhouse 
Date: Thu, 29 Oct 2015 22:35:25 +

> For the receive side, it shouldn't be beyond the wit of man to
> introduce an API which allocates *and* DMA-maps a skb. Pass it to
> netif_rx() still mapped, with a destructor that just shoves it back in
> a pool for re-use.

For forwarding, the SKB is going to another device to be DMA'd,
perhaps via another IOMMU.

For local connections, it's going to sit for an unpredictable and
unbounded amount of time in the socket queue.

We've been through this thought process before, believe me :)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-05 Thread Joerg Roedel
On Mon, Nov 02, 2015 at 07:32:19PM +0200, Shamir Rabinovitch wrote:
> Correct. This issue is one of the concerns here in the previous replies.
> I will take different approach which will not require the IOMMU bypass
> per mapping. Will try to shift to the x86 'iommu=pt' approach.

Yeah, it doesn't really make sense to have an extra remappable area when
the device can access all physical memory anyway.

> We had a bunch of issues around SPARC IOMMU. Not all of them relate to
> performance. The first issue was that on SPARC, currently, we only have 
> limited address space to IOMMU so we had issue to do large DMA mappings
> for Infiniband. Second issue was that we identified high contention on 
> the IOMMU locks even in ETH driver.

Contended IOMMU locks are not only a problem on SPARC, but on x86 and
various other IOMMU drivers too. But I have some ideas on how to improve
the situation there.

> I do not want to put too much information here but you can see some results:
> 
> rds-stress test from sparc t5-2 -> x86:
> 
> with iommu bypass:
> -
> sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
> tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
>3 141278  0 1165565.81   0.00   0.008.93   376.60 -1.00  
> (average)
> 
> without iommu bypass:
> -
> sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
> tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
>3  78558  0  648101.41   0.00   0.00   15.05   876.72 -1.00  
> (average)
> 
> + RDMA tests are totally not working (might be due to failure to DMA map all 
> the memory).
> 
> So IOMMU bypass give ~80% performance boost.

Interesting. Have you looked more closely on what causes the performance
degradation? Is it the lock contention or something else?


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-03 Thread Christoph Hellwig
On Tue, Nov 03, 2015 at 10:08:13AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > > Then I would argue for naming this differently. Make it an optional
> > > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > > achieved via using a bypass or other means in the backend not the
> > > business of the driver.
> > > 
> > 
> > With a name like that, who wouldn't pass that flag? ;-)
> 
> xHCI for example, vs. something like 10G ethernet... but yes I agree it
> sucks. I don't like that sort of policy anywhere in drivers. On the
> other hand the platform doesn't have much information to make that sort
> of decision either.

Mabye because it should simply use what's optimal?  E.g. passthrough
whenever possible, where arguments against possible are:  dma_mask, vfio
requirements, kernel command line option.  This is what a lot of
architectures already do, I remember the SGI Origin / Altix code has the
same behavior as well.  Those IOMMUs already had the 64 bit passthrough
and 32-bit sliding window in addition to the real IOMMU 10 years ago.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-03 Thread Benjamin Herrenschmidt
On Tue, 2015-11-03 at 14:11 +0100, Christoph Hellwig wrote:
> > xHCI for example, vs. something like 10G ethernet... but yes I agree it
> > sucks. I don't like that sort of policy anywhere in drivers. On the
> > other hand the platform doesn't have much information to make that sort
> > of decision either.
> 
> Mabye because it should simply use what's optimal?  E.g. passthrough
> whenever possible, where arguments against possible are:  dma_mask, vfio
> requirements, kernel command line option. 

Right this is what I do today on powerpc with the exception of
the command line option.

>  This is what a lot of
> architectures already do, I remember the SGI Origin / Altix code has the
> same behavior as well.  Those IOMMUs already had the 64 bit passthrough
> and 32-bit sliding window in addition to the real IOMMU 10 years ago.
> --
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Joerg Roedel
On Thu, Oct 29, 2015 at 09:32:32AM +0200, Shamir Rabinovitch wrote:
> For the IB case, setting and tearing DMA mappings for the drivers data buffers
> is expensive. But we could for example consider to map all the HW control 
> objects
> that validate the HW access to the drivers data buffers as IOMMU protected 
> and so
> have IOMMU protection on those critical objects while having fast 
> set-up/tear-down
> of driver data buffers. The HW control objects have stable mapping for the 
> lifetime
> of the system. So the case of using both types of DMA mappings is still valid.

How do you envision this per-mapping by-pass to work? For the DMA-API
mappings you have only one device address space. For by-pass to work,
you need to map all physical memory of the machine into this space,
which leaves the question where you want to put the window for
remapping.

You surely have to put it after the physical mappings, but any
protection will be gone, as the device can access all physical memory.

So instead of working around the shortcomings of DMA-API
implementations, can you present us some numbers and analysis of how bad
the performance impact with an IOMMU is and what causes it?

I know that we have lock-contention issues in our IOMMU drivers, which
can be fixed. Maybe the performance problems you are seeing can be fixed
too, when you give us more details about them.


Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> wrote:
> > 
> > Chosing on a per-mapping basis *in the back end* might still make
> > some
> 
> In my case, choosing mapping based on the hardware that will use this
> mappings makes more sense. Most hardware are not that performance 
> sensitive as the Infiniband hardware.

 ...

> The driver know for what hardware it is mapping the memory so it know 
> if the memory will be used by performance sensitive hardware or not.

Then I would argue for naming this differently. Make it an optional
hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
achieved via using a bypass or other means in the backend not the
business of the driver.

> In your case, what will give the better performance - 1:1 mapping or
> IOMMU
> mapping? When you say 'relaxing the protection' you refer to 1:1
> mapping?

> Also, how this 1:1 window address the security concerns that other
> raised
> by other here?

It will partially only but it's just an example of another way the
bakcend could provide some improved performances without a bypass.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 03:44:27PM +0100, Joerg Roedel wrote:
> 
> How do you envision this per-mapping by-pass to work? For the DMA-API
> mappings you have only one device address space. For by-pass to work,
> you need to map all physical memory of the machine into this space,
> which leaves the question where you want to put the window for
> remapping.
> 
> You surely have to put it after the physical mappings, but any
> protection will be gone, as the device can access all physical memory.

Correct. This issue is one of the concerns here in the previous replies.
I will take different approach which will not require the IOMMU bypass
per mapping. Will try to shift to the x86 'iommu=pt' approach.

> 
> So instead of working around the shortcomings of DMA-API
> implementations, can you present us some numbers and analysis of how bad
> the performance impact with an IOMMU is and what causes it?

We had a bunch of issues around SPARC IOMMU. Not all of them relate to
performance. The first issue was that on SPARC, currently, we only have 
limited address space to IOMMU so we had issue to do large DMA mappings
for Infiniband. Second issue was that we identified high contention on 
the IOMMU locks even in ETH driver. 

> 
> I know that we have lock-contention issues in our IOMMU drivers, which
> can be fixed. Maybe the performance problems you are seeing can be fixed
> too, when you give us more details about them.
> 

I do not want to put too much information here but you can see some results:

rds-stress test from sparc t5-2 -> x86:

with iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3 141278  0 1165565.81   0.00   0.008.93   376.60 -1.00  
(average)

without iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3  78558  0  648101.41   0.00   0.00   15.05   876.72 -1.00  
(average)

+ RDMA tests are totally not working (might be due to failure to DMA map all 
the memory).

So IOMMU bypass give ~80% performance boost.

> 
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Tue, Nov 03, 2015 at 07:13:28AM +1100, Benjamin Herrenschmidt wrote:
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

Correct. This comment was also from internal review :-) Although
currently there is strong opposition to such new attribute.

> 
> It will partially only but it's just an example of another way the
> bakcend could provide some improved performances without a bypass.

Just curious.. 

In the Infiniband case where user application request the driver to DMA 
map some random pages they allocated - will your suggestion still
function? 

I mean can you use this limited 1:1 mapping window to map any page the
user application choose? How?

At the bypass we just use the physical address of the page (almost as is).
We use the fact that bypass address space cover the whole memory and so we
have mapping for any address.

In IOMMU case we use the IOMMU translation tables and so can map 64 bit 
address to some 32 bit address (or less).

In your case it is not clear to me what can we do with such limited 1:1
mapping.

> 
> Cheers,
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Arnd Bergmann
On Tuesday 03 November 2015 07:13:28 Benjamin Herrenschmidt wrote:
> On Mon, 2015-11-02 at 14:07 +0200, Shamir Rabinovitch wrote:
> > On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt
> > wrote:
> > > 
> > > Chosing on a per-mapping basis *in the back end* might still make
> > > some
> > 
> > In my case, choosing mapping based on the hardware that will use this
> > mappings makes more sense. Most hardware are not that performance 
> > sensitive as the Infiniband hardware.
> 
>  ...
> 
> > The driver know for what hardware it is mapping the memory so it know 
> > if the memory will be used by performance sensitive hardware or not.
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

With a name like that, who wouldn't pass that flag? ;-)

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread David Woodhouse
On Mon, 2015-11-02 at 08:10 +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> > Not sure this use case is possible for Infiniband where application hold
> > the data buffers and there is no way to force application to re use the 
> > buffer as suggested.
> > 
> > This is why I think there will be no easy way to bypass the DMA mapping cost
> > for all use cases unless we allow applications to request bypass/pass 
> > through
> > DMA mapping (implicitly or explicitly).
> 
> But but but ...
> 
> What I don't understand is how that brings you any safety.
> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).
> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.
> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.

In the Intel case, the mapping setup is entirely per-device (except for
crap devices and devices behind a PCIe-PCI bridge, etc.).

So you can happily have a passthrough mapping for *one* device, without
making that same mapping available to another device. You can make that
trade-off of speed vs. protection for each device in the system.

Currently we have the 'iommu=pt' option that makes us use passthrough
for *all* devices (except those whose dma_mask can't reach all of
physical memory). But I could live with something like Shamir is
proposing, which lets us do the bypass only for performance-sensitive
devices which actually *ask* for it (and of course we'd have a system-
wide mode which declines that request and does normal mappings anyway).

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 22:48 +, David Woodhouse wrote:
> 
> In the Intel case, the mapping setup is entirely per-device (except for
> crap devices and devices behind a PCIe-PCI bridge, etc.).
> 
> So you can happily have a passthrough mapping for *one* device, without
> making that same mapping available to another device. You can make that
> trade-off of speed vs. protection for each device in the system.

Same for me. I should have written "in the context of a device ...",
the problem reamains that it doesn't buy you much to do it *per
-mapping* which is what this API seems to be about.

> Currently we have the 'iommu=pt' option that makes us use passthrough
> for *all* devices (except those whose dma_mask can't reach all of
> physical memory). But I could live with something like Shamir is
> proposing, which lets us do the bypass only for performance-sensitive
> devices which actually *ask* for it (and of course we'd have a system-
> wide mode which declines that request and does normal mappings anyway).
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 09:23 +0200, Shamir Rabinovitch wrote:
> To summary -
> 
> 1. The whole point of the IOMMU pass through was to get bigger address space
> and faster map/unmap operations for performance critical hardware
> 2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
> protection concerns raised above. Not sure what will be the 
> performance
> impact though. This still need a lot of work before we could test 
> this.
> 3. On x86 we use IOMMU in pass through mode so all the above concerns are 
> valid
> 
> The question are -
> 
> 1. Does partial use of IOMMU while the pass through window is enabled add some
> protection?
> 2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
> translations at kernel level?
> 
> I think that I can live with option (2) till I have DVMA if there is strong
> disagree on the need for per allocation IOMMU bypass.

Chosing on a per-mapping basis *in the back end* might still make some
amount of sense. What I don't completely grasp is what does it give
you to expose that choice to the *driver* all the way up the chain. Why
does the driver knows better whether something should use the bypass or
not ?

I can imagine some in-between setups, for example, on POWER (and
probably x86), I could setup a window that is TCE-mapped (TCEs are our
iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
map to the same physical page. I could then use map/unmap to adjust the
protection, the idea being that only "relaxing" the protection requires
flushing the IO TLB, ie, we could delay most flushes.

But that sort of optimization only makes sense in the back-end.

So what was your original idea where you thought the driver was the
right one to decide whether to use the bypass or not for a given map
operation ? That's what I don't grasp... you might have a valid case
that I just fail to see.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Benjamin Herrenschmidt
On Mon, 2015-11-02 at 22:45 +0100, Arnd Bergmann wrote:
> > Then I would argue for naming this differently. Make it an optional
> > hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> > achieved via using a bypass or other means in the backend not the
> > business of the driver.
> > 
> 
> With a name like that, who wouldn't pass that flag? ;-)

xHCI for example, vs. something like 10G ethernet... but yes I agree it
sucks. I don't like that sort of policy anywhere in drivers. On the
other hand the platform doesn't have much information to make that sort
of decision either.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Joerg Roedel
On Fri, Oct 30, 2015 at 11:32:06AM +0100, Arnd Bergmann wrote:
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

Yeah, definitly. Currently we only have iommu=pt to enable pass-through
mode for _all_ devices. I think it makes sense to introduce a per-device
opt-in for pass-through, but have it configured by the user and not by
the device driver.

If the user enables the IOMMU in his system, he expects to be secure
against DMA attacks. If drivers could opt-out, every protection would be
voided.


Joerg

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt wrote:
> 
> Chosing on a per-mapping basis *in the back end* might still make some

In my case, choosing mapping based on the hardware that will use this
mappings makes more sense. Most hardware are not that performance sensitive
as the Infiniband hardware.

> amount of sense. What I don't completely grasp is what does it give
> you to expose that choice to the *driver* all the way up the chain. Why
> does the driver knows better whether something should use the bypass or
> not ?

The driver know for what hardware it is mapping the memory so it know if
the memory will be used by performance sensitive hardware or not.

> 
> I can imagine some in-between setups, for example, on POWER (and
> probably x86), I could setup a window that is TCE-mapped (TCEs are our
> iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
> map to the same physical page. I could then use map/unmap to adjust the
> protection, the idea being that only "relaxing" the protection requires
> flushing the IO TLB, ie, we could delay most flushes.

In your case, what will give the better performance - 1:1 mapping or IOMMU
mapping? When you say 'relaxing the protection' you refer to 1:1 mapping?

Also, how this 1:1 window address the security concerns that other raised
by other here?

> 
> But that sort of optimization only makes sense in the back-end.
> 
> So what was your original idea where you thought the driver was the
> right one to decide whether to use the bypass or not for a given map
> operation ? That's what I don't grasp... you might have a valid case
> that I just fail to see.

Please see above.

> 
> Cheers,
> Ben.
> 

I think that given that IOMMU bypass on per allocation basis raise some
concerns, the only path for SPARC is this:

1. Support 'iommu=pt' as x86 for total IOMMU as intermediate step. Systems
that use Infiniband will be set to pass through.

2. Add support in DVMA which allow less contention on the IOMMU resources
while doing the map/unmap, bigger address range and full protection.
Still this is not clear what will be the performance cost of using
DVMA.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Shamir Rabinovitch
On Thu, Oct 29, 2015 at 10:35:25PM +, David Woodhouse wrote:
> > 
> > Could this be mitigated using pools?  I don't know if the net code
> > would play along easily.
> 
> For the receive side, it shouldn't be beyond the wit of man to
> introduce an API which allocates *and* DMA-maps a skb. Pass it to
> netif_rx() still mapped, with a destructor that just shoves it back in
> a pool for re-use.
> 
> Doing it for transmit might be a little more complex, but perhaps still
> possible.

Not sure this use case is possible for Infiniband where application hold
the data buffers and there is no way to force application to re use the 
buffer as suggested.

This is why I think there will be no easy way to bypass the DMA mapping cost
for all use cases unless we allow applications to request bypass/pass through
DMA mapping (implicitly or explicitly).

> 
> -- 
> dwmw2
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Benjamin Herrenschmidt
On Sun, 2015-11-01 at 09:45 +0200, Shamir Rabinovitch wrote:
> Not sure this use case is possible for Infiniband where application hold
> the data buffers and there is no way to force application to re use the 
> buffer as suggested.
> 
> This is why I think there will be no easy way to bypass the DMA mapping cost
> for all use cases unless we allow applications to request bypass/pass through
> DMA mapping (implicitly or explicitly).

But but but ...

What I don't understand is how that brings you any safety.

Basically, either your bridge has a bypass window, or it doesn't. (Or
it has one and it's enabled or not, same thing).

If you request the bypass on a per-mapping basis, you basically have to
keep the window always enabled, unless you do some nasty refcounting of
how many people have a bypass mapping in flight, but that doesn't seem
useful.

Thus you have already lost all protection from the device, since your
entire memory is accessible via the bypass mapping.

Which means what is the point of then having non-bypass mappings for
other things ? Just to make things slower ?

I really don't see what this whole "bypass on a per-mapping basis" buys
you.

Note that we implicitly already do that on powerpc, but not for those
reasons, we do it based on the DMA mask, so that if your coherent mask
is 32-bit but your dma mask is 64-bit (which is not an uncommon
combination), we service the "coherent" requests (basically the long
lifed dma allocs) from the remapped region and the "stream" requests
(ie, map_page/map_sg) from the bypass.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 08:10:49AM +1100, Benjamin Herrenschmidt wrote:
> But but but ...
> 
> What I don't understand is how that brings you any safety.

Limited safety maybe? If some device DMA mappings are via IOMMU 
and this fall to some address range that is far from the bypass / 
pass through range then small drifts in address might be still figured 
if we do not bypass / pass through the IOMMU - right?

Device can sure use the bypass address and just reach the memory w/o 
IOMMU protection. See some comments about that below.

> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).

Agree.

> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.

Correct, see my above comment. 

> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.
> 
> Note that we implicitly already do that on powerpc, but not for those
> reasons, we do it based on the DMA mask, so that if your coherent mask
> is 32-bit but your dma mask is 64-bit (which is not an uncommon
> combination), we service the "coherent" requests (basically the long
> lifed dma allocs) from the remapped region and the "stream" requests
> (ie, map_page/map_sg) from the bypass.



To summary -

1. The whole point of the IOMMU pass through was to get bigger address space
and faster map/unmap operations for performance critical hardware
2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
protection concerns raised above. Not sure what will be the performance
impact though. This still need a lot of work before we could test this.
3. On x86 we use IOMMU in pass through mode so all the above concerns are valid

The question are -

1. Does partial use of IOMMU while the pass through window is enabled add some
protection?
2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
translations at kernel level?

I think that I can live with option (2) till I have DVMA if there is strong
disagree on the need for per allocation IOMMU bypass.


> 
> Cheers,
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-30 Thread Arnd Bergmann
On Saturday 31 October 2015 10:17:22 Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote:
> > On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> > > 
> > > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes
> > > > across
> > > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have
> > > > a
> > > > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > > > -paranoid mode where it's not honoured).
> > > 
> > > Possibly, though ideally that would be a user policy but of course
> > > by
> > > the time you get to userspace it's generally too late.
> > 
> > IIRC, we have an 'iommu=force' command line switch for this, to
> > ensure
> > that no device can use a linear mapping and everything goes th ough
> > the page tables. This is often useful for both debugging and as a
> > security measure when dealing with unpriviledged DMA access (virtual
> > machines, vfio, ...).
> 
> That was used to force-enable the iommu on platforms like G5s where we
> would otherwise only do so if the memory was larger than 32-bit but we
> never implemented using it to prevent the bypass region.

Ah, I see. Thanks for the clarification.

> > If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly
> > document
> > which osed to force-enable the iommu on HGthe two we expect to take
> > priority in cases where we have a
> > choice.
> >
> > I wonder if the 'iommu=force' attribute is too coarse-grained though,
> > and if we should perhaps allow a per-device setting on architectures
> > that allow this.
> 
> The interesting thing, if we can make it work, is the bypass attribute
> being per mapping... 

I would say we want both: for the device driver it can make sense to
choose per mapping what it can do, but for the iommu driver, it
can also make sense to ensure we never provide a linear mapping,
because otherwise the additional security aspect is moot.

In particular for the unprivileged VM guest or vfio access, the
code that gives access to the device to something else should
have a way to tell the IOMMU that the linear mapping can no longer
be used.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-30 Thread Benjamin Herrenschmidt
On Fri, 2015-10-30 at 11:32 +0100, Arnd Bergmann wrote:
> On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> > 
> > > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes
> > > across
> > > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have
> > > a
> > > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > > -paranoid mode where it's not honoured).
> > 
> > Possibly, though ideally that would be a user policy but of course
> > by
> > the time you get to userspace it's generally too late.
> 
> IIRC, we have an 'iommu=force' command line switch for this, to
> ensure
> that no device can use a linear mapping and everything goes th ough
> the page tables. This is often useful for both debugging and as a
> security measure when dealing with unpriviledged DMA access (virtual
> machines, vfio, ...).

That was used to force-enable the iommu on platforms like G5s where we
would otherwise only do so if the memory was larger than 32-bit but we
never implemented using it to prevent the bypass region.

> If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly
> document
> which osed to force-enable the iommu on HGthe two we expect to take
> priority in cases where we have a
> choice.
>
> I wonder if the 'iommu=force' attribute is too coarse-grained though,
> and if we should perhaps allow a per-device setting on architectures
> that allow this.

The interesting thing, if we can make it work, is the bypass attribute
being per mapping... 

Ben. 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-30 Thread Arnd Bergmann
On Thursday 29 October 2015 10:10:46 Benjamin Herrenschmidt wrote:
> 
> > Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
> > architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
> > sane meaning in the paranoid mode (and perhaps we'd want an ultra
> > -paranoid mode where it's not honoured).
> 
> Possibly, though ideally that would be a user policy but of course by
> the time you get to userspace it's generally too late.

IIRC, we have an 'iommu=force' command line switch for this, to ensure
that no device can use a linear mapping and everything goes through
the page tables. This is often useful for both debugging and as a
security measure when dealing with unpriviledged DMA access (virtual
machines, vfio, ...).

If we add a DMA_ATTR_IOMMU_BYPASS attribute, we should clearly document
which of the two we expect to take priority in cases where we have a
choice.

I wonder if the 'iommu=force' attribute is too coarse-grained though,
and if we should perhaps allow a per-device setting on architectures
that allow this.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Andy Lutomirski
On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
 wrote:
>
> On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> >
> > > On Power, I generally have 2 IOMMU windows for a device, one at the
> > > bottom is remapped, and is generally used for 32-bit devices and the
> > > one at the top us setup as a bypass
> >
> > So in the normal case of decent 64-bit devices (and not in a VM),
> > they'll *already* be using the bypass region and have full access to
> > all of memory, all of the time? And you have no protection against
> > driver and firmware bugs causing stray DMA?
>
> Correct, we chose to do that for performance reasons.

Could this be mitigated using pools?  I don't know if the net code
would play along easily.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread David Woodhouse
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
>  wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

For the receive side, it shouldn't be beyond the wit of man to
introduce an API which allocates *and* DMA-maps a skb. Pass it to
netif_rx() still mapped, with a destructor that just shoves it back in
a pool for re-use.

Doing it for transmit might be a little more complex, but perhaps still
possible.

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Shamir Rabinovitch
On Thu, Oct 29, 2015 at 09:42:12AM +0900, David Woodhouse wrote:

> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

We do bypass the IOMMU but not the DMA API and given that before we call 
the DMA API we pin the page then we do not need to pin all the pages.
Just the ones we use for the DMA. 

For me this flag looks orthogonal to the page pinning issue you brought. It
is just a hint to the DMA API that we want to use simple & fast mapping while
knowing we loose IOMMU protection for this memory.

For the IB case, setting and tearing DMA mappings for the drivers data buffers
is expensive. But we could for example consider to map all the HW control 
objects
that validate the HW access to the drivers data buffers as IOMMU protected and 
so
have IOMMU protection on those critical objects while having fast 
set-up/tear-down
of driver data buffers. The HW control objects have stable mapping for the 
lifetime
of the system. So the case of using both types of DMA mappings is still valid. 
 
> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Benjamin Herrenschmidt
On Thu, 2015-10-29 at 11:31 -0700, Andy Lutomirski wrote:
> On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt"
>  wrote:
> > 
> > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> > > 
> > > > On Power, I generally have 2 IOMMU windows for a device, one at
> > > > the
> > > > bottom is remapped, and is generally used for 32-bit devices
> > > > and the
> > > > one at the top us setup as a bypass
> > > 
> > > So in the normal case of decent 64-bit devices (and not in a VM),
> > > they'll *already* be using the bypass region and have full access
> > > to
> > > all of memory, all of the time? And you have no protection
> > > against
> > > driver and firmware bugs causing stray DMA?
> > 
> > Correct, we chose to do that for performance reasons.
> 
> Could this be mitigated using pools?  I don't know if the net code
> would play along easily.

Possibly, the pools we have already limit the lock contention but we
still have the map/unmap overhead which under a hypervisor can be quite
high. I'm not necessarily against changing the way we do things but it
would have to be backed up with numbers.

Cheers,
Ben.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > > +For systems with IOMMU it is assumed all DMA translations use the 
> > > > IOMMU.
> > 
> > Not entirely true. We have per-device dma_ops on a most architectures
> > already, and we were just talking about the need to add them to
> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> > map virtio devices too.
> 
> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
> 
> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
> translation that does not require any complex translations tables.

We have an option in the Intel IOMMU for pass-through mode too, which
basically *is* a total bypass. In practice, what's the difference
between that and a "simple translation that does not require any
[translation]"? We set up a full 1:1 mapping of all memory, and then
the map/unmap methods become no-ops.

Currently we have no way to request that mode on a per-device basis; we
only have 'iommu=pt' on the command line to set it for *all* devices.
But performance-sensitive devices might want it, while we keep doing
proper translation for others.

> > 
> > As we look at that (and make the per-device dma_ops a generic thing
> > rather than per-arch), we should probably look at folding your case in
> > too.
> 
> Whether to use IOMMU or not for DMA is up to the driver. The above example
> show real situation where one driver can use IOMMU and the other can't. It
> seems that the device cannot know when to use IOMMU bypass and when not.
> Even in the driver we can decide to DMA map some buffers using IOMMU
> translation and some as IOMMU bypass.

In practice there are multiple possibilities — there are cases where
you *must* use an IOMMU and do full translation, and there is no option
of a bypass. There are cases where there just isn't an IOMMU (and
sometimes that's a per-device fact, like with virtio). And there are
cases where you *can* use the IOMMU, but if you ask nicely you can get
away without it.

My point in linking up these two threads is that we should contemplate
all of those use cases and come up with something that addresses it
all.

> Do you agree that we need this attribute in the generic DMA API? 

Yeah, I think it can be useful.

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Miller
From: David Woodhouse 
Date: Wed, 28 Oct 2015 22:31:50 +0900

> On Wed, 2015-10-28 at 13:10 +0200, Shamir Rabinovitch wrote:
>> On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
>> > > > +For systems with IOMMU it is assumed all DMA translations use the 
>> > > > IOMMU.
>> > 
>> > Not entirely true. We have per-device dma_ops on a most architectures
>> > already, and we were just talking about the need to add them to
>> > POWER/SPARC too, because we need to avoid trying to use the IOMMU to
>> > map virtio devices too.
>> 
>> SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).
>> 
>> Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
>> Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
>> On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
>> translation that does not require any complex translations tables.
> 
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis; we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

In the sparc64 case, the 64-bit DMA address space is divided into
IOMMU translated and non-IOMMU translated.

You just set the high bits differently depending upon what you want.

So a device could use both IOMMU translated and bypass accesses at the
same time.  While seemingly interesting, I do not recommend we provide
this kind of flexibility in our DMA interfaces.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Shamir Rabinovitch
On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
> 
> Not entirely true. We have per-device dma_ops on a most architectures
> already, and we were just talking about the need to add them to
> POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> map virtio devices too.

SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).

Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
translation that does not require any complex translations tables.

> 
> As we look at that (and make the per-device dma_ops a generic thing
> rather than per-arch), we should probably look at folding your case in
> too.

Whether to use IOMMU or not for DMA is up to the driver. The above example
show real situation where one driver can use IOMMU and the other can't. It
seems that the device cannot know when to use IOMMU bypass and when not.
Even in the driver we can decide to DMA map some buffers using IOMMU
translation and some as IOMMU bypass.

Do you agree that we need this attribute in the generic DMA API? 

> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote:
> In the sparc64 case, the 64-bit DMA address space is divided into
> IOMMU translated and non-IOMMU translated.
> 
> You just set the high bits differently depending upon what you want.

Wait, does that mean a (rogue) device could *always* get full access to
physical memory just by setting the high bits appropriately? That
mapping is *always* available?

> So a device could use both IOMMU translated and bypass accesses at 
> the same time.  While seemingly interesting, I do not recommend we 
> provide this kind of flexibility in our DMA interfaces.

Now I could understand this if the answer to my question above was
'no'. We absolutely want the *security* all the time, and we don't want
the device to be able to do stupid stuff. But if the answer was 'yes'
then we take the map/unmap performance hit for... *what* benefit?

On Intel we have the passthrough as an *option* and I have the same
initial reaction — "Hell no, we want the security". But I concede the
performance motivation for it, and I'm not *dead* set against
permitting it.

If I tolerate a per-device request for passthrough mode, that might
prevent people from disabling the IOMMU or putting it into passthrough
mode *entirely*. So actually, I'm *improving* security...

I think it makes sense to allow performance sensitive device drivers to
*request* a passthrough mode. The platform can reserve the right to
refuse, if either the IOMMU hardware doesn't support that, or we're in
a paranoid mode (with iommu=always or something on the command line).


-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Sun, 2015-10-25 at 09:07 -0700, Shamir Rabinovitch wrote:
> 
> +
> +DMA_ATTR_IOMMU_BYPASS
> +-
> +
> > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.

Not entirely true. We have per-device dma_ops on a most architectures
already, and we were just talking about the need to add them to
POWER/SPARC too, because we need to avoid trying to use the IOMMU to
map virtio devices too.

As we look at that (and make the per-device dma_ops a generic thing
rather than per-arch), we should probably look at folding your case in
too.

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Benjamin Herrenschmidt
On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote:
> On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:
> 
> > On Power, I generally have 2 IOMMU windows for a device, one at the
> > bottom is remapped, and is generally used for 32-bit devices and the
> > one at the top us setup as a bypass 
> 
> So in the normal case of decent 64-bit devices (and not in a VM),
> they'll *already* be using the bypass region and have full access to
> all of memory, all of the time? And you have no protection against
> driver and firmware bugs causing stray DMA?

Correct, we chose to do that for performance reasons.

> > I don't see how thata ttribute would work for us.
> 
> Because you're already doing it anyway without being asked :)
> 
> If SPARC and POWER are both doing that, perhaps we should change the
> default for Intel too?
> 
> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

Correct, the problem is that the cost of doing map/unmap from a guest
is really a huge hit on things like network devices.

Another problem is that the failure mode isn't great if you don't pin. 

IE. You have to pin pages as they get mapped into the iommu by the
guest, but you don't know in advance how much and you may hit the
process ulimit on pinned pages half way through.

We tried to address that in various ways but it always ended up horrid.

> Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
> architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
> sane meaning in the paranoid mode (and perhaps we'd want an ultra
> -paranoid mode where it's not honoured).

Possibly, though ideally that would be a user policy but of course by
the time you get to userspace it's generally too late.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Miller
From: David Woodhouse 
Date: Wed, 28 Oct 2015 22:57:12 +0900

> On Wed, 2015-10-28 at 07:07 -0700, David Miller wrote:
>> In the sparc64 case, the 64-bit DMA address space is divided into
>> IOMMU translated and non-IOMMU translated.
>> 
>> You just set the high bits differently depending upon what you want.
> 
> Wait, does that mean a (rogue) device could *always* get full access to
> physical memory just by setting the high bits appropriately? That
> mapping is *always* available?

The IOMMU supports several modes, one of which disallows passthrough
and this is what you would use in a virtual guest.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Benjamin Herrenschmidt
On Wed, 2015-10-28 at 22:31 +0900, David Woodhouse wrote:
> We have an option in the Intel IOMMU for pass-through mode too, which
> basically *is* a total bypass. In practice, what's the difference
> between that and a "simple translation that does not require any
> [translation]"? We set up a full 1:1 mapping of all memory, and then
> the map/unmap methods become no-ops.
> 
> Currently we have no way to request that mode on a per-device basis;
> we
> only have 'iommu=pt' on the command line to set it for *all* devices.
> But performance-sensitive devices might want it, while we keep doing
> proper translation for others.

On Power, I generally have 2 IOMMU windows for a device, one at the
bottom is remapped, and is generally used for 32-bit devices and the
one at the top us setup as a bypass (or in the case of KVM, as a linear
mapping of guest memory which looks the same as a bypass to the guest).

The DMA ops will automatically hit the appropriate one based on the
DMA mask.

I don't see how that attribute would work for us.

Cheers,
Ben.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread David Woodhouse
On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote:

> On Power, I generally have 2 IOMMU windows for a device, one at the
> bottom is remapped, and is generally used for 32-bit devices and the
> one at the top us setup as a bypass 

So in the normal case of decent 64-bit devices (and not in a VM),
they'll *already* be using the bypass region and have full access to
all of memory, all of the time? And you have no protection against
driver and firmware bugs causing stray DMA?

> I don't see how that attribute would work for us.

Because you're already doing it anyway without being asked :)

If SPARC and POWER are both doing that, perhaps we should change the
default for Intel too?

Aside from the lack of security, the other disadvantage of that is that
you have to pin *all* pages of a guest in case DMA happens; you don't
get to pin *only* those pages which are referenced by that guest's
IOMMU page tables...

Maybe we should at least coordinate IOMMU 'paranoid/fast' modes across
architectures, and then the DMA_ATTR_IOMMU_BYPASS flag would have a
sane meaning in the paranoid mode (and perhaps we'd want an ultra
-paranoid mode where it's not honoured).

-- 
dwmw2




smime.p7s
Description: S/MIME cryptographic signature