Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-17 Thread Marek Szyprowski

Hello,

I'm really sorry do late joining this discussion, but I was terribly 
busy with other things.


On 2015-10-30 02:17, Daniel Kurtz wrote:

+linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
v4l2-contig's usage of the DMA API.

Hi Robin,

On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy  wrote:

On 26/10/15 13:44, Yong Wu wrote:

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]

+/*
+ * The DMA API client is passing in a scatterlist which could describe
+ * any old buffer layout, but the IOMMU API requires everything to be
+ * aligned to IOMMU pages. Hence the need for this complicated bit of
+ * impedance-matching, to be able to hand off a suitably-aligned list,
+ * but still preserve the original offsets and sizes for the caller.
+ */
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+   int nents, int prot)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iova_domain *iovad = domain->iova_cookie;
+   struct iova *iova;
+   struct scatterlist *s, *prev = NULL;
+   dma_addr_t dma_addr;
+   size_t iova_len = 0;
+   int i;
+
+   /*
+* Work out how much IOVA space we need, and align the segments
to
+* IOVA granules for the IOMMU driver to handle. With some clever
+* trickery we can modify the list in-place, but reversibly, by
+* hiding the original data in the as-yet-unused DMA fields.
+*/
+   for_each_sg(sg, s, nents, i) {
+   size_t s_offset = iova_offset(iovad, s->offset);
+   size_t s_length = s->length;
+
+   sg_dma_address(s) = s->offset;
+   sg_dma_len(s) = s_length;
+   s->offset -= s_offset;
+   s_length = iova_align(iovad, s_length + s_offset);
+   s->length = s_length;
+
+   /*
+* The simple way to avoid the rare case of a segment
+* crossing the boundary mask is to pad the previous one
+* to end at a naturally-aligned IOVA for this one's
size,
+* at the cost of potentially over-allocating a little.
+*/
+   if (prev) {
+   size_t pad_len = roundup_pow_of_two(s_length);
+
+   pad_len = (pad_len - iova_len) & (pad_len - 1);
+   prev->length += pad_len;


Hi Robin,
While our v4l2 testing, It seems that we met a problem here.
Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.


This is just tweaking the faked-up length that we hand off to iommu_map_sg()
(see also the iova_align() above), to trick it into bumping this segment up
to a suitable starting IOVA. The real length at this point is stashed in
sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
both will hold the same true length once we return to the caller.

Yes, it does mean that if you have a list where the segment lengths are page
aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
still end up with a gap between the second and third segments, but that's
fine because the DMA API offers no guarantees about what the resulting DMA
addresses will be (consider the no-IOMMU case where they would each just be
"mapped" to their physical address). If that breaks v4l, then it's probably
v4l's DMA API use that needs looking at (again).

Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.
IIUC, the change above breaks this model by inserting gaps in how the
buffer is mapped to device memory, such that the buffer is no longer
contiguous in dma address space.

Here is the code in question from
drivers/media/v4l2-core/videobuf2-dma-contig.c :

static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
{
 struct scatterlist *s;
 dma_addr_t expected = sg_dma_address(sgt->sgl);
 unsigned int i;
 unsigned long size = 0;

 for_each_sg(sgt->sgl, s, sgt->nents, i) {
 if (sg_dma_address(s) != expected)
 break;
 expected = sg_dma_address(s) + sg_dma_len(s);
 size += sg_dma_len(s);
 }
 return size;
}


static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
 unsigned long size, enum dma_data_direction dma_dir)
{
 struct vb2_dc_conf *conf = alloc_ctx;
 struct vb2_dc_buf *buf;
 struct frame_vector *vec;
 unsigned long offset;
 int n_pages, i;
 int ret = 0;
 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-09 Thread Robin Murphy

On 04/11/15 05:12, Tomasz Figa wrote:

On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy  wrote:

Hi Tomasz,

On 02/11/15 13:43, Tomasz Figa wrote:


I'd like to know what is the boundary mask and what hardware imposes
requirements like this. The cost here is not only over-allocating a
little, but making many, many buffers contiguously mappable on the
CPU, unmappable contiguously in IOMMU, which just defeats the purpose
of having an IOMMU, which I believe should be there for simple IP
blocks taking one DMA address to be able to view the buffer the same
way as the CPU.



The expectation with dma_map_sg() is that you're either going to be
iterating over the buffer segments, handing off each address to the device
to process one by one;


My understanding of a scatterlist was that it represents a buffer as a
whole, by joining together its physically discontinuous segments.


It can, but there are also cases where a single scatterlist is used to 
batch up multiple I/O requests - see the stuff in block/blk-merge.c as 
described in section 2.2 of Documentation/biodoc.txt, and AFAICS anyone 
could quite happily use the dmaengine API, and possibly others, in the 
same way. Ultimately a scatterlist is no more specific than "a list of 
blocks of physical memory that each want giving a DMA address".



I don't see how single segments (layout of which is completely up to
the allocator; often just single pages) would be usable for hardware
that needs to do some work more serious than just writing a byte
stream continuously to subsequent buffers. In case of such simple
devices you don't even need an IOMMU (for means other than protection
and/or getting over address space limitations).

However, IMHO the most important use case of an IOMMU is to make
buffers, which are contiguous in CPU virtual address space (VA),
contiguous in device's address space (IOVA). Your implementation of
dma_map_sg() effectively breaks this ability, so I'm not really
following why it's located under drivers/iommu and supposed to be used
with IOMMU-enabled platforms...


or you have a scatter-gather-capable device, in which
case you hand off the whole list at once.


No need for mapping ability of the IOMMU here as well (except for
working around address space issues, as I mentioned above).


Ok, now I'm starting to wonder if you're wilfully choosing to miss the 
point. Look at 64-bit systems of any architecture, and those address 
space issues are pretty much the primary consideration for including an 
IOMMU in the first place (behind virtualisation, which we can forget 
about here). Take the Juno board on my desk - most of the peripherals 
cannot address 75% of the RAM, and CPU bounce buffers are both not 
overly efficient and a limited resource (try using dmatest with 
sufficiently large buffers to stress/measure memory bandwidth and watch 
it take down the kernel, and that's without any other SWIOTLB 
contention). The only one that really cares at all about contiguous 
buffers is the HDLCD, but that's perfectly happy when it calls 
dma_alloc_coherent() via drm_fb_cma_helper and pulls a contiguous 8MB 
framebuffer out of thin air, without even knowing that CMA itself is 
disabled and it couldn't natively address 75% of the memory that might 
be backing that buffer.


That last point also illustrates that the thing for providing 
DMA-contiguous buffers is indeed very good at providing DMA-contiguous 
buffers when backed by an IOMMU.



It's in the latter case where you
have to make sure the list doesn't exceed the hardware limitations of that
device. I believe the original concern was disk controllers (the
introduction of dma_parms seems to originate from the linux-scsi list), but
most scatter-gather engines are going to have some limit on how much they
can handle per entry (IMO the dmaengine drivers are the easiest example to
look at).

Segment boundaries are a little more arcane, but my assumption is that they
relate to the kind of devices whose addressing is not flat but relative to
some separate segment register (The "64-bit" mode of USB EHCI is one
concrete example I can think of) - since you cannot realistically change the
segment register while the device is in the middle of accessing a single
buffer entry, that entry must not fall across a segment boundary or at some
point the device's accesses are going to overflow the offset address bits
and wrap around to bogus addresses at the bottom of the segment.


The two requirements above sound like something really specific to
scatter-gather-capable hardware, which as I pointed above, barely need
an IOMMU (at least its mapping capabilities). We are talking here
about very IOMMU-specific code, though...

Now, while I see that on some systems there might be IOMMU used for
improving protection and working around addressing issues with
SG-capable hardware, the code shouldn't be breaking the majority of
systems with IOMMU used as the only possible way to make physically

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-04 Thread Russell King - ARM Linux
On Wed, Nov 04, 2015 at 02:15:41PM +0900, Tomasz Figa wrote:
> On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
>  wrote:
> > On Tue, Nov 03, 2015 at 05:41:24PM +, Robin Murphy wrote:
> >> Hi Tomasz,
> >>
> >> On 02/11/15 13:43, Tomasz Figa wrote:
> >> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >> >contiguous part of virtual address space for any given SG list.
> >> >However it was understood to be able to map buffers contiguously
> >> >mappable by the CPU into a single segment and users,
> >> >videobuf2-dma-contig in particular, relied on this.
> >>
> >> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> >> mappable contiguously by the CPU; it's clearly impossible for the streaming
> >> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> >> to the presence or otherwise of an IOMMU.
> >
> > Tomasz's use of "virtual address space" above in combination with the
> > DMA API is really confusing.
> 
> I suppose I must have mistakenly use "virtual address space" somewhere
> instead of "IO virtual address space". I'm sorry for causing
> confusion.
> 
> The thing being discussed here is mapping of buffers described by
> scatterlists into IO virtual address space, i.e. the operation
> happening when dma_map_sg() is called for an IOMMU-enabled device.

... and there, it's perfectly legal for an IOMMU to merge all entries
in a scatterlist into one mapping - so dma_map_sg() would return 1.

What that means is that the scatterlist contains the original number of
entries which describes the CPU view of the buffer list using the original
number of entries, and the DMA device view of the same but using just the
first entry.

In other words, if you're walking a scatterlist, and doing a mixture of
DMA and PIO, you can't assume that if you're at scatterlist entry N for
DMA, you can switch to PIO for entry N and you'll write to the same
memory.  (I know that there's badly written drivers in the kernel which
unfortunately do make this assumption, and if they're used in the
presence of an IOMMU, they _will_ be silently data corrupting.)

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-04 Thread Tomasz Figa
On Wed, Nov 4, 2015 at 6:27 PM, Russell King - ARM Linux
 wrote:
> On Wed, Nov 04, 2015 at 02:12:03PM +0900, Tomasz Figa wrote:
>> My understanding of a scatterlist was that it represents a buffer as a
>> whole, by joining together its physically discontinuous segments.
>
> Correct, and it may also be scattered in CPU virtual space as well.
>
>> I don't see how single segments (layout of which is completely up to
>> the allocator; often just single pages) would be usable for hardware
>> that needs to do some work more serious than just writing a byte
>> stream continuously to subsequent buffers. In case of such simple
>> devices you don't even need an IOMMU (for means other than protection
>> and/or getting over address space limitations).
>
> All that's required is that the addresses described in the scatterlist
> are accessed as an apparently contiguous series of bytes.  They don't
> have to be contiguous in any address view, provided the device access
> appears to be contiguous.  How that is done is really neither here nor
> there.
>
> IOMMUs are normally there as an address translator - for example, the
> underlying device may not have the capability to address a scatterlist
> (eg, because it makes effectively random access) and in order to be
> accessible to the device, it needs to be made contiguous in device
> address space.
>
> Another scenario is that you have more bits of physical address than
> a device can generate itself for DMA purposes, and you need an IOMMU
> to create a (possibly scattered) mapping in device address space
> within the ability of the device to address.
>
> The requirements here depend on the device behind the IOMMU.

I fully agree with you.

The problem is that the code being discussed here breaks the case of
devices that don't have the capability of addressing a scatterlist,
supposedly for the sake of devices that have such capability (but as I
suggested, they both could be happily supported, by distinguishing
special values of DMA max segment size and boundary mask).

>> However, IMHO the most important use case of an IOMMU is to make
>> buffers, which are contiguous in CPU virtual address space (VA),
>> contiguous in device's address space (IOVA).
>
> No - there is no requirement for CPU virtual contiguous buffers to also
> be contiguous in the device address space.

There is no requirement, but shouldn't it be desired for the mapping
code to map them as such? Otherwise, how could the IOMMU use case you
described above (address translator for devices which don't have the
capability to address a scatterlist) be handled properly?

Is the general conclusion now that dma_map_sg() should not be used to
create IOMMU mappings and we should make a step backwards making all
drivers (or frameworks, such as videobuf2) do that manually? That
would be really backwards, because code not aware of IOMMU existence
at all would have to become aware of it.

Best regards,
Tomasz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-04 Thread Russell King - ARM Linux
On Wed, Nov 04, 2015 at 06:48:50PM +0900, Tomasz Figa wrote:
> There is no requirement, but shouldn't it be desired for the mapping
> code to map them as such? Otherwise, how could the IOMMU use case you
> described above (address translator for devices which don't have the
> capability to address a scatterlist) be handled properly?

It's up to the IOMMU code to respect the parameters that the device has
supplied to it via the device_dma_parameters.  This doesn't currently
allow a device to say "I want this scatterlist to be mapped as a
contiguous device address", so really if a device has such a requirement,
at the moment the device driver _must_ check the dma_map_sg() return
value and act accordingly.

While it's possible to say "an IOMMU should map as a single contiguous
address" what happens when the IOMMU's device address space becomes
fragmented?

> Is the general conclusion now that dma_map_sg() should not be used to
> create IOMMU mappings and we should make a step backwards making all
> drivers (or frameworks, such as videobuf2) do that manually? That
> would be really backwards, because code not aware of IOMMU existence
> at all would have to become aware of it.

No.  The DMA API has always had the responsibility for managing the
IOMMU device, which may well be shared between multiple different
devices.

However, if the IOMMU is part of a device IP block (such as a GPU)
then the decision on whether the DMA API should be used or not is up
to the driver author.  If it has special management requirements,
then it's probably appropriate for the device driver to manage it by
itself.

For example, a GPUs MMU may need something inserted into the GPUs
command stream to flush the MMU TLBs.  Such cases are inappropriate
to be using the DMA API for IOMMU management.

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-03 Thread Robin Murphy

Hi Tomasz,

On 02/11/15 13:43, Tomasz Figa wrote:

I'd like to know what is the boundary mask and what hardware imposes
requirements like this. The cost here is not only over-allocating a
little, but making many, many buffers contiguously mappable on the
CPU, unmappable contiguously in IOMMU, which just defeats the purpose
of having an IOMMU, which I believe should be there for simple IP
blocks taking one DMA address to be able to view the buffer the same
way as the CPU.


The expectation with dma_map_sg() is that you're either going to be 
iterating over the buffer segments, handing off each address to the 
device to process one by one; or you have a scatter-gather-capable 
device, in which case you hand off the whole list at once. It's in the 
latter case where you have to make sure the list doesn't exceed the 
hardware limitations of that device. I believe the original concern was 
disk controllers (the introduction of dma_parms seems to originate from 
the linux-scsi list), but most scatter-gather engines are going to have 
some limit on how much they can handle per entry (IMO the dmaengine 
drivers are the easiest example to look at).


Segment boundaries are a little more arcane, but my assumption is that 
they relate to the kind of devices whose addressing is not flat but 
relative to some separate segment register (The "64-bit" mode of USB 
EHCI is one concrete example I can think of) - since you cannot 
realistically change the segment register while the device is in the 
middle of accessing a single buffer entry, that entry must not fall 
across a segment boundary or at some point the device's accesses are 
going to overflow the offset address bits and wrap around to bogus 
addresses at the bottom of the segment.


Now yes, it will be possible under _most_ circumstances to use an IOMMU 
to lay out a list of segments with page-aligned lengths within a single 
IOVA allocation whilst still meeting all the necessary constraints. It 
just needs some unavoidably complicated calculations - quite likely 
significantly more complex than my v5 version of map_sg() that tried to 
do that and merge segments but failed to take the initial alignment into 
account properly - since there are much simpler ways to enforce just the 
_necessary_ behaviour for the DMA API, I put the complicated stuff to 
one side for now to prevent it holding up getting the basic functional 
support in place.



Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.



Not at all. The streaming DMA API (dma_map_* and friends) has two 
responsibilities: performing any necessary cache maintenance to ensure the 
device will correctly see data from the CPU, and the CPU will correctly see 
data from the device; and working out an address for that buffer from the 
device's point of view to actually hand off to the hardware (which is perfectly 
well allowed to fail).


Agreed. The dma_map_*() API is not guaranteed to return a single
contiguous part of virtual address space for any given SG list.
However it was understood to be able to map buffers contiguously
mappable by the CPU into a single segment and users,
videobuf2-dma-contig in particular, relied on this.


I don't follow that - _any_ buffer made of page-sized chunks is going to 
be mappable contiguously by the CPU; it's clearly impossible for the 
streaming DMA API itself to offer such a guarantee, because it's 
entirely orthogonal to the presence or otherwise of an IOMMU.


Furthermore, I can't see any existing dma_map_sg implementation (between 
arm/64 and x86, at least), that _won't_ break that expectation under 
certain conditions (ranging from "relatively pathological" to "always"), 
so it still seems questionable to have a dependency on it.



Consider SWIOTLB's implementation - segments which already lie at physical 
addresses within the device's DMA mask just get passed through, while those 
that lie outside it get mapped into the bounce buffer, but still as individual 
allocations (arch code just handles cache maintenance on the resulting physical 
addresses and can apply any hard-wired DMA offset for the device concerned).


And this is fine for vb2-dma-contig, which was made for devices that
require buffers contiguous in its address space. Without IOMMU it will
allow only physically contiguous buffers and fails otherwise, which is
fine, because it's a hardware requirement.


If it depends on having contiguous-from-the-device's-view DMA buffers 
either way, that's a sign it should perhaps be using the coherent DMA 
API instead, which _does_ give such a guarantee. I'm well aware of the 
"but the noncacheable mappings make userspace access unacceptably slow!" 
issue many folks have with that, though, and don't particularly fancy 
going off on that tangent here.



IIUC, the change above breaks 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-03 Thread Russell King - ARM Linux
On Tue, Nov 03, 2015 at 05:41:24PM +, Robin Murphy wrote:
> Hi Tomasz,
> 
> On 02/11/15 13:43, Tomasz Figa wrote:
> >Agreed. The dma_map_*() API is not guaranteed to return a single
> >contiguous part of virtual address space for any given SG list.
> >However it was understood to be able to map buffers contiguously
> >mappable by the CPU into a single segment and users,
> >videobuf2-dma-contig in particular, relied on this.
> 
> I don't follow that - _any_ buffer made of page-sized chunks is going to be
> mappable contiguously by the CPU; it's clearly impossible for the streaming
> DMA API itself to offer such a guarantee, because it's entirely orthogonal
> to the presence or otherwise of an IOMMU.

Tomasz's use of "virtual address space" above in combination with the
DMA API is really confusing.

dma_map_sg() does *not* construct a CPU view of the passed scatterlist.
The only thing dma_map_sg() might do with virtual addresses is to use
them as a way to achieve cache coherence for one particular view of
that memory, that being the kernel's own lowmem mapping and any kmaps.
It doesn't extend to vmalloc() or userspace mappings of the memory.

If the scatterlist is converted to an array of struct page pointers,
it's possible to map it with vmap(), but it's implementation defined
whether such a mapping will receive cache maintanence as part of the
DMA API or not.  (If you have PIPT caches, it will, if they're VIPT
caches, maybe not.)

There is a separate set of calls to deal with the flushing issues for
vmap()'d memory in this case - see flush_kernel_vmap_range() and
invalidate_kernel_vmap_range().

-- 
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-03 Thread Tomasz Figa
On Wed, Nov 4, 2015 at 2:41 AM, Robin Murphy  wrote:
> Hi Tomasz,
>
> On 02/11/15 13:43, Tomasz Figa wrote:
>>
>> I'd like to know what is the boundary mask and what hardware imposes
>> requirements like this. The cost here is not only over-allocating a
>> little, but making many, many buffers contiguously mappable on the
>> CPU, unmappable contiguously in IOMMU, which just defeats the purpose
>> of having an IOMMU, which I believe should be there for simple IP
>> blocks taking one DMA address to be able to view the buffer the same
>> way as the CPU.
>
>
> The expectation with dma_map_sg() is that you're either going to be
> iterating over the buffer segments, handing off each address to the device
> to process one by one;

My understanding of a scatterlist was that it represents a buffer as a
whole, by joining together its physically discontinuous segments.

I don't see how single segments (layout of which is completely up to
the allocator; often just single pages) would be usable for hardware
that needs to do some work more serious than just writing a byte
stream continuously to subsequent buffers. In case of such simple
devices you don't even need an IOMMU (for means other than protection
and/or getting over address space limitations).

However, IMHO the most important use case of an IOMMU is to make
buffers, which are contiguous in CPU virtual address space (VA),
contiguous in device's address space (IOVA). Your implementation of
dma_map_sg() effectively breaks this ability, so I'm not really
following why it's located under drivers/iommu and supposed to be used
with IOMMU-enabled platforms...

> or you have a scatter-gather-capable device, in which
> case you hand off the whole list at once.

No need for mapping ability of the IOMMU here as well (except for
working around address space issues, as I mentioned above).

> It's in the latter case where you
> have to make sure the list doesn't exceed the hardware limitations of that
> device. I believe the original concern was disk controllers (the
> introduction of dma_parms seems to originate from the linux-scsi list), but
> most scatter-gather engines are going to have some limit on how much they
> can handle per entry (IMO the dmaengine drivers are the easiest example to
> look at).
>
> Segment boundaries are a little more arcane, but my assumption is that they
> relate to the kind of devices whose addressing is not flat but relative to
> some separate segment register (The "64-bit" mode of USB EHCI is one
> concrete example I can think of) - since you cannot realistically change the
> segment register while the device is in the middle of accessing a single
> buffer entry, that entry must not fall across a segment boundary or at some
> point the device's accesses are going to overflow the offset address bits
> and wrap around to bogus addresses at the bottom of the segment.

The two requirements above sound like something really specific to
scatter-gather-capable hardware, which as I pointed above, barely need
an IOMMU (at least its mapping capabilities). We are talking here
about very IOMMU-specific code, though...

Now, while I see that on some systems there might be IOMMU used for
improving protection and working around addressing issues with
SG-capable hardware, the code shouldn't be breaking the majority of
systems with IOMMU used as the only possible way to make physically
discontinuous appear (IO-virtually) continuous to devices incapable of
scatter-gather.

>
> Now yes, it will be possible under _most_ circumstances to use an IOMMU to
> lay out a list of segments with page-aligned lengths within a single IOVA
> allocation whilst still meeting all the necessary constraints. It just needs
> some unavoidably complicated calculations - quite likely significantly more
> complex than my v5 version of map_sg() that tried to do that and merge
> segments but failed to take the initial alignment into account properly -
> since there are much simpler ways to enforce just the _necessary_ behaviour
> for the DMA API, I put the complicated stuff to one side for now to prevent
> it holding up getting the basic functional support in place.

Somehow just whatever currently done in arch/arm/mm/dma-mapping.c was
sufficient and not overly complicated.

See http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1547 .

I can see that the code there at least tries to comply with maximum
segment size constraint. Segment boundary seems to be ignored, though.
However, I'm convinced that in most (if not all) cases where IOMMU
IOVA-contiguous mapping is needed, those two requirements don't exist.
Do we really have to break the good hardware only because the
bad^Wlimited one is broken?

Couldn't we preserve the ARM-like behavior whenever
dma_parms->segment_boundary_mask is set to all 1s and
dma_parms->max_segment_size to UINT_MAX (what currently drivers used
to set) or 0 (sounds more logical for the meaning of "no maximum
given")?

>
> 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-03 Thread Tomasz Figa
On Wed, Nov 4, 2015 at 3:40 AM, Russell King - ARM Linux
 wrote:
> On Tue, Nov 03, 2015 at 05:41:24PM +, Robin Murphy wrote:
>> Hi Tomasz,
>>
>> On 02/11/15 13:43, Tomasz Figa wrote:
>> >Agreed. The dma_map_*() API is not guaranteed to return a single
>> >contiguous part of virtual address space for any given SG list.
>> >However it was understood to be able to map buffers contiguously
>> >mappable by the CPU into a single segment and users,
>> >videobuf2-dma-contig in particular, relied on this.
>>
>> I don't follow that - _any_ buffer made of page-sized chunks is going to be
>> mappable contiguously by the CPU; it's clearly impossible for the streaming
>> DMA API itself to offer such a guarantee, because it's entirely orthogonal
>> to the presence or otherwise of an IOMMU.
>
> Tomasz's use of "virtual address space" above in combination with the
> DMA API is really confusing.

I suppose I must have mistakenly use "virtual address space" somewhere
instead of "IO virtual address space". I'm sorry for causing
confusion.

The thing being discussed here is mapping of buffers described by
scatterlists into IO virtual address space, i.e. the operation
happening when dma_map_sg() is called for an IOMMU-enabled device.

Best regards,
Tomasz
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-02 Thread Tomasz Figa
On Mon, Nov 2, 2015 at 10:11 PM, Daniel Kurtz  wrote:
>
> +Tomasz, so he can reply to the thread
> +Marek and Russell as recommended by Tomasz
>
> On Oct 30, 2015 22:27, "Robin Murphy"  wrote:
> >
> > Hi Dan,
> >
> > On 30/10/15 01:17, Daniel Kurtz wrote:
> >>
> >> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
> >> v4l2-contig's usage of the DMA API.
> >>
> >> Hi Robin,
> >>
> >> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy  
> >> wrote:
> >>>
> >>> On 26/10/15 13:44, Yong Wu wrote:
> 
> 
>  On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>  [...]
> >
> >
> > +/*
> > + * The DMA API client is passing in a scatterlist which could describe
> > + * any old buffer layout, but the IOMMU API requires everything to be
> > + * aligned to IOMMU pages. Hence the need for this complicated bit of
> > + * impedance-matching, to be able to hand off a suitably-aligned list,
> > + * but still preserve the original offsets and sizes for the caller.
> > + */
> > +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> > +   int nents, int prot)
> > +{
> > +   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> > +   struct iova_domain *iovad = domain->iova_cookie;
> > +   struct iova *iova;
> > +   struct scatterlist *s, *prev = NULL;
> > +   dma_addr_t dma_addr;
> > +   size_t iova_len = 0;
> > +   int i;
> > +
> > +   /*
> > +* Work out how much IOVA space we need, and align the segments
> > to
> > +* IOVA granules for the IOMMU driver to handle. With some 
> > clever
> > +* trickery we can modify the list in-place, but reversibly, by
> > +* hiding the original data in the as-yet-unused DMA fields.
> > +*/
> > +   for_each_sg(sg, s, nents, i) {
> > +   size_t s_offset = iova_offset(iovad, s->offset);
> > +   size_t s_length = s->length;
> > +
> > +   sg_dma_address(s) = s->offset;
> > +   sg_dma_len(s) = s_length;
> > +   s->offset -= s_offset;
> > +   s_length = iova_align(iovad, s_length + s_offset);
> > +   s->length = s_length;
> > +
> > +   /*
> > +* The simple way to avoid the rare case of a segment
> > +* crossing the boundary mask is to pad the previous one
> > +* to end at a naturally-aligned IOVA for this one's
> > size,
> > +* at the cost of potentially over-allocating a little.

I'd like to know what is the boundary mask and what hardware imposes
requirements like this. The cost here is not only over-allocating a
little, but making many, many buffers contiguously mappable on the
CPU, unmappable contiguously in IOMMU, which just defeats the purpose
of having an IOMMU, which I believe should be there for simple IP
blocks taking one DMA address to be able to view the buffer the same
way as the CPU.

> > +*/
> > +   if (prev) {
> > +   size_t pad_len = roundup_pow_of_two(s_length);
> > +
> > +   pad_len = (pad_len - iova_len) & (pad_len - 1);
> > +   prev->length += pad_len;
> 
> 
> 
>  Hi Robin,
>  While our v4l2 testing, It seems that we met a problem here.
>  Here we update prev->length again, Do we need update
>  sg_dma_len(prev) again too?
> 
>  Some function like vb2_dc_get_contiguous_size[1] always get
>  sg_dma_len(s) to compare instead of s->length. so it may break
>  unexpectedly while sg_dma_len(s) is not same with s->length.
> >>>
> >>>
> >>>
> >>> This is just tweaking the faked-up length that we hand off to 
> >>> iommu_map_sg()
> >>> (see also the iova_align() above), to trick it into bumping this segment 
> >>> up
> >>> to a suitable starting IOVA. The real length at this point is stashed in
> >>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), 
> >>> so
> >>> both will hold the same true length once we return to the caller.
> >>>
> >>> Yes, it does mean that if you have a list where the segment lengths are 
> >>> page
> >>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then 
> >>> you'll
> >>> still end up with a gap between the second and third segments, but that's
> >>> fine because the DMA API offers no guarantees about what the resulting DMA
> >>> addresses will be (consider the no-IOMMU case where they would each just 
> >>> be
> >>> "mapped" to their physical address). If that breaks v4l, then it's 
> >>> probably
> >>> v4l's DMA API use that needs looking at (again).
> >>
> >>
> >> Hmm, I thought the DMA API maps a 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-11-02 Thread Daniel Kurtz
+Tomasz, so he can reply to the thread
+Marek and Russell as recommended by Tomasz

On Oct 30, 2015 22:27, "Robin Murphy"  wrote:
>
> Hi Dan,
>
> On 30/10/15 01:17, Daniel Kurtz wrote:
>>
>> +linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
>> v4l2-contig's usage of the DMA API.
>>
>> Hi Robin,
>>
>> On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy  wrote:
>>>
>>> On 26/10/15 13:44, Yong Wu wrote:


 On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
 [...]
>
>
> +/*
> + * The DMA API client is passing in a scatterlist which could describe
> + * any old buffer layout, but the IOMMU API requires everything to be
> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> + * impedance-matching, to be able to hand off a suitably-aligned list,
> + * but still preserve the original offsets and sizes for the caller.
> + */
> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> +   int nents, int prot)
> +{
> +   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> +   struct iova_domain *iovad = domain->iova_cookie;
> +   struct iova *iova;
> +   struct scatterlist *s, *prev = NULL;
> +   dma_addr_t dma_addr;
> +   size_t iova_len = 0;
> +   int i;
> +
> +   /*
> +* Work out how much IOVA space we need, and align the segments
> to
> +* IOVA granules for the IOMMU driver to handle. With some clever
> +* trickery we can modify the list in-place, but reversibly, by
> +* hiding the original data in the as-yet-unused DMA fields.
> +*/
> +   for_each_sg(sg, s, nents, i) {
> +   size_t s_offset = iova_offset(iovad, s->offset);
> +   size_t s_length = s->length;
> +
> +   sg_dma_address(s) = s->offset;
> +   sg_dma_len(s) = s_length;
> +   s->offset -= s_offset;
> +   s_length = iova_align(iovad, s_length + s_offset);
> +   s->length = s_length;
> +
> +   /*
> +* The simple way to avoid the rare case of a segment
> +* crossing the boundary mask is to pad the previous one
> +* to end at a naturally-aligned IOVA for this one's
> size,
> +* at the cost of potentially over-allocating a little.
> +*/
> +   if (prev) {
> +   size_t pad_len = roundup_pow_of_two(s_length);
> +
> +   pad_len = (pad_len - iova_len) & (pad_len - 1);
> +   prev->length += pad_len;



 Hi Robin,
 While our v4l2 testing, It seems that we met a problem here.
 Here we update prev->length again, Do we need update
 sg_dma_len(prev) again too?

 Some function like vb2_dc_get_contiguous_size[1] always get
 sg_dma_len(s) to compare instead of s->length. so it may break
 unexpectedly while sg_dma_len(s) is not same with s->length.
>>>
>>>
>>>
>>> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
>>> (see also the iova_align() above), to trick it into bumping this segment up
>>> to a suitable starting IOVA. The real length at this point is stashed in
>>> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
>>> both will hold the same true length once we return to the caller.
>>>
>>> Yes, it does mean that if you have a list where the segment lengths are page
>>> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
>>> still end up with a gap between the second and third segments, but that's
>>> fine because the DMA API offers no guarantees about what the resulting DMA
>>> addresses will be (consider the no-IOMMU case where they would each just be
>>> "mapped" to their physical address). If that breaks v4l, then it's probably
>>> v4l's DMA API use that needs looking at (again).
>>
>>
>> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
>> memory pages into a contiguous block in device memory address space.
>> This would allow passing a dma mapped buffer to device dma using just
>> a device address and length.
>
>
> Not at all. The streaming DMA API (dma_map_* and friends) has two 
> responsibilities: performing any necessary cache maintenance to ensure the 
> device will correctly see data from the CPU, and the CPU will correctly see 
> data from the device; and working out an address for that buffer from the 
> device's point of view to actually hand off to the hardware (which is 
> perfectly well allowed to fail).
>
> Consider SWIOTLB's implementation - segments which already lie at physical 
> addresses within the device's DMA mask just get passed through, 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-30 Thread Joerg Roedel
On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote:
> Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
> memory pages into a contiguous block in device memory address space.
> This would allow passing a dma mapped buffer to device dma using just
> a device address and length.

If you are speaking of the dma_map_sg interface, than there is absolutly
no guarantee from the API side that the buffers you pass in will end up
mapped contiguously.
IOMMU drivers handle this differently, and when there is no IOMMU at all
there is also no way to map these buffers together.

> So, is the videobuf2-dma-contig.c based on an incorrect assumption
> about how the DMA API is supposed to work?

If it makes the above assumption, then yes.



Joerg

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-30 Thread Robin Murphy

Hi Dan,

On 30/10/15 01:17, Daniel Kurtz wrote:

+linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
v4l2-contig's usage of the DMA API.

Hi Robin,

On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy  wrote:

On 26/10/15 13:44, Yong Wu wrote:


On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]


+/*
+ * The DMA API client is passing in a scatterlist which could describe
+ * any old buffer layout, but the IOMMU API requires everything to be
+ * aligned to IOMMU pages. Hence the need for this complicated bit of
+ * impedance-matching, to be able to hand off a suitably-aligned list,
+ * but still preserve the original offsets and sizes for the caller.
+ */
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+   int nents, int prot)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iova_domain *iovad = domain->iova_cookie;
+   struct iova *iova;
+   struct scatterlist *s, *prev = NULL;
+   dma_addr_t dma_addr;
+   size_t iova_len = 0;
+   int i;
+
+   /*
+* Work out how much IOVA space we need, and align the segments
to
+* IOVA granules for the IOMMU driver to handle. With some clever
+* trickery we can modify the list in-place, but reversibly, by
+* hiding the original data in the as-yet-unused DMA fields.
+*/
+   for_each_sg(sg, s, nents, i) {
+   size_t s_offset = iova_offset(iovad, s->offset);
+   size_t s_length = s->length;
+
+   sg_dma_address(s) = s->offset;
+   sg_dma_len(s) = s_length;
+   s->offset -= s_offset;
+   s_length = iova_align(iovad, s_length + s_offset);
+   s->length = s_length;
+
+   /*
+* The simple way to avoid the rare case of a segment
+* crossing the boundary mask is to pad the previous one
+* to end at a naturally-aligned IOVA for this one's
size,
+* at the cost of potentially over-allocating a little.
+*/
+   if (prev) {
+   size_t pad_len = roundup_pow_of_two(s_length);
+
+   pad_len = (pad_len - iova_len) & (pad_len - 1);
+   prev->length += pad_len;



Hi Robin,
While our v4l2 testing, It seems that we met a problem here.
Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.



This is just tweaking the faked-up length that we hand off to iommu_map_sg()
(see also the iova_align() above), to trick it into bumping this segment up
to a suitable starting IOVA. The real length at this point is stashed in
sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
both will hold the same true length once we return to the caller.

Yes, it does mean that if you have a list where the segment lengths are page
aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
still end up with a gap between the second and third segments, but that's
fine because the DMA API offers no guarantees about what the resulting DMA
addresses will be (consider the no-IOMMU case where they would each just be
"mapped" to their physical address). If that breaks v4l, then it's probably
v4l's DMA API use that needs looking at (again).


Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.


Not at all. The streaming DMA API (dma_map_* and friends) has two 
responsibilities: performing any necessary cache maintenance to ensure 
the device will correctly see data from the CPU, and the CPU will 
correctly see data from the device; and working out an address for that 
buffer from the device's point of view to actually hand off to the 
hardware (which is perfectly well allowed to fail).


Consider SWIOTLB's implementation - segments which already lie at 
physical addresses within the device's DMA mask just get passed through, 
while those that lie outside it get mapped into the bounce buffer, but 
still as individual allocations (arch code just handles cache 
maintenance on the resulting physical addresses and can apply any 
hard-wired DMA offset for the device concerned).



IIUC, the change above breaks this model by inserting gaps in how the
buffer is mapped to device memory, such that the buffer is no longer
contiguous in dma address space.


Even the existing arch/arm IOMMU DMA code which I guess this implicitly 
relies on doesn't guarantee that behaviour - if the mapping happens to 
reach one of the segment length/boundary limits it won't just 

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-30 Thread Mark Hounschell

On 10/30/2015 10:09 AM, Joerg Roedel wrote:

On Fri, Oct 30, 2015 at 09:17:52AM +0800, Daniel Kurtz wrote:

Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.


If you are speaking of the dma_map_sg interface, than there is absolutly
no guarantee from the API side that the buffers you pass in will end up
mapped contiguously.
IOMMU drivers handle this differently, and when there is no IOMMU at all
there is also no way to map these buffers together.



That is what CMA is for ya know. It makes it physically contiguous.

Mark
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-29 Thread Daniel Kurtz
+linux-media & VIDEOBUF2 FRAMEWORK maintainers since this is about the
v4l2-contig's usage of the DMA API.

Hi Robin,

On Tue, Oct 27, 2015 at 12:55 AM, Robin Murphy  wrote:
> On 26/10/15 13:44, Yong Wu wrote:
>>
>> On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
>> [...]
>>>
>>> +/*
>>> + * The DMA API client is passing in a scatterlist which could describe
>>> + * any old buffer layout, but the IOMMU API requires everything to be
>>> + * aligned to IOMMU pages. Hence the need for this complicated bit of
>>> + * impedance-matching, to be able to hand off a suitably-aligned list,
>>> + * but still preserve the original offsets and sizes for the caller.
>>> + */
>>> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
>>> +   int nents, int prot)
>>> +{
>>> +   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
>>> +   struct iova_domain *iovad = domain->iova_cookie;
>>> +   struct iova *iova;
>>> +   struct scatterlist *s, *prev = NULL;
>>> +   dma_addr_t dma_addr;
>>> +   size_t iova_len = 0;
>>> +   int i;
>>> +
>>> +   /*
>>> +* Work out how much IOVA space we need, and align the segments
>>> to
>>> +* IOVA granules for the IOMMU driver to handle. With some clever
>>> +* trickery we can modify the list in-place, but reversibly, by
>>> +* hiding the original data in the as-yet-unused DMA fields.
>>> +*/
>>> +   for_each_sg(sg, s, nents, i) {
>>> +   size_t s_offset = iova_offset(iovad, s->offset);
>>> +   size_t s_length = s->length;
>>> +
>>> +   sg_dma_address(s) = s->offset;
>>> +   sg_dma_len(s) = s_length;
>>> +   s->offset -= s_offset;
>>> +   s_length = iova_align(iovad, s_length + s_offset);
>>> +   s->length = s_length;
>>> +
>>> +   /*
>>> +* The simple way to avoid the rare case of a segment
>>> +* crossing the boundary mask is to pad the previous one
>>> +* to end at a naturally-aligned IOVA for this one's
>>> size,
>>> +* at the cost of potentially over-allocating a little.
>>> +*/
>>> +   if (prev) {
>>> +   size_t pad_len = roundup_pow_of_two(s_length);
>>> +
>>> +   pad_len = (pad_len - iova_len) & (pad_len - 1);
>>> +   prev->length += pad_len;
>>
>>
>> Hi Robin,
>>While our v4l2 testing, It seems that we met a problem here.
>>Here we update prev->length again, Do we need update
>> sg_dma_len(prev) again too?
>>
>>Some function like vb2_dc_get_contiguous_size[1] always get
>> sg_dma_len(s) to compare instead of s->length. so it may break
>> unexpectedly while sg_dma_len(s) is not same with s->length.
>
>
> This is just tweaking the faked-up length that we hand off to iommu_map_sg()
> (see also the iova_align() above), to trick it into bumping this segment up
> to a suitable starting IOVA. The real length at this point is stashed in
> sg_dma_len(s), and will be copied back into s->length in __finalise_sg(), so
> both will hold the same true length once we return to the caller.
>
> Yes, it does mean that if you have a list where the segment lengths are page
> aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, then you'll
> still end up with a gap between the second and third segments, but that's
> fine because the DMA API offers no guarantees about what the resulting DMA
> addresses will be (consider the no-IOMMU case where they would each just be
> "mapped" to their physical address). If that breaks v4l, then it's probably
> v4l's DMA API use that needs looking at (again).

Hmm, I thought the DMA API maps a (possibly) non-contiguous set of
memory pages into a contiguous block in device memory address space.
This would allow passing a dma mapped buffer to device dma using just
a device address and length.
IIUC, the change above breaks this model by inserting gaps in how the
buffer is mapped to device memory, such that the buffer is no longer
contiguous in dma address space.

Here is the code in question from
drivers/media/v4l2-core/videobuf2-dma-contig.c :

static unsigned long vb2_dc_get_contiguous_size(struct sg_table *sgt)
{
struct scatterlist *s;
dma_addr_t expected = sg_dma_address(sgt->sgl);
unsigned int i;
unsigned long size = 0;

for_each_sg(sgt->sgl, s, sgt->nents, i) {
if (sg_dma_address(s) != expected)
break;
expected = sg_dma_address(s) + sg_dma_len(s);
size += sg_dma_len(s);
}
return size;
}


static void *vb2_dc_get_userptr(void *alloc_ctx, unsigned long vaddr,
unsigned long size, enum dma_data_direction dma_dir)
{
struct vb2_dc_conf *conf = alloc_ctx;
struct vb2_dc_buf *buf;
struct frame_vector *vec;

Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-26 Thread Yong Wu
On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]
> +/*
> + * The DMA API client is passing in a scatterlist which could describe
> + * any old buffer layout, but the IOMMU API requires everything to be
> + * aligned to IOMMU pages. Hence the need for this complicated bit of
> + * impedance-matching, to be able to hand off a suitably-aligned list,
> + * but still preserve the original offsets and sizes for the caller.
> + */
> +int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
> + int nents, int prot)
> +{
> + struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
> + struct iova_domain *iovad = domain->iova_cookie;
> + struct iova *iova;
> + struct scatterlist *s, *prev = NULL;
> + dma_addr_t dma_addr;
> + size_t iova_len = 0;
> + int i;
> +
> + /*
> +  * Work out how much IOVA space we need, and align the segments to
> +  * IOVA granules for the IOMMU driver to handle. With some clever
> +  * trickery we can modify the list in-place, but reversibly, by
> +  * hiding the original data in the as-yet-unused DMA fields.
> +  */
> + for_each_sg(sg, s, nents, i) {
> + size_t s_offset = iova_offset(iovad, s->offset);
> + size_t s_length = s->length;
> +
> + sg_dma_address(s) = s->offset;
> + sg_dma_len(s) = s_length;
> + s->offset -= s_offset;
> + s_length = iova_align(iovad, s_length + s_offset);
> + s->length = s_length;
> +
> + /*
> +  * The simple way to avoid the rare case of a segment
> +  * crossing the boundary mask is to pad the previous one
> +  * to end at a naturally-aligned IOVA for this one's size,
> +  * at the cost of potentially over-allocating a little.
> +  */
> + if (prev) {
> + size_t pad_len = roundup_pow_of_two(s_length);
> +
> + pad_len = (pad_len - iova_len) & (pad_len - 1);
> + prev->length += pad_len;

Hi Robin,
  While our v4l2 testing, It seems that we met a problem here.
  Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

  Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.

[1]:
http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70


> + iova_len += pad_len;
> + }
> +
> + iova_len += s_length;
> + prev = s;
> + }
> +
> + iova = __alloc_iova(iovad, iova_len, dma_get_mask(dev));
> + if (!iova)
> + goto out_restore_sg;
> +
> + /*
> +  * We'll leave any physical concatenation to the IOMMU driver's
> +  * implementation - it knows better than we do.
> +  */
> + dma_addr = iova_dma_addr(iovad, iova);
> + if (iommu_map_sg(domain, dma_addr, sg, nents, prot) < iova_len)
> + goto out_free_iova;
> +
> + return __finalise_sg(dev, sg, nents, dma_addr);
> +
> +out_free_iova:
> + __free_iova(iovad, iova);
> +out_restore_sg:
> + __invalidate_sg(sg, nents);
> + return 0;
> +}
> +



___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-26 Thread Robin Murphy

On 26/10/15 13:44, Yong Wu wrote:

On Thu, 2015-10-01 at 20:13 +0100, Robin Murphy wrote:
[...]

+/*
+ * The DMA API client is passing in a scatterlist which could describe
+ * any old buffer layout, but the IOMMU API requires everything to be
+ * aligned to IOMMU pages. Hence the need for this complicated bit of
+ * impedance-matching, to be able to hand off a suitably-aligned list,
+ * but still preserve the original offsets and sizes for the caller.
+ */
+int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg,
+   int nents, int prot)
+{
+   struct iommu_domain *domain = iommu_get_domain_for_dev(dev);
+   struct iova_domain *iovad = domain->iova_cookie;
+   struct iova *iova;
+   struct scatterlist *s, *prev = NULL;
+   dma_addr_t dma_addr;
+   size_t iova_len = 0;
+   int i;
+
+   /*
+* Work out how much IOVA space we need, and align the segments to
+* IOVA granules for the IOMMU driver to handle. With some clever
+* trickery we can modify the list in-place, but reversibly, by
+* hiding the original data in the as-yet-unused DMA fields.
+*/
+   for_each_sg(sg, s, nents, i) {
+   size_t s_offset = iova_offset(iovad, s->offset);
+   size_t s_length = s->length;
+
+   sg_dma_address(s) = s->offset;
+   sg_dma_len(s) = s_length;
+   s->offset -= s_offset;
+   s_length = iova_align(iovad, s_length + s_offset);
+   s->length = s_length;
+
+   /*
+* The simple way to avoid the rare case of a segment
+* crossing the boundary mask is to pad the previous one
+* to end at a naturally-aligned IOVA for this one's size,
+* at the cost of potentially over-allocating a little.
+*/
+   if (prev) {
+   size_t pad_len = roundup_pow_of_two(s_length);
+
+   pad_len = (pad_len - iova_len) & (pad_len - 1);
+   prev->length += pad_len;


Hi Robin,
   While our v4l2 testing, It seems that we met a problem here.
   Here we update prev->length again, Do we need update
sg_dma_len(prev) again too?

   Some function like vb2_dc_get_contiguous_size[1] always get
sg_dma_len(s) to compare instead of s->length. so it may break
unexpectedly while sg_dma_len(s) is not same with s->length.


This is just tweaking the faked-up length that we hand off to 
iommu_map_sg() (see also the iova_align() above), to trick it into 
bumping this segment up to a suitable starting IOVA. The real length at 
this point is stashed in sg_dma_len(s), and will be copied back into 
s->length in __finalise_sg(), so both will hold the same true length 
once we return to the caller.


Yes, it does mean that if you have a list where the segment lengths are 
page aligned but not monotonically decreasing, e.g. {64k, 16k, 64k}, 
then you'll still end up with a gap between the second and third 
segments, but that's fine because the DMA API offers no guarantees about 
what the resulting DMA addresses will be (consider the no-IOMMU case 
where they would each just be "mapped" to their physical address). If 
that breaks v4l, then it's probably v4l's DMA API use that needs looking 
at (again).


Robin.


[1]:
http://lxr.free-electrons.com/source/drivers/media/v4l2-core/videobuf2-dma-contig.c#L70


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v6 1/3] iommu: Implement common IOMMU ops for DMA mapping

2015-10-01 Thread Robin Murphy
Taking inspiration from the existing arch/arm code, break out some
generic functions to interface the DMA-API to the IOMMU-API. This will
do the bulk of the heavy lifting for IOMMU-backed dma-mapping.

Since associating an IOVA allocator with an IOMMU domain is a fairly
common need, rather than introduce yet another private structure just to
do this for ourselves, extend the top-level struct iommu_domain with the
notion. A simple opaque cookie allows reuse by other IOMMU API users
with their various different incompatible allocator types.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/Kconfig |   7 +
 drivers/iommu/Makefile|   1 +
 drivers/iommu/dma-iommu.c | 524 ++
 include/linux/dma-iommu.h |  85 
 include/linux/iommu.h |   1 +
 5 files changed, 618 insertions(+)
 create mode 100644 drivers/iommu/dma-iommu.c
 create mode 100644 include/linux/dma-iommu.h

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 3dc1bcb..27d4d4b 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -48,6 +48,13 @@ config OF_IOMMU
def_bool y
depends on OF && IOMMU_API
 
+# IOMMU-agnostic DMA-mapping layer
+config IOMMU_DMA
+   bool
+   depends on NEED_SG_DMA_LENGTH
+   select IOMMU_API
+   select IOMMU_IOVA
+
 config FSL_PAMU
bool "Freescale IOMMU support"
depends on PPC32
diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile
index c6dcc51..f465cfb 100644
--- a/drivers/iommu/Makefile
+++ b/drivers/iommu/Makefile
@@ -1,6 +1,7 @@
 obj-$(CONFIG_IOMMU_API) += iommu.o
 obj-$(CONFIG_IOMMU_API) += iommu-traces.o
 obj-$(CONFIG_IOMMU_API) += iommu-sysfs.o
+obj-$(CONFIG_IOMMU_DMA) += dma-iommu.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE) += io-pgtable.o
 obj-$(CONFIG_IOMMU_IO_PGTABLE_LPAE) += io-pgtable-arm.o
 obj-$(CONFIG_IOMMU_IOVA) += iova.o
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
new file mode 100644
index 000..3a20db4
--- /dev/null
+++ b/drivers/iommu/dma-iommu.c
@@ -0,0 +1,524 @@
+/*
+ * A fairly generic DMA-API to IOMMU-API glue layer.
+ *
+ * Copyright (C) 2014-2015 ARM Ltd.
+ *
+ * based in part on arch/arm/mm/dma-mapping.c:
+ * Copyright (C) 2000-2004 Russell King
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program.  If not, see .
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+int iommu_dma_init(void)
+{
+   return iova_cache_get();
+}
+
+/**
+ * iommu_get_dma_cookie - Acquire DMA-API resources for a domain
+ * @domain: IOMMU domain to prepare for DMA-API usage
+ *
+ * IOMMU drivers should normally call this from their domain_alloc
+ * callback when domain->type == IOMMU_DOMAIN_DMA.
+ */
+int iommu_get_dma_cookie(struct iommu_domain *domain)
+{
+   struct iova_domain *iovad;
+
+   if (domain->iova_cookie)
+   return -EEXIST;
+
+   iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
+   domain->iova_cookie = iovad;
+
+   return iovad ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL(iommu_get_dma_cookie);
+
+/**
+ * iommu_put_dma_cookie - Release a domain's DMA mapping resources
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ *
+ * IOMMU drivers should normally call this from their domain_free callback.
+ */
+void iommu_put_dma_cookie(struct iommu_domain *domain)
+{
+   struct iova_domain *iovad = domain->iova_cookie;
+
+   if (!iovad)
+   return;
+
+   put_iova_domain(iovad);
+   kfree(iovad);
+   domain->iova_cookie = NULL;
+}
+EXPORT_SYMBOL(iommu_put_dma_cookie);
+
+/**
+ * iommu_dma_init_domain - Initialise a DMA mapping domain
+ * @domain: IOMMU domain previously prepared by iommu_get_dma_cookie()
+ * @base: IOVA at which the mappable address space starts
+ * @size: Size of IOVA space
+ *
+ * @base and @size should be exact multiples of IOMMU page granularity to
+ * avoid rounding surprises. If necessary, we reserve the page at address 0
+ * to ensure it is an invalid IOVA. It is safe to reinitialise a domain, but
+ * any change which could make prior IOVAs invalid will fail.
+ */
+int iommu_dma_init_domain(struct iommu_domain *domain, dma_addr_t base, u64 
size)
+{
+   struct iova_domain *iovad = domain->iova_cookie;
+   unsigned long order, base_pfn, end_pfn;
+
+   if (!iovad)
+   return -ENODEV;
+
+   /* Use the smallest supported page size for IOVA granularity */
+   order =