Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 11:27 PM, Arnd Bergmann  wrote:
> On Thu, Jul 6, 2017 at 4:06 PM, Tomasz Figa  wrote:
>> On Thu, Jul 6, 2017 at 11:02 PM, Arnd Bergmann  wrote:
>>> On Thu, Jul 6, 2017 at 3:49 PM, Tomasz Figa  wrote:
 On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa  wrote:
>>>
> On the other hand, if it's strictly about base/dma-mapping, we might
> not need it indeed. The driver could call iommu-dma helpers directly,
> without the need to provide its own DMA ops. One caveat, though, we
> are not able to obtain coherent (i.e. uncached) memory with this
> approach, which might have some performance effects and complicates
> the code, that would now need to flush caches even for some small
> internal buffers.

 I think I should add a bit of explanation here:
  1) the device is non-coherent with CPU caches, even on x86,
  2) it looks like x86 does not have non-coherent DMA ops, (but it
 might be something that could be fixed)
>>>
>>> I don't understand what this means here. The PCI on x86 is always
>>> cache-coherent, so why is the device not?
>>>
>>> Do you mean that the device has its own caches that may need
>>> flushing to make the device cache coherent with the CPU cache,
>>> rather than flushing the CPU caches?
>>
>> Sakari might be able to explain this with more technical details, but
>> generally the device is not a standard PCI device one might find on
>> existing x86 systems.
>>
>> It is some kind of embedded subsystem that behaves mostly like a PCI
>> device, with certain exceptions, one being the lack of coherency with
>> CPU caches, at least for certain parts of the subsystem. The reference
>> vendor code disables the coherency completely, for reasons not known
>> to me, but AFAICT this is the preferred operating mode, possibly due
>> to performance effects (this is a memory-heavy image processing
>
> Ok, got it. I think something similar happens on integrated GPUs for
> a certain CPU family. The DRM code has its own ways of dealing with
> this kind of device. If you find that the hardware to be closely
> related (either the implementation, or the location on the internal
> buses) to the GPU on this machine, I'd recommend having a look
> in drivers/gpu/drm to see how it's handled there, and if that code could
> be shared.

I think it's not closely related, but might be a very similar case.

Still, DRM is very liberal in terms of not using common code for doing
things, while V4L2 tries to makes things generic as much as possible.
There is already the vb2_dma_contig backend, which allocates coherent
memory (in case of V4L2-allocated buffers), manages caches (in case of
userptr or DMA-buf buffers) and so on for you. If we can't have the
DMA ops do the right thing, the code there is essentially useless and
you are left with vb2_dma_sg that uses a page allocator and gives the
driver sg tables (it actually can also do cache management for you,
but since dma_sync_sg_*() is essentially a no-op on x86, the driver
would have to do it on its own).

Best regards,
Tomasz


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Arnd Bergmann
On Thu, Jul 6, 2017 at 4:06 PM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 11:02 PM, Arnd Bergmann  wrote:
>> On Thu, Jul 6, 2017 at 3:49 PM, Tomasz Figa  wrote:
>>> On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa  wrote:
>>
 On the other hand, if it's strictly about base/dma-mapping, we might
 not need it indeed. The driver could call iommu-dma helpers directly,
 without the need to provide its own DMA ops. One caveat, though, we
 are not able to obtain coherent (i.e. uncached) memory with this
 approach, which might have some performance effects and complicates
 the code, that would now need to flush caches even for some small
 internal buffers.
>>>
>>> I think I should add a bit of explanation here:
>>>  1) the device is non-coherent with CPU caches, even on x86,
>>>  2) it looks like x86 does not have non-coherent DMA ops, (but it
>>> might be something that could be fixed)
>>
>> I don't understand what this means here. The PCI on x86 is always
>> cache-coherent, so why is the device not?
>>
>> Do you mean that the device has its own caches that may need
>> flushing to make the device cache coherent with the CPU cache,
>> rather than flushing the CPU caches?
>
> Sakari might be able to explain this with more technical details, but
> generally the device is not a standard PCI device one might find on
> existing x86 systems.
>
> It is some kind of embedded subsystem that behaves mostly like a PCI
> device, with certain exceptions, one being the lack of coherency with
> CPU caches, at least for certain parts of the subsystem. The reference
> vendor code disables the coherency completely, for reasons not known
> to me, but AFAICT this is the preferred operating mode, possibly due
> to performance effects (this is a memory-heavy image processing

Ok, got it. I think something similar happens on integrated GPUs for
a certain CPU family. The DRM code has its own ways of dealing with
this kind of device. If you find that the hardware to be closely
related (either the implementation, or the location on the internal
buses) to the GPU on this machine, I'd recommend having a look
in drivers/gpu/drm to see how it's handled there, and if that code could
be shared.

Arnd


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 11:02 PM, Arnd Bergmann  wrote:
> On Thu, Jul 6, 2017 at 3:49 PM, Tomasz Figa  wrote:
>> On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa  wrote:
>
>>> On the other hand, if it's strictly about base/dma-mapping, we might
>>> not need it indeed. The driver could call iommu-dma helpers directly,
>>> without the need to provide its own DMA ops. One caveat, though, we
>>> are not able to obtain coherent (i.e. uncached) memory with this
>>> approach, which might have some performance effects and complicates
>>> the code, that would now need to flush caches even for some small
>>> internal buffers.
>>
>> I think I should add a bit of explanation here:
>>  1) the device is non-coherent with CPU caches, even on x86,
>>  2) it looks like x86 does not have non-coherent DMA ops, (but it
>> might be something that could be fixed)
>
> I don't understand what this means here. The PCI on x86 is always
> cache-coherent, so why is the device not?
>
> Do you mean that the device has its own caches that may need
> flushing to make the device cache coherent with the CPU cache,
> rather than flushing the CPU caches?

Sakari might be able to explain this with more technical details, but
generally the device is not a standard PCI device one might find on
existing x86 systems.

It is some kind of embedded subsystem that behaves mostly like a PCI
device, with certain exceptions, one being the lack of coherency with
CPU caches, at least for certain parts of the subsystem. The reference
vendor code disables the coherency completely, for reasons not known
to me, but AFAICT this is the preferred operating mode, possibly due
to performance effects (this is a memory-heavy image processing
subsystem).

Best regards,
Tomasz


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Arnd Bergmann
On Thu, Jul 6, 2017 at 3:49 PM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa  wrote:

>> On the other hand, if it's strictly about base/dma-mapping, we might
>> not need it indeed. The driver could call iommu-dma helpers directly,
>> without the need to provide its own DMA ops. One caveat, though, we
>> are not able to obtain coherent (i.e. uncached) memory with this
>> approach, which might have some performance effects and complicates
>> the code, that would now need to flush caches even for some small
>> internal buffers.
>
> I think I should add a bit of explanation here:
>  1) the device is non-coherent with CPU caches, even on x86,
>  2) it looks like x86 does not have non-coherent DMA ops, (but it
> might be something that could be fixed)

I don't understand what this means here. The PCI on x86 is always
cache-coherent, so why is the device not?

Do you mean that the device has its own caches that may need
flushing to make the device cache coherent with the CPU cache,
rather than flushing the CPU caches?

  Arnd


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Arnd Bergmann
On Thu, Jul 6, 2017 at 3:31 PM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 9:23 PM, Arnd Bergmann  wrote:
>> On Thu, Jul 6, 2017 at 10:36 AM, Tomasz Figa  wrote:
>>> On Thu, Jul 6, 2017 at 5:34 PM, Tomasz Figa  wrote:
>>>
>>> Sorry, I intended to mean:
>>>
>>> IMHO re-implementing the code that's already there in videobuf2 again
>>> in the driver, only because, for no good reason mentioned as for now,
>>> having a loadable module providing DMA ops was disliked, would make no
>>> sense.
>>
>> Why would we need to duplicate that code? I would expect that the videobuf2
>> core can simply call the regular dma_mapping interfaces, and you handle the
>> IOPTE generation at the point when the buffer is handed off from the core
>> code to the device driver. Am I missing something?
>
> Well, for example, the iommu-dma helpers already implement all the
> IOVA management, SG iterations, IOMMU API calls, sanity checks and so
> on. There is a significant amount of common code.
>
> On the other hand, if it's strictly about base/dma-mapping, we might
> not need it indeed. The driver could call iommu-dma helpers directly,
> without the need to provide its own DMA ops.

Yes, that's what I meant: if using the IOMMU interface helps, I don't
see anything wrong with that, but using the iommu based
dma_map_ops seems like it may introduce more problems than
it solves.

Arnd


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 9:23 PM, Arnd Bergmann  wrote:
>> On Thu, Jul 6, 2017 at 10:36 AM, Tomasz Figa  wrote:
>>> On Thu, Jul 6, 2017 at 5:34 PM, Tomasz Figa  wrote:
 On Thu, Jul 6, 2017 at 5:26 PM, Arnd Bergmann  wrote:
> On Thu, Jul 6, 2017 at 3:44 AM, Tomasz Figa  wrote:
>>

 I'd say that this is something that has been consistently tried to be
 avoided by V4L2 and that's why it's so tightly integrated with DMA
 mapping. IMHO re-implementing the code that's already there in
 videobuf2 again in the driver, only because, for no good reason
 mentioned as for now, having a loadable module providing DMA ops was
 disliked.
>>>
>>> Sorry, I intended to mean:
>>>
>>> IMHO re-implementing the code that's already there in videobuf2 again
>>> in the driver, only because, for no good reason mentioned as for now,
>>> having a loadable module providing DMA ops was disliked, would make no
>>> sense.
>>
>> Why would we need to duplicate that code? I would expect that the videobuf2
>> core can simply call the regular dma_mapping interfaces, and you handle the
>> IOPTE generation at the point when the buffer is handed off from the core
>> code to the device driver. Am I missing something?
>
> Well, for example, the iommu-dma helpers already implement all the
> IOVA management, SG iterations, IOMMU API calls, sanity checks and so
> on. There is a significant amount of common code.
>
> On the other hand, if it's strictly about base/dma-mapping, we might
> not need it indeed. The driver could call iommu-dma helpers directly,
> without the need to provide its own DMA ops. One caveat, though, we
> are not able to obtain coherent (i.e. uncached) memory with this
> approach, which might have some performance effects and complicates
> the code, that would now need to flush caches even for some small
> internal buffers.

I think I should add a bit of explanation here:
 1) the device is non-coherent with CPU caches, even on x86,
 2) it looks like x86 does not have non-coherent DMA ops, (but it
might be something that could be fixed)
 3) one technically could still use __get_vm_area() and map_vm_area(),
which _are_ exported, to create an uncached mapping. I'll leave it to
you to judge if it would be better than using the already available
generic helpers.

Best regards,
Tomasz


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 9:23 PM, Arnd Bergmann  wrote:
> On Thu, Jul 6, 2017 at 10:36 AM, Tomasz Figa  wrote:
>> On Thu, Jul 6, 2017 at 5:34 PM, Tomasz Figa  wrote:
>>> On Thu, Jul 6, 2017 at 5:26 PM, Arnd Bergmann  wrote:
 On Thu, Jul 6, 2017 at 3:44 AM, Tomasz Figa  wrote:
>
>>>
>>> I'd say that this is something that has been consistently tried to be
>>> avoided by V4L2 and that's why it's so tightly integrated with DMA
>>> mapping. IMHO re-implementing the code that's already there in
>>> videobuf2 again in the driver, only because, for no good reason
>>> mentioned as for now, having a loadable module providing DMA ops was
>>> disliked.
>>
>> Sorry, I intended to mean:
>>
>> IMHO re-implementing the code that's already there in videobuf2 again
>> in the driver, only because, for no good reason mentioned as for now,
>> having a loadable module providing DMA ops was disliked, would make no
>> sense.
>
> Why would we need to duplicate that code? I would expect that the videobuf2
> core can simply call the regular dma_mapping interfaces, and you handle the
> IOPTE generation at the point when the buffer is handed off from the core
> code to the device driver. Am I missing something?

Well, for example, the iommu-dma helpers already implement all the
IOVA management, SG iterations, IOMMU API calls, sanity checks and so
on. There is a significant amount of common code.

On the other hand, if it's strictly about base/dma-mapping, we might
not need it indeed. The driver could call iommu-dma helpers directly,
without the need to provide its own DMA ops. One caveat, though, we
are not able to obtain coherent (i.e. uncached) memory with this
approach, which might have some performance effects and complicates
the code, that would now need to flush caches even for some small
internal buffers.

Best regards,
Tomasz


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Arnd Bergmann
On Thu, Jul 6, 2017 at 10:36 AM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 5:34 PM, Tomasz Figa  wrote:
>> On Thu, Jul 6, 2017 at 5:26 PM, Arnd Bergmann  wrote:
>>> On Thu, Jul 6, 2017 at 3:44 AM, Tomasz Figa  wrote:

>>
>> I'd say that this is something that has been consistently tried to be
>> avoided by V4L2 and that's why it's so tightly integrated with DMA
>> mapping. IMHO re-implementing the code that's already there in
>> videobuf2 again in the driver, only because, for no good reason
>> mentioned as for now, having a loadable module providing DMA ops was
>> disliked.
>
> Sorry, I intended to mean:
>
> IMHO re-implementing the code that's already there in videobuf2 again
> in the driver, only because, for no good reason mentioned as for now,
> having a loadable module providing DMA ops was disliked, would make no
> sense.

Why would we need to duplicate that code? I would expect that the videobuf2
core can simply call the regular dma_mapping interfaces, and you handle the
IOPTE generation at the point when the buffer is handed off from the core
code to the device driver. Am I missing something?

   Arnd


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 5:34 PM, Tomasz Figa  wrote:
> On Thu, Jul 6, 2017 at 5:26 PM, Arnd Bergmann  wrote:
>> On Thu, Jul 6, 2017 at 3:44 AM, Tomasz Figa  wrote:
>>> On Thu, Jul 6, 2017 at 2:20 AM, Christoph Hellwig  wrote:
 On Thu, Jul 06, 2017 at 12:22:35AM +0900, Tomasz Figa wrote:
>>
 In general I think moving dma
 ops and iommu implementations into modules is a bad idea
>>>
>>> Could you elaborate on this? I'd be interested in seeing the reasoning
>>> behind this.
>>>
 but I
 don't want to reject the idea before seeing the code.  Or maybe
 by looking at the user we can come up with an even better idea
 to solve the original issue you're trying to solve, so please also
 explain your rationale.
>>
>> I had pretty much the same thoughts here.
>>
>>> Basically we have an x86 platform with a camera subsystem that is a
>>> PCI device, has its own MMU and needs cache maintenance. Moreover, the
>>> V4L2 subsystem, which is the right place for camera drivers, heavily
>>> relies on DMA mapping as a way to abstract memory allocation, mapping
>>> and cache maintenance. So it feels natural to me to hide the hardware
>>> details (additional cache maintenance, mapping into the built-in
>>> IOMMU) in the DMA mapping ops for this camera subsystem and simply
>>> make V4L2 just work without knowing those details.
>>
>> I can understand your reasoning here, but I'm also not convinced
>> that this is the best approach. There may be a middle ground somewhere
>> though.
>>
>> Generally speaking I don't want to have to deal with the horrors of
>> deciding whether an IOMMU is going to be there eventually or not
>> at probe() time. At some point, we had decided that IOMMUs need to
>> be initialized (almost) as early as irqchips and clocksources so we can
>> rely on them to be there at device discovery time. That got pushed
>> back already, and now we may have to deal with -EPROBE_DEFER
>> when an IOMMU has not been fully initialized at device probe time,
>> but at least we can reliably see if one is there or not. Making IOMMUs
>> modular will add further uncertainty here. Obviously we cannot attach
>> an IOMMU to a device once we have started using DMA mapping
>> calls on it.
>
> The hardware can only work with IOMMU and so the main module is highly
> tied with the IOMMU module and it initialized it directly. There is no
> separate struct driver or device associated with the IOMMU, as it's a
> part of the one and only one PCI device (as visible from the system
> PCI bus point of view) and technically handled by one pci_driver.
>
>>
>> For your particular use case, I would instead leave the knowledge
>> about the IOMMU in the driver itself, like we do for the IOMMUs
>> that are integrated in desktop GPUs, and have the code use the
>> DMA mapping API with the system-provided dma_map_ops to
>> get dma_addr_t tokens which you then program into the device
>> IOMMU.
>>
>> An open question however would be whether to use the IOMMU
>> API without the DMA mapping API here, or whether to completely
>> leave the knowledge of the IOMMU inside of the driver itself.
>> I don't have a strong opinion on that part, and I guess this mostly
>> depends on what the hardware looks like.
>
> + linux-media and some media folks
>
> I'd say that this is something that has been consistently tried to be
> avoided by V4L2 and that's why it's so tightly integrated with DMA
> mapping. IMHO re-implementing the code that's already there in
> videobuf2 again in the driver, only because, for no good reason
> mentioned as for now, having a loadable module providing DMA ops was
> disliked.

Sorry, I intended to mean:

IMHO re-implementing the code that's already there in videobuf2 again
in the driver, only because, for no good reason mentioned as for now,
having a loadable module providing DMA ops was disliked, would make no
sense.

>
> Similarly with IOMMU API. It provides a lot of help in managing the
> mappings and re-implementing this would be IMHO backwards.
>
> Best regards,
> Tomasz


Re: [RFC PATCH 1/5] base: dma-mapping: Export commonly used symbols

2017-07-06 Thread Tomasz Figa
On Thu, Jul 6, 2017 at 5:26 PM, Arnd Bergmann  wrote:
> On Thu, Jul 6, 2017 at 3:44 AM, Tomasz Figa  wrote:
>> On Thu, Jul 6, 2017 at 2:20 AM, Christoph Hellwig  wrote:
>>> On Thu, Jul 06, 2017 at 12:22:35AM +0900, Tomasz Figa wrote:
>
>>> In general I think moving dma
>>> ops and iommu implementations into modules is a bad idea
>>
>> Could you elaborate on this? I'd be interested in seeing the reasoning
>> behind this.
>>
>>> but I
>>> don't want to reject the idea before seeing the code.  Or maybe
>>> by looking at the user we can come up with an even better idea
>>> to solve the original issue you're trying to solve, so please also
>>> explain your rationale.
>
> I had pretty much the same thoughts here.
>
>> Basically we have an x86 platform with a camera subsystem that is a
>> PCI device, has its own MMU and needs cache maintenance. Moreover, the
>> V4L2 subsystem, which is the right place for camera drivers, heavily
>> relies on DMA mapping as a way to abstract memory allocation, mapping
>> and cache maintenance. So it feels natural to me to hide the hardware
>> details (additional cache maintenance, mapping into the built-in
>> IOMMU) in the DMA mapping ops for this camera subsystem and simply
>> make V4L2 just work without knowing those details.
>
> I can understand your reasoning here, but I'm also not convinced
> that this is the best approach. There may be a middle ground somewhere
> though.
>
> Generally speaking I don't want to have to deal with the horrors of
> deciding whether an IOMMU is going to be there eventually or not
> at probe() time. At some point, we had decided that IOMMUs need to
> be initialized (almost) as early as irqchips and clocksources so we can
> rely on them to be there at device discovery time. That got pushed
> back already, and now we may have to deal with -EPROBE_DEFER
> when an IOMMU has not been fully initialized at device probe time,
> but at least we can reliably see if one is there or not. Making IOMMUs
> modular will add further uncertainty here. Obviously we cannot attach
> an IOMMU to a device once we have started using DMA mapping
> calls on it.

The hardware can only work with IOMMU and so the main module is highly
tied with the IOMMU module and it initialized it directly. There is no
separate struct driver or device associated with the IOMMU, as it's a
part of the one and only one PCI device (as visible from the system
PCI bus point of view) and technically handled by one pci_driver.

>
> For your particular use case, I would instead leave the knowledge
> about the IOMMU in the driver itself, like we do for the IOMMUs
> that are integrated in desktop GPUs, and have the code use the
> DMA mapping API with the system-provided dma_map_ops to
> get dma_addr_t tokens which you then program into the device
> IOMMU.
>
> An open question however would be whether to use the IOMMU
> API without the DMA mapping API here, or whether to completely
> leave the knowledge of the IOMMU inside of the driver itself.
> I don't have a strong opinion on that part, and I guess this mostly
> depends on what the hardware looks like.

+ linux-media and some media folks

I'd say that this is something that has been consistently tried to be
avoided by V4L2 and that's why it's so tightly integrated with DMA
mapping. IMHO re-implementing the code that's already there in
videobuf2 again in the driver, only because, for no good reason
mentioned as for now, having a loadable module providing DMA ops was
disliked.

Similarly with IOMMU API. It provides a lot of help in managing the
mappings and re-implementing this would be IMHO backwards.

Best regards,
Tomasz