Re: [PATCH 21/21] dma-mapping: replace custom code with generic implementation

2023-03-27 Thread Christoph Hellwig
> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
>  {
> + dma_cache_wback(paddr, size);
> +}
>  
> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
> +{
> + dma_cache_inv(paddr, size);
>  }

> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
>  {
> + dma_cache_wback_inv(paddr, size);
> +}

There are the only calls for the three functions for each of the
involved functions.  So I'd rather rename the low-level symbols
(and drop the pointless exports for two of them) rather than adding
these wrapppers.

The same is probably true for many other architectures.

> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
> +{
> + return false;
> +}
>  
> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
> +{
> + return true;
>  }

Is there a way to cut down on this boilerplate code by just having
sane default, and Kconfig options to override them if they are not
runtime decisions?

> +#include 

I can't really say I like the #include version here despite your
rationale in the commit log.  I can probably live with it if you
think it is absolutely worth it, but I'm really not in favor of it.

> +config ARCH_DMA_MARK_DCACHE_CLEAN
> + def_bool y

What do we need this symbol for?  Unless I'm missing something it is
always enable for arm32, and only used in arm32 code.

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 02/21] xtensa: dma-mapping: use normal cache invalidation rules

2023-03-27 Thread Max Filippov
On Mon, Mar 27, 2023 at 5:14 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> xtensa is one of the platforms that has both write-back and write-through
> caches, and needs to account for both in its DMA mapping operations.
>
> It does this through a set of operations that is different from any
> architecture. This is not a problem by itself, but it makes it rather
> hard to figure out whether this is correct or not, and to unify this
> implementation with the others.
>
> Change the semantics to the usual ones for non-speculating CPUs:
>
>  - On DMA_TO_DEVICE, call __flush_dcache_range() to perform the
>writeback even on writethrough caches, where this is a nop.
>
>  - On DMA_FROM_DEVICE, invalidate the mapping before the DMA rather
>than afterwards.
>
>  - On DMA_BIDIRECTIONAL, combine the pre-writeback with the
>post-invalidate into a call to __flush_invalidate_dcache_range()
>that turns into a simple invalidate on writeback caches.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/xtensa/Kconfig  |  1 -
>  arch/xtensa/include/asm/cacheflush.h |  6 +++---
>  arch/xtensa/kernel/pci-dma.c | 29 +---
>  3 files changed, 8 insertions(+), 28 deletions(-)

Reviewed-by: Max Filippov 

-- 
Thanks.
-- Max

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-27 Thread Russell King (Oracle)
On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> PG_dcache_clean after a DMA, but no other architecture does this here.

... because this is an arm32 specific feature. Generically, it's
PG_arch_1, which is a page flag free for architecture use. On arm32
we decided to use this to mark whether we can skip dcache writebacks
when establishing a PTE - and thus it was decided to call it
PG_dcache_clean to reflect how arm32 decided to use that bit.

This isn't just a DMA thing, there are other places that we update
the bit, such as flush_dcache_page() and copy_user_highpage().

So thinking that the arm32 PG_dcache_clean is something for DMA is
actually wrong.

Other architectures are free to do their own other optimisations
using that bit, and their implementations may be DMA-centric.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 10/21] csky: dma-mapping: skip invalidating before DMA from device

2023-03-27 Thread Guo Ren
On Mon, Mar 27, 2023 at 8:15 PM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> csky is the only architecture that does a full flush for the
> dma_sync_*_for_device(..., DMA_FROM_DEVICE) operation. The requirement
> is only make sure there are no dirty cache lines for the buffer,
> which can be either done through an invalidate operation (as on most
> architectures including arm32, mips and arc), or a writeback (as on
> arm64 and riscv). The cache also has to be invalidated eventually but
> csky already does that after the transfer.
>
> Use a 'clean' operation here for consistency with arm64 and riscv.
>
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/csky/mm/dma-mapping.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
> index 82447029feb4..c90f912e2822 100644
> --- a/arch/csky/mm/dma-mapping.c
> +++ b/arch/csky/mm/dma-mapping.c
> @@ -60,11 +60,9 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t 
> size,
>  {
> switch (dir) {
> case DMA_TO_DEVICE:
> -   cache_op(paddr, size, dma_wb_range);
> -   break;
> case DMA_FROM_DEVICE:
> case DMA_BIDIRECTIONAL:
> -   cache_op(paddr, size, dma_wbinv_range);
> +   cache_op(paddr, size, dma_wb_range);
Reviewed-by: Guo Ren 


> break;
> default:
> BUG();
> --
> 2.39.2
>


-- 
Best Regards
 Guo Ren

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 16/21] ARM: dma-mapping: bring back dmac_{clean,inv}_range

2023-03-27 Thread Russell King (Oracle)
On Mon, Mar 27, 2023 at 02:13:12PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> These were remove ages ago in commit 702b94bff3c5 ("ARM: dma-mapping:
> remove dmac_clean_range and dmac_inv_range") in an effort to sanitize
> the dma-mapping API.

Really no, please no. Let's not go back to this, let's keep the
buffer ownership model that came at around that time.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 06/21] powerpc: dma-mapping: minimize for_cpu flushing

2023-03-27 Thread Arnd Bergmann
On Mon, Mar 27, 2023, at 14:56, Christophe Leroy wrote:
> Le 27/03/2023 à 14:13, Arnd Bergmann a écrit :
>> From: Arnd Bergmann 
>> 
>> The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
>> architectures. Reduce it to what everyone else does:
>> 
>>   - No flush is needed after data has been sent to a device
>> 
>>   - When data has been received from a device, the cache only needs to
>> be invalidated to clear out cache lines that were speculatively
>> prefetched.
>> 
>> In particular, the second flushing of partial cache lines of bidirectional
>> buffers is actively harmful -- if a single cache line is written by both
>> the CPU and the device, flushing it again does not maintain coherency
>> but instead overwrite the data that was just received from the device.
>
> Hum . Who is right ?
>
> That behaviour was introduced by commit 03d70617b8a7 ("powerpc: Prevent 
> memory corruption due to cache invalidation of unaligned DMA buffer")
>
> I think your commit log should explain why that commit was wrong, and 
> maybe say that your patch is a revert of that commit ?

Ok, I'll try to explain this better. To clarify here: the __dma_sync()
function in commit 03d70617b8a7 is used both before and after a DMA,
but my patch 05/21 splits this in two, and patch 06/21 only changes
the part that gets called after the DMA-from-device but leaves the
part before DMA-from-device unchanged, which Andrew's patch
addressed.

As I mentioned in the cover letter, it is still unclear whether
we want to consider this the expected behavior as the documentation
seems unclear, but my series does not attempt to answer that
question.

 Arnd

___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 06/21] powerpc: dma-mapping: minimize for_cpu flushing

2023-03-27 Thread Christophe Leroy


Le 27/03/2023 à 14:13, Arnd Bergmann a écrit :
> From: Arnd Bergmann 
> 
> The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
> architectures. Reduce it to what everyone else does:
> 
>   - No flush is needed after data has been sent to a device
> 
>   - When data has been received from a device, the cache only needs to
> be invalidated to clear out cache lines that were speculatively
> prefetched.
> 
> In particular, the second flushing of partial cache lines of bidirectional
> buffers is actively harmful -- if a single cache line is written by both
> the CPU and the device, flushing it again does not maintain coherency
> but instead overwrite the data that was just received from the device.

Hum . Who is right ?

That behaviour was introduced by commit 03d70617b8a7 ("powerpc: Prevent 
memory corruption due to cache invalidation of unaligned DMA buffer")

I think your commit log should explain why that commit was wrong, and 
maybe say that your patch is a revert of that commit ?

Christophe


> 
> Signed-off-by: Arnd Bergmann 
> ---
>   arch/powerpc/mm/dma-noncoherent.c | 18 --
>   1 file changed, 4 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/mm/dma-noncoherent.c 
> b/arch/powerpc/mm/dma-noncoherent.c
> index f10869d27de5..e108cacf877f 100644
> --- a/arch/powerpc/mm/dma-noncoherent.c
> +++ b/arch/powerpc/mm/dma-noncoherent.c
> @@ -132,21 +132,11 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t 
> size,
>   switch (direction) {
>   case DMA_NONE:
>   BUG();
> - case DMA_FROM_DEVICE:
> - /*
> -  * invalidate only when cache-line aligned otherwise there is
> -  * the potential for discarding uncommitted data from the cache
> -  */
> - if ((start | end) & (L1_CACHE_BYTES - 1))
> - __dma_phys_op(start, end, DMA_CACHE_FLUSH);
> - else
> - __dma_phys_op(start, end, DMA_CACHE_INVAL);
> - break;
> - case DMA_TO_DEVICE: /* writeback only */
> - __dma_phys_op(start, end, DMA_CACHE_CLEAN);
> + case DMA_TO_DEVICE:
>   break;
> - case DMA_BIDIRECTIONAL: /* writeback and invalidate */
> - __dma_phys_op(start, end, DMA_CACHE_FLUSH);
> + case DMA_FROM_DEVICE:
> + case DMA_BIDIRECTIONAL:
> + __dma_phys_op(start, end, DMA_CACHE_INVAL);
>   break;
>   }
>   }
___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-27 Thread Robin Murphy

On 2023-03-27 13:13, Arnd Bergmann wrote:

From: Arnd Bergmann 

The arm version of the arch_sync_dma_for_cpu() function annotates pages as
PG_dcache_clean after a DMA, but no other architecture does this here. On
ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
to use the same hook in order to have identical arch_sync_dma_for_cpu()
semantics as all other architectures.

Splitting this out has multiple effects:

  - for dma-direct, this now gets called after arch_sync_dma_for_cpu()
for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
it would not be harmful to keep doing it for bidirectional mappings,
those are apparently not used in any callers that care about the flag.

  - Since arm has its own dma-iommu abstraction, this now also needs to
call the same function, so the calls are added there to mirror the
dma-direct version.

  - Like dma-direct, the dma-iommu version now marks the dcache clean
for both coherent and noncoherent devices after a DMA, but it only
does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.

[ HELP NEEDED: can anyone confirm that it is a correct assumption
   on arm that a cache-coherent device writing to a page always results
   in it being in a PG_dcache_clean state like on ia64, or can a device
   write directly into the dcache?]


In AMBA at least, if a snooping write hits in a cache then the data is 
most likely going to get routed directly into that cache. If it has 
write-back write-allocate attributes it could also land in any cache 
along its normal path to RAM; it wouldn't have to go all the way.


Hence all the fun we have where treating a coherent device as 
non-coherent can still be almost as broken as the other way round :)


Cheers,
Robin.


Signed-off-by: Arnd Bergmann 
---
  arch/arm/Kconfig  |  1 +
  arch/arm/mm/dma-mapping.c | 71 +++
  2 files changed, 43 insertions(+), 29 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e24a9820e12f..125d58c54ab1 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
+   select ARCH_HAS_DMA_MARK_CLEAN if MMU
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cc702cb27ae7..b703cb83d27e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
  }
  
+/*

+ * Mark the D-cache clean for these pages to avoid extra flushing.
+ */
+void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
+{
+   unsigned long pfn = PFN_UP(paddr);
+   unsigned long off = paddr & (PAGE_SIZE - 1);
+   size_t left = size;
+
+   if (size < PAGE_SIZE)
+   return;
+
+   if (off)
+   left -= PAGE_SIZE - off;
+
+   while (left >= PAGE_SIZE) {
+   struct page *page = pfn_to_page(pfn++);
+   set_bit(PG_dcache_clean, >flags);
+   left -= PAGE_SIZE;
+   }
+}
+
  static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
  {
if (IS_ENABLED(CONFIG_CPU_V6) ||
@@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
outer_inv_range(paddr, paddr + size);
dma_cache_maint(paddr, size, dmac_inv_range);
}
-
-   /*
-* Mark the D-cache clean for these pages to avoid extra flushing.
-*/
-   if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
-   unsigned long pfn = PFN_UP(paddr);
-   unsigned long off = paddr & (PAGE_SIZE - 1);
-   size_t left = size;
-
-   if (off)
-   left -= PAGE_SIZE - off;
-
-   while (left >= PAGE_SIZE) {
-   struct page *page = pfn_to_page(pfn++);
-   set_bit(PG_dcache_clean, >flags);
-   left -= PAGE_SIZE;
-   }
-   }
  }
  
  #ifdef CONFIG_ARM_DMA_USE_IOMMU

@@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct 
scatterlist *sg,
return -EINVAL;
  }
  
+static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,

+  enum dma_data_direction dir,
+  bool dma_coherent)
+{
+   if (!dma_coherent)
+   arch_sync_dma_for_cpu(phys, s->length, dir);
+
+   if (dir == DMA_FROM_DEVICE)
+   arch_dma_mark_clean(phys, s->length);
+}
+
  /**
   * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
   * @dev: valid struct device pointer
@@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev,
if 

[PATCH 21/21] dma-mapping: replace custom code with generic implementation

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

Now that all of these have consistent behavior, replace them with
a single shared implementation of arch_sync_dma_for_device() and
arch_sync_dma_for_cpu() and three parameters to pick how they should
operate:

 - If the CPU has speculative prefetching, then the cache
   has to be invalidated after a transfer from the device.
   On the rarer CPUs without prefetching, this can be skipped,
   with all cache management happening before the transfer.
   This flag can be runtime detected, but is usually fixed
   per architecture.

 - Some architectures currently clean the caches before DMA
   from a device, while others invalidate it. There has not
   been a conclusion regarding whether we should change all
   architectures to use clean instead, so this adds an
   architecture specific flag that we can change later on.

 - On 32-bit Arm, the arch_sync_dma_for_cpu() function keeps
   track pages that are marked clean in the page cache, to
   avoid flushing them again. The implementation for this is
   generic enough to work on all architectures that use the
   PG_dcache_clean page flag, but a Kconfig symbol is used
   to only enable it on Arm to preserve the existing behavior.

For the function naming, I picked 'wback' over 'clean', and 'wback_inv'
over 'flush', to avoid any ambiguity of what the helper functions are
supposed to do.

Moving the global functions into a header file is usually a bad idea
as it prevents the header from being included more than once, but it
helps keep the behavior as close as possible to the previous state,
including the possibility of inlining most of it into these functions
where that was done before. This also helps keep the global namespace
clean, by hiding the new arch_dma_cache{_wback,_inv,_wback_inv} from
device drivers that might use them incorrectly.

It would be possible to do this one architecture at a time, but
as the change is the same everywhere, the combined patch helps
explain it better once.

Signed-off-by: Arnd Bergmann 
---
 arch/arc/mm/dma.c |  66 +-
 arch/arm/Kconfig  |   3 +
 arch/arm/mm/dma-mapping-nommu.c   |  39 ++-
 arch/arm/mm/dma-mapping.c |  64 +++---
 arch/arm64/mm/dma-mapping.c   |  28 +---
 arch/csky/mm/dma-mapping.c|  44 ++--
 arch/hexagon/kernel/dma.c |  44 ++--
 arch/m68k/kernel/dma.c|  43 +++-
 arch/microblaze/kernel/dma.c  |  48 +++---
 arch/mips/mm/dma-noncoherent.c|  60 +++--
 arch/nios2/mm/dma-mapping.c   |  57 +++-
 arch/openrisc/kernel/dma.c|  63 +++---
 arch/parisc/kernel/pci-dma.c  |  46 ++---
 arch/powerpc/mm/dma-noncoherent.c |  34 ++
 arch/riscv/mm/dma-noncoherent.c   |  51 +++---
 arch/sh/kernel/dma-coherent.c |  43 +++-
 arch/sparc/kernel/ioport.c|  38 ---
 arch/xtensa/kernel/pci-dma.c  |  40 ++-
 include/linux/dma-sync.h  | 107 ++
 19 files changed, 527 insertions(+), 391 deletions(-)
 create mode 100644 include/linux/dma-sync.h

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index ddb96786f765..61cd01646222 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -30,63 +30,33 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
dma_cache_wback_inv(page_to_phys(page), size);
 }
 
-/*
- * Cache operations depending on function and direction argument, inspired by
- * https://lore.kernel.org/lkml/20180518175004.gf17...@n2100.armlinux.org.uk
- * "dma_sync_*_for_cpu and direction=TO_DEVICE (was Re: [PATCH 02/20]
- * dma-mapping: provide a generic dma-noncoherent implementation)"
- *
- *  |   map  ==  for_device |   unmap ==  for_cpu
- *  |
- * TO_DEV   |   writebackwriteback  |   none  none
- * FROM_DEV |   invalidate   invalidate |   invalidate*   invalidate*
- * BIDIR|   writebackwriteback  |   invalidateinvalidate
- *
- * [*] needed for CPU speculative prefetches
- *
- * NOTE: we don't check the validity of direction argument as it is done in
- * upper layer functions (in include/linux/dma-mapping.h)
- */
-
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
-   enum dma_data_direction dir)
+static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
 {
-   switch (dir) {
-   case DMA_TO_DEVICE:
-   dma_cache_wback(paddr, size);
-   break;
-
-   case DMA_FROM_DEVICE:
-   dma_cache_inv(paddr, size);
-   break;
-
-   case DMA_BIDIRECTIONAL:
-   dma_cache_wback(paddr, size);
-   break;
+   dma_cache_wback(paddr, size);
+}
 
-   default:
-   break;
-   }
+static inline void arch_dma_cache_inv(phys_addr_t 

[PATCH 19/21] ARM: dma-mapping: use generic form of arch_sync_dma_* helpers

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

As the final step of the conversion to generic arch_sync_dma_*
helpers, change the Arm implementation to look the same as the
new generic version, by calling the dmac_{clean,inv,flush}_area
low-level functions instead of the abstracted dmac_{map,unmap}_area
version.

On ARMv6/v7, this invalidates the caches after a DMA transfer from
a device because of speculative prefetching, while on earlier versions
it only needs to do this before the transfer.

This should not change any of the current behavior.

FIXME: address CONFIG_DMA_CACHE_RWFO properly.

Signed-off-by: Arnd Bergmann 
---
 arch/arm/mm/dma-mapping-nommu.c | 11 +++
 arch/arm/mm/dma-mapping.c   | 53 +++--
 2 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/arch/arm/mm/dma-mapping-nommu.c b/arch/arm/mm/dma-mapping-nommu.c
index cfd9c933d2f0..12b5c6ae93fc 100644
--- a/arch/arm/mm/dma-mapping-nommu.c
+++ b/arch/arm/mm/dma-mapping-nommu.c
@@ -16,12 +16,13 @@
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   dmac_map_area(__va(paddr), size, dir);
-
-   if (dir == DMA_FROM_DEVICE)
+   if (dir == DMA_FROM_DEVICE) {
+   dmac_inv_range(__va(paddr), __va(paddr + size));
outer_inv_range(paddr, paddr + size);
-   else
+   } else {
+   dmac_clean_range(__va(paddr), __va(paddr + size));
outer_clean_range(paddr, paddr + size);
+   }
 }
 
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
@@ -29,7 +30,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
 {
if (dir != DMA_TO_DEVICE) {
outer_inv_range(paddr, paddr + size);
-   dmac_unmap_area(__va(paddr), size, dir);
+   dmac_inv_range(__va(paddr), __va(paddr));
}
 }
 
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index ce4b74f34a58..cc702cb27ae7 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -623,8 +623,7 @@ static void __arm_dma_free(struct device *dev, size_t size, 
void *cpu_addr,
 }
 
 static void dma_cache_maint(phys_addr_t paddr,
-   size_t size, enum dma_data_direction dir,
-   void (*op)(const void *, size_t, int))
+   size_t size, void (*op)(const void *, const void *))
 {
unsigned long pfn = PFN_DOWN(paddr);
unsigned long offset = paddr % PAGE_SIZE;
@@ -647,18 +646,18 @@ static void dma_cache_maint(phys_addr_t paddr,
 
if (cache_is_vipt_nonaliasing()) {
vaddr = kmap_atomic(page);
-   op(vaddr + offset, len, dir);
+   op(vaddr + offset, vaddr + offset + len);
kunmap_atomic(vaddr);
} else {
vaddr = kmap_high_get(page);
if (vaddr) {
-   op(vaddr + offset, len, dir);
+   op(vaddr + offset, vaddr + offset + 
len);
kunmap_high(page);
}
}
} else {
vaddr = page_address(page) + offset;
-   op(vaddr, len, dir);
+   op(vaddr, vaddr + len);
}
offset = 0;
pfn++;
@@ -666,6 +665,18 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
 }
 
+static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+   if (IS_ENABLED(CONFIG_CPU_V6) ||
+   IS_ENABLED(CONFIG_CPU_V6K) ||
+   IS_ENABLED(CONFIG_CPU_V7) ||
+   IS_ENABLED(CONFIG_CPU_V7M))
+   return true;
+
+   /* FIXME: runtime detection */
+   return false;
+}
+
 /*
  * Make an area consistent for devices.
  * Note: Drivers should NOT use this function directly.
@@ -674,25 +685,35 @@ static void dma_cache_maint(phys_addr_t paddr,
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   dma_cache_maint(paddr, size, dir, dmac_map_area);
-
-   if (dir == DMA_FROM_DEVICE) {
-   outer_inv_range(paddr, paddr + size);
-   } else {
+   switch (dir) {
+   case DMA_TO_DEVICE:
+   dma_cache_maint(paddr, size, dmac_clean_range);
outer_clean_range(paddr, paddr + size);
+   break;
+   case DMA_FROM_DEVICE:
+   dma_cache_maint(paddr, size, dmac_inv_range);
+   outer_inv_range(paddr, paddr + size);
+   break;
+   case DMA_BIDIRECTIONAL:
+   if (arch_sync_dma_cpu_needs_post_dma_flush()) {
+   dma_cache_maint(paddr, size, dmac_clean_range);
+   outer_clean_range(paddr, paddr + size);
+   } else {
+

[PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The arm version of the arch_sync_dma_for_cpu() function annotates pages as
PG_dcache_clean after a DMA, but no other architecture does this here. On
ia64, the same thing is done in arch_sync_dma_for_cpu(), so it makes sense
to use the same hook in order to have identical arch_sync_dma_for_cpu()
semantics as all other architectures.

Splitting this out has multiple effects:

 - for dma-direct, this now gets called after arch_sync_dma_for_cpu()
   for DMA_FROM_DEVICE mappings, but not for DMA_BIDIRECTIONAL. While
   it would not be harmful to keep doing it for bidirectional mappings,
   those are apparently not used in any callers that care about the flag.

 - Since arm has its own dma-iommu abstraction, this now also needs to
   call the same function, so the calls are added there to mirror the
   dma-direct version.

 - Like dma-direct, the dma-iommu version now marks the dcache clean
   for both coherent and noncoherent devices after a DMA, but it only
   does this for DMA_FROM_DEVICE, not DMA_BIDIRECTIONAL.

[ HELP NEEDED: can anyone confirm that it is a correct assumption
  on arm that a cache-coherent device writing to a page always results
  in it being in a PG_dcache_clean state like on ia64, or can a device
  write directly into the dcache?]

Signed-off-by: Arnd Bergmann 
---
 arch/arm/Kconfig  |  1 +
 arch/arm/mm/dma-mapping.c | 71 +++
 2 files changed, 43 insertions(+), 29 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index e24a9820e12f..125d58c54ab1 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -7,6 +7,7 @@ config ARM
select ARCH_HAS_BINFMT_FLAT
select ARCH_HAS_CURRENT_STACK_POINTER
select ARCH_HAS_DEBUG_VIRTUAL if MMU
+   select ARCH_HAS_DMA_MARK_CLEAN if MMU
select ARCH_HAS_DMA_WRITE_COMBINE if !ARM_DMA_MEM_BUFFERABLE
select ARCH_HAS_ELF_RANDOMIZE
select ARCH_HAS_FORTIFY_SOURCE
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index cc702cb27ae7..b703cb83d27e 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -665,6 +665,28 @@ static void dma_cache_maint(phys_addr_t paddr,
} while (left);
 }
 
+/*
+ * Mark the D-cache clean for these pages to avoid extra flushing.
+ */
+void arch_dma_mark_clean(phys_addr_t paddr, size_t size)
+{
+   unsigned long pfn = PFN_UP(paddr);
+   unsigned long off = paddr & (PAGE_SIZE - 1);
+   size_t left = size;
+
+   if (size < PAGE_SIZE)
+   return;
+
+   if (off)
+   left -= PAGE_SIZE - off;
+
+   while (left >= PAGE_SIZE) {
+   struct page *page = pfn_to_page(pfn++);
+   set_bit(PG_dcache_clean, >flags);
+   left -= PAGE_SIZE;
+   }
+}
+
 static bool arch_sync_dma_cpu_needs_post_dma_flush(void)
 {
if (IS_ENABLED(CONFIG_CPU_V6) ||
@@ -715,24 +737,6 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
outer_inv_range(paddr, paddr + size);
dma_cache_maint(paddr, size, dmac_inv_range);
}
-
-   /*
-* Mark the D-cache clean for these pages to avoid extra flushing.
-*/
-   if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
-   unsigned long pfn = PFN_UP(paddr);
-   unsigned long off = paddr & (PAGE_SIZE - 1);
-   size_t left = size;
-
-   if (off)
-   left -= PAGE_SIZE - off;
-
-   while (left >= PAGE_SIZE) {
-   struct page *page = pfn_to_page(pfn++);
-   set_bit(PG_dcache_clean, >flags);
-   left -= PAGE_SIZE;
-   }
-   }
 }
 
 #ifdef CONFIG_ARM_DMA_USE_IOMMU
@@ -1294,6 +1298,17 @@ static int arm_iommu_map_sg(struct device *dev, struct 
scatterlist *sg,
return -EINVAL;
 }
 
+static void arm_iommu_sync_dma_for_cpu(phys_addr_t phys, size_t len,
+  enum dma_data_direction dir,
+  bool dma_coherent)
+{
+   if (!dma_coherent)
+   arch_sync_dma_for_cpu(phys, s->length, dir);
+
+   if (dir == DMA_FROM_DEVICE)
+   arch_dma_mark_clean(phys, s->length);
+}
+
 /**
  * arm_iommu_unmap_sg - unmap a set of SG buffers mapped by dma_map_sg
  * @dev: valid struct device pointer
@@ -1316,8 +1331,9 @@ static void arm_iommu_unmap_sg(struct device *dev,
if (sg_dma_len(s))
__iommu_remove_mapping(dev, sg_dma_address(s),
   sg_dma_len(s));
-   if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-   arch_sync_dma_for_cpu(sg_phys(s), s->length, dir);
+   if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
+   arm_iommu_sync_dma_for_cpu(sg_phys(s), s->length, dir,
+  dev->dma_coherent);

[PATCH 18/21] ARM: drop SMP support for ARM11MPCore

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The cache management operations for noncoherent DMA on ARMv6 work
in two different ways:

 * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
   DMA buffers lead to data corruption when the prefetched data is written
   back on top of data from the device.

 * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
   is not seen by the other core(s), leading to inconsistent contents
   accross the system.

As a consequence, neither configuration is actually safe to use in a
general-purpose kernel that is used on both MPCore systems and ARM1176
with prefetching enabled.

We could add further workarounds to make the behavior more dynamic based
on the system, but realistically, there are close to zero remaining
users on any ARM11MPCore anyway, and nobody seems too interested in it,
compared to the more popular ARM1176 used in BMC2835 and AST2500.

The Oxnas platform has some minimal support in OpenWRT, but most of the
drivers and dts files never made it into the mainline kernel, while the
Arm Versatile/Realview platform mainly serves as a reference system but
is not necessary to be kept working once all other ARM11MPCore are gone.

Take the easy way out here and drop support for multiprocessing on
ARMv6, along with the CONFIG_DMA_CACHE_RWFO option and the cache
management implementation for it. This also helps with other ARMv6
issues, but for the moment leaves the ability to build a kernel that
can run on both ARMv7 SMP and single-processor ARMv6, which we probably
want to stop supporting as well, but not as part of this series.

Cc: Neil Armstrong 
Cc: Daniel Golle 
Cc: Linus Walleij 
Cc: linux-ox...@groups.io
Signed-off-by: Arnd Bergmann 
---
I could use some help clarifying the above changelog text to describe
the exact problem, and how the CONFIG_DMA_CACHE_RWFO actually works on
MPCore. The TRMs for both 1176 and 11MPCore only describe prefetching
into the instruction cache, not the data cache, but this can end up in
the outercache as a result. The 1176 has some extra control bits to
control prefetching, but I found no reference that explains why an
MPCore does not run into the problem.
---
 arch/arm/mach-oxnas/Kconfig|  4 -
 arch/arm/mach-oxnas/Makefile   |  1 -
 arch/arm/mach-oxnas/headsmp.S  | 23 --
 arch/arm/mach-oxnas/platsmp.c  | 96 --
 arch/arm/mach-versatile/platsmp-realview.c |  4 -
 arch/arm/mm/Kconfig| 19 -
 arch/arm/mm/cache-v6.S | 31 ---
 7 files changed, 178 deletions(-)
 delete mode 100644 arch/arm/mach-oxnas/headsmp.S
 delete mode 100644 arch/arm/mach-oxnas/platsmp.c

diff --git a/arch/arm/mach-oxnas/Kconfig b/arch/arm/mach-oxnas/Kconfig
index a9ded7079268..a054235c3d6c 100644
--- a/arch/arm/mach-oxnas/Kconfig
+++ b/arch/arm/mach-oxnas/Kconfig
@@ -28,10 +28,6 @@ config MACH_OX820
bool "Support OX820 Based Products"
depends on ARCH_MULTI_V6
select ARM_GIC
-   select DMA_CACHE_RWFO if SMP
-   select HAVE_SMP
-   select HAVE_ARM_SCU if SMP
-   select HAVE_ARM_TWD if SMP
help
  Include Support for the Oxford Semiconductor OX820 SoC Based Products.
 
diff --git a/arch/arm/mach-oxnas/Makefile b/arch/arm/mach-oxnas/Makefile
index 0e78ecfe6c49..a4e40e534e6a 100644
--- a/arch/arm/mach-oxnas/Makefile
+++ b/arch/arm/mach-oxnas/Makefile
@@ -1,2 +1 @@
 # SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_SMP)  += platsmp.o headsmp.o
diff --git a/arch/arm/mach-oxnas/headsmp.S b/arch/arm/mach-oxnas/headsmp.S
deleted file mode 100644
index 9c0f1479f33a..
--- a/arch/arm/mach-oxnas/headsmp.S
+++ /dev/null
@@ -1,23 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-only */
-/*
- * Copyright (C) 2013 Ma Haijun 
- * Copyright (c) 2003 ARM Limited
- * All Rights Reserved
- */
-#include 
-#include 
-
-   __INIT
-
-/*
- * OX820 specific entry point for secondary CPUs.
- */
-ENTRY(ox820_secondary_startup)
-   mov r4, #0
-   /* invalidate both caches and branch target cache */
-   mcr p15, 0, r4, c7, c7, 0
-   /*
-* we've been released from the holding pen: secondary_stack
-* should now contain the SVC stack for this core
-*/
-   b   secondary_startup
diff --git a/arch/arm/mach-oxnas/platsmp.c b/arch/arm/mach-oxnas/platsmp.c
deleted file mode 100644
index f0a50b9e61df..
--- a/arch/arm/mach-oxnas/platsmp.c
+++ /dev/null
@@ -1,96 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * Copyright (C) 2016 Neil Armstrong 
- * Copyright (C) 2013 Ma Haijun 
- * Copyright (C) 2002 ARM Ltd.
- * All Rights Reserved
- */
-#include 
-#include 
-#include 
-#include 
-
-#include 
-#include 
-#include 
-#include 
-
-extern void ox820_secondary_startup(void);
-
-static void __iomem *cpu_ctrl;
-static void __iomem *gic_cpu_ctrl;
-
-#define HOLDINGPEN_CPU_OFFSET  0xc8
-#define HOLDINGPEN_LOCATION_OFFSET 

[PATCH 17/21] ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The arm specific iommu code in dma-mapping.c uses the page+offset based
__dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
wrappers around the.

In order to be able to move the latter part set of functions into
common code, change the iommu implementation to use them directly
and remove the internal ones as a separate interface.

As page+offset and phys_address are equivalent, but are used in
different parts of the code here, this allows removing some of
the conversion but adds them elsewhere.

Signed-off-by: Arnd Bergmann 
---
 arch/arm/mm/dma-mapping.c | 93 ++-
 1 file changed, 33 insertions(+), 60 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 8bc01071474a..ce4b74f34a58 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -622,16 +622,14 @@ static void __arm_dma_free(struct device *dev, size_t 
size, void *cpu_addr,
kfree(buf);
 }
 
-static void dma_cache_maint_page(struct page *page, unsigned long offset,
+static void dma_cache_maint(phys_addr_t paddr,
size_t size, enum dma_data_direction dir,
void (*op)(const void *, size_t, int))
 {
-   unsigned long pfn;
+   unsigned long pfn = PFN_DOWN(paddr);
+   unsigned long offset = paddr % PAGE_SIZE;
size_t left = size;
 
-   pfn = page_to_pfn(page) + offset / PAGE_SIZE;
-   offset %= PAGE_SIZE;
-
/*
 * A single sg entry may refer to multiple physically contiguous
 * pages.  But we still need to process highmem pages individually.
@@ -641,8 +639,7 @@ static void dma_cache_maint_page(struct page *page, 
unsigned long offset,
do {
size_t len = left;
void *vaddr;
-
-   page = pfn_to_page(pfn);
+   struct page *page = pfn_to_page(pfn);
 
if (PageHighMem(page)) {
if (len + offset > PAGE_SIZE)
@@ -674,14 +671,11 @@ static void dma_cache_maint_page(struct page *page, 
unsigned long offset,
  * Note: Drivers should NOT use this function directly.
  * Use the driver DMA support - see dma-mapping.h (dma_sync_*)
  */
-static void __dma_page_cpu_to_dev(struct page *page, unsigned long off,
-   size_t size, enum dma_data_direction dir)
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+   enum dma_data_direction dir)
 {
-   phys_addr_t paddr;
+   dma_cache_maint(paddr, size, dir, dmac_map_area);
 
-   dma_cache_maint_page(page, off, size, dir, dmac_map_area);
-
-   paddr = page_to_phys(page) + off;
if (dir == DMA_FROM_DEVICE) {
outer_inv_range(paddr, paddr + size);
} else {
@@ -690,34 +684,30 @@ static void __dma_page_cpu_to_dev(struct page *page, 
unsigned long off,
/* FIXME: non-speculating: flush on bidirectional mappings? */
 }
 
-static void __dma_page_dev_to_cpu(struct page *page, unsigned long off,
-   size_t size, enum dma_data_direction dir)
+void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+   enum dma_data_direction dir)
 {
-   phys_addr_t paddr = page_to_phys(page) + off;
-
/* FIXME: non-speculating: not required */
/* in any case, don't bother invalidating if DMA to device */
if (dir != DMA_TO_DEVICE) {
outer_inv_range(paddr, paddr + size);
 
-   dma_cache_maint_page(page, off, size, dir, dmac_unmap_area);
+   dma_cache_maint(paddr, size, dir, dmac_unmap_area);
}
 
/*
 * Mark the D-cache clean for these pages to avoid extra flushing.
 */
if (dir != DMA_TO_DEVICE && size >= PAGE_SIZE) {
-   unsigned long pfn;
+   unsigned long pfn = PFN_UP(paddr);
+   unsigned long off = paddr & (PAGE_SIZE - 1);
size_t left = size;
 
-   pfn = page_to_pfn(page) + off / PAGE_SIZE;
-   off %= PAGE_SIZE;
-   if (off) {
-   pfn++;
+   if (off)
left -= PAGE_SIZE - off;
-   }
+
while (left >= PAGE_SIZE) {
-   page = pfn_to_page(pfn++);
+   struct page *page = pfn_to_page(pfn++);
set_bit(PG_dcache_clean, >flags);
left -= PAGE_SIZE;
}
@@ -1204,7 +1194,7 @@ static int __map_sg_chunk(struct device *dev, struct 
scatterlist *sg,
unsigned int len = PAGE_ALIGN(s->offset + s->length);
 
if (!dev->dma_coherent && !(attrs & DMA_ATTR_SKIP_CPU_SYNC))
-   __dma_page_cpu_to_dev(sg_page(s), s->offset, s->length, 
dir);
+   arch_sync_dma_for_device(phys + s->offset, s->length, 
dir);
 
prot = __dma_info_to_prot(dir, attrs);
 
@@ -1306,8 +1296,7 

[PATCH 16/21] ARM: dma-mapping: bring back dmac_{clean,inv}_range

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

These were remove ages ago in commit 702b94bff3c5 ("ARM: dma-mapping:
remove dmac_clean_range and dmac_inv_range") in an effort to sanitize
the dma-mapping API.

Now this logic is getting moved into the generic dma-mapping
implementation in order to give architectures less control over
it, which requires reverting that earlier work.

Signed-off-by: Arnd Bergmann 
---
 arch/arm/include/asm/cacheflush.h | 21 +
 arch/arm/include/asm/glue-cache.h |  4 
 arch/arm/mm/cache-fa.S|  4 ++--
 arch/arm/mm/cache-nop.S   |  6 ++
 arch/arm/mm/cache-v4.S|  5 +
 arch/arm/mm/cache-v4wb.S  |  4 ++--
 arch/arm/mm/cache-v4wt.S  | 14 +-
 arch/arm/mm/cache-v6.S|  4 ++--
 arch/arm/mm/cache-v7.S|  6 --
 arch/arm/mm/cache-v7m.S   |  4 ++--
 arch/arm/mm/proc-arm1020.S|  4 ++--
 arch/arm/mm/proc-arm1020e.S   |  4 ++--
 arch/arm/mm/proc-arm1022.S|  4 ++--
 arch/arm/mm/proc-arm1026.S|  4 ++--
 arch/arm/mm/proc-arm920.S |  4 ++--
 arch/arm/mm/proc-arm922.S |  4 ++--
 arch/arm/mm/proc-arm925.S |  4 ++--
 arch/arm/mm/proc-arm926.S |  4 ++--
 arch/arm/mm/proc-arm940.S |  4 ++--
 arch/arm/mm/proc-arm946.S |  4 ++--
 arch/arm/mm/proc-feroceon.S   |  8 
 arch/arm/mm/proc-macros.S |  2 ++
 arch/arm/mm/proc-mohawk.S |  4 ++--
 arch/arm/mm/proc-xsc3.S   |  4 ++--
 arch/arm/mm/proc-xscale.S |  6 --
 25 files changed, 95 insertions(+), 41 deletions(-)

diff --git a/arch/arm/include/asm/cacheflush.h 
b/arch/arm/include/asm/cacheflush.h
index a094f964c869..04462bfe9130 100644
--- a/arch/arm/include/asm/cacheflush.h
+++ b/arch/arm/include/asm/cacheflush.h
@@ -91,6 +91,21 @@
  * DMA Cache Coherency
  * ===
  *
+ * dma_inv_range(start, end)
+ *
+ * Invalidate (discard) the specified virtual address range.
+ * May not write back any entries.  If 'start' or 'end'
+ * are not cache line aligned, those lines must be written
+ * back.
+ * - start  - virtual start address
+ * - end- virtual end address
+ *
+ * dma_clean_range(start, end)
+ *
+ * Clean (write back) the specified virtual address range.
+ * - start  - virtual start address
+ * - end- virtual end address
+ *
  * dma_flush_range(start, end)
  *
  * Clean and invalidate the specified virtual address range.
@@ -112,6 +127,8 @@ struct cpu_cache_fns {
void (*dma_map_area)(const void *, size_t, int);
void (*dma_unmap_area)(const void *, size_t, int);
 
+   void (*dma_clean_range)(const void *, const void *);
+   void (*dma_inv_range)(const void *, const void *);
void (*dma_flush_range)(const void *, const void *);
 } __no_randomize_layout;
 
@@ -137,6 +154,8 @@ extern struct cpu_cache_fns cpu_cache;
  * is visible to DMA, or data written by DMA to system memory is
  * visible to the CPU.
  */
+#define dmac_clean_range   cpu_cache.dma_clean_range
+#define dmac_inv_range cpu_cache.dma_inv_range
 #define dmac_flush_range   cpu_cache.dma_flush_range
 
 #else
@@ -156,6 +175,8 @@ extern void __cpuc_flush_dcache_area(void *, size_t);
  * is visible to DMA, or data written by DMA to system memory is
  * visible to the CPU.
  */
+extern void dmac_clean_range(const void *, const void *);
+extern void dmac_inv_range(const void *, const void *);
 extern void dmac_flush_range(const void *, const void *);
 
 #endif
diff --git a/arch/arm/include/asm/glue-cache.h 
b/arch/arm/include/asm/glue-cache.h
index 724f8dac1e5b..d8c93b483adf 100644
--- a/arch/arm/include/asm/glue-cache.h
+++ b/arch/arm/include/asm/glue-cache.h
@@ -139,6 +139,8 @@ static inline int nop_coherent_user_range(unsigned long a,
unsigned long b) { return 0; }
 static inline void nop_flush_kern_dcache_area(void *a, size_t s) { }
 
+static inline void nop_dma_clean_range(const void *a, const void *b) { }
+static inline void nop_dma_inv_range(const void *a, const void *b) { }
 static inline void nop_dma_flush_range(const void *a, const void *b) { }
 
 static inline void nop_dma_map_area(const void *s, size_t l, int f) { }
@@ -155,6 +157,8 @@ static inline void nop_dma_unmap_area(const void *s, size_t 
l, int f) { }
 #define __cpuc_coherent_user_range __glue(_CACHE,_coherent_user_range)
 #define __cpuc_flush_dcache_area   __glue(_CACHE,_flush_kern_dcache_area)
 
+#define dmac_clean_range   __glue(_CACHE,_dma_clean_range)
+#define dmac_inv_range __glue(_CACHE,_dma_inv_range)
 #define dmac_flush_range   __glue(_CACHE,_dma_flush_range)
 #endif
 
diff --git a/arch/arm/mm/cache-fa.S b/arch/arm/mm/cache-fa.S
index 3a464d1649b4..abc3d58948dd 100644
--- a/arch/arm/mm/cache-fa.S
+++ 

[PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

Most ARM CPUs can have write-back caches and that require
cache management to be done in the dma_sync_*_for_device()
operation. This is typically done in both writeback and
writethrough mode.

The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
(arm920t, arm940t) implementations are the exception here,
and only do the cache management after the DMA is complete,
in the dma_sync_*_for_cpu() operation.

Change this for consistency with the other platforms. This
should have no user visible effect.

Signed-off-by: Arnd Bergmann 
---
 arch/arm/mm/cache-v4.S   | 8 
 arch/arm/mm/cache-v4wt.S | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/arm/mm/cache-v4.S b/arch/arm/mm/cache-v4.S
index 7787057e4990..e2b104876340 100644
--- a/arch/arm/mm/cache-v4.S
+++ b/arch/arm/mm/cache-v4.S
@@ -117,23 +117,23 @@ ENTRY(v4_dma_flush_range)
ret lr
 
 /*
- * dma_unmap_area(start, size, dir)
+ * dma_map_area(start, size, dir)
  * - start - kernel virtual start address
  * - size  - size of region
  * - dir   - DMA direction
  */
-ENTRY(v4_dma_unmap_area)
+ENTRY(v4_dma_map_area)
teq r2, #DMA_TO_DEVICE
bne v4_dma_flush_range
/* FALLTHROUGH */
 
 /*
- * dma_map_area(start, size, dir)
+ * dma_unmap_area(start, size, dir)
  * - start - kernel virtual start address
  * - size  - size of region
  * - dir   - DMA direction
  */
-ENTRY(v4_dma_map_area)
+ENTRY(v4_dma_unmap_area)
ret lr
 ENDPROC(v4_dma_unmap_area)
 ENDPROC(v4_dma_map_area)
diff --git a/arch/arm/mm/cache-v4wt.S b/arch/arm/mm/cache-v4wt.S
index 0b290c25a99d..652218752f88 100644
--- a/arch/arm/mm/cache-v4wt.S
+++ b/arch/arm/mm/cache-v4wt.S
@@ -172,24 +172,24 @@ v4wt_dma_inv_range:
.equv4wt_dma_flush_range, v4wt_dma_inv_range
 
 /*
- * dma_unmap_area(start, size, dir)
+ * dma_map_area(start, size, dir)
  * - start - kernel virtual start address
  * - size  - size of region
  * - dir   - DMA direction
  */
-ENTRY(v4wt_dma_unmap_area)
+ENTRY(v4wt_dma_map_area)
add r1, r1, r0
teq r2, #DMA_TO_DEVICE
bne v4wt_dma_inv_range
/* FALLTHROUGH */
 
 /*
- * dma_map_area(start, size, dir)
+ * dma_unmap_area(start, size, dir)
  * - start - kernel virtual start address
  * - size  - size of region
  * - dir   - DMA direction
  */
-ENTRY(v4wt_dma_map_area)
+ENTRY(v4wt_dma_unmap_area)
ret lr
 ENDPROC(v4wt_dma_unmap_area)
 ENDPROC(v4wt_dma_map_area)
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 14/21] parisc: dma-mapping: use regular flush/invalidate ops

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

non-coherent devices on parisc traditionally use a full flush+invalidate
before and after each DMA, which is more expensive that what we do on
other architectures.

Before transfers to a device, the cache only has to be written back,
but apparently there is no operation for this on parisc. There is no
need to flush it again after the transfer though.

After transfers from a device, the second writeback can be skipped because
the CPU was not allowed to write to the buffer anyway, instead a purge
(invalidate without flush) can be used.

The DMA_FROM_DEVICE is handled differently across architectures,
most use only an invalidate (purge) operation, but some have moved
to flush in order to preserve dirty data when the device does not
write to the buffer, see the link below. As parisc already did the
full flush here, keep that behavior.

Link: https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Signed-off-by: Arnd Bergmann 
---
I'm not really sure I understand the semantics of the 'flush'
and 'purge' operations on parisc correctly, please double-check that
this makes sense in the context of this architecture.
---
 arch/parisc/include/asm/cacheflush.h |  6 +-
 arch/parisc/kernel/pci-dma.c | 25 +++--
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/arch/parisc/include/asm/cacheflush.h 
b/arch/parisc/include/asm/cacheflush.h
index 0bdee6724132..a4c5042f1821 100644
--- a/arch/parisc/include/asm/cacheflush.h
+++ b/arch/parisc/include/asm/cacheflush.h
@@ -33,8 +33,12 @@ void flush_cache_mm(struct mm_struct *mm);
 
 void flush_kernel_dcache_page_addr(const void *addr);
 
+#define clean_kernel_dcache_range(start,size) \
+   flush_kernel_dcache_range((start), (size))
 #define flush_kernel_dcache_range(start,size) \
-   flush_kernel_dcache_range_asm((start), (start)+(size));
+   flush_kernel_dcache_range_asm((start), (start)+(size))
+#define purge_kernel_dcache_range(start,size) \
+   purge_kernel_dcache_range_asm((start), (start)+(size))
 
 #define ARCH_IMPLEMENTS_FLUSH_KERNEL_VMAP_RANGE 1
 void flush_kernel_vmap_range(void *vaddr, int size);
diff --git a/arch/parisc/kernel/pci-dma.c b/arch/parisc/kernel/pci-dma.c
index ba87f791323b..6d3d3cffb316 100644
--- a/arch/parisc/kernel/pci-dma.c
+++ b/arch/parisc/kernel/pci-dma.c
@@ -446,11 +446,32 @@ void arch_dma_free(struct device *dev, size_t size, void 
*vaddr,
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   flush_kernel_dcache_range((unsigned long)phys_to_virt(paddr), size);
+   unsigned long virt = (unsigned long)phys_to_virt(paddr);
+
+   switch (dir) {
+   case DMA_TO_DEVICE:
+   clean_kernel_dcache_range(virt, size);
+   break;
+   case DMA_FROM_DEVICE:
+   clean_kernel_dcache_range(virt, size);
+   break;
+   case DMA_BIDIRECTIONAL:
+   flush_kernel_dcache_range(virt, size);
+   break;
+   }
 }
 
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   flush_kernel_dcache_range((unsigned long)phys_to_virt(paddr), size);
+   unsigned long virt = (unsigned long)phys_to_virt(paddr);
+
+   switch (dir) {
+   case DMA_TO_DEVICE:
+   break;
+   case DMA_FROM_DEVICE:
+   case DMA_BIDIRECTIONAL:
+   purge_kernel_dcache_range(virt, size);
+   break;
+   }
 }
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 13/21] arc: dma-mapping: skip invalidating before bidirectional DMA

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

Some architectures that need to invalidate buffers after bidirectional
DMA because of speculative prefetching only do a simpler writeback
before that DMA, while architectures that don't need to do the second
invalidate tend to have a combined writeback+invalidate before the
DMA.

arc is one of the architectures that does both, which seems unnecessary.

Change it to behave like arm/arm64/xtensa instead, and use just a
writeback before the DMA when we do the invalidate afterwards.

Signed-off-by: Arnd Bergmann 
---
 arch/arc/mm/dma.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arc/mm/dma.c b/arch/arc/mm/dma.c
index 2a7fbbb83b70..ddb96786f765 100644
--- a/arch/arc/mm/dma.c
+++ b/arch/arc/mm/dma.c
@@ -40,7 +40,7 @@ void arch_dma_prep_coherent(struct page *page, size_t size)
  *  |
  * TO_DEV   |   writebackwriteback  |   none  none
  * FROM_DEV |   invalidate   invalidate |   invalidate*   invalidate*
- * BIDIR|   writeback+invwriteback+inv  |   invalidateinvalidate
+ * BIDIR|   writebackwriteback  |   invalidateinvalidate
  *
  * [*] needed for CPU speculative prefetches
  *
@@ -61,7 +61,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
break;
 
case DMA_BIDIRECTIONAL:
-   dma_cache_wback_inv(paddr, size);
+   dma_cache_wback(paddr, size);
break;
 
default:
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 12/21] mips: dma-mapping: split out cache operation logic

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The mips arch_sync_dma_for_device()/arch_sync_dma_for_cpu() functions
behave the same way as on other architectures, but in order to unify
the implementations, the code needs to be rearranged to pick the type
of cache operation in the outermost function.

Signed-off-by: Arnd Bergmann 
---
 arch/mips/mm/dma-noncoherent.c | 75 ++
 1 file changed, 30 insertions(+), 45 deletions(-)

diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index b4350faf4f1e..b9d68bcc5d53 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -54,50 +54,13 @@ void *arch_dma_set_uncached(void *addr, size_t size)
return (void *)(__pa(addr) + UNCAC_BASE);
 }
 
-static inline void dma_sync_virt_for_device(void *addr, size_t size,
-   enum dma_data_direction dir)
-{
-   switch (dir) {
-   case DMA_TO_DEVICE:
-   dma_cache_wback((unsigned long)addr, size);
-   break;
-   case DMA_FROM_DEVICE:
-   dma_cache_inv((unsigned long)addr, size);
-   break;
-   case DMA_BIDIRECTIONAL:
-   if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
-   cpu_needs_post_dma_flush())
-   dma_cache_wback((unsigned long)addr, size);
-   else
-   dma_cache_wback_inv((unsigned long)addr, size);
-   break;
-   default:
-   BUG();
-   }
-}
-
-static inline void dma_sync_virt_for_cpu(void *addr, size_t size,
-   enum dma_data_direction dir)
-{
-   switch (dir) {
-   case DMA_TO_DEVICE:
-   break;
-   case DMA_FROM_DEVICE:
-   case DMA_BIDIRECTIONAL:
-   dma_cache_inv((unsigned long)addr, size);
-   break;
-   default:
-   BUG();
-   }
-}
-
 /*
  * A single sg entry may refer to multiple physically contiguous pages.  But
  * we still need to process highmem pages individually.  If highmem is not
  * configured then the bulk of this loop gets optimized out.
  */
 static inline void dma_sync_phys(phys_addr_t paddr, size_t size,
-   enum dma_data_direction dir, bool for_device)
+   void(*cache_op)(unsigned long start, unsigned long size))
 {
struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
unsigned long offset = paddr & ~PAGE_MASK;
@@ -113,10 +76,7 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t 
size,
}
 
addr = kmap_atomic(page);
-   if (for_device)
-   dma_sync_virt_for_device(addr + offset, len, dir);
-   else
-   dma_sync_virt_for_cpu(addr + offset, len, dir);
+   cache_op((unsigned long)addr + offset, len);
kunmap_atomic(addr);
 
offset = 0;
@@ -128,15 +88,40 @@ static inline void dma_sync_phys(phys_addr_t paddr, size_t 
size,
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   dma_sync_phys(paddr, size, dir, true);
+   switch (dir) {
+   case DMA_TO_DEVICE:
+   dma_sync_phys(paddr, size, _dma_cache_wback);
+   break;
+   case DMA_FROM_DEVICE:
+   dma_sync_phys(paddr, size, _dma_cache_inv);
+   break;
+   case DMA_BIDIRECTIONAL:
+   if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+   cpu_needs_post_dma_flush())
+   dma_sync_phys(paddr, size, _dma_cache_wback);
+   else
+   dma_sync_phys(paddr, size, _dma_cache_wback_inv);
+   break;
+   default:
+   break;
+   }
 }
 
 #ifdef CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   if (cpu_needs_post_dma_flush())
-   dma_sync_phys(paddr, size, dir, false);
+   switch (dir) {
+   case DMA_TO_DEVICE:
+   break;
+   case DMA_FROM_DEVICE:
+   case DMA_BIDIRECTIONAL:
+   if (cpu_needs_post_dma_flush())
+   dma_sync_phys(paddr, size, _dma_cache_inv);
+   break;
+   default:
+   break;
+   }
 }
 #endif
 
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 11/21] mips: dma-mapping: skip invalidating before bidirectional DMA

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

Some architectures that need to invalidate buffers after bidirectional
DMA because of speculative prefetching only do a simpler writeback
before that DMA, while architectures that don't need to do the second
invalidate tend to have a combined writeback+invalidate before the
DMA.

The behavior on mips is slightly inconsistent, as it always
does the invalidation before bidirectional DMA and conditionally
does it a second time.

In order to make the behavior the same as the rest, change it
so that there is exactly one invalidation here, either before
or after the DMA.

Signed-off-by: Arnd Bergmann 
---
 arch/mips/mm/dma-noncoherent.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/mips/mm/dma-noncoherent.c b/arch/mips/mm/dma-noncoherent.c
index 3c4fc97b9f39..b4350faf4f1e 100644
--- a/arch/mips/mm/dma-noncoherent.c
+++ b/arch/mips/mm/dma-noncoherent.c
@@ -65,7 +65,11 @@ static inline void dma_sync_virt_for_device(void *addr, 
size_t size,
dma_cache_inv((unsigned long)addr, size);
break;
case DMA_BIDIRECTIONAL:
-   dma_cache_wback_inv((unsigned long)addr, size);
+   if (IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU) &&
+   cpu_needs_post_dma_flush())
+   dma_cache_wback((unsigned long)addr, size);
+   else
+   dma_cache_wback_inv((unsigned long)addr, size);
break;
default:
BUG();
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 10/21] csky: dma-mapping: skip invalidating before DMA from device

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

csky is the only architecture that does a full flush for the
dma_sync_*_for_device(..., DMA_FROM_DEVICE) operation. The requirement
is only make sure there are no dirty cache lines for the buffer,
which can be either done through an invalidate operation (as on most
architectures including arm32, mips and arc), or a writeback (as on
arm64 and riscv). The cache also has to be invalidated eventually but
csky already does that after the transfer.

Use a 'clean' operation here for consistency with arm64 and riscv.

Signed-off-by: Arnd Bergmann 
---
 arch/csky/mm/dma-mapping.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/csky/mm/dma-mapping.c b/arch/csky/mm/dma-mapping.c
index 82447029feb4..c90f912e2822 100644
--- a/arch/csky/mm/dma-mapping.c
+++ b/arch/csky/mm/dma-mapping.c
@@ -60,11 +60,9 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
 {
switch (dir) {
case DMA_TO_DEVICE:
-   cache_op(paddr, size, dma_wb_range);
-   break;
case DMA_FROM_DEVICE:
case DMA_BIDIRECTIONAL:
-   cache_op(paddr, size, dma_wbinv_range);
+   cache_op(paddr, size, dma_wb_range);
break;
default:
BUG();
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 09/21] riscv: dma-mapping: skip invalidation before bidirectional DMA

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

For a DMA_BIDIRECTIONAL transfer, the caches have to be cleaned
first to let the device see data written by the CPU, and invalidated
after the transfer to let the CPU see data written by the device.

riscv also invalidates the caches before the transfer, which does
not appear to serve any purpose.

Signed-off-by: Arnd Bergmann 
---
 arch/riscv/mm/dma-noncoherent.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index 640f4c496d26..69c80b2155a1 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -25,7 +25,7 @@ void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
case DMA_BIDIRECTIONAL:
-   ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+   ALT_CMO_OP(clean, vaddr, size, riscv_cbom_block_size);
break;
default:
break;
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 08/21] riscv: dma-mapping: only invalidate after DMA, not flush

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

No other architecture intentionally writes back dirty cache lines into
a buffer that a device has just finished writing into. If the cache is
clean, this has no effect at all, but if a cacheline in the buffer has
actually been written by the CPU,  there is a drive bug that is likely
made worse by overwriting that buffer.

Signed-off-by: Arnd Bergmann 
---
 arch/riscv/mm/dma-noncoherent.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/mm/dma-noncoherent.c b/arch/riscv/mm/dma-noncoherent.c
index d919efab6eba..640f4c496d26 100644
--- a/arch/riscv/mm/dma-noncoherent.c
+++ b/arch/riscv/mm/dma-noncoherent.c
@@ -42,7 +42,7 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
break;
case DMA_FROM_DEVICE:
case DMA_BIDIRECTIONAL:
-   ALT_CMO_OP(flush, vaddr, size, riscv_cbom_block_size);
+   ALT_CMO_OP(inval, vaddr, size, riscv_cbom_block_size);
break;
default:
break;
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 07/21] powerpc: dma-mapping: always clean cache in _for_device() op

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The powerpc implementation of arch_sync_dma_for_device() is unique in that
it sometimes performs a full flush for the arch_sync_dma_for_device(paddr,
size, DMA_FROM_DEVICE) operation when the address is unaligned, but
otherwise invalidates the caches.

Since the _for_cpu() counterpart has to invalidate the cache already
in order to avoid stale data from prefetching, this operation only really
needs to ensure that there are no dirty cache lines, which can be done
using either invalidation or cleaning the cache, but not necessarily
both.

Most architectures traditionally go for invalidation here, but as
Will Deacon points out, this can leak old data to user space if
a DMA is started but the device ends up not actually filling the
entire buffer, see the link below.

The same argument applies to DMA_BIDIRECTIONAL transfers. Using
a cache-clean operation is the safe choice here, followed by
invalidating the cache after the DMA to get rid of stale data
that was prefetched before the completion of the DMA.

Link: https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/
Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/mm/dma-noncoherent.c | 21 +
 1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/arch/powerpc/mm/dma-noncoherent.c 
b/arch/powerpc/mm/dma-noncoherent.c
index e108cacf877f..00e59a4faa2b 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -104,26 +104,7 @@ static void __dma_phys_op(phys_addr_t paddr, size_t size, 
enum dma_cache_op op)
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   switch (direction) {
-   case DMA_NONE:
-   BUG();
-   case DMA_FROM_DEVICE:
-   /*
-* invalidate only when cache-line aligned otherwise there is
-* the potential for discarding uncommitted data from the cache
-*/
-   if ((start | end) & (L1_CACHE_BYTES - 1))
-   __dma_phys_op(start, end, DMA_CACHE_FLUSH);
-   else
-   __dma_phys_op(start, end, DMA_CACHE_INVAL);
-   break;
-   case DMA_TO_DEVICE: /* writeback only */
-   __dma_phys_op(start, end, DMA_CACHE_CLEAN);
-   break;
-   case DMA_BIDIRECTIONAL: /* writeback and invalidate */
-   __dma_phys_op(start, end, DMA_CACHE_FLUSH);
-   break;
-   }
+   __dma_phys_op(start, end, DMA_CACHE_CLEAN);
 }
 
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 06/21] powerpc: dma-mapping: minimize for_cpu flushing

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The powerpc dma_sync_*_for_cpu() variants do more flushes than on other
architectures. Reduce it to what everyone else does:

 - No flush is needed after data has been sent to a device

 - When data has been received from a device, the cache only needs to
   be invalidated to clear out cache lines that were speculatively
   prefetched.

In particular, the second flushing of partial cache lines of bidirectional
buffers is actively harmful -- if a single cache line is written by both
the CPU and the device, flushing it again does not maintain coherency
but instead overwrite the data that was just received from the device.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/mm/dma-noncoherent.c | 18 --
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/mm/dma-noncoherent.c 
b/arch/powerpc/mm/dma-noncoherent.c
index f10869d27de5..e108cacf877f 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -132,21 +132,11 @@ void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
switch (direction) {
case DMA_NONE:
BUG();
-   case DMA_FROM_DEVICE:
-   /*
-* invalidate only when cache-line aligned otherwise there is
-* the potential for discarding uncommitted data from the cache
-*/
-   if ((start | end) & (L1_CACHE_BYTES - 1))
-   __dma_phys_op(start, end, DMA_CACHE_FLUSH);
-   else
-   __dma_phys_op(start, end, DMA_CACHE_INVAL);
-   break;
-   case DMA_TO_DEVICE: /* writeback only */
-   __dma_phys_op(start, end, DMA_CACHE_CLEAN);
+   case DMA_TO_DEVICE:
break;
-   case DMA_BIDIRECTIONAL: /* writeback and invalidate */
-   __dma_phys_op(start, end, DMA_CACHE_FLUSH);
+   case DMA_FROM_DEVICE:
+   case DMA_BIDIRECTIONAL:
+   __dma_phys_op(start, end, DMA_CACHE_INVAL);
break;
}
 }
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 05/21] powerpc: dma-mapping: split out cache operation logic

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The powerpc arch_sync_dma_for_device()/arch_sync_dma_for_cpu() functions
behave differently from all other architectures, at least for some of
the operations.

As a preparation for making the behavior more consistent, reorder the
logic in which they decide whether to flush, invalidate or clean the.
No change in behavior is intended.

Signed-off-by: Arnd Bergmann 
---
 arch/powerpc/mm/dma-noncoherent.c | 91 +--
 1 file changed, 63 insertions(+), 28 deletions(-)

diff --git a/arch/powerpc/mm/dma-noncoherent.c 
b/arch/powerpc/mm/dma-noncoherent.c
index 30260b5d146d..f10869d27de5 100644
--- a/arch/powerpc/mm/dma-noncoherent.c
+++ b/arch/powerpc/mm/dma-noncoherent.c
@@ -16,31 +16,28 @@
 #include 
 #include 
 
+enum dma_cache_op {
+   DMA_CACHE_CLEAN,
+   DMA_CACHE_INVAL,
+   DMA_CACHE_FLUSH,
+};
+
 /*
  * make an area consistent.
  */
-static void __dma_sync(void *vaddr, size_t size, int direction)
+static void __dma_op(void *vaddr, size_t size, enum dma_cache_op op)
 {
unsigned long start = (unsigned long)vaddr;
unsigned long end   = start + size;
 
-   switch (direction) {
-   case DMA_NONE:
-   BUG();
-   case DMA_FROM_DEVICE:
-   /*
-* invalidate only when cache-line aligned otherwise there is
-* the potential for discarding uncommitted data from the cache
-*/
-   if ((start | end) & (L1_CACHE_BYTES - 1))
-   flush_dcache_range(start, end);
-   else
-   invalidate_dcache_range(start, end);
-   break;
-   case DMA_TO_DEVICE: /* writeback only */
+   switch (op) {
+   case DMA_CACHE_CLEAN:
clean_dcache_range(start, end);
break;
-   case DMA_BIDIRECTIONAL: /* writeback and invalidate */
+   case DMA_CACHE_INVAL:
+   invalidate_dcache_range(start, end);
+   break;
+   case DMA_CACHE_FLUSH:
flush_dcache_range(start, end);
break;
}
@@ -48,16 +45,16 @@ static void __dma_sync(void *vaddr, size_t size, int 
direction)
 
 #ifdef CONFIG_HIGHMEM
 /*
- * __dma_sync_page() implementation for systems using highmem.
+ * __dma_highmem_op() implementation for systems using highmem.
  * In this case, each page of a buffer must be kmapped/kunmapped
- * in order to have a virtual address for __dma_sync(). This must
+ * in order to have a virtual address for __dma_op(). This must
  * not sleep so kmap_atomic()/kunmap_atomic() are used.
  *
  * Note: yes, it is possible and correct to have a buffer extend
  * beyond the first page.
  */
-static inline void __dma_sync_page_highmem(struct page *page,
-   unsigned long offset, size_t size, int direction)
+static inline void __dma_highmem_op(struct page *page,
+   unsigned long offset, size_t size, enum dma_cache_op op)
 {
size_t seg_size = min((size_t)(PAGE_SIZE - offset), size);
size_t cur_size = seg_size;
@@ -71,7 +68,7 @@ static inline void __dma_sync_page_highmem(struct page *page,
start = (unsigned long)kmap_atomic(page + seg_nr) + seg_offset;
 
/* Sync this buffer segment */
-   __dma_sync((void *)start, seg_size, direction);
+   __dma_op((void *)start, seg_size, op);
kunmap_atomic((void *)start);
seg_nr++;
 
@@ -88,32 +85,70 @@ static inline void __dma_sync_page_highmem(struct page 
*page,
 #endif /* CONFIG_HIGHMEM */
 
 /*
- * __dma_sync_page makes memory consistent. identical to __dma_sync, but
- * takes a struct page instead of a virtual address
+ * __dma_phys_op makes memory consistent. identical to __dma_op, but
+ * takes a phys_addr_t instead of a virtual address
  */
-static void __dma_sync_page(phys_addr_t paddr, size_t size, int dir)
+static void __dma_phys_op(phys_addr_t paddr, size_t size, enum dma_cache_op op)
 {
struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
unsigned offset = paddr & ~PAGE_MASK;
 
 #ifdef CONFIG_HIGHMEM
-   __dma_sync_page_highmem(page, offset, size, dir);
+   __dma_highmem_op(page, offset, size, op);
 #else
unsigned long start = (unsigned long)page_address(page) + offset;
-   __dma_sync((void *)start, size, dir);
+   __dma_op((void *)start, size, op);
 #endif
 }
 
 void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   __dma_sync_page(paddr, size, dir);
+   switch (direction) {
+   case DMA_NONE:
+   BUG();
+   case DMA_FROM_DEVICE:
+   /*
+* invalidate only when cache-line aligned otherwise there is
+* the potential for discarding uncommitted data from the cache
+*/
+   if ((start | end) & (L1_CACHE_BYTES - 1))
+   

[PATCH 04/21] microblaze: dma-mapping: skip extra DMA flushes

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The microblaze dma_sync_* implementation uses the same function
for both _for_cpu() and _for_device(), which is inconsistent
with other architectures and slightly more expensive.

Split it up into separate functions and skip the parts that
are not needed:

 - on dma_sync_*_for_cpu(..., DMA_TO_DEVICE), skip the second
   writeback, which does nothing.

 - on dma_sync_*_for_cpu(..., DMA_BIDIRECTIONAL), only invalidate
   the cache to clear out cache lines that got loaded speculatively,
   but skip the extraneous writeback.

Signed-off-by: Arnd Bergmann 
---
 arch/microblaze/kernel/dma.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/arch/microblaze/kernel/dma.c b/arch/microblaze/kernel/dma.c
index 04d091ade417..b4c4e45fd45e 100644
--- a/arch/microblaze/kernel/dma.c
+++ b/arch/microblaze/kernel/dma.c
@@ -14,8 +14,8 @@
 #include 
 #include 
 
-static void __dma_sync(phys_addr_t paddr, size_t size,
-   enum dma_data_direction direction)
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
+   enum dma_data_direction dir)
 {
switch (direction) {
case DMA_TO_DEVICE:
@@ -30,14 +30,16 @@ static void __dma_sync(phys_addr_t paddr, size_t size,
}
 }
 
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
-   enum dma_data_direction dir)
-{
-   __dma_sync(paddr, size, dir);
-}
-
 void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
-   __dma_sync(paddr, size, dir);
-}
+   switch (direction) {
+   case DMA_TO_DEVICE:
+   break;
+   case DMA_BIDIRECTIONAL:
+   case DMA_FROM_DEVICE:
+   invalidate_dcache_range(paddr, paddr + size);
+   break;
+   default:
+   BUG();
+   }}
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 03/21] sparc32: flush caches in dma_sync_*for_device

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

Leon has a very minimalistic cache that has no range operations
and requires being flushed entirely to deal with noncoherent
DMA. Most in-order architectures do their cache management in
the dma_sync_*for_device() operations rather than dma_sync_*for_cpu.

Since the cache is write-through only, both should have the same
effect, so change it for consistency with the other architectures.

Signed-off-by: Arnd Bergmann 
---
 arch/sparc/Kconfig | 2 +-
 arch/sparc/kernel/ioport.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index 84437a4c6545..637da50e236c 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -51,7 +51,7 @@ config SPARC
 config SPARC32
def_bool !64BIT
select ARCH_32BIT_OFF_T
-   select ARCH_HAS_SYNC_DMA_FOR_CPU
+   select ARCH_HAS_SYNC_DMA_FOR_DEVICE
select CLZ_TAB
select DMA_DIRECT_REMAP
select GENERIC_ATOMIC64
diff --git a/arch/sparc/kernel/ioport.c b/arch/sparc/kernel/ioport.c
index 4e4f3d3263e4..4f3d26066ec2 100644
--- a/arch/sparc/kernel/ioport.c
+++ b/arch/sparc/kernel/ioport.c
@@ -306,7 +306,7 @@ arch_initcall(sparc_register_ioport);
  * On LEON systems without cache snooping, the entire D-CACHE must be flushed 
to
  * make DMA to cacheable memory coherent.
  */
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
if (dir != DMA_TO_DEVICE &&
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 02/21] xtensa: dma-mapping: use normal cache invalidation rules

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

xtensa is one of the platforms that has both write-back and write-through
caches, and needs to account for both in its DMA mapping operations.

It does this through a set of operations that is different from any
architecture. This is not a problem by itself, but it makes it rather
hard to figure out whether this is correct or not, and to unify this
implementation with the others.

Change the semantics to the usual ones for non-speculating CPUs:

 - On DMA_TO_DEVICE, call __flush_dcache_range() to perform the
   writeback even on writethrough caches, where this is a nop.

 - On DMA_FROM_DEVICE, invalidate the mapping before the DMA rather
   than afterwards.

 - On DMA_BIDIRECTIONAL, combine the pre-writeback with the
   post-invalidate into a call to __flush_invalidate_dcache_range()
   that turns into a simple invalidate on writeback caches.

Signed-off-by: Arnd Bergmann 
---
 arch/xtensa/Kconfig  |  1 -
 arch/xtensa/include/asm/cacheflush.h |  6 +++---
 arch/xtensa/kernel/pci-dma.c | 29 +---
 3 files changed, 8 insertions(+), 28 deletions(-)

diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index bcb0c5d2abc2..b938bacbb9af 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -8,7 +8,6 @@ config XTENSA
select ARCH_HAS_DMA_PREP_COHERENT if MMU
select ARCH_HAS_GCOV_PROFILE_ALL
select ARCH_HAS_KCOV
-   select ARCH_HAS_SYNC_DMA_FOR_CPU if MMU
select ARCH_HAS_SYNC_DMA_FOR_DEVICE if MMU
select ARCH_HAS_DMA_SET_UNCACHED if MMU
select ARCH_HAS_STRNCPY_FROM_USER if !KASAN
diff --git a/arch/xtensa/include/asm/cacheflush.h 
b/arch/xtensa/include/asm/cacheflush.h
index 7b4359312c25..2f645d25565a 100644
--- a/arch/xtensa/include/asm/cacheflush.h
+++ b/arch/xtensa/include/asm/cacheflush.h
@@ -61,9 +61,9 @@ static inline void __flush_dcache_page(unsigned long va)
 static inline void __flush_dcache_range(unsigned long va, unsigned long sz)
 {
 }
-# define __flush_invalidate_dcache_all()   __invalidate_dcache_all()
-# define __flush_invalidate_dcache_page(p) __invalidate_dcache_page(p)
-# define __flush_invalidate_dcache_range(p,s)  __invalidate_dcache_range(p,s)
+# define __flush_invalidate_dcache_all __invalidate_dcache_all
+# define __flush_invalidate_dcache_page__invalidate_dcache_page
+# define __flush_invalidate_dcache_range   __invalidate_dcache_range
 #endif
 
 #if defined(CONFIG_MMU) && (DCACHE_WAY_SIZE > PAGE_SIZE)
diff --git a/arch/xtensa/kernel/pci-dma.c b/arch/xtensa/kernel/pci-dma.c
index 94955caa4488..ff3bf015eca4 100644
--- a/arch/xtensa/kernel/pci-dma.c
+++ b/arch/xtensa/kernel/pci-dma.c
@@ -43,38 +43,19 @@ static void do_cache_op(phys_addr_t paddr, size_t size,
}
 }
 
-void arch_sync_dma_for_cpu(phys_addr_t paddr, size_t size,
+void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
enum dma_data_direction dir)
 {
switch (dir) {
-   case DMA_BIDIRECTIONAL:
+   case DMA_TO_DEVICE:
+   do_cache_op(paddr, size, __flush_dcache_range);
+   break;
case DMA_FROM_DEVICE:
do_cache_op(paddr, size, __invalidate_dcache_range);
break;
-
-   case DMA_NONE:
-   BUG();
-   break;
-
-   default:
-   break;
-   }
-}
-
-void arch_sync_dma_for_device(phys_addr_t paddr, size_t size,
-   enum dma_data_direction dir)
-{
-   switch (dir) {
case DMA_BIDIRECTIONAL:
-   case DMA_TO_DEVICE:
-   if (XCHAL_DCACHE_IS_WRITEBACK)
-   do_cache_op(paddr, size, __flush_dcache_range);
+   do_cache_op(paddr, size, __flush_invalidate_dcache_range);
break;
-
-   case DMA_NONE:
-   BUG();
-   break;
-
default:
break;
}
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 01/21] openrisc: dma-mapping: flush bidirectional mappings

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

The cache management operations on DMA are different from the
other architectures:

 - on DMA_TO_DEVICE, Openrisc currently invalidates the cache
   after the writeback, where a simple writeback without
   invalidation should be sufficient.

 - on DMA_BIDIRECTIONAL, Openrisc does nothing, while most
   architectures either flush before DMA, or writeback before
   and invalidate after DMA. The separate invalidation for
   DMA_BIDIRECTIONAL/DMA_FROM_DEVICE is only required on CPUs
   that can do speculative prefetches.

Change both to have the normal set of operations.

Signed-off-by: Arnd Bergmann 
---
 arch/openrisc/kernel/dma.c | 15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

diff --git a/arch/openrisc/kernel/dma.c b/arch/openrisc/kernel/dma.c
index b3edbb33b621..91a00d09ffad 100644
--- a/arch/openrisc/kernel/dma.c
+++ b/arch/openrisc/kernel/dma.c
@@ -103,10 +103,10 @@ void arch_sync_dma_for_device(phys_addr_t addr, size_t 
size,
 
switch (dir) {
case DMA_TO_DEVICE:
-   /* Flush the dcache for the requested range */
+   /* Write back the dcache for the requested range */
for (cl = addr; cl < addr + size;
 cl += cpuinfo->dcache_block_size)
-   mtspr(SPR_DCBFR, cl);
+   mtspr(SPR_DCBWR, cl);
break;
case DMA_FROM_DEVICE:
/* Invalidate the dcache for the requested range */
@@ -114,12 +114,13 @@ void arch_sync_dma_for_device(phys_addr_t addr, size_t 
size,
 cl += cpuinfo->dcache_block_size)
mtspr(SPR_DCBIR, cl);
break;
+   case DMA_BIDIRECTIONAL:
+   /* Flush the dcache for the requested range */
+   for (cl = addr; cl < addr + size;
+cl += cpuinfo->dcache_block_size)
+   mtspr(SPR_DCBFR, cl);
+   break;
default:
-   /*
-* NOTE: If dir == DMA_BIDIRECTIONAL then there's no need to
-* flush nor invalidate the cache here as the area will need
-* to be manually synced anyway.
-*/
break;
}
 }
-- 
2.39.2


___
linux-snps-arc mailing list
linux-snps-arc@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-snps-arc


[PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-27 Thread Arnd Bergmann
From: Arnd Bergmann 

After a long discussion about adding SoC specific semantics for when
to flush caches in drivers/soc/ drivers that we determined to be
fundamentally flawed[1], I volunteered to try to move that logic into
architecture-independent code and make all existing architectures do
the same thing.

As we had determined earlier, the behavior is wildly different across
architectures, but most of the differences come down to either bugs
(when required flushes are missing) or extra flushes that are harmless
but might hurt performance.

I finally found the time to come up with an implementation of this, which
starts by replacing every outlier with one of the three common options:

 1. architectures without speculative prefetching (hegagon, m68k,
openrisc, sh, sparc, and certain armv4 and xtensa implementations)
only flush their caches before a DMA, by cleaning write-back caches
(if any) before a DMA to the device, and by invalidating the caches
before a DMA from a device

 2. arc, microblaze, mips, nios2, sh and later xtensa now follow the
normal 32-bit arm model and invalidate their writeback caches
again after a DMA from the device, to remove stale cache lines
that got prefetched during the DMA. arc, csky and mips used to
invalidate buffers also before the bidirectional DMA, but this
is now skipped whenever we know it gets invalidated again
after the DMA.

 3. parisc, powerpc and riscv already flushed buffers before
a DMA_FROM_DEVICE, and these get moved to the arm64 behavior
that does the writeback before and invalidate after both
DMA_FROM_DEVICE and DMA_BIDIRECTIONAL in order to avoid the
problem of accidentally leaking stale data if the DMA does
not actually happen[2].

The last patch in the series replaces the architecture specific code
with a shared version that implements all three based on architecture
specific parameters that are almost always determined at compile time.

The difference between cases 1. and 2. is hardware specific, while between
2. and 3. we need to decide which semantics we want, but I explicitly
avoid this question in my series and leave it to be decided later.

Another difference that I do not address here is what cache invalidation
does for partical cache lines. On arm32, arm64 and powerpc, a partial
cache line always gets written back before invalidation in order to
ensure that data before or after the buffer is not discarded. On all
other architectures, the assumption is cache lines are never shared
between DMA buffer and data that is accessed by the CPU. If we end up
always writing back dirty cache lines before a DMA (option 3 above),
then this point becomes moot, otherwise we should probably address this
in a follow-up series to document one behavior or the other and implement
it consistently.

Please review!

  Arnd

[1] 
https://lore.kernel.org/all/20221212115505.36770-1-prabhakar.mahadev-lad...@bp.renesas.com/
[2] https://lore.kernel.org/all/20220606152150.GA31568@willie-the-truck/

Arnd Bergmann (21):
  openrisc: dma-mapping: flush bidirectional mappings
  xtensa: dma-mapping: use normal cache invalidation rules
  sparc32: flush caches in dma_sync_*for_device
  microblaze: dma-mapping: skip extra DMA flushes
  powerpc: dma-mapping: split out cache operation logic
  powerpc: dma-mapping: minimize for_cpu flushing
  powerpc: dma-mapping: always clean cache in _for_device() op
  riscv: dma-mapping: only invalidate after DMA, not flush
  riscv: dma-mapping: skip invalidation before bidirectional DMA
  csky: dma-mapping: skip invalidating before DMA from device
  mips: dma-mapping: skip invalidating before bidirectional DMA
  mips: dma-mapping: split out cache operation logic
  arc: dma-mapping: skip invalidating before bidirectional DMA
  parisc: dma-mapping: use regular flush/invalidate ops
  ARM: dma-mapping: always invalidate WT caches before DMA
  ARM: dma-mapping: bring back dmac_{clean,inv}_range
  ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally
  ARM: drop SMP support for ARM11MPCore
  ARM: dma-mapping: use generic form of arch_sync_dma_* helpers
  ARM: dma-mapping: split out arch_dma_mark_clean() helper
  dma-mapping: replace custom code with generic implementation

 arch/arc/mm/dma.c  |  66 ++--
 arch/arm/Kconfig   |   4 +
 arch/arm/include/asm/cacheflush.h  |  21 +++
 arch/arm/include/asm/glue-cache.h  |   4 +
 arch/arm/mach-oxnas/Kconfig|   4 -
 arch/arm/mach-oxnas/Makefile   |   1 -
 arch/arm/mach-oxnas/headsmp.S  |  23 ---
 arch/arm/mach-oxnas/platsmp.c  |  96 ---
 arch/arm/mach-versatile/platsmp-realview.c |   4 -
 arch/arm/mm/Kconfig|  19 ---
 arch/arm/mm/cache-fa.S |   4 +-
 arch/arm/mm/cache-nop.S|   6 +
 arch/arm/mm/cache-v4.S |  13 +-