Re: [PATCH] serial: Use of_property_read_bool() for boolean properties

2023-03-31 Thread Timur Tabi
On Fri, Mar 10, 2023 at 8:48 AM Rob Herring  wrote:
>
> It is preferred to use typed property access functions (i.e.
> of_property_read_ functions) rather than low-level
> of_get_property/of_find_property functions for reading properties.
> Convert reading boolean properties to to of_property_read_bool().
>
> Signed-off-by: Rob Herring 
> ---
>  drivers/tty/serial/imx.c   | 14 +-
>  drivers/tty/serial/mxs-auart.c |  4 ++--
>  drivers/tty/serial/ucc_uart.c  |  2 +-

ucc_uart.c portion:

Acked-by: Timur Tabi 


Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-31 Thread Arnd Bergmann
On Fri, Mar 31, 2023, at 18:53, Catalin Marinas wrote:
> On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
>> Another difference that I do not address here is what cache invalidation
>> does for partical cache lines. On arm32, arm64 and powerpc, a partial
>> cache line always gets written back before invalidation in order to
>> ensure that data before or after the buffer is not discarded. On all
>> other architectures, the assumption is cache lines are never shared
>> between DMA buffer and data that is accessed by the CPU.
>
> I don't think sharing the DMA buffer with other data is safe even with
> this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
> FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
> evicted and override the device written data. This sharing only works if
> the CPU guarantees not to dirty the corresponding cache line.
>
> I'm fine with removing this partial cache line hack from arm64 as it's
> not safe anyway. We'll see if any driver stops working. If there's some
> benign sharing (I wouldn't trust it), the cache cleaning prior to
> mapping and invalidate on unmap would not lose any data.

Ok, I'll add a patch to remove that bit from dcache_inval_poc
then. Do you know if any of the the other callers of this function
rely on on the writeback behavior, or is it safe to remove it for
all of them?

Note that before c50f11c6196f ("arm64: mm: Don't invalidate
FROM_DEVICE buffers at start of DMA transfer"), it made some
sense to write back partial cache lines before a DMA_FROM_DEVICE,
in order to allow sharing read-only data in them the same way as
on arm32 and powerpc. Doing the writeback in the sync_for_cpu
bit is of course always pointless.

   Arnd


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Arnd Bergmann
On Fri, Mar 31, 2023, at 17:12, Robin Murphy wrote:
> On 31/03/2023 3:00 pm, Arnd Bergmann wrote:
>> On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:
>> 
>> To be on the safe side, I'd have to pass a flag into
>> arch_dma_mark_clean() about coherency, to let the arm
>> implementation still require the extra dcache flush
>> for coherent DMA, while ia64 can ignore that flag.
>
> Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA 
> write should be pretty much equivalent to a coherent write by another 
> CPU (or indeed the local CPU itself) - nothing says that it *couldn't* 
> dirty a line in a data cache above the level of unification, so in 
> general the assumption must be that, yes, if coherent DMA is writing 
> data intended to be executable, then it's going to want a Dcache clean 
> to PoU and an Icache invalidate to PoU before trying to execute it. By 
> comparison, a non-coherent DMA transfer will inherently have to 
> invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot 
> leave dirty data above the PoU, so only the Icache maintenance is 
> required in the executable case.

Ok, makes sense. I've already started reworking my patch for it.

> (FWIW I believe the Armv8 IDC/DIC features can safely be considered 
> irrelevant to 32-bit kernels)
>
> I don't know a great deal about IA-64, but it appears to be using its 
> PG_arch_1 flag in a subtly different manner to Arm, namely to optimise 
> out the *Icache* maintenance. So if anything, it seems IA-64 is the 
> weirdo here (who'd have guessed?) where DMA manages to be *more* 
> coherent than the CPUs themselves :)

I checked this in the ia64 manual, and as far as I can tell, it originally
only had one cacheflush instruction that flushes the dcache and invalidates
the icache at the same time. So flush_icache_range() actually does
both and flush_dcache_page() instead just marks the page as dirty to
ensure flush_icache_range() does not get skipped after a writing a
page from the kernel.

On later Itaniums, there is apparently a separate icache flush
instruction that gets used in flush_icache_range(), but that
still works for the DMA case that is allowed to skip the flush.

> This is all now making me think we need some careful consideration of 
> whether the benefits of consolidating code outweigh the confusion of 
> conflating multiple different meanings of "clean" together...

The difference in usage of PG_dcache_clean/PG_dcache_dirty/PG_arch_1
across architectures is certainly big enough that we can't just
define a a common arch_dma_mark_clean() across architectures, but
I think the idea of having a common entry point for
arch_dma_mark_clean() to be called from the dma-mapping code
to do something architecture specific after a DMA is clean still
makes sense, 

 Arnd


Re: [PATCH 18/21] ARM: drop SMP support for ARM11MPCore

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:13:14PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The cache management operations for noncoherent DMA on ARMv6 work
> in two different ways:
> 
>  * When CONFIG_DMA_CACHE_RWFO is set, speculative prefetches on in-flight
>DMA buffers lead to data corruption when the prefetched data is written
>back on top of data from the device.
> 
>  * When CONFIG_DMA_CACHE_RWFO is disabled, a cache flush on one CPU
>is not seen by the other core(s), leading to inconsistent contents
>accross the system.
> 
> As a consequence, neither configuration is actually safe to use in a
> general-purpose kernel that is used on both MPCore systems and ARM1176
> with prefetching enabled.

As the author of this terrible hack (created under duress ;))

Acked-by: Catalin Marinas 

IIRC, RWFO is working in combination with the cache operations. Because
the cache maintenance broadcast did not happen, we forced the cache
lines to migrate to a CPU via a write (for ownership) and doing the
cache maintenance on that CPU (that was the FROM_DEVICE case). For the
TO_DEVICE case, reading on a CPU would cause dirty lines on another CPU
to be evicted (or migrated as dirty to the current CPU IIRC) then the
cache maintenance to clean them to PoC on the local CPU.

But there's always a small window between read/write for ownership and
the actual cache maintenance which can cause a cache line to migrate to
other CPUs if they do speculative prefetches. At the time ARM11MPCore
was deemed safe-ish but I haven't followed what later implementations
actually did (luckily we fixed the architecture in ARMv7).

-- 
Catalin


Re: [PATCH 00/21] dma-mapping: unify support for cache flushes

2023-03-31 Thread Catalin Marinas
On Mon, Mar 27, 2023 at 02:12:56PM +0200, Arnd Bergmann wrote:
> Another difference that I do not address here is what cache invalidation
> does for partical cache lines. On arm32, arm64 and powerpc, a partial
> cache line always gets written back before invalidation in order to
> ensure that data before or after the buffer is not discarded. On all
> other architectures, the assumption is cache lines are never shared
> between DMA buffer and data that is accessed by the CPU.

I don't think sharing the DMA buffer with other data is safe even with
this clean+invalidate on the unaligned cache. Mapping the DMA buffer as
FROM_DEVICE or BIDIRECTIONAL can cause the shared cache line to be
evicted and override the device written data. This sharing only works if
the CPU guarantees not to dirty the corresponding cache line.

I'm fine with removing this partial cache line hack from arm64 as it's
not safe anyway. We'll see if any driver stops working. If there's some
benign sharing (I wouldn't trust it), the cache cleaning prior to
mapping and invalidate on unmap would not lose any data.

-- 
Catalin


Re: [PATCH 1/2] pseries/smp: export the smt level in the SYS FS.

2023-03-31 Thread Michal Suchánek
Hello,

On Fri, Mar 31, 2023 at 05:39:04PM +0200, Laurent Dufour wrote:
> There is no SMT level recorded in the kernel neither in user space.
> Indeed there is no real constraint about that and mixed SMT levels are
> allowed and system is working fine this way.
> 
> However when new CPU are added, the kernel is onlining all the threads
> which is leading to mixed SMT levels and confuse end user a bit.
> 
> To prevent this exports a SMT level from the kernel so user space
> application like the energy daemon, could read it to adjust their settings.
> There is no action unless recording the value when a SMT value is written
> into the new sysfs entry. User space applications like ppc64_cpu should
> update the sysfs when changing the SMT level to keep the system consistent.
> 
> Suggested-by: Srikar Dronamraju 
> Signed-off-by: Laurent Dufour 
> ---
>  arch/powerpc/platforms/pseries/pseries.h |  3 ++
>  arch/powerpc/platforms/pseries/smp.c | 39 
>  2 files changed, 42 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/pseries/pseries.h 
> b/arch/powerpc/platforms/pseries/pseries.h
> index f8bce40ebd0c..af0a145af98f 100644
> --- a/arch/powerpc/platforms/pseries/pseries.h
> +++ b/arch/powerpc/platforms/pseries/pseries.h
> @@ -23,7 +23,9 @@ extern int pSeries_machine_check_exception(struct pt_regs 
> *regs);
>  extern long pseries_machine_check_realmode(struct pt_regs *regs);
>  void pSeries_machine_check_log_err(void);
>  
> +
>  #ifdef CONFIG_SMP
> +extern int pseries_smt;
>  extern void smp_init_pseries(void);
>  
>  /* Get state of physical CPU from query_cpu_stopped */
> @@ -34,6 +36,7 @@ int smp_query_cpu_stopped(unsigned int pcpu);
>  #define QCSS_HARDWARE_ERROR -1
>  #define QCSS_HARDWARE_BUSY -2
>  #else
> +#define pseries_smt 1

Is this really needed for anything?

The code using pseries_smt would not compile with a define, and would be
only compiled with SMP enabled anyway so we should not need this.

Thanks

Michal

>  static inline void smp_init_pseries(void) { }
>  #endif
>  
> diff --git a/arch/powerpc/platforms/pseries/smp.c 
> b/arch/powerpc/platforms/pseries/smp.c
> index c597711ef20a..6c382922f8f3 100644
> --- a/arch/powerpc/platforms/pseries/smp.c
> +++ b/arch/powerpc/platforms/pseries/smp.c
> @@ -21,6 +21,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -45,6 +46,8 @@
>  
>  #include "pseries.h"
>  
> +int pseries_smt;
> +
>  /*
>   * The Primary thread of each non-boot processor was started from the OF 
> client
>   * interface by prom_hold_cpus and is spinning on secondary_hold_spinloop.
> @@ -280,3 +283,39 @@ void __init smp_init_pseries(void)
>  
>   pr_debug(" <- smp_init_pSeries()\n");
>  }
> +
> +static ssize_t pseries_smt_store(struct class *class,
> +  struct class_attribute *attr,
> +  const char *buf, size_t count)
> +{
> + int smt;
> +
> + if (kstrtou32(buf, 0, ) || !smt || smt > (u32) threads_per_core) {
> + pr_err("Invalid pseries_smt specified.\n");
> + return -EINVAL;
> + }
> +
> + pseries_smt = smt;
> +
> + return count;
> +}
> +
> +static ssize_t pseries_smt_show(struct class *class, struct class_attribute 
> *attr,
> +   char *buf)
> +{
> + return sysfs_emit(buf, "%d\n", pseries_smt);
> +}
> +
> +static CLASS_ATTR_RW(pseries_smt);
> +
> +static int __init pseries_smt_init(void)
> +{
> + int rc;
> +
> + pseries_smt = smt_enabled_at_boot;
> + rc = sysfs_create_file(kernel_kobj, _attr_pseries_smt.attr);
> + if (rc)
> + pr_err("Can't create pseries_smt sysfs/kernel entry.\n");
> + return rc;
> +}
> +machine_device_initcall(pseries, pseries_smt_init);
> -- 
> 2.40.0
> 


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Russell King (Oracle)
On Fri, Mar 31, 2023 at 04:06:37PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote:
> > On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
> >> From: Arnd Bergmann 
> >> 
> >> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
> >> PG_dcache_clean after a DMA, but no other architecture does this here.
> >
> > ... because this is an arm32 specific feature. Generically, it's
> > PG_arch_1, which is a page flag free for architecture use. On arm32
> > we decided to use this to mark whether we can skip dcache writebacks
> > when establishing a PTE - and thus it was decided to call it
> > PG_dcache_clean to reflect how arm32 decided to use that bit.
> >
> > This isn't just a DMA thing, there are other places that we update
> > the bit, such as flush_dcache_page() and copy_user_highpage().
> >
> > So thinking that the arm32 PG_dcache_clean is something for DMA is
> > actually wrong.
> >
> > Other architectures are free to do their own other optimisations
> > using that bit, and their implementations may be DMA-centric.
> 
> The flag is used the same way on most architectures, though some
> use the opposite polarity and call it PG_dcache_dirty. The only
> other architecture that uses it for DMA is ia64, with the difference
> being that this also marks the page as clean even for coherent
> DMA, not just when doing a flush as part of noncoherent DMA.
> 
> Based on Robin's reply it sounds that this is not a valid assumption
> on Arm, if a coherent DMA can target a dirty dcache line without
> cleaning it.

The other thing to note here is that PG_dcache_clean doesn't have
much meaning on modern CPUs with PIPT caches. For these,
cache_is_vipt_nonaliasing() will be true, and
cache_ops_need_broadcast() will be false.

Firstly, if we're using coherent DMA, then PG_dcache_clean is
intentionally not touched, because the data cache isn't cleaned
in any way by DMA operations.

flush_dcache_page() turns into a no-op apart from clearing
PG_dcache_clean if it was set.

__sync_icache_dcache() will do nothing for non-executable pages,
but will write-back a page that isn't marked PG_dcache_clean to
ensure that it is visible to the instruction stream. This is only
used to ensure that a the instructions are visible to a newly
established executable mapping when e.g. the page has been DMA'd
in. The default state of PG_dcache_clean is zero on any new
allocation, so this has the effect of causing any executable page
to be flushed such that the instruction stream can see the
instructions, but only for the first establishment of the mapping.
That means that e.g. libc text pages don't keep getting flushed on
the start of every program.

update_mmu_cache() isn't compiled, so it's use of PG_dcache_clean
is irrelevant.

v6_copy_user_highpage_aliasing() won't be called because we're not
using an aliasing cache.

So, for modern ARM systems with DMA-coherent PG_dcache_clean only
serves for the __sync_icache_dcache() optimisation.

ARMs use of this remains valid in this circumstance.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


[PATCH 0/2] Online new threads according to the current SMT level

2023-03-31 Thread Laurent Dufour
When a new CPU is added, the kernel is activating all its threads. This
leads to weird, but functional, result when adding CPU on a SMT 4 system
for instance.

Here the newly added CPU 1 has 8 threads while the other one has 4 threads
active (system has been booted with the 'smt-enabled=4' kernel option):

ltcden3-lp12:~ # ppc64_cpu --info
Core   0:0*1*2*3*4 5 6 7
Core   1:8*9*   10*   11*   12*   13*   14*   15*

This mixed SMT level is confusing end users and some application like
lparstat are reporting wrong values.

There is no SMT level recorded in the kernel, neither in user space. Such a
level could be helpful when adding new CPU or when optimizing the energy
efficiency. This series introduce a new SYS FS entry named 'pseries_smt' to
store the current SMT level.

The SMT level is provided in best effort, writing a new value into that
entry is only recording it into the kernel. This way, it can be used when
new CPU are onlined for instance. There is no real change to the CPU setup
when a value is written, no CPU are onlined or offlined.

At boot time `pseries_smt` is loaded with smt_enabled_at_boot which is
containing the SMT level set at boot time, even if no kernel option is
specified.

The change is specific to pseries since CPU hot-plug is only provided for
this platform.

The second patch of this series is implementing the change to online only
the right number of threads when a new CPU is added.

Laurent Dufour (2):
  pseries/smp: export the smt level in the SYS FS.
  powerpc/pseries/cpuhp: respect current SMT when adding new CPU

 arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 ++---
 arch/powerpc/platforms/pseries/pseries.h |  3 ++
 arch/powerpc/platforms/pseries/smp.c | 39 
 3 files changed, 55 insertions(+), 5 deletions(-)

-- 
2.40.0



[PATCH 2/2] powerpc/pseries/cpuhp: respect current SMT when adding new CPU

2023-03-31 Thread Laurent Dufour
When a new CPU is added, the kernel is activating all its threads. This
leads to weird, but functional, result when adding CPU on a SMT 4 system
for instance.

Here the newly added CPU 1 has 8 threads while the other one has 4 threads
active (system has been booted with the 'smt-enabled=4' kernel option):

ltcden3-lp12:~ # ppc64_cpu --info
Core   0:0*1*2*3*4 5 6 7
Core   1:8*9*   10*   11*   12*   13*   14*   15*

We rely on the newly pseries_smt value which should be updated when
changing the SMT level by ppc64_cpu --smt=x and at boot time using the
smt-enabled kernel option.

This way on a LPAR running in SMT=4, newly added CPU will be running 4
threads, which is what a end user would expect.

Cc: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/hotplug-cpu.c | 18 +-
 arch/powerpc/platforms/pseries/smp.c |  2 +-
 2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index 1a3cb313976a..e623ed8649b3 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -382,7 +382,7 @@ static int dlpar_online_cpu(struct device_node *dn)
 {
int rc = 0;
unsigned int cpu;
-   int len, nthreads, i;
+   int len, nthreads, i, smt;
const __be32 *intserv;
u32 thread;
 
@@ -392,6 +392,11 @@ static int dlpar_online_cpu(struct device_node *dn)
 
nthreads = len / sizeof(u32);
 
+   smt = READ_ONCE(pseries_smt);
+   /* We should online at least one thread */
+   if (smt < 1)
+   smt = 1;
+
cpu_maps_update_begin();
for (i = 0; i < nthreads; i++) {
thread = be32_to_cpu(intserv[i]);
@@ -400,10 +405,13 @@ static int dlpar_online_cpu(struct device_node *dn)
continue;
cpu_maps_update_done();
find_and_update_cpu_nid(cpu);
-   rc = device_online(get_cpu_device(cpu));
-   if (rc) {
-   dlpar_offline_cpu(dn);
-   goto out;
+   /* Don't active CPU over the current SMT setting */
+   if (smt-- > 0) {
+   rc = device_online(get_cpu_device(cpu));
+   if (rc) {
+   dlpar_offline_cpu(dn);
+   goto out;
+   }
}
cpu_maps_update_begin();
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index 6c382922f8f3..ef8070651846 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -295,7 +295,7 @@ static ssize_t pseries_smt_store(struct class *class,
return -EINVAL;
}
 
-   pseries_smt = smt;
+   WRITE_ONCE(pseries_smt, smt);
 
return count;
 }
-- 
2.40.0



[PATCH 1/2] pseries/smp: export the smt level in the SYS FS.

2023-03-31 Thread Laurent Dufour
There is no SMT level recorded in the kernel neither in user space.
Indeed there is no real constraint about that and mixed SMT levels are
allowed and system is working fine this way.

However when new CPU are added, the kernel is onlining all the threads
which is leading to mixed SMT levels and confuse end user a bit.

To prevent this exports a SMT level from the kernel so user space
application like the energy daemon, could read it to adjust their settings.
There is no action unless recording the value when a SMT value is written
into the new sysfs entry. User space applications like ppc64_cpu should
update the sysfs when changing the SMT level to keep the system consistent.

Suggested-by: Srikar Dronamraju 
Signed-off-by: Laurent Dufour 
---
 arch/powerpc/platforms/pseries/pseries.h |  3 ++
 arch/powerpc/platforms/pseries/smp.c | 39 
 2 files changed, 42 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/pseries.h 
b/arch/powerpc/platforms/pseries/pseries.h
index f8bce40ebd0c..af0a145af98f 100644
--- a/arch/powerpc/platforms/pseries/pseries.h
+++ b/arch/powerpc/platforms/pseries/pseries.h
@@ -23,7 +23,9 @@ extern int pSeries_machine_check_exception(struct pt_regs 
*regs);
 extern long pseries_machine_check_realmode(struct pt_regs *regs);
 void pSeries_machine_check_log_err(void);
 
+
 #ifdef CONFIG_SMP
+extern int pseries_smt;
 extern void smp_init_pseries(void);
 
 /* Get state of physical CPU from query_cpu_stopped */
@@ -34,6 +36,7 @@ int smp_query_cpu_stopped(unsigned int pcpu);
 #define QCSS_HARDWARE_ERROR -1
 #define QCSS_HARDWARE_BUSY -2
 #else
+#define pseries_smt 1
 static inline void smp_init_pseries(void) { }
 #endif
 
diff --git a/arch/powerpc/platforms/pseries/smp.c 
b/arch/powerpc/platforms/pseries/smp.c
index c597711ef20a..6c382922f8f3 100644
--- a/arch/powerpc/platforms/pseries/smp.c
+++ b/arch/powerpc/platforms/pseries/smp.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -45,6 +46,8 @@
 
 #include "pseries.h"
 
+int pseries_smt;
+
 /*
  * The Primary thread of each non-boot processor was started from the OF client
  * interface by prom_hold_cpus and is spinning on secondary_hold_spinloop.
@@ -280,3 +283,39 @@ void __init smp_init_pseries(void)
 
pr_debug(" <- smp_init_pSeries()\n");
 }
+
+static ssize_t pseries_smt_store(struct class *class,
+struct class_attribute *attr,
+const char *buf, size_t count)
+{
+   int smt;
+
+   if (kstrtou32(buf, 0, ) || !smt || smt > (u32) threads_per_core) {
+   pr_err("Invalid pseries_smt specified.\n");
+   return -EINVAL;
+   }
+
+   pseries_smt = smt;
+
+   return count;
+}
+
+static ssize_t pseries_smt_show(struct class *class, struct class_attribute 
*attr,
+ char *buf)
+{
+   return sysfs_emit(buf, "%d\n", pseries_smt);
+}
+
+static CLASS_ATTR_RW(pseries_smt);
+
+static int __init pseries_smt_init(void)
+{
+   int rc;
+
+   pseries_smt = smt_enabled_at_boot;
+   rc = sysfs_create_file(kernel_kobj, _attr_pseries_smt.attr);
+   if (rc)
+   pr_err("Can't create pseries_smt sysfs/kernel entry.\n");
+   return rc;
+}
+machine_device_initcall(pseries, pseries_smt_init);
-- 
2.40.0



Re: [PATCH kernel] KVM: PPC: Make KVM_CAP_IRQFD_RESAMPLE platform dependent

2023-03-31 Thread Paolo Bonzini
Queued, thanks.

Paolo




[PATCH v2 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2023-03-31 Thread Sean Anderson
smp_call_function_single disables IRQs when executing the callback. To
prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
This is already done by qman_update_cgr and qman_delete_cgr; fix the
other lockers.

Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
Signed-off-by: Sean Anderson 
---

Changes in v2:
- Fix one additional call to spin_unlock

 drivers/soc/fsl/qbman/qman.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 739e4eee6b75..1bf1f1ea67f0 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;
 
-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }
 
@@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();
 
cgr->chan = p->config->channel;
-   spin_lock(>cgr_lock);
+   spin_lock_irq(>cgr_lock);
 
if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock(>cgr_lock);
+   spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
-- 
2.35.1.1320.gc452695387.dirty



[PATCH v2 2/2] soc: fsl: qbman: Use raw spinlock for cgr_lock

2023-03-31 Thread Sean Anderson
cgr_lock may be locked with interrupts already disabled by
smp_call_function_single. As such, we must use a raw spinlock to avoid
problems on PREEMPT_RT kernels. Although this bug has existed for a
while, it was not apparent until commit ef2a8d5478b9 ("net: dpaa: Adjust
queue depth on rate change") which invokes smp_call_function_single via
qman_update_cgr_safe every time a link goes up or down.

Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
Reported-by: Vladimir Oltean 
Link: https://lore.kernel.org/all/20230323153935.nofnjucqjqnz34ej@skbuf/
Signed-off-by: Sean Anderson 
---

(no changes since v1)

 drivers/soc/fsl/qbman/qman.c | 22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
index 1bf1f1ea67f0..7a1558aba523 100644
--- a/drivers/soc/fsl/qbman/qman.c
+++ b/drivers/soc/fsl/qbman/qman.c
@@ -991,7 +991,7 @@ struct qman_portal {
/* linked-list of CSCN handlers. */
struct list_head cgr_cbs;
/* list lock */
-   spinlock_t cgr_lock;
+   raw_spinlock_t cgr_lock;
struct work_struct congestion_work;
struct work_struct mr_work;
char irqname[MAX_IRQNAME];
@@ -1281,7 +1281,7 @@ static int qman_create_portal(struct qman_portal *portal,
/* if the given mask is NULL, assume all CGRs can be seen */
qman_cgrs_fill(>cgrs[0]);
INIT_LIST_HEAD(>cgr_cbs);
-   spin_lock_init(>cgr_lock);
+   raw_spin_lock_init(>cgr_lock);
INIT_WORK(>congestion_work, qm_congestion_task);
INIT_WORK(>mr_work, qm_mr_process_task);
portal->bits = 0;
@@ -1456,11 +1456,11 @@ static void qm_congestion_task(struct work_struct *work)
union qm_mc_result *mcr;
struct qman_cgr *cgr;
 
-   spin_lock_irq(>cgr_lock);
+   raw_spin_lock_irq(>cgr_lock);
qm_mc_start(>p);
qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
if (!qm_mc_result_timeout(>p, )) {
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
dev_crit(p->config->dev, "QUERYCONGESTION timeout\n");
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
return;
@@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct *work)
list_for_each_entry(cgr, >cgr_cbs, node)
if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
qman_p_irqsource_add(p, QM_PIRQ_CSCI);
 }
 
@@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
preempt_enable();
 
cgr->chan = p->config->channel;
-   spin_lock_irq(>cgr_lock);
+   raw_spin_lock_irq(>cgr_lock);
 
if (opts) {
struct qm_mcc_initcgr local_opts = *opts;
@@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32 flags,
qman_cgrs_get(>cgrs[1], cgr->cgrid))
cgr->cb(p, cgr, 1);
 out:
-   spin_unlock_irq(>cgr_lock);
+   raw_spin_unlock_irq(>cgr_lock);
put_affine_portal();
return ret;
 }
@@ -2512,7 +2512,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
return -EINVAL;
 
memset(_opts, 0, sizeof(struct qm_mcc_initcgr));
-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
list_del(>node);
/*
 * If there are no other CGR objects for this CGRID in the list,
@@ -2537,7 +2537,7 @@ int qman_delete_cgr(struct qman_cgr *cgr)
/* add back to the list */
list_add(>node, >cgr_cbs);
 release_lock:
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
@@ -2577,9 +2577,9 @@ static int qman_update_cgr(struct qman_cgr *cgr, struct 
qm_mcc_initcgr *opts)
if (!p)
return -EINVAL;
 
-   spin_lock_irqsave(>cgr_lock, irqflags);
+   raw_spin_lock_irqsave(>cgr_lock, irqflags);
ret = qm_modify_cgr(cgr, 0, opts);
-   spin_unlock_irqrestore(>cgr_lock, irqflags);
+   raw_spin_unlock_irqrestore(>cgr_lock, irqflags);
put_affine_portal();
return ret;
 }
-- 
2.35.1.1320.gc452695387.dirty



Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Robin Murphy

On 31/03/2023 3:00 pm, Arnd Bergmann wrote:

On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:

On 2023-03-27 13:13, Arnd Bergmann wrote:


[ HELP NEEDED: can anyone confirm that it is a correct assumption
on arm that a cache-coherent device writing to a page always results
in it being in a PG_dcache_clean state like on ia64, or can a device
write directly into the dcache?]


In AMBA at least, if a snooping write hits in a cache then the data is
most likely going to get routed directly into that cache. If it has
write-back write-allocate attributes it could also land in any cache
along its normal path to RAM; it wouldn't have to go all the way.

Hence all the fun we have where treating a coherent device as
non-coherent can still be almost as broken as the other way round :)


Ok, thanks for the information. I'm still not sure whether this can
result in the situation where PG_dcache_clean is wrong though.

Specifically, the question is whether a DMA to a coherent buffer
can end up in a dirty L1 dcache of one core and require to write
back the dcache before invalidating the icache for that page.

On ia64, this is not the case, the optimization here is to
only flush the icache after a coherent DMA into an executable
user page, while Arm only does this for noncoherent DMA but not
coherent DMA.

 From your explanation it sounds like this might happen,
even though that would mean that "coherent" DMA is slightly
less coherent than it is elsewhere.

To be on the safe side, I'd have to pass a flag into
arch_dma_mark_clean() about coherency, to let the arm
implementation still require the extra dcache flush
for coherent DMA, while ia64 can ignore that flag.


Coherent DMA on Arm is assumed to be inner-shareable, so a coherent DMA 
write should be pretty much equivalent to a coherent write by another 
CPU (or indeed the local CPU itself) - nothing says that it *couldn't* 
dirty a line in a data cache above the level of unification, so in 
general the assumption must be that, yes, if coherent DMA is writing 
data intended to be executable, then it's going to want a Dcache clean 
to PoU and an Icache invalidate to PoU before trying to execute it. By 
comparison, a non-coherent DMA transfer will inherently have to 
invalidate the Dcache all the way to PoC in its dma_unmap, thus cannot 
leave dirty data above the PoU, so only the Icache maintenance is 
required in the executable case.


(FWIW I believe the Armv8 IDC/DIC features can safely be considered 
irrelevant to 32-bit kernels)


I don't know a great deal about IA-64, but it appears to be using its 
PG_arch_1 flag in a subtly different manner to Arm, namely to optimise 
out the *Icache* maintenance. So if anything, it seems IA-64 is the 
weirdo here (who'd have guessed?) where DMA manages to be *more* 
coherent than the CPUs themselves :)


This is all now making me think we need some careful consideration of 
whether the benefits of consolidating code outweigh the confusion of 
conflating multiple different meanings of "clean" together...


Thanks,
Robin.


Re: [PATCH] powerpc/pseries/cpuhp: respect current SMT when adding new CPU

2023-03-31 Thread Laurent Dufour
On 30/03/2023 18:19:38, Michal Suchánek wrote:
> On Thu, Mar 30, 2023 at 05:51:57PM +0200, Laurent Dufour wrote:
>> On 13/02/2023 16:40:50, Nathan Lynch wrote:
>>> Michal Suchánek  writes:
 On Mon, Feb 13, 2023 at 08:46:50AM -0600, Nathan Lynch wrote:
> Laurent Dufour  writes:
>> When a new CPU is added, the kernel is activating all its threads. This
>> leads to weird, but functional, result when adding CPU on a SMT 4 system
>> for instance.
>>
>> Here the newly added CPU 1 has 8 threads while the other one has 4 
>> threads
>> active (system has been booted with the 'smt-enabled=4' kernel option):
>>
>> ltcden3-lp12:~ # ppc64_cpu --info
>> Core   0:0*1*2*3*4 5 6 7
>> Core   1:8*9*   10*   11*   12*   13*   14*   15*
>>
>> There is no SMT value in the kernel. It is possible to run unbalanced 
>> LPAR
>> with 2 threads for a CPU, 4 for another one, and 5 on the latest.
> 
>> Indeed, that's not so easy. There are multiple ways for the SMT level to be
>> impacted:
>>  - smt-enabled kernel option
>>  - smtstate systemctl service (if activated), saving SMT level at shutdown
>> time to restore it a boot time
>>  - pseries-energyd daemon (if activated) could turn off threads
>>  - ppc64_cpu --smt=x user command
>>  - sysfs direct writing to turn off/on specific threads.
>>
>> There is no SMT level saved, on "disk" or in the kernel, and any of these
>> options can interact in parallel. So from the user space point of view, the
>> best we could do is looking for the SMT current values, there could be
>> multiple values in the case of a mixed SMT state, peek one value and apply 
>> it.
>>
>> Extending the drmgr's hook is still valid, and I sent a patch series on the
>> powerpc-utils mailing list to achieve that. However, changing the SMT level
>> in that hook means that newly added CPU will be first turn on and there is
>> a window where this threads could be seen active. Not a big deal but not
>> turning on these extra threads looks better to me.
> 
> Which means
> 
> 1) add an option to not onlince hotplugged CPUs by default

After discussing this with Srikar, it happens that exposing the smt-enabled
value set a boot time (or not) in SYS FS and to update it when SMT level is
changed using ppc64_cpu will be better. This will aslo allow the energy
daemon to take this value in account.

> 2) when a tool that wants to manage CPU onlining is active it can set
> the option so that no threads are onlined automatically, and online the
> desired threads

When new CPU are added, the kernel will automatically online the right
number of threads based on that recorded SMT level.

> 
> 3) when no such tool is active the default should be to online all
> threeads to preserve compatibility with existing behavior

I don't think we should keep the existing behavior, customers are confused
and some user space tools like lparstart have difficulties to handled mixed
SMT levels.

I'll submit a new series exposing the SMT level and propose a patch for
ppc64_cpu too.

> 
>> That's being said, I can't see any benefit of a user space implementation
>> compared to the option I'm proposing in that patch.
> 
> The userspace implementation can implement arbitrily complex policy,
> that's not something that belongs into the kernel.
> 
> Thanks
> 
> Michal



Re: [PATCH 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2023-03-31 Thread Sean Anderson
On 3/31/23 06:58, Camelia Alexandra Groza wrote:
>> -Original Message-
>> From: Sean Anderson 
>> Sent: Monday, March 27, 2023 22:29
>> To: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm-
>> ker...@lists.infradead.org
>> Cc: Scott Wood ; linux-ker...@vger.kernel.org; David S .
>> Miller ; Claudiu Manoil ;
>> Roy Pledge ; Vladimir Oltean
>> ; Camelia Alexandra Groza
>> ; Sean Anderson 
>> Subject: [PATCH 1/2] soc: fsl: qbman: Always disable interrupts when taking
>> cgr_lock
>> 
>> smp_call_function_single disables IRQs when executing the callback. To
>> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
>> This is already done by qman_update_cgr and qman_delete_cgr; fix the
>> other lockers.
>> 
>> Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
>> Signed-off-by: Sean Anderson 
>> ---
>> 
>>  drivers/soc/fsl/qbman/qman.c | 8 
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>> 
>> diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
>> index 739e4eee6b75..ff870ca07596 100644
>> --- a/drivers/soc/fsl/qbman/qman.c
>> +++ b/drivers/soc/fsl/qbman/qman.c
>> @@ -1456,7 +1456,7 @@ static void qm_congestion_task(struct work_struct
>> *work)
>>  union qm_mc_result *mcr;
>>  struct qman_cgr *cgr;
>> 
>> -spin_lock(>cgr_lock);
>> +spin_lock_irq(>cgr_lock);
>>  qm_mc_start(>p);
>>  qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
>>  if (!qm_mc_result_timeout(>p, )) {
>> @@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct
>> *work)
>>  list_for_each_entry(cgr, >cgr_cbs, node)
>>  if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
>>  cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
>> -spin_unlock(>cgr_lock);
>> +spin_unlock_irq(>cgr_lock);
>>  qman_p_irqsource_add(p, QM_PIRQ_CSCI);
>>  }
> 
> There is one more spin_unlock call in qm_congestion_task on the error path 
> that needs updating:
> 
> if (!qm_mc_result_timeout(>p, )) {
>   spin_unlock(>cgr_lock);

Will fix. Thanks.

--Sean

> 
>> @@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32
>> flags,
>>  preempt_enable();
>> 
>>  cgr->chan = p->config->channel;
>> -spin_lock(>cgr_lock);
>> +spin_lock_irq(>cgr_lock);
>> 
>>  if (opts) {
>>  struct qm_mcc_initcgr local_opts = *opts;
>> @@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32
>> flags,
>>  qman_cgrs_get(>cgrs[1], cgr->cgrid))
>>  cgr->cb(p, cgr, 1);
>>  out:
>> -spin_unlock(>cgr_lock);
>> +spin_unlock_irq(>cgr_lock);
>>  put_affine_portal();
>>  return ret;
>>  }
>> --
>> 2.35.1.1320.gc452695387.dirty
> 



Re: [PATCH] ASoC: fsl_sai: Use physical format width

2023-03-31 Thread Mark Brown
On Fri, Mar 31, 2023 at 02:26:33PM +, Emil Abildgaard Svendsen wrote:
> On 3/31/23 04:55, Shengjiu Wang wrote:

> > There are different requirements for this slot width. Some need physical
> > width,
> > Some need format width. We need to be careful about change here.

> I might be wrong but shouldn't physical width always correspond to what
> can be physically measured. If you need the slot width, the same as
> format width you use a format with matching widths like
> SNDRV_PCM_FORMAT_S24_3LE?

The point is more that things are likely to be relying on the
current behaviour, for example CODECs that don't actually support
24 bit audio due to a power of two requirement but cope fine when
you play 24 bit audio in a 32 bit timeslot.  This creates issues
changing things even if the new behaviour is in some sense
better.  This is one of the unfortunate consequences of DT.


signature.asc
Description: PGP signature


Re: [PATCH] ASoC: fsl_sai: Use physical format width

2023-03-31 Thread Emil Abildgaard Svendsen
On 3/31/23 04:55, Shengjiu Wang wrote:
> On Thu, Mar 30, 2023 at 4:30 PM Emil Abildgaard Svendsen <
> e...@bang-olufsen.dk> wrote:
> 
>> Slot width should follow the physical width of the format instead of the
>> data width.
>>
>> This is needed for formats like SNDRV_PCM_FMTBIT_S24_LE where physical
>> width is 32 and data width is 24. By using the physical width, data
>> won't get misaligned.
> 
> 
> There are different requirements for this slot width. Some need physical
> width,
> Some need format width. We need to be careful about change here.

I might be wrong but shouldn't physical width always correspond to what
can be physically measured. If you need the slot width, the same as
format width you use a format with matching widths like
SNDRV_PCM_FORMAT_S24_3LE?

> 
> Actually there is .set_tdm_slot API for slot specific setting, please use
> this API.

But if you're using the generic sound cards. You need to add it to the
DTS. Thus, if you want a static TDM slot width, all is fine. But if you
want to change slot width runtime then it isn't enough.
Unless I missed something.

Kind regards,
Emil Svendsen 

> 
> best regards
> wang shengjiu
> 
>>
>> Signed-off-by: Emil Svendsen 
>> ---
>>  sound/soc/fsl/fsl_sai.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/sound/soc/fsl/fsl_sai.c b/sound/soc/fsl/fsl_sai.c
>> index 939c6bdd22c4..213e2d462076 100644
>> --- a/sound/soc/fsl/fsl_sai.c
>> +++ b/sound/soc/fsl/fsl_sai.c
>> @@ -519,13 +519,13 @@ static int fsl_sai_hw_params(struct
>> snd_pcm_substream *substream,
>> unsigned int channels = params_channels(params);
>> struct snd_dmaengine_dai_dma_data *dma_params;
>> struct fsl_sai_dl_cfg *dl_cfg = sai->dl_cfg;
>> +   u32 slot_width = params_physical_width(params);
>> u32 word_width = params_width(params);
>> int trce_mask = 0, dl_cfg_idx = 0;
>> int dl_cfg_cnt = sai->dl_cfg_cnt;
>> u32 dl_type = FSL_SAI_DL_I2S;
>> u32 val_cr4 = 0, val_cr5 = 0;
>> u32 slots = (channels == 1) ? 2 : channels;
>> -   u32 slot_width = word_width;
>> int adir = tx ? RX : TX;
>> u32 pins, bclk;
>> u32 watermark;
>> --
>> 2.34.1
>>
> 



Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Arnd Bergmann
On Mon, Mar 27, 2023, at 17:01, Russell King (Oracle) wrote:
> On Mon, Mar 27, 2023 at 02:13:16PM +0200, Arnd Bergmann wrote:
>> From: Arnd Bergmann 
>> 
>> The arm version of the arch_sync_dma_for_cpu() function annotates pages as
>> PG_dcache_clean after a DMA, but no other architecture does this here.
>
> ... because this is an arm32 specific feature. Generically, it's
> PG_arch_1, which is a page flag free for architecture use. On arm32
> we decided to use this to mark whether we can skip dcache writebacks
> when establishing a PTE - and thus it was decided to call it
> PG_dcache_clean to reflect how arm32 decided to use that bit.
>
> This isn't just a DMA thing, there are other places that we update
> the bit, such as flush_dcache_page() and copy_user_highpage().
>
> So thinking that the arm32 PG_dcache_clean is something for DMA is
> actually wrong.
>
> Other architectures are free to do their own other optimisations
> using that bit, and their implementations may be DMA-centric.

The flag is used the same way on most architectures, though some
use the opposite polarity and call it PG_dcache_dirty. The only
other architecture that uses it for DMA is ia64, with the difference
being that this also marks the page as clean even for coherent
DMA, not just when doing a flush as part of noncoherent DMA.

Based on Robin's reply it sounds that this is not a valid assumption
on Arm, if a coherent DMA can target a dirty dcache line without
cleaning it.

 Arnd


Re: [PATCH 20/21] ARM: dma-mapping: split out arch_dma_mark_clean() helper

2023-03-31 Thread Arnd Bergmann
On Mon, Mar 27, 2023, at 14:48, Robin Murphy wrote:
> On 2023-03-27 13:13, Arnd Bergmann wrote:
>> 
>> [ HELP NEEDED: can anyone confirm that it is a correct assumption
>>on arm that a cache-coherent device writing to a page always results
>>in it being in a PG_dcache_clean state like on ia64, or can a device
>>write directly into the dcache?]
>
> In AMBA at least, if a snooping write hits in a cache then the data is 
> most likely going to get routed directly into that cache. If it has 
> write-back write-allocate attributes it could also land in any cache 
> along its normal path to RAM; it wouldn't have to go all the way.
>
> Hence all the fun we have where treating a coherent device as 
> non-coherent can still be almost as broken as the other way round :)

Ok, thanks for the information. I'm still not sure whether this can
result in the situation where PG_dcache_clean is wrong though.

Specifically, the question is whether a DMA to a coherent buffer
can end up in a dirty L1 dcache of one core and require to write
back the dcache before invalidating the icache for that page.

On ia64, this is not the case, the optimization here is to
only flush the icache after a coherent DMA into an executable
user page, while Arm only does this for noncoherent DMA but not
coherent DMA.

>From your explanation it sounds like this might happen,
even though that would mean that "coherent" DMA is slightly
less coherent than it is elsewhere.

To be on the safe side, I'd have to pass a flag into
arch_dma_mark_clean() about coherency, to let the arm
implementation still require the extra dcache flush
for coherent DMA, while ia64 can ignore that flag.

Arnd


Re: [PATCH 21/21] dma-mapping: replace custom code with generic implementation

2023-03-31 Thread Arnd Bergmann
On Tue, Mar 28, 2023, at 00:25, Christoph Hellwig wrote:
>> +static inline void arch_dma_cache_wback(phys_addr_t paddr, size_t size)
>>  {
>> +dma_cache_wback(paddr, size);
>> +}
>>  
>> +static inline void arch_dma_cache_inv(phys_addr_t paddr, size_t size)
>> +{
>> +dma_cache_inv(paddr, size);
>>  }
>
>> +static inline void arch_dma_cache_wback_inv(phys_addr_t paddr, size_t size)
>>  {
>> +dma_cache_wback_inv(paddr, size);
>> +}
>
> There are the only calls for the three functions for each of the
> involved functions.  So I'd rather rename the low-level symbols
> (and drop the pointless exports for two of them) rather than adding
> these wrapppers.
>
> The same is probably true for many other architectures.

Ok, done that now.

>> +static inline bool arch_sync_dma_clean_before_fromdevice(void)
>> +{
>> +return false;
>> +}
>>  
>> +static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
>> +{
>> +return true;
>>  }
>
> Is there a way to cut down on this boilerplate code by just having
> sane default, and Kconfig options to override them if they are not
> runtime decisions?

I've changed arch_sync_dma_clean_before_fromdevice() to a
Kconfig symbol now, as this is never a runtime decision.

For arch_sync_dma_cpu_needs_post_dma_flush(), I have this
version now in common code, which lets mips and arm have
their own logic and has the same effect elsewhere:

+#ifndef arch_sync_dma_cpu_needs_post_dma_flush
+static inline bool arch_sync_dma_cpu_needs_post_dma_flush(void)
+{
+   return IS_ENABLED(CONFIG_ARCH_HAS_SYNC_DMA_FOR_CPU);
+}
+#endif

>> +#include 
>
> I can't really say I like the #include version here despite your
> rationale in the commit log.  I can probably live with it if you
> think it is absolutely worth it, but I'm really not in favor of it.
>
>> +config ARCH_DMA_MARK_DCACHE_CLEAN
>> +def_bool y
>
> What do we need this symbol for?  Unless I'm missing something it is
> always enable for arm32, and only used in arm32 code.

This was left over from an earlier draft and accidentally duplicates
the thing that I have in the Arm version for the existing
ARCH_HAS_DMA_MARK_CLEAN. I dropped this one and the
generic copy of the arch_dma_mark_dcache_clean() function
now, but still need to revisit the arm version, as it sounds
like it has slightly different semantics from the ia64 version.

 Arnd


Re: [PATCH 17/21] ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally

2023-03-31 Thread Arnd Bergmann
On Fri, Mar 31, 2023, at 11:10, Linus Walleij wrote:
> On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann  wrote:
>
>> From: Arnd Bergmann 
>>
>> The arm specific iommu code in dma-mapping.c uses the page+offset based
>> __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
>> phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
>> wrappers around the.
>
> Broken sentence?

I've changed s/the/them/ now, at least I think that's what I meant to
write in the first place.

>> In order to be able to move the latter part set of functions into
>> common code, change the iommu implementation to use them directly
>> and remove the internal ones as a separate interface.
>>
>> As page+offset and phys_address are equivalent, but are used in
>> different parts of the code here, this allows removing some of
>> the conversion but adds them elsewhere.
>>
>> Signed-off-by: Arnd Bergmann 
>
> Looks good to me, took me some time to verify and understand
> the open-coded version of PFN_UP() and this refactoring alone
> makes the patch highly valuable.
> Reviewed-by: Linus Walleij 

Thanks!

ARnd


Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Arnd Bergmann
On Fri, Mar 31, 2023, at 13:08, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
>> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
>> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> >> > From: Arnd Bergmann 
>> >> > 
>> >> > Most ARM CPUs can have write-back caches and that require
>> >> > cache management to be done in the dma_sync_*_for_device()
>> >> > operation. This is typically done in both writeback and
>> >> > writethrough mode.
>> >> > 
>> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> >> > (arm920t, arm940t) implementations are the exception here,
>> >> > and only do the cache management after the DMA is complete,
>> >> > in the dma_sync_*_for_cpu() operation.
>> >> > 
>> >> > Change this for consistency with the other platforms. This
>> >> > should have no user visible effect.
>> >> 
>> >> NAK...So t
>> >> 
>> >> The reason we do cache management _after_ is to ensure that there
>> >> is no stale data. The kernel _has_ (at the very least in the past)
>> >> performed DMA to data structures that are embedded within other
>> >> data structures, resulting in cache lines being shared. If one of
>> >> those cache lines is touched while DMA is progressing, then we
>> >> must to cache management _after_ the DMA operation has completed.
>> >> Doing it before is no good.
>> 
>> What I'm trying to address here is the inconsistency between
>> implementations. If we decide that we always want to invalidate
>> after FROM_DEVICE, I can do that as part of the series, but then
>> I have to change most of the other arm implementations.
>
> Why?
>
> First thing to say is that DMA to buffers where the cache lines are
> shared with data the CPU may be accessing need to be outlawed - they
> are a recipe for data corruption - always have been. Sadly, some folk
> don't see it that way because of a passed "x86 just works and we demand
> that all architectures behave like x86!" attitude. The SCSI sense
> buffer has historically been a big culpret for that.

I think that part is pretty much agree by everyone, the difference
between architectures is to what extend they try to work around
drivers that get it wrong.

> For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
> because we want to ensure that the DMA'd data is properly readable upon
> completion of the DMA. If overlapping cache lines haveDoes that mean you take 
> back you NAK on this patch tehn? been touched
> while DMA is proSo tgressing, and we invalidate before DMA, then the cache
> will contain stale data that will remain in the cache after DMA has
> completed. Invalidating a WT cache does not destroy any data, so is
> safe to do. So the safest approach is to invalidate after DMA has
> completed in this instance.

> For WB, FROM_DEVICE, we have the problem of dirty cache lines which
> we have to get rid of. For the overlapping cache lines, we have to
> clean those before DMA begins to ensure that data written to the
> non-DMA-buffer part is preserved. All other cache lines need to be
> invalidated before DMA begins to ensure that writebacks do not
> corrupt data from the device. Hence why it's different.

I don't see how WB and Wt caches being different implies that we
should give extra guarantees to (broken) drivers when WT caches on
other architectures. Always doing it first in the absence of
prefetching avoids a special case in the generic implementation
and makes the driver interface on Arm/sparc32/xtensa WT caches
no different from what everything provides.

The writeback before DMA_FROM_DEVICE is another issue that we
have to address at some point, as there are clearly incompatible
expectations here. It makes no sense that a device driver can
rely on the entire to be written back on a 64-bit arm kernel
but not on a 32-bit kernel.

> And hence why the ARM implementation is based around buffer ownership.
> And hence why they're called dma_map_area()/dma_unmap_area() rather
> than the cache operations themselves. This is an intentional change,
> one that was done when ARMv6 came along.

The bit that has changed in the meantime though is that the buffer
ownership interfaces has moved up in the stack and is now handled
mostly in the common kernel/dma/*.c that multiplexes between the
direct/iommu/swiotlb dma_map_ops, except for the bit about
noncoherent devices. Right now, we have 37 implementations that
are mostly identical, and all the differences are either bugs
or disagreements about the API guarantees but not related to
architecture specific requirements.

>> OTOH, most machines that are actually in use today (armv6+,
>> powerpc, later mips, microblaze, riscv, nios2) also have to
>> deal with speculative accesses, so they end up having to
>> invalidate or flush both before and after a DMA_FROM_DEVICE
>> and DMA_BIDIRECTIONAL.
>
> Again, these are implementation 

Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Russell King (Oracle)
On Fri, Mar 31, 2023 at 12:38:45PM +0200, Arnd Bergmann wrote:
> On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> > On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> >> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> >> > From: Arnd Bergmann 
> >> > 
> >> > Most ARM CPUs can have write-back caches and that require
> >> > cache management to be done in the dma_sync_*_for_device()
> >> > operation. This is typically done in both writeback and
> >> > writethrough mode.
> >> > 
> >> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> >> > (arm920t, arm940t) implementations are the exception here,
> >> > and only do the cache management after the DMA is complete,
> >> > in the dma_sync_*_for_cpu() operation.
> >> > 
> >> > Change this for consistency with the other platforms. This
> >> > should have no user visible effect.
> >> 
> >> NAK...
> >> 
> >> The reason we do cache management _after_ is to ensure that there
> >> is no stale data. The kernel _has_ (at the very least in the past)
> >> performed DMA to data structures that are embedded within other
> >> data structures, resulting in cache lines being shared. If one of
> >> those cache lines is touched while DMA is progressing, then we
> >> must to cache management _after_ the DMA operation has completed.
> >> Doing it before is no good.
> 
> What I'm trying to address here is the inconsistency between
> implementations. If we decide that we always want to invalidate
> after FROM_DEVICE, I can do that as part of the series, but then
> I have to change most of the other arm implementations.

Why?

First thing to say is that DMA to buffers where the cache lines are
shared with data the CPU may be accessing need to be outlawed - they
are a recipe for data corruption - always have been. Sadly, some folk
don't see it that way because of a passed "x86 just works and we demand
that all architectures behave like x86!" attitude. The SCSI sense
buffer has historically been a big culpret for that.


For WT, FROM_DEVICE, invalidating after DMA is the right thing to do,
because we want to ensure that the DMA'd data is properly readable upon
completion of the DMA. If overlapping cache lines have been touched
while DMA is progressing, and we invalidate before DMA, then the cache
will contain stale data that will remain in the cache after DMA has
completed. Invalidating a WT cache does not destroy any data, so is
safe to do. So the safest approach is to invalidate after DMA has
completed in this instance.


For WB, FROM_DEVICE, we have the problem of dirty cache lines which
we have to get rid of. For the overlapping cache lines, we have to
clean those before DMA begins to ensure that data written to the
non-DMA-buffer part is preserved. All other cache lines need to be
invalidated before DMA begins to ensure that writebacks do not
corrupt data from the device. Hence why it's different.


And hence why the ARM implementation is based around buffer ownership.
And hence why they're called dma_map_area()/dma_unmap_area() rather
than the cache operations themselves. This is an intentional change,
one that was done when ARMv6 came along.

> OTOH, most machines that are actually in use today (armv6+,
> powerpc, later mips, microblaze, riscv, nios2) also have to
> deal with speculative accesses, so they end up having to
> invalidate or flush both before and after a DMA_FROM_DEVICE
> and DMA_BIDIRECTIONAL.

Again, these are implementation details of the cache, and this is
precisely why having the map/unmap interface is so much better than
having generic code explicitly call "clean" and "invalidate"
interfaces into arch code.

If we treat everything as a speculative cache, then we're doing
needless extra work for those caches that aren't speculative. So,
ARM would have to step through every cache line for every DMA
buffer at 32-byte intervals performing cache maintenance whether
the cache is speculative or not. That is expensive, and hurts
performance.

I put a lot of thought into this when I updated the ARM DMA
implementation when we started seeing these different cache types
particularly when ARMv6 came along. I really don't want that work
wrecked.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


RE: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread David Laight
From: Arnd Bergmann
> Sent: 31 March 2023 11:39
...
> Most architectures that have write-through caches (m68k,
> microblaze) or write-back caches but no speculation (all other
> armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa)
> only invalidate before DMA but not after.
> 
> OTOH, most machines that are actually in use today (armv6+,
> powerpc, later mips, microblaze, riscv, nios2) also have to
> deal with speculative accesses, so they end up having to
> invalidate or flush both before and after a DMA_FROM_DEVICE
> and DMA_BIDIRECTIONAL.

nios2 is a simple in-order cpu with a short pipeline
(it is a soft-cpu made from normal fpga logic elements).
Definitely doesn't do speculative accesses.
OTOH any one trying to run Linux on it needs their head examined.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



RE: [PATCH 1/2] soc: fsl: qbman: Always disable interrupts when taking cgr_lock

2023-03-31 Thread Camelia Alexandra Groza
> -Original Message-
> From: Sean Anderson 
> Sent: Monday, March 27, 2023 22:29
> To: Leo Li ; linuxppc-dev@lists.ozlabs.org; linux-arm-
> ker...@lists.infradead.org
> Cc: Scott Wood ; linux-ker...@vger.kernel.org; David S .
> Miller ; Claudiu Manoil ;
> Roy Pledge ; Vladimir Oltean
> ; Camelia Alexandra Groza
> ; Sean Anderson 
> Subject: [PATCH 1/2] soc: fsl: qbman: Always disable interrupts when taking
> cgr_lock
> 
> smp_call_function_single disables IRQs when executing the callback. To
> prevent deadlocks, we must disable IRQs when taking cgr_lock elsewhere.
> This is already done by qman_update_cgr and qman_delete_cgr; fix the
> other lockers.
> 
> Fixes: c535e923bb97 ("soc/fsl: Introduce DPAA 1.x QMan device driver")
> Signed-off-by: Sean Anderson 
> ---
> 
>  drivers/soc/fsl/qbman/qman.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/soc/fsl/qbman/qman.c b/drivers/soc/fsl/qbman/qman.c
> index 739e4eee6b75..ff870ca07596 100644
> --- a/drivers/soc/fsl/qbman/qman.c
> +++ b/drivers/soc/fsl/qbman/qman.c
> @@ -1456,7 +1456,7 @@ static void qm_congestion_task(struct work_struct
> *work)
>   union qm_mc_result *mcr;
>   struct qman_cgr *cgr;
> 
> - spin_lock(>cgr_lock);
> + spin_lock_irq(>cgr_lock);
>   qm_mc_start(>p);
>   qm_mc_commit(>p, QM_MCC_VERB_QUERYCONGESTION);
>   if (!qm_mc_result_timeout(>p, )) {
> @@ -1476,7 +1476,7 @@ static void qm_congestion_task(struct work_struct
> *work)
>   list_for_each_entry(cgr, >cgr_cbs, node)
>   if (cgr->cb && qman_cgrs_get(, cgr->cgrid))
>   cgr->cb(p, cgr, qman_cgrs_get(, cgr->cgrid));
> - spin_unlock(>cgr_lock);
> + spin_unlock_irq(>cgr_lock);
>   qman_p_irqsource_add(p, QM_PIRQ_CSCI);
>  }

There is one more spin_unlock call in qm_congestion_task on the error path that 
needs updating:

if (!qm_mc_result_timeout(>p, )) {
spin_unlock(>cgr_lock);

Regards,
Camelia

> @@ -2440,7 +2440,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32
> flags,
>   preempt_enable();
> 
>   cgr->chan = p->config->channel;
> - spin_lock(>cgr_lock);
> + spin_lock_irq(>cgr_lock);
> 
>   if (opts) {
>   struct qm_mcc_initcgr local_opts = *opts;
> @@ -2477,7 +2477,7 @@ int qman_create_cgr(struct qman_cgr *cgr, u32
> flags,
>   qman_cgrs_get(>cgrs[1], cgr->cgrid))
>   cgr->cb(p, cgr, 1);
>  out:
> - spin_unlock(>cgr_lock);
> + spin_unlock_irq(>cgr_lock);
>   put_affine_portal();
>   return ret;
>  }
> --
> 2.35.1.1320.gc452695387.dirty



Re: [PATCH] drm/amdgpu: drop the long-double-128 powerpc check/hack

2023-03-31 Thread Michael Ellerman
"Daniel Kolesa"  writes:
> Commit c653c591789b ("drm/amdgpu: Re-enable DCN for 64-bit powerpc")
> introduced this check as a workaround for the driver not building
> with toolchains that default to 64-bit long double.
...
> In mainline, this work is now fully done, so this check is fully
> redundant and does not do anything except preventing AMDGPU DC
> from being built on systems such as those using musl libc. The
> last piece of work to enable this was commit c92b7fe0d92a
> ("drm/amd/display: move remaining FPU code to dml folder")
> and this has since been backported to 6.1 stable (in 6.1.7).
>
> Relevant issue: https://gitlab.freedesktop.org/drm/amd/-/issues/2288

I looked to pick this up for 6.3 but was still seeing build errors with
some compilers. I assumed that was due to some fixes coming in
linux-next that I didn't have.

But applying the patch on v6.3-rc4 I still see build errors. This is
building allyesconfig with the kernel.org GCC 12.2.0 / binutils 2.39
toolchain:

  powerpc64le-linux-gnu-ld: 
drivers/gpu/drm/amd/display/dc/dml/display_mode_lib.o uses hard float, 
arch/powerpc/lib/test_emulate_step.o uses soft float
  powerpc64le-linux-gnu-ld: failed to merge target specific data of file 
drivers/gpu/drm/amd/display/dc/dml/display_mode_lib.o

etc.

All the conflicts are between test_emulate_step.o and some file in 
drivers/gpu/drm/amd/display/dc/dml.

So even with all the hard-float code isolated in the dml folder, we
still hit build errors, because allyesconfig wants to link those
hard-float using objects with soft-float objects from elsewhere in the
kernel.

It seems like the only workable fix is to force the kernel build to use
128-bit long double. I'll send a patch doing that.

cheers


Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Arnd Bergmann
On Fri, Mar 31, 2023, at 11:35, Russell King (Oracle) wrote:
> On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
>> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
>> > From: Arnd Bergmann 
>> > 
>> > Most ARM CPUs can have write-back caches and that require
>> > cache management to be done in the dma_sync_*_for_device()
>> > operation. This is typically done in both writeback and
>> > writethrough mode.
>> > 
>> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
>> > (arm920t, arm940t) implementations are the exception here,
>> > and only do the cache management after the DMA is complete,
>> > in the dma_sync_*_for_cpu() operation.
>> > 
>> > Change this for consistency with the other platforms. This
>> > should have no user visible effect.
>> 
>> NAK...
>> 
>> The reason we do cache management _after_ is to ensure that there
>> is no stale data. The kernel _has_ (at the very least in the past)
>> performed DMA to data structures that are embedded within other
>> data structures, resulting in cache lines being shared. If one of
>> those cache lines is touched while DMA is progressing, then we
>> must to cache management _after_ the DMA operation has completed.
>> Doing it before is no good.

What I'm trying to address here is the inconsistency between
implementations. If we decide that we always want to invalidate
after FROM_DEVICE, I can do that as part of the series, but then
I have to change most of the other arm implementations.

Right now, the only WT cache implementations that do the the
invalidation after the DMA are cache-v4.S (arm720 integrator and
clps711x), cache-v4wt.S (arm920/arm922 at91rm9200, clps711x,
ep93xx, omap15xx, imx1 and integrator), some sparc32 leon3 and
early xtensa.

Most architectures that have write-through caches (m68k,
microblaze) or write-back caches but no speculation (all other
armv4/armv5, hexagon, openrisc, sh, most mips, later xtensa)
only invalidate before DMA but not after.

OTOH, most machines that are actually in use today (armv6+,
powerpc, later mips, microblaze, riscv, nios2) also have to
deal with speculative accesses, so they end up having to
invalidate or flush both before and after a DMA_FROM_DEVICE
and DMA_BIDIRECTIONAL.

> It looks like the main offender of "touching cache lines shared
> with DMA" has now been resolved - that was the SCSI sense buffer,
> and was fixed some time ago:
>
> commit de25deb18016f66dcdede165d07654559bb332bc
> Author: FUJITA Tomonori 
> Date:   Wed Jan 16 13:32:17 2008 +0900
>
> /if/ that is the one and only case, then we're probably fine, but
> having been through an era where this kind of thing was the norm
> and requests to fix it did not get great responses from subsystem
> maintainers, I just don't trust the kernel not to want to DMA to
> overlapping cache lines.

Thanks for digging that out, that is very useful. It looks like this
was around the same time as 03d70617b8a7 ("powerpc: Prevent memory
corruption due to cache invalidation of unaligned DMA buffer"), so
it may well have been related. I know we also had more recent 
problems with USB drivers trying to DMA to stack, which would 
also cause problems on non-coherent machines, but some of these were
only found after we introduced VMAP_STACK.

It would be nice to use KASAN prevent reads on cache lines that
have in-flight DMA.

 Arnd


Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Russell King (Oracle)
On Fri, Mar 31, 2023 at 10:07:28AM +0100, Russell King (Oracle) wrote:
> On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> > From: Arnd Bergmann 
> > 
> > Most ARM CPUs can have write-back caches and that require
> > cache management to be done in the dma_sync_*_for_device()
> > operation. This is typically done in both writeback and
> > writethrough mode.
> > 
> > The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> > (arm920t, arm940t) implementations are the exception here,
> > and only do the cache management after the DMA is complete,
> > in the dma_sync_*_for_cpu() operation.
> > 
> > Change this for consistency with the other platforms. This
> > should have no user visible effect.
> 
> NAK...
> 
> The reason we do cache management _after_ is to ensure that there
> is no stale data. The kernel _has_ (at the very least in the past)
> performed DMA to data structures that are embedded within other
> data structures, resulting in cache lines being shared. If one of
> those cache lines is touched while DMA is progressing, then we
> must to cache management _after_ the DMA operation has completed.
> Doing it before is no good.

It looks like the main offender of "touching cache lines shared
with DMA" has now been resolved - that was the SCSI sense buffer,
and was fixed some time ago:

commit de25deb18016f66dcdede165d07654559bb332bc
Author: FUJITA Tomonori 
Date:   Wed Jan 16 13:32:17 2008 +0900

/if/ that is the one and only case, then we're probably fine, but
having been through an era where this kind of thing was the norm
and requests to fix it did not get great responses from subsystem
maintainers, I just don't trust the kernel not to want to DMA to
overlapping cache lines.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH 17/21] ARM: dma-mapping: use arch_sync_dma_for_{device,cpu}() internally

2023-03-31 Thread Linus Walleij
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann  wrote:

> From: Arnd Bergmann 
>
> The arm specific iommu code in dma-mapping.c uses the page+offset based
> __dma_page_cpu_to_dev()/__dma_page_dev_to_cpu() helpers in place of the
> phys_addr_t based arch_sync_dma_for_device()/arch_sync_dma_for_cpu()
> wrappers around the.

Broken sentence?

> In order to be able to move the latter part set of functions into
> common code, change the iommu implementation to use them directly
> and remove the internal ones as a separate interface.
>
> As page+offset and phys_address are equivalent, but are used in
> different parts of the code here, this allows removing some of
> the conversion but adds them elsewhere.
>
> Signed-off-by: Arnd Bergmann 

Looks good to me, took me some time to verify and understand
the open-coded version of PFN_UP() and this refactoring alone
makes the patch highly valuable.
Reviewed-by: Linus Walleij 

Yours,
Linus Walleij


Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Russell King (Oracle)
On Mon, Mar 27, 2023 at 02:13:11PM +0200, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
> 
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
> 
> Change this for consistency with the other platforms. This
> should have no user visible effect.

NAK...

The reason we do cache management _after_ is to ensure that there
is no stale data. The kernel _has_ (at the very least in the past)
performed DMA to data structures that are embedded within other
data structures, resulting in cache lines being shared. If one of
those cache lines is touched while DMA is progressing, then we
must to cache management _after_ the DMA operation has completed.
Doing it before is no good.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


Re: [PATCH 15/21] ARM: dma-mapping: always invalidate WT caches before DMA

2023-03-31 Thread Linus Walleij
On Mon, Mar 27, 2023 at 2:16 PM Arnd Bergmann  wrote:

> From: Arnd Bergmann 
>
> Most ARM CPUs can have write-back caches and that require
> cache management to be done in the dma_sync_*_for_device()
> operation. This is typically done in both writeback and
> writethrough mode.
>
> The cache-v4.S (arm720/740/7tdmi/9tdmi) and cache-v4wt.S
> (arm920t, arm940t) implementations are the exception here,
> and only do the cache management after the DMA is complete,
> in the dma_sync_*_for_cpu() operation.
>
> Change this for consistency with the other platforms. This
> should have no user visible effect.
>
> Signed-off-by: Arnd Bergmann 

Looks good to me.
Reviewed-by: Linus Walleij 

Yours,
Linus Walleij