Re: [PATCH 3/3] dt-bindings: remoteproc: Add Arm remoteproc

2024-03-13 Thread Robin Murphy

On 2024-03-01 4:42 pm, abdellatif.elkhl...@arm.com wrote:

From: Abdellatif El Khlifi 

introduce the bindings for Arm remoteproc support.

Signed-off-by: Abdellatif El Khlifi 
---
  .../bindings/remoteproc/arm,rproc.yaml| 69 +++
  MAINTAINERS   |  1 +
  2 files changed, 70 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml

diff --git a/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml 
b/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
new file mode 100644
index ..322197158059
--- /dev/null
+++ b/Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
@@ -0,0 +1,69 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/remoteproc/arm,rproc.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Arm Remoteproc Devices
+
+maintainers:
+  - Abdellatif El Khlifi 
+
+description: |
+  Some Arm heterogeneous System-On-Chips feature remote processors that can
+  be controlled with a reset control register and a reset status register to
+  start or stop the processor.
+
+  This document defines the bindings for these remote processors.
+
+properties:
+  compatible:
+enum:
+  - arm,corstone1000-extsys
+
+  reg:
+minItems: 2
+maxItems: 2
+description: |
+  Address and size in bytes of the reset control register
+  and the reset status register.
+  Expects the registers to be in the order as above.
+  Should contain an entry for each value in 'reg-names'.
+
+  reg-names:
+description: |
+  Required names for each of the reset registers defined in
+  the 'reg' property. Expects the names from the following
+  list, in the specified order, each representing the corresponding
+  reset register.
+items:
+  - const: reset-control
+  - const: reset-status
+
+  firmware-name:
+description: |
+  Default name of the firmware to load to the remote processor.


So... is loading the firmware image achieved by somehow bitbanging it 
through the one reset register, maybe? I find it hard to believe this is 
a complete and functional binding.


Frankly at the moment I'd be inclined to say it isn't even a remoteproc 
binding (or driver) at all, it's a reset controller. Bindings are a 
contract for describing the hardware, not the current state of Linux 
driver support - if this thing still needs mailboxes, shared memory, a 
reset vector register, or whatever else to actually be useful, those 
should be in the binding from day 1 so that a) people can write and 
deploy correct DTs now, such that functionality becomes available on 
their systems as soon as driver support catches up, and b) the community 
has any hope of being able to review whether the binding is 
appropriately designed and specified for the purpose it intends to serve.


For instance right now it seems somewhat tenuous to describe two 
consecutive 32-bit registers as separate "reg" entries, but *maybe* it's 
OK if that's all there ever is. However if it's actually going to end up 
needing several more additional MMIO and/or memory regions for other 
functionality, then describing each register and location individually 
is liable to get unmanageable really fast, and a higher-level functional 
grouping (e.g. these reset-related registers together as a single 8-byte 
region) would likely be a better design.


Thanks,
Robin.


+
+required:
+  - compatible
+  - reg
+  - reg-names
+  - firmware-name
+
+additionalProperties: false
+
+examples:
+  - |
+extsys0: remoteproc@1a010310 {
+compatible = "arm,corstone1000-extsys";
+reg = <0x1a010310 0x4>, <0x1a010314 0x4>;
+reg-names = "reset-control", "reset-status";
+firmware-name = "es0_flashfw.elf";
+};
+
+extsys1: remoteproc@1a010318 {
+compatible = "arm,corstone1000-extsys";
+reg = <0x1a010318 0x4>, <0x1a01031c 0x4>;
+reg-names = "reset-control", "reset-status";
+firmware-name = "es1_flashfw.elf";
+};
diff --git a/MAINTAINERS b/MAINTAINERS
index 54d6a40feea5..eddaa3841a65 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1768,6 +1768,7 @@ ARM REMOTEPROC DRIVER
  M:Abdellatif El Khlifi 
  L:linux-remotep...@vger.kernel.org
  S:Maintained
+F: Documentation/devicetree/bindings/remoteproc/arm,rproc.yaml
  F:drivers/remoteproc/arm_rproc.c
  
  ARM SMC WATCHDOG DRIVER




Re: [PATCH v3] drivers: introduce and use WANT_DMA_CMA for soft dependencies on DMA_CMA

2021-04-12 Thread Robin Murphy

On 2021-04-09 14:39, David Hildenbrand wrote:

On 09.04.21 15:35, Arnd Bergmann wrote:
On Fri, Apr 9, 2021 at 1:21 PM David Hildenbrand  
wrote:


Random drivers should not override a user configuration of core knobs
(e.g., CONFIG_DMA_CMA=n). Applicable drivers would like to use DMA_CMA,
which depends on CMA, if possible; however, these drivers also have to
tolerate if DMA_CMA is not available/functioning, for example, if no CMA
area for DMA_CMA use has been setup via "cma=X". In the worst case, the
driver cannot do it's job properly in some configurations.

For example, commit 63f5677544b3 ("drm/etnaviv: select CMA and 
DMA_CMA if

available") documents
 While this is no build dependency, etnaviv will only work 
correctly
 on most systems if CMA and DMA_CMA are enabled. Select both 
options
 if available to avoid users ending up with a non-working GPU 
due to

 a lacking kernel config.
So etnaviv really wants to have DMA_CMA, however, can deal with some 
cases

where it is not available.

Let's introduce WANT_DMA_CMA and use it in most cases where drivers
select CMA/DMA_CMA, or depend on DMA_CMA (in a wrong way via CMA because
of recursive dependency issues).

We'll assume that any driver that selects DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER would like to use DMA_CMA if possible.

With this change, distributions can disable CONFIG_CMA or
CONFIG_DMA_CMA, without it silently getting enabled again by random
drivers. Also, we'll now automatically try to enabled both, CONFIG_CMA
and CONFIG_DMA_CMA if they are unspecified and any driver is around that
selects WANT_DMA_CMA -- also implicitly via DRM_GEM_CMA_HELPER or
DRM_KMS_CMA_HELPER.

For example, if any driver selects WANT_DMA_CMA and we do a
"make olddefconfig":

1. With "# CONFIG_CMA is not set" and no specification of
    "CONFIG_DMA_CMA"

-> CONFIG_DMA_CMA won't be part of .config

2. With no specification of CONFIG_CMA or CONFIG_DMA_CMA

Contiguous Memory Allocator (CMA) [Y/n/?] (NEW)
DMA Contiguous Memory Allocator (DMA_CMA) [Y/n/?] (NEW)

3. With "# CONFIG_CMA is not set" and "# CONFIG_DMA_CMA is not set"

-> CONFIG_DMA_CMA will be removed from .config

Note: drivers/remoteproc seems to be special; commit c51e882cd711
("remoteproc/davinci: Update Kconfig to depend on DMA_CMA") explains 
that
there is a real dependency to DMA_CMA for it to work; leave that 
dependency

in place and don't convert it to a soft dependency.


I don't think this dependency is fundamentally different from the others,
though davinci machines tend to have less memory than a lot of the
other machines, so it's more likely to fail without CMA.



I was also unsure - and Lucas had similar thoughts. If you want, I can 
send a v4 also taking care of this.


TBH I think it should all just be removed. DMA_CMA is effectively an 
internal feature of the DMA API, and drivers which simply use the DMA 
API shouldn't really be trying to assume *how* things might be allocated 
at runtime - CMA is hardly the only way. Platform-level assumptions 
about the presence or not of IOMMUs, memory carveouts, etc., and whether 
it even matters - e.g. a device with a tiny LCD may only need display 
buffers which still fit in a regular MAX_ORDER allocation - could go in 
platform-specific configs, but I really don't think they belong at the 
generic subsystem level.


We already have various examples like I2S drivers that won't even probe 
without a dmaengine provider being present, or host controller drivers 
which are useless without their corresponding phy driver (and I'm 
guessing you can probably also do higher-level things like include the 
block layer but omit all filesystem drivers). I don't believe it's 
Kconfig's job to try to guess whether a given configuration is *useful*, 
only to enforce that's it's valid to build.


Robin.


[PATCH v2 2/2] iommu: Streamline registration interface

2021-04-01 Thread Robin Murphy
Rather than have separate opaque setter functions that are easy to
overlook and lead to repetitive boilerplate in drivers, let's pass the
relevant initialisation parameters directly to iommu_device_register().

Acked-by: Will Deacon 
Signed-off-by: Robin Murphy 

---

v2: Add some kerneldoc and an attempt to sanity-check modules

FWIW I'm hoping to fold the sysfs registration in here too, such that
@hwdev can also represent the sysfs parent, but I haven't yet figured
out why virtio-iommu is the odd one out in that regard.
---
 drivers/iommu/amd/init.c|  3 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  5 +---
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  5 +---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |  5 +---
 drivers/iommu/exynos-iommu.c|  5 +---
 drivers/iommu/fsl_pamu_domain.c |  4 +--
 drivers/iommu/intel/dmar.c  |  4 +--
 drivers/iommu/intel/iommu.c |  3 +--
 drivers/iommu/iommu.c   | 19 -
 drivers/iommu/ipmmu-vmsa.c  |  6 +
 drivers/iommu/msm_iommu.c   |  5 +---
 drivers/iommu/mtk_iommu.c   |  5 +---
 drivers/iommu/mtk_iommu_v1.c|  4 +--
 drivers/iommu/omap-iommu.c  |  5 +---
 drivers/iommu/rockchip-iommu.c  |  5 +---
 drivers/iommu/s390-iommu.c  |  4 +--
 drivers/iommu/sprd-iommu.c  |  5 +---
 drivers/iommu/sun50i-iommu.c|  5 +---
 drivers/iommu/tegra-gart.c  |  5 +---
 drivers/iommu/tegra-smmu.c  |  5 +---
 drivers/iommu/virtio-iommu.c|  5 +---
 include/linux/iommu.h   | 30 +
 22 files changed, 44 insertions(+), 98 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 321f5906e6ed..e1ef922d9f8f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1935,8 +1935,7 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
 
iommu_device_sysfs_add(>iommu, >dev->dev,
   amd_iommu_groups, "ivhd%d", iommu->index);
-   iommu_device_set_ops(>iommu, _iommu_ops);
-   iommu_device_register(>iommu);
+   iommu_device_register(>iommu, _iommu_ops, NULL);
 
return pci_enable_device(iommu->dev);
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b82000519af6..ecc6cfe3ae90 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3621,10 +3621,7 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
if (ret)
return ret;
 
-   iommu_device_set_ops(>iommu, _smmu_ops);
-   iommu_device_set_fwnode(>iommu, dev->fwnode);
-
-   ret = iommu_device_register(>iommu);
+   ret = iommu_device_register(>iommu, _smmu_ops, dev);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
return ret;
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 11ca963c4b93..0a697cb0d2f8 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -,10 +,7 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
return err;
}
 
-   iommu_device_set_ops(>iommu, _smmu_ops);
-   iommu_device_set_fwnode(>iommu, dev->fwnode);
-
-   err = iommu_device_register(>iommu);
+   err = iommu_device_register(>iommu, _smmu_ops, dev);
if (err) {
dev_err(dev, "Failed to register iommu\n");
return err;
diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 7f280c8d5c53..4294abe389b2 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -847,10 +847,7 @@ static int qcom_iommu_device_probe(struct platform_device 
*pdev)
return ret;
}
 
-   iommu_device_set_ops(_iommu->iommu, _iommu_ops);
-   iommu_device_set_fwnode(_iommu->iommu, dev->fwnode);
-
-   ret = iommu_device_register(_iommu->iommu);
+   ret = iommu_device_register(_iommu->iommu, _iommu_ops, dev);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
return ret;
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index de324b4eedfe..f887c3e111c1 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -630,10 +630,7 @@ static int exynos_sysmmu_probe(struct platform_device 
*pdev)
if (ret)
return ret;
 
-   iommu_device_set_ops(>iommu, _iommu_ops);
-   iommu_device_set_fwnode(>iommu, >of_node->fwnode);
-
-

[PATCH v2 1/2] iommu: Statically set module owner

2021-04-01 Thread Robin Murphy
It happens that the 3 drivers which first supported being modular are
also ones which play games with their pgsize_bitmap, so have non-const
iommu_ops where dynamically setting the owner manages to work out OK.
However, it's less than ideal to force that upon all drivers which want
to be modular - like the new sprd-iommu driver which now has a potential
bug in that regard - so let's just statically set the module owner and
let ops remain const wherever possible.

Reviewed-by: Christoph Hellwig 
Acked-by: Will Deacon 
Signed-off-by: Robin Murphy 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 +
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 +
 drivers/iommu/sprd-iommu.c  | 1 +
 drivers/iommu/virtio-iommu.c| 1 +
 include/linux/iommu.h   | 9 +
 5 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..b82000519af6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2632,6 +2632,7 @@ static struct iommu_ops arm_smmu_ops = {
.sva_unbind = arm_smmu_sva_unbind,
.sva_get_pasid  = arm_smmu_sva_get_pasid,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
+   .owner  = THIS_MODULE,
 };
 
 /* Probing and initialisation functions */
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..11ca963c4b93 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1638,6 +1638,7 @@ static struct iommu_ops arm_smmu_ops = {
.put_resv_regions   = generic_iommu_put_resv_regions,
.def_domain_type= arm_smmu_def_domain_type,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
+   .owner  = THIS_MODULE,
 };
 
 static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
diff --git a/drivers/iommu/sprd-iommu.c b/drivers/iommu/sprd-iommu.c
index e1dc2f7d5639..b5edf0e82176 100644
--- a/drivers/iommu/sprd-iommu.c
+++ b/drivers/iommu/sprd-iommu.c
@@ -436,6 +436,7 @@ static const struct iommu_ops sprd_iommu_ops = {
.device_group   = sprd_iommu_device_group,
.of_xlate   = sprd_iommu_of_xlate,
.pgsize_bitmap  = ~0UL << SPRD_IOMMU_PAGE_SHIFT,
+   .owner  = THIS_MODULE,
 };
 
 static const struct of_device_id sprd_iommu_of_match[] = {
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 2bfdd5734844..594ed827e944 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -945,6 +945,7 @@ static struct iommu_ops viommu_ops = {
.get_resv_regions   = viommu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
.of_xlate   = viommu_of_xlate,
+   .owner  = THIS_MODULE,
 };
 
 static int viommu_init_vqs(struct viommu_dev *viommu)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5e7fe519430a..dce8c5e12ea0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -379,19 +379,12 @@ int  iommu_device_link(struct iommu_device   *iommu, 
struct device *link);
 void iommu_device_unlink(struct iommu_device *iommu, struct device *link);
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain);
 
-static inline void __iommu_device_set_ops(struct iommu_device *iommu,
+static inline void iommu_device_set_ops(struct iommu_device *iommu,
  const struct iommu_ops *ops)
 {
iommu->ops = ops;
 }
 
-#define iommu_device_set_ops(iommu, ops)   \
-do {   \
-   struct iommu_ops *__ops = (struct iommu_ops *)(ops);\
-   __ops->owner = THIS_MODULE; \
-   __iommu_device_set_ops(iommu, __ops);   \
-} while (0)
-
 static inline void iommu_device_set_fwnode(struct iommu_device *iommu,
   struct fwnode_handle *fwnode)
 {
-- 
2.21.0.dirty



Re: [PATCH 2/3] tracing: Use pr_crit() instead of long fancy messages

2021-04-01 Thread Robin Murphy

On 2021-04-01 10:39, Geert Uytterhoeven wrote:

Hi Steven,

On Wed, Mar 31, 2021 at 3:40 PM Steven Rostedt  wrote:

On Wed, 31 Mar 2021 11:31:03 +0200
Geert Uytterhoeven  wrote:


This reduces kernel size by ca. 0.5 KiB.


If you are worried about size, disable tracing and it will go away
entirely. 0.5KiB is a drop in the bucket compared to what tracing adds in
size overhead.


Fair enough for this particular case, as tracing can be disabled.


I think the same argument can be applied to patch #1 - it's hard to 
imaging anyone debugging an IOMMU driver on a system where a few hundred 
bytes makes the slightest bit of difference, and for people not 
debugging IOMMU drivers it should be moot (per the message itself).


Robin.


Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable

2021-03-31 Thread Robin Murphy

On 2021-03-19 17:26, John Garry wrote:
[...]

@@ -25,7 +25,8 @@ struct iova {
  struct iova_magazine;
  struct iova_cpu_rcache;
-#define IOVA_RANGE_CACHE_MAX_SIZE 6    /* log of max cached IOVA 
range size (in pages) */

+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range 
size (in pages) */


No.

And why? If we're going to allocate massive caches anyway, whatever is 
the point of *not* using all of them?




I wanted to keep the same effective threshold for devices today, unless 
set otherwise.


The reason is that I didn't know if a blanket increase would cause 
regressions, and I was taking the super-safe road. Specifically some 
systems may be very IOVA space limited, and just work today by not 
caching large IOVAs.


alloc_iova_fast() will already clear out the caches if space is running 
low, so the caching itself shouldn't be an issue.


And in the precursor thread you wrote "We can't arbitrarily *increase* 
the scope of caching once a domain is active due to the size-rounding-up 
requirement, which would be prohibitive to larger allocations if applied 
universally" (sorry for quoting)


I took the last part to mean that we shouldn't apply this increase in 
threshold globally.


I meant we can't increase the caching threshold as-is once the domain is 
in use, because that could result in odd-sized IOVAs previously 
allocated above the old threshold being later freed back into caches, 
then causing havoc the next time they get allocated (because they're not 
as big as the actual size being asked for). However, trying to address 
that by just size-aligning everything even above the caching threshold 
is liable to waste too much space on IOVA-constrained systems (e.g. a 
single 4K video frame may be ~35MB - rounding that up to 64MB each time 
would be hard to justify).


It follows from that that there's really no point in decoupling the 
rounding-up threshold from the actual caching threshold - you get all 
the wastage (both IOVA space and actual memory for the cache arrays) for 
no obvious benefit.


It only makes sense for a configuration knob to affect the actual 
rcache and depot allocations - that way, big high-throughput systems 
with plenty of memory can spend it on better performance, while small 
systems - that often need IOMMU scatter-gather precisely *because* 
memory is tight and thus easily fragmented - don't have to pay the 
(not insignificant) cost for caches they don't need.


So do you suggest to just make IOVA_RANGE_CACHE_MAX_SIZE a kconfig option?


Again, I'm less convinced by Kconfig since I imagine many people tuning 
server-class systems for their own particular workloads will be running 
standard enterprise distros, so I think end-user-accessible knobs will 
be the most valuable. That's not to say that a Kconfig option to set the 
default state of a command-line option (as we do elsewhere) won't be 
useful for embedded users, cloud providers, etc., just that I'm not sure 
it's worth it being the *only* option.


Robin.


Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-31 Thread Robin Murphy

On 2021-03-22 15:01, John Garry wrote:

On 19/03/2021 19:20, Robin Murphy wrote:

Hi Robin,


So then we have the issue of how to dynamically increase this rcache
threshold. The problem is that we may have many devices associated with
the same domain. So, in theory, we can't assume that when we increase
the threshold that some other device will try to fast free an IOVA which
was allocated prior to the increase and was not rounded up.

I'm very open to better (or less bad) suggestions on how to do this ...

...but yes, regardless of exactly where it happens, rounding up or not
is the problem for rcaches in general. I've said several times that my
preferred approach is to not change it that dynamically at all, but
instead treat it more like we treat the default domain type.



Can you remind me of that idea? I don't remember you mentioning using 
default domain handling as a reference in any context.


Sorry if the phrasing was unclear there - the allusion to default 
domains is new, it just occurred to me that what we do there is in fact 
fairly close to what I've suggested previously for this. In that case, 
we have a global policy set by the command line, which *can* be 
overridden per-domain via sysfs at runtime, provided the user is willing 
to tear the whole thing down. Using a similar approach here would give a 
fair degree of flexibility but still mean that changes never have to be 
made dynamically to a live domain.


Robin.


Re: [PATCH 24/30] Kconfig: Change Synopsys to Synopsis

2021-03-30 Thread Robin Murphy

On 2021-03-29 00:53, Bhaskar Chowdhury wrote:

s/Synopsys/Synopsis/  .two different places.


Erm, that is definitely not a typo... :/


..and for some unknown reason it introduce a empty line deleted and added
back.


Presumably your editor is configured to trim trailing whitespace on save.

Furthermore, there are several instances in the other patches where your 
"corrections" are grammatically incorrect, I'm not sure what the deal is 
with patch #14, and you've also used the wrong subsystem name (it should 
be "dmaengine"). It's great to want to clean things up, but please pay a 
bit of care and attention to what you're actually doing.


Robin.


Signed-off-by: Bhaskar Chowdhury 
---
  drivers/dma/Kconfig | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/dma/Kconfig b/drivers/dma/Kconfig
index 0c2827fd8c19..30e8cc26f43b 100644
--- a/drivers/dma/Kconfig
+++ b/drivers/dma/Kconfig
@@ -170,15 +170,15 @@ config DMA_SUN6I
  Support for the DMA engine first found in Allwinner A31 SoCs.

  config DW_AXI_DMAC
-   tristate "Synopsys DesignWare AXI DMA support"
+   tristate "Synopsis DesignWare AXI DMA support"
depends on OF || COMPILE_TEST
depends on HAS_IOMEM
select DMA_ENGINE
select DMA_VIRTUAL_CHANNELS
help
- Enable support for Synopsys DesignWare AXI DMA controller.
+ Enable support for Synopsis DesignWare AXI DMA controller.
  NOTE: This driver wasn't tested on 64 bit platform because
- of lack 64 bit platform with Synopsys DW AXI DMAC.
+ of lack 64 bit platform with Synopsis DW AXI DMAC.

  config EP93XX_DMA
bool "Cirrus Logic EP93xx DMA support"
@@ -394,7 +394,7 @@ config MOXART_DMA
select DMA_VIRTUAL_CHANNELS
help
  Enable support for the MOXA ART SoC DMA controller.
-
+
  Say Y here if you enabled MMP ADMA, otherwise say N.

  config MPC512X_DMA
--
2.26.3

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-26 Thread Robin Murphy

On 2021-03-26 17:26, Mark Kettenis wrote:

From: Arnd Bergmann 
Date: Fri, 26 Mar 2021 17:38:24 +0100

On Fri, Mar 26, 2021 at 5:10 PM Sven Peter  wrote:

On Fri, Mar 26, 2021, at 16:59, Mark Kettenis wrote:

Some of the DARTs provide a bypass facility.  That code make using the
standard "dma-ranges" property tricky.  That property would need to
contain the bypass address range.  But that would mean that if the
DART driver needs to look at that property to figure out the address
range that supports translation it will need to be able to distinguish
between the translatable address range and the bypass address range.


Do we understand if and why we even need to bypass certain streams?


My guess is that this is a performance optimization.

There are generally three reasons to want an iommu in the first place:
  - Pass a device down to a guest or user process without giving
access to all of memory
  - Avoid problems with limitations in the device, typically when it
only supports
32-bit bus addressing, but the installed memory is larger than 4GB
  - Protect kernel memory from broken drivers

If you care about none of the above, but you do care about data transfer
speed, you are better off just leaving the IOMMU in bypass mode.
I don't think we have to support it if the IOMMU works reliably, but it's
something that users might want.


Another reason might be that a device needs access to large amounts of
physical memory at the same time and the 32-bit address space that the
DART provides is too tight.

In U-Boot I might want to use bypass where it works since there is no
proper IOMMU support in U-Boot.  Generally speaking, the goal is to
keep the code size down in U-Boot.  In OpenBSD I'll probably avoid
bypass mode if I can.

I haven't figured out how the bypass stuff really works.  Corellium
added support for it in their codebase when they added support for
Thunderbolt, and some of the DARTs that seem to be related to
Thunderbolt do indeed have a "bypass" property.  But it is unclear to
me how the different puzzle pieces fit together for Thunderbolt.

Anyway, from my viewpoint having the information about the IOVA
address space sit on the devices makes little sense.  This information
is needed by the DART driver, and there is no direct cnnection from
the DART to the individual devices in the devicetree.  The "iommus"
property makes a connection in the opposite direction.


What still seems unclear is whether these addressing limitations are a 
property of the DART input interface, the device output interface, or 
the interconnect between them. Although the observable end result 
appears more or less the same either way, they are conceptually 
different things which we have different abstractions to deal with.


Robin.


Re: Marvell: hw perfevents: unable to count PMU IRQs

2021-03-26 Thread Robin Murphy

On 2021-03-25 21:39, Paul Menzel wrote:

Dear Linux folks,


On the Marvell Prestera switch, Linux 5.10.4 prints the error (with an 
additional info level message) below.


     [    0.00] Linux version 5.10.4 (robimarko@onlbuilder9) 
(aarch64-linux-gnu-gcc (Debian 6.3.0-18) 6.3.0 20170516, GNU ld (GNU 
Binutils for Debian) 2.28) #1 SMP PREEMPT Thu Mar 11 10:22:09 UTC 2021

     […]
     [    1.996658] hw perfevents: unable to count PMU IRQs
     [    2.001825] hw perfevents: /ap806/config-space@f000/pmu: 
failed to register PMU devices!


```
# lscpu
Architecture:  aarch64
Byte Order:    Little Endian
CPU(s):    4
On-line CPU(s) list:   0-3
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s): 1
NUMA node(s):  1
Model: 1
BogoMIPS:  50.00
L1d cache: 32K
L1i cache: 48K
L2 cache:  512K
NUMA node0 CPU(s): 0-3
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
# cat /proc/cpuinfo
processor   : 0
BogoMIPS    : 50.00
Features    : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part    : 0xd08
CPU revision    : 1
[…]
```

Please find the output of `dmesg` attached.

How can the IRQs be counted?


Well, that message simply means we got an error back from 
platform_irq_count(), which in turn implies that 
platform_get_irq_optional() failed. Most likely we got -EPROBE_DEFER 
back from of_irq_get() because the relevant interrupt controller wasn't 
ready by that point - especially since that's the o9nly error code that 
platform_irq_cont() will actually pass. It looks like that should end up 
getting propagated all the way out appropriately, so the PMU driver 
should defer and be able to probe OK once the mvebu-pic driver has 
turned up to provide its IRQ. We could of course do a better job of not 
shouting error messages for a non-fatal condition


As for why the PMU doesn't eventually show up, my best guess would be 
either an issue with the mvebu-pic driver itself probing, and/or perhaps 
something in fw_devlink going awry - inspecting sysfs should shed a bit 
more light on those.


Robin.


Re: [PATCH] usb: gadget: aspeed: set port_dev dma mask

2021-03-26 Thread Robin Murphy

On 2021-03-26 07:02, rentao.b...@gmail.com wrote:

From: Tao Ren 

Set aspeed-usb vhub port_dev's dma mask to pass the dma_mask test in
"dma_map_page_attrs" function, and the dma_mask test was added in
'commit f959dcd6ddfd ("dma-direct: Fix potential NULL pointer
dereference")'.

Below is the backtrace without the patch:
[<80106550>] show_stack+0x20/0x24
[<80106868>] dump_stack+0x28/0x30
[<80823540>] __warn+0xfc/0x110
[<8011ac30>] warn_slowpath_fmt+0xb0/0xc0
[<8011ad44>] dma_map_page_attrs+0x24c/0x314
[<8016a27c>] usb_gadget_map_request_by_dev+0x100/0x1e4
[<805cedd8>] usb_gadget_map_request+0x1c/0x20
[<805cefbc>] ast_vhub_epn_queue+0xa0/0x1d8
[<7f02f710>] usb_ep_queue+0x48/0xc4
[<805cd3e8>] ecm_do_notify+0xf8/0x248
[<7f145920>] ecm_set_alt+0xc8/0x1d0
[<7f145c34>] composite_setup+0x680/0x1d30
[<7f00deb8>] ast_vhub_ep0_handle_setup+0xa4/0x1bc
[<7f02ee94>] ast_vhub_dev_irq+0x58/0x84
[<7f0309e0>] ast_vhub_irq+0xb0/0x1c8
[<7f02e118>] __handle_irq_event_percpu+0x50/0x19c
[<8015e5bc>] handle_irq_event_percpu+0x38/0x8c
[<8015e758>] handle_irq_event+0x38/0x4c

Signed-off-by: Tao Ren 
---
  drivers/usb/gadget/udc/aspeed-vhub/dev.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/usb/gadget/udc/aspeed-vhub/dev.c 
b/drivers/usb/gadget/udc/aspeed-vhub/dev.c
index d268306a7bfe..9eb3904a6ff9 100644
--- a/drivers/usb/gadget/udc/aspeed-vhub/dev.c
+++ b/drivers/usb/gadget/udc/aspeed-vhub/dev.c
@@ -569,6 +569,7 @@ int ast_vhub_init_dev(struct ast_vhub *vhub, unsigned int 
idx)
device_initialize(d->port_dev);
d->port_dev->release = ast_vhub_dev_release;
d->port_dev->parent = parent;
+   d->port_dev->dma_mask = parent->dma_mask;


This might happen to work out, but is far from correct. Just wait until 
you try it on a platform where the USB controller is behind an IOMMU...


It looks like something is more fundamentally wrong here - the device 
passed to DMA API calls must be the actual hardware device performing 
the DMA, which in USB-land I believe means the controller's sysdev.


Robin.


dev_set_name(d->port_dev, "%s:p%d", dev_name(parent), idx + 1);
rc = device_add(d->port_dev);
if (rc)



Re: [PATCH 1/2] iommu/mediatek-v1: Alloc building as module

2021-03-25 Thread Robin Murphy

^^Nit: presumably you meant "Allow" in the subject.

On 2021-03-25 17:16, Will Deacon wrote:

On Tue, Mar 23, 2021 at 01:58:00PM +0800, Yong Wu wrote:

This patch only adds support for building the IOMMU-v1 driver as module.
Correspondingly switch the config to tristate.

Signed-off-by: Yong Wu 
---
rebase on v5.12-rc2.
---
  drivers/iommu/Kconfig| 2 +-
  drivers/iommu/mtk_iommu_v1.c | 9 -
  2 files changed, 5 insertions(+), 6 deletions(-)


Both of these patches look fine to me, but you probably need to check
the setting of MODULE_OWNER after:

https://lore.kernel.org/r/f4de29d8330981301c1935e667b507254a2691ae.1616157612.git.robin.mur...@arm.com


Right, furthermore I would rather expect these patches on their own to 
hit the problem that my patch tries to avoid - where since mtk_iommu_ops 
is const, the current version of iommu_device_set_ops() is liable to 
blow up trying to write to rodata.


In fact I do wonder a little why that wasn't happening already - maybe 
the compiler is clever enough to tell that the assignment is redundant 
when THIS_MODULE == 0, and elides it :/


Robin.


Re: [PATCH v2 4/4] iommu: Stop exporting free_iova_fast()

2021-03-25 Thread Robin Murphy

On 2021-03-25 12:30, John Garry wrote:

Function free_iova_fast() is only referenced by dma-iommu.c, which can
only be in-built, so stop exporting it.

This was missed in an earlier tidy-up patch.


Reviewed-by: Robin Murphy 


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 8a493ee92c79..e31e79a9f5c3 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -493,7 +493,6 @@ free_iova_fast(struct iova_domain *iovad, unsigned long 
pfn, unsigned long size)
  
  	free_iova(iovad, pfn);

  }
-EXPORT_SYMBOL_GPL(free_iova_fast);
  
  #define fq_ring_for_each(i, fq) \

for ((i) = (fq)->head; (i) != (fq)->tail; (i) = ((i) + 1) % 
IOVA_FQ_SIZE)



Re: [PATCH v2 3/4] iommu: Delete iommu_dma_free_cpu_cached_iovas()

2021-03-25 Thread Robin Murphy

On 2021-03-25 12:30, John Garry wrote:

Function iommu_dma_free_cpu_cached_iovas() no longer has any caller, so
delete it.

With that, function free_cpu_cached_iovas() may be made static.


Reviewed-by: Robin Murphy 


Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c | 9 -
  drivers/iommu/iova.c  | 3 ++-
  include/linux/dma-iommu.h | 8 
  include/linux/iova.h  | 5 -
  4 files changed, 2 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..9da7e9901bec 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -53,15 +53,6 @@ struct iommu_dma_cookie {
  
  static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
  
-void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,

-   struct iommu_domain *domain)
-{
-   struct iommu_dma_cookie *cookie = domain->iova_cookie;
-   struct iova_domain *iovad = >iovad;
-
-   free_cpu_cached_iovas(cpu, iovad);
-}
-
  static void iommu_dma_entry_dtor(unsigned long data)
  {
struct page *freelist = (struct page *)data;
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..8a493ee92c79 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -22,6 +22,7 @@ static unsigned long iova_rcache_get(struct iova_domain 
*iovad,
 unsigned long size,
 unsigned long limit_pfn);
  static void init_iova_rcaches(struct iova_domain *iovad);
+static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
@@ -998,7 +999,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
  /*
   * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
   */
-void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
+static void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
  {
struct iova_cpu_rcache *cpu_rcache;
struct iova_rcache *rcache;
diff --git a/include/linux/dma-iommu.h b/include/linux/dma-iommu.h
index 706b68d1359b..2112f21f73d8 100644
--- a/include/linux/dma-iommu.h
+++ b/include/linux/dma-iommu.h
@@ -37,9 +37,6 @@ void iommu_dma_compose_msi_msg(struct msi_desc *desc,
  
  void iommu_dma_get_resv_regions(struct device *dev, struct list_head *list);
  
-void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,

-   struct iommu_domain *domain);
-
  #else /* CONFIG_IOMMU_DMA */
  
  struct iommu_domain;

@@ -81,10 +78,5 @@ static inline void iommu_dma_get_resv_regions(struct device 
*dev, struct list_he
  {
  }
  
-static inline void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,

-   struct iommu_domain *domain)
-{
-}
-
  #endif/* CONFIG_IOMMU_DMA */
  #endif/* __DMA_IOMMU_H */
diff --git a/include/linux/iova.h b/include/linux/iova.h
index 4be6c0ab4997..71d8a2de6635 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -157,7 +157,6 @@ int init_iova_flush_queue(struct iova_domain *iovad,
  iova_flush_cb flush_cb, iova_entry_dtor entry_dtor);
  struct iova *find_iova(struct iova_domain *iovad, unsigned long pfn);
  void put_iova_domain(struct iova_domain *iovad);
-void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad);
  #else
  static inline int iova_cache_get(void)
  {
@@ -234,10 +233,6 @@ static inline void put_iova_domain(struct iova_domain 
*iovad)
  {
  }
  
-static inline void free_cpu_cached_iovas(unsigned int cpu,

-struct iova_domain *iovad)
-{
-}
  #endif
  
  #endif




Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-25 Thread Robin Murphy

On 2021-03-25 07:53, Sven Peter wrote:



On Tue, Mar 23, 2021, at 21:53, Rob Herring wrote:

On Sun, Mar 21, 2021 at 05:00:50PM +0100, Mark Kettenis wrote:

Date: Sat, 20 Mar 2021 15:19:33 +
From: Sven Peter 
I have just noticed today though that at least the USB DWC3 controller in host
mode uses *two* darts at the same time. I'm not sure yet which parts seem to
require which DART instance.

This means that we might need to support devices attached to two iommus
simultaneously and just create the same iova mappings. Currently this only
seems to be required for USB according to Apple's Device Tree.

I see two options for this and would like to get feedback before
I implement either one:

 1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
to identify the DART and the second one to identify the master.
The DART DT node would then also take two register ranges that would
correspond to the two DARTs. Both instances use the same IRQ and the
same clocks according to Apple's device tree and my experiments.
This would keep a single device node and the DART driver would then
simply map iovas in both DARTs if required.

 2) Keep #iommu-cells as-is but support
 iommus = <_dart1a 1>, <_dart1b 0>;
instead.
This would then require two devices nodes for the two DART instances and
some housekeeping in the DART driver to support mapping iovas in both
DARTs.
I believe omap-iommu.c supports this setup but I will have to read
more code to understand the details there and figure out how to 
implement
this in a sane way.

I currently prefer the first option but I don't understand enough details of
the iommu system to actually make an informed decision.


Please don't mix what does the h/w look like and what's easy to
implement in Linux's IOMMU subsytem. It's pretty clear (at least
from the description here) that option 2 reflects the h/w.



Good point, I'll keep that in mind and give option 2 a try.



As I mentioned before, not all DARTs support the full 32-bit aperture.
In particular the PCIe DARTs support a smaller address-space.  It is
not clear whether this is a restriction of the PCIe host controller or
the DART, but the Apple Device Tree has "vm-base" and "vm-size"
properties that encode the base address and size of the aperture.
These single-cell properties which is probably why for the USB DARTs
only "vm-base" is given; since "vm-base" is 0, a 32-bit number
wouldn't be able to encode the full aperture size.  We could make them
64-bit numbers in the Linux device tree though and always be explicit
about the size.  Older Sun SPARC machines used a single "virtual-dma"
property to encode the aperture.  We could do someting similar.  You
would use this property to initialize domain->geometry.aperture_start
and domain->geometry.aperture_end in diff 3/3 of this series.


'dma-ranges' is what should be used here.



The iommu binding documentation [1] mentions that

 The device tree node of the IOMMU device's parent bus must contain a valid
 "dma-ranges" property that describes how the physical address space of the
 IOMMU maps to memory. An empty "dma-ranges" property means that there is a
 1:1 mapping from IOMMU to memory.

which, if I understand this correctly, means that the 'dma-ranges' for the
parent bus of the iommu should be empty since the DART hardware can see the
full physical address space with a 1:1 mapping.


The documentation also mentions that

  When an "iommus" property is specified in a device tree node, the IOMMU
  will be used for address translation. If a "dma-ranges" property exists
  in the device's parent node it will be ignored.

which means that specifying a 'dma-ranges' in the parent bus of any devices
that use the iommu will just be ignored.


I think that's just wrong and wants updating (or at least clarifying). 
The high-level view now is that we use "dma-ranges" to describe 
limitations imposed by a bridge or interconnect segment, and that can 
certainly happen upstream of an IOMMU. As it happens, I've just recently 
sent a patch for precisely that case[1].


I guess what it might have been trying to say is that "dma-ranges" 
*does* become irrelevant in terms of constraining what physical memory 
is usable for DMA, but that shouldn't imply that its meaning doesn't 
just shift to a different purpose.



As a concrete example, the PCIe DART IOMMU only allows translations from iovas
within 0x0010...0x3ff0 to the entire physical address space (though
realistically it will only map to 16GB RAM starting at 0x8 on the M1).

I'm probably just confused or maybe the documentation is outdated but I don't
see how I could specify "this device can only use DMA addresses from
0x0010...0x3ff0 but can map these via the iommu to any physical
address" using 'dma-ranges'.

Could you maybe point me to the right 

Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

2021-03-24 Thread Robin Murphy

On 2021-03-24 16:38, David Laight wrote:

From: Robin Murphy

Sent: 23 March 2021 12:09

On 2021-03-23 07:34, Yang Yingliang wrote:

When copy over 128 bytes, src/dst is added after
each ldp/stp instruction, it will cost more time.
To improve this, we only add src/dst after load
or store 64 bytes.


This breaks the required behaviour for copy_*_user(), since the fault
handler expects the base address to be up-to-date at all times. Say
you're copying 128 bytes and fault on the 4th store, it should return 80
bytes not copied; the code below would return 128 bytes not copied, even
though 48 bytes have actually been written to the destination.


Are there any non-superscaler amd64 cpu (that anyone cares about)?

If the cpu can execute multiple instructions in one clock
then it is usually possible to get the loop control (almost) free.

You might need to unroll once to interleave read/write
but any more may be pointless.


Nah, the whole point is that using post-increment addressing is crap in 
the first place because it introduces register dependencies between each 
access that could be avoided entirely if we could use offset addressing 
(and especially crap when we don't even *have* a post-index addressing 
mode for the unprivileged load/store instructions used in copy_*_user() 
and have to simulate it with extra instructions that throw off the code 
alignment).


We already have code that's tuned to work well across our 
microarchitectures[1], the issue is that butchering it to satisfy the 
additional requirements of copy_*_user() with a common template has 
hobbled regular memcpy() performance. I intend to have a crack at fixing 
that properly tomorrow ;)


Robin.

[1] https://github.com/ARM-software/optimized-routines


So something like:
a = *src++
do {
b = *src++;
*dst++ = a;
a = *src++;
*dst++ = b;
} while (src != lim);
*dst++ = b;

 David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH 1/3] iommu: io-pgtable: add DART pagetable format

2021-03-24 Thread Robin Murphy

On 2021-03-20 15:19, Sven Peter wrote:

Apple's DART iommu uses a pagetable format that's very similar to the ones
already implemented by io-pgtable.c.
Add a new format variant to support the required differences.


TBH there look to be more differences than similarities, but I guess we 
already opened that door with the Mali format, and nobody likes writing 
pagetable code :)



Signed-off-by: Sven Peter 
---
  drivers/iommu/Kconfig  | 13 +++
  drivers/iommu/io-pgtable-arm.c | 70 ++
  drivers/iommu/io-pgtable.c |  3 ++
  include/linux/io-pgtable.h |  6 +++
  4 files changed, 92 insertions(+)

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 192ef8f61310..3c95c8524abe 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -39,6 +39,19 @@ config IOMMU_IO_PGTABLE_LPAE
  sizes at both stage-1 and stage-2, as well as address spaces
  up to 48-bits in size.

+config IOMMU_IO_PGTABLE_APPLE_DART


Does this really need to be configurable? I don't think there's an 
appreciable code saving to be had, and it's not like we do it for any of 
the other sub-formats.



+   bool "Apple DART Descriptor Format"
+   select IOMMU_IO_PGTABLE
+   select IOMMU_IO_PGTABLE_LPAE
+   depends on ARM64 || (COMPILE_TEST && !GENERIC_ATOMIC64)
+   help
+ Enable support for the Apple DART iommu pagetable format.
+ This format is a variant of the ARMv7/v8 Long Descriptor
+ Format specific to Apple's iommu found in their SoCs.
+
+ Say Y here if you have a Apple SoC like the M1 which
+ contains DART iommus.
+
  config IOMMU_IO_PGTABLE_LPAE_SELFTEST
bool "LPAE selftests"
depends on IOMMU_IO_PGTABLE_LPAE
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 87def58e79b5..18674469313d 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -127,6 +127,10 @@
  #define ARM_MALI_LPAE_MEMATTR_IMP_DEF 0x88ULL
  #define ARM_MALI_LPAE_MEMATTR_WRITE_ALLOC 0x8DULL

+/* APPLE_DART_PTE_PROT_NO_WRITE actually maps to ARM_LPAE_PTE_AP_RDONLY  */
+#define APPLE_DART_PTE_PROT_NO_WRITE (1<<7)
Given that there's apparently zero similarity with any of the other 
attributes/permissions, this seems more like a coincidence that probably 
doesn't need to be called out.



+#define APPLE_DART_PTE_PROT_NO_READ (1<<8)
+


Do you have XN permission? How about memory type attributes?


  /* IOPTE accessors */
  #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))

@@ -381,6 +385,17 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
  {
arm_lpae_iopte pte;

+#ifdef CONFIG_IOMMU_IO_PGTABLE_APPLE_DART


As a general tip, prefer IS_ENABLED() to inline #ifdefs.


+   if (data->iop.fmt == ARM_APPLE_DART) { > +pte = 0;
+   if (!(prot & IOMMU_WRITE))
+   pte |= APPLE_DART_PTE_PROT_NO_WRITE;
+   if (!(prot & IOMMU_READ))
+   pte |= APPLE_DART_PTE_PROT_NO_READ;
+   return pte;
+   }
+#endif
+
if (data->iop.fmt == ARM_64_LPAE_S1 ||
data->iop.fmt == ARM_32_LPAE_S1) {
pte = ARM_LPAE_PTE_nG;
@@ -1043,6 +1058,54 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, 
void *cookie)
return NULL;
  }

+#ifdef CONFIG_IOMMU_IO_PGTABLE_APPLE_DART


Similarly, prefer __maybe_unused rather than #ifdefing functions if they 
don't contain any config-dependent references.



+static struct io_pgtable *
+apple_dart_alloc_pgtable(struct io_pgtable_cfg *cfg, void *cookie)
+{
+   struct arm_lpae_io_pgtable *data;
+
+   if (cfg->ias > 38)
+   return NULL;
+   if (cfg->oas > 36)
+   return NULL;
+
+   if (!cfg->coherent_walk)
+   return NULL;


For completeness you should probably also reject any quirks, since none 
of them are applicable either.



+
+   cfg->pgsize_bitmap &= SZ_16K;


No block mappings?


+   if (!cfg->pgsize_bitmap)
+   return NULL;
+
+   data = arm_lpae_alloc_pgtable(cfg);
+   if (!data)
+   return NULL;
+
+   /*
+* the hardware only supports this specific three level pagetable 
layout with
+* the first level being encoded into four hardware registers
+*/
+   data->start_level = ARM_LPAE_MAX_LEVELS - 2;


The comment implies that walks should start at level 1 (for a 3-level 
table), but the code here says (in an unnecessarily roundabout manner) 
level 2 :/


Which is it?


+   data->pgd_bits = 13;


What if ias < 38? Couldn't we get away with only allocating as much as 
we actually need?



+   data->bits_per_level = 11;
+
+   data->pgd = __arm_lpae_alloc_pages(ARM_LPAE_PGD_SIZE(data), GFP_KERNEL,
+  cfg);
+   if (!data->pgd)
+   goto out_free_data;
+
+   

Re: [PATCH 0/3] Apple M1 DART IOMMU driver

2021-03-24 Thread Robin Murphy

On 2021-03-20 15:19, Sven Peter wrote:

Hi,

After Hector's initial work [1] to bring up Linux on Apple's M1 it's time to
bring up more devices. Most peripherals connected to the SoC are behind a iommu
which Apple calls "Device Address Resolution Table", or DART for short [2].
Unfortunately, it only shares the name with PowerPC's DART.
Configuring this iommu is mandatory if these peripherals require DMA access.

This patchset implements initial support for this iommu. The hardware itself
uses a pagetable format that's very similar to the one already implement in
io-pgtable.c. There are some minor modifications, namely some details of the
PTE format and that there are always three pagetable levels, which I've
implement as a new format variant.

I have mainly tested this with the USB controller in device mode which is
compatible with Linux's dwc3 driver. Some custom PHY initialization (which is
not yet ready or fully understood) is required though to bring up the ports,
see e.g. my patches to our m1n1 bootloader [3,4]. If you want to test the same
setup you will probably need that branch for now and add the nodes from
the DT binding specification example to your device tree.

Even though each DART instances could support up to 16 devices usually only
a single device is actually connected. Different devices generally just use
an entirely separate DART instance with a seperate MMIO range, IRQ, etc.

I have just noticed today though that at least the USB DWC3 controller in host
mode uses *two* darts at the same time. I'm not sure yet which parts seem to
require which DART instance.

This means that we might need to support devices attached to two iommus
simultaneously and just create the same iova mappings. Currently this only
seems to be required for USB according to Apple's Device Tree.

I see two options for this and would like to get feedback before
I implement either one:

 1) Change #iommu-cells = <1>; to #iommu-cells = <2>; and use the first cell
to identify the DART and the second one to identify the master.
The DART DT node would then also take two register ranges that would
correspond to the two DARTs. Both instances use the same IRQ and the
same clocks according to Apple's device tree and my experiments.
This would keep a single device node and the DART driver would then
simply map iovas in both DARTs if required.


This is broadly similar to the approach used by rockchip-iommu and the 
special arm-smmu-nvidia implementation, where there are multiple 
instances which require programming identically, that are abstracted 
behind a single "device". Your case is a little different since you're 
not programming both *entirely* identically, although maybe that's a 
possibility if each respective ID isn't used by anything else on the 
"other" DART?


Overall I tend to view this approach as a bit of a hack because it's not 
really describing the hardware truthfully - just because two distinct 
functional blocks have their IRQ lines wired together doesn't suddenly 
make them a single monolithic block with multiple interfaces - and tends 
to be done for the sake of making the driver implementation easier in 
terms of the Linux IOMMU API (which, note, hasn't evolved all that far 
from its PCI-centric origins and isn't exactly great for arbitrary SoC 
topologies).



 2) Keep #iommu-cells as-is but support
 iommus = <_dart1a 1>, <_dart1b 0>;
instead.
This would then require two devices nodes for the two DART instances and
some housekeeping in the DART driver to support mapping iovas in both
DARTs.
I believe omap-iommu.c supports this setup but I will have to read
more code to understand the details there and figure out how to 
implement
this in a sane way.


This approach is arguably the most honest, and more robust in terms of 
making fewer assumptions, and is used by at least exynos-iommu and 
omap-iommu. In Linux it currently takes a little bit more housekeeping 
to keep track of linked instances within the driver since the IOMMU API 
holds the notion that any given client device is associated with "an 
IOMMU", but that's always free to change at any time, unlike the design 
of a DT binding.


There's also the funky "root" and "leaf" devices thing that ipmmu-vmsa 
does, although I believe that represents genuine hardware differences 
where the leaves are more like extra TLBs rather than fully-functional 
IOMMU blocks in their own right, so that may not be a relevant model here.


Robin.


I currently prefer the first option but I don't understand enough details of
the iommu system to actually make an informed decision.
I'm obviously also open to more options :-)


Best regards,


Sven

[1] https://lore.kernel.org/linux-arch/20210304213902.83903-1-mar...@marcan.st/
[2] 

Re: [PATCH v2 12/15] PCI/MSI: Let PCI host bridges declare their reliance on MSI domains

2021-03-23 Thread Robin Murphy

On 2021-03-23 18:09, Marc Zyngier wrote:

Hi Robin,

On Tue, 23 Mar 2021 11:45:02 +,
Robin Murphy  wrote:


On 2021-03-22 18:46, Marc Zyngier wrote:

The new 'no_msi' attribute solves the problem of advertising the lack
of MSI capability for host bridges that know for sure that there will
be no MSI for their end-points.

However, there is a whole class of host bridges that cannot know
whether MSIs will be provided or not, as they rely on other blocks
to provide the MSI functionnality, using MSI domains.  This is
the case for example on systems that use the ARM GIC architecture.

Introduce a new attribute ('msi_domain') indicating that implicit
dependency, and use this property to set the NO_MSI flag when
no MSI domain is found at probe time.

Acked-by: Bjorn Helgaas 
Signed-off-by: Marc Zyngier 
---
   drivers/pci/probe.c | 2 +-
   include/linux/pci.h | 1 +
   2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 146bd85c037e..bac9f69a06a8 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -925,7 +925,7 @@ static int pci_register_host_bridge(struct pci_host_bridge 
*bridge)
device_enable_async_suspend(bus->bridge);
pci_set_bus_of_node(bus);
pci_set_bus_msi_domain(bus);
-   if (bridge->no_msi)
+   if (bridge->no_msi || (bridge->msi_domain && !bus->dev.msi_domain))
bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
if (!parent)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 48605cca82ae..d322d00db432 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -551,6 +551,7 @@ struct pci_host_bridge {
unsigned intpreserve_config:1;  /* Preserve FW resource setup */
unsigned intsize_windows:1; /* Enable root bus sizing */
unsigned intno_msi:1;   /* Bridge has no MSI support */
+   unsigned intmsi_domain:1;   /* Bridge wants MSI domain */


Aren't these really the same thing? Either way we're saying the bridge
itself doesn't handle MSIs, it's just in one case we're effectively
encoding a platform-specific assumption that an external domain won't
be provided. I can't help wondering whether that distinction is really
necessary...


There is a subtle difference: no_msi indicates that there is no way
*any* MSI can be dealt with whatsoever (maybe because the RC doesn't
forward the corresponding TLPs?). msi_domain says "no MSI unless...".


PCI says that MSIs are simply memory writes at the transaction level, so 
AFAIK unless the host bridge can't support DMA at all, it shouldn't be 
in a position to make any assumptions about what transactions mean what.


I suppose there could in theory be an issue in the other direction, 
where config space somehow didn't allow access to the MSI capability in 
the first place, but then we'd presumably just never detect any device 
as being MSI-capable in the first place, and it wouldn't matter either way.



We could implement the former with the latter, but I have the feeling
that's not totally bullet proof. Happy to revisit this if you think it
really matters.


I would expect it to be a fairly safe assumption that a host bridge 
which "doesn't support MSIs" wouldn't have an msi-parent set, but I 
don't have a strong opinion either way - I just figured we could 
probably save a little bit of complexity here.


Cheers,
Robin.


Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

2021-03-23 Thread Robin Murphy

On 2021-03-23 13:32, Will Deacon wrote:

On Tue, Mar 23, 2021 at 12:08:56PM +, Robin Murphy wrote:

On 2021-03-23 07:34, Yang Yingliang wrote:

When copy over 128 bytes, src/dst is added after
each ldp/stp instruction, it will cost more time.
To improve this, we only add src/dst after load
or store 64 bytes.


This breaks the required behaviour for copy_*_user(), since the fault
handler expects the base address to be up-to-date at all times. Say you're
copying 128 bytes and fault on the 4th store, it should return 80 bytes not
copied; the code below would return 128 bytes not copied, even though 48
bytes have actually been written to the destination.

We've had a couple of tries at updating this code (because the whole
template is frankly a bit terrible, and a long way from the well-optimised
code it was derived from), but getting the fault-handling behaviour right
without making the handler itself ludicrously complex has proven tricky. And
then it got bumped down the priority list while the uaccess behaviour in
general was in flux - now that the dust has largely settled on that I should
probably try to find time to pick this up again...


I think the v5 from Oli was pretty close, but it didn't get any review:

https://lore.kernel.org/r/20200914151800.2270-1-oli.sw...@arm.com

he also included tests:

https://lore.kernel.org/r/20200916104636.19172-1-oli.sw...@arm.com

It would be great if you or somebody else has time to revive those!


Yeah, we still have a ticket open for it. Since the uaccess overhaul has 
pretty much killed off any remaining value in the template idea, I might 
have a quick go at spinning a series to just update memcpy() and the 
other library routines to their shiny new versions, then come back and 
work on some dedicated usercopy routines built around LDTR/STTR (and the 
desired fault behaviour) as a follow-up.


(I was also holding out hope for copy_in_user() to disappear if we wait 
long enough, but apparently not yet...)


Robin.


Re: [PATCH 3/3] iova: Correct comment for free_cpu_cached_iovas()

2021-03-23 Thread Robin Murphy

On 2021-03-01 12:12, John Garry wrote:

Function free_cpu_cached_iovas() is not only called when a CPU is
hotplugged, so remove that part of the code comment.


FWIW I read it as clarifying why this is broken out into a separate 
function vs. a monolithic "free all cached IOVAs" routine that handles 
both the per-cpu and global caches - it never said "*only* used..."


As such I'd hesitate to call it incorrect, but it's certainly arguable 
whether it needs to be stated or not, especially once the hotplug 
callsite is now obvious in the same file - on which note the function 
itself also shouldn't need to be public any more, no?


Robin.


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c78312560425..465b3b0eeeb0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -996,7 +996,7 @@ static void free_iova_rcaches(struct iova_domain *iovad)
  }
  
  /*

- * free all the IOVA ranges cached by a cpu (used when cpu is unplugged)
+ * free all the IOVA ranges cached by a cpu
   */
  void free_cpu_cached_iovas(unsigned int cpu, struct iova_domain *iovad)
  {



Re: [PATCH 1/3] iova: Add CPU hotplug handler to flush rcaches

2021-03-23 Thread Robin Murphy

On 2021-03-01 12:12, John Garry wrote:

Like the intel IOMMU driver already does, flush the per-IOVA domain
CPU rcache when a CPU goes offline - there's no point in keeping it.


Thanks John!

Reviewed-by: Robin Murphy 


Signed-off-by: John Garry 
---
  drivers/iommu/iova.c   | 30 +-
  include/linux/cpuhotplug.h |  1 +
  include/linux/iova.h   |  1 +
  3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..c78312560425 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -25,6 +25,17 @@ static void init_iova_rcaches(struct iova_domain *iovad);
  static void free_iova_rcaches(struct iova_domain *iovad);
  static void fq_destroy_all_entries(struct iova_domain *iovad);
  static void fq_flush_timeout(struct timer_list *t);
+
+static int iova_cpuhp_dead(unsigned int cpu, struct hlist_node *node)
+{
+   struct iova_domain *iovad;
+
+   iovad = hlist_entry_safe(node, struct iova_domain, cpuhp_dead);
+
+   free_cpu_cached_iovas(cpu, iovad);
+   return 0;
+}
+
  static void free_global_cached_iovas(struct iova_domain *iovad);
  
  void

@@ -51,6 +62,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
rb_insert_color(>anchor.node, >rbroot);
+   cpuhp_state_add_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD, 
>cpuhp_dead);
init_iova_rcaches(iovad);
  }
  EXPORT_SYMBOL_GPL(init_iova_domain);
@@ -257,10 +269,21 @@ int iova_cache_get(void)
  {
mutex_lock(_cache_mutex);
if (!iova_cache_users) {
+   int ret;
+
+   ret = cpuhp_setup_state_multi(CPUHP_IOMMU_IOVA_DEAD, 
"iommu/iova:dead", NULL,
+   iova_cpuhp_dead);
+   if (ret) {
+   mutex_unlock(_cache_mutex);
+   pr_err("Couldn't register cpuhp handler\n");
+   return ret;
+   }
+
iova_cache = kmem_cache_create(
"iommu_iova", sizeof(struct iova), 0,
SLAB_HWCACHE_ALIGN, NULL);
if (!iova_cache) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
mutex_unlock(_cache_mutex);
pr_err("Couldn't create iova cache\n");
return -ENOMEM;
@@ -282,8 +305,10 @@ void iova_cache_put(void)
return;
}
iova_cache_users--;
-   if (!iova_cache_users)
+   if (!iova_cache_users) {
+   cpuhp_remove_multi_state(CPUHP_IOMMU_IOVA_DEAD);
kmem_cache_destroy(iova_cache);
+   }
mutex_unlock(_cache_mutex);
  }
  EXPORT_SYMBOL_GPL(iova_cache_put);
@@ -606,6 +631,9 @@ void put_iova_domain(struct iova_domain *iovad)
  {
struct iova *iova, *tmp;
  
+	cpuhp_state_remove_instance_nocalls(CPUHP_IOMMU_IOVA_DEAD,

+   >cpuhp_dead);
+
free_iova_flush_queue(iovad);
free_iova_rcaches(iovad);
rbtree_postorder_for_each_entry_safe(iova, tmp, >rbroot, node)
diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index f14adb882338..cedac9986557 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -58,6 +58,7 @@ enum cpuhp_state {
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
CPUHP_IOMMU_INTEL_DEAD,
+   CPUHP_IOMMU_IOVA_DEAD,
CPUHP_LUSTRE_CFS_DEAD,
CPUHP_AP_ARM_CACHE_B15_RAC_DEAD,
CPUHP_PADATA_DEAD,
diff --git a/include/linux/iova.h b/include/linux/iova.h
index c834c01c0a5b..4be6c0ab4997 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -95,6 +95,7 @@ struct iova_domain {
   flush-queues */
atomic_t fq_timer_on;   /* 1 when timer is active, 0
   when not */
+   struct hlist_node   cpuhp_dead;
  };
  
  static inline unsigned long iova_size(struct iova *iova)




Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

2021-03-23 Thread Robin Murphy

On 2021-03-23 07:34, Yang Yingliang wrote:

When copy over 128 bytes, src/dst is added after
each ldp/stp instruction, it will cost more time.
To improve this, we only add src/dst after load
or store 64 bytes.


This breaks the required behaviour for copy_*_user(), since the fault 
handler expects the base address to be up-to-date at all times. Say 
you're copying 128 bytes and fault on the 4th store, it should return 80 
bytes not copied; the code below would return 128 bytes not copied, even 
though 48 bytes have actually been written to the destination.


We've had a couple of tries at updating this code (because the whole 
template is frankly a bit terrible, and a long way from the 
well-optimised code it was derived from), but getting the fault-handling 
behaviour right without making the handler itself ludicrously complex 
has proven tricky. And then it got bumped down the priority list while 
the uaccess behaviour in general was in flux - now that the dust has 
largely settled on that I should probably try to find time to pick this 
up again...


Robin.


Copy 4096 bytes cost on Kunpeng920 (ms):
Without this patch:
memcpy: 143.85 copy_from_user: 172.69 copy_to_user: 199.23

With this patch:
memcpy: 107.12 copy_from_user: 157.50 copy_to_user: 198.85

It's about 25% improvement in memcpy().

Signed-off-by: Yang Yingliang 
---
  arch/arm64/lib/copy_template.S | 36 +++---
  1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/arch/arm64/lib/copy_template.S b/arch/arm64/lib/copy_template.S
index 488df234c49a..c3cd6f84c9c0 100644
--- a/arch/arm64/lib/copy_template.S
+++ b/arch/arm64/lib/copy_template.S
@@ -152,29 +152,33 @@ D_h   .reqx14
.p2alignL1_CACHE_SHIFT
  .Lcpy_body_large:
/* pre-get 64 bytes data. */
-   ldp1A_l, A_h, src, #16
-   ldp1B_l, B_h, src, #16
-   ldp1C_l, C_h, src, #16
-   ldp1D_l, D_h, src, #16
+   ldp2A_l, A_h, src, #0,  #8
+   ldp2B_l, B_h, src, #16, #24
+   ldp2C_l, C_h, src, #32, #40
+   ldp2D_l, D_h, src, #48, #56
+   add src, src, #64
  1:
/*
* interlace the load of next 64 bytes data block with store of the last
* loaded 64 bytes data.
*/
-   stp1A_l, A_h, dst, #16
-   ldp1A_l, A_h, src, #16
-   stp1B_l, B_h, dst, #16
-   ldp1B_l, B_h, src, #16
-   stp1C_l, C_h, dst, #16
-   ldp1C_l, C_h, src, #16
-   stp1D_l, D_h, dst, #16
-   ldp1D_l, D_h, src, #16
+   stp2A_l, A_h, dst, #0,  #8
+   ldp2A_l, A_h, src, #0,  #8
+   stp2B_l, B_h, dst, #16, #24
+   ldp2B_l, B_h, src, #16, #24
+   stp2C_l, C_h, dst, #32, #40
+   ldp2C_l, C_h, src, #32, #40
+   stp2D_l, D_h, dst, #48, #56
+   ldp2D_l, D_h, src, #48, #56
+   add src, src, #64
+   add dst, dst, #64
subscount, count, #64
b.ge1b
-   stp1A_l, A_h, dst, #16
-   stp1B_l, B_h, dst, #16
-   stp1C_l, C_h, dst, #16
-   stp1D_l, D_h, dst, #16
+   stp2A_l, A_h, dst, #0,  #8
+   stp2B_l, B_h, dst, #16, #24
+   stp2C_l, C_h, dst, #32, #40
+   stp2D_l, D_h, dst, #48, #56
+   add dst, dst, #64
  
  	tst	count, #0x3f

b.ne.Ltail63



Re: [PATCH v2 12/15] PCI/MSI: Let PCI host bridges declare their reliance on MSI domains

2021-03-23 Thread Robin Murphy

On 2021-03-22 18:46, Marc Zyngier wrote:

The new 'no_msi' attribute solves the problem of advertising the lack
of MSI capability for host bridges that know for sure that there will
be no MSI for their end-points.

However, there is a whole class of host bridges that cannot know
whether MSIs will be provided or not, as they rely on other blocks
to provide the MSI functionnality, using MSI domains.  This is
the case for example on systems that use the ARM GIC architecture.

Introduce a new attribute ('msi_domain') indicating that implicit
dependency, and use this property to set the NO_MSI flag when
no MSI domain is found at probe time.

Acked-by: Bjorn Helgaas 
Signed-off-by: Marc Zyngier 
---
  drivers/pci/probe.c | 2 +-
  include/linux/pci.h | 1 +
  2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 146bd85c037e..bac9f69a06a8 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -925,7 +925,7 @@ static int pci_register_host_bridge(struct pci_host_bridge 
*bridge)
device_enable_async_suspend(bus->bridge);
pci_set_bus_of_node(bus);
pci_set_bus_msi_domain(bus);
-   if (bridge->no_msi)
+   if (bridge->no_msi || (bridge->msi_domain && !bus->dev.msi_domain))
bus->bus_flags |= PCI_BUS_FLAGS_NO_MSI;
  
  	if (!parent)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 48605cca82ae..d322d00db432 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -551,6 +551,7 @@ struct pci_host_bridge {
unsigned intpreserve_config:1;  /* Preserve FW resource setup */
unsigned intsize_windows:1; /* Enable root bus sizing */
unsigned intno_msi:1;   /* Bridge has no MSI support */
+   unsigned intmsi_domain:1;   /* Bridge wants MSI domain */


Aren't these really the same thing? Either way we're saying the bridge 
itself doesn't handle MSIs, it's just in one case we're effectively 
encoding a platform-specific assumption that an external domain won't be 
provided. I can't help wondering whether that distinction is really 
necessary...


Robin.

  
  	/* Resource alignment requirements */

resource_size_t (*align_resource)(struct pci_dev *dev,



Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-19 Thread Robin Murphy

On 2021-03-19 16:58, John Garry wrote:

On 19/03/2021 16:13, Robin Murphy wrote:

On 2021-03-19 13:25, John Garry wrote:

Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.

This is to eventually make it possible to be able to configure the upper
limit of the IOVA rcache range.

Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c |  8 --
  drivers/iommu/iova.c  | 51 ++-
  2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..15b7270a5c2a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,

  shift = iova_shift(iovad);
  iova_len = size >> shift;
-    /*
- * Freeing non-power-of-two-sized allocations back into the IOVA 
caches
- * will come back to bite us badly, so we have to waste a bit of 
space
- * rounding up anything cacheable to make sure that can't 
happen. The

- * order of the unadjusted size will still match upon freeing.
- */
-    if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
-    iova_len = roundup_pow_of_two(iova_len);
  dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..e62e9e30b30c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct 
iova *iova,

  static int __alloc_and_insert_iova_range(struct iova_domain *iovad,
  unsigned long size, unsigned long limit_pfn,
-    struct iova *new, bool size_aligned)
+    struct iova *new, bool size_aligned, bool fast)
  {
  struct rb_node *curr, *prev;
  struct iova *curr_iova;
@@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  unsigned long align_mask = ~0UL;
  unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
+    /*
+ * Freeing non-power-of-two-sized allocations back into the IOVA 
caches
+ * will come back to bite us badly, so we have to waste a bit of 
space
+ * rounding up anything cacheable to make sure that can't 
happen. The

+ * order of the unadjusted size will still match upon freeing.
+ */
+    if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+    size = roundup_pow_of_two(size);


If this transformation is only relevant to alloc_iova_fast(), and we 
have to add a special parameter here to tell whether we were called 
from alloc_iova_fast(), doesn't it seem more sensible to just do it in 
alloc_iova_fast() rather than here?


We have the restriction that anything we put in the rcache needs be a 
power-of-2.


I was really only talking about the apparently silly structure of:

void foo(bool in_bar) {
if (in_bar)
//do thing
...
}
void bar() {
foo(true);
}

vs.:

void foo() {
...
}
void bar() {
//do thing
foo();
}

So then we have the issue of how to dynamically increase this rcache 
threshold. The problem is that we may have many devices associated with 
the same domain. So, in theory, we can't assume that when we increase 
the threshold that some other device will try to fast free an IOVA which 
was allocated prior to the increase and was not rounded up.


I'm very open to better (or less bad) suggestions on how to do this ...


...but yes, regardless of exactly where it happens, rounding up or not 
is the problem for rcaches in general. I've said several times that my 
preferred approach is to not change it that dynamically at all, but 
instead treat it more like we treat the default domain type.


I could say that we only allow this for a group with a single device, so 
these sort of things don't have to be worried about, but even then the 
iommu_group internals are not readily accessible here.




But then the API itself has no strict requirement that a pfn passed to 
free_iova_fast() wasn't originally allocated with alloc_iova(), so 
arguably hiding the adjustment away makes it less clear that the 
responsibility is really on any caller of free_iova_fast() to make 
sure they don't get things wrong.




alloc_iova() doesn't roundup to pow-of-2, so wouldn't it be broken to do 
that?


Well, right now neither call rounds up, which is why iommu-dma takes 
care to avoid any issues by explicitly rounding up for itself 
beforehand. I'm just concerned that giving the impression that the API 
takes care of everything for itself will make it easier to write broken 
code in future, if that impression is in fact not entirely true.


I don't even think it's very likely that someone would manage to hit 
that rather wacky alloc/free pattern either way, I just know that 
getting wrong-sized things into the rcaches is an absolute sod to debug, 
so...


Robin.


Re: [PATCH] dt: rockchip: rk3399: Add dynamic power coefficient for GPU

2021-03-19 Thread Robin Murphy

On 2021-03-19 14:35, Daniel Lezcano wrote:


Hi Robin,

On 19/03/2021 13:17, Robin Murphy wrote:

On 2021-03-19 11:05, Daniel Lezcano wrote:

The DTPM framework is looking for upstream SoC candidates to share the
power numbers.

We can see around different numbers but the one which seems to be
consistent with the initial post for the values on the CPUs can be
found in the patch https://lore.kernel.org/patchwork/patch/810159/


The kernel hacker in me would be more inclined to trust the BSP that the
vendor actively supports than a 5-year-old patch that was never pursued
upstream. Apparently that was last updated more recently:

https://github.com/rockchip-linux/kernel/commit/98d4505e1bd62ff028bd79fbd8284d64b6f468f8


Yes, I've seen this value also.


The ex-mathematician in me can't even comment either way without
evidence that whatever model expects to consume this value is even
comparable to whatever "arm,mali-simple-power-model" is. >
The way the
latter apparently needs an explicit "static" coefficient as well as a
"dynamic" one, and the value here being nearly 3 times that of a
similarly-named one in active use downstream (ChromeOS appears to still
be using the values from before the above commit), certainly incline me
to think they may not be...


Sorry, I'm missing the point :/

We dropped in the kernel any static power computation because as there
was no value, the resulting code was considered dead. So we rely on the
dynamic power only.


Right, so a 2-factor model is clearly not identical to a 1-factor model, 
so how do we know that a value for one is valid for the other, even if 
it happens to have a similar name? I'm not saying that it is or isn't; I 
don't know. If someone can point to the downstream coefficient 
definition being identical to the upstream one then great, let's use 
that as justification. If not, then the justification of one arbitrary 
meaningless number over any other is a bit misleading.



I don't know the precision of this value but it is better than
nothing.


But is it? If it leads to some throttling mechanism kicking in and
crippling GPU performance because it's massively overestimating power
consumption, that would be objectively worse for most users, no?


No because there is no sustainable power specified for the thermal zones
related to the GPU.
OK, that's some reassurance at least. Does the exact value have any 
material effect? If not, what's to stop us from using an obviously 
made-up value like 1, and saying so?


Robin.


Re: [PATCH 5/6] dma-mapping/iommu: Add dma_set_max_opt_size()

2021-03-19 Thread Robin Murphy

On 2021-03-19 13:25, John Garry wrote:

Add a function to allow the max size which we want to optimise DMA mappings
for.


It seems neat in theory - particularly for packet-based interfaces that 
might have a known fixed size of data unit that they're working on at 
any given time - but aren't there going to be many cases where the 
driver has no idea because it depends on whatever size(s) of request 
userspace happens to throw at it? Even if it does know the absolute 
maximum size of thing it could ever transfer, that could be 
impractically large in areas like video/AI/etc., so it could still be 
hard to make a reasonable decision.


Being largely workload-dependent is why I still think this should be a 
command-line or sysfs tuneable - we could set the default based on how 
much total memory is available, but ultimately it's the end user who 
knows what the workload is going to be and what they care about 
optimising for.


Another thought (which I'm almost reluctant to share) is that I would 
*love* to try implementing a self-tuning strategy that can detect high 
contention on particular allocation sizes and adjust the caches on the 
fly, but I can easily imagine that having enough inherent overhead to 
end up being an impractical (but fun) waste of time.


Robin.


Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c   |  2 +-
  include/linux/dma-map-ops.h |  1 +
  include/linux/dma-mapping.h |  5 +
  kernel/dma/mapping.c| 11 +++
  4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index a5dfbd6c0496..d35881fcfb9c 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -447,7 +447,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
return (dma_addr_t)iova << shift;
  }
  
-__maybe_unused

  static void iommu_dma_set_opt_size(struct device *dev, size_t size)
  {
struct iommu_domain *domain = iommu_get_dma_domain(dev);
@@ -1278,6 +1277,7 @@ static const struct dma_map_ops iommu_dma_ops = {
.map_resource   = iommu_dma_map_resource,
.unmap_resource = iommu_dma_unmap_resource,
.get_merge_boundary = iommu_dma_get_merge_boundary,
+   .set_max_opt_size   = iommu_dma_set_opt_size,
  };
  
  /*

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 51872e736e7b..fed7a183b3b9 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -64,6 +64,7 @@ struct dma_map_ops {
u64 (*get_required_mask)(struct device *dev);
size_t (*max_mapping_size)(struct device *dev);
unsigned long (*get_merge_boundary)(struct device *dev);
+   void (*set_max_opt_size)(struct device *dev, size_t size);
  };
  
  #ifdef CONFIG_DMA_OPS

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2a984cb4d1e0..91fe770145d4 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -144,6 +144,7 @@ u64 dma_get_required_mask(struct device *dev);
  size_t dma_max_mapping_size(struct device *dev);
  bool dma_need_sync(struct device *dev, dma_addr_t dma_addr);
  unsigned long dma_get_merge_boundary(struct device *dev);
+void dma_set_max_opt_size(struct device *dev, size_t size);
  #else /* CONFIG_HAS_DMA */
  static inline dma_addr_t dma_map_page_attrs(struct device *dev,
struct page *page, size_t offset, size_t size,
@@ -257,6 +258,10 @@ static inline unsigned long dma_get_merge_boundary(struct 
device *dev)
  {
return 0;
  }
+static inline void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+}
+
  #endif /* CONFIG_HAS_DMA */
  
  struct page *dma_alloc_pages(struct device *dev, size_t size,

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..59e6acb1c471 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -608,3 +608,14 @@ unsigned long dma_get_merge_boundary(struct device *dev)
return ops->get_merge_boundary(dev);
  }
  EXPORT_SYMBOL_GPL(dma_get_merge_boundary);
+
+void dma_set_max_opt_size(struct device *dev, size_t size)
+{
+   const struct dma_map_ops *ops = get_dma_ops(dev);
+
+   if (!ops || !ops->set_max_opt_size)
+   return;
+
+   ops->set_max_opt_size(dev, size);
+}
+EXPORT_SYMBOL_GPL(dma_set_max_opt_size);



Re: [PATCH 3/6] iova: Allow rcache range upper limit to be configurable

2021-03-19 Thread Robin Murphy

On 2021-03-19 13:25, John Garry wrote:

Some LLDs may request DMA mappings whose IOVA length exceeds that of the
current rcache upper limit.

This means that allocations for those IOVAs will never be cached, and
always must be allocated and freed from the RB tree per DMA mapping cycle.
This has a significant effect on performance, more so since commit
4e89dce72521 ("iommu/iova: Retry from last rb tree node if iova search
fails"), as discussed at [0].

Allow the range of cached IOVAs to be increased, by providing an API to set
the upper limit, which is capped at IOVA_RANGE_CACHE_MAX_SIZE.

For simplicity, the full range of IOVA rcaches is allocated and initialized
at IOVA domain init time.

Setting the range upper limit will fail if the domain is already live
(before the tree contains IOVA nodes). This must be done to ensure any
IOVAs cached comply with rule of size being a power-of-2.

[0] 
https://lore.kernel.org/linux-iommu/20210129092120.1482-1-thunder.leiz...@huawei.com/

Signed-off-by: John Garry 
---
  drivers/iommu/iova.c | 37 +++--
  include/linux/iova.h | 11 ++-
  2 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index cecc74fb8663..d4f2f7fbbd84 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -49,6 +49,7 @@ init_iova_domain(struct iova_domain *iovad, unsigned long 
granule,
iovad->flush_cb = NULL;
iovad->fq = NULL;
iovad->anchor.pfn_lo = iovad->anchor.pfn_hi = IOVA_ANCHOR;
+   iovad->rcache_max_size = IOVA_RANGE_CACHE_DEFAULT_SIZE;
rb_link_node(>anchor.node, NULL, >rbroot.rb_node);
rb_insert_color(>anchor.node, >rbroot);
init_iova_rcaches(iovad);
@@ -194,7 +195,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
 * rounding up anything cacheable to make sure that can't happen. The
 * order of the unadjusted size will still match upon freeing.
 */
-   if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   if (fast && size < (1 << (iovad->rcache_max_size - 1)))
size = roundup_pow_of_two(size);
  
  	if (size_aligned)

@@ -901,7 +902,7 @@ static bool iova_rcache_insert(struct iova_domain *iovad, 
unsigned long pfn,
  {
unsigned int log_size = order_base_2(size);
  
-	if (log_size >= IOVA_RANGE_CACHE_MAX_SIZE)

+   if (log_size >= iovad->rcache_max_size)
return false;
  
  	return __iova_rcache_insert(iovad, >rcaches[log_size], pfn);

@@ -946,6 +947,38 @@ static unsigned long __iova_rcache_get(struct iova_rcache 
*rcache,
return iova_pfn;
  }
  
+void iova_rcache_set_upper_limit(struct iova_domain *iovad,

+unsigned long iova_len)
+{
+   unsigned int rcache_index = order_base_2(iova_len) + 1;
+   struct rb_node *rb_node = >anchor.node;
+   unsigned long flags;
+   int count = 0;
+
+   spin_lock_irqsave(>iova_rbtree_lock, flags);
+   if (rcache_index <= iovad->rcache_max_size)
+   goto out;
+
+   while (1) {
+   rb_node = rb_prev(rb_node);
+   if (!rb_node)
+   break;
+   count++;
+   }
+
+   /*
+* If there are already IOVA nodes present in the tree, then don't
+* allow range upper limit to be set.
+*/
+   if (count != iovad->reserved_node_count)
+   goto out;
+
+   iovad->rcache_max_size = min_t(unsigned long, rcache_index,
+  IOVA_RANGE_CACHE_MAX_SIZE);
+out:
+   spin_unlock_irqrestore(>iova_rbtree_lock, flags);
+}
+
  /*
   * Try to satisfy IOVA allocation range from rcache.  Fail if requested
   * size is too big or the DMA limit we are given isn't satisfied by the
diff --git a/include/linux/iova.h b/include/linux/iova.h
index fd3217a605b2..952b81b54ef7 100644
--- a/include/linux/iova.h
+++ b/include/linux/iova.h
@@ -25,7 +25,8 @@ struct iova {
  struct iova_magazine;
  struct iova_cpu_rcache;
  
-#define IOVA_RANGE_CACHE_MAX_SIZE 6	/* log of max cached IOVA range size (in pages) */

+#define IOVA_RANGE_CACHE_DEFAULT_SIZE 6
+#define IOVA_RANGE_CACHE_MAX_SIZE 10 /* log of max cached IOVA range size (in 
pages) */


No.

And why? If we're going to allocate massive caches anyway, whatever is 
the point of *not* using all of them?


It only makes sense for a configuration knob to affect the actual rcache 
and depot allocations - that way, big high-throughput systems with 
plenty of memory can spend it on better performance, while small systems 
- that often need IOMMU scatter-gather precisely *because* memory is 
tight and thus easily fragmented - don't have to pay the (not 
insignificant) cost for caches they don't need.


Robin.


  #define MAX_GLOBAL_MAGS 32/* magazines per bin */
  
  struct iova_rcache {

@@ -74,6 +75,7 @@ struct iova_domain {
unsigned long   start_pfn;  /* Lower limit for 

Re: [PATCH 1/6] iommu: Move IOVA power-of-2 roundup into allocator

2021-03-19 Thread Robin Murphy

On 2021-03-19 13:25, John Garry wrote:

Move the IOVA size power-of-2 rcache roundup into the IOVA allocator.

This is to eventually make it possible to be able to configure the upper
limit of the IOVA rcache range.

Signed-off-by: John Garry 
---
  drivers/iommu/dma-iommu.c |  8 --
  drivers/iommu/iova.c  | 51 ++-
  2 files changed, 34 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..15b7270a5c2a 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -429,14 +429,6 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
  
  	shift = iova_shift(iovad);

iova_len = size >> shift;
-   /*
-* Freeing non-power-of-two-sized allocations back into the IOVA caches
-* will come back to bite us badly, so we have to waste a bit of space
-* rounding up anything cacheable to make sure that can't happen. The
-* order of the unadjusted size will still match upon freeing.
-*/
-   if (iova_len < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
-   iova_len = roundup_pow_of_two(iova_len);
  
  	dma_limit = min_not_zero(dma_limit, dev->bus_dma_limit);
  
diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c

index e6e2fa85271c..e62e9e30b30c 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -179,7 +179,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
  
  static int __alloc_and_insert_iova_range(struct iova_domain *iovad,

unsigned long size, unsigned long limit_pfn,
-   struct iova *new, bool size_aligned)
+   struct iova *new, bool size_aligned, bool fast)
  {
struct rb_node *curr, *prev;
struct iova *curr_iova;
@@ -188,6 +188,15 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,
unsigned long align_mask = ~0UL;
unsigned long high_pfn = limit_pfn, low_pfn = iovad->start_pfn;
  
+	/*

+* Freeing non-power-of-two-sized allocations back into the IOVA caches
+* will come back to bite us badly, so we have to waste a bit of space
+* rounding up anything cacheable to make sure that can't happen. The
+* order of the unadjusted size will still match upon freeing.
+*/
+   if (fast && size < (1 << (IOVA_RANGE_CACHE_MAX_SIZE - 1)))
+   size = roundup_pow_of_two(size);


If this transformation is only relevant to alloc_iova_fast(), and we 
have to add a special parameter here to tell whether we were called from 
alloc_iova_fast(), doesn't it seem more sensible to just do it in 
alloc_iova_fast() rather than here?


But then the API itself has no strict requirement that a pfn passed to 
free_iova_fast() wasn't originally allocated with alloc_iova(), so 
arguably hiding the adjustment away makes it less clear that the 
responsibility is really on any caller of free_iova_fast() to make sure 
they don't get things wrong.


Robin.


+
if (size_aligned)
align_mask <<= fls_long(size - 1);
  
@@ -288,21 +297,10 @@ void iova_cache_put(void)

  }
  EXPORT_SYMBOL_GPL(iova_cache_put);
  
-/**

- * alloc_iova - allocates an iova
- * @iovad: - iova domain in question
- * @size: - size of page frames to allocate
- * @limit_pfn: - max limit address
- * @size_aligned: - set if size_aligned address range is required
- * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
- * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
- * flag is set then the allocated address iova->pfn_lo will be naturally
- * aligned on roundup_power_of_two(size).
- */
-struct iova *
-alloc_iova(struct iova_domain *iovad, unsigned long size,
+static struct iova *
+__alloc_iova(struct iova_domain *iovad, unsigned long size,
unsigned long limit_pfn,
-   bool size_aligned)
+   bool size_aligned, bool fast)
  {
struct iova *new_iova;
int ret;
@@ -312,7 +310,7 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
return NULL;
  
  	ret = __alloc_and_insert_iova_range(iovad, size, limit_pfn + 1,

-   new_iova, size_aligned);
+   new_iova, size_aligned, fast);
  
  	if (ret) {

free_iova_mem(new_iova);
@@ -321,6 +319,25 @@ alloc_iova(struct iova_domain *iovad, unsigned long size,
  
  	return new_iova;

  }
+
+/**
+ * alloc_iova - allocates an iova
+ * @iovad: - iova domain in question
+ * @size: - size of page frames to allocate
+ * @limit_pfn: - max limit address
+ * @size_aligned: - set if size_aligned address range is required
+ * This function allocates an iova in the range iovad->start_pfn to limit_pfn,
+ * searching top-down from limit_pfn to iovad->start_pfn. If the size_aligned
+ * flag is set then the allocated address iova->pfn_lo will be naturally
+ * aligned on roundup_power_of_two(size).
+ 

[PATCH 2/2] iommu: Streamline registration interface

2021-03-19 Thread Robin Murphy
Rather than have separate opaque setter functions that are easy to
overlook and lead to repetitive boilerplate in drivers, let's pass the
relevant initialisation parameters directly to iommu_device_register().

Signed-off-by: Robin Murphy 
---
 drivers/iommu/amd/init.c|  3 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  5 +---
 drivers/iommu/arm/arm-smmu/arm-smmu.c   |  5 +---
 drivers/iommu/arm/arm-smmu/qcom_iommu.c |  5 +---
 drivers/iommu/exynos-iommu.c|  5 +---
 drivers/iommu/fsl_pamu_domain.c |  4 +--
 drivers/iommu/intel/dmar.c  |  4 +--
 drivers/iommu/intel/iommu.c |  3 +--
 drivers/iommu/iommu.c   |  7 -
 drivers/iommu/ipmmu-vmsa.c  |  6 +
 drivers/iommu/msm_iommu.c   |  5 +---
 drivers/iommu/mtk_iommu.c   |  5 +---
 drivers/iommu/mtk_iommu_v1.c|  4 +--
 drivers/iommu/omap-iommu.c  |  5 +---
 drivers/iommu/rockchip-iommu.c  |  5 +---
 drivers/iommu/s390-iommu.c  |  4 +--
 drivers/iommu/sprd-iommu.c  |  5 +---
 drivers/iommu/sun50i-iommu.c|  5 +---
 drivers/iommu/tegra-gart.c  |  5 +---
 drivers/iommu/tegra-smmu.c  |  5 +---
 drivers/iommu/virtio-iommu.c|  5 +---
 include/linux/iommu.h   | 29 -
 22 files changed, 31 insertions(+), 98 deletions(-)

diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c
index 321f5906e6ed..e1ef922d9f8f 100644
--- a/drivers/iommu/amd/init.c
+++ b/drivers/iommu/amd/init.c
@@ -1935,8 +1935,7 @@ static int __init iommu_init_pci(struct amd_iommu *iommu)
 
iommu_device_sysfs_add(>iommu, >dev->dev,
   amd_iommu_groups, "ivhd%d", iommu->index);
-   iommu_device_set_ops(>iommu, _iommu_ops);
-   iommu_device_register(>iommu);
+   iommu_device_register(>iommu, _iommu_ops, NULL);
 
return pci_enable_device(iommu->dev);
 }
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b82000519af6..ecc6cfe3ae90 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3621,10 +3621,7 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
if (ret)
return ret;
 
-   iommu_device_set_ops(>iommu, _smmu_ops);
-   iommu_device_set_fwnode(>iommu, dev->fwnode);
-
-   ret = iommu_device_register(>iommu);
+   ret = iommu_device_register(>iommu, _smmu_ops, dev);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
return ret;
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 11ca963c4b93..0a697cb0d2f8 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -,10 +,7 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
return err;
}
 
-   iommu_device_set_ops(>iommu, _smmu_ops);
-   iommu_device_set_fwnode(>iommu, dev->fwnode);
-
-   err = iommu_device_register(>iommu);
+   err = iommu_device_register(>iommu, _smmu_ops, dev);
if (err) {
dev_err(dev, "Failed to register iommu\n");
return err;
diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c 
b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
index 7f280c8d5c53..4294abe389b2 100644
--- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c
+++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c
@@ -847,10 +847,7 @@ static int qcom_iommu_device_probe(struct platform_device 
*pdev)
return ret;
}
 
-   iommu_device_set_ops(_iommu->iommu, _iommu_ops);
-   iommu_device_set_fwnode(_iommu->iommu, dev->fwnode);
-
-   ret = iommu_device_register(_iommu->iommu);
+   ret = iommu_device_register(_iommu->iommu, _iommu_ops, dev);
if (ret) {
dev_err(dev, "Failed to register iommu\n");
return ret;
diff --git a/drivers/iommu/exynos-iommu.c b/drivers/iommu/exynos-iommu.c
index de324b4eedfe..f887c3e111c1 100644
--- a/drivers/iommu/exynos-iommu.c
+++ b/drivers/iommu/exynos-iommu.c
@@ -630,10 +630,7 @@ static int exynos_sysmmu_probe(struct platform_device 
*pdev)
if (ret)
return ret;
 
-   iommu_device_set_ops(>iommu, _iommu_ops);
-   iommu_device_set_fwnode(>iommu, >of_node->fwnode);
-
-   ret = iommu_device_register(>iommu);
+   ret = iommu_device_register(>iommu, _iommu_ops, dev);
if (ret)
return ret;
 
diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index b2110767caf4..1a15bd5da358 100644
---

[PATCH 1/2] iommu: Statically set module owner

2021-03-19 Thread Robin Murphy
It happens that the 3 drivers which first supported being modular are
also ones which play games with their pgsize_bitmap, so have non-const
iommu_ops where dynamically setting the owner manages to work out OK.
However, it's less than ideal to force that upon all drivers which want
to be modular - like the new sprd-iommu driver which now has a potential
bug in that regard - so let's just statically set the module owner and
let ops remain const wherever possible.

Signed-off-by: Robin Murphy 
---

This is something I hadn't got round to sending earlier, so now rebased
onto iommu/next to accommodate the new driver :)

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 1 +
 drivers/iommu/arm/arm-smmu/arm-smmu.c   | 1 +
 drivers/iommu/sprd-iommu.c  | 1 +
 drivers/iommu/virtio-iommu.c| 1 +
 include/linux/iommu.h   | 9 +
 5 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8594b4a83043..b82000519af6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2632,6 +2632,7 @@ static struct iommu_ops arm_smmu_ops = {
.sva_unbind = arm_smmu_sva_unbind,
.sva_get_pasid  = arm_smmu_sva_get_pasid,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
+   .owner  = THIS_MODULE,
 };
 
 /* Probing and initialisation functions */
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..11ca963c4b93 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1638,6 +1638,7 @@ static struct iommu_ops arm_smmu_ops = {
.put_resv_regions   = generic_iommu_put_resv_regions,
.def_domain_type= arm_smmu_def_domain_type,
.pgsize_bitmap  = -1UL, /* Restricted during device attach */
+   .owner  = THIS_MODULE,
 };
 
 static void arm_smmu_device_reset(struct arm_smmu_device *smmu)
diff --git a/drivers/iommu/sprd-iommu.c b/drivers/iommu/sprd-iommu.c
index 7100ed17dcce..024a0cdd26a6 100644
--- a/drivers/iommu/sprd-iommu.c
+++ b/drivers/iommu/sprd-iommu.c
@@ -436,6 +436,7 @@ static const struct iommu_ops sprd_iommu_ops = {
.device_group   = sprd_iommu_device_group,
.of_xlate   = sprd_iommu_of_xlate,
.pgsize_bitmap  = ~0UL << SPRD_IOMMU_PAGE_SHIFT,
+   .owner  = THIS_MODULE,
 };
 
 static const struct of_device_id sprd_iommu_of_match[] = {
diff --git a/drivers/iommu/virtio-iommu.c b/drivers/iommu/virtio-iommu.c
index 2bfdd5734844..594ed827e944 100644
--- a/drivers/iommu/virtio-iommu.c
+++ b/drivers/iommu/virtio-iommu.c
@@ -945,6 +945,7 @@ static struct iommu_ops viommu_ops = {
.get_resv_regions   = viommu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
.of_xlate   = viommu_of_xlate,
+   .owner  = THIS_MODULE,
 };
 
 static int viommu_init_vqs(struct viommu_dev *viommu)
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 5e7fe519430a..dce8c5e12ea0 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -379,19 +379,12 @@ int  iommu_device_link(struct iommu_device   *iommu, 
struct device *link);
 void iommu_device_unlink(struct iommu_device *iommu, struct device *link);
 int iommu_deferred_attach(struct device *dev, struct iommu_domain *domain);
 
-static inline void __iommu_device_set_ops(struct iommu_device *iommu,
+static inline void iommu_device_set_ops(struct iommu_device *iommu,
  const struct iommu_ops *ops)
 {
iommu->ops = ops;
 }
 
-#define iommu_device_set_ops(iommu, ops)   \
-do {   \
-   struct iommu_ops *__ops = (struct iommu_ops *)(ops);\
-   __ops->owner = THIS_MODULE; \
-   __iommu_device_set_ops(iommu, __ops);   \
-} while (0)
-
 static inline void iommu_device_set_fwnode(struct iommu_device *iommu,
   struct fwnode_handle *fwnode)
 {
-- 
2.21.0.dirty



Re: [PATCH] dt: rockchip: rk3399: Add dynamic power coefficient for GPU

2021-03-19 Thread Robin Murphy

On 2021-03-19 11:05, Daniel Lezcano wrote:

The DTPM framework is looking for upstream SoC candidates to share the
power numbers.

We can see around different numbers but the one which seems to be
consistent with the initial post for the values on the CPUs can be
found in the patch https://lore.kernel.org/patchwork/patch/810159/


The kernel hacker in me would be more inclined to trust the BSP that the 
vendor actively supports than a 5-year-old patch that was never pursued 
upstream. Apparently that was last updated more recently:


https://github.com/rockchip-linux/kernel/commit/98d4505e1bd62ff028bd79fbd8284d64b6f468f8

The ex-mathematician in me can't even comment either way without 
evidence that whatever model expects to consume this value is even 
comparable to whatever "arm,mali-simple-power-model" is. The way the 
latter apparently needs an explicit "static" coefficient as well as a 
"dynamic" one, and the value here being nearly 3 times that of a 
similarly-named one in active use downstream (ChromeOS appears to still 
be using the values from before the above commit), certainly incline me 
to think they may not be...



I don't know the precision of this value but it is better than
nothing.


But is it? If it leads to some throttling mechanism kicking in and 
crippling GPU performance because it's massively overestimating power 
consumption, that would be objectively worse for most users, no?


Robin.


Hopefully, one day SoC vendors will be more generous with the power
numbers at least for the SoC which are from the previous generation
and give the community the opportunity to develop power based
frameworks.
---
  arch/arm64/boot/dts/rockchip/rk3399.dtsi | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/arm64/boot/dts/rockchip/rk3399.dtsi 
b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
index edbbf35fe19e..1ab1d293d2e9 100644
--- a/arch/arm64/boot/dts/rockchip/rk3399.dtsi
+++ b/arch/arm64/boot/dts/rockchip/rk3399.dtsi
@@ -1933,6 +1933,7 @@
interrupt-names = "job", "mmu", "gpu";
clocks = < ACLK_GPU>;
#cooling-cells = <2>;
+   dynamic-power-coefficient = <977>;
power-domains = < RK3399_PD_GPU>;
status = "disabled";
};



Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 21:31, Florian Fainelli wrote:



On 3/18/2021 12:53 PM, Robin Murphy wrote:

On 2021-03-18 19:43, Florian Fainelli wrote:



On 3/18/2021 12:34 PM, Robin Murphy wrote:

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or
when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've
really ironed out *all* the awkward corners that used to blow up if
various internal bits were left uninitialised, then it would make sense
to just tweak the implementation of what we already have.


swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
swiotlb, however what I am also after is reclaiming these 64MB of
default SWIOTLB bounce buffering memory because my systems run with
large amounts of reserved memory into ZONE_MOVABLE and everything in
ZONE_NORMAL is precious at that point.


It also forces io_tlb_nslabs to the minimum, so it should be claiming
considerably less than 64MB. IIRC the original proposal *did* skip
initialisation completely, but that turned up the aforementioned issues.


AFAICT in that case we will have iotlb_n_slabs will set to 1, which will
still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in
swiotlb_init(), which still gives us 64MB.


Eh? When did 2KB become 64MB? IO_TLB_SHIFT is 11, so that's at most one 
page in anyone's money...



I wouldn't necessarily disagree with adding "off" as an additional alias
for "noforce", though, since it does come across as a bit wacky for
general use.


Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
  swiotlb_init(1)


Modulo "swiotlb=force", of course ;)


Indeed, we would need to handle that case as well. Does it sound
reasonable to do that to you as well?


I wouldn't like it done to me personally, but for arm64, observe what
mem_init() in arch/arm64/mm/init.c already does.


In fact I should have looked more closely at that myself - checking 
debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and 
indeed we are bypassing initialisation completely and (ab)using 
SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now 
for the noforce option to do the same for itself and save even that one 
page.


Robin.


Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 19:43, Florian Fainelli wrote:



On 3/18/2021 12:34 PM, Robin Murphy wrote:

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've
really ironed out *all* the awkward corners that used to blow up if
various internal bits were left uninitialised, then it would make sense
to just tweak the implementation of what we already have.


swiotlb=noforce does prevent dma_direct_map_page() from resorting to the
swiotlb, however what I am also after is reclaiming these 64MB of
default SWIOTLB bounce buffering memory because my systems run with
large amounts of reserved memory into ZONE_MOVABLE and everything in
ZONE_NORMAL is precious at that point.


It also forces io_tlb_nslabs to the minimum, so it should be claiming 
considerably less than 64MB. IIRC the original proposal *did* skip 
initialisation completely, but that turned up the aforementioned issues.



I wouldn't necessarily disagree with adding "off" as an additional alias
for "noforce", though, since it does come across as a bit wacky for
general use.


Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
 swiotlb_init(1)


Modulo "swiotlb=force", of course ;)


Indeed, we would need to handle that case as well. Does it sound
reasonable to do that to you as well?


I wouldn't like it done to me personally, but for arm64, observe what 
mem_init() in arch/arm64/mm/init.c already does.


Robin.


Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB

2021-03-18 Thread Robin Murphy

On 2021-03-18 19:22, Florian Fainelli wrote:



On 3/18/2021 12:18 PM, Florian Fainelli wrote:

It may be useful to disable the SWIOTLB completely for testing or when a
platform is known not to have any DRAM addressing limitations what so
ever.


Isn't that what "swiotlb=noforce" is for? If you're confident that we've 
really ironed out *all* the awkward corners that used to blow up if 
various internal bits were left uninitialised, then it would make sense 
to just tweak the implementation of what we already have.


I wouldn't necessarily disagree with adding "off" as an additional alias 
for "noforce", though, since it does come across as a bit wacky for 
general use.



Signed-off-by: Florian Fainelli 


Christoph, in addition to this change, how would you feel if we
qualified the swiotlb_init() in arch/arm/mm/init.c with a:


if (memblock_end_of_DRAM() >= SZ_4G)
swiotlb_init(1)


Modulo "swiotlb=force", of course ;)

Robin.


right now this is made unconditional whenever ARM_LPAE is enabled which
is the case for the platforms I maintain (ARCH_BRCMSTB) however we do
not really need a SWIOTLB so long as the largest DRAM physical address
does not exceed 4GB AFAICT.

Thanks!


---
  Documentation/admin-guide/kernel-parameters.txt | 1 +
  include/linux/swiotlb.h | 1 +
  kernel/dma/swiotlb.c| 9 +
  3 files changed, 11 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..b0223e48921e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5278,6 +5278,7 @@
force -- force using of bounce buffers even if they
 wouldn't be automatically used by the kernel
noforce -- Never use bounce buffers (for debugging)
+   off -- Completely disable SWIOTLB
  
  	switches=	[HW,M68k]
  
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h

index 5857a937c637..23f86243defe 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -15,6 +15,7 @@ enum swiotlb_force {
SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */
SWIOTLB_FORCE,  /* swiotlb=force */
SWIOTLB_NO_FORCE,   /* swiotlb=noforce */
+   SWIOTLB_OFF,/* swiotlb=off */
  };
  
  /*

diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index c10e855a03bc..d7a4a789c7d3 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str)
} else if (!strcmp(str, "noforce")) {
swiotlb_force = SWIOTLB_NO_FORCE;
io_tlb_nslabs = 1;
+   } else if (!strcmp(str, "off")) {
+   swiotlb_force = SWIOTLB_OFF;
}
  
  	return 0;

@@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long 
nslabs, int verbose)
unsigned long i, bytes;
size_t alloc_size;
  
+	if (swiotlb_force == SWIOTLB_OFF)

+   return 0;
+
bytes = nslabs << IO_TLB_SHIFT;
  
  	io_tlb_nslabs = nslabs;

@@ -284,6 +289,9 @@ swiotlb_init(int verbose)
unsigned char *vstart;
unsigned long bytes;
  
+	if (swiotlb_force == SWIOTLB_OFF)

+   goto out;
+
if (!io_tlb_nslabs) {
io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
@@ -302,6 +310,7 @@ swiotlb_init(int verbose)
io_tlb_start = 0;
}
pr_warn("Cannot allocate buffer");
+out:
no_iotlb_memory = true;
  }
  





Re: [PATCH 2/2] iommu/iova: Improve restart logic

2021-03-18 Thread Robin Murphy

On 2021-03-18 11:38, John Garry wrote:

On 10/03/2021 17:47, John Garry wrote:

On 09/03/2021 15:55, John Garry wrote:

On 05/03/2021 16:35, Robin Murphy wrote:

Hi Robin,


When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of 
the

opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. 


Right


We can improve on that by implementing
a proper tree traversal to find the first node above the relevant 
limit,

and set the exact start point.


Thanks for this. However, as mentioned in the other thread, this does 
not help our performance regression Re: commit 4e89dce72521.


And mentioning this "retry" approach again, in case it was missed, 
from my experiment on the affected HW, it also has generally a 
dreadfully low success rate - less than 1% for the 32b range retry. 
Retry rate is about 20%. So I am saying from this 20%, less than 1% 
of those succeed.


Well yeah, in your particular case you're allocating from a heavily 
over-contended address space, so much of the time it is genuinely full. 
Plus you're primarily churning one or two sizes of IOVA, so there's a 
high chance that you will either allocate immediately from the cached 
node (after a previous free), or search the whole space and fail. In 
case it was missed, searching only some arbitrary subset of the space 
before giving up is not a good behaviour for an allocator to have in 
general.


So since the retry means that we search through the complete pfn range 
most of the time (due to poor success rate), we should be able to do a 
better job at maintaining an accurate max alloc size, by calculating 
it from the range search, and not relying on max alloc failed or 
resetting it frequently. Hopefully that would mean that we're smarter 
about not trying the allocation.


So I tried that out and we seem to be able to scrap back an appreciable 
amount of performance. Maybe 80% of original, with with another change, 
below.


TBH if you really want to make allocation more efficient I think there 
are more radical changes that would be worth experimenting with, like 
using some form of augmented rbtree to also encode the amount of free 
space under each branch, or representing the free space in its own 
parallel tree, or whether some other structure entirely might be a 
better bet these days.


And if you just want to make your thing acceptably fast, now I'm going 
to say stick a quirk somewhere to force the "forcedac" option on your 
platform ;)


[...]
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct 
iova_domain *iovad,

  if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
  high_pfn = limit_pfn;
  low_pfn = retry_pfn;
-    curr = >anchor.node;
+    curr = iova_find_limit(iovad, limit_pfn);



I see that it is now applied. However, alternatively could we just add a 
zero-length 32b boundary marker node for the 32b pfn restart point?


That would need special cases all over the place to prevent the marker 
getting merged into reservations or hit by lookups, and at worst break 
the ordering of the tree if a legitimate node straddles the boundary. I 
did consider having the insert/delete routines keep track of yet another 
cached node for whatever's currently the first thing above the 32-bit 
boundary, but I was worried that might be a bit too invasive.


FWIW I'm currently planning to come back to this again when I have a bit 
more time, since the optimum thing to do (modulo replacing the entire 
algorithm...) is actually to make the second part of the search 
*upwards* from the cached node to the limit. Furthermore, to revive my 
arch/arm conversion I think we're realistically going to need a 
compatibility option for bottom-up allocation to avoid too many nasty 
surprises, so I'd like to generalise things to tackle both concerns at once.


Thanks,
Robin.


Re: [PATCH v3 2/2] rockchip: rk3399: Add support for FriendlyARM NanoPi R4S

2021-03-15 Thread Robin Murphy

On 2021-03-13 13:22, CN_SZTL wrote:

Robin Murphy  于2021年3月13日周六 下午7:55写道:


On 2021-03-13 03:25, Tianling Shen wrote:

This adds support for the NanoPi R4S from FriendlyArm.

Rockchip RK3399 SoC
1GB DDR3 or 4GB LPDDR4 RAM
Gigabit Ethernet (WAN)
Gigabit Ethernet (PCIe) (LAN)
USB 3.0 Port x 2
MicroSD slot
Reset button
WAN - LAN - SYS LED

[initial DTS file]
Co-developed-by: Jensen Huang 
Signed-off-by: Jensen Huang 
[minor adjustments]
Co-developed-by: Marty Jones 
Signed-off-by: Marty Jones 
[fixed format issues]
Signed-off-by: Tianling Shen 

Reported-by: kernel test robot 
---
   arch/arm64/boot/dts/rockchip/Makefile |   1 +
   .../boot/dts/rockchip/rk3399-nanopi-r4s.dts   | 179 ++
   2 files changed, 180 insertions(+)
   create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts

diff --git a/arch/arm64/boot/dts/rockchip/Makefile 
b/arch/arm64/boot/dts/rockchip/Makefile
index 62d3abc17a24..c3e00c0e2db7 100644
--- a/arch/arm64/boot/dts/rockchip/Makefile
+++ b/arch/arm64/boot/dts/rockchip/Makefile
@@ -36,6 +36,7 @@ dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopc-t4.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-m4.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-m4b.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-neo4.dtb
+dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-r4s.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-orangepi.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-pinebook-pro.dtb
   dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-puma-haikou.dtb
diff --git a/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts 
b/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts
new file mode 100644
index ..41b3d5c5043c
--- /dev/null
+++ b/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * FriendlyElec NanoPC-T4 board device tree source
+ *
+ * Copyright (c) 2020 FriendlyElec Computer Tech. Co., Ltd.
+ * (http://www.friendlyarm.com)
+ *
+ * Copyright (c) 2018 Collabora Ltd.
+ *
+ * Copyright (c) 2020 Jensen Huang 
+ * Copyright (c) 2020 Marty Jones 
+ * Copyright (c) 2021 Tianling Shen 
+ */
+
+/dts-v1/;
+#include "rk3399-nanopi4.dtsi"
+
+/ {
+ model = "FriendlyElec NanoPi R4S";
+ compatible = "friendlyarm,nanopi-r4s", "rockchip,rk3399";
+
+ /delete-node/ gpio-leds;


Why? You could justify deleting _led, but redefining the whole
node from scratch seems unnecessary.


First of all, thank you for reviewing, and sorry for my poor English.

I need to redefine `pinctrl-0`, but if I use `/delete-property/
pinctrl-0;`, it will throw an error,
so maybe I made a mistake? And I will try again...


You don't need to delete the property itself though - simply specifying 
it replaces whatever previous value was inherited from the DTSI. Think 
about how all those "status = ..." lines work, for example.


Similarly, given that you're redefining the led-0 node anyway you 
wouldn't really *need* to delete that either; doing so just avoids the 
extra _led label hanging around if the DTB is built with symbols, 
and saves having to explicitly override/delete the default trigger 
property if necessary.



+ gpio-leds {
+ compatible = "gpio-leds";
+ pinctrl-0 = <_led_pin>, <_led_pin>, <_led_pin>;
+ pinctrl-names = "default";
+
+ lan_led: led-0 {
+ gpios = < RK_PA1 GPIO_ACTIVE_HIGH>;
+ label = "nanopi-r4s:green:lan";
+ };
+
+ sys_led: led-1 {
+ gpios = < RK_PB5 GPIO_ACTIVE_HIGH>;
+ label = "nanopi-r4s:red:sys";
+ default-state = "on";
+ };
+
+ wan_led: led-2 {
+ gpios = < RK_PA0 GPIO_ACTIVE_HIGH>;
+ label = "nanopi-r4s:green:wan";
+ };


Nit: (apologies for overlooking it before) there isn't an obvious 
definitive order for the LEDs, but the order here is certainly not 
consistent with anything. The most logical would probably be sys, wan, 
lan since that's both in order of GPIO number and how they are 
physically positioned relative to each other on the board/case (although 
you could also argue for wan, lan, sys in that regard, depending on how 
you look at it).



+ };
+
+ /delete-node/ gpio-keys;


Ditto - just removing the power key node itself should suffice.


Just like gpio-leds.



+ gpio-keys {
+ compatible = "gpio-keys";
+ pinctrl-names = "default";
+ pinctrl-0 = <_button_pin>;
+
+ reset {
+ debounce-interval = <50>;
+ gpios = < RK_PC6 GPIO_ACTIVE_LOW>;
+ label = "reset";
+ linux,code = ;
+ };
+ };
+
+   

Re: [PATCH] arm64: csum: cast to the proper type

2021-03-15 Thread Robin Murphy

On 2021-03-15 01:26, Alex Elder wrote:

The last line of ip_fast_csum() calls csum_fold(), forcing the
type of the argument passed to be u32.  But csum_fold() takes a
__wsum argument (which is __u32 __bitwise for arm64).  As long
as we're forcing the cast, cast it to the right type.


Oddly, the commit adding the cast does specifically speak about 
converting to __wsum, so I'm not sure what happened there... :/


Anyway, this seems to make sense.

Acked-by: Robin Murphy 


Signed-off-by: Alex Elder 
---

With this patch in place, quite a few "different base types" sparse
warnings go away on a full arm64 kernel build.  More specifically:
   warning: incorrect type in argument 1 (different base types)
  expected restricted __wsum [usertype] csum
  got unsigned int [usertype]

-Alex

  arch/arm64/include/asm/checksum.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/checksum.h 
b/arch/arm64/include/asm/checksum.h
index 93a161b3bf3fe..dc52b733675db 100644
--- a/arch/arm64/include/asm/checksum.h
+++ b/arch/arm64/include/asm/checksum.h
@@ -37,7 +37,7 @@ static inline __sum16 ip_fast_csum(const void *iph, unsigned 
int ihl)
} while (--n > 0);
  
  	sum += ((sum >> 32) | (sum << 32));

-   return csum_fold((__force u32)(sum >> 32));
+   return csum_fold((__force __wsum)(sum >> 32));
  }
  #define ip_fast_csum ip_fast_csum
  



Re: [PATCH v3 2/2] rockchip: rk3399: Add support for FriendlyARM NanoPi R4S

2021-03-13 Thread Robin Murphy

On 2021-03-13 03:25, Tianling Shen wrote:

This adds support for the NanoPi R4S from FriendlyArm.

Rockchip RK3399 SoC
1GB DDR3 or 4GB LPDDR4 RAM
Gigabit Ethernet (WAN)
Gigabit Ethernet (PCIe) (LAN)
USB 3.0 Port x 2
MicroSD slot
Reset button
WAN - LAN - SYS LED

[initial DTS file]
Co-developed-by: Jensen Huang 
Signed-off-by: Jensen Huang 
[minor adjustments]
Co-developed-by: Marty Jones 
Signed-off-by: Marty Jones 
[fixed format issues]
Signed-off-by: Tianling Shen 

Reported-by: kernel test robot 
---
  arch/arm64/boot/dts/rockchip/Makefile |   1 +
  .../boot/dts/rockchip/rk3399-nanopi-r4s.dts   | 179 ++
  2 files changed, 180 insertions(+)
  create mode 100644 arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts

diff --git a/arch/arm64/boot/dts/rockchip/Makefile 
b/arch/arm64/boot/dts/rockchip/Makefile
index 62d3abc17a24..c3e00c0e2db7 100644
--- a/arch/arm64/boot/dts/rockchip/Makefile
+++ b/arch/arm64/boot/dts/rockchip/Makefile
@@ -36,6 +36,7 @@ dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopc-t4.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-m4.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-m4b.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-neo4.dtb
+dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-nanopi-r4s.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-orangepi.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-pinebook-pro.dtb
  dtb-$(CONFIG_ARCH_ROCKCHIP) += rk3399-puma-haikou.dtb
diff --git a/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts 
b/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts
new file mode 100644
index ..41b3d5c5043c
--- /dev/null
+++ b/arch/arm64/boot/dts/rockchip/rk3399-nanopi-r4s.dts
@@ -0,0 +1,179 @@
+// SPDX-License-Identifier: (GPL-2.0+ OR MIT)
+/*
+ * FriendlyElec NanoPC-T4 board device tree source
+ *
+ * Copyright (c) 2020 FriendlyElec Computer Tech. Co., Ltd.
+ * (http://www.friendlyarm.com)
+ *
+ * Copyright (c) 2018 Collabora Ltd.
+ *
+ * Copyright (c) 2020 Jensen Huang 
+ * Copyright (c) 2020 Marty Jones 
+ * Copyright (c) 2021 Tianling Shen 
+ */
+
+/dts-v1/;
+#include "rk3399-nanopi4.dtsi"
+
+/ {
+   model = "FriendlyElec NanoPi R4S";
+   compatible = "friendlyarm,nanopi-r4s", "rockchip,rk3399";
+
+   /delete-node/ gpio-leds;


Why? You could justify deleting _led, but redefining the whole 
node from scratch seems unnecessary.



+   gpio-leds {
+   compatible = "gpio-leds";
+   pinctrl-0 = <_led_pin>, <_led_pin>, <_led_pin>;
+   pinctrl-names = "default";
+
+   lan_led: led-0 {
+   gpios = < RK_PA1 GPIO_ACTIVE_HIGH>;
+   label = "nanopi-r4s:green:lan";
+   };
+
+   sys_led: led-1 {
+   gpios = < RK_PB5 GPIO_ACTIVE_HIGH>;
+   label = "nanopi-r4s:red:sys";
+   default-state = "on";
+   };
+
+   wan_led: led-2 {
+   gpios = < RK_PA0 GPIO_ACTIVE_HIGH>;
+   label = "nanopi-r4s:green:wan";
+   };
+   };
+
+   /delete-node/ gpio-keys;


Ditto - just removing the power key node itself should suffice.


+   gpio-keys {
+   compatible = "gpio-keys";
+   pinctrl-names = "default";
+   pinctrl-0 = <_button_pin>;
+
+   reset {
+   debounce-interval = <50>;
+   gpios = < RK_PC6 GPIO_ACTIVE_LOW>;
+   label = "reset";
+   linux,code = ;
+   };
+   };
+
+   vdd_5v: vdd-5v {
+   compatible = "regulator-fixed";
+   regulator-name = "vdd_5v";
+   regulator-always-on;
+   regulator-boot-on;
+   };
+
+   fan: pwm-fan {
+   compatible = "pwm-fan";
+   /*
+* With 20KHz PWM and an EVERCOOL EC4007H12SA fan, these levels
+* work out to 0, ~1200, ~3000, and 5000RPM respectively.
+*/
+   cooling-levels = <0 12 18 255>;


This is clearly not true - those numbers refer to a 12V fan on my 
NanoPC-T4's 12V PWM circuit, while the output circuit here is 5V. If you 
really want a placeholder here maybe just use <0 255>, or figure out 
some empirical values with a suitable 5V fan that are actually meaningful.



+   #cooling-cells = <2>;
+   fan-supply = <_5v>;
+   pwms = < 0 5 0>;
+   };
+};
+
+_thermal {
+   trips {
+   cpu_warm: cpu_warm {
+   temperature = <55000>;
+   hysteresis = <2000>;
+   type = "active";
+   };
+
+   cpu_hot: cpu_hot {
+   temperature = <65000>;
+   hysteresis = <2000>;
+   type = "active";
+   };
+   };
+
+   cooling-maps {
+   map2 {
+   

Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-03-12 Thread Robin Murphy

On 2021-03-12 17:03, Logan Gunthorpe wrote:



On 2021-03-12 8:52 a.m., Robin Murphy wrote:

On 2021-03-11 23:31, Logan Gunthorpe wrote:

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.


This misled me at first, but I see the implementation does actually
appear to accomodate the case of working ACS where P2P *would* still
need to be mapped at the IOMMU.


Yes, that's correct.

   static int __finalise_sg(struct device *dev, struct scatterlist *sg,
int nents,
-    dma_addr_t dma_addr)
+    dma_addr_t dma_addr, unsigned long attrs)
   {
   struct scatterlist *s, *cur = sg;
   unsigned long seg_mask = dma_get_seg_boundary(dev);
@@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev,
struct scatterlist *sg, int nents,
   sg_dma_address(s) = DMA_MAPPING_ERROR;
   sg_dma_len(s) = 0;
   +    if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {
+    if (i > 0)
+    cur = sg_next(cur);
+
+    sg_dma_address(cur) = sg_phys(s) + s->offset -


Are you sure about that? ;)


Do you see a bug? I don't follow you...


sg_phys() already accounts for the offset, so you're adding it twice.


+    pci_p2pdma_bus_offset(sg_page(s));


Can the bus offset make P2P addresses overlap with regions of mem space
that we might use for regular IOVA allocation? That would be very bad...


No. IOMMU drivers already disallow all PCI addresses from being used as
IOVA addresses. See, for example,  dmar_init_reserved_ranges(). It would
be a huge problem for a whole lot of other reasons if it didn't.


I know we reserve the outbound windows (largely *because* some host 
bridges will consider those addresses as attempts at unsupported P2P and 
prevent them working), I just wanted to confirm that this bus offset is 
always something small that stays within the relevant window, rather 
than something that might make a BAR appear in a completely different 
place for P2P purposes. If so, that's good.



@@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev,
struct scatterlist *sg,
   struct iommu_dma_cookie *cookie = domain->iova_cookie;
   struct iova_domain *iovad = >iovad;
   struct scatterlist *s, *prev = NULL;
+    struct dev_pagemap *pgmap = NULL;
   int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
   dma_addr_t iova;
   size_t iova_len = 0;
   unsigned long mask = dma_get_seg_boundary(dev);
-    int i;
+    int i, map = -1, ret = 0;
     if (static_branch_unlikely(_deferred_attach_enabled) &&
   iommu_deferred_attach(dev, domain))
@@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev,
struct scatterlist *sg,
   s_length = iova_align(iovad, s_length + s_iova_off);
   s->length = s_length;
   +    if (is_pci_p2pdma_page(sg_page(s))) {
+    if (sg_page(s)->pgmap != pgmap) {
+    pgmap = sg_page(s)->pgmap;
+    map = pci_p2pdma_dma_map_type(dev, pgmap);
+    }
+
+    if (map < 0) {


It rather feels like it should be the job of whoever creates the list in
the first place not to put unusable pages in it, especially since the
p2pdma_map_type looks to be a fairly coarse-grained and static thing.
The DMA API isn't responsible for validating normal memory pages, so
what makes P2P special?


Yes, that would be ideal, but there's some difficulties there. For the
driver to check the pages, it would need to loop through the entire SG
one more time on every transaction, regardless of whether there are
P2PDMA pages, or not. So that will have a performance impact even when
the feature isn't being used. I don't think that'll be acceptable for
many drivers.

The other possibility is for GUP to do it when it gets the pages from
userspace. But GUP doesn't have all the information to do this at the
moment. We'd have to pass the struct device that will eventually map the
pages through all the nested functions in the GUP to do that test at
that time. This might not be a bad option (that I half looked into), but
I'm not sure how acceptable it would be to the GUP developers.


Urgh, yes, if a page may or may not be valid for p2p depending on which 
device is trying to map it, then it probably is most reasonable to 
figure that out at this point. It's a little unfortunate having to cope 
with failure so late, but oh well.



But even if we do verify the pages ahead of time, we still need the same
infrastructure in dma_map_sg(); it could only now be a BUG if the driver
sent invalid pages instead of an error return.


The hope was that we could save doing even that - e.g. if you pass a 
dodgy page into dma_map_page(), maybe page_to_phys() will crash, maybe 
you'll just e

Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-03-12 Thread Robin Murphy

On 2021-03-12 16:24, Logan Gunthorpe wrote:



On 2021-03-12 8:52 a.m., Robin Murphy wrote:

+
   sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
   sg->offset, sg->length, dir, attrs);
   if (sg->dma_address == DMA_MAPPING_ERROR)
@@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct
scatterlist *sgl, int nents,
     out_unmap:
   dma_direct_unmap_sg(dev, sgl, i, dir, attrs |
DMA_ATTR_SKIP_CPU_SYNC);
-    return 0;
+    return ret;
   }
     dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t
paddr,
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..adc1a83950be 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev,
dma_addr_t addr, size_t size,
   EXPORT_SYMBOL(dma_unmap_page_attrs);
     /*
- * dma_maps_sg_attrs returns 0 on error and > 0 on success.
- * It should never return a value < 0.
+ * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
+ *
+ * If 0 is returned, the mapping can be retried and will succeed once
+ * sufficient resources are available.


That's not a guarantee we can uphold. Retrying forever in the vain hope
that a device might evolve some extra address bits, or a bounce buffer
might magically grow big enough for a gigantic mapping, isn't
necessarily the best idea.


Perhaps this is just poorly worded. Returning 0 is the normal case and
nothing has changed there. The block layer, for example, will retry if
zero is returned as this only happens if it failed to allocate resources
for the mapping. The reason we have to return -1 is to tell the block
layer not to retry these requests as they will never succeed in the future.


+ *
+ * If there are P2PDMA pages in the scatterlist then this function may
+ * return -EREMOTEIO to indicate that the pages are not mappable by the
+ * device. In this case, an error should be returned for the IO as it
+ * will never be successfully retried.
    */
   int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int
nents,
   enum dma_data_direction dir, unsigned long attrs)
@@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct
scatterlist *sg, int nents,
   ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
   else
   ents = ops->map_sg(dev, sg, nents, dir, attrs);
-    BUG_ON(ents < 0);
+


This scares me - I hesitate to imagine the amount of driver/subsystem
code out there that will see nonzero and merrily set off iterating a
negative number of segments, if we open the floodgates of allowing
implementations to return error codes here.


Yes, but it will never happen on existing drivers/subsystems. The only
way it can return a negative number is if the driver passes in P2PDMA
pages which can't happen without changes in the driver. We are careful
about where P2PDMA pages can get into so we don't have to worry about
all the existing driver code out there.


Sure, that's how things stand immediately after this patch. But then 
someone comes along with the perfectly reasonable argument for returning 
more expressive error information for regular mapping failures as well 
(because sometimes those can be terminal too, as above), we start to get 
divergent behaviour across architectures and random bits of old code 
subtly breaking down the line. *That* is what makes me wary of making a 
fundamental change to a long-standing "nonzero means success" interface...


Robin.


Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA

2021-03-12 Thread Robin Murphy

On 2021-03-12 16:18, Logan Gunthorpe wrote:



On 2021-03-12 8:51 a.m., Robin Murphy wrote:

On 2021-03-11 23:31, Logan Gunthorpe wrote:

Hi,

This is a rework of the first half of my RFC for doing P2PDMA in
userspace
with O_DIRECT[1].

The largest issue with that series was the gross way of flagging P2PDMA
SGL segments. This RFC proposes a different approach, (suggested by
Dan Williams[2]) which uses the third bit in the page_link field of the
SGL.

This approach is a lot less hacky but comes at the cost of adding a
CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
scarce bit in the page_link. For our purposes, a 64BIT restriction is
acceptable but it's not clear if this is ok for all usecases hoping
to make use of P2PDMA.

Matthew Wilcox has already suggested (off-list) that this is the wrong
approach, preferring a new dma mapping operation and an SGL
replacement. I
don't disagree that something along those lines would be a better long
term solution, but it involves overcoming a lot of challenges to get
there. Creating a new mapping operation still means adding support to
more
than 25 dma_map_ops implementations (many of which are on obscure
architectures) or creating a redundant path to fallback with dma_map_sg()
for every driver that uses the new operation. This RFC is an approach
that doesn't require overcoming these blocks.


I don't really follow that argument - you're only adding support to two
implementations with the awkward flag, so why would using a dedicated
operation instead be any different? Whatever callers need to do if
dma_pci_p2pdma_supported() says no, they could equally do if
dma_map_p2p_sg() (or whatever) returns -ENXIO, no?


The thing is if the dma_map_sg doesn't support P2PDMA then P2PDMA
transactions cannot be done, but regular transactions can still go
through as they always did.

But replacing dma_map_sg() with dma_map_new() affects all operations,
P2PDMA or otherwise. If dma_map_new() isn't supported it can't simply
not support P2PDMA; it has to maintain a fallback path to dma_map_sg().


But AFAICS the equivalent fallback path still has to exist either way. 
My impression so far is that callers would end up looking something like 
this:


if (dma_pci_p2pdma_supported()) {
if (dma_map_sg(...) < 0)
//do non-p2p fallback due to p2p failure
} else {
//do non-p2p fallback due to lack of support
}

at which point, simply:

if (dma_map_sg_p2p(...) < 0)
//do non-p2p fallback either way

seems entirely reasonable. What am I missing?

Let's not pretend that overloading an existing API means we can start 
feeding P2P pages into any old subsystem/driver without further changes 
- there already *are* at least some that retry ad infinitum if DMA 
mapping fails (the USB layer springs to mind...) and thus wouldn't 
handle the PCI_P2PDMA_MAP_NOT_SUPPORTED case acceptably.



Given that the inputs and outputs for dma_map_new() will be completely
different data structures this will be quite a lot of similar paths
required in the driver. (ie mapping a bvec to the input struct and the
output struct to hardware requirements) If a bug crops up in the old
dma_map_sg(), developers might not notice it for some time seeing it
won't be used on the most popular architectures.


Huh? I'm specifically suggesting a new interface that takes the *same* 
data structure (at least to begin with), but just gives us more 
flexibility in terms of introducing p2p-aware behaviour somewhat more 
safely. Yes, we already know that we ultimately want something better 
than scatterlists for representing things like this and dma-buf imports, 
but that hardly has to happen overnight.


Robin.


Re: [RFC PATCH v2 06/11] dma-direct: Support PCI P2PDMA pages in dma-direct map_sg

2021-03-12 Thread Robin Murphy

On 2021-03-11 23:31, Logan Gunthorpe wrote:

Add PCI P2PDMA support for dma_direct_map_sg() so that it can map
PCI P2PDMA pages directly without a hack in the callers. This allows
for heterogeneous SGLs that contain both P2PDMA and regular pages.

SGL segments that contain PCI bus addresses are marked with
sg_mark_pci_p2pdma() and are ignored when unmapped.

Signed-off-by: Logan Gunthorpe 
---
  kernel/dma/direct.c  | 35 ---
  kernel/dma/mapping.c | 13 ++---
  2 files changed, 42 insertions(+), 6 deletions(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index 002268262c9a..f326d32062dd 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -13,6 +13,7 @@
  #include 
  #include 
  #include 
+#include 
  #include "direct.h"
  
  /*

@@ -387,19 +388,47 @@ void dma_direct_unmap_sg(struct device *dev, struct 
scatterlist *sgl,
struct scatterlist *sg;
int i;
  
-	for_each_sg(sgl, sg, nents, i)

+   for_each_sg(sgl, sg, nents, i) {
+   if (sg_is_pci_p2pdma(sg))
+   continue;
+
dma_direct_unmap_page(dev, sg->dma_address, sg_dma_len(sg), dir,
 attrs);
+   }
  }
  #endif
  
  int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,

enum dma_data_direction dir, unsigned long attrs)
  {
-   int i;
+   struct dev_pagemap *pgmap = NULL;
+   int i, map = -1, ret = 0;
struct scatterlist *sg;
+   u64 bus_off;
  
  	for_each_sg(sgl, sg, nents, i) {

+   if (is_pci_p2pdma_page(sg_page(sg))) {
+   if (sg_page(sg)->pgmap != pgmap) {
+   pgmap = sg_page(sg)->pgmap;
+   map = pci_p2pdma_dma_map_type(dev, pgmap);
+   bus_off = pci_p2pdma_bus_offset(sg_page(sg));
+   }
+
+   if (map < 0) {
+   sg->dma_address = DMA_MAPPING_ERROR;
+   ret = -EREMOTEIO;
+   goto out_unmap;
+   }
+
+   if (map) {
+   sg->dma_address = sg_phys(sg) + sg->offset -
+   bus_off;
+   sg_dma_len(sg) = sg->length;
+   sg_mark_pci_p2pdma(sg);
+   continue;
+   }
+   }
+
sg->dma_address = dma_direct_map_page(dev, sg_page(sg),
sg->offset, sg->length, dir, attrs);
if (sg->dma_address == DMA_MAPPING_ERROR)
@@ -411,7 +440,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
  
  out_unmap:

dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return 0;
+   return ret;
  }
  
  dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,

diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index b6a633679933..adc1a83950be 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -178,8 +178,15 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
  EXPORT_SYMBOL(dma_unmap_page_attrs);
  
  /*

- * dma_maps_sg_attrs returns 0 on error and > 0 on success.
- * It should never return a value < 0.
+ * dma_maps_sg_attrs returns 0 on any resource error and > 0 on success.
+ *
+ * If 0 is returned, the mapping can be retried and will succeed once
+ * sufficient resources are available.


That's not a guarantee we can uphold. Retrying forever in the vain hope 
that a device might evolve some extra address bits, or a bounce buffer 
might magically grow big enough for a gigantic mapping, isn't 
necessarily the best idea.



+ *
+ * If there are P2PDMA pages in the scatterlist then this function may
+ * return -EREMOTEIO to indicate that the pages are not mappable by the
+ * device. In this case, an error should be returned for the IO as it
+ * will never be successfully retried.
   */
  int dma_map_sg_attrs(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs)
@@ -197,7 +204,7 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist 
*sg, int nents,
ents = dma_direct_map_sg(dev, sg, nents, dir, attrs);
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
-   BUG_ON(ents < 0);
+


This scares me - I hesitate to imagine the amount of driver/subsystem 
code out there that will see nonzero and merrily set off iterating a 
negative number of segments, if we open the floodgates of allowing 
implementations to return error codes here.


Robin.


debug_dma_map_sg(dev, sg, nents, ents, dir);
  
  	return ents;




Re: [RFC PATCH v2 08/11] iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg

2021-03-12 Thread Robin Murphy

On 2021-03-11 23:31, Logan Gunthorpe wrote:

When a PCI P2PDMA page is seen, set the IOVA length of the segment
to zero so that it is not mapped into the IOVA. Then, in finalise_sg(),
apply the appropriate bus address to the segment. The IOVA is not
created if the scatterlist only consists of P2PDMA pages.


This misled me at first, but I see the implementation does actually 
appear to accomodate the case of working ACS where P2P *would* still 
need to be mapped at the IOMMU.



Similar to dma-direct, the sg_mark_pci_p2pdma() flag is used to
indicate bus address segments. On unmap, P2PDMA segments are skipped
over when determining the start and end IOVA addresses.

With this change, the flags variable in the dma_map_ops is
set to DMA_F_PCI_P2PDMA_SUPPORTED to indicate support for
P2PDMA pages.

Signed-off-by: Logan Gunthorpe 
---
  drivers/iommu/dma-iommu.c | 63 ---
  1 file changed, 53 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index af765c813cc8..c0821e9051a9 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -20,6 +20,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -846,7 +847,7 @@ static void iommu_dma_unmap_page(struct device *dev, 
dma_addr_t dma_handle,
   * segment's start address to avoid concatenating across one.
   */
  static int __finalise_sg(struct device *dev, struct scatterlist *sg, int 
nents,
-   dma_addr_t dma_addr)
+   dma_addr_t dma_addr, unsigned long attrs)
  {
struct scatterlist *s, *cur = sg;
unsigned long seg_mask = dma_get_seg_boundary(dev);
@@ -864,6 +865,20 @@ static int __finalise_sg(struct device *dev, struct 
scatterlist *sg, int nents,
sg_dma_address(s) = DMA_MAPPING_ERROR;
sg_dma_len(s) = 0;
  
+		if (is_pci_p2pdma_page(sg_page(s)) && !s_iova_len) {

+   if (i > 0)
+   cur = sg_next(cur);
+
+   sg_dma_address(cur) = sg_phys(s) + s->offset -


Are you sure about that? ;)


+   pci_p2pdma_bus_offset(sg_page(s));


Can the bus offset make P2P addresses overlap with regions of mem space 
that we might use for regular IOVA allocation? That would be very bad...



+   sg_dma_len(cur) = s->length;
+   sg_mark_pci_p2pdma(cur);
+
+   count++;
+   cur_len = 0;
+   continue;
+   }
+
/*
 * Now fill in the real DMA data. If...
 * - there is a valid output segment to append to
@@ -960,11 +975,12 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
struct iommu_dma_cookie *cookie = domain->iova_cookie;
struct iova_domain *iovad = >iovad;
struct scatterlist *s, *prev = NULL;
+   struct dev_pagemap *pgmap = NULL;
int prot = dma_info_to_prot(dir, dev_is_dma_coherent(dev), attrs);
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
-   int i;
+   int i, map = -1, ret = 0;
  
  	if (static_branch_unlikely(_deferred_attach_enabled) &&

iommu_deferred_attach(dev, domain))
@@ -993,6 +1009,23 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s_length = iova_align(iovad, s_length + s_iova_off);
s->length = s_length;
  
+		if (is_pci_p2pdma_page(sg_page(s))) {

+   if (sg_page(s)->pgmap != pgmap) {
+   pgmap = sg_page(s)->pgmap;
+   map = pci_p2pdma_dma_map_type(dev, pgmap);
+   }
+
+   if (map < 0) {


It rather feels like it should be the job of whoever creates the list in 
the first place not to put unusable pages in it, especially since the 
p2pdma_map_type looks to be a fairly coarse-grained and static thing. 
The DMA API isn't responsible for validating normal memory pages, so 
what makes P2P special?



+   ret = -EREMOTEIO;
+   goto out_restore_sg;
+   }
+
+   if (map) {
+   s->length = 0;


I'm not really thrilled about the idea of passing zero-length segments 
to iommu_map_sg(). Yes, it happens to trick the concatenation logic in 
the current implementation into doing what you want, but it feels fragile.



+   continue;
+   }
+   }
+
/*
 * Due to the alignment of our single IOVA allocation, we can
 * depend on these assumptions about the segment boundary mask:
@@ -1015,6 +1048,9 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
prev = 

Re: [RFC PATCH v2 00/11] Add support to dma_map_sg for P2PDMA

2021-03-12 Thread Robin Murphy

On 2021-03-11 23:31, Logan Gunthorpe wrote:

Hi,

This is a rework of the first half of my RFC for doing P2PDMA in userspace
with O_DIRECT[1].

The largest issue with that series was the gross way of flagging P2PDMA
SGL segments. This RFC proposes a different approach, (suggested by
Dan Williams[2]) which uses the third bit in the page_link field of the
SGL.

This approach is a lot less hacky but comes at the cost of adding a
CONFIG_64BIT dependency to CONFIG_PCI_P2PDMA and using up the last
scarce bit in the page_link. For our purposes, a 64BIT restriction is
acceptable but it's not clear if this is ok for all usecases hoping
to make use of P2PDMA.

Matthew Wilcox has already suggested (off-list) that this is the wrong
approach, preferring a new dma mapping operation and an SGL replacement. I
don't disagree that something along those lines would be a better long
term solution, but it involves overcoming a lot of challenges to get
there. Creating a new mapping operation still means adding support to more
than 25 dma_map_ops implementations (many of which are on obscure
architectures) or creating a redundant path to fallback with dma_map_sg()
for every driver that uses the new operation. This RFC is an approach
that doesn't require overcoming these blocks.


I don't really follow that argument - you're only adding support to two 
implementations with the awkward flag, so why would using a dedicated 
operation instead be any different? Whatever callers need to do if 
dma_pci_p2pdma_supported() says no, they could equally do if 
dma_map_p2p_sg() (or whatever) returns -ENXIO, no?


We don't try to multiplex .map_resource through .map_page, so there 
doesn't seem to be any good reason to force that complexity on .map_sg 
either. And having a distinct API from the outset should make it a lot 
easier to transition to better "list of P2P memory regions" data 
structures in future without rewriting the whole world. As it is, there 
are potential benefits in a P2P interface which can define its own 
behaviour - for instance if threw out the notion of segment merging it 
could save a load of bother by just maintaining the direct correlation 
between pages and DMA addresses.


Robin.


Any alternative ideas or feedback is welcome.

These patches are based on v5.12-rc2 and a git branch is available here:

   https://github.com/sbates130272/linux-p2pmem/  p2pdma_dma_map_ops_rfc

A branch with the patches from the previous RFC that add userspace
O_DIRECT support is available at the same URL with the name
"p2pdma_dma_map_ops_rfc+user" (however, none of the issues with those
extra patches from the feedback of the last posting have been fixed).

Thanks,

Logan

[1] 
https://lore.kernel.org/linux-block/20201106170036.18713-1-log...@deltatee.com/
[2] 
https://lore.kernel.org/linux-block/capcyv4ifgcrdotut8qr7pmfhmecghqgvre9g0rorgczcgve...@mail.gmail.com/

--

Logan Gunthorpe (11):
   PCI/P2PDMA: Pass gfp_mask flags to upstream_bridge_distance_warn()
   PCI/P2PDMA: Avoid pci_get_slot() which sleeps
   PCI/P2PDMA: Attempt to set map_type if it has not been set
   PCI/P2PDMA: Introduce pci_p2pdma_should_map_bus() and
 pci_p2pdma_bus_offset()
   lib/scatterlist: Add flag for indicating P2PDMA segments in an SGL
   dma-direct: Support PCI P2PDMA pages in dma-direct map_sg
   dma-mapping: Add flags to dma_map_ops to indicate PCI P2PDMA support
   iommu/dma: Support PCI P2PDMA pages in dma-iommu map_sg
   block: Add BLK_STS_P2PDMA
   nvme-pci: Check DMA ops when indicating support for PCI P2PDMA
   nvme-pci: Convert to using dma_map_sg for p2pdma pages

  block/blk-core.c|  2 +
  drivers/iommu/dma-iommu.c   | 63 +-
  drivers/nvme/host/core.c|  3 +-
  drivers/nvme/host/nvme.h|  2 +-
  drivers/nvme/host/pci.c | 38 +++-
  drivers/pci/Kconfig |  2 +-
  drivers/pci/p2pdma.c| 89 +++--
  include/linux/blk_types.h   |  7 +++
  include/linux/dma-map-ops.h |  3 ++
  include/linux/dma-mapping.h |  5 +++
  include/linux/pci-p2pdma.h  | 11 +
  include/linux/scatterlist.h | 49 ++--
  kernel/dma/direct.c | 35 +--
  kernel/dma/mapping.c| 21 +++--
  14 files changed, 271 insertions(+), 59 deletions(-)


base-commit: a38fd8748464831584a19438cbb3082b5a2dab15
--
2.20.1
___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [PATCH v3] iommu: Check dev->iommu in iommu_dev_xxx functions

2021-03-09 Thread Robin Murphy

On 2021-03-03 17:36, Shameer Kolothum wrote:

The device iommu probe/attach might have failed leaving dev->iommu
to NULL and device drivers may still invoke these functions resulting
in a crash in iommu vendor driver code.

Hence make sure we check that.


Reviewed-by: Robin Murphy 


Fixes: a3a195929d40 ("iommu: Add APIs for multiple domains per device")
Signed-off-by: Shameer Kolothum 
---
v2 --> v3
  -Removed iommu_ops from struct dev_iommu.
v1 --> v2:
  -Added iommu_ops to struct dev_iommu based on the discussion with Robin.
  -Rebased against iommu-tree core branch.
---
  drivers/iommu/iommu.c | 24 +++-
  1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index d0b0a15dba84..e10cfa99057c 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2878,10 +2878,12 @@ EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
   */
  int iommu_dev_enable_feature(struct device *dev, enum iommu_dev_features feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
+   if (dev->iommu && dev->iommu->iommu_dev) {
+   const struct iommu_ops *ops = dev->iommu->iommu_dev->ops;
  
-	if (ops && ops->dev_enable_feat)

-   return ops->dev_enable_feat(dev, feat);
+   if (ops->dev_enable_feat)
+   return ops->dev_enable_feat(dev, feat);
+   }
  
  	return -ENODEV;

  }
@@ -2894,10 +2896,12 @@ EXPORT_SYMBOL_GPL(iommu_dev_enable_feature);
   */
  int iommu_dev_disable_feature(struct device *dev, enum iommu_dev_features 
feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
+   if (dev->iommu && dev->iommu->iommu_dev) {
+   const struct iommu_ops *ops = dev->iommu->iommu_dev->ops;
  
-	if (ops && ops->dev_disable_feat)

-   return ops->dev_disable_feat(dev, feat);
+   if (ops->dev_disable_feat)
+   return ops->dev_disable_feat(dev, feat);
+   }
  
  	return -EBUSY;

  }
@@ -2905,10 +2909,12 @@ EXPORT_SYMBOL_GPL(iommu_dev_disable_feature);
  
  bool iommu_dev_feature_enabled(struct device *dev, enum iommu_dev_features feat)

  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
+   if (dev->iommu && dev->iommu->iommu_dev) {
+   const struct iommu_ops *ops = dev->iommu->iommu_dev->ops;
  
-	if (ops && ops->dev_feat_enabled)

-   return ops->dev_feat_enabled(dev, feat);
+   if (ops->dev_feat_enabled)
+   return ops->dev_feat_enabled(dev, feat);
+   }
  
  	return false;

  }



Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

2021-03-08 Thread Robin Murphy

On 2021-03-01 15:48, John Garry wrote:

On 01/03/2021 13:20, Robin Murphy wrote:

FWIW, I'm 99% sure that what you really want is [1], but then you get
to battle against an unknown quantity of dodgy firmware instead.


Something which has not been said before is that this only happens for
strict mode.

I think that makes sense - once you*have*  actually failed to allocate
from the 32-bit space, max32_alloc_size will make subsequent attempts
fail immediately. In non-strict mode you're most likely freeing 32-bit
IOVAs back to the tree - and thus reset max32_alloc_size - much less
often, and you'll make more total space available each time, both of
which will amortise the cost of getting back into that failed state
again. Conversely, the worst case in strict mode is to have multiple
threads getting into this pathological cycle:

1: allocate, get last available IOVA
2: allocate, fail and set max32_alloc_size
3: free one IOVA, reset max32_alloc_size, goto 1

Now, given the broken behaviour where the cached PFN can get stuck near
the bottom of the address space, step 2 might well have been faster and
more premature than it should have, but I hope you can appreciate that
relying on an allocator being broken at its fundamental purpose of
allocating is not a good or sustainable thing to do.


I figure that you're talking about 4e89dce72521 now. I would have liked 
to know which real-life problem it solved in practice.


From what I remember, the problem reported was basically the one 
illustrated in that commit and the one I alluded to above - namely that 
certain allocation patterns with a broad mix of sizes and relative 
lifetimes end up pushing the cached PFN down to the bottom of the 
address space such that allocations start failing despite there still 
being sufficient free space overall, which was breaking some media 
workload. What was originally proposed was an overcomplicated palaver 
with DMA attributes and a whole extra allocation algorithm rather than 
just fixing the clearly unintended and broken behaviour.


While max32_alloc_size indirectly tracks the largest*contiguous* 
available space, one of the ideas from which it grew was to simply keep

count of the total number of free PFNs. If you're really spending
significant time determining that the tree is full, as opposed to just
taking longer to eventually succeed, then it might be relatively
innocuous to tack on that semi-redundant extra accounting as a
self-contained quick fix for that worst case.


Anyway, we see ~50% throughput regression, which is intolerable. As seen
in [0], I put this down to the fact that we have so many IOVA requests
which exceed the rcache size limit, which means many RB tree accesses
for non-cacheble IOVAs, which are now slower.


I will attempt to prove this by increasing RCACHE RANGE, such that all 
IOVA sizes may be cached.




On another point, as for longterm IOVA aging issue, it seems that there
is no conclusion there. However I did mention the issue of IOVA sizes
exceeding rcache size for that issue, so maybe we can find a common
solution. Similar to a fixed rcache depot size, it seems that having a
fixed rcache max size range value (at 6) doesn't scale either.

Well, I'd say that's more of a workload tuning thing than a scalability
one -


ok


a massive system with hundreds of CPUs that spends all day
flinging 1500-byte network packets around as fast as it can might be
happy with an even smaller value and using the saved memory for
something else. IIRC the value of 6 is a fairly arbitrary choice for a
tradeoff between expected utility and memory consumption, so making it a
Kconfig or command-line tuneable does seem like a sensible thing to 
explore.


Even if it is were configurable, wouldn't it make sense to have it 
configurable per IOVA domain?


Perhaps, but I don't see that being at all easy to implement. We can't 
arbitrarily *increase* the scope of caching once a domain is active due 
to the size-rounding-up requirement, which would be prohibitive to 
larger allocations if applied universally.


Furthermore, as mentioned above, I still want to solve this IOVA aging 
issue, and this fixed RCACHE RANGE size seems to be the at the center of 
that problem.





As for 4e89dce72521, so even if it's proper to retry for a failed alloc,
it is not always necessary. I mean, if we're limiting ourselves to 32b
subspace for this SAC trick and we fail the alloc, then we can try the
space above 32b first (if usable). If that fails, then retry there. I
don't see a need to retry the 32b subspace if we're not limited to it.
How about it? We tried that idea and it looks to just about restore
performance.

The thing is, if you do have an actual PCI device where DAC might mean a
33% throughput loss and you're mapping a long-lived buffer, or you're on
one of these systems where firmware fails to document address limits and
using the full IOMMU address width quietly breaks things, then you
almost certainly*do*  want

Re: [PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-08 Thread Robin Murphy

On 2021-03-05 17:41, John Garry wrote:

On 05/03/2021 16:32, Robin Murphy wrote:

In converting intel-iommu over to the common IOMMU DMA ops, it quietly
lost the functionality of its "forcedac" option. Since this is a handy
thing both for testing and for performance optimisation on certain
platforms, reimplement it under the common IOMMU parameter namespace.

For the sake of fixing the inadvertent breakage of the Intel-specific
parameter, remove the dmar_forcedac remnants and hook it up as an alias
while documenting the transition to the new common parameter.



Do you think that having a kconfig option to control the default for 
this can help identify the broken platforms which rely on forcedac=0? 
But seems a bit trivial for that, though.


I think it's still a sizeable can of worms - unlike, say, 
ARM_SMMU_DISABLE_BYPASS_BY_DEFAULT, we can't actually tell when things 
have gone awry and explicitly call it out. While I was getting the 
dma-ranges right on my Juno, everything broke differently - the SATA 
controller fails gracefully; the ethernet controller got the kernel tied 
up somewhere (to the point that the USB keyboard died) once it tried to 
brink up the link, but was at least spewing regular timeout backtraces 
that implicated the networking layer; having an (unused) NVMe plugged in 
simply wedged the boot process early on with no hint whatsoever of why.


TBH I'm not really sure what the best way forward is in terms of trying 
to weed out platforms that would need quirking. Our discussion just 
reminded me of this option and that it had gone AWOL, so bringing it 
back to be potentially *some* use to everyone seems justifiable on its own.


Thanks,
Robin.



Or are we bothered (finding them)?

Thanks,
john

Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the 
iommu ops")

Signed-off-by: Robin Murphy 
---
  Documentation/admin-guide/kernel-parameters.txt | 15 ---
  drivers/iommu/dma-iommu.c   | 13 -
  drivers/iommu/intel/iommu.c |  5 ++---
  include/linux/dma-iommu.h   |  2 ++
  4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt

index 04545725f187..835f810f2f26 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1869,13 +1869,6 @@
  bypassed by not enabling DMAR with this option. In
  this case, gfx device will use physical address for
  DMA.
-    forcedac [X86-64]
-    With this option iommu will not optimize to look
-    for io virtual address below 32-bit forcing dual
-    address cycle on pci bus for cards supporting greater
-    than 32-bit addressing. The default is to look
-    for translation below 32-bit and if not available
-    then look in the higher range.
  strict [Default Off]
  With this option on every unmap_single operation will
  result in a hardware IOTLB flush operation as opposed
@@ -1964,6 +1957,14 @@
  nobypass    [PPC/POWERNV]
  Disable IOMMU bypass, using IOMMU for PCI devices.
+    iommu.forcedac=    [ARM64, X86] Control IOVA allocation for PCI 
devices.

+    Format: { "0" | "1" }
+    0 - Try to allocate a 32-bit DMA address first, before
+  falling back to the full range if needed.
+    1 - Allocate directly from the full usable range,
+  forcing Dual Address Cycle for PCI cards supporting
+  greater than 32-bit addressing.
+
  iommu.strict=    [ARM64] Configure TLB invalidation behaviour
  Format: { "0" | "1" }
  0 - Lazy mode.
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9ab6ee22c110..260bf3de1992 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -52,6 +52,17 @@ struct iommu_dma_cookie {
  };
  static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
+bool iommu_dma_forcedac __read_mostly;
+
+static int __init iommu_dma_forcedac_setup(char *str)
+{
+    int ret = kstrtobool(str, _dma_forcedac);
+
+    if (!ret && iommu_dma_forcedac)
+    pr_info("Forcing DAC for PCI devices\n");
+    return ret;
+}
+early_param("iommu.forcedac", iommu_dma_forcedac_setup);
  void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
  struct iommu_domain *domain)
@@ -438,7 +449,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct 
iommu_domain *domain,

  dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
  /* Try to get PCI devices a SAC address */
-    if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
+    if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && 
dev_

[PATCH 1/2] iommu/iova: Add rbtree entry helper

2021-03-05 Thread Robin Murphy
Repeating the rb_entry() boilerplate all over the place gets old fast.
Before adding yet more instances, add a little hepler to tidy it up.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/iova.c | 23 ++-
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index e6e2fa85271c..c28003e1d2ee 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -27,6 +27,11 @@ static void fq_destroy_all_entries(struct iova_domain 
*iovad);
 static void fq_flush_timeout(struct timer_list *t);
 static void free_global_cached_iovas(struct iova_domain *iovad);
 
+static struct iova *to_iova(struct rb_node *node)
+{
+   return rb_entry(node, struct iova, node);
+}
+
 void
 init_iova_domain(struct iova_domain *iovad, unsigned long granule,
unsigned long start_pfn)
@@ -136,7 +141,7 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
 {
struct iova *cached_iova;
 
-   cached_iova = rb_entry(iovad->cached32_node, struct iova, node);
+   cached_iova = to_iova(iovad->cached32_node);
if (free == cached_iova ||
(free->pfn_hi < iovad->dma_32bit_pfn &&
 free->pfn_lo >= cached_iova->pfn_lo)) {
@@ -144,7 +149,7 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
iovad->max32_alloc_size = iovad->dma_32bit_pfn;
}
 
-   cached_iova = rb_entry(iovad->cached_node, struct iova, node);
+   cached_iova = to_iova(iovad->cached_node);
if (free->pfn_lo >= cached_iova->pfn_lo)
iovad->cached_node = rb_next(>node);
 }
@@ -159,7 +164,7 @@ iova_insert_rbtree(struct rb_root *root, struct iova *iova,
new = (start) ?  : &(root->rb_node);
/* Figure out where to put new node */
while (*new) {
-   struct iova *this = rb_entry(*new, struct iova, node);
+   struct iova *this = to_iova(*new);
 
parent = *new;
 
@@ -198,7 +203,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
goto iova32_full;
 
curr = __get_cached_rbnode(iovad, limit_pfn);
-   curr_iova = rb_entry(curr, struct iova, node);
+   curr_iova = to_iova(curr);
retry_pfn = curr_iova->pfn_hi + 1;
 
 retry:
@@ -207,7 +212,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
new_pfn = (high_pfn - size) & align_mask;
prev = curr;
curr = rb_prev(curr);
-   curr_iova = rb_entry(curr, struct iova, node);
+   curr_iova = to_iova(curr);
} while (curr && new_pfn <= curr_iova->pfn_hi && new_pfn >= low_pfn);
 
if (high_pfn < size || new_pfn < low_pfn) {
@@ -215,7 +220,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
high_pfn = limit_pfn;
low_pfn = retry_pfn;
curr = >anchor.node;
-   curr_iova = rb_entry(curr, struct iova, node);
+   curr_iova = to_iova(curr);
goto retry;
}
iovad->max32_alloc_size = size;
@@ -331,7 +336,7 @@ private_find_iova(struct iova_domain *iovad, unsigned long 
pfn)
assert_spin_locked(>iova_rbtree_lock);
 
while (node) {
-   struct iova *iova = rb_entry(node, struct iova, node);
+   struct iova *iova = to_iova(node);
 
if (pfn < iova->pfn_lo)
node = node->rb_left;
@@ -617,7 +622,7 @@ static int
 __is_range_overlap(struct rb_node *node,
unsigned long pfn_lo, unsigned long pfn_hi)
 {
-   struct iova *iova = rb_entry(node, struct iova, node);
+   struct iova *iova = to_iova(node);
 
if ((pfn_lo <= iova->pfn_hi) && (pfn_hi >= iova->pfn_lo))
return 1;
@@ -685,7 +690,7 @@ reserve_iova(struct iova_domain *iovad,
spin_lock_irqsave(>iova_rbtree_lock, flags);
for (node = rb_first(>rbroot); node; node = rb_next(node)) {
if (__is_range_overlap(node, pfn_lo, pfn_hi)) {
-   iova = rb_entry(node, struct iova, node);
+   iova = to_iova(node);
__adjust_overlap_range(iova, _lo, _hi);
if ((pfn_lo >= iova->pfn_lo) &&
(pfn_hi <= iova->pfn_hi))
-- 
2.17.1



[PATCH 2/2] iommu/iova: Improve restart logic

2021-03-05 Thread Robin Murphy
When restarting after searching below the cached node fails, resetting
the start point to the anchor node is often overly pessimistic. If
allocations are made with mixed limits - particularly in the case of the
opportunistic 32-bit allocation for PCI devices - this could mean
significant time wasted walking through the whole populated upper range
just to reach the initial limit. We can improve on that by implementing
a proper tree traversal to find the first node above the relevant limit,
and set the exact start point.

Signed-off-by: Robin Murphy 
---
 drivers/iommu/iova.c | 39 ++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index c28003e1d2ee..471c48dd71e7 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, 
struct iova *free)
iovad->cached_node = rb_next(>node);
 }
 
+static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned 
long limit_pfn)
+{
+   struct rb_node *node, *next;
+   /*
+* Ideally what we'd like to judge here is whether limit_pfn is close
+* enough to the highest-allocated IOVA that starting the allocation
+* walk from the anchor node will be quicker than this initial work to
+* find an exact starting point (especially if that ends up being the
+* anchor node anyway). This is an incredibly crude approximation which
+* only really helps the most likely case, but is at least trivially 
easy.
+*/
+   if (limit_pfn > iovad->dma_32bit_pfn)
+   return >anchor.node;
+
+   node = iovad->rbroot.rb_node;
+   while (to_iova(node)->pfn_hi < limit_pfn)
+   node = node->rb_right;
+
+search_left:
+   while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn)
+   node = node->rb_left;
+
+   if (!node->rb_left)
+   return node;
+
+   next = node->rb_left;
+   while (next->rb_right) {
+   next = next->rb_right;
+   if (to_iova(next)->pfn_lo >= limit_pfn) {
+   node = next;
+   goto search_left;
+   }
+   }
+
+   return node;
+}
+
 /* Insert the iova into domain rbtree by holding writer lock */
 static void
 iova_insert_rbtree(struct rb_root *root, struct iova *iova,
@@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) {
high_pfn = limit_pfn;
low_pfn = retry_pfn;
-   curr = >anchor.node;
+   curr = iova_find_limit(iovad, limit_pfn);
curr_iova = to_iova(curr);
goto retry;
}
-- 
2.17.1



[PATCH] iommu/dma: Resurrect the "forcedac" option

2021-03-05 Thread Robin Murphy
In converting intel-iommu over to the common IOMMU DMA ops, it quietly
lost the functionality of its "forcedac" option. Since this is a handy
thing both for testing and for performance optimisation on certain
platforms, reimplement it under the common IOMMU parameter namespace.

For the sake of fixing the inadvertent breakage of the Intel-specific
parameter, remove the dmar_forcedac remnants and hook it up as an alias
while documenting the transition to the new common parameter.

Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu ops")
Signed-off-by: Robin Murphy 
---
 Documentation/admin-guide/kernel-parameters.txt | 15 ---
 drivers/iommu/dma-iommu.c   | 13 -
 drivers/iommu/intel/iommu.c |  5 ++---
 include/linux/dma-iommu.h   |  2 ++
 4 files changed, 24 insertions(+), 11 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..835f810f2f26 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1869,13 +1869,6 @@
bypassed by not enabling DMAR with this option. In
this case, gfx device will use physical address for
DMA.
-   forcedac [X86-64]
-   With this option iommu will not optimize to look
-   for io virtual address below 32-bit forcing dual
-   address cycle on pci bus for cards supporting greater
-   than 32-bit addressing. The default is to look
-   for translation below 32-bit and if not available
-   then look in the higher range.
strict [Default Off]
With this option on every unmap_single operation will
result in a hardware IOTLB flush operation as opposed
@@ -1964,6 +1957,14 @@
nobypass[PPC/POWERNV]
Disable IOMMU bypass, using IOMMU for PCI devices.
 
+   iommu.forcedac= [ARM64, X86] Control IOVA allocation for PCI devices.
+   Format: { "0" | "1" }
+   0 - Try to allocate a 32-bit DMA address first, before
+ falling back to the full range if needed.
+   1 - Allocate directly from the full usable range,
+ forcing Dual Address Cycle for PCI cards supporting
+ greater than 32-bit addressing.
+
iommu.strict=   [ARM64] Configure TLB invalidation behaviour
Format: { "0" | "1" }
0 - Lazy mode.
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 9ab6ee22c110..260bf3de1992 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -52,6 +52,17 @@ struct iommu_dma_cookie {
 };
 
 static DEFINE_STATIC_KEY_FALSE(iommu_deferred_attach_enabled);
+bool iommu_dma_forcedac __read_mostly;
+
+static int __init iommu_dma_forcedac_setup(char *str)
+{
+   int ret = kstrtobool(str, _dma_forcedac);
+
+   if (!ret && iommu_dma_forcedac)
+   pr_info("Forcing DAC for PCI devices\n");
+   return ret;
+}
+early_param("iommu.forcedac", iommu_dma_forcedac_setup);
 
 void iommu_dma_free_cpu_cached_iovas(unsigned int cpu,
struct iommu_domain *domain)
@@ -438,7 +449,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain 
*domain,
dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end);
 
/* Try to get PCI devices a SAC address */
-   if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev))
+   if (dma_limit > DMA_BIT_MASK(32) && !iommu_dma_forcedac && 
dev_is_pci(dev))
iova = alloc_iova_fast(iovad, iova_len,
   DMA_BIT_MASK(32) >> shift, false);
 
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index ee0932307d64..1c3250bc 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -360,7 +360,6 @@ int intel_iommu_enabled = 0;
 EXPORT_SYMBOL_GPL(intel_iommu_enabled);
 
 static int dmar_map_gfx = 1;
-static int dmar_forcedac;
 static int intel_iommu_strict;
 static int intel_iommu_superpage = 1;
 static int iommu_identity_mapping;
@@ -451,8 +450,8 @@ static int __init intel_iommu_setup(char *str)
dmar_map_gfx = 0;
pr_info("Disable GFX device mapping\n");
} else if (!strncmp(str, "forcedac", 8)) {
-   pr_info("Forcing DAC for PCI devices\n");
-   dmar_forcedac = 1;
+   pr_warn("int

Re: [PATCH 5.10 337/663] iommu: Move iotlb_sync_map out from __iommu_map

2021-03-04 Thread Robin Murphy

On 2021-03-04 08:43, Pavel Machek wrote:

Hi!


[ Upstream commit d8c1df02ac7f2c802a9b2afc0f5c888c4217f1d5 ]

In the end of __iommu_map, It alway call iotlb_sync_map.

This patch moves iotlb_sync_map out from __iommu_map since it is
unnecessary to call this for each sg segment especially iotlb_sync_map
is flush tlb all currently. Add a little helper _iommu_map for this.



Signed-off-by: Sasha Levin 


AFAICT this is slight performance optimalization, not a bugfix. It
actually introduces a bug, fixed by the next patch. I'd preffer not to
have it in stable.


Right, the whole original series was a set of performance/efficiency 
improvements, so I'm not sure there's much justification for backporting 
individual patches out of context. Plus as you say this one has the 
unfortunate gfp bug.


Thanks for spotting this!

Robin.



Best regards,
Pavel


@@ -2421,18 +2418,31 @@ static int __iommu_map(struct iommu_domain *domain, 
unsigned long iova,
  
+static int _iommu_map(struct iommu_domain *domain, unsigned long iova,

+ phys_addr_t paddr, size_t size, int prot, gfp_t gfp)
+{
+   const struct iommu_ops *ops = domain->ops;
+   int ret;
+
+   ret = __iommu_map(domain, iova, paddr, size, prot, GFP_KERNEL);
+   if (ret == 0 && ops->iotlb_sync_map)
+   ops->iotlb_sync_map(domain);
+
+   return ret;
+}




Re: [PATCH v3 1/3] iommu/arm-smmu: Add support for driver IOMMU fault handlers

2021-03-02 Thread Robin Murphy

On 2021-02-25 17:51, Jordan Crouse wrote:

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse 
---

  drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++--
  1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..0f3a9b5f3284 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
int idx = smmu_domain->cfg.cbndx;
+   int ret;
  
  	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);

if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void 
*dev)
iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
  
-	dev_err_ratelimited(smmu->dev,

-   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
+   ret = report_iommu_fault(domain, dev, iova,


Beware that "dev" here is not a struct device, so this isn't right. I'm 
not entirely sure what we *should* be passing here, since we can't 
easily attribute a context fault to a specific client device, and 
passing the IOMMU device seems a bit dubious too, so maybe just NULL?


Robin.


+   fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : 
IOMMU_FAULT_READ);
+
+   if (ret == -ENOSYS)
+   dev_err_ratelimited(smmu->dev,
+   "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, 
cbfrsynra=0x%x, cb=%d\n",
fsr, iova, fsynr, cbfrsynra, idx);
  
  	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);




Re: [RFC 10/13] iommu/arm-smmu-impl: Get rid of Marvell's implementation details

2021-03-02 Thread Robin Murphy

On 2021-02-26 14:03, Nicolas Saenz Julienne wrote:

arm-smmu can now deal with integrations on buses that don't support
64bit MMIO accesses. No need to create a special case for that on
Marvell's integration.


This breaks compatibility with existing DTs.

Robin.


Signed-off-by: Nicolas Saenz Julienne 
---
  drivers/iommu/arm/arm-smmu/arm-smmu-impl.c | 21 -
  1 file changed, 21 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
index 136872e77195..55d40e37e144 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-impl.c
@@ -145,25 +145,6 @@ static const struct arm_smmu_impl arm_mmu500_impl = {
.reset = arm_mmu500_reset,
  };
  
-static u64 mrvl_mmu500_readq(struct arm_smmu_device *smmu, int page, int off)

-{
-   /*
-* Marvell Armada-AP806 erratum #582743.
-* Split all the readq to double readl
-*/
-   return hi_lo_readq_relaxed(arm_smmu_page(smmu, page) + off);
-}
-
-static void mrvl_mmu500_writeq(struct arm_smmu_device *smmu, int page, int off,
-  u64 val)
-{
-   /*
-* Marvell Armada-AP806 erratum #582743.
-* Split all the writeq to double writel
-*/
-   hi_lo_writeq_relaxed(val, arm_smmu_page(smmu, page) + off);
-}
-
  static int mrvl_mmu500_cfg_probe(struct arm_smmu_device *smmu)
  {
  
@@ -181,8 +162,6 @@ static int mrvl_mmu500_cfg_probe(struct arm_smmu_device *smmu)

  }
  
  static const struct arm_smmu_impl mrvl_mmu500_impl = {

-   .read_reg64 = mrvl_mmu500_readq,
-   .write_reg64 = mrvl_mmu500_writeq,
.cfg_probe = mrvl_mmu500_cfg_probe,
.reset = arm_mmu500_reset,
  };



Re: [RFC 02/13] driver core: Introduce MMIO configuration

2021-03-02 Thread Robin Murphy

On 2021-02-26 14:02, Nicolas Saenz Julienne wrote:

Some devices might inadvertently sit on buses that don't support 64bit
MMIO access, and need a mechanism to query these limitations without
prejudice to other buses in the system (i.e. defaulting to 32bit access
system wide isn't an option).

Introduce a new bus callback, 'mmio_configure(),' which will take care
of populating the relevant device properties based on the bus'
limitations.


Devil's advocate: there already exist workarounds for 8-bit and/or 
16-bit accesses not working in various places, does it make sense for a 
64-bit workaround to be so wildly different and disjoint?



Signed-off-by: Nicolas Saenz Julienne 
---
  arch/Kconfig   | 8 
  drivers/base/dd.c  | 6 ++
  include/linux/device.h | 3 +++
  include/linux/device/bus.h | 3 +++
  4 files changed, 20 insertions(+)

diff --git a/arch/Kconfig b/arch/Kconfig
index 2bb30673d8e6..ba7f246b6b9d 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1191,6 +1191,14 @@ config ARCH_SPLIT_ARG64
  config ARCH_HAS_ELFCORE_COMPAT
bool
  
+config ARCH_HAS_64BIT_MMIO_BROKEN

+   bool
+   depends on 64BIT


As mentioned previously, 32-bit systems may not need the overrides for 
kernel I/O accessors, but they could still need the same workarounds for 
the memory-mapping implications (if this is to be a proper generic 
mechanism).



+   default n


Tip: it is always redundant to state that.

Robin.


+   help
+  Arch might contain busses unable to perform 64bit mmio accessses on
+  an otherwise 64bit system.
+
  source "kernel/gcov/Kconfig"
  
  source "scripts/gcc-plugins/Kconfig"

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 9179825ff646..8086ce8f17a6 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -538,6 +538,12 @@ static int really_probe(struct device *dev, struct 
device_driver *drv)
goto probe_failed;
}
  
+	if (dev->bus->mmio_configure) {

+   ret = dev->bus->mmio_configure(dev);
+   if (ret)
+   goto probe_failed;
+   }
+
if (driver_sysfs_add(dev)) {
pr_err("%s: driver_sysfs_add(%s) failed\n",
   __func__, dev_name(dev));
diff --git a/include/linux/device.h b/include/linux/device.h
index ba660731bd25..bd94aa0cbd72 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -553,6 +553,9 @@ struct device {
  #ifdef CONFIG_DMA_OPS_BYPASS
booldma_ops_bypass : 1;
  #endif
+#if defined(CONFIG_ARCH_HAS_64BIT_MMIO_BROKEN)
+   boolmmio_64bit_broken:1;
+#endif
  };
  
  /**

diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
index 1ea5e1d1545b..680dfc3b4744 100644
--- a/include/linux/device/bus.h
+++ b/include/linux/device/bus.h
@@ -59,6 +59,8 @@ struct fwnode_handle;
   *bus supports.
   * @dma_configure:Called to setup DMA configuration on a device on
   *this bus.
+ * @mmio_configure:Called to setup MMIO configuration on a device on
+ * this bus.
   * @pm:   Power management operations of this bus, callback the 
specific
   *device driver's pm-ops.
   * @iommu_ops:  IOMMU specific operations for this bus, used to attach IOMMU
@@ -103,6 +105,7 @@ struct bus_type {
int (*num_vf)(struct device *dev);
  
  	int (*dma_configure)(struct device *dev);

+   int (*mmio_configure)(struct device *dev);
  
  	const struct dev_pm_ops *pm;
  



Re: [RFC 09/13] iommu/arm-smmu: Make use of dev_64bit_mmio_supported()

2021-03-02 Thread Robin Murphy

On 2021-02-26 14:03, Nicolas Saenz Julienne wrote:

Some arm SMMU implementations might sit on a bus that doesn't support
64bit memory accesses. In that case default to using hi_lo_{readq,
writeq}() and BUG if such platform tries to use AArch64 formats as they
rely on writeq()'s atomicity.

Signed-off-by: Nicolas Saenz Julienne 
---
  drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +
  drivers/iommu/arm/arm-smmu/arm-smmu.h | 9 +++--
  2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c 
b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index d8c6bfde6a61..239ff42b20c3 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -1889,6 +1889,15 @@ static int arm_smmu_device_cfg_probe(struct 
arm_smmu_device *smmu)
smmu->features |= ARM_SMMU_FEAT_FMT_AARCH64_64K;
}
  
+	/*

+* 64bit accesses not possible through the interconnect, AArch64
+* formats depend on it.
+*/
+   BUG_ON(!dev_64bit_mmio_supported(smmu->dev) &&
+  smmu->features & (ARM_SMMU_FEAT_FMT_AARCH64_4K |
+   ARM_SMMU_FEAT_FMT_AARCH64_16K |
+   ARM_SMMU_FEAT_FMT_AARCH64_64K));


No. Crashing the kernel in a probe routine which is free to fail is 
unacceptable either way, but guaranteeing failure in the case that the 
workaround *would* be required is doubly so.


Basically, this logic is backwards - if you really wanted to handle it 
generically, this would be the point at which you'd need to actively 
suppress all the detected hardware features which depend on 64-bit 
atomicity, not complain about them.



+
if (smmu->impl && smmu->impl->cfg_probe) {
ret = smmu->impl->cfg_probe(smmu);
if (ret)
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h 
b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index d2a2d1bc58ba..997d13a21717 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -477,15 +477,20 @@ static inline void arm_smmu_writel(struct arm_smmu_device 
*smmu, int page,
  {
if (smmu->impl && unlikely(smmu->impl->write_reg))
smmu->impl->write_reg(smmu, page, offset, val);
-   else
+   else if (dev_64bit_mmio_supported(smmu->dev))
writel_relaxed(val, arm_smmu_page(smmu, page) + offset);
+   else
+   hi_lo_writeq_relaxed(val, arm_smmu_page(smmu, page) + offset);


As Arnd pointed out, this is in completely the wrong place. Also, in 
general it doesn't work if the implementation already needs a hook to 
filter or override register accesses for any other reason. TBH I'm not 
convinced that this isn't *more* of a mess than handling it on a 
SoC-specific basis...


Robin.


  }
  
  static inline u64 arm_smmu_readq(struct arm_smmu_device *smmu, int page, int offset)

  {
if (smmu->impl && unlikely(smmu->impl->read_reg64))
return smmu->impl->read_reg64(smmu, page, offset);
-   return readq_relaxed(arm_smmu_page(smmu, page) + offset);
+   else if (dev_64bit_mmio_supported(smmu->dev))
+   return readq_relaxed(arm_smmu_page(smmu, page) + offset);
+   else
+   return hi_lo_readq_relaxed(arm_smmu_page(smmu, page) + offset);
  }
  
  static inline void arm_smmu_writeq(struct arm_smmu_device *smmu, int page,




Re: Aw: Re: Re: [PATCH 09/13] PCI: mediatek: Advertise lack of MSI handling

2021-03-02 Thread Robin Murphy

On 2021-03-01 14:06, Frank Wunderlich wrote:

Gesendet: Montag, 01. März 2021 um 14:31 Uhr
Von: "Marc Zyngier" 

Frank,


i guess it's a bug in ath10k driver or my r64 board (it is a v1.1
which has missing capacitors on tx lines).


No, this definitely looks like a bug in the MTK PCIe driver,
where the mutex is either not properly initialised, corrupted,
or the wrong pointer is passed.


but why does it happen only with the ath10k-card and not the mt7612 in
same slot?


Does mt7612 use MSI? What we have here is a bogus mutex in the
MTK PCIe driver, and the only way not to get there would be
to avoid using MSIs.


i guess this card/its driver does not use MSI. Did not found anything in 
"datasheet" [1] or driver [2] about msi


FWIW, no need to guess - `lspci -v` (as root) should tell you whether 
the card has MSI (and/or MSI-X) capability, and whether it is enabled if so.


Robin.




This r64 machine is supposed to have working MSIs, right?


imho mt7622 have working MSI


Do you get the same issue without this series?


tested 5.11.0 [1] without this series (but with your/thomas' patch
from discussion about my old patch) and got same trace. so this series
does not break anything here.


Can you retest without any additional patch on top of 5.11?
These two patches only affect platforms that do *not* have MSIs at all.


i can revert these 2, but still need patches for mt7622 pcie-support [3]...btw. 
i see that i miss these in 5.11-main...do not see traceback with them (have 
firmware not installed...)

root@bpi-r64:~# dmesg | grep ath
[6.450765] ath10k_pci :01:00.0: assign IRQ: got 146
[6.661752] ath10k_pci :01:00.0: enabling device ( -> 0002)
[6.697811] ath10k_pci :01:00.0: enabling bus mastering
[6.721293] ath10k_pci :01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 r
eset_mode 0
[6.921030] ath10k_pci :01:00.0: Failed to find firmware-N.bin (N between
  2 and 6) from ath10k/QCA988X/hw2.0: -2
[6.931698] ath10k_pci :01:00.0: could not fetch firmware files (-2)
[6.940417] ath10k_pci :01:00.0: could not probe fw (-2)

so traceback was caused by missing changes in mtk pcie-driver not yet upstream, 
added Chuanjia Liu




Tried with an mt7612e, this seems to work without any errors.

so for mt7622/mt7623

Tested-by: Frank Wunderlich 


We definitely need to understand the above.


there is a hardware-bug which may cause this...afair i saw this with
the card in r64 with earlier Kernel-versions where other cards work
(like the mt7612e).


I don't think a HW bug affecting PCI would cause what we are seeing
here, unless it results in memory corruption.



[1] 
https://www.asiarf.com/shop/wifi-wlan/wifi_mini_pcie/ws2433-wifi-11ac-mini-pcie-module-manufacturer/
[2] grep -Rni 'msi' drivers/net/wireless/mediatek/mt76/mt76x2/
[3] https://patchwork.kernel.org/project/linux-mediatek/list/?series=372885

___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



Re: [PATCH 1/1] Revert "iommu/iova: Retry from last rb tree node if iova search fails"

2021-03-01 Thread Robin Murphy

On 2021-02-25 13:54, John Garry wrote:

On 29/01/2021 12:03, Robin Murphy wrote:

On 2021-01-29 09:48, Leizhen (ThunderTown) wrote:


Currently, we are thinking about the solution to the problem. 
However, because the end time of v5.11 is approaching, this patch is 
sent first.


However, that commit was made for a reason - how do we justify that 
one thing being slow is more important than another thing being 
completely broken? It's not practical to just keep doing the patch 
hokey-cokey based on whoever shouts loudest :(



On 2021/1/29 17:21, Zhen Lei wrote:

This reverts commit 4e89dce725213d3d0b0475211b500eda4ef4bf2f.

We find that this patch has a great impact on performance. According to
our test: the iops decreases from 1655.6K to 893.5K, about half.

Hardware: 1 SAS expander with 12 SAS SSD
Command:  Only the main parameters are listed.
   fio bs=4k rw=read iodepth=128 cpus_allowed=0-127


FWIW, I'm 99% sure that what you really want is [1], but then you get 
to battle against an unknown quantity of dodgy firmware instead.




Something which has not been said before is that this only happens for 
strict mode.


I think that makes sense - once you *have* actually failed to allocate 
from the 32-bit space, max32_alloc_size will make subsequent attempts 
fail immediately. In non-strict mode you're most likely freeing 32-bit 
IOVAs back to the tree - and thus reset max32_alloc_size - much less 
often, and you'll make more total space available each time, both of 
which will amortise the cost of getting back into that failed state 
again. Conversely, the worst case in strict mode is to have multiple 
threads getting into this pathological cycle:


1: allocate, get last available IOVA
2: allocate, fail and set max32_alloc_size
3: free one IOVA, reset max32_alloc_size, goto 1

Now, given the broken behaviour where the cached PFN can get stuck near 
the bottom of the address space, step 2 might well have been faster and 
more premature than it should have, but I hope you can appreciate that 
relying on an allocator being broken at its fundamental purpose of 
allocating is not a good or sustainable thing to do.


While max32_alloc_size indirectly tracks the largest *contiguous* 
available space, one of the ideas from which it grew was to simply keep 
count of the total number of free PFNs. If you're really spending 
significant time determining that the tree is full, as opposed to just 
taking longer to eventually succeed, then it might be relatively 
innocuous to tack on that semi-redundant extra accounting as a 
self-contained quick fix for that worst case.


Anyway, we see ~50% throughput regression, which is intolerable. As seen 
in [0], I put this down to the fact that we have so many IOVA requests 
which exceed the rcache size limit, which means many RB tree accesses 
for non-cacheble IOVAs, which are now slower.


On another point, as for longterm IOVA aging issue, it seems that there 
is no conclusion there. However I did mention the issue of IOVA sizes 
exceeding rcache size for that issue, so maybe we can find a common 
solution. Similar to a fixed rcache depot size, it seems that having a 
fixed rcache max size range value (at 6) doesn't scale either.


Well, I'd say that's more of a workload tuning thing than a scalability 
one - a massive system with hundreds of CPUs that spends all day 
flinging 1500-byte network packets around as fast as it can might be 
happy with an even smaller value and using the saved memory for 
something else. IIRC the value of 6 is a fairly arbitrary choice for a 
tradeoff between expected utility and memory consumption, so making it a 
Kconfig or command-line tuneable does seem like a sensible thing to explore.


As for 4e89dce72521, so even if it's proper to retry for a failed alloc, 
it is not always necessary. I mean, if we're limiting ourselves to 32b 
subspace for this SAC trick and we fail the alloc, then we can try the 
space above 32b first (if usable). If that fails, then retry there. I 
don't see a need to retry the 32b subspace if we're not limited to it. 
How about it? We tried that idea and it looks to just about restore 
performance.


The thing is, if you do have an actual PCI device where DAC might mean a 
33% throughput loss and you're mapping a long-lived buffer, or you're on 
one of these systems where firmware fails to document address limits and 
using the full IOMMU address width quietly breaks things, then you 
almost certainly *do* want the allocator to actually do a proper job of 
trying to satisfy the given request.


Furthermore, what you propose is still fragile for your own use-case 
anyway. If someone makes internal changes to the allocator - converts it 
to a different tree structure, implements split locking for concurrency, 
that sort of thing - and it fundamentally loses the dodgy cached32_node 
behaviour which makes the initial failure unintentionally fast for your 
workload's allocation pattern, that extra complexity

Re: next/master bisection: baseline.login on r8a77960-ulcb

2021-02-25 Thread Robin Murphy

On 2021-02-25 11:09, Thierry Reding wrote:

On Wed, Feb 24, 2021 at 10:39:42PM +0100, Heiko Thiery wrote:

Hi Christoph and all,

On 23.02.21 10:56, Guillaume Tucker wrote:

Hi Christoph,

Please see the bisection report below about a boot failure on
r8a77960-ulcb on next-20210222.

Reports aren't automatically sent to the public while we're
trialing new bisection features on kernelci.org but this one
looks valid.

The log shows a kernel panic, more details can be found here:

https://kernelci.org/test/case/id/6034bde034504edc9faddd2c/

Please let us know if you need any help to debug the issue or try
a fix on this platform.


I am also seeing this problem on an iMX8MQ board and can help test if you
have a fix.


This is also causing boot failures on Jetson AGX Xavier. The origin is
slightly different from the above kernelci.org report, but the BUG_ON is
the same:

 [2.650447] [ cut here ]
 [2.650588] kernel BUG at include/linux/iommu-helper.h:23!
 [2.650729] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
 [2.654330] Modules linked in:
 [2.657474] CPU: 2 PID: 67 Comm: kworker/2:1 Not tainted 
5.11.0-next-20210225-00025-gfd15609b3a81-dirty #120
 [2.667367] Hardware name: NVIDIA Jetson AGX Xavier Developer Kit (DT)
 [2.674096] Workqueue: events deferred_probe_work_func
 [2.679169] pstate: 40400089 (nZcv daIf +PAN -UAO -TCO BTYPE=--)
 [2.684949] pc : find_slots.isra.0+0x118/0x2f0
 [2.689494] lr : find_slots.isra.0+0x88/0x2f0
 [2.693696] sp : 800011faf950
 [2.697281] x29: 800011faf950 x28: 0001
 [2.702537] x27: 0001 x26: 
 [2.708131] x25: 0001 x24: 000105f03148
 [2.713556] x23: 0001 x22: 800011559000
 [2.718835] x21: 800011559a80 x20: edc0
 [2.724493] x19:  x18: 0020
 [2.729770] x17: 0003ffd7d160 x16: 0068
 [2.735173] x15: 80b43150 x14: 
 [2.740944] x13: 82b5d791 x12: 0040
 [2.746113] x11: a248 x10: 
 [2.751882] x9 :  x8 : 0003fed3
 [2.757139] x7 :  x6 : 
 [2.762818] x5 :  x4 : 
 [2.767984] x3 : 0001e8303148 x2 : 8000
 [2.773580] x1 :  x0 : 001db800
 [2.778662] Call trace:
 [2.781136]  find_slots.isra.0+0x118/0x2f0
 [2.785137]  swiotlb_tbl_map_single+0x80/0x1b4
 [2.789858]  swiotlb_map+0x58/0x200
 [2.793355]  dma_direct_map_page+0x148/0x1c0
 [2.797386]  dma_map_page_attrs+0x2c/0x54
 [2.801411]  dw_pcie_host_init+0x40c/0x4c0
 [2.805633]  tegra_pcie_config_rp+0x7c/0x1f4
 [2.810155]  tegra_pcie_dw_probe+0x3d0/0x60c
 [2.814185]  platform_probe+0x68/0xe0
 [2.817688]  really_probe+0xe4/0x4c0
 [2.821362]  driver_probe_device+0x58/0xc0
 [2.825386]  __device_attach_driver+0xa8/0x104
 [2.829953]  bus_for_each_drv+0x78/0xd0
 [2.833434]  __device_attach+0xdc/0x17c
 [2.837631]  device_initial_probe+0x14/0x20
 [2.841680]  bus_probe_device+0x9c/0xa4
 [2.845160]  deferred_probe_work_func+0x74/0xb0
 [2.849734]  process_one_work+0x1cc/0x350
 [2.853822]  worker_thread+0x20c/0x3ac
 [2.858018]  kthread+0x128/0x134
 [2.860997]  ret_from_fork+0x10/0x34
 [2.864508] Code: ca180063 ea06007f 54fffee1 b50001e7 (d421)
 [2.870547] ---[ end trace e5c50bdcf12b316e ]---
 [2.875087] note: kworker/2:1[67] exited with preempt_count 2
 [2.880836] [ cut here ]

I've confirmed that reverting the following commits makes the system
boot again:

 47cfc5be1934 ("swiotlb: Validate bounce size in the sync/unmap path")
 c6f50c7719e7 ("swiotlb: respect min_align_mask")
 e952d9a1bc20 ("swiotlb: don't modify orig_addr in swiotlb_tbl_sync_single")
 567d877f9a7d ("swiotlb: refactor swiotlb_tbl_map_single")

Let me know if I can help test any fixes for this.


FWIW, this sounds like it's probably the same thing for which a fix 
should be pending:


https://lore.kernel.org/linux-iommu/20210223072514.ga18...@lst.de/T/#u

Robin.


Re: [PATCH 4/7] dma-mapping: add a dma_alloc_noncontiguous API

2021-02-16 Thread Robin Murphy

On 2021-02-02 09:51, Christoph Hellwig wrote:

Add a new API that returns a potentiall virtually non-contigous sg_table
and a DMA address.  This API is only properly implemented for dma-iommu
and will simply return a contigious chunk as a fallback.

The intent is that media drivers can use this API if either:

  - no kernel mapping or only temporary kernel mappings are required.
That is as a better replacement for DMA_ATTR_NO_KERNEL_MAPPING
  - a kernel mapping is required for cached and DMA mapped pages, but
the driver also needs the pages to e.g. map them to userspace.
In that sense it is a replacement for some aspects of the recently
removed and never fully implemented DMA_ATTR_NON_CONSISTENT

Signed-off-by: Christoph Hellwig 
---
  Documentation/core-api/dma-api.rst |  74 +
  include/linux/dma-map-ops.h|  18 +
  include/linux/dma-mapping.h|  31 +
  kernel/dma/mapping.c   | 103 +
  4 files changed, 226 insertions(+)

diff --git a/Documentation/core-api/dma-api.rst 
b/Documentation/core-api/dma-api.rst
index 157a474ae54416..e24b2447f4bfe6 100644
--- a/Documentation/core-api/dma-api.rst
+++ b/Documentation/core-api/dma-api.rst
@@ -594,6 +594,80 @@ dev, size, dma_handle and dir must all be the same as 
those passed into
  dma_alloc_noncoherent().  cpu_addr must be the virtual address returned by
  dma_alloc_noncoherent().
  
+::

+
+   struct sg_table *
+   dma_alloc_noncontiguous(struct device *dev, size_t size,
+   enum dma_data_direction dir, gfp_t gfp)
+
+This routine allocates   bytes of non-coherent and possibly 
non-contiguous
+memory.  It returns a pointer to struct sg_table that describes the allocated
+and DMA mapped memory, or NULL if the allocation failed. The resulting memory
+can be used for struct page mapped into a scatterlist are suitable for.
+
+The return sg_table is guaranteed to have 1 single DMA mapped segment as
+indicated by sgt->nents, but it might have multiple CPU side segments as
+indicated by sgt->orig_nents.
+
+The dir parameter specified if data is read and/or written by the device,
+see dma_map_single() for details.
+
+The gfp parameter allows the caller to specify the ``GFP_`` flags (see
+kmalloc()) for the allocation, but rejects flags used to specify a memory
+zone such as GFP_DMA or GFP_HIGHMEM.
+
+Before giving the memory to the device, dma_sync_sgtable_for_device() needs
+to be called, and before reading memory written by the device,
+dma_sync_sgtable_for_cpu(), just like for streaming DMA mappings that are
+reused.
+
+::
+
+   void
+   dma_free_noncontiguous(struct device *dev, size_t size,
+  struct sg_table *sgt,
+  enum dma_data_direction dir)
+
+Free memory previously allocated using dma_alloc_noncontiguous().  dev, size,
+and dir must all be the same as those passed into dma_alloc_noncontiguous().
+sgt must be the pointer returned by dma_alloc_noncontiguous().
+
+::
+
+   void *
+   dma_vmap_noncontiguous(struct device *dev, size_t size,
+   struct sg_table *sgt)
+
+Return a contiguous kernel mapping for an allocation returned from
+dma_alloc_noncontiguous().  dev and size must be the same as those passed into
+dma_alloc_noncontiguous().  sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
+Once a non-contiguous allocation is mapped using this function, the
+flush_kernel_vmap_range() and invalidate_kernel_vmap_range() APIs must be used
+to manage the coherency of the kernel mapping.


Maybe say something like "coherency between the kernel mapping and any 
userspace mappings"? Otherwise people like me may be easily confused and 
think it's referring to coherency between the kernel mapping and the 
device, where in most cases those APIs won't help at all :)



+
+::
+
+   void
+   dma_vunmap_noncontiguous(struct device *dev, void *vaddr)
+
+Unmap a kernel mapping returned by dma_vmap_noncontiguous().  dev must be the
+same the one passed into dma_alloc_noncontiguous().  vaddr must be the pointer
+returned by dma_vmap_noncontiguous().
+
+
+::
+
+   int
+   dma_mmap_noncontiguous(struct device *dev, struct vm_area_struct *vma,
+  size_t size, struct sg_table *sgt)
+
+Map an allocation returned from dma_alloc_noncontiguous() into a user address
+space.  dev and size must be the same as those passed into
+dma_alloc_noncontiguous().  sgt must be the pointer returned by
+dma_alloc_noncontiguous().
+
  ::
  
  	int

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 11e02537b9e01b..fe46a41130e662 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -22,6 +22,10 @@ struct dma_map_ops {
gfp_t gfp);
void (*free_pages)(struct device *dev, size_t size, struct page *vaddr,
dma_addr_t 

Re: [PATCH v3 1/2] dt-bindings: iommu: add bindings for sprd iommu

2021-02-16 Thread Robin Murphy

On 2021-02-10 19:21, Rob Herring wrote:

On Fri, Feb 5, 2021 at 1:21 AM Chunyan Zhang  wrote:


Hi Rob,

On Fri, 5 Feb 2021 at 07:25, Rob Herring  wrote:


On Wed, Feb 03, 2021 at 05:07:26PM +0800, Chunyan Zhang wrote:

From: Chunyan Zhang 

This iommu module can be used by Unisoc's multimedia devices, such as
display, Image codec(jpeg) and a few signal processors, including
VSP(video), GSP(graphic), ISP(image), and CPP(camera pixel processor), etc.

Signed-off-by: Chunyan Zhang 
---
  .../devicetree/bindings/iommu/sprd,iommu.yaml | 72 +++
  1 file changed, 72 insertions(+)
  create mode 100644 Documentation/devicetree/bindings/iommu/sprd,iommu.yaml

diff --git a/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml 
b/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
new file mode 100644
index ..4fc99e81fa66
--- /dev/null
+++ b/Documentation/devicetree/bindings/iommu/sprd,iommu.yaml
@@ -0,0 +1,72 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+# Copyright 2020 Unisoc Inc.
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/iommu/sprd,iommu.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Unisoc IOMMU and Multi-media MMU
+
+maintainers:
+  - Chunyan Zhang 
+
+properties:
+  compatible:
+enum:
+  - sprd,iommu-v1
+
+  "#iommu-cells":
+const: 0
+description:
+  Unisoc IOMMUs are all single-master IOMMU devices, therefore no
+  additional information needs to associate with its master device.
+  Please refer to the generic bindings document for more details,
+  Documentation/devicetree/bindings/iommu/iommu.txt
+
+  reg:
+maxItems: 1
+description:
+  Not required if 'sprd,iommu-regs' is defined.
+
+  clocks:
+description:
+  Reference to a gate clock phandle, since access to some of IOMMUs are
+  controlled by gate clock, but this is not required.
+
+  sprd,iommu-regs:
+$ref: /schemas/types.yaml#/definitions/phandle-array
+description:
+  Reference to a syscon phandle plus 1 cell, the syscon defines the
+  register range used by the iommu and the media device, the cell
+  defines the offset for iommu registers. Since iommu module shares
+  the same register range with the media device which uses it.
+
+required:
+  - compatible
+  - "#iommu-cells"


OK, so apparently the hardware is not quite as trivial as my initial 
impression, and you should have interrupts as well.



+
+oneOf:
+  - required:
+  - reg
+  - required:
+  - sprd,iommu-regs
+
+additionalProperties: false
+
+examples:
+  - |
+iommu_disp: iommu-disp {
+  compatible = "sprd,iommu-v1";
+  sprd,iommu-regs = <_regs 0x800>;


If the IOMMU is contained within another device, then it should just be
a child node of that device.


Yes, actually IOMMU can be seen as a child of multimedia devices, I
considered moving IOMMU under into multimedia device node, but
multimedia devices need IOMMU when probe[1], so I dropped that idea.


Don't design your binding around working-around linux issues.


Having stumbled across the DRM driver patches the other day, I now see 
where this is coming from, and it's even worse than that - this whole 
binding seems to be largely working around bad driver design.



And they share the same register base, e.g.

+   mm {
+   compatible = "simple-bus";
+   #address-cells = <2>;
+   #size-cells = <2>;
+   ranges;
+
+   dpu_regs: syscon@6300 {


Drop this node.


+   compatible = "sprd,sc9863a-dpuregs", "syscon";
+   reg = <0 0x6300 0 0x1000>;
+   };
+
+   dpu: dpu@6300 {
+   compatible = "sprd,sharkl3-dpu";
+   sprd,disp-regs = <_regs>;


reg = <0 0x6300 0 0x800>;


In fact judging by the other driver it looks like the length only needs 
to be 0x200 here (but maybe there's more to come in future).



+   iommus = <_dispc>;
+   };
+
+   iommu_dispc: iommu@6300 {
+   compatible = "sprd,iommu-v1";
+   sprd,iommu-regs = <_regs 0x800>;


reg = <0 0x63000800 0 0x800>;


...and this one looks to need less than 0x80, even :)




+   #iommu-cells = <0>;


Though given it seems there is only 1 client and this might really be
just 1 h/w block, you don't really need to use the iommu binding at
all. The DPU should be able to instantiate it's own IOMMU device.
There's other examples of this such as mali GPU though that is all one
driver, but that's a Linux implementation detail.


FWIW that's really a very different situation - the MMUs in a Mali GPU 
are fundamental parts of its internal pipelines and would never make 
sense to handle as 

Re: [PATCH v3 4/6] drm/sprd: add Unisoc's drm display controller driver

2021-02-16 Thread Robin Murphy

On 2021-01-05 13:46, Kevin Tang wrote:

Adds DPU(Display Processor Unit) support for the Unisoc's display subsystem.
It's support multi planes, scaler, rotation, PQ(Picture Quality) and more.

Cc: Orson Zhai 
Cc: Chunyan Zhang 
Signed-off-by: Kevin Tang 

v2:
   - Use drm_xxx to replace all DRM_XXX.
   - Use kzalloc to replace devm_kzalloc for sprd_dpu structure init.

v3:
   - Remove dpu_layer stuff layer and commit layers by aotmic_update
---
  drivers/gpu/drm/sprd/Kconfig|   1 +
  drivers/gpu/drm/sprd/Makefile   |   4 +-
  drivers/gpu/drm/sprd/sprd_dpu.c | 985 
  drivers/gpu/drm/sprd/sprd_dpu.h | 120 +
  drivers/gpu/drm/sprd/sprd_drm.c |   1 +
  drivers/gpu/drm/sprd/sprd_drm.h |   2 +
  6 files changed,  insertions(+), 2 deletions(-)
  create mode 100644 drivers/gpu/drm/sprd/sprd_dpu.c
  create mode 100644 drivers/gpu/drm/sprd/sprd_dpu.h

diff --git a/drivers/gpu/drm/sprd/Kconfig b/drivers/gpu/drm/sprd/Kconfig
index 6e80cc9..9b4ef9a 100644
--- a/drivers/gpu/drm/sprd/Kconfig
+++ b/drivers/gpu/drm/sprd/Kconfig
@@ -3,6 +3,7 @@ config DRM_SPRD
depends on ARCH_SPRD || COMPILE_TEST
depends on DRM && OF
select DRM_KMS_HELPER
+   select VIDEOMODE_HELPERS
select DRM_GEM_CMA_HELPER
select DRM_KMS_CMA_HELPER
select DRM_MIPI_DSI
diff --git a/drivers/gpu/drm/sprd/Makefile b/drivers/gpu/drm/sprd/Makefile
index 86d95d9..6c25bfa 100644
--- a/drivers/gpu/drm/sprd/Makefile
+++ b/drivers/gpu/drm/sprd/Makefile
@@ -1,5 +1,5 @@
  # SPDX-License-Identifier: GPL-2.0
  
-subdir-ccflags-y += -I$(srctree)/$(src)

+obj-y := sprd_drm.o \
+   sprd_dpu.o
  
-obj-y := sprd_drm.o

diff --git a/drivers/gpu/drm/sprd/sprd_dpu.c b/drivers/gpu/drm/sprd/sprd_dpu.c
new file mode 100644
index 000..d562d44
--- /dev/null
+++ b/drivers/gpu/drm/sprd/sprd_dpu.c
@@ -0,0 +1,985 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2020 Unisoc Inc.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "sprd_drm.h"
+#include "sprd_dpu.h"
+
+/* Global control registers */
+#define REG_DPU_CTRL   0x04
+#define REG_DPU_CFG0   0x08
+#define REG_PANEL_SIZE 0x20
+#define REG_BLEND_SIZE 0x24
+#define REG_BG_COLOR   0x2C
+
+/* Layer0 control registers */
+#define REG_LAY_BASE_ADDR0 0x30
+#define REG_LAY_BASE_ADDR1 0x34
+#define REG_LAY_BASE_ADDR2 0x38
+#define REG_LAY_CTRL   0x40
+#define REG_LAY_SIZE   0x44
+#define REG_LAY_PITCH  0x48
+#define REG_LAY_POS0x4C
+#define REG_LAY_ALPHA  0x50
+#define REG_LAY_CROP_START 0x5C
+
+/* Interrupt control registers */
+#define REG_DPU_INT_EN 0x1E0
+#define REG_DPU_INT_CLR0x1E4
+#define REG_DPU_INT_STS0x1E8
+
+/* DPI control registers */
+#define REG_DPI_CTRL   0x1F0
+#define REG_DPI_H_TIMING   0x1F4
+#define REG_DPI_V_TIMING   0x1F8
+
+/* MMU control registers */
+#define REG_MMU_EN 0x800
+#define REG_MMU_VPN_RANGE  0x80C
+#define REG_MMU_VAOR_ADDR_RD   0x818
+#define REG_MMU_VAOR_ADDR_WR   0x81C
+#define REG_MMU_INV_ADDR_RD0x820
+#define REG_MMU_INV_ADDR_WR0x824
+#define REG_MMU_PPN1   0x83C
+#define REG_MMU_RANGE1 0x840
+#define REG_MMU_PPN2   0x844
+#define REG_MMU_RANGE2 0x848
+
+/* Global control bits */
+#define BIT_DPU_RUNBIT(0)
+#define BIT_DPU_STOP   BIT(1)
+#define BIT_DPU_REG_UPDATE BIT(2)
+#define BIT_DPU_IF_EDPIBIT(0)
+
+/* Layer control bits */
+#define BIT_DPU_LAY_EN BIT(0)
+#define BIT_DPU_LAY_LAYER_ALPHA(0x01 << 2)
+#define BIT_DPU_LAY_COMBO_ALPHA(0x02 << 2)
+#define BIT_DPU_LAY_FORMAT_YUV422_2PLANE   (0x00 << 4)
+#define BIT_DPU_LAY_FORMAT_YUV420_2PLANE   (0x01 << 4)
+#define BIT_DPU_LAY_FORMAT_YUV420_3PLANE   (0x02 << 4)
+#define BIT_DPU_LAY_FORMAT_ARGB(0x03 << 4)
+#define BIT_DPU_LAY_FORMAT_RGB565  (0x04 << 4)
+#define BIT_DPU_LAY_DATA_ENDIAN_B0B1B2B3   (0x00 << 8)
+#define BIT_DPU_LAY_DATA_ENDIAN_B3B2B1B0   (0x01 << 8)
+#define BIT_DPU_LAY_NO_SWITCH  (0x00 << 10)
+#define BIT_DPU_LAY_RB_OR_UV_SWITCH(0x01 << 10)
+#define BIT_DPU_LAY_MODE_BLEND_NORMAL  (0x00 << 16)
+#define BIT_DPU_LAY_MODE_BLEND_PREMULT (0x01 << 16)
+
+/* Interrupt control & status bits */
+#define BIT_DPU_INT_DONE   BIT(0)
+#define BIT_DPU_INT_TE BIT(1)
+#define BIT_DPU_INT_ERRBIT(2)
+#define BIT_DPU_INT_UPDATE_DONEBIT(4)
+#define BIT_DPU_INT_VSYNC  

Re: [PATCH v2] iommu: Check dev->iommu in iommu_dev_xxx functions

2021-02-16 Thread Robin Murphy

On 2021-02-12 17:28, Shameerali Kolothum Thodi wrote:




-Original Message-
From: Shameerali Kolothum Thodi
Sent: 12 February 2021 16:45
To: 'Robin Murphy' ; linux-kernel@vger.kernel.org;
io...@lists.linux-foundation.org
Cc: j...@8bytes.org; jean-phili...@linaro.org; w...@kernel.org; Zengtao (B)
; linux...@openeuler.org
Subject: RE: [PATCH v2] iommu: Check dev->iommu in iommu_dev_xxx functions




-Original Message-
From: Robin Murphy [mailto:robin.mur...@arm.com]
Sent: 12 February 2021 16:39
To: Shameerali Kolothum Thodi ;
linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org
Cc: j...@8bytes.org; jean-phili...@linaro.org; w...@kernel.org; Zengtao (B)
; linux...@openeuler.org
Subject: Re: [PATCH v2] iommu: Check dev->iommu in iommu_dev_xxx

functions


On 2021-02-12 14:54, Shameerali Kolothum Thodi wrote:

Hi Robin/Joerg,


-Original Message-
From: Shameer Kolothum

[mailto:shameerali.kolothum.th...@huawei.com]

Sent: 01 February 2021 12:41
To: linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org
Cc: j...@8bytes.org; robin.mur...@arm.com; jean-phili...@linaro.org;
w...@kernel.org; Zengtao (B) ;
linux...@openeuler.org
Subject: [Linuxarm] [PATCH v2] iommu: Check dev->iommu in

iommu_dev_xxx

functions

The device iommu probe/attach might have failed leaving dev->iommu
to NULL and device drivers may still invoke these functions resulting
in a crash in iommu vendor driver code. Hence make sure we check that.

Also added iommu_ops to the "struct dev_iommu" and set it if the dev
is successfully associated with an iommu.

Fixes: a3a195929d40 ("iommu: Add APIs for multiple domains per

device")

Signed-off-by: Shameer Kolothum



---
v1 --> v2:
   -Added iommu_ops to struct dev_iommu based on the discussion with

Robin.

   -Rebased against iommu-tree core branch.


A gentle ping on this...


Is there a convincing justification for maintaining yet another copy of
the ops pointer rather than simply dereferencing iommu_dev->ops at point
of use?



TBH, nothing I can think of now. That was mainly the way I interpreted your
suggestion
from the v1.  Now it looks like you didn’t mean it :). I am Ok to rework it to
dereference
it from iommu_dev. Please let me know.


So we can do something like this,

index fd76e2f579fe..5fd31a3cec18 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2865,10 +2865,12 @@ EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
   */
  int iommu_dev_enable_feature(struct device *dev, enum iommu_dev_features feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
+   if (dev->iommu && dev->iommu->iommu_dev && dev->iommu->iommu_dev->ops)
+   struct iommu_ops  *ops = dev->iommu->iommu_dev->ops;
  
-   if (ops && ops->dev_enable_feat)

-   return ops->dev_enable_feat(dev, feat);
+   if (ops->dev_enable_feat)
+   return ops->dev_enable_feat(dev, feat);
+   }
  
 return -ENODEV;

  }

Again, not sure we need to do the checking for iommu->dev and ops here. If the
dev->iommu is set, is it safe to assume that we have a valid iommu->iommu_dev
and ops always? (May be it is safer to do the checking in case something
else breaks this assumption in future). Please let me know your thoughts.


I think it *could* happen that dev->iommu is set by iommu_fwspec_init() 
but iommu_probe_device() later refuses the device for whatever reason, 
so we would still need to check iommu->iommu_dev to be completely safe. 
We can assume iommu_dev->ops is valid, since if the IOMMU driver has 
returned something bogus from .probe_device then it's a major bug in 
that driver and crashing is the best indicator :)


Robin.



Thanks,
Shameer


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [RFC PATCH 1/8] of/device: Allow specifying a custom iommu_spec to of_dma_configure

2021-02-16 Thread Robin Murphy

Hi Mikko,

On 2021-02-08 16:38, Mikko Perttunen wrote:

To allow for more customized device tree bindings that point to IOMMUs,
allow manual specification of iommu_spec to of_dma_configure.

The initial use case for this is with Host1x, where the driver manages
a set of device tree-defined IOMMU contexts that are dynamically
allocated to various users. These contexts don't correspond to
platform devices and are instead attached to dummy devices on a custom
software bus.


I'd suggest taking a closer look at the patches that made this 
of_dma_configure_id() in the first place, and the corresponding bus code 
in fsl-mc. At this level, Host1x sounds effectively identical to DPAA2 
in terms of being a bus of logical devices composed from bits of 
implicit behind-the-scenes hardware. I mean, compare your series title 
to the fact that their identifiers are literally named "Isolation 
Context ID" ;)


Please just use the existing mechanisms to describe a mapping between 
Host1x context IDs and SMMU Stream IDs, rather than what looks like a 
giant hacky mess here.


(This also reminds me I wanted to rip out all the PCI special-cases and 
convert pci_dma_configure() over to passing its own IDs too, so thanks 
for the memory-jog...)


Robin.


Signed-off-by: Mikko Perttunen 
---
  drivers/iommu/of_iommu.c  | 12 
  drivers/of/device.c   |  9 +
  include/linux/of_device.h | 34 +++---
  include/linux/of_iommu.h  |  6 --
  4 files changed, 44 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index e505b9130a1c..3fefa6c63863 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -87,8 +87,7 @@ int of_get_dma_window(struct device_node *dn, const char 
*prefix, int index,
  }
  EXPORT_SYMBOL_GPL(of_get_dma_window);
  
-static int of_iommu_xlate(struct device *dev,

- struct of_phandle_args *iommu_spec)
+int of_iommu_xlate(struct device *dev, struct of_phandle_args *iommu_spec)
  {
const struct iommu_ops *ops;
struct fwnode_handle *fwnode = _spec->np->fwnode;
@@ -117,6 +116,7 @@ static int of_iommu_xlate(struct device *dev,
module_put(ops->owner);
return ret;
  }
+EXPORT_SYMBOL_GPL(of_iommu_xlate);
  
  static int of_iommu_configure_dev_id(struct device_node *master_np,

 struct device *dev,
@@ -177,7 +177,8 @@ static int of_iommu_configure_device(struct device_node 
*master_np,
  
  const struct iommu_ops *of_iommu_configure(struct device *dev,

   struct device_node *master_np,
-  const u32 *id)
+  const u32 *id,
+  struct of_phandle_args *iommu_spec)
  {
const struct iommu_ops *ops = NULL;
struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
@@ -209,7 +210,10 @@ const struct iommu_ops *of_iommu_configure(struct device 
*dev,
err = pci_for_each_dma_alias(to_pci_dev(dev),
 of_pci_iommu_init, );
} else {
-   err = of_iommu_configure_device(master_np, dev, id);
+   if (iommu_spec)
+   err = of_iommu_xlate(dev, iommu_spec);
+   else
+   err = of_iommu_configure_device(master_np, dev, id);
  
  		fwspec = dev_iommu_fwspec_get(dev);

if (!err && fwspec)
diff --git a/drivers/of/device.c b/drivers/of/device.c
index aedfaaafd3e7..84ada2110c5b 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -88,8 +88,9 @@ int of_device_add(struct platform_device *ofdev)
   * can use a platform bus notifier and handle BUS_NOTIFY_ADD_DEVICE events
   * to fix up DMA configuration.
   */
-int of_dma_configure_id(struct device *dev, struct device_node *np,
-   bool force_dma, const u32 *id)
+int __of_dma_configure(struct device *dev, struct device_node *np,
+   bool force_dma, const u32 *id,
+   struct of_phandle_args *iommu_spec)
  {
const struct iommu_ops *iommu;
const struct bus_dma_region *map = NULL;
@@ -170,7 +171,7 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
dev_dbg(dev, "device is%sdma coherent\n",
coherent ? " " : " not ");
  
-	iommu = of_iommu_configure(dev, np, id);

+   iommu = of_iommu_configure(dev, np, id, iommu_spec);
if (PTR_ERR(iommu) == -EPROBE_DEFER) {
kfree(map);
return -EPROBE_DEFER;
@@ -184,7 +185,7 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
dev->dma_range_map = map;
return 0;
  }
-EXPORT_SYMBOL_GPL(of_dma_configure_id);
+EXPORT_SYMBOL_GPL(__of_dma_configure);
  
  int of_device_register(struct platform_device *pdev)

  {
diff --git 

Re: [PATCH v2] iommu: Check dev->iommu in iommu_dev_xxx functions

2021-02-12 Thread Robin Murphy

On 2021-02-12 14:54, Shameerali Kolothum Thodi wrote:

Hi Robin/Joerg,


-Original Message-
From: Shameer Kolothum [mailto:shameerali.kolothum.th...@huawei.com]
Sent: 01 February 2021 12:41
To: linux-kernel@vger.kernel.org; io...@lists.linux-foundation.org
Cc: j...@8bytes.org; robin.mur...@arm.com; jean-phili...@linaro.org;
w...@kernel.org; Zengtao (B) ;
linux...@openeuler.org
Subject: [Linuxarm] [PATCH v2] iommu: Check dev->iommu in iommu_dev_xxx
functions

The device iommu probe/attach might have failed leaving dev->iommu
to NULL and device drivers may still invoke these functions resulting
in a crash in iommu vendor driver code. Hence make sure we check that.

Also added iommu_ops to the "struct dev_iommu" and set it if the dev
is successfully associated with an iommu.

Fixes: a3a195929d40 ("iommu: Add APIs for multiple domains per device")
Signed-off-by: Shameer Kolothum 
---
v1 --> v2:
  -Added iommu_ops to struct dev_iommu based on the discussion with Robin.
  -Rebased against iommu-tree core branch.


A gentle ping on this...


Is there a convincing justification for maintaining yet another copy of 
the ops pointer rather than simply dereferencing iommu_dev->ops at point 
of use?


Robin.


Thanks,
Shameer


---
  drivers/iommu/iommu.c | 19 +++
  include/linux/iommu.h |  2 ++
  2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fd76e2f579fe..6023d0b7c542 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -217,6 +217,7 @@ static int __iommu_probe_device(struct device *dev,
struct list_head *group_list
}

dev->iommu->iommu_dev = iommu_dev;
+   dev->iommu->ops = iommu_dev->ops;

group = iommu_group_get_for_dev(dev);
if (IS_ERR(group)) {
@@ -2865,10 +2866,8 @@ EXPORT_SYMBOL_GPL(iommu_fwspec_add_ids);
   */
  int iommu_dev_enable_feature(struct device *dev, enum
iommu_dev_features feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
-
-   if (ops && ops->dev_enable_feat)
-   return ops->dev_enable_feat(dev, feat);
+   if (dev->iommu && dev->iommu->ops->dev_enable_feat)
+   return dev->iommu->ops->dev_enable_feat(dev, feat);

return -ENODEV;
  }
@@ -2881,10 +2880,8 @@
EXPORT_SYMBOL_GPL(iommu_dev_enable_feature);
   */
  int iommu_dev_disable_feature(struct device *dev, enum
iommu_dev_features feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
-
-   if (ops && ops->dev_disable_feat)
-   return ops->dev_disable_feat(dev, feat);
+   if (dev->iommu && dev->iommu->ops->dev_disable_feat)
+   return dev->iommu->ops->dev_disable_feat(dev, feat);

return -EBUSY;
  }
@@ -2892,10 +2889,8 @@
EXPORT_SYMBOL_GPL(iommu_dev_disable_feature);

  bool iommu_dev_feature_enabled(struct device *dev, enum
iommu_dev_features feat)
  {
-   const struct iommu_ops *ops = dev->bus->iommu_ops;
-
-   if (ops && ops->dev_feat_enabled)
-   return ops->dev_feat_enabled(dev, feat);
+   if (dev->iommu && dev->iommu->ops->dev_feat_enabled)
+   return dev->iommu->ops->dev_feat_enabled(dev, feat);

return false;
  }
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 524ffc2ff64f..ff0c76bdfb67 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -354,6 +354,7 @@ struct iommu_fault_param {
   * @fault_param: IOMMU detected device fault reporting data
   * @fwspec:IOMMU fwspec data
   * @iommu_dev: IOMMU device this device is linked to
+ * @ops:iommu-ops for talking to the iommu_dev
   * @priv:  IOMMU Driver private data
   *
   * TODO: migrate other per device data pointers under iommu_dev_data,
e.g.
@@ -364,6 +365,7 @@ struct dev_iommu {
struct iommu_fault_param*fault_param;
struct iommu_fwspec *fwspec;
struct iommu_device *iommu_dev;
+   const struct iommu_ops  *ops;
void*priv;
  };

--
2.17.1
___
Linuxarm mailing list -- linux...@openeuler.org
To unsubscribe send an email to linuxarm-le...@openeuler.org


Re: bcm2711_thermal: Kernel panic - not syncing: Asynchronous SError Interrupt

2021-02-10 Thread Robin Murphy

On 2021-02-10 13:15, Nicolas Saenz Julienne wrote:

[ Add Robin, Catalin and Florian in case they want to chime in ]

Hi Juerg, thanks for the report!

On Wed, 2021-02-10 at 11:48 +0100, Juerg Haefliger wrote:

Trying to dump the BCM2711 registers kills the kernel:

# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/range
0-efc
# cat /sys/kernel/debug/regmap/dummy-avs-monitor\@fd5d2000/registers

[   62.857661] SError Interrupt on CPU1, code 0xbf02 -- SError


So ESR's IDS (bit 24) is set, which means it's an 'Implementation Defined
SError,' hence IIUC the rest of the error code is meaningless to anyone outside
of Broadcom/RPi.


It's imp-def from the architecture's PoV, but the implementation in this 
case is Cortex-A72, where 0x02 means an attributable, containable 
Slave Error:


https://developer.arm.com/documentation/100095/0003/system-control/aarch64-register-descriptions/exception-syndrome-register--el1-and-el3?lang=en

In other words, the thing at the other end of an interconnect 
transaction said "no" :)


(The fact that Cortex-A72 gets too far ahead of itself to take it as a 
synchronous external abort is a mild annoyance, but hey...)



The regmap is created through the following syscon device:

avs_monitor: avs-monitor@7d5d2000 {
compatible = "brcm,bcm2711-avs-monitor",
 "syscon", "simple-mfd";
reg = <0x7d5d2000 0xf00>;

thermal: thermal {
compatible = "brcm,bcm2711-thermal";
#thermal-sensor-cells = <0>;
};
};

I've done some tests with devmem, and the whole <0x7d5d2000 0xf00> range is
full of addresses that trigger this same error. Also note that as per Florian's
comments[1]: "AVS_RO_REGISTERS_0: 0x7d5d2200 - 0x7d5d22e3." But from what I can
tell, at least 0x7d5d22b0 seems to be faulty too.

Any ideas/comments? My guess is that those addresses are marked somehow as
secure, and only for VC4 to access (VC4 is RPi4's co-processor). Ultimately,
the solution is to narrow the register range exposed by avs-monitor to whatever
bcm2711-thermal needs (which is ATM a single 32bit register).


When a peripheral decodes a region of address space, nobody says it has 
to accept accesses to *every* address in that space; registers may be 
sparsely populated, and although some devices might be "nice" and make 
unused areas behave as RAZ/WI, others may throw slave errors if you poke 
at the wrong places. As you note, in a TrustZone-aware device some 
registers may only exist in one or other of the Secure/Non-Secure 
address spaces.


Even when there is a defined register at a given address, it still 
doesn't necessarily accept all possible types of access; it wouldn't be 
particularly friendly, but a device *could* have, say, some registers 
that support 32-bit accesses and others that only support 16-bit 
accesses, and thus throw slave errors if you do the wrong thing in the 
wrong place.


It really all depends on the device itself.

Robin.



Regards,
Nicolas

[1] 
https://lore.kernel.org/linux-pm/82125042-684a-b4e2-fbaa-45a393b2c...@gmx.net/


[   62.857671] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
[   62.857674] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[   62.857676] pstate: 2085 (nzCv daIf -PAN -UAO -TCO BTYPE=--)
[   62.857679] pc : regmap_mmio_read32le+0x1c/0x34
[   62.857681] lr : regmap_mmio_read+0x50/0x80
[   62.857682] sp : 8000105c3c00
[   62.857685] x29: 8000105c3c00 x28: 0014
[   62.857694] x27: 0014 x26: d2ea1c2060b0
[   62.857699] x25: 4e34408ecc00 x24: 0efc
[   62.857704] x23: 8000105c3e20 x22: 8000105c3d3c
[   62.857710] x21: 8000105c3d3c x20: 0014
[   62.857715] x19: 4e344037a900 x18: 0020
[   62.857720] x17:  x16: 
[   62.857725] x15: 4e3447ac40f0 x14: 0003
[   62.857730] x13: 4e34422c x12: 4e34422a0046
[   62.857735] x11: d2ea1c8765e0 x10: 
[   62.857741] x9 : d2ea1b9495a0 x8 : 4e34429ef980
[   62.857746] x7 : 000f x6 : 4e34422a004b
[   62.857751] x5 : fff9 x4 : 
[   62.857757] x3 : d2ea1b949550 x2 : d2ea1b949330
[   62.857761] x1 : 0014 x0 : 
[   62.857767] Kernel panic - not syncing: Asynchronous SError Interrupt
[   62.857770] CPU: 1 PID: 478 Comm: cat Not tainted 5.11.0-rc7 #4
[   62.857773] Hardware name: Raspberry Pi 4 Model B Rev 1.2 (DT)
[   62.857775] Call trace:
[   62.85]  dump_backtrace+0x0/0x1e0
[   62.857778]  show_stack+0x24/0x70
[   62.857780]  dump_stack+0xd0/0x12c
[   62.857782]  panic+0x168/0x370
[   62.857783]  nmi_panic+0x98/0xa0
[   62.857786]  arm64_serror_panic+0x8c/0x98
[   62.857787]  do_serror+0x3c/0x6c
[   62.857789]  el1_error+0x78/0xf0
[   62.857791]  regmap_mmio_read32le+0x1c/0x34
[   62.857793]  

Re: DMA direct mapping fix for 5.4 and earlier stable branches

2021-02-09 Thread Robin Murphy

On 2021-02-09 12:36, Sumit Garg wrote:

Hi Christoph,

On Tue, 9 Feb 2021 at 15:06, Christoph Hellwig  wrote:


On Tue, Feb 09, 2021 at 10:23:12AM +0100, Greg KH wrote:

   From the view point of ZeroCopy using DMABUF, is 5.4 not
mature enough, and is 5.10 enough mature ?
   This is the most important point for judging migration.


How do you judge "mature"?

And again, if a feature isn't present in a specific kernel version, why
would you think that it would be a viable solution for you to use?


I'm pretty sure dma_get_sgtable has been around much longer and was
supposed to work, but only really did work properly for arm32, and
for platforms with coherent DMA.  I bet he is using non-coherent arm64,


It's an arm64 platform using coherent DMA where device coherent DMA
memory pool is defined in the DT as follows:

 reserved-memory {
 #address-cells = <2>;
 #size-cells = <2>;
 ranges;

 
 encbuffer: encbuffer@0xb000 {
 compatible = "shared-dma-pool";
 reg = <0 0xb000 0 0x0800>; // this
area used with dma-coherent
 no-map;
 };
 
 };

Device is dma-coherent as per following DT property:

 codec {
 compatible = "socionext,uniphier-pxs3-codec";
 
 memory-region = <>;
 dma-coherent;
 
 };

And call chain to create device coherent DMA pool is as follows:

rmem_dma_device_init();
   dma_init_coherent_memory();
 memremap();
   ioremap_wc();

which simply maps coherent DMA memory into vmalloc space on arm64.

The thing I am unclear is why this is called a new feature rather than
a bug in dma_common_get_sgtable() which is failing to handle vmalloc
addresses? While at the same time DMA debug APIs specifically handle
vmalloc addresses [1].


It's not a bug, it's a fundamental design failure. dma_get_sgtable() has 
only ever sort-of-worked for DMA buffers that come from CMA or regular 
page allocations. In particular, a "no-map" DMA pool is not backed by 
kernel memory, so does not have any corresponding page structs, so it's 
impossible to generate a *valid* scatterlist to represent memory from 
that pool, regardless of what you might get away with provided you don't 
poke too hard at it.


It is not a good API...

Robin.



[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/kernel/dma/debug.c?h=linux-5.4.y#n1462

-Sumit


and it would be broken for other drivers there as well if people did
test them, which they apparently so far did not.


Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-09 Thread Robin Murphy

On 2021-02-09 11:57, Yi Sun wrote:

On 21-02-07 18:40:36, Keqian Zhu wrote:

Hi Yi,

On 2021/2/7 17:56, Yi Sun wrote:

Hi,

On 21-01-28 23:17:41, Keqian Zhu wrote:

[...]


+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_split_block(d->domain, dma->iova, dma->size);
+   }
+}


This should be a switch to prepare for dirty log start. Per Intel
Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
It enables Accessed/Dirty Flags in second-level paging entries.
So, a generic iommu interface here is better. For Intel iommu, it
enables SLADE. For ARM, it splits block.

Indeed, a generic interface name is better.

The vendor iommu driver plays vendor's specific actions to start dirty log, and 
Intel iommu and ARM smmu may differ. Besides, we may add more actions in ARM 
smmu driver in future.

One question: Though I am not familiar with Intel iommu, I think it also should 
split block mapping besides enable SLADE. Right?


I am not familiar with ARM smmu. :) So I want to clarify if the block
in smmu is big page, e.g. 2M page? Intel Vtd manages the memory per
page, 4KB/2MB/1GB.


Indeed, what you call large pages, we call blocks :)

Robin.


There are two ways to manage dirty pages.
1. Keep default granularity. Just set SLADE to enable the dirty track.
2. Split big page to 4KB to get finer granularity.

But question about the second solution is if it can benefit the user
space, e.g. live migration. If my understanding about smmu block (i.e.
the big page) is correct, have you collected some performance data to
prove that the split can improve performance? Thanks!


Thanks,
Keqian

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [RFC PATCH 10/11] vfio/iommu_type1: Optimize dirty bitmap population based on iommu HWDBM

2021-02-09 Thread Robin Murphy

On 2021-02-07 09:56, Yi Sun wrote:

Hi,

On 21-01-28 23:17:41, Keqian Zhu wrote:

[...]


+static void vfio_dma_dirty_log_start(struct vfio_iommu *iommu,
+struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_split_block(d->domain, dma->iova, dma->size);
+   }
+}


This should be a switch to prepare for dirty log start. Per Intel
Vtd spec, there is SLADE defined in Scalable-Mode PASID Table Entry.
It enables Accessed/Dirty Flags in second-level paging entries.
So, a generic iommu interface here is better. For Intel iommu, it
enables SLADE. For ARM, it splits block.


From a quick look, VT-D's SLADE and SMMU's HTTU appear to be the exact 
same thing. This step isn't about enabling or disabling that feature 
itself (the proposal for SMMU is to simply leave HTTU enabled all the 
time), it's about controlling the granularity at which the dirty status 
can be detected/reported at all, since that's tied to the pagetable 
structure.


However, if an IOMMU were to come along with some other way of reporting 
dirty status that didn't depend on the granularity of individual 
mappings, then indeed it wouldn't need this operation.


Robin.


+
+static void vfio_dma_dirty_log_stop(struct vfio_iommu *iommu,
+   struct vfio_dma *dma)
+{
+   struct vfio_domain *d;
+
+   list_for_each_entry(d, >domain_list, next) {
+   /* Go through all domain anyway even if we fail */
+   iommu_merge_page(d->domain, dma->iova, dma->size,
+d->prot | dma->prot);
+   }
+}


Same as above comment, a generic interface is required here.


+
+static void vfio_iommu_dirty_log_switch(struct vfio_iommu *iommu, bool start)
+{
+   struct rb_node *n;
+
+   /* Split and merge even if all iommu don't support HWDBM now */
+   for (n = rb_first(>dma_list); n; n = rb_next(n)) {
+   struct vfio_dma *dma = rb_entry(n, struct vfio_dma, node);
+
+   if (!dma->iommu_mapped)
+   continue;
+
+   /* Go through all dma range anyway even if we fail */
+   if (start)
+   vfio_dma_dirty_log_start(iommu, dma);
+   else
+   vfio_dma_dirty_log_stop(iommu, dma);
+   }
+}
+
  static int vfio_iommu_type1_dirty_pages(struct vfio_iommu *iommu,
unsigned long arg)
  {
@@ -2812,8 +2900,10 @@ static int vfio_iommu_type1_dirty_pages(struct 
vfio_iommu *iommu,
pgsize = 1 << __ffs(iommu->pgsize_bitmap);
if (!iommu->dirty_page_tracking) {
ret = vfio_dma_bitmap_alloc_all(iommu, pgsize);
-   if (!ret)
+   if (!ret) {
iommu->dirty_page_tracking = true;
+   vfio_iommu_dirty_log_switch(iommu, true);
+   }
}
mutex_unlock(>lock);
return ret;
@@ -2822,6 +2912,7 @@ static int vfio_iommu_type1_dirty_pages(struct vfio_iommu 
*iommu,
if (iommu->dirty_page_tracking) {
iommu->dirty_page_tracking = false;
vfio_dma_bitmap_free_all(iommu);
+   vfio_iommu_dirty_log_switch(iommu, false);
}
mutex_unlock(>lock);
return 0;
--
2.19.1

___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Robin Murphy

On 2021-02-05 11:48, Robin Murphy wrote:

On 2021-02-05 09:13, Keqian Zhu wrote:

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
   include/linux/io-pgtable.h  |  1 +
   3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct 
iommu_domain *domain,

   .pgsize_bitmap    = smmu->pgsize_bitmap,
   .ias    = ias,
   .oas    = oas,
+    .httu_hd    = smmu->features & ARM_SMMU_FEAT_HTTU_HD,
   .coherent_walk    = smmu->features & 
ARM_SMMU_FEAT_COHERENCY,

   .tlb    = _smmu_flush_ops,
   .iommu_dev    = smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)

   if (reg & IDR0_HYP)
   smmu->features |= ARM_SMMU_FEAT_HYP;
   +    switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this 
to be meaningful. Jean-Philippe is already carrying a suitable patch 
in the SVA stack[1].

Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register 
info unconditionally. I have some concern about it:


If the override flag has HTTU but hardware doesn't support it, then 
driver will use this feature but receive access fault or permission 
fault from SMMU unexpectedly.

1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as 
HTTU based dma dirty tracking (this series).


Yes, if the IORT describes the SMMU incorrectly, things will not work 
well. Just like if it describes the wrong base address or the wrong 
interrupt numbers, things will also not work well. The point is that 
incorrect firmware can be updated in the field fairly easily; incorrect 
hardware can not.


Say the SMMU designer hard-codes the ID register field to 0x2 because 
the SMMU itself is capable of HTTU, and they assume it's always going to 
be wired up coherently, but then a customer integrates it to a 
non-coherent interconnect. Firmware needs to override that value to 
prevent an OS thinking that the claimed HTTU capability is ever going to 
work.


Or say the SMMU *is* integrated correctly, but due to an erratum 
discovered later in the interconnect or SMMU itself, it turns out DBM 
doesn't always work reliably, but AF is still OK. Firmware needs to 
downgrade the indicated level of support from that which was intended to 
that which works reliably.


Or say someone forgets to set an integration tieoff so their SMMU 
reports 0x0 even though it and the interconnect *can* happily support 
HTTU. In that case, firmware may want to upgrade the value to *allow* an 
OS to use HTTU despite the ID register being wrong.


As the IORT spec doesn't give an explicit explanation for HTTU 
override, can we comprehend it as a mask for HTTU related hardware 
register?

So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;


No, it literally states that the OS must use the value of the firmware 
field *instead* of the value from the hardware field.


Oops, apologies for an oversight there - I've been reviewing IORT spec 
updates lately so naturally had the newest version open already. Turns 
out these descriptions were only clarified in the most recent release, 
so if you were looking at an older document they *were* horribly vague.


Robin.


+    case IDR0_HTTU_NONE:
+    break;
+    case IDR0_HTTU_HA:
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+    break;
+    case IDR0_HTTU_HAD:
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+    smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+    break;
+    default:
+    dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+    return -ENXIO;
+    }
+
   /*
    * The coherency feature as set by FW is used in preference 
to the ID

    * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h

index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
   #define IDR0_ASID1

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-05 Thread Robin Murphy

On 2021-02-05 09:13, Keqian Zhu wrote:

Hi Robin and Jean,

On 2021/2/5 3:50, Robin Murphy wrote:

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
   include/linux/io-pgtable.h  |  1 +
   3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
   .pgsize_bitmap= smmu->pgsize_bitmap,
   .ias= ias,
   .oas= oas,
+.httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
   .coherent_walk= smmu->features & ARM_SMMU_FEAT_COHERENCY,
   .tlb= _smmu_flush_ops,
   .iommu_dev= smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
   if (reg & IDR0_HYP)
   smmu->features |= ARM_SMMU_FEAT_HYP;
   +switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this to be 
meaningful. Jean-Philippe is already carrying a suitable patch in the SVA 
stack[1].

Robin, Thanks for pointing it out.

Jean, I see that the IORT HTTU flag overrides the hardware register info 
unconditionally. I have some concern about it:

If the override flag has HTTU but hardware doesn't support it, then driver will 
use this feature but receive access fault or permission fault from SMMU 
unexpectedly.
1) If IOPF is not supported, then kernel can not work normally.
2) If IOPF is supported, kernel will perform useless actions, such as HTTU 
based dma dirty tracking (this series).


Yes, if the IORT describes the SMMU incorrectly, things will not work 
well. Just like if it describes the wrong base address or the wrong 
interrupt numbers, things will also not work well. The point is that 
incorrect firmware can be updated in the field fairly easily; incorrect 
hardware can not.


Say the SMMU designer hard-codes the ID register field to 0x2 because 
the SMMU itself is capable of HTTU, and they assume it's always going to 
be wired up coherently, but then a customer integrates it to a 
non-coherent interconnect. Firmware needs to override that value to 
prevent an OS thinking that the claimed HTTU capability is ever going to 
work.


Or say the SMMU *is* integrated correctly, but due to an erratum 
discovered later in the interconnect or SMMU itself, it turns out DBM 
doesn't always work reliably, but AF is still OK. Firmware needs to 
downgrade the indicated level of support from that which was intended to 
that which works reliably.


Or say someone forgets to set an integration tieoff so their SMMU 
reports 0x0 even though it and the interconnect *can* happily support 
HTTU. In that case, firmware may want to upgrade the value to *allow* an 
OS to use HTTU despite the ID register being wrong.



As the IORT spec doesn't give an explicit explanation for HTTU override, can we 
comprehend it as a mask for HTTU related hardware register?
So the logic becomes: smmu->feature = HTTU override & IDR0_HTTU;


No, it literally states that the OS must use the value of the firmware 
field *instead* of the value from the hardware field.



+case IDR0_HTTU_NONE:
+break;
+case IDR0_HTTU_HA:
+smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+break;
+case IDR0_HTTU_HAD:
+smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+break;
+default:
+dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+return -ENXIO;
+}
+
   /*
* The coherency feature as set by FW is used in preference to the ID
* register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
   #define IDR0_ASID16(1 << 12)
   #define IDR0_ATS(1 << 10)
   #define IDR0_HYP(1 << 9)
+#define IDR0_HTTUGENMASK(7, 6)
+#define IDR0_HTTU_NONE0
+#define IDR0_HTTU_HA1
+#define IDR0_HTTU_HAD2
   #define IDR0_COHACC(1 << 4)
   #define IDR0_TTFG

Re: [PATCH] dma-mapping: benchmark: pretend DMA is transmitting

2021-02-04 Thread Robin Murphy

On 2021-02-04 22:58, Barry Song wrote:

In a real dma mapping user case, after dma_map is done, data will be
transmit. Thus, in multi-threaded user scenario, IOMMU contention
should not be that severe. For example, if users enable multiple
threads to send network packets through 1G/10G/100Gbps NIC, usually
the steps will be: map -> transmission -> unmap.  Transmission delay
reduces the contention of IOMMU. Here a delay is added to simulate
the transmission for TX case so that the tested result could be
more accurate.

RX case would be much more tricky. It is not supported yet.


I guess it might be a reasonable approximation to map several pages, 
then unmap them again after a slightly more random delay. Or maybe 
divide the threads into pairs of mappers and unmappers respectively 
filling up and draining proper little buffer pools.



Signed-off-by: Barry Song 
---
  kernel/dma/map_benchmark.c  | 11 +++
  tools/testing/selftests/dma/dma_map_benchmark.c | 17 +++--
  2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c
index 1b1b8ff875cb..1976db7e34e4 100644
--- a/kernel/dma/map_benchmark.c
+++ b/kernel/dma/map_benchmark.c
@@ -21,6 +21,7 @@
  #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
  #define DMA_MAP_MAX_THREADS   1024
  #define DMA_MAP_MAX_SECONDS   300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */


Using MSEC_PER_SEC might be sufficiently self-documenting?


  #define DMA_MAP_BIDIRECTIONAL 0
  #define DMA_MAP_TO_DEVICE 1
@@ -36,6 +37,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* time for DMA transmission in ns */
__u64 expansion[10];/* For future use */
  };
  
@@ -87,6 +89,10 @@ static int map_benchmark_thread(void *data)

map_etime = ktime_get();
map_delta = ktime_sub(map_etime, map_stime);
  
+		/* Pretend DMA is transmitting */

+   if (map->dir != DMA_FROM_DEVICE)
+   ndelay(map->bparam.dma_trans_ns);


TBH I think the option of a fixed delay between map and unmap might be a 
handy thing in general, so having the direction check at all seems 
needlessly restrictive. As long as the driver implements all the basic 
building blocks, combining them to simulate specific traffic patterns 
can be left up to the benchmark tool.


Robin.


+
unmap_stime = ktime_get();
dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir);
unmap_etime = ktime_get();
@@ -218,6 +224,11 @@ static long map_benchmark_ioctl(struct file *file, 
unsigned int cmd,
return -EINVAL;
}
  
+		if (map->bparam.dma_trans_ns > DMA_MAP_MAX_TRANS_DELAY) {

+   pr_err("invalid transmission delay\n");
+   return -EINVAL;
+   }
+
if (map->bparam.node != NUMA_NO_NODE &&
!node_possible(map->bparam.node)) {
pr_err("invalid numa node\n");
diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c 
b/tools/testing/selftests/dma/dma_map_benchmark.c
index 7065163a8388..dbf426e2fb7f 100644
--- a/tools/testing/selftests/dma/dma_map_benchmark.c
+++ b/tools/testing/selftests/dma/dma_map_benchmark.c
@@ -14,6 +14,7 @@
  #define DMA_MAP_BENCHMARK _IOWR('d', 1, struct map_benchmark)
  #define DMA_MAP_MAX_THREADS   1024
  #define DMA_MAP_MAX_SECONDS 300
+#define DMA_MAP_MAX_TRANS_DELAY(10 * 1000 * 1000) /* 10ms */
  
  #define DMA_MAP_BIDIRECTIONAL	0

  #define DMA_MAP_TO_DEVICE 1
@@ -35,6 +36,7 @@ struct map_benchmark {
__s32 node; /* which numa node this benchmark will run on */
__u32 dma_bits; /* DMA addressing capability */
__u32 dma_dir; /* DMA data direction */
+   __u32 dma_trans_ns; /* delay for DMA transmission in ns */
__u64 expansion[10];/* For future use */
  };
  
@@ -45,12 +47,12 @@ int main(int argc, char **argv)

/* default single thread, run 20 seconds on NUMA_NO_NODE */
int threads = 1, seconds = 20, node = -1;
/* default dma mask 32bit, bidirectional DMA */
-   int bits = 32, dir = DMA_MAP_BIDIRECTIONAL;
+   int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL;
  
  	int cmd = DMA_MAP_BENCHMARK;

char *p;
  
-	while ((opt = getopt(argc, argv, "t:s:n:b:d:")) != -1) {

+   while ((opt = getopt(argc, argv, "t:s:n:b:d:x:")) != -1) {
switch (opt) {
case 't':
threads = atoi(optarg);
@@ -67,6 +69,9 @@ int main(int argc, char **argv)
case 'd':
dir = atoi(optarg);
break;
+   case 'x':
+  

Re: [RFC PATCH 06/11] iommu/arm-smmu-v3: Scan leaf TTD to sync hardware dirty log

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

During dirty log tracking, user will try to retrieve dirty log from
iommu if it supports hardware dirty log. This adds a new interface
named sync_dirty_log in iommu layer and arm smmuv3 implements it,
which scans leaf TTD and treats it's dirty if it's writable (As we
just enable HTTU for stage1, so check AP[2] is not set).

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 +++
  drivers/iommu/io-pgtable-arm.c  | 90 +
  drivers/iommu/iommu.c   | 41 ++
  include/linux/io-pgtable.h  |  4 +
  include/linux/iommu.h   | 17 
  5 files changed, 179 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2434519e4bb6..43d0536b429a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2548,6 +2548,32 @@ static size_t arm_smmu_merge_page(struct iommu_domain 
*domain, unsigned long iov
return ops->merge_page(ops, iova, paddr, size, prot);
  }
  
+static int arm_smmu_sync_dirty_log(struct iommu_domain *domain,

+  unsigned long iova, size_t size,
+  unsigned long *bitmap,
+  unsigned long base_iova,
+  unsigned long bitmap_pgshift)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & ARM_SMMU_FEAT_HTTU_HD)) {
+   dev_err(smmu->dev, "don't support HTTU_HD and sync dirty 
log\n");
+   return -EPERM;
+   }
+
+   if (!ops || !ops->sync_dirty_log) {
+   pr_err("don't support sync dirty log\n");
+   return -ENODEV;
+   }
+
+   /* To ensure all inflight transactions are completed */
+   arm_smmu_flush_iotlb_all(domain);


What about transactions that arrive between the point that this 
completes, and the point - potentially much later - that we actually 
access any given PTE during the walk? I don't see what this is supposed 
to be synchronising against, even if it were just a CMD_SYNC (I 
especially don't see why we'd want to knock out the TLBs).



+
+   return ops->sync_dirty_log(ops, iova, size, bitmap,
+   base_iova, bitmap_pgshift);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2649,6 +2675,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
.merge_page = arm_smmu_merge_page,
+   .sync_dirty_log = arm_smmu_sync_dirty_log,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index 17390f258eb1..6cfe1ef3fedd 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -877,6 +877,95 @@ static size_t arm_lpae_merge_page(struct io_pgtable_ops 
*ops, unsigned long iova
return __arm_lpae_merge_page(data, iova, paddr, size, lvl, ptep, prot);
  }
  
+static int __arm_lpae_sync_dirty_log(struct arm_lpae_io_pgtable *data,

+unsigned long iova, size_t size,
+int lvl, arm_lpae_iopte *ptep,
+unsigned long *bitmap,
+unsigned long base_iova,
+unsigned long bitmap_pgshift)
+{
+   arm_lpae_iopte pte;
+   struct io_pgtable *iop = >iop;
+   size_t base, next_size;
+   unsigned long offset;
+   int nbits, ret;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return -EINVAL;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return -EINVAL;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt)) {
+   if (pte & ARM_LPAE_PTE_AP_RDONLY)
+   return 0;
+
+   /* It is writable, set the bitmap */
+   nbits = size >> bitmap_pgshift;
+   offset = (iova - base_iova) >> bitmap_pgshift;
+   bitmap_set(bitmap, offset, nbits);
+   return 0;
+   } else {
+   /* To traverse next level */
+   next_size = ARM_LPAE_BLOCK_SIZE(lvl + 1, data);
+  

Re: [RFC PATCH 05/11] iommu/arm-smmu-v3: Merge a span of page to block descriptor

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

When stop dirty log tracking, we need to recover all block descriptors
which are splited when start dirty log tracking. This adds a new
interface named merge_page in iommu layer and arm smmuv3 implements it,
which reinstall block mappings and unmap the span of page mappings.

It's caller's duty to find contiuous physical memory.

During merging page, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the merge
procedure is completed to ease the pressure of iommu, as we will merge a
huge range of page mappings in general.


Again, I think we need better reasoning than "race conditions don't 
exist because we don't expect them to exist".



Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 20 ++
  drivers/iommu/io-pgtable-arm.c  | 78 +
  drivers/iommu/iommu.c   | 75 
  include/linux/io-pgtable.h  |  2 +
  include/linux/iommu.h   | 10 +++
  5 files changed, 185 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 5469f4fca820..2434519e4bb6 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2529,6 +2529,25 @@ static size_t arm_smmu_split_block(struct iommu_domain 
*domain,
return ops->split_block(ops, iova, size);
  }
  
+static size_t arm_smmu_merge_page(struct iommu_domain *domain, unsigned long iova,

+ phys_addr_t paddr, size_t size, int prot)
+{
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and merge page\n");
+   return 0;
+   }
+
+   if (!ops || !ops->merge_page) {
+   pr_err("don't support merge page\n");
+   return 0;
+   }
+
+   return ops->merge_page(ops, iova, paddr, size, prot);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2629,6 +2648,7 @@ static struct iommu_ops arm_smmu_ops = {
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
.split_block= arm_smmu_split_block,
+   .merge_page = arm_smmu_merge_page,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index f3b7f7115e38..17390f258eb1 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -800,6 +800,83 @@ static size_t arm_lpae_split_block(struct io_pgtable_ops 
*ops,
return __arm_lpae_split_block(data, iova, size, lvl, ptep);
  }
  
+static size_t __arm_lpae_merge_page(struct arm_lpae_io_pgtable *data,

+   unsigned long iova, phys_addr_t paddr,
+   size_t size, int lvl, arm_lpae_iopte *ptep,
+   arm_lpae_iopte prot)
+{
+   arm_lpae_iopte pte, *tablep;
+   struct io_pgtable *iop = >iop;
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   ptep += ARM_LPAE_LVL_IDX(iova, lvl, data);
+   pte = READ_ONCE(*ptep);
+   if (WARN_ON(!pte))
+   return 0;
+
+   if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
+   if (iopte_leaf(pte, lvl, iop->fmt))
+   return size;
+
+   /* Race does not exist */
+   if (cfg->bbml == 1) {
+   prot |= ARM_LPAE_PTE_NT;
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   io_pgtable_tlb_flush_walk(iop, iova, size,
+ ARM_LPAE_GRANULE(data));
+
+   prot &= ~(ARM_LPAE_PTE_NT);
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   } else {
+   __arm_lpae_init_pte(data, paddr, prot, lvl, ptep);
+   }
+
+   tablep = iopte_deref(pte, data);
+   __arm_lpae_free_pgtable(data, lvl + 1, tablep);
+   return size;
+   } else if (iopte_leaf(pte, lvl, iop->fmt)) {
+   /* The size is too small, already merged */
+   return size;
+   }
+
+   /* Keep on walkin */
+   ptep = iopte_deref(pte, data);
+   return 

Re: [RFC PATCH 04/11] iommu/arm-smmu-v3: Split block descriptor to a span of page

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

Block descriptor is not a proper granule for dirty log tracking. This
adds a new interface named split_block in iommu layer and arm smmuv3
implements it, which splits block descriptor to an equivalent span of
page descriptors.

During spliting block, other interfaces are not expected to be working,
so race condition does not exist. And we flush all iotlbs after the split
procedure is completed to ease the pressure of iommu, as we will split a
huge range of block mappings in general.


"Not expected to be" is not the same thing as "can not". Presumably the 
whole point of dirty log tracking is that it can be run speculatively in 
the background, so is there any actual guarantee that the guest can't, 
say, issue a hotplug event that would cause some memory to be released 
back to the host and unmapped while a scan might be in progress? Saying 
effectively "there is no race condition as long as you assume there is 
no race condition" isn't all that reassuring...


That said, it's not very clear why patches #4 and #5 are here at all, 
given that patches #6 and #7 appear quite happy to handle block entries.



Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c |  20 
  drivers/iommu/io-pgtable-arm.c  | 122 
  drivers/iommu/iommu.c   |  40 +++
  include/linux/io-pgtable.h  |   2 +
  include/linux/iommu.h   |  10 ++
  5 files changed, 194 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9208881a571c..5469f4fca820 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2510,6 +2510,25 @@ static int arm_smmu_domain_set_attr(struct iommu_domain 
*domain,
return ret;
  }
  
+static size_t arm_smmu_split_block(struct iommu_domain *domain,

+  unsigned long iova, size_t size)
+{
+   struct arm_smmu_device *smmu = to_smmu_domain(domain)->smmu;
+   struct io_pgtable_ops *ops = to_smmu_domain(domain)->pgtbl_ops;
+
+   if (!(smmu->features & (ARM_SMMU_FEAT_BBML1 | ARM_SMMU_FEAT_BBML2))) {
+   dev_err(smmu->dev, "don't support BBML1/2 and split block\n");
+   return 0;
+   }
+
+   if (!ops || !ops->split_block) {
+   pr_err("don't support split block\n");
+   return 0;
+   }
+
+   return ops->split_block(ops, iova, size);
+}
+
  static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args)
  {
return iommu_fwspec_add_ids(dev, args->args, 1);
@@ -2609,6 +2628,7 @@ static struct iommu_ops arm_smmu_ops = {
.device_group   = arm_smmu_device_group,
.domain_get_attr= arm_smmu_domain_get_attr,
.domain_set_attr= arm_smmu_domain_set_attr,
+   .split_block= arm_smmu_split_block,
.of_xlate   = arm_smmu_of_xlate,
.get_resv_regions   = arm_smmu_get_resv_regions,
.put_resv_regions   = generic_iommu_put_resv_regions,
diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index e299a44808ae..f3b7f7115e38 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -79,6 +79,8 @@
  #define ARM_LPAE_PTE_SH_IS(((arm_lpae_iopte)3) << 8)
  #define ARM_LPAE_PTE_NS   (((arm_lpae_iopte)1) << 5)
  #define ARM_LPAE_PTE_VALID(((arm_lpae_iopte)1) << 0)
+/* Block descriptor bits */
+#define ARM_LPAE_PTE_NT(((arm_lpae_iopte)1) << 16)
  
  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) << 2)

  /* Ignore the contiguous bit for block splitting */
@@ -679,6 +681,125 @@ static phys_addr_t arm_lpae_iova_to_phys(struct 
io_pgtable_ops *ops,
return iopte_to_paddr(pte, data) | iova;
  }
  
+static size_t __arm_lpae_split_block(struct arm_lpae_io_pgtable *data,

+unsigned long iova, size_t size, int lvl,
+arm_lpae_iopte *ptep);
+
+static size_t arm_lpae_do_split_blk(struct arm_lpae_io_pgtable *data,
+   unsigned long iova, size_t size,
+   arm_lpae_iopte blk_pte, int lvl,
+   arm_lpae_iopte *ptep)
+{
+   struct io_pgtable_cfg *cfg = >iop.cfg;
+   arm_lpae_iopte pte, *tablep;
+   phys_addr_t blk_paddr;
+   size_t tablesz = ARM_LPAE_GRANULE(data);
+   size_t split_sz = ARM_LPAE_BLOCK_SIZE(lvl, data);
+   int i;
+
+   if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
+   return 0;
+
+   tablep = __arm_lpae_alloc_pages(tablesz, GFP_ATOMIC, cfg);
+   if (!tablep)
+   return 0;
+
+   blk_paddr = iopte_to_paddr(blk_pte, data);
+   

Re: [RFC PATCH 01/11] iommu/arm-smmu-v3: Add feature detection for HTTU

2021-02-04 Thread Robin Murphy

On 2021-01-28 15:17, Keqian Zhu wrote:

From: jiangkunkun 

The SMMU which supports HTTU (Hardware Translation Table Update) can
update the access flag and the dirty state of TTD by hardware. It is
essential to track dirty pages of DMA.

This adds feature detection, none functional change.

Co-developed-by: Keqian Zhu 
Signed-off-by: Kunkun Jiang 
---
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 
  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  8 
  include/linux/io-pgtable.h  |  1 +
  3 files changed, 25 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 8ca7415d785d..0f0fe71cc10d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1987,6 +1987,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain 
*domain,
.pgsize_bitmap  = smmu->pgsize_bitmap,
.ias= ias,
.oas= oas,
+   .httu_hd= smmu->features & ARM_SMMU_FEAT_HTTU_HD,
.coherent_walk  = smmu->features & ARM_SMMU_FEAT_COHERENCY,
.tlb= _smmu_flush_ops,
.iommu_dev  = smmu->dev,
@@ -3224,6 +3225,21 @@ static int arm_smmu_device_hw_probe(struct 
arm_smmu_device *smmu)
if (reg & IDR0_HYP)
smmu->features |= ARM_SMMU_FEAT_HYP;
  
+	switch (FIELD_GET(IDR0_HTTU, reg)) {


We need to accommodate the firmware override as well if we need this to 
be meaningful. Jean-Philippe is already carrying a suitable patch in the 
SVA stack[1].



+   case IDR0_HTTU_NONE:
+   break;
+   case IDR0_HTTU_HA:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   break;
+   case IDR0_HTTU_HAD:
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HA;
+   smmu->features |= ARM_SMMU_FEAT_HTTU_HD;
+   break;
+   default:
+   dev_err(smmu->dev, "unknown/unsupported HTTU!\n");
+   return -ENXIO;
+   }
+
/*
 * The coherency feature as set by FW is used in preference to the ID
 * register, but warn on mismatch.
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 96c2e9565e00..e91bea44519e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -33,6 +33,10 @@
  #define IDR0_ASID16   (1 << 12)
  #define IDR0_ATS  (1 << 10)
  #define IDR0_HYP  (1 << 9)
+#define IDR0_HTTU  GENMASK(7, 6)
+#define IDR0_HTTU_NONE 0
+#define IDR0_HTTU_HA   1
+#define IDR0_HTTU_HAD  2
  #define IDR0_COHACC   (1 << 4)
  #define IDR0_TTF  GENMASK(3, 2)
  #define IDR0_TTF_AARCH64  2
@@ -286,6 +290,8 @@
  #define CTXDESC_CD_0_TCR_TBI0 (1ULL << 38)
  
  #define CTXDESC_CD_0_AA64		(1UL << 41)

+#define CTXDESC_CD_0_HD(1UL << 42)
+#define CTXDESC_CD_0_HA(1UL << 43)
  #define CTXDESC_CD_0_S(1UL << 44)
  #define CTXDESC_CD_0_R(1UL << 45)
  #define CTXDESC_CD_0_A(1UL << 46)
@@ -604,6 +610,8 @@ struct arm_smmu_device {
  #define ARM_SMMU_FEAT_RANGE_INV   (1 << 15)
  #define ARM_SMMU_FEAT_BTM (1 << 16)
  #define ARM_SMMU_FEAT_SVA (1 << 17)
+#define ARM_SMMU_FEAT_HTTU_HA  (1 << 18)
+#define ARM_SMMU_FEAT_HTTU_HD  (1 << 19)
u32 features;
  
  #define ARM_SMMU_OPT_SKIP_PREFETCH	(1 << 0)

diff --git a/include/linux/io-pgtable.h b/include/linux/io-pgtable.h
index ea727eb1a1a9..1a00ea8562c7 100644
--- a/include/linux/io-pgtable.h
+++ b/include/linux/io-pgtable.h
@@ -97,6 +97,7 @@ struct io_pgtable_cfg {
unsigned long   pgsize_bitmap;
unsigned intias;
unsigned intoas;
+   boolhttu_hd;


This is very specific to the AArch64 stage 1 format, not a generic 
capability - I think it should be a quirk flag rather than a common field.


Robin.

[1] 
https://jpbrucker.net/git/linux/commit/?h=sva/current=1ef7d512fb9082450dfe0d22ca4f7e35625a097b



boolcoherent_walk;
const struct iommu_flush_ops*tlb;
struct device   *iommu_dev;



Re: [PATCH] drm/lima: Use delayed timer as default in devfreq profile

2021-02-04 Thread Robin Murphy

On 2021-02-03 02:01, Qiang Yu wrote:

On Tue, Feb 2, 2021 at 10:02 PM Lukasz Luba  wrote:




On 2/2/21 1:01 AM, Qiang Yu wrote:

Hi Lukasz,

Thanks for the explanation. So the deferred timer option makes a mistake that
when GPU goes from idle to busy for only one poll periodic, in this
case 50ms, right?


Not exactly. Driver sets the polling interval to 50ms (in this case)
because it needs ~3-frame average load (in 60fps). I have discovered the
issue quite recently that on systems with 2 CPUs or more, the devfreq
core is not monitoring the devices even for seconds. Therefore, we might
end up with quite big amount of work that GPU is doing, but we don't
know about it. Devfreq core didn't check <- timer didn't fired. Then
suddenly that CPU, which had the deferred timer registered last time,
is waking up and timer triggers to check our device. We get the stats,
but they might be showing load from 1sec not 50ms. We feed them into
governor. Governor sees the new load, but was tested and configured for
50ms, so it might try to rise the frequency to max. The GPU work might
be already lower and there is no need for such freq. Then the CPU goes
idle again, so no devfreq core check for next e.g. 1sec, but the
frequency stays at max OPP and we burn power.

So, it's completely unreliable. We might stuck at min frequency and
suffer the frame drops, or sometimes stuck to max freq and burn more
power when there is no such need.

Similar for thermal governor, which is confused by this old stats and
long period stats, longer than 50ms.

Stats from last e.g. ~1sec tells you nothing about real recent GPU
workload.

Oh, right, I missed this case.




But delayed timer will wakeup CPU every 50ms even when system is idle, will this
cause more power consumption for the case like phone suspend?


No, in case of phone suspend it won't increase the power consumption.
The device won't be woken up, it will stay in suspend.

I mean the CPU is waked up frequently by timer when phone suspend,
not the whole device (like the display).

Seems it's better to have deferred timer when device is suspended for
power saving,
and delayed timer when device in working state. User knows this and
can use sysfs
to change it.


Doesn't devfreq_suspend_device() already cancel any timer work either 
way in that case?


Robin.


Set the delayed timer as default is reasonable, so patch is:
Reviewed-by: Qiang Yu 

Regards,
Qiang



Regards,
Lukasz




Regards,
Qiang


On Mon, Feb 1, 2021 at 5:53 PM Lukasz Luba  wrote:


Hi Qiang,

On 1/30/21 1:51 PM, Qiang Yu wrote:

Thanks for the patch. But I can't observe any difference on glmark2
with or without this patch.
Maybe you can provide other test which can benefit from it.


This is a design problem and has impact on the whole system.
There is a few issues. When the device is not checked and there are
long delays between last check and current, the history is broken.
It confuses the devfreq governor and thermal governor (Intelligent Power
Allocation (IPA)). Thermal governor works on stale stats data and makes
stupid decisions, because there is no new stats (device not checked).
Similar applies to devfreq simple_ondemand governor, where it 'tires' to
work on a lng period even 3sec and make prediction for the next
frequency based on it (which is broken).

How it should be done: constant reliable check is needed, then:
- period is guaranteed and has fixed size, e.g 50ms or 100ms.
- device status is quite recent so thermal devfreq cooling provides
 'fresh' data into thermal governor

This would prevent odd behavior and solve the broken cases.



Considering it will wake up CPU more frequently, and user may choose
to change this by sysfs,
I'd like to not apply it.


The deferred timer for GPU is wrong option, for UFS or eMMC makes more
sense. It's also not recommended for NoC busses. I've discovered that
some time ago and proposed to have option to switch into delayed timer.
Trust me, it wasn't obvious to find out that this missing check has
those impacts. So the other engineers or users might not know that some
problems they faces (especially when the device load is changing) is due
to this delayed vs deffered timer and they will change it in the sysfs.

Regards,
Lukasz



Regards,
Qiang


___
dri-devel mailing list
dri-de...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel



Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Robin Murphy

On 2021-02-04 07:29, Christoph Hellwig wrote:

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.


Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables.


Indeed, I skimmed the cover letter and immediately thought that this 
whole thing is just the restricted DMA pool concept[1] again, only from 
a slightly different angle.


Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/


Re: [PATCH v2 1/7] dt-bindings: usb: convert rockchip,dwc3.txt to yaml

2021-02-04 Thread Robin Murphy

On 2021-02-03 16:52, Johan Jonker wrote:

In the past Rockchip dwc3 usb nodes were manually checked.
With the conversion of snps,dwc3.yaml as common document
we now can convert rockchip,dwc3.txt to yaml as well.
Remove node wrapper.

Added properties for rk3399 are:
   power-domains
   resets
   reset-names

Signed-off-by: Johan Jonker 
---
  .../devicetree/bindings/usb/rockchip,dwc3.txt  |  56 ---
  .../devicetree/bindings/usb/rockchip,dwc3.yaml | 103 +
  2 files changed, 103 insertions(+), 56 deletions(-)
  delete mode 100644 Documentation/devicetree/bindings/usb/rockchip,dwc3.txt
  create mode 100644 Documentation/devicetree/bindings/usb/rockchip,dwc3.yaml

diff --git a/Documentation/devicetree/bindings/usb/rockchip,dwc3.txt 
b/Documentation/devicetree/bindings/usb/rockchip,dwc3.txt
deleted file mode 100644
index 945204932..0
--- a/Documentation/devicetree/bindings/usb/rockchip,dwc3.txt
+++ /dev/null
@@ -1,56 +0,0 @@
-Rockchip SuperSpeed DWC3 USB SoC controller
-
-Required properties:
-- compatible:  should contain "rockchip,rk3399-dwc3" for rk3399 SoC
-- clocks:  A list of phandle + clock-specifier pairs for the
-   clocks listed in clock-names
-- clock-names: Should contain the following:
-  "ref_clk"  Controller reference clk, have to be 24 MHz
-  "suspend_clk"  Controller suspend clk, have to be 24 MHz or 32 KHz
-  "bus_clk"  Master/Core clock, have to be >= 62.5 MHz for SS
-   operation and >= 30MHz for HS operation
-  "grf_clk"  Controller grf clk
-
-Required child node:
-A child node must exist to represent the core DWC3 IP block. The name of
-the node is not important. The content of the node is defined in dwc3.txt.
-
-Phy documentation is provided in the following places:
-Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.yaml - USB2.0 PHY
-Documentation/devicetree/bindings/phy/phy-rockchip-typec.txt - Type-C PHY
-
-Example device nodes:
-
-   usbdrd3_0: usb@fe80 {
-   compatible = "rockchip,rk3399-dwc3";
-   clocks = < SCLK_USB3OTG0_REF>, < SCLK_USB3OTG0_SUSPEND>,
-< ACLK_USB3OTG0>, < ACLK_USB3_GRF>;
-   clock-names = "ref_clk", "suspend_clk",
- "bus_clk", "grf_clk";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   usbdrd_dwc3_0: dwc3@fe80 {
-   compatible = "snps,dwc3";
-   reg = <0x0 0xfe80 0x0 0x10>;
-   interrupts = ;
-   dr_mode = "otg";
-   };
-   };
-
-   usbdrd3_1: usb@fe90 {
-   compatible = "rockchip,rk3399-dwc3";
-   clocks = < SCLK_USB3OTG1_REF>, < SCLK_USB3OTG1_SUSPEND>,
-< ACLK_USB3OTG1>, < ACLK_USB3_GRF>;
-   clock-names = "ref_clk", "suspend_clk",
- "bus_clk", "grf_clk";
-   #address-cells = <2>;
-   #size-cells = <2>;
-   ranges;
-   usbdrd_dwc3_1: dwc3@fe90 {
-   compatible = "snps,dwc3";
-   reg = <0x0 0xfe90 0x0 0x10>;
-   interrupts = ;
-   dr_mode = "otg";
-   };
-   };
diff --git a/Documentation/devicetree/bindings/usb/rockchip,dwc3.yaml 
b/Documentation/devicetree/bindings/usb/rockchip,dwc3.yaml
new file mode 100644
index 0..fdf9497bc
--- /dev/null
+++ b/Documentation/devicetree/bindings/usb/rockchip,dwc3.yaml
@@ -0,0 +1,103 @@
+# SPDX-License-Identifier: GPL-2.0
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/usb/rockchip,dwc3.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Rockchip SuperSpeed DWC3 USB SoC controller
+
+maintainers:
+  - Heiko Stuebner 
+
+description:
+  The common content of the node is defined in snps,dwc3.yaml.
+
+  Phy documentation is provided in the following places.
+
+  USB2.0 PHY
+  Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.yaml
+
+  Type-C PHY
+  Documentation/devicetree/bindings/phy/phy-rockchip-typec.txt
+
+allOf:
+  - $ref: snps,dwc3.yaml#
+
+properties:
+  compatible:
+items:
+  - enum:
+  - rockchip,rk3399-dwc3
+  - const: snps,dwc3
+
+  reg:
+maxItems: 1
+
+  interrupts:
+maxItems: 1
+
+  clocks:
+items:
+  - description:
+  Controller reference clock, must to be 24 MHz
+  - description:
+  Controller suspend clock, must to be 24 MHz or 32 KHz
+  - description:
+  Master/Core clock, must to be >= 62.5 MHz for SS
+  operation and >= 30MHz for HS operation
+  - description:
+  Controller aclk_usb3_rksoc_axi_perf clock


I'm pretty sure these last 3 don't belong to the controller itself, 
hence why they were in the glue layer node to being with.



+  - description:
+   

Re: [PATCH 2/2] iommu: add Unisoc iommu basic driver

2021-02-02 Thread Robin Murphy

On 2021-02-02 14:41, Joerg Roedel wrote:

On Tue, Feb 02, 2021 at 02:34:34PM +, Robin Murphy wrote:

Nope, I believe if Arm Ltd. had any involvement in this I'd know about it :)


Okay, got confused by thinking of ARM as the CPU architecture, not the
company :)
But given the intel/ and amd/ subdirectories refer to company names as
well, the same is true for arm/.


Right, trying to group IOMMU drivers by supposed CPU architecture is 
already a demonstrable non-starter; does intel-iommu count as x86, or 
IA-64, or do you want two copies? :P


I somehow doubt anyone would license one of Arm's SMMUs to go in a 
RISC-V/MIPS/etc. based SoC, but in principle, they *could*. In fact it's 
precisely cases like this one - where silicon vendors come up with their 
own little scatter-gather unit to go with their own display controller 
etc. - that I imagine are most likely to get reused if the vendor 
decides to experiment with different CPUs to reach new market segments.


Robin.


Re: [PATCH 2/2] iommu: add Unisoc iommu basic driver

2021-02-02 Thread Robin Murphy

On 2021-02-02 14:01, Joerg Roedel wrote:

On Tue, Feb 02, 2021 at 06:42:57PM +0800, Chunyan Zhang wrote:

From: Chunyan Zhang 

This iommu module can be used by Unisoc's multimedia devices, such as
display, Image codec(jpeg) and a few signal processors, including
VSP(video), GSP(graphic), ISP(image), and CPP(camera pixel processor), etc.

Signed-off-by: Chunyan Zhang 
---
  drivers/iommu/Kconfig  |  12 +
  drivers/iommu/Makefile |   1 +
  drivers/iommu/sprd-iommu.c | 598 +


This looks like it actually belongs under drivers/iommu/arm/, no?


Nope, I believe if Arm Ltd. had any involvement in this I'd know about it :)

Robin.


Re: [RESENT PATCH] arm64: cpuinfo: Add "model name" in /proc/cpuinfo for 64bit tasks also

2021-02-02 Thread Robin Murphy

On 2021-02-01 23:58, Tianling Shen wrote:

From: Sumit Gupta 

Removed restriction of displaying model name for 32 bit tasks only.
This can be used for 64 bit tasks as well, and it's useful for some
tools that already parse this, such as coreutils `uname -p`, Ubuntu
model name display etc.


How exactly is it useful? It clearly isn't necessary for compatibility, 
since AArch64 userspace has apparently been running quite happily for 8 
years without it. It also doesn't convey anything meaningful to the 
user, since they already know they're on an Armv8-compatible processor 
by the fact that they're running AArch64 userspace at all.


Robin.


It should be like this:
```
$ cat '/proc/cpuinfo' | grep 'model name' | head -n 1
model name : ARMv8 Processor rev X (v8l)
```

Link: 
https://lore.kernel.org/lkml/1472461345-28219-1-git-send-email-sum...@nvidia.com/

Signed-off-by: Sumit Gupta 
Signed-off-by: Tianling Shen 
---
  arch/arm64/kernel/cpuinfo.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/cpuinfo.c b/arch/arm64/kernel/cpuinfo.c
index 77605aec25fe..d69b4e486098 100644
--- a/arch/arm64/kernel/cpuinfo.c
+++ b/arch/arm64/kernel/cpuinfo.c
@@ -148,8 +148,7 @@ static int c_show(struct seq_file *m, void *v)
 * "processor".  Give glibc what it expects.
 */
seq_printf(m, "processor\t: %d\n", i);
-   if (compat)
-   seq_printf(m, "model name\t: ARMv8 Processor rev %d 
(%s)\n",
+   seq_printf(m, "model name\t: ARMv8 Processor rev %d (%s)\n",
   MIDR_REVISION(midr), COMPAT_ELF_PLATFORM);
  
  		seq_printf(m, "BogoMIPS\t: %lu.%02lu\n",




Re: [PATCH] iommu: Update the document of IOMMU_DOMAIN_UNMANAGED

2021-02-02 Thread Robin Murphy

On 2021-02-02 08:53, Keqian Zhu wrote:

Signed-off-by: Keqian Zhu 
---
  include/linux/iommu.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 77e561ed57fd..e8f2efae212b 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -68,7 +68,7 @@ struct iommu_domain_geometry {
   *  devices
   *IOMMU_DOMAIN_IDENTITY   - DMA addresses are system physical addresses
   *IOMMU_DOMAIN_UNMANAGED  - DMA mappings managed by IOMMU-API user, used
- *   for VMs
+ *   for VMs or userspace driver frameworks


Given that "VMs" effectively has to mean VFIO, doesn't it effectively 
already imply other uses of VFIO anyway? Unmanaged domains are also used 
in other subsystems/drivers inside the kernel and we're not naming 
those, so I don't see that it's particularly helpful to specifically 
call out one more VFIO use-case.


Perhaps the current wording could be generalised a little more, but we 
certainly don't want to start trying to maintain an exhaustive list of 
users here...


Robin.


   *IOMMU_DOMAIN_DMA- Internally used for DMA-API implementations.
   *  This flag allows IOMMU drivers to implement
   *  certain optimizations for these domains



Re: [PATCH V2 3/3] Adding device_dma_parameters->offset_preserve_mask to NVMe driver.

2021-02-02 Thread Robin Murphy

On 2021-02-02 11:21, Andy Shevchenko wrote:

On Mon, Feb 01, 2021 at 04:25:55PM -0800, Jianxiong Gao wrote:


+   if (dma_set_min_align_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1))


Side note: we have DMA_BIT_MASK(), please use it.


FWIW I'd actually disagree on that point. Conceptually, this is a very 
different thing from dev->{coherent_}dma_mask. It does not need to 
handle everything up to 2^64-1 correctly without overflow issues, and 
data alignments typically *are* defined in terms of sizes rather than 
numbers of bits.


In fact, it might be neat to just have callers pass a size directly to a 
dma_set_min_align() interface which asserts that it is a power of two 
and stores it as a mask internally.


Robin.




+   dev_warn(dev->dev, "dma_set_min_align_mask failed to
set offset\n");




Re: [PATCH v4 1/2] perf/smmuv3: Don't reserve the PMCG register spaces

2021-02-01 Thread Robin Murphy

On 2021-01-30 07:14, Zhen Lei wrote:

According to the SMMUv3 specification:
Each PMCG counter group is represented by one 4KB page (Page 0) with one
optional additional 4KB page (Page 1), both of which are at IMPLEMENTATION
DEFINED base addresses.

This means that the PMCG register spaces may be within the 64KB pages of
the SMMUv3 register space. When both the SMMU and PMCG drivers reserve
their own resources, a resource conflict occurs.

To avoid this conflict, don't reserve the PMCG regions.


I said my review on v3 stood either way, but for the avoidance of doubt,

Reviewed-by: Robin Murphy 

I hadn't considered that a comment is a very good idea, in case the 
cleanup-script crew find this in future and try to "simplify" it :)


Thanks,
Robin.


Suggested-by: Robin Murphy 
Signed-off-by: Zhen Lei 
---
  drivers/perf/arm_smmuv3_pmu.c | 25 +++--
  1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/drivers/perf/arm_smmuv3_pmu.c b/drivers/perf/arm_smmuv3_pmu.c
index 74474bb322c3f26..5e894f957c7b935 100644
--- a/drivers/perf/arm_smmuv3_pmu.c
+++ b/drivers/perf/arm_smmuv3_pmu.c
@@ -793,17 +793,30 @@ static int smmu_pmu_probe(struct platform_device *pdev)
.capabilities   = PERF_PMU_CAP_NO_EXCLUDE,
};
  
-	smmu_pmu->reg_base = devm_platform_get_and_ioremap_resource(pdev, 0, _0);

-   if (IS_ERR(smmu_pmu->reg_base))
-   return PTR_ERR(smmu_pmu->reg_base);
+   /*
+* The register spaces of the PMCG may be in the register space of
+* other devices. For example, SMMU. Therefore, the PMCG resources are
+* not reserved to avoid resource conflicts with other drivers.
+*/
+   res_0 = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+   if (!res_0)
+   return ERR_PTR(-EINVAL);
+   smmu_pmu->reg_base = devm_ioremap(dev, res_0->start, 
resource_size(res_0));
+   if (!smmu_pmu->reg_base)
+   return ERR_PTR(-ENOMEM);
  
  	cfgr = readl_relaxed(smmu_pmu->reg_base + SMMU_PMCG_CFGR);
  
  	/* Determine if page 1 is present */

if (cfgr & SMMU_PMCG_CFGR_RELOC_CTRS) {
-   smmu_pmu->reloc_base = devm_platform_ioremap_resource(pdev, 1);
-   if (IS_ERR(smmu_pmu->reloc_base))
-   return PTR_ERR(smmu_pmu->reloc_base);
+   struct resource *res_1;
+
+   res_1 = platform_get_resource(pdev, IORESOURCE_MEM, 1);
+   if (!res_1)
+   return ERR_PTR(-EINVAL);
+   smmu_pmu->reloc_base = devm_ioremap(dev, res_1->start, 
resource_size(res_1));
+   if (!smmu_pmu->reloc_base)
+   return ERR_PTR(-ENOMEM);
} else {
smmu_pmu->reloc_base = smmu_pmu->reg_base;
}



Re: [PATCH] driver/perf: Remove ARM_SMMU_V3_PMU dependency on ARM_SMMU_V3

2021-02-01 Thread Robin Murphy

On 2021-02-01 10:24, John Garry wrote:

The ARM_SMMU_V3_PMU dependency on ARM_SMMU_V3_PMU was added with the idea
that a SMMUv3 PMCG would only exist on a system with an associated SMMUv3.

However it is not the job of Kconfig to make these sorts of decisions (even
if it were true), so remove the dependency.


Reviewed-by: Robin Murphy 


Signed-off-by: John Garry 

diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 3075cf171f78..77522e5efe11 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -62,7 +62,7 @@ config ARM_PMU_ACPI
  
  config ARM_SMMU_V3_PMU

 tristate "ARM SMMUv3 Performance Monitors Extension"
-depends on ARM64 && ACPI && ARM_SMMU_V3
+depends on ARM64 && ACPI
   help
   Provides support for the ARM SMMUv3 Performance Monitor Counter
   Groups (PMCG), which provide monitoring of transactions passing



Re: [PATCH v3 3/3] iommu/arm-smmu-v3: Reserving the entire SMMU register space

2021-02-01 Thread Robin Murphy

On 2021-01-30 01:54, Leizhen (ThunderTown) wrote:



On 2021/1/29 23:27, Robin Murphy wrote:

On 2021-01-27 11:32, Zhen Lei wrote:

commit 52f3fab0067d ("iommu/arm-smmu-v3: Don't reserve implementation
defined register space") only reserves the basic SMMU register space. So
the ECMDQ register space is not covered, it should be mapped again. Due
to the size of this ECMDQ resource is not fixed, depending on
SMMU_IDR6.CMDQ_CONTROL_PAGE_LOG2NUMQ. Processing its resource reservation
to avoid resource conflict with PMCG is a bit more complicated.

Therefore, the resources of the PMCG are not reserved, and the entire SMMU
resources are reserved.


This is the opposite of what I suggested. My point was that it will make the 
most sense to map the ECMDQ pages as a separate request anyway, therefore there 
is no reason to stop mapping page 0 and page 1 separately either.


I don't understand why the ECMDQ mapping must be performed separately. If the 
conflict with PMCG resources is eliminated. ECMDQ cannot be a separate driver 
like PMCG.


I mean in terms of the basic practice of not mapping megabytes worth of 
IMP-DEF crap that this driver doesn't need or even understand. If we 
don't have ECMDQ, we definitely don't need anything beyond page 1, so 
there's no point mapping it all, and indeed it's safest not to anyway. 
Even if we do have ECMDQ, it's still safer not to map all the unknown 
stuff that may be in between, and until we've mapped page 0 we don't 
know whether we have ECMDQ or not.


Therefore the most sensible thing to do either way is to map the basic 
page(s) first, then map the ECMDQ pages specifically if we determine 
that we need to. And either way we don't even need to think about this 
until adding ECMDQ support.


Robin.


If we need to expand the page 0 mapping to cover more of page 0 to reach the 
SMMU_CMDQ_CONTROL_PAGE_* registers, we can do that when we actually add ECMDQ 
support.

Robin.


Suggested-by: Robin Murphy 
Signed-off-by: Zhen Lei 
---
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 --
   2 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f04c55a7503c790..fde24403b06a9e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -3476,14 +3476,6 @@ static int arm_smmu_set_bus_ops(struct iommu_ops *ops)
   return err;
   }
   -static void __iomem *arm_smmu_ioremap(struct device *dev, resource_size_t 
start,
-  resource_size_t size)
-{
-    struct resource res = DEFINE_RES_MEM(start, size);
-
-    return devm_ioremap_resource(dev, );
-}
-
   static int arm_smmu_device_probe(struct platform_device *pdev)
   {
   int irq, ret;
@@ -3519,22 +3511,14 @@ static int arm_smmu_device_probe(struct platform_device 
*pdev)
   }
   ioaddr = res->start;
   -    /*
- * Don't map the IMPLEMENTATION DEFINED regions, since they may contain
- * the PMCG registers which are reserved by the PMU driver.
- */
-    smmu->base = arm_smmu_ioremap(dev, ioaddr, ARM_SMMU_REG_SZ);
+    smmu->base = devm_ioremap_resource(dev, res);
   if (IS_ERR(smmu->base))
   return PTR_ERR(smmu->base);
   -    if (arm_smmu_resource_size(smmu) > SZ_64K) {
-    smmu->page1 = arm_smmu_ioremap(dev, ioaddr + SZ_64K,
-   ARM_SMMU_REG_SZ);
-    if (IS_ERR(smmu->page1))
-    return PTR_ERR(smmu->page1);
-    } else {
+    if (arm_smmu_resource_size(smmu) > SZ_64K)
+    smmu->page1 = smmu->base + SZ_64K;
+    else
   smmu->page1 = smmu->base;
-    }
     /* Interrupt lines */
   diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index da525f46dab4fc1..63f2b476987d6ae 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -152,8 +152,6 @@
   #define ARM_SMMU_PRIQ_IRQ_CFG1    0xd8
   #define ARM_SMMU_PRIQ_IRQ_CFG2    0xdc
   -#define ARM_SMMU_REG_SZ    0xe00
-
   /* Common MSI config fields */
   #define MSI_CFG0_ADDR_MASK    GENMASK_ULL(51, 2)
   #define MSI_CFG2_SH    GENMASK(5, 4)



.





Re: [PATCH v5 06/27] dt-bindings: mediatek: Add binding for mt8192 IOMMU

2021-02-01 Thread Robin Murphy

On 2021-01-29 11:45, Tomasz Figa wrote:

On Mon, Jan 25, 2021 at 4:34 PM Yong Wu  wrote:


On Mon, 2021-01-25 at 13:18 +0900, Tomasz Figa wrote:

On Wed, Jan 20, 2021 at 4:08 PM Yong Wu  wrote:


On Wed, 2021-01-20 at 13:15 +0900, Tomasz Figa wrote:

On Wed, Jan 13, 2021 at 3:45 PM Yong Wu  wrote:


On Wed, 2021-01-13 at 14:30 +0900, Tomasz Figa wrote:

On Thu, Dec 24, 2020 at 8:35 PM Yong Wu  wrote:


On Wed, 2020-12-23 at 17:18 +0900, Tomasz Figa wrote:

On Wed, Dec 09, 2020 at 04:00:41PM +0800, Yong Wu wrote:

This patch adds decriptions for mt8192 IOMMU and SMI.

mt8192 also is MTK IOMMU gen2 which uses ARM Short-Descriptor translation
table format. The M4U-SMI HW diagram is as below:

   EMI
|
   M4U
|
   
SMI Common
   
|
   +---+--+--+--+---+
   |   |  |  |   .. |   |
   |   |  |  |  |   |
larb0   larb1  larb2  larb4 ..  larb19   larb20
disp0   disp1   mdpvdec   IPE  IPE

All the connections are HW fixed, SW can NOT adjust it.

mt8192 M4U support 0~16GB iova range. we preassign different engines
into different iova ranges:

domain-id  module iova-range  larbs
0   disp0 ~ 4G  larb0/1
1   vcodec  4G ~ 8G larb4/5/7
2   cam/mdp 8G ~ 12G larb2/9/11/13/14/16/17/18/19/20


Why do we preassign these addresses in DT? Shouldn't it be a user's or
integrator's decision to split the 16 GB address range into sub-ranges
and define which larbs those sub-ranges are shared with?


The problem is that we can't split the 16GB range with the larb as unit.
The example is the below ccu0(larb13 port9/10) is a independent
range(domain), the others ports in larb13 is in another domain.

disp/vcodec/cam/mdp don't have special iova requirement, they could
access any range. vcodec also can locate 8G~12G. it don't care about
where its iova locate. here I preassign like this following with our
internal project setting.


Let me try to understand this a bit more. Given the split you're
proposing, is there actually any isolation enforced between particular
domains? For example, if I program vcodec to with a DMA address from
the 0-4G range, would the IOMMU actually generate a fault, even if
disp had some memory mapped at that address?


In this case. we will get fault in current SW setting.



Okay, thanks.





Why set this in DT?, this is only for simplifying the code. Assume we
put it in the platform data. We have up to 32 larbs, each larb has up to
32 ports, each port may be in different iommu domains. we should have a
big array for this..however we only use a macro to get the domain in the
DT method.

When replying this mail, I happen to see there is a "dev->dev_range_map"
which has "dma-range" information, I think I could use this value to get
which domain the device belong to. then no need put domid in DT. I will
test this.


My feeling is that the only part that needs to be enforced statically
is the reserved IOVA range for CCUs. The other ranges should be
determined dynamically, although I think I need to understand better
how the hardware and your proposed design work to tell what would be
likely the best choice here.


I have removed the domid patch in v6. and get the domain id in [27/33]
in v6..

About the other ranges should be dynamical, the commit message [30/33]
of v6 should be helpful. the problem is that we have a bank_sel setting
for the iova[32:33]. currently we preassign this value. thus, all the
ranges are fixed. If you adjust this setting, you can let vcodec access
0~4G.


Okay, so it sounds like we effectively have four 4G address spaces and
we can assign the master devices to them. I guess each of these
address spaces makes for an IOMMU group.


Yes. Each a address spaces is an IOMMU group.



It's fine to pre-assign the devices to those groups for now, but it
definitely shouldn't be hardcoded in DT, because it depends on the use
case of the device. I'll take a look at v6, but it sounds like it
should be fine if it doesn't take the address space assignment from DT
anymore.


Thanks very much for your review.



Hmm, I had a look at v6 and it still has the address spaces hardcoded
in the DTS.


sorry. I didn't get here. where do you mean. or help reply in v6.

I only added the preassign list as comment in the file
(dt-binding/memory/mt8192-larb-port.h). I thought our iommu consumer may
need it when they use these ports. they need add dma-ranges property if
its iova is over 4GB.


That's exactly the problem. v6 simply replaced one way to describe the
policy (domain ID) with another (dma-ranges). However, DT is not the
right place to describe 

Re: [PATCH v3 2/3] perf/smmuv3: Add a MODULE_SOFTDEP() to indicate dependency on SMMU

2021-01-29 Thread Robin Murphy

On 2021-01-29 15:34, John Garry wrote:

On 29/01/2021 15:12, Robin Murphy wrote:

On 2021-01-27 11:32, Zhen Lei wrote:
The MODULE_SOFTDEP() gives user space a hint of the loading sequence. 
And
when command "modprobe arm_smmuv3_pmu" is executed, the 
arm_smmu_v3.ko is

automatically loaded in advance.


Why do we need this? If probe order doesn't matter when both drivers 
are built-in, why should module load order?


TBH I'm not sure why we even have a Kconfig dependency on ARM_SMMU_V3, 
given that the drivers operate completely independently :/


Can that Kconfig dependency just be removed? I think that it was added 
under the idea that there is no point in having the SMMUv3 PMU driver 
without the SMMUv3 driver.


A PMCG *might* be usable for simply counting transactions to measure 
device activity regardless of its associated SMMU being enabled. Either 
way, it's not really Kconfig's job to decide what makes sense (beyond 
the top-level "can this driver *ever* be used on this platform" 
visibility choices). Imagine if we gave every PCI/USB/etc. device driver 
an explicit dependency on at least one host controller driver being 
enabled...


Robin.


Re: [PATCH v6 07/33] iommu: Avoid reallocate default domain for a group

2021-01-28 Thread Robin Murphy

On 2021-01-28 21:14, Will Deacon wrote:

On Thu, Jan 28, 2021 at 09:10:20PM +, Will Deacon wrote:

On Wed, Jan 27, 2021 at 05:39:16PM +0800, Yong Wu wrote:

On Tue, 2021-01-26 at 22:23 +, Will Deacon wrote:

On Mon, Jan 11, 2021 at 07:18:48PM +0800, Yong Wu wrote:

If group->default_domain exists, avoid reallocate it.

In some iommu drivers, there may be several devices share a group. Avoid
realloc the default domain for this case.

Signed-off-by: Yong Wu 
---
  drivers/iommu/iommu.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 3d099a31ddca..f4b87e6abe80 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -266,7 +266,8 @@ int iommu_probe_device(struct device *dev)
 * support default domains, so the return value is not yet
 * checked.
 */
-   iommu_alloc_default_domain(group, dev);
+   if (!group->default_domain)
+   iommu_alloc_default_domain(group, dev);


I don't really get what this achieves, since iommu_alloc_default_domain()
looks like this:

static int iommu_alloc_default_domain(struct iommu_group *group,
  struct device *dev)
{
unsigned int type;

if (group->default_domain)
return 0;

...

in which case, it should be fine?


oh. sorry, I overlooked this. the current code is enough.
I will remove this patch. and send the next version in this week.
Thanks very much.


Actually, looking at this again, if we're dropping this patch and patch 6
just needs the kfree() moving about, then there's no need to repost. The
issue that Robin and Paul are discussing can be handled separately.


FWIW patch #6 gets dropped as well now, since Rob has applied the 
standalone fix (89c7cb1608ac).


Robin.


Argh, except that this version of the patches doesn't apply :)

So after all that, please go ahead and post a v7 ASAP based on this branch:

https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=for-joerg/mtk

Will



Re: [PATCH 3/3] Adding device_dma_parameters->offset_preserve_mask to NVMe driver.

2021-01-28 Thread Robin Murphy

On 2021-01-28 00:38, Jianxiong Gao wrote:

NVMe driver relies on the address offset to function properly.
This patch adds the offset preserve mask to NVMe driver when mapping
via dma_map_sg_attrs and unmapping via nvme_unmap_sg. The mask
depends on the page size defined by CC.MPS register of NVMe
controller.

Signed-off-by: Jianxiong Gao 
---
  drivers/nvme/host/pci.c | 19 +--
  1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 856aa31931c1..0b23f04068be 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -580,12 +580,15 @@ static void nvme_free_sgls(struct nvme_dev *dev, struct 
request *req)
  static void nvme_unmap_sg(struct nvme_dev *dev, struct request *req)
  {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
-
+   if (dma_set_page_offset_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1))
+   dev_warn(dev->dev, "dma_set_page_offset_mask failed to set 
offset\n");
if (is_pci_p2pdma_page(sg_page(iod->sg)))
pci_p2pdma_unmap_sg(dev->dev, iod->sg, iod->nents,
rq_dma_dir(req));
else
dma_unmap_sg(dev->dev, iod->sg, iod->nents, rq_dma_dir(req));
+   if (dma_set_page_offset_mask(dev->dev, 0))
+   dev_warn(dev->dev, "dma_set_page_offset_mask failed to reset 
offset\n");
  }
  
  static void nvme_unmap_data(struct nvme_dev *dev, struct request *req)

@@ -842,7 +845,7 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
  {
struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
blk_status_t ret = BLK_STS_RESOURCE;
-   int nr_mapped;
+   int nr_mapped, offset_ret;
  
  	if (blk_rq_nr_phys_segments(req) == 1) {

struct bio_vec bv = req_bvec(req);
@@ -868,12 +871,24 @@ static blk_status_t nvme_map_data(struct nvme_dev *dev, 
struct request *req,
if (!iod->nents)
goto out_free_sg;
  
+	offset_ret = dma_set_page_offset_mask(dev->dev, NVME_CTRL_PAGE_SIZE - 1);

+   if (offset_ret) {
+   dev_warn(dev->dev, "dma_set_page_offset_mask failed to set 
offset\n");
+   goto out_free_sg;
+   }
+
if (is_pci_p2pdma_page(sg_page(iod->sg)))
nr_mapped = pci_p2pdma_map_sg_attrs(dev->dev, iod->sg,
iod->nents, rq_dma_dir(req), DMA_ATTR_NO_WARN);
else
nr_mapped = dma_map_sg_attrs(dev->dev, iod->sg, iod->nents,
 rq_dma_dir(req), DMA_ATTR_NO_WARN);
+
+   offset_ret = dma_set_page_offset_mask(dev->dev, 0);
+   if (offset_ret) {
+   dev_warn(dev->dev, "dma_set_page_offset_mask failed to reset 
offset\n");
+   goto out_free_sg;


If it were possible for this to fail, you might leak the DMA mapping 
here. However if dev->dma_parms somehow disappeared since a dozen lines 
above then I think you've got far bigger problems anyway.


That said, do you really need to keep toggling this back and forth all 
the time? Even if the device does make other mappings elsewhere that 
don't necessarily need the same strict alignment, would it be 
significantly harmful just to set it once at probe and leave it in place 
anyway?


Robin.


+   }
if (!nr_mapped)
goto out_free_sg;
  



Re: [PATCH 1/3] Adding page_offset_mask to device_dma_parameters

2021-01-28 Thread Robin Murphy

On 2021-01-28 00:38, Jianxiong Gao wrote:

Some devices rely on the address offset in a page to function
correctly (NVMe driver as an example). These devices may use
a different page size than the Linux kernel. The address offset
has to be preserved upon mapping, and in order to do so, we
need to record the page_offset_mask first.

Signed-off-by: Jianxiong Gao 
---
  include/linux/device.h  |  1 +
  include/linux/dma-mapping.h | 17 +
  2 files changed, 18 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 1779f90eeb4c..f44e0659fc66 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -292,6 +292,7 @@ struct device_dma_parameters {
 */
unsigned int max_segment_size;
unsigned long segment_boundary_mask;
+   unsigned int page_offset_mask;


Could we call this something more like "min_align_mask" (sorry, I can't 
think of a name that's actually good and descriptive right now). 
Essentially I worry that having "page" in there is going to be too easy 
to misinterpret as having anything to do what "page" means almost 
everywhere else (even before you throw IOMMU pages into the mix).


Also note that of all the possible ways to pack two ints and a long, 
this one is the worst ;)


Robin.


  };
  
  /**

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 2e49996a8f39..5529a31fefba 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -500,6 +500,23 @@ static inline int dma_set_seg_boundary(struct device *dev, 
unsigned long mask)
return -EIO;
  }
  
+static inline unsigned int dma_get_page_offset_mask(struct device *dev)

+{
+   if (dev->dma_parms)
+   return dev->dma_parms->page_offset_mask;
+   return 0;
+}
+
+static inline int dma_set_page_offset_mask(struct device *dev,
+   unsigned int page_offset_mask)
+{
+   if (dev->dma_parms) {
+   dev->dma_parms->page_offset_mask = page_offset_mask;
+   return 0;
+   }
+   return -EIO;
+}
+
  static inline int dma_get_cache_alignment(void)
  {
  #ifdef ARCH_DMA_MINALIGN



Re: [PATCH 1/1] iommu/arm-smmu-v3: add support for BBML

2021-01-28 Thread Robin Murphy

On 2021-01-28 15:18, Keqian Zhu wrote:



On 2021/1/27 17:39, Robin Murphy wrote:

On 2021-01-27 07:36, Keqian Zhu wrote:



On 2021/1/27 10:01, Leizhen (ThunderTown) wrote:



On 2021/1/26 18:12, Will Deacon wrote:

On Mon, Jan 25, 2021 at 08:23:40PM +, Robin Murphy wrote:

Now we probably will need some degreee of BBML feature awareness for the
sake of SVA if and when we start using it for CPU pagetables, but I still
cannot see any need to consider it in io-pgtable.


Agreed; I don't think this is something that io-pgtable should have to care
about.

Hi,

I have a question here :-).
If the old table is not live, then the break procedure seems unnecessary. Do I 
miss something?


The MMU is allowed to prefetch translations at any time, so not following the 
proper update procedure could still potentially lead to a TLB conflict, even if 
there's no device traffic to worry about disrupting.

Robin.


Thanks. Does the MMU you mention here includes MMU and SMMU? I know that at 
SMMU side, ATS can prefetch translation.


Yes, both - VMSAv8 allows speculative translation table walks, so SMMUv3 
inherits from there (per 3.21.1 "Translation tables and TLB invalidation 
completion behavior").


Robin.



Keqian



Thanks,
Keqian



Yes, the SVA works in stall mode, and the failed device access requests are not
discarded.

Let me look for examples. The BBML usage scenario was told by a former 
colleague.



Will

.




___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
.


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


.



Re: [PATCH v2] of/device: Update dma_range_map only when dev has valid dma-ranges

2021-01-27 Thread Robin Murphy

On 2021-01-27 19:09, Rob Herring wrote:

On Wed, Jan 27, 2021 at 7:13 AM Robin Murphy  wrote:


[ + Christoph, Marek ]

On 2021-01-27 13:00, Paul Kocialkowski wrote:

Hi,

On Tue 19 Jan 21, 18:52, Yong Wu wrote:

The commit e0d072782c73 ("dma-mapping: introduce DMA range map,
supplanting dma_pfn_offset") always update dma_range_map even though it was
already set, like in the sunxi_mbus driver. the issue is reported at [1].
This patch avoid this(Updating it only when dev has valid dma-ranges).

Meanwhile, dma_range_map contains the devices' dma_ranges information,
This patch moves dma_range_map before of_iommu_configure. The iommu
driver may need to know the dma_address requirements of its iommu
consumer devices.


Just a gentle ping on this issue, it would be nice to have this fix merged
ASAP, in the next RC :)


Ack to that - Rob, Frank, do you want to take this through the OF tree,
or shall we take it through the DMA-mapping tree like the original culprit?


I've already got some fixes queued up and can take it.


Brilliant, thanks!


Suggested-by doesn't mean you are happy with the implementation. So
Acked-by or Reviewed-by?


It still feels slightly awkward to give a tag to say "yes, this is 
exactly what I suggested", but for the avoidance of doubt,


Reviewed-by: Robin Murphy 

Cheers,
Robin.


Re: [PATCH v2] of/device: Update dma_range_map only when dev has valid dma-ranges

2021-01-27 Thread Robin Murphy

On 2021-01-27 19:07, Rob Herring wrote:

On Tue, Jan 19, 2021 at 4:52 AM Yong Wu  wrote:


The commit e0d072782c73 ("dma-mapping: introduce DMA range map,
supplanting dma_pfn_offset") always update dma_range_map even though it was
already set, like in the sunxi_mbus driver. the issue is reported at [1].
This patch avoid this(Updating it only when dev has valid dma-ranges).


only when dev *doesn't* have valid dma-ranges?


I believe the intent is to mean when a valid "dma-ranges" property is 
specified in DT. The implementation allows DT to take precedence even if 
platform code *has* already installed a dma_range_map, although we 
shouldn't expect that to ever happen (except perhaps in the wild early 
days of platform bringup).


Robin.


Meanwhile, dma_range_map contains the devices' dma_ranges information,
This patch moves dma_range_map before of_iommu_configure. The iommu
driver may need to know the dma_address requirements of its iommu
consumer devices.

[1] 
https://lore.kernel.org/linux-arm-kernel/5c7946f3-b56e-da00-a750-be097c7ce...@arm.com/

CC: Rob Herring 
CC: Frank Rowand 
Fixes: e0d072782c73 ("dma-mapping: introduce DMA range map, supplanting 
dma_pfn_offset"),
Suggested-by: Robin Murphy 
Signed-off-by: Yong Wu 
Signed-off-by: Paul Kocialkowski 
Reviewed-by: Rob Herring 
---
  drivers/of/device.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/of/device.c b/drivers/of/device.c
index aedfaaafd3e7..1122daa8e273 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -162,9 +162,11 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
 mask = DMA_BIT_MASK(ilog2(end) + 1);
 dev->coherent_dma_mask &= mask;
 *dev->dma_mask &= mask;
-   /* ...but only set bus limit if we found valid dma-ranges earlier */
-   if (!ret)
+   /* ...but only set bus limit and range map if we found valid dma-ranges 
earlier */
+   if (!ret) {
 dev->bus_dma_limit = end;
+   dev->dma_range_map = map;
+   }

 coherent = of_dma_is_coherent(np);
 dev_dbg(dev, "device is%sdma coherent\n",
@@ -172,6 +174,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,

 iommu = of_iommu_configure(dev, np, id);
 if (PTR_ERR(iommu) == -EPROBE_DEFER) {
+   /* Don't touch range map if it wasn't set from a valid 
dma-ranges */
+   if (!ret)
+   dev->dma_range_map = NULL;
 kfree(map);
 return -EPROBE_DEFER;
 }
@@ -181,7 +186,6 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,

 arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);

-   dev->dma_range_map = map;
 return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
--
2.18.0


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Robin Murphy

On 2021-01-27 13:09, Marcelo Tosatti wrote:

On Wed, Jan 27, 2021 at 12:36:30PM +, Robin Murphy wrote:

On 2021-01-27 12:19, Marcelo Tosatti wrote:

On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:

Hi,

On 2020-06-25 23:34, Nitesh Narayan Lal wrote:

From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
lib/cpumask.c | 16 +++-
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
#include 
#include 
#include 
+#include 
/**
 * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 */
unsigned int cpumask_local_spread(unsigned int i, int node)
{
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+   mask = housekeeping_cpumask(hk_flags);


AFAICS, this generally resolves to something based on cpu_possible_mask
rather than cpu_online_mask as before, so could now potentially return an
offline CPU. Was that an intentional change?


Robin,

AFAICS online CPUs should be filtered.


Apologies if I'm being thick, but can you explain how? In the case of
isolation being disabled or compiled out, housekeeping_cpumask() is
literally just "return cpu_possible_mask;". If we then iterate over that
with for_each_cpu() and just return the i'th possible CPU (e.g. in the
NUMA_NO_NODE case), what guarantees that CPU is actually online?

Robin.


Nothing, but that was the situation before 
1abdfe706a579a702799fce465bceb9fb01d407c
as well.


True, if someone calls from a racy context then there's not much we can 
do to ensure that any CPU *remains* online after we initially observed 
it to be, but when it's called from somewhere safe like a cpuhp offline 
handler, then picking from cpu_online_mask *did* always do the right 
thing (by my interpretation), whereas picking from 
housekeeping_cpumask() might not.


This is why I decided to ask rather than just send a patch to fix what I 
think might be a bug - I have no objection if this *is* intended 
behaviour, other than suggesting we amend the "...selects an online 
CPU..." comment if that aspect was never meant to be relied upon.


Thanks,
Robin.



cpumask_local_spread() should probably be disabling CPU hotplug.

Thomas?




I was just looking at the current code since I had the rare presence of mind
to check if something suitable already existed before I start open-coding
"any online CPU, but local node preferred" logic for handling IRQ affinity
in a driver - cpumask_local_spread() appears to be almost what I want (if a
bit more heavyweight), if only it would actually guarantee an online CPU as
the kerneldoc claims :(

Robin.


/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
-   for_each_cpu(cpu, cpu_online_mask) {
+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;







Re: [PATCH v2] of/device: Update dma_range_map only when dev has valid dma-ranges

2021-01-27 Thread Robin Murphy

[ + Christoph, Marek ]

On 2021-01-27 13:00, Paul Kocialkowski wrote:

Hi,

On Tue 19 Jan 21, 18:52, Yong Wu wrote:

The commit e0d072782c73 ("dma-mapping: introduce DMA range map,
supplanting dma_pfn_offset") always update dma_range_map even though it was
already set, like in the sunxi_mbus driver. the issue is reported at [1].
This patch avoid this(Updating it only when dev has valid dma-ranges).

Meanwhile, dma_range_map contains the devices' dma_ranges information,
This patch moves dma_range_map before of_iommu_configure. The iommu
driver may need to know the dma_address requirements of its iommu
consumer devices.


Just a gentle ping on this issue, it would be nice to have this fix merged
ASAP, in the next RC :)


Ack to that - Rob, Frank, do you want to take this through the OF tree, 
or shall we take it through the DMA-mapping tree like the original culprit?


Thanks,
Robin.



Cheers,

Paul


[1] 
https://lore.kernel.org/linux-arm-kernel/5c7946f3-b56e-da00-a750-be097c7ce...@arm.com/

CC: Rob Herring 
CC: Frank Rowand 
Fixes: e0d072782c73 ("dma-mapping: introduce DMA range map, supplanting 
dma_pfn_offset"),
Suggested-by: Robin Murphy 
Signed-off-by: Yong Wu 
Signed-off-by: Paul Kocialkowski 
Reviewed-by: Rob Herring 
---
  drivers/of/device.c | 10 +++---
  1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/of/device.c b/drivers/of/device.c
index aedfaaafd3e7..1122daa8e273 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -162,9 +162,11 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
mask = DMA_BIT_MASK(ilog2(end) + 1);
dev->coherent_dma_mask &= mask;
*dev->dma_mask &= mask;
-   /* ...but only set bus limit if we found valid dma-ranges earlier */
-   if (!ret)
+   /* ...but only set bus limit and range map if we found valid dma-ranges 
earlier */
+   if (!ret) {
dev->bus_dma_limit = end;
+   dev->dma_range_map = map;
+   }
  
  	coherent = of_dma_is_coherent(np);

dev_dbg(dev, "device is%sdma coherent\n",
@@ -172,6 +174,9 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
  
  	iommu = of_iommu_configure(dev, np, id);

if (PTR_ERR(iommu) == -EPROBE_DEFER) {
+   /* Don't touch range map if it wasn't set from a valid 
dma-ranges */
+   if (!ret)
+   dev->dma_range_map = NULL;
kfree(map);
return -EPROBE_DEFER;
}
@@ -181,7 +186,6 @@ int of_dma_configure_id(struct device *dev, struct 
device_node *np,
  
  	arch_setup_dma_ops(dev, dma_start, size, iommu, coherent);
  
-	dev->dma_range_map = map;

return 0;
  }
  EXPORT_SYMBOL_GPL(of_dma_configure_id);
--
2.18.0





Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Robin Murphy

On 2021-01-27 12:19, Marcelo Tosatti wrote:

On Wed, Jan 27, 2021 at 11:57:16AM +, Robin Murphy wrote:

Hi,

On 2020-06-25 23:34, Nitesh Narayan Lal wrote:

From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
   lib/cpumask.c | 16 +++-
   1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
   #include 
   #include 
   #include 
+#include 
   /**
* cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
*/
   unsigned int cpumask_local_spread(unsigned int i, int node)
   {
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
+   hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;
+   mask = housekeeping_cpumask(hk_flags);


AFAICS, this generally resolves to something based on cpu_possible_mask
rather than cpu_online_mask as before, so could now potentially return an
offline CPU. Was that an intentional change?


Robin,

AFAICS online CPUs should be filtered.


Apologies if I'm being thick, but can you explain how? In the case of 
isolation being disabled or compiled out, housekeeping_cpumask() is 
literally just "return cpu_possible_mask;". If we then iterate over that 
with for_each_cpu() and just return the i'th possible CPU (e.g. in the 
NUMA_NO_NODE case), what guarantees that CPU is actually online?


Robin.


I was just looking at the current code since I had the rare presence of mind
to check if something suitable already existed before I start open-coding
"any online CPU, but local node preferred" logic for handling IRQ affinity
in a driver - cpumask_local_spread() appears to be almost what I want (if a
bit more heavyweight), if only it would actually guarantee an online CPU as
the kerneldoc claims :(

Robin.


/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
if (node == NUMA_NO_NODE) {
-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
-   for_each_cpu(cpu, cpu_online_mask) {
+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;





Re: [Patch v4 1/3] lib: Restrict cpumask_local_spread to houskeeping CPUs

2021-01-27 Thread Robin Murphy

Hi,

On 2020-06-25 23:34, Nitesh Narayan Lal wrote:

From: Alex Belits 

The current implementation of cpumask_local_spread() does not respect the
isolated CPUs, i.e., even if a CPU has been isolated for Real-Time task,
it will return it to the caller for pinning of its IRQ threads. Having
these unwanted IRQ threads on an isolated CPU adds up to a latency
overhead.

Restrict the CPUs that are returned for spreading IRQs only to the
available housekeeping CPUs.

Signed-off-by: Alex Belits 
Signed-off-by: Nitesh Narayan Lal 
---
  lib/cpumask.c | 16 +++-
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index fb22fb266f93..85da6ab4fbb5 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -6,6 +6,7 @@
  #include 
  #include 
  #include 
+#include 
  
  /**

   * cpumask_next - get the next cpu in a cpumask
@@ -205,22 +206,27 @@ void __init free_bootmem_cpumask_var(cpumask_var_t mask)
   */
  unsigned int cpumask_local_spread(unsigned int i, int node)
  {
-   int cpu;
+   int cpu, hk_flags;
+   const struct cpumask *mask;
  
+	hk_flags = HK_FLAG_DOMAIN | HK_FLAG_MANAGED_IRQ;

+   mask = housekeeping_cpumask(hk_flags);


AFAICS, this generally resolves to something based on cpu_possible_mask 
rather than cpu_online_mask as before, so could now potentially return 
an offline CPU. Was that an intentional change?


I was just looking at the current code since I had the rare presence of 
mind to check if something suitable already existed before I start 
open-coding "any online CPU, but local node preferred" logic for 
handling IRQ affinity in a driver - cpumask_local_spread() appears to be 
almost what I want (if a bit more heavyweight), if only it would 
actually guarantee an online CPU as the kerneldoc claims :(


Robin.


/* Wrap: we always want a cpu. */
-   i %= num_online_cpus();
+   i %= cpumask_weight(mask);
  
  	if (node == NUMA_NO_NODE) {

-   for_each_cpu(cpu, cpu_online_mask)
+   for_each_cpu(cpu, mask) {
if (i-- == 0)
return cpu;
+   }
} else {
/* NUMA first. */
-   for_each_cpu_and(cpu, cpumask_of_node(node), cpu_online_mask)
+   for_each_cpu_and(cpu, cpumask_of_node(node), mask) {
if (i-- == 0)
return cpu;
+   }
  
-		for_each_cpu(cpu, cpu_online_mask) {

+   for_each_cpu(cpu, mask) {
/* Skip NUMA nodes, done above. */
if (cpumask_test_cpu(cpu, cpumask_of_node(node)))
continue;



Re: [PATCH 1/1] iommu/arm-smmu-v3: add support for BBML

2021-01-27 Thread Robin Murphy

On 2021-01-27 07:36, Keqian Zhu wrote:



On 2021/1/27 10:01, Leizhen (ThunderTown) wrote:



On 2021/1/26 18:12, Will Deacon wrote:

On Mon, Jan 25, 2021 at 08:23:40PM +, Robin Murphy wrote:

Now we probably will need some degreee of BBML feature awareness for the
sake of SVA if and when we start using it for CPU pagetables, but I still
cannot see any need to consider it in io-pgtable.


Agreed; I don't think this is something that io-pgtable should have to care
about.

Hi,

I have a question here :-).
If the old table is not live, then the break procedure seems unnecessary. Do I 
miss something?


The MMU is allowed to prefetch translations at any time, so not 
following the proper update procedure could still potentially lead to a 
TLB conflict, even if there's no device traffic to worry about disrupting.


Robin.


Thanks,
Keqian



Yes, the SVA works in stall mode, and the failed device access requests are not
discarded.

Let me look for examples. The BBML usage scenario was told by a former 
colleague.



Will

.




___
linux-arm-kernel mailing list
linux-arm-ker...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
.


___
iommu mailing list
io...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu



  1   2   3   4   5   6   7   8   9   10   >