Re: [PATCH v3 02/24] dt-bindings: memory: mediatek: Convert SMI to DT schema

2020-10-05 Thread Yong Wu
On Fri, 2020-10-02 at 13:08 +0200, Krzysztof Kozlowski wrote:
> On Wed, Sep 30, 2020 at 03:06:25PM +0800, Yong Wu wrote:
> > Convert MediaTek SMI to DT schema.
> > 
> > Signed-off-by: Yong Wu 
> > ---
> >  .../mediatek,smi-common.txt   |  49 -
> >  .../mediatek,smi-common.yaml  | 100 ++
> >  .../memory-controllers/mediatek,smi-larb.txt  |  49 -
> >  .../memory-controllers/mediatek,smi-larb.yaml |  91 
> >  4 files changed, 191 insertions(+), 98 deletions(-)
> >  delete mode 100644 
> > Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.txt
> >  create mode 100644 
> > Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
> >  delete mode 100644 
> > Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.txt
> >  create mode 100644 
> > Documentation/devicetree/bindings/memory-controllers/mediatek,smi-larb.yaml
...
> > +properties:
> > +  compatible:
> > +oneOf:
> > +  - enum:
> > +  - mediatek,mt2701-smi-common
> > +  - mediatek,mt2712-smi-common
> > +  - mediatek,mt6779-smi-common
> > +  - mediatek,mt8173-smi-common
> > +  - mediatek,mt8183-smi-common
> > +
> > +  - description: for mt7623
> > +items:
> > +  - const: mediatek,mt7623-smi-common
> > +  - const: mediatek,mt2701-smi-common
> > +
> > +  reg:
> > +maxItems: 1
> > +
> > +  clocks:
> > +description: |
> > +  apb and smi are mandatory. the async is only for generation 1 smi HW.
> > +  gals(global async local sync) also is optional, here is the list 
> > which
> > +  require gals: mt6779 and mt8183.
> > +minItems: 2
> > +maxItems: 4
> > +items:
> > +  - description: apb is Advanced Peripheral Bus clock, It's the clock 
> > for
> > +  setting the register.
> > +  - description: smi is the clock for transfer data and command.
> > +  - description: async is asynchronous clock, it help transform the 
> > smi clock
> > +  into the emi clock domain.
> > +  - description: gals0 is the path0 clock of gals.
> > +  - description: gals1 is the path1 clock of gals.
> > +
> > +  clock-names:
> > +oneOf:
> > +  - items:
> > +  - const: apb
> > +  - const: smi
> > +  - items:
> > +  - const: apb
> > +  - const: smi
> > +  - const: async
> > +  - items:
> > +  - const: apb
> > +  - const: smi
> > +  - const: gals0
> > +  - const: gals1
> 
> Similarly to my comment to other properties, this requirement per
> compatible should be part of the schema within 'if-then'.

I'm not so familiar with this format. Do this has "if-then-'else
if'-then-else"?

I tried below instead of the clocks segment above:

===
if:
  properties:
compatible:
  enum:
- mediatek,mt6779-smi-common
- mediatek,mt8183-smi-common

then:
  properties:
clock:
  items:
- description: apb is the clock for setting the register..
- description: smi is the clock for transfer data and command.
- description: gals0 is the path0 clock of gals(global async
local sync).
- description: gals1 is the path1 clock of gals.
clock-names:
  items:
- const: apb
- const: smi
- const: gals0
- const: gals1
else:
  if:
properties:
  compatible:
contains:
  enum:
- mediatek,mt2701-smi-common

  then:
properties:
  clocks:
items:
  - description: apb is the clock for setting the register.
  - description: smi is the clock for transfer data and command.
  - description: async is asynchronous clock, it help transform
the smi clock
  into the emi clock domain.
  clock-names:
items:
  - const: apb
  - const: smi
  - const: async
  else:
properties:
  clocks:
items:
  - description: apb is the clock for setting the register.
  - description: smi is the clock for transfer data and
command.
  clock-names:
items:
  - const: apb
  - const: smi


But I got a warning when dt_binding_check:

CHKDT
Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml
  SCHEMA
Documentation/devicetree/bindings/processed-schema-examples.yaml
  DTC
Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.example.dt.yaml
  CHECK
Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.example.dt.yaml
.../Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.example.dt.yaml:
 smi@14022000: 'clock-names', 'clocks' do not match any of the regexes: 
'pinctrl-[0-9]+'
From
schema: 
.../Documentation/devicetree/bindings/memory-controllers/mediatek,smi-common.yaml

Any 

Re: [PATCH v3 01/24] dt-bindings: iommu: mediatek: Convert IOMMU to DT schema

2020-10-05 Thread Yong Wu
On Fri, 2020-10-02 at 13:07 +0200, Krzysztof Kozlowski wrote:
> On Wed, Sep 30, 2020 at 03:06:24PM +0800, Yong Wu wrote:
> > Convert MediaTek IOMMU to DT schema.
> > 
> > Signed-off-by: Yong Wu 
> > ---
> >  .../bindings/iommu/mediatek,iommu.txt | 103 
> >  .../bindings/iommu/mediatek,iommu.yaml| 154 ++
> >  2 files changed, 154 insertions(+), 103 deletions(-)
> >  delete mode 100644 
> > Documentation/devicetree/bindings/iommu/mediatek,iommu.txt
> >  create mode 100644 
> > Documentation/devicetree/bindings/iommu/mediatek,iommu.yaml
> > 

[...]

> > +properties:
> > +  compatible:
> > +oneOf:
> > +  - enum:
> > +  - mediatek,mt2701-m4u # mt2701 generation one HW
> > +  - mediatek,mt2712-m4u # mt2712 generation two HW
> > +  - mediatek,mt6779-m4u # mt6779 generation two HW
> > +  - mediatek,mt8173-m4u # mt8173 generation two HW
> > +  - mediatek,mt8183-m4u # mt8183 generation two HW
> > +
> > +  - description: mt7623 generation one HW
> > +items:
> > +  - const: mediatek,mt7623-m4u
> > +  - const: mediatek,mt2701-m4u
> > +
> > +  reg:
> > +maxItems: 1
> > +
> > +  interrupts:
> > +maxItems: 1
> > +
> > +  clocks:
> > +description: |
> > +  bclk is optional. here is the list which require this bclk:
> > +  mt2701, mt2712, mt7623 and mt8173.
> 
> Similarly to my comment in other patch, this should be part of schema
> within 'if-then'.

Thanks for the review.

I will change like this:

=
  clocks:
items:
  - description: bclk is the block clock.

  clock-names:
items:
  - const: bclk

required:
  - compatible
  - reg
  - interrupts
  - mediatek,larbs
  - '#iommu-cells'
if:
  properties:
compatible:
  contains:
enum:
  - mediatek,mt2701-m4u
  - mediatek,mt2712-m4u
  - mediatek,mt8173-m4u

then:
 required:
   - clocks
==

If this is not right, please tell me.
(dt_binding_check is ok.)

> 
> Best regards,
> Krzysztof

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 06/24] dt-bindings: mediatek: Add binding for mt8192 IOMMU

2020-10-05 Thread Yong Wu
Hi Krzysztof,

On Fri, 2020-10-02 at 13:10 +0200, Krzysztof Kozlowski wrote:
> On Wed, Sep 30, 2020 at 03:06:29PM +0800, Yong Wu wrote:
> > This patch adds decriptions for mt8192 IOMMU and SMI.
> > 
> > mt8192 also is MTK IOMMU gen2 which uses ARM Short-Descriptor translation
> > table format. The M4U-SMI HW diagram is as below:
> > 
> >   EMI
> >|
> >   M4U
> >|
> >   
> >SMI Common
> >   
> >|
> >   +---+--+--+--+---+
> >   |   |  |  |   .. |   |
> >   |   |  |  |  |   |
> > larb0   larb1  larb2  larb4 ..  larb19   larb20
> > disp0   disp1   mdpvdec   IPE  IPE
> > 
> > All the connections are HW fixed, SW can NOT adjust it.
> > 
> > mt8192 M4U support 0~16GB iova range. we preassign different engines
> > into different iova ranges:
> > 
> > domain-id  module iova-range  larbs
> >0   disp0 ~ 4G  larb0/1
> >1   vcodec  4G ~ 8G larb4/5/7
> >2   cam/mdp 8G ~ 12G larb2/9/11/13/14/16/17/18/19/20
> >3   CCU00x4000_ ~ 0x43ff_ larb13: port 9/10
> >4   CCU10x4400_ ~ 0x47ff_ larb14: port 4/5
> > 
> > The iova range for CCU0/1(camera control unit) is HW requirement.
> > 
> > Signed-off-by: Yong Wu 
> > Reviewed-by: Rob Herring 
> > ---
> >  .../bindings/iommu/mediatek,iommu.yaml|   9 +-
> >  .../mediatek,smi-common.yaml  |   5 +-
> >  .../memory-controllers/mediatek,smi-larb.yaml |   3 +-
> >  include/dt-bindings/memory/mt8192-larb-port.h | 239 ++
> >  4 files changed, 251 insertions(+), 5 deletions(-)
> >  create mode 100644 include/dt-bindings/memory/mt8192-larb-port.h
> 
> I see it depends on previous patches but does it have to be within one
> commit? Is it not bisectable? The memory changes/bindings could go via
> memory tree if this is split.

Thanks for the view.

I can split this into two patchset in next version, one is for iommu and
the other is for smi.

Only the patch [18/24] change both the code(iommu and smi). I don't plan
to split it, and the smi patch[24/24] don't depend on it(won't
conflict).

since 18/24 also touch the smi code, I expect it could get Acked-by from
you or Matthias, then Joerg could take it.

Thanks.

> 
> Best regards,
> Krzysztof

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v5 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Nicolin Chen
On Mon, Oct 05, 2020 at 12:56:38PM +0200, Thierry Reding wrote:
> On Mon, Oct 05, 2020 at 11:41:08AM +0300, Dmitry Osipenko wrote:
> > 05.10.2020 00:57, Nicolin Chen пишет:
> > > On Sat, Oct 03, 2020 at 05:06:42PM +0300, Dmitry Osipenko wrote:
> > >> 03.10.2020 09:59, Nicolin Chen пишет:
> > >>>  static int tegra_smmu_of_xlate(struct device *dev,
> > >>>struct of_phandle_args *args)
> > >>>  {
> > >>> +   struct platform_device *iommu_pdev = 
> > >>> of_find_device_by_node(args->np);
> > >>> +   struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > >>> u32 id = args->args[0];
> > >>>  
> > >>> +   put_device(_pdev->dev);
> > >>> +
> > >>> +   if (!mc || !mc->smmu)
> > >>> +   return -EPROBE_DEFER;
> > >>
> > >> I'm not very excited by seeing code in the patches that can't be
> > >> explained by the patch authors and will appreciate if you could provide
> > >> a detailed explanation about why this NULL checking is needed because I
> > >> think it is unneeded, especially given that other IOMMU drivers don't
> > >> have such check.
> > > 
> > > This function could be called from of_iommu_configure(), which is
> > > a part of other driver's probe(). So I think it's safer to have a
> > > check. Yet, given mc driver is added to the "arch_initcall" stage,
> > > you are probably right that there's no really need at this moment
> > > because all clients should be called after mc/smmu are inited. So
> > > I'll resend a v6, if that makes you happy.
> > 
> > I wanted to get the explanation :) I'm very curious why it's actually
> > needed because I'm not 100% sure whether it's not needed at all.
> > 
> > I'd assume that the only possible problem could be if some device is
> > created in parallel with the MC probing and there is no locking that
> > could prevent this in the drivers core. It's not apparent to me whether
> > this situation could happen at all in practice.
> > 
> > The MC is created early and at that time everything is sequential, so
> > it's indeed should be safe to remove the check.
> 
> I think I now remember exactly why the "hack" in tegra_smmu_probe()
> exists. The reason is that the MC driver does this:
> 
>   mc->smmu = tegra_smmu_probe(...);
> 
> That means that mc->smmu is going to be NULL until tegra_smmu_probe()
> has finished. But tegra_smmu_probe() calls bus_set_iommu() and that in
> turn calls ->probe_device(). So the purpose of the "hack" in the
> tegra_smmu_probe() function was to make sure mc->smmu was available at
> that point, because, well, it is already known, but we haven't gotten
> around to storing it yet.

Yea, that's exactly what I described in the commit message of a
later version of this patch. Yet, we can drop it now as a device
will probe() and call ->probe_device() again.

> ->of_xlate() can theoretically be called as early as right after
> bus_set_iommu() via of_iommu_configure() if that is called in parallel
> with tegra_smmu_probe(). I think that's very unlikely, but I'm not 100%
> sure that it can't happen.

As my previous reply to Dmitry above, I don't think it'd happen,
given that mc/smmu drivers are inited during arch_initcall stage
while all clients should be probed during device_initcall stage,
once this rework patch gets merged.

> In any case, I do agree with Dmitry that we should have a comment here
> explaining why this is necessary. Even if we're completely certain that
> this is necessary, it's not obvious and therefore should get that
> comment. And if we're not certain that it's necessary, it's probably
> also good to mention that in the comment so that eventually it can be
> determined or the check removed if it proves to be unnecessary.

Well, I have answered him, and also removed the mc/smmu check in
v6 already.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Nicolin Chen
On Mon, Oct 05, 2020 at 11:57:54AM +0200, Thierry Reding wrote:
> On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote:
> > On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote:
> > > 02.10.2020 09:08, Nicolin Chen пишет:
> > > >  static int tegra_smmu_of_xlate(struct device *dev,
> > > >struct of_phandle_args *args)
> > > >  {
> > > > +   struct platform_device *iommu_pdev = 
> > > > of_find_device_by_node(args->np);
> > > > +   struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > > > u32 id = args->args[0];
> > > >  
> > > > +   of_node_put(args->np);
> > > 
> > > of_find_device_by_node() takes device reference and not the np
> > > reference. This is a bug, please remove of_node_put().
> > 
> > Looks like so. Replacing it with put_device(_pdev->dev);
> 
> Putting the put_device() here is wrong, though. You need to make sure
> you keep a reference to it as long as you keep accessing the data that
> is owned by it.

I am confused. You said in the other reply (to Dmitry) that we do
need to put_device(mc->dev), where mc->dev should be the same as
iommu_pdev->dev. But here your comments sounds that we should not
put_device at all since ->probe_device/group_device/attach_dev()
will use it later.

> Like I said earlier, this is a bit weird in this case because we're
> self-referencing, so iommu_pdev->dev is going to stay around as long as
> the SMMU is. However, it might be worth to properly track the lifetime
> anyway just so that the code can serve as a good example of how to do
> things.

What's this "track-the-lifetime"?

> If you decide to go for the shortcut and not track this reference
> properly, then at least you need to add a comment as to why it is safe
> to do in this case. This ensures that readers are away of the
> circumstances and don't copy this bad code into a context where the
> circumstances are different.

I don't quite get this "shortcut" here either...mind elaborating?
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 3/3] iommu/tegra-smmu: Add PCI support

2020-10-05 Thread Nicolin Chen
On Mon, Oct 05, 2020 at 12:04:19PM +0200, Thierry Reding wrote:
> > +err_bus_set: __maybe_unused;
> > +   bus_set_iommu(_bus_type, NULL);
> > +err_unregister:
> > +   iommu_device_unregister(>iommu);
> > +err_sysfs:
> > +   iommu_device_sysfs_remove(>iommu);
> 
> Can you please switch to label names that are more consistent with the
> others in this driver? Notably the ones in tegra_smmu_domain_alloc().
> The idea is to describe in the name of the label what's happening at the
> label. Something like this, for example:
> 
> unset_platform_bus:
>   bus_set_iommu(_bus_type, NULL);
> unregister:
>   iommu_device_unregister(>iommu);
> remove_sysfs:
>   iommu_device_sysfs_remove(>iommu);

Okay. Will update in v7.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: clean up the DMA mapping headers

2020-10-05 Thread Christoph Hellwig
I've pulled this into the dma-mapping for-next tree now.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

2020-10-05 Thread Boris Brezillon
On Mon, 5 Oct 2020 16:16:32 +0100
Steven Price  wrote:

> On 05/10/2020 15:50, Boris Brezillon wrote:
> > On Tue, 22 Sep 2020 15:16:48 +0100
> > Robin Murphy  wrote:
> >   
> >> Midgard GPUs have ACE-Lite master interfaces which allows systems to
> >> integrate them in an I/O-coherent manner. It seems that from the GPU's
> >> viewpoint, the rest of the system is its outer shareable domain, and so
> >> even when snoop signals are wired up, they are only emitted for outer
> >> shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does
> >> indeed get coherent pagetable walks working nicely for the coherent
> >> T620 in the Arm Juno SoC.
> >>
> >> Reviewed-by: Steven Price 
> >> Tested-by: Neil Armstrong 
> >> Signed-off-by: Robin Murphy 
> >> ---
> >>   drivers/iommu/io-pgtable-arm.c | 11 ++-
> >>   1 file changed, 10 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/iommu/io-pgtable-arm.c 
> >> b/drivers/iommu/io-pgtable-arm.c
> >> index dc7bcf858b6d..b4072a18e45d 100644
> >> --- a/drivers/iommu/io-pgtable-arm.c
> >> +++ b/drivers/iommu/io-pgtable-arm.c
> >> @@ -440,7 +440,13 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
> >> arm_lpae_io_pgtable *data,
> >><< ARM_LPAE_PTE_ATTRINDX_SHIFT);
> >>}
> >>   
> >> -  if (prot & IOMMU_CACHE)
> >> +  /*
> >> +   * Also Mali has its own notions of shareability wherein its Inner
> >> +   * domain covers the cores within the GPU, and its Outer domain is
> >> +   * "outside the GPU" (i.e. either the Inner or System domain in CPU
> >> +   * terms, depending on coherency).
> >> +   */
> >> +  if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
> >>pte |= ARM_LPAE_PTE_SH_IS;
> >>else
> >>pte |= ARM_LPAE_PTE_SH_OS;  
> > 
> > Actually, it still doesn't work on s922x :-/. For it to work I
> > correctly, I need to drop the outer shareable flag here.  
> 
> The logic here does seem a bit odd. Originally it was:
> 
> IOMMU_CACHE -> Inner shared (value 3)
> !IOMMU_CACHE -> Outer shared (value 2)
> 
> For Mali we're forcing everything to the second option. But Mali being 
> Mali doesn't do things the same as LPAE, so for Mali we have:
> 
> 0 - not shared
> 1 - reserved
> 2 - inner(*) and outer shareable
> 3 - inner shareable only
> 
> (*) where "inner" means internal to the GPU, and "outer" means shared 
> with the CPU "inner". Very confusing!
> 
> So originally we had:
> IOMMU_CACHE -> not shared with CPU (only internally in the GPU)
> !IOMMU_CACHE -> shared with CPU
> 
> The change above gets us to "always shared", dropping the SH_OS bit 
> would get us to not even shareable between cores (which doesn't sound 
> like what we want).

Thanks for this explanation.

> 
> It's not at all clear to me why the change helps, but I suspect we want 
> at least "inner" shareable.

Right. Looks like all this was caused by a bad conflict resolution
during a rebase. Sorry for the noise :-/.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 05/13] genirq: Prepare for default affinity to be passed to __irq_alloc_descs()

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

Instead of plugging in irq_default_affinity at the lowest level in
desc_smp_init() if the caller didn't specify an affinity for this
specific IRQ in its array, allow the default affinity to be passed
down from __irq_alloc_descs() instead.

This is in preparation for irq domains having their own default (and
indeed maximum) affinities.

An alternative possibility would have been to make the caller allocate
an entire array of struct irq_affinity_desc for the allocation, but
that's a lot nastier than simply passing in an additional const cpumask.

This commit has no visible API change outside static functions in
irqdesc.c for now; the actual change will come later.

Signed-off-by: David Woodhouse 
---
 include/linux/interrupt.h |  2 ++
 kernel/irq/irqdesc.c  | 21 +
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index f9aee3538461..cd0ff293486a 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -364,6 +364,8 @@ unsigned int irq_calc_affinity_vectors(unsigned int minvec, 
unsigned int maxvec,
 
 #else /* CONFIG_SMP */
 
+#define irq_default_affinity (NULL)
+
 static inline int irq_set_affinity(unsigned int irq, const struct cpumask *m)
 {
return -EINVAL;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..4ac91b9fc618 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -81,8 +81,6 @@ static int alloc_masks(struct irq_desc *desc, int node)
 static void desc_smp_init(struct irq_desc *desc, int node,
  const struct cpumask *affinity)
 {
-   if (!affinity)
-   affinity = irq_default_affinity;
cpumask_copy(desc->irq_common_data.affinity, affinity);
 
 #ifdef CONFIG_GENERIC_PENDING_IRQ
@@ -465,12 +463,16 @@ static void free_desc(unsigned int irq)
 
 static int alloc_descs(unsigned int start, unsigned int cnt, int node,
   const struct irq_affinity_desc *affinity,
+  const struct cpumask *default_affinity,
   struct module *owner)
 {
struct irq_desc *desc;
int i;
 
/* Validate affinity mask(s) */
+   if (!default_affinity || cpumask_empty(default_affinity))
+   return -EINVAL;
+
if (affinity) {
for (i = 0; i < cnt; i++) {
if (cpumask_empty([i].mask))
@@ -479,7 +481,7 @@ static int alloc_descs(unsigned int start, unsigned int 
cnt, int node,
}
 
for (i = 0; i < cnt; i++) {
-   const struct cpumask *mask = NULL;
+   const struct cpumask *mask = default_affinity;
unsigned int flags = 0;
 
if (affinity) {
@@ -488,10 +490,10 @@ static int alloc_descs(unsigned int start, unsigned int 
cnt, int node,
IRQD_MANAGED_SHUTDOWN;
}
mask = >mask;
-   node = cpu_to_node(cpumask_first(mask));
affinity++;
}
 
+   node = cpu_to_node(cpumask_first(mask));
desc = alloc_desc(start + i, node, flags, mask, owner);
if (!desc)
goto err;
@@ -538,7 +540,7 @@ int __init early_irq_init(void)
nr_irqs = initcnt;
 
for (i = 0; i < initcnt; i++) {
-   desc = alloc_desc(i, node, 0, NULL, NULL);
+   desc = alloc_desc(i, node, 0, irq_default_affinity, NULL);
set_bit(i, allocated_irqs);
irq_insert_desc(i, desc);
}
@@ -573,7 +575,8 @@ int __init early_irq_init(void)
raw_spin_lock_init([i].lock);
lockdep_set_class([i].lock, _desc_lock_class);
mutex_init([i].request_mutex);
-   desc_set_defaults(i, [i], node, NULL, NULL);
+   desc_set_defaults(i, [i], node, irq_default_affinity,
+ NULL);
}
return arch_early_irq_init();
 }
@@ -590,12 +593,14 @@ static void free_desc(unsigned int irq)
unsigned long flags;
 
raw_spin_lock_irqsave(>lock, flags);
-   desc_set_defaults(irq, desc, irq_desc_get_node(desc), NULL, NULL);
+   desc_set_defaults(irq, desc, irq_desc_get_node(desc),
+ irq_default_affinity, NULL);
raw_spin_unlock_irqrestore(>lock, flags);
 }
 
 static inline int alloc_descs(unsigned int start, unsigned int cnt, int node,
  const struct irq_affinity_desc *affinity,
+ const struct cpumask *default_affinity,
  struct module *owner)
 {
u32 i;
@@ -803,7 +808,7 @@ __irq_alloc_descs(int irq, unsigned int from, unsigned int 
cnt, int node,
if (ret)
goto unlock;
}
-   ret = alloc_descs(start, cnt, node, 

[PATCH 04/13] x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

Some hypervisors can allow the guest to use the Extended Destination ID
field in the IOAPIC RTE and MSI address to address up to 32768 CPUs.

Signed-off-by: David Woodhouse 
---
 arch/x86/include/asm/mpspec.h   |  1 +
 arch/x86/include/asm/x86_init.h |  2 ++
 arch/x86/kernel/apic/apic.c | 15 ++-
 arch/x86/kernel/apic/msi.c  |  9 -
 arch/x86/kernel/x86_init.c  |  1 +
 5 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index e90ac7e9ae2c..25ee8ca0a1f2 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -42,6 +42,7 @@ extern DECLARE_BITMAP(mp_bus_not_pci, MAX_MP_BUSSES);
 extern unsigned int boot_cpu_physical_apicid;
 extern u8 boot_cpu_apic_version;
 extern unsigned long mp_lapic_addr;
+extern int msi_ext_dest_id;
 
 #ifdef CONFIG_X86_LOCAL_APIC
 extern int smp_found_config;
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 397196fae24d..5af3fe9e38f3 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -114,6 +114,7 @@ struct x86_init_pci {
  * @init_platform: platform setup
  * @guest_late_init:   guest late init
  * @x2apic_available:  X2APIC detection
+ * @msi_ext_dest_id:   MSI and IOAPIC support 15-bit APIC IDs
  * @init_mem_mapping:  setup early mappings during init_mem_mapping()
  * @init_after_bootmem:guest init after boot allocator is 
finished
  */
@@ -121,6 +122,7 @@ struct x86_hyper_init {
void (*init_platform)(void);
void (*guest_late_init)(void);
bool (*x2apic_available)(void);
+   bool (*msi_ext_dest_id)(void);
void (*init_mem_mapping)(void);
void (*init_after_bootmem)(void);
 };
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index a75767052a92..459c78558f36 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1837,6 +1837,8 @@ static __init void x2apic_enable(void)
 
 static __init void try_to_enable_x2apic(int remap_mode)
 {
+   u32 apic_limit = 0;
+
if (x2apic_state == X2APIC_DISABLED)
return;
 
@@ -1858,7 +1860,15 @@ static __init void try_to_enable_x2apic(int remap_mode)
return;
}
 
-   x2apic_set_max_apicid(255);
+   /*
+* If the hypervisor supports extended destination ID
+* in IOAPIC and MSI, we can support that many CPUs.
+*/
+   if (x86_init.hyper.msi_ext_dest_id()) {
+   msi_ext_dest_id = 1;
+   apic_limit = 32767;
+   } else
+   apic_limit = 255;
}
 
/*
@@ -1867,6 +1877,9 @@ static __init void try_to_enable_x2apic(int remap_mode)
 */
x2apic_phys = 1;
}
+   if (apic_limit)
+   x2apic_set_max_apicid(apic_limit);
+
x2apic_enable();
 }
 
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 356f8acf4927..4d891967bea4 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -23,6 +23,8 @@
 
 struct irq_domain *x86_pci_msi_default_domain __ro_after_init;
 
+int msi_ext_dest_id __ro_after_init;
+
 static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, 
int dmar)
 {
msg->address_hi = MSI_ADDR_BASE_HI;
@@ -45,10 +47,15 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, 
struct msi_msg *msg, int
 * Only the IOMMU itself can use the trick of putting destination
 * APIC ID into the high bits of the address. Anything else would
 * just be writing to memory if it tried that, and needs IR to
-* address APICs above 255.
+* address APICs which can't be addressed in the normal 32-bit
+* address range at 0xFFEx. That is typically just 8 bits, but
+* some hypervisors allow the extended destination ID field in bits
+* 11-5 to be used, giving support for 15 bits of APIC IDs in total.
 */
if (dmar)
msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
+   else if (msi_ext_dest_id && cfg->dest_apicid < 0x8000)
+   msg->address_lo |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid) >> 3;
else
WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid));
 }
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index a3038d8deb6a..8b395821cb8d 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -110,6 +110,7 @@ struct x86_init_ops x86_init __initdata = {
.init_platform  = x86_init_noop,
.guest_late_init= x86_init_noop,

[PATCH 12/13] iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

It took me a while to realise that this "IRQ remapping" driver exists
not to actually remap interrupts, but to return -EINVAL if anyone ever
tries to set the affinity to a set of CPUs which can't be reached
*without* remapping. Having fixed the core IRQ domain code to handle
such limited, all this hackery can now die.

I haven't deleted it entirely because its existence still causes the
kernel to use X2APIC in cluster mode not physical. I'm not sure we
*want* that, but it wants further thought and testing before ripping
it out too.

Signed-off-by: David Woodhouse 
---
 drivers/iommu/hyperv-iommu.c | 149 +--
 1 file changed, 1 insertion(+), 148 deletions(-)

diff --git a/drivers/iommu/hyperv-iommu.c b/drivers/iommu/hyperv-iommu.c
index e09e2d734c57..46a794d34f57 100644
--- a/drivers/iommu/hyperv-iommu.c
+++ b/drivers/iommu/hyperv-iommu.c
@@ -24,156 +24,12 @@
 #include "irq_remapping.h"
 
 #ifdef CONFIG_IRQ_REMAP
-
-/*
- * According 82093AA IO-APIC spec , IO APIC has a 24-entry Interrupt
- * Redirection Table. Hyper-V exposes one single IO-APIC and so define
- * 24 IO APIC remmapping entries.
- */
-#define IOAPIC_REMAPPING_ENTRY 24
-
-static cpumask_t ioapic_max_cpumask = { CPU_BITS_NONE };
-static struct irq_domain *ioapic_ir_domain;
-
-static int hyperv_ir_set_affinity(struct irq_data *data,
-   const struct cpumask *mask, bool force)
-{
-   struct irq_data *parent = data->parent_data;
-   struct irq_cfg *cfg = irqd_cfg(data);
-   struct IO_APIC_route_entry *entry;
-   int ret;
-
-   /* Return error If new irq affinity is out of ioapic_max_cpumask. */
-   if (!cpumask_subset(mask, _max_cpumask))
-   return -EINVAL;
-
-   ret = parent->chip->irq_set_affinity(parent, mask, force);
-   if (ret < 0 || ret == IRQ_SET_MASK_OK_DONE)
-   return ret;
-
-   entry = data->chip_data;
-   entry->dest = cfg->dest_apicid;
-   entry->vector = cfg->vector;
-   send_cleanup_vector(cfg);
-
-   return 0;
-}
-
-static struct irq_chip hyperv_ir_chip = {
-   .name   = "HYPERV-IR",
-   .irq_ack= apic_ack_irq,
-   .irq_set_affinity   = hyperv_ir_set_affinity,
-};
-
-static int hyperv_irq_remapping_alloc(struct irq_domain *domain,
-unsigned int virq, unsigned int nr_irqs,
-void *arg)
-{
-   struct irq_alloc_info *info = arg;
-   struct irq_data *irq_data;
-   struct irq_desc *desc;
-   int ret = 0;
-
-   if (!info || info->type != X86_IRQ_ALLOC_TYPE_IOAPIC || nr_irqs > 1)
-   return -EINVAL;
-
-   ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg);
-   if (ret < 0)
-   return ret;
-
-   irq_data = irq_domain_get_irq_data(domain, virq);
-   if (!irq_data) {
-   irq_domain_free_irqs_common(domain, virq, nr_irqs);
-   return -EINVAL;
-   }
-
-   irq_data->chip = _ir_chip;
-
-   /*
-* If there is interrupt remapping function of IOMMU, setting irq
-* affinity only needs to change IRTE of IOMMU. But Hyper-V doesn't
-* support interrupt remapping function, setting irq affinity of IO-APIC
-* interrupts still needs to change IO-APIC registers. But ioapic_
-* configure_entry() will ignore value of cfg->vector and cfg->
-* dest_apicid when IO-APIC's parent irq domain is not the vector
-* domain.(See ioapic_configure_entry()) In order to setting vector
-* and dest_apicid to IO-APIC register, IO-APIC entry pointer is saved
-* in the chip_data and hyperv_irq_remapping_activate()/hyperv_ir_set_
-* affinity() set vector and dest_apicid directly into IO-APIC entry.
-*/
-   irq_data->chip_data = info->ioapic.entry;
-
-   /*
-* Hypver-V IO APIC irq affinity should be in the scope of
-* ioapic_max_cpumask because no irq remapping support.
-*/
-   desc = irq_data_to_desc(irq_data);
-   cpumask_copy(desc->irq_common_data.affinity, _max_cpumask);
-
-   return 0;
-}
-
-static void hyperv_irq_remapping_free(struct irq_domain *domain,
-unsigned int virq, unsigned int nr_irqs)
-{
-   irq_domain_free_irqs_common(domain, virq, nr_irqs);
-}
-
-static int hyperv_irq_remapping_activate(struct irq_domain *domain,
- struct irq_data *irq_data, bool reserve)
-{
-   struct irq_cfg *cfg = irqd_cfg(irq_data);
-   struct IO_APIC_route_entry *entry = irq_data->chip_data;
-
-   entry->dest = cfg->dest_apicid;
-   entry->vector = cfg->vector;
-
-   return 0;
-}
-
-static const struct irq_domain_ops hyperv_ir_domain_ops = {
-   .alloc = hyperv_irq_remapping_alloc,
-   .free = hyperv_irq_remapping_free,
-   .activate = hyperv_irq_remapping_activate,
-};
-
 static int __init 

[PATCH 13/13] x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

This allows the host to indicate that IOAPIC and MSI emulation supports
15-bit destination IDs, allowing up to 32Ki CPUs without remapping.

Signed-off-by: David Woodhouse 
---
 Documentation/virt/kvm/cpuid.rst | 4 
 arch/x86/include/uapi/asm/kvm_para.h | 1 +
 arch/x86/kernel/kvm.c| 6 ++
 3 files changed, 11 insertions(+)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index a7dff9186bed..1726b5925d2b 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -92,6 +92,10 @@ KVM_FEATURE_ASYNC_PF_INT  14  guest checks 
this feature bit
   async pf acknowledgment msr
   0x4b564d07.
 
+KVM_FEATURE_MSI_EXT_DEST_ID   15  guest checks this feature bit
+  before using extended destination
+  ID bits in MSI address bits 11-5.
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24  host will warn if no guest-side
   per-cpu warps are expeced in
   kvmclock
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 812e9b4c1114..950afebfba88 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -32,6 +32,7 @@
 #define KVM_FEATURE_POLL_CONTROL   12
 #define KVM_FEATURE_PV_SCHED_YIELD 13
 #define KVM_FEATURE_ASYNC_PF_INT   14
+#define KVM_FEATURE_MSI_EXT_DEST_ID15
 
 #define KVM_HINTS_REALTIME  0
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1b51b727b140..4986b4399aef 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -743,12 +743,18 @@ static void __init kvm_init_platform(void)
x86_platform.apic_post_init = kvm_apic_init;
 }
 
+static bool __init kvm_msi_ext_dest_id(void)
+{
+   return kvm_para_has_feature(KVM_FEATURE_MSI_EXT_DEST_ID);
+}
+
 const __initconst struct hypervisor_x86 x86_hyper_kvm = {
.name   = "KVM",
.detect = kvm_detect,
.type   = X86_HYPER_KVM,
.init.guest_late_init   = kvm_guest_init,
.init.x2apic_available  = kvm_para_available,
+   .init.msi_ext_dest_id   = kvm_msi_ext_dest_id,
.init.init_platform = kvm_init_platform,
 };
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 01/13] x86/apic: Use x2apic in guest kernels even with unusable CPUs.

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

If the BIOS has enabled x2apic mode, then leave it enabled and don't
fall back to xapic just because some CPUs can't be targeted.

Previously, if there are CPUs with APIC IDs > 255, the kernel will
disable x2apic and fall back to xapic. And then not use the additional
CPUs anyway, since xapic can't target them either.

In fact, xapic mode can target even *fewer* CPUs, since it can't use
the one with APIC ID 255 either. Using x2apic instead gives us at least
one more CPU without needing interrupt remapping in the guest.

Signed-off-by: David Woodhouse 
---
 arch/x86/include/asm/apic.h|  1 +
 arch/x86/kernel/apic/apic.c| 18 ++
 arch/x86/kernel/apic/x2apic_phys.c |  9 +
 3 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 1c129abb7f09..b0fd204e0023 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -259,6 +259,7 @@ static inline u64 native_x2apic_icr_read(void)
 
 extern int x2apic_mode;
 extern int x2apic_phys;
+extern void __init x2apic_set_max_apicid(u32 apicid);
 extern void __init check_x2apic(void);
 extern void x2apic_setup(void);
 static inline int x2apic_enabled(void)
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index b3eef1d5c903..a75767052a92 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1841,14 +1841,24 @@ static __init void try_to_enable_x2apic(int remap_mode)
return;
 
if (remap_mode != IRQ_REMAP_X2APIC_MODE) {
-   /* IR is required if there is APIC ID > 255 even when running
-* under KVM
+   /*
+* If there are APIC IDs above 255, they cannot be used
+* without interrupt remapping. In a virtual machine, but
+* not on bare metal, X2APIC can be used anyway. In the
+* case where BIOS has enabled X2APIC, don't disable it
+* just because there are APIC IDs that can't be used.
+* They couldn't be used if the kernel falls back to XAPIC
+* anyway.
 */
if (max_physical_apicid > 255 ||
!x86_init.hyper.x2apic_available()) {
pr_info("x2apic: IRQ remapping doesn't support X2APIC 
mode\n");
-   x2apic_disable();
-   return;
+   if (!x2apic_mode) {
+   x2apic_disable();
+   return;
+   }
+
+   x2apic_set_max_apicid(255);
}
 
/*
diff --git a/arch/x86/kernel/apic/x2apic_phys.c 
b/arch/x86/kernel/apic/x2apic_phys.c
index bc9693841353..b4b4e89c1118 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -8,6 +8,12 @@
 int x2apic_phys;
 
 static struct apic apic_x2apic_phys;
+static u32 x2apic_max_apicid;
+
+void __init x2apic_set_max_apicid(u32 apicid)
+{
+   x2apic_max_apicid = apicid;
+}
 
 static int __init set_x2apic_phys_mode(char *arg)
 {
@@ -98,6 +104,9 @@ static int x2apic_phys_probe(void)
 /* Common x2apic functions, also used by x2apic_cluster */
 int x2apic_apic_id_valid(u32 apicid)
 {
+   if (x2apic_max_apicid && apicid > x2apic_max_apicid)
+   return 0;
+
return 1;
 }
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 11/13] x86/smp: Allow more than 255 CPUs even without interrupt remapping

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

Now that external interrupt affinity can be limited to the range of
CPUs that can be reached through legacy IOAPIC RTEs and MSI, it is
possible to use additional CPUs.

Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/apic/apic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 069f5e9f1d28..750a92464bec 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -1881,8 +1881,6 @@ static __init void try_to_enable_x2apic(int remap_mode)
 */
x2apic_phys = 1;
}
-   if (apic_limit)
-   x2apic_set_max_apicid(apic_limit);
 
/* Build the affinity mask for interrupts that can't be remapped. */
cpumask_clear(_non_ir_cpumask);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 07/13] irqdomain: Add max_affinity argument to irq_domain_alloc_descs()

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

This is the maximum possible set of CPUs which can be used. Use it
to calculate the default affinity requested from __irq_alloc_descs()
by first attempting to find the intersection with irq_default_affinity,
or falling back to using just the max_affinity if the intersection
would be empty.

Signed-off-by: David Woodhouse 
---
 include/linux/irqdomain.h |  3 ++-
 kernel/irq/ipi.c  |  2 +-
 kernel/irq/irqdomain.c| 45 +--
 3 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 5d9de881..6b5576da77f0 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -278,7 +278,8 @@ extern void irq_set_default_host(struct irq_domain *host);
 extern struct irq_domain *irq_get_default_host(void);
 extern int irq_domain_alloc_descs(int virq, unsigned int nr_irqs,
  irq_hw_number_t hwirq, int node,
- const struct irq_affinity_desc *affinity);
+ const struct irq_affinity_desc *affinity,
+ const struct cpumask *max_affinity);
 
 static inline struct fwnode_handle *of_node_to_fwnode(struct device_node *node)
 {
diff --git a/kernel/irq/ipi.c b/kernel/irq/ipi.c
index 43e3d1be622c..13f56210eca9 100644
--- a/kernel/irq/ipi.c
+++ b/kernel/irq/ipi.c
@@ -75,7 +75,7 @@ int irq_reserve_ipi(struct irq_domain *domain,
}
}
 
-   virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL);
+   virq = irq_domain_alloc_descs(-1, nr_irqs, 0, NUMA_NO_NODE, NULL, dest);
if (virq <= 0) {
pr_warn("Can't reserve IPI, failed to alloc descs\n");
return -ENOMEM;
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index c93e00ca11d8..ffd41f21afca 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -660,7 +660,8 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
}
 
/* Allocate a virtual interrupt number */
-   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node), 
NULL);
+   virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
+ NULL, NULL);
if (virq <= 0) {
pr_debug("-> virq allocation failed\n");
return 0;
@@ -1011,25 +1012,57 @@ int irq_domain_translate_twocell(struct irq_domain *d,
 EXPORT_SYMBOL_GPL(irq_domain_translate_twocell);
 
 int irq_domain_alloc_descs(int virq, unsigned int cnt, irq_hw_number_t hwirq,
-  int node, const struct irq_affinity_desc *affinity)
+  int node, const struct irq_affinity_desc *affinity,
+  const struct cpumask *max_affinity)
 {
+   cpumask_var_t default_affinity;
unsigned int hint;
+   int i;
+
+   /* Check requested per-IRQ affinities are in the possible range */
+   if (affinity && max_affinity) {
+   for (i = 0; i < cnt; i++)
+   if (!cpumask_subset([i].mask, max_affinity))
+   return -EINVAL;
+   }
+
+   /*
+* Generate default affinity. Either the possible subset of
+* irq_default_affinity if such a subset is non-empty, or fall
+* back to the provided max_affinity if there is no intersection.
+* And just a copy of irq_default_affinity in the
+* !CONFIG_CPUMASK_OFFSTACK case.
+*/
+   memset(_affinity, 0, sizeof(default_affinity));
+   if ((max_affinity &&
+!cpumask_subset(irq_default_affinity, max_affinity))) {
+   if (!alloc_cpumask_var(_affinity, GFP_KERNEL))
+   return -ENOMEM;
+   cpumask_and(default_affinity, max_affinity,
+   irq_default_affinity);
+   if (cpumask_empty(default_affinity))
+   cpumask_copy(default_affinity, max_affinity);
+   } else if (cpumask_available(default_affinity))
+   cpumask_copy(default_affinity, irq_default_affinity);
 
if (virq >= 0) {
virq = __irq_alloc_descs(virq, virq, cnt, node, THIS_MODULE,
-affinity, NULL);
+affinity, default_affinity);
} else {
hint = hwirq % nr_irqs;
if (hint == 0)
hint++;
virq = __irq_alloc_descs(-1, hint, cnt, node, THIS_MODULE,
-affinity, NULL);
+affinity, default_affinity);
if (virq <= 0 && hint > 1) {
virq = __irq_alloc_descs(-1, 1, cnt, node, THIS_MODULE,
-affinity, NULL);
+affinity, 

[PATCH 08/13] genirq: Add irq_domain_set_affinity()

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

This allows a maximal affinity to be set, for IRQ domains which cannot
target all CPUs in the system.

Signed-off-by: David Woodhouse 
---
 include/linux/irqdomain.h |  4 
 kernel/irq/irqdomain.c| 28 ++--
 kernel/irq/manage.c   | 19 +++
 3 files changed, 45 insertions(+), 6 deletions(-)

diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h
index 6b5576da77f0..cf686f18c1fa 100644
--- a/include/linux/irqdomain.h
+++ b/include/linux/irqdomain.h
@@ -151,6 +151,7 @@ struct irq_domain_chip_generic;
  *  drivers using the generic chip library which uses this pointer.
  * @parent: Pointer to parent irq_domain to support hierarchy irq_domains
  * @debugfs_file: dentry for the domain debugfs file
+ * @max_affinity: mask of CPUs targetable by this IRQ domain
  *
  * Revmap data, used internally by irq_domain
  * @revmap_direct_max_irq: The largest hwirq that can be set for controllers 
that
@@ -177,6 +178,7 @@ struct irq_domain {
 #ifdef CONFIG_GENERIC_IRQ_DEBUGFS
struct dentry   *debugfs_file;
 #endif
+   const struct cpumask *max_affinity;
 
/* reverse map data. The linear map gets appended to the irq_domain */
irq_hw_number_t hwirq_max;
@@ -453,6 +455,8 @@ extern void irq_domain_set_info(struct irq_domain *domain, 
unsigned int virq,
void *chip_data, irq_flow_handler_t handler,
void *handler_data, const char *handler_name);
 extern void irq_domain_reset_irq_data(struct irq_data *irq_data);
+extern int irq_domain_set_affinity(struct irq_domain *domain,
+  const struct cpumask *affinity);
 #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
 extern struct irq_domain *irq_domain_create_hierarchy(struct irq_domain 
*parent,
unsigned int flags, unsigned int size,
diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c
index ffd41f21afca..0b298355be1b 100644
--- a/kernel/irq/irqdomain.c
+++ b/kernel/irq/irqdomain.c
@@ -661,7 +661,7 @@ unsigned int irq_create_mapping(struct irq_domain *domain,
 
/* Allocate a virtual interrupt number */
virq = irq_domain_alloc_descs(-1, 1, hwirq, of_node_to_nid(of_node),
- NULL, NULL);
+ NULL, domain->max_affinity);
if (virq <= 0) {
pr_debug("-> virq allocation failed\n");
return 0;
@@ -1078,6 +1078,27 @@ void irq_domain_reset_irq_data(struct irq_data *irq_data)
 }
 EXPORT_SYMBOL_GPL(irq_domain_reset_irq_data);
 
+/**
+ * irq_domain_set_affinity - Set maximum CPU affinity for domain
+ * @parent:Domain to set affinity for
+ * @affinity:  Pointer to cpumask, consumed by domain
+ *
+ * Sets the maximal set of CPUs to which interrupts in this domain may
+ * be delivered. Must only be called after creation, before any interrupts
+ * have been in the domain.
+ *
+ * This function retains a pointer to the cpumask which is passed in.
+ */
+int irq_domain_set_affinity(struct irq_domain *domain,
+   const struct cpumask *affinity)
+{
+   if (cpumask_empty(affinity))
+   return -EINVAL;
+   domain->max_affinity = affinity;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(irq_domain_set_affinity);
+
 #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY
 /**
  * irq_domain_create_hierarchy - Add a irqdomain into the hierarchy
@@ -1110,6 +1131,9 @@ struct irq_domain *irq_domain_create_hierarchy(struct 
irq_domain *parent,
if (domain) {
domain->parent = parent;
domain->flags |= flags;
+   if (parent && parent->max_affinity)
+   irq_domain_set_affinity(domain,
+   parent->max_affinity);
}
 
return domain;
@@ -1375,7 +1399,7 @@ int __irq_domain_alloc_irqs(struct irq_domain *domain, 
int irq_base,
virq = irq_base;
} else {
virq = irq_domain_alloc_descs(irq_base, nr_irqs, 0, node,
- affinity, NULL);
+ affinity, domain->max_affinity);
if (virq < 0) {
pr_debug("cannot allocate IRQ(base %d, count %d)\n",
 irq_base, nr_irqs);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5df903fccb60..e8c5f8ecd306 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -345,6 +345,10 @@ int irq_set_affinity_locked(struct irq_data *data, const 
struct cpumask *mask,
struct irq_desc *desc = irq_data_to_desc(data);
int ret = 0;
 
+   if (data->domain && data->domain->max_affinity &&
+   !cpumask_subset(mask, data->domain->max_affinity))
+   return -EINVAL;
+
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
 
@@ -484,13 +488,20 @@ int 

[PATCH 02/13] x86/msi: Only use high bits of MSI address for DMAR unit

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

The Intel IOMMU has an MSI-like configuration for its interrupt, but
it isn't really MSI. So it gets to abuse the high 32 bits of the address,
and puts the high 24 bits of the extended APIC ID there.

This isn't something that can be used in the general case for real MSIs,
since external devices using the high bits of the address would be
performing writes to actual memory space above 4GiB, not targeted at the
APIC.

Factor the hack out and allow it only to be used when appropriate,
adding a WARN_ON_ONCE() if other MSIs are targeted at an unreachable APIC
ID. That should never happen since the legacy MSI messages are not supposed
to be used with Interrupt Remapping enabled.

The x2apic_enabled() check isn't needed because we won't bring up CPUs
with higher APIC IDs unless x2apic is enabled anyway.

Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/apic/msi.c | 34 --
 1 file changed, 28 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 6313f0a05db7..356f8acf4927 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -23,13 +23,10 @@
 
 struct irq_domain *x86_pci_msi_default_domain __ro_after_init;
 
-static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg)
+static void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, 
int dmar)
 {
msg->address_hi = MSI_ADDR_BASE_HI;
 
-   if (x2apic_enabled())
-   msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
-
msg->address_lo =
MSI_ADDR_BASE_LO |
((apic->irq_dest_mode == 0) ?
@@ -43,18 +40,42 @@ static void __irq_msi_compose_msg(struct irq_cfg *cfg, 
struct msi_msg *msg)
MSI_DATA_LEVEL_ASSERT |
MSI_DATA_DELIVERY_FIXED |
MSI_DATA_VECTOR(cfg->vector);
+
+   /*
+* Only the IOMMU itself can use the trick of putting destination
+* APIC ID into the high bits of the address. Anything else would
+* just be writing to memory if it tried that, and needs IR to
+* address APICs above 255.
+*/
+   if (dmar)
+   msg->address_hi |= MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid);
+   else
+   WARN_ON_ONCE(MSI_ADDR_EXT_DEST_ID(cfg->dest_apicid));
 }
 
 void x86_vector_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
 {
-   __irq_msi_compose_msg(irqd_cfg(data), msg);
+   __irq_msi_compose_msg(irqd_cfg(data), msg, 0);
 }
 
+/*
+ * The Intel IOMMU (ab)uses the high bits of the MSI address to contain the
+ * high bits of the destination APIC ID. This can't be done in the general
+ * case for MSIs as it would be targeting real memory above 4GiB not the
+ * APIC.
+ */
+static void dmar_msi_compose_msg(struct irq_data *data, struct msi_msg *msg)
+{
+   __irq_msi_compose_msg(irqd_cfg(data), msg, 1);
+
+
+
+}
 static void irq_msi_update_msg(struct irq_data *irqd, struct irq_cfg *cfg)
 {
struct msi_msg msg[2] = { [1] = { }, };
 
-   __irq_msi_compose_msg(cfg, msg);
+   __irq_msi_compose_msg(cfg, msg, 0);
irq_data_get_irq_chip(irqd)->irq_write_msi_msg(irqd, msg);
 }
 
@@ -288,6 +309,7 @@ static struct irq_chip dmar_msi_controller = {
.irq_ack= irq_chip_ack_parent,
.irq_set_affinity   = msi_domain_set_affinity,
.irq_retrigger  = irq_chip_retrigger_hierarchy,
+   .irq_compose_msi_msg= dmar_msi_compose_msg,
.irq_write_msi_msg  = dmar_msi_write_msg,
.flags  = IRQCHIP_SKIP_SET_WAKE,
 };
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 09/13] x86/irq: Add x86_non_ir_cpumask

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

This is the mask of CPUs to which IRQs can be delivered without interrupt
remapping.

Signed-off-by: David Woodhouse 
---
 arch/x86/include/asm/mpspec.h  |  1 +
 arch/x86/kernel/apic/apic.c| 12 
 arch/x86/kernel/apic/io_apic.c |  2 ++
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
index 25ee8ca0a1f2..b2090be5b444 100644
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -141,5 +141,6 @@ static inline void physid_set_mask_of_physid(int physid, 
physid_mask_t *map)
 #define PHYSID_MASK_NONE   { {[0 ... PHYSID_ARRAY_SIZE-1] = 0UL} }
 
 extern physid_mask_t phys_cpu_present_map;
+extern cpumask_t x86_non_ir_cpumask;
 
 #endif /* _ASM_X86_MPSPEC_H */
diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
index 459c78558f36..069f5e9f1d28 100644
--- a/arch/x86/kernel/apic/apic.c
+++ b/arch/x86/kernel/apic/apic.c
@@ -103,6 +103,9 @@ EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_bios_cpu_apicid);
 EXPORT_EARLY_PER_CPU_SYMBOL(x86_cpu_to_acpiid);
 
+/* Mask of CPUs which can be targeted by non-remapped interrupts. */
+cpumask_t x86_non_ir_cpumask = { CPU_BITS_ALL };
+
 #ifdef CONFIG_X86_32
 
 /*
@@ -1838,6 +1841,7 @@ static __init void x2apic_enable(void)
 static __init void try_to_enable_x2apic(int remap_mode)
 {
u32 apic_limit = 0;
+   int i;
 
if (x2apic_state == X2APIC_DISABLED)
return;
@@ -1880,6 +1884,14 @@ static __init void try_to_enable_x2apic(int remap_mode)
if (apic_limit)
x2apic_set_max_apicid(apic_limit);
 
+   /* Build the affinity mask for interrupts that can't be remapped. */
+   cpumask_clear(_non_ir_cpumask);
+   i = min_t(unsigned int, num_possible_cpus() - 1, apic_limit);
+   for ( ; i >= 0; i--) {
+   if (cpu_physical_id(i) <= apic_limit)
+   cpumask_set_cpu(i, _non_ir_cpumask);
+   }
+
x2apic_enable();
 }
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index aa9a3b54a96c..4d0ef46fedb9 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2098,6 +2098,8 @@ static int mp_alloc_timer_irq(int ioapic, int pin)
struct irq_alloc_info info;
 
ioapic_set_alloc_attr(, NUMA_NO_NODE, 0, 0);
+   if (domain->parent == x86_vector_domain)
+   info.mask = _non_ir_cpumask;
info.devid = mpc_ioapic_id(ioapic);
info.ioapic.pin = pin;
mutex_lock(_mutex);
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH 06/13] genirq: Add default_affinity argument to __irq_alloc_descs()

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

It already takes an array of affinities for each individual irq being
allocated but that's awkward to allocate and populate in the case
where they're all the same and inherited from the irqdomain, so pass
the default in separately as a simple cpumask.

Signed-off-by: David Woodhouse 
---
 include/linux/irq.h| 10 ++
 kernel/irq/devres.c|  8 ++--
 kernel/irq/irqdesc.c   | 10 --
 kernel/irq/irqdomain.c |  6 +++---
 4 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 1b7f4dfee35b..6e119594d35d 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -897,15 +897,17 @@ unsigned int arch_dynirq_lower_bound(unsigned int from);
 
 int __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
  struct module *owner,
- const struct irq_affinity_desc *affinity);
+ const struct irq_affinity_desc *affinity,
+ const struct cpumask *default_affinity);
 
 int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
   unsigned int cnt, int node, struct module *owner,
-  const struct irq_affinity_desc *affinity);
+  const struct irq_affinity_desc *affinity,
+  const struct cpumask *default_affinity);
 
 /* use macros to avoid needing export.h for THIS_MODULE */
 #define irq_alloc_descs(irq, from, cnt, node)  \
-   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL)
+   __irq_alloc_descs(irq, from, cnt, node, THIS_MODULE, NULL, NULL)
 
 #define irq_alloc_desc(node)   \
irq_alloc_descs(-1, 0, 1, node)
@@ -920,7 +922,7 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, 
unsigned int from,
irq_alloc_descs(-1, from, cnt, node)
 
 #define devm_irq_alloc_descs(dev, irq, from, cnt, node)\
-   __devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL)
+   __devm_irq_alloc_descs(dev, irq, from, cnt, node, THIS_MODULE, NULL, 
NULL)
 
 #define devm_irq_alloc_desc(dev, node) \
devm_irq_alloc_descs(dev, -1, 0, 1, node)
diff --git a/kernel/irq/devres.c b/kernel/irq/devres.c
index f6e5515ee077..079339decc23 100644
--- a/kernel/irq/devres.c
+++ b/kernel/irq/devres.c
@@ -170,6 +170,8 @@ static void devm_irq_desc_release(struct device *dev, void 
*res)
  * @affinity:  Optional pointer to an irq_affinity_desc array of size @cnt
  * which hints where the irq descriptors should be allocated
  * and which default affinities to use
+ * @default_affinity: Optional pointer to a cpumask indicating the default
+ *  affinity to use where not specified by the @affinity array
  *
  * Returns the first irq number or error code.
  *
@@ -177,7 +179,8 @@ static void devm_irq_desc_release(struct device *dev, void 
*res)
  */
 int __devm_irq_alloc_descs(struct device *dev, int irq, unsigned int from,
   unsigned int cnt, int node, struct module *owner,
-  const struct irq_affinity_desc *affinity)
+  const struct irq_affinity_desc *affinity,
+  const struct cpumask *default_affinity)
 {
struct irq_desc_devres *dr;
int base;
@@ -186,7 +189,8 @@ int __devm_irq_alloc_descs(struct device *dev, int irq, 
unsigned int from,
if (!dr)
return -ENOMEM;
 
-   base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity);
+   base = __irq_alloc_descs(irq, from, cnt, node, owner, affinity,
+default_affinity);
if (base < 0) {
devres_free(dr);
return base;
diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 4ac91b9fc618..fcc3b8a1fe01 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -770,15 +770,21 @@ EXPORT_SYMBOL_GPL(irq_free_descs);
  * @affinity:  Optional pointer to an affinity mask array of size @cnt which
  * hints where the irq descriptors should be allocated and which
  * default affinities to use
+ * @default_affinity: Optional pointer to a cpumask indicating the default
+ *  affinity where not specified in the @affinity array
  *
  * Returns the first irq number or error code
  */
 int __ref
 __irq_alloc_descs(int irq, unsigned int from, unsigned int cnt, int node,
- struct module *owner, const struct irq_affinity_desc 
*affinity)
+ struct module *owner, const struct irq_affinity_desc 
*affinity,
+ const struct cpumask *default_affinity)
 {
int start, ret;
 
+   if (!default_affinity)
+   default_affinity = irq_default_affinity;
+
if (!cnt)
return -EINVAL;
 
@@ -808,7 +814,7 @@ __irq_alloc_descs(int irq, unsigned int from, 

[PATCH 03/13] x86/ioapic: Handle Extended Destination ID field in RTE

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

The IOAPIC Redirection Table Entries contain an 8-bit Extended
Destination ID field which maps to bits 11-4 of the MSI address.

The lowest bit is used to indicate remappable format, when interrupt
remapping is in use. A hypervisor can use the other 7 bits to permit
guests to address up to 15 bits of APIC IDs, thus allowing 32768 vCPUs
before having to expose a vIOMMU and interrupt remapping to the guest.

No behavioural change in this patch, since nothing yet permits APIC IDs
above 255 to be used with the non-IR IOAPIC domain.

Signed-off-by: David Woodhouse 
---
 arch/x86/include/asm/io_apic.h |  3 ++-
 arch/x86/kernel/apic/io_apic.c | 19 +--
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
index a1a26f6d3aa4..e65a0b7379d0 100644
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -78,7 +78,8 @@ struct IO_APIC_route_entry {
mask:  1,   /* 0: enabled, 1: disabled */
__reserved_2: 15;
 
-   __u32   __reserved_3: 24,
+   __u32   __reserved_3: 17,
+   ext_dest:  7,
dest:  8;
 } __attribute__ ((packed));
 
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index a33380059db6..aa9a3b54a96c 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1239,10 +1239,10 @@ static void io_apic_print_entries(unsigned int apic, 
unsigned int nr_entries)
   buf, (ir_entry->index2 << 15) | ir_entry->index,
   ir_entry->zero);
else
-   printk(KERN_DEBUG "%s, %s, D(%02X), M(%1d)\n",
+   printk(KERN_DEBUG "%s, %s, D(%02X%02X), M(%1d)\n",
   buf,
   entry.dest_mode == IOAPIC_DEST_MODE_LOGICAL ?
-  "logical " : "physical",
+  "logical " : "physical", entry.ext_dest,
   entry.dest, entry.delivery_mode);
}
 }
@@ -1410,6 +1410,7 @@ void native_restore_boot_irq_mode(void)
 */
if (ioapic_i8259.pin != -1) {
struct IO_APIC_route_entry entry;
+   u32 apic_id = read_apic_id();
 
memset(, 0, sizeof(entry));
entry.mask  = IOAPIC_UNMASKED;
@@ -1417,7 +1418,8 @@ void native_restore_boot_irq_mode(void)
entry.polarity  = IOAPIC_POL_HIGH;
entry.dest_mode = IOAPIC_DEST_MODE_PHYSICAL;
entry.delivery_mode = dest_ExtINT;
-   entry.dest  = read_apic_id();
+   entry.dest  = apic_id & 0xff;
+   entry.ext_dest  = apic_id >> 8;
 
/*
 * Add it to the IO-APIC irq-routing table:
@@ -1861,7 +1863,8 @@ static void ioapic_configure_entry(struct irq_data *irqd)
 * ioapic chip to verify that.
 */
if (irqd->chip == _chip) {
-   mpd->entry.dest = cfg->dest_apicid;
+   mpd->entry.dest = cfg->dest_apicid & 0xff;
+   mpd->entry.ext_dest = cfg->dest_apicid >> 8;
mpd->entry.vector = cfg->vector;
}
for_each_irq_pin(entry, mpd->irq_2_pin)
@@ -2027,6 +2030,7 @@ static inline void __init unlock_ExtINT_logic(void)
int apic, pin, i;
struct IO_APIC_route_entry entry0, entry1;
unsigned char save_control, save_freq_select;
+   u32 apic_id;
 
pin  = find_isa_irq_pin(8, mp_INT);
if (pin == -1) {
@@ -2042,11 +2046,13 @@ static inline void __init unlock_ExtINT_logic(void)
entry0 = ioapic_read_entry(apic, pin);
clear_IO_APIC_pin(apic, pin);
 
+   apic_id = hard_smp_processor_id();
memset(, 0, sizeof(entry1));
 
entry1.dest_mode = IOAPIC_DEST_MODE_PHYSICAL;
entry1.mask = IOAPIC_UNMASKED;
-   entry1.dest = hard_smp_processor_id();
+   entry1.dest = apic_id & 0xff;
+   entry1.ext_dest = apic_id >> 8;
entry1.delivery_mode = dest_ExtINT;
entry1.polarity = entry0.polarity;
entry1.trigger = IOAPIC_EDGE;
@@ -2949,7 +2955,8 @@ static void mp_setup_entry(struct irq_cfg *cfg, struct 
mp_chip_data *data,
memset(entry, 0, sizeof(*entry));
entry->delivery_mode = apic->irq_delivery_mode;
entry->dest_mode = apic->irq_dest_mode;
-   entry->dest  = cfg->dest_apicid;
+   entry->dest  = cfg->dest_apicid & 0xff;
+   entry->ext_dest  = cfg->dest_apicid >> 8;
entry->vector= cfg->vector;
entry->trigger   = data->trigger;
entry->polarity  = data->polarity;
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org

[PATCH 0/13] Fix per-domain IRQ affinity, allow >255 CPUs on x86 without IRQ remapping

2020-10-05 Thread David Woodhouse
Linux currently refuses to use >255 CPUs on x86 unless it has interrupt
remapping. This is a bit gratuitous because it could use those extra
CPUs just fine; it just can't target external interrupts at them.

The only problem is that our generic IRQ domain code cann't cope with
the concept of domains which can only target a subset of CPUs.

The hyperv-iommu IRQ remapping driver works around this — not by
actually doing any remapping, but just returning -EINVAL if the
affinity is ever set to an unreachable CPU. This almost works, but ends
up being a bit late because irq_set_affinity_locked() doesn't call into
the irqchip driver immediately; the error only happens later.

This patch series implements a per-domain "maximum affinity" set and
uses it for the non-remapped IOAPIC and MSI domains on x86. As well as
allowing more CPUs to be used without interrupt remapping, this also
fixes the case where some IOAPICs or PCI devices aren't actually in
scope of any active IOMMU and are operating without remapping.

While we're at it, recognise that the 8-bit limit is a bit gratuitous
and a hypervisor could offer at least 15 bits of APIC ID in the IOAPIC
RTE and MSI address bits 11-5 without even needing to use remapping.

David Woodhouse (13):
  x86/apic: Use x2apic in guest kernels even with unusable CPUs.
  x86/msi: Only use high bits of MSI address for DMAR unit
  x86/ioapic: Handle Extended Destination ID field in RTE
  x86/apic: Support 15 bits of APIC ID in IOAPIC/MSI where available
  genirq: Prepare for default affinity to be passed to __irq_alloc_descs()
  genirq: Add default_affinity argument to __irq_alloc_descs()
  irqdomain: Add max_affinity argument to irq_domain_alloc_descs()
  genirq: Add irq_domain_set_affinity()
  x86/irq: Add x86_non_ir_cpumask
  x86/irq: Limit IOAPIC and MSI domains' affinity without IR
  x86/smp: Allow more than 255 CPUs even without interrupt remapping
  iommu/irq_remapping: Kill most of hyperv-iommu.c now it's redundant
  x86/kvm: Add KVM_FEATURE_MSI_EXT_DEST_ID

 Documentation/virt/kvm/cpuid.rst |   4 +
 arch/x86/include/asm/apic.h  |   1 +
 arch/x86/include/asm/io_apic.h   |   3 +-
 arch/x86/include/asm/mpspec.h|   2 +
 arch/x86/include/asm/x86_init.h  |   2 +
 arch/x86/include/uapi/asm/kvm_para.h |   1 +
 arch/x86/kernel/apic/apic.c  |  41 +-
 arch/x86/kernel/apic/io_apic.c   |  23 --
 arch/x86/kernel/apic/msi.c   |  44 +--
 arch/x86/kernel/apic/x2apic_phys.c   |   9 +++
 arch/x86/kernel/kvm.c|   6 ++
 arch/x86/kernel/x86_init.c   |   1 +
 drivers/iommu/hyperv-iommu.c | 149 +--
 include/linux/interrupt.h|   2 +
 include/linux/irq.h  |  10 ++-
 include/linux/irqdomain.h|   7 +-
 kernel/irq/devres.c  |   8 +-
 kernel/irq/ipi.c |   2 +-
 kernel/irq/irqdesc.c |  29 ---
 kernel/irq/irqdomain.c   |  69 ++--
 kernel/irq/manage.c  |  19 -
 21 files changed, 240 insertions(+), 192 deletions(-)


smime.p7s
Description: S/MIME cryptographic signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH 10/13] x86/irq: Limit IOAPIC and MSI domains' affinity without IR

2020-10-05 Thread David Woodhouse
From: David Woodhouse 

When interrupt remapping isn't enabled, only the first 255 CPUs can
receive external interrupts. Set the appropriate max affinity for
the IOAPIC and MSI IRQ domains accordingly.

This also fixes the case where interrupt remapping is enabled but some
devices are not within the scope of any active IOMMU.

Signed-off-by: David Woodhouse 
---
 arch/x86/kernel/apic/io_apic.c | 2 ++
 arch/x86/kernel/apic/msi.c | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 4d0ef46fedb9..1c8ce7bc098f 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -2332,6 +2332,8 @@ static int mp_irqdomain_create(int ioapic)
}
 
ip->irqdomain->parent = parent;
+   if (parent == x86_vector_domain)
+   irq_domain_set_affinity(ip->irqdomain, _non_ir_cpumask);
 
if (cfg->type == IOAPIC_DOMAIN_LEGACY ||
cfg->type == IOAPIC_DOMAIN_STRICT)
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 4d891967bea4..af5ce5c4da02 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -259,6 +259,7 @@ struct irq_domain * __init 
native_create_pci_msi_domain(void)
pr_warn("Failed to initialize PCI-MSI irqdomain.\n");
} else {
d->flags |= IRQ_DOMAIN_MSI_NOMASK_QUIRK;
+   irq_domain_set_affinity(d, _non_ir_cpumask);
}
return d;
 }
@@ -479,6 +480,8 @@ struct irq_domain *hpet_create_irq_domain(int hpet_id)
irq_domain_free_fwnode(fn);
kfree(domain_info);
}
+   if (parent == x86_vector_domain)
+   irq_domain_set_affinity(d, _non_ir_cpumask);
return d;
 }
 
-- 
2.26.2

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

2020-10-05 Thread Steven Price

On 05/10/2020 15:50, Boris Brezillon wrote:

On Tue, 22 Sep 2020 15:16:48 +0100
Robin Murphy  wrote:


Midgard GPUs have ACE-Lite master interfaces which allows systems to
integrate them in an I/O-coherent manner. It seems that from the GPU's
viewpoint, the rest of the system is its outer shareable domain, and so
even when snoop signals are wired up, they are only emitted for outer
shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does
indeed get coherent pagetable walks working nicely for the coherent
T620 in the Arm Juno SoC.

Reviewed-by: Steven Price 
Tested-by: Neil Armstrong 
Signed-off-by: Robin Murphy 
---
  drivers/iommu/io-pgtable-arm.c | 11 ++-
  1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index dc7bcf858b6d..b4072a18e45d 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -440,7 +440,13 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
arm_lpae_io_pgtable *data,
<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
}
  
-	if (prot & IOMMU_CACHE)

+   /*
+* Also Mali has its own notions of shareability wherein its Inner
+* domain covers the cores within the GPU, and its Outer domain is
+* "outside the GPU" (i.e. either the Inner or System domain in CPU
+* terms, depending on coherency).
+*/
+   if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
pte |= ARM_LPAE_PTE_SH_IS;
else
pte |= ARM_LPAE_PTE_SH_OS;


Actually, it still doesn't work on s922x :-/. For it to work I
correctly, I need to drop the outer shareable flag here.


The logic here does seem a bit odd. Originally it was:

IOMMU_CACHE -> Inner shared (value 3)
!IOMMU_CACHE -> Outer shared (value 2)

For Mali we're forcing everything to the second option. But Mali being 
Mali doesn't do things the same as LPAE, so for Mali we have:


0 - not shared
1 - reserved
2 - inner(*) and outer shareable
3 - inner shareable only

(*) where "inner" means internal to the GPU, and "outer" means shared 
with the CPU "inner". Very confusing!


So originally we had:
IOMMU_CACHE -> not shared with CPU (only internally in the GPU)
!IOMMU_CACHE -> shared with CPU

The change above gets us to "always shared", dropping the SH_OS bit 
would get us to not even shareable between cores (which doesn't sound 
like what we want).


It's not at all clear to me why the change helps, but I suspect we want 
at least "inner" shareable.


Steve


@@ -1049,6 +1055,9 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, 
void *cookie)
cfg->arm_mali_lpae_cfg.transtab = virt_to_phys(data->pgd) |
  ARM_MALI_LPAE_TTBR_READ_INNER |
  ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
+   if (cfg->coherent_walk)
+   cfg->arm_mali_lpae_cfg.transtab |= 
ARM_MALI_LPAE_TTBR_SHARE_OUTER;
+
return >iop;
  
  out_free_data:




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v2 1/3] iommu/io-pgtable-arm: Support coherency for Mali LPAE

2020-10-05 Thread Boris Brezillon
On Tue, 22 Sep 2020 15:16:48 +0100
Robin Murphy  wrote:

> Midgard GPUs have ACE-Lite master interfaces which allows systems to
> integrate them in an I/O-coherent manner. It seems that from the GPU's
> viewpoint, the rest of the system is its outer shareable domain, and so
> even when snoop signals are wired up, they are only emitted for outer
> shareable accesses. As such, setting the TTBR_SHARE_OUTER bit does
> indeed get coherent pagetable walks working nicely for the coherent
> T620 in the Arm Juno SoC.
> 
> Reviewed-by: Steven Price 
> Tested-by: Neil Armstrong 
> Signed-off-by: Robin Murphy 
> ---
>  drivers/iommu/io-pgtable-arm.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
> index dc7bcf858b6d..b4072a18e45d 100644
> --- a/drivers/iommu/io-pgtable-arm.c
> +++ b/drivers/iommu/io-pgtable-arm.c
> @@ -440,7 +440,13 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct 
> arm_lpae_io_pgtable *data,
>   << ARM_LPAE_PTE_ATTRINDX_SHIFT);
>   }
>  
> - if (prot & IOMMU_CACHE)
> + /*
> +  * Also Mali has its own notions of shareability wherein its Inner
> +  * domain covers the cores within the GPU, and its Outer domain is
> +  * "outside the GPU" (i.e. either the Inner or System domain in CPU
> +  * terms, depending on coherency).
> +  */
> + if (prot & IOMMU_CACHE && data->iop.fmt != ARM_MALI_LPAE)
>   pte |= ARM_LPAE_PTE_SH_IS;
>   else
>   pte |= ARM_LPAE_PTE_SH_OS;

Actually, it still doesn't work on s922x :-/. For it to work I
correctly, I need to drop the outer shareable flag here.

> @@ -1049,6 +1055,9 @@ arm_mali_lpae_alloc_pgtable(struct io_pgtable_cfg *cfg, 
> void *cookie)
>   cfg->arm_mali_lpae_cfg.transtab = virt_to_phys(data->pgd) |
> ARM_MALI_LPAE_TTBR_READ_INNER |
> ARM_MALI_LPAE_TTBR_ADRMODE_TABLE;
> + if (cfg->coherent_walk)
> + cfg->arm_mali_lpae_cfg.transtab |= 
> ARM_MALI_LPAE_TTBR_SHARE_OUTER;
> +
>   return >iop;
>  
>  out_free_data:

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.4 4/4] iommu/vt-d: Fix lockdep splat in iommu_flush_dev_iotlb()

2020-10-05 Thread Sasha Levin
From: Lu Baolu 

[ Upstream commit 1a3f2fd7fc4e8f24510830e265de2ffb8e3300d2 ]

Lock(>lock) without disabling irq causes lockdep warnings.

[   12.703950] 
[   12.703962] WARNING: possible irq lock inversion dependency detected
[   12.703975] 5.9.0-rc6+ #659 Not tainted
[   12.703983] 
[   12.703995] systemd-udevd/284 just changed the state of lock:
[   12.704007] bd6ff4d8 (device_domain_lock){..-.}-{2:2}, at:
   iommu_flush_dev_iotlb.part.57+0x2e/0x90
[   12.704031] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   12.704043]  (>lock){+.+.}-{2:2}
[   12.704045]

   and interrupts could create inverse lock ordering between
   them.

[   12.704073]
   other info that might help us debug this:
[   12.704085]  Possible interrupt unsafe locking scenario:

[   12.704097]CPU0CPU1
[   12.704106]
[   12.704115]   lock(>lock);
[   12.704123]local_irq_disable();
[   12.704134]lock(device_domain_lock);
[   12.704146]lock(>lock);
[   12.704158]   
[   12.704164] lock(device_domain_lock);
[   12.704174]
*** DEADLOCK ***

Signed-off-by: Lu Baolu 
Link: https://lore.kernel.org/r/20200927062428.13713-1-baolu...@linux.intel.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/intel-iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index 2ffec65df3889..1147626f0d253 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -2560,14 +2560,14 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
}
 
/* Setup the PASID entry for requests without PASID: */
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
if (hw_pass_through && domain_type_is_si(domain))
ret = intel_pasid_setup_pass_through(iommu, domain,
dev, PASID_RID2PASID);
else
ret = intel_pasid_setup_second_level(iommu, domain,
dev, PASID_RID2PASID);
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
if (ret) {
dev_err(dev, "Setup RID2PASID failed\n");
dmar_remove_one_dev_info(dev);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH AUTOSEL 5.8 12/12] iommu/vt-d: Fix lockdep splat in iommu_flush_dev_iotlb()

2020-10-05 Thread Sasha Levin
From: Lu Baolu 

[ Upstream commit 1a3f2fd7fc4e8f24510830e265de2ffb8e3300d2 ]

Lock(>lock) without disabling irq causes lockdep warnings.

[   12.703950] 
[   12.703962] WARNING: possible irq lock inversion dependency detected
[   12.703975] 5.9.0-rc6+ #659 Not tainted
[   12.703983] 
[   12.703995] systemd-udevd/284 just changed the state of lock:
[   12.704007] bd6ff4d8 (device_domain_lock){..-.}-{2:2}, at:
   iommu_flush_dev_iotlb.part.57+0x2e/0x90
[   12.704031] but this lock took another, SOFTIRQ-unsafe lock in the past:
[   12.704043]  (>lock){+.+.}-{2:2}
[   12.704045]

   and interrupts could create inverse lock ordering between
   them.

[   12.704073]
   other info that might help us debug this:
[   12.704085]  Possible interrupt unsafe locking scenario:

[   12.704097]CPU0CPU1
[   12.704106]
[   12.704115]   lock(>lock);
[   12.704123]local_irq_disable();
[   12.704134]lock(device_domain_lock);
[   12.704146]lock(>lock);
[   12.704158]   
[   12.704164] lock(device_domain_lock);
[   12.704174]
*** DEADLOCK ***

Signed-off-by: Lu Baolu 
Link: https://lore.kernel.org/r/20200927062428.13713-1-baolu...@linux.intel.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Sasha Levin 
---
 drivers/iommu/intel/iommu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index fbe0b0cc56edf..24a84d294fd01 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -2617,7 +2617,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
}
 
/* Setup the PASID entry for requests without PASID: */
-   spin_lock(>lock);
+   spin_lock_irqsave(>lock, flags);
if (hw_pass_through && domain_type_is_si(domain))
ret = intel_pasid_setup_pass_through(iommu, domain,
dev, PASID_RID2PASID);
@@ -2627,7 +2627,7 @@ static struct dmar_domain 
*dmar_insert_one_dev_info(struct intel_iommu *iommu,
else
ret = intel_pasid_setup_second_level(iommu, domain,
dev, PASID_RID2PASID);
-   spin_unlock(>lock);
+   spin_unlock_irqrestore(>lock, flags);
if (ret) {
dev_err(dev, "Setup RID2PASID failed\n");
dmar_remove_one_dev_info(dev);
-- 
2.25.1

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Mon, Oct 05, 2020 at 04:28:53PM +0300, Dmitry Osipenko wrote:
> 05.10.2020 14:15, Thierry Reding пишет:
> > On Mon, Oct 05, 2020 at 01:36:55PM +0300, Dmitry Osipenko wrote:
> >> 05.10.2020 12:53, Thierry Reding пишет:
> >>> On Fri, Oct 02, 2020 at 05:50:08PM +0300, Dmitry Osipenko wrote:
>  02.10.2020 17:22, Dmitry Osipenko пишет:
> >>  static int tegra_smmu_of_xlate(struct device *dev,
> >>   struct of_phandle_args *args)
> >>  {
> >> +  struct platform_device *iommu_pdev = 
> >> of_find_device_by_node(args->np);
> >> +  struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> >>u32 id = args->args[0];
> >>  
> >> +  of_node_put(args->np);
> >> +
> >> +  if (!mc || !mc->smmu)
> >> +  return -EPROBE_DEFER;
> > platform_get_drvdata(NULL) will crash.
> >
> 
>  Actually, platform_get_drvdata(NULL) can't happen. I overlooked this.
> >>>
> >>> How so? It's technically possible for the iommus property to reference a
> >>> device tree node for which no platform device will ever be created, in
> >>> which case of_find_device_by_node() will return NULL. That's very
> >>> unlikely and perhaps worth just crashing on to make sure it gets fixed
> >>> immediately.
> >>
> >> The tegra_smmu_ops are registered from the SMMU driver itself and MC
> >> driver sets platform data before SMMU is initialized, hence device is
> >> guaranteed to exist and mc can't be NULL.
> > 
> > Yes, but that assumes that args->np points to the memory controller's
> > device tree node. It's obviously a mistake to do this, but I don't think
> > anyone will prevent you from doing this:
> > 
> > iommus = <&{/chosen} 0>;
> > 
> > In that case, since no platform device is created for the /chosen node,
> > iommu_pdev will end up being NULL and platform_get_drvdata() will crash.
> 
> But then Tegra SMMU isn't associated with the device's IOMMU path, and
> thus, tegra_smmu_of_xlate() won't be invoked for this device.

Indeed, you're right! It used to be that ops were assigned to the bus
without any knowledge about the specific instances that might exist, but
nowadays there's struct iommu_device which properly encapsulates all of
that, so yeah, I don't think this can ever be NULL.

Although that makes me wonder why we aren't going one step further and
pass struct iommu_device * into ->of_xlate(), which would avoid the need
for us to do the look up once more.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Dmitry Osipenko
05.10.2020 14:15, Thierry Reding пишет:
> On Mon, Oct 05, 2020 at 01:36:55PM +0300, Dmitry Osipenko wrote:
>> 05.10.2020 12:53, Thierry Reding пишет:
>>> On Fri, Oct 02, 2020 at 05:50:08PM +0300, Dmitry Osipenko wrote:
 02.10.2020 17:22, Dmitry Osipenko пишет:
>>  static int tegra_smmu_of_xlate(struct device *dev,
>> struct of_phandle_args *args)
>>  {
>> +struct platform_device *iommu_pdev = 
>> of_find_device_by_node(args->np);
>> +struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
>>  u32 id = args->args[0];
>>  
>> +of_node_put(args->np);
>> +
>> +if (!mc || !mc->smmu)
>> +return -EPROBE_DEFER;
> platform_get_drvdata(NULL) will crash.
>

 Actually, platform_get_drvdata(NULL) can't happen. I overlooked this.
>>>
>>> How so? It's technically possible for the iommus property to reference a
>>> device tree node for which no platform device will ever be created, in
>>> which case of_find_device_by_node() will return NULL. That's very
>>> unlikely and perhaps worth just crashing on to make sure it gets fixed
>>> immediately.
>>
>> The tegra_smmu_ops are registered from the SMMU driver itself and MC
>> driver sets platform data before SMMU is initialized, hence device is
>> guaranteed to exist and mc can't be NULL.
> 
> Yes, but that assumes that args->np points to the memory controller's
> device tree node. It's obviously a mistake to do this, but I don't think
> anyone will prevent you from doing this:
> 
>   iommus = <&{/chosen} 0>;
> 
> In that case, since no platform device is created for the /chosen node,
> iommu_pdev will end up being NULL and platform_get_drvdata() will crash.

But then Tegra SMMU isn't associated with the device's IOMMU path, and
thus, tegra_smmu_of_xlate() won't be invoked for this device.

> That said, I'm fine with not adding a check for that. If anyone really
> does end up messing this up they deserve the crash.
> 
> I'm still a bit undecided about the mc->smmu check because I haven't
> convinced myself yet that it can't happen.

For now I can't see any realistic situation where mc->smmu could be NULL.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-10-05 Thread Christoph Hellwig
On Mon, Oct 05, 2020 at 11:44:10AM +0100, Lorenzo Pieralisi wrote:
> > I see that there are both OF and ACPI hooks in pci_dma_configure() and
> > both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> > but I'm not convinced this even does what it's intended to do.  The
> > driver core calls this via the bus->dma_configure callback before
> > probing a driver, but then what happens when the driver calls
> > pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
> > don't see anywhere that would take into account the existing
> > dev->dma_mask.  It seems for example that pci_dma_configure() could
> > produce a 42 bit mask as we have here, then the driver could override
> > that with anything that the dma_ops.dma_supported() callback finds
> > acceptable, and I don't see any instances where the current
> > dev->dma_mask is considered.  Am I overlooking something? 
> 
> I don't think so but Christoph and Robin can provide more input on
> this - it is a long story.
> 
> ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
> this does not prevent a driver from overriding the dev->dma_mask but DMA
> mapping code still takes into account the dev->bus_dma_limit.
> 
> This may help:
> 
> git log -p 03bfdc31176c

This is at best a historic artefact.  Bus drivers have no business
messing with the DMA mask, dev->bus_dma_limit is the way to communicate
addressing limits on the bus (or another interconnect closer to the CPU).
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH] cma: decrease CMA_ALIGNMENT lower limit to 2

2020-10-05 Thread Christoph Hellwig
Thanks,

applied to dma-mapping for-next.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Mon, Oct 05, 2020 at 01:36:55PM +0300, Dmitry Osipenko wrote:
> 05.10.2020 12:53, Thierry Reding пишет:
> > On Fri, Oct 02, 2020 at 05:50:08PM +0300, Dmitry Osipenko wrote:
> >> 02.10.2020 17:22, Dmitry Osipenko пишет:
>   static int tegra_smmu_of_xlate(struct device *dev,
>  struct of_phandle_args *args)
>   {
>  +struct platform_device *iommu_pdev = 
>  of_find_device_by_node(args->np);
>  +struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
>   u32 id = args->args[0];
>   
>  +of_node_put(args->np);
>  +
>  +if (!mc || !mc->smmu)
>  +return -EPROBE_DEFER;
> >>> platform_get_drvdata(NULL) will crash.
> >>>
> >>
> >> Actually, platform_get_drvdata(NULL) can't happen. I overlooked this.
> > 
> > How so? It's technically possible for the iommus property to reference a
> > device tree node for which no platform device will ever be created, in
> > which case of_find_device_by_node() will return NULL. That's very
> > unlikely and perhaps worth just crashing on to make sure it gets fixed
> > immediately.
> 
> The tegra_smmu_ops are registered from the SMMU driver itself and MC
> driver sets platform data before SMMU is initialized, hence device is
> guaranteed to exist and mc can't be NULL.

Yes, but that assumes that args->np points to the memory controller's
device tree node. It's obviously a mistake to do this, but I don't think
anyone will prevent you from doing this:

iommus = <&{/chosen} 0>;

In that case, since no platform device is created for the /chosen node,
iommu_pdev will end up being NULL and platform_get_drvdata() will crash.

That said, I'm fine with not adding a check for that. If anyone really
does end up messing this up they deserve the crash.

I'm still a bit undecided about the mc->smmu check because I haven't
convinced myself yet that it can't happen.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v5 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Mon, Oct 05, 2020 at 11:41:08AM +0300, Dmitry Osipenko wrote:
> 05.10.2020 00:57, Nicolin Chen пишет:
> > On Sat, Oct 03, 2020 at 05:06:42PM +0300, Dmitry Osipenko wrote:
> >> 03.10.2020 09:59, Nicolin Chen пишет:
> >>>  static int tegra_smmu_of_xlate(struct device *dev,
> >>>  struct of_phandle_args *args)
> >>>  {
> >>> + struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
> >>> + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> >>>   u32 id = args->args[0];
> >>>  
> >>> + put_device(_pdev->dev);
> >>> +
> >>> + if (!mc || !mc->smmu)
> >>> + return -EPROBE_DEFER;
> >>
> >> I'm not very excited by seeing code in the patches that can't be
> >> explained by the patch authors and will appreciate if you could provide
> >> a detailed explanation about why this NULL checking is needed because I
> >> think it is unneeded, especially given that other IOMMU drivers don't
> >> have such check.
> > 
> > This function could be called from of_iommu_configure(), which is
> > a part of other driver's probe(). So I think it's safer to have a
> > check. Yet, given mc driver is added to the "arch_initcall" stage,
> > you are probably right that there's no really need at this moment
> > because all clients should be called after mc/smmu are inited. So
> > I'll resend a v6, if that makes you happy.
> 
> I wanted to get the explanation :) I'm very curious why it's actually
> needed because I'm not 100% sure whether it's not needed at all.
> 
> I'd assume that the only possible problem could be if some device is
> created in parallel with the MC probing and there is no locking that
> could prevent this in the drivers core. It's not apparent to me whether
> this situation could happen at all in practice.
> 
> The MC is created early and at that time everything is sequential, so
> it's indeed should be safe to remove the check.

I think I now remember exactly why the "hack" in tegra_smmu_probe()
exists. The reason is that the MC driver does this:

mc->smmu = tegra_smmu_probe(...);

That means that mc->smmu is going to be NULL until tegra_smmu_probe()
has finished. But tegra_smmu_probe() calls bus_set_iommu() and that in
turn calls ->probe_device(). So the purpose of the "hack" in the
tegra_smmu_probe() function was to make sure mc->smmu was available at
that point, because, well, it is already known, but we haven't gotten
around to storing it yet.

->of_xlate() can theoretically be called as early as right after
bus_set_iommu() via of_iommu_configure() if that is called in parallel
with tegra_smmu_probe(). I think that's very unlikely, but I'm not 100%
sure that it can't happen.

In any case, I do agree with Dmitry that we should have a comment here
explaining why this is necessary. Even if we're completely certain that
this is necessary, it's not obvious and therefore should get that
comment. And if we're not certain that it's necessary, it's probably
also good to mention that in the comment so that eventually it can be
determined or the check removed if it proves to be unnecessary.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [RFC 0/3] iommu: Reserved regions for IOVAs beyond dma_mask and iommu aperture

2020-10-05 Thread Lorenzo Pieralisi
[+Christoph]

On Tue, Sep 29, 2020 at 12:18:49PM -0600, Alex Williamson wrote:
> On Tue, 29 Sep 2020 09:18:22 +0200
> Auger Eric  wrote:
> 
> > Hi all,
> > 
> > [also correcting some outdated email addresses + adding Lorenzo in cc]
> > 
> > On 9/29/20 12:42 AM, Alex Williamson wrote:
> > > On Mon, 28 Sep 2020 21:50:34 +0200
> > > Eric Auger  wrote:
> > >   
> > >> VFIO currently exposes the usable IOVA regions through the
> > >> VFIO_IOMMU_GET_INFO ioctl / VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> > >> capability. However it fails to take into account the dma_mask
> > >> of the devices within the container. The top limit currently is
> > >> defined by the iommu aperture.  
> > > 
> > > I think that dma_mask is traditionally a DMA API interface for a device
> > > driver to indicate to the DMA layer which mappings are accessible to the
> > > device.  On the other hand, vfio makes use of the IOMMU API where the
> > > driver is in userspace.  That userspace driver has full control of the
> > > IOVA range of the device, therefore dma_mask is mostly irrelevant to
> > > vfio.  I think the issue you're trying to tackle is that the IORT code
> > > is making use of the dma_mask to try to describe a DMA address
> > > limitation imposed by the PCI root bus, living between the endpoint
> > > device and the IOMMU.  Therefore, if the IORT code is exposing a
> > > topology or system imposed device limitation, this seems much more akin
> > > to something like an MSI reserved range, where it's not necessarily the
> > > device or the IOMMU with the limitation, but something that sits
> > > between them.  
> > 
> > First I think I failed to explain the context. I worked on NVMe
> > passthrough on ARM. The QEMU NVMe backend uses VFIO to program the
> > physical device. The IOVA allocator there currently uses an IOVA range
> > within [0x1, 1ULL << 39]. This IOVA layout rather is arbitrary if I
> > understand correctly.
> 
> 39 bits is the minimum available on some VT-d systems, so it was
> probably considered a reasonable minimum address width to consider.
> 
> > I noticed we rapidly get some VFIO MAP DMA
> > failures because the allocated IOVA collide with the ARM MSI reserved
> > IOVA window [0x800, 0x810]. Since  9b77e5c79840 ("vfio/type1:
> > Check reserved region conflict and update iova list"), such VFIO MAP DMA
> > attempts to map IOVAs belonging to host reserved IOVA windows fail. So,
> > by using the VFIO_IOMMU_GET_INFO ioctl /
> > VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE I can change the IOVA allocator to
> > avoid allocating within this range and others. While working on this, I
> > tried to automatically compute the min/max IOVAs and change the
> > arbitrary [0x1, 1ULL << 39]. My SMMUv2 supports up to 48b so
> > naturally the max IOVA was computed as 1ULL << 48. The QEMU NVMe backend
> > allocates at the bottom and at the top of the range. I noticed the use
> > case was not working as soon as the top IOVA was more than 1ULL << 42.
> > And then we noticed the dma_mask was set to 42 by using
> > cat  /sys/bus/pci/devices/0005:01:00.0/dma_mask_bits. So my
> > interpretation is the dma_mask was somehow containing the info the
> > device couldn't handle IOVAs beyond a certain limit.
> 
> I see that there are both OF and ACPI hooks in pci_dma_configure() and
> both modify dev->dma_mask, which is what pci-sysfs is exposing here,
> but I'm not convinced this even does what it's intended to do.  The
> driver core calls this via the bus->dma_configure callback before
> probing a driver, but then what happens when the driver calls
> pci_set_dma_mask()?  This is just a wrapper for dma_set_mask() and I
> don't see anywhere that would take into account the existing
> dev->dma_mask.  It seems for example that pci_dma_configure() could
> produce a 42 bit mask as we have here, then the driver could override
> that with anything that the dma_ops.dma_supported() callback finds
> acceptable, and I don't see any instances where the current
> dev->dma_mask is considered.  Am I overlooking something? 

I don't think so but Christoph and Robin can provide more input on
this - it is a long story.

ACPI and OF bindings set a default dma_mask (and dev->bus_dma_limit),
this does not prevent a driver from overriding the dev->dma_mask but DMA
mapping code still takes into account the dev->bus_dma_limit.

This may help:

git log -p 03bfdc31176c

Thanks,
Lorenzo
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Dmitry Osipenko
05.10.2020 12:53, Thierry Reding пишет:
> On Fri, Oct 02, 2020 at 05:50:08PM +0300, Dmitry Osipenko wrote:
>> 02.10.2020 17:22, Dmitry Osipenko пишет:
  static int tegra_smmu_of_xlate(struct device *dev,
   struct of_phandle_args *args)
  {
 +  struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
 +  struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
u32 id = args->args[0];
  
 +  of_node_put(args->np);
 +
 +  if (!mc || !mc->smmu)
 +  return -EPROBE_DEFER;
>>> platform_get_drvdata(NULL) will crash.
>>>
>>
>> Actually, platform_get_drvdata(NULL) can't happen. I overlooked this.
> 
> How so? It's technically possible for the iommus property to reference a
> device tree node for which no platform device will ever be created, in
> which case of_find_device_by_node() will return NULL. That's very
> unlikely and perhaps worth just crashing on to make sure it gets fixed
> immediately.

The tegra_smmu_ops are registered from the SMMU driver itself and MC
driver sets platform data before SMMU is initialized, hence device is
guaranteed to exist and mc can't be NULL.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 14/14] iommu/amd: Adopt IO page table framework

2020-10-05 Thread Jonathan Cameron
On Sun, 4 Oct 2020 01:45:49 +
Suravee Suthikulpanit  wrote:

> Switch to using IO page table framework for AMD IOMMU v1 page table.
> 
> Signed-off-by: Suravee Suthikulpanit 

One minor thing inline.


> ---
>  drivers/iommu/amd/iommu.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
> index 77f44b927ae7..6f8316206fb8 100644
> --- a/drivers/iommu/amd/iommu.c
> +++ b/drivers/iommu/amd/iommu.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1573,6 +1574,22 @@ static int pdev_iommuv2_enable(struct pci_dev *pdev)
>   return ret;
>  }
>  
> +struct io_pgtable_ops *
> +amd_iommu_setup_io_pgtable_ops(struct iommu_dev_data *dev_data,
> +struct protection_domain *domain)
> +{
> + struct amd_iommu *iommu = amd_iommu_rlookup_table[dev_data->devid];
> +
> + domain->iop.pgtbl_cfg = (struct io_pgtable_cfg) {
> + .pgsize_bitmap  = AMD_IOMMU_PGSIZES,
> + .ias= IOMMU_IN_ADDR_BIT_SIZE,
> + .oas= IOMMU_OUT_ADDR_BIT_SIZE,
> + .iommu_dev  = >dev->dev,
> + };
> +
> + return alloc_io_pgtable_ops(AMD_IOMMU_V1, >iop.pgtbl_cfg, 
> domain);
> +}
> +
>  /*
>   * If a device is not yet associated with a domain, this function makes the
>   * device visible in the domain
> @@ -1580,6 +1597,7 @@ static int pdev_iommuv2_enable(struct pci_dev *pdev)
>  static int attach_device(struct device *dev,
>struct protection_domain *domain)
>  {
> + struct io_pgtable_ops *pgtbl_ops;
>   struct iommu_dev_data *dev_data;
>   struct pci_dev *pdev;
>   unsigned long flags;
> @@ -1623,6 +1641,12 @@ static int attach_device(struct device *dev,
>  skip_ats_check:
>   ret = 0;
>  
> + pgtbl_ops = amd_iommu_setup_io_pgtable_ops(dev_data, domain);
> + if (!pgtbl_ops) {

Perhaps cleaner to not store in a local variable if you aren't going to use it?

if (!amd_iommu_setup_io_pgtable_ops(dev_data, domain)) {

> + ret = -ENOMEM;
> + goto out;
> + }
> +
>   do_attach(dev_data, domain);
>  
>   /*
> @@ -1958,6 +1982,8 @@ static void amd_iommu_domain_free(struct iommu_domain 
> *dom)
>   if (domain->dev_cnt > 0)
>   cleanup_domain(domain);
>  
> + free_io_pgtable_ops(>iop.iop.ops);
> +
>   BUG_ON(domain->dev_cnt != 0);
>  
>   if (!dom)


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Thierry Reding
On Mon, Oct 05, 2020 at 12:38:20PM +0300, Dmitry Osipenko wrote:
> 05.10.2020 12:16, Thierry Reding пишет:
> ...
> >> I think you meant regmap in regards to protecting IO accesses, but this
> >> is not what regmap is about if IO accesses are atomic by nature.
> > 
> > Then why is there regmap-mmio?
> 
> To protect programming sequences for example, actually that's the only
> real use-case I saw in kernel drivers once. In our case there are no
> sequences that require protection, at least I'm not aware about that.

True. But I'd still prefer to have a more formal mechanism of handing
out access to registers.

Either way, this isn't very relevant in the case of tegra20-devfreq
because there's really no reason for it to be a separate driver with
device tree lookup since it's all internal to the MC driver. The only
reason (unless I'm missing something) for that is to have the code
located in drivers/devfreq. We can do that without requiring DT lookup
either like we did for tegra-smmu/tegra-mc, or by directly copying the
code into drivers/memory.

It's become fairly common in recent years to group code by IP rather
than functionality. We nowadays have good tools to help with subsystem
wide refactoring that make it much less necessary for subsystem code to
be concentrated in a single directory.

> >> Secondly, you're missing that tegra30-devfreq driver will also need to
> >> perform the MC lookup [3] and then driver will no longer work for the
> >> older DTs if phandle becomes mandatory because these DTs do not have the
> >> MC phandle in the ACTMON node.
> >>
> >> [3]
> >> https://github.com/grate-driver/linux/commit/441d19281f9b3428a4db1ecb3a02e1ec02a8ad7f
> >>
> >> So we will need the fall back for T30/124 as well.
> > 
> > No, we don't need the fallback because this is new functionality which
> > can and should be gated on the existence of the new phandle. If there's
> > no phandle then we have no choice but to use the old code to ensure old
> > behaviour.
> 
> You just repeated what I was trying to say:)
> 
> Perhaps I spent a bit too much time touching that code to the point that
> lost feeling that there is a need to clarify everything in details.

I assumed that by "fall back" you meant the lookup-by-compatible. But
what I was talking about is the fall back to the current code which does
not use the MC device tree node at all. The latter is going to be
required to ensure that the code continues to work as-is. But the former
is not required because we have fall back code that already works with
old device trees. New code that is using the memory controller's timings
nodes can be gated on the existence of the phandle in DT and doing so is
not going to break backwards-compatibility.

But perhaps I was misunderstanding what you were trying to say.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

2020-10-05 Thread Steven Price

On 05/10/2020 09:39, Boris Brezillon wrote:

On Mon, 5 Oct 2020 09:34:06 +0100
Steven Price  wrote:


On 05/10/2020 09:15, Boris Brezillon wrote:

Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200
Neil Armstrong  wrote:
   

Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

According to a downstream commit I found in the Khadas vendor kernel,
the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows
how to handle this properly) we should describe it as such. Otherwise
the mismatch leads to all manner of fun with mismatched attributes and
inadvertently snooping stale data from caches, which would account for
at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy 
---
   arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 
   1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi 
b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
index 9b8548e5f6e5..ee8fcae9f9f0 100644
--- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
+++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
@@ -135,3 +135,7 @@ map1 {
};
};
   };
+
+ {
+   dma-coherent;
+};
  


Thanks a lot for digging, I'll run a test to confirm it fixes the issue !


Sorry for the late reply. I triggered a dEQP run with this patch applied
and I see a bunch of "panfrost ffe4.gpu: matching BO is not heap type"
errors (see below for a full backtrace). That doesn't seem to happen when
we drop this dma-coherent property.

[  690.945731] [ cut here ]
[  690.950003] panfrost ffe4.gpu: matching BO is not heap type (GPU VA = 
319a000)
[  690.950051] WARNING: CPU: 0 PID: 120 at 
drivers/gpu/drm/panfrost/panfrost_mmu.c:465 
panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.968854] Modules linked in:
[  690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: GW
 5.9.0-rc5-02434-g7d8109ec5a42 #784
[  690.981964] Hardware name: Khadas VIM3 (DT)
[  690.986107] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--)
[  690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.002836] sp : 800011bcbcd0
[  691.006114] x29: 800011bcbcf0 x28: f3fe3800
[  691.011375] x27: ceaf5350 x26: ca5fc500
[  691.016636] x25: f32409c0 x24: 0001
[  691.021897] x23: f3240880 x22: f3e3a800
[  691.027159] x21:  x20: 
[  691.032420] x19: 00010001 x18: 0020
[  691.037681] x17:  x16: 
[  691.042942] x15: f3fe3c70 x14: 
[  691.048204] x13: 8000116c2428 x12: 8000116c2086
[  691.053466] x11: 800011bcbcd0 x10: 800011bcbcd0
[  691.058727] x9 : fffe x8 : 
[  691.063988] x7 : 7420706165682074 x6 : 8000116c1816
[  691.069249] x5 :  x4 : 
[  691.074510] x3 :  x2 : 8000e348c000
[  691.079771] x1 : f1b91ff9af2df000 x0 : 
[  691.085033] Call trace:
[  691.087452]  panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.092712]  irq_thread_fn+0x2c/0xa0
[  691.096246]  irq_thread+0x184/0x208
[  691.099699]  kthread+0x140/0x160
[  691.102890]  ret_from_fork+0x10/0x34
[  691.106424] ---[ end trace b5dd8c2dfada8236 ]---
   


It's quite possible this is caused by the GPU seeing a stale page table
entry, so perhaps coherency isn't working as well as it should...

Do you get an "Unhandled Page fault" message after this?


Yep (see below).

--->8---


[...]

[  689.805864] panfrost ffe4.gpu: Unhandled Page fault in AS0 at VA 
0x03146080
[  689.805864] Reason: TODO
[  689.805864] raw fault status: 0x10003C3
[  689.805864] decoded fault status: SLAVE FAULT
[  689.805864] exception type 0xC3: TRANSLATION_FAULT_LEVEL3
[  689.805864] access type 0x3: WRITE
[  689.805864] source id 0x100
[  690.170419] panfrost ffe4.gpu: gpu sched timeout, js=1, config=0x7300, 
status=0x8, head=0x3101100, tail=0x3101100, sched_job=4b442768
[  690.770373] panfrost ffe4.gpu: error powering up gpu shader
[  690.945123] panfrost ffe4.gpu: error powering up gpu shader
[  690.945731] [ cut here ]



That's a write fault from level 3 of the page table triggered by shader
core 0 in a fragment job. So could be writing out the frame buffer.

It would be interesting to see if a patch like below would work round
it:

8<
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c 
b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index e8f7b11352d2..5144860afdea 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -460,9 +460,12 @@ static int panfrost_mmu_map_fault_addr(struct 
panfrost_device *pfdev, int as,
 
 	bo = bomapping->obj;

if (!bo->is_heap) {
-   dev_WARN(pfdev->dev, "matching BO is not heap type (GPU VA = 
%llx)",
+

Re: [PATCH v4 3/3] iommu/tegra-smmu: Add PCI support

2020-10-05 Thread Thierry Reding
On Thu, Oct 01, 2020 at 11:08:07PM -0700, Nicolin Chen wrote:
> This patch simply adds support for PCI devices.
> 
> Signed-off-by: Nicolin Chen 
> ---
> 
> Changelog
> v3->v4
>  * Dropped !iommu_present() check
>  * Added CONFIG_PCI check in the exit path
> v2->v3
>  * Replaced ternary conditional operator with if-else in .device_group()
>  * Dropped change in tegra_smmu_remove()
> v1->v2
>  * Added error-out labels in tegra_smmu_probe()
>  * Dropped pci_request_acs() since IOMMU core would call it.
> 
>  drivers/iommu/tegra-smmu.c | 37 +++--
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c
> index 02d02b0c55c4..b701a7b55e84 100644
> --- a/drivers/iommu/tegra-smmu.c
> +++ b/drivers/iommu/tegra-smmu.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -865,7 +866,11 @@ static struct iommu_group 
> *tegra_smmu_device_group(struct device *dev)
>   group->smmu = smmu;
>   group->soc = soc;
>  
> - group->group = iommu_group_alloc();
> + if (dev_is_pci(dev))
> + group->group = pci_device_group(dev);
> + else
> + group->group = generic_device_group(dev);
> +
>   if (IS_ERR(group->group)) {
>   devm_kfree(smmu->dev, group);
>   mutex_unlock(>lock);
> @@ -1069,22 +1074,32 @@ struct tegra_smmu *tegra_smmu_probe(struct device 
> *dev,
>   iommu_device_set_fwnode(>iommu, dev->fwnode);
>  
>   err = iommu_device_register(>iommu);
> - if (err) {
> - iommu_device_sysfs_remove(>iommu);
> - return ERR_PTR(err);
> - }
> + if (err)
> + goto err_sysfs;
>  
>   err = bus_set_iommu(_bus_type, _smmu_ops);
> - if (err < 0) {
> - iommu_device_unregister(>iommu);
> - iommu_device_sysfs_remove(>iommu);
> - return ERR_PTR(err);
> - }
> + if (err < 0)
> + goto err_unregister;
> +
> +#ifdef CONFIG_PCI
> + err = bus_set_iommu(_bus_type, _smmu_ops);
> + if (err < 0)
> + goto err_bus_set;
> +#endif
>  
>   if (IS_ENABLED(CONFIG_DEBUG_FS))
>   tegra_smmu_debugfs_init(smmu);
>  
>   return smmu;
> +
> +err_bus_set: __maybe_unused;
> + bus_set_iommu(_bus_type, NULL);
> +err_unregister:
> + iommu_device_unregister(>iommu);
> +err_sysfs:
> + iommu_device_sysfs_remove(>iommu);

Can you please switch to label names that are more consistent with the
others in this driver? Notably the ones in tegra_smmu_domain_alloc().
The idea is to describe in the name of the label what's happening at the
label. Something like this, for example:

unset_platform_bus:
bus_set_iommu(_bus_type, NULL);
unregister:
iommu_device_unregister(>iommu);
remove_sysfs:
iommu_device_sysfs_remove(>iommu);

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Fri, Oct 02, 2020 at 11:58:29AM -0700, Nicolin Chen wrote:
> On Fri, Oct 02, 2020 at 06:02:18PM +0300, Dmitry Osipenko wrote:
> > 02.10.2020 09:08, Nicolin Chen пишет:
> > >  static int tegra_smmu_of_xlate(struct device *dev,
> > >  struct of_phandle_args *args)
> > >  {
> > > + struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
> > > + struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > >   u32 id = args->args[0];
> > >  
> > > + of_node_put(args->np);
> > 
> > of_find_device_by_node() takes device reference and not the np
> > reference. This is a bug, please remove of_node_put().
> 
> Looks like so. Replacing it with put_device(_pdev->dev);

Putting the put_device() here is wrong, though. You need to make sure
you keep a reference to it as long as you keep accessing the data that
is owned by it.

Like I said earlier, this is a bit weird in this case because we're
self-referencing, so iommu_pdev->dev is going to stay around as long as
the SMMU is. However, it might be worth to properly track the lifetime
anyway just so that the code can serve as a good example of how to do
things.

If you decide to go for the shortcut and not track this reference
properly, then at least you need to add a comment as to why it is safe
to do in this case. This ensures that readers are away of the
circumstances and don't copy this bad code into a context where the
circumstances are different.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Fri, Oct 02, 2020 at 05:50:08PM +0300, Dmitry Osipenko wrote:
> 02.10.2020 17:22, Dmitry Osipenko пишет:
> >>  static int tegra_smmu_of_xlate(struct device *dev,
> >>   struct of_phandle_args *args)
> >>  {
> >> +  struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
> >> +  struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> >>u32 id = args->args[0];
> >>  
> >> +  of_node_put(args->np);
> >> +
> >> +  if (!mc || !mc->smmu)
> >> +  return -EPROBE_DEFER;
> > platform_get_drvdata(NULL) will crash.
> > 
> 
> Actually, platform_get_drvdata(NULL) can't happen. I overlooked this.

How so? It's technically possible for the iommus property to reference a
device tree node for which no platform device will ever be created, in
which case of_find_device_by_node() will return NULL. That's very
unlikely and perhaps worth just crashing on to make sure it gets fixed
immediately.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Fri, Oct 02, 2020 at 05:22:41PM +0300, Dmitry Osipenko wrote:
> 02.10.2020 09:08, Nicolin Chen пишет:
> >  static int tegra_smmu_of_xlate(struct device *dev,
> >struct of_phandle_args *args)
> >  {
> > +   struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
> > +   struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
> > u32 id = args->args[0];
> >  
> > +   of_node_put(args->np);
> > +
> > +   if (!mc || !mc->smmu)
> > +   return -EPROBE_DEFER;
> 
> platform_get_drvdata(NULL) will crash.
> 
> > +   dev_iommu_priv_set(dev, mc->smmu);
> 
> I think put_device(mc->dev) is missed here, doesn't it?

Yeah, I think we'd need that here, otherwise we'd be leaking a
reference. Worse, even, mc->dev is the same device that owns the SMMU,
so we're basically incrementing our own reference here and never
releasing it. We also need that put_device(mc->dev) in the error case
above because we already hold the reference there.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v4 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Thierry Reding
On Fri, Oct 02, 2020 at 12:53:28PM -0700, Nicolin Chen wrote:
> On Fri, Oct 02, 2020 at 05:58:29PM +0300, Dmitry Osipenko wrote:
> > 02.10.2020 17:22, Dmitry Osipenko пишет:
> > > 02.10.2020 09:08, Nicolin Chen пишет:
> > >> -static void tegra_smmu_release_device(struct device *dev)
> > >> -{
> > >> -dev_iommu_priv_set(dev, NULL);
> > >> -}
> > >> +static void tegra_smmu_release_device(struct device *dev) {}
> > > 
> > > Please keep the braces as-is.
> > > 
> > 
> > I noticed that you borrowed this style from the sun50i-iommu driver, but
> > this is a bit unusual coding style for the c files. At least to me it's
> > unusual to see header-style function stub in a middle of c file. But
> > maybe it's just me.
> 
> I don't see a rule in ./Documentation/process/coding-style.rst
> against this, and there're plenty of drivers doing so. If you
> feel uncomfortable with this style, you may add a rule to that
> doc so everyone will follow :)

I also prefer braces on separate lines. Even better would be to just
drop this entirely and make ->release_device() optional. At least the
following drivers could be cleaned up that way:

* fsl-pamu
* msm-iommu
* sun50i-iommu
* tegra-gart
* tegra-smmu

And it looks like mtk-iommu and mtk-iommu-v1 do only iommu_fwspec_free()
in their ->release_device() implementations, but that's already done via
iommu_release_device().

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Dmitry Osipenko
05.10.2020 12:16, Thierry Reding пишет:
...
>> I think you meant regmap in regards to protecting IO accesses, but this
>> is not what regmap is about if IO accesses are atomic by nature.
> 
> Then why is there regmap-mmio?

To protect programming sequences for example, actually that's the only
real use-case I saw in kernel drivers once. In our case there are no
sequences that require protection, at least I'm not aware about that.

>> Secondly, you're missing that tegra30-devfreq driver will also need to
>> perform the MC lookup [3] and then driver will no longer work for the
>> older DTs if phandle becomes mandatory because these DTs do not have the
>> MC phandle in the ACTMON node.
>>
>> [3]
>> https://github.com/grate-driver/linux/commit/441d19281f9b3428a4db1ecb3a02e1ec02a8ad7f
>>
>> So we will need the fall back for T30/124 as well.
> 
> No, we don't need the fallback because this is new functionality which
> can and should be gated on the existence of the new phandle. If there's
> no phandle then we have no choice but to use the old code to ensure old
> behaviour.

You just repeated what I was trying to say:)

Perhaps I spent a bit too much time touching that code to the point that
lost feeling that there is a need to clarify everything in details.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Thierry Reding
On Mon, Oct 05, 2020 at 11:14:27AM +0300, Dmitry Osipenko wrote:
> 05.10.2020 10:13, Thierry Reding пишет:
> ...
> > Have you also seen that sun50i-iommu does look up the SMMU from a
> > phandle using of_find_device_by_node()? So I think you've shown yourself
> > that even "modern" drivers avoid global pointers and look up via
> > phandle.
> 
> I have no problem with the lookup by phandle and I'm all for it. It's
> now apparent to me that you completely missed my point, but that should
> be my fault that I haven't conveyed it properly from the start. I just
> wanted to avoid the incompatible DT changes which could break older DTs
> + I simply wanted to improve the older code without introducing new
> features, that's it.
> 
> Anyways, after yours comments I started to look at how the interconnect
> patches could be improved and found new things, like that OPPs now
> support ICC and that EMC has a working EMC_STAT, I also discovered
> syscon and simple-mfd. This means that we won't need the global pointers
> at all neither for SMMU, nor for interconnect, nor for EMC drivers :)

Well, evidently discussion on mailing lists actually works. =)

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Thierry Reding
On Thu, Oct 01, 2020 at 10:04:30PM +0300, Dmitry Osipenko wrote:
> ...
> >> There are dozens variants of the panels and system could easily have
> >> more than one panel, hence a direct lookup by phandle is a natural
> >> choice for the panels.
> > 
> > Not really, there's typically only just one panel. But that's just one
> > example. EMC would be another. There's only a single EMC on Tegra and
> > yet for something like interconnects we still reference it by phandle.
> 
> Interconnect is a generic API.
> 
> > PMC is another case and so is CAR, GPIO (at least on early Tegra SoCs)
> > and pinmux, etc.
> > 
> > The example of GPIO shows very well how this is important. If we had
> > made the assumption from the beginning that there was only ever going to
> > be a single GPIO controller, then we would've had a big problem when the
> > first SoC shipped that had multiple GPIO controllers.
> 
> This is true, but only if all these words are applied to the generic APIs.

The reason why this is true for generic APIs is because people actually
think about generic APIs a bit more than about custom APIs because they
need to be more future-proof.

That doesn't make it wrong to think hard about using custom APIs because
we also want those to be somewhat future-proof.

> >> While all Tegra SoCs have a single fixed MC in the system, and thus,
> >> there is no real need to use phandle because we can't mix up MC with
> >> anything else.
> > 
> > The same is true for the SMMU, and yet the iommus property references
> > the SMMU by phandle. There are a *lot* of cases where you could imply
> > dependencies because you have intimate knowledge about the hardware
> > within drivers. But the point is to avoid this wherever possible so
> > that the DTB is as self-describing as possible.
> > 
>  older DTs if DT change will be needed. Please give a detailed 
>  explanation.
> >>>
> >>> New functionality doesn't have to work with older DTs.
> >>
> >> This is fine in general, but I'm afraid that in this particular case we
> >> will need to have a fall back anyways because otherwise it should break
> >> the old functionality.
> > 
> > It looks like tegra20-devfreq is the only one that currently does this
> > lookup via compatible string. And looking at the driver, what it does is
> > pretty horrible, to be honest. It gets a reference to the memory
> > controller and then simply accesses registers within the memory
> > controller without any type of protection against concurrent accesses or
> > reference counting to make sure the registers it accesses are still
> > valid. At the very least this should've been a regmap.
> 
> Regmap is about abstracting accesses to devices that may sit on
> different types of buses, like I2C or SPI for example. Or devices that
> have a non-trivial registers mapping, or have slow IO and need caching.

Those are common uses, yes.

> I think you meant regmap in regards to protecting IO accesses, but this
> is not what regmap is about if IO accesses are atomic by nature.

Then why is there regmap-mmio?

> The tegra20-devfreq functionality is very separated from the rest of the
> memory controller, hence there are no conflicts in regards to hardware
> accesses, so there is nothing to protect.
> 
> Also, Regmap API itself doesn't manage refcounting of the mappings.

That may be true now, but at least it is something formal rather than
just dereferencing some pointer and accessing registers through it. If
this ever becomes a problem it's something that we can more easily
address.

> > And not
> > coincidentally, regmaps are usually passed around by referencing their
> > provider via phandle.
> 
> Any real-world examples? I think you're mixing up regmap with something
> else.

syscon is the most obvious that comes to mind. It is meant to address
the kind of use-case that tegra20-devfreq apparently needs here, where
you have registers for certain functionality that are located in a
completely different IP block from the rest of that functionality. Often
there are better alternatives to solve this, by reusing existing
infrastructure, such as pinmux.

In cases where no subsystem exists we typically use syscon, which are
implemented via regmaps, to gain access to shared registers.

> The devfreq driver works just like the SMMU and GART. The devfreq device
> is supposed to be created only once both MC and EMC drivers are loaded
> and we know that they can't go away [1].
> 
> [1]
> https://patchwork.ozlabs.org/project/linux-tegra/patch/20200814000621.8415-32-dig...@gmail.com/

Huh... why is the tegra20-devfreq device instantiated from the EMC
driver? That doesn't make any sense to me. If there aren't any registers
that the driver accesses, then it would make more sense to subsume that
functionality under some different driver (tegra20-mc most likely by
the looks of things).

On a side-note: once we move tegra20-devfreq into tegra20-mc, there's no
need for this look up at all anymore.

> Hence the 

Re: [PATCH v5 2/3] iommu/tegra-smmu: Rework tegra_smmu_probe_device()

2020-10-05 Thread Dmitry Osipenko
05.10.2020 00:57, Nicolin Chen пишет:
> On Sat, Oct 03, 2020 at 05:06:42PM +0300, Dmitry Osipenko wrote:
>> 03.10.2020 09:59, Nicolin Chen пишет:
>>>  static int tegra_smmu_of_xlate(struct device *dev,
>>>struct of_phandle_args *args)
>>>  {
>>> +   struct platform_device *iommu_pdev = of_find_device_by_node(args->np);
>>> +   struct tegra_mc *mc = platform_get_drvdata(iommu_pdev);
>>> u32 id = args->args[0];
>>>  
>>> +   put_device(_pdev->dev);
>>> +
>>> +   if (!mc || !mc->smmu)
>>> +   return -EPROBE_DEFER;
>>
>> I'm not very excited by seeing code in the patches that can't be
>> explained by the patch authors and will appreciate if you could provide
>> a detailed explanation about why this NULL checking is needed because I
>> think it is unneeded, especially given that other IOMMU drivers don't
>> have such check.
> 
> This function could be called from of_iommu_configure(), which is
> a part of other driver's probe(). So I think it's safer to have a
> check. Yet, given mc driver is added to the "arch_initcall" stage,
> you are probably right that there's no really need at this moment
> because all clients should be called after mc/smmu are inited. So
> I'll resend a v6, if that makes you happy.

I wanted to get the explanation :) I'm very curious why it's actually
needed because I'm not 100% sure whether it's not needed at all.

I'd assume that the only possible problem could be if some device is
created in parallel with the MC probing and there is no locking that
could prevent this in the drivers core. It's not apparent to me whether
this situation could happen at all in practice.

The MC is created early and at that time everything is sequential, so
it's indeed should be safe to remove the check.

>> I'm asking this question second time now, please don't ignore review
>> comments next time.
> 
> I think I missed your reply or misunderstood it.
> 

Okay!
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

2020-10-05 Thread Boris Brezillon
On Mon, 5 Oct 2020 09:34:06 +0100
Steven Price  wrote:

> On 05/10/2020 09:15, Boris Brezillon wrote:
> > Hi Robin, Neil,
> > 
> > On Wed, 16 Sep 2020 10:26:43 +0200
> > Neil Armstrong  wrote:
> >   
> >> Hi Robin,
> >>
> >> On 16/09/2020 01:51, Robin Murphy wrote:  
> >>> According to a downstream commit I found in the Khadas vendor kernel,
> >>> the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows
> >>> how to handle this properly) we should describe it as such. Otherwise
> >>> the mismatch leads to all manner of fun with mismatched attributes and
> >>> inadvertently snooping stale data from caches, which would account for
> >>> at least some of the brokenness observed on this platform.
> >>>
> >>> Signed-off-by: Robin Murphy 
> >>> ---
> >>>   arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 
> >>>   1 file changed, 4 insertions(+)
> >>>
> >>> diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi 
> >>> b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> >>> index 9b8548e5f6e5..ee8fcae9f9f0 100644
> >>> --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> >>> +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> >>> @@ -135,3 +135,7 @@ map1 {
> >>>   };
> >>>   };
> >>>   };
> >>> +
> >>> + {
> >>> + dma-coherent;
> >>> +};
> >>>  
> >>
> >> Thanks a lot for digging, I'll run a test to confirm it fixes the issue !  
> > 
> > Sorry for the late reply. I triggered a dEQP run with this patch applied
> > and I see a bunch of "panfrost ffe4.gpu: matching BO is not heap type"
> > errors (see below for a full backtrace). That doesn't seem to happen when
> > we drop this dma-coherent property.
> > 
> > [  690.945731] [ cut here ]
> > [  690.950003] panfrost ffe4.gpu: matching BO is not heap type (GPU VA 
> > = 319a000)
> > [  690.950051] WARNING: CPU: 0 PID: 120 at 
> > drivers/gpu/drm/panfrost/panfrost_mmu.c:465 
> > panfrost_mmu_irq_handler_thread+0x47c/0x650
> > [  690.968854] Modules linked in:
> > [  690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: GW
> >  5.9.0-rc5-02434-g7d8109ec5a42 #784
> > [  690.981964] Hardware name: Khadas VIM3 (DT)
> > [  690.986107] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--)
> > [  690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650
> > [  690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650
> > [  691.002836] sp : 800011bcbcd0
> > [  691.006114] x29: 800011bcbcf0 x28: f3fe3800
> > [  691.011375] x27: ceaf5350 x26: ca5fc500
> > [  691.016636] x25: f32409c0 x24: 0001
> > [  691.021897] x23: f3240880 x22: f3e3a800
> > [  691.027159] x21:  x20: 
> > [  691.032420] x19: 00010001 x18: 0020
> > [  691.037681] x17:  x16: 
> > [  691.042942] x15: f3fe3c70 x14: 
> > [  691.048204] x13: 8000116c2428 x12: 8000116c2086
> > [  691.053466] x11: 800011bcbcd0 x10: 800011bcbcd0
> > [  691.058727] x9 : fffe x8 : 
> > [  691.063988] x7 : 7420706165682074 x6 : 8000116c1816
> > [  691.069249] x5 :  x4 : 
> > [  691.074510] x3 :  x2 : 8000e348c000
> > [  691.079771] x1 : f1b91ff9af2df000 x0 : 
> > [  691.085033] Call trace:
> > [  691.087452]  panfrost_mmu_irq_handler_thread+0x47c/0x650
> > [  691.092712]  irq_thread_fn+0x2c/0xa0
> > [  691.096246]  irq_thread+0x184/0x208
> > [  691.099699]  kthread+0x140/0x160
> > [  691.102890]  ret_from_fork+0x10/0x34
> > [  691.106424] ---[ end trace b5dd8c2dfada8236 ]---
> >   
> 
> It's quite possible this is caused by the GPU seeing a stale page table 
> entry, so perhaps coherency isn't working as well as it should...
> 
> Do you get an "Unhandled Page fault" message after this?

Yep (see below).

--->8---

[  689.640491] [ cut here ]
[  689.644754] panfrost ffe4.gpu: matching BO is not heap type (GPU VA = 
3146000)
[  689.644802] WARNING: CPU: 0 PID: 120 at 
drivers/gpu/drm/panfrost/panfrost_mmu.c:465 
panfrost_mmu_irq_handler_thread+0x47c/0x650
[  689.663607] Modules linked in:
[  689.31] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: GW
 5.9.0-rc5-02434-g7d8109ec5a42 #784
[  689.676717] Hardware name: Khadas VIM3 (DT)
[  689.680860] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--)
[  689.686380] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  689.691987] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  689.697590] sp : 800011bcbcd0
[  689.700867] x29: 800011bcbcf0 x28: f3fe3800 
[  689.706128] x27: d89cf750 x26: da34a800 
[  689.711389] x25: f32409c0 x24: 0001 
[  689.716650] x23: f3240880 x22: d456e000 
[  689.721911] x21:  x20:  
[  689.727173] x19: 00010001 x18: 0020 
[  

Re: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

2020-10-05 Thread Steven Price

On 05/10/2020 09:15, Boris Brezillon wrote:

Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200
Neil Armstrong  wrote:


Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:

According to a downstream commit I found in the Khadas vendor kernel,
the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows
how to handle this properly) we should describe it as such. Otherwise
the mismatch leads to all manner of fun with mismatched attributes and
inadvertently snooping stale data from caches, which would account for
at least some of the brokenness observed on this platform.

Signed-off-by: Robin Murphy 
---
  arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 
  1 file changed, 4 insertions(+)

diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi 
b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
index 9b8548e5f6e5..ee8fcae9f9f0 100644
--- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
+++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
@@ -135,3 +135,7 @@ map1 {
};
};
  };
+
+ {
+   dma-coherent;
+};
   


Thanks a lot for digging, I'll run a test to confirm it fixes the issue !


Sorry for the late reply. I triggered a dEQP run with this patch applied
and I see a bunch of "panfrost ffe4.gpu: matching BO is not heap type"
errors (see below for a full backtrace). That doesn't seem to happen when
we drop this dma-coherent property.

[  690.945731] [ cut here ]
[  690.950003] panfrost ffe4.gpu: matching BO is not heap type (GPU VA = 
319a000)
[  690.950051] WARNING: CPU: 0 PID: 120 at 
drivers/gpu/drm/panfrost/panfrost_mmu.c:465 
panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.968854] Modules linked in:
[  690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: GW
 5.9.0-rc5-02434-g7d8109ec5a42 #784
[  690.981964] Hardware name: Khadas VIM3 (DT)
[  690.986107] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--)
[  690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.002836] sp : 800011bcbcd0
[  691.006114] x29: 800011bcbcf0 x28: f3fe3800
[  691.011375] x27: ceaf5350 x26: ca5fc500
[  691.016636] x25: f32409c0 x24: 0001
[  691.021897] x23: f3240880 x22: f3e3a800
[  691.027159] x21:  x20: 
[  691.032420] x19: 00010001 x18: 0020
[  691.037681] x17:  x16: 
[  691.042942] x15: f3fe3c70 x14: 
[  691.048204] x13: 8000116c2428 x12: 8000116c2086
[  691.053466] x11: 800011bcbcd0 x10: 800011bcbcd0
[  691.058727] x9 : fffe x8 : 
[  691.063988] x7 : 7420706165682074 x6 : 8000116c1816
[  691.069249] x5 :  x4 : 
[  691.074510] x3 :  x2 : 8000e348c000
[  691.079771] x1 : f1b91ff9af2df000 x0 : 
[  691.085033] Call trace:
[  691.087452]  panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.092712]  irq_thread_fn+0x2c/0xa0
[  691.096246]  irq_thread+0x184/0x208
[  691.099699]  kthread+0x140/0x160
[  691.102890]  ret_from_fork+0x10/0x34
[  691.106424] ---[ end trace b5dd8c2dfada8236 ]---



It's quite possible this is caused by the GPU seeing a stale page table 
entry, so perhaps coherency isn't working as well as it should...


Do you get an "Unhandled Page fault" message after this? It might give 
some clues. Coherency issues are a pain to debug though and it's of 
course possible there are issues on this specific platform.


Steve
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 8/8] WIP: add a dma_alloc_contiguous API

2020-10-05 Thread Christoph Hellwig
On Fri, Oct 02, 2020 at 05:50:40PM +, Tomasz Figa wrote:
> Hi Christoph,
> 
> On Wed, Sep 30, 2020 at 06:09:17PM +0200, Christoph Hellwig wrote:
> > Add a new API that returns a virtually non-contigous array of pages
> > and dma address.  This API is only implemented for dma-iommu and will
> > not be implemented for non-iommu DMA API instances that have to allocate
> > contiguous memory.  It is up to the caller to check if the API is
> > available.
> 
> Would you mind scheding some more light on what made the previous attempt
> not work well? I liked the previous API because it was more consistent with
> the regular dma_alloc_coherent().

The problem is that with a dma_alloc_noncoherent that can return pages
not in the kernel mapping we can't just use virt_to_page to fill in
scatterlists or mmap the buffer to userspace, but would need new helpers
and another two methods.

> >  - no kernel mapping or only temporary kernel mappings are required.
> >That is as a better replacement for DMA_ATTR_NO_KERNEL_MAPPING
> >  - a kernel mapping is required for cached and DMA mapped pages, but
> >the driver also needs the pages to e.g. map them to userspace.
> >In that sense it is a replacement for some aspects of the recently
> >removed and never fully implemented DMA_ATTR_NON_CONSISTENT
> 
> What's the expected allocation and mapping flow with the latter? Would that be
> 
> pages = dma_alloc_noncoherent(...)
> vaddr = vmap(pages, ...);
> 
> ?

Yes.  Witht the vmap step optional for replacements of
DMA_ATTR_NO_KERNEL_MAPPING, which is another nightmare to deal with.

> Would one just use the usual dma_sync_for_{cpu,device}() for cache
> invallidate/clean, while keeping the mapping in place?

Yes.  And make sure the API isn't implemented when VIVT caches are
used, but that isn't really different from the current interface.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 3/3] arm64: dts: meson: Describe G12b GPU as coherent

2020-10-05 Thread Boris Brezillon
Hi Robin, Neil,

On Wed, 16 Sep 2020 10:26:43 +0200
Neil Armstrong  wrote:

> Hi Robin,
> 
> On 16/09/2020 01:51, Robin Murphy wrote:
> > According to a downstream commit I found in the Khadas vendor kernel,
> > the GPU on G12b is wired up for ACE-lite, so (now that Panfrost knows
> > how to handle this properly) we should describe it as such. Otherwise
> > the mismatch leads to all manner of fun with mismatched attributes and
> > inadvertently snooping stale data from caches, which would account for
> > at least some of the brokenness observed on this platform.
> > 
> > Signed-off-by: Robin Murphy 
> > ---
> >  arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi 
> > b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> > index 9b8548e5f6e5..ee8fcae9f9f0 100644
> > --- a/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> > +++ b/arch/arm64/boot/dts/amlogic/meson-g12b.dtsi
> > @@ -135,3 +135,7 @@ map1 {
> > };
> > };
> >  };
> > +
> > + {
> > +   dma-coherent;
> > +};
> >   
> 
> Thanks a lot for digging, I'll run a test to confirm it fixes the issue !

Sorry for the late reply. I triggered a dEQP run with this patch applied
and I see a bunch of "panfrost ffe4.gpu: matching BO is not heap type"
errors (see below for a full backtrace). That doesn't seem to happen when
we drop this dma-coherent property.

[  690.945731] [ cut here ]
[  690.950003] panfrost ffe4.gpu: matching BO is not heap type (GPU VA = 
319a000)
[  690.950051] WARNING: CPU: 0 PID: 120 at 
drivers/gpu/drm/panfrost/panfrost_mmu.c:465 
panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.968854] Modules linked in:
[  690.971878] CPU: 0 PID: 120 Comm: irq/27-panfrost Tainted: GW
 5.9.0-rc5-02434-g7d8109ec5a42 #784
[  690.981964] Hardware name: Khadas VIM3 (DT)
[  690.986107] pstate: 6005 (nZCv daif -PAN -UAO BTYPE=--)
[  690.991627] pc : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  690.997232] lr : panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.002836] sp : 800011bcbcd0
[  691.006114] x29: 800011bcbcf0 x28: f3fe3800 
[  691.011375] x27: ceaf5350 x26: ca5fc500 
[  691.016636] x25: f32409c0 x24: 0001 
[  691.021897] x23: f3240880 x22: f3e3a800 
[  691.027159] x21:  x20:  
[  691.032420] x19: 00010001 x18: 0020 
[  691.037681] x17:  x16:  
[  691.042942] x15: f3fe3c70 x14:  
[  691.048204] x13: 8000116c2428 x12: 8000116c2086 
[  691.053466] x11: 800011bcbcd0 x10: 800011bcbcd0 
[  691.058727] x9 : fffe x8 :  
[  691.063988] x7 : 7420706165682074 x6 : 8000116c1816 
[  691.069249] x5 :  x4 :  
[  691.074510] x3 :  x2 : 8000e348c000 
[  691.079771] x1 : f1b91ff9af2df000 x0 :  
[  691.085033] Call trace:
[  691.087452]  panfrost_mmu_irq_handler_thread+0x47c/0x650
[  691.092712]  irq_thread_fn+0x2c/0xa0
[  691.096246]  irq_thread+0x184/0x208
[  691.099699]  kthread+0x140/0x160
[  691.102890]  ret_from_fork+0x10/0x34
[  691.106424] ---[ end trace b5dd8c2dfada8236 ]---
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Dmitry Osipenko
05.10.2020 10:13, Thierry Reding пишет:
...
> Have you also seen that sun50i-iommu does look up the SMMU from a
> phandle using of_find_device_by_node()? So I think you've shown yourself
> that even "modern" drivers avoid global pointers and look up via
> phandle.

I have no problem with the lookup by phandle and I'm all for it. It's
now apparent to me that you completely missed my point, but that should
be my fault that I haven't conveyed it properly from the start. I just
wanted to avoid the incompatible DT changes which could break older DTs
+ I simply wanted to improve the older code without introducing new
features, that's it.

Anyways, after yours comments I started to look at how the interconnect
patches could be improved and found new things, like that OPPs now
support ICC and that EMC has a working EMC_STAT, I also discovered
syscon and simple-mfd. This means that we won't need the global pointers
at all neither for SMMU, nor for interconnect, nor for EMC drivers :)
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Thierry Reding
On Fri, Oct 02, 2020 at 04:55:34AM +0300, Dmitry Osipenko wrote:
> 02.10.2020 04:07, Nicolin Chen пишет:
> > On Thu, Oct 01, 2020 at 11:33:38PM +0300, Dmitry Osipenko wrote:
> > If we can't come to an agreement on globalizing mc pointer, would
> > it be possible to pass tegra_mc_driver through tegra_smmu_probe()
> > so we can continue to use driver_find_device_by_fwnode() as v1?
> >
> > v1: https://lkml.org/lkml/2020/9/26/68
> 
>  tegra_smmu_probe() already takes a struct tegra_mc *. Did you mean
>  tegra_smmu_probe_device()? I don't think we can do that because it isn't
> >>>
> >>> I was saying to have a global parent_driver pointer: similar to
> >>> my v1, yet rather than "extern" the tegra_mc_driver, we pass it
> >>> through egra_smmu_probe() and store it in a static global value
> >>> so as to call tegra_smmu_get_by_fwnode() in ->probe_device().
> >>>
> >>> Though I agree that creating a global device pointer (mc) might
> >>> be controversial, yet having a global parent_driver pointer may
> >>> not be against the rule, considering that it is common in iommu
> >>> drivers to call driver_find_device_by_fwnode in probe_device().
> >>
> >> You don't need the global pointer if you have SMMU OF node.
> >>
> >> You could also get driver pointer from mc->dev->driver.
> >>
> >> But I don't think you need to do this at all. The probe_device() could
> >> be invoked only for the tegra_smmu_ops and then seems you could use
> >> dev_iommu_priv_set() in tegra_smmu_of_xlate(), like sun50i-iommu driver
> >> does.
> > 
> > Getting iommu device pointer using driver_find_device_by_fwnode()
> > is a common practice in ->probe_device() of other iommu drivers.
> 
> Please give me a full list of the IOMMU drivers which use this method.

ARM SMMU and ARM SMMU v3 do this and so does virtio-iommu. Pretty much
all the other drivers for ARM platforms have their own variations of
tegra_smmu_find() using of_find_device_by_node() at some point.

What others do differently is that they call of_find_device_by_node()
from ->of_xlate(), which is notably different from what we do in
tegra-smmu (where we call it from ->probe_device()). It's entirely
possible that we can do that as well, which is what we've been
discussing in a different sub-thread, but like I mentioned there I do
recall that being problematic, otherwise I wouldn't have left all the
comments in the code.

If we can determine that moving this to ->of_xlate() works fine in all
cases, then I think that's something that we should do for tegra-smmu to
become more consistent with other drivers.

> > But this requires a device_driver pointer that tegra-smmu doesn't
> > have. So passing tegra_mc_driver through tegra_smmu_probe() will
> > address it.
> > 
> 
> If you're borrowing code and ideas from other drivers, then at least
> please borrow them from a modern good-looking drivers. And I already
> pointed out that following cargo cult is not always a good idea.
> 
> ARM-SMMU isn't a modern driver and it has legacy code. You shouldn't
> copy it blindly. The sun50i-iommu driver was added half year ago, you
> may use it as a reference.

That's nonsense. There's no such thing as "modern" drivers is Linux
because they are constantly improved. Yes, ARM SMMU may have legacy code
paths, but that's because it has been around for much longer than others
and therefore is much more mature.

I can't say much about sun50i-iommu because I'm not familiar with it,
but I have seen plenty of "modern" drivers that turn out to be much
worse than "old" drivers. New doesn't always mean better.

> Always consult the IOMMU core code. If you're too unsure about
> something, then maybe better to start a new thread and ask Joerg about
> the best modern practices that IOMMU drivers should use.

This doesn't really have anything to do with the IOMMU core code. This
has to do with platform and firmware code, so the IOMMU core is only
marginally involved.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: [PATCH v3 2/3] iommu/tegra-smmu: Rework .probe_device and .attach_dev

2020-10-05 Thread Thierry Reding
On Thu, Oct 01, 2020 at 11:33:38PM +0300, Dmitry Osipenko wrote:
> 01.10.2020 14:04, Nicolin Chen пишет:
> > On Thu, Oct 01, 2020 at 12:23:16PM +0200, Thierry Reding wrote:
> >  > > >> It looks to me like the only reason why you need this new 
> > global API is
> >> because PCI devices may not have a device tree node with a phandle 
> >> to
> >> the IOMMU. However, SMMU support for PCI will only be enabled if 
> >> the
> >> root complex has an iommus property, right? In that case, can't we
> >> simply do something like this:
> >>
> >>if (dev_is_pci(dev))
> >>np = find_host_bridge(dev)->of_node;
> >>else
> >>np = dev->of_node;
> > 
> >>> I personally am not a fan of adding a path for PCI device either,
> >>> since PCI/IOMMU cores could have taken care of it while the same
> >>> path can't be used for other buses.
> >>
> >> There's already plenty of other drivers that do something similar to
> >> this. Take a look at the arm-smmu driver, for example, which seems to be
> >> doing exactly the same thing to finding the right device tree node to
> >> look at (see dev_get_dev_node() in drivers/iommu/arm-smmu/arm-smmu.c).
> > 
> > Hmm..okay..that is quite convincing then...
> 
> Not very convincing to me. I don't see a "plenty of other drivers",
> there is only one arm-smmu driver.

There's ARM SMMU, ARM SMMU v3 and at least FSL PAMU. Even some of the
x86 platforms use dev_is_pci() to special-case PCI devices. That's just
because PCI is fundamentally different from fixed devices such as those
on a platform bus.

> The dev_get_dev_node() is under CONFIG_ARM_SMMU_LEGACY_DT_BINDINGS (!).
> Guys, doesn't it look strange to you? :)

See above, there are other cases where PCI devices are treated special.
For example, pretty much every driver that supports PCI differentiates
between PCI and other devices in their ->device_group() callback.

> The arm-smmu driver does a similar thing for the modern bindings to what
> Nicolin's v3 is doing.

I don't really have any objections to doing something similar to what
ARM SMMU does. My main objection is to the use of a global pointer that
is used to look up the SMMU. As you see, the ARM SMMU driver also does
this lookup via driver_find_device_by_fwnode() rather than storing a
global pointer.

Also you can't quite equate ARM SMMU with Tegra SMMU. ARM SMMU can
properly deal with devices behind a PCI host bridge, whereas on Tegra
all those devices are thrown in the same bucket with the same IOMMU
domain. So it's to be expected that some things will have to be
different between the two drivers.

> >>> If we can't come to an agreement on globalizing mc pointer, would
> >>> it be possible to pass tegra_mc_driver through tegra_smmu_probe()
> >>> so we can continue to use driver_find_device_by_fwnode() as v1?
> >>>
> >>> v1: https://lkml.org/lkml/2020/9/26/68
> >>
> >> tegra_smmu_probe() already takes a struct tegra_mc *. Did you mean
> >> tegra_smmu_probe_device()? I don't think we can do that because it isn't
> > 
> > I was saying to have a global parent_driver pointer: similar to
> > my v1, yet rather than "extern" the tegra_mc_driver, we pass it
> > through egra_smmu_probe() and store it in a static global value
> > so as to call tegra_smmu_get_by_fwnode() in ->probe_device().
> > 
> > Though I agree that creating a global device pointer (mc) might
> > be controversial, yet having a global parent_driver pointer may
> > not be against the rule, considering that it is common in iommu
> > drivers to call driver_find_device_by_fwnode in probe_device().
> 
> You don't need the global pointer if you have SMMU OF node.
> 
> You could also get driver pointer from mc->dev->driver.
> 
> But I don't think you need to do this at all. The probe_device() could
> be invoked only for the tegra_smmu_ops and then seems you could use
> dev_iommu_priv_set() in tegra_smmu_of_xlate(), like sun50i-iommu driver
> does.

Have you also seen that sun50i-iommu does look up the SMMU from a
phandle using of_find_device_by_node()? So I think you've shown yourself
that even "modern" drivers avoid global pointers and look up via
phandle.

Thierry


signature.asc
Description: PGP signature
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu