Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 08:48:29AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
> > On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > > > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > > > I think we agreed (or agree to disagree and commit) for device 
> > > > > > types that 
> > > > > > we have for SIOV, VFIO based approach works well without having to 
> > > > > > re-invent 
> > > > > > another way to do the same things. Not looking for a shortcut by 
> > > > > > any means, 
> > > > > > but we need to plan around existing hardware though. Looks like 
> > > > > > vDPA took 
> > > > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > > > necessary hardware was available.. This would be a solved puzzle. 
> > > > > 
> > > > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > > > the "home" and needs to stand alone.
> > > > > 
> > > > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > > > 
> > > > So just to make this clear, I did check internally if there are any 
> > > > plans
> > > > for vDPA + SVM. There are none at the moment. 
> > > 
> > > Not SVM, SIOV.
> > 
> > ... And that included.. I should have said vDPA + PASID, No current plans. 
> > I have no idea who set expectations with you. Can you please put me in 
> > touch 
> > with that person, privately is fine.
> 
> It was the team that aruged VDPA had to be done through VFIO - SIOV
> and PASID was one of their reasons it had to be VFIO, check the list
> archives

Humm... I could search the arhives, but the point is I'm confirming that
there is no forward looking plan!

And who ever did was it was based on probably strawman hypothetical argument 
that wasn't
grounded in reality. 

> 
> If they didn't plan to use it, bit of a strawman argument, right?

This doesn't need to continue like the debates :-) Pun intended :-)

I don't think it makes any sense to have an abstract strawman argument
design discussion. Yi is looking into for pasid management alone. Rest 
of the IOMMU related topics should wait until we have another 
*real* use requiring consolidation. 

Contrary to your argument, vDPA went with a half blown device only 
iommu user without considering existing abstractions like containers 
and such in VFIO is part of the reason the gap is big at the moment.
And you might not agree, but that's beside the point. 

Rather than pivot ourselves around hypothetical, strawman,
non-intersecting, suggesting architecture without having done a proof of
concept to validate the proposal should stop. We have to ground ourselves
in reality.

The use cases we have so far for SIOV, VFIO and mdev seem to be the right
candidates and addresses them well. Now you might disagree, but as noted we
all agreed to move past this.

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Liu, Yi L
Hi Jason,

> From: Jason Wang 
> Sent: Thursday, October 22, 2020 10:56 AM
> 
[...]
> If you(Intel) don't have plan to do vDPA, you should not prevent other vendors
> from implementing PASID capable hardware through non-VFIO subsystem/uAPI
> on top of your SIOV architecture. Isn't it?

yes, that's true.

> So if Intel has the willing to collaborate on the POC, I'd happy to help. E.g 
> it's not
> hard to have a PASID capable virtio device through qemu, and we can start from
> there.

actually, I'm already doing a poc to move the PASID allocation/free interface
out of VFIO. So that other users could use it as well. I think this is also
what you replied previously. :-) I'll send out when it's ready and seek for
your help on mature it. does it sound good to you?

Regards,
Yi Liu

> 
> Thanks
> 
> 
> >

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Jason Wang


On 2020/10/22 上午11:54, Liu, Yi L wrote:

Hi Jason,


From: Jason Wang 
Sent: Thursday, October 22, 2020 10:56 AM


[...]

If you(Intel) don't have plan to do vDPA, you should not prevent other vendors
from implementing PASID capable hardware through non-VFIO subsystem/uAPI
on top of your SIOV architecture. Isn't it?

yes, that's true.


So if Intel has the willing to collaborate on the POC, I'd happy to help. E.g 
it's not
hard to have a PASID capable virtio device through qemu, and we can start from
there.

actually, I'm already doing a poc to move the PASID allocation/free interface
out of VFIO. So that other users could use it as well. I think this is also
what you replied previously. :-) I'll send out when it's ready and seek for
your help on mature it. does it sound good to you?



Yes, fine with me.

Thanks




Regards,
Yi Liu


Thanks




___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 03:24:42PM -0300, Jason Gunthorpe wrote:
> 
> > Contrary to your argument, vDPA went with a half blown device only 
> > iommu user without considering existing abstractions like containers
> 
> VDPA IOMMU was done *for Intel*, as the kind of half-architected thing
> you are advocating should be allowed for IDXD here. Not sure why you
> think bashing that work is going to help your case here.

I'm not bashing that work, sorry if it comes out that way, 
but just feels like double standards.

I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
SVM is orthogonal to how we achieve mdev passthrough to guest and vSVM. 
We visited that exact thing multiple times. Doing SVM is quite simple and 
doesn't carry the weight of other (Kevin explained this in detail 
not too long ago) long list of things we need to accomplish for mdev pass 
through. 

For SVM, just access to hw, mmio and bind_mm to get a PASID bound with
IOMMU. 

For IDXD that creates passthough devices for guest access and vSVM is
through the VFIO path. 

For guest SVM, we expose mdev's to guest OS, idxd in the guest provides vSVM
services. vSVM is *not* build around native SVM interfaces. 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


WARNING in dma_map_page_attrs

2020-10-21 Thread syzbot
Hello,

syzbot found the following issue on:

HEAD commit:c4d6fe73 Merge tag 'xarray-5.9' of git://git.infradead.org..
git tree:   upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14862ff050
kernel config:  https://syzkaller.appspot.com/x/.config?x=7d790573d3e379c4
dashboard link: https://syzkaller.appspot.com/bug?extid=34dc2fea3478e659af01
compiler:   gcc (GCC) 10.1.0-syz 20200507

Unfortunately, I don't have any reproducer for this issue yet.

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+34dc2fea3478e659a...@syzkaller.appspotmail.com

infiniband syz1: set active
infiniband syz1: added vcan0
[ cut here ]
WARNING: CPU: 1 PID: 9851 at kernel/dma/mapping.c:149 
dma_map_page_attrs+0x493/0x700 kernel/dma/mapping.c:149
Modules linked in:
CPU: 1 PID: 9851 Comm: syz-executor.1 Not tainted 5.9.0-syzkaller #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
RIP: 0010:dma_map_page_attrs+0x493/0x700 kernel/dma/mapping.c:149
Code: 80 3c 10 00 0f 85 ed 01 00 00 48 8b 1d 36 c3 fa 0c e9 2d fc ff ff 48 89 
c3 e9 d1 fd ff ff e8 04 12 12 00 0f 0b e8 fd 11 12 00 <0f> 0b 49 c7 c4 ff ff ff 
ff e9 d5 fd ff ff e8 ea 11 12 00 48 8d 7b
RSP: 0018:c90001546c68 EFLAGS: 00010246
RAX: 0004 RBX: 894d0040 RCX: c9000dbe4000
RDX: 0004 RSI: 815d3b03 RDI: 88806a988b00
RBP: 8880236cc400 R08: 0002 R09: 
R10: 0002 R11:  R12: ea8db300
R13: 88806a9886e8 R14: 04b8 R15: 0002
FS:  7f678fae2700() GS:88802ce0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 7f299a39b190 CR3: 69f31000 CR4: 00350ee0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 dma_map_single_attrs include/linux/dma-mapping.h:279 [inline]
 ib_dma_map_single include/rdma/ib_verbs.h:3967 [inline]
 ib_mad_post_receive_mads+0x23f/0xd60 drivers/infiniband/core/mad.c:2715
 ib_mad_port_start drivers/infiniband/core/mad.c:2862 [inline]
 ib_mad_port_open drivers/infiniband/core/mad.c:3016 [inline]
 ib_mad_init_device+0x72b/0x1400 drivers/infiniband/core/mad.c:3092
 add_client_context+0x405/0x5e0 drivers/infiniband/core/device.c:680
 enable_device_and_get+0x1d5/0x3c0 drivers/infiniband/core/device.c:1301
 ib_register_device drivers/infiniband/core/device.c:1376 [inline]
 ib_register_device+0x7a7/0xa40 drivers/infiniband/core/device.c:1335
 rxe_register_device+0x46d/0x570 drivers/infiniband/sw/rxe/rxe_verbs.c:1182
 rxe_add+0x12fe/0x16d0 drivers/infiniband/sw/rxe/rxe.c:247
 rxe_net_add+0x8c/0xe0 drivers/infiniband/sw/rxe/rxe_net.c:507
 rxe_newlink drivers/infiniband/sw/rxe/rxe.c:269 [inline]
 rxe_newlink+0xb7/0xe0 drivers/infiniband/sw/rxe/rxe.c:250
 nldev_newlink+0x30e/0x540 drivers/infiniband/core/nldev.c:1555
 rdma_nl_rcv_msg+0x367/0x690 drivers/infiniband/core/netlink.c:195
 rdma_nl_rcv_skb drivers/infiniband/core/netlink.c:239 [inline]
 rdma_nl_rcv+0x2f2/0x440 drivers/infiniband/core/netlink.c:259
 netlink_unicast_kernel net/netlink/af_netlink.c:1304 [inline]
 netlink_unicast+0x533/0x7d0 net/netlink/af_netlink.c:1330
 netlink_sendmsg+0x856/0xd90 net/netlink/af_netlink.c:1919
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:671
 sys_sendmsg+0x6e8/0x810 net/socket.c:2353
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2407
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45d9f9
Code: bd b1 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 
89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 
8b b1 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:7f678fae1c88 EFLAGS: 0246 ORIG_RAX: 002e
RAX: ffda RBX: 0071f480 RCX: 0045d9f9
RDX:  RSI: 2200 RDI: 0003
RBP: 004aab13 R08:  R09: 
R10:  R11: 0246 R12: 0075bf00
R13: 7ffc6f9b8bbf R14: 7f678fac2000 R15: 0003


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkal...@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 7/7] mm: Remove examples from enum zone_type comment

2020-10-21 Thread Nicolas Saenz Julienne
We can't really list every setup in common code. On top of that they are
unlikely to stay true for long as things change in the arch trees
independently of this comment.

Suggested-by: Christoph Hellwig 
Signed-off-by: Nicolas Saenz Julienne 
---
 include/linux/mmzone.h | 20 
 1 file changed, 20 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf696c05e..9d0c454d23cd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -354,26 +354,6 @@ enum zone_type {
 * DMA mask is assumed when ZONE_DMA32 is defined. Some 64-bit
 * platforms may need both zones as they support peripherals with
 * different DMA addressing limitations.
-*
-* Some examples:
-*
-*  - i386 and x86_64 have a fixed 16M ZONE_DMA and ZONE_DMA32 for the
-*rest of the lower 4G.
-*
-*  - arm only uses ZONE_DMA, the size, up to 4G, may vary depending on
-*the specific device.
-*
-*  - arm64 has a fixed 1G ZONE_DMA and ZONE_DMA32 for the rest of the
-*lower 4G.
-*
-*  - powerpc only uses ZONE_DMA, the size, up to 2G, may vary
-*depending on the specific device.
-*
-*  - s390 uses ZONE_DMA fixed to the lower 2G.
-*
-*  - ia64 and riscv only use ZONE_DMA32.
-*
-*  - parisc uses neither.
 */
 #ifdef CONFIG_ZONE_DMA
ZONE_DMA,
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 07/14] iommu/ioasid: Add an iterator API for ioasid_set

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:34PM -0700, Jacob Pan wrote:
> Users of an ioasid_set may not keep track of all the IOASIDs allocated
> under the set. When collective actions are needed for each IOASIDs, it
> is useful to iterate over all the IOASIDs within the set. For example,
> when the ioasid_set is freed, the user might perform the same cleanup
> operation on each IOASID.
> 
> This patch adds an API to iterate all the IOASIDs within the set.
> 
> Signed-off-by: Jacob Pan 

Could add a short description of the function parameters, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/ioasid.c | 17 +
>  include/linux/ioasid.h |  9 +
>  2 files changed, 26 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index cf8c7d34e2de..9628e78b2ab4 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -701,6 +701,23 @@ int ioasid_adjust_set(struct ioasid_set *set, int quota)
>  EXPORT_SYMBOL_GPL(ioasid_adjust_set);
>  
>  /**
> + * ioasid_set_for_each_ioasid - Iterate over all the IOASIDs within the set
> + *
> + * Caller must hold a reference of the set and handles its own locking.
> + */
> +void ioasid_set_for_each_ioasid(struct ioasid_set *set,
> + void (*fn)(ioasid_t id, void *data),
> + void *data)
> +{
> + struct ioasid_data *entry;
> + unsigned long index;
> +
> + xa_for_each(>xa, index, entry)
> + fn(index, data);
> +}
> +EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
> +
> +/**
>   * ioasid_find - Find IOASID data
>   * @set: the IOASID set
>   * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 0a5e82148eb9..aab58bc26714 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -75,6 +75,9 @@ int ioasid_register_allocator(struct ioasid_allocator_ops 
> *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_detach_data(ioasid_t ioasid);
> +void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> + void (*fn)(ioasid_t id, void *data),
> + void *data);
>  #else /* !CONFIG_IOASID */
>  static inline void ioasid_install_capacity(ioasid_t total)
>  {
> @@ -131,5 +134,11 @@ static inline int ioasid_attach_data(ioasid_t ioasid, 
> void *data)
>  static inline void ioasid_detach_data(ioasid_t ioasid)
>  {
>  }
> +
> +static inline void ioasid_set_for_each_ioasid(struct ioasid_set *sdata,
> +   void (*fn)(ioasid_t id, void 
> *data),
> +   void *data)
> +{
> +}
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] x86: Secure Launch kernel early boot stub

2020-10-21 Thread Daniel Kiper
On Mon, Oct 19, 2020 at 01:18:22PM -0400, Arvind Sankar wrote:
> On Mon, Oct 19, 2020 at 04:51:53PM +0200, Daniel Kiper wrote:
> > On Fri, Oct 16, 2020 at 04:51:51PM -0400, Arvind Sankar wrote:
> > > On Thu, Oct 15, 2020 at 08:26:54PM +0200, Daniel Kiper wrote:
> > > >
> > > > I am discussing with Ross the other option. We can create
> > > > .rodata.mle_header section and put it at fixed offset as
> > > > kernel_info is. So, we would have, e.g.:
> > > >
> > > > arch/x86/boot/compressed/vmlinux.lds.S:
> > > > .rodata.kernel_info KERNEL_INFO_OFFSET : {
> > > > *(.rodata.kernel_info)
> > > > }
> > > > ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, 
> > > > "kernel_info at bad address!")
> > > >
> > > > .rodata.mle_header MLE_HEADER_OFFSET : {
> > > > *(.rodata.mle_header)
> > > > }
> > > > ASSERT(ABSOLUTE(mle_header) == MLE_HEADER_OFFSET, "mle_header 
> > > > at bad address!")
> > > >
> > > > arch/x86/boot/compressed/sl_stub.S:
> > > > #define mleh_rva(X) (((X) - mle_header) + MLE_HEADER_OFFSET)
> > > >
> > > > .section ".rodata.mle_header", "a"
> > > >
> > > > SYM_DATA_START(mle_header)
> > > > .long   0x9082ac5a/* UUID0 */
> > > > .long   0x74a7476f/* UUID1 */
> > > > .long   0xa2555c0f/* UUID2 */
> > > > .long   0x42b651cb/* UUID3 */
> > > > .long   0x0034/* MLE header size */
> > > > .long   0x00020002/* MLE version 2.2 */
> > > > .long   mleh_rva(sl_stub_entry)/* Linear entry point of MLE 
> > > > (virt. address) */
> > > > .long   0x/* First valid page of MLE */
> > > > .long   0x/* Offset within binary of first byte of 
> > > > MLE */
> > > > .long   0x/* Offset within binary of last byte + 1 
> > > > of MLE */
> > > > .long   0x0223/* Bit vector of MLE-supported 
> > > > capabilities */
> > > > .long   0x/* Starting linear address of command 
> > > > line (unused) */
> > > > .long   0x/* Ending linear address of command line 
> > > > (unused) */
> > > > SYM_DATA_END(mle_header)
> > > >
> > > > Of course MLE_HEADER_OFFSET has to be defined as a constant somewhere.
> > > > Anyway, is it acceptable?
> >
> > What do you think about my MLE_HEADER_OFFSET and related stuff proposal?
> >
>
> I'm wondering if it would be easier to just allow relocations in these
> special "header" sections. I need to check how easy/hard it is to do
> that without triggering linker warnings.

Ross and I still bouncing some ideas. We came to the conclusion that
putting mle_header into kernel .rodata.kernel_info section or even
arch/x86/boot/compressed/kernel_info.S file would be the easiest thing
to do at this point. Of course I would suggest some renaming too. E.g.
.rodata.kernel_info to .rodata.kernel_headers, etc. Does it make sense
for you?

Daniel
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] x86: Secure Launch kernel early boot stub

2020-10-21 Thread Ross Philipson
On 10/21/20 12:18 PM, Arvind Sankar wrote:
> On Wed, Oct 21, 2020 at 05:28:33PM +0200, Daniel Kiper wrote:
>> On Mon, Oct 19, 2020 at 01:18:22PM -0400, Arvind Sankar wrote:
>>> On Mon, Oct 19, 2020 at 04:51:53PM +0200, Daniel Kiper wrote:
 On Fri, Oct 16, 2020 at 04:51:51PM -0400, Arvind Sankar wrote:
> On Thu, Oct 15, 2020 at 08:26:54PM +0200, Daniel Kiper wrote:
>>
>> I am discussing with Ross the other option. We can create
>> .rodata.mle_header section and put it at fixed offset as
>> kernel_info is. So, we would have, e.g.:
>>
>> arch/x86/boot/compressed/vmlinux.lds.S:
>> .rodata.kernel_info KERNEL_INFO_OFFSET : {
>> *(.rodata.kernel_info)
>> }
>> ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, "kernel_info 
>> at bad address!")
>>
>> .rodata.mle_header MLE_HEADER_OFFSET : {
>> *(.rodata.mle_header)
>> }
>> ASSERT(ABSOLUTE(mle_header) == MLE_HEADER_OFFSET, "mle_header at 
>> bad address!")
>>
>> arch/x86/boot/compressed/sl_stub.S:
>> #define mleh_rva(X) (((X) - mle_header) + MLE_HEADER_OFFSET)
>>
>> .section ".rodata.mle_header", "a"
>>
>> SYM_DATA_START(mle_header)
>> .long   0x9082ac5a/* UUID0 */
>> .long   0x74a7476f/* UUID1 */
>> .long   0xa2555c0f/* UUID2 */
>> .long   0x42b651cb/* UUID3 */
>> .long   0x0034/* MLE header size */
>> .long   0x00020002/* MLE version 2.2 */
>> .long   mleh_rva(sl_stub_entry)/* Linear entry point of MLE 
>> (virt. address) */
>> .long   0x/* First valid page of MLE */
>> .long   0x/* Offset within binary of first byte of 
>> MLE */
>> .long   0x/* Offset within binary of last byte + 1 
>> of MLE */
>> .long   0x0223/* Bit vector of MLE-supported 
>> capabilities */
>> .long   0x/* Starting linear address of command line 
>> (unused) */
>> .long   0x/* Ending linear address of command line 
>> (unused) */
>> SYM_DATA_END(mle_header)
>>
>> Of course MLE_HEADER_OFFSET has to be defined as a constant somewhere.
>> Anyway, is it acceptable?

 What do you think about my MLE_HEADER_OFFSET and related stuff proposal?

>>>
>>> I'm wondering if it would be easier to just allow relocations in these
>>> special "header" sections. I need to check how easy/hard it is to do
>>> that without triggering linker warnings.
>>
>> Ross and I still bouncing some ideas. We came to the conclusion that
>> putting mle_header into kernel .rodata.kernel_info section or even
>> arch/x86/boot/compressed/kernel_info.S file would be the easiest thing
>> to do at this point. Of course I would suggest some renaming too. E.g.
>> .rodata.kernel_info to .rodata.kernel_headers, etc. Does it make sense
>> for you?
>>
>> Daniel
> 
> I haven't been able to come up with any different options that don't
> require post-processing of the kernel image. Allowing relocations in
> specific sections seems to not be possible with lld, and anyway would
> require the fields to be 64-bit sized so it doesn't really help.
> 
> Putting mle_header into kernel_info seems like a reasonable thing to me,
> and if you do that, putting it into kernel_info.S would seem to be
> necessary?  Would you also have a fixed field with the offset of the

That seems like a reasonable place for it to go.

> mle_header from kernel_info?  That seems nicer than having the
> bootloader scan the variable data for magic strings.

Yes kernel_info will have a field to the offset of the mle_header. I
agree that would be nicer.

Thanks
Ross

> 

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Jason Wang


On 2020/10/22 上午1:51, Raj, Ashok wrote:

On Wed, Oct 21, 2020 at 08:48:29AM -0300, Jason Gunthorpe wrote:

On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:

On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:

On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:

On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:

On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:

I think we agreed (or agree to disagree and commit) for device types that
we have for SIOV, VFIO based approach works well without having to re-invent
another way to do the same things. Not looking for a shortcut by any means,
but we need to plan around existing hardware though. Looks like vDPA took
some shortcuts then to not abstract iommu uAPI instead :-)? When all
necessary hardware was available.. This would be a solved puzzle.

I think it is the opposite, vIOMMU and related has outgrown VFIO as
the "home" and needs to stand alone.

Apparently the HW that will need PASID for vDPA is Intel HW, so if

So just to make this clear, I did check internally if there are any plans
for vDPA + SVM. There are none at the moment.

Not SVM, SIOV.

... And that included.. I should have said vDPA + PASID, No current plans.
I have no idea who set expectations with you. Can you please put me in touch
with that person, privately is fine.

It was the team that aruged VDPA had to be done through VFIO - SIOV
and PASID was one of their reasons it had to be VFIO, check the list
archives

Humm... I could search the arhives, but the point is I'm confirming that
there is no forward looking plan!

And who ever did was it was based on probably strawman hypothetical argument 
that wasn't
grounded in reality.


If they didn't plan to use it, bit of a strawman argument, right?

This doesn't need to continue like the debates :-) Pun intended :-)

I don't think it makes any sense to have an abstract strawman argument
design discussion. Yi is looking into for pasid management alone. Rest
of the IOMMU related topics should wait until we have another
*real* use requiring consolidation.

Contrary to your argument, vDPA went with a half blown device only
iommu user without considering existing abstractions like containers
and such in VFIO is part of the reason the gap is big at the moment.
And you might not agree, but that's beside the point.



Can you explain why it must care VFIO abstractions? vDPA is trying to 
hide device details which is fundamentally different with what VFIO 
wants to do. vDPA allows the parent to deal with IOMMU stuffs, and if 
necessary, the parent can talk with IOMMU drivers directly via IOMMU APIs.



  


Rather than pivot ourselves around hypothetical, strawman,
non-intersecting, suggesting architecture without having done a proof of
concept to validate the proposal should stop. We have to ground ourselves
in reality.



The reality is VFIO should not be the only user for (v)SVA/SIOV/PASID. 
The kernel hard already had users like GPU or uacce.





The use cases we have so far for SIOV, VFIO and mdev seem to be the right
candidates and addresses them well. Now you might disagree, but as noted we
all agreed to move past this.



The mdev is not perfect for sure, but it's another topic.

If you(Intel) don't have plan to do vDPA, you should not prevent other 
vendors from implementing PASID capable hardware through non-VFIO 
subsystem/uAPI on top of your SIOV architecture. Isn't it?


So if Intel has the willing to collaborate on the POC, I'd happy to 
help. E.g it's not hard to have a PASID capable virtio device through 
qemu, and we can start from there.


Thanks






___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

[PATCH v4 0/7] arm64: Default to 32-bit wide ZONE_DMA

2020-10-21 Thread Nicolas Saenz Julienne
Using two distinct DMA zones turned out to be problematic. Here's an
attempt go back to a saner default.

I tested this on both a RPi4 and QEMU.

---

Changes since v3:
 - Drop patch adding define in dma-mapping
 - Address small review changes
 - Update Ard's patch
 - Add new patch removing examples from mmzone.h

Changes since v2:
 - Introduce Ard's patch
 - Improve OF dma-ranges parsing function
 - Add unit test for OF function
 - Address small changes
 - Move crashkernel reservation later in boot process

Changes since v1:
 - Parse dma-ranges instead of using machine compatible string

Ard Biesheuvel (1):
  arm64: mm: Set ZONE_DMA size based on early IORT scan

Nicolas Saenz Julienne (6):
  arm64: mm: Move reserve_crashkernel() into mem_init()
  arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()
  of/address: Introduce of_dma_get_max_cpu_address()
  of: unittest: Add test for of_dma_get_max_cpu_address()
  arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges
  mm: Remove examples from enum zone_type comment

 arch/arm64/mm/init.c  | 16 ++--
 drivers/acpi/arm64/iort.c | 52 +++
 drivers/of/address.c  | 42 +++
 drivers/of/unittest.c | 18 ++
 include/linux/acpi_iort.h |  4 +++
 include/linux/mmzone.h| 20 ---
 include/linux/of.h|  7 ++
 7 files changed, 130 insertions(+), 29 deletions(-)

-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 07/13] x86: Secure Launch kernel early boot stub

2020-10-21 Thread Arvind Sankar
On Wed, Oct 21, 2020 at 05:28:33PM +0200, Daniel Kiper wrote:
> On Mon, Oct 19, 2020 at 01:18:22PM -0400, Arvind Sankar wrote:
> > On Mon, Oct 19, 2020 at 04:51:53PM +0200, Daniel Kiper wrote:
> > > On Fri, Oct 16, 2020 at 04:51:51PM -0400, Arvind Sankar wrote:
> > > > On Thu, Oct 15, 2020 at 08:26:54PM +0200, Daniel Kiper wrote:
> > > > >
> > > > > I am discussing with Ross the other option. We can create
> > > > > .rodata.mle_header section and put it at fixed offset as
> > > > > kernel_info is. So, we would have, e.g.:
> > > > >
> > > > > arch/x86/boot/compressed/vmlinux.lds.S:
> > > > > .rodata.kernel_info KERNEL_INFO_OFFSET : {
> > > > > *(.rodata.kernel_info)
> > > > > }
> > > > > ASSERT(ABSOLUTE(kernel_info) == KERNEL_INFO_OFFSET, 
> > > > > "kernel_info at bad address!")
> > > > >
> > > > > .rodata.mle_header MLE_HEADER_OFFSET : {
> > > > > *(.rodata.mle_header)
> > > > > }
> > > > > ASSERT(ABSOLUTE(mle_header) == MLE_HEADER_OFFSET, "mle_header 
> > > > > at bad address!")
> > > > >
> > > > > arch/x86/boot/compressed/sl_stub.S:
> > > > > #define mleh_rva(X) (((X) - mle_header) + MLE_HEADER_OFFSET)
> > > > >
> > > > > .section ".rodata.mle_header", "a"
> > > > >
> > > > > SYM_DATA_START(mle_header)
> > > > > .long   0x9082ac5a/* UUID0 */
> > > > > .long   0x74a7476f/* UUID1 */
> > > > > .long   0xa2555c0f/* UUID2 */
> > > > > .long   0x42b651cb/* UUID3 */
> > > > > .long   0x0034/* MLE header size */
> > > > > .long   0x00020002/* MLE version 2.2 */
> > > > > .long   mleh_rva(sl_stub_entry)/* Linear entry point of 
> > > > > MLE (virt. address) */
> > > > > .long   0x/* First valid page of MLE */
> > > > > .long   0x/* Offset within binary of first byte 
> > > > > of MLE */
> > > > > .long   0x/* Offset within binary of last byte + 
> > > > > 1 of MLE */
> > > > > .long   0x0223/* Bit vector of MLE-supported 
> > > > > capabilities */
> > > > > .long   0x/* Starting linear address of command 
> > > > > line (unused) */
> > > > > .long   0x/* Ending linear address of command 
> > > > > line (unused) */
> > > > > SYM_DATA_END(mle_header)
> > > > >
> > > > > Of course MLE_HEADER_OFFSET has to be defined as a constant somewhere.
> > > > > Anyway, is it acceptable?
> > >
> > > What do you think about my MLE_HEADER_OFFSET and related stuff proposal?
> > >
> >
> > I'm wondering if it would be easier to just allow relocations in these
> > special "header" sections. I need to check how easy/hard it is to do
> > that without triggering linker warnings.
> 
> Ross and I still bouncing some ideas. We came to the conclusion that
> putting mle_header into kernel .rodata.kernel_info section or even
> arch/x86/boot/compressed/kernel_info.S file would be the easiest thing
> to do at this point. Of course I would suggest some renaming too. E.g.
> .rodata.kernel_info to .rodata.kernel_headers, etc. Does it make sense
> for you?
> 
> Daniel

I haven't been able to come up with any different options that don't
require post-processing of the kernel image. Allowing relocations in
specific sections seems to not be possible with lld, and anyway would
require the fields to be 64-bit sized so it doesn't really help.

Putting mle_header into kernel_info seems like a reasonable thing to me,
and if you do that, putting it into kernel_info.S would seem to be
necessary?  Would you also have a fixed field with the offset of the
mle_header from kernel_info?  That seems nicer than having the
bootloader scan the variable data for magic strings.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Jason Gunthorpe
On Wed, Oct 21, 2020 at 01:03:15PM -0700, Raj, Ashok wrote:

> I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
> SVM is orthogonal to how we achieve mdev passthrough to guest and
> vSVM.

Everyone assumes that vIOMMU and SIOV aka PASID is going to be needed
on the VDPA side as well, I think that is why JasonW brought this up
in the first place.

We may not see vSVA for VDPA, but that seems like some special sub
mode of all the other vIOMMU and PASID stuff, and not a completely
distinct thing.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Raj, Ashok
On Wed, Oct 21, 2020 at 08:32:18PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 21, 2020 at 01:03:15PM -0700, Raj, Ashok wrote:
> 
> > I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
> > SVM is orthogonal to how we achieve mdev passthrough to guest and
> > vSVM.
> 
> Everyone assumes that vIOMMU and SIOV aka PASID is going to be needed
> on the VDPA side as well, I think that is why JasonW brought this up
> in the first place.

True, to that effect we are working on trying to move PASID allocation
outside of VFIO, so both agents VFIO and vDPA with PASID, when that comes
available can support one way to allocate and manage PASID's from user
space.

Since the IOASID is almost standalone, this is possible.
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Jason Gunthorpe
On Wed, Oct 21, 2020 at 10:51:46AM -0700, Raj, Ashok wrote:

> > If they didn't plan to use it, bit of a strawman argument, right?
> 
> This doesn't need to continue like the debates :-) Pun intended :-)
> 
> I don't think it makes any sense to have an abstract strawman argument
> design discussion. Yi is looking into for pasid management alone. Rest 
> of the IOMMU related topics should wait until we have another 
> *real* use requiring consolidation. 

Actually I'm really annoyed right now that the other Intel team wasted
quiet a lot of the rest of our time on arguing about vDPA and vfio
with no actual interest in this technology.

So you'll excuse me if I'm not particularly enamored with this
discussion right now.

> Contrary to your argument, vDPA went with a half blown device only 
> iommu user without considering existing abstractions like containers

VDPA IOMMU was done *for Intel*, as the kind of half-architected thing
you are advocating should be allowed for IDXD here. Not sure why you
think bashing that work is going to help your case here.

I'm saying Intel needs to get its architecture together and stop
ceating this mess across the kernel to support Intel devices.

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 03/14] iommu/ioasid: Add a separate function for detach data

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:30PM -0700, Jacob Pan wrote:
> IOASID private data can be cleared by ioasid_attach_data() with a NULL
> data pointer. A common use case is for a caller to free the data
> afterward. ioasid_attach_data() calls synchronize_rcu() before return
> such that free data can be sure without outstanding readers.
> However, since synchronize_rcu() may sleep, ioasid_attach_data() cannot
> be used under spinlocks.
> 
> This patch adds ioasid_detach_data() as a separate API where
> synchronize_rcu() is called only in this case. ioasid_attach_data() can
> then be used under spinlocks. In addition, this change makes the API
> symmetrical.
> 
> Signed-off-by: Jacob Pan 

A typo below, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/intel/svm.c |  4 ++--
>  drivers/iommu/ioasid.c| 54 
> ++-
>  include/linux/ioasid.h|  5 -
>  3 files changed, 50 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c
> index 2c5645f0737a..06a16bee7b65 100644
> --- a/drivers/iommu/intel/svm.c
> +++ b/drivers/iommu/intel/svm.c
> @@ -398,7 +398,7 @@ int intel_svm_bind_gpasid(struct iommu_domain *domain, 
> struct device *dev,
>   list_add_rcu(>list, >devs);
>   out:
>   if (!IS_ERR_OR_NULL(svm) && list_empty(>devs)) {
> - ioasid_attach_data(data->hpasid, NULL);
> + ioasid_detach_data(data->hpasid);
>   kfree(svm);
>   }
>  
> @@ -441,7 +441,7 @@ int intel_svm_unbind_gpasid(struct device *dev, int pasid)
>* the unbind, IOMMU driver will get notified
>* and perform cleanup.
>*/
> - ioasid_attach_data(pasid, NULL);
> + ioasid_detach_data(pasid);
>   kfree(svm);
>   }
>   }
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 5f63af07acd5..6cfbdfb492e0 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -272,24 +272,58 @@ int ioasid_attach_data(ioasid_t ioasid, void *data)
>  
>   spin_lock(_allocator_lock);
>   ioasid_data = xa_load(_allocator->xa, ioasid);
> - if (ioasid_data)
> - rcu_assign_pointer(ioasid_data->private, data);
> - else
> +
> + if (!ioasid_data) {
>   ret = -ENOENT;
> - spin_unlock(_allocator_lock);
> + goto done_unlock;
> + }
>  
> - /*
> -  * Wait for readers to stop accessing the old private data, so the
> -  * caller can free it.
> -  */
> - if (!ret)
> - synchronize_rcu();
> + if (ioasid_data->private) {
> + ret = -EBUSY;
> + goto done_unlock;
> + }
> + rcu_assign_pointer(ioasid_data->private, data);
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
>  
>   return ret;
>  }
>  EXPORT_SYMBOL_GPL(ioasid_attach_data);
>  
>  /**
> + * ioasid_detach_data - Clear the private data of an ioasid
> + *
> + * @ioasid: the IOASIDD to clear private data

IOASID

> + */
> +void ioasid_detach_data(ioasid_t ioasid)
> +{
> + struct ioasid_data *ioasid_data;
> +
> + spin_lock(_allocator_lock);
> + ioasid_data = xa_load(_allocator->xa, ioasid);
> +
> + if (!ioasid_data) {
> + pr_warn("IOASID %u not found to detach data from\n", ioasid);
> + goto done_unlock;
> + }
> +
> + if (ioasid_data->private) {
> + rcu_assign_pointer(ioasid_data->private, NULL);
> + goto done_unlock;
> + }
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> + /*
> +  * Wait for readers to stop accessing the old private data,
> +  * so the caller can free it.
> +  */
> + synchronize_rcu();
> +}
> +EXPORT_SYMBOL_GPL(ioasid_detach_data);
> +
> +/**
>   * ioasid_alloc - Allocate an IOASID
>   * @set: the IOASID set
>   * @min: the minimum ID (inclusive)
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 9c44947a68c8..c7f649fa970a 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -40,7 +40,7 @@ void *ioasid_find(struct ioasid_set *set, ioasid_t ioasid,
>  int ioasid_register_allocator(struct ioasid_allocator_ops *allocator);
>  void ioasid_unregister_allocator(struct ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
> -
> +void ioasid_detach_data(ioasid_t ioasid);
>  #else /* !CONFIG_IOASID */
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   ioasid_t max, void *private)
> @@ -72,5 +72,8 @@ static inline int ioasid_attach_data(ioasid_t ioasid, void 
> *data)
>   return -ENOTSUPP;
>  }
>  
> +static inline void ioasid_detach_data(ioasid_t ioasid)
> +{
> +}
>  #endif /* CONFIG_IOASID */
>  #endif /* __LINUX_IOASID_H */
> -- 
> 2.7.4
> 

Re: [PATCH v3 08/14] iommu/ioasid: Add reference couting functions

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:35PM -0700, Jacob Pan wrote:
> There can be multiple users of an IOASID, each user could have hardware
> contexts associated with the IOASID. In order to align lifecycles,
> reference counting is introduced in this patch. It is expected that when
> an IOASID is being freed, each user will drop a reference only after its
> context is cleared.
> 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/ioasid.c | 117 
> +
>  include/linux/ioasid.h |  24 ++
>  2 files changed, 141 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 9628e78b2ab4..828cc44b1b1c 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -16,8 +16,26 @@ static ioasid_t ioasid_capacity = PCI_PASID_MAX;
>  static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>  static DEFINE_XARRAY_ALLOC(ioasid_sets);
>  
> +enum ioasid_state {
> + IOASID_STATE_INACTIVE,
> + IOASID_STATE_ACTIVE,
> + IOASID_STATE_FREE_PENDING,
> +};
> +
> +/**
> + * struct ioasid_data - Meta data about ioasid
> + *
> + * @id:  Unique ID
> + * @users:   Number of active users
> + * @state:   Track state of the IOASID
> + * @set: ioasid_set of the IOASID belongs to
> + * @private: Private data associated with the IOASID
> + * @rcu: For free after RCU grace period
> + */
>  struct ioasid_data {
>   ioasid_t id;
> + refcount_t users;
> + enum ioasid_state state;
>   struct ioasid_set *set;
>   void *private;
>   struct rcu_head rcu;
> @@ -511,6 +529,8 @@ ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t 
> min, ioasid_t max,
>   goto exit_free;
>   }
>   data->id = id;
> + data->state = IOASID_STATE_ACTIVE;
> + refcount_set(>users, 1);
>  
>   /* Store IOASID in the per set data */
>   if (xa_err(xa_store(>xa, id, data, GFP_ATOMIC))) {
> @@ -560,6 +580,14 @@ static void ioasid_free_locked(struct ioasid_set *set, 
> ioasid_t ioasid)
>   if (WARN_ON(!xa_load(_sets, data->set->id)))
>   return;
>  
> + /* Free is already in progress */
> + if (data->state == IOASID_STATE_FREE_PENDING)
> + return;

But the previous call to ioasid_free_locked() dropped a reference, then
returned because more refs where held. Shouldn't this call also
dec_and_test() the reference and call ioasid_do_free_locked() if
necessary?

> +
> + data->state = IOASID_STATE_FREE_PENDING;
> + if (!refcount_dec_and_test(>users))
> + return;
> +
>   ioasid_do_free_locked(data);
>  }
>  
> @@ -717,6 +745,95 @@ void ioasid_set_for_each_ioasid(struct ioasid_set *set,
>  }
>  EXPORT_SYMBOL_GPL(ioasid_set_for_each_ioasid);
>  
> +int ioasid_get_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to get unknown IOASID %u\n", ioasid);
> + return -EINVAL;
> + }
> + if (data->state == IOASID_STATE_FREE_PENDING) {
> + pr_err("Trying to get IOASID being freed%u\n", ioasid);

Strange placement of the %u

> + return -EBUSY;
> + }
> +
> + /* Check set ownership if the set is non-null */
> + if (set && data->set != set) {
> + pr_err("Trying to get IOASID %u outside the set\n", ioasid);
> + /* data found but does not belong to the set */
> + return -EACCES;
> + }
> + refcount_inc(>users);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_locked);

If this is a public facing let's add a lockdep_assert_held() to make sure
they do hold the right lock. Same for ioasid_put_locked().

Thanks,
Jean

> +
> +/**
> + * ioasid_get - Obtain a reference to an ioasid
> + * @set: the ioasid_set to check permission against if not NULL
> + * @ioasid:  the ID to remove
> + *
> + *
> + * Return: 0 on success, error if failed.
> + */
> +int ioasid_get(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + int ret;
> +
> + spin_lock(_allocator_lock);
> + ret = ioasid_get_locked(set, ioasid);
> + spin_unlock(_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get);
> +
> +bool ioasid_put_locked(struct ioasid_set *set, ioasid_t ioasid)
> +{
> + struct ioasid_data *data;
> +
> + data = xa_load(_allocator->xa, ioasid);
> + if (!data) {
> + pr_err("Trying to put unknown IOASID %u\n", ioasid);
> + return false;
> + }
> + if (set && data->set != set) {
> + pr_err("Trying to drop IOASID %u outside the set\n", ioasid);
> + return false;
> + }
> + if (!refcount_dec_and_test(>users))
> + return false;
> +
> + ioasid_do_free_locked(data);
> +
> + return true;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_put_locked);
> +
> +/**
> + * ioasid_put - Release a reference to an ioasid
> + * @set: the ioasid_set 

Re: [PATCH v3 05/14] iommu/ioasid: Redefine IOASID set and allocation APIs

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:32PM -0700, Jacob Pan wrote:
> ioasid_set was introduced as an arbitrary token that is shared by a
> group of IOASIDs. For example, two IOASIDs allocated via the same
> ioasid_set pointer belong to the same set.
> 
> For guest SVA usages, system-wide IOASID resources need to be
> partitioned such that each VM can have its own quota and being managed
> separately. ioasid_set is the perfect candidate for meeting such
> requirements. This patch redefines and extends ioasid_set with the
> following new fields:
> - Quota
> - Reference count
> - Storage of its namespace
> - The token is now stored in the ioasid_set with types
> 
> Basic ioasid_set level APIs are introduced that wire up these new data.
> Existing users of IOASID APIs are converted where a host IOASID set is
> allocated for bare-metal usage.
> 
> Signed-off-by: Liu Yi L 
> Signed-off-by: Jacob Pan 
> ---
>  drivers/iommu/intel/iommu.c |  26 +++--
>  drivers/iommu/intel/pasid.h |   1 +
>  drivers/iommu/intel/svm.c   |  25 ++--
>  drivers/iommu/ioasid.c  | 277 
> 
>  include/linux/ioasid.h  |  53 +++--
>  5 files changed, 333 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index e7bcb299e51e..872391890323 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -104,6 +104,9 @@
>   */
>  #define INTEL_IOMMU_PGSIZES  (~0xFFFUL)
>  
> +/* PASIDs used by host SVM */
> +struct ioasid_set *host_pasid_set;
> +
>  static inline int agaw_to_level(int agaw)
>  {
>   return agaw + 2;
> @@ -3147,8 +3150,8 @@ static void intel_vcmd_ioasid_free(ioasid_t ioasid, 
> void *data)
>* Sanity check the ioasid owner is done at upper layer, e.g. VFIO
>* We can only free the PASID when all the devices are unbound.
>*/
> - if (ioasid_find(NULL, ioasid, NULL)) {
> - pr_alert("Cannot free active IOASID %d\n", ioasid);
> + if (IS_ERR(ioasid_find(host_pasid_set, ioasid, NULL))) {
> + pr_err("IOASID %d to be freed but not in system set\n", ioasid);
>   return;
>   }
>   vcmd_free_pasid(iommu, ioasid);
> @@ -,8 +3336,17 @@ static int __init init_dmars(void)
>   goto free_iommu;
>  
>   /* PASID is needed for scalable mode irrespective to SVM */
> - if (intel_iommu_sm)
> + if (intel_iommu_sm) {
>   ioasid_install_capacity(intel_pasid_max_id);
> + /* We should not run out of IOASIDs at boot */
> + host_pasid_set = ioasid_set_alloc(NULL, PID_MAX_DEFAULT,
> +   IOASID_SET_TYPE_NULL);
> + if (IS_ERR_OR_NULL(host_pasid_set)) {
> + pr_err("Failed to allocate host PASID set %lu\n",

Printing a negative value as unsigned

> + PTR_ERR(host_pasid_set));
> + intel_iommu_sm = 0;
> + }
> + }
>  
>   /*
>* for each drhd
> @@ -3381,7 +3393,7 @@ static int __init init_dmars(void)
>   disable_dmar_iommu(iommu);
>   free_dmar_iommu(iommu);
>   }
> -
> + ioasid_set_put(host_pasid_set);
>   kfree(g_iommus);
>  
>  error:
> @@ -5163,7 +5175,7 @@ static void auxiliary_unlink_device(struct dmar_domain 
> *domain,
>   domain->auxd_refcnt--;
>  
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  }
>  
>  static int aux_domain_add_dev(struct dmar_domain *domain,
> @@ -5181,7 +5193,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   int pasid;
>  
>   /* No private data needed for the default pasid */
> - pasid = ioasid_alloc(NULL, PASID_MIN,
> + pasid = ioasid_alloc(host_pasid_set, PASID_MIN,
>pci_max_pasids(to_pci_dev(dev)) - 1,
>NULL);
>   if (pasid == INVALID_IOASID) {
> @@ -5224,7 +5236,7 @@ static int aux_domain_add_dev(struct dmar_domain 
> *domain,
>   spin_unlock(>lock);
>   spin_unlock_irqrestore(_domain_lock, flags);
>   if (!domain->auxd_refcnt && domain->default_pasid > 0)
> - ioasid_free(domain->default_pasid);
> + ioasid_free(host_pasid_set, domain->default_pasid);
>  
>   return ret;
>  }
> diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h
> index c9850766c3a9..ccdc23446015 100644
> --- a/drivers/iommu/intel/pasid.h
> +++ b/drivers/iommu/intel/pasid.h
> @@ -99,6 +99,7 @@ static inline bool pasid_pte_is_present(struct pasid_entry 
> *pte)
>  }
>  
>  extern u32 intel_pasid_max_id;
> +extern struct ioasid_set *host_pasid_set;
>  int intel_pasid_alloc_id(void *ptr, int start, int end, gfp_t gfp);
>  void intel_pasid_free_id(int pasid);
>  void 

Re: [PATCH v3 06/14] iommu/ioasid: Introduce API to adjust the quota of an ioasid_set

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:33PM -0700, Jacob Pan wrote:
> Each ioasid_set is given a quota during allocation. As system
> administrators balance resources among VMs, we shall support the
> adjustment of quota at runtime. The new quota cannot be less than the
> outstanding IOASIDs already allocated within the set. The extra quota
> will be returned to the system-wide IOASID pool if the new quota is
> smaller than the existing one.
> 
> Signed-off-by: Jacob Pan 

Minor comments below, but

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/ioasid.c | 47 +++
>  include/linux/ioasid.h |  6 ++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 61e25c2375ab..cf8c7d34e2de 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -654,6 +654,53 @@ void ioasid_set_put(struct ioasid_set *set)
>  EXPORT_SYMBOL_GPL(ioasid_set_put);
>  
>  /**
> + * ioasid_adjust_set - Adjust the quota of an IOASID set
> + * @set: IOASID set to be assigned
> + * @quota:   Quota allowed in this set
> + *
> + * Return 0 on success. If the new quota is smaller than the number of
> + * IOASIDs already allocated, -EINVAL will be returned. No change will be
> + * made to the existing quota.
> + */
> +int ioasid_adjust_set(struct ioasid_set *set, int quota)

@quota probably doesn't need to be signed (since it's the same as an
ioasid_t, which is unsigned).

> +{
> + int ret = 0;
> +
> + if (quota <= 0)
> + return -EINVAL;
> +
> + spin_lock(_allocator_lock);
> + if (set->nr_ioasids > quota) {
> + pr_err("New quota %d is smaller than outstanding IOASIDs %d\n",
> + quota, set->nr_ioasids);
> + ret = -EINVAL;
> + goto done_unlock;
> + }
> +
> + if ((quota > set->quota) &&
> + (quota - set->quota > ioasid_capacity_avail)) {

misaligned

> + ret = -ENOSPC;
> + goto done_unlock;
> + }
> +
> + /* Return the delta back to system pool */
> + ioasid_capacity_avail += set->quota - quota;
> +
> + /*
> +  * May have a policy to prevent giving all available IOASIDs
> +  * to one set. But we don't enforce here, it should be in the
> +  * upper layers.
> +  */

I think here would be OK to check about fairness. But anyway, we don't
have to worry about this yet, so I'd just drop the comment.

Thanks,
Jean

> + set->quota = quota;
> +
> +done_unlock:
> + spin_unlock(_allocator_lock);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_adjust_set);
> +
> +/**
>   * ioasid_find - Find IOASID data
>   * @set: the IOASID set
>   * @ioasid: the IOASID to find
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index 1ae213b660f0..0a5e82148eb9 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -62,6 +62,7 @@ struct ioasid_allocator_ops {
>  void ioasid_install_capacity(ioasid_t total);
>  ioasid_t ioasid_get_capacity(void);
>  struct ioasid_set *ioasid_set_alloc(void *token, ioasid_t quota, int type);
> +int ioasid_adjust_set(struct ioasid_set *set, int quota);
>  void ioasid_set_get(struct ioasid_set *set);
>  void ioasid_set_put(struct ioasid_set *set);
>  
> @@ -99,6 +100,11 @@ static inline struct ioasid_set *ioasid_set_alloc(void 
> *token, ioasid_t quota, i
>   return ERR_PTR(-ENOTSUPP);
>  }
>  
> +static inline int ioasid_adjust_set(struct ioasid_set *set, int quota)
> +{
> + return -ENOTSUPP;
> +}
> +
>  static inline void ioasid_set_put(struct ioasid_set *set)
>  {
>  }
> -- 
> 2.7.4
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH v3 04/14] iommu/ioasid: Support setting system-wide capacity

2020-10-21 Thread Jean-Philippe Brucker
On Mon, Sep 28, 2020 at 02:38:31PM -0700, Jacob Pan wrote:
> IOASID is a system-wide resource that could vary on different systems.
> The default capacity is 20 bits as defined in the PCI-E specifications.
> This patch adds a function to allow adjusting system IOASID capacity.
> For VT-d this is set during boot as part of the Intel IOMMU
> initialization.
> 
> Signed-off-by: Jacob Pan 

Reviewed-by: Jean-Philippe Brucker 

> ---
>  drivers/iommu/intel/iommu.c |  5 +
>  drivers/iommu/ioasid.c  | 20 
>  include/linux/ioasid.h  | 11 +++
>  3 files changed, 36 insertions(+)
> 
> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
> index 18ed3b3c70d7..e7bcb299e51e 100644
> --- a/drivers/iommu/intel/iommu.c
> +++ b/drivers/iommu/intel/iommu.c
> @@ -42,6 +42,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

(not in alphabetical order)

>  #include 
>  #include 
>  #include 
> @@ -3331,6 +3332,10 @@ static int __init init_dmars(void)
>   if (ret)
>   goto free_iommu;
>  
> + /* PASID is needed for scalable mode irrespective to SVM */
> + if (intel_iommu_sm)
> + ioasid_install_capacity(intel_pasid_max_id);
> +
>   /*
>* for each drhd
>*   enable fault log
> diff --git a/drivers/iommu/ioasid.c b/drivers/iommu/ioasid.c
> index 6cfbdfb492e0..4277cb17e15b 100644
> --- a/drivers/iommu/ioasid.c
> +++ b/drivers/iommu/ioasid.c
> @@ -10,6 +10,10 @@
>  #include 
>  #include 
>  
> +/* Default to PCIe standard 20 bit PASID */
> +#define PCI_PASID_MAX 0x10
> +static ioasid_t ioasid_capacity = PCI_PASID_MAX;
> +static ioasid_t ioasid_capacity_avail = PCI_PASID_MAX;
>  struct ioasid_data {
>   ioasid_t id;
>   struct ioasid_set *set;
> @@ -17,6 +21,22 @@ struct ioasid_data {
>   struct rcu_head rcu;
>  };
>  
> +void ioasid_install_capacity(ioasid_t total)
> +{
> + if (ioasid_capacity && ioasid_capacity != PCI_PASID_MAX) {
> + pr_warn("IOASID capacity is already set.\n");
> + return;
> + }
> + ioasid_capacity = ioasid_capacity_avail = total;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_install_capacity);
> +
> +ioasid_t ioasid_get_capacity(void)
> +{
> + return ioasid_capacity;
> +}
> +EXPORT_SYMBOL_GPL(ioasid_get_capacity);
> +
>  /*
>   * struct ioasid_allocator_data - Internal data structure to hold information
>   * about an allocator. There are two types of allocators:
> diff --git a/include/linux/ioasid.h b/include/linux/ioasid.h
> index c7f649fa970a..7fc320656be2 100644
> --- a/include/linux/ioasid.h
> +++ b/include/linux/ioasid.h
> @@ -32,6 +32,8 @@ struct ioasid_allocator_ops {
>  #define DECLARE_IOASID_SET(name) struct ioasid_set name = { 0 }
>  
>  #if IS_ENABLED(CONFIG_IOASID)
> +void ioasid_install_capacity(ioasid_t total);
> +ioasid_t ioasid_get_capacity(void);
>  ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min, ioasid_t max,
> void *private);
>  void ioasid_free(ioasid_t ioasid);
> @@ -42,6 +44,15 @@ void ioasid_unregister_allocator(struct 
> ioasid_allocator_ops *allocator);
>  int ioasid_attach_data(ioasid_t ioasid, void *data);
>  void ioasid_detach_data(ioasid_t ioasid);
>  #else /* !CONFIG_IOASID */
> +static inline void ioasid_install_capacity(ioasid_t total)
> +{
> +}
> +
> +static inline ioasid_t ioasid_get_capacity(void)
> +{
> + return 0;
> +}
> +
>  static inline ioasid_t ioasid_alloc(struct ioasid_set *set, ioasid_t min,
>   ioasid_t max, void *private)
>  {
> -- 
> 2.7.4
> 
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 6/7] arm64: mm: Set ZONE_DMA size based on early IORT scan

2020-10-21 Thread Nicolas Saenz Julienne
From: Ard Biesheuvel 

We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

Instructing the DMA layer about these limitations is straight-forward,
even though we had to fix some issues regarding memory limits set in
the IORT for named components, and regarding the handling of ACPI _DMA
methods. However, the DMA layer also needs to be able to allocate
memory that is guaranteed to meet those DMA constraints, for bounce
buffering as well as allocating the backing for consistent mappings.

This is why the 1 GB ZONE_DMA was introduced recently. Unfortunately,
it turns out the having a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes
problems with kdump, and potentially in other places where allocations
cannot cross zone boundaries. Therefore, we should avoid having two
separate DMA zones when possible.

So let's do an early scan of the IORT, and only create the ZONE_DMA
if we encounter any devices that need it. This puts the burden on
the firmware to describe such limitations in the IORT, which may be
redundant (and less precise) if _DMA methods are also being provided.
However, it should be noted that this situation is highly unusual for
arm64 ACPI machines. Also, the DMA subsystem still gives precedence to
the _DMA method if implemented, and so we will not lose the ability to
perform streaming DMA outside the ZONE_DMA if the _DMA method permits
it.

Cc: Jeremy Linton 
Cc: Lorenzo Pieralisi 
Cc: Nicolas Saenz Julienne 
Cc: Rob Herring 
Cc: Christoph Hellwig 
Cc: Robin Murphy 
Cc: Hanjun Guo 
Cc: Sudeep Holla 
Cc: Anshuman Khandual 
Signed-off-by: Ard Biesheuvel 
[nsaenz: Rebased, removed documentation change and add declaration in 
acpi_iort.h]
Signed-off-by: Nicolas Saenz Julienne 

---

Changes since v3:
 - Use min_not_zero()
 - Check ACPI revision
 - Remove unnecessary #ifdef in zone_sizes_init()

 arch/arm64/mm/init.c  |  3 ++-
 drivers/acpi/arm64/iort.c | 52 +++
 include/linux/acpi_iort.h |  4 +++
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 94e38f99748b..f5d4f85506e4 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -29,6 +29,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -190,7 +191,7 @@ static void __init zone_sizes_init(unsigned long min, 
unsigned long max)
 
 #ifdef CONFIG_ZONE_DMA
dt_zone_dma_bits = ilog2(of_dma_get_max_cpu_address(NULL));
-   zone_dma_bits = min(32U, dt_zone_dma_bits);
+   zone_dma_bits = min3(32U, dt_zone_dma_bits, 
acpi_iort_get_zone_dma_size());
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
diff --git a/drivers/acpi/arm64/iort.c b/drivers/acpi/arm64/iort.c
index 9929ff50c0c0..05fe4a076bab 100644
--- a/drivers/acpi/arm64/iort.c
+++ b/drivers/acpi/arm64/iort.c
@@ -1718,3 +1718,55 @@ void __init acpi_iort_init(void)
 
iort_init_platform_devices();
 }
+
+#ifdef CONFIG_ZONE_DMA
+/*
+ * Check the IORT whether any devices exist whose DMA mask is < 32 bits.
+ * If so, return the smallest value encountered, or 32 otherwise.
+ */
+unsigned int __init acpi_iort_get_zone_dma_size(void)
+{
+   struct acpi_table_iort *iort;
+   struct acpi_iort_node *node, *end;
+   acpi_status status;
+   u8 limit = 32;
+   int i;
+
+   if (acpi_disabled)
+   return limit;
+
+   status = acpi_get_table(ACPI_SIG_IORT, 0,
+   (struct acpi_table_header **));
+   if (ACPI_FAILURE(status))
+   return limit;
+
+   node = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->node_offset);
+   end = ACPI_ADD_PTR(struct acpi_iort_node, iort, iort->header.length);
+
+   for (i = 0; i < iort->node_count; i++) {
+   if (node >= end)
+   break;
+
+   switch (node->type) {
+   struct acpi_iort_named_component *ncomp;
+   struct acpi_iort_root_complex *rc;
+
+   case ACPI_IORT_NODE_NAMED_COMPONENT:
+   ncomp = (struct acpi_iort_named_component 
*)node->node_data;
+   limit = min_not_zero(limit, 
ncomp->memory_address_limit);
+   break;
+
+   case ACPI_IORT_NODE_PCI_ROOT_COMPLEX:
+   if (node->revision < 1)
+   break;
+
+   rc = (struct acpi_iort_root_complex *)node->node_data;
+   limit = min_not_zero(limit, rc->memory_address_limit);
+   break;
+   }
+   node = ACPI_ADD_PTR(struct acpi_iort_node, node, node->length);
+   }
+   

[PATCH v4 5/7] arm64: mm: Set ZONE_DMA size based on devicetree's dma-ranges

2020-10-21 Thread Nicolas Saenz Julienne
We recently introduced a 1 GB sized ZONE_DMA to cater for platforms
incorporating masters that can address less than 32 bits of DMA, in
particular the Raspberry Pi 4, which has 4 or 8 GB of DRAM, but has
peripherals that can only address up to 1 GB (and its PCIe host
bridge can only access the bottom 3 GB)

The DMA layer also needs to be able to allocate memory that is
guaranteed to meet those DMA constraints, for bounce buffering as well
as allocating the backing for consistent mappings. This is why the 1 GB
ZONE_DMA was introduced recently. Unfortunately, it turns out the having
a 1 GB ZONE_DMA as well as a ZONE_DMA32 causes problems with kdump, and
potentially in other places where allocations cannot cross zone
boundaries. Therefore, we should avoid having two separate DMA zones
when possible.

So, with the help of of_dma_get_max_cpu_address() get the topmost
physical address accessible to all DMA masters in system and use that
information to fine-tune ZONE_DMA's size. In the absence of addressing
limited masters ZONE_DMA will span the whole 32-bit address space,
otherwise, in the case of the Raspberry Pi 4 it'll only span the 30-bit
address space, and have ZONE_DMA32 cover the rest of the 32-bit address
space.

Signed-off-by: Nicolas Saenz Julienne 

---

Changes since v3:
 - Simplify code for readability.

Changes since v2:
 - Updated commit log by shamelessly copying Ard's ACPI commit log

 arch/arm64/mm/init.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 410721fc4fc0..94e38f99748b 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -42,8 +42,6 @@
 #include 
 #include 
 
-#define ARM64_ZONE_DMA_BITS30
-
 /*
  * We need to be able to catch inadvertent references to memstart_addr
  * that occur (potentially in generic code) before arm64_memblock_init()
@@ -188,9 +186,11 @@ static phys_addr_t __init max_zone_phys(unsigned int 
zone_bits)
 static void __init zone_sizes_init(unsigned long min, unsigned long max)
 {
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
+   unsigned int __maybe_unused dt_zone_dma_bits;
 
 #ifdef CONFIG_ZONE_DMA
-   zone_dma_bits = ARM64_ZONE_DMA_BITS;
+   dt_zone_dma_bits = ilog2(of_dma_get_max_cpu_address(NULL));
+   zone_dma_bits = min(32U, dt_zone_dma_bits);
arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 4/7] of: unittest: Add test for of_dma_get_max_cpu_address()

2020-10-21 Thread Nicolas Saenz Julienne
Introduce a test for of_dma_get_max_cup_address(), it uses the same DT
data as the rest of dma-ranges unit tests.

Signed-off-by: Nicolas Saenz Julienne 

---
Changes since v3:
 - Remove HAS_DMA guards

 drivers/of/unittest.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/drivers/of/unittest.c b/drivers/of/unittest.c
index 06cc988faf78..b9a4d047a95e 100644
--- a/drivers/of/unittest.c
+++ b/drivers/of/unittest.c
@@ -869,6 +869,23 @@ static void __init of_unittest_changeset(void)
 #endif
 }
 
+static void __init of_unittest_dma_get_max_cpu_address(void)
+{
+   struct device_node *np;
+   phys_addr_t cpu_addr;
+
+   np = of_find_node_by_path("/testcase-data/address-tests");
+   if (!np) {
+   pr_err("missing testcase data\n");
+   return;
+   }
+
+   cpu_addr = of_dma_get_max_cpu_address(np);
+   unittest(cpu_addr == 0x5000,
+"of_dma_get_max_cpu_address: wrong CPU addr %pad (expecting 
%x)\n",
+_addr, 0x5000);
+}
+
 static void __init of_unittest_dma_ranges_one(const char *path,
u64 expect_dma_addr, u64 expect_paddr)
 {
@@ -3266,6 +3283,7 @@ static int __init of_unittest(void)
of_unittest_changeset();
of_unittest_parse_interrupts();
of_unittest_parse_interrupts_extended();
+   of_unittest_dma_get_max_cpu_address();
of_unittest_parse_dma_ranges();
of_unittest_pci_dma_ranges();
of_unittest_match_node();
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 1/7] arm64: mm: Move reserve_crashkernel() into mem_init()

2020-10-21 Thread Nicolas Saenz Julienne
crashkernel might reserve memory located in ZONE_DMA. We plan to delay
ZONE_DMA's initialization after unflattening the devicetree and ACPI's
boot table initialization, so move it later in the boot process.
Specifically into mem_init(), this is the last place crashkernel will be
able to reserve the memory before the page allocator kicks in. There
isn't any apparent reason for doing this earlier.

Signed-off-by: Nicolas Saenz Julienne 
---
 arch/arm64/mm/init.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 095540667f0f..fc4ab0d6d5d2 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -386,8 +386,6 @@ void __init arm64_memblock_init(void)
else
arm64_dma32_phys_limit = PHYS_MASK + 1;
 
-   reserve_crashkernel();
-
reserve_elfcorehdr();
 
high_memory = __va(memblock_end_of_DRAM() - 1) + 1;
@@ -508,6 +506,8 @@ void __init mem_init(void)
else
swiotlb_force = SWIOTLB_NO_FORCE;
 
+   reserve_crashkernel();
+
set_max_mapnr(max_pfn - PHYS_PFN_OFFSET);
 
 #ifndef CONFIG_SPARSEMEM_VMEMMAP
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 2/7] arm64: mm: Move zone_dma_bits initialization into zone_sizes_init()

2020-10-21 Thread Nicolas Saenz Julienne
zone_dma_bits's initialization happens earlier that it's actually
needed, in arm64_memblock_init(). So move it into the more suitable
zone_sizes_init().

Signed-off-by: Nicolas Saenz Julienne 
---
 arch/arm64/mm/init.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index fc4ab0d6d5d2..410721fc4fc0 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -190,6 +190,8 @@ static void __init zone_sizes_init(unsigned long min, 
unsigned long max)
unsigned long max_zone_pfns[MAX_NR_ZONES]  = {0};
 
 #ifdef CONFIG_ZONE_DMA
+   zone_dma_bits = ARM64_ZONE_DMA_BITS;
+   arm64_dma_phys_limit = max_zone_phys(zone_dma_bits);
max_zone_pfns[ZONE_DMA] = PFN_DOWN(arm64_dma_phys_limit);
 #endif
 #ifdef CONFIG_ZONE_DMA32
@@ -376,11 +378,6 @@ void __init arm64_memblock_init(void)
 
early_init_fdt_scan_reserved_mem();
 
-   if (IS_ENABLED(CONFIG_ZONE_DMA)) {
-   zone_dma_bits = ARM64_ZONE_DMA_BITS;
-   arm64_dma_phys_limit = max_zone_phys(ARM64_ZONE_DMA_BITS);
-   }
-
if (IS_ENABLED(CONFIG_ZONE_DMA32))
arm64_dma32_phys_limit = max_zone_phys(32);
else
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


[PATCH v4 3/7] of/address: Introduce of_dma_get_max_cpu_address()

2020-10-21 Thread Nicolas Saenz Julienne
Introduce of_dma_get_max_cpu_address(), which provides the highest CPU
physical address addressable by all DMA masters in the system. It's
specially useful for setting memory zones sizes at early boot time.

Signed-off-by: Nicolas Saenz Julienne 

---

Changes since v3:
 - use u64 with cpu_end

Changes since v2:
 - Use PHYS_ADDR_MAX
 - return phys_dma_t
 - Rename function
 - Correct subject
 - Add support to start parsing from an arbitrary device node in order
   for the function to work with unit tests

 drivers/of/address.c | 42 ++
 include/linux/of.h   |  7 +++
 2 files changed, 49 insertions(+)

diff --git a/drivers/of/address.c b/drivers/of/address.c
index eb9ab4f1e80b..47dfe5881e18 100644
--- a/drivers/of/address.c
+++ b/drivers/of/address.c
@@ -1024,6 +1024,48 @@ int of_dma_get_range(struct device_node *np, const 
struct bus_dma_region **map)
 }
 #endif /* CONFIG_HAS_DMA */
 
+/**
+ * of_dma_get_max_cpu_address - Gets highest CPU address suitable for DMA
+ * @np: The node to start searching from or NULL to start from the root
+ *
+ * Gets the highest CPU physical address that is addressable by all DMA masters
+ * in the sub-tree pointed by np, or the whole tree if NULL is passed. If no
+ * DMA constrained device is found, it returns PHYS_ADDR_MAX.
+ */
+phys_addr_t __init of_dma_get_max_cpu_address(struct device_node *np)
+{
+   phys_addr_t max_cpu_addr = PHYS_ADDR_MAX;
+   struct of_range_parser parser;
+   phys_addr_t subtree_max_addr;
+   struct device_node *child;
+   struct of_range range;
+   const __be32 *ranges;
+   u64 cpu_end = 0;
+   int len;
+
+   if (!np)
+   np = of_root;
+
+   ranges = of_get_property(np, "dma-ranges", );
+   if (ranges && len) {
+   of_dma_range_parser_init(, np);
+   for_each_of_range(, )
+   if (range.cpu_addr + range.size > cpu_end)
+   cpu_end = range.cpu_addr + range.size;
+
+   if (max_cpu_addr > cpu_end)
+   max_cpu_addr = cpu_end;
+   }
+
+   for_each_available_child_of_node(np, child) {
+   subtree_max_addr = of_dma_get_max_cpu_address(child);
+   if (max_cpu_addr > subtree_max_addr)
+   max_cpu_addr = subtree_max_addr;
+   }
+
+   return max_cpu_addr;
+}
+
 /**
  * of_dma_is_coherent - Check if device is coherent
  * @np:device node
diff --git a/include/linux/of.h b/include/linux/of.h
index 481ec0467285..db8db8f2c967 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -558,6 +558,8 @@ int of_map_id(struct device_node *np, u32 id,
   const char *map_name, const char *map_mask_name,
   struct device_node **target, u32 *id_out);
 
+phys_addr_t of_dma_get_max_cpu_address(struct device_node *np);
+
 #else /* CONFIG_OF */
 
 static inline void of_core_init(void)
@@ -995,6 +997,11 @@ static inline int of_map_id(struct device_node *np, u32 id,
return -EINVAL;
 }
 
+static inline phys_addr_t of_dma_get_max_cpu_address(struct device_node *np)
+{
+   return PHYS_ADDR_MAX;
+}
+
 #define of_match_ptr(_ptr) NULL
 #define of_match_node(_matches, _node) NULL
 #endif /* CONFIG_OF */
-- 
2.28.0

___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: [PATCH 2/4] iommu/mediatek: Add iotlb_sync_range() support

2020-10-21 Thread Robin Murphy

On 2020-10-19 12:30, Chao Hao wrote:

MTK_IOMMU driver writes one page entry and does tlb flush at a time
currently. More optimal would be to aggregate the writes and flush
BUS buffer in the end.


That's exactly what iommu_iotlb_gather_add_page() is meant to achieve. 
Rather than jumping straight into hacking up a new API to go round the 
back of the existing API design, it would be far better to ask the 
question of why that's not behaving as expected.



For 50MB buffer mapping, if mtk_iommu driver use iotlb_sync_range()
instead of tlb_add_range() and tlb_flush_walk/leaf(), it can increase
50% performance or more(depending on size of every page size) in
comparison to flushing after each page entry update. So we prefer to
use iotlb_sync_range() to replace iotlb_sync(), tlb_add_range() and
tlb_flush_walk/leaf() for MTK platforms.


In the case of mapping, it sounds like what you actually want to do is 
hook up .iotlb_sync_map and generally make IO_PGTABLE_QUIRK_TLBI_ON_MAP 
cleverer, because the current implementation is as dumb as it could 
possibly be. In fact if we simply passed an address range to 
.iotlb_sync_map, io-pgtable probably wouldn't need to be involved at all 
any more.


Robin.


Signed-off-by: Chao Hao 
---
  drivers/iommu/mtk_iommu.c | 6 ++
  1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c
index 785b228d39a6..d3400c15ff7b 100644
--- a/drivers/iommu/mtk_iommu.c
+++ b/drivers/iommu/mtk_iommu.c
@@ -224,6 +224,11 @@ static void mtk_iommu_tlb_flush_range_sync(unsigned long 
iova, size_t size,
}
  }
  
+static void __mtk_iommu_tlb_flush_range_sync(unsigned long iova, size_t size)

+{
+   mtk_iommu_tlb_flush_range_sync(iova, size, 0, NULL)
+}
+
  static void mtk_iommu_tlb_flush_page_nosync(struct iommu_iotlb_gather *gather,
unsigned long iova, size_t granule,
void *cookie)
@@ -536,6 +541,7 @@ static const struct iommu_ops mtk_iommu_ops = {
.map= mtk_iommu_map,
.unmap  = mtk_iommu_unmap,
.flush_iotlb_all = mtk_iommu_flush_iotlb_all,
+   .iotlb_sync_range = __mtk_iommu_tlb_flush_range_sync,
.iotlb_sync = mtk_iommu_iotlb_sync,
.iova_to_phys   = mtk_iommu_iova_to_phys,
.probe_device   = mtk_iommu_probe_device,


___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu


Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs

2020-10-21 Thread Jason Gunthorpe
On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
> On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > > I think we agreed (or agree to disagree and commit) for device types 
> > > > > that 
> > > > > we have for SIOV, VFIO based approach works well without having to 
> > > > > re-invent 
> > > > > another way to do the same things. Not looking for a shortcut by any 
> > > > > means, 
> > > > > but we need to plan around existing hardware though. Looks like vDPA 
> > > > > took 
> > > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > > necessary hardware was available.. This would be a solved puzzle. 
> > > > 
> > > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > > the "home" and needs to stand alone.
> > > > 
> > > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > > 
> > > So just to make this clear, I did check internally if there are any plans
> > > for vDPA + SVM. There are none at the moment. 
> > 
> > Not SVM, SIOV.
> 
> ... And that included.. I should have said vDPA + PASID, No current plans. 
> I have no idea who set expectations with you. Can you please put me in touch 
> with that person, privately is fine.

It was the team that aruged VDPA had to be done through VFIO - SIOV
and PASID was one of their reasons it had to be VFIO, check the list
archives

If they didn't plan to use it, bit of a strawman argument, right?

Jason
___
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu