Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-05-31 Thread Jonas Gorski
Hi,

On Tue, 30 May 2023 at 23:34, Bjorn Helgaas  wrote:
> On Fri, May 12, 2023 at 02:48:51PM -0500, Bjorn Helgaas wrote:
> > On Fri, May 12, 2023 at 01:56:29PM +0300, Andy Shevchenko wrote:
> > > On Tue, May 09, 2023 at 01:21:22PM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Apr 04, 2023 at 11:11:01AM -0500, Bjorn Helgaas wrote:
> > > > > On Thu, Mar 30, 2023 at 07:24:27PM +0300, Andy Shevchenko wrote:
> > > > > > Provide two new helper macros to iterate over PCI device resources 
> > > > > > and
> > > > > > convert users.
> > > >
> > > > > Applied 2-7 to pci/resource for v6.4, thanks, I really like this!
> > > >
> > > > This is 09cc90063240 ("PCI: Introduce pci_dev_for_each_resource()")
> > > > upstream now.
> > > >
> > > > Coverity complains about each use,
> > >
> > > It needs more clarification here. Use of reduced variant of the
> > > macro or all of them? If the former one, then I can speculate that
> > > Coverity (famous for false positives) simply doesn't understand `for
> > > (type var; var ...)` code.
> >
> > True, Coverity finds false positives.  It flagged every use in
> > drivers/pci and drivers/pnp.  It didn't mention the arch/alpha, arm,
> > mips, powerpc, sh, or sparc uses, but I think it just didn't look at
> > those.
> >
> > It flagged both:
> >
> >   pbus_size_iopci_dev_for_each_resource(dev, r)
> >   pbus_size_mem   pci_dev_for_each_resource(dev, r, i)
> >
> > Here's a spreadsheet with a few more details (unfortunately I don't
> > know how to make it dump the actual line numbers or analysis like I
> > pasted below, so "pci_dev_for_each_resource" doesn't appear).  These
> > are mostly in the "Drivers-PCI" component.
> >
> > https://docs.google.com/spreadsheets/d/1ohOJwxqXXoDUA0gwopgk-z-6ArLvhN7AZn4mIlDkHhQ/edit?usp=sharing
> >
> > These particular reports are in the "High Impact Outstanding" tab.
>
> Where are we at?  Are we going to ignore this because some Coverity
> reports are false positives?

Looking at the code I understand where coverity is coming from:

#define __pci_dev_for_each_res0(dev, res, ...) \
   for (unsigned int __b = 0;  \
res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
__b++)

 res will be assigned before __b is checked for being less than
PCI_NUM_RESOURCES, making it point to behind the array at the end of
the last loop iteration.

Rewriting the test expression as

__b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));

should avoid the (coverity) warning by making use of lazy evaluation.

It probably makes the code slightly less performant as res will now be
checked for being not NULL (which will always be true), but I doubt it
will be significant (or in any hot paths).

Regards,
Jonas



Re: [PATCH v3 22/34] csky: Convert __pte_free_tlb() to use ptdescs

2023-05-31 Thread Guo Ren
Acked-by: Guo Ren 

On Thu, Jun 1, 2023 at 5:34 AM Vishal Moola (Oracle)
 wrote:
>
> Part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents.
>
> Signed-off-by: Vishal Moola (Oracle) 
> ---
>  arch/csky/include/asm/pgalloc.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
> index 7d57e5da0914..9c84c9012e53 100644
> --- a/arch/csky/include/asm/pgalloc.h
> +++ b/arch/csky/include/asm/pgalloc.h
> @@ -63,8 +63,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
>
>  #define __pte_free_tlb(tlb, pte, address)  \
>  do {   \
> -   pgtable_pte_page_dtor(pte); \
> -   tlb_remove_page(tlb, pte);  \
> +   pagetable_pte_dtor(page_ptdesc(pte));   \
> +   tlb_remove_page_ptdesc(tlb, page_ptdesc(pte));  \
>  } while (0)
>
>  extern void pagetable_init(void);
> --
> 2.40.1
>


-- 
Best Regards
 Guo Ren



[qemu-mainline test] 181057: regressions - FAIL

2023-05-31 Thread osstest service owner
flight 181057 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181057/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-i386-libvirt-pair 10 xen-install/src_host fail REGR. vs. 180691
 test-amd64-amd64-libvirt-pair 30 leak-check/check/src_host fail REGR. vs. 
180691
 test-amd64-amd64-libvirt-pair 31 leak-check/check/dst_host fail REGR. vs. 
180691
 test-amd64-i386-libvirt  23 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-libvirt-xsm 23 leak-check/check fail REGR. vs. 180691
 build-arm64-xsm   6 xen-buildfail REGR. vs. 180691
 build-arm64   6 xen-buildfail REGR. vs. 180691
 test-amd64-amd64-libvirt 23 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-xsm  23 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 test-amd64-amd64-libvirt-vhd 19 guest-start/debian.repeat fail REGR. vs. 180691
 test-amd64-i386-xl-vhd  21 guest-start/debian.repeat fail REGR. vs. 180691
 test-amd64-i386-libvirt-raw  22 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-xl-qcow224 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt 21 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt-qcow2 20 leak-check/check   fail REGR. vs. 180691
 test-armhf-armhf-xl-vhd  20 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt-raw 20 leak-check/check fail REGR. vs. 180691

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit1   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit2   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-thunderx  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-vhd   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 build-arm64-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180691
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180691
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180691
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180691
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 180691
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180691
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180691
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 180691
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 

[PATCH v2 2/2] net: Update MemReentrancyGuard for NIC

2023-05-31 Thread Akihiko Odaki
Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

This implementation follows what bottom half does, but it does not add
a tracepoint for the case that the network device backend started
delivering a packet to a device which is already engaging in I/O. This
is because such reentrancy frequently happens for
qemu_flush_queued_packets() and is insignificant.

Fixes: CVE-2023-3019
Reported-by: Alexander Bulekov 
Signed-off-by: Akihiko Odaki 
---
 include/net/net.h |  1 +
 net/net.c | 14 ++
 2 files changed, 15 insertions(+)

diff --git a/include/net/net.h b/include/net/net.h
index a7d8deaccb..685ec58318 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -124,6 +124,7 @@ typedef QTAILQ_HEAD(NetClientStateList, NetClientState) 
NetClientStateList;
 typedef struct NICState {
 NetClientState *ncs;
 NICConf *conf;
+MemReentrancyGuard *reentrancy_guard;
 void *opaque;
 bool peer_deleted;
 } NICState;
diff --git a/net/net.c b/net/net.c
index 982df2479f..3523cceafc 100644
--- a/net/net.c
+++ b/net/net.c
@@ -332,6 +332,7 @@ NICState *qemu_new_nic(NetClientInfo *info,
 nic = g_malloc0(info->size + sizeof(NetClientState) * queues);
 nic->ncs = (void *)nic + info->size;
 nic->conf = conf;
+nic->reentrancy_guard = reentrancy_guard,
 nic->opaque = opaque;
 
 for (i = 0; i < queues; i++) {
@@ -805,6 +806,7 @@ static ssize_t qemu_deliver_packet_iov(NetClientState 
*sender,
int iovcnt,
void *opaque)
 {
+MemReentrancyGuard *owned_reentrancy_guard;
 NetClientState *nc = opaque;
 int ret;
 
@@ -817,12 +819,24 @@ static ssize_t qemu_deliver_packet_iov(NetClientState 
*sender,
 return 0;
 }
 
+if (nc->info->type != NET_CLIENT_DRIVER_NIC ||
+qemu_get_nic(nc)->reentrancy_guard->engaged_in_io) {
+owned_reentrancy_guard = NULL;
+} else {
+owned_reentrancy_guard = qemu_get_nic(nc)->reentrancy_guard;
+owned_reentrancy_guard->engaged_in_io = true;
+}
+
 if (nc->info->receive_iov && !(flags & QEMU_NET_PACKET_FLAG_RAW)) {
 ret = nc->info->receive_iov(nc, iov, iovcnt);
 } else {
 ret = nc_sendv_compat(nc, iov, iovcnt, flags);
 }
 
+if (owned_reentrancy_guard) {
+owned_reentrancy_guard->engaged_in_io = false;
+}
+
 if (ret == 0) {
 nc->receive_disabled = 1;
 }
-- 
2.40.1




[PATCH v2 0/2] net: Update MemReentrancyGuard for NIC

2023-05-31 Thread Akihiko Odaki
Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

This implementation follows what bottom half does, but it does not add
a tracepoint for the case that the network device backend started
delivering a packet to a device which is already engaging in I/O. This
is because such reentrancy frequently happens for
qemu_flush_queued_packets() and is insignificant.

This series consists of two patches. The first patch makes a bulk change to
add a new parameter to qemu_new_nic() and does not contain behavioral changes.
The second patch actually implements MemReentrancyGuard update.

V1 -> V2: Added the 'Fixes: CVE-2023-3019' tag

Akihiko Odaki (2):
  net: Provide MemReentrancyGuard * to qemu_new_nic()
  net: Update MemReentrancyGuard for NIC

 include/net/net.h |  2 ++
 hw/net/allwinner-sun8i-emac.c |  3 ++-
 hw/net/allwinner_emac.c   |  3 ++-
 hw/net/cadence_gem.c  |  3 ++-
 hw/net/dp8393x.c  |  3 ++-
 hw/net/e1000.c|  3 ++-
 hw/net/e1000e.c   |  2 +-
 hw/net/eepro100.c |  4 +++-
 hw/net/etraxfs_eth.c  |  3 ++-
 hw/net/fsl_etsec/etsec.c  |  3 ++-
 hw/net/ftgmac100.c|  3 ++-
 hw/net/i82596.c   |  2 +-
 hw/net/igb.c  |  2 +-
 hw/net/imx_fec.c  |  2 +-
 hw/net/lan9118.c  |  3 ++-
 hw/net/mcf_fec.c  |  3 ++-
 hw/net/mipsnet.c  |  3 ++-
 hw/net/msf2-emac.c|  3 ++-
 hw/net/mv88w8618_eth.c|  3 ++-
 hw/net/ne2000-isa.c   |  3 ++-
 hw/net/ne2000-pci.c   |  3 ++-
 hw/net/npcm7xx_emc.c  |  3 ++-
 hw/net/opencores_eth.c|  3 ++-
 hw/net/pcnet.c|  3 ++-
 hw/net/rocker/rocker_fp.c |  4 ++--
 hw/net/rtl8139.c  |  3 ++-
 hw/net/smc91c111.c|  3 ++-
 hw/net/spapr_llan.c   |  3 ++-
 hw/net/stellaris_enet.c   |  3 ++-
 hw/net/sungem.c   |  2 +-
 hw/net/sunhme.c   |  3 ++-
 hw/net/tulip.c|  3 ++-
 hw/net/virtio-net.c   |  6 --
 hw/net/vmxnet3.c  |  2 +-
 hw/net/xen_nic.c  |  4 ++--
 hw/net/xgmac.c|  3 ++-
 hw/net/xilinx_axienet.c   |  3 ++-
 hw/net/xilinx_ethlite.c   |  3 ++-
 hw/usb/dev-network.c  |  3 ++-
 net/net.c | 15 +++
 40 files changed, 90 insertions(+), 41 deletions(-)

-- 
2.40.1




[PATCH v2 1/2] net: Provide MemReentrancyGuard * to qemu_new_nic()

2023-05-31 Thread Akihiko Odaki
Recently MemReentrancyGuard was added to DeviceState to record that the
device is engaging in I/O. The network device backend needs to update it
when delivering a packet to a device.

In preparation for such a change, add MemReentrancyGuard * as a
parameter of qemu_new_nic().

Signed-off-by: Akihiko Odaki 
---
 include/net/net.h | 1 +
 hw/net/allwinner-sun8i-emac.c | 3 ++-
 hw/net/allwinner_emac.c   | 3 ++-
 hw/net/cadence_gem.c  | 3 ++-
 hw/net/dp8393x.c  | 3 ++-
 hw/net/e1000.c| 3 ++-
 hw/net/e1000e.c   | 2 +-
 hw/net/eepro100.c | 4 +++-
 hw/net/etraxfs_eth.c  | 3 ++-
 hw/net/fsl_etsec/etsec.c  | 3 ++-
 hw/net/ftgmac100.c| 3 ++-
 hw/net/i82596.c   | 2 +-
 hw/net/igb.c  | 2 +-
 hw/net/imx_fec.c  | 2 +-
 hw/net/lan9118.c  | 3 ++-
 hw/net/mcf_fec.c  | 3 ++-
 hw/net/mipsnet.c  | 3 ++-
 hw/net/msf2-emac.c| 3 ++-
 hw/net/mv88w8618_eth.c| 3 ++-
 hw/net/ne2000-isa.c   | 3 ++-
 hw/net/ne2000-pci.c   | 3 ++-
 hw/net/npcm7xx_emc.c  | 3 ++-
 hw/net/opencores_eth.c| 3 ++-
 hw/net/pcnet.c| 3 ++-
 hw/net/rocker/rocker_fp.c | 4 ++--
 hw/net/rtl8139.c  | 3 ++-
 hw/net/smc91c111.c| 3 ++-
 hw/net/spapr_llan.c   | 3 ++-
 hw/net/stellaris_enet.c   | 3 ++-
 hw/net/sungem.c   | 2 +-
 hw/net/sunhme.c   | 3 ++-
 hw/net/tulip.c| 3 ++-
 hw/net/virtio-net.c   | 6 --
 hw/net/vmxnet3.c  | 2 +-
 hw/net/xen_nic.c  | 4 ++--
 hw/net/xgmac.c| 3 ++-
 hw/net/xilinx_axienet.c   | 3 ++-
 hw/net/xilinx_ethlite.c   | 3 ++-
 hw/usb/dev-network.c  | 3 ++-
 net/net.c | 1 +
 40 files changed, 75 insertions(+), 41 deletions(-)

diff --git a/include/net/net.h b/include/net/net.h
index 1448d00afb..a7d8deaccb 100644
--- a/include/net/net.h
+++ b/include/net/net.h
@@ -157,6 +157,7 @@ NICState *qemu_new_nic(NetClientInfo *info,
NICConf *conf,
const char *model,
const char *name,
+   MemReentrancyGuard *reentrancy_guard,
void *opaque);
 void qemu_del_nic(NICState *nic);
 NetClientState *qemu_get_subqueue(NICState *nic, int queue_index);
diff --git a/hw/net/allwinner-sun8i-emac.c b/hw/net/allwinner-sun8i-emac.c
index fac4405f45..cc350d40e5 100644
--- a/hw/net/allwinner-sun8i-emac.c
+++ b/hw/net/allwinner-sun8i-emac.c
@@ -824,7 +824,8 @@ static void allwinner_sun8i_emac_realize(DeviceState *dev, 
Error **errp)
 
 qemu_macaddr_default_if_unset(>conf.macaddr);
 s->nic = qemu_new_nic(_allwinner_sun8i_emac_info, >conf,
-   object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 }
 
diff --git a/hw/net/allwinner_emac.c b/hw/net/allwinner_emac.c
index 372e5b66da..e10965de14 100644
--- a/hw/net/allwinner_emac.c
+++ b/hw/net/allwinner_emac.c
@@ -453,7 +453,8 @@ static void aw_emac_realize(DeviceState *dev, Error **errp)
 
 qemu_macaddr_default_if_unset(>conf.macaddr);
 s->nic = qemu_new_nic(_aw_emac_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic), s->conf.macaddr.a);
 
 fifo8_create(>rx_fifo, RX_FIFO_SIZE);
diff --git a/hw/net/cadence_gem.c b/hw/net/cadence_gem.c
index 42ea2411a2..a7bce1c120 100644
--- a/hw/net/cadence_gem.c
+++ b/hw/net/cadence_gem.c
@@ -1633,7 +1633,8 @@ static void gem_realize(DeviceState *dev, Error **errp)
 qemu_macaddr_default_if_unset(>conf.macaddr);
 
 s->nic = qemu_new_nic(_gem_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 
 if (s->jumbo_max_len > MAX_FRAME_SIZE) {
 error_setg(errp, "jumbo-max-len is greater than %d",
diff --git a/hw/net/dp8393x.c b/hw/net/dp8393x.c
index 45b954e46c..abfcc6f69f 100644
--- a/hw/net/dp8393x.c
+++ b/hw/net/dp8393x.c
@@ -943,7 +943,8 @@ static void dp8393x_realize(DeviceState *dev, Error **errp)
   "dp8393x-regs", SONIC_REG_COUNT << s->it_shift);
 
 s->nic = qemu_new_nic(_dp83932_info, >conf,
-  object_get_typename(OBJECT(dev)), dev->id, s);
+  object_get_typename(OBJECT(dev)), dev->id,
+  >mem_reentrancy_guard, s);
 qemu_format_nic_info_str(qemu_get_queue(s->nic), 

RE: [XEN][PATCH v6 08/19] xen/device-tree: Add device_tree_find_node_by_path() to find nodes in device tree

2023-05-31 Thread Henry Wang
Hi Vikram,

> -Original Message-
> Hi Henry & Michal,
> Changed this for v7. Will send it out soon.
> 
> @Henry, i didn't add reviewed-by as the patch is bit changed with
> renaming. Can you please review v7 and give your feedback.

Thanks for the reminder, yes I would be more than happy to review
the v7 series once you send it. 

Kind regards,
Henry

> >


[linux-linus test] 181033: regressions - trouble: blocked/broken/fail/pass

2023-05-31 Thread osstest service owner
flight 181033 linux-linus real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181033/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-armhf-libvirt  broken
 test-armhf-armhf-xl-credit1   8 xen-boot fail REGR. vs. 180278
 build-armhf-libvirt   5 host-build-prep  fail REGR. vs. 180278
 test-amd64-amd64-xl-vhd 21 guest-start/debian.repeat fail REGR. vs. 180278

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-examine  8 reboot   fail  like 180278
 test-armhf-armhf-xl-arndale   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl-credit2   8 xen-boot fail  like 180278
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180278
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl-multivcpu  8 xen-boot fail like 180278
 test-armhf-armhf-xl   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-vhd   8 xen-boot fail  like 180278
 test-armhf-armhf-xl-rtds  8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 linux48b1320a674e1ff5de2fad8606bee38f724594dc
baseline version:
 linux6c538e1adbfc696ac4747fb10d63e704344f763d

Last test of basis   180278  2023-04-16 19:41:46 Z   45 days
Failing since180281  2023-04-17 06:24:36 Z   44 days   84 attempts
Testing same since   181033  2023-05-31 12:43:43 Z0 days1 attempts


2558 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  broken  
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops

[PATCH] automation: zen3 dom0pvh test

2023-05-31 Thread Stefano Stabellini
From: Stefano Stabellini 

Add a PVH Dom0 test for the zen3 runner.

Signed-off-by: Stefano Stabellini 
---
 automation/gitlab-ci/test.yaml | 8 
 1 file changed, 8 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index fbe2c0589a..d5cb238b0a 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -202,6 +202,14 @@ zen3p-smoke-x86-64-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.12-gcc-debug
 
+zen3p-smoke-x86-64-dom0pvh-gcc-debug:
+  extends: .zen3p-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh dom0pvh 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.12-gcc-debug
+
 zen3p-pci-hvm-x86-64-gcc-debug:
   extends: .zen3p-x86-64
   script:
-- 
2.25.1




[xen-unstable-smoke test] 181054: tolerable all pass - PUSHED

2023-05-31 Thread osstest service owner
flight 181054 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181054/

Failures :-/ but no regressions.

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  dc98fa74446e5abe417e5ba9a6a632b50444cfa1
baseline version:
 xen  94200e1bae07e725cc07238c11569c5cab7befb7

Last test of basis   181018  2023-05-30 20:00:24 Z1 days
Failing since181031  2023-05-31 11:00:27 Z0 days4 attempts
Testing same since   181054  2023-05-31 19:00:25 Z0 days1 attempts


People who touched revisions under test:
  Bobby Eshleman 
  George Dunlap 
  Jan Beulich 
  Juergen Gross 
  Olaf Hering 
  Oleksii Kurochko 
  Roger Pau Monné 
  Stefano Stabellini 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  pass
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  pass
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/xen.git
   94200e1bae..dc98fa7444  dc98fa74446e5abe417e5ba9a6a632b50444cfa1 -> smoke



[PATCH v3 14/34] powerpc: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/powerpc/mm/book3s64/mmu_context.c | 10 +++---
 arch/powerpc/mm/book3s64/pgtable.c | 32 +-
 arch/powerpc/mm/pgtable-frag.c | 46 +-
 3 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index c766e4c26e42..1715b07c630c 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -246,15 +246,15 @@ static void destroy_contexts(mm_context_t *ctx)
 static void pmd_frag_destroy(void *pmd_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pmd_frag);
+   ptdesc = virt_to_ptdesc(pmd_frag);
/* drop all the pending references */
count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PMD_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PMD_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 85c84e89e3ea..1212deeabe15 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -306,22 +306,22 @@ static pmd_t *get_pmd_from_cache(struct mm_struct *mm)
 static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 {
void *ret = NULL;
-   struct page *page;
+   struct ptdesc *ptdesc;
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
if (mm == _mm)
gfp &= ~__GFP_ACCOUNT;
-   page = alloc_page(gfp);
-   if (!page)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pmd_page_ctor(page)) {
-   __free_pages(page, 0);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   atomic_set(>pt_frag_refcount, 1);
+   atomic_set(>pt_frag_refcount, 1);
 
-   ret = page_address(page);
+   ret = ptdesc_address(ptdesc);
/*
 * if we support only one fragment just return the
 * allocated page.
@@ -331,12 +331,12 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 
spin_lock(>page_table_lock);
/*
-* If we find pgtable_page set, we return
+* If we find ptdesc_page set, we return
 * the allocated page with single fragment
 * count.
 */
if (likely(!mm->context.pmd_frag)) {
-   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
+   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
mm->context.pmd_frag = ret + PMD_FRAG_SIZE;
}
spin_unlock(>page_table_lock);
@@ -357,15 +357,15 @@ pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned 
long vmaddr)
 
 void pmd_fragment_free(unsigned long *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
 
-   if (PageReserved(page))
-   return free_reserved_page(page);
+   if (pagetable_is_reserved(ptdesc))
+   return free_reserved_ptdesc(ptdesc);
 
-   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
-   if (atomic_dec_and_test(>pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
+   if (atomic_dec_and_test(>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..8961f1540209 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -18,15 +18,15 @@
 void pte_frag_destroy(void *pte_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pte_frag);
+   ptdesc = virt_to_ptdesc(pte_frag);
/* drop all the pending references */
count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PTE_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PTE_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
@@ -55,25 +55,25 @@ static pte_t *get_pte_from_cache(struct mm_struct *mm)
 static pte_t *__alloc_for_ptecache(struct 

[PATCH v3 25/34] m68k: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/m68k/include/asm/mcf_pgalloc.h  | 41 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  8 +++---
 arch/m68k/mm/motorola.c  |  4 +--
 3 files changed, 27 insertions(+), 26 deletions(-)

diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
b/arch/m68k/include/asm/mcf_pgalloc.h
index 5c2c0a864524..9eb4ef9e6d77 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -7,20 +7,19 @@
 
 extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long) pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 extern const char bad_pmd_string[];
 
 extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   unsigned long page = __get_free_page(GFP_DMA);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
 
-   memset((void *)page, 0, PAGE_SIZE);
-   return (pte_t *) (page);
+   return (pte_t *) (ptdesc_address(ptdesc));
 }
 
 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
@@ -35,36 +34,36 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
long address)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
  unsigned long address)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_DMA, 0);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA, 0);
pte_t *pte;
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pte = page_address(page);
-   clear_page(pte);
+   pte = ptdesc_address(ptdesc);
+   pagetable_clear(pte);
 
return pte;
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(ptdesc);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 /*
@@ -75,16 +74,18 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
pgtable)
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_page((unsigned long) pgd);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
pgd_t *new_pgd;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | GFP_NOWARN, 0);
 
-   new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN);
-   if (!new_pgd)
+   if (!ptdesc)
return NULL;
+   new_pgd = (pgd_t *) ptdesc_address(ptdesc);
+
memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t));
memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT);
return new_pgd;
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h 
b/arch/m68k/include/asm/sun3_pgalloc.h
index 198036aff519..ff48573db2c0 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -17,10 +17,10 @@
 
 extern const char bad_pmd_string[];
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t 
*pte)
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index c75984e2d86b..594575a0780c 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -161,7 +161,7 @@ void *get_pointer_table(int type)
 * m68k doesn't have SPLIT_PTE_PTLOCKS for not having
 * SMP.
 */
-   pgtable_pte_page_ctor(virt_to_page(page));
+   

[PATCH v3 18/34] mm: Remove page table members from struct page

2023-05-31 Thread Vishal Moola (Oracle)
The page table members are now split out into their own ptdesc struct.
Remove them from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm_types.h | 14 --
 include/linux/pgtable.h  |  3 ---
 2 files changed, 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6161fe1ae5b8..31ffa1be21d0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -141,20 +141,6 @@ struct page {
struct {/* Tail pages of compound page */
unsigned long compound_head;/* Bit zero is set */
};
-   struct {/* Page table pages */
-   unsigned long _pt_pad_1;/* compound_head */
-   pgtable_t pmd_huge_pte; /* protected by page->ptl */
-   unsigned long _pt_s390_gaddr;   /* mapping */
-   union {
-   struct mm_struct *pt_mm; /* x86 pgds only */
-   atomic_t pt_frag_refcount; /* powerpc */
-   };
-#if ALLOC_SPLIT_PTLOCKS
-   spinlock_t *ptl;
-#else
-   spinlock_t ptl;
-#endif
-   };
struct {/* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5f12622d1521..3b89dd028973 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1020,10 +1020,7 @@ struct ptdesc {
 TABLE_MATCH(flags, __page_flags);
 TABLE_MATCH(compound_head, pt_list);
 TABLE_MATCH(compound_head, _pt_pad_1);
-TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
 TABLE_MATCH(mapping, _pt_s390_gaddr);
-TABLE_MATCH(pt_mm, pt_mm);
-TABLE_MATCH(ptl, ptl);
 #undef TABLE_MATCH
 static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
 
-- 
2.40.1




[PATCH v3 15/34] x86: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/x86/mm/pgtable.c | 46 +--
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..79681557fce6 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -52,7 +52,7 @@ early_param("userpte", setup_userpte);
 
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
-   pgtable_pte_page_dtor(pte);
+   pagetable_pte_dtor(page_ptdesc(pte));
paravirt_release_pte(page_to_pfn(pte));
paravirt_tlb_remove_table(tlb, pte);
 }
@@ -60,7 +60,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 #if CONFIG_PGTABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
/*
 * NOTE! For PAE, any changes to the top page-directory-pointer-table
@@ -69,8 +69,8 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #ifdef CONFIG_X86_PAE
tlb->need_flush_all = 1;
 #endif
-   pgtable_pmd_page_dtor(page);
-   paravirt_tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -92,16 +92,16 @@ void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 
 static inline void pgd_list_add(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_add(>lru, _list);
+   list_add(>pt_list, _list);
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_del(>lru);
+   list_del(>pt_list);
 }
 
 #define UNSHARED_PTRS_PER_PGD  \
@@ -112,12 +112,12 @@ static inline void pgd_list_del(pgd_t *pgd)
 
 static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
 {
-   virt_to_page(pgd)->pt_mm = mm;
+   virt_to_ptdesc(pgd)->pt_mm = mm;
 }
 
 struct mm_struct *pgd_page_get_mm(struct page *page)
 {
-   return page->pt_mm;
+   return page_ptdesc(page)->pt_mm;
 }
 
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
@@ -213,11 +213,14 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, 
pmd_t *pmd)
 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 {
int i;
+   struct ptdesc *ptdesc;
 
for (i = 0; i < count; i++)
if (pmds[i]) {
-   pgtable_pmd_page_dtor(virt_to_page(pmds[i]));
-   free_page((unsigned long)pmds[i]);
+   ptdesc = virt_to_ptdesc(pmds[i]);
+
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
mm_dec_nr_pmds(mm);
}
 }
@@ -232,16 +235,21 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t 
*pmds[], int count)
gfp &= ~__GFP_ACCOUNT;
 
for (i = 0; i < count; i++) {
-   pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
-   if (!pmd)
+   pmd_t *pmd = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(gfp, 0);
+
+   if (!ptdesc)
failed = true;
-   if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-   free_page((unsigned long)pmd);
-   pmd = NULL;
+   if (ptdesc && !pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
+   ptdesc = NULL;
failed = true;
}
-   if (pmd)
+   if (ptdesc) {
mm_inc_nr_pmds(mm);
+   pmd = (pmd_t *)ptdesc_address(ptdesc);
+   }
+
pmds[i] = pmd;
}
 
@@ -838,7 +846,7 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 
free_page((unsigned long)pmd_sv);
 
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
free_page((unsigned long)pmd);
 
return 1;
-- 
2.40.1




[PATCH v3 28/34] openrisc: Convert __pte_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/openrisc/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/openrisc/include/asm/pgalloc.h 
b/arch/openrisc/include/asm/pgalloc.h
index b7b2b8d16fad..c6a73772a546 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -66,10 +66,10 @@ extern inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1




[PATCH v3 32/34] sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable pte constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/sparc/mm/srmmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index 13f027afc875..8393faa3e596 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -355,7 +355,8 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
return NULL;
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
-   if (page_ref_inc_return(page) == 2 && !pgtable_pte_page_ctor(page)) {
+   if (page_ref_inc_return(page) == 2 &&
+   !pagetable_pte_ctor(page_ptdesc(page))) {
page_ref_dec(page);
ptep = NULL;
}
@@ -371,7 +372,7 @@ void pte_free(struct mm_struct *mm, pgtable_t ptep)
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
if (page_ref_dec_return(page) == 1)
-   pgtable_pte_page_dtor(page);
+   pagetable_pte_dtor(page_ptdesc(page));
spin_unlock(>page_table_lock);
 
srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
-- 
2.40.1




[PATCH v3 13/34] mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}

2023-05-31 Thread Vishal Moola (Oracle)
Creates pagetable_pte_ctor(), pagetable_pmd_ctor(), pagetable_pte_dtor(),
and pagetable_pmd_dtor() and make the original pgtable
constructor/destructors wrappers.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 56 ++
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 72725aa6c30d..2c7d27348ea9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2867,20 +2867,34 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { 
return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
+static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc)
 {
-   if (!ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_table(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pte_page_ctor(struct page *page)
+{
+   return pagetable_pte_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   ptlock_free(ptdesc);
+   __folio_clear_table(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pte_dtor(page_ptdesc(page));
 }
 
 #define pte_offset_map_lock(mm, pmd, address, ptlp)\
@@ -2962,20 +2976,34 @@ static inline spinlock_t *pmd_lock(struct mm_struct 
*mm, pmd_t *pmd)
return ptl;
 }
 
-static inline bool pgtable_pmd_page_ctor(struct page *page)
+static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
 {
-   if (!pmd_ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!pmd_ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_table(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+   return pagetable_pmd_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pmd_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   pmd_ptlock_free(ptdesc);
+   __folio_clear_table(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pmd_dtor(page_ptdesc(page));
 }
 
 /*
-- 
2.40.1




[PATCH v3 22/34] csky: Convert __pte_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/csky/include/asm/pgalloc.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index 7d57e5da0914..9c84c9012e53 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -63,8 +63,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page(tlb, pte);  \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc(tlb, page_ptdesc(pte));  \
 } while (0)
 
 extern void pagetable_init(void);
-- 
2.40.1




[PATCH v3 34/34] mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

2023-05-31 Thread Vishal Moola (Oracle)
These functions are no longer necessary. Remove them and cleanup
Documentation referencing them.

Signed-off-by: Vishal Moola (Oracle) 
---
 Documentation/mm/split_page_table_lock.rst| 12 +--
 .../zh_CN/mm/split_page_table_lock.rst| 14 ++---
 include/linux/mm.h| 20 ---
 3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst 
b/Documentation/mm/split_page_table_lock.rst
index 50ee0dfc95be..4bffec728340 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -53,7 +53,7 @@ Support of split page table lock by an architecture
 ===
 
 There's no need in special enabling of PTE split page table lock: everything
-required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which
+required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which
 must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
@@ -63,8 +63,8 @@ This field shares storage with page->ptl.
 PMD split lock only makes sense if you have more than two page table
 levels.
 
-PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
-allocation and pgtable_pmd_page_dtor() on freeing.
+PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
+allocation and pagetable_pmd_dtor() on freeing.
 
 Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
 pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
@@ -72,7 +72,7 @@ paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
 
 With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
 
-NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
+NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
 be handled properly.
 
 page->ptl
@@ -92,7 +92,7 @@ trick:
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
one more cache line for indirect access;
 
-The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in
-pgtable_pmd_page_ctor() for PMD table.
+The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
+pagetable_pmd_ctor() for PMD table.
 
 Please, never access page->ptl directly -- use appropriate helper.
diff --git a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst 
b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
index 4fb7aa666037..a2c288670a24 100644
--- a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
+++ b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
@@ -56,16 +56,16 @@ Hugetlb特定的辅助函数:
 架构对分页表锁的支持
 
 
-没有必要特别启用PTE分页表锁:所有需要的东西都由pgtable_pte_page_ctor()
-和pgtable_pte_page_dtor()完成,它们必须在PTE表分配/释放时被调用。
+没有必要特别启用PTE分页表锁:所有需要的东西都由pagetable_pte_ctor()
+和pagetable_pte_dtor()完成,它们必须在PTE表分配/释放时被调用。
 
 确保架构不使用slab分配器来分配页表:slab使用page->slab_cache来分配其页
 面。这个区域与page->ptl共享存储。
 
 PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
-启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor(),在释放时调
-用pgtable_pmd_page_dtor()。
+启用PMD分页锁需要在PMD表分配时调用pagetable_pmd_ctor(),在释放时调
+用pagetable_pmd_dtor()。
 
 分配通常发生在pmd_alloc_one()中,释放发生在pmd_free()和pmd_free_tlb()
 中,但要确保覆盖所有的PMD表分配/释放路径:即X86_PAE在pgd_alloc()中预先
@@ -73,7 +73,7 @@ PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
 一切就绪后,你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。
 
-注意:pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必
+注意:pagetable_pte_ctor()和pagetable_pmd_ctor()可能失败--必
 须正确处理。
 
 page->ptl
@@ -90,7 +90,7 @@ page->ptl用于访问分割页表锁,其中'page'是包含该表的页面struc
的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的
情况下使用分页锁,但由于间接访问而多花了一个缓存行。
 
-PTE表的spinlock_t分配在pgtable_pte_page_ctor()中,PMD表的spinlock_t
-分配在pgtable_pmd_page_ctor()中。
+PTE表的spinlock_t分配在pagetable_pte_ctor()中,PMD表的spinlock_t
+分配在pagetable_pmd_ctor()中。
 
 请不要直接访问page->ptl - -使用适当的辅助函数。
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2c7d27348ea9..218cad2041a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2878,11 +2878,6 @@ static inline bool pagetable_pte_ctor(struct ptdesc 
*ptdesc)
return true;
 }
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
-{
-   return pagetable_pte_ctor(page_ptdesc(page));
-}
-
 static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
 {
struct folio *folio = ptdesc_folio(ptdesc);
@@ -2892,11 +2887,6 @@ static inline void pagetable_pte_dtor(struct ptdesc 
*ptdesc)
lruvec_stat_sub_folio(folio, NR_PAGETABLE);
 }
 
-static inline void pgtable_pte_page_dtor(struct page *page)
-{
-   pagetable_pte_dtor(page_ptdesc(page));
-}
-
 #define pte_offset_map_lock(mm, pmd, address, ptlp)\
 ({ \
spinlock_t *__ptl = pte_lockptr(mm, pmd);   \
@@ -2987,11 +2977,6 @@ static inline bool pagetable_pmd_ctor(struct ptdesc 
*ptdesc)
return true;
 }
 

Re: [XEN][PATCH v6 08/19] xen/device-tree: Add device_tree_find_node_by_path() to find nodes in device tree

2023-05-31 Thread Vikram Garhwal

Hi Henry & Michal,


On 5/9/23 4:29 AM, Michal Orzel wrote:


On 04/05/2023 06:23, Henry Wang wrote:


Hi Vikram,


-Original Message-
Subject: [XEN][PATCH v6 08/19] xen/device-tree: Add
device_tree_find_node_by_path() to find nodes in device tree

Add device_tree_find_node_by_path() to find a matching node with path for
a
dt_device_node.

Reason behind this function:
 Each time overlay nodes are added using .dtbo, a new fdt(memcpy of
 device_tree_flattened) is created and updated with overlay nodes. This
 updated fdt is further unflattened to a dt_host_new. Next, we need to find
 the overlay nodes in dt_host_new, find the overlay node's parent in dt_host
 and add the nodes as child under their parent in the dt_host. Thus we need
 this function to search for node in different unflattened device trees.

Also, make dt_find_node_by_path() static inline.

Signed-off-by: Vikram Garhwal 
---
  xen/common/device_tree.c  |  5 +++--
  xen/include/xen/device_tree.h | 17 +++--
  2 files changed, 18 insertions(+), 4 deletions(-)


[...]


  /**
- * dt_find_node_by_path - Find a node matching a full DT path
+ * device_tree_find_node_by_path - Generic function to find a node
matching the
+ * full DT path for any given unflatten device tree
+ * @dt_node: The device tree to search

I noticed that you missed Michal's comment here about renaming the
"dt_node" here to "dt" to match below function prototype...

This is one thing. The other is that in v5 you said this is to be a generic 
function
where you can search from a middle of a device tree. This means that the 
parameter should be
named "node" or "from" and the description needs to say "The node to start searching 
from" +
seeing the lack of ->allnext you can mention that this is inclusive (i.e. the 
passed node will also be searched).

Changed this for v7. Will send it out soon.

@Henry, i didn't add reviewed-by as the patch is bit changed with 
renaming. Can you please review v7 and give your feedback.


~Michal





[PATCH v3 31/34] sparc64: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/sparc/mm/init_64.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..8a1618c3b435 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2893,14 +2893,15 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-   if (!page)
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0);
+
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
-   return (pte_t *) page_address(page);
+   return (pte_t *) ptdesc_address(ptdesc);
 }
 
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -2910,10 +2911,10 @@ void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 static void __pte_free(pgtable_t pte)
 {
-   struct page *page = virt_to_page(pte);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pte);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 void pte_free(struct mm_struct *mm, pgtable_t pte)
-- 
2.40.1




[PATCH v3 20/34] arm: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

late_alloc() also uses the __get_free_pages() helper function. Convert
this to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/arm/include/asm/tlb.h | 12 +++-
 arch/arm/mm/mmu.c  |  6 +++---
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index b8cbe03ad260..f40d06ad5d2a 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -39,7 +39,9 @@ static inline void __tlb_remove_table(void *_table)
 static inline void
 __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -50,17 +52,17 @@ __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, 
unsigned long addr)
__tlb_adjust_range(tlb, addr - PAGE_SIZE, 2 * PAGE_SIZE);
 #endif
 
-   tlb_remove_table(tlb, pte);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 static inline void
 __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
 {
 #ifdef CONFIG_ARM_LPAE
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 #endif
 }
 
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 22292cf3381c..294518fd0240 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -737,11 +737,11 @@ static void __init *early_alloc(unsigned long sz)
 
 static void *__init late_alloc(unsigned long sz)
 {
-   void *ptr = (void *)__get_free_pages(GFP_PGTABLE_KERNEL, get_order(sz));
+   void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL, get_order(sz));
 
-   if (!ptr || !pgtable_pte_page_ctor(virt_to_page(ptr)))
+   if (!ptdesc || !pagetable_pte_ctor(ptdesc))
BUG();
-   return ptr;
+   return ptdesc;
 }
 
 static pte_t * __init arm_pte_alloc(pmd_t *pmd, unsigned long addr,
-- 
2.40.1




[PATCH v3 17/34] s390: Convert various pgalloc functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/s390/include/asm/pgalloc.h |   4 +-
 arch/s390/include/asm/tlb.h |   4 +-
 arch/s390/mm/pgalloc.c  | 108 
 3 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..00ad9b88fda9 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -86,7 +86,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long vmaddr)
if (!table)
return NULL;
crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
-   if (!pgtable_pmd_page_ctor(virt_to_page(table))) {
+   if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) {
crst_table_free(mm, table);
return NULL;
}
@@ -97,7 +97,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
if (mm_pmd_folded(mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
crst_table_free(mm, (unsigned long *) pmd);
 }
 
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index b91f4a9b044c..383b1f91442c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -89,12 +89,12 @@ static inline void pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmd,
 {
if (mm_pmd_folded(tlb->mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
__tlb_adjust_range(tlb, address, PAGE_SIZE);
tlb->mm->context.flush_mm = 1;
tlb->freed_tables = 1;
tlb->cleared_puds = 1;
-   tlb_remove_table(tlb, pmd);
+   tlb_remove_ptdesc(tlb, pmd);
 }
 
 /*
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 6b99932abc66..eeb7c95b98cf 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -43,17 +43,17 @@ __initcall(page_table_register_sysctl);
 
 unsigned long *crst_table_alloc(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_KERNEL, CRST_ALLOC_ORDER);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, CRST_ALLOC_ORDER);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   arch_set_page_dat(page, CRST_ALLOC_ORDER);
-   return (unsigned long *) page_to_virt(page);
+   arch_set_page_dat(ptdesc_page(ptdesc), CRST_ALLOC_ORDER);
+   return (unsigned long *) ptdesc_to_virt(ptdesc);
 }
 
 void crst_table_free(struct mm_struct *mm, unsigned long *table)
 {
-   free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+   pagetable_free(virt_to_ptdesc(table));
 }
 
 static void __crst_table_upgrade(void *arg)
@@ -140,21 +140,21 @@ static inline unsigned int atomic_xor_bits(atomic_t *v, 
unsigned int bits)
 
 struct page *page_table_alloc_pgste(struct mm_struct *mm)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
u64 *table;
 
-   page = alloc_page(GFP_KERNEL);
-   if (page) {
-   table = (u64 *)page_to_virt(page);
+   ptdesc = pagetable_alloc(GFP_KERNEL, 0);
+   if (ptdesc) {
+   table = (u64 *)ptdesc_to_virt(ptdesc);
memset64(table, _PAGE_INVALID, PTRS_PER_PTE);
memset64(table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
}
-   return page;
+   return ptdesc_page(ptdesc);
 }
 
 void page_table_free_pgste(struct page *page)
 {
-   __free_page(page);
+   pagetable_free(page_ptdesc(page));
 }
 
 #endif /* CONFIG_PGSTE */
@@ -230,7 +230,7 @@ void page_table_free_pgste(struct page *page)
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
unsigned long *table;
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned int mask, bit;
 
/* Try to get a fragment of a 4K page as a 2K page table */
@@ -238,9 +238,9 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
-   mask = atomic_read(>pt_frag_refcount);
+   ptdesc = list_first_entry(>context.pgtable_list,
+   struct ptdesc, pt_list);
+   mask = atomic_read(>pt_frag_refcount);
/*
 * The pending removal bits must also be checked.
 * Failure to do so might 

[PATCH v3 23/34] hexagon: Convert __pte_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/hexagon/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/hexagon/include/asm/pgalloc.h 
b/arch/hexagon/include/asm/pgalloc.h
index f0c47e6a7427..55988625e6fb 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -87,10 +87,10 @@ static inline void pmd_populate_kernel(struct mm_struct 
*mm, pmd_t *pmd,
max_kernel_seg = pmdindex;
 }
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor((pte));   \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor((page_ptdesc(pte))); \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1




[PATCH v3 29/34] riscv: Convert alloc_{pmd, pte}_late() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/include/asm/pgalloc.h |  8 
 arch/riscv/mm/init.c | 16 ++--
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 59dc12b5b7e8..d169a4f41a2e 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -153,10 +153,10 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-#define __pte_free_tlb(tlb, pte, buf)   \
-do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, buf)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 #endif /* CONFIG_MMU */
 
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2f7a7c345a6a..2fe6ca1b1f95 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -353,12 +353,10 @@ static inline phys_addr_t __init 
alloc_pte_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pte_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pte_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc));
+   return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pte_mapping(pte_t *ptep,
@@ -436,12 +434,10 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pmd_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pmd_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc));
+   return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pmd_mapping(pmd_t *pmdp,
-- 
2.40.1




[PATCH v3 33/34] um: Convert {pmd, pte}_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/um/include/asm/pgalloc.h | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index 8ec7cd46dd96..de5e31c64793 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -25,19 +25,19 @@
  */
 extern pgd_t *pgd_alloc(struct mm_struct *);
 
-#define __pte_free_tlb(tlb,pte, address)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb),(pte));   \
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #ifdef CONFIG_3_LEVEL_PGTABLES
 
-#define __pmd_free_tlb(tlb, pmd, address)  \
-do {   \
-   pgtable_pmd_page_dtor(virt_to_page(pmd));   \
-   tlb_remove_page((tlb),virt_to_page(pmd));   \
-} while (0)\
+#define __pmd_free_tlb(tlb, pmd, address)  \
+do {   \
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));\
+   tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \
+} while (0)
 
 #endif
 
-- 
2.40.1




[PATCH v3 26/34] mips: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/mips/include/asm/pgalloc.h | 31 +--
 arch/mips/mm/pgtable.c  |  7 ---
 2 files changed, 21 insertions(+), 17 deletions(-)

diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index f72e737dda21..3ba1fdb06502 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -51,13 +51,13 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_pages((unsigned long)pgd, PGD_TABLE_ORDER);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
-#define __pte_free_tlb(tlb,pte,address)\
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -65,18 +65,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_pages(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_pages(pg, PMD_TABLE_ORDER);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = (pmd_t *)ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -90,10 +90,13 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, PUD_TABLE_ORDER);
 
-   pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_TABLE_ORDER);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = (pud_t *)ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/mips/mm/pgtable.c b/arch/mips/mm/pgtable.c
index b13314be5d0e..6be3493d7722 100644
--- a/arch/mips/mm/pgtable.c
+++ b/arch/mips/mm/pgtable.c
@@ -10,10 +10,11 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, PGD_TABLE_ORDER);
 
-   ret = (pgd_t *) __get_free_pages(GFP_KERNEL, PGD_TABLE_ORDER);
-   if (ret) {
+   if (ptdesc) {
+   ret = (pgd_t *) ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1




[PATCH v3 21/34] arm64: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/arm64/include/asm/tlb.h | 14 --
 arch/arm64/mm/mmu.c  |  7 ---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index c995d1f4594f..2c29239d05c3 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -75,18 +75,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
  unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
-   tlb_remove_table(tlb, pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
  unsigned long addr)
 {
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 #endif
 
@@ -94,7 +96,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
  unsigned long addr)
 {
-   tlb_remove_table(tlb, virt_to_page(pudp));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pudp));
 }
 #endif
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..5867a0e917b9 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -426,6 +426,7 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
 static phys_addr_t pgd_pgtable_alloc(int shift)
 {
phys_addr_t pa = __pgd_pgtable_alloc(shift);
+   struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
 
/*
 * Call proper page table ctor in case later we need to
@@ -433,12 +434,12 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 * this pre-allocated page table.
 *
 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
-* folded, and if so pgtable_pmd_page_ctor() becomes nop.
+* folded, and if so pagetable_pte_ctor() becomes nop.
 */
if (shift == PAGE_SHIFT)
-   BUG_ON(!pgtable_pte_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pte_ctor(ptdesc));
else if (shift == PMD_SHIFT)
-   BUG_ON(!pgtable_pmd_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
return pa;
 }
-- 
2.40.1




[PATCH v3 16/34] s390: Convert various gmap functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/s390/mm/gmap.c | 230 
 1 file changed, 128 insertions(+), 102 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 81c683426b49..010e87df7299 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -34,7 +34,7 @@
 static struct gmap *gmap_alloc(unsigned long limit)
 {
struct gmap *gmap;
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long *table;
unsigned long etype, atype;
 
@@ -67,12 +67,12 @@ static struct gmap *gmap_alloc(unsigned long limit)
spin_lock_init(>guest_table_lock);
spin_lock_init(>shadow_lock);
refcount_set(>ref_count, 1);
-   page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
-   if (!page)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+   if (!ptdesc)
goto out_free;
-   page->_pt_s390_gaddr = 0;
-   list_add(>lru, >crst_list);
-   table = page_to_virt(page);
+   ptdesc->_pt_s390_gaddr = 0;
+   list_add(>pt_list, >crst_list);
+   table = ptdesc_to_virt(ptdesc);
crst_table_init(table, etype);
gmap->table = table;
gmap->asce = atype | _ASCE_TABLE_LENGTH |
@@ -181,25 +181,25 @@ static void gmap_rmap_radix_tree_free(struct 
radix_tree_root *root)
  */
 static void gmap_free(struct gmap *gmap)
 {
-   struct page *page, *next;
+   struct ptdesc *ptdesc, *next;
 
/* Flush tlb of all gmaps (if not already done for shadows) */
if (!(gmap_is_shadow(gmap) && gmap->removed))
gmap_flush_tlb(gmap);
/* Free all segment & region tables. */
-   list_for_each_entry_safe(page, next, >crst_list, lru) {
-   page->_pt_s390_gaddr = 0;
-   __free_pages(page, CRST_ALLOC_ORDER);
+   list_for_each_entry_safe(ptdesc, next, >crst_list, pt_list) {
+   ptdesc->_pt_s390_gaddr = 0;
+   pagetable_free(ptdesc);
}
gmap_radix_tree_free(>guest_to_host);
gmap_radix_tree_free(>host_to_guest);
 
/* Free additional data for a shadow gmap */
if (gmap_is_shadow(gmap)) {
-   /* Free all page tables. */
-   list_for_each_entry_safe(page, next, >pt_list, lru) {
-   page->_pt_s390_gaddr = 0;
-   page_table_free_pgste(page);
+   /* Free all ptdesc tables. */
+   list_for_each_entry_safe(ptdesc, next, >pt_list, pt_list) 
{
+   ptdesc->_pt_s390_gaddr = 0;
+   page_table_free_pgste(ptdesc_page(ptdesc));
}
gmap_rmap_radix_tree_free(>host_to_rmap);
/* Release reference to the parent */
@@ -308,27 +308,27 @@ EXPORT_SYMBOL_GPL(gmap_get_enabled);
 static int gmap_alloc_table(struct gmap *gmap, unsigned long *table,
unsigned long init, unsigned long gaddr)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long *new;
 
/* since we dont free the gmap table until gmap_free we can unlock */
-   page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
-   if (!page)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+   if (!ptdesc)
return -ENOMEM;
-   new = page_to_virt(page);
+   new = ptdesc_to_virt(ptdesc);
crst_table_init(new, init);
spin_lock(>guest_table_lock);
if (*table & _REGION_ENTRY_INVALID) {
-   list_add(>lru, >crst_list);
+   list_add(>pt_list, >crst_list);
*table = __pa(new) | _REGION_ENTRY_LENGTH |
(*table & _REGION_ENTRY_TYPE_MASK);
-   page->_pt_s390_gaddr = gaddr;
-   page = NULL;
+   ptdesc->_pt_s390_gaddr = gaddr;
+   ptdesc = NULL;
}
spin_unlock(>guest_table_lock);
-   if (page) {
-   page->_pt_s390_gaddr = 0;
-   __free_pages(page, CRST_ALLOC_ORDER);
+   if (ptdesc) {
+   ptdesc->_pt_s390_gaddr = 0;
+   pagetable_free(ptdesc);
}
return 0;
 }
@@ -341,15 +341,15 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
  */
 static unsigned long __gmap_segment_gaddr(unsigned long *entry)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long offset, mask;
 
offset = (unsigned long) entry / sizeof(unsigned long);
offset = (offset & (PTRS_PER_PMD - 1)) * PMD_SIZE;
mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-   page = virt_to_page((void *)((unsigned long) 

[PATCH v3 19/34] pgalloc: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/pgalloc.h | 62 +--
 1 file changed, 37 insertions(+), 25 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index a7cf825befae..1e5799ba2e56 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -18,7 +18,11 @@
  */
 static inline pte_t *__pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   return (pte_t *)__get_free_page(GFP_PGTABLE_KERNEL);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL, 0);
+
+   if (!ptdesc)
+   return NULL;
+   return (pte_t *)ptdesc_address(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL
@@ -41,7 +45,7 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct 
*mm)
  */
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long)pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 /**
@@ -49,7 +53,7 @@ static inline void pte_free_kernel(struct mm_struct *mm, 
pte_t *pte)
  * @mm: the mm_struct of the current context
  * @gfp: GFP flags to use for the allocation
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocates a ptdesc and runs the pagetable_pte_ctor().
  *
  * This function is intended for architectures that need
  * anything beyond simple page allocation or must have custom GFP flags.
@@ -58,17 +62,17 @@ static inline void pte_free_kernel(struct mm_struct *mm, 
pte_t *pte)
  */
 static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, gfp_t gfp)
 {
-   struct page *pte;
+   struct ptdesc *ptdesc;
 
-   pte = alloc_page(gfp);
-   if (!pte)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(pte)) {
-   __free_page(pte);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   return pte;
+   return ptdesc_page(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE
@@ -76,7 +80,7 @@ static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, 
gfp_t gfp)
  * pte_alloc_one - allocate a page for PTE-level user page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocates a ptdesc and runs the pagetable_pte_ctor().
  *
  * Return: `struct page` initialized as page table or %NULL on error
  */
@@ -98,8 +102,10 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
  */
 static inline void pte_free(struct mm_struct *mm, struct page *pte_page)
 {
-   pgtable_pte_page_dtor(pte_page);
-   __free_page(pte_page);
+   struct ptdesc *ptdesc = page_ptdesc(pte_page);
+
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 
@@ -110,7 +116,7 @@ static inline void pte_free(struct mm_struct *mm, struct 
page *pte_page)
  * pmd_alloc_one - allocate a page for PMD-level page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the pgtable_pmd_page_ctor().
+ * Allocates a ptdesc and runs the pagetable_pmd_ctor().
  * Allocations use %GFP_PGTABLE_USER in user context and
  * %GFP_PGTABLE_KERNEL in kernel context.
  *
@@ -118,28 +124,30 @@ static inline void pte_free(struct mm_struct *mm, struct 
page *pte_page)
  */
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
gfp_t gfp = GFP_PGTABLE_USER;
 
if (mm == _mm)
gfp = GFP_PGTABLE_KERNEL;
-   page = alloc_page(gfp);
-   if (!page)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pmd_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
-   return (pmd_t *)page_address(page);
+   return (pmd_t *)ptdesc_address(ptdesc);
 }
 #endif
 
 #ifndef __HAVE_ARCH_PMD_FREE
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
+
BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
-   free_page((unsigned long)pmd);
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 #endif
 
@@ -149,11 +157,15 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t 
*pmd)
 
 static inline pud_t *__pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-   gfp_t gfp = GFP_PGTABLE_USER;
+   gfp_t gfp 

[PATCH v3 24/34] loongarch: Convert various functions to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/loongarch/include/asm/pgalloc.h | 27 +++
 arch/loongarch/mm/pgtable.c  |  7 ---
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/loongarch/include/asm/pgalloc.h 
b/arch/loongarch/include/asm/pgalloc.h
index af1d1e4a6965..5f5afbf1f10c 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -45,9 +45,9 @@ extern void pagetable_init(void);
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 #define __pte_free_tlb(tlb, pte, address)  \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -55,18 +55,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_page(GFP_KERNEL_ACCOUNT);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, 0);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_page(pg);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = (pmd_t *)ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -80,10 +80,13 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-   pud = (pud_t *) __get_free_page(GFP_KERNEL);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = (pud_t *)ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c
index 36a6dc0148ae..cdba10ffc0df 100644
--- a/arch/loongarch/mm/pgtable.c
+++ b/arch/loongarch/mm/pgtable.c
@@ -11,10 +11,11 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, 0);
 
-   ret = (pgd_t *) __get_free_page(GFP_KERNEL);
-   if (ret) {
+   if (ptdesc) {
+   ret = (pgd_t *)ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1




[PATCH v3 30/34] sh: Convert pte_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/sh/include/asm/pgalloc.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index a9e98233c4d4..5d8577ab1591 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_SH_PGALLOC_H
 #define __ASM_SH_PGALLOC_H
 
+#include 
 #include 
 
 #define __HAVE_ARCH_PMD_ALLOC_ONE
@@ -31,10 +32,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif /* __ASM_SH_PGALLOC_H */
-- 
2.40.1




[PATCH v3 27/34] nios2: Convert __pte_free_tlb() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/nios2/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index ecd1657bb2ce..ce6bb8e74271 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -28,10 +28,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-   do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+   do {\
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
} while (0)
 
 #endif /* _ASM_NIOS2_PGALLOC_H */
-- 
2.40.1




[PATCH v3 11/34] mm: Convert pmd_ptlock_free() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index bc2f139de4e7..ffc82355fea6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2931,12 +2931,12 @@ static inline bool pmd_ptlock_init(struct ptdesc 
*ptdesc)
return ptlock_init(ptdesc);
 }
 
-static inline void pmd_ptlock_free(struct page *page)
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+   VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(page);
+   ptlock_free(ptdesc_page(ptdesc));
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
@@ -2949,7 +2949,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 
 static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
 
@@ -2973,7 +2973,7 @@ static inline bool pgtable_pmd_page_ctor(struct page 
*page)
 
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page);
+   pmd_ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
-- 
2.40.1




[PATCH v3 10/34] mm: Convert ptlock_init() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8e63e60c399c..bc2f139de4e7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2838,7 +2838,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
-static inline bool ptlock_init(struct page *page)
+static inline bool ptlock_init(struct ptdesc *ptdesc)
 {
/*
 * prep_new_page() initialize page->private (and therefore page->ptl)
@@ -2847,10 +2847,10 @@ static inline bool ptlock_init(struct page *page)
 * It can happen if arch try to use slab for page table allocation:
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
-   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page_ptdesc(page)))
+   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, ptdesc_page(ptdesc));
+   if (!ptlock_alloc(ptdesc))
return false;
-   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
+   spin_lock_init(ptlock_ptr(ptdesc));
return true;
 }
 
@@ -2863,13 +2863,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
+static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct page *page) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
-   if (!ptlock_init(page))
+   if (!ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
@@ -2928,7 +2928,7 @@ static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(ptdesc_page(ptdesc));
+   return ptlock_init(ptdesc);
 }
 
 static inline void pmd_ptlock_free(struct page *page)
-- 
2.40.1




[PATCH v3 09/34] mm: Convert pmd_ptlock_init() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f7263fcd821..8e63e60c399c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2923,12 +2923,12 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
-static inline bool pmd_ptlock_init(struct page *page)
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   page->pmd_huge_pte = NULL;
+   ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(page);
+   return ptlock_init(ptdesc_page(ptdesc));
 }
 
 static inline void pmd_ptlock_free(struct page *page)
@@ -2948,7 +2948,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void pmd_ptlock_free(struct page *page) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
@@ -2964,7 +2964,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
-   if (!pmd_ptlock_init(page))
+   if (!pmd_ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
-- 
2.40.1




[PATCH v3 12/34] mm: Convert ptlock_free() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 10 +-
 mm/memory.c|  4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ffc82355fea6..72725aa6c30d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2807,7 +2807,7 @@ static inline void pagetable_clear(void *x)
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
-extern void ptlock_free(struct page *page);
+void ptlock_free(struct ptdesc *ptdesc);
 
 static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
@@ -2823,7 +2823,7 @@ static inline bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-static inline void ptlock_free(struct page *page)
+static inline void ptlock_free(struct ptdesc *ptdesc)
 {
 }
 
@@ -2864,7 +2864,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 static inline void ptlock_cache_init(void) {}
 static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void ptlock_free(struct page *page) {}
+static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
@@ -2878,7 +2878,7 @@ static inline bool pgtable_pte_page_ctor(struct page 
*page)
 
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page);
+   ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
@@ -2936,7 +2936,7 @@ static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(ptdesc_page(ptdesc));
+   ptlock_free(ptdesc);
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
diff --git a/mm/memory.c b/mm/memory.c
index 8d37dd302f2f..df0251243dfa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5949,8 +5949,8 @@ bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-void ptlock_free(struct page *page)
+void ptlock_free(struct ptdesc *ptdesc)
 {
-   kmem_cache_free(page_ptl_cachep, page->ptl);
+   kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
-- 
2.40.1




[PATCH v3 08/34] mm: Convert ptlock_ptr() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/x86/xen/mmu_pv.c |  2 +-
 include/linux/mm.h| 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index b3b8d289b9ab..f469862e3ef4 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -651,7 +651,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct 
mm_struct *mm)
spinlock_t *ptl = NULL;
 
 #if USE_SPLIT_PTE_PTLOCKS
-   ptl = ptlock_ptr(page);
+   ptl = ptlock_ptr(page_ptdesc(page));
spin_lock_nest_lock(ptl, >page_table_lock);
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1fd16ac96036..6f7263fcd821 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2809,9 +2809,9 @@ void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return page->ptl;
+   return ptdesc->ptl;
 }
 #else /* ALLOC_SPLIT_PTLOCKS */
 static inline void ptlock_cache_init(void)
@@ -2827,15 +2827,15 @@ static inline void ptlock_free(struct page *page)
 {
 }
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return >ptl;
+   return >ptl;
 }
 #endif /* ALLOC_SPLIT_PTLOCKS */
 
 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_page(*pmd));
+   return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
 static inline bool ptlock_init(struct page *page)
@@ -2850,7 +2850,7 @@ static inline bool ptlock_init(struct page *page)
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
if (!ptlock_alloc(page_ptdesc(page)))
return false;
-   spin_lock_init(ptlock_ptr(page));
+   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
return true;
 }
 
@@ -2920,7 +2920,7 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
+   return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
-- 
2.40.1




[PATCH v3 01/34] mm: Add PAGE_TYPE_OP folio functions

2023-05-31 Thread Vishal Moola (Oracle)
No folio equivalents for page type operations have been defined, so
define them for later folio conversions.

Also changes the Page##uname macros to take in const struct page* since
we only read the memory here.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/page-flags.h | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..e99a616b9bcd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -908,6 +908,8 @@ static inline bool is_page_hwpoison(struct page *page)
 
 #define PageType(page, flag)   \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+#define folio_test_type(folio, flag)   \
+   ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
 
 static inline int page_type_has_type(unsigned int page_type)
 {
@@ -920,20 +922,34 @@ static inline int page_has_type(struct page *page)
 }
 
 #define PAGE_TYPE_OPS(uname, lname)\
-static __always_inline int Page##uname(struct page *page)  \
+static __always_inline int Page##uname(const struct page *page)
\
 {  \
return PageType(page, PG_##lname);  \
 }  \
+static __always_inline int folio_test_##lname(const struct folio *folio)\
+{  \
+   return folio_test_type(folio, PG_##lname);  \
+}  \
 static __always_inline void __SetPage##uname(struct page *page)
\
 {  \
VM_BUG_ON_PAGE(!PageType(page, 0), page);   \
page->page_type &= ~PG_##lname; \
 }  \
+static __always_inline void __folio_set_##lname(struct folio *folio)   \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \
+   folio->page.page_type &= ~PG_##lname;   \
+}  \
 static __always_inline void __ClearPage##uname(struct page *page)  \
 {  \
VM_BUG_ON_PAGE(!Page##uname(page), page);   \
page->page_type |= PG_##lname;  \
-}
+}  \
+static __always_inline void __folio_clear_##lname(struct folio *folio) \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_##lname(folio), folio); \
+   folio->page.page_type |= PG_##lname;\
+}  \
 
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
-- 
2.40.1




[PATCH v3 07/34] mm: Convert ptlock_alloc() to use ptdescs

2023-05-31 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 6 +++---
 mm/memory.c| 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 3a9c40e90dd7..1fd16ac96036 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2806,7 +2806,7 @@ static inline void pagetable_clear(void *x)
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
+bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
 static inline spinlock_t *ptlock_ptr(struct page *page)
@@ -2818,7 +2818,7 @@ static inline void ptlock_cache_init(void)
 {
 }
 
-static inline bool ptlock_alloc(struct page *page)
+static inline bool ptlock_alloc(struct ptdesc *ptdesc)
 {
return true;
 }
@@ -2848,7 +2848,7 @@ static inline bool ptlock_init(struct page *page)
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page))
+   if (!ptlock_alloc(page_ptdesc(page)))
return false;
spin_lock_init(ptlock_ptr(page));
return true;
diff --git a/mm/memory.c b/mm/memory.c
index 8358f3b853f2..8d37dd302f2f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5938,14 +5938,14 @@ void __init ptlock_cache_init(void)
SLAB_PANIC, NULL);
 }
 
-bool ptlock_alloc(struct page *page)
+bool ptlock_alloc(struct ptdesc *ptdesc)
 {
spinlock_t *ptl;
 
ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
if (!ptl)
return false;
-   page->ptl = ptl;
+   ptdesc->ptl = ptl;
return true;
 }
 
-- 
2.40.1




[PATCH v3 05/34] mm: add utility functions for ptdesc

2023-05-31 Thread Vishal Moola (Oracle)
Introduce utility functions setting the foundation for ptdescs. These
will also assist in the splitting out of ptdesc from struct page.

Functions that focus on the descriptor are prefixed with ptdesc_* while
functions that focus on the pagetable are prefixed with pagetable_*.

pagetable_alloc() is defined to allocate new ptdesc pages as compound
pages. This is to standardize ptdescs by allowing for one allocation
and one free function, in contrast to 2 allocation and 2 free functions.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/tlb.h | 11 +++
 include/linux/mm.h| 61 +++
 include/linux/pgtable.h   | 12 
 3 files changed, 84 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b46617207c93..6bade9e0e799 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -481,6 +481,17 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, 
struct page *page)
return tlb_remove_page_size(tlb, page, PAGE_SIZE);
 }
 
+static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
+{
+   tlb_remove_table(tlb, pt);
+}
+
+/* Like tlb_remove_ptdesc, but for page-like page directories. */
+static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
ptdesc *pt)
+{
+   tlb_remove_page(tlb, ptdesc_page(pt));
+}
+
 static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 42ff3e04c006..620537e2f94f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2747,6 +2747,62 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
+static inline struct ptdesc *virt_to_ptdesc(const void *x)
+{
+   return page_ptdesc(virt_to_page(x));
+}
+
+static inline void *ptdesc_to_virt(const struct ptdesc *pt)
+{
+   return page_to_virt(ptdesc_page(pt));
+}
+
+static inline void *ptdesc_address(const struct ptdesc *pt)
+{
+   return folio_address(ptdesc_folio(pt));
+}
+
+static inline bool pagetable_is_reserved(struct ptdesc *pt)
+{
+   return folio_test_reserved(ptdesc_folio(pt));
+}
+
+/**
+ * pagetable_alloc - Allocate pagetables
+ * @gfp:GFP flags
+ * @order:  desired pagetable order
+ *
+ * pagetable_alloc allocates a page table descriptor as well as all pages
+ * described by it.
+ *
+ * Return: The ptdesc describing the allocated page tables.
+ */
+static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
+{
+   struct page *page = alloc_pages(gfp | __GFP_COMP, order);
+
+   return page_ptdesc(page);
+}
+
+/**
+ * pagetable_free - Free pagetables
+ * @pt:The page table descriptor
+ *
+ * pagetable_free frees a page table descriptor as well as all page
+ * tables described by said ptdesc.
+ */
+static inline void pagetable_free(struct ptdesc *pt)
+{
+   struct page *page = ptdesc_page(pt);
+
+   __free_pages(page, compound_order(page));
+}
+
+static inline void pagetable_clear(void *x)
+{
+   clear_page(x);
+}
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
@@ -2973,6 +3029,11 @@ static inline void mark_page_reserved(struct page *page)
adjust_managed_page_count(page, -1);
 }
 
+static inline void free_reserved_ptdesc(struct ptdesc *pt)
+{
+   free_reserved_page(ptdesc_page(pt));
+}
+
 /*
  * Default method to free all the __init memory into the buddy system.
  * The freed pages will be poisoned with pattern "poison" if it's within
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c997e9878969..5f12622d1521 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1027,6 +1027,18 @@ TABLE_MATCH(ptl, ptl);
 #undef TABLE_MATCH
 static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
 
+#define ptdesc_page(pt)(_Generic((pt), 
\
+   const struct ptdesc *:  (const struct page *)(pt),  \
+   struct ptdesc *:(struct page *)(pt)))
+
+#define ptdesc_folio(pt)   (_Generic((pt), \
+   const struct ptdesc *:  (const struct folio *)(pt), \
+   struct ptdesc *:(struct folio *)(pt)))
+
+#define page_ptdesc(p) (_Generic((p),  \
+   const struct page *:(const struct ptdesc *)(p), \
+   struct page *:  (struct ptdesc *)(p)))
+
 /*
  * No-op macros that just return the current protection value. Defined here
  * because these macros can be used even if CONFIG_MMU is not defined.
-- 
2.40.1




[PATCH v3 00/34] Split ptdesc from struct page

2023-05-31 Thread Vishal Moola (Oracle)
The MM subsystem is trying to shrink struct page. This patchset
introduces a memory descriptor for page table tracking - struct ptdesc.

This patchset introduces ptdesc, splits ptdesc from struct page, and
converts many callers of page table constructor/destructors to use ptdescs.

Ptdesc is a foundation to further standardize page tables, and eventually
allow for dynamic allocation of page tables independent of struct page.
However, the use of pages for page table tracking is quite deeply
ingrained and varied across archictectures, so there is still a lot of
work to be done before that can happen.

This is rebased on next-20230531.

v3:
  Got an Acked-by
  Fixed the arm64 compilation issue
  Rename some ptdesc utility functions to be pagetable_* instead
  Add some comments to functions describing their uses

Vishal Moola (Oracle) (34):
  mm: Add PAGE_TYPE_OP folio functions
  s390: Use _pt_s390_gaddr for gmap address tracking
  s390: Use pt_frag_refcount for pagetables
  pgtable: Create struct ptdesc
  mm: add utility functions for ptdesc
  mm: Convert pmd_pgtable_page() to pmd_ptdesc()
  mm: Convert ptlock_alloc() to use ptdescs
  mm: Convert ptlock_ptr() to use ptdescs
  mm: Convert pmd_ptlock_init() to use ptdescs
  mm: Convert ptlock_init() to use ptdescs
  mm: Convert pmd_ptlock_free() to use ptdescs
  mm: Convert ptlock_free() to use ptdescs
  mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}
  powerpc: Convert various functions to use ptdescs
  x86: Convert various functions to use ptdescs
  s390: Convert various gmap functions to use ptdescs
  s390: Convert various pgalloc functions to use ptdescs
  mm: Remove page table members from struct page
  pgalloc: Convert various functions to use ptdescs
  arm: Convert various functions to use ptdescs
  arm64: Convert various functions to use ptdescs
  csky: Convert __pte_free_tlb() to use ptdescs
  hexagon: Convert __pte_free_tlb() to use ptdescs
  loongarch: Convert various functions to use ptdescs
  m68k: Convert various functions to use ptdescs
  mips: Convert various functions to use ptdescs
  nios2: Convert __pte_free_tlb() to use ptdescs
  openrisc: Convert __pte_free_tlb() to use ptdescs
  riscv: Convert alloc_{pmd, pte}_late() to use ptdescs
  sh: Convert pte_free_tlb() to use ptdescs
  sparc64: Convert various functions to use ptdescs
  sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents
  um: Convert {pmd, pte}_free_tlb() to use ptdescs
  mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

 Documentation/mm/split_page_table_lock.rst|  12 +-
 .../zh_CN/mm/split_page_table_lock.rst|  14 +-
 arch/arm/include/asm/tlb.h|  12 +-
 arch/arm/mm/mmu.c |   6 +-
 arch/arm64/include/asm/tlb.h  |  14 +-
 arch/arm64/mm/mmu.c   |   7 +-
 arch/csky/include/asm/pgalloc.h   |   4 +-
 arch/hexagon/include/asm/pgalloc.h|   8 +-
 arch/loongarch/include/asm/pgalloc.h  |  27 ++-
 arch/loongarch/mm/pgtable.c   |   7 +-
 arch/m68k/include/asm/mcf_pgalloc.h   |  41 ++--
 arch/m68k/include/asm/sun3_pgalloc.h  |   8 +-
 arch/m68k/mm/motorola.c   |   4 +-
 arch/mips/include/asm/pgalloc.h   |  31 +--
 arch/mips/mm/pgtable.c|   7 +-
 arch/nios2/include/asm/pgalloc.h  |   8 +-
 arch/openrisc/include/asm/pgalloc.h   |   8 +-
 arch/powerpc/mm/book3s64/mmu_context.c|  10 +-
 arch/powerpc/mm/book3s64/pgtable.c|  32 +--
 arch/powerpc/mm/pgtable-frag.c|  46 ++--
 arch/riscv/include/asm/pgalloc.h  |   8 +-
 arch/riscv/mm/init.c  |  16 +-
 arch/s390/include/asm/pgalloc.h   |   4 +-
 arch/s390/include/asm/tlb.h   |   4 +-
 arch/s390/mm/gmap.c   | 222 +++---
 arch/s390/mm/pgalloc.c| 126 +-
 arch/sh/include/asm/pgalloc.h |   9 +-
 arch/sparc/mm/init_64.c   |  17 +-
 arch/sparc/mm/srmmu.c |   5 +-
 arch/um/include/asm/pgalloc.h |  18 +-
 arch/x86/mm/pgtable.c |  46 ++--
 arch/x86/xen/mmu_pv.c |   2 +-
 include/asm-generic/pgalloc.h |  62 +++--
 include/asm-generic/tlb.h |  11 +
 include/linux/mm.h| 155 
 include/linux/mm_types.h  |  14 --
 include/linux/page-flags.h|  20 +-
 include/linux/pgtable.h   |  61 +
 mm/memory.c   |   8 +-
 39 files changed, 665 insertions(+), 449 deletions(-)

-- 
2.40.1




[PATCH v3 02/34] s390: Use _pt_s390_gaddr for gmap address tracking

2023-05-31 Thread Vishal Moola (Oracle)
s390 uses page->index to keep track of page tables for the guest address
space. In an attempt to consolidate the usage of page fields in s390,
replace _pt_pad_2 with _pt_s390_gaddr to replace page->index in gmap.

This will help with the splitting of struct ptdesc from struct page, as
well as allow s390 to use _pt_frag_refcount for fragmented page table
tracking.

Since page->_pt_s390_gaddr aliases with mapping, ensure its set to NULL
before freeing the pages as well.

This also reverts commit 7e25de77bc5ea ("s390/mm: use pmd_pgtable_page()
helper in __gmap_segment_gaddr()") which had s390 use
pmd_pgtable_page() to get a gmap page table, as pmd_pgtable_page()
should be used for more generic process page tables.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/s390/mm/gmap.c  | 56 +++-
 include/linux/mm_types.h |  2 +-
 2 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index dc90d1eb0d55..81c683426b49 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -70,7 +70,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
if (!page)
goto out_free;
-   page->index = 0;
+   page->_pt_s390_gaddr = 0;
list_add(>lru, >crst_list);
table = page_to_virt(page);
crst_table_init(table, etype);
@@ -187,16 +187,20 @@ static void gmap_free(struct gmap *gmap)
if (!(gmap_is_shadow(gmap) && gmap->removed))
gmap_flush_tlb(gmap);
/* Free all segment & region tables. */
-   list_for_each_entry_safe(page, next, >crst_list, lru)
+   list_for_each_entry_safe(page, next, >crst_list, lru) {
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
+   }
gmap_radix_tree_free(>guest_to_host);
gmap_radix_tree_free(>host_to_guest);
 
/* Free additional data for a shadow gmap */
if (gmap_is_shadow(gmap)) {
/* Free all page tables. */
-   list_for_each_entry_safe(page, next, >pt_list, lru)
+   list_for_each_entry_safe(page, next, >pt_list, lru) {
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
+   }
gmap_rmap_radix_tree_free(>host_to_rmap);
/* Release reference to the parent */
gmap_put(gmap->parent);
@@ -318,12 +322,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
list_add(>lru, >crst_list);
*table = __pa(new) | _REGION_ENTRY_LENGTH |
(*table & _REGION_ENTRY_TYPE_MASK);
-   page->index = gaddr;
+   page->_pt_s390_gaddr = gaddr;
page = NULL;
}
spin_unlock(>guest_table_lock);
-   if (page)
+   if (page) {
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
+   }
return 0;
 }
 
@@ -336,12 +342,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
 static unsigned long __gmap_segment_gaddr(unsigned long *entry)
 {
struct page *page;
-   unsigned long offset;
+   unsigned long offset, mask;
 
offset = (unsigned long) entry / sizeof(unsigned long);
offset = (offset & (PTRS_PER_PMD - 1)) * PMD_SIZE;
-   page = pmd_pgtable_page((pmd_t *) entry);
-   return page->index + offset;
+   mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+   page = virt_to_page((void *)((unsigned long) entry & mask));
+
+   return page->_pt_s390_gaddr + offset;
 }
 
 /**
@@ -1351,6 +1359,7 @@ static void gmap_unshadow_pgt(struct gmap *sg, unsigned 
long raddr)
/* Free page table */
page = phys_to_page(pgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
 }
 
@@ -1379,6 +1388,7 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned 
long raddr,
/* Free page table */
page = phys_to_page(pgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
}
 }
@@ -1409,6 +1419,7 @@ static void gmap_unshadow_sgt(struct gmap *sg, unsigned 
long raddr)
/* Free segment table */
page = phys_to_page(sgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
 }
 
@@ -1437,6 +1448,7 @@ static void __gmap_unshadow_r3t(struct gmap *sg, unsigned 
long raddr,
/* Free segment table */
page = phys_to_page(sgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
}
 }
@@ -1467,6 +1479,7 @@ static void gmap_unshadow_r3t(struct gmap *sg, unsigned 
long raddr)
/* Free region 3 table */
  

[PATCH v3 06/34] mm: Convert pmd_pgtable_page() to pmd_ptdesc()

2023-05-31 Thread Vishal Moola (Oracle)
Converts pmd_pgtable_page() to pmd_ptdesc() and all its callers. This
removes some direct accesses to struct page, working towards splitting
out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/mm.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 620537e2f94f..3a9c40e90dd7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2912,15 +2912,15 @@ static inline void pgtable_pte_page_dtor(struct page 
*page)
 
 #if USE_SPLIT_PMD_PTLOCKS
 
-static inline struct page *pmd_pgtable_page(pmd_t *pmd)
+static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 {
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-   return virt_to_page((void *)((unsigned long) pmd & mask));
+   return virt_to_ptdesc((void *)((unsigned long) pmd & mask));
 }
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_pgtable_page(pmd));
+   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
@@ -2939,7 +2939,7 @@ static inline void pmd_ptlock_free(struct page *page)
ptlock_free(page);
 }
 
-#define pmd_huge_pte(mm, pmd) (pmd_pgtable_page(pmd)->pmd_huge_pte)
+#define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
 
 #else
 
-- 
2.40.1




[PATCH v3 04/34] pgtable: Create struct ptdesc

2023-05-31 Thread Vishal Moola (Oracle)
Currently, page table information is stored within struct page. As part
of simplifying struct page, create struct ptdesc for page table
information.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/linux/pgtable.h | 52 +
 1 file changed, 52 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..c997e9878969 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -975,6 +975,58 @@ static inline void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
 #endif /* CONFIG_MMU */
 
+
+/**
+ * struct ptdesc - Memory descriptor for page tables.
+ * @__page_flags: Same as page flags. Unused for page tables.
+ * @pt_list: List of used page tables. Used for s390 and x86.
+ * @_pt_pad_1: Padding that aliases with page's compound head.
+ * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
+ * @_pt_s390_gaddr: Aliases with page's mapping. Used for s390 gmap only.
+ * @pt_mm: Used for x86 pgds.
+ * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
only.
+ * @ptl: Lock for the page table.
+ *
+ * This struct overlays struct page for now. Do not modify without a good
+ * understanding of the issues.
+ */
+struct ptdesc {
+   unsigned long __page_flags;
+
+   union {
+   struct list_head pt_list;
+   struct {
+   unsigned long _pt_pad_1;
+   pgtable_t pmd_huge_pte;
+   };
+   };
+   unsigned long _pt_s390_gaddr;
+
+   union {
+   struct mm_struct *pt_mm;
+   atomic_t pt_frag_refcount;
+   unsigned long index;
+   };
+
+#if ALLOC_SPLIT_PTLOCKS
+   spinlock_t *ptl;
+#else
+   spinlock_t ptl;
+#endif
+};
+
+#define TABLE_MATCH(pg, pt)\
+   static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
+TABLE_MATCH(flags, __page_flags);
+TABLE_MATCH(compound_head, pt_list);
+TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
+TABLE_MATCH(mapping, _pt_s390_gaddr);
+TABLE_MATCH(pt_mm, pt_mm);
+TABLE_MATCH(ptl, ptl);
+#undef TABLE_MATCH
+static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
+
 /*
  * No-op macros that just return the current protection value. Defined here
  * because these macros can be used even if CONFIG_MMU is not defined.
-- 
2.40.1




[PATCH v3 03/34] s390: Use pt_frag_refcount for pagetables

2023-05-31 Thread Vishal Moola (Oracle)
s390 currently uses _refcount to identify fragmented page tables.
The page table struct already has a member pt_frag_refcount used by
powerpc, so have s390 use that instead of the _refcount field as well.
This improves the safety for _refcount and the page table tracking.

This also allows us to simplify the tracking since we can once again use
the lower byte of pt_frag_refcount instead of the upper byte of _refcount.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/s390/mm/pgalloc.c | 38 +++---
 1 file changed, 15 insertions(+), 23 deletions(-)

diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..6b99932abc66 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -182,20 +182,17 @@ void page_table_free_pgste(struct page *page)
  * As follows from the above, no unallocated or fully allocated parent
  * pages are contained in mm_context_t::pgtable_list.
  *
- * The upper byte (bits 24-31) of the parent page _refcount is used
+ * The lower byte (bits 0-7) of the parent page pt_frag_refcount is used
  * for tracking contained 2KB-pgtables and has the following format:
  *
  *   PP  AA
- * 01234567upper byte (bits 24-31) of struct page::_refcount
+ * 01234567upper byte (bits 0-7) of struct page::pt_frag_refcount
  *   ||  ||
  *   ||  |+--- upper 2KB-pgtable is allocated
  *   ||  + lower 2KB-pgtable is allocated
  *   |+--- upper 2KB-pgtable is pending for removal
  *   + lower 2KB-pgtable is pending for removal
  *
- * (See commit 620b4e903179 ("s390: use _refcount for pgtables") on why
- * using _refcount is possible).
- *
  * When 2KB-pgtable is allocated the corresponding AA bit is set to 1.
  * The parent page is either:
  *   - added to mm_context_t::pgtable_list in case the second half of the
@@ -243,11 +240,12 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
if (!list_empty(>context.pgtable_list)) {
page = list_first_entry(>context.pgtable_list,
struct page, lru);
-   mask = atomic_read(>_refcount) >> 24;
+   mask = atomic_read(>pt_frag_refcount);
/*
 * The pending removal bits must also be checked.
 * Failure to do so might lead to an impossible
-* value of (i.e 0x13 or 0x23) written to _refcount.
+* value of (i.e 0x13 or 0x23) written to
+* pt_frag_refcount.
 * Such values violate the assumption that pending and
 * allocation bits are mutually exclusive, and the rest
 * of the code unrails as result. That could lead to
@@ -259,8 +257,8 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
bit = mask & 1; /* =1 -> second 2K */
if (bit)
table += PTRS_PER_PTE;
-   atomic_xor_bits(>_refcount,
-   0x01U << (bit + 24));
+   atomic_xor_bits(>pt_frag_refcount,
+   0x01U << bit);
list_del(>lru);
}
}
@@ -281,12 +279,12 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = (unsigned long *) page_to_virt(page);
if (mm_alloc_pgste(mm)) {
/* Return 4K page table with PGSTEs */
-   atomic_xor_bits(>_refcount, 0x03U << 24);
+   atomic_xor_bits(>pt_frag_refcount, 0x03U);
memset64((u64 *)table, _PAGE_INVALID, PTRS_PER_PTE);
memset64((u64 *)table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
} else {
/* Return the first 2K fragment of the page */
-   atomic_xor_bits(>_refcount, 0x01U << 24);
+   atomic_xor_bits(>pt_frag_refcount, 0x01U);
memset64((u64 *)table, _PAGE_INVALID, 2 * PTRS_PER_PTE);
spin_lock_bh(>context.lock);
list_add(>lru, >context.pgtable_list);
@@ -323,22 +321,19 @@ void page_table_free(struct mm_struct *mm, unsigned long 
*table)
 * will happen outside of the critical section from this
 * function or from __tlb_remove_table()
 */
-   mask = atomic_xor_bits(>_refcount, 0x11U << (bit + 24));
-   mask >>= 24;
+   mask = atomic_xor_bits(>pt_frag_refcount, 0x11U << bit);
if (mask & 0x03U)
list_add(>lru, >context.pgtable_list);
else
list_del(>lru);
spin_unlock_bh(>context.lock);
-   mask = atomic_xor_bits(>_refcount, 0x10U << (bit + 24));
-   

Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-05-31 Thread Bjorn Helgaas
On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:
> ...

> Looking at the code I understand where coverity is coming from:
> 
> #define __pci_dev_for_each_res0(dev, res, ...) \
>for (unsigned int __b = 0;  \
> res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> __b++)
> 
>  res will be assigned before __b is checked for being less than
> PCI_NUM_RESOURCES, making it point to behind the array at the end of
> the last loop iteration.
> 
> Rewriting the test expression as
> 
> __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> 
> should avoid the (coverity) warning by making use of lazy evaluation.
> 
> It probably makes the code slightly less performant as res will now be
> checked for being not NULL (which will always be true), but I doubt it
> will be significant (or in any hot paths).

Thanks a lot for looking into this!  I think you're right, and I think
the rewritten expression is more logical as well.  Do you want to post
a patch for it?

Bjorn



[xen-unstable test] 181027: tolerable FAIL - PUSHED

2023-05-31 Thread osstest service owner
flight 181027 xen-unstable real [real]
flight 181056 xen-unstable real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/181027/
http://logs.test-lab.xenproject.org/osstest/logs/181056/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-xl-pvhv2-amd 13 debian-fixup   fail pass in 181056-retest
 test-amd64-amd64-libvirt-pair 28 guest-migrate/dst_host/src_host/debian.repeat 
fail pass in 181056-retest

Tests which did not succeed, but are not blocking:
 test-amd64-amd64-xl-qemuu-debianhvm-i386-xsm 12 debian-hvm-install fail like 
180976
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 181007
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 181007
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 181007
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 181007
 test-amd64-i386-xl-qemut-ws16-amd64 19 guest-stop fail like 181007
 test-amd64-i386-xl-qemut-win7-amd64 19 guest-stop fail like 181007
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 181007
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 181007
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 181007
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 181007
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 181007
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 181007
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-vhd  15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-arndale  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-arndale  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-rtds 16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 xen  

[qemu-mainline test] 181041: regressions - trouble: blocked/broken/fail/pass

2023-05-31 Thread osstest service owner
flight 181041 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181041/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-armhf  broken
 test-amd64-amd64-libvirt-pair 30 leak-check/check/src_host fail REGR. vs. 
180691
 test-amd64-amd64-libvirt-pair 31 leak-check/check/dst_host fail REGR. vs. 
180691
 test-amd64-i386-libvirt  23 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-libvirt-xsm 23 leak-check/check fail REGR. vs. 180691
 build-arm64-xsm   6 xen-buildfail REGR. vs. 180691
 build-arm64   6 xen-buildfail REGR. vs. 180691
 test-amd64-amd64-libvirt 23 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-xsm  23 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-pair 30 leak-check/check/src_host fail REGR. vs. 180691
 test-amd64-i386-libvirt-pair 31 leak-check/check/dst_host fail REGR. vs. 180691
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 build-armhf   5 host-build-prep  fail REGR. vs. 180691
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 test-amd64-amd64-libvirt-vhd 19 guest-start/debian.repeat fail REGR. vs. 180691
 test-amd64-i386-xl-vhd   24 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-raw  22 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-xl-qcow224 leak-check/check fail REGR. vs. 180691

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit1   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit2   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-thunderx  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-vhd   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-qcow2  1 build-check(1)   blocked  n/a
 test-armhf-armhf-libvirt-raw  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-arndale   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-multivcpu  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-rtds  1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-vhd   1 build-check(1)   blocked  n/a
 build-arm64-libvirt   1 build-check(1)   blocked  n/a
 build-armhf-libvirt   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-credit1   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl-credit2   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180691
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180691
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 180691
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180691
 test-amd64-i386-xl-qemuu-win7-amd64 19 guest-stop fail like 180691
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass

version targeted for testing:
 qemuu51bdb0b57a2d9e84d6915fbae7b5d76c8820cf3c
baseline version:
 qemuu6972ef1440a9d685482d78672620a7482f2bd09a

Last test of basis   180691  2023-05-17 10:45:22 Z   14 days
Failing since180699  2023-05-18 07:21:24 Z   13 days   51 attempts
Testing same since   181041  2023-05-31 15:02:29 Z0 days1 attempts


People who touched revisions under test:
  Afonso Bordado 
  Akihiko Odaki 
  Akihiro Suda 
  Alex Bennée 
  Alex Williamson 
  Alexander Bulekov 
  Alexander Graf 
  Alistair Francis 
  Ani 

Re: [XEN][PATCH v6 02/19] common/device_tree: handle memory allocation failure in __unflatten_device_tree()

2023-05-31 Thread Vikram Garhwal

Hi Michal,

On 5/5/23 2:38 AM, Michal Orzel wrote:

On 03/05/2023 01:36, Vikram Garhwal wrote:


Change __unflatten_device_tree() return type to integer so it can propagate
memory allocation failure. Add panic() in dt_unflatten_host_device_tree() for
memory allocation failure during boot.

Signed-off-by: Vikram Garhwal 

I think we are missing a Fixes tag.

Like the below line?
Fixes: fb97eb6 ("xen/arm: Create a hierarchical device tree")

Original patch for your reference: 
https://github.com/xen-project/xen/commit/fb97eb614acfbcc812098bbbe5dde99271fe0a0d


Regards,
Vikram


---
  xen/common/device_tree.c | 13 ++---
  1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/xen/common/device_tree.c b/xen/common/device_tree.c
index 5f7ae45304..fc38a0b3dd 100644
--- a/xen/common/device_tree.c
+++ b/xen/common/device_tree.c
@@ -2056,8 +2056,8 @@ static unsigned long unflatten_dt_node(const void *fdt,
   * @fdt: The fdt to expand
   * @mynodes: The device_node tree created by the call
   */
-static void __init __unflatten_device_tree(const void *fdt,
-   struct dt_device_node **mynodes)
+static int __init __unflatten_device_tree(const void *fdt,
+  struct dt_device_node **mynodes)
  {
  unsigned long start, mem, size;
  struct dt_device_node **allnextp = mynodes;
@@ -2078,6 +2078,8 @@ static void __init __unflatten_device_tree(const void 
*fdt,

  /* Allocate memory for the expanded device tree */
  mem = (unsigned long)_xmalloc (size + 4, __alignof__(struct 
dt_device_node));
+if ( !mem )
+return -ENOMEM;

  ((__be32 *)mem)[size / 4] = cpu_to_be32(0xdeadbeef);

@@ -2095,6 +2097,8 @@ static void __init __unflatten_device_tree(const void 
*fdt,
  *allnextp = NULL;

  dt_dprintk(" <- unflatten_device_tree()\n");
+
+return 0;
  }

  static void dt_alias_add(struct dt_alias_prop *ap,
@@ -2179,7 +2183,10 @@ dt_find_interrupt_controller(const struct 
dt_device_match *matches)

  void __init dt_unflatten_host_device_tree(void)
  {
-__unflatten_device_tree(device_tree_flattened, _host);
+int error = __unflatten_device_tree(device_tree_flattened, _host);

NIT: there should be a blank line between definitions and rest of the code


+if ( error )
+panic("__unflatten_device_tree failed with error %d\n", error);
+
  dt_alias_scan();
  }

--
2.17.1



FWICS, patches 2 and 4 are not strictly related to DTBO and are fixing issues
and propagating errors which is always good. Therefore by moving them to the 
start
of the series, they could be merged right away reducing the number of patches 
to review.
At the moment, they can't be because patch 3 placed in-between is strictly 
related to the series.

@julien?

~Michal






Re: [RFC PATCH v1 0/9] Hypervisor-Enforced Kernel Integrity

2023-05-31 Thread Sean Christopherson
On Tue, May 30, 2023, Rick P Edgecombe wrote:
> On Fri, 2023-05-26 at 17:22 +0200, Micka�l Sala�n wrote:
> > > > Can the guest kernel ask the host VMM's emulated devices to DMA into
> > > > the protected data? It should go through the host userspace mappings I
> > > > think, which don't care about EPT permissions. Or did I miss where you
> > > > are protecting that another way? There are a lot of easy ways to ask
> > > > the host to write to guest memory that don't involve the EPT. You
> > > > probably need to protect the host userspace mappings, and also the
> > > > places in KVM that kmap a GPA provided by the guest.
> > > 
> > > Good point, I'll check this confused deputy attack. Extended KVM
> > > protections should indeed handle all ways to map guests' memory.  I'm
> > > wondering if current VMMs would gracefully handle such new restrictions
> > > though.
> > 
> > I guess the host could map arbitrary data to the guest, so that need to be
> > handled, but how could the VMM (not the host kernel) bypass/update EPT
> > initially used for the guest (and potentially later mapped to the host)?
> 
> Well traditionally both QEMU and KVM accessed guest memory via host
> mappings instead of the EPT.�So I'm wondering what is stopping the
> guest from passing a protected gfn when setting up the DMA, and QEMU
> being enticed to write to it? The emulator as well would use these host
> userspace mappings and not consult the EPT IIRC.
> 
> I think Sean was suggesting host userspace should be more involved in
> this process, so perhaps it could protect its own alias of the
> protected memory, for example mprotect() it as read-only.

Ya, though "suggesting" is really "demanding, unless someone provides super 
strong
justification for handling this directly in KVM".  It's basically the same 
argument
that led to Linux Security Modules: I'm all for KVM providing the framework and
plumbing, but I don't want KVM to get involved in defining policy, thread 
models, etc.



Re: [PATCH v6 5/6] xen/riscv: introduce an implementation of macros from

2023-05-31 Thread Oleksii
On Tue, 2023-05-30 at 18:00 +0200, Jan Beulich wrote:
> > +static uint32_t read_instr(unsigned long pc)
> > +{
> > +    uint16_t instr16 = *(uint16_t *)pc;
> > +
> > +    if ( GET_INSN_LENGTH(instr16) == 2 )
> > +    return (uint32_t)instr16;
> > +    else
> > +    return *(uint32_t *)pc;
> > +}
> 
> As long as this function is only used on Xen code, it's kind of okay.
> There you/we control whether code can change behind our backs. But as
> soon as you might use this on guest code, the double read is going to
> be a problem
Will it be enough to add a comment that read_instr() should be used
only on Xen code? Or it is needed to introduce some lock?

> (I think; I wonder how hardware is supposed to deal with
> the situation: Maybe they indeed fetch in 16-bit quantities?).
I thought that it reads amount of bytes corresponded to i-cache size
and then the pipeline tracks whether an instruction is 16  or 32 bit.

At least something similar is done for BOOM RISC-V CPU [1].

[1]
https://github.com/riscv-boom/riscv-boom/blob/master/docs/sections/instruction-fetch-stage.rst#id64



Re: [PATCH v3 0/6] block: add blk_io_plug_call() API

2023-05-31 Thread Stefan Hajnoczi
Hi Kevin,
Do you want to review the thread-local blk_io_plug() patch series or
should I merge it?

Thanks,
Stefan


signature.asc
Description: PGP signature


[libvirt test] 181023: tolerable FAIL - PUSHED

2023-05-31 Thread osstest service owner
flight 181023 libvirt real [real]
flight 181049 libvirt real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/181023/
http://logs.test-lab.xenproject.org/osstest/logs/181049/

Failures :-/ but no regressions.

Tests which are failing intermittently (not blocking):
 test-amd64-amd64-libvirt-vhd 19 guest-start/debian.repeat fail pass in 
181049-retest

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180985
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180985
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180985
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt 16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-qcow2 15 saverestore-support-checkfail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 libvirt  9222f35dc6917f00d166be3bb69ac4e5ff8536f0
baseline version:
 libvirt  e35b5df3f561ea5678a21aa1b39f14308fc6363c

Last test of basis   180985  2023-05-28 04:18:52 Z3 days
Testing same since   181023  2023-05-31 04:21:47 Z0 days1 attempts


People who touched revisions under test:
  Michal Privoznik 
  김인수 

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm   pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsmpass
 test-amd64-amd64-libvirt-xsm pass
 test-arm64-arm64-libvirt-xsm pass
 test-amd64-i386-libvirt-xsm  pass
 test-amd64-amd64-libvirt pass
 test-arm64-arm64-libvirt pass
 test-armhf-armhf-libvirt pass
 test-amd64-i386-libvirt  pass
 test-amd64-amd64-libvirt-pairpass
 test-amd64-i386-libvirt-pair pass
 test-arm64-arm64-libvirt-qcow2   pass
 test-armhf-armhf-libvirt-qcow2   pass
 test-arm64-arm64-libvirt-raw pass
 test-armhf-armhf-libvirt-raw pass
 test-amd64-i386-libvirt-raw  pass
 test-amd64-amd64-libvirt-vhd fail




[xen-unstable-smoke test] 181044: regressions - trouble: blocked/broken/pass

2023-05-31 Thread osstest service owner
flight 181044 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181044/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-armhf  broken
 build-armhf   5 host-build-prep  fail REGR. vs. 181018

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-xl   1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass

version targeted for testing:
 xen  04f25e9048c375898430a58e1c570806896252cb
baseline version:
 xen  94200e1bae07e725cc07238c11569c5cab7befb7

Last test of basis   181018  2023-05-30 20:00:24 Z0 days
Failing since181031  2023-05-31 11:00:27 Z0 days3 attempts
Testing same since   181044  2023-05-31 16:01:57 Z0 days1 attempts


People who touched revisions under test:
  Bobby Eshleman 
  Jan Beulich 
  Juergen Gross 
  Oleksii Kurochko 
  Roger Pau Monné 
  Stefano Stabellini 

jobs:
 build-arm64-xsm  pass
 build-amd64  pass
 build-armhf  broken  
 build-amd64-libvirt  pass
 test-armhf-armhf-xl  blocked 
 test-arm64-arm64-xl-xsm  pass
 test-amd64-amd64-xl-qemuu-debianhvm-amd64pass
 test-amd64-amd64-libvirt pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary

broken-job build-armhf broken

Not pushing.


commit 04f25e9048c375898430a58e1c570806896252cb
Author: Jan Beulich 
Date:   Wed May 31 16:04:30 2023 +0200

vPCI: fix test harness build

The earlier commit introduced two uses of is_hardware_domain().

Fixes: 465217b0f872 ("vPCI: account for hidden devices")
Reported-by: Andrew Cooper 
Signed-off-by: Jan Beulich 
Acked-by: Roger Pau Monné 

commit 7a2f0ba0d08562fc09c6dd865c6cb3468185be1f
Author: Jan Beulich 
Date:   Wed May 31 16:04:12 2023 +0200

vPCI: add test harness entry to ./MAINTAINERS

Signed-off-by: Jan Beulich 
Acked-by: Roger Pau Monné 

commit 465217b0f872602b4084a1b0fa2ef75377cb3589
Author: Jan Beulich 
Date:   Wed May 31 12:01:11 2023 +0200

vPCI: account for hidden devices

Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
console) are associated with DomXEN, not Dom0. This means that while
looking for overlapping BARs such devices cannot be found on Dom0's list
of devices; DomXEN's list also needs to be scanned.

Suppress vPCI init altogether for r/o devices (which constitute a subset
of hidden ones).

Signed-off-by: Jan Beulich 
Reviewed-by: Roger Pau Monné 
Tested-by: Stefano Stabellini 

commit 445fdc641e304ff41a544f8f5926a13b604c08ad
Author: Juergen Gross 
Date:   Wed May 31 12:00:40 2023 +0200

xen/include/public: fix 9pfs xenstore path description

In xen/include/public/io/9pfs.h the name of the Xenstore backend node
"security-model" should be "security_model", as this is how the Xen
tools are creating it and qemu is reading it.

Fixes: ad58142e73a9 ("xen/public: move xenstore related doc into 9pfs.h")
Fixes: cf1d2d22fdfd ("docs/misc: Xen transport for 9pfs")
Signed-off-by: Juergen Gross 
Reviewed-by: Jason Andryuk 
Acked-by: Stefano Stabellini 

commit 0f80a46ffa6bfd5d111fc2e64ee5983513627e4d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:13 2023 +0200

xen/riscv: remove dummy_bss variable

After introduction of initial pagetables there is no any sense
in dummy_bss variable as bss section will not be empty anymore.

Signed-off-by: Oleksii Kurochko 
Acked-by: Bobby Eshleman 

commit 0d74fc2b2f85586ceb5672aedc79c666e529381d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:05 2023 +0200

xen/riscv: setup initial pagetables

The patch does two thing:
1. Setup 

[xen-unstable-smoke bisection] complete build-amd64

2023-05-31 Thread osstest service owner
branch xen-unstable-smoke
xenbranch xen-unstable-smoke
job build-amd64
testid xen-build

Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
Tree: qemuu git://xenbits.xen.org/qemu-xen.git
Tree: xen git://xenbits.xen.org/xen.git

*** Found and reproduced problem changeset ***

  Bug is in tree:  xen git://xenbits.xen.org/xen.git
  Bug introduced:  465217b0f872602b4084a1b0fa2ef75377cb3589
  Bug not present: 445fdc641e304ff41a544f8f5926a13b604c08ad
  Last fail repro: http://logs.test-lab.xenproject.org/osstest/logs/181050/


  commit 465217b0f872602b4084a1b0fa2ef75377cb3589
  Author: Jan Beulich 
  Date:   Wed May 31 12:01:11 2023 +0200
  
  vPCI: account for hidden devices
  
  Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
  console) are associated with DomXEN, not Dom0. This means that while
  looking for overlapping BARs such devices cannot be found on Dom0's list
  of devices; DomXEN's list also needs to be scanned.
  
  Suppress vPCI init altogether for r/o devices (which constitute a subset
  of hidden ones).
  
  Signed-off-by: Jan Beulich 
  Reviewed-by: Roger Pau Monné 
  Tested-by: Stefano Stabellini 


For bisection revision-tuple graph see:
   
http://logs.test-lab.xenproject.org/osstest/results/bisect/xen-unstable-smoke/build-amd64.xen-build.html
Revision IDs in each graph node refer, respectively, to the Trees above.


Running cs-bisection-step 
--graph-out=/home/logs/results/bisect/xen-unstable-smoke/build-amd64.xen-build 
--summary-out=tmp/181050.bisection-summary --basis-template=181018 
--blessings=real,real-bisect,real-retry xen-unstable-smoke build-amd64 xen-build
Searching for failure / basis pass:
 181035 fail [host=himrod2] / 181018 [host=himrod0] 181016 [host=himrod0] 
180963 ok.
Failure / basis pass flights: 181035 / 180963
(tree with no url: minios)
(tree with no url: ovmf)
(tree with no url: seabios)
Tree: qemu git://xenbits.xen.org/qemu-xen-traditional.git
Tree: qemuu git://xenbits.xen.org/qemu-xen.git
Tree: xen git://xenbits.xen.org/xen.git
Latest 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
Basis pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
f54dd5b53ee516fa1d4c106e0744ce0083acfcdc
Generating revisions with ./adhoc-revtuple-generator  
git://xenbits.xen.org/qemu-xen-traditional.git#3d273dd05e51e5a1ffba3d98c7437ee84e8f8764-3d273dd05e51e5a1ffba3d98c7437ee84e8f8764
 
git://xenbits.xen.org/qemu-xen.git#8c51cd970509b97d8378d175646ec32889828158-8c51cd970509b97d8378d175646ec32889828158
 
git://xenbits.xen.org/xen.git#f54dd5b53ee516fa1d4c106e0744ce0083acfcdc-465217b0f872602b4084a1b0fa2ef75377cb3589
Loaded 5002 nodes in revision graph
Searching for test results:
 180963 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
f54dd5b53ee516fa1d4c106e0744ce0083acfcdc
 181016 [host=himrod0]
 181018 [host=himrod0]
 181037 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
8347d6bb29bfd0c3b5acdc078574a8643c5a5637
 181031 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
 181032 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
f54dd5b53ee516fa1d4c106e0744ce0083acfcdc
 181034 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
 181038 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
e66003e7be1996c9dd8daca54ba34ad5bb58d668
 181039 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
0d74fc2b2f85586ceb5672aedc79c666e529381d
 181040 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
0f80a46ffa6bfd5d111fc2e64ee5983513627e4d
 181035 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
 181043 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
445fdc641e304ff41a544f8f5926a13b604c08ad
 181045 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
 181046 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
445fdc641e304ff41a544f8f5926a13b604c08ad
 181047 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
 181048 pass 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
445fdc641e304ff41a544f8f5926a13b604c08ad
 181050 fail 3d273dd05e51e5a1ffba3d98c7437ee84e8f8764 
8c51cd970509b97d8378d175646ec32889828158 
465217b0f872602b4084a1b0fa2ef75377cb3589
Searching 

[PATCH] x86/microcode: Prevent attempting updates known to fail

2023-05-31 Thread Alejandro Vallejo
If IA32_MSR_MCU_CONTROL exists, then it's possible a CPU may be unable to
perform microcode updates. This is controlled through the DIS_MCU_LOAD bit.

This patch checks that the CPU that got the request is capable of doing an
update. If it is, then we let the procedure go through. While not enough
for the general case (different CPUs with different settings), this patch
copes with the far more common scenario of all CPUs being locked.

Note that for the uncommon general case, we already have some logic in
place to emit a message on xl-dmseg in order to notify the admin that they
should reboot the machine ASAP.

Signed-off-by: Alejandro Vallejo 
---
 xen/arch/x86/cpu/microcode/core.c | 27 +++
 xen/arch/x86/include/asm/cpufeature.h |  1 +
 xen/arch/x86/include/asm/msr-index.h  |  5 +
 3 files changed, 33 insertions(+)

diff --git a/xen/arch/x86/cpu/microcode/core.c 
b/xen/arch/x86/cpu/microcode/core.c
index cd456c476f..e507945932 100644
--- a/xen/arch/x86/cpu/microcode/core.c
+++ b/xen/arch/x86/cpu/microcode/core.c
@@ -697,6 +697,17 @@ static long cf_check microcode_update_helper(void *data)
 return ret;
 }
 
+static bool this_cpu_can_install_update(void)
+{
+uint64_t mcu_ctrl;
+
+if ( !cpu_has_mcu_ctrl )
+return true;
+
+rdmsrl(MSR_MCU_CONTROL, mcu_ctrl);
+return !(mcu_ctrl & MCU_CONTROL_DIS_MCU_LOAD);
+}
+
 int microcode_update(XEN_GUEST_HANDLE(const_void) buf, unsigned long len)
 {
 int ret;
@@ -708,6 +719,22 @@ int microcode_update(XEN_GUEST_HANDLE(const_void) buf, 
unsigned long len)
 if ( !ucode_ops.apply_microcode )
 return -EINVAL;
 
+if ( !this_cpu_can_install_update() )
+{
+/*
+ * This CPU can't install microcode, so it makes no sense to try to
+ * go on. We're implicitly trusting firmware sanity in that all
+ * CPUs are expected to have a homogeneous setting. If, for some
+ * reason, another CPU happens to be locked down when this one
+ * isn't then unpleasantness will follow. In particular, some CPUs
+ * will be updated while others will not. A very stern message will
+ * be displayed in xl-dmesg that case, strongly advising to reboot the
+ * machine.
+ */
+printk("WARNING: microcode not installed due to DIS_MCU_LOAD=1");
+return -EACCES;
+}
+
 buffer = xmalloc_flex_struct(struct ucode_buf, buffer, len);
 if ( !buffer )
 return -ENOMEM;
diff --git a/xen/arch/x86/include/asm/cpufeature.h 
b/xen/arch/x86/include/asm/cpufeature.h
index ace31e3b1f..0118171d7e 100644
--- a/xen/arch/x86/include/asm/cpufeature.h
+++ b/xen/arch/x86/include/asm/cpufeature.h
@@ -192,6 +192,7 @@ static inline bool boot_cpu_has(unsigned int feat)
 #define cpu_has_if_pschange_mc_no boot_cpu_has(X86_FEATURE_IF_PSCHANGE_MC_NO)
 #define cpu_has_tsx_ctrlboot_cpu_has(X86_FEATURE_TSX_CTRL)
 #define cpu_has_taa_no  boot_cpu_has(X86_FEATURE_TAA_NO)
+#define cpu_has_mcu_ctrlboot_cpu_has(X86_FEATURE_MCU_CTRL)
 #define cpu_has_fb_clearboot_cpu_has(X86_FEATURE_FB_CLEAR)
 
 /* Synthesized. */
diff --git a/xen/arch/x86/include/asm/msr-index.h 
b/xen/arch/x86/include/asm/msr-index.h
index 2749e433d2..5c1350b5f9 100644
--- a/xen/arch/x86/include/asm/msr-index.h
+++ b/xen/arch/x86/include/asm/msr-index.h
@@ -165,6 +165,11 @@
 #define  PASID_PASID_MASK   0x000f
 #define  PASID_VALID(_AC(1, ULL) << 31)
 
+#define MSR_MCU_CONTROL 0x1406
+#define  MCU_CONTROL_LOCK   (_AC(1, ULL) <<  0)
+#define  MCU_CONTROL_DIS_MCU_LOAD   (_AC(1, ULL) <<  1)
+#define  MCU_CONTROL_EN_SMM_BYPASS  (_AC(1, ULL) <<  2)
+
 #define MSR_UARCH_MISC_CTRL 0x1b01
 #define  UARCH_CTRL_DOITM   (_AC(1, ULL) <<  0)
 
-- 
2.34.1




Re: [PATCH RFC v2] vPCI: account for hidden devices

2023-05-31 Thread Stefano Stabellini
On Wed, 31 May 2023, Jan Beulich wrote:
> On 31.05.2023 00:38, Stefano Stabellini wrote:
> > On Fri, 26 May 2023, Jan Beulich wrote:
> >> On 25.05.2023 21:24, Stefano Stabellini wrote:
> >>> On Thu, 25 May 2023, Jan Beulich wrote:
>  On 25.05.2023 01:37, Stefano Stabellini wrote:
> > On Wed, 24 May 2023, Jan Beulich wrote:
>  RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
>   modify_bars() to consistently respect BARs of hidden devices 
>  while
>   setting up "normal" ones (i.e. to avoid as much as possible the
>   "continue" path introduced here), setting up of the former may 
>  want
>   doing first.
> >>>
> >>> But BARs of hidden devices should be mapped into dom0 physmap?
> >>
> >> Yes.
> >
> > The BARs would be mapped read-only (not read-write), right? Otherwise we
> > let dom0 access devices that belong to Xen, which doesn't seem like a
> > good idea.
> >
> > But even if we map the BARs read-only, what is the benefit of mapping
> > them to Dom0? If Dom0 loads a driver for it and the driver wants to
> > initialize the device, the driver will crash because the MMIO region is
> > read-only instead of read-write, right?
> >
> > How does this device hiding work for dom0? How does dom0 know not to
> > access a device that is present on the PCI bus but is used by Xen?
> 
>  None of these are new questions - this has all been this way for PV Dom0,
>  and so far we've limped along quite okay. That's not to say that we
>  shouldn't improve things if we can, but that first requires ideas as to
>  how.
> >>>
> >>> For PV, that was OK because PV requires extensive guest modifications
> >>> anyway. We only run Linux and few BSDs as Dom0. So, making the interface
> >>> cleaner and reducing guest changes is nice-to-have but not critical.
> >>>
> >>> For PVH, this is different. One of the top reasons for AMD to work on
> >>> PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
> >>> dom0less/hyperlaunch). It could be anything from Zephyr to a
> >>> proprietary RTOS like VxWorks. Minimal guest changes for advanced
> >>> features (e.g. Dom0 S3) might be OK but in general I think we should aim
> >>> at (almost) zero guest changes. On ARM, it is already the case (with some
> >>> non-upstream patches for dom0less PCI.)
> >>>
> >>> For this specific patch, which is necessary to enable PVH on AMD x86 in
> >>> gitlab-ci, we can do anything we want to make it move faster. But
> >>> medium/long term I think we should try to make non-Xen-aware PVH Dom0
> >>> possible.
> >>
> >> I don't think Linux could boot as PVH Dom0 without any awareness. Hence
> >> I guess it's not easy to see how other OSes might. What you're after
> >> looks rather like a HVM Dom0 to me, with it being unclear where the
> >> external emulator then would run (in a stubdom maybe, which might be
> >> possible to arrange for via the dom0less way of creating boot time
> >> DomU-s) and how it would get any necessary xenstore based information.
> > 
> > I know that Linux has lots of Xen awareness scattered everywhere so it
> > is difficult to tell what's what. Leaving the PVH entry point aside for
> > this discussion, what else is really needed for a Linux without
> > CONFIG_XEN to boot as PVH Dom0?
> > 
> > Same question from a different angle: let's say that we boot Zephyr or
> > another RTOS as HVM Dom0, what is really required for the emulator to
> > emulate? I am hoping that the answer is "nothing" except for maybe a
> > UART.
> > 
> > It comes down to how much legacy stuff the guest OS expects to find.
> > Legacy stuff that would normally be emulated by QEMU. I am counting on
> > the fact that a modern OS doesn't expect any of the legacy stuff (e.g.
> > PIIX3/Q35/E1000) if it is not advertised in the firmware tables.
> 
> And that's where I expect the problems start: We don't really alter
> things like the DSDT and SSDTs, and we also don't parse them. So we
> won't know what firmware describes there. Hence we have to expect that
> any legacy device might be present in the underlying platform, and
> hence would also need offering either by passing through or by
> emulation. Yet we can't sensibly emulate everything in Xen itself.

I see your point, thanks for the explanation. I can see it might require
some work in that area, either by removing those devices from the
firmware tables (that we currently don't even parse) or by passing
through those devices when possible. FYI there is also an ACPI SOT table
[1] that could maybe used for this but nobody has ever used it so far.

[1] https://wiki.xenproject.org/images/0/02/Status-override-table.pdf



Re: [PATCH] xen/cpu-policy: Add an IBRS -> AUTO_IBRS dependency

2023-05-31 Thread Alejandro Vallejo
On Wed, May 31, 2023 at 04:30:28PM +0100, Andrew Cooper wrote:
> AUTO_IBRS is an extention over regular (AMD) IBRS, and needs hiding if IBRS is
> levelled out for any reason.
True that. My bad.

> ---
>  xen/tools/gen-cpuid.py | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
> index f28ff708a2fc..973fcc1c64e8 100755
> --- a/xen/tools/gen-cpuid.py
> +++ b/xen/tools/gen-cpuid.py
> @@ -319,7 +319,7 @@ def crunch_numbers(state):
>  # as dependent features simplifies Xen's logic, and prevents the 
> guest
>  # from seeing implausible configurations.
>  IBRSB: [STIBP, SSBD, INTEL_PSFD],
> -IBRS: [AMD_STIBP, AMD_SSBD, PSFD,
> +IBRS: [AMD_STIBP, AMD_SSBD, PSFD, AUTO_IBRS,
> IBRS_ALWAYS, IBRS_FAST, IBRS_SAME_MODE],
>  AMD_STIBP: [STIBP_ALWAYS],
>  
> 
> base-commit: 465217b0f872602b4084a1b0fa2ef75377cb3589
> -- 
> 2.30.2
> 

LGTM

Alejandro



[ovmf test] 181036: all pass - PUSHED

2023-05-31 Thread osstest service owner
flight 181036 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181036/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf 9f12d6b6ecf8ffe9cd4d93fe0976fdbaf2ded4f0
baseline version:
 ovmf d15d2667d58d40c0748919ac4b5771b875c0780b

Last test of basis   181028  2023-05-31 09:12:10 Z0 days
Testing same since   181036  2023-05-31 13:10:43 Z0 days1 attempts


People who touched revisions under test:
  Zhihao Li 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   d15d2667d5..9f12d6b6ec  9f12d6b6ecf8ffe9cd4d93fe0976fdbaf2ded4f0 -> 
xen-tested-master



[PATCH] xen/cpu-policy: Add an IBRS -> AUTO_IBRS dependency

2023-05-31 Thread Andrew Cooper
AUTO_IBRS is an extention over regular (AMD) IBRS, and needs hiding if IBRS is
levelled out for any reason.

Fixes: defaf651631a ("x86/hvm: Expose Automatic IBRS to guests")
Signed-off-by: Andrew Cooper 
---
CC: Jan Beulich 
CC: Roger Pau Monné 
CC: Wei Liu 
CC: Alejandro Vallejo 

This was an oversight of mine when reviewing the aformentioned patch.
---
 xen/tools/gen-cpuid.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/tools/gen-cpuid.py b/xen/tools/gen-cpuid.py
index f28ff708a2fc..973fcc1c64e8 100755
--- a/xen/tools/gen-cpuid.py
+++ b/xen/tools/gen-cpuid.py
@@ -319,7 +319,7 @@ def crunch_numbers(state):
 # as dependent features simplifies Xen's logic, and prevents the guest
 # from seeing implausible configurations.
 IBRSB: [STIBP, SSBD, INTEL_PSFD],
-IBRS: [AMD_STIBP, AMD_SSBD, PSFD,
+IBRS: [AMD_STIBP, AMD_SSBD, PSFD, AUTO_IBRS,
IBRS_ALWAYS, IBRS_FAST, IBRS_SAME_MODE],
 AMD_STIBP: [STIBP_ALWAYS],
 

base-commit: 465217b0f872602b4084a1b0fa2ef75377cb3589
-- 
2.30.2




[xen-unstable-smoke test] 181035: regressions - trouble: blocked/fail

2023-05-31 Thread osstest service owner
flight 181035 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181035/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64   6 xen-buildfail REGR. vs. 181018
 build-arm64-xsm   6 xen-buildfail REGR. vs. 181018
 build-armhf   6 xen-buildfail REGR. vs. 181018

Tests which did not succeed, but are not blocking:
 build-amd64-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-debianhvm-amd64  1 build-check(1)blocked n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl   1 build-check(1)   blocked  n/a

version targeted for testing:
 xen  465217b0f872602b4084a1b0fa2ef75377cb3589
baseline version:
 xen  94200e1bae07e725cc07238c11569c5cab7befb7

Last test of basis   181018  2023-05-30 20:00:24 Z0 days
Testing same since   181031  2023-05-31 11:00:27 Z0 days2 attempts


People who touched revisions under test:
  Bobby Eshleman 
  Jan Beulich 
  Juergen Gross 
  Oleksii Kurochko 
  Stefano Stabellini 

jobs:
 build-arm64-xsm  fail
 build-amd64  fail
 build-armhf  fail
 build-amd64-libvirt  blocked 
 test-armhf-armhf-xl  blocked 
 test-arm64-arm64-xl-xsm  blocked 
 test-amd64-amd64-xl-qemuu-debianhvm-amd64blocked 
 test-amd64-amd64-libvirt blocked 



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.


commit 465217b0f872602b4084a1b0fa2ef75377cb3589
Author: Jan Beulich 
Date:   Wed May 31 12:01:11 2023 +0200

vPCI: account for hidden devices

Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
console) are associated with DomXEN, not Dom0. This means that while
looking for overlapping BARs such devices cannot be found on Dom0's list
of devices; DomXEN's list also needs to be scanned.

Suppress vPCI init altogether for r/o devices (which constitute a subset
of hidden ones).

Signed-off-by: Jan Beulich 
Reviewed-by: Roger Pau Monné 
Tested-by: Stefano Stabellini 

commit 445fdc641e304ff41a544f8f5926a13b604c08ad
Author: Juergen Gross 
Date:   Wed May 31 12:00:40 2023 +0200

xen/include/public: fix 9pfs xenstore path description

In xen/include/public/io/9pfs.h the name of the Xenstore backend node
"security-model" should be "security_model", as this is how the Xen
tools are creating it and qemu is reading it.

Fixes: ad58142e73a9 ("xen/public: move xenstore related doc into 9pfs.h")
Fixes: cf1d2d22fdfd ("docs/misc: Xen transport for 9pfs")
Signed-off-by: Juergen Gross 
Reviewed-by: Jason Andryuk 
Acked-by: Stefano Stabellini 

commit 0f80a46ffa6bfd5d111fc2e64ee5983513627e4d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:13 2023 +0200

xen/riscv: remove dummy_bss variable

After introduction of initial pagetables there is no any sense
in dummy_bss variable as bss section will not be empty anymore.

Signed-off-by: Oleksii Kurochko 
Acked-by: Bobby Eshleman 

commit 0d74fc2b2f85586ceb5672aedc79c666e529381d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:05 2023 +0200

xen/riscv: setup initial pagetables

The patch does two thing:
1. Setup initial pagetables.
2. Enable MMU which end up with code in
   cont_after_mmu_is_enabled()

Signed-off-by: Oleksii Kurochko 
Acked-by: Bobby Eshleman 

commit ec337ce2e972b70619f5a076b20910a2ff4fea7a
Author: Oleksii Kurochko 
Date:   Wed May 31 11:59:53 2023 +0200

xen/riscv: align __bss_start

bss clear cycle requires proper alignment of __bss_start.

ALIGN(PAGE_SIZE) before "*(.bss.page_aligned)" in xen.lds.S
was removed as any contribution to "*(.bss.page_aligned)" have to
specify proper aligntment 

[qemu-mainline test] 181021: regressions - FAIL

2023-05-31 Thread osstest service owner
flight 181021 qemu-mainline real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181021/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-amd64-amd64-libvirt-pair 30 leak-check/check/src_host fail REGR. vs. 
180691
 test-amd64-amd64-libvirt-pair 31 leak-check/check/dst_host fail REGR. vs. 
180691
 test-amd64-amd64-xl-xsm 22 guest-start/debian.repeat fail REGR. vs. 180691
 test-amd64-i386-libvirt  23 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-libvirt-xsm 23 leak-check/check fail REGR. vs. 180691
 build-arm64-xsm   6 xen-buildfail REGR. vs. 180691
 build-arm64   6 xen-buildfail REGR. vs. 180691
 test-amd64-amd64-libvirt 23 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-xsm  23 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-pair 30 leak-check/check/src_host fail REGR. vs. 180691
 test-amd64-i386-libvirt-pair 31 leak-check/check/dst_host fail REGR. vs. 180691
 test-amd64-amd64-xl-qcow224 leak-check/check fail REGR. vs. 180691
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 test-amd64-i386-xl-qemuu-debianhvm-i386-xsm 12 debian-hvm-install fail REGR. 
vs. 180691
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 21 leak-check/check fail 
REGR. vs. 180691
 test-amd64-i386-xl-vhd  21 guest-start/debian.repeat fail REGR. vs. 180691
 test-amd64-amd64-libvirt-vhd 22 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-libvirt-raw  22 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt 21 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt-qcow2 20 leak-check/check   fail REGR. vs. 180691
 test-armhf-armhf-xl-vhd  20 leak-check/check fail REGR. vs. 180691
 test-armhf-armhf-libvirt-raw 20 leak-check/check fail REGR. vs. 180691
 test-amd64-i386-xl-qemuu-win7-amd64 12 windows-install   fail REGR. vs. 180691

Tests which did not succeed, but are not blocking:
 test-arm64-arm64-libvirt-raw  1 build-check(1)   blocked  n/a
 test-arm64-arm64-libvirt-xsm  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit1   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-credit2   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-thunderx  1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-vhd   1 build-check(1)   blocked  n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 build-arm64-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180691
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180691
 test-armhf-armhf-libvirt 16 saverestore-support-checkfail  like 180691
 test-armhf-armhf-libvirt-qcow2 15 saverestore-support-check   fail like 180691
 test-amd64-i386-xl-qemuu-ws16-amd64 19 guest-stop fail like 180691
 test-armhf-armhf-libvirt-raw 15 saverestore-support-checkfail  like 180691
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180691
 test-amd64-i386-xl-pvshim14 guest-start  fail   never pass
 test-amd64-i386-libvirt  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-i386-libvirt-xsm  15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-i386-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-armhf-armhf-xl-multivcpu 15 migrate-support-checkfail  never pass
 test-armhf-armhf-xl-multivcpu 16 saverestore-support-checkfail  never pass
 test-armhf-armhf-xl-credit1  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit1  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-vhd 14 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl-credit2  16 saverestore-support-checkfail   never pass
 test-amd64-i386-libvirt-raw  14 migrate-support-checkfail   never pass
 test-armhf-armhf-libvirt 15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  15 migrate-support-checkfail   never pass
 test-armhf-armhf-xl  16 saverestore-support-checkfail   never pass
 test-armhf-armhf-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-armhf-armhf-xl-rtds 15 migrate-support-checkfail   never 

[ANNOUNCE] Call for agenda items for 1 June Community Call @ 1500 UTC

2023-05-31 Thread George Dunlap
Hi all,

Sorry for the late notice on this -- somehow my reminder didn't trigger.  I
believe we had discussed at the last call to hold this meeting, and skip
the one in July.


The proposed agenda is in
https://cryptpad.fr/pad/#/2/pad/edit/eddrScu3DyYdHZ6f0ReVGazZ/ and you can
edit to add items.  Alternatively, you can reply to this mail directly.

Agenda items appreciated a few days before the call: please put your name
besides items if you edit the document.

Note the following administrative conventions for the call:
* Unless, agreed in the previous meeting otherwise, the call is on the 1st
Thursday of each month at 1600 British Time (either GMT or BST)
* I usually send out a meeting reminder a few days before with a
provisional agenda

* To allow time to switch between meetings, we'll plan on starting the
agenda at 16:05 sharp.  Aim to join by 16:03 if possible to allocate time
to sort out technical difficulties 

* If you want to be CC'ed please add or remove yourself from the
sign-up-sheet at
https://cryptpad.fr/pad/#/2/pad/edit/D9vGzihPxxAOe6RFPz0sRCf+/

Best Regards
George


== Dial-in Information ==
## Meeting time
16:00 - 17:00 British time
Further International meeting times:


https://www.timeanddate.com/worldclock/meetingdetails.html?year=2023=6=1=15=0=0=1234=37=224=179


## Dial in details
Web: https://meet.jit.si/XenProjectCommunityCall

Dial-in info and pin can be found here:

https://meet.jit.si/static/dialInInfo.html?room=XenProjectCommunityCall


Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-31 Thread Juergen Gross

On 31.05.23 10:35, Borislav Petkov wrote:

[0.018357] MTRR default type: uncachable
[0.022347] MTRR fixed ranges enabled:
[0.026085]   0-9 write-back
[0.029650]   A-B uncachable
[0.033214]   C-F write-protect
[0.037039] MTRR variable ranges enabled:
[0.041038]   0 base 000 mask 0003FFC write-back
[0.047383]   1 base 004 mask 0003FFFC000 write-back
[0.053730]   2 base 0044000 mask 0003000 write-back
[0.060076]   3 base 000AE00 mask 0003E00 uncachable
[0.066421]   4 base 000B000 mask 0003000 uncachable
[0.072768]   5 base 000C000 mask 0003FFFC000 uncachable
[0.079114]   6 disabled
[0.081635]   7 disabled
[0.084156]   8 disabled
[0.086677]   9 disabled
[0.089203] total RAM covered: 16352M
[0.093023] Found optimal setting for mtrr clean up
[0.097734]  gran_size: 64K  chunk_size: 64M num_reg: 8  lose 
cover RAM: 0G


One other note: why does mtrr_cleanup() think that using 8 instead of 6
variable MTRRs would be an "optimal setting"?

IMO it should replace the original setup only in case it is using _less_
MTRRs than before.

Additionally I believe mtrr_cleanup() would make much more sense if it
wouldn't be __init, but being usable when trying to add additional MTRRs
in the running system in case we run out of MTRRs.

It should probably be based on the new MTRR map anyway...


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 3/3] multiboot2: do not set StdOut mode unconditionally

2023-05-31 Thread Jan Beulich
On 31.05.2023 12:57, Roger Pau Monné wrote:
> On Wed, Apr 05, 2023 at 12:36:55PM +0200, Jan Beulich wrote:
>> On 31.03.2023 11:59, Roger Pau Monne wrote:
>>> @@ -887,6 +881,15 @@ void __init efi_multiboot2(EFI_HANDLE ImageHandle, 
>>> EFI_SYSTEM_TABLE *SystemTable
>>>  
>>>  efi_arch_edid(gop_handle);
>>>  }
>>> +else
>>> +{
>>> +/* If no GOP, init ConOut (StdOut) to the max supported size. */
>>> +efi_console_set_mode();
>>> +
>>> +if ( StdOut->QueryMode(StdOut, StdOut->Mode->Mode,
>>> +   , ) == EFI_SUCCESS )
>>> +efi_arch_console_init(cols, rows);
>>> +}
>>
>> Instead of making this an "else", wouldn't you better check that a
>> valid gop_mode was found? efi_find_gop_mode() can return ~0 after all.
> 
> When using vga=current gop_mode would also be ~0, in order for
> efi_set_gop_mode() to not change the current mode,

And then we'd skip efi_console_set_mode() here as well, which I think
is what we want with "vga=current"?

> I was trying to
> avoid exposing keep_current or similar extra variable to signal this.
> 
>> Furthermore, what if the active mode doesn't support text output? (I
>> consider the spec unclear in regard to whether this is possible, but
>> maybe I simply didn't find the right place stating it.)
>>
>> Finally I think efi_arch_console_init() wants calling nevertheless.
>>
>> So altogether maybe
>>
>> if ( gop_mode == ~0 ||
>>  StdOut->QueryMode(StdOut, StdOut->Mode->Mode,
>>, ) != EFI_SUCCESS )
> 
> I think it would make more sense to call efi_console_set_mode() only
> if the current StdOut mode is not valid, as anything different from
> vga=current will already force a GOP mode change.

Hmm, this may also make sense. I guess I'd like to see the combined
result to be better able to judge.

Jan



Re: [PATCH 2/2] vPCI: fix test harness build

2023-05-31 Thread Jan Beulich
On 31.05.2023 15:51, Roger Pau Monné wrote:
> On Wed, May 31, 2023 at 03:19:56PM +0200, Jan Beulich wrote:
>> The earlier commit introduced two uses of is_hardware_domain().
>>
>> Fixes: 465217b0f872 ("vPCI: account for hidden devices")
>> Reported-by: Andrew Cooper 
>> Signed-off-by: Jan Beulich 
> 
> Acked-by: Roger Pau Monné 

Thanks.

> We do rely on the compiler always removing the call to
> pci_get_pdev(dom_xen, ...) or AFAICT that would also trigger an error
> as there's no definition of dom_xen in this scope.

Not really, no. The stub macro itself discards all its arguments:

#define pci_get_pdev(...) (_pdev)

Jan




Re: [patch] x86/smpboot: Fix the parallel bringup decision

2023-05-31 Thread Tom Lendacky

On 5/31/23 02:44, Thomas Gleixner wrote:

The decision to allow parallel bringup of secondary CPUs checks
CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
parallel bootup because accessing the local APIC is intercepted and raises
a #VC or #VE, which cannot be handled at that point.

The check works correctly, but only for AMD encrypted guests. TDX does not
set that flag.

As there is no real connection between CC attributes and the inability to
support parallel bringup, replace this with a generic control flag in
x86_cpuinit and let SEV-ES and TDX init code disable it.

Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement arch_cpuhp_init_parallel_bringup() 
and enable it")
Reported-by: Kirill A. Shutemov 
Signed-off-by: Thomas Gleixner 


Still works for SEV-ES/SEV-SNP with parallel boot properly disabled.

Tested-by: Tom Lendacky 


---
  arch/x86/coco/tdx/tdx.c |   11 +++
  arch/x86/include/asm/x86_init.h |3 +++
  arch/x86/kernel/smpboot.c   |   19 ++-
  arch/x86/kernel/x86_init.c  |1 +
  arch/x86/mm/mem_encrypt_amd.c   |   15 +++
  5 files changed, 32 insertions(+), 17 deletions(-)

--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -871,5 +871,16 @@ void __init tdx_early_init(void)
x86_platform.guest.enc_tlb_flush_required   = tdx_tlb_flush_required;
x86_platform.guest.enc_status_change_finish = tdx_enc_status_changed;
  
+	/*

+* TDX intercepts the RDMSR to read the X2APIC ID in the parallel
+* bringup low level code. That raises #VE which cannot be handled
+* there.
+*
+* Intel-TDX has a secure RDMSR hypercall, but that needs to be
+* implemented seperately in the low level startup ASM code.
+* Until that is in place, disable parallel bringup for TDX.
+*/
+   x86_cpuinit.parallel_bringup = false;
+
pr_info("Guest detected\n");
  }
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -177,11 +177,14 @@ struct x86_init_ops {
   * struct x86_cpuinit_ops - platform specific cpu hotplug setups
   * @setup_percpu_clockev: set up the per cpu clock event device
   * @early_percpu_clock_init:  early init of the per cpu clock event device
+ * @fixup_cpu_id:  fixup function for cpuinfo_x86::phys_proc_id
+ * @parallel_bringup:  Parallel bringup control
   */
  struct x86_cpuinit_ops {
void (*setup_percpu_clockev)(void);
void (*early_percpu_clock_init)(void);
void (*fixup_cpu_id)(struct cpuinfo_x86 *c, int node);
+   bool parallel_bringup;
  };
  
  struct timespec64;

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1267,23 +1267,8 @@ void __init smp_prepare_cpus_common(void
  /* Establish whether parallel bringup can be supported. */
  bool __init arch_cpuhp_init_parallel_bringup(void)
  {
-   /*
-* Encrypted guests require special handling. They enforce X2APIC
-* mode but the RDMSR to read the APIC ID is intercepted and raises
-* #VC or #VE which cannot be handled in the early startup code.
-*
-* AMD-SEV does not provide a RDMSR GHCB protocol so the early
-* startup code cannot directly communicate with the secure
-* firmware. The alternative solution to retrieve the APIC ID via
-* CPUID(0xb), which is covered by the GHCB protocol, is not viable
-* either because there is no enforcement of the CPUID(0xb)
-* provided "initial" APIC ID to be the same as the real APIC ID.
-*
-* Intel-TDX has a secure RDMSR hypercall, but that needs to be
-* implemented seperately in the low level startup ASM code.
-*/
-   if (cc_platform_has(CC_ATTR_GUEST_STATE_ENCRYPT)) {
-   pr_info("Parallel CPU startup disabled due to guest state 
encryption\n");
+   if (!x86_cpuinit.parallel_bringup) {
+   pr_info("Parallel CPU startup disabled by the platform\n");
return false;
}
  
--- a/arch/x86/kernel/x86_init.c

+++ b/arch/x86/kernel/x86_init.c
@@ -126,6 +126,7 @@ struct x86_init_ops x86_init __initdata
  struct x86_cpuinit_ops x86_cpuinit = {
.early_percpu_clock_init= x86_init_noop,
.setup_percpu_clockev   = setup_secondary_APIC_clock,
+   .parallel_bringup   = true,
  };
  
  static void default_nmi_init(void) { };

--- a/arch/x86/mm/mem_encrypt_amd.c
+++ b/arch/x86/mm/mem_encrypt_amd.c
@@ -501,6 +501,21 @@ void __init sme_early_init(void)
x86_platform.guest.enc_status_change_finish  = 
amd_enc_status_change_finish;
x86_platform.guest.enc_tlb_flush_required= 
amd_enc_tlb_flush_required;
x86_platform.guest.enc_cache_flush_required  = 
amd_enc_cache_flush_required;
+
+   /*
+* AMD-SEV-ES intercepts the RDMSR to read the X2APIC ID in the
+* parallel bringup low level code. That raises #VC which 

Re: [PATCH 2/2] vPCI: fix test harness build

2023-05-31 Thread Roger Pau Monné
On Wed, May 31, 2023 at 03:19:56PM +0200, Jan Beulich wrote:
> The earlier commit introduced two uses of is_hardware_domain().
> 
> Fixes: 465217b0f872 ("vPCI: account for hidden devices")
> Reported-by: Andrew Cooper 
> Signed-off-by: Jan Beulich 

Acked-by: Roger Pau Monné 

We do rely on the compiler always removing the call to
pci_get_pdev(dom_xen, ...) or AFAICT that would also trigger an error
as there's no definition of dom_xen in this scope.

Thanks, Roger.



[PATCH AUTOSEL 4.14 10/10] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index cd58f582c50c1..b649f1a68b417 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -780,7 +780,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




Re: [PATCH 1/2] vPCI: add test harness entry to ./MAINTAINERS

2023-05-31 Thread Roger Pau Monné
On Wed, May 31, 2023 at 03:19:31PM +0200, Jan Beulich wrote:
> Signed-off-by: Jan Beulich 

Acked-by: Roger Pau Monné 

Thanks, Roger.



[PATCH AUTOSEL 4.19 13/13] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 7ee618ab1567b..b4807d12ef29c 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -779,7 +779,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH AUTOSEL 5.4 16/17] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index d0538c03f0332..da67621ebc212 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -779,7 +779,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH AUTOSEL 5.10 20/21] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 6f33d62331b1f..d68a8ca2161fb 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -792,7 +792,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH AUTOSEL 5.15 22/24] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 24a86d829f92a..831747ba8113c 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -780,7 +780,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH AUTOSEL 6.1 30/33] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 35b9bcad9db90..5ddf393aa390f 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -780,7 +780,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH AUTOSEL 6.3 34/37] xen/blkfront: Only check REQ_FUA for writes

2023-05-31 Thread Sasha Levin
From: Ross Lagerwall 

[ Upstream commit b6ebaa8100090092aa602530d7e8316816d0c98d ]

The existing code silently converts read operations with the
REQ_FUA bit set into write-barrier operations. This results in data
loss as the backend scribbles zeroes over the data instead of returning
it.

While the REQ_FUA bit doesn't make sense on a read operation, at least
one well-known out-of-tree kernel module does set it and since it
results in data loss, let's be safe here and only look at REQ_FUA for
writes.

Signed-off-by: Ross Lagerwall 
Acked-by: Juergen Gross 
Link: 
https://lore.kernel.org/r/20230426164005.2213139-1-ross.lagerw...@citrix.com
Signed-off-by: Juergen Gross 
Signed-off-by: Sasha Levin 
---
 drivers/block/xen-blkfront.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 23ed258b57f0e..c1890c8a9f6e7 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -780,7 +780,8 @@ static int blkif_queue_rw_req(struct request *req, struct 
blkfront_ring_info *ri
ring_req->u.rw.handle = info->handle;
ring_req->operation = rq_data_dir(req) ?
BLKIF_OP_WRITE : BLKIF_OP_READ;
-   if (req_op(req) == REQ_OP_FLUSH || req->cmd_flags & REQ_FUA) {
+   if (req_op(req) == REQ_OP_FLUSH ||
+   (req_op(req) == REQ_OP_WRITE && (req->cmd_flags & 
REQ_FUA))) {
/*
 * Ideally we can do an unordered flush-to-disk.
 * In case the backend onlysupports barriers, use that.
-- 
2.39.2




[PATCH 2/2] vPCI: fix test harness build

2023-05-31 Thread Jan Beulich
The earlier commit introduced two uses of is_hardware_domain().

Fixes: 465217b0f872 ("vPCI: account for hidden devices")
Reported-by: Andrew Cooper 
Signed-off-by: Jan Beulich 

--- a/tools/tests/vpci/emul.h
+++ b/tools/tests/vpci/emul.h
@@ -82,6 +82,8 @@ typedef union {
 
 #define __hwdom_init
 
+#define is_hardware_domain(d) ((void)(d), false)
+
 #define has_vpci(d) true
 
 #define xzalloc(type) ((type *)calloc(1, sizeof(type)))




[PATCH 1/2] vPCI: add test harness entry to ./MAINTAINERS

2023-05-31 Thread Jan Beulich
Signed-off-by: Jan Beulich 

--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -568,6 +568,7 @@
 VPCI
 M: Roger Pau Monné 
 S: Supported
+F: tools/tests/vpci/
 F: xen/drivers/vpci/
 F: xen/include/xen/vpci.h
 




[PATCH 0/2] vPCI: test harness adjustments

2023-05-31 Thread Jan Beulich
1: add test harness entry to ./MAINTAINERS
2: fix test harness build

Jan



Re: [dm-devel] [PATCH v2 00/16] Diskseq support in loop, device-mapper, and blkback

2023-05-31 Thread Christoph Hellwig
On Tue, May 30, 2023 at 04:31:00PM -0400, Demi Marie Obenour wrote:
> This work aims to allow userspace to create and destroy block devices
> in a race-free way, and to allow them to be exposed to other Xen VMs via
> blkback without races.
> 
> Changes since v1:
> 
> - Several device-mapper fixes added.

Let's get these reviewed by the DM maintainers independently.  This
series is mixing up way too many things.



Re: [PATCH v6 5/6] xen/riscv: introduce an implementation of macros from

2023-05-31 Thread Jan Beulich
On 31.05.2023 12:40, Oleksii wrote:
> On Tue, 2023-05-30 at 18:00 +0200, Jan Beulich wrote:
>> On 29.05.2023 14:13, Oleksii Kurochko wrote:
>>> +static uint32_t read_instr(unsigned long pc)
>>> +{
>>> +    uint16_t instr16 = *(uint16_t *)pc;
>>> +
>>> +    if ( GET_INSN_LENGTH(instr16) == 2 )
>>> +    return (uint32_t)instr16;
>>> +    else
>>> +    return *(uint32_t *)pc;
>>> +}
>>
>> As long as this function is only used on Xen code, it's kind of okay.
>> There you/we control whether code can change behind our backs. But as
>> soon as you might use this on guest code, the double read is going to
>> be a problem (I think; I wonder how hardware is supposed to deal with
>> the situation: Maybe they indeed fetch in 16-bit quantities?).
> I'll check how the hardware fetches instructions.
> 
> I am trying to figure out why the double-read can be a problem. It
> looks pretty safe to read 16 bits ( they will be available for any
> instruction length with the assumption that the minimal instruction
> length is 16 ), then check the length of the instruction, and if it is
> 32-bit instruction, read it as uint32_t.

Simply consider what happens if a buggy or malicious entity changes the
code between the two reads. And not just with the detection of "break"
in mind that you use it for here.

Jan



[ovmf test] 181028: all pass - PUSHED

2023-05-31 Thread osstest service owner
flight 181028 ovmf real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181028/

Perfect :-)
All tests in this flight passed as required
version targeted for testing:
 ovmf d15d2667d58d40c0748919ac4b5771b875c0780b
baseline version:
 ovmf d8e5d35ede7158ccbb9abf600e65b9aa6e043f74

Last test of basis   181024  2023-05-31 05:10:47 Z0 days
Testing same since   181028  2023-05-31 09:12:10 Z0 days1 attempts


People who touched revisions under test:
  Abner Chang 

jobs:
 build-amd64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl-qemuu-ovmf-amd64 pass
 test-amd64-i386-xl-qemuu-ovmf-amd64  pass



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Pushing revision :

To xenbits.xen.org:/home/xen/git/osstest/ovmf.git
   d8e5d35ede..d15d2667d5  d15d2667d58d40c0748919ac4b5771b875c0780b -> 
xen-tested-master



[linux-linus test] 181020: regressions - FAIL

2023-05-31 Thread osstest service owner
flight 181020 linux-linus real [real]
flight 181030 linux-linus real-retest [real]
http://logs.test-lab.xenproject.org/osstest/logs/181020/
http://logs.test-lab.xenproject.org/osstest/logs/181030/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 test-armhf-armhf-xl-credit1   8 xen-boot fail REGR. vs. 180278

Tests which did not succeed, but are not blocking:
 test-armhf-armhf-examine  8 reboot   fail  like 180278
 test-armhf-armhf-xl-arndale   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-win7-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-xl-qemuu-win7-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl-credit2   8 xen-boot fail  like 180278
 test-amd64-amd64-qemuu-nested-amd 20 debian-hvm-install/l1/l2 fail like 180278
 test-amd64-amd64-xl-qemuu-ws16-amd64 19 guest-stopfail like 180278
 test-armhf-armhf-xl-multivcpu  8 xen-boot fail like 180278
 test-armhf-armhf-libvirt-raw  8 xen-boot fail  like 180278
 test-armhf-armhf-libvirt  8 xen-boot fail  like 180278
 test-armhf-armhf-xl-rtds  8 xen-boot fail  like 180278
 test-armhf-armhf-xl   8 xen-boot fail  like 180278
 test-armhf-armhf-libvirt-qcow2  8 xen-bootfail like 180278
 test-armhf-armhf-xl-vhd   8 xen-boot fail  like 180278
 test-amd64-amd64-xl-qemut-ws16-amd64 19 guest-stopfail like 180278
 test-amd64-amd64-libvirt 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit1  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-thunderx 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-credit2  16 saverestore-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-xsm 16 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  15 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-xsm  16 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-qemuu-debianhvm-amd64-xsm 13 migrate-support-check 
fail never pass
 test-amd64-amd64-libvirt-xsm 15 migrate-support-checkfail   never pass
 test-amd64-amd64-libvirt-qcow2 14 migrate-support-checkfail never pass
 test-arm64-arm64-libvirt-raw 14 migrate-support-checkfail   never pass
 test-arm64-arm64-libvirt-raw 15 saverestore-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  14 migrate-support-checkfail   never pass
 test-arm64-arm64-xl-vhd  15 saverestore-support-checkfail   never pass
 test-amd64-amd64-libvirt-raw 14 migrate-support-checkfail   never pass

version targeted for testing:
 linux8b817fded42d8fe3a0eb47b1149d907851a3c942
baseline version:
 linux6c538e1adbfc696ac4747fb10d63e704344f763d

Last test of basis   180278  2023-04-16 19:41:46 Z   44 days
Failing since180281  2023-04-17 06:24:36 Z   44 days   83 attempts
Testing same since   181002  2023-05-29 16:11:58 Z1 days4 attempts


2553 people touched revisions under test,
not listing them all

jobs:
 build-amd64-xsm  pass
 build-arm64-xsm  pass
 build-i386-xsm   pass
 build-amd64  pass
 build-arm64  pass
 build-armhf  pass
 build-i386   pass
 build-amd64-libvirt  pass
 build-arm64-libvirt  pass
 build-armhf-libvirt  pass
 build-i386-libvirt   pass
 build-amd64-pvopspass
 build-arm64-pvopspass
 build-armhf-pvopspass
 build-i386-pvops pass
 test-amd64-amd64-xl  pass
 

Re: [PATCH v2 2/3] x86: Expose Automatic IBRS to guests

2023-05-31 Thread Andrew Cooper
On 31/05/2023 10:01 am, Alejandro Vallejo wrote:
> On Tue, May 30, 2023 at 06:31:03PM +0100, Andrew Cooper wrote:
>> I've committed this, but made two tweaks to the commit message.  First,
>> "x86/hvm" in the subject because it's important context at a glance.
> Sure, that makes sense.
>
>> Second, I've adjusted the bit about PV guests.  The reason why we can't
>> expose it yet is because Xen doesn't currently context switch EFER
>> between PV guests.
>>
>> ~Andrew
> We could of course context switch EFER sensibly, but what would that mean
> for Automatic IBRS? It can't be trivially used for domain-to-domain
> isolation because every domain is in a co-equal protection level. Is there
> a non-obvious edge that exposing some interface to it gives for PV? The
> only useful case I can think of is PVH, and that seems to be subsumed by
> HVM.

Hence why it's fine to not worry about PV for now.

Right now, when we decide to use IBRS on AMD, we set it unilaterally. 
This turns out to be better performance than flipping it on privilege
changes (whether that's non-Xen <-> Xen, or guest user <-> kernel).

PV guests are obscure corner cases these days, and fall outside of
anything the hardware vendors care about when it comes to prediction
mode.  The only sane option is to have Xen explicitly tell the the PV
guest what Xen is doing, and let the guest decide if it wants to do
anything further in terms of protections.

~Andrew



Re: [xen-unstable-smoke test] 181031: regressions - trouble: blocked/fail

2023-05-31 Thread Andrew Cooper
On 31/05/2023 1:17 pm, osstest service owner wrote:
> flight 181031 xen-unstable-smoke real [real]
> http://logs.test-lab.xenproject.org/osstest/logs/181031/
>
> Regressions :-(
>
> Tests which did not succeed and are blocking,
> including tests which could not be run:
>  build-amd64   6 xen-buildfail REGR. vs. 
> 181018
>  build-arm64-xsm   6 xen-buildfail REGR. vs. 
> 181018
>  build-armhf   6 xen-buildfail REGR. vs. 
> 181018

Real failure, caused by the vPCI change.

http://logs.test-lab.xenproject.org/osstest/logs/181031/build-arm64-xsm/6.ts-xen-build.log

The userspace test needs is_hardware_domain().

~Andrew



[xen-unstable-smoke test] 181031: regressions - trouble: blocked/fail

2023-05-31 Thread osstest service owner
flight 181031 xen-unstable-smoke real [real]
http://logs.test-lab.xenproject.org/osstest/logs/181031/

Regressions :-(

Tests which did not succeed and are blocking,
including tests which could not be run:
 build-amd64   6 xen-buildfail REGR. vs. 181018
 build-arm64-xsm   6 xen-buildfail REGR. vs. 181018
 build-armhf   6 xen-buildfail REGR. vs. 181018

Tests which did not succeed, but are not blocking:
 build-amd64-libvirt   1 build-check(1)   blocked  n/a
 test-amd64-amd64-libvirt  1 build-check(1)   blocked  n/a
 test-amd64-amd64-xl-qemuu-debianhvm-amd64  1 build-check(1)blocked n/a
 test-arm64-arm64-xl-xsm   1 build-check(1)   blocked  n/a
 test-armhf-armhf-xl   1 build-check(1)   blocked  n/a

version targeted for testing:
 xen  465217b0f872602b4084a1b0fa2ef75377cb3589
baseline version:
 xen  94200e1bae07e725cc07238c11569c5cab7befb7

Last test of basis   181018  2023-05-30 20:00:24 Z0 days
Testing same since   181031  2023-05-31 11:00:27 Z0 days1 attempts


People who touched revisions under test:
  Bobby Eshleman 
  Jan Beulich 
  Juergen Gross 
  Oleksii Kurochko 
  Stefano Stabellini 

jobs:
 build-arm64-xsm  fail
 build-amd64  fail
 build-armhf  fail
 build-amd64-libvirt  blocked 
 test-armhf-armhf-xl  blocked 
 test-arm64-arm64-xl-xsm  blocked 
 test-amd64-amd64-xl-qemuu-debianhvm-amd64blocked 
 test-amd64-amd64-libvirt blocked 



sg-report-flight on osstest.test-lab.xenproject.org
logs: /home/logs/logs
images: /home/logs/images

Logs, config files, etc. are available at
http://logs.test-lab.xenproject.org/osstest/logs

Explanation of these reports, and of osstest in general, is at
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README.email;hb=master
http://xenbits.xen.org/gitweb/?p=osstest.git;a=blob;f=README;hb=master

Test harness code can be found at
http://xenbits.xen.org/gitweb?p=osstest.git;a=summary


Not pushing.


commit 465217b0f872602b4084a1b0fa2ef75377cb3589
Author: Jan Beulich 
Date:   Wed May 31 12:01:11 2023 +0200

vPCI: account for hidden devices

Hidden devices (e.g. an add-in PCI serial card used for Xen's serial
console) are associated with DomXEN, not Dom0. This means that while
looking for overlapping BARs such devices cannot be found on Dom0's list
of devices; DomXEN's list also needs to be scanned.

Suppress vPCI init altogether for r/o devices (which constitute a subset
of hidden ones).

Signed-off-by: Jan Beulich 
Reviewed-by: Roger Pau Monné 
Tested-by: Stefano Stabellini 

commit 445fdc641e304ff41a544f8f5926a13b604c08ad
Author: Juergen Gross 
Date:   Wed May 31 12:00:40 2023 +0200

xen/include/public: fix 9pfs xenstore path description

In xen/include/public/io/9pfs.h the name of the Xenstore backend node
"security-model" should be "security_model", as this is how the Xen
tools are creating it and qemu is reading it.

Fixes: ad58142e73a9 ("xen/public: move xenstore related doc into 9pfs.h")
Fixes: cf1d2d22fdfd ("docs/misc: Xen transport for 9pfs")
Signed-off-by: Juergen Gross 
Reviewed-by: Jason Andryuk 
Acked-by: Stefano Stabellini 

commit 0f80a46ffa6bfd5d111fc2e64ee5983513627e4d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:13 2023 +0200

xen/riscv: remove dummy_bss variable

After introduction of initial pagetables there is no any sense
in dummy_bss variable as bss section will not be empty anymore.

Signed-off-by: Oleksii Kurochko 
Acked-by: Bobby Eshleman 

commit 0d74fc2b2f85586ceb5672aedc79c666e529381d
Author: Oleksii Kurochko 
Date:   Wed May 31 12:00:05 2023 +0200

xen/riscv: setup initial pagetables

The patch does two thing:
1. Setup initial pagetables.
2. Enable MMU which end up with code in
   cont_after_mmu_is_enabled()

Signed-off-by: Oleksii Kurochko 
Acked-by: Bobby Eshleman 

commit ec337ce2e972b70619f5a076b20910a2ff4fea7a
Author: Oleksii Kurochko 
Date:   Wed May 31 11:59:53 2023 +0200

xen/riscv: align __bss_start

bss clear cycle requires proper alignment of __bss_start.

ALIGN(PAGE_SIZE) before "*(.bss.page_aligned)" in xen.lds.S
was removed as any contribution to "*(.bss.page_aligned)" have to
specify proper aligntment 

[BROKEN] Re: [PATCH v9 0/5] enable MMU for RISC-V

2023-05-31 Thread Andrew Cooper
On 25/05/2023 4:28 pm, Oleksii Kurochko wrote:
> Oleksii Kurochko (5):
>   xen/riscv: add VM space layout
>   xen/riscv: introduce setup_initial_pages
>   xen/riscv: align __bss_start
>   xen/riscv: setup initial pagetables
>   xen/riscv: remove dummy_bss variable

These have just been committed.

But as I fed back to early drafts of this series, patch 2 is
sufficiently fragile and unwise as to be unacceptable in this form.

enable_mmu() is unsafe in multiple ways, from the compiler reordering
statements (the label needs to be in the asm statement for that to work
correctly), and because it * depends* on hooking all exceptions and
pagefault.

Any exception other than pagefault, or not taking a pagefault causes it
to malfunction, which means you will fail to boot depending on where Xen
was loaded into memory.  It may not explode inside Qemu right now, but
it will not function reliably in the general case.

Furthermore, a combination of patch 2 and 4 breaks the CI integration of
looking for "All set up" at the end of start_xen().  It's not ok, from a
code quality point of view, to defer 99% of start_xen()'s functionality
into unrelated function.


Please do not do anything else until you've addressed these issues. 
enable_mmu() needs to return normally, cont_after_mmu_is_enabled() needs
deleting entirely, and there needs to be an identity page for Xen to
land on so it isn't jumping into the void and praying not to explode.

Other minor issues include page.h not having __ASSEMBLY__ guards, mm.c
locally externing cpu0_boot_stack[] from setup.c when the declaration
needs to be in a header file somewhere, and SPDX tags.

~Andrew



Re: [patch] x86/smpboot: Fix the parallel bringup decision

2023-05-31 Thread Kirill A. Shutemov
On Wed, May 31, 2023 at 09:44:26AM +0200, Thomas Gleixner wrote:
> The decision to allow parallel bringup of secondary CPUs checks
> CC_ATTR_GUEST_STATE_ENCRYPT to detect encrypted guests. Those cannot use
> parallel bootup because accessing the local APIC is intercepted and raises
> a #VC or #VE, which cannot be handled at that point.
> 
> The check works correctly, but only for AMD encrypted guests. TDX does not
> set that flag.
> 
> As there is no real connection between CC attributes and the inability to
> support parallel bringup, replace this with a generic control flag in
> x86_cpuinit and let SEV-ES and TDX init code disable it.
> 
> Fixes: 0c7ffa32dbd6 ("x86/smpboot/64: Implement 
> arch_cpuhp_init_parallel_bringup() and enable it")
> Reported-by: Kirill A. Shutemov 
> Signed-off-by: Thomas Gleixner 

Tested-by: Kirill A. Shutemov 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov



xen with colors issue again

2023-05-31 Thread Oleg Nikitenko
Hello,

I built the xlnx_rebase_4.17 branch and ran it in our environment with
colors.
I ran into the next issue. Looks like some device was stucking.
It may come up immediately on start or after 20-30 minutes later even with
no DomUs.
A xen command line is
"console=dtuart dtuart=serial0 dom0_mem=1800M dom0_max_vcpus=2
dom0_vcpus_pin bootscrub=0 vwfi=native sched=null timer_slop=0
llc-coloring=on llc-way-size=64K xen-llc-colors=0-1 dom0-llc-colors=2-8";

This is a first one coming up

(XEN) d0v0 Forwarding AES operation: 3254779951
Assert occurred from file xcsudma.c at line 143

I found out this is inside some DMA handler in FSBL code

void XCsuDma_Transfer(XCsuDma *InstancePtr, XCsuDma_Channel Channel,
u64 Addr, u32 Size, u8 EnDataLast)

Xil_AssertVoid(Size <= (u32)(XCSUDMA_SIZE_MAX));

This is a second one coming up.

[  188.737910] zynqmp_aes firmware:zynqmp-firmware:zynqmp-aes: ERROR: Gcm
Tag mismatch
(XEN) d0v0 Forwarding AES operation: 3254779951
[  188.748496] zynqmp_aes firmware:zynqmp-firmware:zynqmp-aes: ERROR : Non
word aligned data
(XEN) d0v0 Forwarding AES operation: 3254779951
[  198.826279] zynqmp_aes firmware:zynqmp-firmware:zynqmp-aes: ERROR : Non
word aligned data
(XEN) d0v0 Forwarding AES operation: 3254779951
[  198.837363] zynqmp_aes firmware:zynqmp-firmware:zynqmp-aes: ERROR:
Invalid
(XEN) d0v0 Forwarding AES operation: 3254779951
Received exception
MSR: 0x200, EAR: 0x181, EDR: 0x0, ESR: 0x861
[  229.916284] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  229.916667] (detected by 1, t=5252 jiffies, g=1953, q=101)
[  229.922286] rcu: All QSes seen, last rcu_sched kthread activity 5252
(4294949715-4294944463), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  229.934569] rcu: rcu_sched kthread timer wakeup didn't happen for 5251
jiffies! g1953 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[  229.945727] rcu: Possible timer handling issue on cpu=0
timer-softirq=3481
[  229.952734] rcu: rcu_sched kthread starved for 5252 jiffies! g1953 f0x2
RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=0
[  229.962940] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM
is now expected behavior.
[  229.971936] rcu: RCU grace-period kthread stack dump:
[  229.977041] task:rcu_sched   state:R stack:0 pid:   12 ppid:
2 flags:0x0008
[  229.985433] Call trace:
[  229.987939]  __switch_to+0xf4/0x13c
[  229.991486]  __schedule+0x2f0/0x690
[  229.995032]  schedule+0x5c/0xc4
[  229.998232]  schedule_timeout+0x80/0xf0
[  230.002125]  rcu_gp_fqs_loop+0xf0/0x2b4
[  230.006017]  rcu_gp_kthread+0xe8/0x100
[  230.009824]  kthread+0x120/0x130
[  230.013111]  ret_from_fork+0x10/0x20
[  230.016744] rcu: Stack dump where RCU GP kthread last ran:
[  230.022279] Task dump for CPU 0:
[  230.025567] task:tokio-runtime-w state:R  running task stack:0
pid:  795 ppid:   408 flags:0x
[  230.035514] Call trace:
[  230.038022]  __switch_to+0xf4/0x13c
[  230.041569]  0xffae44320a00
[  292.936283] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[  292.936646] (detected by 1, t=21007 jiffies, g=1953, q=113)
[  292.942354] rcu: All QSes seen, last rcu_sched kthread activity 21007
(4294965470-4294944463), jiffies_till_next_fqs=1, root ->qsmask 0x0
[  292.954723] rcu: rcu_sched kthread timer wakeup didn't happen for 21006
jiffies! g1953 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200
[  292.965968] rcu: Possible timer handling issue on cpu=0
timer-softirq=3481
[  292.972975] rcu: rcu_sched kthread starved for 21007 jiffies! g1953 f0x2
RCU_GP_WAIT_FQS(5) ->state=0x200 ->cpu=0
[  292.983268] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM
is now expected behavior.
[  292.992264] rcu: RCU grace-period kthread stack dump:
[  292.997368] task:rcu_sched   state:R stack:0 pid:   12 ppid:
2 flags:0x0008
[  293.005759] Call trace:
[  293.008266]  __switch_to+0xf4/0x13c
[  293.011814]  __schedule+0x2f0/0x690
[  293.015359]  schedule+0x5c/0xc4
[  293.018560]  schedule_timeout+0x80/0xf0
[  293.022453]  rcu_gp_fqs_loop+0xf0/0x2b4
[  293.026345]  rcu_gp_kthread+0xe8/0x100
[  293.030151]  kthread+0x120/0x130
[  293.033438]  ret_from_fork+0x10/0x20
[  293.037071] rcu: Stack dump where RCU GP kthread last ran:
[  293.042607] Task dump for CPU 0:
[  293.045895] task:tokio-runtime-w state:R  running task stack:0
pid:  795 ppid:   408 flags:0x
[  293.055842] Call trace:
[  293.058350]  __switch_to+0xf4/0x13c
[  293.061897]  0xffae44320a00

Maybe someone will be able to give me some hints where I should make sense ?

Regards,
Oleg


Re: [PATCH v2 3/3] multiboot2: do not set StdOut mode unconditionally

2023-05-31 Thread Roger Pau Monné
On Wed, Apr 05, 2023 at 12:36:55PM +0200, Jan Beulich wrote:
> On 31.03.2023 11:59, Roger Pau Monne wrote:
> > @@ -887,6 +881,15 @@ void __init efi_multiboot2(EFI_HANDLE ImageHandle, 
> > EFI_SYSTEM_TABLE *SystemTable
> >  
> >  efi_arch_edid(gop_handle);
> >  }
> > +else
> > +{
> > +/* If no GOP, init ConOut (StdOut) to the max supported size. */
> > +efi_console_set_mode();
> > +
> > +if ( StdOut->QueryMode(StdOut, StdOut->Mode->Mode,
> > +   , ) == EFI_SUCCESS )
> > +efi_arch_console_init(cols, rows);
> > +}
> 
> Instead of making this an "else", wouldn't you better check that a
> valid gop_mode was found? efi_find_gop_mode() can return ~0 after all.

When using vga=current gop_mode would also be ~0, in order for
efi_set_gop_mode() to not change the current mode, I was trying to
avoid exposing keep_current or similar extra variable to signal this.

> Furthermore, what if the active mode doesn't support text output? (I
> consider the spec unclear in regard to whether this is possible, but
> maybe I simply didn't find the right place stating it.)
> 
> Finally I think efi_arch_console_init() wants calling nevertheless.
> 
> So altogether maybe
> 
> if ( gop_mode == ~0 ||
>  StdOut->QueryMode(StdOut, StdOut->Mode->Mode,
>, ) != EFI_SUCCESS )

I think it would make more sense to call efi_console_set_mode() only
if the current StdOut mode is not valid, as anything different from
vga=current will already force a GOP mode change.

Thanks, Roger.



Re: [PATCH v6 5/6] xen/riscv: introduce an implementation of macros from

2023-05-31 Thread Oleksii
On Tue, 2023-05-30 at 18:00 +0200, Jan Beulich wrote:
> On 29.05.2023 14:13, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/include/asm/bug.h
> > +++ b/xen/arch/riscv/include/asm/bug.h
> > @@ -7,4 +7,32 @@
> >  #ifndef _ASM_RISCV_BUG_H
> >  #define _ASM_RISCV_BUG_H
> >  
> > +#ifndef __ASSEMBLY__
> > +
> > +#define BUG_INSTR "ebreak"
> > +
> > +/*
> > + * The base instruction set has a fixed length of 32-bit naturally
> > aligned
> > + * instructions.
> > + *
> > + * There are extensions of variable length ( where each
> > instruction can be
> > + * any number of 16-bit parcels in length ) but they aren't used
> > in Xen
> > + * and Linux kernel ( where these definitions were taken from ).
> 
> This, at least to some degree, looks to contradict ...
> 
> > + * Compressed ISA is used now where the instruction length is 16
> > bit  and
> > + * 'ebreak' instruction, in this case, can be either 16 or 32 bit
> > (
> > + * depending on if compressed ISA is used or not )
> 
> ... this. Plus there already is CONFIG_RISCV_ISA_C, so compressed
> insns
> can very well be used in Xen.
Thanks. You are right. The comment should be updated.

> 
> > @@ -114,7 +116,134 @@ static void do_unexpected_trap(const struct
> > cpu_user_regs *regs)
> >  die();
> >  }
> >  
> > +void show_execution_state(const struct cpu_user_regs *regs)
> > +{
> > +    printk("implement show_execution_state(regs)\n");
> > +}
> > +
> > +/*
> > + * TODO: change early_printk's function to early_printk with
> > format
> > + *   when s(n)printf() will be added.
> 
> What is this comment about? I don't think I understand what it says
> needs doing.
I meant that it would be nice to introduce the second version of
early_printk() function which will take 'format', as printk() does.

But there is no any sense in this comment because all early_printk() in
do_bug_frame() were changed to printk().

Thereby I will update the comment.

> 
> > + * Probably the TODO won't be needed as generic do_bug_frame()
> > + * has been introduced and current implementation will be replaced
> > + * with generic one when panic(), printk() and find_text_region()
> > + * (virtual memory?) will be ready/merged
> > + */
> > +int do_bug_frame(const struct cpu_user_regs *regs, vaddr_t pc)
> 
> While it's going to be the maintainers to judge, I continue to be
> unconvinced that introducing copies of common functions (also in
> patch 1) is a good idea.
Generally I agree with you but as I mentioned before and in the comment
above the function do_bug_frame() the reason not to use generic
implementation of do_bug_frame() now as it will require to introduce
compilation of whole Xen's common code. ( there is no way to enable
just necessary parts for the current one function ). 

I think that after this patch series I'll introduce compilation of
Xen's common code and after it'll be merged do_bug_frame() can be
removed.

> 
> > +{
> > +    const struct bug_frame *start, *end;
> > +    const struct bug_frame *bug = NULL;
> > +    unsigned int id = 0;
> > +    const char *filename, *predicate;
> > +    int lineno;
> > +
> > +    static const struct bug_frame* bug_frames[] = {
> 
> Nit: * and blank want to swap places. I would also expect another
> "const".
Thanks. I'll update that.

> 
> > +static uint32_t read_instr(unsigned long pc)
> > +{
> > +    uint16_t instr16 = *(uint16_t *)pc;
> > +
> > +    if ( GET_INSN_LENGTH(instr16) == 2 )
> > +    return (uint32_t)instr16;
> > +    else
> > +    return *(uint32_t *)pc;
> > +}
> 
> As long as this function is only used on Xen code, it's kind of okay.
> There you/we control whether code can change behind our backs. But as
> soon as you might use this on guest code, the double read is going to
> be a problem (I think; I wonder how hardware is supposed to deal with
> the situation: Maybe they indeed fetch in 16-bit quantities?).
I'll check how the hardware fetches instructions.

I am trying to figure out why the double-read can be a problem. It
looks pretty safe to read 16 bits ( they will be available for any
instruction length with the assumption that the minimal instruction
length is 16 ), then check the length of the instruction, and if it is
32-bit instruction, read it as uint32_t.
> 
> > --- a/xen/arch/riscv/xen.lds.S
> > +++ b/xen/arch/riscv/xen.lds.S
> > @@ -40,6 +40,16 @@ SECTIONS
> >  . = ALIGN(PAGE_SIZE);
> >  .rodata : {
> >  _srodata = .;  /* Read-only data */
> > +    /* Bug frames table */
> > +   __start_bug_frames = .;
> > +   *(.bug_frames.0)
> > +   __stop_bug_frames_0 = .;
> > +   *(.bug_frames.1)
> > +   __stop_bug_frames_1 = .;
> > +   *(.bug_frames.2)
> > +   __stop_bug_frames_2 = .;
> > +   *(.bug_frames.3)
> > +   __stop_bug_frames_3 = .;
> >  *(.rodata)
> >  *(.rodata.*)
> >  *(.data.rel.ro)
> 
> Nit: There looks to be an off-by-one in how you indent your addition
> (except for the comment).
Thanks. One space is really 

[PATCH 1/1] doc: clarify intended usage of ~/control/ xentore path

2023-05-31 Thread Yann Dirson
Signed-off-by: Yann Dirson 
---
 docs/misc/xenstore-paths.pandoc | 29 +
 1 file changed, 29 insertions(+)

diff --git a/docs/misc/xenstore-paths.pandoc b/docs/misc/xenstore-paths.pandoc
index f07ef90f63..5501033893 100644
--- a/docs/misc/xenstore-paths.pandoc
+++ b/docs/misc/xenstore-paths.pandoc
@@ -432,6 +432,35 @@ by udev ("0") or will be run by the toolstack directly 
("1").
 
 ### Platform Feature and Control Paths
 
+ ~/control = "" []
+
+Directory to hold feature and control paths.  This directory is not
+guest-writable, only the toolstack is allowed to create new child
+nodes under this.
+
+Children of this nodes can have one of several types:
+
+* platform features: using name pattern `platform-feature-*`, they may
+  be set by the toolstack to inform the guest, and are not writable by
+  the guest.
+
+* guest features: using name pattern `feature-*`, they may be created
+  by the toolstack with an empty value (`""`), should be set writable
+  by the guest which can then advertize to the toolstack its
+  (non-)usage of the feature with values `"0"` and `"1"` respectively.
+  The lack of update by the guest can be interpreted by the toolstack
+  as the lack of supporting software (PV driver, guest agent, ...) in
+  the guest.
+
+* control nodes: using any name not matching the above pattern, they
+  are used by the toolstack or by the guest to signal a specific
+  condition to the other end, which is expected to watch it to react
+  to changes.
+
+Note: the presence of a control node in itself advertises the
+underlying toolstack feature, it is not necessary to add an extra
+platform-feature for such cases.
+
  ~/control/sysrq = (""|COMMAND) [w]
 
 This is the PV SysRq control node. A toolstack can write a single character
-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



[PATCH 0/1] RFC: clarify intended usage of ~/control/ xentore path

2023-05-31 Thread Yann Dirson
This proposal, spurred by a discrepancy between how toolstacks handles
the control nodes, tries to summarize what I understand to be the
spirit of ~/control/, from its children already described in the
xenstore-paths document, and from the libxl behaviour.

Yann Dirson (1):
  doc: clarify intended usage of ~/control/ xentore path

 docs/misc/xenstore-paths.pandoc | 29 +
 1 file changed, 29 insertions(+)

-- 
2.30.2



Yann Dirson | Vates Platform Developer

XCP-ng & Xen Orchestra - Vates solutions
w: vates.fr | xcp-ng.org | xen-orchestra.com



Re: [PATCH v6 4/6] xen/riscv: introduce trap_init()

2023-05-31 Thread Oleksii
On Tue, 2023-05-30 at 17:44 +0200, Jan Beulich wrote:
> On 29.05.2023 14:13, Oleksii Kurochko wrote:
> > --- a/xen/arch/riscv/traps.c
> > +++ b/xen/arch/riscv/traps.c
> > @@ -12,6 +12,31 @@
> >  #include 
> >  #include 
> >  
> > +#define cast_to_bug_frame(addr) \
> > +    (const struct bug_frame *)(addr)
> 
> I can't find a use for this; should it be dropped or moved to some
> later patch? In any event, if ti's intended to survive, it needs yet
> another pair of parentheses.
You are right. It should be a part of the next patch.
Thanks.

> 
> > +/*
> > + * Initialize the trap handling.
> > + *
> > + * The function is called after MMU is enabled.
> > + */
> > +void trap_init(void)
> 
> Is this going to be used for secondary processors as well? If not,
> it will want to be __init.
I think I'll use it for secondary processors.

> 
> > +{
> > +    /*
> > + * When the MMU is off, addr varialbe will be a physical
> > address otherwise
> > + * it would be a virtual address.
> > + *
> > + * It will work fine as:
> > + *  - access to addr is PC-relative.
> > + *  - -nopie is used. -nopie really suppresses the compiler
> > emitting
> > + *    code going through .got (which then indeed would mean
> > using absolute
> > + *    addresses).
> > + */
> 
> Is all of this comment still relevant not that you're running with
> the MMU already enabled.
Not really. I think comment above trap_init() function will be enough.
I'll remove this comment.

~ Oleksii



Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-31 Thread Borislav Petkov
On Wed, May 31, 2023 at 11:31:37AM +0200, Juergen Gross wrote:
> What it did would have been printed if pr_debug() would have been
> active. :-(

Lemme turn those into pr_info(). pr_debug() is nuts.

> Did you check whether CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT was the same in 
> both
> kernels you've tested?

Yes, it is enabled.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v2 2/3] multiboot2: parse console= and vga= options when setting GOP mode

2023-05-31 Thread Jan Beulich
On 31.05.2023 11:30, Roger Pau Monné wrote:
> On Wed, May 31, 2023 at 11:15:44AM +0200, Jan Beulich wrote:
>> On 30.05.2023 18:02, Roger Pau Monné wrote:
>>> On Wed, Apr 05, 2023 at 12:15:26PM +0200, Jan Beulich wrote:
 On 31.03.2023 11:59, Roger Pau Monne wrote:
> Only set the GOP mode if vga is selected in the console option,

 This particular aspect of the behavior is inconsistent with legacy
 boot behavior: There "vga=" isn't qualified by what "console=" has.
>>>
>>> Hm, I find this very odd, why would we fiddle with the VGA (or the GOP
>>> here) if we don't intend to use it for output?
>>
>> Because we also need to arrange for what Dom0 possibly wants to use.
>> It has no way of setting the mode the low-level (BIOS or EFI) way.
> 
> I understand this might be needed when Xen is booted as an EFI
> application, but otherwise it should be the bootloader that takes care
> of setting such mode, as (most?) OSes are normally loaded with boot
> services already exited.

The bootloader doing this is a quirk imo. In the Linux case this implies
knowing the inner structure of the binary to be booted, to communicate
the necessary information, plus peeking into the kernel's command line.

Furthermore I wasn't referring to the EFI-with-bootloader case, but the
legacy BIOS one plus the (mentioned by you) EFI-application one. Even
the MB2 protocol allows the bootloader to hand over with boot services
not exited yet, iirc, so even in that case Xen would be in the position
of using boot service functions (from efi_multiboot2()).

Jan



Re: [PATCH v6 00/16] x86/mtrr: fix handling with PAT but without MTRR

2023-05-31 Thread Juergen Gross

On 31.05.23 10:35, Borislav Petkov wrote:

On Wed, May 31, 2023 at 09:28:57AM +0200, Juergen Gross wrote:

Can you please boot the system with the MTRR patches and specify "mtrr=debug"
on the command line? I'd be interested in the raw register values being read
and the resulting memory type map.


This is exactly why I wanted this option. And you're already putting it
to good use. :-P

Full dmesg below.

[0.012878] last_pfn = 0x45 max_arch_pfn = 0x4
[0.018357] MTRR default type: uncachable
[0.022347] MTRR fixed ranges enabled:
[0.026085]   0-9 write-back
[0.029650]   A-B uncachable
[0.033214]   C-F write-protect
[0.037039] MTRR variable ranges enabled:
[0.041038]   0 base 000 mask 0003FFC write-back


16 GB WB at address 0.


[0.047383]   1 base 004 mask 0003FFFC000 write-back


1 GB WB at address 16GB.


[0.053730]   2 base 0044000 mask 0003000 write-back


256MB WB at address 17GB.

This means per default 0-44fff are WB.


[0.060076]   3 base 000AE00 mask 0003E00 uncachable


32MB UC at AE00


[0.066421]   4 base 000B000 mask 0003000 uncachable


256MB UC at B000


[0.072768]   5 base 000C000 mask 0003FFFC000 uncachable


512MB UC at C000

So an UC hole at AE00-.


[0.079114]   6 disabled
[0.081635]   7 disabled
[0.084156]   8 disabled
[0.086677]   9 disabled
[0.089203] total RAM covered: 16352M
[0.093023] Found optimal setting for mtrr clean up


It seems as if mtrr_cleanup() did change the MTRR settings.

What it did would have been printed if pr_debug() would have been
active. :-(


[0.097734]  gran_size: 64K  chunk_size: 64M num_reg: 8  lose 
cover RAM: 0G
[0.104864] MTRR map: 6 entries (3 fixed + 3 variable; max 23), built from 
10 variable MTRRs
[0.113294]   0: -0009 write-back
[0.119033]   1: 000a-000b uncachable
[0.124771]   2: 000c-000f write-protect
[0.130769]   3: 0010-adff write-back
[0.136508]   4: ae00-afff uncachable
[0.142246]   5: 0001-00044fff write-back


The MTRR map seems to be fine assuming the MTRR values before the "clean up".


[0.147992] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT

> [0.155122] e820: update [mem 0xae00-0xafff] usable ==> reserved
> [0.161663] e820: update [mem 0xc000-0x] usable ==> reserved
> [0.168358] e820: update [mem 0x11000-0x1] usable ==> reserved
> [0.175227] WARNING: BIOS bug: CPU MTRRs don't cover all of memory, losing 
3840MB of RAM.


Clean up messed with the settings, resulting in loss of RAM.

Did you check whether CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT was the same in both
kernels you've tested?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 2/3] multiboot2: parse console= and vga= options when setting GOP mode

2023-05-31 Thread Roger Pau Monné
On Wed, May 31, 2023 at 11:15:44AM +0200, Jan Beulich wrote:
> On 30.05.2023 18:02, Roger Pau Monné wrote:
> > On Wed, Apr 05, 2023 at 12:15:26PM +0200, Jan Beulich wrote:
> >> On 31.03.2023 11:59, Roger Pau Monne wrote:
> >>> Only set the GOP mode if vga is selected in the console option,
> >>
> >> This particular aspect of the behavior is inconsistent with legacy
> >> boot behavior: There "vga=" isn't qualified by what "console=" has.
> > 
> > Hm, I find this very odd, why would we fiddle with the VGA (or the GOP
> > here) if we don't intend to use it for output?
> 
> Because we also need to arrange for what Dom0 possibly wants to use.
> It has no way of setting the mode the low-level (BIOS or EFI) way.

I understand this might be needed when Xen is booted as an EFI
application, but otherwise it should be the bootloader that takes care
of setting such mode, as (most?) OSes are normally loaded with boot
services already exited.

I've removed the parsing of the console= option and unconditionally
parse vga= now.  We can always adjust later.

> >>> otherwise just fetch the information from the current mode in order to
> >>> make it available to dom0.
> >>>
> >>> Introduce support for passing the command line to the efi_multiboot2()
> >>> helper, and parse the console= and vga= options if present.
> >>>
> >>> Add support for the 'gfx' and 'current' vga options, ignore the 'keep'
> >>> option, and print a warning message about other options not being
> >>> currently implemented.
> >>>
> >>> Signed-off-by: Roger Pau Monné 
> >>> [...] 
> >>> --- a/xen/arch/x86/efi/efi-boot.h
> >>> +++ b/xen/arch/x86/efi/efi-boot.h
> >>> @@ -786,7 +786,30 @@ static bool __init 
> >>> efi_arch_use_config_file(EFI_SYSTEM_TABLE *SystemTable)
> >>>  
> >>>  static void __init efi_arch_flush_dcache_area(const void *vaddr, UINTN 
> >>> size) { }
> >>>  
> >>> -void __init efi_multiboot2(EFI_HANDLE ImageHandle, EFI_SYSTEM_TABLE 
> >>> *SystemTable)
> >>> +/* Return the next occurrence of opt in cmd. */
> >>> +static const char __init *get_option(const char *cmd, const char *opt)
> >>> +{
> >>> +const char *s = cmd, *o = NULL;
> >>> +
> >>> +if ( !cmd || !opt )
> >>
> >> I can see why you need to check "cmd", but there's no need to check "opt"
> >> I would say.
> > 
> > Given this is executed without a page-fault handler in place I thought
> > it was best to be safe rather than sorry, and avoid any pointer
> > dereferences.
> 
> Hmm, I see. We don't do so elsewhere, though, I think.

If you insist I can remove it, otherwise I will just leave as-is.

> 
> >>> @@ -807,7 +830,60 @@ void __init efi_multiboot2(EFI_HANDLE ImageHandle, 
> >>> EFI_SYSTEM_TABLE *SystemTable
> >>>  
> >>>  if ( gop )
> >>>  {
> >>> -gop_mode = efi_find_gop_mode(gop, 0, 0, 0);
> >>> +const char *opt = NULL, *last = cmdline;
> >>> +/* Default console selection is "com1,vga". */
> >>> +bool vga = true;
> >>> +
> >>> +/* For the console option the last occurrence is the enforced 
> >>> one. */
> >>> +while ( (last = get_option(last, "console=")) != NULL )
> >>> +opt = last;
> >>> +
> >>> +if ( opt )
> >>> +{
> >>> +const char *s = strstr(opt, "vga");
> >>> +
> >>> +if ( !s || s > strpbrk(opt, " ") )
> >>
> >> Why strpbrk() and not the simpler strchr()? Or did you mean to also look
> >> for tabs, but then didn't include \t here (and in get_option())? (Legacy
> >> boot code also takes \r and \n as separators, btw, but I'm unconvinced
> >> of the need.)
> > 
> > I was originally checking for more characters here and didn't switch
> > when removing those.  I will add \t.
> > 
> >> Also aiui this is UB when the function returns NULL, as relational 
> >> operators
> >> (excluding equality ones) may only be applied when both addresses refer to
> >> the same object (or to the end of an involved array).
> > 
> > Hm, I see, thanks for spotting. So I would need to do:
> > 
> > s > (strpbrk(opt, " ") ?: s)
> > 
> > So that we don't compare against NULL.
> > 
> > Also the original code was wrong AFAICT, as strpbrk() returning NULL
> > should result in vga=true (as it would imply console= is the last
> > option on the command line).
> 
> I'm afraid I'm in trouble what "original code" you're referring to here.
> Iirc you really only add code to the function. And boot/cmdline.c has no
> use of strpbrk() afaics.

Original code in the patch, anyway this is now gone because I no
longer parse console=.

> >>> +vga = false;
> >>> +}
> >>> +
> >>> +if ( vga )
> >>> +{
> >>> +unsigned int width = 0, height = 0, depth = 0;
> >>> +bool keep_current = false;
> >>> +
> >>> +last = cmdline;
> >>> +while ( (last = get_option(last, "vga=")) != NULL )
> >>
> >> It's yet different for "vga=", I'm afraid: Early boot code (boot/cmdline.c)
> >> finds the first instance only. Normal command line 

Re: [PATCH RFC v2] vPCI: account for hidden devices

2023-05-31 Thread Jan Beulich
On 31.05.2023 00:38, Stefano Stabellini wrote:
> On Fri, 26 May 2023, Jan Beulich wrote:
>> On 25.05.2023 21:24, Stefano Stabellini wrote:
>>> On Thu, 25 May 2023, Jan Beulich wrote:
 On 25.05.2023 01:37, Stefano Stabellini wrote:
> On Wed, 24 May 2023, Jan Beulich wrote:
 RFC: _setup_hwdom_pci_devices()' loop may want splitting: For
  modify_bars() to consistently respect BARs of hidden devices while
  setting up "normal" ones (i.e. to avoid as much as possible the
  "continue" path introduced here), setting up of the former may 
 want
  doing first.
>>>
>>> But BARs of hidden devices should be mapped into dom0 physmap?
>>
>> Yes.
>
> The BARs would be mapped read-only (not read-write), right? Otherwise we
> let dom0 access devices that belong to Xen, which doesn't seem like a
> good idea.
>
> But even if we map the BARs read-only, what is the benefit of mapping
> them to Dom0? If Dom0 loads a driver for it and the driver wants to
> initialize the device, the driver will crash because the MMIO region is
> read-only instead of read-write, right?
>
> How does this device hiding work for dom0? How does dom0 know not to
> access a device that is present on the PCI bus but is used by Xen?

 None of these are new questions - this has all been this way for PV Dom0,
 and so far we've limped along quite okay. That's not to say that we
 shouldn't improve things if we can, but that first requires ideas as to
 how.
>>>
>>> For PV, that was OK because PV requires extensive guest modifications
>>> anyway. We only run Linux and few BSDs as Dom0. So, making the interface
>>> cleaner and reducing guest changes is nice-to-have but not critical.
>>>
>>> For PVH, this is different. One of the top reasons for AMD to work on
>>> PVH is to enable arbitrary non-Linux OSes as Dom0 (when paired with
>>> dom0less/hyperlaunch). It could be anything from Zephyr to a
>>> proprietary RTOS like VxWorks. Minimal guest changes for advanced
>>> features (e.g. Dom0 S3) might be OK but in general I think we should aim
>>> at (almost) zero guest changes. On ARM, it is already the case (with some
>>> non-upstream patches for dom0less PCI.)
>>>
>>> For this specific patch, which is necessary to enable PVH on AMD x86 in
>>> gitlab-ci, we can do anything we want to make it move faster. But
>>> medium/long term I think we should try to make non-Xen-aware PVH Dom0
>>> possible.
>>
>> I don't think Linux could boot as PVH Dom0 without any awareness. Hence
>> I guess it's not easy to see how other OSes might. What you're after
>> looks rather like a HVM Dom0 to me, with it being unclear where the
>> external emulator then would run (in a stubdom maybe, which might be
>> possible to arrange for via the dom0less way of creating boot time
>> DomU-s) and how it would get any necessary xenstore based information.
> 
> I know that Linux has lots of Xen awareness scattered everywhere so it
> is difficult to tell what's what. Leaving the PVH entry point aside for
> this discussion, what else is really needed for a Linux without
> CONFIG_XEN to boot as PVH Dom0?
> 
> Same question from a different angle: let's say that we boot Zephyr or
> another RTOS as HVM Dom0, what is really required for the emulator to
> emulate? I am hoping that the answer is "nothing" except for maybe a
> UART.
> 
> It comes down to how much legacy stuff the guest OS expects to find.
> Legacy stuff that would normally be emulated by QEMU. I am counting on
> the fact that a modern OS doesn't expect any of the legacy stuff (e.g.
> PIIX3/Q35/E1000) if it is not advertised in the firmware tables.

And that's where I expect the problems start: We don't really alter
things like the DSDT and SSDTs, and we also don't parse them. So we
won't know what firmware describes there. Hence we have to expect that
any legacy device might be present in the underlying platform, and
hence would also need offering either by passing through or by
emulation. Yet we can't sensibly emulate everything in Xen itself.

> If
> there is no need for QEMU, I don't know if I would call it PVH or HVM
> but either way we are good. 
> 
> Same for xenstore: there should be no need for xenstore without
> CONFIG_XEN.

Right, it may be possible to get away without xenstore.

Jan



  1   2   >