Re: [PATCH v7 2/2] x86/time: prefer CMOS over EFI_GET_TIME

2024-09-17 Thread Marek Marczykowski-Górecki
On Fri, Sep 13, 2024 at 09:59:07AM +0200, Roger Pau Monne wrote:
> The EFI_GET_TIME implementation is well known to be broken for many firmware
> implementations, for Xen the result on such implementations are:
> 
> [ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
> CPU:0
> RIP:e008:[<62ccfa70>] 62ccfa70
> [...]
> Xen call trace:
>[<62ccfa70>] R 62ccfa70
>[<732e9a3f>] S 732e9a3f
>[] F arch/x86/time.c#get_cmos_time+0x1b3/0x26e
>[] F init_xen_time+0x28/0xa4
>[] F __start_xen+0x1ee7/0x2578
>[] F __high_start+0x94/0xa0
> 
> Pagetable walk from 62ccfa70:
>  L4[0x000] = 00207ef1c063 
>  L3[0x001] = 5d6c0063 
>  L2[0x116] = 800062c001e3  (PSE)
> 
> 
> Panic on CPU 0:
> FATAL PAGE FAULT
> [error_code=0011]
> Faulting linear address: 62ccfa70
> 
> 
> Swap the preference to default to CMOS first, and EFI later, in an attempt to
> use EFI_GET_TIME as a last resort option only.  Note that Linux for example
> doesn't allow calling the get_time method, and instead provides a dummy 
> handler
> that unconditionally returns EFI_UNSUPPORTED on x86-64.
> 
> Such change in the preferences requires some re-arranging of the function
> logic, so that panic messages with workaround suggestions are suitably 
> printed.
> 
> Signed-off-by: Roger Pau Monné 

Since this changes behavior for running on EFI,
Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v2:
>  - Updated to match previous changes.
> ---
>  xen/arch/x86/time.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> index e4751684951e..b86e4d58b40c 100644
> --- a/xen/arch/x86/time.c
> +++ b/xen/arch/x86/time.c
> @@ -1592,14 +1592,14 @@ static void __init probe_wallclock(void)
>  wallclock_source = WALLCLOCK_XEN;
>  return;
>  }
> -if ( efi_enabled(EFI_RS) && efi_get_time() )
> +if ( cmos_rtc_probe() )
>  {
> -wallclock_source = WALLCLOCK_EFI;
> +wallclock_source = WALLCLOCK_CMOS;
>  return;
>  }
> -if ( cmos_rtc_probe() )
> +    if ( efi_enabled(EFI_RS) && efi_get_time() )
>  {
> -wallclock_source = WALLCLOCK_CMOS;
> +wallclock_source = WALLCLOCK_EFI;
>  return;
>  }
>  
> -- 
> 2.46.0
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v7 1/2] x86/time: introduce command line option to select wallclock

2024-09-16 Thread Marek Marczykowski-Górecki
On Mon, Sep 16, 2024 at 02:11:08PM +0100, Andrew Cooper wrote:
> On 13/09/2024 8:59 am, Roger Pau Monne wrote:
> > diff --git a/docs/misc/xen-command-line.pandoc 
> > b/docs/misc/xen-command-line.pandoc
> > index 959cf45b55d9..2a9b3b9b8975 100644
> > --- a/docs/misc/xen-command-line.pandoc
> > +++ b/docs/misc/xen-command-line.pandoc
> > @@ -2816,6 +2816,27 @@ vwfi to `native` reduces irq latency significantly. 
> > It can also lead to
> >  suboptimal scheduling decisions, but only when the system is
> >  oversubscribed (i.e., in total there are more vCPUs than pCPUs).
> >  
> > +### wallclock (x86)
> > +> `= auto | xen | cmos | efi`
> > +
> > +> Default: `auto`
> > +
> > +Allow forcing the usage of a specific wallclock source.
> > +
> > + * `auto` let the hypervisor select the clocksource based on internal
> > +   heuristics.
> > +
> > + * `xen` force usage of the Xen shared_info wallclock when booted as a Xen
> > +   guest.  This option is only available if the hypervisor was compiled 
> > with
> > +   `CONFIG_XEN_GUEST` enabled.
> > +
> > + * `cmos` force usage of the CMOS RTC wallclock.
> > +
> > + * `efi` force usage of the EFI_GET_TIME run-time method when booted from 
> > EFI
> > +   firmware.
> 
> For both `xen` and `efi`, something should be said about "if selected
> and not satisfied, Xen falls back to other heuristics".
> 
> > +
> > +If the selected option is invalid or not available Xen will default to 
> > `auto`.
> 
> I'm afraid that I'm firmly of the opinion that "auto" on the cmdline is
> unnecessary complexity.  Auto is the default, and doesn't need
> specifying explicitly.  That in turn simplifies ...

What about overriding earlier choice? For example overriding a built-in
cmdline? That said, with the current code, the same can be achieved with
wallclock=foo, and living with the warning in boot log...

> > +
> >  ### watchdog (x86)
> >  > `= force | `
> >  
> > diff --git a/xen/arch/x86/time.c b/xen/arch/x86/time.c
> > index 29b026735e5d..e4751684951e 100644
> > --- a/xen/arch/x86/time.c
> > +++ b/xen/arch/x86/time.c
> > @@ -1552,6 +1552,37 @@ static const char *__init 
> > wallclock_type_to_string(void)
> >  return "";
> >  }
> >  
> > +static int __init cf_check parse_wallclock(const char *arg)
> > +{
> > +wallclock_source = WALLCLOCK_UNSET;
> > +
> > +if ( !arg )
> > +return -EINVAL;
> > +
> > +if ( !strcmp("auto", arg) )
> > +ASSERT(wallclock_source == WALLCLOCK_UNSET);
> 
> ... this.
> 
> Hitting this assert will manifest as a system reboot/hang with no
> information on serial/VGA, because all of this runs prior to getting up
> the console.  You only get to see it on a PVH boot because we bodge the
> Xen console into default-existence.

This assert is no-op as wallclock_source is unconditionally set to 
WALLCLOCK_UNSET few lines above.

> So, ASSERT()/etc really need avoiding wherever possible in cmdline parsing.
> 
> In this case, all it serves to do is break examples like `wallclock=xen
> wallclock=auto` case, which is unlike our latest-takes-precedence
> behaviour everywhere else.
> 
> > +else if ( !strcmp("xen", arg) )
> > +{
> > +if ( !xen_guest )
> 
> We don't normally treat this path as an error when parsing (we know what
> it is, even if we can't action it).  Instead, there's no_config_param()
> to be more friendly (for PVH at least).
> 
> It's a bit awkward, but this should do:
> 
>     {
> #ifdef CONFIG_XEN_GUEST
>         wallclock_source = WALLCLOCK_XEN;
> #else
>         no_config_param("XEN_GUEST", "wallclock", s, ss);
> #endif
>     }

Can you boot the binary build with CONFIG_XEN_GUEST=y as native? If so,
the above will not be enough, a runtime check is needed anyway.

> There probably wants to be something similar for EFI, although it's not
> a plain CONFIG so it might be more tricky.

It needs to be runtime check here even more. Not only because of
different boot modes, but due to interaction with efi=no-rs (or any
other reason for not having runtime services). See the comment there.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v6 1/2] x86/time: introduce command line option to select wallclock

2024-09-12 Thread Marek Marczykowski-Górecki
On Thu, Sep 12, 2024 at 03:47:53PM +0200, Roger Pau Monné wrote:
> On Thu, Sep 12, 2024 at 03:30:29PM +0200, Marek Marczykowski-Górecki wrote:
> > On Thu, Sep 12, 2024 at 02:56:55PM +0200, Roger Pau Monné wrote:
> > > On Thu, Sep 12, 2024 at 01:57:00PM +0200, Jan Beulich wrote:
> > > > On 12.09.2024 13:15, Roger Pau Monne wrote:
> > > > > --- a/xen/arch/x86/time.c
> > > > > +++ b/xen/arch/x86/time.c
> > > > > @@ -1552,6 +1552,35 @@ static const char *__init 
> > > > > wallclock_type_to_string(void)
> > > > >  return "";
> > > > >  }
> > > > >  
> > > > > +static int __init cf_check parse_wallclock(const char *arg)
> > > > > +{
> > > > > +if ( !arg )
> > > > > +return -EINVAL;
> > > > > +
> > > > > +if ( !strcmp("auto", arg) )
> > > > > +wallclock_source = WALLCLOCK_UNSET;
> > > > > +else if ( !strcmp("xen", arg) )
> > > > > +{
> > > > > +if ( !xen_guest )
> > > > > +return -EINVAL;
> > > > > +
> > > > > +wallclock_source = WALLCLOCK_XEN;
> > > > > +}
> > > > > +else if ( !strcmp("cmos", arg) )
> > > > > +wallclock_source = WALLCLOCK_CMOS;
> > > > > +else if ( !strcmp("efi", arg) )
> > > > > +/*
> > > > > + * Checking if run-time services are available must be done 
> > > > > after
> > > > > + * command line parsing.
> > > > > + */
> > > > > +wallclock_source = WALLCLOCK_EFI;
> > > > > +else
> > > > > +return -EINVAL;
> > > > > +
> > > > > +return 0;
> > > > > +}
> > > > > +custom_param("wallclock", parse_wallclock);
> > > > > +
> > > > >  static void __init probe_wallclock(void)
> > > > >  {
> > > > >  ASSERT(wallclock_source == WALLCLOCK_UNSET);
> > > > > @@ -2527,7 +2556,15 @@ int __init init_xen_time(void)
> > > > >  
> > > > >  open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
> > > > >  
> > > > > -probe_wallclock();
> > > > > +/*
> > > > > + * EFI run time services can be disabled from the command line, 
> > > > > hence the
> > > > > + * check for them cannot be done as part of the wallclock option 
> > > > > parsing.
> > > > > + */
> > > > > +if ( wallclock_source == WALLCLOCK_EFI && !efi_enabled(EFI_RS) )
> > > > > +wallclock_source = WALLCLOCK_UNSET;
> > > > > +
> > > > > +if ( wallclock_source == WALLCLOCK_UNSET )
> > > > > +probe_wallclock();
> > > > 
> > > > I don't want to stand in the way, and I can live with this form, so I'd 
> > > > like to
> > > > ask that EFI folks or Andrew provide the necessary A-b / R-b. I'd like 
> > > > to note
> > > > though that there continue to be quirks here. They may not be affecting 
> > > > the
> > > > overall result as long as we have only three possible wallclocks, but 
> > > > there
> > > > might be problems if a 4th one appeared.
> > > > 
> > > > With the EFI_RS check in the command line handler and assuming the 
> > > > default case
> > > > of no "efi=no-rs" on the command line, EFI_RS may still end up being 
> > > > off by the
> > > > time the command line is being parsed. With "wallclock=cmos 
> > > > wallclock=efi" this
> > > > would result in no probing and "cmos" chosen if there was that check in 
> > > > place.
> > > > With the logic as added here there will be probing in that case. 
> > > > Replace "cmos"
> > > > by a hypothetical non-default 4th wallclock type (i.e. probing picking 
> > > > "cmos"),
> > > > and I expect you can see how the result would then not necessarily be as
> > > > expected.
> > > 
> > > My expectation would be that if "wallclock=cmos wallclock=efi" is used
> > > the last option overrides any previous on

Re: [PATCH v6 1/2] x86/time: introduce command line option to select wallclock

2024-09-12 Thread Marek Marczykowski-Górecki
On Thu, Sep 12, 2024 at 02:56:55PM +0200, Roger Pau Monné wrote:
> On Thu, Sep 12, 2024 at 01:57:00PM +0200, Jan Beulich wrote:
> > On 12.09.2024 13:15, Roger Pau Monne wrote:
> > > --- a/xen/arch/x86/time.c
> > > +++ b/xen/arch/x86/time.c
> > > @@ -1552,6 +1552,35 @@ static const char *__init 
> > > wallclock_type_to_string(void)
> > >  return "";
> > >  }
> > >  
> > > +static int __init cf_check parse_wallclock(const char *arg)
> > > +{
> > > +if ( !arg )
> > > +return -EINVAL;
> > > +
> > > +if ( !strcmp("auto", arg) )
> > > +wallclock_source = WALLCLOCK_UNSET;
> > > +else if ( !strcmp("xen", arg) )
> > > +{
> > > +if ( !xen_guest )
> > > +return -EINVAL;
> > > +
> > > +wallclock_source = WALLCLOCK_XEN;
> > > +}
> > > +else if ( !strcmp("cmos", arg) )
> > > +wallclock_source = WALLCLOCK_CMOS;
> > > +else if ( !strcmp("efi", arg) )
> > > +/*
> > > + * Checking if run-time services are available must be done after
> > > + * command line parsing.
> > > + */
> > > +wallclock_source = WALLCLOCK_EFI;
> > > +else
> > > +return -EINVAL;
> > > +
> > > +return 0;
> > > +}
> > > +custom_param("wallclock", parse_wallclock);
> > > +
> > >  static void __init probe_wallclock(void)
> > >  {
> > >  ASSERT(wallclock_source == WALLCLOCK_UNSET);
> > > @@ -2527,7 +2556,15 @@ int __init init_xen_time(void)
> > >  
> > >  open_softirq(TIME_CALIBRATE_SOFTIRQ, local_time_calibration);
> > >  
> > > -probe_wallclock();
> > > +/*
> > > + * EFI run time services can be disabled from the command line, 
> > > hence the
> > > + * check for them cannot be done as part of the wallclock option 
> > > parsing.
> > > + */
> > > +if ( wallclock_source == WALLCLOCK_EFI && !efi_enabled(EFI_RS) )
> > > +wallclock_source = WALLCLOCK_UNSET;
> > > +
> > > +if ( wallclock_source == WALLCLOCK_UNSET )
> > > +probe_wallclock();
> > 
> > I don't want to stand in the way, and I can live with this form, so I'd 
> > like to
> > ask that EFI folks or Andrew provide the necessary A-b / R-b. I'd like to 
> > note
> > though that there continue to be quirks here. They may not be affecting the
> > overall result as long as we have only three possible wallclocks, but there
> > might be problems if a 4th one appeared.
> > 
> > With the EFI_RS check in the command line handler and assuming the default 
> > case
> > of no "efi=no-rs" on the command line, EFI_RS may still end up being off by 
> > the
> > time the command line is being parsed. With "wallclock=cmos wallclock=efi" 
> > this
> > would result in no probing and "cmos" chosen if there was that check in 
> > place.
> > With the logic as added here there will be probing in that case. Replace 
> > "cmos"
> > by a hypothetical non-default 4th wallclock type (i.e. probing picking 
> > "cmos"),
> > and I expect you can see how the result would then not necessarily be as
> > expected.
> 
> My expectation would be that if "wallclock=cmos wallclock=efi" is used
> the last option overrides any previous one, and hence if that last
> option is not valid the logic will fallback to the default selection
> (in this case to probing).

That would be my expectation too. If some kind of preference would be
expected, it should looks like wallclock=efi,cmos, but I don't think we
need that.

> Thinking about this, it might make sense to unconditionally set
> wallclock_source = WALLCLOCK_UNSET at the start of parse_wallclock()
> to avoid previous instances carrying over if later ones are not valid.

This may be a good idea. But more importantly, the behavior should be
included in the option documentation (that if a selected value is not
available, it fallback to auto). And maybe a log message when that
happens (but I'm okay with skipping this one, as selected wallclock
source is logged already)?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [XEN PATCH 1/3] EFI: address violations of MISRA C Rule 13.6

2024-09-11 Thread Marek Marczykowski-Górecki
On Wed, Sep 11, 2024 at 02:50:03PM +0200, Jan Beulich wrote:
> On 10.09.2024 21:06, Federico Serafini wrote:
> > Refactor the code to improve readability
> 
> I question this aspect. I'm not the maintainer of this code anymore, so
> my view probably doesn't matter much here.
> 
> > and address violations of
> > MISRA C:2012 Rule 13.6 ("The operand of the `sizeof' operator shall
> > not contain any expression which has potential side effect").
> 
> Where's the potential side effect? Since you move ...
> 
> > --- a/xen/common/efi/runtime.c
> > +++ b/xen/common/efi/runtime.c
> > @@ -250,14 +250,20 @@ int efi_get_info(uint32_t idx, union xenpf_efi_info 
> > *info)
> >  info->cfg.addr = __pa(efi_ct);
> >  info->cfg.nent = efi_num_ct;
> >  break;
> > +
> >  case XEN_FW_EFI_VENDOR:
> > +{
> > +XEN_GUEST_HANDLE_PARAM(CHAR16) vendor_name =
> > +guest_handle_cast(info->vendor.name, CHAR16);
> 
> .. this out, it must be the one. I've looked at it, yet I can't spot
> anything:
> 
> #define guest_handle_cast(hnd, type) ({ \
> type *_x = (hnd).p; \
> (XEN_GUEST_HANDLE_PARAM(type)) { _x };  \
> })
> 
> As a rule of thumb, when things aren't obvious, please call out the
> specific aspect / property in descriptions of such patches.

I guess it's because guest_handle_cast() is a macro, yet it's lowercase
so looks like a function? Wasn't there some other MISRA rule about
lowercase/uppercase for macro names?

And yes, I don't really see why this would violate the side effect rule
either.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [REGRESSION] kernel NULL pointer dereference in xen-balloon with mem hotplug

2024-09-06 Thread Marek Marczykowski-Górecki
On Fri, Sep 06, 2024 at 12:30:03PM +0200, Linux regression tracking (Thorsten 
Leemhuis) wrote:
> On 08.08.24 12:31, Marek Marczykowski-Górecki wrote:
> > 
> > When testing Linux 6.11-rc2, I've got the crash like below. It's a PVH
> > guest started with 400MB memory, and then extended via mem hotplug (I
> > don't know to what exact size it was at this time, but up to 4GB), it
> > was quite early in the domU boot process, I suspect it could be the
> > first mem hotplug even happening there.
> > Unfortunately I don't have reliable reproducer, it crashed only once
> > over several test runs. I don't remember seeing such crash before, so it
> > looks like a regression in 6.11. I'm not sure if that matters, but it's
> > on ADL, very similar to the qubes-hw2 gitlab runner.
> 
> Marek, did this happen again or do things appear to be resolved? Asking
> because I'm tracking this as a regression.

I haven't investigated it more, and also haven't ran any later tests on
6.11, so I don't know if it's still there, but I suspect it might be.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2] xen: PE/COFF image header

2024-08-23 Thread Marek Marczykowski-Górecki
On Mon, Jul 29, 2024 at 01:42:46PM +0200, Jan Beulich wrote:
> On 23.07.2024 20:22, Milan Djokic wrote:
> > From: Nikola Jelic 
> > 
> > Added PE/COFF generic image header which shall be used for EFI
> > application format for x86/risc-v. x86 and risc-v source shall be adjusted
> > to use this header in following commits. pe.h header is taken over from
> > linux kernel with minor changes in terms of formatting and structure member 
> > comments.
> > Also, COFF relocation and win cert structures are ommited, since these are 
> > not relevant for Xen.
> > 
> > Origin: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> > 36e4fc57fc16
> > 
> > Signed-off-by: Nikola Jelic 
> > Signed-off-by: Milan Djokic 

Acked-by: Marek Marczykowski-Górecki 

> This looks okay to me now, but I can't ack it (or more precisely my ack
> wouldn't mean anything). There are a few style issues in comments, but
> leaving them as they are in Linux may be intentional. Just one question,
> more to other maintainers than to you:
> 
> > +#define IMAGE_DLL_CHARACTERISTICS_DYNAMIC_BASE 0x0040
> > +#define IMAGE_DLL_CHARACTERISTICS_FORCE_INTEGRITY  0x0080
> > +#define IMAGE_DLL_CHARACTERISTICS_NX_COMPAT0x0100
> > +#define IMAGE_DLLCHARACTERISTICS_NO_ISOLATION  0x0200
> > +#define IMAGE_DLLCHARACTERISTICS_NO_SEH0x0400
> > +#define IMAGE_DLLCHARACTERISTICS_NO_BIND   0x0800
> > +#define IMAGE_DLLCHARACTERISTICS_WDM_DRIVER0x2000
> > +#define IMAGE_DLLCHARACTERISTICS_TERMINAL_SERVER_AWARE 0x8000
> > +
> > +#define IMAGE_DLLCHARACTERISTICS_EX_CET_COMPAT 0x0001
> > +#define IMAGE_DLLCHARACTERISTICS_EX_FORWARD_CFI_COMPAT 0x0040
> 
> The naming inconsistency (underscore or not after DLL) is somewhat
> unhelpful. Do we maybe want to diverge from what Linux has here? Note
> that e.g. the GNU binutils header has at least a comment there.

Indeed it doesn't look great, but IMO leaving it consistent with Linux
is okay as it ease updating and porting/comparing other code if needed.

> What I'm puzzled by is IMAGE_DLLCHARACTERISTICS_EX_FORWARD_CFI_COMPAT
> having the same value as IMAGE_DLL_CHARACTERISTICS_DYNAMIC_BASE. Are
> these meant to apply to the same field? Or do these values rather
> relate to IMAGE_DEBUG_TYPE_EX_DLLCHARACTERISTICS? Some clarification
> may be needed here, or the two entries may simply want omitting for
> now.

One has _EX_ infix and the other doesn't so IMO together with visual
separation it's clear they apply to a different field.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Assertion failed at arch/x86/genapic/x2apic.c:38 on S3 resume nested in KVM on AMD

2024-08-08 Thread Marek Marczykowski-Górecki
On Thu, Aug 08, 2024 at 01:22:30PM +0200, Jan Beulich wrote:
> On 23.07.2024 16:28, Marek Marczykowski-Górecki wrote:
> > I'm observing a crash like the one below when trying to resume from S3.
> > It happens on Xen nested in KVM (QEMU 9.0, Linux 6.9.3) but only on AMD.
> > The very same software stack on Intel works just fine. QEMU is running
> > with "-cpu host,+svm,+invtsc -machine q35,kernel-irqchip=split -device
> > amd-iommu,intremap=on -smp 2" among others.
> > 
> > (XEN) Preparing system for ACPI S3 state.
> > (XEN) Disabling non-boot CPUs ...
> > (XEN) Broke affinity for IRQ1, new: {0-1}
> > (XEN) Broke affinity for IRQ20, new: {0-1}
> > (XEN) Broke affinity for IRQ22, new: {0-1}
> > (XEN) Entering ACPI S3 state.
> > (XEN) Finishing wakeup from ACPI S3 state.
> > (XEN) Enabling non-boot CPUs  ...
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) [ Xen-4.20  x86_64  debug=y  Not tainted ]
> > (XEN) CPU:1
> > (XEN) RIP:e008:[] 
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN) RFLAGS: 00010096   CONTEXT: hypervisor
> > (XEN) rax: 830278a25f50   rbx: 0001   rcx: 
> > 82d0405e1700
> > (XEN) rdx: 003233412000   rsi: 8302739da2d8   rdi: 
> > 
> > (XEN) rbp: 00c8   rsp: 8302739d7e78   r8:  
> > 0001
> > (XEN) r9:  8302739d7fa0   r10: 0001   r11: 
> > 
> > (XEN) r12: 0001   r13: 0001   r14: 
> > 
> > (XEN) r15:    cr0: 8005003b   cr4: 
> > 007506e0
> > (XEN) cr3: 7fa7a000   cr2: 
> > (XEN) fsb:    gsb:    gss: 
> > 
> > (XEN) ds:    es:    fs:    gs:    ss:    cs: e008
> > (XEN) Xen code around  
> > (x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9):
> > (XEN)  cf 82 ff ff eb b7 0f 0b <0f> 0b 48 8d 05 9c fc 33 00 48 8b 0d a5 
> > 0a 35 00
> > (XEN) Xen stack trace from rsp=8302739d7e78:
> > (XEN) 00c8 0001 
> > 0001
> > (XEN) 82d0402f1d83 8302739d7fff 
> > 00c8
> > (XEN)0001 0001 82d04031adb9 
> > 0001
> > (XEN)   
> > 82d040276677
> > (XEN)   
> > 
> > (XEN)88810037c000 0001 0246 
> > deadbeefdeadf00d
> > (XEN)0001   
> > 811d130a
> > (XEN)deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d 
> > 0100
> > (XEN)811d130a e033 0246 
> > c900400b3ef8
> > (XEN)e02b beef beef 
> > beef
> > (XEN)beef e011 8302739de000 
> > 003233412000
> > (XEN)007506e0   
> > 0002
> > (XEN)0002
> > (XEN) Xen call trace:
> > (XEN)[] R 
> > x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
> > (XEN)[] S setup_local_APIC+0x26/0x449
> > (XEN)[] S start_secondary+0x1c4/0x37a
> > (XEN)[] S __high_start+0x87/0xd0
> > (XEN) 
> > (XEN) 
> > (XEN) 
> > (XEN) Panic on CPU 1:
> > (XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
> > this_cpu))' failed at arch/x86/genapic/x2apic.c:38
> > (XEN) 
> 
> Would you mind giving the patch below a try?

Yes, this seems to fix the issue, thanks!

> Jan
> 
> x86/x2APIC: correct cluster tracking upon CPUs going down for S3
> 
> Downing CPUs for S3 is somewhat special: Since we can expect the system
> to come back up in exactly the same hardware configuration, per-CPU data
> for the secondary CPUs isn't de-allocated (and then cleared upon re-
> allocation when the CPUs are being brought back up). Therefore the
> cluster_cpus per-CPU pointer will retain its value for all CPUs other
> than the f

[REGRESSION] kernel NULL pointer dereference in xen-balloon with mem hotplug

2024-08-08 Thread Marek Marczykowski-Górecki
nqa.qubes-os.org/tests/108883/file/system_tests-qubes.tests.integ.vm_qrexec_gui.TC_20_NonAudio_whonix-workstation-17.test_105.guest-test-inst-vm2.log
Other logs, including dom0 and Xen messages:
https://openqa.qubes-os.org/tests/108883#downloads

Kernel config is build from merging
https://github.com/QubesOS/qubes-linux-kernel/blob/005ae1ac3819d957379e48fb2cfd33f511a47275/config-base
with
https://github.com/QubesOS/qubes-linux-kernel/blob/005ae1ac3819d957379e48fb2cfd33f511a47275/config-qubes
(options set in the latter takes precedence)
Especially, it has:
CONFIG_XEN_BALLOON_MEMORY_HOTPLUG=y
CONFIG_XEN_UNPOPULATED_ALLOC=y

#regzbot introduced: v6.10..v6.11-rc2

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-07 Thread Marek Marczykowski-Górecki
On Wed, Aug 07, 2024 at 12:26:26PM +0200, Jürgen Groß wrote:
> On 07.08.24 12:23, Marek Marczykowski-Górecki wrote:
> > On Tue, Aug 06, 2024 at 05:24:22PM +0200, Jürgen Groß wrote:
> > > On 06.08.24 17:21, Marek Marczykowski-Górecki wrote:
> > > > On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> > > > > Marek,
> > > > > 
> > > > > On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > > > > > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > while it feels like we had a similar situation before, I can't 
> > > > > > > seem to be
> > > > > > > able to find traces thereof, or associated (Linux) commits.
> > > > > > 
> > > > > > Is it some AMD Threadripper system by a chance? Previous thread on 
> > > > > > this
> > > > > > issue:
> > > > > > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > > > > > 
> > > > > > > With
> > > > > > > 
> > > > > > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > > > > > ...
> > > > > > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 
> > > > > > > pages to be allocated)
> > > > > > > ...
> > > > > > > (XEN)  Loaded kernel: 8100->8400
> > > > > > > 
> > > > > > > the kernel occupies the space from 16Mb to 64Mb in the initial 
> > > > > > > allocation.
> > > > > > > Page tables come (almost) directly above:
> > > > > > > 
> > > > > > > (XEN)  Page tables:   84001000->84026000
> > > > > > > 
> > > > > > > I.e. they're just above the 64Mb boundary. Yet sadly in the host 
> > > > > > > E820 map
> > > > > > > there is
> > > > > > > 
> > > > > > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > > > > > 
> > > > > > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) 
> > > > > > > won't tolerate
> > > > > > > such an overlap (also if it was overlapping the kernel image, 
> > > > > > > e.g. if on the
> > > > > > > machine in question s sufficiently much larger kernel was used). 
> > > > > > > Yet with its
> > > > > > > fundamental goal of making its E820 match the host one I'm also 
> > > > > > > in trouble
> > > > > > > thinking of possible solutions / workarounds. I certainly do not 
> > > > > > > see Xen
> > > > > > > trying to cover for this, as the E820 map re-arrangement is 
> > > > > > > purely a kernel
> > > > > > > side decision (forward ported kernels got away without, and what 
> > > > > > > e.g. the
> > > > > > > BSDs do is entirely unknown to me).
> > > > > > 
> > > > > > In Qubes we have worked around the issue by moving the kernel lower
> > > > > > (CONFIG_PHYSICAL_START=0x20):
> > > > > > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > > > > > 
> > > > > > Far from ideal, but gets it bootable...
> > > > > > 
> > > > > 
> > > > > could you test the attached kernel patches? They should fix the issue 
> > > > > without
> > > > > having to modify CONFIG_PHYSICAL_START.
> > > > > 
> > > > > I have tested them to boot up without problem on my test system, but 
> > > > > I don't
> > > > > have access to a system showing the E820 map conflict you are seeing.
> > > > > 
> > > > > The patches have been developed against kernel 6.11-rc2, but I think 
> > > > > they
> > > > > should apply to a 6.10 and maybe even an older kernel.
> > > > 
> > > > Sure, but tomorrow-ish.
> > > 
> > > Thanks.
> > 
> > Seems to work :)
> > 
> > Snippets from Xen 

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-07 Thread Marek Marczykowski-Górecki
On Tue, Aug 06, 2024 at 05:24:22PM +0200, Jürgen Groß wrote:
> On 06.08.24 17:21, Marek Marczykowski-Górecki wrote:
> > On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> > > Marek,
> > > 
> > > On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > > > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > > > Hello,
> > > > > 
> > > > > while it feels like we had a similar situation before, I can't seem 
> > > > > to be
> > > > > able to find traces thereof, or associated (Linux) commits.
> > > > 
> > > > Is it some AMD Threadripper system by a chance? Previous thread on this
> > > > issue:
> > > > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > > > 
> > > > > With
> > > > > 
> > > > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > > > ...
> > > > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 
> > > > > pages to be allocated)
> > > > > ...
> > > > > (XEN)  Loaded kernel: 8100->8400
> > > > > 
> > > > > the kernel occupies the space from 16Mb to 64Mb in the initial 
> > > > > allocation.
> > > > > Page tables come (almost) directly above:
> > > > > 
> > > > > (XEN)  Page tables:   84001000->84026000
> > > > > 
> > > > > I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 
> > > > > map
> > > > > there is
> > > > > 
> > > > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > > > 
> > > > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't 
> > > > > tolerate
> > > > > such an overlap (also if it was overlapping the kernel image, e.g. if 
> > > > > on the
> > > > > machine in question s sufficiently much larger kernel was used). Yet 
> > > > > with its
> > > > > fundamental goal of making its E820 match the host one I'm also in 
> > > > > trouble
> > > > > thinking of possible solutions / workarounds. I certainly do not see 
> > > > > Xen
> > > > > trying to cover for this, as the E820 map re-arrangement is purely a 
> > > > > kernel
> > > > > side decision (forward ported kernels got away without, and what e.g. 
> > > > > the
> > > > > BSDs do is entirely unknown to me).
> > > > 
> > > > In Qubes we have worked around the issue by moving the kernel lower
> > > > (CONFIG_PHYSICAL_START=0x20):
> > > > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > > > 
> > > > Far from ideal, but gets it bootable...
> > > > 
> > > 
> > > could you test the attached kernel patches? They should fix the issue 
> > > without
> > > having to modify CONFIG_PHYSICAL_START.
> > > 
> > > I have tested them to boot up without problem on my test system, but I 
> > > don't
> > > have access to a system showing the E820 map conflict you are seeing.
> > > 
> > > The patches have been developed against kernel 6.11-rc2, but I think they
> > > should apply to a 6.10 and maybe even an older kernel.
> > 
> > Sure, but tomorrow-ish.
> 
> Thanks.

Seems to work :)

Snippets from Xen log:

(XEN) EFI RAM map:
(XEN)  [, 0009] (usable)
(XEN)  [000a, 000f] (reserved)
(XEN)  [0010, 03ff] (usable)
(XEN)  [0400, 04011fff] (ACPI NVS)
(XEN)  [04012000, 09df1fff] (usable)
(XEN)  [09df2000, 09ff] (reserved)
(XEN)  [0a00, a8840fff] (usable)
(XEN)  [a8841000, a9d9] (reserved)
(XEN)  [a9da, a9dd4fff] (ACPI data)
(XEN)  [a9dd5000, a9dd5fff] (reserved)
(XEN)  [a9dd6000, a9f20fff] (ACPI data)
(XEN)  [a9f21000, aa099fff] (ACPI NVS)
(XEN)  [aa09a000, ab1fefff] (reserved)
(XEN)  [ab1ff000, abff] (usable)
(XEN)  [ac00, afff] (reserved)
(XEN)  [b250, b2580fff] 

Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-08-06 Thread Marek Marczykowski-Górecki
On Tue, Aug 06, 2024 at 04:12:32PM +0200, Jürgen Groß wrote:
> Marek,
> 
> On 17.06.24 16:03, Marek Marczykowski-Górecki wrote:
> > On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> > > Hello,
> > > 
> > > while it feels like we had a similar situation before, I can't seem to be
> > > able to find traces thereof, or associated (Linux) commits.
> > 
> > Is it some AMD Threadripper system by a chance? Previous thread on this
> > issue:
> > https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/
> > 
> > > With
> > > 
> > > (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> > > ...
> > > (XEN)  Dom0 alloc.:   00044000->00044800 (619175 pages to 
> > > be allocated)
> > > ...
> > > (XEN)  Loaded kernel: 8100->8400
> > > 
> > > the kernel occupies the space from 16Mb to 64Mb in the initial allocation.
> > > Page tables come (almost) directly above:
> > > 
> > > (XEN)  Page tables:   84001000->84026000
> > > 
> > > I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 map
> > > there is
> > > 
> > > (XEN)  [0400, 04009fff] (ACPI NVS)
> > > 
> > > i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't 
> > > tolerate
> > > such an overlap (also if it was overlapping the kernel image, e.g. if on 
> > > the
> > > machine in question s sufficiently much larger kernel was used). Yet with 
> > > its
> > > fundamental goal of making its E820 match the host one I'm also in trouble
> > > thinking of possible solutions / workarounds. I certainly do not see Xen
> > > trying to cover for this, as the E820 map re-arrangement is purely a 
> > > kernel
> > > side decision (forward ported kernels got away without, and what e.g. the
> > > BSDs do is entirely unknown to me).
> > 
> > In Qubes we have worked around the issue by moving the kernel lower
> > (CONFIG_PHYSICAL_START=0x20):
> > https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282
> > 
> > Far from ideal, but gets it bootable...
> > 
> 
> could you test the attached kernel patches? They should fix the issue without
> having to modify CONFIG_PHYSICAL_START.
> 
> I have tested them to boot up without problem on my test system, but I don't
> have access to a system showing the E820 map conflict you are seeing.
> 
> The patches have been developed against kernel 6.11-rc2, but I think they
> should apply to a 6.10 and maybe even an older kernel.

Sure, but tomorrow-ish.

> If possible it would be nice to verify suspend to disk still working, as
> the kernel will need to access the ACPI NVS area in this case.

That might be harder, as Qubes OS doesn't support suspend to disk, but
I'll see if something can be done.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2] automation: upgrade Yocto to scarthgap

2024-07-30 Thread Marek Marczykowski-Górecki
On Tue, Jul 30, 2024 at 03:01:52PM +0100, Andrew Cooper wrote:
> On 30/07/2024 2:46 pm, Marek Marczykowski-Górecki wrote:
> > On Fri, Jul 26, 2024 at 05:19:42PM -0700, Stefano Stabellini wrote:
> >> Upgrade Yocto to a newer version. Use ext4 as image format for testing
> >> with QEMU on ARM and ARM64 as the default is WIC and it is not available
> >> for our xen-image-minimal target.
> >>
> >> Also update the tar.bz2 filename for the rootfs.
> >>
> >> Signed-off-by: Stefano Stabellini 
> > Reviewed-by: Marek Marczykowski-Górecki 
> >
> >> ---
> >>
> >> all yocto tests pass:
> >> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/1390081173
> 
> That test ran on gitlab-docker-pug, not qubes-ambrosia, so doesn't
> confirm the fix to the xattr issue.

There is one on ambrosia too:
https://gitlab.com/xen-project/people/sstabellini/xen/-/jobs/7423043016

> Seeing as I'm going to need to rebuild the container anyway, I'll see
> about forcing this and double checking.

But double-checking is a good idea anyway.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2] automation: upgrade Yocto to scarthgap

2024-07-30 Thread Marek Marczykowski-Górecki
On Fri, Jul 26, 2024 at 05:19:42PM -0700, Stefano Stabellini wrote:
> Upgrade Yocto to a newer version. Use ext4 as image format for testing
> with QEMU on ARM and ARM64 as the default is WIC and it is not available
> for our xen-image-minimal target.
> 
> Also update the tar.bz2 filename for the rootfs.
> 
> Signed-off-by: Stefano Stabellini 

Reviewed-by: Marek Marczykowski-Górecki 

> ---
> 
> all yocto tests pass:
> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/1390081173
> 
> Changes in v2:
> - s/EXT4/IMAGE_FMT/
> - set IMAGE_FMT before the call to project_build
> - also update the filename xen-image-minimal-qemuarm.rootfs.tar.bz2
> ---
>  automation/build/yocto/build-yocto.sh   | 15 ---
>  automation/build/yocto/yocto.inc|  4 ++--
>  automation/gitlab-ci/build.yaml |  2 +-
>  automation/scripts/qemu-smoke-dom0-arm32.sh |  2 +-
>  4 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/automation/build/yocto/build-yocto.sh 
> b/automation/build/yocto/build-yocto.sh
> index 93ce81ce82..e1e69f2bb5 100755
> --- a/automation/build/yocto/build-yocto.sh
> +++ b/automation/build/yocto/build-yocto.sh
> @@ -25,6 +25,7 @@ TARGET_SUPPORTED="qemuarm qemuarm64 qemux86-64"
>  VERBOSE="n"
>  TARGETLIST=""
>  BUILDJOBS="8"
> +IMAGE_FMT=""
>  
>  # actions to do
>  do_clean="n"
> @@ -38,8 +39,9 @@ build_result=0
>  # layers to include in the project
>  build_layerlist="poky/meta poky/meta-poky poky/meta-yocto-bsp \
>   meta-openembedded/meta-oe meta-openembedded/meta-python \
> + meta-openembedded/meta-networking \
>   meta-openembedded/meta-filesystems \
> - meta-openembedded/meta-networking meta-virtualization"
> + meta-virtualization"
>  
>  # yocto image to build
>  build_image="xen-image-minimal"
> @@ -175,7 +177,7 @@ function project_build() {
>  mkdir -p $OUTPUTDIR
>  cp $BUILDDIR/tmp/deploy/images/qemuarm/zImage $OUTPUTDIR
>  cp $BUILDDIR/tmp/deploy/images/qemuarm/xen-qemuarm $OUTPUTDIR
> -cp 
> $BUILDDIR/tmp/deploy/images/qemuarm/xen-image-minimal-qemuarm.tar.bz2 
> $OUTPUTDIR
> +cp 
> $BUILDDIR/tmp/deploy/images/qemuarm/xen-image-minimal-qemuarm.rootfs.tar.bz2 
> $OUTPUTDIR
>  fi
>  fi
>  ) || return 1
> @@ -196,7 +198,7 @@ function project_run() {
>  
>  /usr/bin/expect <  set timeout 1000
> -spawn bash -c "runqemu serialstdio nographic slirp"
> +spawn bash -c "runqemu serialstdio nographic slirp ${IMAGE_FMT}"
>  
>  expect_after {
>  -re "(.*)\r" {
> @@ -356,6 +358,13 @@ for f in ${TARGETLIST}; do
>  run_task project_create "${f}"
>  fi
>  if [ -f "${BUILDDIR}/${f}/conf/local.conf" ]; then
> +# Set the right image target
> +if [ "$f" = "qemux86-64" ]; then
> +IMAGE_FMT=""
> +else
> +IMAGE_FMT="ext4"
> +fi
> +
>  if [ "${do_build}" = "y" ]; then
>  run_task project_build "${f}"
>  fi
> diff --git a/automation/build/yocto/yocto.inc 
> b/automation/build/yocto/yocto.inc
> index 2f3b1a5b2a..209df7dde9 100644
> --- a/automation/build/yocto/yocto.inc
> +++ b/automation/build/yocto/yocto.inc
> @@ -6,10 +6,10 @@
>  # YOCTOVERSION-TARGET for x86_64 hosts
>  # YOCTOVERSION-TARGET-arm64v8 for arm64 hosts
>  # For example you can build an arm64 container with the following command:
> -# make yocto/kirkstone-qemuarm64-arm64v8
> +# make yocto/scarthgap-qemuarm64-arm64v8
>  
>  # Yocto versions we are currently using.
> -YOCTO_VERSION = kirkstone
> +YOCTO_VERSION = scarthgap
>  
>  # Yocto BSPs we want to build for.
>  YOCTO_TARGETS = qemuarm64 qemuarm qemux86-64
> diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
> index 7ce88d38e7..32045cef0c 100644
> --- a/automation/gitlab-ci/build.yaml
> +++ b/automation/gitlab-ci/build.yaml
> @@ -212,7 +212,7 @@
>script:
>  - ./automation/build/yocto/build-yocto.sh -v --log-dir=./logs 
> --xen-dir=`pwd` ${YOCTO_BOARD} ${YOCTO_OUTPUT}
>variables:
> -YOCTO_VERSION: kirkstone
> +YOCTO_VERSION: scarthgap
>  CONTAINER: yocto:${YOCTO_VERSION}-${YOCTO_BOARD}${YOCTO_HOST}
>artifacts:
>  paths:
> diff --git a/automation/scripts/qemu-smoke-dom0-arm32.sh 
> b/automation/scripts/qemu-smoke-dom0-arm32.sh
> index 31c05cc840..eaaea5a982 100755
> --- a/automation/scripts/qemu-smoke-dom0-arm32.sh
> +++ b/automation/scripts/qemu-smoke-dom0-arm32.sh
> @@ -8,7 +8,7 @@ cd binaries
>  
>  mkdir rootfs
>  cd rootfs
> -tar xvf ../xen-image-minimal-qemuarm.tar.bz2
> +tar xvf ../xen-image-minimal-qemuarm.rootfs.tar.bz2
>  mkdir -p ./root
>  echo "name=\"test\"
>  memory=400
> -- 
> 2.25.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v2] x86/shutdown: change default reboot method preference

2024-07-29 Thread Marek Marczykowski-Górecki
On Fri, Sep 15, 2023 at 09:43:47AM +0200, Roger Pau Monne wrote:
> The current logic to chose the preferred reboot method is based on the mode 
> Xen
> has been booted into, so if the box is booted from UEFI, the preferred reboot
> method will be to use the ResetSystem() run time service call.
> 
> However, that method seems to be widely untested, and quite often leads to a
> result similar to:
> 
> Hardware Dom0 shutdown: rebooting machine
> [ Xen-4.18-unstable  x86_64  debug=y  Tainted:   C]
> CPU:0
> RIP:e008:[<0017>] 0017
> RFLAGS: 00010202   CONTEXT: hypervisor
> [...]
> Xen call trace:
>[<0017>] R 0017
>[] S 83207eff7b50
>[] F machine_restart+0x1da/0x261
>[] F apic_wait_icr_idle+0/0x37
>[] F smp_call_function_interrupt+0xc7/0xcb
>[] F call_function_interrupt+0x20/0x34
>[] F do_IRQ+0x150/0x6f3
>[] F common_interrupt+0x132/0x140
>[] F 
> arch/x86/acpi/cpu_idle.c#acpi_idle_do_entry+0x113/0x129
>[] F 
> arch/x86/acpi/cpu_idle.c#acpi_processor_idle+0x3eb/0x5f7
>[] F arch/x86/domain.c#idle_loop+0xec/0xee
> 
> 
> Panic on CPU 0:
> FATAL TRAP: vector = 6 (invalid opcode)
> 
> 
> Which in most cases does lead to a reboot, however that's unreliable.
> 
> Change the default reboot preference to prefer ACPI over UEFI if available and
> not in reduced hardware mode.
> 
> This is in line to what Linux does, so it's unlikely to cause issues on 
> current
> and future hardware, since there's a much higher chance of vendors testing
> hardware with Linux rather than Xen.
> 
> Add a special case for one Acer model that does require being rebooted using
> ResetSystem().  See Linux commit 0082517fa4bce for rationale.
> 
> I'm not aware of using ACPI reboot causing issues on boxes that do have
> properly implemented ResetSystem() methods.

With the Acer quirk, and the info Jan posted in the thread, this
sentence technically is not true. I don't think it warrants any code
change in this patch (it's clearly less common and less problematic
issue than crash during ResetSystem(), and still can be worked around
with a cmdline option). But might warrant adjusting commit message.

> Signed-off-by: Roger Pau Monné 

Other points still stand, and I think this generally is an improvement,
so, preferably with adjusted commit message:

Acked-by: Marek Marczykowski-Górecki 

> ---
> Changes since v1:
>  - Add special case for Acer model to use UEFI reboot.
>  - Adjust commit message.
> ---
>  xen/arch/x86/shutdown.c | 19 +++
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/xen/arch/x86/shutdown.c b/xen/arch/x86/shutdown.c
> index 7619544d14da..3816ede1afe5 100644
> --- a/xen/arch/x86/shutdown.c
> +++ b/xen/arch/x86/shutdown.c
> @@ -150,19 +150,20 @@ static void default_reboot_type(void)
>  
>  if ( xen_guest )
>  reboot_type = BOOT_XEN;
> +else if ( !acpi_disabled && !acpi_gbl_reduced_hardware )
> +reboot_type = BOOT_ACPI;
>  else if ( efi_enabled(EFI_RS) )
>  reboot_type = BOOT_EFI;
> -else if ( acpi_disabled )
> -reboot_type = BOOT_KBD;
>  else
> -reboot_type = BOOT_ACPI;
> +reboot_type = BOOT_KBD;
>  }
>  
>  static int __init cf_check override_reboot(const struct dmi_system_id *d)
>  {
>  enum reboot_type type = (long)d->driver_data;
>  
> -if ( type == BOOT_ACPI && acpi_disabled )
> +if ( (type == BOOT_ACPI && acpi_disabled) ||
> + (type == BOOT_EFI && !efi_enabled(EFI_RS)) )
>  type = BOOT_KBD;
>  
>  if ( reboot_type != type )
> @@ -172,6 +173,7 @@ static int __init cf_check override_reboot(const struct 
> dmi_system_id *d)
>  [BOOT_KBD]  = "keyboard controller",
>  [BOOT_ACPI] = "ACPI",
>  [BOOT_CF9]  = "PCI",
> +[BOOT_EFI]  = "UEFI",
>  };
>  
>  reboot_type = type;
> @@ -530,6 +532,15 @@ static const struct dmi_system_id __initconstrel 
> reboot_dmi_table[] = {
>  DMI_MATCH(DMI_PRODUCT_NAME, "PowerEdge R740"),
>  },
>  },
> +{    /* Handle problems with rebooting on Acer TravelMate X514-51T. */
> +.callback = override_reboot,
> +.driver_data = (void *)(long)BOOT_EFI,
> +.ident = "Acer TravelMate X514-51T",
> +.matches = {
> +DMI_MATCH(DMI_SYS_VENDOR, "Acer"),
> +DMI_MATCH(DMI_PRODUCT_NAME, "TravelMate X514-51T"),
> +},
> +},
>  { }
>  };
>  
> -- 
> 2.42.0
> 
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v7 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-07-25 Thread Marek Marczykowski-Górecki
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v7:
- refuse misaligned start in release build too, to have release build
  running what was tested in debug build
- simplify return from subpage_mmio_ro_add_page
Changes in v6:
- fix return type of subpage_mmio_find_page()
- change 'iter' pointer to 'new_entry' bool and move list_add()
- comment why different error handling for unaligned start / size
- code style
Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, a

[PATCH v7 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-25 Thread Marek Marczykowski-Górecki
On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (2):
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 261 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 +++--
 6 files changed, 313 insertions(+), 19 deletions(-)

base-commit: b25b28ede1cba43eda1e0b84ad967683b8196847
-- 
git-series 0.9.1



[PATCH v7 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-25 Thread Marek Marczykowski-Górecki
Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1



Re: [PATCH v6 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-25 Thread Marek Marczykowski-Górecki
On Thu, Jul 25, 2024 at 11:26:31AM +0200, Jan Beulich wrote:
> On 23.07.2024 05:24, Marek Marczykowski-Górecki wrote:
> > + * so tolerate it.
> > + * But unaligned size would result in smaller area, so deny it.
> > + */
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +return -EINVAL;
> 
> I hoped you would, when adding the comment, recall an earlier comment of
> mine: If you want to tolerate mis-aligned start in release builds, you
> need to make further adjustments to the subsequent logic (at which
> point the respective assertion may become pointless); see below. While
> things may work okay without (I didn't fully convince myself either way),
> the main point here is that you want to make sure we test in debug builds
> what's actually used in release one. Hence subtleties like this would
> better be dealt with uniformly between release and debug builds.

Right, and I think this is a good argument to not try to accept
unaligned size either, even if it would be possible here.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 07/12] libxl: Allow stubdomain to control interupts of PCI device

2024-07-25 Thread Marek Marczykowski-Górecki
On Thu, Jul 25, 2024 at 02:06:04PM +, Anthony PERARD wrote:
> On Thu, May 16, 2024 at 03:58:28PM +0200, Marek Marczykowski-Górecki wrote:
> > Especially allow it to control MSI/MSI-X enabling bits. This part only
> > writes a flag to a sysfs, the actual implementation is on the kernel
> > side.
> >
> > This requires Linux >= 5.10 in dom0 (or relevant patch backported).
> 
> Does it not work before 5.10? Because the
> Documentation/ABI/testing/sysfs-driver-pciback in linux tree say that
> allow_interrupt_control is in 5.6.

For MSI-X to work at least with Linux it needs a fixup
2c269f42d0f382743ab230308b836ffe5ae9b2ae, which was backported to
5.10.201, but not further. 

> > diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
> > index 96cb4da0794e..6f357b70b815 100644
> > --- a/tools/libs/light/libxl_pci.c
> > +++ b/tools/libs/light/libxl_pci.c
> > @@ -1513,6 +1513,14 @@ static void pci_add_dm_done(libxl__egc *egc,
> >  rc = ERROR_FAIL;
> >  goto out;
> >  }
> > +} else if (libxl_is_stubdom(ctx, domid, NULL)) {
> > +/* Allow acces to MSI enable flag in PCI config space for the 
> > stubdom */
> 
> s/acces/access/
> 
> > +if ( sysfs_write_bdf(gc, 
> > SYSFS_PCIBACK_DRIVER"/allow_interrupt_control",
> > + pci) < 0 ) {
> > +LOGD(ERROR, domainid, "Setting allow_interrupt_control for 
> > device");
> > +rc = ERROR_FAIL;
> > +goto out;
> 
> Is it possible to make this non-fatal for cases where the kernel is
> older than the introduction of the new setting? Or does pci passthrough
> doesn't work at all with a stubdom before the change in the kernel?

MSI/MSI-X will not work. And if QEMU wouldn't hide MSI/MSI-X (upstream
one doesn't), Linux won't fallback to INTx, so the device won't work at
all.

> If making this new setting conditional is an option, you could
> potentially improve the error code returned by sysfs_write_bdf() to
> distinguish between an open() failure and write() failure, to avoid
> checking the existance of the path ahead of the call. But maybe that
> pointless because it doesn't appear possible to distinguish between
> permission denied and not found.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: xen | Failed pipeline for staging-4.19 | 2d7b6170

2024-07-24 Thread Marek Marczykowski-Górecki
On Wed, Jul 24, 2024 at 03:36:44PM +, Anthony PERARD wrote:
> On Wed, Jul 24, 2024 at 03:18:50PM +0200, Jan Beulich wrote:
> > On 24.07.2024 15:15, GitLab wrote:
> > > 
> > > 
> > > Pipeline #1385987377 has failed!
> > > 
> > > Project: xen ( https://gitlab.com/xen-project/hardware/xen )
> > > Branch: staging-4.19 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/commits/staging-4.19 )
> > > 
> > > Commit: 2d7b6170 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/commit/2d7b6170cc69f8a1a60c52d87ba61f6b1f180132
> > >  )
> > > Commit Message: hotplug: Restore block-tap phy compatibility (a...
> > > Commit Author: Jason Andryuk ( https://gitlab.com/jandryuk-amd )
> > > Committed by: Jan Beulich ( https://gitlab.com/jbeulich )
> > > 
> > > 
> > > Pipeline #1385987377 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/pipelines/1385987377 ) 
> > > triggered by Jan Beulich ( https://gitlab.com/jbeulich )
> > > had 3 failed jobs.
> > > 
> > > Job #7415912260 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912260/raw )
> > > 
> > > Stage: test
> > > Name: qemu-alpine-x86_64-gcc
> > 
> > This is the one known to fail more often than not, I think, but ...
> > 
> > > Job #7415912175 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912175/raw )
> > > 
> > > Stage: build
> > > Name: ubuntu-24.04-x86_64-clang
> > > Job #7415912173 ( 
> > > https://gitlab.com/xen-project/hardware/xen/-/jobs/7415912173/raw )
> > > 
> > > Stage: build
> > > Name: ubuntu-22.04-x86_64-gcc
> > 
> > ... for these two I can't spot any failure in the referenced logs. What's 
> > going on
> > there?
> 
> They are crutial information missing in that email, the actual reason
> given by gitlab for the failures: "There has been a timeout failure or
> the job got stuck." (That message can be seen when going to the url,
> removing "/raw" part, and scrolling to the top. Or looking at the side
> bar and seen a duration that well above 1h)
> 
> Communication between gitlab and the runner might be broken in those
> cases, or the runner stop working.

This time the runner VM got hit with
https://lore.kernel.org/xen-devel/ZO0WrR5J0xuwDIxW@mail-itl/ . So, I
guess the failure is warranted, just not the one you'd expect...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Assertion failed at arch/x86/genapic/x2apic.c:38 on S3 resume nested in KVM on AMD

2024-07-23 Thread Marek Marczykowski-Górecki
Hi,

I'm observing a crash like the one below when trying to resume from S3.
It happens on Xen nested in KVM (QEMU 9.0, Linux 6.9.3) but only on AMD.
The very same software stack on Intel works just fine. QEMU is running
with "-cpu host,+svm,+invtsc -machine q35,kernel-irqchip=split -device
amd-iommu,intremap=on -smp 2" among others.

(XEN) Preparing system for ACPI S3 state.
(XEN) Disabling non-boot CPUs ...
(XEN) Broke affinity for IRQ1, new: {0-1}
(XEN) Broke affinity for IRQ20, new: {0-1}
(XEN) Broke affinity for IRQ22, new: {0-1}
(XEN) Entering ACPI S3 state.
(XEN) Finishing wakeup from ACPI S3 state.
(XEN) Enabling non-boot CPUs  ...
(XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
this_cpu))' failed at arch/x86/genapic/x2apic.c:38
(XEN) [ Xen-4.20  x86_64  debug=y  Not tainted ]
(XEN) CPU:1
(XEN) RIP:e008:[] 
x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
(XEN) RFLAGS: 00010096   CONTEXT: hypervisor
(XEN) rax: 830278a25f50   rbx: 0001   rcx: 82d0405e1700
(XEN) rdx: 003233412000   rsi: 8302739da2d8   rdi: 
(XEN) rbp: 00c8   rsp: 8302739d7e78   r8:  0001
(XEN) r9:  8302739d7fa0   r10: 0001   r11: 
(XEN) r12: 0001   r13: 0001   r14: 
(XEN) r15:    cr0: 8005003b   cr4: 007506e0
(XEN) cr3: 7fa7a000   cr2: 
(XEN) fsb:    gsb:    gss: 
(XEN) ds:    es:    fs:    gs:    ss:    cs: e008
(XEN) Xen code around  
(x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9):
(XEN)  cf 82 ff ff eb b7 0f 0b <0f> 0b 48 8d 05 9c fc 33 00 48 8b 0d a5 0a 
35 00
(XEN) Xen stack trace from rsp=8302739d7e78:
(XEN) 00c8 0001 0001
(XEN) 82d0402f1d83 8302739d7fff 00c8
(XEN)0001 0001 82d04031adb9 0001
(XEN)   82d040276677
(XEN)   
(XEN)88810037c000 0001 0246 deadbeefdeadf00d
(XEN)0001   811d130a
(XEN)deadbeefdeadf00d deadbeefdeadf00d deadbeefdeadf00d 0100
(XEN)811d130a e033 0246 c900400b3ef8
(XEN)e02b beef beef beef
(XEN)beef e011 8302739de000 003233412000
(XEN)007506e0   0002
(XEN)0002
(XEN) Xen call trace:
(XEN)[] R 
x2apic.c#init_apic_ldr_x2apic_cluster+0x8a/0x1b9
(XEN)[] S setup_local_APIC+0x26/0x449
(XEN)[] S start_secondary+0x1c4/0x37a
(XEN)[] S __high_start+0x87/0xd0
(XEN) 
(XEN) 
(XEN) 
(XEN) Panic on CPU 1:
(XEN) Assertion 'cpumask_test_cpu(this_cpu, per_cpu(cluster_cpus, 
this_cpu))' failed at arch/x86/genapic/x2apic.c:38
(XEN) 

On a release build, it hits "BUG" later on in the same file.

I've tested the attached patch from Roger, but that assertion didn't
fail (or it crashed before reaching that part).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
diff --git a/xen/arch/x86/genapic/x2apic.c b/xen/arch/x86/genapic/x2apic.c
index 371dd100c742..fe8e664e1b63 100644
--- a/xen/arch/x86/genapic/x2apic.c
+++ b/xen/arch/x86/genapic/x2apic.c
@@ -30,8 +30,11 @@ static inline u32 x2apic_cluster(unsigned int cpu)
 static void cf_check init_apic_ldr_x2apic_cluster(void)
 {
 unsigned int cpu, this_cpu = smp_processor_id();
+uint32_t id = apic_read(APIC_ID);
+uint32_t ldr = apic_read(APIC_LDR);
 
-per_cpu(cpu_2_logical_apicid, this_cpu) = apic_read(APIC_LDR);
+ASSERT(x2apic_ldr_from_id(id) == ldr);
+per_cpu(cpu_2_logical_apicid, this_cpu) = ldr;
 
 if ( per_cpu(cluster_cpus, this_cpu) )
 {
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 9cfc82666ae5..2a010a6363b7 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1064,11 +1064,6 @@ static const struct hvm_mmio_ops vlapic_mmio_ops = {
 .write = vlapic_mmio_write,
 };
 
-static uint32_t x2apic_ldr_from_id(uint32_t id)
-{
-return ((id & ~0xf) << 12) | (1 << (id & 0xf));
-}
-
 static void set_x2apic_id(struct vlapic *vlapic)
 {
 const struct vcpu *v = vlapic_vcpu(vlapic);
diff --git a/xen/arch/x86/include/asm/apic.h b/xen/arch/x86/include/asm/api

Re: [PATCH 2/2] x86/efi: Unlock NX if necessary

2024-07-23 Thread Marek Marczykowski-Górecki
On Tue, Jul 23, 2024 at 12:25:32PM +0200, Marek Marczykowski-Górecki wrote:
> On Mon, Jul 22, 2024 at 11:18:38AM +0100, Andrew Cooper wrote:
> > EFI systems can run with NX disabled, as has been discovered on a Broadwell
> > Supermicro X10SRM-TF system.
> > 
> > Prior to commit fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early 
> > boot
> > path"), the logic to unlock NX was common to all boot paths, but that commit
> > moved it out of the native-EFI booth path.
> > 
> > Have the EFI path attempt to unlock NX, rather than just blindly refusing to
> > boot when CONFIG_REQUIRE_NX is active.
> > 
> > Fixes: fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot path")
> > Link: https://xcp-ng.org/forum/post/80520
> > Reported-by: Gene Bright 
> > Signed-off-by: Andrew Cooper 
> 
> Acked-by: Andrew Cooper 

Ugh, wrong copy paste:
Acked-by: Marek Marczykowski-Górecki 

I should finish my coffee first...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 2/2] x86/efi: Unlock NX if necessary

2024-07-23 Thread Marek Marczykowski-Górecki
On Mon, Jul 22, 2024 at 11:18:38AM +0100, Andrew Cooper wrote:
> EFI systems can run with NX disabled, as has been discovered on a Broadwell
> Supermicro X10SRM-TF system.
> 
> Prior to commit fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot
> path"), the logic to unlock NX was common to all boot paths, but that commit
> moved it out of the native-EFI booth path.
> 
> Have the EFI path attempt to unlock NX, rather than just blindly refusing to
> boot when CONFIG_REQUIRE_NX is active.
> 
> Fixes: fc3090a47b21 ("x86/boot: Clear XD_DISABLE from the early boot path")
> Link: https://xcp-ng.org/forum/post/80520
> Reported-by: Gene Bright 
> Signed-off-by: Andrew Cooper 

Acked-by: Andrew Cooper 

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
> 
> Note.  Entirely speculative coding, based only on the forum report.
> ---
>  xen/arch/x86/efi/efi-boot.h | 33 ++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index 4e4be7174751..158350aa14e4 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -736,13 +736,33 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  efi_bs->FreePool(ptr);
>  }
>  
> +static bool __init intel_unlock_nx(void)
> +{
> +uint64_t val, disable;
> +
> +rdmsrl(MSR_IA32_MISC_ENABLE, val);
> +
> +disable = val & MSR_IA32_MISC_ENABLE_XD_DISABLE;
> +
> +if ( !disable )
> +return false;
> +
> +wrmsrl(MSR_IA32_MISC_ENABLE, val & ~disable);
> +trampoline_misc_enable_off |= disable;
> +
> +return true;
> +}
> +
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax;
> +uint32_t eax, ebx, ecx, edx;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
> +cpuid(0, &eax, &ebx, &ecx, &edx);
> +boot_cpu_data.x86_vendor = x86_cpuid_lookup_vendor(ebx, ecx, edx);
> +
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
>  eax = cpuid_eax(0x8000U);
> @@ -752,10 +772,17 @@ static void __init efi_arch_cpu(void)
>  caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
>  /*
> - * This check purposefully doesn't use cpu_has_nx because
> + * These checks purposefully doesn't use cpu_has_nx because
>   * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> + * with CONFIG_REQUIRE_NX.
> + *
> + * If NX isn't available, it might be hidden.  Try to reactivate it.
>   */
> +if ( !boot_cpu_has(X86_FEATURE_NX) &&
> + boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
> + intel_unlock_nx() )
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +
>  if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
>   !boot_cpu_has(X86_FEATURE_NX) )
>  blexit(L"This build of Xen requires NX support");
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 1/2] x86/efi: Simplify efi_arch_cpu() a little

2024-07-23 Thread Marek Marczykowski-Górecki
On Mon, Jul 22, 2024 at 11:18:37AM +0100, Andrew Cooper wrote:
> Make the "no extended leaves" case fatal and remove one level of indentation.
> Defer the max-leaf aquisition until it is first used.
> 
> No functional change.
> 
> Signed-off-by: Andrew Cooper 

Acked-by: Marek Marczykowski-Górecki 

> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Daniel P. Smith 
> CC: Marek Marczykowski-Górecki 
> CC: Alejandro Vallejo 
> CC: Gene Bright 
> ---
>  xen/arch/x86/efi/efi-boot.h | 31 ---
>  1 file changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index f282358435f1..4e4be7174751 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -738,29 +738,30 @@ static void __init efi_arch_handle_module(const struct 
> file *file,
>  
>  static void __init efi_arch_cpu(void)
>  {
> -uint32_t eax = cpuid_eax(0x8000U);
> +uint32_t eax;
>  uint32_t *caps = boot_cpu_data.x86_capability;
>  
>  boot_tsc_stamp = rdtsc();
>  
>  caps[FEATURESET_1c] = cpuid_ecx(1);
>  
> -if ( (eax >> 16) == 0x8000 && eax > 0x8000U )
> -{
> -caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
> +eax = cpuid_eax(0x8000U);
> +if ( (eax >> 16) != 0x8000 || eax < 0x8000U )
> +blexit(L"In 64bit mode, but no extended CPUID leaves?!?");
>  
> -/*
> - * This check purposefully doesn't use cpu_has_nx because
> - * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> - * with CONFIG_REQUIRE_NX
> - */
> -if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> - !boot_cpu_has(X86_FEATURE_NX) )
> -blexit(L"This build of Xen requires NX support");
> +caps[FEATURESET_e1d] = cpuid_edx(0x8001U);
>  
> -if ( cpu_has_nx )
> -trampoline_efer |= EFER_NXE;
> -}
> +/*
> + * This check purposefully doesn't use cpu_has_nx because
> + * cpu_has_nx bypasses the boot_cpu_data read if Xen was compiled
> + * with CONFIG_REQUIRE_NX
> + */
> +if ( IS_ENABLED(CONFIG_REQUIRE_NX) &&
> +     !boot_cpu_has(X86_FEATURE_NX) )
> +blexit(L"This build of Xen requires NX support");
> +
> +if ( cpu_has_nx )
> +trampoline_efer |= EFER_NXE;
>  }
>  
>  static void __init efi_arch_blexit(void)
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v6 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v6:
- fix return type of subpage_mmio_find_page()
- change 'iter' pointer to 'new_entry' bool and move list_add()
- comment why different error handling for unaligned start / size
- code style
Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_c

[PATCH v6 0/3] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-22 Thread Marek Marczykowski-Górecki
On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (3):
  xen/list: add LIST_HEAD_RO_AFTER_INIT
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 265 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 ++--
 xen/include/xen/list.h  |   3 +-
 7 files changed, 320 insertions(+), 19 deletions(-)

base-commit: a99f25f7ac60544e9af4b3b516d7566ba8841cc4
-- 
git-series 0.9.1



[PATCH v6 1/3] xen/list: add LIST_HEAD_RO_AFTER_INIT

2024-07-22 Thread Marek Marczykowski-Górecki
Similar to LIST_HEAD_READ_MOSTLY.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Jan Beulich 
---
New in v5
---
 xen/include/xen/list.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/xen/list.h b/xen/include/xen/list.h
index 6506ac40893b..62169f46742e 100644
--- a/xen/include/xen/list.h
+++ b/xen/include/xen/list.h
@@ -42,6 +42,9 @@ struct list_head {
 #define LIST_HEAD_READ_MOSTLY(name) \
 struct list_head __read_mostly name = LIST_HEAD_INIT(name)
 
+#define LIST_HEAD_RO_AFTER_INIT(name) \
+struct list_head __ro_after_init name = LIST_HEAD_INIT(name)
+
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 list->next = list;
-- 
git-series 0.9.1



[PATCH v6 3/3] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-22 Thread Marek Marczykowski-Górecki
Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1



Re: [PATCH] CI: workaround broken selinux+docker interaction in yocto

2024-07-22 Thread Marek Marczykowski-Górecki
On Mon, Jul 22, 2024 at 06:16:51PM +0100, Andrew Cooper wrote:
> On 20/07/2024 1:15 am, Marek Marczykowski-Górecki wrote:
> > `cp --preserve=xattr` doesn't work in docker when SELinux is enabled. It
> > tries to set the "security.selinux" xattr, but SELinux (or overlay fs?)
> > denies it.
> > Workaround it by skipping selinux.selinux xattr copying.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Tested here:
> > https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7386198058
> >
> > But since yocto container fails to build, it isn't exactly easy to apply
> > this patch...
> > "kirkstone" branch of meta-virtualization seems to target Xen 4.15 and
> > 4.16, so it isn't exactly surprising it fails to build with 4.19.
> 
> Why is the external version of Xen relevant to rebuilding the container ?

I think it tries to build xen_git.bb, which fetches "master" branch, and
this fails to build with its current state.

> Or is it that kirkstone has updated since the container was last built?
> 
> I'm not familiar with yocto, and a quick glance at the docs haven't
> helped...
> 
> ~Andrew
> 
> >
> > I tried also bumping yocto version to scarthgap (which supposedly should
> > have updated pygrub patch), but that fails to build for me too, with a
> > different error:
> >
> > ERROR: Layer 'filesystems-layer' depends on layer 'networking-layer', 
> > but this layer is not enabled in your configuration
> > ERROR: Parse failure with the specified layer added, exiting.
> > ...
> > ERROR: Nothing PROVIDES 'xen-image-minimal'. Close matches:
> >   core-image-minimal
> >   core-image-minimal-dev
> > Parsing of 2472 .bb files complete (0 cached, 2472 parsed). 4309 
> > targets, 101 skipped, 0 masked, 0 errors.

In the meantime I've solved this issue by reordering layers in
build-yocto.sh (meta-networking before meta-filesystems). But then, ran
out of disk space (40GB wasn't enough) and hasn't retried yet...

> > ---
> >  automation/build/yocto/yocto.dockerfile.in | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/automation/build/yocto/yocto.dockerfile.in 
> > b/automation/build/yocto/yocto.dockerfile.in
> > index fbaa4e191caa..600db7bf4d19 100644
> > --- a/automation/build/yocto/yocto.dockerfile.in
> > +++ b/automation/build/yocto/yocto.dockerfile.in
> > @@ -68,6 +68,10 @@ RUN locale-gen en_US.UTF-8 && update-locale 
> > LC_ALL=en_US.UTF-8 \
> >  ENV LANG en_US.UTF-8
> >  ENV LC_ALL en_US.UTF-8
> >  
> > +# Workaround `cp --preserve=xattr` not working in docker when SELinux is
> > +# enabled
> > +RUN echo "security.selinux skip" >> /etc/xattr.conf
> > +
> >  # Create a user for the build (we don't want to build as root).
> >  ENV USER_NAME docker-build
> >  ARG host_uid=1000
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki
On Mon, Jul 22, 2024 at 03:01:45PM +0200, Jan Beulich wrote:
> On 22.07.2024 14:36, Marek Marczykowski-Górecki wrote:
> > On Mon, Jul 22, 2024 at 02:09:15PM +0200, Jan Beulich wrote:
> >> On 19.07.2024 04:33, Marek Marczykowski-Górecki wrote:
> >>> +int __init subpage_mmio_ro_add(
> >>> +paddr_t start,
> >>> +size_t size)
> >>> +{
> >>> +mfn_t mfn_start = maddr_to_mfn(start);
> >>> +paddr_t end = start + size - 1;
> >>> +mfn_t mfn_end = maddr_to_mfn(end);
> >>> +unsigned int offset_end = 0;
> >>> +int rc;
> >>> +bool subpage_start, subpage_end;
> >>> +
> >>> +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> >>> +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> >>> +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> >>> +return -EINVAL;
> >>
> >> I think I had asked before: Why is misaligned size something that wants a
> >> release build fallback to the assertion, but not misaligned start?
> > 
> > Misaligned start will lead to protecting larger area, not smaller, so it
> > is not unsafe thing to do. But I can also make it return an error, it
> > shouldn't happen after all.
> 
> Well, I wouldn't mind if you kept what you have, just with a (brief) comment
> making clear why there is a difference in treatment. After all you could
> treat mis-aligned size similarly, making the protected area larger, too.

Ok.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-22 Thread Marek Marczykowski-Górecki
On Mon, Jul 22, 2024 at 02:09:15PM +0200, Jan Beulich wrote:
> On 19.07.2024 04:33, Marek Marczykowski-Górecki wrote:
> > @@ -4910,6 +4921,254 @@ long arch_memory_op(unsigned long cmd, 
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> >  return rc;
> >  }
> >  
> > +static void __iomem *subpage_mmio_find_page(mfn_t mfn)
> > +{
> > +struct subpage_ro_range *entry;
> 
> With the function returning void*, my first reaction was to ask why this
> isn't pointer-to-const. Yet then ...
> 
> > +list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +if ( mfn_eq(entry->mfn, mfn) )
> > +return entry;
> 
> ... you're actually returning entry here, just with its type zapped for
> no apparent reason. I also question the __iomem in the return type.

Right, a leftover from some earlier version.

> > +static int __init subpage_mmio_ro_add_page(
> > +mfn_t mfn,
> > +unsigned int offset_s,
> > +unsigned int offset_e)
> > +{
> > +struct subpage_ro_range *entry = NULL, *iter;
> > +unsigned int i;
> > +
> > +entry = subpage_mmio_find_page(mfn);
> > +if ( !entry )
> > +{
> > +/* iter == NULL marks it was a newly allocated entry */
> > +iter = NULL;
> 
> Yet you don't use "iter" for other purposes anymore. I think the variable
> wants renaming and shrinking to e.g. a simple bool.

+1

> > +entry = xzalloc(struct subpage_ro_range);
> > +if ( !entry )
> > +return -ENOMEM;
> > +entry->mfn = mfn;
> > +}
> > +
> > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > +{
> > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > +entry->ro_elems);
> 
> Nit: Indentation looks to be off by 1 here.
> 
> > +ASSERT(!oldbit);
> > +}
> > +
> > +if ( !iter )
> > +list_add(&entry->list, &subpage_ro_ranges);
> 
> What's wrong with doing this right in the earlier conditional?
> 
> > +int __init subpage_mmio_ro_add(
> > +paddr_t start,
> > +size_t size)
> > +{
> > +mfn_t mfn_start = maddr_to_mfn(start);
> > +paddr_t end = start + size - 1;
> > +mfn_t mfn_end = maddr_to_mfn(end);
> > +unsigned int offset_end = 0;
> > +int rc;
> > +bool subpage_start, subpage_end;
> > +
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +return -EINVAL;
> 
> I think I had asked before: Why is misaligned size something that wants a
> release build fallback to the assertion, but not misaligned start?

Misaligned start will lead to protecting larger area, not smaller, so it
is not unsafe thing to do. But I can also make it return an error, it
shouldn't happen after all.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH] CI: workaround broken selinux+docker interaction in yocto

2024-07-19 Thread Marek Marczykowski-Górecki
`cp --preserve=xattr` doesn't work in docker when SELinux is enabled. It
tries to set the "security.selinux" xattr, but SELinux (or overlay fs?)
denies it.
Workaround it by skipping selinux.selinux xattr copying.

Signed-off-by: Marek Marczykowski-Górecki 
---
Tested here:
https://gitlab.com/xen-project/people/marmarek/xen/-/jobs/7386198058

But since yocto container fails to build, it isn't exactly easy to apply
this patch...
"kirkstone" branch of meta-virtualization seems to target Xen 4.15 and
4.16, so it isn't exactly surprising it fails to build with 4.19.

I tried also bumping yocto version to scarthgap (which supposedly should
have updated pygrub patch), but that fails to build for me too, with a
different error:

ERROR: Layer 'filesystems-layer' depends on layer 'networking-layer', but 
this layer is not enabled in your configuration
ERROR: Parse failure with the specified layer added, exiting.
...
ERROR: Nothing PROVIDES 'xen-image-minimal'. Close matches:
  core-image-minimal
  core-image-minimal-dev
Parsing of 2472 .bb files complete (0 cached, 2472 parsed). 4309 targets, 
101 skipped, 0 masked, 0 errors.
---
 automation/build/yocto/yocto.dockerfile.in | 4 
 1 file changed, 4 insertions(+)

diff --git a/automation/build/yocto/yocto.dockerfile.in 
b/automation/build/yocto/yocto.dockerfile.in
index fbaa4e191caa..600db7bf4d19 100644
--- a/automation/build/yocto/yocto.dockerfile.in
+++ b/automation/build/yocto/yocto.dockerfile.in
@@ -68,6 +68,10 @@ RUN locale-gen en_US.UTF-8 && update-locale 
LC_ALL=en_US.UTF-8 \
 ENV LANG en_US.UTF-8
 ENV LC_ALL en_US.UTF-8
 
+# Workaround `cp --preserve=xattr` not working in docker when SELinux is
+# enabled
+RUN echo "security.selinux skip" >> /etc/xattr.conf
+
 # Create a user for the build (we don't want to build as root).
 ENV USER_NAME docker-build
 ARG host_uid=1000
-- 
2.45.2




[PATCH v5 0/3] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-07-18 Thread Marek Marczykowski-Górecki
On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (3):
  xen/list: add LIST_HEAD_RO_AFTER_INIT
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  23 +++-
 xen/arch/x86/mm.c   | 262 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 +++--
 xen/include/xen/list.h  |   3 +-
 7 files changed, 317 insertions(+), 19 deletions(-)

base-commit: a99f25f7ac60544e9af4b3b516d7566ba8841cc4
-- 
git-series 0.9.1



[PATCH v5 2/3] x86/mm: add API for marking only part of a MMIO page read only

2024-07-18 Thread Marek Marczykowski-Górecki
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v5:
- use subpage_mmio_find_page helper, simplifying several functions
- use LIST_HEAD_RO_AFTER_INIT
- don't use subpage_ro_lock in __init
- drop #ifdef in mm.h
- return error on unaligned size in subpage_mmio_ro_add() instead of
  extending the size (in release build)
Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_ctxt.mfn for mmcfg case
- comment where fields of mmio_ro_ctxt are used
- use bool for result of __test_and_set_bit
- do not open-code mfn_to_maddr()
- remove leftover RCU
- mention hvm_hap_nested_page_fault() explicitly in the co

[PATCH v5 3/3] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-07-18 Thread Marek Marczykowski-Górecki
Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1



[PATCH v5 1/3] xen/list: add LIST_HEAD_RO_AFTER_INIT

2024-07-18 Thread Marek Marczykowski-Górecki
Similar to LIST_HEAD_READ_MOSTLY.

Signed-off-by: Marek Marczykowski-Górecki 
---
New in v5
---
 xen/include/xen/list.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/xen/include/xen/list.h b/xen/include/xen/list.h
index 6506ac40893b..62169f46742e 100644
--- a/xen/include/xen/list.h
+++ b/xen/include/xen/list.h
@@ -42,6 +42,9 @@ struct list_head {
 #define LIST_HEAD_READ_MOSTLY(name) \
 struct list_head __read_mostly name = LIST_HEAD_INIT(name)
 
+#define LIST_HEAD_RO_AFTER_INIT(name) \
+struct list_head __ro_after_init name = LIST_HEAD_INIT(name)
+
 static inline void INIT_LIST_HEAD(struct list_head *list)
 {
 list->next = list;
-- 
git-series 0.9.1



Re: [PATCH for-4.19] docs/checklist: Fix XEN_EXTRAVERSION inconsistency for release candidates

2024-07-16 Thread Marek Marczykowski-Górecki
On Tue, Jul 16, 2024 at 03:55:58PM +0200, Anthony PERARD wrote:
> On Tue, Jul 16, 2024 at 10:22:18AM +0200, Juergen Gross wrote:
> > On 16.07.24 09:46, Jan Beulich wrote:
> > > On 16.07.2024 09:33, Julien Grall wrote:
> > > > Hi,
> > > > 
> > > > On 16/07/2024 08:24, Jan Beulich wrote:
> > > > > On 16.07.2024 09:22, Julien Grall wrote:
> > > > > > On 16/07/2024 07:47, Jan Beulich wrote:
> > > > > > > On 15.07.2024 18:56, Julien Grall wrote:
> > > > > > > > On 15/07/2024 16:50, Andrew Cooper wrote:
> > > > > > > > > An earlier part of the checklist states:
> > > > > > > > > 
> > > > > > > > >   * change xen-unstable README. The banner (generated 
> > > > > > > > > using figlet) should say:
> > > > > > > > >   - "Xen 4.5" in releases and on stable branches
> > > > > > > > >   - "Xen 4.5-unstable" on unstable
> > > > > > > > >   - "Xen 4.5-rc" for release candidate
> > > > > > > > > 
> > > > > > > > > Update the notes about XEN_EXTRAVERSION to match.
> > > > > 
> > > > > When this is the purpose of the patch, ...
> > > > > 
> > > > > > > > We have been tagging the tree with 4.5.0-rcX. So I think it 
> > > > > > > > would be
> > > > > > > > better to update the wording so we use a consistent naming.
> > > > > > > 
> > > > > > > I find:
> > > > > > > 
> > > > > > > 4.18-rc
> > > > > > > 4.17-rc
> > > > > > > 4.16-rc
> > > > > > > 4.15-rc
> > > > > > 
> > > > > > Hmmm... I don't think we are looking at the same thing. I was
> > > > > > specifically looking at the tag and *not* XEN_EXTRAVERSION.
> > > > > 
> > > > > ... why would we be looking at tags?
> > > > 
> > > > As I wrote, consistency across the naming scheme we use.
> > > > 
> > > > > The tags (necessarily) have RC numbers,
> > > > 
> > > > Right but they also *have* the .0.
> > > > 
> > > > > so are going to be different from XEN_EXTRAVERSION in any event.
> > > > 
> > > > Sure they are not going to be 100% the same. However, they could have
> > > > some similarity.
> > > > 
> > > > As I pointed out multiple times now, to me it is odd we are tagging the
> > > > tree with 4.19.0-rcX, but we use 4.19-rc.
> > > > 
> > > > Furthermore, if you look at the history of the document. It is quite
> > > > clear that the goal was consistency (the commit mentioned by Andrew
> > > > happened after). Yes it wasn't respected but I can't tell exactly why.
> > > > 
> > > > So as we try to correct the documentation, I think we should also look
> > > > at consistency. If you *really* want to drop the .0, then I think it
> > > > should happen for the tag as well (again for consistency).
> > > 
> > > I don't see why (but I also wouldn't mind the dropping from the tag).
> > > They are going to be different. Whether they're different in one or two
> > > aspects is secondary to me. I rather view the consistency goal to be
> > > with what we've been doing in the last so many releases.
> > 
> > Another aspect to look at would be version sorting. This will be interesting
> > when e.g. having a Xen rpm package installed and upgrading it with a later
> > version. I don't think we want to regard replacing an -rc version with the
> > .0 version to be a downgrade, so the the version numbers should be sorted by
> > "sort -V" in the correct order.
> 
> Packages version from distribution is not something we have to deal with
> upstream. It's for the one writing the package version to make sure
> that -rc are older than actual release.
> 
> While trying to to find if SPEC files where dealing with "-rc" suffix,
> I found a doc for fedora telling how to deal with RCs:
> https://docs.fedoraproject.org/en-US/packaging-guidelines/Versioning/
> They say to replace the dash with a tilde, so "-rc" become "~rc", and
> package manager know what to do with it.
> 
> Some other distribution know how to deal with "rc" suffix, but the dash
> "-" isn't actually allowed in the version string:
> https://man.archlinux.org/man/vercmp.8
> 
> So unless we forgo "-rc" in tags, there's no way we can take into
> account how distributions package manager sorts version numbers. Also,
> there's no need to, it is the job of the packager to deal with version
> number, we just need to make is simple enough and consistent.

XEN_EXTRAVERSION isn't only about version for packaging (where indeed
some changes for -rc will likely be needed anyway, as different packages
have different ways of dealing with it). It's also about version
reported by Xen in various places like `xl info xen_version`. IMO it
makes sense to have consistent format there (always 3 parts separated by
a dot). It makes live easier for any tooling making use of this value.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: systemd units are not installed in 4.19.0-rc2 anymore

2024-07-15 Thread Marek Marczykowski-Górecki
On Mon, Jul 15, 2024 at 11:07:42AM +0100, Andrew Cooper wrote:
> On 15/07/2024 9:11 am, Jan Beulich wrote:
> > On 13.07.2024 15:02, Andrew Cooper wrote:
> >> On 13/07/2024 3:45 am, Marek Marczykowski-Górecki wrote:
> >>> Hi,
> >>>
> >>> Something has changed between -rc1 and -rc2 that systemd units are not
> >>> installed anymore by default.
> >>>
> >>> Reproducer: 
> >>>
> >>> ./configure --prefix=/usr
> >>> make dist-tools
> >>> ls dist/install/usr/lib/systemd/system
> >>>
> >>> It does work, if I pass --enable-systemd to ./configure.
> >>>
> >>> My guess is the actual change is earlier, specifically 6ef4fa1e7fe7
> >>> "tools: (Actually) drop libsystemd as a dependency", but configure was
> >>> regenerated only later. But TBH, I don't fully understand interaction
> >>> between those m4 macros...
> >> Between -rc1 and -rc2 was 7cc95f41669d
> >>
> >> That regenerated the existing configure scripts with Autoconf 2.71, vs
> >> 2.69 previously.
> >>
> >> Glancing through again, I can't spot anything that looks relevant.
> >>
> >>
> >> 6ef4fa1e7fe7 for systemd itself was regenerated, and I had to go out of
> >> my way to get autoconf 2.69 to do it.
> > Yet was it correct for that commit to wholesale drop
> > AX_CHECK_SYSTEMD_ENABLE_AVAILABLE? That's, afaics, the only place where
> > $systemd would have been set to y in the absence of --enable-systemd.
> 
> Hmm.
> 
> Yes it was right to drop that, because the whole purpose of the change
> was to break the dependency with systemd.
> 
> Thereafter, looking for systemd in the build environment is a bogus
> heuristic, and certainly one which would go wrong in XenServer's build
> system where the Mock chroot strictly has only the declared dependencies.
> 
> I see two options.
> 
> 1) Activate Systemd by default on Linux now (as it's basically free), or
> 2) Update CHANGELOG to note this behaviour
> 
> Personally I think 2 is the better option, because we don't special case
> the other init systems.

But we do install classic init scripts by default, no? Why not systemd
units then (which are harmless on non-systemd system)?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


systemd units are not installed in 4.19.0-rc2 anymore

2024-07-12 Thread Marek Marczykowski-Górecki
Hi,

Something has changed between -rc1 and -rc2 that systemd units are not
installed anymore by default.

Reproducer: 

./configure --prefix=/usr
make dist-tools
ls dist/install/usr/lib/systemd/system

It does work, if I pass --enable-systemd to ./configure.

My guess is the actual change is earlier, specifically 6ef4fa1e7fe7
"tools: (Actually) drop libsystemd as a dependency", but configure was
regenerated only later. But TBH, I don't fully understand interaction
between those m4 macros...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 06/17] x86/EFI: address violations of MISRA C:2012 Directive 4.10

2024-07-01 Thread Marek Marczykowski-Górecki
On Mon, Jul 01, 2024 at 03:36:01PM +0200, Alessandro Zucchelli wrote:
> From: Simone Ballarin 
> 
> Add inclusion guard to address violations of
> MISRA C:2012 Directive 4.10 ("Precautions shall be taken in order
> to prevent the contents of a header file being included more than
> once").
> 
> Mechanical change.
> 
> Signed-off-by: Simone Ballarin 
> Signed-off-by: Maria Celeste Cesario 
> Signed-off-by: Nicola Vetrini 
> Signed-off-by: Alessandro Zucchelli 
> 
> ---
> Changes in v4:
> - Modified inclusion guard.
> Changes in v3:
> - remove trailing underscores
> - change inclusion guard name to adhere to the new standard
> Changes in v2:
> - remove changes in "xen/arch/x86/efi/efi-boot.h"
> 
> Note:
> Changes in efi-boot.h have been removed since the file is
> intenteded to be included by common/efi/boot.c only. This motivation
> is not enough to raise a deviation record, so the violation is
> still present.

I'm confused by this comment. It says changes in efi-boot.h have been
removed, yet the patch does include them.

> ---
>  xen/arch/x86/efi/efi-boot.h | 7 +++
>  xen/arch/x86/efi/runtime.h  | 5 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/xen/arch/x86/efi/efi-boot.h b/xen/arch/x86/efi/efi-boot.h
> index f282358435..c6be744f2b 100644
> --- a/xen/arch/x86/efi/efi-boot.h
> +++ b/xen/arch/x86/efi/efi-boot.h
> @@ -3,6 +3,11 @@
>   * is intended to be included by common/efi/boot.c _only_, and
>   * therefore can define arch specific global variables.
>   */
> +
> +#ifndef X86_EFI_EFI_BOOT_H
> +#define X86_EFI_EFI_BOOT_H
> +
> +
>  #include 
>  #include 
>  #include 
> @@ -912,6 +917,8 @@ void asmlinkage __init efi_multiboot2(EFI_HANDLE 
> ImageHandle,
>  efi_exit_boot(ImageHandle, SystemTable);
>  }
>  
> +#endif /* X86_EFI_EFI_BOOT_H */
> +
>  /*
>   * Local variables:
>   * mode: C
> diff --git a/xen/arch/x86/efi/runtime.h b/xen/arch/x86/efi/runtime.h
> index 77866c5f21..88ab5651e9 100644
> --- a/xen/arch/x86/efi/runtime.h
> +++ b/xen/arch/x86/efi/runtime.h
> @@ -1,3 +1,6 @@
> +#ifndef X86_EFI_RUNTIME_H
> +#define X86_EFI_RUNTIME_H
> +
>  #include 
>  #include 
>  #include 
> @@ -17,3 +20,5 @@ void efi_update_l4_pgtable(unsigned int l4idx, l4_pgentry_t 
> l4e)
>  }
>  }
>  #endif
> +
> +#endif /* X86_EFI_RUNTIME_H */
> -- 
> 2.34.1
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Regression in xen-blkfront regarding sector sizes

2024-06-24 Thread Marek Marczykowski-Górecki
Hi,

Some Qubes users report a regression in xen-blkfront regarding block
size reporting. It works fine on 6.8.8, but appears broken on 6.9.2.

The specific problem is that blkfront reports block size of 512, even for
backend devices of 4096. This, for example, fails 512-bytes reads with
O_DIRECT, and appears to break mounting a filesystem on such a device
(at least xfs one).

For example it looks like this:

[user@dom0 ~]$ head /sys/block/loop12/queue/*_block_size
==> /sys/block/loop12/queue/logical_block_size <==
4096

==> /sys/block/loop12/queue/physical_block_size <==
4096

[user@dom0 bin]$ qvm-run -p the-vm 'head /sys/block/xvdi/queue/*_block_size'
==> /sys/block/xvdi/queue/logical_block_size <==
512

==> /sys/block/xvdi/queue/physical_block_size <==
512

and then:

$ sudo dd if=/dev/xvdi of=/dev/null count=1 status=progress iflag=direct
/usr/bin/dd: error reading '/dev/xvdi': Input/output error
0+0 records in
0+0 records out
0 bytes copied, 0.000170858 s, 0.0 kB/s

and mounting fails like this:

[   68.055045] SGI XFS with ACLs, security attributes, realtime, scrub, 
quota, no debug enabled
[   68.057308] I/O error, dev xvdi, sector 0 op 0x0:(READ) flags 0x1000 
phys_seg 1 prio class 0
[   68.057333] XFS (xvdi): SB validate failed with error -5.

More details at https://github.com/QubesOS/qubes-issues/issues/9293

Rusty suspects it's related to
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/block/xen-blkfront.c?id=ba3f67c1163812b5d7ec33705c31edaa30ce6c51,
so I'm cc-ing people mentioned there too.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH for-4.19 v2] tools/xl: Open xldevd.log with O_CLOEXEC

2024-06-21 Thread Marek Marczykowski-Górecki
On Fri, Jun 21, 2024 at 05:16:56PM +0100, Andrew Cooper wrote:
> `xl devd` has been observed leaking /var/log/xldevd.log into children.
> 
> Note this is specifically safe; dup2() leaves O_CLOEXEC disabled on newfd, so
> after setting up stdout/stderr, it's only the logfile fd which will close on
> exec().
> 
> Link: https://github.com/QubesOS/qubes-issues/issues/8292
> Reported-by: Demi Marie Obenour 
> Signed-off-by: Andrew Cooper 

Reviewed-by: Marek Marczykowski-Górecki 

> ---
> CC: Anthony PERARD 
> CC: Juergen Gross 
> CC: Demi Marie Obenour 
> CC: Marek Marczykowski-Górecki 
> CC: Oleksii Kurochko 
> 
> Also entirely speculative based on the QubesOS ticket.
> 
> v2:
>  * Extend the commit message to explain why stdout/stderr aren't closed by
>this change
> 
> For 4.19.  This bugfix was posted earlier, but fell between the cracks.
> ---
>  tools/xl/xl_utils.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> index 17489d182954..060186db3a59 100644
> --- a/tools/xl/xl_utils.c
> +++ b/tools/xl/xl_utils.c
> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
>  exit(-1);
>  }
>  
> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> O_CLOEXEC, 0644));
>  free(fullname);
>  assert(logfile >= 3);
>  
> -- 
> 2.39.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: Design session notes: GPU acceleration in Xen

2024-06-17 Thread Marek Marczykowski-Górecki
On Mon, Jun 17, 2024 at 09:46:29AM +0200, Roger Pau Monné wrote:
> On Sun, Jun 16, 2024 at 08:38:19PM -0400, Demi Marie Obenour wrote:
> > In both cases, the device physical
> > addresses are identical to dom0’s physical addresses.
> 
> Yes, but a PV dom0 physical address space can be very scattered.
> 
> IIRC there's an hypercall to request physically contiguous memory for
> PV, but you don't want to be using that every time you allocate a
> buffer (not sure it would support the sizes needed by the GPU
> anyway).

Indeed that isn't going to fly. In older Qubes versions we had PV
sys-net with PCI passthrough for a network card. After some uptime it
was basically impossible to restart and still have enough contagious
memory for a network driver, and there it was about _much_ smaller
buffers, like 2M or 4M. At least not without shutting down a lot more
things to free some more memory.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: ACPI NVS range conflicting with Dom0 page tables (or kernel image)

2024-06-17 Thread Marek Marczykowski-Górecki
On Mon, Jun 17, 2024 at 01:22:37PM +0200, Jan Beulich wrote:
> Hello,
> 
> while it feels like we had a similar situation before, I can't seem to be
> able to find traces thereof, or associated (Linux) commits.

Is it some AMD Threadripper system by a chance? Previous thread on this
issue:
https://lore.kernel.org/xen-devel/CAOCpoWdOH=xgxiqsc1c5ueb1thxajh4wizbczq-qt+d_kak...@mail.gmail.com/

> With
> 
> (XEN)  Dom0 kernel: 64-bit, PAE, lsb, paddr 0x100 -> 0x400
> ...
> (XEN)  Dom0 alloc.:   00044000->00044800 (619175 pages to be 
> allocated)
> ...
> (XEN)  Loaded kernel: 8100->8400
> 
> the kernel occupies the space from 16Mb to 64Mb in the initial allocation.
> Page tables come (almost) directly above:
> 
> (XEN)  Page tables:   84001000->84026000
> 
> I.e. they're just above the 64Mb boundary. Yet sadly in the host E820 map
> there is
> 
> (XEN)  [0400, 04009fff] (ACPI NVS)
> 
> i.e. a non-RAM range starting at 64Mb. The kernel (currently) won't tolerate
> such an overlap (also if it was overlapping the kernel image, e.g. if on the
> machine in question s sufficiently much larger kernel was used). Yet with its
> fundamental goal of making its E820 match the host one I'm also in trouble
> thinking of possible solutions / workarounds. I certainly do not see Xen
> trying to cover for this, as the E820 map re-arrangement is purely a kernel
> side decision (forward ported kernels got away without, and what e.g. the
> BSDs do is entirely unknown to me).

In Qubes we have worked around the issue by moving the kernel lower
(CONFIG_PHYSICAL_START=0x20):
https://github.com/QubesOS/qubes-linux-kernel/commit/3e8be4ac1682370977d4d0dc1d782c428d860282

Far from ideal, but gets it bootable...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki
On Tue, Jun 11, 2024 at 04:07:03PM +0200, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 03:15:42PM +0200, Marek Marczykowski-Górecki wrote:
> > It's location is discovered at startup
> > (device presents a linked-list of capabilities in one of its BARs).
> > The spec talks only about alignment of individual registers, not the
> > whole group...
> 
> Never mind then, I had the expectation we could get away with a single
> page, but doesn't look to be the case.
> 
> I assume the spec doesn't mention anything about the BAR where the
> capabilities reside having a size <= 4KiB.

No, and in fact I see it's a BAR of 64KiB on one of devices...

> > > Maybe worth adding a comment that the logic here intends to deal only
> > > with the RW bits of a page that's otherwise RO, and that by not
> > > handling the RO regions the intention is that those are dealt just
> > > like fully RO pages.
> > 
> > I can extend the comment, but I assumed it's kinda implied already (if
> > nothing else, by the function name).
> 
> Well, at this point we know the write is not going to make it to host
> memory.  The only reason to not handle the access here is that we want
> to unify the consequences it has for a guest writing to a RO address.

Yup.

> > > I guess there's some message printed when attempting to write to a RO
> > > page that you would also like to print here?
> > 
> > If a HVM domain writes to an R/O area, it is crashed, so you will get a
> > message. This applies to both full page R/O and partial R/O. PV doesn't
> > go through subpage_mmio_write_accept().
> 
> Oh, crashing the domain is more strict than I was expecting.

That's how it was before, I'm not really changing it here. It's less
strict for PV though (it either gets a #PF forwarded back to the guest,
or is ignored).

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki
On Tue, Jun 11, 2024 at 02:55:22PM +0200, Roger Pau Monné wrote:
> On Tue, Jun 11, 2024 at 01:38:35PM +0200, Marek Marczykowski-Górecki wrote:
> > On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> > > On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > +if ( !entry )
> > > > +{
> > > > +/* iter == NULL marks it was a newly allocated entry */
> > > > +iter = NULL;
> > > > +entry = xzalloc(struct subpage_ro_range);
> > > > +if ( !entry )
> > > > +return -ENOMEM;
> > > > +entry->mfn = mfn;
> > > > +}
> > > > +
> > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > +{
> > > > +bool oldbit = __test_and_set_bit(i / MMIO_RO_SUBPAGE_GRAN,
> > > > +entry->ro_elems);
> > > > +ASSERT(!oldbit);
> > > > +}
> > > > +
> > > > +if ( !iter )
> > > > +list_add(&entry->list, &subpage_ro_ranges);
> > > > +
> > > > +return iter ? 1 : 0;
> > > > +}
> > > > +
> > > > +/* This needs subpage_ro_lock already taken */
> > > > +static void __init subpage_mmio_ro_remove_page(
> > > > +mfn_t mfn,
> > > > +unsigned int offset_s,
> > > > +unsigned int offset_e)
> > > > +{
> > > > +struct subpage_ro_range *entry = NULL, *iter;
> > > > +unsigned int i;
> > > > +
> > > > +list_for_each_entry(iter, &subpage_ro_ranges, list)
> > > > +{
> > > > +if ( mfn_eq(iter->mfn, mfn) )
> > > > +{
> > > > +entry = iter;
> > > > +break;
> > > > +}
> > > > +}
> > > > +if ( !entry )
> > > > +return;
> > > > +
> > > > +for ( i = offset_s; i <= offset_e; i += MMIO_RO_SUBPAGE_GRAN )
> > > > +__clear_bit(i / MMIO_RO_SUBPAGE_GRAN, entry->ro_elems);
> > > > +
> > > > +if ( !bitmap_empty(entry->ro_elems, PAGE_SIZE / 
> > > > MMIO_RO_SUBPAGE_GRAN) )
> > > > +return;
> > > > +
> > > > +list_del(&entry->list);
> > > > +if ( entry->mapped )
> > > > +iounmap(entry->mapped);
> > > > +xfree(entry);
> > > > +}
> > > > +
> > > > +int __init subpage_mmio_ro_add(
> > > > +paddr_t start,
> > > > +size_t size)
> > > > +{
> > > > +mfn_t mfn_start = maddr_to_mfn(start);
> > > > +paddr_t end = start + size - 1;
> > > > +mfn_t mfn_end = maddr_to_mfn(end);
> > > > +unsigned int offset_end = 0;
> > > > +int rc;
> > > > +bool subpage_start, subpage_end;
> > > > +
> > > > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > > > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > > > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > > > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> > > > +
> > > > +if ( !size )
> > > > +return 0;
> > > > +
> > > > +if ( mfn_eq(mfn_start, mfn_end) )
> > > > +{
> > > > +/* Both starting and ending parts handled at once */
> > > > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > > > PAGE_SIZE - 1;
> > > > +subpage_end = false;
> > > 
> > > Given the intended usage of this, don't we want to limit to only a
> > > single page?  So that PFN_DOWN(start + size) == PFN_DOWN/(start), as
> > > that would simplify the logic here?
> > 
> > I have considered that, but I haven't found anything in the spec
> > mandating the XHCI DbC registers to not cross page boundary. Currently
> > (on a system I test this on) they don't cross page boundary, but I don't
> > want to assume extra constrains - to avoid issues like before (when
> > on the older system I tested the DbC registers didn't shared page with
> > other registers, but then they shared the page on a newer hardware).
> 
> Oh, from our conversation at XenSummit I got the impression debug 

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki
On Tue, Jun 11, 2024 at 12:40:49PM +0200, Roger Pau Monné wrote:
> On Wed, May 22, 2024 at 05:39:03PM +0200, Marek Marczykowski-Górecki wrote:
> > In some cases, only few registers on a page needs to be write-protected.
> > Examples include USB3 console (64 bytes worth of registers) or MSI-X's
> > PBA table (which doesn't need to span the whole table either), although
> > in the latter case the spec forbids placing other registers on the same
> > page. Current API allows only marking whole pages pages read-only,
> > which sometimes may cover other registers that guest may need to
> > write into.
> > 
> > Currently, when a guest tries to write to an MMIO page on the
> > mmio_ro_ranges, it's either immediately crashed on EPT violation - if
> > that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
> > from userspace (like, /dev/mem), it will try to fixup by updating page
> > tables (that Xen again will force to read-only) and will hit that #PF
> > again (looping endlessly). Both behaviors are undesirable if guest could
> > actually be allowed the write.
> > 
> > Introduce an API that allows marking part of a page read-only. Since
> > sub-page permissions are not a thing in page tables (they are in EPT,
> > but not granular enough), do this via emulation (or simply page fault
> > handler for PV) that handles writes that are supposed to be allowed.
> > The new subpage_mmio_ro_add() takes a start physical address and the
> > region size in bytes. Both start address and the size need to be 8-byte
> > aligned, as a practical simplification (allows using smaller bitmask,
> > and a smaller granularity isn't really necessary right now).
> > It will internally add relevant pages to mmio_ro_ranges, but if either
> > start or end address is not page-aligned, it additionally adds that page
> > to a list for sub-page R/O handling. The list holds a bitmask which
> > qwords are supposed to be read-only and an address where page is mapped
> > for write emulation - this mapping is done only on the first access. A
> > plain list is used instead of more efficient structure, because there
> > isn't supposed to be many pages needing this precise r/o control.
> > 
> > The mechanism this API is plugged in is slightly different for PV and
> > HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
> > it's already called for #PF on read-only MMIO page. For HVM however, EPT
> > violation on p2m_mmio_direct page results in a direct domain_crash() for
> > non hardware domains.  To reach mmio_ro_emulated_write(), change how
> > write violations for p2m_mmio_direct are handled - specifically, check
> > if they relate to such partially protected page via
> > subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
> > them too. This decodes what guest is trying write and finally calls
> > mmio_ro_emulated_write(). The EPT write violation is detected as
> > npfec.write_access and npfec.present both being true (similar to other
> > places), which may cover some other (future?) cases - if that happens,
> > emulator might get involved unnecessarily, but since it's limited to
> > pages marked with subpage_mmio_ro_add() only, the impact is minimal.
> > Both of those paths need an MFN to which guest tried to write (to check
> > which part of the page is supposed to be read-only, and where
> > the page is mapped for writes). This information currently isn't
> > available directly in mmio_ro_emulated_write(), but in both cases it is
> > already resolved somewhere higher in the call tree. Pass it down to
> > mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.
> > 
> > This may give a bit more access to the instruction emulator to HVM
> > guests (the change in hvm_hap_nested_page_fault()), but only for pages
> > explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
> > passed through a device partially used by Xen.
> > As of the next patch, it applies only configuration explicitly
> > documented as not security supported.
> > 
> > The subpage_mmio_ro_add() function cannot be called with overlapping
> > ranges, and on pages already added to mmio_ro_ranges separately.
> > Successful calls would result in correct handling, but error paths may
> > result in incorrect state (like pages removed from mmio_ro_ranges too
> > early). Debug build has asserts for relevant cases.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > Shadow mode is not tested, but I don't expect it to work differently

Re: [PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-06-11 Thread Marek Marczykowski-Górecki
On Fri, Jun 07, 2024 at 09:01:25AM +0200, Jan Beulich wrote:
> On 22.05.2024 17:39, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,34 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via 
> > another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to MMIO_RO_SUBPAGE_GRAN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already 
> > added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Since there is currently no subpage_mmio_ro_remove(), relevant device 
> > should
> > + * not be hot-unplugged.
> 
> Yet there are no guarantees whatsoever. I think we should refuse
> hot-unplug attempts (not just here, but also e.g. for an EHCI
> controller that we use the debug feature of), but doing so would
> likely require coordination with Dom0. Nothing to be done right
> here, of course.
> 
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success
> > + */
> > +#define MMIO_RO_SUBPAGE_GRAN 8
> > +int subpage_mmio_ro_add(paddr_t start, size_t size);
> > +#ifdef CONFIG_HVM
> > +bool subpage_mmio_write_accept(mfn_t mfn, unsigned long gla);
> > +#endif
> 
> I'd suggest to omit the #ifdef here. The declaration alone doesn't
> hurt, and the #ifdef harms readability (if only a bit).

Ok.


> > --- a/xen/arch/x86/mm.c
> > +++ b/xen/arch/x86/mm.c
> > @@ -150,6 +150,17 @@ bool __read_mostly machine_to_phys_mapping_valid;
> >  
> >  struct rangeset *__read_mostly mmio_ro_ranges;
> >  
> > +/* Handling sub-page read-only MMIO regions */
> > +struct subpage_ro_range {
> > +struct list_head list;
> > +mfn_t mfn;
> > +void __iomem *mapped;
> > +DECLARE_BITMAP(ro_elems, PAGE_SIZE / MMIO_RO_SUBPAGE_GRAN);
> > +};
> > +
> > +static LIST_HEAD(subpage_ro_ranges);
> 
> With modifications all happen from __init code, this likely wants
> to be LIST_HEAD_RO_AFTER_INIT() (which would need introducing, to
> parallel LIST_HEAD_READ_MOSTLY()).

Makes sense. And then I would be comfortable with dropping the spinlock
as Roger suggested.
I tried to make this API a bit more generic than I currently need, but
indeed it can be simplified for this particular use case.

> > +int __init subpage_mmio_ro_add(
> > +paddr_t start,
> > +size_t size)
> > +{
> > +mfn_t mfn_start = maddr_to_mfn(start);
> > +paddr_t end = start + size - 1;
> > +mfn_t mfn_end = maddr_to_mfn(end);
> > +unsigned int offset_end = 0;
> > +int rc;
> > +bool subpage_start, subpage_end;
> > +
> > +ASSERT(IS_ALIGNED(start, MMIO_RO_SUBPAGE_GRAN));
> > +ASSERT(IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN));
> > +if ( !IS_ALIGNED(size, MMIO_RO_SUBPAGE_GRAN) )
> > +size = ROUNDUP(size, MMIO_RO_SUBPAGE_GRAN);
> 
> I'm puzzled: You first check suitable alignment and then adjust size
> to have suitable granularity. Either it is a mistake to call the
> function with a bad size, or it is not. If it's a mistake, the
> release build alternative to the assertion would be to return an
> error. If it's not a mistake, the assertion ought to go away.
> 
> If the assertion is to stay, then I'll further question why the
> other one doesn't also have release build safety fallback logic.

For some reason I read your earlier comment as a request to (try to)
continue safely in this case. But indeed an error is a better option, it
isn't supposed to happen anyway.

> > +if ( !size )
> > +return 0;
> > +
> > +if ( mfn_eq(mfn_start, mfn_end) )
> > +{
> > +/* Both starting and ending parts handled at once */
> > +subpage_start = PAGE_OFFSET(start) || PAGE_OFFSET(end) != 
> > PAGE_SIZE - 1;
> > +subpage_end = false;
> > +}
> > +else
> > +{
> > +subpage_start = PAGE_OFFSET(start);
> > +subpage_end = PAGE_OFFSET(end) != PAGE_SIZE - 1;
> > +}
> 
> Since you calculate "end" before adjusting "size", the logic here
> depends on there being the assertion further up.
> 
> Jan

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] MAINTAINERS: alter EFI section

2024-06-10 Thread Marek Marczykowski
On Mon, Jun 10, 2024 at 08:38:45AM +0200, Jan Beulich wrote:
> To get past the recurring friction on the approach to take wrt
> workarounds needed for various firmware flaws, I'm stepping down as the
> maintainer of our code interfacing with EFI firmware. Two new
> maintainers are being introduced in my place.
> 
> Signed-off-by: Jan Beulich 

I'm not sure what is the proper tag for cases like this, but:
Acked-by: Marek Marczykowski 

> ---
> For the new maintainers, here's a 1st patch to consider right away:
> https://lists.xen.org/archives/html/xen-devel/2024-03/msg00931.html.
> 
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -308,7 +308,9 @@ F:automation/eclair_analysis/
>  F:   automation/scripts/eclair
>  
>  EFI
> -M:   Jan Beulich 
> +M:   Daniel P. Smith 
> +M:   Marek Marczykowski-Górecki 
> +R:   Jan Beulich 
>  S:   Supported
>  F:   xen/arch/x86/efi/
>  F:   xen/arch/x86/include/asm/efi*.h

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH for-4.19 v1] automation: add a test for HVM domU on PVH dom0

2024-06-10 Thread Marek Marczykowski-Górecki
On Mon, Jun 10, 2024 at 04:25:01PM +0100, Andrew Cooper wrote:
> On 10/06/2024 2:32 pm, Marek Marczykowski-Górecki wrote:
> > This tests if QEMU works in PVH dom0. QEMU in dom0 requires enabling TUN
> > in the kernel, so do that too.
> >
> > Add it to both x86 runners, similar to the PVH domU test.
> >
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> Acked-by: Andrew Cooper 
> 
> CC Oleksii.
> 
> > ---
> > Requires rebuilding test-artifacts/kernel/6.1.19
> 
> Ok.
> 
> But on a tangent, shouldn't that move forwards somewhat?

There is already "[PATCH 08/12] automation: update kernel for x86 tests"
in the stubdom test series. And as noted in the cover letter there, most
patches can be applied independently, and also they got R-by/A-by from
Stefano already.

> > I'm actually not sure if there is a sense in testing HVM domU on both
> > runners, when PVH domU variant is already tested on both. Are there any
> > differences between Intel and AMD relevant for QEMU in dom0?
> 
> It's not just Qemu, it's also HVMLoader, and the particulars of VT-x/SVM
> VMExit decode information in order to generate ioreqs.

For just HVM, we have PCI passthrough tests on both - they run HVM (but
on PV dom0). My question was more about PVH-dom0 specific parts.

> I'd firmly suggest having both.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v1] automation: add a test for HVM domU on PVH dom0

2024-06-10 Thread Marek Marczykowski-Górecki
This tests if QEMU works in PVH dom0. QEMU in dom0 requires enabling TUN
in the kernel, so do that too.

Add it to both x86 runners, similar to the PVH domU test.

Signed-off-by: Marek Marczykowski-Górecki 
---
Requires rebuilding test-artifacts/kernel/6.1.19

I'm actually not sure if there is a sense in testing HVM domU on both
runners, when PVH domU variant is already tested on both. Are there any
differences between Intel and AMD relevant for QEMU in dom0?
---
 automation/gitlab-ci/test.yaml| 16 
 automation/scripts/qubes-x86-64.sh| 19 ---
 .../tests-artifacts/kernel/6.1.19.dockerfile  |  1 +
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 902139e14893..898d2adc8c5b 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -175,6 +175,14 @@ adl-smoke-x86-64-dom0pvh-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.18-gcc-debug
 
+adl-smoke-x86-64-dom0pvh-hvm-gcc-debug:
+  extends: .adl-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh dom0pvh-hvm 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.18-gcc-debug
+
 adl-suspend-x86-64-gcc-debug:
   extends: .adl-x86-64
   script:
@@ -215,6 +223,14 @@ zen3p-smoke-x86-64-dom0pvh-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.18-gcc-debug
 
+zen3p-smoke-x86-64-dom0pvh-hvm-gcc-debug:
+  extends: .zen3p-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh dom0pvh-hvm 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.18-gcc-debug
+
 zen3p-pci-hvm-x86-64-gcc-debug:
   extends: .zen3p-x86-64
   script:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 087ab2dc171c..816c5dd9aa77 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -19,8 +19,8 @@ vif = [ "bridge=xenbr0", ]
 disk = [ ]
 '
 
-### test: smoke test & smoke test PVH
-if [ -z "${test_variant}" ] || [ "${test_variant}" = "dom0pvh" ]; then
+### test: smoke test & smoke test PVH & smoke test HVM
+if [ -z "${test_variant}" ] || [ "${test_variant}" = "dom0pvh" ] || [ 
"${test_variant}" = "dom0pvh-hvm" ]; then
 passed="ping test passed"
 domU_check="
 ifconfig eth0 192.168.0.2
@@ -37,10 +37,23 @@ done
 set -x
 echo \"${passed}\"
 "
-if [ "${test_variant}" = "dom0pvh" ]; then
+if [ "${test_variant}" = "dom0pvh" ] || [ "${test_variant}" = "dom0pvh-hvm" ]; 
then
 extra_xen_opts="dom0=pvh"
 fi
 
+if [ "${test_variant}" = "dom0pvh-hvm" ]; then
+domU_config='
+type = "hvm"
+name = "domU"
+kernel = "/boot/vmlinuz"
+ramdisk = "/boot/initrd-domU"
+extra = "root=/dev/ram0 console=hvc0"
+memory = 512
+vif = [ "bridge=xenbr0", ]
+disk = [ ]
+'
+fi
+
 ### test: S3
 elif [ "${test_variant}" = "s3" ]; then
 passed="suspend test passed"
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..021bde26c790 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,7 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable TUN && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
2.44.0




Re: Segment truncation in multi-segment PCI handling?

2024-06-10 Thread Marek Marczykowski-Górecki
On Mon, Jun 10, 2024 at 12:11:58PM +0200, Jan Beulich wrote:
> On 10.06.2024 11:46, Roger Pau Monné wrote:
> > On Mon, Jun 10, 2024 at 10:41:19AM +0200, Jan Beulich wrote:
> >> On 10.06.2024 10:28, Roger Pau Monné wrote:
> >>> On Mon, Jun 10, 2024 at 09:58:11AM +0200, Jan Beulich wrote:
> >>>> On 07.06.2024 21:52, Andrew Cooper wrote:
> >>>>> On 07/06/2024 8:46 pm, Marek Marczykowski-Górecki wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> I've got a new system, and it has two PCI segments:
> >>>>>>
> >>>>>> :00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
> >>>>>> :00:02.0 VGA compatible controller: Intel Corporation Meteor 
> >>>>>> Lake-P [Intel Graphics] (rev 08)
> >>>>>> ...
> >>>>>> 1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
> >>>>>> Controller
> >>>>>> 1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
> >>>>>> 1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
> >>>>>> Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)
> >>>>>>
> >>>>>> But looks like Xen doesn't handle it correctly:
> >>>
> >>> In the meantime you can probably disable VMD from the firmware and the
> >>> NVMe devices should appear on the regular PCI bus.
> >>>
> >>>>>> (XEN) :e0:06.0: unknown type 0
> >>>>>> (XEN) :e0:06.2: unknown type 0
> >>>>>> (XEN) :e1:00.0: unknown type 0
> >>>>>> ...
> >>>>>> (XEN)  PCI devices 
> >>>>>> (XEN)  segment  
> >>>>>> (XEN) :e1:00.0 - NULL - node -1 
> >>>>>> (XEN) :e0:06.2 - NULL - node -1 
> >>>>>> (XEN) :e0:06.0 - NULL - node -1 
> >>>>>> (XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
> >>>>>> (XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
> >>>>>> ...
> >>>>>>
> >>>>>> This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
> >>>>>> 0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
> >>>>>> Segment Groups, each with 256 bus segments.
> >>>>>>
> >>>>>> Fortunately, I don't need this to work, if I disable VMD in the
> >>>>>> firmware, I get a single segment and everything works fine.
> >>>>>>
> >>>>>
> >>>>> This is a known issue.  Works is being done, albeit slowly.
> >>>>
> >>>> Is work being done? After the design session in Prague I put it on my
> >>>> todo list, but at low priority. I'd be happy to take it off there if I
> >>>> knew someone else is looking into this.
> >>>
> >>> We had a design session about VMD?  If so I'm afraid I've missed it.
> >>
> >> In Prague last year, not just now in Lisbon.
> >>
> >>>>> 0x1 is indeed not a spec-compliant PCI segment.  It's something
> >>>>> model specific the Linux VMD driver is doing.
> >>>>
> >>>> I wouldn't call this "model specific" - this numbering is purely a
> >>>> software one (and would need coordinating between Dom0 and Xen).
> >>>
> >>> Hm, TBH I'm not sure whether Xen needs to be aware of VMD devices.
> >>> The resources used by the VMD devices are all assigned to the VMD
> >>> root.  My current hypothesis is that it might be possible to manage
> >>> such devices without Xen being aware of their existence.
> >>
> >> Well, it may be possible to have things work in Dom0 without Xen
> >> knowing much. Then Dom0 would need to suppress any physdevop calls
> >> with such software-only segment numbers (in order to at least not
> >> confuse Xen). I'd be curious though how e.g. MSI setup would work in
> >> such a scenario.
> > 
> > IIRC from my read of the spec,
> 
> So you have found a spec somewhere? I didn't so far, and I had even asked
> Intel ...
> 
> > VMD devices don't use regular MSI
> > data/address fields, and 

Segment truncation in multi-segment PCI handling?

2024-06-07 Thread Marek Marczykowski-Górecki
Hi,

I've got a new system, and it has two PCI segments:

:00:00.0 Host bridge: Intel Corporation Device 7d14 (rev 04)
:00:02.0 VGA compatible controller: Intel Corporation Meteor Lake-P 
[Intel Graphics] (rev 08)
...
1:e0:06.0 System peripheral: Intel Corporation RST VMD Managed 
Controller
1:e0:06.2 PCI bridge: Intel Corporation Device 7ecb (rev 10)
1:e1:00.0 Non-Volatile memory controller: Phison Electronics 
Corporation PS5021-E21 PCIe4 NVMe Controller (DRAM-less) (rev 01)

But looks like Xen doesn't handle it correctly:

(XEN) :e0:06.0: unknown type 0
(XEN) :e0:06.2: unknown type 0
(XEN) :e1:00.0: unknown type 0
...
(XEN)  PCI devices 
(XEN)  segment  
(XEN) :e1:00.0 - NULL - node -1 
(XEN) :e0:06.2 - NULL - node -1 
(XEN) :e0:06.0 - NULL - node -1 
(XEN) :2b:00.0 - d0 - node -1  - MSIs < 161 >
(XEN) :00:1f.6 - d0 - node -1  - MSIs < 148 >
...

This isn't exactly surprising, since pci_sbdf_t.seg is uint16_t, so
0x1 doesn't fit. OSDev wiki says PCI Express can have 65536 PCI
Segment Groups, each with 256 bus segments.

Fortunately, I don't need this to work, if I disable VMD in the
firmware, I get a single segment and everything works fine.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: NULL pointer dereference in xenbus_thread->...

2024-05-31 Thread Marek Marczykowski-Górecki
On Tue, Mar 26, 2024 at 11:00:50AM +, Julien Grall wrote:
> Hi Marek,
> 
> +Juergen for visibility
> 
> When sending a bug report, I would suggest to CC relevant people as
> otherwise it can get lost (not may people monitors Xen devel if they are not
> CCed).
> 
> Cheers,
> 
> On 25/03/2024 16:17, Marek Marczykowski-Górecki wrote:
> > On Sun, Oct 22, 2023 at 04:14:30PM +0200, Marek Marczykowski-Górecki wrote:
> > > On Mon, Aug 28, 2023 at 11:50:36PM +0200, Marek Marczykowski-Górecki 
> > > wrote:
> > > > Hi,
> > > > 
> > > > I've noticed in Qubes's CI failure like this:
> > > > 
> > > > [  871.271292] BUG: kernel NULL pointer dereference, address: 
> > > > 
> > > > [  871.275290] #PF: supervisor read access in kernel mode
> > > > [  871.277282] #PF: error_code(0x) - not-present page
> > > > [  871.279182] PGD 106fdb067 P4D 106fdb067 PUD 106fdc067 PMD 0
> > > > [  871.281071] Oops:  [#1] PREEMPT SMP NOPTI
> > > > [  871.282698] CPU: 1 PID: 28 Comm: xenbus Not tainted 
> > > > 6.1.43-1.qubes.fc37.x86_64 #1
> > > > [  871.285222] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> > > > BIOS rel-1.16.0-0-gd239552-rebuilt.opensuse.org 04/01/2014
> > > > [  871.23] RIP: e030:__wake_up_common+0x4c/0x180
> > > > [  871.292838] Code: 24 0c 89 4c 24 08 4d 85 c9 74 0a 41 f6 01 04 0f 85 
> > > > a3 00 00 00 48 8b 43 08 4c 8d 40 e8 48 83 c3 08 49 8d 40 18 48 39 c3 74 
> > > > 5b <49> 8b 40 18 31 ed 4c 8d 70 e8 45 8b 28 41 f6 c5 04 75 5f 49 8b 40
> > > > [  871.299776] RSP: e02b:c900400f7e10 EFLAGS: 00010082
> > > > [  871.301656] RAX:  RBX: 88810541ce98 RCX: 
> > > > 
> > > > [  871.304255] RDX: 0001 RSI: 0003 RDI: 
> > > > 88810541ce90
> > > > [  871.306714] RBP: c900400f0280 R08: ffe8 R09: 
> > > > c900400f7e68
> > > > [  871.309937] R10: 7ff0 R11: 888100ad3000 R12: 
> > > > c900400f7e68
> > > > [  871.312326] R13:  R14:  R15: 
> > > > 
> > > > [  871.314647] FS:  () GS:88813ff0() 
> > > > knlGS:
> > > > [  871.317677] CS:  1e030 DS:  ES:  CR0: 80050033
> > > > [  871.319644] CR2:  CR3: 0001067fe000 CR4: 
> > > > 00040660
> > > > [  871.321973] Call Trace:
> > > > [  871.322782]  
> > > > [  871.323494]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.324901]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.326310]  ? show_trace_log_lvl+0x1d3/0x2ef
> > > > [  871.327721]  ? __wake_up_common_lock+0x82/0xd0
> > > > [  871.329147]  ? __die_body.cold+0x8/0xd
> > > > [  871.330378]  ? page_fault_oops+0x163/0x1a0
> > > > [  871.331691]  ? exc_page_fault+0x70/0x170
> > > > [  871.332946]  ? asm_exc_page_fault+0x22/0x30
> > > > [  871.334454]  ? __wake_up_common+0x4c/0x180
> > > > [  871.335777]  __wake_up_common_lock+0x82/0xd0
> > > > [  871.337183]  ? process_writes+0x240/0x240
> > > > [  871.338461]  process_msg+0x18e/0x2f0
> > > > [  871.339627]  xenbus_thread+0x165/0x1c0
> > > > [  871.340830]  ? cpuusage_read+0x10/0x10
> > > > [  871.342032]  kthread+0xe9/0x110
> > > > [  871.343317]  ? kthread_complete_and_exit+0x20/0x20
> > > > [  871.345020]  ret_from_fork+0x22/0x30
> > > > [  871.346239]  
> > > > [  871.347060] Modules linked in: snd_hda_codec_generic ledtrig_audio 
> > > > snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec 
> > > > snd_hda_core snd_hwdep snd_seq snd_seq_device joydev snd_pcm 
> > > > intel_rapl_msr ppdev intel_rapl_common snd_timer pcspkr e1000e snd 
> > > > soundcore i2c_piix4 parport_pc parport loop fuse xenfs dm_crypt 
> > > > crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni 
> > > > polyval_generic floppy ghash_clmulni_intel sha512_ssse3 serio_raw 
> > > > virtio_scsi virtio_console bochs xhci_pci xhci_pci_renesas xhci_hcd 
> > > > qemu_fw_cfg drm_vram_helper drm_ttm_helper ttm ata_generic pata_acpi 
> > > > xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn 
> > > > scsi_dh_rdac scsi_dh_emc scsi_dh_alua uinpu

Re: [PATCH 1/3] CI: Remove CI_COMMIT_REF_PROTECTED requirement for HW jobs

2024-05-31 Thread Marek Marczykowski-Górecki
On Thu, May 30, 2024 at 05:43:12PM -0700, Stefano Stabellini wrote:
> On Thu, 30 May 2024, Marek Marczykowski-Górecki wrote:
> > On Wed, May 29, 2024 at 03:19:43PM +0100, Andrew Cooper wrote:
> > > This restriction doesn't provide any security because anyone with suitable
> > > permissions on the HW runners can bypass it with this local patch.
> > > 
> > > Requiring branches to be protected hampers usability of transient testing
> > > branches (specifically, can't delete branches except via the Gitlab UI).
> > >
> > > Drop the requirement.
> > > 
> > > Fixes: 746774cd1786 ("automation: introduce a dom0less test run on Xilinx 
> > > hardware")
> > > Fixes: 0ab316e7e15f ("automation: add a smoke and suspend test on an 
> > > Alder Lake system")
> > > Signed-off-by: Andrew Cooper 
> > 
> > Runners used to be set to run only on protected branches. I think it
> > isn't the case anymore from what I see, but it needs checking (I don't
> > see specific settings in all the projects). If it were still the case,
> > removing variable check would result in jobs forever pending.
> 
> Andrew, thank you so much for pointing this out.
> 
> I think the idea was that we can specify the individual users with
> access to protected branches. We cannot add restrictions for unprotected
> branches. So if we set the gitlab runner to only run protected jobs,
> then the $CI_COMMIT_REF_PROTECTED check makes sense. Not for security,
> but to prevent the jobs from getting stuck waiting for a runner that
> will never arrive.
> 
> However, like Marek said, now the gitlab runners don't have the
> "Protected" check set, so it is all useless :-(
> 
> I would prefer to set "Protected" in the gitlab runners settings so that
> it becomes easier to specify users that can and cannot trigger the jobs.

Owners of subprojects can control branch protection rules, so this
feature doesn't help with limiting access to runners added to the whole
group. Qubes runners are not group runners, they are project runners
added only to select projects.

I don't remember why exactly runners got "protected" disabled, but AFAIR
there was some issue with that setting.

> Then, we'll need the $CI_COMMIT_REF_PROTECTED check, not for security,
> but to avoid pipelines getting stuck for unprotected branches.
> 
> It is really difficult to restrict users from triggering jobs in other
> way because they are all automatically added to all subprojects.
> 
> 
> Would you guys be OK if I set "Protected" in the Xilinx and Qubes gitlab
> runners as soon as possible?


-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH 1/3] CI: Remove CI_COMMIT_REF_PROTECTED requirement for HW jobs

2024-05-29 Thread Marek Marczykowski-Górecki
On Wed, May 29, 2024 at 03:19:43PM +0100, Andrew Cooper wrote:
> This restriction doesn't provide any security because anyone with suitable
> permissions on the HW runners can bypass it with this local patch.
> 
> Requiring branches to be protected hampers usability of transient testing
> branches (specifically, can't delete branches except via the Gitlab UI).
>
> Drop the requirement.
> 
> Fixes: 746774cd1786 ("automation: introduce a dom0less test run on Xilinx 
> hardware")
> Fixes: 0ab316e7e15f ("automation: add a smoke and suspend test on an Alder 
> Lake system")
> Signed-off-by: Andrew Cooper 

Runners used to be set to run only on protected branches. I think it
isn't the case anymore from what I see, but it needs checking (I don't
see specific settings in all the projects). If it were still the case,
removing variable check would result in jobs forever pending.

Other than that, I'm okay with this change, since the hw runners are
added only to select projects. You can interpret this as Acked-by, if
you verify if indeed runners are not limited to protected branches only.

I will need to adjust setting of my project, to set "QUBES_JOBS" only
to some branches - I used to use branch protection rules as a proxy to
selecting on which branch to run hw tests...

> ---
> CC: Roger Pau Monné 
> CC: Stefano Stabellini 
> CC: Michal Orzel 
> CC: Marek Marczykowski-Górecki 
> CC: Oleksii Kurochko 
> 
> Fixes because this wants backporting, but it also needs acks from both Marek
> and Stefano as the owners of the hardware in question.
> ---
>  automation/gitlab-ci/test.yaml | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
> index ad249fa0a5d9..efd3ad46f08e 100644
> --- a/automation/gitlab-ci/test.yaml
> +++ b/automation/gitlab-ci/test.yaml
> @@ -92,7 +92,7 @@
>  when: always
>only:
>  variables:
> -  - $XILINX_JOBS == "true" && $CI_COMMIT_REF_PROTECTED == "true"
> +  - $XILINX_JOBS == "true"
>tags:
>  - xilinx
>  
> @@ -112,7 +112,7 @@
>  when: always
>    only:
>  variables:
> -  - $QUBES_JOBS == "true" && $CI_COMMIT_REF_PROTECTED == "true"
> +  - $QUBES_JOBS == "true"
>tags:
>  - qubes-hw2
>  
> -- 
> 2.30.2
> 

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v4 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-23 Thread Marek Marczykowski-Górecki
On Wed, May 22, 2024 at 05:39:02PM +0200, Marek Marczykowski-Górecki wrote:
> On older systems, XHCI xcap had a layout that no other (interesting) registers
> were placed on the same page as the debug capability, so Linux was fine with
> making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
> needs to write to some other registers on the same page too.
> 
> Add a generic API for making just parts of an MMIO page R/O and use it to fix
> USB3 console with share=yes or share=hwdom options. More details in commit
> messages.
> 
> Marek Marczykowski-Górecki (2):
>   x86/mm: add API for marking only part of a MMIO page read only
>   drivers/char: Use sub-page ro API to make just xhci dbc cap RO

Does any other x86 maintainer feel comfortable ack-ing this series? Jan
already reviewed 2/2 here (but not 1/2 in this version), but also said
he is not comfortable with letting this in without a second maintainer
approval: 
https://lore.kernel.org/xen-devel/7655e401-b927-4250-ae63-05361a5ee...@suse.com/

> 
>  xen/arch/x86/hvm/emulate.c  |   2 +-
>  xen/arch/x86/hvm/hvm.c  |   4 +-
>  xen/arch/x86/include/asm/mm.h   |  25 +++-
>  xen/arch/x86/mm.c   | 273 +-
>  xen/arch/x86/pv/ro-page-fault.c |   6 +-
>  xen/drivers/char/xhci-dbc.c |  36 ++--
>  6 files changed, 327 insertions(+), 19 deletions(-)
> 
> base-commit: b0082b908391b29b7c4dd5e6c389ebd6481926f8
> -- 
> git-series 0.9.1

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v4 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-22 Thread Marek Marczykowski-Górecki
Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v4:
- restore mmio_ro_ranges in the fallback case
- set XHCI_SHARE_NONE in the fallback case
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..c45e4b6825cc 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,28 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+uart->dbc.share = XHCI_SHARE_NONE;
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+if ( rangeset_add_range(mmio_ro_ranges,
+ PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset),
+ PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+uart->dbc.xhc_dbc_offset +
+sizeof(*uart->dbc.dbc_reg)) - 1) )
+printk(XENLOG_INFO
+   "Error while adding MMIO range of device to 
mmio_ro_ranges\n");
+}
 #endif
 }
 
-- 
git-series 0.9.1



[PATCH v4 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-22 Thread Marek Marczykowski-Górecki
On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Marek Marczykowski-Górecki (2):
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   4 +-
 xen/arch/x86/include/asm/mm.h   |  25 +++-
 xen/arch/x86/mm.c   | 273 +-
 xen/arch/x86/pv/ro-page-fault.c |   6 +-
 xen/drivers/char/xhci-dbc.c |  36 ++--
 6 files changed, 327 insertions(+), 19 deletions(-)

base-commit: b0082b908391b29b7c4dd5e6c389ebd6481926f8
-- 
git-series 0.9.1



[PATCH v4 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Marek Marczykowski-Górecki
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash() for
non hardware domains.  To reach mmio_ro_emulated_write(), change how
write violations for p2m_mmio_direct are handled - specifically, check
if they relate to such partially protected page via
subpage_mmio_write_accept() and if so, call hvm_emulate_one_mmio() for
them too. This decodes what guest is trying write and finally calls
mmio_ro_emulated_write(). The EPT write violation is detected as
npfec.write_access and npfec.present both being true (similar to other
places), which may cover some other (future?) cases - if that happens,
emulator might get involved unnecessarily, but since it's limited to
pages marked with subpage_mmio_ro_add() only, the impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests (the change in hvm_hap_nested_page_fault()), but only for pages
explicitly marked with subpage_mmio_ro_add() - so, if the guest has a
passed through a device partially used by Xen.
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v4:
- rename SUBPAGE_MMIO_RO_ALIGN to MMIO_RO_SUBPAGE_GRAN
- guard subpage_mmio_write_accept with CONFIG_HVM, as it's used only
  there
- rename ro_qwords to ro_elems
- use unsigned arguments for subpage_mmio_ro_remove_page()
- use volatile for __iomem
- do not set mmio_ro_ctxt.mfn for mmcfg case
- comment where fields of mmio_ro_ctxt are used
- use bool for result of __test_and_set_bit
- do not open-code mfn_to_maddr()
- remove leftover RCU
- mention hvm_hap_nested_page_fault() explicitly in the commit message
Changes in v3:
- use unsigned int for loop iterators
- use __set_bit/__clear_bit when under spinlock
- avoid ioremap() under spinlock
- do not cast away const
- handle unaligned parameters in release build
- comment fixes
- remove RCU - the add functions are __init and

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Marek Marczykowski-Górecki
On Wed, May 22, 2024 at 03:29:51PM +0200, Jan Beulich wrote:
> On 22.05.2024 15:22, Marek Marczykowski-Górecki wrote:
> > On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> >> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> >>> +static void subpage_mmio_write_emulate(
> >>> +mfn_t mfn,
> >>> +unsigned int offset,
> >>> +const void *data,
> >>> +unsigned int len)
> >>> +{
> >>> +struct subpage_ro_range *entry;
> >>> +void __iomem *addr;
> >>
> >> Wouldn't this better be pointer-to-volatile, with ...
> > 
> > Shouldn't then most other uses of __iomem in the code base be this way
> > too? I see volatile only in few places...
> 
> Quite likely, yet being consistent at least in new code is going to be
> at least desirable.

I tried. Build fails because iounmap() doesn't declare its argument as
volatile, so it triggers -Werror=discarded-qualifiers...

I'll change it just in subpage_mmio_write_emulate(), but leave
subpage_mmio_map_page() (was _get_page) without volatile.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Marek Marczykowski-Górecki
On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > +static void subpage_mmio_write_emulate(
> > +mfn_t mfn,
> > +unsigned int offset,
> > +const void *data,
> > +unsigned int len)
> > +{
> > +struct subpage_ro_range *entry;
> > +void __iomem *addr;
> 
> Wouldn't this better be pointer-to-volatile, with ...

Shouldn't then most other uses of __iomem in the code base be this way
too? I see volatile only in few places...

> > +list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +{
> > +if ( mfn_eq(entry->mfn, mfn) )
> > +{
> > +if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, 
> > entry->ro_qwords) )
> > +{
> > + write_ignored:
> > +gprintk(XENLOG_WARNING,
> > +"ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len 
> > %u\n",
> > +mfn_x(mfn), offset, len);
> > +return;
> > +}
> > +
> > +addr = subpage_mmio_get_page(entry);
> > +if ( !addr )
> > +{
> > +gprintk(XENLOG_ERR,
> > +"Failed to map page for MMIO write at 
> > 0x%"PRI_mfn"%03x\n",
> > +mfn_x(mfn), offset);
> > +return;
> > +}
> > +
> > +switch ( len )
> > +{
> > +case 1:
> > +writeb(*(const uint8_t*)data, addr);
> > +break;
> > +case 2:
> > +writew(*(const uint16_t*)data, addr);
> > +break;
> > +case 4:
> > +writel(*(const uint32_t*)data, addr);
> > +break;
> > +case 8:
> > +writeq(*(const uint64_t*)data, addr);
> > +break;
> 
> ... this being how it's written? (If so, volatile suitably carried through to
> other places as well.)

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-22 Thread Marek Marczykowski-Górecki
On Wed, May 22, 2024 at 10:05:05AM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > --- a/xen/drivers/char/xhci-dbc.c
> > +++ b/xen/drivers/char/xhci-dbc.c
> > @@ -1216,20 +1216,19 @@ static void __init cf_check 
> > dbc_uart_init_postirq(struct serial_port *port)
> >  break;
> >  }
> >  #ifdef CONFIG_X86
> > -/*
> > - * This marks the whole page as R/O, which may include other registers
> > - * unrelated to DbC. Xen needs only DbC area protected, but it seems
> > - * Linux's XHCI driver (as of 5.18) works without writting to the whole
> > - * page, so keep it simple.
> > - */
> > -if ( rangeset_add_range(mmio_ro_ranges,
> > -PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> > - uart->dbc.xhc_dbc_offset),
> > -PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> > -   uart->dbc.xhc_dbc_offset +
> > -sizeof(*uart->dbc.dbc_reg)) - 1) )
> > -printk(XENLOG_INFO
> > -   "Error while adding MMIO range of device to 
> > mmio_ro_ranges\n");
> > +if ( subpage_mmio_ro_add(
> > + (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
> > +  uart->dbc.xhc_dbc_offset,
> > + sizeof(*uart->dbc.dbc_reg)) )
> > +{
> > +printk(XENLOG_WARNING
> > +   "Error while marking MMIO range of XHCI console as R/O, "
> > +   "making the whole device R/O (share=no)\n");
> 
> Since you mention "share=no" here, wouldn't you then better also update the
> respective struct field, even if (right now) there may be nothing subsequently
> using that? Except that dbc_ensure_running() actually is looking at it, and
> that's not an __init function.

That case is just an optimization - if pci_ro_device() is used, nobody
else could write to PCI_COMMAND behind the driver backs, so there is no
point checking. Anyway, yes, makes sense to adjust dbc->share too.

> > +if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
> > +printk(XENLOG_WARNING
> > +   "Failed to mark read-only %pp used for XHCI console\n",
> > +   &uart->dbc.sbdf);
> > +}
> >  #endif
> >  }
> 
> It's been a long time since v2 and the description doesn't say anything in
> this regard: Is there a reason not to retain the rangeset addition alongside
> the pci_ro_device() on the fallback path?

pci_ro_device() prevents device from being assigned to domU at all, so
that case is covered already. Dom0 would fail to load any driver (if
nothing else - because it can't size the BARs with R/O config space), so
a _well behaving_ Dom0 would also not touch the device in this case.
But otherwise, yes, it makes sense keep adding to mmio_ro_ranges in the
fallback path.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-22 Thread Marek Marczykowski-Górecki
On Wed, May 22, 2024 at 09:52:44AM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -2009,6 +2009,14 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned 
> > long gla,
> >  goto out_put_gfn;
> >  }
> >  
> > +if ( (p2mt == p2m_mmio_direct) && npfec.write_access && npfec.present 
> > &&
> > + subpage_mmio_write_accept(mfn, gla) &&
> 
> Afaics subpage_mmio_write_accept() is unreachable then when CONFIG_HVM=n?

Right, the PV path hits mmio_ro_emulated_write() without my changes
already.
Do you suggest to make subpage_mmio_write_accept() under #ifdef
CONFIG_HVM?

> > + (hvm_emulate_one_mmio(mfn_x(mfn), gla) == X86EMUL_OKAY) )
> > +{
> > +rc = 1;
> > +goto out_put_gfn;
> > +}
> 
> Overall this new if() is pretty similar to the immediate preceding one.
> So similar that I wonder whether the two shouldn't be folded. 

I can do that if you prefer.

> In fact
> it looks as if the new one is needed only for the case where you'd pass
> through (to a DomU) a device partially used by Xen. That could certainly
> do with mentioning explicitly.

Well, the change in mmio_ro_emulated_write() is relevant to both dom0
and domU. It simply wasn't reachable (in this case) for HVM domU before
(but was for PV already).

> > +static void __iomem *subpage_mmio_get_page(struct subpage_ro_range *entry)
> 
> Considering what the function does and what it returns, perhaps better
> s/get/map/? The "get_page" part of the name generally has a different
> meaning in Xen's memory management.

Ok.

> > +{
> > +void __iomem *mapped_page;
> > +
> > +if ( entry->mapped )
> > +return entry->mapped;
> > +
> > +mapped_page = ioremap(mfn_x(entry->mfn) << PAGE_SHIFT, PAGE_SIZE);
> > +
> > +spin_lock(&subpage_ro_lock);
> > +/* Re-check under the lock */
> > +if ( entry->mapped )
> > +{
> > +spin_unlock(&subpage_ro_lock);
> > +iounmap(mapped_page);
> 
> The only unmap is on an error path here and on another error path elsewhere.
> IOW it looks as if devices with such marked pages are meant to never be hot
> unplugged. I can see that being intentional for the XHCI console, but imo
> such a restriction also needs prominently calling out in a comment next to
> e.g. the function declaration.

The v1 included subpage_mmio_ro_remove() function (which would need to
be used in case of hot-unplug of such device, if desirable), but since
this series doesn't introduce any use of it (as you say, it isn't
desirable for XHCI console specifically), you asked me to remove it...

Should I add an explicit comment about the limitation, instead of having
it implicit by not having subpage_mmio_ro_remove() there?

> > +return entry->mapped;
> > +}
> > +
> > +entry->mapped = mapped_page;
> > +spin_unlock(&subpage_ro_lock);
> > +return entry->mapped;
> > +}
> > +
> > +static void subpage_mmio_write_emulate(
> > +mfn_t mfn,
> > +unsigned int offset,
> > +const void *data,
> > +unsigned int len)
> > +{
> > +struct subpage_ro_range *entry;
> > +void __iomem *addr;
> 
> Wouldn't this better be pointer-to-volatile, with ...
> 
> > +list_for_each_entry(entry, &subpage_ro_ranges, list)
> > +{
> > +if ( mfn_eq(entry->mfn, mfn) )
> > +{
> > +if ( test_bit(offset / SUBPAGE_MMIO_RO_ALIGN, 
> > entry->ro_qwords) )
> > +{
> > + write_ignored:
> > +gprintk(XENLOG_WARNING,
> > +"ignoring write to R/O MMIO 0x%"PRI_mfn"%03x len 
> > %u\n",
> > +mfn_x(mfn), offset, len);
> > +return;
> > +}
> > +
> > +addr = subpage_mmio_get_page(entry);
> > +if ( !addr )
> > +{
> > +gprintk(XENLOG_ERR,
> > +"Failed to map page for MMIO write at 
> > 0x%"PRI_mfn"%03x\n",
> > +mfn_x(mfn), offset);
> > +return;
> > +}
> > +
> > +switch ( len )
> > +{
> > +case 1:
> > +writeb(*(const uint8_t*)data, addr);
> > +  

Re: [PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-21 Thread Marek Marczykowski-Górecki
On Tue, May 21, 2024 at 05:16:58PM +0200, Jan Beulich wrote:
> On 21.05.2024 04:54, Marek Marczykowski-Górecki wrote:
> > --- a/xen/arch/x86/include/asm/mm.h
> > +++ b/xen/arch/x86/include/asm/mm.h
> > @@ -522,9 +522,27 @@ extern struct rangeset *mmio_ro_ranges;
> >  void memguard_guard_stack(void *p);
> >  void memguard_unguard_stack(void *p);
> >  
> > +/*
> > + * Add more precise r/o marking for a MMIO page. Range specified here
> > + * will still be R/O, but the rest of the page (not marked as R/O via 
> > another
> > + * call) will have writes passed through.
> > + * The start address and the size must be aligned to SUBPAGE_MMIO_RO_ALIGN.
> > + *
> > + * This API cannot be used for overlapping ranges, nor for pages already 
> > added
> > + * to mmio_ro_ranges separately.
> > + *
> > + * Return values:
> > + *  - negative: error
> > + *  - 0: success
> > + */
> > +#define SUBPAGE_MMIO_RO_ALIGN 8
> 
> This isn't just alignment, but also (and perhaps more importantly) 
> granularity.
> I think the name wants to express this.

SUBPAGE_MMIO_RO_GRANULARITY? Sounds a bit long...

> 
> > @@ -4910,6 +4921,260 @@ long arch_memory_op(unsigned long cmd, 
> > XEN_GUEST_HANDLE_PARAM(void) arg)
> >  return rc;
> >  }
> >  
> > +/*
> > + * Mark part of the page as R/O.
> > + * Returns:
> > + * - 0 on success - first range in the page
> > + * - 1 on success - subsequent range in the page
> > + * - <0 on error
> > + *
> > + * This needs subpage_ro_lock already taken */
> 
> Nit: Comment style (full stop and */ on its own line).
> 
> > +static int __init subpage_mmio_ro_add_page(
> > +mfn_t mfn, unsigned int offset_s, unsigned int offset_e)
> > +{
> > +struct subpage_ro_range *entry = NULL, *iter;
> > +unsigned int i;
> > +
> > +list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +{
> > +if ( mfn_eq(iter->mfn, mfn) )
> > +{
> > +entry = iter;
> > +break;
> > +}
> > +}
> > +if ( !entry )
> > +{
> > +/* iter == NULL marks it was a newly allocated entry */
> > +iter = NULL;
> > +entry = xzalloc(struct subpage_ro_range);
> > +if ( !entry )
> > +return -ENOMEM;
> > +entry->mfn = mfn;
> > +}
> > +
> > +for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> > +{
> > +int oldbit = __test_and_set_bit(i / SUBPAGE_MMIO_RO_ALIGN,
> > +entry->ro_qwords);
> 
> Why int, not bool?

Because __test_and_set_bit returns int. But I can change to bool if you
prefer.

> > +ASSERT(!oldbit);
> > +}
> > +
> > +if ( !iter )
> > +list_add(&entry->list, &subpage_ro_ranges);
> > +
> > +return iter ? 1 : 0;
> > +}
> > +
> > +/* This needs subpage_ro_lock already taken */
> > +static void __init subpage_mmio_ro_remove_page(
> > +mfn_t mfn,
> > +int offset_s,
> > +int offset_e)
> 
> Can either of these be negative? The more that ...

Right, I can change them to unsigned. They are unsigned already in
subpage_mmio_ro_add_page.

> > +{
> > +struct subpage_ro_range *entry = NULL, *iter;
> > +unsigned int i;
> 
> ... this is used ...
> 
> > +list_for_each_entry(iter, &subpage_ro_ranges, list)
> > +{
> > +if ( mfn_eq(iter->mfn, mfn) )
> > +{
> > +entry = iter;
> > +break;
> > +}
> > +}
> > +if ( !entry )
> > +return;
> > +
> > +for ( i = offset_s; i <= offset_e; i += SUBPAGE_MMIO_RO_ALIGN )
> 
> ... with both of them?
> 
> > +__clear_bit(i / SUBPAGE_MMIO_RO_ALIGN, entry->ro_qwords);
> > +
> > +if ( !bitmap_empty(entry->ro_qwords, PAGE_SIZE / 
> > SUBPAGE_MMIO_RO_ALIGN) )
> > +return;
> > +
> > +list_del(&entry->list);
> > +if ( entry->mapped )
> > +iounmap(entry->mapped);
> > +xfree(entry);
> > +}
> > +
> > +int __init subpage_mmio_ro_add(
> > +paddr_t start,
> > +size_t size)
> > +{
> > +mfn_t mfn_start = maddr_to_mfn(start);
> > +paddr_t end = start + size - 1;
> > +mfn_t mfn_end = maddr_to_mfn(end);
> > +unsigned int offset_end = 0;
> > +int rc;
> > +boo

[PATCH v3 0/2] Add API for making parts of a MMIO page R/O and use it in XHCI console

2024-05-20 Thread Marek Marczykowski-Górecki
On older systems, XHCI xcap had a layout that no other (interesting) registers
were placed on the same page as the debug capability, so Linux was fine with
making the whole page R/O. But at least on Tiger Lake and Alder Lake, Linux
needs to write to some other registers on the same page too.

Add a generic API for making just parts of an MMIO page R/O and use it to fix
USB3 console with share=yes or share=hwdom options. More details in commit
messages.

Technically it may still qualify for 4.19, since v1 was sent well before
last posting date. But I realize it's quite late and it isn't top
priority series, so if it won't hit 4.19, it's okay with me too.

Marek Marczykowski-Górecki (2):
  x86/mm: add API for marking only part of a MMIO page read only
  drivers/char: Use sub-page ro API to make just xhci dbc cap RO

 xen/arch/x86/hvm/emulate.c  |   2 +-
 xen/arch/x86/hvm/hvm.c  |   8 +-
 xen/arch/x86/include/asm/mm.h   |  18 ++-
 xen/arch/x86/mm.c   | 268 +-
 xen/arch/x86/pv/ro-page-fault.c |   1 +-
 xen/drivers/char/xhci-dbc.c |  27 +--
 6 files changed, 309 insertions(+), 15 deletions(-)

base-commit: b0082b908391b29b7c4dd5e6c389ebd6481926f8
-- 
git-series 0.9.1



[PATCH v3 2/2] drivers/char: Use sub-page ro API to make just xhci dbc cap RO

2024-05-20 Thread Marek Marczykowski-Górecki
Not the whole page, which may contain other registers too. The XHCI
specification describes DbC as designed to be controlled by a different
driver, but does not mandate placing registers on a separate page. In fact
on Tiger Lake and newer (at least), this page do contain other registers
that Linux tries to use. And with share=yes, a domU would use them too.
Without this patch, PV dom0 would fail to initialize the controller,
while HVM would be killed on EPT violation.

With `share=yes`, this patch gives domU more access to the emulator
(although a HVM with any emulated device already has plenty of it). This
configuration is already documented as unsafe with untrusted guests and
not security supported.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v3:
- indentation fix
- remove stale comment
- fallback to pci_ro_device() if subpage_mmio_ro_add() fails
- extend commit message
Changes in v2:
 - adjust for simplified subpage_mmio_ro_add() API
---
 xen/drivers/char/xhci-dbc.c | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/xen/drivers/char/xhci-dbc.c b/xen/drivers/char/xhci-dbc.c
index 8e2037f1a5f7..cac9d90d3e39 100644
--- a/xen/drivers/char/xhci-dbc.c
+++ b/xen/drivers/char/xhci-dbc.c
@@ -1216,20 +1216,19 @@ static void __init cf_check 
dbc_uart_init_postirq(struct serial_port *port)
 break;
 }
 #ifdef CONFIG_X86
-/*
- * This marks the whole page as R/O, which may include other registers
- * unrelated to DbC. Xen needs only DbC area protected, but it seems
- * Linux's XHCI driver (as of 5.18) works without writting to the whole
- * page, so keep it simple.
- */
-if ( rangeset_add_range(mmio_ro_ranges,
-PFN_DOWN((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
- uart->dbc.xhc_dbc_offset),
-PFN_UP((uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
-   uart->dbc.xhc_dbc_offset +
-sizeof(*uart->dbc.dbc_reg)) - 1) )
-printk(XENLOG_INFO
-   "Error while adding MMIO range of device to mmio_ro_ranges\n");
+if ( subpage_mmio_ro_add(
+ (uart->dbc.bar_val & PCI_BASE_ADDRESS_MEM_MASK) +
+  uart->dbc.xhc_dbc_offset,
+ sizeof(*uart->dbc.dbc_reg)) )
+{
+printk(XENLOG_WARNING
+   "Error while marking MMIO range of XHCI console as R/O, "
+   "making the whole device R/O (share=no)\n");
+if ( pci_ro_device(0, uart->dbc.sbdf.bus, uart->dbc.sbdf.devfn) )
+printk(XENLOG_WARNING
+   "Failed to mark read-only %pp used for XHCI console\n",
+   &uart->dbc.sbdf);
+}
 #endif
 }
 
-- 
git-series 0.9.1



[PATCH v3 1/2] x86/mm: add API for marking only part of a MMIO page read only

2024-05-20 Thread Marek Marczykowski-Górecki
In some cases, only few registers on a page needs to be write-protected.
Examples include USB3 console (64 bytes worth of registers) or MSI-X's
PBA table (which doesn't need to span the whole table either), although
in the latter case the spec forbids placing other registers on the same
page. Current API allows only marking whole pages pages read-only,
which sometimes may cover other registers that guest may need to
write into.

Currently, when a guest tries to write to an MMIO page on the
mmio_ro_ranges, it's either immediately crashed on EPT violation - if
that's HVM, or if PV, it gets #PF. In case of Linux PV, if access was
from userspace (like, /dev/mem), it will try to fixup by updating page
tables (that Xen again will force to read-only) and will hit that #PF
again (looping endlessly). Both behaviors are undesirable if guest could
actually be allowed the write.

Introduce an API that allows marking part of a page read-only. Since
sub-page permissions are not a thing in page tables (they are in EPT,
but not granular enough), do this via emulation (or simply page fault
handler for PV) that handles writes that are supposed to be allowed.
The new subpage_mmio_ro_add() takes a start physical address and the
region size in bytes. Both start address and the size need to be 8-byte
aligned, as a practical simplification (allows using smaller bitmask,
and a smaller granularity isn't really necessary right now).
It will internally add relevant pages to mmio_ro_ranges, but if either
start or end address is not page-aligned, it additionally adds that page
to a list for sub-page R/O handling. The list holds a bitmask which
qwords are supposed to be read-only and an address where page is mapped
for write emulation - this mapping is done only on the first access. A
plain list is used instead of more efficient structure, because there
isn't supposed to be many pages needing this precise r/o control.

The mechanism this API is plugged in is slightly different for PV and
HVM. For both paths, it's plugged into mmio_ro_emulated_write(). For PV,
it's already called for #PF on read-only MMIO page. For HVM however, EPT
violation on p2m_mmio_direct page results in a direct domain_crash().
To reach mmio_ro_emulated_write(), change how write violations for
p2m_mmio_direct are handled - specifically, check if they relate to such
partially protected page via subpage_mmio_write_accept() and if so, call
hvm_emulate_one_mmio() for them too. This decodes what guest is trying
write and finally calls mmio_ro_emulated_write(). The EPT write
violation is detected as npfec.write_access and npfec.present both being
true (similar to other places), which may cover some other (future?)
cases - if that happens, emulator might get involved unnecessarily, but
since it's limited to pages marked with subpage_mmio_ro_add() only, the
impact is minimal.
Both of those paths need an MFN to which guest tried to write (to check
which part of the page is supposed to be read-only, and where
the page is mapped for writes). This information currently isn't
available directly in mmio_ro_emulated_write(), but in both cases it is
already resolved somewhere higher in the call tree. Pass it down to
mmio_ro_emulated_write() via new mmio_ro_emulate_ctxt.mfn field.

This may give a bit more access to the instruction emulator to HVM
guests, but only for pages explicitly marked with subpage_mmio_ro_add().
As of the next patch, it applies only configuration explicitly
documented as not security supported.

The subpage_mmio_ro_add() function cannot be called with overlapping
ranges, and on pages already added to mmio_ro_ranges separately.
Successful calls would result in correct handling, but error paths may
result in incorrect state (like pages removed from mmio_ro_ranges too
early). Debug build has asserts for relevant cases.

Signed-off-by: Marek Marczykowski-Górecki 
---
Shadow mode is not tested, but I don't expect it to work differently than
HAP in areas related to this patch.

Changes in v3:
- use unsigned int for loop iterators
- use __set_bit/__clear_bit when under spinlock
- avoid ioremap() under spinlock
- do not cast away const
- handle unaligned parameters in release build
- comment fixes
- remove RCU - the add functions are __init and actual usage is only
  much later after domains are running
- add checks overlapping ranges in debug build and document the
  limitations
- change subpage_mmio_ro_add() so the error path doesn't potentially
  remove pages from mmio_ro_ranges
- move printing message to avoid one goto in
  subpage_mmio_write_emulate()
Changes in v2:
- Simplify subpage_mmio_ro_add() parameters
- add to mmio_ro_ranges from within subpage_mmio_ro_add()
- use ioremap() instead of caller-provided fixmap
- use 8-bytes granularity (largest supported single write) and a bitmap
  instead of a rangeset
- clarify commit message
- change how it's plugged in for HVM domain, to not change

Re: [PATCH 06/12] RFC: automation: Add linux stubdom build and smoke test

2024-05-17 Thread Marek Marczykowski-Górecki
On Fri, May 17, 2024 at 05:40:52PM -0700, Stefano Stabellini wrote:
> On Thu, 16 May 2024, Marek Marczykowski-Górecki wrote:
> > Add minimal linux-stubdom smoke test. It starts a simple HVM with
> > linux-stubdom. The actual stubdom implementation is taken from Qubes OS
> > and then stripped off Qubes-specific code. In particular, the remaining
> > code does _not_ support:
> >  - direct kernel boot (implemented by relaying on specific guest disk
> >laying in Qubes OS)
> >  - graphical console (used Qubes GUI agent injected into
> >stubdomain's qemu)
> >  - audio input/output (used Qubes audio agent inside stubdomain)
> >  - USB passthrough (used qrexec <-> usbip proxy inside stubdomain)
> >  - setting up DHCP server (assumes guest addressing used in Qubes OS)
> > 
> > For this smoke test, the relevant part is missing direct kernel boot, as
> > that's used in other smoke tests. Solve this by preparing disk image
> > with proper bootloader (grub) installed. Since the test script is
> > running on arm64 to control x86_64 box, it cannot (easily) install grub
> > directly. For this reason, prepare bootsector as part of the Xen build
> > (which runs on x86_64) and then prepend do the disk image during the
> > test (and adjust partitions table afterwards).
> 
> I am not an expert on this, but do you think it would be possible to use
> network boot and tftp instead of grub on emulated disk? That would not
> require us to build neither /grub-core.img nor build_domU_disk().

Honestly, I don't know. I guess I'd need at least dnsmasq in dom0, and
also iPXE for the domU (if not built already?). I can try for this test.
But a later test (the PCI one) connects a network card and dom0 can't
really setup own DHCP on that network. Additionally combining this with
vif network for PXE might be confusing down the road.

> I am trying to avoid grub-core.img and disk.img because I think direct
> kernel boot or network boot are easier to maintain and more similar to
> the other tests. If you see the ARM tests, they all use tftp boot.

The ARM ones boot as dom0less, where there is only one boot mode for the
system to start in. Here, we have two: xen+dom0 (which already
does network boot), and then domU started by dom0. The latter would
need either a separate DHCP server on a separate network (vif interface
in dom0 should be fine), or some other way to separate dom0/domU boot
mode.

That said, the stubdom used in Qubes does support direct kernel boot. It
is removed from this version, because it relies on specific disk layout
(it reserves xvdd for this purpose). But I do want to bring this
capability to the upstream version too at some point.

> > Signed-off-by: Marek Marczykowski-Górecki 
> > ---
> > The test is implemented using hardware runner, because some of the
> > further tests will require it (for example PCI passthrough with
> > stubdomain). But if there is strong desire to have stubdomain tested
> > inside qemu tests (to be included in patchew runs), it is probably an
> > option for this basic smoke test.
> 
> Thanks for this amazing work. This is a great start, we can see how to
> create more tests after merging this one.
> 
> 
> > For now I'm keeping stubdomain code (build and glue scripts) in separate
> > repository on my github account. This is far from ideal. What would be
> > preferred option? New repository on xenbits? Or add directly into
> > xen.git (stubdom directory)? Honestly, I'd rather avoid the latter, as
> > from packager point of view those are mostly separate beings (similar to
> > qemu, where many use distribution-provide one instead of the one bundled
> > with Xen) and it's convenient to not need to rebuild stubdomain on every
> > hypervisor change (like a security patch).
> 
> My suggestion is to create repositories under gitlab.com/xen-project

gitlab.com/xen-project/stubdom-dm-linux ?
Initially I can create the repository under people/marmarek/.

Is there any preference regarding git history? I see two options:
1. Preserve the current history, where there is a lot of qubes-specific
work and on top a bunch of commits making it not qubes-specific (this is
what is there now).
2. Start with fresh history and reference original repository (and the
commit id) in the initial commit message.

> > Another topic is QEMU version inside stubdomain. It needs to be a
> > separate build due to vastly different configure options, so I cannot
> > reuse the qemu binary built for dom0 (or distribution-provided one if
> > Xen is configured to use it). But also, at this moment qemu for
> > stubdomain needs few extra patches that are not upstream yet.
> > What should be the proper soluti

Re: [PATCH v2 2/4] tools: Import standalone sd_notify() implementation from systemd

2024-05-16 Thread Marek Marczykowski-Górecki
On Thu, May 16, 2024 at 07:58:02PM +0100, Andrew Cooper wrote:
> ... in order to avoid linking against the whole of libsystemd.
> 
> Only minimal changes to the upstream copy, to function as a drop-in
> replacement for sd_notify() and as a header-only library.

Maybe add explicit link to the original source?

> Signed-off-by: Andrew Cooper 
> ---
> CC: Anthony PERARD 
> CC: Juergen Gross 
> CC: George Dunlap 
> CC: Jan Beulich 
> CC: Stefano Stabellini 
> CC: Julien Grall 
> CC: Christian Lindig 
> CC: Edwin Török 
> 
> v2:
>  * New
> ---
>  tools/include/xen-sd-notify.h | 98 +++
>  1 file changed, 98 insertions(+)
>  create mode 100644 tools/include/xen-sd-notify.h
> 
> diff --git a/tools/include/xen-sd-notify.h b/tools/include/xen-sd-notify.h
> new file mode 100644
> index ..eda9d8b22d9e
> --- /dev/null
> +++ b/tools/include/xen-sd-notify.h
> @@ -0,0 +1,98 @@

...

> +static inline void xen_sd_closep(int *fd) {

Static inline is one of the changes vs upstream, and gitlab-ci is not
happy about it:

/builds/xen-project/patchew/xen/tools/xenstored/../../tools/include/xen-sd-notify.h:45:3:
 error: cleanup argument not a function
   45 |   int __attribute__((cleanup(sd_closep))) fd = -1;
  |   ^~~

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH 11/12] automation: stubdom test with boot from CDROM

2024-05-16 Thread Marek Marczykowski-Górecki
Based on the initial stubdomain test add booting from CDOM. It's
significantly different in terms of emulated devices (contrary to PV
disk, the cdrom is backed by qemu), so test that path too.
Schedule it on the AMD runner, as it has less tests right now.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/build/alpine/3.19-arm64v8.dockerfile   |  1 +-
 automation/gitlab-ci/build.yaml   |  2 +-
 automation/gitlab-ci/test.yaml|  8 ++-
 automation/scripts/qubes-x86-64.sh| 58 +++-
 automation/tests-artifacts/alpine/3.19.dockerfile |  3 +-
 5 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/automation/build/alpine/3.19-arm64v8.dockerfile 
b/automation/build/alpine/3.19-arm64v8.dockerfile
index 12810f87ecc6..03a3f28ff686 100644
--- a/automation/build/alpine/3.19-arm64v8.dockerfile
+++ b/automation/build/alpine/3.19-arm64v8.dockerfile
@@ -49,3 +49,4 @@ RUN apk --no-cache add \
   fakeroot \
   sfdisk \
   e2fsprogs \
+  xorriso \
diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 134a01d03efa..f1e6a6144c90 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -324,10 +324,12 @@ alpine-3.19-rootfs-export:
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
 - cp /grub-core.img binaries/grub-core.img
+- cp /grub-core-eltorito.img binaries/grub-core-eltorito.img
   artifacts:
 paths:
   - binaries/initrd.tar.gz
   - binaries/grub-core.img
+  - binaries/grub-core-eltorito.img
   tags:
 - x86_64
 
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 76cc430ae00f..4e4dca91c26e 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -239,6 +239,14 @@ zen3p-pci-stubdom-x86-64-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.19-gcc-debug
 
+zen3p-stubdom-hvm-cdboot-x86-64-gcc-debug:
+  extends: .zen3p-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh stubdom-hvm-cdboot 2>&1 | tee 
${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.19-gcc-debug
+
 qemu-smoke-dom0-arm64-gcc:
   extends: .qemu-arm64
   script:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 816c16fbab3e..b4f5c846ffe3 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -19,6 +19,7 @@ vif = [ "bridge=xenbr0", ]
 disk = [ ]
 '
 domU_disk_path=
+domU_disk_type=disk
 
 ### helper functions
 
@@ -27,27 +28,47 @@ build_domU_disk() {
 local initrd="$2"
 local rootfs="$3"
 local output="$4"
+local img_type="$5"
 local grubcfg="$rootfs/boot/grub2/grub.cfg"
-local kernel_cmdline="root=/dev/xvda1 console=hvc0 earlyprintk=xen"
+local kernel_cmdline
 
 mkdir -p "$rootfs/boot/grub2"
 cp "$kernel" "$rootfs/boot/vmlinuz"
+if [ "$img_type" = "disk" ]; then
+kernel_cmdline="root=/dev/xvda1 console=hvc0 earlyprintk=xen"
+elif [ "$img_type" = "cdrom" ]; then
+kernel_cmdline="root=/dev/sr0 console=hvc0 earlyprintk=xen"
+fi
 echo "linux /boot/vmlinuz $kernel_cmdline" >> "$grubcfg"
 if [ -n "$initrd" ]; then
 cp "$initrd" "$rootfs/boot/initrd.img"
 echo "initrd /boot/initrd.img" >> "$grubcfg"
 fi
 echo "boot" >> "$grubcfg"
-size=$(du -sm "$rootfs")
-size=${size%%  *}
-# add 5M margin
-size=$(( size + 5 ))
-mke2fs -d "$rootfs" "$output.part1" ${size}m
-cat "$rootfs/usr/lib/grub/i386-pc/boot_hybrid.img" binaries/grub-core.img 
> "$output"
-# align for the partition 1 start (2048 sectors)
-truncate -s $((2048 * 512)) "$output"
-cat "$output.part1" >> "$output"
-echo ",,linux,*" | sfdisk "$output"
+if [ "$img_type" = "disk" ]; then
+size=$(du -sm "$rootfs")
+size=${size%%  *}
+# add 5M margin
+size=$(( size + 5 ))
+mke2fs -d "$rootfs" "$output.part1" ${size}m
+cat "$rootfs/usr/lib/grub/i386-pc/boot_hybrid.img" 
binaries/grub-core.img > "$output"
+# align for the partition 1 start (2048 sectors)
+truncate -s $((2048 * 512)) "$output"
+cat "$output.part1" >> "$output"
+echo ",,linux,*" | sfdisk "$output"
+elif [ "$img_type" = "cdrom" ]; then
+cp binaries/grub-core-eltorito.img "$rootfs/boot/"
+xorriso -as mkisofs \
+   

[PATCH 09/12] WIP: automation: temporarily add 'testlab' tag to stubdomain build

2024-05-16 Thread Marek Marczykowski-Górecki
Make it run on newer runners that have new enough kernel for
dracut-install.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/gitlab-ci/build.yaml | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 9b9e5464f179..134a01d03efa 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -356,6 +356,9 @@ alpine-3.19-gcc-debug:
   variables:
 CONTAINER: alpine:3.19
 STUBDOM_LINUX: y
+  tags:
+  - x86_64
+  - testlab
 
 debian-stretch-gcc-debug:
   extends: .gcc-x86-64-build-debug
-- 
git-series 0.9.1



[PATCH 07/12] libxl: Allow stubdomain to control interupts of PCI device

2024-05-16 Thread Marek Marczykowski-Górecki
Especially allow it to control MSI/MSI-X enabling bits. This part only
writes a flag to a sysfs, the actual implementation is on the kernel
side.

This requires Linux >= 5.10 in dom0 (or relevant patch backported).

Signed-off-by: Marek Marczykowski-Górecki 
---
 tools/libs/light/libxl_pci.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/tools/libs/light/libxl_pci.c b/tools/libs/light/libxl_pci.c
index 96cb4da0794e..6f357b70b815 100644
--- a/tools/libs/light/libxl_pci.c
+++ b/tools/libs/light/libxl_pci.c
@@ -1513,6 +1513,14 @@ static void pci_add_dm_done(libxl__egc *egc,
 rc = ERROR_FAIL;
 goto out;
 }
+} else if (libxl_is_stubdom(ctx, domid, NULL)) {
+/* Allow acces to MSI enable flag in PCI config space for the stubdom 
*/
+if ( sysfs_write_bdf(gc, 
SYSFS_PCIBACK_DRIVER"/allow_interrupt_control",
+ pci) < 0 ) {
+LOGD(ERROR, domainid, "Setting allow_interrupt_control for 
device");
+rc = ERROR_FAIL;
+goto out;
+}
 }
 
 out_no_irq:
-- 
git-series 0.9.1



[PATCH 04/12] automation: increase verbosity of starting a domain

2024-05-16 Thread Marek Marczykowski-Górecki
And start collecting qemu log earlier, so it isn't lost in case of a
timeout during domain startup.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/scripts/qemu-alpine-x86_64.sh| 2 +-
 automation/scripts/qemu-smoke-dom0-arm32.sh | 2 +-
 automation/scripts/qemu-smoke-dom0-arm64.sh | 2 +-
 automation/scripts/qubes-x86-64.sh  | 4 ++--
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/automation/scripts/qemu-alpine-x86_64.sh 
b/automation/scripts/qemu-alpine-x86_64.sh
index 8e398dcea34b..a188d60ea6f3 100755
--- a/automation/scripts/qemu-alpine-x86_64.sh
+++ b/automation/scripts/qemu-alpine-x86_64.sh
@@ -56,7 +56,7 @@ bash /etc/init.d/xencommons start
 
 xl list
 
-xl create -c /root/test.cfg
+xl -vvv create -c /root/test.cfg
 
 " > etc/local.d/xen.start
 chmod +x etc/local.d/xen.start
diff --git a/automation/scripts/qemu-smoke-dom0-arm32.sh 
b/automation/scripts/qemu-smoke-dom0-arm32.sh
index d91648905669..3d208cd55bfa 100755
--- a/automation/scripts/qemu-smoke-dom0-arm32.sh
+++ b/automation/scripts/qemu-smoke-dom0-arm32.sh
@@ -21,7 +21,7 @@ echo "#!/bin/bash
 
 xl list
 
-xl create -c /root/test.cfg
+xl -vvv create -c /root/test.cfg
 
 " > ./root/xen.start
 echo "bash /root/xen.start" >> ./etc/init.d/xen-watchdog
diff --git a/automation/scripts/qemu-smoke-dom0-arm64.sh 
b/automation/scripts/qemu-smoke-dom0-arm64.sh
index e0bb37af3610..afc24074eef8 100755
--- a/automation/scripts/qemu-smoke-dom0-arm64.sh
+++ b/automation/scripts/qemu-smoke-dom0-arm64.sh
@@ -52,7 +52,7 @@ bash /etc/init.d/xencommons start
 
 xl list
 
-xl create -c /root/test.cfg
+xl -vvv create -c /root/test.cfg
 
 " > etc/local.d/xen.start
 chmod +x etc/local.d/xen.start
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 4beeff17d31b..bd620b0d9273 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -112,7 +112,6 @@ echo \"${passed}\"
 "
 
 dom0_check="
-tail -F /var/log/xen/qemu-dm-domU.log &
 until grep -q \"^domU Welcome to Alpine Linux\" 
/var/log/xen/console/guest-domU.log; do
 sleep 1
 done
@@ -167,7 +166,8 @@ ifconfig xenbr0 192.168.0.1
 
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
-xl create /etc/xen/domU.cfg
+tail -F /var/log/xen/qemu-dm-domU.log 2>/dev/null | sed -e \"s/^/(qemu-dm) /\" 
&
+xl -vvv create /etc/xen/domU.cfg
 ${dom0_check}
 " > etc/local.d/xen.start
 chmod +x etc/local.d/xen.start
-- 
git-series 0.9.1



[PATCH 01/12] automation: include domU kernel messages in the console output log

2024-05-16 Thread Marek Marczykowski-Górecki
Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..4beeff17d31b 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -131,6 +131,8 @@ mkdir sys
 rm var/run
 echo "#!/bin/sh
 
+echo 8 > /proc/sys/kernel/printk
+
 ${domU_check}
 " > etc/local.d/xen.start
 chmod +x etc/local.d/xen.start
-- 
git-series 0.9.1



[PATCH 05/12] automation: prevent grub unpacking initramfs

2024-05-16 Thread Marek Marczykowski-Górecki
It fails on larger initramfs (~250MB one), let Linux do it.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/scripts/qubes-x86-64.sh | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index bd620b0d9273..77cb0d45815d 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -189,7 +189,7 @@ CONTROLLER=control@thor.testnet
 echo "
 multiboot2 (http)/gitlab-ci/xen $CONSOLE_OPTS loglvl=all guest_loglvl=all 
dom0_mem=4G console_timestamps=boot $extra_xen_opts
 module2 (http)/gitlab-ci/vmlinuz console=hvc0 root=/dev/ram0 earlyprintk=xen
-module2 (http)/gitlab-ci/initrd-dom0
+module2 --nounzip (http)/gitlab-ci/initrd-dom0
 " > $TFTP/grub.cfg
 
 cp -f binaries/xen $TFTP/xen
-- 
git-series 0.9.1



[PATCH 12/12] [DO NOT MERGE] switch to my containers fork

2024-05-16 Thread Marek Marczykowski-Górecki
---
 automation/gitlab-ci/build.yaml | 19 ---
 automation/gitlab-ci/test.yaml  |  9 -
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index f1e6a6144c90..88a59692a881 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -260,7 +260,7 @@
 
 alpine-3.19-arm64-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: 
registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.19-arm64v8
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.19-arm64v8
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.19-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.19
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
 - cp /grub-core.img binaries/grub-core.img
@@ -335,7 +335,7 @@ alpine-3.19-rootfs-export:
 
 kernel-6.1.90-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.90
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.90
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
@@ -350,11 +350,13 @@ kernel-6.1.90-export:
 
 alpine-3.19-gcc:
   extends: .gcc-x86-64-build
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19
 
 alpine-3.19-gcc-debug:
   extends: .gcc-x86-64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19
 STUBDOM_LINUX: y
@@ -445,28 +447,33 @@ debian-bookworm-gcc-debug-arm64-randconfig:
 
 alpine-3.19-gcc-arm64:
   extends: .gcc-arm64-build
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 
 alpine-3.19-gcc-debug-arm64:
   extends: .gcc-arm64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 
 alpine-3.19-gcc-arm64-randconfig:
   extends: .gcc-arm64-build
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 RANDCONFIG: y
 
 alpine-3.19-gcc-debug-arm64-randconfig:
   extends: .gcc-arm64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 RANDCONFIG: y
 
 alpine-3.19-gcc-debug-arm64-staticmem:
   extends: .gcc-arm64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 EXTRA_XEN_CONFIG: |
@@ -476,6 +483,7 @@ alpine-3.19-gcc-debug-arm64-staticmem:
 
 alpine-3.19-gcc-debug-arm64-static-shared-mem:
   extends: .gcc-arm64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 EXTRA_XEN_CONFIG: |
@@ -485,6 +493,7 @@ alpine-3.19-gcc-debug-arm64-static-shared-mem:
 
 alpine-3.19-gcc-debug-arm64-boot-cpupools:
   extends: .gcc-arm64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19-arm64v8
 EXTRA_XEN_CONFIG: |
@@ -598,11 +607,13 @@ debian-bookworm-gcc-arm64-cppcheck:
 
 alpine-3.19-clang:
   extends: .clang-x86-64-build
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19
 
 alpine-3.19-clang-debug:
   extends: .clang-x86-64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: alpine:3.19
 
@@ -698,11 +709,13 @@ debian-bookworm-32-gcc-debug:
 
 fedora-gcc:
   extends: .gcc-x86-64-build
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: fedora:39
 
 fedora-gcc-debug:
   extends: .gcc-x86-64-build-debug
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
   variables:
 CONTAINER: fedora:39
 
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 4e4dca91c26e..0f36036d8275 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -1,6 +1,6 @@
 .test-jobs-common:
   stage: test
-  image: registry.gitlab.com/xen-project/xen/${CONTAINER}
+  image: registry.gitlab.com/xen-project/people/marmarek/xen/${CONTAINER}
 
 .arm64-test-needs: &arm64-test-needs
   - alpine-3.19-arm64-rootfs-export
@@ -16,6 +16,7 @@
 
 .qemu-arm64:
   extends: .test-jobs-common
+  image: registry.gitlab.com/xen-project/xen/${CONTAINER}
   variables:
 CO

[PATCH 03/12] automation: switch to alpine:3.19

2024-05-16 Thread Marek Marczykowski-Górecki
Alpine 3.19 is needed for upcoming stubdomain tests, as linux stubdomain
build requires dracut-core package (dracut-install tool specifically)
which isn't available in 3.18. While technically it will be needed only
in the x86_64 builds, switch Alpine version everywhere for uniformity.
Note this bumps kernel version requirement on docker runners -
dracut-install uses faccessat2() syscall which was introduced in Linux
5.8.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/build/alpine/3.18-arm64v8.dockerfile   | 49 +--
 automation/build/alpine/3.18.dockerfile   | 51 +--
 automation/build/alpine/3.19-arm64v8.dockerfile   | 49 ++-
 automation/build/alpine/3.19.dockerfile   | 51 ++-
 automation/gitlab-ci/build.yaml   | 56 +++
 automation/gitlab-ci/test.yaml| 52 +++---
 automation/scripts/containerize   |  4 +-
 automation/tests-artifacts/alpine/3.18-arm64v8.dockerfile | 65 +
 automation/tests-artifacts/alpine/3.18.dockerfile | 66 +
 automation/tests-artifacts/alpine/3.19-arm64v8.dockerfile | 65 -
 automation/tests-artifacts/alpine/3.19.dockerfile | 67 -
 11 files changed, 288 insertions(+), 287 deletions(-)
 delete mode 100644 automation/build/alpine/3.18-arm64v8.dockerfile
 delete mode 100644 automation/build/alpine/3.18.dockerfile
 create mode 100644 automation/build/alpine/3.19-arm64v8.dockerfile
 create mode 100644 automation/build/alpine/3.19.dockerfile
 delete mode 100644 automation/tests-artifacts/alpine/3.18-arm64v8.dockerfile
 delete mode 100644 automation/tests-artifacts/alpine/3.18.dockerfile
 create mode 100644 automation/tests-artifacts/alpine/3.19-arm64v8.dockerfile
 create mode 100644 automation/tests-artifacts/alpine/3.19.dockerfile

diff --git a/automation/build/alpine/3.18-arm64v8.dockerfile 
b/automation/build/alpine/3.18-arm64v8.dockerfile
deleted file mode 100644
index 91e90220240f..
--- a/automation/build/alpine/3.18-arm64v8.dockerfile
+++ /dev/null
@@ -1,49 +0,0 @@
-FROM --platform=linux/arm64/v8 alpine:3.18
-LABEL maintainer.name="The Xen Project" \
-  maintainer.email="xen-devel@lists.xenproject.org"
-
-ENV USER root
-
-RUN mkdir /build
-WORKDIR /build
-
-# build depends
-RUN apk --no-cache add \
-  \
-  # xen build deps
-  argp-standalone \
-  autoconf \
-  bash \
-  bison \
-  curl \
-  dev86 \
-  dtc-dev \
-  flex \
-  gcc \
-  git \
-  iasl \
-  libaio-dev \
-  libfdt \
-  linux-headers \
-  make \
-  musl-dev  \
-  ncurses-dev \
-  ocaml \
-  ocaml-findlib \
-  patch  \
-  python3-dev \
-  py3-setuptools \
-  texinfo \
-  util-linux-dev \
-  xz-dev \
-  yajl-dev \
-  zlib-dev \
-  \
-  # qemu build deps
-  glib-dev \
-  libattr \
-  libcap-ng-dev \
-  pixman-dev \
-  # qubes test deps
-  openssh-client \
-  fakeroot \
diff --git a/automation/build/alpine/3.18.dockerfile 
b/automation/build/alpine/3.18.dockerfile
deleted file mode 100644
index 8d5dac05b01f..
--- a/automation/build/alpine/3.18.dockerfile
+++ /dev/null
@@ -1,51 +0,0 @@
-FROM --platform=linux/amd64 alpine:3.18
-LABEL maintainer.name="The Xen Project" \
-  maintainer.email="xen-devel@lists.xenproject.org"
-
-ENV USER root
-
-RUN mkdir /build
-WORKDIR /build
-
-# build depends
-RUN apk --no-cache add \
-  \
-  # xen build deps
-  argp-standalone \
-  autoconf \
-  bash \
-  bison \
-  clang \
-  curl \
-  dev86 \
-  flex \
-  g++ \
-  gcc \
-  git \
-  grep \
-  iasl \
-  libaio-dev \
-  libc6-compat \
-  linux-headers \
-  make \
-  musl-dev  \
-  ncurses-dev \
-  ocaml \
-  ocaml-findlib \
-  patch  \
-  python3-dev \
-  py3-setuptools \
-  texinfo \
-  util-linux-dev \
-  xz-dev \
-  yajl-dev \
-  zlib-dev \
-  \
-  # qemu build deps
-  glib-dev \
-  libattr \
-  libcap-ng-dev \
-  ninja \
-  pixman-dev \
-  # livepatch-tools deps
-  elfutils-dev \
diff --git a/automation/build/alpine/3.19-arm64v8.dockerfile 
b/automation/build/alpine/3.19-arm64v8.dockerfile
new file mode 100644
index ..158cf465a9ff
--- /dev/null
+++ b/automation/build/alpine/3.19-arm64v8.dockerfile
@@ -0,0 +1,49 @@
+FROM --platform=linux/arm64/v8 alpine:3.19
+LABEL maintainer.name="The Xen Project" \
+  maintainer.email="xen-devel@lists.xenproject.org"
+
+ENV USER root
+
+RUN mkdir /build
+WORKDIR /build
+
+# build depends
+RUN apk --no-cache add \
+  \
+  # xen build deps
+  argp-standalone \
+  autoconf \
+  bash \
+  bison \
+  curl \
+  dev86 \
+  dtc-dev \
+  flex \
+  gcc \
+  git \
+  iasl \
+  libaio-dev \
+  libfdt \
+  linux-headers \
+  make \
+  musl-dev  \
+  ncurses-dev \
+  ocaml \
+  ocaml-findlib \
+  patch  \
+  python3-dev \
+  py3-setuptools \
+  texinfo \
+  util-linux-dev \
+  xz-dev \
+  yajl-dev \
+  zlib-dev \
+  \
+  # qemu build deps
+  glib-dev \
+  libattr \
+  libcap-ng-dev \
+  pixman-dev \
+  # qubes test dep

[PATCH 06/12] RFC: automation: Add linux stubdom build and smoke test

2024-05-16 Thread Marek Marczykowski-Górecki
Add minimal linux-stubdom smoke test. It starts a simple HVM with
linux-stubdom. The actual stubdom implementation is taken from Qubes OS
and then stripped off Qubes-specific code. In particular, the remaining
code does _not_ support:
 - direct kernel boot (implemented by relaying on specific guest disk
   laying in Qubes OS)
 - graphical console (used Qubes GUI agent injected into
   stubdomain's qemu)
 - audio input/output (used Qubes audio agent inside stubdomain)
 - USB passthrough (used qrexec <-> usbip proxy inside stubdomain)
 - setting up DHCP server (assumes guest addressing used in Qubes OS)

For this smoke test, the relevant part is missing direct kernel boot, as
that's used in other smoke tests. Solve this by preparing disk image
with proper bootloader (grub) installed. Since the test script is
running on arm64 to control x86_64 box, it cannot (easily) install grub
directly. For this reason, prepare bootsector as part of the Xen build
(which runs on x86_64) and then prepend do the disk image during the
test (and adjust partitions table afterwards).

Signed-off-by: Marek Marczykowski-Górecki 
---
The test is implemented using hardware runner, because some of the
further tests will require it (for example PCI passthrough with
stubdomain). But if there is strong desire to have stubdomain tested
inside qemu tests (to be included in patchew runs), it is probably an
option for this basic smoke test.

For now I'm keeping stubdomain code (build and glue scripts) in separate
repository on my github account. This is far from ideal. What would be
preferred option? New repository on xenbits? Or add directly into
xen.git (stubdom directory)? Honestly, I'd rather avoid the latter, as
from packager point of view those are mostly separate beings (similar to
qemu, where many use distribution-provide one instead of the one bundled
with Xen) and it's convenient to not need to rebuild stubdomain on every
hypervisor change (like a security patch).

Another topic is QEMU version inside stubdomain. It needs to be a
separate build due to vastly different configure options, so I cannot
reuse the qemu binary built for dom0 (or distribution-provided one if
Xen is configured to use it). But also, at this moment qemu for
stubdomain needs few extra patches that are not upstream yet.
What should be the proper solution here (after upstreaming all the
patches)?

Generally, I try to add tests early, even though there is still some
work to do for proper stubdomain integration into upstream Xen, so any
cleanups and future changes (like the CDROM libxl patches by Jason
Andryuk) can be made with more confidence and reduce risk of
regressions.

The patch is RFC only because of the stubdom repository location.
---
 automation/build/alpine/3.19-arm64v8.dockerfile   |  2 +-
 automation/build/alpine/3.19.dockerfile   |  9 ++-
 automation/gitlab-ci/build.yaml   |  3 +-
 automation/gitlab-ci/test.yaml|  8 +-
 automation/scripts/build  | 12 ++-
 automation/scripts/qubes-x86-64.sh| 87 +++-
 automation/tests-artifacts/alpine/3.19.dockerfile |  6 +-
 7 files changed, 123 insertions(+), 4 deletions(-)

diff --git a/automation/build/alpine/3.19-arm64v8.dockerfile 
b/automation/build/alpine/3.19-arm64v8.dockerfile
index 158cf465a9ff..12810f87ecc6 100644
--- a/automation/build/alpine/3.19-arm64v8.dockerfile
+++ b/automation/build/alpine/3.19-arm64v8.dockerfile
@@ -47,3 +47,5 @@ RUN apk --no-cache add \
   # qubes test deps
   openssh-client \
   fakeroot \
+  sfdisk \
+  e2fsprogs \
diff --git a/automation/build/alpine/3.19.dockerfile 
b/automation/build/alpine/3.19.dockerfile
index 0be6d7c85fe7..108284613987 100644
--- a/automation/build/alpine/3.19.dockerfile
+++ b/automation/build/alpine/3.19.dockerfile
@@ -49,3 +49,12 @@ RUN apk --no-cache add \
   pixman-dev \
   # livepatch-tools deps
   elfutils-dev \
+  # stubdom deps
+  dracut-core \
+  quilt \
+  gnupg \
+  libseccomp-dev \
+  glib-static \
+  gmp-dev \
+  mpc1-dev \
+  mpfr-dev \
diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index b186289bbd82..783a0687ba34 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -323,9 +323,11 @@ alpine-3.19-rootfs-export:
   image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.19
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
+- cp /grub-core.img binaries/grub-core.img
   artifacts:
 paths:
   - binaries/initrd.tar.gz
+  - binaries/grub-core.img
   tags:
 - x86_64
 
@@ -353,6 +355,7 @@ alpine-3.19-gcc-debug:
   extends: .gcc-x86-64-build-debug
   variables:
 CONTAINER: alpine:3.19
+STUBDOM_LINUX: y
 
 debian-stretch-gcc-debug:
   extends: .gcc-x86-64-build-debug
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index f62d426a8d34..80d10eb7f476 100644
---

[PATCH 08/12] automation: update kernel for x86 tests

2024-05-16 Thread Marek Marczykowski-Górecki
Update 6.1.x kernel to the latest version in this branch. This is
especially needed to include MSI-X related fixes for stubdomain
("xen-pciback: Consider INTx disabled when MSI/MSI-X is enabled").

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/gitlab-ci/build.yaml |  4 +-
 automation/gitlab-ci/test.yaml  |  2 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 40 +--
 automation/tests-artifacts/kernel/6.1.90.dockerfile | 40 ++-
 4 files changed, 43 insertions(+), 43 deletions(-)
 delete mode 100644 automation/tests-artifacts/kernel/6.1.19.dockerfile
 create mode 100644 automation/tests-artifacts/kernel/6.1.90.dockerfile

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 783a0687ba34..9b9e5464f179 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -331,9 +331,9 @@ alpine-3.19-rootfs-export:
   tags:
 - x86_64
 
-kernel-6.1.19-export:
+kernel-6.1.90-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.90
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index 80d10eb7f476..e3910f4c1a9f 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -12,7 +12,7 @@
 
 .x86-64-test-needs: &x86-64-test-needs
   - alpine-3.19-rootfs-export
-  - kernel-6.1.19-export
+  - kernel-6.1.90-export
 
 .qemu-arm64:
   extends: .test-jobs-common
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
deleted file mode 100644
index 3a4096780d20..
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ /dev/null
@@ -1,40 +0,0 @@
-FROM --platform=linux/amd64 debian:bookworm
-LABEL maintainer.name="The Xen Project" \
-  maintainer.email="xen-devel@lists.xenproject.org"
-
-ENV DEBIAN_FRONTEND=noninteractive
-ENV LINUX_VERSION=6.1.19
-ENV USER root
-
-RUN mkdir /build
-WORKDIR /build
-
-# build depends
-RUN apt-get update && \
-apt-get --quiet --yes install \
-build-essential \
-libssl-dev \
-bc \
-curl \
-flex \
-bison \
-libelf-dev \
-&& \
-apt-get autoremove -y && \
-apt-get clean && \
-rm -rf /var/lib/apt/lists* /tmp/* /var/tmp/*
-
-# Build the kernel
-RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSION".tar.xz && \
-tar xvJf linux-"$LINUX_VERSION".tar.xz && \
-cd linux-"$LINUX_VERSION" && \
-make defconfig && \
-make xen.config && \
-scripts/config --enable BRIDGE && \
-scripts/config --enable IGC && \
-cp .config .config.orig && \
-cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
-make -j$(nproc) bzImage && \
-cp arch/x86/boot/bzImage / && \
-cd /build && \
-rm -rf linux-"$LINUX_VERSION"*
diff --git a/automation/tests-artifacts/kernel/6.1.90.dockerfile 
b/automation/tests-artifacts/kernel/6.1.90.dockerfile
new file mode 100644
index ..46cadf02ca78
--- /dev/null
+++ b/automation/tests-artifacts/kernel/6.1.90.dockerfile
@@ -0,0 +1,40 @@
+FROM --platform=linux/amd64 debian:bookworm
+LABEL maintainer.name="The Xen Project" \
+  maintainer.email="xen-devel@lists.xenproject.org"
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV LINUX_VERSION=6.1.90
+ENV USER root
+
+RUN mkdir /build
+WORKDIR /build
+
+# build depends
+RUN apt-get update && \
+apt-get --quiet --yes install \
+build-essential \
+libssl-dev \
+bc \
+curl \
+flex \
+bison \
+libelf-dev \
+&& \
+apt-get autoremove -y && \
+apt-get clean && \
+rm -rf /var/lib/apt/lists* /tmp/* /var/tmp/*
+
+# Build the kernel
+RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSION".tar.xz && \
+tar xvJf linux-"$LINUX_VERSION".tar.xz && \
+cd linux-"$LINUX_VERSION" && \
+make defconfig && \
+make xen.config && \
+scripts/config --enable BRIDGE && \
+scripts/config --enable IGC && \
+cp .config .config.orig && \
+cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
+make -j$(nproc) bzImage && \
+cp arch/x86/boot/bzImage / && \
+cd /build && \
+rm -rf linux-"$LINUX_VERSION"*
-- 
git-series 0.9.1



[PATCH 10/12] automation: stubdom test with PCI passthrough

2024-05-16 Thread Marek Marczykowski-Górecki
Based on the initial stubdomain test and existing PCI passthrough tests,
add one that combines both.
Schedule it on the AMD runner, as it has less tests right now.

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/gitlab-ci/test.yaml |  8 
 automation/scripts/qubes-x86-64.sh | 30 +-
 2 files changed, 33 insertions(+), 5 deletions(-)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index e3910f4c1a9f..76cc430ae00f 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -231,6 +231,14 @@ zen3p-pci-hvm-x86-64-gcc-debug:
 - *x86-64-test-needs
 - alpine-3.19-gcc-debug
 
+zen3p-pci-stubdom-x86-64-gcc-debug:
+  extends: .zen3p-x86-64
+  script:
+- ./automation/scripts/qubes-x86-64.sh pci-stubdom 2>&1 | tee ${LOGFILE}
+  needs:
+- *x86-64-test-needs
+- alpine-3.19-gcc-debug
+
 qemu-smoke-dom0-arm64-gcc:
   extends: .qemu-arm64
   script:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index fc73403dbadf..816c16fbab3e 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -98,8 +98,8 @@ ping -c 10 192.168.0.2 || exit 1
 echo \"${passed}\"
 "
 
-### test: pci-pv, pci-hvm
-elif [ "${test_variant}" = "pci-pv" ] || [ "${test_variant}" = "pci-hvm" ]; 
then
+### test: pci-pv, pci-hvm, pci-stubdom
+elif [ "${test_variant}" = "pci-pv" ] || [ "${test_variant}" = "pci-hvm" ] || 
[ "${test_variant}" = "pci-stubdom" ]; then
 
 if [ -z "$PCIDEV" ]; then
 echo "Please set 'PCIDEV' variable with BDF of test network adapter" 
>&2
@@ -109,15 +109,35 @@ elif [ "${test_variant}" = "pci-pv" ] || [ 
"${test_variant}" = "pci-hvm" ]; then
 
 passed="pci test passed"
 
-domU_config='
+domain_type="${test_variant#pci-}"
+if [ "$test_variant" = "pci-stubdom" ]; then
+domain_type="hvm"
+domU_config='
+type = "hvm"
+disk = [ "/srv/disk.img,format=raw,vdev=xvda" ]
+device_model_version = "qemu-xen"
+device_model_stubdomain_override = 1
+on_reboot = "destroy"
+# libxl configures vkb backend to be dom0 instead of the stubdomain, defer
+# changing that until there is consensus what to do about VGA output (VNC)
+vkb_device = 0
+'
+domU_disk_path=/srv/disk.img
+else
+domU_config='
 type = "'${test_variant#pci-}'"
-name = "domU"
 kernel = "/boot/vmlinuz"
 ramdisk = "/boot/initrd-domU"
 extra = "root=/dev/ram0 console=hvc0 earlyprintk=xen"
+disk = [ ]
+'
+fi
+
+# common part
+domU_config="$domU_config"'
+name = "domU"
 memory = 512
 vif = [ ]
-disk = [ ]
 pci = [ "'$PCIDEV',seize=1" ]
 on_reboot = "destroy"
 '
-- 
git-series 0.9.1



[PATCH 02/12] automation: update fedora build to F39

2024-05-16 Thread Marek Marczykowski-Górecki
Fedora 29 is long EOL

Signed-off-by: Marek Marczykowski-Górecki 
---
 automation/build/fedora/29.dockerfile | 46 +
 automation/build/fedora/39.dockerfile | 46 -
 automation/gitlab-ci/build.yaml   |  4 +-
 3 files changed, 48 insertions(+), 48 deletions(-)
 delete mode 100644 automation/build/fedora/29.dockerfile
 create mode 100644 automation/build/fedora/39.dockerfile

diff --git a/automation/build/fedora/29.dockerfile 
b/automation/build/fedora/29.dockerfile
deleted file mode 100644
index f473ae13e7c1..
--- a/automation/build/fedora/29.dockerfile
+++ /dev/null
@@ -1,46 +0,0 @@
-FROM --platform=linux/amd64 fedora:29
-LABEL maintainer.name="The Xen Project" \
-  maintainer.email="xen-devel@lists.xenproject.org"
-
-# install Xen depends
-RUN dnf -y install \
-clang \
-gcc \
-gcc-c++ \
-ncurses-devel \
-zlib-devel \
-openssl-devel \
-python-devel \
-python3-devel \
-libuuid-devel \
-pkgconfig \
-flex \
-bison \
-libaio-devel \
-glib2-devel \
-yajl-devel \
-pixman-devel \
-glibc-devel \
-make \
-binutils \
-git \
-wget \
-acpica-tools \
-python-markdown \
-patch \
-checkpolicy \
-dev86 \
-xz-devel \
-bzip2 \
-nasm \
-ocaml \
-ocaml-findlib \
-golang \
-# QEMU
-ninja-build \
-&& dnf clean all && \
-rm -rf /var/cache/dnf
-
-RUN useradd --create-home user
-USER user
-WORKDIR /build
diff --git a/automation/build/fedora/39.dockerfile 
b/automation/build/fedora/39.dockerfile
new file mode 100644
index ..054f73444060
--- /dev/null
+++ b/automation/build/fedora/39.dockerfile
@@ -0,0 +1,46 @@
+FROM --platform=linux/amd64 fedora:39
+LABEL maintainer.name="The Xen Project" \
+  maintainer.email="xen-devel@lists.xenproject.org"
+
+# install Xen depends
+RUN dnf -y install \
+clang \
+gcc \
+gcc-c++ \
+ncurses-devel \
+zlib-devel \
+openssl-devel \
+python-devel \
+python3-devel \
+libuuid-devel \
+pkgconfig \
+flex \
+bison \
+libaio-devel \
+glib2-devel \
+yajl-devel \
+pixman-devel \
+glibc-devel \
+make \
+binutils \
+git \
+wget \
+acpica-tools \
+python-markdown \
+patch \
+checkpolicy \
+dev86 \
+xz-devel \
+bzip2 \
+nasm \
+ocaml \
+ocaml-findlib \
+golang \
+# QEMU
+ninja-build \
+&& dnf clean all && \
+rm -rf /var/cache/dnf
+
+RUN useradd --create-home user
+USER user
+WORKDIR /build
diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 49d6265ad5b4..69665ec5b11f 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -691,12 +691,12 @@ debian-bookworm-32-gcc-debug:
 fedora-gcc:
   extends: .gcc-x86-64-build
   variables:
-CONTAINER: fedora:29
+CONTAINER: fedora:39
 
 fedora-gcc-debug:
   extends: .gcc-x86-64-build-debug
   variables:
-CONTAINER: fedora:29
+CONTAINER: fedora:39
 
 # Ubuntu Trusty's Clang is 3.4 while Xen requires 3.5
 
-- 
git-series 0.9.1



[PATCH 00/12] automation: Add build and test for Linux stubdomain

2024-05-16 Thread Marek Marczykowski-Górecki
Initial patches can be applied independently but all are needed before the
"automation: Add linux stubdom build and smoke test" patch.
And later "libxl: Allow stubdomain to control interupts of PCI device" and
"automation: update kernel for x86 tests" is needed before PCI passthrough
test (but both can be committed earlier as they don't depend on others).

See the "automation: Add linux stubdom build and smoke test" patch description
for more details.

Note the Alpine version bump requires rebuilding containers, but so does the
actual test patch (extra dependencies), so it probably makes sense to do it at
the same time.

Marek Marczykowski-Górecki (12):
  automation: include domU kernel messages in the console output log
  automation: update fedora build to F39
  automation: switch to alpine:3.19
  automation: increase verbosity of starting a domain
  automation: prevent grub unpacking initramfs
  RFC: automation: Add linux stubdom build and smoke test
  libxl: Allow stubdomain to control interupts of PCI device
  automation: update kernel for x86 tests
  WIP: automation: temporarily add 'testlab' tag to stubdomain build
  automation: stubdom test with PCI passthrough
  automation: stubdom test with boot from CDROM
  [DO NOT MERGE] switch to my containers fork

 automation/build/alpine/3.18-arm64v8.dockerfile   |  49 +--
 automation/build/alpine/3.18.dockerfile   |  51 +--
 automation/build/alpine/3.19-arm64v8.dockerfile   |  52 ++-
 automation/build/alpine/3.19.dockerfile   |  60 +++-
 automation/build/fedora/29.dockerfile |  46 +--
 automation/build/fedora/39.dockerfile |  46 ++-
 automation/gitlab-ci/build.yaml   |  85 ++--
 automation/gitlab-ci/test.yaml|  87 ++--
 automation/scripts/build  |  12 +-
 automation/scripts/containerize   |   4 +-
 automation/scripts/qemu-alpine-x86_64.sh  |   2 +-
 automation/scripts/qemu-smoke-dom0-arm32.sh   |   2 +-
 automation/scripts/qemu-smoke-dom0-arm64.sh   |   2 +-
 automation/scripts/qubes-x86-64.sh| 153 ++-
 automation/tests-artifacts/alpine/3.18-arm64v8.dockerfile |  65 +---
 automation/tests-artifacts/alpine/3.18.dockerfile |  66 +---
 automation/tests-artifacts/alpine/3.19-arm64v8.dockerfile |  65 +++-
 automation/tests-artifacts/alpine/3.19.dockerfile |  72 +++-
 automation/tests-artifacts/kernel/6.1.19.dockerfile   |  40 +--
 automation/tests-artifacts/kernel/6.1.90.dockerfile   |  40 ++-
 tools/libs/light/libxl_pci.c  |   8 +-
 21 files changed, 614 insertions(+), 393 deletions(-)
 delete mode 100644 automation/build/alpine/3.18-arm64v8.dockerfile
 delete mode 100644 automation/build/alpine/3.18.dockerfile
 create mode 100644 automation/build/alpine/3.19-arm64v8.dockerfile
 create mode 100644 automation/build/alpine/3.19.dockerfile
 delete mode 100644 automation/build/fedora/29.dockerfile
 create mode 100644 automation/build/fedora/39.dockerfile
 delete mode 100644 automation/tests-artifacts/alpine/3.18-arm64v8.dockerfile
 delete mode 100644 automation/tests-artifacts/alpine/3.18.dockerfile
 create mode 100644 automation/tests-artifacts/alpine/3.19-arm64v8.dockerfile
 create mode 100644 automation/tests-artifacts/alpine/3.19.dockerfile
 delete mode 100644 automation/tests-artifacts/kernel/6.1.19.dockerfile
 create mode 100644 automation/tests-artifacts/kernel/6.1.90.dockerfile

base-commit: 319a5125ca2649e6eb95670b4d721260025c187d
-- 
git-series 0.9.1



Linux HVM fails to start with PANIC: early exception 0x00 IP 0010:clear_page_orig+0x12/0x40 error 0

2024-05-11 Thread Marek Marczykowski-Górecki
Hi,

I've got a report[1] that after some update Linux HVM fails to start with the
error as in the subject. It looks to be caused by some change between
Xen 4.17.3 and 4.17.4. Here the failure is on Linux 6.6.25 (both dom0
and domU), but the 6.1.62 that worked with older Xen before, now fails
too. The full error (logged via earlyprintk=xen) is:

[0.009500] Using GB pages for direct mapping
PANIC: early exception 0x00 IP 10:b01c32e2 error 0 cr2 
0xa08649801000
[0.009606] CPU: 0 PID: 0 Comm: swapper Not tainted 
6.6.25-1.qubes.fc37.x86_64 #1
[0.009665] Hardware name: Xen HVM domU, BIOS 4.17.4 04/26/2024
[0.009710] RIP: 0010:clear_page_orig+0x12/0x40
[0.009766] Code: 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 f3 0f 1e fa 31 c0 b9 40 00 00 00 0f 1f 44 00 00 ff c9 <48> 89 
07 48 89 47 08 48 89 47 10 48 89 47 18 48 89 47 20 48 89 47
[0.009862] RSP: :b0e03d58 EFLAGS: 00010016 ORIG_RAX: 

[0.009915] RAX:  RBX:  RCX: 
003f
[0.009967] RDX: 9801 RSI:  RDI: 
a08649801000
[0.010015] RBP: 0001 R08: 0001 R09: 
6b7f283562d74b16
[0.010063] R10:  R11:  R12: 
0001
[0.010112] R13:  R14: b0e22a08 R15: 
a0864000
[0.010161] FS:  () GS:b16ea000() 
knlGS:
[0.010214] CS:  0010 DS:  ES:  CR0: 80050033
[0.010257] CR2: a08649801000 CR3: 08e8 CR4: 
00b0
[0.010310] Call Trace:
[0.010341]  
[0.010372]  ? early_fixup_exception+0xf7/0x190
[0.010416]  ? early_idt_handler_common+0x2f/0x3a
[0.010460]  ? clear_page_orig+0x12/0x40
[0.010501]  ? alloc_low_pages+0xeb/0x150
[0.010541]  ? __kernel_physical_mapping_init+0x1d2/0x630
[0.010588]  ? init_memory_mapping+0x83/0x160
[0.010631]  ? init_mem_mapping+0x9a/0x460
[0.010669]  ? memblock_reserve+0x6d/0xf0
[0.010709]  ? setup_arch+0x796/0xf90
[0.010748]  ? start_kernel+0x63/0x420
[0.010787]  ? x86_64_start_reservations+0x18/0x30
[0.010828]  ? x86_64_start_kernel+0x96/0xa0
[0.010868]  ? secondary_startup_64_no_verify+0x18f/0x19b
[0.010918]  

I'm pretty sure the exception 0 is misleading here, I don't see how it
could be #DE.

More logs (including full hypervisor log) are attached to the linked
issue.

This is on HP 240 g7, and my educated guess is it's Intel Celeron N4020
CPU. I cannot reproduce the issue on different hardware.

PVH domains seems to work.

Any ideas what could have happened here?

[1] https://github.com/QubesOS/qubes-issues/issues/9217

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v8 0/6] MSI-X support with qemu in stubdomain, and other related changes

2024-05-09 Thread Marek Marczykowski-Górecki
This series includes changes to make MSI-X working with Linux stubdomain and
especially Intel Wifi 6 AX210 card. This takes care of remaining reasons for
QEMU to access /dev/mem, but also the Intel Wifi card violating spec by putting
some registers on the same page as the MSI-X table.
Besides the stubdomain case (of which I care more), this is also necessary for
PCI-passthrough to work with lockdown enabled in dom0 (when QEMU runs in dom0).

See individual patches for details.

This series include also tests for MSI-X using new approach (by preventing QEMU
access to /dev/mem). But for it to work, it needs QEMU change that
makes use of the changes introduced here. It can be seen at
https://github.com/marmarek/qemu/commits/msix

Here is the pipeline that used the QEMU fork above:
https://gitlab.com/xen-project/people/marmarek/xen/-/pipelines/1285354093
(the build failures are due to issues with fetching or building newer QEMU
 discussed on Matrix)

v7:
 - "x86/msi: passthrough all MSI-X vector ctrl writes to device model" is 
already applied

Marek Marczykowski-Górecki (6):
  x86/msi: Extend per-domain/device warning mechanism
  x86/hvm: Allow access to registers on the same page as MSI-X table
  automation: prevent QEMU access to /dev/mem in PCI passthrough tests
  automation: switch to a wifi card on ADL system
  [DO NOT APPLY] switch to qemu fork
  [DO NOT APPLY] switch to alternative artifact repo

 Config.mk   |   4 +-
 automation/gitlab-ci/build.yaml |   4 +-
 automation/gitlab-ci/test.yaml  |   4 +-
 automation/scripts/qubes-x86-64.sh  |   9 +-
 automation/tests-artifacts/alpine/3.18.dockerfile   |   7 +-
 automation/tests-artifacts/kernel/6.1.19.dockerfile |   2 +-
 xen/arch/x86/hvm/vmsi.c | 208 -
 xen/arch/x86/include/asm/msi.h  |  22 +-
 xen/arch/x86/msi.c  |  47 ++-
 9 files changed, 288 insertions(+), 19 deletions(-)

base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
-- 
git-series 0.9.1



[PATCH v8 6/6] [DO NOT APPLY] switch to alternative artifact repo

2024-05-09 Thread Marek Marczykowski-Górecki
For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 49d6265ad5b4..0d7e311417d8 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1



[PATCH v8 1/6] x86/msi: Extend per-domain/device warning mechanism

2024-05-09 Thread Marek Marczykowski-Górecki
The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v6:
- add MSIX_CHECK_WARN macro (Jan)
- drop struct name from warned_kind union (Jan)

New in v5
---
 xen/arch/x86/include/asm/msi.h | 17 -
 xen/arch/x86/msi.c |  5 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..bcfdfd35345d 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -208,6 +208,15 @@ struct msg_address {
PCI_MSIX_ENTRY_SIZE + \
(~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1)))
 
+#define MSIX_CHECK_WARN(msix, domid, which) ({ \
+if ( (msix)->warned_domid != (domid) ) \
+{ \
+(msix)->warned_domid = (domid); \
+(msix)->warned_kind.all = 0; \
+} \
+(msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \
+})
+
 struct arch_msix {
 unsigned int nr_entries, used_entries;
 struct {
@@ -217,7 +226,13 @@ struct arch_msix {
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
 bool host_maskall, guest_maskall;
-domid_t warned;
+domid_t warned_domid;
+union {
+uint8_t all;
+struct {
+bool maskall   : 1;
+};
+} warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..42c793426da3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 domid_t domid = pdev->domain->domain_id;
 
 maskall = true;
-if ( pdev->msix->warned != domid )
-{
-pdev->msix->warned = domid;
+if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) )
 printk(XENLOG_G_WARNING
"cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
desc->irq, domid, &pdev->sbdf);
-}
 }
 pdev->msix->host_maskall = maskall;
 if ( maskall || pdev->msix->guest_maskall )
-- 
git-series 0.9.1



[PATCH v8 4/6] automation: switch to a wifi card on ADL system

2024-05-09 Thread Marek Marczykowski-Górecki
Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Stefano Stabellini 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index ad249fa0a5d9..6803cae116b5 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1



[PATCH v8 3/6] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

2024-05-09 Thread Marek Marczykowski-Górecki
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Stefano Stabellini 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1



[PATCH v8 5/6] [DO NOT APPLY] switch to qemu fork

2024-05-09 Thread Marek Marczykowski-Górecki
This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1



[PATCH v8 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table

2024-05-09 Thread Marek Marczykowski-Górecki
Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
on the same page as MSI-X table. Device model (especially one in
stubdomain) cannot really handle those, as direct writes to that page is
refused (page is on the mmio_ro_ranges list). Instead, extend
msixtbl_mmio_ops to handle such accesses too.

Doing this, requires correlating read/write location with guest
MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
for PV would need to be done separately.

This will be also used to read Pending Bit Array, if it lives on the same
page, making QEMU not needing /dev/mem access at all (especially helpful
with lockdown enabled in dom0). If PBA lives on another page, QEMU will
map it to the guest directly.
If PBA lives on the same page, discard writes and log a message.
Technically, writes outside of PBA could be allowed, but at this moment
the precise location of PBA isn't saved, and also no known device abuses
the spec in this way (at least yet).

To access those registers, msixtbl_mmio_ops need the relevant page
mapped. MSI handling already has infrastructure for that, using fixmap,
so try to map first/last page of the MSI-X table (if necessary) and save
their fixmap indexes. Note that msix_get_fixmap() does reference
counting and reuses existing mapping, so just call it directly, even if
the page was mapped before. Also, it uses a specific range of fixmap
indexes which doesn't include 0, so use 0 as default ("not mapped")
value - which simplifies code a bit.

Based on assumption that all MSI-X page accesses are handled by Xen, do
not forward adjacent accesses to other hypothetical ioreq servers, even
if the access wasn't handled for some reason (failure to map pages etc).
Relevant places log a message about that already.

Signed-off-by: Marek Marczykowski-Górecki 
---
Changes in v8:
- rename adjacent_handle to get_adjacent_idx
- put SBDF at the start of error messages
- use 0 for ADJACENT_DONT_HANDLE (it's FIX_RESERVED)
- merge conditions in msixtbl_range into one "if"
- add assert for address alignment
- change back to setting pval to ~0UL at the start of adjacent_read
Changes in v7:
- simplify logic based on assumption that all access to MSI-X pages are
  handled by Xen (Roger)
- move calling adjacent_handle() into adjacent_{read,write}() (Roger)
- move range check into msixtbl_addr_to_desc() (Roger)
- fix off-by-one when initializing adj_access_idx[ADJ_IDX_LAST] (Roger)
- no longer distinguish between unhandled write due to PBA nearby and
  other reasons
- add missing break after ASSERT_UNREACHABLE (Jan)
Changes in v6:
- use MSIX_CHECK_WARN macro
- extend assert on fixmap_idx
- add break in default label, after ASSERT_UNREACHABLE(), and move
  setting default there
- style fixes
Changes in v5:
- style fixes
- include GCC version in the commit message
- warn only once (per domain, per device) about failed adjacent access
Changes in v4:
- drop same_page parameter of msixtbl_find_entry(), distinguish two
  cases in relevant callers
- rename adj_access_table_idx to adj_access_idx
- code style fixes
- drop alignment check in adjacent_{read,write}() - all callers already
  have it earlier
- delay mapping first/last MSI-X pages until preparing device for a
  passthrough
v3:
 - merge handling into msixtbl_mmio_ops
 - extend commit message
v2:
 - adjust commit message
 - pass struct domain to msixtbl_page_handler_get_hwaddr()
 - reduce local variables used only once
 - log a warning if write is forbidden if MSI-X and PBA lives on the same
   page
 - do not passthrough unaligned accesses
 - handle accesses both before and after MSI-X table
---
 xen/arch/x86/hvm/vmsi.c| 208 --
 xen/arch/x86/include/asm/msi.h |   5 +-
 xen/arch/x86/msi.c |  42 +++-
 3 files changed, 245 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/hvm/vmsi.c b/xen/arch/x86/hvm/vmsi.c
index 17983789..d506d6adaaf6 100644
--- a/xen/arch/x86/hvm/vmsi.c
+++ b/xen/arch/x86/hvm/vmsi.c
@@ -180,6 +180,10 @@ static bool msixtbl_initialised(const struct domain *d)
 return d->arch.hvm.msixtbl_list.next;
 }
 
+/*
+ * Lookup an msixtbl_entry on the same page as given addr. It's up to the
+ * caller to check if address is strictly part of the table - if relevant.
+ */
 static struct msixtbl_entry *msixtbl_find_entry(
 struct vcpu *v, unsigned long addr)
 {
@@ -187,8 +191,8 @@ static struct msixtbl_entry *msixtbl_find_entry(
 struct domain *d = v->domain;
 
 list_for_each_entry( entry, &d->arch.hvm.msixtbl_list, list )
-if ( addr >= entry->gtable &&
- addr < entry->gtable + entry->table_len )
+if ( PFN_DOWN(addr) >= PFN_DOWN(entry->gtable) &&
+ PFN_DOWN(addr) <= PFN_DOWN(entry->gtable + entry->table_len - 1) )
 

Re: [PATCH v7 2/6] x86/hvm: Allow access to registers on the same page as MSI-X table

2024-05-08 Thread Marek Marczykowski-Górecki
On Wed, May 08, 2024 at 06:09:48PM +0200, Roger Pau Monné wrote:
> On Tue, May 07, 2024 at 02:44:02PM +0200, Marek Marczykowski-Górecki wrote:
> > Some devices (notably Intel Wifi 6 AX210 card) keep auxiliary registers
> > on the same page as MSI-X table. Device model (especially one in
> > stubdomain) cannot really handle those, as direct writes to that page is
> > refused (page is on the mmio_ro_ranges list). Instead, extend
> > msixtbl_mmio_ops to handle such accesses too.
> > 
> > Doing this, requires correlating read/write location with guest
> > MSI-X table address. Since QEMU doesn't map MSI-X table to the guest,
> > it requires msixtbl_entry->gtable, which is HVM-only. Similar feature
> > for PV would need to be done separately.
> > 
> > This will be also used to read Pending Bit Array, if it lives on the same
> > page, making QEMU not needing /dev/mem access at all (especially helpful
> > with lockdown enabled in dom0). If PBA lives on another page, QEMU will
> > map it to the guest directly.
> > If PBA lives on the same page, discard writes and log a message.
> > Technically, writes outside of PBA could be allowed, but at this moment
> > the precise location of PBA isn't saved, and also no known device abuses
> > the spec in this way (at least yet).
> > 
> > To access those registers, msixtbl_mmio_ops need the relevant page
> > mapped. MSI handling already has infrastructure for that, using fixmap,
> > so try to map first/last page of the MSI-X table (if necessary) and save
> > their fixmap indexes. Note that msix_get_fixmap() does reference
> > counting and reuses existing mapping, so just call it directly, even if
> > the page was mapped before. Also, it uses a specific range of fixmap
> > indexes which doesn't include 0, so use 0 as default ("not mapped")
> > value - which simplifies code a bit.
> > 
> > Based on assumption that all MSI-X page accesses are handled by Xen, do
> > not forward adjacent accesses to other hypothetical ioreq servers, even
> > if the access wasn't handled for some reason (failure to map pages etc).
> > Relevant places log a message about that already.
> > 
> > Signed-off-by: Marek Marczykowski-Górecki 
> 
> Thanks, just a couple of minor comments, I think the only relevant one
> is that you can drop ADJACENT_DONT_HANDLE unless there's something
> I'm missing.  The rest are mostly cosmetic, but if you have to respin
> and agree with them might be worth addressing.
> 
> Sorry for giving this feedback so late in the process, I should have
> attempted to review earlier versions.
> 
> > ---
> > Changes in v7:
> > - simplify logic based on assumption that all access to MSI-X pages are
> >   handled by Xen (Roger)
> > - move calling adjacent_handle() into adjacent_{read,write}() (Roger)
> > - move range check into msixtbl_addr_to_desc() (Roger)
> > - fix off-by-one when initializing adj_access_idx[ADJ_IDX_LAST] (Roger)
> > - no longer distinguish between unhandled write due to PBA nearby and
> >   other reasons
> > - add missing break after ASSERT_UNREACHABLE (Jan)
> > Changes in v6:
> > - use MSIX_CHECK_WARN macro
> > - extend assert on fixmap_idx
> > - add break in default label, after ASSERT_UNREACHABLE(), and move
> >   setting default there
> > - style fixes
> > Changes in v5:
> > - style fixes
> > - include GCC version in the commit message
> > - warn only once (per domain, per device) about failed adjacent access
> > Changes in v4:
> > - drop same_page parameter of msixtbl_find_entry(), distinguish two
> >   cases in relevant callers
> > - rename adj_access_table_idx to adj_access_idx
> > - code style fixes
> > - drop alignment check in adjacent_{read,write}() - all callers already
> >   have it earlier
> > - delay mapping first/last MSI-X pages until preparing device for a
> >   passthrough
> > v3:
> >  - merge handling into msixtbl_mmio_ops
> >  - extend commit message
> > v2:
> >  - adjust commit message
> >  - pass struct domain to msixtbl_page_handler_get_hwaddr()
> >  - reduce local variables used only once
> >  - log a warning if write is forbidden if MSI-X and PBA lives on the same
> >page
> >  - do not passthrough unaligned accesses
> >  - handle accesses both before and after MSI-X table
> > ---
> >  xen/arch/x86/hvm/vmsi.c| 205 --
> >  xen/arch/x86/include/asm/msi.h |   5 +-
> >  xen/arch/x86/msi.c |  42 +++-
> >  3 files changed, 242 insertions(

Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

2024-05-07 Thread Marek Marczykowski-Górecki
On Tue, May 07, 2024 at 01:32:00PM +0200, Marek Marczykowski-Górecki wrote:
> On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
> > `xl devd` has been observed leaking /var/log/xldevd.log into children.
> > 
> > Link: https://github.com/QubesOS/qubes-issues/issues/8292
> > Reported-by: Demi Marie Obenour 
> > Signed-off-by: Andrew Cooper 
> > ---
> > CC: Anthony PERARD 
> > CC: Juergen Gross 
> > CC: Demi Marie Obenour 
> > CC: Marek Marczykowski-Górecki 
> > 
> > Also entirely speculative based on the QubesOS ticket.
> > ---
> >  tools/xl/xl_utils.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> > index 17489d182954..060186db3a59 100644
> > --- a/tools/xl/xl_utils.c
> > +++ b/tools/xl/xl_utils.c
> > @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
> >  exit(-1);
> >  }
> >  
> > -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 0644));
> > +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> > O_CLOEXEC, 0644));
> 
> This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
> just outside of the context here, and then inherited by various hotplug
> script. Just adding O_CLOEXEC here means the hotplug scripts will run
> with stdout/stderr closed. The scripts shipped with Xen do redirect
> stderr to a log quite early, but a) it doesn't do it for stdout, and b)
> custom hotplug scripts are a valid use case.
> Without that, I see at least few potential issues:
> - some log messages may be lost (minor, but annoying)
> - something might simply fail on writing to a closed FD, breaking the
>   hotplug script
> - FD 1 will be used as first free FD for any open() or similar call - if
>   a tool later tries writing something to stdout, it will gets written
>   to that FD - worse of all three

Wait, the above is wrong, dup does not copy the O_CLOEXEC flag over to
the new FD. So, maybe your patch is correct after all.

> What should be the behavior of hotplug scripts logging? Should they
> always take care of their own logging? If so, the hotplug calling part
> should redirect stdout/stderr to /dev/null IMO. But if `xl` should
> provide some default logging for them (like, the xldevd.log here?), then
> the O_CLOEXEC should be set only after duplicating logfile over stdout/err.
> 
> >  free(fullname);
> >  assert(logfile >= 3);
> >  
> > 
> > base-commit: ebab808eb1bb8f24c7d0dd41b956e48cb1824b81
> > prerequisite-patch-id: 212e50457e9b6bdfd06a97da545a5aa7155bb919
> 
> Which one is this? I don't see it in staging, nor in any of your
> branches on xenbits. Lore finds "tools/libxs: Open /dev/xen/xenbus fds
> as O_CLOEXEC" which I guess is correct, but I have no idea how it
> correlates it, as this hash doesn't appear anywhere in the message, nor
> its headers...
> 
> -- 
> Best Regards,
> Marek Marczykowski-Górecki
> Invisible Things Lab



-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


Re: [PATCH] tools/xl: Open xldevd.log with O_CLOEXEC

2024-05-07 Thread Marek Marczykowski-Górecki
On Tue, May 07, 2024 at 03:15:48PM +0100, Andrew Cooper wrote:
> On 07/05/2024 12:32 pm, Marek Marczykowski-Górecki wrote:
> > On Tue, May 07, 2024 at 12:08:06PM +0100, Andrew Cooper wrote:
> >> `xl devd` has been observed leaking /var/log/xldevd.log into children.
> >>
> >> Link: https://github.com/QubesOS/qubes-issues/issues/8292
> >> Reported-by: Demi Marie Obenour 
> >> Signed-off-by: Andrew Cooper 
> >> ---
> >> CC: Anthony PERARD 
> >> CC: Juergen Gross 
> >> CC: Demi Marie Obenour 
> >> CC: Marek Marczykowski-Górecki 
> >>
> >> Also entirely speculative based on the QubesOS ticket.
> >> ---
> >>  tools/xl/xl_utils.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> >> index 17489d182954..060186db3a59 100644
> >> --- a/tools/xl/xl_utils.c
> >> +++ b/tools/xl/xl_utils.c
> >> @@ -270,7 +270,7 @@ int do_daemonize(const char *name, const char *pidfile)
> >>  exit(-1);
> >>  }
> >>  
> >> -CHK_SYSCALL(logfile = open(fullname, O_WRONLY|O_CREAT|O_APPEND, 
> >> 0644));
> >> +CHK_SYSCALL(logfile = open(fullname, O_WRONLY | O_CREAT | O_APPEND | 
> >> O_CLOEXEC, 0644));
> > This one might be not enough, as the FD gets dup2()-ed to stdout/stderr
> > just outside of the context here, and then inherited by various hotplug
> > script. Just adding O_CLOEXEC here means the hotplug scripts will run
> > with stdout/stderr closed.
> 
> Lovely :(  Yes - this won't work.  I guess what we want instead is:
> 
> diff --git a/tools/xl/xl_utils.c b/tools/xl/xl_utils.c
> index 060186db3a59..a0ce7dd7fa21 100644
> --- a/tools/xl/xl_utils.c
> +++ b/tools/xl/xl_utils.c
> @@ -282,6 +282,7 @@ int do_daemonize(const char *name, const char *pidfile)
>  dup2(logfile, 2);
>  
>  close(nullfd);
> +    close(logfile);
>  
>  CHK_SYSCALL(daemon(0, 1));
>  
> which at least means there's not a random extra fd attached to the logfile.

But logfile is a global variable, and it looks to be used in dolog()...

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab


signature.asc
Description: PGP signature


[PATCH v7 5/6] [DO NOT APPLY] switch to qemu fork

2024-05-07 Thread Marek Marczykowski-Górecki
This makes tests to use patched QEMU, to actually test the new behavior.
---
 Config.mk | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Config.mk b/Config.mk
index a962f095ca16..5e220a1284e4 100644
--- a/Config.mk
+++ b/Config.mk
@@ -220,8 +220,8 @@ endif
 OVMF_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/ovmf.git
 OVMF_UPSTREAM_REVISION ?= ba91d0292e593df8528b66f99c1b0b14fadc8e16
 
-QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git
-QEMU_UPSTREAM_REVISION ?= master
+QEMU_UPSTREAM_URL ?= https://github.com/marmarek/qemu
+QEMU_UPSTREAM_REVISION ?= origin/msix
 
 MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
 MINIOS_UPSTREAM_REVISION ?= b6a5b4d72b88e5c4faed01f5a44505de022860fc
-- 
git-series 0.9.1



[PATCH v7 1/6] x86/msi: Extend per-domain/device warning mechanism

2024-05-07 Thread Marek Marczykowski-Górecki
The arch_msix struct had a single "warned" field with a domid for which
warning was issued. Upcoming patch will need similar mechanism for few
more warnings, so change it to save a bit field of issued warnings.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Jan Beulich 
---
Changes in v6:
- add MSIX_CHECK_WARN macro (Jan)
- drop struct name from warned_kind union (Jan)

New in v5
---
 xen/arch/x86/include/asm/msi.h | 17 -
 xen/arch/x86/msi.c |  5 +
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/xen/arch/x86/include/asm/msi.h b/xen/arch/x86/include/asm/msi.h
index 997ccb87be0c..bcfdfd35345d 100644
--- a/xen/arch/x86/include/asm/msi.h
+++ b/xen/arch/x86/include/asm/msi.h
@@ -208,6 +208,15 @@ struct msg_address {
PCI_MSIX_ENTRY_SIZE + \
(~PCI_MSIX_BIRMASK & (PAGE_SIZE - 1)))
 
+#define MSIX_CHECK_WARN(msix, domid, which) ({ \
+if ( (msix)->warned_domid != (domid) ) \
+{ \
+(msix)->warned_domid = (domid); \
+(msix)->warned_kind.all = 0; \
+} \
+(msix)->warned_kind.which ? false : ((msix)->warned_kind.which = true); \
+})
+
 struct arch_msix {
 unsigned int nr_entries, used_entries;
 struct {
@@ -217,7 +226,13 @@ struct arch_msix {
 int table_idx[MAX_MSIX_TABLE_PAGES];
 spinlock_t table_lock;
 bool host_maskall, guest_maskall;
-domid_t warned;
+domid_t warned_domid;
+union {
+uint8_t all;
+struct {
+bool maskall   : 1;
+};
+} warned_kind;
 };
 
 void early_msi_init(void);
diff --git a/xen/arch/x86/msi.c b/xen/arch/x86/msi.c
index e721aaf5c001..42c793426da3 100644
--- a/xen/arch/x86/msi.c
+++ b/xen/arch/x86/msi.c
@@ -364,13 +364,10 @@ static bool msi_set_mask_bit(struct irq_desc *desc, bool 
host, bool guest)
 domid_t domid = pdev->domain->domain_id;
 
 maskall = true;
-if ( pdev->msix->warned != domid )
-{
-pdev->msix->warned = domid;
+if ( MSIX_CHECK_WARN(pdev->msix, domid, maskall) )
 printk(XENLOG_G_WARNING
"cannot mask IRQ %d: masking MSI-X on Dom%d's %pp\n",
desc->irq, domid, &pdev->sbdf);
-}
 }
 pdev->msix->host_maskall = maskall;
 if ( maskall || pdev->msix->guest_maskall )
-- 
git-series 0.9.1



[PATCH v7 3/6] automation: prevent QEMU access to /dev/mem in PCI passthrough tests

2024-05-07 Thread Marek Marczykowski-Górecki
/dev/mem access doesn't work in dom0 in lockdown and in stubdomain.
Simulate this environment with removing /dev/mem device node. Full test
for lockdown and stubdomain will come later, when all requirements will
be in place.

Signed-off-by: Marek Marczykowski-Górecki 
Acked-by: Stefano Stabellini 
---
This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/scripts/qubes-x86-64.sh | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index d81ed7b931cf..7eabc1bd6ad4 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -163,6 +163,8 @@ ifconfig eth0 up
 ifconfig xenbr0 up
 ifconfig xenbr0 192.168.0.1
 
+# ensure QEMU wont have access /dev/mem
+rm -f /dev/mem
 # get domU console content into test log
 tail -F /var/log/xen/console/guest-domU.log 2>/dev/null | sed -e \"s/^/(domU) 
/\" &
 xl create /etc/xen/domU.cfg
-- 
git-series 0.9.1



[PATCH v7 6/6] [DO NOT APPLY] switch to alternative artifact repo

2024-05-07 Thread Marek Marczykowski-Górecki
For testing, switch to my containers registry that includes containers
rebuilt with changes in this series.
---
 automation/gitlab-ci/build.yaml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/automation/gitlab-ci/build.yaml b/automation/gitlab-ci/build.yaml
index 49d6265ad5b4..0d7e311417d8 100644
--- a/automation/gitlab-ci/build.yaml
+++ b/automation/gitlab-ci/build.yaml
@@ -320,7 +320,7 @@ qemu-system-ppc64-8.1.0-ppc64-export:
 
 alpine-3.18-rootfs-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/alpine:3.18
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/alpine:3.18
   script:
 - mkdir binaries && cp /initrd.tar.gz binaries/initrd.tar.gz
   artifacts:
@@ -331,7 +331,7 @@ alpine-3.18-rootfs-export:
 
 kernel-6.1.19-export:
   extends: .test-jobs-artifact-common
-  image: registry.gitlab.com/xen-project/xen/tests-artifacts/kernel:6.1.19
+  image: 
registry.gitlab.com/xen-project/people/marmarek/xen/tests-artifacts/kernel:6.1.19
   script:
 - mkdir binaries && cp /bzImage binaries/bzImage
   artifacts:
-- 
git-series 0.9.1



[PATCH v7 4/6] automation: switch to a wifi card on ADL system

2024-05-07 Thread Marek Marczykowski-Górecki
Switch to a wifi card that has registers on a MSI-X page. This tests the
"x86/hvm: Allow writes to registers on the same page as MSI-X table"
feature. Switch it only for HVM test, because MSI-X adjacent write is
not supported on PV.

This requires also including drivers and firmware in system for tests.
Remove firmware unrelated to the test, to not increase initrd size too
much (all firmware takes over 100MB compressed).
And finally adjusts test script to handle not only eth0 as a test device,
but also wlan0 and connect it to the wifi network.

Signed-off-by: Marek Marczykowski-Górecki 
Reviewed-by: Stefano Stabellini 
---
This needs two new gitlab variables: WIFI_HW2_SSID and WIFI_HW2_PSK. I'll
provide them in private.

This change requires rebuilding test containers.

This can be applied only after QEMU change is committed. Otherwise the
test will fail.
---
 automation/gitlab-ci/test.yaml  | 4 
 automation/scripts/qubes-x86-64.sh  | 7 +++
 automation/tests-artifacts/alpine/3.18.dockerfile   | 7 +++
 automation/tests-artifacts/kernel/6.1.19.dockerfile | 2 ++
 4 files changed, 20 insertions(+)

diff --git a/automation/gitlab-ci/test.yaml b/automation/gitlab-ci/test.yaml
index ad249fa0a5d9..6803cae116b5 100644
--- a/automation/gitlab-ci/test.yaml
+++ b/automation/gitlab-ci/test.yaml
@@ -193,6 +193,10 @@ adl-pci-pv-x86-64-gcc-debug:
 
 adl-pci-hvm-x86-64-gcc-debug:
   extends: .adl-x86-64
+  variables:
+PCIDEV: "00:14.3"
+WIFI_SSID: "$WIFI_HW2_SSID"
+WIFI_PSK: "$WIFI_HW2_PSK"
   script:
 - ./automation/scripts/qubes-x86-64.sh pci-hvm 2>&1 | tee ${LOGFILE}
   needs:
diff --git a/automation/scripts/qubes-x86-64.sh 
b/automation/scripts/qubes-x86-64.sh
index 7eabc1bd6ad4..60498ef1e89a 100755
--- a/automation/scripts/qubes-x86-64.sh
+++ b/automation/scripts/qubes-x86-64.sh
@@ -94,6 +94,13 @@ on_reboot = "destroy"
 domU_check="
 set -x -e
 interface=eth0
+if [ -e /sys/class/net/wlan0 ]; then
+interface=wlan0
+set +x
+wpa_passphrase "$WIFI_SSID" "$WIFI_PSK" > /etc/wpa_supplicant.conf
+set -x
+wpa_supplicant -B -iwlan0 -c /etc/wpa_supplicant.conf
+fi
 ip link set \"\$interface\" up
 timeout 30s udhcpc -i \"\$interface\"
 pingip=\$(ip -o -4 r show default|cut -f 3 -d ' ')
diff --git a/automation/tests-artifacts/alpine/3.18.dockerfile 
b/automation/tests-artifacts/alpine/3.18.dockerfile
index 9cde6c9ad4da..c323e266c7da 100644
--- a/automation/tests-artifacts/alpine/3.18.dockerfile
+++ b/automation/tests-artifacts/alpine/3.18.dockerfile
@@ -34,6 +34,13 @@ RUN \
   apk add udev && \
   apk add pciutils && \
   apk add libelf && \
+  apk add wpa_supplicant && \
+  # Select firmware for hardware tests
+  apk add linux-firmware-other && \
+  mkdir /lib/firmware-preserve && \
+  mv /lib/firmware/iwlwifi-so-a0-gf-a0* /lib/firmware-preserve/ && \
+  rm -rf /lib/firmware && \
+  mv /lib/firmware-preserve /lib/firmware && \
   \
   # Xen
   cd / && \
diff --git a/automation/tests-artifacts/kernel/6.1.19.dockerfile 
b/automation/tests-artifacts/kernel/6.1.19.dockerfile
index 3a4096780d20..84ed5dff23ae 100644
--- a/automation/tests-artifacts/kernel/6.1.19.dockerfile
+++ b/automation/tests-artifacts/kernel/6.1.19.dockerfile
@@ -32,6 +32,8 @@ RUN curl -fsSLO 
https://cdn.kernel.org/pub/linux/kernel/v6.x/linux-"$LINUX_VERSI
 make xen.config && \
 scripts/config --enable BRIDGE && \
 scripts/config --enable IGC && \
+scripts/config --enable IWLWIFI && \
+scripts/config --enable IWLMVM && \
 cp .config .config.orig && \
 cat .config.orig | grep XEN | grep =m |sed 's/=m/=y/g' >> .config && \
 make -j$(nproc) bzImage && \
-- 
git-series 0.9.1



  1   2   3   4   5   6   7   8   9   10   >