from:"George Dunlap"

[PATCH v2 1/2] x86/hvm/trace: Use a different trace type for AMD processors

2024-05-23 Thread George Dunlap

A long-standing usability sub-optimality with xenalyze is the
necessity to specify `--svm-mode` when analyzing AMD processors.  This
fundamentally comes about because the same trace event ID is used for
both VMX and SVM, but the contents of the trace must be interpreted
differently.

Instead, allocate separate trace events for VMX and SVM vmexits in
Xen; this will allow all readers to properly interpret the meaning of
the vmexit reason.

In xenalyze, first remove the redundant call to init_hvm_data();
there's no way to get to hvm_vmexit_process() without it being already
initialized by the set_vcpu_type call in hvm_process().

Replace this with set_hvm_exit_reson_data(), and move setting of
hvm->exit_reason_* into that function.

Modify hvm_process and hvm_vmexit_process to handle all four potential
values appropriately.

If SVM entries are encountered, set opt.svm_mode so that other
SVM-specific functionality is triggered.

Remove the `--svm-mode` command-line option, since it's now redundant.

Signed-off-by: George Dunlap 
---
v2:
- Rebase to tip of staging
- Rebase over xentrace_format removal
- Fix typo in commit message
- Remove --svm-mode command-line flag

CC: Andrew Cooper 
CC: Jan Beulich 
CC: Roger Pau Monne 
CC: Anthony Perard 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Olaf Hering 
---
 tools/xentrace/xenalyze.c  | 37 +++--
 xen/arch/x86/hvm/svm/svm.c |  4 ++--
 xen/arch/x86/hvm/vmx/vmx.c |  4 ++--
 xen/include/public/trace.h |  6 --
 4 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index ce6a85d50b..9c4463b0e8 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -1437,14 +1437,6 @@ void init_hvm_data(struct hvm_data *h, struct vcpu_data 
*v) {
 
 h->init = 1;
 
-if(opt.svm_mode) {
-h->exit_reason_max = HVM_SVM_EXIT_REASON_MAX;
-h->exit_reason_name = hvm_svm_exit_reason_name;
-} else {
-h->exit_reason_max = HVM_VMX_EXIT_REASON_MAX;
-h->exit_reason_name = hvm_vmx_exit_reason_name;
-}
-
 if(opt.histogram_interrupt_eip) {
 int count = 
((1ULL<summary.guest_interrupt[i].count=0;
 }
 
+void set_hvm_exit_reason_data(struct hvm_data *h, unsigned event) {
+if (((event & ~TRC_64_FLAG) == TRC_HVM_SVM_EXIT) ||
+opt.svm_mode) {
+opt.svm_mode = 1;
+h->exit_reason_max = HVM_SVM_EXIT_REASON_MAX;
+h->exit_reason_name = hvm_svm_exit_reason_name;
+} else {
+h->exit_reason_max = HVM_VMX_EXIT_REASON_MAX;
+h->exit_reason_name = hvm_vmx_exit_reason_name;
+}
+}
+
 /* PV data */
 enum {
 PV_HYPERCALL=1,
@@ -5088,13 +5092,13 @@ void hvm_vmexit_process(struct record_info *ri, struct 
hvm_data *h,
 
 r = (typeof(r))ri->d;
 
-if(!h->init)
-init_hvm_data(h, v);
+if(!h->exit_reason_name)
+set_hvm_exit_reason_data(h, ri->event);
 
 h->vmexit_valid=1;
 bzero(>inflight, sizeof(h->inflight));
 
-if(ri->event == TRC_HVM_VMEXIT64) {
+if(ri->event & TRC_64_FLAG) {
 if(v->guest_paging_levels != 4)
 {
 if ( verbosity >= 6 )
@@ -5316,8 +5320,10 @@ void hvm_process(struct pcpu_info *p)
 break;
 default:
 switch(ri->event) {
-case TRC_HVM_VMEXIT:
-case TRC_HVM_VMEXIT64:
+case TRC_HVM_VMX_EXIT:
+case TRC_HVM_VMX_EXIT64:
+case TRC_HVM_SVM_EXIT:
+case TRC_HVM_SVM_EXIT64:
 UPDATE_VOLUME(p, hvm[HVM_VOL_VMEXIT], ri->size);
 hvm_vmexit_process(ri, h, v);
 break;
@@ -10884,11 +10890,6 @@ const struct argp_option cmd_opts[] =  {
   .arg = "HZ",
   .doc = "Cpu speed of the tracing host, used to convert tsc into 
seconds.", },
 
-{ .name = "svm-mode",
-  .key = OPT_SVM_MODE,
-  .group = OPT_GROUP_HARDWARE,
-  .doc = "Assume AMD SVM-style vmexit error codes.  (Default is Intel 
VMX.)", },
-
 { .name = "progress",
   .key = OPT_PROGRESS,
   .doc = "Progress dialog.  Requires the zenity (GTK+) executable.", },
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index db530d55f2..988250dbc1 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2571,10 +2571,10 @@ void asmlinkage svm_vmexit_handler(void)
 exit_reason = vmcb->exitcode;
 
 if ( hvm_long_mode_active(v) )
-TRACE_TIME(TRC_HVM_VMEXIT64 | (vcpu_guestmode ? TRC_HVM_NESTEDFLAG : 
0),
+TRACE_TIME(TRC_HVM_SVM_EXIT64 | (vcpu_guestmode ? TRC_HVM_NESTEDFLAG : 
0),
exit_reason, regs->rip, regs->rip >> 32);
 else
-TRACE_TIME(TRC_HVM_VMEXIT | (vcpu_guestmode ? TRC_HVM_NESTEDFLAG : 0),
+TRACE_TIME(TRC_HVM_SVM_EXIT | (vcpu_guestmode ? TRC_HVM_NESTEDFLAG : 
0),
exi

[PATCH v2 2/2] tools/xenalyze: Ignore HVM_EMUL events harder

2024-05-23 Thread George Dunlap

To unify certain common sanity checks, checks are done very early in
processing based only on the top-level type.

Unfortunately, when TRC_HVM_EMUL was introduced, it broke some of the
assumptions about how the top-level types worked.  Namely, traces of
this type will show up outside of HVM contexts: in idle domains and in
PV domains.

Make an explicit exception for TRC_HVM_EMUL types in a number of places:

 - Pass the record info pointer to toplevel_assert_check, so that it
   can exclude TRC_HVM_EMUL records from idle and vcpu data_mode
   checks

 - Don't attempt to set the vcpu data_type in hvm_process for
   TRC_HVM_EMUL records.

Signed-off-by: George Dunlap 
Acked-by: Andrew Cooper 
---
CC: Andrew Cooper 
CC: Anthony Perard 
CC: Olaf Hering 
---
 tools/xentrace/xenalyze.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index 9c4463b0e8..d95e52695f 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -21,6 +21,7 @@
 #define _XOPEN_SOURCE 600
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -5305,8 +5306,11 @@ void hvm_process(struct pcpu_info *p)
 
 assert(p->current);
 
-if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
-return;
+/* HVM_EMUL types show up in all contexts */
+if(ri->evt.sub != 0x4) {
+if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
+return;
+}
 
 switch ( ri->evt.sub ) {
 case 2: /* HVM_HANDLER */
@@ -9447,9 +9451,10 @@ static struct tl_assert_mask 
tl_assert_checks[TOPLEVEL_MAX] = {
 /* There are a lot of common assumptions for the various processing
  * routines.  Check them all in one place, doing something else if
  * they don't pass. */
-int toplevel_assert_check(int toplevel, struct pcpu_info *p)
+int toplevel_assert_check(int toplevel, struct record_info *ri, struct 
pcpu_info *p)
 {
 struct tl_assert_mask mask;
+bool is_hvm_emul = (toplevel == TOPLEVEL_HVM) && (ri->evt.sub == 0x4);
 
 mask = tl_assert_checks[toplevel];
 
@@ -9459,7 +9464,7 @@ int toplevel_assert_check(int toplevel, struct pcpu_info 
*p)
 goto fail;
 }
 
-if( mask.not_idle_domain )
+if( mask.not_idle_domain && !is_hvm_emul)
 {
 /* Can't do this check w/o first doing above check */
 assert(mask.p_current);
@@ -9478,7 +9483,8 @@ int toplevel_assert_check(int toplevel, struct pcpu_info 
*p)
 v = p->current;
 
 if ( ! (v->data_type == VCPU_DATA_NONE
-|| v->data_type == mask.vcpu_data_mode) )
+|| v->data_type == mask.vcpu_data_mode
+|| is_hvm_emul) )
 {
 /* This may happen for track_dirty_vram, which causes a 
SHADOW_WRMAP_BF trace f/ dom0 */
 fprintf(warn, "WARNING: Unexpected vcpu data type for d%dv%d on 
proc %d! Expected %d got %d. Not processing\n",
@@ -9525,7 +9531,7 @@ void process_record(struct pcpu_info *p) {
 return;
 
 /* Unify toplevel assertions */
-if ( toplevel_assert_check(toplevel, p) )
+if ( toplevel_assert_check(toplevel, ri, p) )
 {
 switch(toplevel) {
 case TRC_GEN_MAIN:
-- 
2.25.1

Re: [PATCH] tools/golang: Add missing golang bindings for vlan

2024-05-20 Thread George Dunlap

On Mon, May 20, 2024 at 9:21 AM Henry Wang  wrote:
>
> It is noticed that commit:
> 3bc14e4fa4b9 ("tools/libs/light: Add vlan field to libxl_device_nic")
> introduces a new "vlan" string field to libxl_device_nic. But the
> golang bindings are missing. Add it in this patch.
>
> Fixes: 3bc14e4fa4b9 ("tools/libs/light: Add vlan field to libxl_device_nic")
> Signed-off-by: Henry Wang 

Acked-by: George Dunlap

Re: [PATCH] Revert "evtchn: refuse EVTCHNOP_status for Xen-bound event channels"

2024-05-15 Thread George Dunlap

On Tue, May 14, 2024 at 10:51 AM Andrew Cooper
 wrote:
>
> On 14/05/2024 10:25 am, Jan Beulich wrote:
> > On 03.04.2024 08:16, Jan Beulich wrote:
> >> On 02.04.2024 19:06, Andrew Cooper wrote:
> >>> The commit makes a claim without any kind of justification.
> >> Well, what does "have no business" leave open?
> >>
> >>> The claim is false, and the commit broke lsevtchn in dom0.
> >> Or alternatively lsevtchn was doing something that was never meant to work
> >> (from Xen's perspective).
> >>
> >>>  It is also quite
> >>> obvious from XSM_TARGET that it has broken device model stubdoms too.
> >> Why would that be "obvious"? What business would a stubdom have to look at
> >> Xen's side of an evtchn?
> >>
> >>> Whether to return information about a xen-owned evtchn is a matter of 
> >>> policy,
> >>> and it's not acceptable to short circuit the XSM on the matter.
> >> I can certainly accept this as one possible view point. As in so many cases
> >> I'm afraid I dislike you putting it as if it was the only possible one.
> >>
> >> In summary: The supposed justification you claim is missing in the original
> >> change is imo also missing here then: What business would any entity in the
> >> system have to look at Xen's side of an event channel? Back at the time, 3
> >> people agreed that it's "none".
> > You've never responded to this reply of mine, or its follow-up. You also
> > didn't chime in on the discussion Daniel and I were having. I consider my
> > objections unaddressed, and in fact I continue to consider the change to
> > be wrong. Therefore it was inappropriate for you to commit it; it needs
> > reverting asap. If you're not going to do so, I will.
>
> You tried defending breaking a utility with "well it shouldn't exist then".
>
> You don't have a leg to stand on, and two maintainers of relevant
> subsystems here just got tired of bullshit being presented in place of
> any credible argument for having done the change in the way you did.
>
> The correct response was "Sorry I broke things.  Lets revert this for
> now to unbreak, and I'll see about reworking it to not intentionally
> subvert Xen's security mechanism".
>
> As it stands, you're 2-1 outvoted, and wasted any sympathy I may have
> had for the principle of the change based on the absurdity of your
> arguments.

We have a simple process for dealing with this situation, Andy, which
you didn't follow.  You can't just go checking things in because you
feel strongly about it.

 -George

Re: [PATCH 2/5] x86/cpu-policy: Add SVM features already used by Xen

2024-05-01 Thread George Dunlap

On Wed, May 1, 2024 at 11:39 AM Andrew Cooper  wrote:
>
> On 01/05/2024 11:00 am, George Dunlap wrote:
> > On Mon, Apr 29, 2024 at 4:16 PM Andrew Cooper  
> > wrote:
> >> These will replace svm_feature_flags and the SVM_FEATURE_* constants over 
> >> the
> >> next few changes.  Take the opportunity to rationalise some names.
> >>
> >> Drop the opencoded "inherit from host" logic in calculate_hvm_max_policy() 
> >> and
> >> use 'h'/'!' annotations.  The logic needs to operate on fs, not the policy
> >> object, given its position within the function.
> >>
> >> Drop some trailing whitespace introduced when this block of code was last
> >> moved.
> >>
> >> Signed-off-by: Andrew Cooper 
> >> ---
> >> CC: Jan Beulich 
> >> CC: Roger Pau Monné 
> >> CC: Stefano Stabellini 
> >> CC: Xenia Ragiadakou 
> >> CC: Sergiy Kibrik 
> >> CC: George Dunlap 
> >> CC: Andrei Semenov 
> >> CC: Vaishali Thakkar 
> >> ---
> >>  tools/misc/xen-cpuid.c  | 11 +++
> >>  xen/arch/x86/cpu-policy.c   | 17 +
> >>  xen/include/public/arch-x86/cpufeatureset.h | 14 ++
> >>  xen/tools/gen-cpuid.py  |  3 +++
> >>  4 files changed, 33 insertions(+), 12 deletions(-)
> >>
> >> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> >> index ab09410a05d6..0d01b0e797f1 100644
> >> --- a/tools/misc/xen-cpuid.c
> >> +++ b/tools/misc/xen-cpuid.c
> >> @@ -266,6 +266,17 @@ static const char *const str_m10Ah[32] =
> >>
> >>  static const char *const str_eAd[32] =
> >>  {
> >> +[ 0] = "npt", [ 1] = "v-lbr",
> >> +[ 2] = "svm-lock",[ 3] = "nrips",
> >> +[ 4] = "v-tsc-rate",  [ 5] = "vmcb-cleanbits",
> >> +[ 6] = "flush-by-asid",   [ 7] = "decode-assist",
> >> +
> >> +[10] = "pause-filter",
> >> +[12] = "pause-thresh",
> >> +/* 14 */  [15] = "v-loadsave",
> >> +[16] = "v-gif",
> >> +/* 18 */  [19] = "npt-sss",
> >> +[20] = "v-spec-ctrl",
> >>  };
> >>
> >>  static const char *const str_e1Fa[32] =
> >> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> >> index 4b6d96276399..da4401047e89 100644
> >> --- a/xen/arch/x86/cpu-policy.c
> >> +++ b/xen/arch/x86/cpu-policy.c
> >> @@ -9,7 +9,6 @@
> >>  #include 
> >>  #include 
> >>  #include 
> >> -#include 
> >>  #include 
> >>  #include 
> >>  #include 
> >> @@ -748,22 +747,16 @@ static void __init calculate_hvm_max_policy(void)
> >>  if ( !cpu_has_vmx )
> >>  __clear_bit(X86_FEATURE_PKS, fs);
> >>
> >> -/*
> >> +/*
> >>   * Make adjustments to possible (nested) virtualization features 
> >> exposed
> >>   * to the guest
> >>   */
> >> -if ( p->extd.svm )
> >> +if ( test_bit(X86_FEATURE_SVM, fs) )
> >>  {
> >> -/* Clamp to implemented features which require hardware support. 
> >> */
> >> -p->extd.raw[0xa].d &= ((1u << SVM_FEATURE_NPT) |
> >> -   (1u << SVM_FEATURE_LBRV) |
> >> -   (1u << SVM_FEATURE_NRIPS) |
> >> -   (1u << SVM_FEATURE_PAUSEFILTER) |
> >> -   (1u << SVM_FEATURE_DECODEASSISTS));
> >> -/* Enable features which are always emulated. */
> >> -p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
> >> +/* Xen always emulates cleanbits. */
> >> +__set_bit(X86_FEATURE_VMCB_CLEANBITS, fs);
> >>  }
> > What about this line, at the end of recalculate_cpuid_policy()?
> >
> >   if ( !p->extd.svm )
> > p->extd.raw[0xa] = EMPTY_LEAF;
> >
> > Is there a reason to continue to operate directly on the policy here,
> > or would it be better to do this up in the featureset section?
>
> That's still needed.
>
> Otherwise in a !SVM VM you still see svm_rev and nr_asids in a leaf that
> should be all zeroes.

...Uh, yes of course we still need to clear the non-existent CPUID
bits.  I was asking if we should change *how* we should clear them.

In the hunk I responded to, when enabling VMCBCLEAN, we switch from
setting bits in the policy, to setting bits in the featureset.  I was
asking whether it would make sense to do something like

if !test_bit(X86_FEATURE_SVM, fs)
fs[FEATURESET_eAd]  = 0;

before the x86_cpu_featureset_to_policy() instead.

 -George

Re: [PATCH 2/5] x86/cpu-policy: Add SVM features already used by Xen

2024-05-01 Thread George Dunlap

On Mon, Apr 29, 2024 at 4:16 PM Andrew Cooper  wrote:
>
> These will replace svm_feature_flags and the SVM_FEATURE_* constants over the
> next few changes.  Take the opportunity to rationalise some names.
>
> Drop the opencoded "inherit from host" logic in calculate_hvm_max_policy() and
> use 'h'/'!' annotations.  The logic needs to operate on fs, not the policy
> object, given its position within the function.
>
> Drop some trailing whitespace introduced when this block of code was last
> moved.
>
> Signed-off-by: Andrew Cooper 
> ---
> CC: Jan Beulich 
> CC: Roger Pau Monné 
> CC: Stefano Stabellini 
> CC: Xenia Ragiadakou 
> CC: Sergiy Kibrik 
> CC: George Dunlap 
> CC: Andrei Semenov 
> CC: Vaishali Thakkar 
> ---
>  tools/misc/xen-cpuid.c  | 11 +++
>  xen/arch/x86/cpu-policy.c   | 17 +
>  xen/include/public/arch-x86/cpufeatureset.h | 14 ++
>  xen/tools/gen-cpuid.py  |  3 +++
>  4 files changed, 33 insertions(+), 12 deletions(-)
>
> diff --git a/tools/misc/xen-cpuid.c b/tools/misc/xen-cpuid.c
> index ab09410a05d6..0d01b0e797f1 100644
> --- a/tools/misc/xen-cpuid.c
> +++ b/tools/misc/xen-cpuid.c
> @@ -266,6 +266,17 @@ static const char *const str_m10Ah[32] =
>
>  static const char *const str_eAd[32] =
>  {
> +[ 0] = "npt", [ 1] = "v-lbr",
> +[ 2] = "svm-lock",[ 3] = "nrips",
> +[ 4] = "v-tsc-rate",  [ 5] = "vmcb-cleanbits",
> +[ 6] = "flush-by-asid",   [ 7] = "decode-assist",
> +
> +[10] = "pause-filter",
> +[12] = "pause-thresh",
> +/* 14 */  [15] = "v-loadsave",
> +[16] = "v-gif",
> +/* 18 */  [19] = "npt-sss",
> +[20] = "v-spec-ctrl",
>  };
>
>  static const char *const str_e1Fa[32] =
> diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
> index 4b6d96276399..da4401047e89 100644
> --- a/xen/arch/x86/cpu-policy.c
> +++ b/xen/arch/x86/cpu-policy.c
> @@ -9,7 +9,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -748,22 +747,16 @@ static void __init calculate_hvm_max_policy(void)
>  if ( !cpu_has_vmx )
>  __clear_bit(X86_FEATURE_PKS, fs);
>
> -/*
> +/*
>   * Make adjustments to possible (nested) virtualization features exposed
>   * to the guest
>   */
> -if ( p->extd.svm )
> +if ( test_bit(X86_FEATURE_SVM, fs) )
>  {
> -/* Clamp to implemented features which require hardware support. */
> -p->extd.raw[0xa].d &= ((1u << SVM_FEATURE_NPT) |
> -   (1u << SVM_FEATURE_LBRV) |
> -   (1u << SVM_FEATURE_NRIPS) |
> -   (1u << SVM_FEATURE_PAUSEFILTER) |
> -   (1u << SVM_FEATURE_DECODEASSISTS));
> -/* Enable features which are always emulated. */
> -p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
> +/* Xen always emulates cleanbits. */
> +__set_bit(X86_FEATURE_VMCB_CLEANBITS, fs);
>  }

What about this line, at the end of recalculate_cpuid_policy()?

  if ( !p->extd.svm )
p->extd.raw[0xa] = EMPTY_LEAF;

Is there a reason to continue to operate directly on the policy here,
or would it be better to do this up in the featureset section?

 -George

Re: [PATCH 1/5] x86/cpu-policy: Infrastructure for the AMD SVM and SEV leaves

2024-05-01 Thread George Dunlap

On Tue, Apr 30, 2024 at 2:25 PM Andrew Cooper  wrote:
>
> On 30/04/2024 1:45 pm, Jan Beulich wrote:
> > On 29.04.2024 17:16, Andrew Cooper wrote:
> >> Allocate two new feature leaves, and extend cpu_policy with the non-feature
> >> fields too.
> >>
> >> The CPUID dependency between the SVM bit on the whole SVM leaf is
> >> intentionally deferred, to avoid transiently breaking nested virt.
> > In reply to this I meant to ask that you at least add those dependencies in
> > commented-out form, such that from looking at gen-cpuid.py it becomes clear
> > they're intentionally omitted. But you don't add feature identifiers either,
> > making dependencies impossible to express. Maybe this sentence was really
> > meant for another of the patches? (Then my request would actually apply
> > there.)
>
> This is necessary because c/s 4f8b0e94d7ca is buggy.  Notice how it puts
> an edit to the policy object in the middle of a block of logic editing
> the featureset, which ends with writing the featureset back over the
> policy object.
>
> And it's not the first outstanding problem from what is a very small
> number of nested-virt patches so far...

I specifically raised this on the x86 maintainers call, and you said
to go ahead with it.

 -George

Re: [PATCH 1/3] x86/hvm/trace: Use a different trace type for AMD processors

2024-04-26 Thread George Dunlap

On Fri, Apr 26, 2024 at 4:18 PM Andrew Cooper  wrote:
>
> On 26/04/2024 3:32 pm, George Dunlap wrote:
> > A long-standing usability sub-optimality with xenalyze is the
> > necessity to specify `--svm-mode` when analyzing AMD processors.  This
> > fundamentally comes about because the same trace event ID is used for
> > both VMX and SVM, but the contents of the trace must be interpreted
> > differently.
> >
> > Instead, allocate separate trace events for VMX and SVM vmexits in
> > Xen; this will allow all readers to properly intrepret the meaning of
>
> interpret ?

Oops, yes.

> > In xenalyze, first remove the redundant call to init_hvm_data();
> > there's no way to get to hvm_vmexit_process() without it being already
> > initialized by the set_vcpu_type call in hvm_process().
> >
> > Replace this with set_hvm_exit_reson_data(), and move setting of
> > hvm->exit_reason_* into that function.
> >
> > Modify hvm_process and hvm_vmexit_process to handle all four potential
> > values appropriately.
> >
> > If SVM entries are encountered, set opt.svm_mode so that other
> > SVM-specific functionality is triggered.
>
> Given that xenalyze is now closely tied to Xen, and that we're
> technically changing the ABI here, is there any point keeping `--svm-mode` ?
>
> I'm unsure of the utility of reading the buggy trace records from an
> older version of Xen.

Yeah, I thought about that.  If nobody argues to keep it, I guess I'll
rip it out for v2.

> > Also add lines in `formats` for xentrace_format.
>
> Personally I'd have put patch 3 first, and reduced the churn here.

I actually meant to add an `RFC` to the third patch.  I'd already
implemented this by the time you suggested ripping it out.  Doing it
this way doesn't add much overhead if we do end up ripping out
xentrace_format, and does save time if we don't.

> > Signed-off-by: George Dunlap 
> > ---
> > NB that this patch goes on top of Andrew's trace cleanup series:
> >
> > https://lore.kernel.org/xen-devel/20240318163552.3808695-1-andrew.coop...@citrix.com/
>
> The delta in Xen is trivial.  I'm happy if you want to commit this, and
> I can rebase over it.

It's trivial in part *because of* your series.  Without your series I
would have had to do a bunch of munging around with DO_TRC_BLAH_BLAH
to make it compile.  In fact, I started doing it directly on staging,
but quickly moved onto your series to save myself some time. :-)

Since I'm planning on re-submitting your series if you don't (unless
you really object), this way should minimize churn.

I'm happy to post a branch to gitlab if anyone wants to see the combined result.

 -George

Re: [PATCH 2/3] tools/xenalyze: Ignore HVM_EMUL events harder

2024-04-26 Thread George Dunlap

On Fri, Apr 26, 2024 at 4:06 PM Andrew Cooper  wrote:
>
> On 26/04/2024 3:32 pm, George Dunlap wrote:
> > To unify certain common sanity checks, checks are done very early in
> > processing based only on the top-level type.
> >
> > Unfortunately, when TRC_HVM_EMUL was introduced, it broke some of the
> > assumptions about how the top-level types worked.  Namely, traces of
> > this type will show up outside of HVM contexts: in idle domains and in
> > PV domains.
> >
> > Make an explicit exception for TRC_HVM_EMUL types in a number of places:
> >
> >  - Pass the record info pointer to toplevel_assert_check, so that it
> >can exclude TRC_HVM_EMUL records from idle and vcpu data_mode
> >checks
> >
> >  - Don't attempt to set the vcpu data_type in hvm_process for
> >TRC_HVM_EMUL records.
> >
> > Signed-off-by: George Dunlap 
>
> Acked-by: Andrew Cooper 
>
> Although I'm tempted to say that if records of this type show up outside
> of HVM context, then it's misnamed or we've got an error in Xen.  Any
> view on which?

I didn't add them; they seem to be about doing emulation of devices on
behalf of an HVM domain.  But since some of these are things like
timers and such, the actual code ends up being run in random contexts;
probably interrupt contexts (which of course can occur when a non-HVM
guest is running).

One example of an event that showed up this way was
TRC_HVM_EMUL_RTC_STOP_TIMER, which is traced from
xen/arch/x86/hvm/rtc.c:rtc_pf_callback().  I didn't trace back how it
came to be called while a PV guest was running, but it was pretty
obvious that it was legit.

It would certainly make the xenalyze code cleaner to make a separate
top-level tracing category for them; but they certainly do seem to be
directly related to HVM, so doing so would seem to be putting the cart
before the horse, as they say (although I could be convinced
otherwise).

 -George

[PATCH 3/3] tools/xentrace: Remove xentrace_format

2024-04-26 Thread George Dunlap

xentrace_format was always of limited utility, since trace records
across pcpus were processed out of order; it was superseded by xenalyze
over a decade ago.

But for several releases, the `formats` file it has depended on for
proper operation has not even been included in `make install` (which
generally means it doesn't get picked up by distros either); yet
nobody has seemed to complain.

Simple remove xentrace_format, and point people to xenalyze instead.

NB that there is no man page for xenalyze, so the "see also" on the
xentrace man page is simply removed for now.

Signed-off-by: George Dunlap 
---
CC: Andrew Cooper 
CC: Jan Beulich 
CC: Roger Pau Monne 
CC: Anthony Perard 
CC: Olaf Hering 
---
 docs/man/xentrace.8.pod|   5 +-
 docs/man/xentrace_format.1.pod |  46 --
 tools/xentrace/Makefile|   3 -
 tools/xentrace/formats | 231 -
 tools/xentrace/xentrace.c  |   2 +-
 tools/xentrace/xentrace_format | 264 -
 6 files changed, 2 insertions(+), 549 deletions(-)

diff --git a/docs/man/xentrace.8.pod b/docs/man/xentrace.8.pod
index 4c174a84c0..7c9f69c67f 100644
--- a/docs/man/xentrace.8.pod
+++ b/docs/man/xentrace.8.pod
@@ -20,7 +20,7 @@ D1...D5 are the trace data.
 Data is dumped onto the standard output (which must not be a TTY) or a
 I specified on the command line.
 
-The output should be parsed using the tool xentrace_format, which can
+The output should be parsed using the tool xenalyze, which can
 produce human-readable output in ASCII format.
 
 
@@ -157,6 +157,3 @@ B collects the following events from the trace 
buffer:
 
 Mark A. Williamson 
 
-=head1 SEE ALSO
-
-xentrace_format(1)
diff --git a/docs/man/xentrace_format.1.pod b/docs/man/xentrace_format.1.pod
deleted file mode 100644
index e05479a83b..00
--- a/docs/man/xentrace_format.1.pod
+++ /dev/null
@@ -1,46 +0,0 @@
-=head1 NAME
-
-xentrace_format - pretty-print Xen trace data
-
-=head1 SYNOPSIS
-
-B [ I ]
-
-=head1 DESCRIPTION
-
-B parses trace data in B binary format from
-standard input and reformats it according to the rules in a file of
-definitions (I), printing to standard output.
-
-The rules in I should have the format shown below:
-
-I I I
-
-Each rule should start on a new line.
-
-The format string may include format specifiers, such as:
-%(cpu)d, %(tsc)d, %(event)d, %(1)d, %(2)d, %(3)d, %(4)d, %(5)d
-
-[ the `d' format specifier output in decimal, alternatively `x'
-  will output in hexadecimal and `o' will output in octal ]
-
-These correspond to the CPU number, event ID, timestamp counter and
-the 5 data fields from the trace record.  There should be one such
-rule for each type of event to be pretty-printed (events which do not
-have formatting rules are ignored).
-
-A sample format file for Xen's predefined trace events is available
-in the file tools/xentrace/formats in the Xen source tree.
-
-Depending on your system and the rate at which trace data is produced,
-this script may not be able to keep up with the output of
-B if it is piped directly.  In these circumstances you
-should have B output to a file for processing off-line.
-
-=head1 AUTHOR
-
-Mark A. Williamson 
-
-=head1 SEE ALSO
-
-xentrace(8)
diff --git a/tools/xentrace/Makefile b/tools/xentrace/Makefile
index d50d400472..bf960c0867 100644
--- a/tools/xentrace/Makefile
+++ b/tools/xentrace/Makefile
@@ -10,7 +10,6 @@ LDLIBS += $(ARGP_LDFLAGS)
 BIN := xenalyze
 SBIN:= xentrace xentrace_setsize
 LIBBIN  := xenctx
-SCRIPTS := xentrace_format
 
 TARGETS := $(BIN) $(SBIN) $(LIBBIN)
 
@@ -24,13 +23,11 @@ install: all
$(INSTALL_DIR) $(DESTDIR)$(LIBEXEC_BIN)
$(INSTALL_PROG) $(BIN) $(DESTDIR)$(bindir)
$(INSTALL_PROG) $(SBIN) $(DESTDIR)$(sbindir)
-   $(INSTALL_PYTHON_PROG) $(SCRIPTS) $(DESTDIR)$(bindir)
$(INSTALL_PROG) $(LIBBIN) $(DESTDIR)$(LIBEXEC_BIN)
 
 .PHONY: uninstall
 uninstall:
rm -f $(addprefix $(DESTDIR)$(LIBEXEC_BIN)/, $(LIBBIN))
-   rm -f $(addprefix $(DESTDIR)$(bindir)/, $(SCRIPTS))
rm -f $(addprefix $(DESTDIR)$(sbindir)/, $(SBIN))
rm -f $(addprefix $(DESTDIR)$(bindir)/, $(BIN))
 
diff --git a/tools/xentrace/formats b/tools/xentrace/formats
deleted file mode 100644
index 3381c1cda5..00
--- a/tools/xentrace/formats
+++ /dev/null
@@ -1,231 +0,0 @@
-0x  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  unknown (0x%(event)016x)  [ 
0x%(1)08x 0x%(2)08x 0x%(3)08x 0x%(4)08x 0x%(5)08x 0x%(6)08x 0x%(7)08x ]
-
-0x0001f001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  lost_records  0x%(1)08x
-0x0001f002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  wrap_buffer   0x%(1)08x
-0x0001f003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  cpu_change0x%(1)08x
-0x0001f004  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  trace_irq[ vector = %(1)d, 
count = %(2)d, tot_cycles = 0x%(3)08x, max_cycles = 0x%(4)08x ]
-
-0x00021002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  continue_running[ dom:vcpu 
= 0x%(1)08x ]
-0x00021011

[PATCH 1/3] x86/hvm/trace: Use a different trace type for AMD processors

2024-04-26 Thread George Dunlap

A long-standing usability sub-optimality with xenalyze is the
necessity to specify `--svm-mode` when analyzing AMD processors.  This
fundamentally comes about because the same trace event ID is used for
both VMX and SVM, but the contents of the trace must be interpreted
differently.

Instead, allocate separate trace events for VMX and SVM vmexits in
Xen; this will allow all readers to properly intrepret the meaning of
the vmexit reason.

In xenalyze, first remove the redundant call to init_hvm_data();
there's no way to get to hvm_vmexit_process() without it being already
initialized by the set_vcpu_type call in hvm_process().

Replace this with set_hvm_exit_reson_data(), and move setting of
hvm->exit_reason_* into that function.

Modify hvm_process and hvm_vmexit_process to handle all four potential
values appropriately.

If SVM entries are encountered, set opt.svm_mode so that other
SVM-specific functionality is triggered.

Also add lines in `formats` for xentrace_format.

Signed-off-by: George Dunlap 
---
NB that this patch goes on top of Andrew's trace cleanup series:

https://lore.kernel.org/xen-devel/20240318163552.3808695-1-andrew.coop...@citrix.com/

CC: Andrew Cooper 
CC: Jan Beulich 
CC: Roger Pau Monne 
CC: Anthony Perard 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Olaf Hering 
---
 tools/xentrace/formats |  6 --
 tools/xentrace/xenalyze.c  | 32 +++-
 xen/arch/x86/hvm/svm/svm.c |  4 ++--
 xen/arch/x86/hvm/vmx/vmx.c |  4 ++--
 xen/include/public/trace.h |  6 --
 5 files changed, 31 insertions(+), 21 deletions(-)

diff --git a/tools/xentrace/formats b/tools/xentrace/formats
index afb5ee0112..3381c1cda5 100644
--- a/tools/xentrace/formats
+++ b/tools/xentrace/formats
@@ -90,8 +90,10 @@
 0x00041002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  domain_destroy  [ dom = 
0x%(1)08x ]
 
 0x00081001  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  VMENTRY
-0x00081002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  VMEXIT  [ exitcode = 
0x%(1)08x, rIP  = 0x%(2)08x ]
-0x00081102  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  VMEXIT  [ exitcode = 
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
+0x00081002  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  VMX_EXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(2)08x ]
+0x00081102  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  VMX_EXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
+0x00081003  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  SVM_EXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(2)08x ]
+0x00081103  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  SVM_EXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
 0x00081401  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  nVMENTRY
 0x00081402  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  nVMEXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(2)08x ]
 0x00081502  CPU%(cpu)d  %(tsc)d (+%(reltsc)8d)  nVMEXIT [ exitcode = 
0x%(1)08x, rIP  = 0x%(3)08x%(2)08x ]
diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index ce6a85d50b..ceb07229d1 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -1437,14 +1437,6 @@ void init_hvm_data(struct hvm_data *h, struct vcpu_data 
*v) {
 
 h->init = 1;
 
-if(opt.svm_mode) {
-h->exit_reason_max = HVM_SVM_EXIT_REASON_MAX;
-h->exit_reason_name = hvm_svm_exit_reason_name;
-} else {
-h->exit_reason_max = HVM_VMX_EXIT_REASON_MAX;
-h->exit_reason_name = hvm_vmx_exit_reason_name;
-}
-
 if(opt.histogram_interrupt_eip) {
 int count = 
((1ULL<summary.guest_interrupt[i].count=0;
 }
 
+void set_hvm_exit_reason_data(struct hvm_data *h, unsigned event) {
+if (((event & ~TRC_64_FLAG) == TRC_HVM_SVM_EXIT) ||
+opt.svm_mode) {
+opt.svm_mode = 1;
+h->exit_reason_max = HVM_SVM_EXIT_REASON_MAX;
+h->exit_reason_name = hvm_svm_exit_reason_name;
+} else {
+h->exit_reason_max = HVM_VMX_EXIT_REASON_MAX;
+h->exit_reason_name = hvm_vmx_exit_reason_name;
+}
+}
+
 /* PV data */
 enum {
 PV_HYPERCALL=1,
@@ -5088,13 +5092,13 @@ void hvm_vmexit_process(struct record_info *ri, struct 
hvm_data *h,
 
 r = (typeof(r))ri->d;
 
-if(!h->init)
-init_hvm_data(h, v);
+if(!h->exit_reason_name)
+set_hvm_exit_reason_data(h, ri->event);
 
 h->vmexit_valid=1;
 bzero(>inflight, sizeof(h->inflight));
 
-if(ri->event == TRC_HVM_VMEXIT64) {
+if(ri->event & TRC_64_FLAG) {
 if(v->guest_paging_levels != 4)
 {
 if ( verbosity >= 6 )
@@ -5316,8 +5320,10 @@ void hvm_process(struct pcpu_info *p)
 break;
 default:
 switch(ri->event) {
-case TRC_HVM_VMEXIT:
-case TRC_HVM_VMEXIT64:
+case TRC_HVM_VMX_EXIT:
+case TRC_HVM_VMX_EXIT64:
+case TRC_HVM_SVM_EXIT:
+case TRC_HVM_SVM_EXIT64:
 UPDATE_VOLUME(p, hvm[HVM_VOL_VMEXIT], ri->size);
 hvm_vmexit_process(ri, h, v);
 break;
diff --git a/xen/arch/x8

[PATCH 2/3] tools/xenalyze: Ignore HVM_EMUL events harder

2024-04-26 Thread George Dunlap

To unify certain common sanity checks, checks are done very early in
processing based only on the top-level type.

Unfortunately, when TRC_HVM_EMUL was introduced, it broke some of the
assumptions about how the top-level types worked.  Namely, traces of
this type will show up outside of HVM contexts: in idle domains and in
PV domains.

Make an explicit exception for TRC_HVM_EMUL types in a number of places:

 - Pass the record info pointer to toplevel_assert_check, so that it
   can exclude TRC_HVM_EMUL records from idle and vcpu data_mode
   checks

 - Don't attempt to set the vcpu data_type in hvm_process for
   TRC_HVM_EMUL records.

Signed-off-by: George Dunlap 
---
CC: Andrew Cooper 
CC: Anthony Perard 
CC: Olaf Hering 
---
 tools/xentrace/xenalyze.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tools/xentrace/xenalyze.c b/tools/xentrace/xenalyze.c
index ceb07229d1..8fb45dec76 100644
--- a/tools/xentrace/xenalyze.c
+++ b/tools/xentrace/xenalyze.c
@@ -21,6 +21,7 @@
 #define _XOPEN_SOURCE 600
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -5305,8 +5306,11 @@ void hvm_process(struct pcpu_info *p)
 
 assert(p->current);
 
-if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
-return;
+/* HVM_EMUL types show up in all contexts */
+if(ri->evt.sub != 0x4) {
+if(vcpu_set_data_type(p->current, VCPU_DATA_HVM))
+return;
+}
 
 switch ( ri->evt.sub ) {
 case 2: /* HVM_HANDLER */
@@ -9447,9 +9451,10 @@ static struct tl_assert_mask 
tl_assert_checks[TOPLEVEL_MAX] = {
 /* There are a lot of common assumptions for the various processing
  * routines.  Check them all in one place, doing something else if
  * they don't pass. */
-int toplevel_assert_check(int toplevel, struct pcpu_info *p)
+int toplevel_assert_check(int toplevel, struct record_info *ri, struct 
pcpu_info *p)
 {
 struct tl_assert_mask mask;
+bool is_hvm_emul = (toplevel == TOPLEVEL_HVM) && (ri->evt.sub == 0x4);
 
 mask = tl_assert_checks[toplevel];
 
@@ -9459,7 +9464,7 @@ int toplevel_assert_check(int toplevel, struct pcpu_info 
*p)
 goto fail;
 }
 
-if( mask.not_idle_domain )
+if( mask.not_idle_domain && !is_hvm_emul)
 {
 /* Can't do this check w/o first doing above check */
 assert(mask.p_current);
@@ -9478,7 +9483,8 @@ int toplevel_assert_check(int toplevel, struct pcpu_info 
*p)
 v = p->current;
 
 if ( ! (v->data_type == VCPU_DATA_NONE
-|| v->data_type == mask.vcpu_data_mode) )
+|| v->data_type == mask.vcpu_data_mode
+|| is_hvm_emul) )
 {
 /* This may happen for track_dirty_vram, which causes a 
SHADOW_WRMAP_BF trace f/ dom0 */
 fprintf(warn, "WARNING: Unexpected vcpu data type for d%dv%d on 
proc %d! Expected %d got %d. Not processing\n",
@@ -9525,7 +9531,7 @@ void process_record(struct pcpu_info *p) {
 return;
 
 /* Unify toplevel assertions */
-if ( toplevel_assert_check(toplevel, p) )
+if ( toplevel_assert_check(toplevel, ri, p) )
 {
 switch(toplevel) {
 case TRC_GEN_MAIN:
-- 
2.25.1

[PATCH 0/3] Further trace improvements

2024-04-26 Thread George Dunlap

Some further trace improvements:

 - Remove the need to specify `--svm-mode` every time running xenalyze
   on a trace generated on an AMD box.

 - Get rid of warnings due to unhandled HVM_EMUL traces

 - Completely remove obsolete xentrace_format

This series is meant to be applied on top of Andy's series "xen/trace:
Treewide API cleanup":

https://lore.kernel.org/xen-devel/20240318163552.3808695-1-andrew.coop...@citrix.com/

George Dunlap (3):
  x86/hvm/trace: Use a different trace type for AMD processors
  tools/xenalyze: Ignore HVM_EMUL events harder
  tools/xentrace: Remove xentrace_format

 docs/man/xentrace.8.pod|   5 +-
 docs/man/xentrace_format.1.pod |  46 --
 tools/xentrace/Makefile|   3 -
 tools/xentrace/formats | 229 
 tools/xentrace/xenalyze.c  |  50 ---
 tools/xentrace/xentrace.c  |   2 +-
 tools/xentrace/xentrace_format | 264 -
 xen/arch/x86/hvm/svm/svm.c |   4 +-
 xen/arch/x86/hvm/vmx/vmx.c |   4 +-
 xen/include/public/trace.h |   6 +-
 10 files changed, 41 insertions(+), 572 deletions(-)
 delete mode 100644 docs/man/xentrace_format.1.pod
 delete mode 100644 tools/xentrace/formats
 delete mode 100644 tools/xentrace/xentrace_format

-- 
2.25.1

Re: [PATCH] svm: Fix MISRA 8.2 violation

2024-04-26 Thread George Dunlap

On Thu, Apr 25, 2024 at 5:00 PM Andrew Cooper  wrote:
>
> On 25/04/2024 10:12 am, George Dunlap wrote:
> > Misra 8.2 requires named parameters in prototypes.  Use the name from
> > the implementaiton.
> >
> > Fixes 0d19d3aab0 ("svm/nestedsvm: Introduce nested capabilities bit").
>
> Nit.  We treat Fixes: as a tag.

Oops -- I'll fix that on check-in.  Gmail is also underlining my
misspelling of "implementation" in red, so I'll fix that as well.

> Also you might want to include the Reported-by's.

I thought about adding a Reported-by; but this was reported by Eclair,
which is run automatically, right?  We typically don't give a
"Reported-by" to the person who flagged up the osstest or gitlab
failure, so I figured this would be similar.

I'm happy to give credit where it's due, however.  You were the first
one who reported it to me, but Nicola also raised it; and I guess it's
someone else who's paying for the Eclair runs.  So what "Reported-by"
did you have in mind?

> Otherwise, Acked-by: Andrew Cooper  although
> I have a strong wish to shorten the parameter name.  That can be done at
> a later point.

Thanks,
 -George

Re: [PATCH net] xen-netfront: Add missing skb_mark_for_recycle

2024-04-25 Thread George Dunlap

Greg,

We're issuing an XSA for this; can you issue a CVE?

Thanks,
 -George Dunlap

On Tue, Apr 2, 2024 at 9:25 PM Arthur Borsboom 
wrote:

> After having a better look, I have found the patch in linux-next
>
>
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=0cd74ffcf4fb0536718241d59d2c124578624d83
>
> On Tue, 2 Apr 2024 at 10:20, Arthur Borsboom 
> wrote:
> >
> > On Fri, 29 Mar 2024 at 10:47, Arthur Borsboom 
> wrote:
> > >
> > > On Wed, 27 Mar 2024 at 13:15, Jesper Dangaard Brouer 
> wrote:
> > > >
> > > > Notice that skb_mark_for_recycle() is introduced later than fixes
> tag in
> > > > 6a5bcd84e886 ("page_pool: Allow drivers to hint on SKB recycling").
> > > >
> > > > It is believed that fixes tag were missing a call to
> page_pool_release_page()
> > > > between v5.9 to v5.14, after which is should have used
> skb_mark_for_recycle().
> > > > Since v6.6 the call page_pool_release_page() were removed (in
> 535b9c61bdef
> > > > ("net: page_pool: hide page_pool_release_page()") and remaining
> callers
> > > > converted (in commit 6bfef2ec0172 ("Merge branch
> > > > 'net-page_pool-remove-page_pool_release_page'")).
> > > >
> > > > This leak became visible in v6.8 via commit dba1b8a7ab68
> ("mm/page_pool: catch
> > > > page_pool memory leaks").
> > > >
> > > > Fixes: 6c5aa6fc4def ("xen networking: add basic XDP support for
> xen-netfront")
> > > > Reported-by: Arthur Borsboom 
> > > > Signed-off-by: Jesper Dangaard Brouer 
> > > > ---
> > > > Compile tested only, can someone please test this
> > >
> > > I have tested this patch on Xen 4.18.1 with VM (Arch Linux) kernel
> 6.9.0-rc1.
> > >
> > > Without the patch there are many trace traces and cloning the Linux
> > > mainline git repository resulted in failures (same with kernel 6.8.1).
> > > The patched kernel 6.9.0-rc1 performs as expected; cloning the git
> > > repository was successful and no kernel traces observed.
> > > Hereby my tested by:
> > >
> > > Tested-by: Arthur Borsboom 
> > >
> > >
> > >
> > > >  drivers/net/xen-netfront.c |1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
> > > > index ad29f370034e..8d2aee88526c 100644
> > > > --- a/drivers/net/xen-netfront.c
> > > > +++ b/drivers/net/xen-netfront.c
> > > > @@ -285,6 +285,7 @@ static struct sk_buff
> *xennet_alloc_one_rx_buffer(struct netfront_queue *queue)
> > > > return NULL;
> > > > }
> > > > skb_add_rx_frag(skb, 0, page, 0, 0, PAGE_SIZE);
> > > > +   skb_mark_for_recycle(skb);
> > > >
> > > > /* Align ip header to a 16 bytes boundary */
> > > > skb_reserve(skb, NET_IP_ALIGN);
> > > >
> > > >
> >
> > I don't see this patch yet in linux-next.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/log
> >
> > Any idea in which kernel release this patch will be included?
>

[PATCH] svm: Fix MISRA 8.2 violation

2024-04-25 Thread George Dunlap

Misra 8.2 requires named parameters in prototypes.  Use the name from
the implementaiton.

Fixes 0d19d3aab0 ("svm/nestedsvm: Introduce nested capabilities bit").

Signed-off-by: George Dunlap 
---
CC: Andrew Cooper 
CC: Jan Beulich 
CC: Roger Pau Monne 
CC: Nicola Vetrini 
---
 xen/arch/x86/include/asm/hvm/nestedhvm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/include/asm/hvm/nestedhvm.h 
b/xen/arch/x86/include/asm/hvm/nestedhvm.h
index 0568acb25f..ea2c1bc328 100644
--- a/xen/arch/x86/include/asm/hvm/nestedhvm.h
+++ b/xen/arch/x86/include/asm/hvm/nestedhvm.h
@@ -83,7 +83,7 @@ static inline bool vvmcx_valid(const struct vcpu *v)
 }
 
 
-void start_nested_svm(struct hvm_function_table *);
-void start_nested_vmx(struct hvm_function_table *);
+void start_nested_svm(struct hvm_function_table *hvm_function_table);
+void start_nested_vmx(struct hvm_function_table *hvm_function_table);
 
 #endif /* _HVM_NESTEDHVM_H */
-- 
2.25.1

Re: [PATCH v2 1/3] tools/golang: When returning pointers, actually allocate structrues

2024-04-19 Thread George Dunlap

On Fri, Apr 19, 2024 at 4:27 PM Nick Rosbrook  wrote:
>
> On Fri, Apr 19, 2024 at 10:00 AM George Dunlap  
> wrote:
> >
> > In a handful of cases, it was decided to return a pointer to a
> > structure rather than the plain structure itself, due to the size.
> > However, in these cases the structure was never allocated, leading to
> > a nil pointer exception when calling the relevant `fromC` method.
> >
> > Allocate structures before attempting to fill them in.
> >
> > Fixes: 453713b1750 ("golang/xenlight: Add host-related functionality")
> > Reported-by: Tobias Fitschen 
> > Signed-off-by: George Dunlap 
> > Tested-by: Tobias Fitschen 
>
> Acked-by: Nick Rosbrook 
>
> This is one of the reasons I don't like using named return values (if
> you can help it) and naked returns. When you declare the variable
> explicitly you are less likely to explicitly initialize it to nil. Or,
> it's a compile time error if you forget all together.

Yes, after having more experience in golang I wouldn't be opposed to
getting rid of all those now.  But as I'm hoping to backport this as a
bug fix, that would be a subsequent patch. :-)

Thanks,
 -George

Re: Detecting whether dom0 is in a VM

2024-04-19 Thread George Dunlap

On Fri, Apr 19, 2024 at 4:29 PM George Dunlap  wrote:
>
> On Fri, Jul 7, 2023 at 3:56 PM George Dunlap  wrote:
> >> >>> Xen's public interface offers access to the featuresets known / found /
> >> >>> used by the hypervisor. See XEN_SYSCTL_get_cpu_featureset, accessible
> >> >>> via xc_get_cpu_featureset().
> >> >>>
> >> >>
> >> >> Are any of these exposed in dom0 via sysctl, or hypfs?
> >> >
> >> > sysctl - yes (as the quoted name also says). hypfs no, afaict.
> >> >
> >> >>  SYSCTLs are
> >> >> unfortunately not stable interfaces, correct?  So it wouldn't be 
> >> >> practical
> >> >> for systemd to use them.
> >> >
> >> > Indeed, neither sysctl-s nor the libxc interfaces are stable.
> >>
> >> Thinking of it, xen-cpuid is a wrapper tool around those. They may want
> >> to look at its output (and, if they want to use it, advise distros to
> >> also package it), which I think we try to keep reasonably stable,
> >> albeit without providing any guarantees.
> >
> >
> > We haven't had any clear guidance yet on what the systemd team want in the 
> >  question; I just sort of assumed they 
> > wanted the L-1 virtualization *if possible*.  It sounds like `vm-other` 
> > would be acceptable, particularly as a fall-back output if there's no way 
> > to get Xen's picture of the cpuid.
> >
> > It looks like xen-cpuid is available on Fedora, Debian, Ubuntu, and the old 
> > Virt SIG CentOS packages; so I'd expect most packages to follow suit.  
> > That's a place to start.
> >
> > Just to take the discussion all the way to its conclusion:
> >
> > - Supposing xen-cpuid isn't available, is there any other way to tell if 
> > Xen is running in a VM from dom0?
> >
> > - Would it make sense to expose that information somewhere, either in sysfs 
> > or in hypfs (or both?), so that eventually even systems which may not get 
> > the memo about packaging xen-cpuid will get support (or if the systemd guys 
> > would rather avoid executing another process if possible)?
>
> Resurrecting this thread.
>
> To recap:
>
> Currently, `systemd-detect-virt` has the following  input / output table:
>
> 1. Xen on real hardware, domU: xen
> 2. Xen on real hardware, dom0: vm-other
> 3. Xen in a VM, domU: xen
> 4. Xen in aVM, dom0: vm-other
>
> It's the dom0 lines (2 and 4) which are problematic; we'd ideally like
> those to be `none` and `vm-other` (or the actual value, like `xen` or
> `kvm`).
>
> It looks like ATM, /proc/xen/capabilities will contain `control_d` if
> it's a dom0.  Simply unilaterally returning `none` if
> /proc/xen/capabilities contains `control_d` would correct the vast
> majority of instances (since the number of instances of Xen on real
> hardware is presumably higher than the number running virtualized).
>
> Is /proc/xen/capabilities expected to stay around?  I don't see
> anything equivalent in /dev/xen.
>
> Other than adding a new interface to Xen, is there any way to tell if
> Xen is running in a VM?  If we do need to expose an interface, what
> would be the best way to do that?

Actually, the entire code for detection is here:

https://github.com/systemd/systemd/blob/f5fefec786e35ef7606e2b14e2530878b74ca879/src/basic/virt.c

The core issue seems to be this bit:

/* Detect from CPUID */
v = detect_vm_cpuid();
if (v < 0)
return v;
if (v == VIRTUALIZATION_VM_OTHER)
other = true;
else if (v != VIRTUALIZATION_NONE)
goto finish;

/* If we are in Dom0 and have not yet finished, finish with
the result of detect_vm_cpuid */
if (xen_dom0 > 0)
goto finish;

Basically, the code expects that dom0 will be able to read the raw
virtualization CPUID leaves, and that the result will be
VIRTUALIZATION_NONE on real hardware.

But in fact, the result appears to be VIRTUALIZATION_VM_OTHER -- which
would mean that 1) the CPUID instruction for reading CPUID leaf 0x1
succeeded, 2) the 'hypervisor' bit was set, but 3) the signature at
the hypervisor information leaf (0x4000) didn't match any
hypervisor (including XenVMMXenVMM, which it's looking for).

What do we do in regard to these for dom0?

 -George

Re: Detecting whether dom0 is in a VM

2024-04-19 Thread George Dunlap

On Fri, Jul 7, 2023 at 3:56 PM George Dunlap  wrote:
>> >>> Xen's public interface offers access to the featuresets known / found /
>> >>> used by the hypervisor. See XEN_SYSCTL_get_cpu_featureset, accessible
>> >>> via xc_get_cpu_featureset().
>> >>>
>> >>
>> >> Are any of these exposed in dom0 via sysctl, or hypfs?
>> >
>> > sysctl - yes (as the quoted name also says). hypfs no, afaict.
>> >
>> >>  SYSCTLs are
>> >> unfortunately not stable interfaces, correct?  So it wouldn't be practical
>> >> for systemd to use them.
>> >
>> > Indeed, neither sysctl-s nor the libxc interfaces are stable.
>>
>> Thinking of it, xen-cpuid is a wrapper tool around those. They may want
>> to look at its output (and, if they want to use it, advise distros to
>> also package it), which I think we try to keep reasonably stable,
>> albeit without providing any guarantees.
>
>
> We haven't had any clear guidance yet on what the systemd team want in the 
>  question; I just sort of assumed they wanted 
> the L-1 virtualization *if possible*.  It sounds like `vm-other` would be 
> acceptable, particularly as a fall-back output if there's no way to get Xen's 
> picture of the cpuid.
>
> It looks like xen-cpuid is available on Fedora, Debian, Ubuntu, and the old 
> Virt SIG CentOS packages; so I'd expect most packages to follow suit.  That's 
> a place to start.
>
> Just to take the discussion all the way to its conclusion:
>
> - Supposing xen-cpuid isn't available, is there any other way to tell if Xen 
> is running in a VM from dom0?
>
> - Would it make sense to expose that information somewhere, either in sysfs 
> or in hypfs (or both?), so that eventually even systems which may not get the 
> memo about packaging xen-cpuid will get support (or if the systemd guys would 
> rather avoid executing another process if possible)?

Resurrecting this thread.

To recap:

Currently, `systemd-detect-virt` has the following  input / output table:

1. Xen on real hardware, domU: xen
2. Xen on real hardware, dom0: vm-other
3. Xen in a VM, domU: xen
4. Xen in aVM, dom0: vm-other

It's the dom0 lines (2 and 4) which are problematic; we'd ideally like
those to be `none` and `vm-other` (or the actual value, like `xen` or
`kvm`).

It looks like ATM, /proc/xen/capabilities will contain `control_d` if
it's a dom0.  Simply unilaterally returning `none` if
/proc/xen/capabilities contains `control_d` would correct the vast
majority of instances (since the number of instances of Xen on real
hardware is presumably higher than the number running virtualized).

Is /proc/xen/capabilities expected to stay around?  I don't see
anything equivalent in /dev/xen.

Other than adding a new interface to Xen, is there any way to tell if
Xen is running in a VM?  If we do need to expose an interface, what
would be the best way to do that?

 -George

[PATCH v2 3/3] tools/golang: Run `go vet` as part of the build process

2024-04-19 Thread George Dunlap

Signed-off-by: George Dunlap 
---
CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/golang/xenlight/Makefile b/tools/golang/xenlight/Makefile
index c5bb6b94a8..645e7b3a82 100644
--- a/tools/golang/xenlight/Makefile
+++ b/tools/golang/xenlight/Makefile
@@ -29,6 +29,7 @@ $(subst .gen.,.%.,$(GOXL_GEN_FILES)): gengotypes.py 
$(LIBXL_SRC_DIR)/libxl_types
 # so that it can find the actual library.
 .PHONY: build
 build: xenlight.go $(GOXL_GEN_FILES)
+   CGO_CFLAGS="$(CFLAGS_libxenlight) $(CFLAGS_libxentoollog) 
$(APPEND_CFLAGS)" CGO_LDFLAGS="$(call xenlibs-ldflags,light toollog) 
$(APPEND_LDFLAGS)" $(GO) vet
CGO_CFLAGS="$(CFLAGS_libxenlight) $(CFLAGS_libxentoollog) 
$(APPEND_CFLAGS)" CGO_LDFLAGS="$(call xenlibs-ldflags,light toollog) 
$(APPEND_LDFLAGS)" $(GO) build -x
 
 .PHONY: install
-- 
2.25.1

[PATCH v2 1/3] tools/golang: When returning pointers, actually allocate structrues

2024-04-19 Thread George Dunlap

In a handful of cases, it was decided to return a pointer to a
structure rather than the plain structure itself, due to the size.
However, in these cases the structure was never allocated, leading to
a nil pointer exception when calling the relevant `fromC` method.

Allocate structures before attempting to fill them in.

Fixes: 453713b1750 ("golang/xenlight: Add host-related functionality")
Reported-by: Tobias Fitschen 
Signed-off-by: George Dunlap 
Tested-by: Tobias Fitschen 
---
v2:
 - Added Fixes: tag
 - Added Tested-by tag

NB this is a candidate for backport.

CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/xenlight.go | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/golang/xenlight/xenlight.go 
b/tools/golang/xenlight/xenlight.go
index a45c636952..d793f172e5 100644
--- a/tools/golang/xenlight/xenlight.go
+++ b/tools/golang/xenlight/xenlight.go
@@ -999,6 +999,7 @@ func (ctx *Context) GetPhysinfo() (physinfo *Physinfo, err 
error) {
err = Error(ret)
return
}
+   physinfo = {}
err = physinfo.fromC()
 
return
@@ -1010,6 +1011,7 @@ func (ctx *Context) GetVersionInfo() (info *VersionInfo, 
err error) {
 
cinfo = C.libxl_get_version_info(ctx.ctx)
 
+   info = {}
err = info.fromC(cinfo)
 
return
@@ -1027,6 +1029,7 @@ func (ctx *Context) DomainInfo(Id Domid) (di *Dominfo, 
err error) {
return
}
 
+   di = {}
err = di.fromC()
 
return
-- 
2.25.1

[PATCH v2 2/3] golang/xenlight: Ensure comments aren't interpreted as docstrings

2024-04-19 Thread George Dunlap

Go has always interpreted a comment directly before a function as a
docstring, so having the C function prototypes which the golang method
is meant to wrap in the comment before was always a bit non-standard.
However, recent versions of `go fmt` now attempt to normalize these
docstrings as well, leading to strange changes, particularly if `go
fmt` is run on save.

Go through and put a space between non-docstring comments and methods,
so that `go fmt` leaves the comments alone.

No functional change.

Signed-off-by: George Dunlap 
---
v2:
 - New (replaced previous `go fmt` patch)

CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/xenlight.go | 28 
 1 file changed, 28 insertions(+)

diff --git a/tools/golang/xenlight/xenlight.go 
b/tools/golang/xenlight/xenlight.go
index d793f172e5..7f08657187 100644
--- a/tools/golang/xenlight/xenlight.go
+++ b/tools/golang/xenlight/xenlight.go
@@ -562,6 +562,7 @@ func (dt DomainType) String() (str string) {
 }
 
 // const char *libxl_scheduler_to_string(libxl_scheduler p);
+
 func (s Scheduler) String() string {
cs := C.libxl_scheduler_to_string(C.libxl_scheduler(s))
// No need to free const return value
@@ -570,6 +571,7 @@ func (s Scheduler) String() string {
 }
 
 // int libxl_scheduler_from_string(const char *s, libxl_scheduler *e);
+
 func (s *Scheduler) FromString(gstr string) (err error) {
*s, err = SchedulerFromString(gstr)
return
@@ -594,6 +596,7 @@ func SchedulerFromString(name string) (s Scheduler, err 
error) {
 
 // libxl_cpupoolinfo * libxl_list_cpupool(libxl_ctx*, int *nb_pool_out);
 // void libxl_cpupoolinfo_list_free(libxl_cpupoolinfo *list, int nb_pool);
+
 func (ctx *Context) ListCpupool() (list []Cpupoolinfo) {
var nbPool C.int
 
@@ -617,6 +620,7 @@ func (ctx *Context) ListCpupool() (list []Cpupoolinfo) {
 }
 
 // int libxl_cpupool_info(libxl_ctx *ctx, libxl_cpupoolinfo *info, uint32_t 
poolid);
+
 func (ctx *Context) CpupoolInfo(Poolid uint32) (pool Cpupoolinfo, err error) {
var c_cpupool C.libxl_cpupoolinfo
 
@@ -638,6 +642,7 @@ func (ctx *Context) CpupoolInfo(Poolid uint32) (pool 
Cpupoolinfo, err error) {
 //  uint32_t *poolid);
 // FIXME: uuid
 // FIXME: Setting poolid
+
 func (ctx *Context) CpupoolCreate(Name string, Scheduler Scheduler, Cpumap 
Bitmap) (err error, Poolid uint32) {
poolid := C.uint32_t(C.LIBXL_CPUPOOL_POOLID_ANY)
name := C.CString(Name)
@@ -666,6 +671,7 @@ func (ctx *Context) CpupoolCreate(Name string, Scheduler 
Scheduler, Cpumap Bitma
 }
 
 // int libxl_cpupool_destroy(libxl_ctx *ctx, uint32_t poolid);
+
 func (ctx *Context) CpupoolDestroy(Poolid uint32) (err error) {
ret := C.libxl_cpupool_destroy(ctx.ctx, C.uint32_t(Poolid))
if ret != 0 {
@@ -677,6 +683,7 @@ func (ctx *Context) CpupoolDestroy(Poolid uint32) (err 
error) {
 }
 
 // int libxl_cpupool_cpuadd(libxl_ctx *ctx, uint32_t poolid, int cpu);
+
 func (ctx *Context) CpupoolCpuadd(Poolid uint32, Cpu int) (err error) {
ret := C.libxl_cpupool_cpuadd(ctx.ctx, C.uint32_t(Poolid), C.int(Cpu))
if ret != 0 {
@@ -689,6 +696,7 @@ func (ctx *Context) CpupoolCpuadd(Poolid uint32, Cpu int) 
(err error) {
 
 // int libxl_cpupool_cpuadd_cpumap(libxl_ctx *ctx, uint32_t poolid,
 // const libxl_bitmap *cpumap);
+
 func (ctx *Context) CpupoolCpuaddCpumap(Poolid uint32, Cpumap Bitmap) (err 
error) {
var cbm C.libxl_bitmap
if err = Cpumap.toC(); err != nil {
@@ -706,6 +714,7 @@ func (ctx *Context) CpupoolCpuaddCpumap(Poolid uint32, 
Cpumap Bitmap) (err error
 }
 
 // int libxl_cpupool_cpuremove(libxl_ctx *ctx, uint32_t poolid, int cpu);
+
 func (ctx *Context) CpupoolCpuremove(Poolid uint32, Cpu int) (err error) {
ret := C.libxl_cpupool_cpuremove(ctx.ctx, C.uint32_t(Poolid), 
C.int(Cpu))
if ret != 0 {
@@ -718,6 +727,7 @@ func (ctx *Context) CpupoolCpuremove(Poolid uint32, Cpu 
int) (err error) {
 
 // int libxl_cpupool_cpuremove_cpumap(libxl_ctx *ctx, uint32_t poolid,
 //const libxl_bitmap *cpumap);
+
 func (ctx *Context) CpupoolCpuremoveCpumap(Poolid uint32, Cpumap Bitmap) (err 
error) {
var cbm C.libxl_bitmap
if err = Cpumap.toC(); err != nil {
@@ -735,6 +745,7 @@ func (ctx *Context) CpupoolCpuremoveCpumap(Poolid uint32, 
Cpumap Bitmap) (err er
 }
 
 // int libxl_cpupool_rename(libxl_ctx *ctx, const char *name, uint32_t poolid);
+
 func (ctx *Context) CpupoolRename(Name string, Poolid uint32) (err error) {
name := C.CString(Name)
defer C.free(unsafe.Pointer(name))
@@ -749,6 +760,7 @@ func (ctx *Context) CpupoolRename(Name string, Poolid 
uint32) (err error) {
 }
 
 // int libxl_cpupool_cpuadd_node(libxl_ctx *ctx, uint32_t poolid, int node, 
int *cpus);
+
 func (ctx *Context) CpupoolCpuaddNode(Poolid uint32, Node int) (Cpus int, err 
error) {
ccpus := C.int(0)
 
@@ -764,6 +776,7 @@ func (ctx

Re: [PATCH 2/3] golang/xenlight: Run `go fmt` on non-generated golang files

2024-04-19 Thread George Dunlap

On Fri, Apr 19, 2024 at 11:59 AM George Dunlap  wrote:
>
> No functional change.
>
> Signed-off-by: George Dunlap 
> ---
> CC: Nick Rosbrook 
> CC: Anthony PERARD 
> ---
>  tools/golang/xenlight/xenlight.go | 55 +--
>  1 file changed, 30 insertions(+), 25 deletions(-)
>
> diff --git a/tools/golang/xenlight/xenlight.go 
> b/tools/golang/xenlight/xenlight.go
> index d793f172e5..0ff28590ee 100644
> --- a/tools/golang/xenlight/xenlight.go
> +++ b/tools/golang/xenlight/xenlight.go
> @@ -633,9 +633,11 @@ func (ctx *Context) CpupoolInfo(Poolid uint32) (pool 
> Cpupoolinfo, err error) {
>  }
>
>  // int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
> -//  libxl_scheduler sched,
> -//  libxl_bitmap cpumap, libxl_uuid *uuid,
> -//  uint32_t *poolid);
> +//
> +// libxl_scheduler sched,
> +// libxl_bitmap cpumap, libxl_uuid *uuid,
> +// uint32_t *poolid);

Gah, sorry, I forgot to delete some of these blank lines that were
added in.  I'll fix it up in v2.

 -George

[PATCH 2/3] golang/xenlight: Run `go fmt` on non-generated golang files

2024-04-19 Thread George Dunlap

No functional change.

Signed-off-by: George Dunlap 
---
CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/xenlight.go | 55 +--
 1 file changed, 30 insertions(+), 25 deletions(-)

diff --git a/tools/golang/xenlight/xenlight.go 
b/tools/golang/xenlight/xenlight.go
index d793f172e5..0ff28590ee 100644
--- a/tools/golang/xenlight/xenlight.go
+++ b/tools/golang/xenlight/xenlight.go
@@ -633,9 +633,11 @@ func (ctx *Context) CpupoolInfo(Poolid uint32) (pool 
Cpupoolinfo, err error) {
 }
 
 // int libxl_cpupool_create(libxl_ctx *ctx, const char *name,
-//  libxl_scheduler sched,
-//  libxl_bitmap cpumap, libxl_uuid *uuid,
-//  uint32_t *poolid);
+//
+// libxl_scheduler sched,
+// libxl_bitmap cpumap, libxl_uuid *uuid,
+// uint32_t *poolid);
+//
 // FIXME: uuid
 // FIXME: Setting poolid
 func (ctx *Context) CpupoolCreate(Name string, Scheduler Scheduler, Cpumap 
Bitmap) (err error, Poolid uint32) {
@@ -688,7 +690,8 @@ func (ctx *Context) CpupoolCpuadd(Poolid uint32, Cpu int) 
(err error) {
 }
 
 // int libxl_cpupool_cpuadd_cpumap(libxl_ctx *ctx, uint32_t poolid,
-// const libxl_bitmap *cpumap);
+//
+// const libxl_bitmap *cpumap);
 func (ctx *Context) CpupoolCpuaddCpumap(Poolid uint32, Cpumap Bitmap) (err 
error) {
var cbm C.libxl_bitmap
if err = Cpumap.toC(); err != nil {
@@ -717,7 +720,8 @@ func (ctx *Context) CpupoolCpuremove(Poolid uint32, Cpu 
int) (err error) {
 }
 
 // int libxl_cpupool_cpuremove_cpumap(libxl_ctx *ctx, uint32_t poolid,
-//const libxl_bitmap *cpumap);
+//
+// const libxl_bitmap *cpumap);
 func (ctx *Context) CpupoolCpuremoveCpumap(Poolid uint32, Cpumap Bitmap) (err 
error) {
var cbm C.libxl_bitmap
if err = Cpumap.toC(); err != nil {
@@ -789,9 +793,7 @@ func (ctx *Context) CpupoolMovedomain(Poolid uint32, Id 
Domid) (err error) {
return
 }
 
-//
 // Utility functions
-//
 func (ctx *Context) CpupoolFindByName(name string) (info Cpupoolinfo, found 
bool) {
plist := ctx.ListCpupool()
 
@@ -939,7 +941,7 @@ func (bm Bitmap) String() (s string) {
return
 }
 
-//int libxl_get_max_cpus(libxl_ctx *ctx);
+// int libxl_get_max_cpus(libxl_ctx *ctx);
 func (ctx *Context) GetMaxCpus() (maxCpus int, err error) {
ret := C.libxl_get_max_cpus(ctx.ctx)
if ret < 0 {
@@ -950,7 +952,7 @@ func (ctx *Context) GetMaxCpus() (maxCpus int, err error) {
return
 }
 
-//int libxl_get_online_cpus(libxl_ctx *ctx);
+// int libxl_get_online_cpus(libxl_ctx *ctx);
 func (ctx *Context) GetOnlineCpus() (onCpus int, err error) {
ret := C.libxl_get_online_cpus(ctx.ctx)
if ret < 0 {
@@ -961,7 +963,7 @@ func (ctx *Context) GetOnlineCpus() (onCpus int, err error) 
{
return
 }
 
-//int libxl_get_max_nodes(libxl_ctx *ctx);
+// int libxl_get_max_nodes(libxl_ctx *ctx);
 func (ctx *Context) GetMaxNodes() (maxNodes int, err error) {
ret := C.libxl_get_max_nodes(ctx.ctx)
if ret < 0 {
@@ -972,7 +974,7 @@ func (ctx *Context) GetMaxNodes() (maxNodes int, err error) 
{
return
 }
 
-//int libxl_get_free_memory(libxl_ctx *ctx, uint64_t *memkb);
+// int libxl_get_free_memory(libxl_ctx *ctx, uint64_t *memkb);
 func (ctx *Context) GetFreeMemory() (memkb uint64, err error) {
var cmem C.uint64_t
ret := C.libxl_get_free_memory(ctx.ctx, )
@@ -987,7 +989,7 @@ func (ctx *Context) GetFreeMemory() (memkb uint64, err 
error) {
 
 }
 
-//int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo)
+// int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo)
 func (ctx *Context) GetPhysinfo() (physinfo *Physinfo, err error) {
var cphys C.libxl_physinfo
C.libxl_physinfo_init()
@@ -1005,7 +1007,7 @@ func (ctx *Context) GetPhysinfo() (physinfo *Physinfo, 
err error) {
return
 }
 
-//const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx);
+// const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx);
 func (ctx *Context) GetVersionInfo() (info *VersionInfo, err error) {
var cinfo *C.libxl_version_info
 
@@ -1044,7 +1046,7 @@ func (ctx *Context) DomainUnpause(Id Domid) (err error) {
return
 }
 
-//int libxl_domain_pause(libxl_ctx *ctx, uint32_t domain);
+// int libxl_domain_pause(libxl_ctx *ctx, uint32_t domain);
 func (ctx *Context) DomainPause(id Domid) (err error) {
ret := C.libxl_domain_pause(ctx.ctx, C.uint32_t(id), nil)
 
@@ -1054,7 +1056,7 @@ func (ctx *Context) DomainPause(id Domid) (err error) {
return
 }
 
-//int libxl_domain_shutdown(libxl_ctx *ctx, uint32_t domid);
+// int libxl_domain_shutdown(libxl_ctx *ctx, uint32_t domid);
 func (ctx *Context) DomainShutdown(id Domid) (err error) {
ret := C.libxl_domain_shutdown(ctx.ctx, C.uint32_t(id), nil)
 
@@ -1064,7 +1066,7 @@ func (ctx *Context) DomainShutdown(id Do

[PATCH 1/3] tools/golang: When returning pointers, actually allocate structrues

2024-04-19 Thread George Dunlap

In a handful of cases, it was decided to return a pointer to a
structure rather than the plain structure itself, due to the size.
However, in these cases the structure was never allocated, leading to
a nil pointer exception when calling the relevant `fromC` method.

Allocate structures before attempting to fill them in.

Reported-by: Tobias Fitschen 
Signed-off-by: George Dunlap 
---
This has been compile-tested only; Tobias, I'd appreciate a test if you get a 
chance.

CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/xenlight.go | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/golang/xenlight/xenlight.go 
b/tools/golang/xenlight/xenlight.go
index a45c636952..d793f172e5 100644
--- a/tools/golang/xenlight/xenlight.go
+++ b/tools/golang/xenlight/xenlight.go
@@ -999,6 +999,7 @@ func (ctx *Context) GetPhysinfo() (physinfo *Physinfo, err 
error) {
err = Error(ret)
return
}
+   physinfo = {}
err = physinfo.fromC()
 
return
@@ -1010,6 +1011,7 @@ func (ctx *Context) GetVersionInfo() (info *VersionInfo, 
err error) {
 
cinfo = C.libxl_get_version_info(ctx.ctx)
 
+   info = {}
err = info.fromC(cinfo)
 
return
@@ -1027,6 +1029,7 @@ func (ctx *Context) DomainInfo(Id Domid) (di *Dominfo, 
err error) {
return
}
 
+   di = {}
err = di.fromC()
 
return
-- 
2.25.1

[PATCH 3/3] tools/golang: Run `go vet` as part of the build process

2024-04-19 Thread George Dunlap

Signed-off-by: George Dunlap 
---
CC: Nick Rosbrook 
CC: Anthony PERARD 
---
 tools/golang/xenlight/Makefile | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/golang/xenlight/Makefile b/tools/golang/xenlight/Makefile
index c5bb6b94a8..645e7b3a82 100644
--- a/tools/golang/xenlight/Makefile
+++ b/tools/golang/xenlight/Makefile
@@ -29,6 +29,7 @@ $(subst .gen.,.%.,$(GOXL_GEN_FILES)): gengotypes.py 
$(LIBXL_SRC_DIR)/libxl_types
 # so that it can find the actual library.
 .PHONY: build
 build: xenlight.go $(GOXL_GEN_FILES)
+   CGO_CFLAGS="$(CFLAGS_libxenlight) $(CFLAGS_libxentoollog) 
$(APPEND_CFLAGS)" CGO_LDFLAGS="$(call xenlibs-ldflags,light toollog) 
$(APPEND_LDFLAGS)" $(GO) vet
CGO_CFLAGS="$(CFLAGS_libxenlight) $(CFLAGS_libxentoollog) 
$(APPEND_CFLAGS)" CGO_LDFLAGS="$(call xenlibs-ldflags,light toollog) 
$(APPEND_LDFLAGS)" $(GO) build -x
 
 .PHONY: install
-- 
2.25.1

[PATCH] x86/cpuid-policy: Add AMD SVM CPUID leaf to featureset

2024-04-17 Thread George Dunlap

Currently, the CPUID leaf for SVM features (extd 0xa.edx) is manually
twiddled:

 - hvm_max_policy takes host_policy and clamps it to supported
   features (with some features unilaterally enabled because they're
   always emulated

 - hvm_default_policy is copied from there

 - When recalculate_policy() is called for a guest, if SVM is clear,
   then the entire leaf is zeroed out.

Move to a mode where the extended features are off by default, and
enabled when nested_virt is enabled.

In cpufeatureset.h, define a new featureset word for the AMD SVM
features, and declare all of the bits defined in
x86/include/asm/hvm/svm/svm.h.  Mark the ones we currently pass
through to the "max policy" as HAP-only and optional.

In cpu-policy.h, define FEATURESET_ead, and convert the un-named space
in struct_cpu_policy into the appropriate union.  FIXME: Do this in a
prerequisite patch, and change all references to p->extd.raw[0xa].

Update x86_cpu_X_to_Y and Y_to_X to copy this into and out of the
appropriate leaf.

Populate this during boot in generic_identify().

Add the new featureset definition into libxl_cpuid.c.

Update the code in calculate_hvm_max_policy() to do nothing with the
"normal" CPUID bits, and use the feature bit to unconditionally enable
VMCBCLEAN. FIXME Move this to a follow-up patch.

In recalculate_cpuid_policy(), enable max_fs when nested_hvm() is
true.

Signed-off-by: George Dunlap 
---
CC: Andrew Cooper 
CC: Jan Beulich 
CC: Roger Pau Monne 
---
 tools/libs/light/libxl_cpuid.c  |  1 +
 xen/arch/x86/cpu-policy.c   | 19 +--
 xen/arch/x86/cpu/common.c   |  2 ++
 xen/include/public/arch-x86/cpufeatureset.h | 16 
 xen/include/xen/lib/x86/cpu-policy.h| 10 +-
 xen/lib/x86/cpuid.c |  4 +++-
 6 files changed, 40 insertions(+), 12 deletions(-)

diff --git a/tools/libs/light/libxl_cpuid.c b/tools/libs/light/libxl_cpuid.c
index ce4f3c7095..2c5749c3a0 100644
--- a/tools/libs/light/libxl_cpuid.c
+++ b/tools/libs/light/libxl_cpuid.c
@@ -342,6 +342,7 @@ int libxl_cpuid_parse_config(libxl_cpuid_policy_list 
*policy, const char* str)
 CPUID_ENTRY(0x0007,  1, CPUID_REG_EDX),
 MSR_ENTRY(0x10a, CPUID_REG_EAX),
 MSR_ENTRY(0x10a, CPUID_REG_EDX),
+CPUID_ENTRY(0x800a, NA, CPUID_REG_EDX),
 #undef MSR_ENTRY
 #undef CPUID_ENTRY
 };
diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 4b6d962763..4a5d1b916b 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -754,14 +754,8 @@ static void __init calculate_hvm_max_policy(void)
  */
 if ( p->extd.svm )
 {
-/* Clamp to implemented features which require hardware support. */
-p->extd.raw[0xa].d &= ((1u << SVM_FEATURE_NPT) |
-   (1u << SVM_FEATURE_LBRV) |
-   (1u << SVM_FEATURE_NRIPS) |
-   (1u << SVM_FEATURE_PAUSEFILTER) |
-   (1u << SVM_FEATURE_DECODEASSISTS));
 /* Enable features which are always emulated. */
-p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
+__set_bit(X86_FEATURE_VMCBCLEAN, fs);
 }
 
 guest_common_max_feature_adjustments(fs);
@@ -909,6 +903,14 @@ void recalculate_cpuid_policy(struct domain *d)
 __clear_bit(X86_FEATURE_VMX, max_fs);
 __clear_bit(X86_FEATURE_SVM, max_fs);
 }
+else
+{
+/* 
+ * Enable SVM features.  This will be empty on VMX
+ * hosts. 
+ */
+fs[FEATURESET_ead] = max_fs[FEATURESET_ead];
+}
 }
 
 /*
@@ -975,9 +977,6 @@ void recalculate_cpuid_policy(struct domain *d)
  ((vpmu_mode & XENPMU_MODE_ALL) && !is_hardware_domain(d)) )
 p->basic.raw[0xa] = EMPTY_LEAF;
 
-if ( !p->extd.svm )
-p->extd.raw[0xa] = EMPTY_LEAF;
-
 if ( !p->extd.page1gb )
 p->extd.raw[0x19] = EMPTY_LEAF;
 }
diff --git a/xen/arch/x86/cpu/common.c b/xen/arch/x86/cpu/common.c
index 28d7f34c4d..5093379a43 100644
--- a/xen/arch/x86/cpu/common.c
+++ b/xen/arch/x86/cpu/common.c
@@ -477,6 +477,8 @@ static void generic_identify(struct cpuinfo_x86 *c)
c->x86_capability[FEATURESET_e7d] = cpuid_edx(0x8007);
if (c->extended_cpuid_level >= 0x8008)
c->x86_capability[FEATURESET_e8b] = cpuid_ebx(0x8008);
+   if (c->extended_cpuid_level >= 0x800a)
+   c->x86_capability[FEATURESET_ead] = cpuid_edx(0x800a);
if (c->extended_cpuid_level >= 0x8021)
c->x86_capability[FEATURESET_e21a] = cpuid_eax(0x8021);
 
diff --git a/xen/include/public/arch-x86/cpufeatureset.h 
b/xen/include/public/arch-x86/cpufeatureset.h
index 53f13dec31..c5c712

Rewritten XSA status page, xsa.json

2024-04-15 Thread George Dunlap

Hey all,

Some of you may have noticed that xenbis.xenproject.org/xsa/ doesn't
currently list XSA-456.  This has prompted me to rewrite the perl code
which generates that area of the webpage into golang, which is much
easier for the current security team to understand and modify.  The
draft replacement is here:

https://xenbits.xenproject.org/people/gdunlap/xsa-draft/

In particular, if you use `xsa.json` for any purpose, let me know if
you have any issues.  If I don't hear any objections I'll replace the
website generation code on Wednesday.

Thanks,
 -George

Re: [XEN PATCH v2 1/2] xen/sched: address violations of MISRA C:2012 Rule 16.3

2024-04-04 Thread George Dunlap

On Thu, Apr 4, 2024 at 8:49 AM Federico Serafini
 wrote:
>
> Add break statement to address a violation of MISRA C:2012 Rule 16.3
> ("An unconditional `break' statement shall terminate every
> switch-clause").
> Replace deprecated comment /* FALLTHRU */ with pseudo-keyword
> fallthrough to meet the requirements to deviate the rule.
>
> No functional change.
>
> Signed-off-by: Federico Serafini 

Acked-by: George Dunlap

Re: [PATCH] do_multicall and MISRA Rule 8.3

2024-04-03 Thread George Dunlap

On Tue, Mar 19, 2024 at 3:39 AM Stefano Stabellini
 wrote:
> > The main use of fixed width types, to me, is in interface structure
> > definitions - between Xen and hardware / firmware, or in hypercall
> > structures. I'm afraid I have a hard time seeing good uses outside of
> > that. Even in inline assembly, types of variables used for inputs or
> > outputs don't strictly need to be fixed-width; I can somewhat accept
> > their use there for documentation purposes.
>
>
> Non-ABI interfaces are OK with native types.
>
> Our ABI interfaces are, for better or for worse, described using C
> header files. Looking at these header files, it should be clear the size
> and alignment of all integer parameters.
>
> To that end, I think we should use fixed-width types in all ABIs,
> including hypercall entry points. In my opinion, C hypercall entry
> points are part of the ABI and should match the integer types used in
> the public header files. I don't consider the little assembly code on
> hypercall entry as important.

So as Jan pointed out in the recent call, the "actual" ABI is
"register X, Y, Z are arguments 1-3".  The key thing for me then is
whether it's clear what values in register X are acceptable.  If the C
function implementing the hypercall has an argument of type "unsigned
int", is it clear what values the guest is allowed to be put into the
corresponding register, and how they'll be interpreted, as opposed to
"unsigned long"?

If we can document the expectations, for each architecture, for how
the "native types" map to sizes, then I think that should be
sufficient.

 -George

Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit

2024-03-28 Thread George Dunlap

On Thu, Mar 28, 2024 at 6:44 AM Jan Beulich  wrote:
> > As to why to have each vendor independent code check for HAP -- there
> > are in fact two implementations of the code; it's nice to be able to
> > look in one place for each implementation to determine the
> > requirements.  Additionally, it would be possible, in the future, for
> > one of the nested virt implementations to enable shadow mode, while
> > the other one didn't.  The current arrangement makes that easy.
>
> Well, first - how likely is this, when instead we've been considering
> whether we could get rid of shadow mode? And then this is balancing
> between ease of future changes vs avoiding redundancy where it can be
> avoided. I'm not going to insist on centralizing the HAP check, but I
> certainly wanted the option to be considered.

I considered it before replying to you; so consider it considered. :-)

> >>> --- a/xen/arch/x86/hvm/nestedhvm.c
> >>> +++ b/xen/arch/x86/hvm/nestedhvm.c
> >>> @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void)
> >>>  __clear_bit(0x80, shadow_io_bitmap[0]);
> >>>  __clear_bit(0xed, shadow_io_bitmap[1]);
> >>>
> >>> +/*
> >>> + * NB this must be called after all command-line processing has been
> >>> + * done, so that if (for example) HAP is disabled, nested virt is
> >>> + * disabled as well.
> >>> + */
> >>> +if ( cpu_has_vmx )
> >>> +start_nested_vmx(_funcs);
> >>> +else if ( cpu_has_svm )
> >>> +start_nested_svm(_funcs);
> >>
> >> Instead of passing the pointer, couldn't you have the functions return
> >> bool, and then set hvm_funcs.caps.nested_virt from that? Passing that
> >> pointer looks somewhat odd to me.
> >
> > For one, it makes start_nested_XXX symmetric to start_XXX, which also
> > has access to the full hvm_funcs structure (albeit in the former case
> > because it's creating the structure).
>
> This last aspect is the crucial aspect: start{svm,vmx}() are indeed quite
> special because of this. Everywhere else we have accessor helpers when
> hvm_funcs needs consulting, e.g. ...
>
> >  For two, both of them need to read caps.hap.
>
> ... caps _reads_ would want using (an) accessor(s), too.
>
> >  For three, it's just more flexible -- there may be
> > future things that we want to modify in the start_nested_*()
> > functions.  If we did as you suggest, and then added (say)
> > caps.nested_virt_needs_hap, then we'd either need to add a second
> > function, or refactor it to look like this.
>
> Right, I can see this as an argument when taking, as you do, speculation
> on future needs into account. Albeit I'm then inclined to say that even
> such modifications may better be done through accessor helpers.

Why access it through accessor helpers?

I consider that it's the SVM and VMX "construction/setup" code
respectively which "own" hvm_funcs (as evidenced by the fact that they
create the structures and then return them); and I consider the
start_nested_{vmx/svm} to be a part of the same code; so I consider it
valid for them to have direct access to the structure.

> >> Is there a reason to use direct calls here rather than a true hook
> >> (seeing that hvm_funcs must have been populated by the time we make it
> >> here)? I understand we're (remotely) considering to switch away from
> >> using hooks at some point, but right now this feels somewhat
> >> inconsistent if not further justified.
> >
> > Again it mimics the calls to start_vmx/svm in hvm_enable.  But I could
> > look at adding a function pointer to see if it works.
>
> Andrew - do you have any input here towards those somewhat vague plans
> of getting rid of the hook functions? I guess they're more relevant in
> places where we can't easily use the altcall machinery (i.e. not here).

Rather than do all that work collecting information, why don't we just
check it in as it is?  Obviously if we're thinking about getting rid
of hook functions at some point anyway, it can't be all that bad.
There is an aesthetic justification for the lack of hook, in that it's
similar to the start_vmx/svm() functions.

So far I really don't think the remaining "open" changes we're
discussing are worth either of our efforts.  I don't plan to make any
of these changes unless another x86 maintainer seconds your request
(or you can convince me they're worth making).

(The other two changes -- correcting the function name in the commit
message, and removing the extra blank line -- I've already changed
locally, so could check in with your ack.)

 -George

Re: [PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit

2024-03-27 Thread George Dunlap

On Mon, Mar 18, 2024 at 2:17 PM Jan Beulich  wrote:
>
> On 13.03.2024 13:24, George Dunlap wrote:
> > In order to make implementation and testing tractable, we will require
> > specific host functionality.  Add a nested_virt bit to hvm_funcs.caps,
> > and return an error if a domain is created with nested virt and this
> > bit isn't set.  Create VMX and SVM callbacks to be executed from
> > start_nested_svm(), which is guaranteed to execute after all
>
> Nit: nestedhvm_setup() (or, with different wording, start_nested_{svm,vmx}()).

Oops, fixed.

> > --- a/xen/arch/x86/domain.c
> > +++ b/xen/arch/x86/domain.c
> > @@ -673,6 +673,12 @@ int arch_sanitise_domain_config(struct 
> > xen_domctl_createdomain *config)
> >   */
> >  config->flags |= XEN_DOMCTL_CDF_oos_off;
> >
> > +if ( nested_virt && !hvm_nested_virt_supported() )
> > +{
> > +dprintk(XENLOG_INFO, "Nested virt requested but not available\n");
> > +return -EINVAL;
> > +}
> > +
> >  if ( nested_virt && !hap )
> >  {
> >  dprintk(XENLOG_INFO, "Nested virt not supported without HAP\n");
>
> As mentioned in reply to v1, I think what both start_nested_{svm,vmx}() check
> is redundant with this latter check. I think that check isn't necessary there
> (yet unlike suggested in reply to v1 I don't think anymore that the check here
> can alternatively be dropped). And even if it was to be kept for some reason,
> I'm having some difficulty seeing why vendor code needs to do that check, when
> nestedhvm_setup() could do it for both of them.

I'm having a bit of trouble resolving the antecedents in this
paragraph (what "this" and "there" are referring to).

Are you saying that we should set hvm_funcs.caps.nested_virt to 'true'
even if the hardware doesn't support HAP, because we check that here?

That seems like a very strange way to go about things; hvm_funcs.caps
should reflect the actual capabilities of the hardware.  Suppose at
some point we want to expose nested virt capability to the toolstack
-- wouldn't it make more sense to be able to just read
`hvm_funcs.caps.nested_virt`, rather than having to remember to && it
with `hvm_funcs.caps.hap`?

And as you say, you can't get rid of the HAP check here, because we
also want to deny nested virt if the admin deliberately created a
guest in shadow mode on a system that has HAP available.  So it's not
redundant: one is checking the capabilities of the system, the other
is checking the requested guest configuration.

As to why to have each vendor independent code check for HAP -- there
are in fact two implementations of the code; it's nice to be able to
look in one place for each implementation to determine the
requirements.  Additionally, it would be possible, in the future, for
one of the nested virt implementations to enable shadow mode, while
the other one didn't.  The current arrangement makes that easy.

> > --- a/xen/arch/x86/hvm/nestedhvm.c
> > +++ b/xen/arch/x86/hvm/nestedhvm.c
> > @@ -150,6 +150,16 @@ static int __init cf_check nestedhvm_setup(void)
> >  __clear_bit(0x80, shadow_io_bitmap[0]);
> >  __clear_bit(0xed, shadow_io_bitmap[1]);
> >
> > +/*
> > + * NB this must be called after all command-line processing has been
> > + * done, so that if (for example) HAP is disabled, nested virt is
> > + * disabled as well.
> > + */
> > +if ( cpu_has_vmx )
> > +start_nested_vmx(_funcs);
> > +else if ( cpu_has_svm )
> > +start_nested_svm(_funcs);
>
> Instead of passing the pointer, couldn't you have the functions return
> bool, and then set hvm_funcs.caps.nested_virt from that? Passing that
> pointer looks somewhat odd to me.

For one, it makes start_nested_XXX symmetric to start_XXX, which also
has access to the full hvm_funcs structure (albeit in the former case
because it's creating the structure).  For two, both of them need to
read caps.hap.  For three, it's just more flexible -- there may be
future things that we want to modify in the start_nested_*()
functions.  If we did as you suggest, and then added (say)
caps.nested_virt_needs_hap, then we'd either need to add a second
function, or refactor it to look like this.

> Is there a reason to use direct calls here rather than a true hook
> (seeing that hvm_funcs must have been populated by the time we make it
> here)? I understand we're (remotely) considering to switch away from
> using hooks at some point, but right now this feels somewhat
> inconsistent if not further justified.

Again it mimics the calls to start_vmx/svm in hvm_enable.  But I could
look at adding a function pointer to see if it works.

 -George

Re: [PATCH 7/7] xen/trace: Drop old trace API

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> With all users updated to the new API, drop the old API.  This includes all of
> asm/hvm/trace.h, which allows us to drop some includes.
>
> Signed-off-by: Andrew Cooper 

Reviewed-by: George Dunlap

Re: [PATCH 5/7] xen: Switch to new TRACE() API

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> (Almost) no functional change.
>
> irq_move_cleanup_interrupt() changes two smp_processor_id() calls to the 'me'
> local variable which manifests as a minor code improvement.  All other
> differences in the compiled binary are to do with line numbers changing.

Probably also worth noting that (as Jan points out in 7/7) by
bypassing HTMTRACE_ND, we're also bypassing the whole DEFAULT_HVM_*
and DO_TRC_HVM_* thing, which in theory could streamline the code by
causing the compiler to elide certain trace points.

 -George

Re: [PATCH 6/7] xen/trace: Update final {__,}trace_var() users to the new API

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> This logically drops the cycles parameter, as it's now encoded directly in the
> event code.

So to make Jan's comment more specific, what about something like this:

This primarily consists of converting the cycles parameter into the
appropriate function: 0 -> trace(), 1 -> trace_time().

> No functional change.
>
> Signed-off-by: Andrew Cooper 

Reviewed-by: George Dunlap

Re: [PATCH 5/7] xen: Switch to new TRACE() API

2024-03-20 Thread George Dunlap

On Wed, Mar 20, 2024 at 1:45 PM Jan Beulich  wrote:
>
> On 18.03.2024 17:35, Andrew Cooper wrote:
> > @@ -736,9 +736,19 @@ static void vlapic_update_timer(struct vlapic *vlapic, 
> > uint32_t lvtt,
> >  delta = delta * vlapic->hw.timer_divisor / old_divisor;
> >  }
> >
> > -TRACE_2_LONG_3D(TRC_HVM_EMUL_LAPIC_START_TIMER, 
> > TRC_PAR_LONG(delta),
> > -TRC_PAR_LONG(is_periodic ? period : 0),
> > -vlapic->pt.irq);
> > +if ( unlikely(tb_init_done) )
> > +{
> > +struct {
> > +uint64_t delta, period;
> > +uint32_t irq;
> > +} __packed d = {
> > +.delta = delta,
> > +.period = is_periodic ? period : 0,
> > +.irq = vlapic->pt.irq,
> > +};
> > +
> > +trace_time(TRC_HVM_EMUL_LAPIC_START_TIMER, sizeof(d), );
> > +}
>
> Instead of this open-coding, how about
>
> if ( is_periodic )
> TRACE_TIME(TRC_HVM_EMUL_LAPIC_START_TIMER, delta, delta >> 32,
>period, period >> 32, vlapic->pt.irq);
> else
> TRACE_TIME(TRC_HVM_EMUL_LAPIC_START_TIMER, delta, delta >> 32,
>0, 0, vlapic->pt.irq);
>
> thus improving similarity / grep-ability with ...

Yuck -- I'm not keen on breaking the similarity, but I'm *really* not
a fan of duplicating code.  Both are kind of "code smell"-y to me, but
I think the second seems worse.

It sort of looks to me like the source of the problem is that the
`period` variable is overloaded somehow; in that it's used to update
some calculation even if !is_periodic, and so the two places it's used
as an actual period (the trace code, and the call to
`create_periodic_time()`) need to figure out if `periodic` is for them
to use or not.

What about adding a variable, "timer_period" for that purpose?
Something like the following?

diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index dcbcf4a1fe..ea14fc1587 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -729,6 +729,8 @@ static void vlapic_update_timer(struct vlapic
*vlapic, uint32_t lvtt,

 if ( delta && (is_oneshot || is_periodic) )
 {
+uint64_t timer_period = 0;
+
 if ( vlapic->hw.timer_divisor != old_divisor )
 {
 period = (uint64_t)vlapic_get_reg(vlapic, APIC_TMICT)
@@ -736,12 +738,15 @@ static void vlapic_update_timer(struct vlapic
*vlapic, uint32_t lvtt,
 delta = delta * vlapic->hw.timer_divisor / old_divisor;
 }

-TRACE_2_LONG_3D(TRC_HVM_EMUL_LAPIC_START_TIMER, TRC_PAR_LONG(delta),
-TRC_PAR_LONG(is_periodic ? period : 0),
-vlapic->pt.irq);
+if ( is_periodic )
+timer_period = period;
+
+TRACE_TIME(TRC_HVM_EMUL_LAPIC_START_TIMER, delta, delta >> 32,
+   timer_period, timer_period >> 32,
+   vlapic->pt.irq);

 create_periodic_time(current, >pt, delta,
- is_periodic ? period : 0, vlapic->pt.irq,
+     timer_period, vlapic->pt.irq,
  is_periodic ? vlapic_pt_cb : NULL,
  >timer_last_update, false);


As with Jan, I'd be OK with checking it in the way it is if you prefer, so:

Reviewed-by: George Dunlap

Re: [PATCH 4/7] xen/sched: Clean up trace handling

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> There is no need for bitfields anywhere - use more sensible types.  There is
> also no need to cast 'd' to (unsigned char *) before passing it to a function
> taking void *.  Switch to new trace_time() API.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper 
> Reviewed-by: Dario Faggioli 

Reviewed-by: George Dunlap

Re: [PATCH 3/7] xen/rt: Clean up trace handling

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> Most uses of bitfields and __packed are unnecessary.  There is also no need to
> cast 'd' to (unsigned char *) before passing it to a function taking void *.
> Switch to new trace_time() API.
>
> No functional change.
>
> Signed-off-by: Andrew Cooper 
> Reviewed-by: Dario Faggioli 

Reviewed-by: George Dunlap

Re: [PATCH 2/7] xen/credit2: Clean up trace handling

2024-03-20 Thread George Dunlap

On Wed, Mar 20, 2024 at 12:16 PM George Dunlap  wrote:
>
> On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  
> wrote:
> >
> > There is no need for bitfields anywhere - use more sensible types.  There is
> > also no need to cast 'd' to (unsigned char *) before passing it to a 
> > function
> > taking void *.  Switch to new trace_time() API.
> >
> > No functional change.
>
> Hey Andrew -- overall changes look great, thanks for doing this very
> detailed work.

For my own posterity later:

I've done a close review of the rest of the patch and didn't see any
issues other than the loss of sign in a handful of cases.

 -George

Re: [PATCH 2/7] xen/credit2: Clean up trace handling

2024-03-20 Thread George Dunlap

On Wed, Mar 20, 2024 at 1:13 PM Jan Beulich  wrote:
>
> On 20.03.2024 13:19, Andrew Cooper wrote:
> > On 20/03/2024 12:16 pm, George Dunlap wrote:
> >> On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  
> >> wrote:
> >>> There is no need for bitfields anywhere - use more sensible types.  There 
> >>> is
> >>> also no need to cast 'd' to (unsigned char *) before passing it to a 
> >>> function
> >>> taking void *.  Switch to new trace_time() API.
> >>>
> >>> No functional change.
> >> Hey Andrew -- overall changes look great, thanks for doing this very
> >> detailed work.
> >>
> >> One issue here is that you've changed a number of signed values to
> >> unsigned values; for example:
> >>
> >>> @@ -1563,16 +1559,16 @@ static s_time_t tickle_score(const struct 
> >>> scheduler *ops, s_time_t now,
> >>>  if ( unlikely(tb_init_done) )
> >>>  {
> >>>  struct {
> >>> -unsigned unit:16, dom:16;
> >>> -int credit, score;
> >>> -} d;
> >>> -d.dom = cur->unit->domain->domain_id;
> >>> -d.unit = cur->unit->unit_id;
> >>> -d.credit = cur->credit;
> >>> -d.score = score;
> >>> -__trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
> >>> -sizeof(d),
> >>> -(unsigned char *));
> >>> +uint16_t unit, dom;
> >>> +uint32_t credit, score;
> >> ...here you change `int` to `unit32_t`; but `credit` and `score` are
> >> both signed values, which may be negative.  There are a number of
> >> other similar instances.  In general, if there's a signed value, it
> >> was meant.
> >
> > Oh - this is a consequence of being reviewed that way in earlier iterations.
>
> Which in turn is a result of us still having way to many uses of plain
> int when signed quantities aren't meant. Plus my suggestion to make
> this explicit by saying "signed int" was rejected.

Not complaining, but just pointing out to maybe save some time in the
future:  The vast majority of fields in the trace records in this file
are unsigned; in fact, two of the fields in this structure are
unsigned.  That should have been a hint that being signed was likely
to be intentional in this code, and to look at the context before
removing it.  (In this case, for example just above here there's an
"if ( score < 0 )" showing that score might be negative.)

 -George

Re: [PATCH 2/7] xen/credit2: Clean up trace handling

2024-03-20 Thread George Dunlap

On Wed, Mar 20, 2024 at 12:19 PM Andrew Cooper
 wrote:
>
> On 20/03/2024 12:16 pm, George Dunlap wrote:
> > On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  
> > wrote:
> >> There is no need for bitfields anywhere - use more sensible types.  There 
> >> is
> >> also no need to cast 'd' to (unsigned char *) before passing it to a 
> >> function
> >> taking void *.  Switch to new trace_time() API.
> >>
> >> No functional change.
> > Hey Andrew -- overall changes look great, thanks for doing this very
> > detailed work.
> >
> > One issue here is that you've changed a number of signed values to
> > unsigned values; for example:
> >
> >> @@ -1563,16 +1559,16 @@ static s_time_t tickle_score(const struct 
> >> scheduler *ops, s_time_t now,
> >>  if ( unlikely(tb_init_done) )
> >>  {
> >>  struct {
> >> -unsigned unit:16, dom:16;
> >> -int credit, score;
> >> -} d;
> >> -d.dom = cur->unit->domain->domain_id;
> >> -d.unit = cur->unit->unit_id;
> >> -d.credit = cur->credit;
> >> -d.score = score;
> >> -__trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
> >> -sizeof(d),
> >> -(unsigned char *));
> >> +uint16_t unit, dom;
> >> +uint32_t credit, score;
> > ...here you change `int` to `unit32_t`; but `credit` and `score` are
> > both signed values, which may be negative.  There are a number of
> > other similar instances.  In general, if there's a signed value, it
> > was meant.
>
> Oh - this is a consequence of being reviewed that way in earlier iterations.
>
> If they really can hold negative numbers, they can become int32_t's.
> What's important is that they have a clearly-specified width.

Oh yes, sorry if that wasn't clear -- I was suggesting int32_t, unless
there was some compelling reason to do otherwise.  I just realize that
the last time this series had some review comments, it took 3 years to
come back around, and it would be a shame if that happened again; so
I'm happy to go through and do the change.

 -George

Re: [PATCH 2/7] xen/credit2: Clean up trace handling

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> There is no need for bitfields anywhere - use more sensible types.  There is
> also no need to cast 'd' to (unsigned char *) before passing it to a function
> taking void *.  Switch to new trace_time() API.
>
> No functional change.

Hey Andrew -- overall changes look great, thanks for doing this very
detailed work.

One issue here is that you've changed a number of signed values to
unsigned values; for example:

> @@ -1563,16 +1559,16 @@ static s_time_t tickle_score(const struct scheduler 
> *ops, s_time_t now,
>  if ( unlikely(tb_init_done) )
>  {
>  struct {
> -unsigned unit:16, dom:16;
> -int credit, score;
> -} d;
> -d.dom = cur->unit->domain->domain_id;
> -d.unit = cur->unit->unit_id;
> -d.credit = cur->credit;
> -d.score = score;
> -__trace_var(TRC_CSCHED2_TICKLE_CHECK, 1,
> -sizeof(d),
> -(unsigned char *));
> +uint16_t unit, dom;
> +uint32_t credit, score;

...here you change `int` to `unit32_t`; but `credit` and `score` are
both signed values, which may be negative.  There are a number of
other similar instances.  In general, if there's a signed value, it
was meant.

I don't want to delay this another 3 years, so if there's not a strong
argument in favor of converting the signed types into uint32_t, I can
make the change myself and re-submit the series.

(I'll probably end up making a change to xenalyze.c as well, to update
the structure definitions there to match what's in Xen, so it wouldn't
be any extra effort.)

 -George

Re: [PATCH 1/7] xen/trace: Introduce new API

2024-03-20 Thread George Dunlap

On Mon, Mar 18, 2024 at 4:36 PM Andrew Cooper  wrote:
>
> trace() and trace_time(), in function form for struct arguments, and macro
> form for simple uint32_t list arguments.
>
> This will be used to clean up the mess of macros which exists throughout the
> codebase, as well as eventually dropping __trace_var().
>
> There is intentionally no macro to split a 64-bit parameter in the new API,
> for MISRA reasons.
>
> Signed-off-by: Andrew Cooper 

Thanks for doing this, it all looks good to me.

I'll note that last time this was posted, there were some questions.
One was regarding underscores in the macro parameters.  I don't care
about this enough to stop it going in.

The other question was about the use of ##__VA_ARGS__. I don't
actually understand what that comment was trying to say, as a quick
Google search it appears that 1) this will do what I think it should
do, and 2) it works both in gcc and clang.

We should give Jan a chance to explain his second point more clearly
before checking it in; but:

Reviewed-by: George Dunlap

Re: [PATCH v5] x86/PoD: tie together P2M update and increment of entry count

2024-03-20 Thread George Dunlap

On Tue, Mar 19, 2024 at 1:22 PM Jan Beulich  wrote:
>
> When not holding the PoD lock across the entire region covering P2M
> update and stats update, the entry count - if to be incorrect at all -
> should indicate too large a value in preference to a too small one, to
> avoid functions bailing early when they find the count is zero. However,
> instead of moving the increment ahead (and adjust back upon failure),
> extend the PoD-locked region.
>
> Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
> Signed-off-by: Jan Beulich 

Oh!  Thanks for doing this -- I hadn't responded because I wasn't sure
whether I was bikeshedding, and then it sort of fell off my radar.  At
any rate:

Reviewed-by: George Dunlap

Re: [XEN PATCH 10/10] xen/sched: address violations of MISRA C Rule 20.7

2024-03-18 Thread George Dunlap

On Mon, Mar 18, 2024 at 11:54 AM Nicola Vetrini
 wrote:
>
> MISRA C Rule 20.7 states: "Expressions resulting from the expansion
> of macro parameters shall be enclosed in parentheses". Therefore, some
> macro definitions should gain additional parentheses to ensure that all
> current and future users will be safe with respect to expansions that
> can possibly alter the semantics of the passed-in macro parameter.
>
> No functional change.
>
> Signed-off-by: Nicola Vetrini 
> ---
>  xen/common/sched/private.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/xen/common/sched/private.h b/xen/common/sched/private.h
> index 459d1dfb11a5..c0e7c96d24f4 100644
> --- a/xen/common/sched/private.h
> +++ b/xen/common/sched/private.h
> @@ -540,7 +540,7 @@ static inline void sched_unit_unpause(const struct 
> sched_unit *unit)
>  }
>
>  #define REGISTER_SCHEDULER(x) static const struct scheduler *x##_entry \
> -  __used_section(".data.schedulers") = 
> +  __used_section(".data.schedulers") = &(x)

Arguably this is safe, because any `x` which would be problematic in
this line wouldn't compile in the line above.

But it's almost certainly not worth the effort of documenting or deviating, so:

Acked-by: George Dunlap

Re: [PATCH] do_multicall and MISRA Rule 8.3\

2024-03-15 Thread George Dunlap

On Fri, Mar 15, 2024 at 2:13 PM Jan Beulich  wrote:
>
> On 15.03.2024 14:55, Julien Grall wrote:
> > Hi Jan,
> >
> > On 15/03/2024 13:24, Jan Beulich wrote:
> >> On 15.03.2024 13:17, George Dunlap wrote:
> >>> On Fri, Mar 15, 2024 at 11:57 AM Jan Beulich  wrote:
> >>>>> It sounds like Andy and Stefano feel like this is a situation where "a
> >>>>> fixed width quantity is meant"; absent any further guidance from the
> >>>>> CODING_STYLE about when fixed widths should or should not be used, I
> >>>>> don't think this change would be a violation of CODING_STYLE.
[snip]
> >>> Other than "it's in CODING_STYLE", and "it's not really necessary
> >>> because it's ensured in the assembly code", you haven't advanced a
> >>> single reason why "uint32_t" is problematic.
> >>
> >> And it isn't, I never said it would be. But if we set rules for
> >> ourselves, why would we take the first opportunity to not respect them?
> >
> > I am a bit confused. Reading through the thread you seem to agree that
> > the written rules are respected here. So what rules are you talking about?
>
> What was proposed is use of a fixed width type where according to my
> reading ./CODING_STYLE says it shouldn't be used.

This conversation is starting to get frustrating.  That's simply not
what it says, and I pointed that out just a few messages ago.

To reiterate:The text says fixed-width types are OK when a fixed-width
quantity is "meant"; and that in this case, Stefano and Andy "mean" to
use a fixed-width quantity.  The implied subtext of that sentence
could be, "Don't use fixed width types unless there's a good reason to
use a fixed width", and both Andy and Stefano think there's a good
reason.  This argument you haven't really addressed at all, except
with a specious "slippery slope" argument meant to nullify the
exception; and now you attempt to simply ignore.

I venture to assert that for most people, the rules are a means to an
end: That end being code which is correct, robust, fast, easy to
write, maintain, debug, and review patches for.  What I agreed to,
when I accepted this patch, was that *in general* we would avoid using
fixed-width types; but that there were cases where doing so made
sense.  Some of those were discussed in the thread above.

Andy and Stefano have already put forward reasons why they think a
fixed-width type would be better here, which are related to "end
goals": namely, more robust and easy to maintain code.  When I asked
what "end goals" would be negatively impacted by using a fixed-width
type, you explicitly said that there were none!  That the *only*
reason you're continuing to argue is that we have a document somewhere
that says we should -- a document which explicitly acknowledges that
there will be exceptions!

The ideal response would have been to lay out a reasonably
comprehensive set of criteria for when to use basic types, when to use
fixed-width types, and how each criteria advanced the end goals of a
better codebase.  A sufficient response would have been to lay out
reasons why "better codebase", in this instance, tilts towards using a
basic type rather than a fixed-width type.

Saying "That's what the rules say", without motivating it by
explaining how it connects to a better codebase, is just not a helpful
response; and making that argument after it's been pointed out that
that is *not* what the rules say is even worse.

 -George

Re: [PATCH] do_multicall and MISRA Rule 8.3\

2024-03-15 Thread George Dunlap

On Fri, Mar 15, 2024 at 11:57 AM Jan Beulich  wrote:
> > It sounds like Andy and Stefano feel like this is a situation where "a
> > fixed width quantity is meant"; absent any further guidance from the
> > CODING_STYLE about when fixed widths should or should not be used, I
> > don't think this change would be a violation of CODING_STYLE.
>
> As with any not sufficiently clear statement, that's certainly true here,
> too. Yet if we try to give as wide meaning as possible to "a fixed width
> quantity is meant", there's basically no restriction on use of fixed width
> types because everyone can just say "but I mean a fixed width quantity
> here". I think the earlier sentence needs taking with higher priority,
> i.e. if a basic type does for the purpose, that's what should be used. The
> 2nd sentence then only tries to further clarify what the 1st means.

Come, now.  There are lots of situations where we just need some sort
of number, and there's no real need to worry about the exact size.
There are other situations, where we mean "whatever covers the whole
address space" or the like, where it makes sense to have something
like "unsigned long", which changes size, but in predictable and
useful ways.  There are other situations, like when talking over an
API to code which may be compiled by a different compiler, or may be
running in a different processor mode, where we want to be more
specific, and set an exact number of bits.

Should we use uint32_t for random loop variables?  Pretty clearly
"No".  Should we use uint32_t for the C entry of a hypercall, even
though the assembly code allegedly makes that unnecessary?  At least
two core maintainers think "maybe just to be safe".  That's hardly a
slippery slope of "anyone can say anything".

Other than "it's in CODING_STYLE", and "it's not really necessary
because it's ensured in the assembly code", you haven't advanced a
single reason why "uint32_t" is problematic.

 -George

Re: [PATCH] do_multicall and MISRA Rule 8.3\

2024-03-15 Thread George Dunlap

On Fri, Mar 15, 2024 at 6:54 AM Jan Beulich  wrote:
>
> On 15.03.2024 01:21, Stefano Stabellini wrote:
> > On Mon, 11 Mar 2024, Julien Grall wrote:
> >> On 11/03/2024 11:32, George Dunlap wrote:
> >>> On Sat, Mar 9, 2024 at 1:59 AM Stefano Stabellini
> >>>  wrote:
> >>>>
> >>>> I would like to resurrect this thread and ask other opinions.
> >>>>
> >>>>
> >>>> On Thu, 23 Nov 2023, Jan Beulich wrote:
> >>>>> On 22.11.2023 22:46, Stefano Stabellini wrote:
> >>>>>> Two out of three do_multicall definitions/declarations use uint32_t as
> >>>>>> type for the "nr_calls" parameters. Change the third one to be
> >>>>>> consistent with the other two.
> >>>>>>
> >>>>>> Link:
> >>>>>> https://lore.kernel.org/xen-devel/7e3abd4c0ef5127a07a60de1bf090a8aefac8e5c.1692717906.git.federico.seraf...@bugseng.com/
> >>>>>> Link:
> >>>>>> https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2308251502430.6458@ubuntu-linux-20-04-desktop/
> >>>>>> Signed-off-by: Stefano Stabellini 
> >>>>>> ---
> >>>>>> Note that a previous discussion showed disagreement between
> >>>>>> maintainers
> >>>>>> on this topic. The source of disagreements are that we don't want to
> >>>>>> change a guest-visible ABI and we haven't properly documented how to
> >>>>>> use
> >>>>>> types for guest ABIs.
> >>>>>>
> >>>>>> As an example, fixed-width types have the advantage of being explicit
> >>>>>> about their size but sometimes register-size types are required (e.g.
> >>>>>> unsigned long). The C specification says little about the size of
> >>>>>> unsigned long and today, and we even use unsigned int in guest ABIs
> >>>>>> without specifying the expected width of unsigned int on the various
> >>>>>> arches. As Jan pointed out, in Xen we assume sizeof(int) >= 4, but
> >>>>>> that's not written anywhere as far as I can tell.
> >>>>>>
> >>>>>> I think the appropriate solution would be to document properly our
> >>>>>> expectations of both fixed-width and non-fixed-width types, and how to
> >>>>>> use them for guest-visible ABIs.
> >>>>>>
> >>>>>> In this patch I used uint32_t for a couple of reasons:
> >>>>>> - until we have better documentation, I feel more confident in using
> >>>>>>explicitly-sized integers in guest-visible ABIs
> >>>>>
> >>>>> I disagree with this way of looking at it. Guests don't invoke these
> >>>>> functions directly, and our assembly code sitting in between already is
> >>>>> expected to (and does) guarantee that (in the case here) unsigned int
> >>>>> would be okay to use (as would be unsigned long, but at least on x86
> >>>>> that's slightly less efficient), in line with what ./CODING_STYLE says.
> >>>>>
> >>>>> Otoh structure definitions in the public interface of course need to
> >>>>> use fixed with types (and still doesn't properly do so in a few cases).
> >>>
> >>> You didn't address the other argument, which was that all the other
> >>> definitions have uint32_t; in particular,
> >>> common/multicall.c:do_multicall() takes uint32_t.  Surely that should
> >>> match the non-compat definition in include/hypercall-defs.c?
> >>>
> >>> Whether they should both be `unsigned int` or `uint32_t` I don't
> >>> really feel like I have a good enough grasp of the situation to form a
> >>> strong opinion.
> >>
> >> FWIW +1. We at least need some consistency.
> >
> > Consistency is my top concern. Let's put the "unsigned int" vs
> > "uint32_t" argument aside.
> >
> > do_multicall is not consistent with itself. We need
> > hypercall-defs.c:do_multicall and multicall.c:do_multicall to match.
> >
> > Option1) We can change hypercall-defs.c:do_multicall to use uint32_t.
> >
> > Option2) Or we can change multicall.c:do_multicall to use unsigned int.
> >
> > I went with Option1. Andrew expressed his strong preference toward
> > Option1 in the past

Re: [PATCH] SUPPORT.MD: Fix matrix generation after 43c416d0d819 and 77c39a53cf5b

2024-03-15 Thread George Dunlap

On Thu, Mar 14, 2024 at 5:39 PM Julien Grall  wrote:
>
> From: Julien Grall 
>
> The script docs/support-matrix-generate throw the following error on the
> latest staging.

I wonder if it would be worth adding a follow-up patch to run that
script, maybe only when debug=y, so it catches errors in development.

 -George

Re: [PATCH] docs/misra: document the expected sizes of integer types

2024-03-14 Thread George Dunlap

On Thu, Mar 14, 2024 at 7:36 AM Jan Beulich  wrote:
>
> On 14.03.2024 00:23, Stefano Stabellini wrote:
> > Xen makes assumptions about the size of integer types on the various
> > architectures. Document these assumptions.
>
> This all reads as if we required exact widths. Is that really the case?

At least one thing here is that *all compilers on the architecture*
need to have the same idea.  If not, we absolutely need to change
"unsigned int" to "uint32_t" in any public interface.

A second thing is not only assumptions about minimum number of bits,
but about storage size and alignment.  Again, if we don't assume that
"unsigned int" is exactly 4 bytes, then we should go through and
change it to "uint32_t" anywhere that the size or alignment matter.

 -George

Re: [PATCH v4] x86/PoD: tie together P2M update and increment of entry count

2024-03-13 Thread George Dunlap

On Wed, Mar 13, 2024 at 2:00 PM Jan Beulich  wrote:
>
> When not holding the PoD lock across the entire region covering P2M
> update and stats update, the entry count - if to be incorrect at all -
> should indicate too large a value in preference to a too small one, to
> avoid functions bailing early when they find the count is zero. However,
> instead of moving the increment ahead (and adjust back upon failure),
> extend the PoD-locked region.
>
> Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
> Signed-off-by: Jan Beulich 

Would you mind commenting on why you went with multiple unlocks,
rather than multiple if statements?

e.g.,

```
rc = p2m_set_entry(...);

/* Do the pod entry adjustment while holding the lock on success */
if ( rc == 0 ) {
 /* adjust pod entries */
}

pod_unlock(p2m);

/* Do the rest of the clean-up and error handling */
if (rc == 0 ) {
```

Just right now the multiple unlocks makes me worry that we may forget
one at some point.

 -George

Re: Mailing list changes

2024-03-13 Thread George Dunlap

On Wed, Mar 13, 2024 at 2:59 PM Kelly Choi  wrote:
>
> Hi everyone,
>
> We recently discussed some mailing list changes in our last community call.
>
> Issue:
>
> Lists.xenproject.org software is outdated, DKIM, DMARC, ARC is no longer 
> sufficient, and there's no clear update path from Mailman2 to Mailman3.
>
> Potential solution:
>
> Host them at https://subspace.kernel.org/ due to us being part of the Linux 
> Foundation
> No additional cost to the project

It should also be pointed out that the list address wouldn't change;
on our end we'd point the MX for lists.xenproject.org to kernel.org's
mail servers (or something like that).

 -George

[PATCH v2 0/3] AMD Nested Virt Preparation

2024-03-13 Thread George Dunlap

This series lays the groundwork for revamp of the AMD nested virt
functionality.  The first two patches are clean-ups or reorganizations
of existing code.  The final patch is the first major step towards making
the feature supportable: allowing Xen to refuse nested virt support if certain
hardware features are not present.

George Dunlap (3):
  x86: Move SVM features exposed to guest into hvm_max_cpu_policy
  nestedsvm: Disable TscRateMSR
  svm/nestedsvm: Introduce nested capabilities bit

 docs/designs/nested-svm-cpu-features.md  | 111 +++
 xen/arch/x86/cpu-policy.c|  29 ++---
 xen/arch/x86/domain.c|   6 +
 xen/arch/x86/hvm/nestedhvm.c |  10 ++
 xen/arch/x86/hvm/svm/nestedsvm.c |  16 ++-
 xen/arch/x86/hvm/svm/svm.c   |  57 --
 xen/arch/x86/hvm/vmx/vvmx.c  |   8 ++
 xen/arch/x86/include/asm/hvm/hvm.h   |  16 ++-
 xen/arch/x86/include/asm/hvm/nestedhvm.h |   4 +
 xen/arch/x86/include/asm/hvm/svm/nestedsvm.h |   5 -
 10 files changed, 184 insertions(+), 78 deletions(-)
 create mode 100644 docs/designs/nested-svm-cpu-features.md

-- 
2.25.1

[PATCH v2 2/3] nestedsvm: Disable TscRateMSR

2024-03-13 Thread George Dunlap

The primary purpose of TSC scaling, from our perspective, is to
maintain the fiction of an "invariant TSC" across migrates between
platforms with different clock speeds.

On AMD, the TscRateMSR CPUID bit is unconditionally enabled in the
"host cpuid", even if the hardware doesn't actually support it.
According to c/s fd14a1943c4 ("nestedsvm: Support TSC Rate MSR"),
testing showed that emulating TSC scaling in an L1 was more expensive
than emulating TSC scaling on an L0 (due to extra sets of vmexit /
vmenter).

However, the current implementation seems to be broken.

First of all, the final L2 scaling ratio should be a composition of
the L0 scaling ratio and the L1 scaling ratio; there's no indication
this is being done anywhere.

Secondly, it's not clear that the L1 tsc scaling ratio actually
affects the L0 tsc scaling ratio.  The stored value (ns_tscratio) is
used to affect the tsc *offset*, but doesn't seem to actually be
factored into d->hvm.tsc_scaling_ratio.  (Which shouldn't be
per-domain anyway, but per-vcpu.)  Having the *offset* scaled
according to the nested scaling without the actual RDTSC itself also
being scaled has got to produce inconsistent results.

For now, just disable the functionality entirely until we can
implement it properly:

- Don't set TSCRATEMSR in the host CPUID policy

- Remove MSR_AMD64_TSC_RATIO emulation handling, so that the guest
  guests a #GP if it tries to access them (as it should when
  TSCRATEMSR is clear)

- Remove ns_tscratio from struct nestedhvm, and all code that touches
  it

Unfortunately this means ripping out the scaling calculation stuff as
well, since it's only used in the nested case; it's there in the git
tree if we need it for reference when we re-introduce it.

Signed-off-by: George Dunlap 
---
v2:
- Port over move to hvm_max_cpu_policy

CC: Jan Beulich 
CC: Andrew Cooper 
CC: "Roger Pau Monné" 
CC: Wei Liu 
---
 xen/arch/x86/cpu-policy.c|  3 +-
 xen/arch/x86/hvm/svm/nestedsvm.c |  2 -
 xen/arch/x86/hvm/svm/svm.c   | 57 
 xen/arch/x86/include/asm/hvm/svm/nestedsvm.h |  5 --
 4 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index bd047456eb..5952ff20e6 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -741,8 +741,7 @@ static void __init calculate_hvm_max_policy(void)
(1u << SVM_FEATURE_PAUSEFILTER) |
(1u << SVM_FEATURE_DECODEASSISTS));
 /* Enable features which are always emulated. */
-p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
-   (1u << SVM_FEATURE_TSCRATEMSR));
+p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
 }
 
 guest_common_max_feature_adjustments(fs);
diff --git a/xen/arch/x86/hvm/svm/nestedsvm.c b/xen/arch/x86/hvm/svm/nestedsvm.c
index e4e01add8c..a5319ab729 100644
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -146,8 +146,6 @@ int cf_check nsvm_vcpu_reset(struct vcpu *v)
 svm->ns_msr_hsavepa = INVALID_PADDR;
 svm->ns_ovvmcb_pa = INVALID_PADDR;
 
-svm->ns_tscratio = DEFAULT_TSC_RATIO;
-
 svm->ns_cr_intercepts = 0;
 svm->ns_dr_intercepts = 0;
 svm->ns_exception_intercepts = 0;
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index b551eac807..34b9f603bc 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -777,43 +777,6 @@ static int cf_check svm_get_guest_pat(struct vcpu *v, u64 
*gpat)
 return 1;
 }
 
-static uint64_t scale_tsc(uint64_t host_tsc, uint64_t ratio)
-{
-uint64_t mult, frac, scaled_host_tsc;
-
-if ( ratio == DEFAULT_TSC_RATIO )
-return host_tsc;
-
-/*
- * Suppose the most significant 32 bits of host_tsc and ratio are
- * tsc_h and mult, and the least 32 bits of them are tsc_l and frac,
- * then
- * host_tsc * ratio * 2^-32
- * = host_tsc * (mult * 2^32 + frac) * 2^-32
- * = host_tsc * mult + (tsc_h * 2^32 + tsc_l) * frac * 2^-32
- * = host_tsc * mult + tsc_h * frac + ((tsc_l * frac) >> 32)
- *
- * Multiplications in the last two terms are between 32-bit integers,
- * so both of them can fit in 64-bit integers.
- *
- * Because mult is usually less than 10 in practice, it's very rare
- * that host_tsc * mult can overflow a 64-bit integer.
- */
-mult = ratio >> 32;
-frac = ratio & ((1ULL << 32) - 1);
-scaled_host_tsc  = host_tsc * mult;
-scaled_host_tsc += (host_tsc >> 32) * frac;
-scaled_host_tsc += ((host_tsc & ((1ULL << 32) - 1)) * frac) >> 32;
-
-return scaled_host_tsc;
-}
-
-static uint64_t svm_get_tsc_offset(uint64_t host_tsc, uint64_t guest_tsc,
-uint64_t ratio

[PATCH v2 3/3] svm/nestedsvm: Introduce nested capabilities bit

2024-03-13 Thread George Dunlap

In order to make implementation and testing tractable, we will require
specific host functionality.  Add a nested_virt bit to hvm_funcs.caps,
and return an error if a domain is created with nested virt and this
bit isn't set.  Create VMX and SVM callbacks to be executed from
start_nested_svm(), which is guaranteed to execute after all
command-line options have been procesed.

For VMX, start with always enabling it if HAP is present; this
shouldn't change current behvior.

For SVM, require some basic functionality, adding a document
explaining the rationale.

NB that only SVM CPUID bits 0-7 have been considered.  Bits 10-16 may
be considered in a follow-up patch.

Signed-off-by: George Dunlap 
---
v2:
 - Fixed typo in title
 - Added hvm_nested_virt_supported() def for !CONFIG_HVM
 - Rebased over previous changes
 - Tweak some wording in document
 - Require npt rather than nrips twice
 - Remove stray __init from header
 - Set caps.nested_virt from callback from nestedhvm_setup()

CC: Andrew Cooper 
CC: George Dunlap 
CC: Jan Beulich 
CC: Julien Grall 
CC: Stefano Stabellini 
CC: Wei Liu 
CC: "Roger Pau Monné" 
CC: Jun Nakajima 
CC: Kevin Tian 
---
 docs/designs/nested-svm-cpu-features.md  | 111 +++
 xen/arch/x86/domain.c|   6 ++
 xen/arch/x86/hvm/nestedhvm.c |  10 ++
 xen/arch/x86/hvm/svm/nestedsvm.c |  14 +++
 xen/arch/x86/hvm/vmx/vvmx.c  |   8 ++
 xen/arch/x86/include/asm/hvm/hvm.h   |  16 +++-
 xen/arch/x86/include/asm/hvm/nestedhvm.h |   4 +
 7 files changed, 168 insertions(+), 1 deletion(-)

diff --git a/docs/designs/nested-svm-cpu-features.md 
b/docs/designs/nested-svm-cpu-features.md
new file mode 100644
index 00..837a96df05
--- /dev/null
+++ b/docs/designs/nested-svm-cpu-features.md
@@ -0,0 +1,111 @@
+# Nested SVM (AMD) CPUID requirements
+
+The first step in making nested SVM production-ready is to make sure
+that all features are implemented and well-tested.  To make this
+tractable, we will initially be limiting the "supported" range of
+nested virt to a specific subset of host and guest features.  This
+document describes the criteria for deciding on features, and the
+rationale behind each feature.
+
+For AMD, all virtualization-related features can be found in CPUID
+leaf 800A:edx
+
+# Criteria
+
+- Processor support: At a minimum we want to support processors from
+  the last 5 years.  All things being equal, we'd prefer to cover
+  older processors than not.  Bits 0:7 were available in the very
+  earliest processors; and even through bit 15 we should be pretty
+  good support-wise.
+
+- Faithfulness to hardware: We need the behavior of the "virtual cpu"
+  from the L1 hypervisor's perspective to be as close as possible to
+  the original hardware.  In particular, the behavior of the hardware
+  on error paths 1) is not easy to understand or test, 2) can be the
+  source of surprising vulnerabiliies.  (See XSA-7 for an example of a
+  case where subtle error-handling differences can open up a privilege
+  escalation.)  We should avoid emulating any bit of the hardware with
+  complex error paths if we can at all help it.
+
+- Cost of implementation: We want to minimize the cost of
+  implementation (where this includes bringing an existing sub-par
+  implementation up to speed).  All things being equal, we'll favor a
+  configuration which does not require any new implementation.
+
+- Performance: All things being equal, we'd prefer to choose a set of
+  L0 / L1 CPUID bits that are faster than slower.
+
+
+# Bits
+
+- 0 `NP` *Nested Paging*: Required both for L0 and L1.
+
+  Based primarily on faithfulness and performance, as well as
+  potential cost of implementation.  Available on earliest hardware,
+  so no compatibility issues.
+
+- 1 `LbrVirt` *LBR / debugging virtualization*: Require for L0 and L1.
+
+  For L0 this is required for performance: There's no way to tell the
+  guests not to use the LBR-related registers; and if the guest does,
+  then you have to save and restore all LBR-related registers on
+  context switch, which is prohibitive.  Furthermore, the additional
+  emulation risks a security-relevant difference to come up.
+
+  Providing it to L1 when we have it in L0 is basically free, and
+  already implemented.
+
+  Just require it and provide it.
+
+- 2 `SVML` *SVM Lock*: Not required for L0, not provided to L1
+
+  Seems to be aboult enabling an operating system to prevent "blue
+  pill" attacks against itself.
+
+  Xen doesn't use it, nor provide it; so it would need to be
+  implementend.  The best way to protect a guest OS is to leave nested
+  virt disabled in the tools.
+
+- 3 `NRIPS` NRIP Save: Require for both L0 and L1
+
+  If NRIPS is not present, the software interrupt injection
+  functionality can't be used; and Xen has to emulate it.  That's
+  another source of potential security issues.  If hardware supports
+  it, then providi

[PATCH v2 1/3] x86: Move SVM features exposed to guest into hvm_max_cpu_policy

2024-03-13 Thread George Dunlap

Currently (nested) SVM features we're willing to expose to the guest
are defined in calculate_host_policy, and stored in host_cpu_policy.
This is the wrong place for this; move it into
calculate_hvm_max_policy(), and store it in hvm_max_cpu_policy.

Signed-off-by: George Dunlap 
---
v2:
- New
---
 xen/arch/x86/cpu-policy.c | 30 +-
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 2acc27632f..bd047456eb 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -398,19 +398,6 @@ static void __init calculate_host_policy(void)
 if ( vpmu_mode == XENPMU_MODE_OFF )
 p->basic.raw[0xa] = EMPTY_LEAF;
 
-if ( p->extd.svm )
-{
-/* Clamp to implemented features which require hardware support. */
-p->extd.raw[0xa].d &= ((1u << SVM_FEATURE_NPT) |
-   (1u << SVM_FEATURE_LBRV) |
-   (1u << SVM_FEATURE_NRIPS) |
-   (1u << SVM_FEATURE_PAUSEFILTER) |
-   (1u << SVM_FEATURE_DECODEASSISTS));
-/* Enable features which are always emulated. */
-p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
-   (1u << SVM_FEATURE_TSCRATEMSR));
-}
-
 /* 0x00ce  MSR_INTEL_PLATFORM_INFO */
 /* probe_cpuid_faulting() sanity checks presence of MISC_FEATURES_ENABLES 
*/
 p->platform_info.cpuid_faulting = cpu_has_cpuid_faulting;
@@ -741,6 +728,23 @@ static void __init calculate_hvm_max_policy(void)
 if ( !cpu_has_vmx )
 __clear_bit(X86_FEATURE_PKS, fs);
 
+/* 
+ * Make adjustments to possible (nested) virtualization features exposed
+ * to the guest
+ */
+if ( p->extd.svm )
+{
+/* Clamp to implemented features which require hardware support. */
+p->extd.raw[0xa].d &= ((1u << SVM_FEATURE_NPT) |
+   (1u << SVM_FEATURE_LBRV) |
+   (1u << SVM_FEATURE_NRIPS) |
+   (1u << SVM_FEATURE_PAUSEFILTER) |
+   (1u << SVM_FEATURE_DECODEASSISTS));
+/* Enable features which are always emulated. */
+p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
+   (1u << SVM_FEATURE_TSCRATEMSR));
+}
+
 guest_common_max_feature_adjustments(fs);
 guest_common_feature_adjustments(fs);
 
-- 
2.25.1

Re: [PATCH v3] x86/PoD: tie together P2M update and increment of entry count

2024-03-13 Thread George Dunlap

On Wed, Mar 13, 2024 at 12:25 PM George Dunlap  wrote:
> I keep missing your post-commit-message remarks due to the way I'm
> applying your series.

Er, just to be clear, this is a problem with my workflow, not with
your patches...

 -George

Re: [PATCH v3] x86/PoD: tie together P2M update and increment of entry count

2024-03-13 Thread George Dunlap

On Wed, Mar 13, 2024 at 12:19 PM Jan Beulich  wrote:
>
> On 13.03.2024 11:58, George Dunlap wrote:
> > On Tue, Mar 12, 2024 at 3:22 PM Jan Beulich  wrote:
> >>
> >> When not holding the PoD lock across the entire region covering P2M
> >> update and stats update, the entry count - if to be incorrect at all -
> >> should indicate too large a value in preference to a too small one, to
> >> avoid functions bailing early when they find the count is zero. However,
> >> instead of moving the increment ahead (and adjust back upon failure),
> >> extend the PoD-locked region.
> >>
> >> Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
> >> Signed-off-by: Jan Beulich 
> >> ---
> >> The locked region could be shrunk again, by having multiple unlock
> >> calls. But I think both ioreq_request_mapcache_invalidate() and
> >> domain_crash() are fair enough to call with the lock still held?
> >> ---
> >> v3: Extend locked region instead. Add Fixes: tag.
> >> v2: Add comments.
> >>
> >> --- a/xen/arch/x86/mm/p2m-pod.c
> >> +++ b/xen/arch/x86/mm/p2m-pod.c
> >> @@ -1348,16 +1348,22 @@ mark_populate_on_demand(struct domain *d
> >>  }
> >>  }
> >>
> >> +/*
> >> + * P2M update and stats increment need to collectively be under PoD 
> >> lock,
> >> + * to prevent code elsewhere observing PoD entry count being zero 
> >> despite
> >> + * there actually still being PoD entries (created by the 
> >> p2m_set_entry()
> >> + * invocation below).
> >> + */
> >> +pod_lock(p2m);
> >> +
> >>  /* Now, actually do the two-way mapping */
> >>  rc = p2m_set_entry(p2m, gfn, INVALID_MFN, order,
> >> p2m_populate_on_demand, p2m->default_access);
> >>  if ( rc == 0 )
> >>  {
> >> -pod_lock(p2m);
> >>  p2m->pod.entry_count += 1UL << order;
> >>  p2m->pod.entry_count -= pod_count;
> >>  BUG_ON(p2m->pod.entry_count < 0);
> >> -pod_unlock(p2m);
> >>
> >>  ioreq_request_mapcache_invalidate(d);
> >>  }
> >> @@ -1373,6 +1379,8 @@ mark_populate_on_demand(struct domain *d
> >>  domain_crash(d);
> >>  }
> >>
> >> +pod_unlock(p2m);
> >
> > We're confident that neither domain_crash() nor
> > ioreq_request_mapcache_invalidate() will grab any of the p2m locks?
>
> There's no doubt about ioreq_request_mapcache_invalidate(). domain_crash(),
> otoh, invokes show_execution_state(), which in principle would be nice to
> dump the guest stack among other things. My patch doing so was reverted, so
> right now there's no issue there. Plus any attempt to do so would need to
> be careful anyway regarding locks. But as you see it is not a clear cut no,
> so ...
>
> > If so,
> >
> > Reviewed-by: George Dunlap 
>
> ... rather than taking this (thanks), maybe I indeed better follow the
> alternative outlined in the post-commit-message remark?

I keep missing your post-commit-message remarks due to the way I'm
applying your series.  Yes, that had occurred to me as well -- I don't
think this is a hot path, and I do think it would be good to avoid
laying a trap for future people wanting to change domain_crash(); in
particular as that would change a domain crash into either a host
crash or a potential deadlock.

I think I would go with multiple if statements, rather than multiple
unlock calls though.

 -George

Re: [PATCH v3] x86/PoD: tie together P2M update and increment of entry count

2024-03-13 Thread George Dunlap

On Tue, Mar 12, 2024 at 3:22 PM Jan Beulich  wrote:
>
> When not holding the PoD lock across the entire region covering P2M
> update and stats update, the entry count - if to be incorrect at all -
> should indicate too large a value in preference to a too small one, to
> avoid functions bailing early when they find the count is zero. However,
> instead of moving the increment ahead (and adjust back upon failure),
> extend the PoD-locked region.
>
> Fixes: 99af3cd40b6e ("x86/mm: Rework locking in the PoD layer")
> Signed-off-by: Jan Beulich 
> ---
> The locked region could be shrunk again, by having multiple unlock
> calls. But I think both ioreq_request_mapcache_invalidate() and
> domain_crash() are fair enough to call with the lock still held?
> ---
> v3: Extend locked region instead. Add Fixes: tag.
> v2: Add comments.
>
> --- a/xen/arch/x86/mm/p2m-pod.c
> +++ b/xen/arch/x86/mm/p2m-pod.c
> @@ -1348,16 +1348,22 @@ mark_populate_on_demand(struct domain *d
>  }
>  }
>
> +/*
> + * P2M update and stats increment need to collectively be under PoD lock,
> + * to prevent code elsewhere observing PoD entry count being zero despite
> + * there actually still being PoD entries (created by the p2m_set_entry()
> + * invocation below).
> + */
> +pod_lock(p2m);
> +
>  /* Now, actually do the two-way mapping */
>  rc = p2m_set_entry(p2m, gfn, INVALID_MFN, order,
> p2m_populate_on_demand, p2m->default_access);
>  if ( rc == 0 )
>  {
> -pod_lock(p2m);
>  p2m->pod.entry_count += 1UL << order;
>  p2m->pod.entry_count -= pod_count;
>  BUG_ON(p2m->pod.entry_count < 0);
> -pod_unlock(p2m);
>
>  ioreq_request_mapcache_invalidate(d);
>  }
> @@ -1373,6 +1379,8 @@ mark_populate_on_demand(struct domain *d
>  domain_crash(d);
>  }
>
> +pod_unlock(p2m);

We're confident that neither domain_crash() nor
ioreq_request_mapcache_invalidate() will grab any of the p2m locks?

If so,

Reviewed-by: George Dunlap

Re: [PATCH v2] x86/PoD: move increment of entry count

2024-03-11 Thread George Dunlap

On Tue, Jan 4, 2022 at 10:58 AM Jan Beulich  wrote:

> When not holding the PoD lock across the entire region covering P2M
> update and stats update, the entry count should indicate too large a
> value in preference to a too small one, to avoid functions bailing early
> when they find the count is zero. Hence increments should happen ahead
> of P2M updates, while decrements should happen only after. Deal with the
> one place where this hasn't been the case yet.
>
> Signed-off-by: Jan Beulich 
> ---
> v2: Add comments.
> ---
> While it might be possible to hold the PoD lock over the entire
> operation, I didn't want to chance introducing a lock order violation on
> a perhaps rarely taken code path.
>

[No idea how I missed this 2 years ago, sorry for the delay]

Actually I think just holding the lock is probably the right thing to do.
We already grab gfn_lock() over the entire operation, and p2m_set_entry()
ASSERTs gfn_locked_by_me(); and all we have to do to trigger the check is
boot any guest in PoD mode at all; surely we have osstest tests for that?

 -George

Re: [PATCH] do_multicall and MISRA Rule 8.3

2024-03-11 Thread George Dunlap

On Sat, Mar 9, 2024 at 1:59 AM Stefano Stabellini
 wrote:
>
> I would like to resurrect this thread and ask other opinions.
>
>
> On Thu, 23 Nov 2023, Jan Beulich wrote:
> > On 22.11.2023 22:46, Stefano Stabellini wrote:
> > > Two out of three do_multicall definitions/declarations use uint32_t as
> > > type for the "nr_calls" parameters. Change the third one to be
> > > consistent with the other two.
> > >
> > > Link: 
> > > https://lore.kernel.org/xen-devel/7e3abd4c0ef5127a07a60de1bf090a8aefac8e5c.1692717906.git.federico.seraf...@bugseng.com/
> > > Link: 
> > > https://lore.kernel.org/xen-devel/alpine.DEB.2.22.394.2308251502430.6458@ubuntu-linux-20-04-desktop/
> > > Signed-off-by: Stefano Stabellini 
> > > ---
> > > Note that a previous discussion showed disagreement between maintainers
> > > on this topic. The source of disagreements are that we don't want to
> > > change a guest-visible ABI and we haven't properly documented how to use
> > > types for guest ABIs.
> > >
> > > As an example, fixed-width types have the advantage of being explicit
> > > about their size but sometimes register-size types are required (e.g.
> > > unsigned long). The C specification says little about the size of
> > > unsigned long and today, and we even use unsigned int in guest ABIs
> > > without specifying the expected width of unsigned int on the various
> > > arches. As Jan pointed out, in Xen we assume sizeof(int) >= 4, but
> > > that's not written anywhere as far as I can tell.
> > >
> > > I think the appropriate solution would be to document properly our
> > > expectations of both fixed-width and non-fixed-width types, and how to
> > > use them for guest-visible ABIs.
> > >
> > > In this patch I used uint32_t for a couple of reasons:
> > > - until we have better documentation, I feel more confident in using
> > >   explicitly-sized integers in guest-visible ABIs
> >
> > I disagree with this way of looking at it. Guests don't invoke these
> > functions directly, and our assembly code sitting in between already is
> > expected to (and does) guarantee that (in the case here) unsigned int
> > would be okay to use (as would be unsigned long, but at least on x86
> > that's slightly less efficient), in line with what ./CODING_STYLE says.
> >
> > Otoh structure definitions in the public interface of course need to
> > use fixed with types (and still doesn't properly do so in a few cases).

You didn't address the other argument, which was that all the other
definitions have uint32_t; in particular,
common/multicall.c:do_multicall() takes uint32_t.  Surely that should
match the non-compat definition in include/hypercall-defs.c?

Whether they should both be `unsigned int` or `uint32_t` I don't
really feel like I have a good enough grasp of the situation to form a
strong opinion.

 -George

Re: [PATCH v2] Argo: don't obtain excess page references

2024-03-07 Thread George Dunlap

On Wed, Mar 6, 2024 at 11:38 PM Christopher Clark
 wrote:
>
> On Sun, Feb 18, 2024 at 10:01 AM Julien Grall  wrote:
> >
> > Hi Jan,
> >
> > On 14/02/2024 10:12, Jan Beulich wrote:
> > > find_ring_mfn() already holds a page reference when trying to obtain a
> > > writable type reference. We shouldn't make assumptions on the general
> > > reference count limit being effectively "infinity". Obtain merely a type
> > > ref, re-using the general ref by only dropping the previously acquired
> > > one in the case of an error.
> > >
> > > Signed-off-by: Jan Beulich 
> >
> > Reviewed-by: Julien Grall 
> >
> > > ---
> > > I further question the log-dirty check there: The present P2M type of a
> > > page doesn't really matter for writing to the page (plus it's stale by
> > > the time it is looked at). Instead I think every write to such a page
> > > needs to be accompanied by a call to paging_mark_dirty().
> >
> > I agree with that.
>
> Adding OpenXT mailing list as I have found that I have not had the
> time available that I had hoped for to spend on reviewing this Argo
> change, and to provide opportunity for downstream feedback.
>
> Link to the posted patch (start of this thread):
> https://lists.xenproject.org/archives/html/xen-devel/2024-02/msg00858.html

Could we add some more designated reviewers / maintainers to the Argo
code to help spread the load a bit?

 -George

Re: [PATCH 5/6] nestedsvm: Remove bogus debug message from nestedsvm_check_intercepts

2024-03-04 Thread George Dunlap

On Tue, Feb 27, 2024 at 12:47 AM Andrew Cooper
 wrote:
>
> On 06/02/2024 1:20 am, George Dunlap wrote:
> > Changeset ef3e8db8068 ("x86/hvm: Corrections and improvements to
> > unhandled vmexit logging") introduced a printk to the default path of
> > the switch statement in nestedsvm_check_intercepts(), complaining of
> > an unknown exit reason.
> >
> > Unfortunately, the "core" switch statement which is meant to handle
> > all vmexit reasons is in nsvm_vmcb_guest_intercepts_exitcode(); the
> > switch statement in nestedsvm_check_intercepts() is only meant to
> > superimpose on top of that some special-casing for how to interaction
> > between L1 and L0 vmexits.
> >
> > Remove the printk, and add a comment to prevent future confusion.
> >
> > Signed-off-by: George Dunlap 
>
> Erm...   The addition of this printk was very deliberate, to point out
> where security fixes are needed.
>
> It's not bogus in the slightest.  It is an error for exit reasons to not
> be inspected for safety in this path.

I'm a bit at a loss how to respond to this.  As I wrote above, exit
reasons are inspected for safety in this path -- in
nsvm_vmcb_guest_intercepts_exitcode().  If you want the patch
reverted, you'll have to explain why that's not sufficient.

 -George

Re: [PATCH 6/6] svm/nestedvm: Introduce nested capabilities bit

2024-02-22 Thread George Dunlap

On Tue, Feb 20, 2024 at 12:25 AM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > --- /dev/null
> > +++ b/docs/designs/nested-svm-cpu-features.md
> > @@ -0,0 +1,110 @@
> > +# Nested SVM (AMD) CPUID requirements
> > +
> > +The first step in making nested SVM production-ready is to make sure
> > +that all features are implemented and well-tested.  To make this
> > +tractable, we will initially be limiting the "supported" range of
> > +nested virt to a specific subset of host and guest features.  This
> > +document describes the criteria for deciding on features, and the
> > +rationale behind each feature.
> > +
> > +For AMD, all virtualization-related features can be found in CPUID
> > +leaf 800A:edx
> > +
> > +# Criteria
> > +
> > +- Processor support: At a minimum we want to support processors from
> > +  the last 5 years.  All things being equal, older processors are
> > +  better.
>
> Nit: Perhaps missing "covering"? Generally I hope newer processors are
> "better".

I've reworded it.

> > +  For L0 this is required for performance: There's no way to tell the
> > +  guests not to use the LBR-related registers; and if the guest does,
> > +  then you have to save and restore all LBR-related registers on
> > +  context switch, which is prohibitive.
>
> "prohibitive" is too strong imo; maybe "undesirable"?

Prohibitive sounds strong, but I don't think it really is; here's the
dictionary definition:

"(of a price or charge) so high as to prevent something being done or bought."

The cost of saving the registers on every context switch is so high as
to prevent us from wanting to do it.

I could change it to "too expensive".

> > --- a/xen/arch/x86/hvm/svm/nestedhvm.h
> > +++ b/xen/arch/x86/hvm/svm/nestedhvm.h
> > @@ -35,6 +35,7 @@ enum nestedhvm_vmexits
> >  nestedsvm_check_intercepts(struct vcpu *v, struct cpu_user_regs *regs,
> >  uint64_t exitcode);
> >  void svm_nested_features_on_efer_update(struct vcpu *v);
> > +void __init start_nested_svm(struct hvm_function_table 
> > *svm_function_table);
>
> No section placement attributes on declarations, please.

Ack.

> > --- a/xen/arch/x86/hvm/svm/nestedsvm.c
> > +++ b/xen/arch/x86/hvm/svm/nestedsvm.c
> > @@ -1666,3 +1666,17 @@ void svm_nested_features_on_efer_update(struct vcpu 
> > *v)
> >  }
> >  }
> >  }
> > +
> > +void __init start_nested_svm(struct hvm_function_table *svm_function_table)
> > +{
> > +/*
> > + * Required host functionality to support nested virt.  See
> > + * docs/designs/nested-svm-cpu-features.md for rationale.
> > + */
> > +svm_function_table->caps.nested_virt =
> > +cpu_has_svm_nrips &&
> > +cpu_has_svm_lbrv &&
> > +cpu_has_svm_nrips &&
>
> nrips twice? Was the earlier one meant to be npt?

Er, yes exactly.



> > --- a/xen/arch/x86/hvm/vmx/vmx.c
> > +++ b/xen/arch/x86/hvm/vmx/vmx.c
> > @@ -3021,6 +3021,9 @@ const struct hvm_function_table * __init 
> > start_vmx(void)
> >  if ( cpu_has_vmx_tsc_scaling )
> >  vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
> >
> > +/* TODO: Require hardware support before enabling nested virt */
> > +vmx_function_table.caps.nested_virt = vmx_function_table.caps.hap;
>
> This won't have the intended effect if hap_supported() ends up clearing
> the bit (used as input here) due to a command line option override. I
> wonder if instead this wants doing e.g. in a new hook to be called from
> nestedhvm_setup(). On the SVM side the hook function would then be the
> start_nested_svm() that you already introduce, with a caps.hap check
> added.

I take it presmp_initcall()'s are guaranteed to run before __initcall()'s?

If so that's probably a good idea.

> Since you leave the other nested-related if() in place in
> arch_sanitise_domain_config(), all ought to be well, but I think that
> other if() really wants replacing by the one you presently add.

Ack.

I'll probably check in patches 1,2,3, and 5, and resend the other two,
unless you'd like to see all the changes again...

 -George




 -George

Re: [PATCH 5/6] nestedsvm: Remove bogus debug message from nestedsvm_check_intercepts

2024-02-22 Thread George Dunlap

On Mon, Feb 19, 2024 at 11:56 PM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > Changeset ef3e8db8068 ("x86/hvm: Corrections and improvements to
> > unhandled vmexit logging") introduced a printk to the default path of
> > the switch statement in nestedsvm_check_intercepts(), complaining of
> > an unknown exit reason.
> >
> > Unfortunately, the "core" switch statement which is meant to handle
> > all vmexit reasons is in nsvm_vmcb_guest_intercepts_exitcode(); the
> > switch statement in nestedsvm_check_intercepts() is only meant to
> > superimpose on top of that some special-casing for how to interaction
> > between L1 and L0 vmexits.
> >
> > Remove the printk, and add a comment to prevent future confusion.
> >
> > Signed-off-by: George Dunlap 
>
> Reviewed-by: Jan Beulich 
>
> I wonder if a Fixes: tag is warranted here.

Yes, I think probably so.


 -George

Re: [PATCH 4/6] nestedsvm: Disable TscRateMSR

2024-02-22 Thread George Dunlap

On Thu, Feb 22, 2024 at 5:50 PM Jan Beulich  wrote:
>
> On 22.02.2024 10:30, George Dunlap wrote:
> > On Wed, Feb 21, 2024 at 6:52 PM Jan Beulich  wrote:
> >>>> But then of course Andrew may know of reasons why all of this is done
> >>>> in calculate_host_policy() in the first place, rather than in HVM
> >>>> policy calculation.
> >>>
> >>> It sounds like maybe you're confusing host_policy with
> >>> x86_capabilities?  From what I can tell:
> >>>
> >>> *  the "basic" cpu_has_X macros resolve to boot_cpu_has(), which
> >>> resolves to cpu_has(_cpu_data, ...), which is completely
> >>> independent of the cpu-policy.c:host_cpu_policy
> >>>
> >>> * cpu-policy.c:host_cpu_policy only affects what is advertised to
> >>> guests, via {pv,hvm}_cpu_policy and featureset bits.  Most notably a
> >>> quick skim doesn't show any mechanism by which host_cpu_policy could
> >>> affect what features Xen itself decides to use.
> >>
> >> I'm not mixing the two, no; the two are still insufficiently disentangled.
> >> There's really no reason (long term) to have both host policy and
> >> x86_capabilities. Therefore I'd prefer if new code (including a basically
> >> fundamental re-write as is going to be needed for nested) to avoid
> >> needlessly further extending x86_capabilities. Unless of course there's
> >> something fundamentally wrong with eliminating the redundancy, which
> >> likely Andrew would be in the best position to point out.
> >
> > So I don't know the history of how things got to be the way they are,
> > nor really much about the code but what I've gathered from skimming
> > through while creating this patch series.  But from that impression,
> > the only issue I really see with the current code is the confusing
> > naming.  The cpufeature.h code has this nice infrastructure to allow
> > you to, for instance, enable or disable certain bits on the
> > command-line; and the interface for querying all the different bits of
> > functionality is all nicely put in one place.  Moving the
> > svm_feature_flags into x86_capabilities would immediately allow SVM to
> > take advantage of this infrastructure; it's not clear to me how this
> > would be "needless".
> >
> > Furthermore, it looks to me like host_cpu_policy is used as a starting
> > point for generating pv_cpu_policy and hvm_cpu_policy, both of which
> > are only used for guest cpuid generation.  Given that the format of
> > those policies is fixed, and there's a lot of "copy this bit from the
> > host policy wholesale", it seems like no matter what, you'd want a
> > host_cpu_policy.
> >
> > And in any case -- all that is kind of moot.  *Right now*,
> > host_cpu_policy is only used for guest cpuid policy creation; *right
> > now*, the nested virt features of AMD are handled in the
> > host_cpu_policy; *right now*, we're advertising to guests bits which
> > are not properly virtualized; *right now* these bits are actually set
> > unconditionally, regardless of whether they're even available on the
> > hardware; *right now*, Xen uses svm_feature_flags to determine its own
> > use of TscRateMSR; so *right now*, removing this bit from
> > host_cpu_policy won't prevent Xen from using TscRateMSR itself.
> >
> > (Unless my understanding of the code is wrong, in which case I'd
> > appreciate a correction.)
>
> There's nothing wrong afaics, just missing at least one aspect: Did you
> see all the featureset <-> policy conversions in cpu-policy.c? That (to
> me at least) clearly is a sign of unnecessary duplication of the same
> data. This goes as far as seeding the host policy from the raw one, just
> to then immediately run x86_cpu_featureset_to_policy(), thus overwriting
> a fair part of what was first taken from the raw policy. That's necessary
> right now, because setup_{force,clear}_cpu_cap() act on
> boot_cpu_data.x86_capability[], not the host policy.
>
> As to the "needless" further up, it's only as far as moving those bits
> into x86_capability[] would further duplicate information, rather than
> (for that piece at least) putting them into the policies right away. But
> yes, if the goal is to have setup_{force,clear}_cpu_cap() be able to
> control those bits as well, then going the intermediate step would be
> unavoidable at this point in time.

I'm still not sure of what needs to happen to move this forward.

As I said, I'm not opposed to doing some prep work; but I don't want
to randomly guess as to what kinds o

Re: [PATCH 4/6] nestedsvm: Disable TscRateMSR

2024-02-22 Thread George Dunlap

On Wed, Feb 21, 2024 at 6:52 PM Jan Beulich  wrote:
> >> But then of course Andrew may know of reasons why all of this is done
> >> in calculate_host_policy() in the first place, rather than in HVM
> >> policy calculation.
> >
> > It sounds like maybe you're confusing host_policy with
> > x86_capabilities?  From what I can tell:
> >
> > *  the "basic" cpu_has_X macros resolve to boot_cpu_has(), which
> > resolves to cpu_has(_cpu_data, ...), which is completely
> > independent of the cpu-policy.c:host_cpu_policy
> >
> > * cpu-policy.c:host_cpu_policy only affects what is advertised to
> > guests, via {pv,hvm}_cpu_policy and featureset bits.  Most notably a
> > quick skim doesn't show any mechanism by which host_cpu_policy could
> > affect what features Xen itself decides to use.
>
> I'm not mixing the two, no; the two are still insufficiently disentangled.
> There's really no reason (long term) to have both host policy and
> x86_capabilities. Therefore I'd prefer if new code (including a basically
> fundamental re-write as is going to be needed for nested) to avoid
> needlessly further extending x86_capabilities. Unless of course there's
> something fundamentally wrong with eliminating the redundancy, which
> likely Andrew would be in the best position to point out.

So I don't know the history of how things got to be the way they are,
nor really much about the code but what I've gathered from skimming
through while creating this patch series.  But from that impression,
the only issue I really see with the current code is the confusing
naming.  The cpufeature.h code has this nice infrastructure to allow
you to, for instance, enable or disable certain bits on the
command-line; and the interface for querying all the different bits of
functionality is all nicely put in one place.  Moving the
svm_feature_flags into x86_capabilities would immediately allow SVM to
take advantage of this infrastructure; it's not clear to me how this
would be "needless".

Furthermore, it looks to me like host_cpu_policy is used as a starting
point for generating pv_cpu_policy and hvm_cpu_policy, both of which
are only used for guest cpuid generation.  Given that the format of
those policies is fixed, and there's a lot of "copy this bit from the
host policy wholesale", it seems like no matter what, you'd want a
host_cpu_policy.

And in any case -- all that is kind of moot.  *Right now*,
host_cpu_policy is only used for guest cpuid policy creation; *right
now*, the nested virt features of AMD are handled in the
host_cpu_policy; *right now*, we're advertising to guests bits which
are not properly virtualized; *right now* these bits are actually set
unconditionally, regardless of whether they're even available on the
hardware; *right now*, Xen uses svm_feature_flags to determine its own
use of TscRateMSR; so *right now*, removing this bit from
host_cpu_policy won't prevent Xen from using TscRateMSR itself.

(Unless my understanding of the code is wrong, in which case I'd
appreciate a correction.)

If at some point in the future x86_capabilities and host_cpu_policy
were merged somehow, whoever did the merging would have to untangle
the twiddling of these bits anyway.  What I'm changing in this patch
wouldn't make that any harder.

> > Not sure exactly why the nested virt stuff is done at the
> > host_cpu_policy level rather than the hvm_cpu_policy level, but since
> > that's where it is, that's where we need to change it.
> >
> > FWIW, as I said in response to your comment on 2/6, it would be nicer
> > if we moved svm_feature_flags into the "capabilities" section; but
> > that's a different set of work.
>
> Can as well reply here then, rather than there: If the movement from
> host to HVM policy was done first, the patch here could more sanely go
> on top, and patch 2 could then also go on top, converting the separate
> variable to host policy accesses, quite possibly introducing a similar
> wrapper as you introduce there right now.
>
> But no, I'm not meaning to make this a requirement; this would merely be
> an imo better approach. My ack there stands, in case you want to keep
> (and commit) the change as is.

I mean, I don't mind doing a bit more prep work, if I know that's the
direction we want to go in.  "Actually, since you're doing a bit of
clean-up anyway -- right now host_cpu_policy is only used to calculate
guest policy, but we'd like to move over to being the Source of Truth
for the host instead of x86_capabilities.  While you're here, would
you mind moving the nested virt policy stuff into hvm_cpu_policy
instead?"

I certainly wouldn't want to move svm_feature_flags into
host_cpu_policy while it's still got random other "guest-only" policy
bits in it; and auditing it for such policy bits is out of the scope
of this work.

 -George

Re: [PATCH 2/2] almost fully ignore zero-size flush requests

2024-02-21 Thread George Dunlap

On Wed, Feb 21, 2024 at 3:17 PM Jan Beulich  wrote:
> > #1 by itself is probably enough to counterindicate this kind of
> > behavior.  Add them together, and I'm inclined to say that we should
> > write a policy against such optimizations, without specific
> > justifications.
>
> It's not like I didn't give any justification. So I guess you mean
> without better (whatever that means) justification.

Sorry, what I meant was that the policy would have to include a sketch
for what sorts of justifications would be acceptable.

For instance, here's a justification I would consider for this sort of thing:

A. In use-case X, there is hard limit Y on the binary size.  For X's
configuration, with a reasonably small number of features enabled, we
are already close to 90% of the way there.  If we were to consistently
use this sort of manual code size optimization techniques across the
codebase, we could cut down the total size of the code base by 25%.

Here's a situation I would absolutely not consider worth it:

B. If we consistently use this sort of code size optimization
techniques across the codebase, we could cut down the entire size of
the codebase by 0.1%.  There are no hard limits, we're just trying to
generally keep things smaller.

Filling our codebase with these sorts of logic puzzles ("Why are we
binary or-ing the offset and the length?") makes it more difficult for
people to understand the code base and increases the risk of someone
making a mistake as they try to change it.  For instance, is this
change really equivalent, given that previously one of the comparisons
had >=?  It turns out yes, but only because we filter out situations
where the length is 0; what if we were to move things around again,
such that we actually can get here with length 0?

Making the binary 0.1% smaller is absolutely not worth the cost of
that.  I'm not sure even 5% would be worth that cost, given that we
don't really have any hard limits we're in danger of exceeding (at
least that I'm aware of).

But a minimum justification for allowing these sorts of things would
need to include a concrete prediction of the improvement we would get
by applying these sorts of things all over the place; not simply, "in
this instance it goes from three to two branches".

 -George

Re: [PATCH 4/6] nestedsvm: Disable TscRateMSR

2024-02-21 Thread George Dunlap

On Mon, Feb 19, 2024 at 11:22 PM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > For now, just disable the functionality entirely until we can
> > implement it properly:
> >
> > - Don't set TSCRATEMSR in the host CPUID policy
>
> This goes too far: This way you would (in principle) also affect guests
> with nesting disabled. According to the earlier parts of the description
> there's also no issue with it in that case. What you want to make sure
> it that in the HVM policy the bit isn't set.
>
> While presently resolving to cpu_has_svm_feature(), I think
> cpu_has_tsc_ratio really ought to resolve to the host policy field.
> Of course then requiring the host policy to reflect reality rather than
> having what is "always emulated". IOW ...
>
> > --- a/xen/arch/x86/cpu-policy.c
> > +++ b/xen/arch/x86/cpu-policy.c
> > @@ -407,8 +407,7 @@ static void __init calculate_host_policy(void)
> > (1u << SVM_FEATURE_PAUSEFILTER) |
> > (1u << SVM_FEATURE_DECODEASSISTS));
> >  /* Enable features which are always emulated. */
> > -p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
> > -   (1u << SVM_FEATURE_TSCRATEMSR));
> > +p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
>
> ... this likely wants replacing altogether by not overriding what we
> found in hardware, which would apparently mean moving the two bit
> masks to the earlier "clamping" expression.
>
> But then of course Andrew may know of reasons why all of this is done
> in calculate_host_policy() in the first place, rather than in HVM
> policy calculation.

It sounds like maybe you're confusing host_policy with
x86_capabilities?  From what I can tell:

*  the "basic" cpu_has_X macros resolve to boot_cpu_has(), which
resolves to cpu_has(_cpu_data, ...), which is completely
independent of the cpu-policy.c:host_cpu_policy

* cpu-policy.c:host_cpu_policy only affects what is advertised to
guests, via {pv,hvm}_cpu_policy and featureset bits.  Most notably a
quick skim doesn't show any mechanism by which host_cpu_policy could
affect what features Xen itself decides to use.

Not sure exactly why the nested virt stuff is done at the
host_cpu_policy level rather than the hvm_cpu_policy level, but since
that's where it is, that's where we need to change it.

FWIW, as I said in response to your comment on 2/6, it would be nicer
if we moved svm_feature_flags into the "capabilities" section; but
that's a different set of work.

 -George


 -George

Re: [PATCH 1/6] xen/hvm: Convert hap_capabilities into a bitfield

2024-02-20 Thread George Dunlap

On Wed, Feb 21, 2024 at 3:23 PM Jan Beulich  wrote:
>
> On 21.02.2024 08:02, George Dunlap wrote:
> > On Mon, Feb 19, 2024 at 9:36 PM Jan Beulich  wrote:
> >> On 06.02.2024 02:20, George Dunlap wrote:
> >>> --- a/xen/arch/x86/hvm/vmx/vmcs.c
> >>> +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> >>> @@ -113,8 +113,8 @@ static int cf_check parse_ept_param_runtime(const 
> >>> char *s)
> >>>  int val;
> >>>
> >>>  if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
> >>> - !(hvm_funcs.hap_capabilities &
> >>> -   (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
> >>> + !(hvm_funcs.caps.hap_superpage_2mb ||
> >>> +   hvm_funcs.caps.hap_superpage_1gb) )
> >>>  {
> >>>  printk("VMX: EPT not available, or not in use - ignoring\n");
> >>
> >> Just to mention it: The conditional and the log message don't really
> >> fit together. (I was first wondering what the 2mb/1gb checks had to
> >> do here at all, but that's immediately clear when seeing that the
> >> only sub-option here is "exec-sp".)
> >
> > So you mean basically that the checks & error message are poorly
> > factored, because there's only a single sub-option?  (i.e., if there
> > were options which didn't rely on superpages, the check would be
> > incorrect?)
>
> Right.
>
> > Let me know if there's something concrete you'd like me to do here.
>
> Nothing. I meant to express this by starting with "Just to mention it".

Right, and when I said "let me know" I meant, "I'm going to ignore
this unless you say something, feel free to say nothing". :-D

I understood that you weren't asking for anything, but maybe coming
back to this after a few days you'd've had a simple fix.  I wouldn't
mind changing the text of the message, but I didn't feel like finding
a better text.  Reorganizing the checks (which seems closer to the
Right Thing) is off-topic for this patch of course.

 -George

Re: [PATCH 3/6] xen/hvm: Move other hvm_function_table booleans into the caps bitfield

2024-02-20 Thread George Dunlap

On Tue, Feb 20, 2024 at 12:08 AM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > --- a/xen/arch/x86/include/asm/hvm/hvm.h
> > +++ b/xen/arch/x86/include/asm/hvm/hvm.h
> > @@ -86,20 +86,19 @@ struct hvm_vcpu_nonreg_state {
> >  struct hvm_function_table {
> >  const char *name;
> >
> > -/* Support Hardware-Assisted Paging? */
> > -bool hap_supported;
> > -
> > -/* Necessary hardware support for alternate p2m's? */
> > -bool altp2m_supported;
> > -bool singlestep_supported;
> > -
> > -/* Hardware virtual interrupt delivery enable? */
> > -bool virtual_intr_delivery_enabled;
> > -
> >  struct {
> >  /* Indicate HAP capabilities. */
> > -bool hap_superpage_1gb:1,
> > -hap_superpage_2mb:1;
> > +bool hap:1,
> > + hap_superpage_1gb:1,
> > + hap_superpage_2mb:1,
> > +
> > +/* Altp2m capabilities */
> > +altp2m:1,
> > +singlestep:1,
> > +
> > +/* Hardware virtual interrupt delivery enable? */
> > +virtual_intr_delivery;
> > +
> >  } caps;
>
> Nit (spotted only while looking at patch 6): You're adding a stray blank
> line at the end of the structure. Further I expect virtual_intr_delivery
> would also want to be just a single bit?

Oh yes, good catch.  (I kind of feel like ":1" should be the default
for bools, but hey...)

I'll fix up this and the 0/1 => false/true thing.

 -George

Re: [PATCH 2/6] svm: Improve type of cpu_has_svm_feature

2024-02-20 Thread George Dunlap

On Mon, Feb 19, 2024 at 11:24 PM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > --- a/xen/arch/x86/include/asm/hvm/svm/svm.h
> > +++ b/xen/arch/x86/include/asm/hvm/svm/svm.h
> > @@ -38,7 +38,10 @@ extern u32 svm_feature_flags;
> >  #define SVM_FEATURE_SSS   19 /* NPT Supervisor Shadow Stacks */
> >  #define SVM_FEATURE_SPEC_CTRL 20 /* MSR_SPEC_CTRL virtualisation */
> >
> > -#define cpu_has_svm_feature(f) (svm_feature_flags & (1u << (f)))
> > +static inline bool cpu_has_svm_feature(unsigned int feat)
> > +{
> > +return svm_feature_flags & (1u << (feat));
> > +}
> >  #define cpu_has_svm_npt   cpu_has_svm_feature(SVM_FEATURE_NPT)
> >  #define cpu_has_svm_lbrv  cpu_has_svm_feature(SVM_FEATURE_LBRV)
> >  #define cpu_has_svm_svml  cpu_has_svm_feature(SVM_FEATURE_SVML)
>
> Having seen patch 4 now, I have to raise the question here as well: Why
> do we need a separate variable (svm_feature_flags) when we could use
> the host policy (provided it isn't abused; see comments on patch 4)?

We certainly don't need an extra variable; in fact, moving all of
these into the host cpuid policy thing would make it easier, for
example, to test some of the guest creation restrictions: One could
use the Xen command-line parameters to disable some of the bits, then
try to create a nested SVM guest and verify that it fails as expected.

I would like to do that eventually, but this patch is already done and
improves the code, so I just kept it.

Let me know if you'd like me to simply drop this patch instead.

 -George

Re: [PATCH 2/6] svm: Improve type of cpu_has_svm_feature

2024-02-20 Thread George Dunlap

On Mon, Feb 19, 2024 at 10:03 PM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > The "effective type" of the cpu_has_svm_feature macro is effectively
> > an unsigned log with one bit set (or not); at least one place someone
> > felt compelled to do a !! to make sure that they got a boolean out of
> > it.
> >
> > Ideally the whole of this would be folded into the cpufeature.h
> > infrastructure.  But for now, duplicate the more type-safe static
> > inlines in that file, and remove the !!.
> >
> > Signed-off-by: George Dunlap 
>
> Acked-by: Jan Beulich 
> albeit preferably with ...
>
> > --- a/xen/arch/x86/include/asm/hvm/svm/svm.h
> > +++ b/xen/arch/x86/include/asm/hvm/svm/svm.h
> > @@ -38,7 +38,10 @@ extern u32 svm_feature_flags;
> >  #define SVM_FEATURE_SSS   19 /* NPT Supervisor Shadow Stacks */
> >  #define SVM_FEATURE_SPEC_CTRL 20 /* MSR_SPEC_CTRL virtualisation */
> >
> > -#define cpu_has_svm_feature(f) (svm_feature_flags & (1u << (f)))
> > +static inline bool cpu_has_svm_feature(unsigned int feat)
> > +{
> > +return svm_feature_flags & (1u << (feat));
>
> ... the inner pair of parentheses dropped.

Done, thanks.

 -George

Re: [PATCH 1/6] xen/hvm: Convert hap_capabilities into a bitfield

2024-02-20 Thread George Dunlap

On Mon, Feb 19, 2024 at 9:36 PM Jan Beulich  wrote:
>
> On 06.02.2024 02:20, George Dunlap wrote:
> > hvm_function_table is an internal structure; rather than manually
> > |-ing and &-ing bits, just make it a boolean bitfield and let the
> > compiler do all the work.  This makes everything easier to read, and
> > presumably allows the compiler more flexibility in producing efficient
> > code.
> >
> > No functional change intended.
> >
> > Signed-off-by: George Dunlap 
>
> Reviewed-by: Jan Beulich 
>
> > ---
> > Questions:
> >
> > * Should hap_superpage_2m really be set unconditionally, or should we
> >   condition it on cpu_has_svm_npt?
>
> That's HAP capabilities; there's not going to be any use of HAP when
> there's no NPT (on an AMD system). IOW - all is fine as is, imo.

Basically there are two stances to take:

1. hap_superpage_2m always has meaning.  If there's no HAP, then there
are no HAP superpages, so we should say that there are no superpages.

2. hap_superpage_2m only has meaning if hap is enabled.  If there's no
HAP, then the question "are there superpages" is moot; nobody should
be paying attention to it.

The second is not without precedent, but is it really the best stance?
 It means that before reading hap_superpage_2m, you have to first
check hap; and it introduces a risk (no matter how small) that someone
will forget to check, and end up doing the wrong thing.

Not a huge deal, but overall my vote would be #1.  I may send a patch
at some point in the future.

> > * Do we really need to "!!cpu_has_svm_npt"?  If so, wouldn't it be
> >   better to put the "!!"  in the #define, rather than requiring the
> >   user to know that it's needed?
>
> Considering that hap_supported is bool now, the !! can simply be
> dropped. We've been doing so as code was touched anyway, not in a
> concerted effort.

Right, forgot to actually delete this para after adding a patch to address it.

> > --- a/xen/arch/x86/hvm/vmx/vmcs.c
> > +++ b/xen/arch/x86/hvm/vmx/vmcs.c
> > @@ -113,8 +113,8 @@ static int cf_check parse_ept_param_runtime(const char 
> > *s)
> >  int val;
> >
> >  if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
> > - !(hvm_funcs.hap_capabilities &
> > -   (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
> > + !(hvm_funcs.caps.hap_superpage_2mb ||
> > +   hvm_funcs.caps.hap_superpage_1gb) )
> >  {
> >  printk("VMX: EPT not available, or not in use - ignoring\n");
>
> Just to mention it: The conditional and the log message don't really
> fit together. (I was first wondering what the 2mb/1gb checks had to
> do here at all, but that's immediately clear when seeing that the
> only sub-option here is "exec-sp".)

So you mean basically that the checks & error message are poorly
factored, because there's only a single sub-option?  (i.e., if there
were options which didn't rely on superpages, the check would be
incorrect?)

Let me know if there's something concrete you'd like me to do here.

> > @@ -104,8 +96,11 @@ struct hvm_function_table {
> >  /* Hardware virtual interrupt delivery enable? */
> >  bool virtual_intr_delivery_enabled;
> >
> > -/* Indicate HAP capabilities. */
> > -unsigned int hap_capabilities;
> > +struct {
> > +/* Indicate HAP capabilities. */
> > +bool hap_superpage_1gb:1,
> > +hap_superpage_2mb:1;
>
> Nit: Would be nice imo if the two identifiers aligned vertically with
> one another.

Done.

 -George

Re: [PATCH V3] libxl: Add "grant_usage" parameter for virtio disk devices

2024-02-20 Thread George Dunlap

On Thu, Feb 15, 2024 at 11:07 PM Oleksandr Tyshchenko
 wrote:
>
> From: Oleksandr Tyshchenko 
>
> Allow administrators to control whether Xen grant mappings for
> the virtio disk devices should be used. By default (when new
> parameter is not specified), the existing behavior is retained
> (we enable grants if backend-domid != 0).
>
> Signed-off-by: Oleksandr Tyshchenko 

Golang bits:

Acked-by: George Dunlap

Re: Stats on Xen tarball downloads

2024-02-20 Thread George Dunlap

On Mon, Feb 19, 2024 at 6:38 PM Jan Beulich  wrote:
>
> On 19.02.2024 11:31, Roger Pau Monné wrote:
> > On Mon, Feb 19, 2024 at 06:01:54PM +0800, George Dunlap wrote:
> >> One of the questions we had with respect to changing our release
> >> practice (for instance, making the process more light-weight so that
> >> we could do a point release after every XSA) was, "How many people are
> >> actually using the tarballs?"
> >
> > What would this more lightweight process involve from a downstream
> > PoV?  IOW: in what would the contents of the tarball change compared
> > to the current releases?
>
> From all prior discussion my conclusion was "no tarball at all".

Or at very least, the tarball would be a simple `git archive` of a
release tag.   Right now the tarball creation has a number of
annoyingly manual parts about it.


 -George

Re: Stats on Xen tarball downloads

2024-02-20 Thread George Dunlap

On Mon, Feb 19, 2024 at 11:34 PM Elliott Mitchell  wrote:
>
> On Mon, Feb 19, 2024 at 06:01:54PM +0800, George Dunlap wrote:
> >
> > Looking at the *non*-4.18 downloads, nearly all of them have user
> > agents that make it clear they're part of automated build systems:
> > user agents like curl and wget, but also "Go-http-client", "libfetch",
> 
>
> I reject this claim.  `curl` or `wget` could be part of an interactive
> operation.  Telling a browser to copy a URL into the paste buffer, then
> using `wget`/`curl` is entirely possible.  I may be the outlier, but I
> routinely do this.

It's not just the user agent; there are certain statistical
regularities that make me think it's automated.  e.g., a specific
version of curl always downloading a specific version of the tarball,
the tar.gz and the tar.gz.sig being downloaded exactly the same time
distance apart.  There certainly *are* manual wget / curl invocations,
but the majority of them look to me like they're part of automated
systems.

(And the "Go-http-client" instances are kind of fascinating to me --
someone wrote something on golang to download the Xen tarball?  And
always 4.15.1?  And it's being run from both NH, USA and from Finland,
and a handful of other places that seem unrelated?  What project is
this?)

> > It's not really clear to me why we'd be getting 300-ish people
> > downloading the Xen 4.18.0 tarball, 2/3 of which are on Windows.  But
> > then I'm also not sure why someone would *fake* hundreds of downloads
> > a week from unique IP addresses; and in particular, if you were going
> > to fake hundreds of downloads a week, I'm not sure why you'd only fake
> > the most recent release.
>
> Remember the browser wars?  At one point many sites were looking for
> IE/Windows and sending back error messages without those.  Getting the
> tarball on Windows doesn't seem too likely, faking the browser was
> pretty common for a while.

Right, which is why I wanted to look more into the rest of the data to
see if I could get a feel for it.  There are very few Windows user
agents for the other versions; the handful of browser agents for
non-4.18.0 tarballs look very normal and unix-y.  So the question is,
why would you fake loads of downloads for Chrome / Firefox / Edge on
Windows *only* for 4.18.0?

I agree that none of the current explanations make a lot of sense; but
I continue to believe that the "We have loads of actual humans
downloading the 4.18.0 tarball via browsers, even on Windows" is the
least-bad fit.  (Feel free to propose others, though.)

 -George

Re: [PATCH 2/2] almost fully ignore zero-size flush requests

2024-02-20 Thread George Dunlap

On Tue, Feb 20, 2024 at 4:26 PM Jan Beulich  wrote:
> >> +if ( (cflush->offset | cflush->length) > PAGE_SIZE ||
> >
> > This is confusing. I understand you are trying to force the compiler to
> > optimize. But is it really worth it? After all, the rest of operation
> > will outweight this check (cache flush are quite expensive).
>
> From purely a performance point of view it may not be worth it. From
> code size angle (taken globally) I already view this differently.
> Plus I think that we ought to aim at avoiding undesirable patterns,
> just because people tend to clone existing code when they can. Thing
> is that (as per below) the two of us apparently disagree on what
> "undesirable" is in cases like this one.
>
> > We probably should take a more generic decision (and encode in our
> > policy) because you seem to like this pattern and I dislike it :). Not
> > sure what the others think.

This is similar to the policy question I raised among the x86
committers a few weeks ago: You're manually specifying a more specific
behavior than is required, rather than specifying what you want and
then letting the compiler optimize things.  The problem with this is
twofold:

1. It's harder for humans to read and understand the intent

2. It ties the compiler's hands.  If you write your intent, then the
compiler is free to apply the optimization or not, or apply a
different optimization.  If you specify this optimization, then the
compiler has fewer ways that it's allowed to compile the code.

#1 by itself is probably enough to counterindicate this kind of
behavior.  Add them together, and I'm inclined to say that we should
write a policy against such optimizations, without specific
justifications.

 -George

Re: [PATCH v2] Constify some parameters

2024-02-20 Thread George Dunlap

On Wed, Feb 14, 2024 at 5:48 PM Frediano Ziglio
 wrote:
>
> Make clear they are not changed in the functions.
>
> Signed-off-by: Frediano Ziglio 
> Reviewed-by: Jan Beulich 

Sched bits:

Acked-by: George Dunlap

Re: [PATCH 5/5] mm: add the __must_check attribute to {gfn,mfn}_add()

2024-02-18 Thread George Dunlap

On Wed, Feb 14, 2024 at 7:42 PM Roger Pau Monne  wrote:
>
> It's not obvious from the function itself whether the incremented value will 
> be
> stored in the parameter, or returned to the caller.  That has leads to bugs in
> the past as callers assume the incremented value is stored in the parameter.
>
> Add the __must_check attribute to the function to easily spot callers that
> don't consume the returned value, which signals an error in the caller logic.
>
> No functional change intended.

Just thinking out loud here... I wonder if "mfn_plus()" would be less
likely to be misunderstood.  Catching this during compile is def
better than catching it w/ Coverity or Eclair, but even better to help
people not make the mistake in the first place.

 -George

Re: [PATCH v2 1/2] x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly

2024-02-08 Thread George Dunlap

On Thu, Feb 8, 2024 at 4:02 PM Jan Beulich  wrote:
>
> On 08.02.2024 07:32, George Dunlap wrote:
> > On Thu, Jun 23, 2022 at 7:54 PM Jan Beulich  wrote:
> >
> >> Grant P2M entries, which are covered by p2m_is_any_ram(), wouldn't pass
> >> the get_page() unless the grant was a local one. These need to take the
> >> same path as foreign entries. Just the assertion there is not valid for
> >> local grants, and hence it triggering needs to be avoided.
> >>
> >
> > I think I'd say:
> >
> > ---
> > The 'fast' path of p2m_get_page_from_gfn handles three cases: normal ram,
> > foreign p2m entries, and grant map entries.  For normal ram and grant table
> > entries, get_page() is called, but for foreign entries,
> > page_get_owner_and_reference() is called, since the current domain is
> > expected not to be the owner.
> >
> > Unfortunately, grant maps are *also* generally expected to be owned by
> > foreign domains; so this function will fail for any p2m entry containing a
> > grant map that doesn't happen to be local.
> >
> > Have grant maps take the same path as foreign entries.  Since grants may
> > actually be either foreign or local, adjust the assertion to allow for this.
> > ---
>
> Sure, thanks, I can use this, but then I'd perhaps ought to add your
> S-o-b instead of ...

> ... R-b, requiring yet someone else's ack?

Legally I think the SoB is more for the provenance of the code than
the commit messages; so it would mainly be to credit me, which I'm not
particularly fussed by.

That said, we did just put something in MAINTAINERS about how to deal
with this situation; You sending the patch implicitly approves all the
changes I made, so then if I give an R-b, that approves all the
changes you made, satisfying the requirements.

 -George

Re: [PATCH v2 1/2] x86/p2m: make p2m_get_page_from_gfn() handle grant case correctly

2024-02-07 Thread George Dunlap

On Thu, Jun 23, 2022 at 7:54 PM Jan Beulich  wrote:

> Grant P2M entries, which are covered by p2m_is_any_ram(), wouldn't pass
> the get_page() unless the grant was a local one. These need to take the
> same path as foreign entries. Just the assertion there is not valid for
> local grants, and hence it triggering needs to be avoided.
>

I think I'd say:

---
The 'fast' path of p2m_get_page_from_gfn handles three cases: normal ram,
foreign p2m entries, and grant map entries.  For normal ram and grant table
entries, get_page() is called, but for foreign entries,
page_get_owner_and_reference() is called, since the current domain is
expected not to be the owner.

Unfortunately, grant maps are *also* generally expected to be owned by
foreign domains; so this function will fail for any p2m entry containing a
grant map that doesn't happen to be local.

Have grant maps take the same path as foreign entries.  Since grants may
actually be either foreign or local, adjust the assertion to allow for this.
---

One more comment...


> Signed-off-by: Jan Beulich 
> ---
> Using | instead of || helps the compiler fold the two p2m_is_*().
> ---
> v2: The shared case was fine; limit to grant adjustment.
>
> --- a/xen/arch/x86/mm/p2m.c
> +++ b/xen/arch/x86/mm/p2m.c
> @@ -357,11 +357,11 @@ struct page_info *p2m_get_page_from_gfn(
>   && !((q & P2M_UNSHARE) && p2m_is_shared(*t)) )
>  {
>  page = mfn_to_page(mfn);
> -if ( unlikely(p2m_is_foreign(*t)) )
> +if ( unlikely(p2m_is_foreign(*t) | p2m_is_grant(*t)) )
>

I'm not a fan of this.  If you replace it with || you can have my R-b
immediately; otherwise we'll have to wait until we can discuss our general
policy on this sort of thing at the x86 maintainer's call.

 -George

Re: [PATCH] x86/altp2m: p2m_altp2m_get_or_propagate() should honor ap2m->default_access

2024-02-07 Thread George Dunlap

On Thu, Feb 8, 2024 at 9:21 AM Tamas K Lengyel  wrote:
>
> On Wed, Feb 7, 2024 at 5:21 PM Andrew Cooper  
> wrote:
> >
> > On 07/02/2024 1:18 am, George Dunlap wrote:
> > > On Tue, Feb 6, 2024 at 6:08 PM Petr Beneš  wrote:
> > >> From: Petr Beneš 
> > >>
> > >> This patch addresses a behavior discrepancy in the handling of altp2m 
> > >> views,
> > >> where upon the creation and subsequent EPT violation, the page access
> > >> permissions were incorrectly inherited from the hostp2m instead of 
> > >> respecting
> > >> the altp2m default_access.
> > >>
> > >> Previously, when a new altp2m view was established with restrictive
> > >> default_access permissions and activated via xc_altp2m_switch_to_view(),
> > >> it failed to trigger an event on the first access violation.  This 
> > >> behavior
> > >> diverged from the intended mechanism, where the altp2m's default_access
> > >> should dictate the initial permissions, ensuring proper event triggering 
> > >> on
> > >> access violations.
> > >>
> > >> The correction involves modifying the handling mechanism to respect the
> > >> altp2m view's default_access upon its activation, eliminating the need 
> > >> for
> > >> setting memory access permissions for the entire altp2m range (e.g. 
> > >> within
> > >> xen-access.c).  This change not only aligns the behavior with the 
> > >> expected
> > >> access control logic but also results in a significant performance 
> > >> improvement
> > >> by reducing the overhead associated with setting memory access 
> > >> permissions
> > >> across the altp2m range.
> > >>
> > >> Signed-off-by: Petr Beneš 
> > > Thanks Petr, this looks like a great change.
> > >
> > > Two things:
> > >
> > > - Probably worth adjusting the comment at the top of
> > > p2m_altp2m_get_or_propagate to mention that you use the altp2m
> > > default_access when propagating from the host p2m
> > >
> > > - This represents a change in behavior, so probably at least worth a
> > > mention in CHANGELOG.md?
> >
> > This is a bugfix to an tech preview feature.  It's not remotely close to
> > CHANGELOG material.
> >
> > Tangent.  SUPPORT.md says tech preview on ARM, despite there being no
> > support in the slightest.  Patches were posted and never taken.
> >
> > > Tamas, I guess this is OK from an interface compatibility point of
> > > view?  In theory it should always have been behaving this way.
> >
> > Given the already-provided Ack, I expect that has been considered and
> > deemed ok.
>
> Correct, this is just a bugfix.

Er, ok, just one more comment: this could allow an altp2m to have more
permissions than the host; for example, the host p2m entry could be
p2m_access_r, but if the altp2m's default_access were p2m_access_rw,
it would override that.  Is that the behavior we want?  Or do we want
to do some sort of intersection of permissions?

If the former, I'd propose the comment be adjusted thus:

 * If the entry is invalid, and the host entry was valid, propagate
 * the host's entry to the altp2m, retaining page order but using the
 * altp2m's default_access, and indicate that the caller should re-try
 * the faulting instruction.

I could do that on check-in.

 -George

Re: [PATCH v2 2/2] x86/p2m: aid the compiler in folding p2m_is_...()

2024-02-06 Thread George Dunlap

On Thu, Feb 1, 2024 at 10:15 PM Jan Beulich  wrote:

> On 01.02.2024 14:32, George Dunlap wrote:
> > On Thu, Jun 23, 2022 at 12:54 PM Jan Beulich  wrote:
> >
> >> By using | instead of || or (in the negated form) && chances increase
> >> for the compiler to recognize that both predicates can actually be
> >> folded into an expression requiring just a single branch (via OR-ing
> >> together the respective P2M_*_TYPES constants).
> >>
> >> Signed-off-by: Jan Beulich 
> >>
> >
> > Sorry for the delay.  Git complains that this patch is malformed:
> >
> > error: `git apply --index`: error: corrupt patch at line 28
> >
> > Similar complaint from patchew when it was posted:
> >
> > https://patchew.org/Xen/5d6c927e-7d7c-5754-e7eb-65d1e70f6...@suse.com/
>
> Not sure what to say. The patch surely is well-formed. It applies fine
> using patch (when not taken from email). When taken from email, patch
> mentions that it strips CRs (I'm running my email client on Windows),
> but the saved email still applies fine. "git am" indeed is unhappy
> when taking the plain file as saved from email, albeit here with an
> error different from yours. If I edit the saved email to retain just
> the From: and Subject: tags, all is fine.
>

That still doesn't work for me.

> I can't tell what git doesn't like. The error messages (the one you
> see and the one I got) tell me nothing.

The raw email looks like this:

```
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -379,7 +379,7 @@ struct page_info *p2m_get_page_from_gfn(
 return page;
=20
 /* Error path: not a suitable GFN at all */
-if ( !p2m_is_ram(*t) && !p2m_is_paging(*t) && !p2m_is_pod(*t) &&
+if ( !(p2m_is_ram(*t) | p2m_is_paging(*t) | p2m_is_pod(*t)) &&
  !mem_sharing_is_fork(p2m->domain) )
 return NULL;
 }
```

Note the "=20" at the beginning of the empty line.  Why `patch` handles it
but `git am` doesn't, who knows.

> I'm also not aware of there
> being a requirement that patches I send via email need to be
> "git am"-able (unlike in xsa.git, where I edit patches enough to be
> suitable for that), nor am I aware how I would convince my email
> client and/or server to omit whatever git doesn't like or to add
> whatever git is missing.
>
> Bottom line - your response would be actionable by me only in so far
> as I could switch to using "git send-email". Which I'm afraid I'm not
> going to do unless left with no other choice. The way I've been
> sending patches has worked well for over 20 years, and for different
> projects. (I'm aware Andrew has some special "Jan" command to apply
> patches I send, but I don't know any specifics.)
>

In the general case, I'm not going to review a patch without being able to
see it in context; and it's not reasonable to expect reviewers to have
specific contributor-specific scripts for doing so.  If we run into this
issue in the future, and you want my review, you may have to post a git
tree somewhere, or attach the patch as an attachment or something.  (Or you
can try to figure out why `git am` isn't working and try to upstream a fix.)

That said, in this case, context isn't really necessary to understand the
change, so it won't be necessary.

The logic of the change is obviously correct; but it definitely reduces the
readability.  I kind of feel like whether this sort of optimization is
worth the benefits is more a general x86 maintainer policy decision.  Maybe
we can talk about it at the next maintainer's meeting I'll be at?

 -George

Re: [PATCH] x86/altp2m: p2m_altp2m_get_or_propagate() should honor ap2m->default_access

2024-02-06 Thread George Dunlap

On Tue, Feb 6, 2024 at 6:08 PM Petr Beneš  wrote:
>
> From: Petr Beneš 
>
> This patch addresses a behavior discrepancy in the handling of altp2m views,
> where upon the creation and subsequent EPT violation, the page access
> permissions were incorrectly inherited from the hostp2m instead of respecting
> the altp2m default_access.
>
> Previously, when a new altp2m view was established with restrictive
> default_access permissions and activated via xc_altp2m_switch_to_view(),
> it failed to trigger an event on the first access violation.  This behavior
> diverged from the intended mechanism, where the altp2m's default_access
> should dictate the initial permissions, ensuring proper event triggering on
> access violations.
>
> The correction involves modifying the handling mechanism to respect the
> altp2m view's default_access upon its activation, eliminating the need for
> setting memory access permissions for the entire altp2m range (e.g. within
> xen-access.c).  This change not only aligns the behavior with the expected
> access control logic but also results in a significant performance improvement
> by reducing the overhead associated with setting memory access permissions
> across the altp2m range.
>
> Signed-off-by: Petr Beneš 

Thanks Petr, this looks like a great change.

Two things:

- Probably worth adjusting the comment at the top of
p2m_altp2m_get_or_propagate to mention that you use the altp2m
default_access when propagating from the host p2m

- This represents a change in behavior, so probably at least worth a
mention in CHANGELOG.md?

Tamas, I guess this is OK from an interface compatibility point of
view?  In theory it should always have been behaving this way.

 -George

Re: [XEN PATCH] golang: Regen binding

2024-02-05 Thread George Dunlap

On Tue, Feb 6, 2024 at 12:53 AM Anthony PERARD
 wrote:
>
> Fixes: 024e7131be5c ("tools: don't expose XENFEAT_hvm_pirqs by default")
> Signed-off-by: Anthony PERARD 

Acked-by: George Dunlap

Re: [PATCH 6/6] svm/nestedvm: Introduce nested capabilities bit

2024-02-05 Thread George Dunlap

On Tue, Feb 6, 2024 at 9:20 AM George Dunlap  wrote:

> @@ -655,6 +658,12 @@ static inline bool hvm_altp2m_supported(void)
>  return hvm_funcs.caps.altp2m;
>  }
>
> +/* Returns true if we have the minimum hardware requirements for nested virt 
> */
> +static inline bool hvm_nested_virt_supported(void)
> +{
> +return hvm_funcs.caps.nested_virt;
> +}

NB this is missing the !CONFIG_HVM version of this function; I'll
include it in v2.

 -George

[PATCH 6/6] svm/nestedvm: Introduce nested capabilities bit

2024-02-05 Thread George Dunlap

In order to make implementation and testing tractable, we will require
specific host functionality.  Add a nested_virt bit to hvm_funcs.caps,
and return an error if a domain is created with nested virt and this
bit isn't set.

For VMX, start with always enabling it if HAP is present; this
shouldn't change current behvior.

For SVM, require some basic functionality, adding a document
explaining the rationale.

NB that only SVM CPUID bits 0-7 have been considered.  Bits 10-16 may
be considered in a follow-up patch.

Signed-off-by: George Dunlap 
---
CC: Andrew Cooper 
CC: George Dunlap 
CC: Jan Beulich 
CC: Julien Grall 
CC: Stefano Stabellini 
CC: Wei Liu 
CC: "Roger Pau Monné" 
CC: Jun Nakajima 
CC: Kevin Tian 
---
 docs/designs/nested-svm-cpu-features.md | 110 
 xen/arch/x86/domain.c   |   6 ++
 xen/arch/x86/hvm/svm/nestedhvm.h|   1 +
 xen/arch/x86/hvm/svm/nestedsvm.c|  14 +++
 xen/arch/x86/hvm/svm/svm.c  |   2 +
 xen/arch/x86/hvm/vmx/vmx.c  |   3 +
 xen/arch/x86/include/asm/hvm/hvm.h  |  11 ++-
 7 files changed, 146 insertions(+), 1 deletion(-)

diff --git a/docs/designs/nested-svm-cpu-features.md 
b/docs/designs/nested-svm-cpu-features.md
new file mode 100644
index 00..7ffb8daefd
--- /dev/null
+++ b/docs/designs/nested-svm-cpu-features.md
@@ -0,0 +1,110 @@
+# Nested SVM (AMD) CPUID requirements
+
+The first step in making nested SVM production-ready is to make sure
+that all features are implemented and well-tested.  To make this
+tractable, we will initially be limiting the "supported" range of
+nested virt to a specific subset of host and guest features.  This
+document describes the criteria for deciding on features, and the
+rationale behind each feature.
+
+For AMD, all virtualization-related features can be found in CPUID
+leaf 800A:edx
+
+# Criteria
+
+- Processor support: At a minimum we want to support processors from
+  the last 5 years.  All things being equal, older processors are
+  better.  Bits 0:7 were available in the very earliest processors;
+  and even through bit 15 we should be pretty good support-wise.
+
+- Faithfulness to hardware: We need the behavior of the "virtual cpu"
+  from the L1 hypervisor's perspective to be as close as possible to
+  the original hardware.  In particular, the behavior of the hardware
+  on error paths 1) is not easy to understand or test, 2) can be the
+  source of surprising vulnerabiliies.  (See XSA-7 for an example of a
+  case where subtle error-handling differences can open up a privilege
+  escalation.)  We should avoid emulating any bit of the hardware with
+  complex error paths if we can at all help it.
+
+- Cost of implementation: We want to minimize the cost of
+  implementation (where this includes bringing an existing sub-par
+  implementation up to speed).  All things being equal, we'll favor a
+  configuration which does not require any new implementation.
+
+- Performance: All things being equal, we'd prefer to choose a set of
+  L0 / L1 CPUID bits that are faster than slower.
+
+
+# Bits
+
+- 0 `NP` *Nested Paging*: Required both for L0 and L1.
+
+  Based primarily on faithfulness and performance, as well as
+  potential cost of implementation.  Available on earliest hardware,
+  so no compatibility issues.
+
+- 1 `LbrVirt` *LBR / debugging virtualization*: Require for L0 and L1.
+
+  For L0 this is required for performance: There's no way to tell the
+  guests not to use the LBR-related registers; and if the guest does,
+  then you have to save and restore all LBR-related registers on
+  context switch, which is prohibitive.  Furthermore, the additional
+  emulation risks a security-relevant difference to come up.
+
+  Providing it to L1 when we have it in L0 is basically free, and
+  already implemented.
+
+  Just require it and provide it.
+
+- 2 `SVML` *SVM Lock*: Not required for L0, not provided to L1
+
+  Seems to be aboult enabling an operating system to prevent "blue
+  pill" attacks against itself.
+
+  Xen doesn't use it, nor provide it; so it would need to be
+  implementend.  The best way to protect a guest OS is to leave nested
+  virt disabled in the tools.
+
+- 3 `NRIPS` NRIP Save: Require for both L0 and L1
+
+  If NRIPS is not present, the software interrupt injection
+  functionality can't be used; and Xen has to emulate it.  That's
+  another source of potential security issues.  If hardware supports
+  it, then providing it to guest is basically free.
+
+- 4 `TscRateMsr`: Not required by L0, not provided to L1
+
+  The main putative use for this would be trying to maintain an
+  invariant TSC across cores with different clock speeds, or after a
+  migrate.  Unlike others, this doesn't have an error path to worry
+  about compatibility-wise; and according to tests done when nestedSVM
+  was first implemented, it's actually faster to emliate TscRateMSR in
+  the L0 hypervisor tha

[PATCH 2/6] svm: Improve type of cpu_has_svm_feature

2024-02-05 Thread George Dunlap

The "effective type" of the cpu_has_svm_feature macro is effectively
an unsigned log with one bit set (or not); at least one place someone
felt compelled to do a !! to make sure that they got a boolean out of
it.

Ideally the whole of this would be folded into the cpufeature.h
infrastructure.  But for now, duplicate the more type-safe static
inlines in that file, and remove the !!.

Signed-off-by: George Dunlap 
---
CC: Jan Beulich 
CC: Andrew Cooper 
CC: "Roger Pau Monné" 
CC: Wei Liu 
---
 xen/arch/x86/hvm/svm/svm.c | 2 +-
 xen/arch/x86/include/asm/hvm/svm/svm.h | 5 -
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 5741287355..40bc1ffbc6 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2580,7 +2580,7 @@ const struct hvm_function_table * __init start_svm(void)
 if ( !printed )
 printk(" - none\n");
 
-svm_function_table.hap_supported = !!cpu_has_svm_npt;
+svm_function_table.hap_supported = cpu_has_svm_npt;
 svm_function_table.caps.hap_superpage_2mb = true;
 svm_function_table.caps.hap_superpage_1gb = cpu_has_page1gb;
 
diff --git a/xen/arch/x86/include/asm/hvm/svm/svm.h 
b/xen/arch/x86/include/asm/hvm/svm/svm.h
index 687d35be40..9e9a148845 100644
--- a/xen/arch/x86/include/asm/hvm/svm/svm.h
+++ b/xen/arch/x86/include/asm/hvm/svm/svm.h
@@ -38,7 +38,10 @@ extern u32 svm_feature_flags;
 #define SVM_FEATURE_SSS   19 /* NPT Supervisor Shadow Stacks */
 #define SVM_FEATURE_SPEC_CTRL 20 /* MSR_SPEC_CTRL virtualisation */
 
-#define cpu_has_svm_feature(f) (svm_feature_flags & (1u << (f)))
+static inline bool cpu_has_svm_feature(unsigned int feat)
+{
+return svm_feature_flags & (1u << (feat));
+}
 #define cpu_has_svm_npt   cpu_has_svm_feature(SVM_FEATURE_NPT)
 #define cpu_has_svm_lbrv  cpu_has_svm_feature(SVM_FEATURE_LBRV)
 #define cpu_has_svm_svml  cpu_has_svm_feature(SVM_FEATURE_SVML)
-- 
2.25.1

[PATCH 0/6] AMD Nested Virt Preparation

2024-02-05 Thread George Dunlap

This series lays the groundwork for revamp of the AMD nested virt
functionality.  The first five patches are clean-ups or reorganizations
of existing code.  The final patch is the first major step towards making
the feature supportable: allowing Xen to refuse nested virt support if certain
hardware features are not present.

George Dunlap (6):
  xen/hvm: Convert hap_capabilities into a bitfield
  svm: Improve type of cpu_has_svm_feature
  xen/hvm: Move other hvm_function_table booleans into the caps bitfield
  nestedsvm: Disable TscRateMSR
  nestedsvm: Remove bogus debug message from nestedsvm_check_intercepts
  svm/nestedvm: Introduce nested capabilities bit

 docs/designs/nested-svm-cpu-features.md  | 110 +++
 xen/arch/x86/cpu-policy.c|   3 +-
 xen/arch/x86/domain.c|   6 +
 xen/arch/x86/hvm/hvm.c   |  14 +--
 xen/arch/x86/hvm/svm/nestedhvm.h |   1 +
 xen/arch/x86/hvm/svm/nestedsvm.c |  22 +++-
 xen/arch/x86/hvm/svm/svm.c   |  65 +--
 xen/arch/x86/hvm/vlapic.c|   4 +-
 xen/arch/x86/hvm/vmx/vmcs.c  |   6 +-
 xen/arch/x86/hvm/vmx/vmx.c   |  19 ++--
 xen/arch/x86/include/asm/hvm/hvm.h   |  47 
 xen/arch/x86/include/asm/hvm/svm/nestedsvm.h |   5 -
 xen/arch/x86/include/asm/hvm/svm/svm.h   |   5 +-
 13 files changed, 191 insertions(+), 116 deletions(-)
 create mode 100644 docs/designs/nested-svm-cpu-features.md

-- 
2.25.1

[PATCH 5/6] nestedsvm: Remove bogus debug message from nestedsvm_check_intercepts

2024-02-05 Thread George Dunlap

Changeset ef3e8db8068 ("x86/hvm: Corrections and improvements to
unhandled vmexit logging") introduced a printk to the default path of
the switch statement in nestedsvm_check_intercepts(), complaining of
an unknown exit reason.

Unfortunately, the "core" switch statement which is meant to handle
all vmexit reasons is in nsvm_vmcb_guest_intercepts_exitcode(); the
switch statement in nestedsvm_check_intercepts() is only meant to
superimpose on top of that some special-casing for how to interaction
between L1 and L0 vmexits.

Remove the printk, and add a comment to prevent future confusion.

Signed-off-by: George Dunlap 
---
 xen/arch/x86/hvm/svm/nestedsvm.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/hvm/svm/nestedsvm.c b/xen/arch/x86/hvm/svm/nestedsvm.c
index d02a59f184..a5319ab729 100644
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -1292,6 +1292,10 @@ nestedsvm_check_intercepts(struct vcpu *v, struct 
cpu_user_regs *regs,
 ASSERT(vcpu_nestedhvm(v).nv_vmexit_pending == 0);
 is_intercepted = nsvm_vmcb_guest_intercepts_exitcode(v, regs, exitcode);
 
+/* 
+ * Handle specific interactions between things the guest and host
+ * may both want to intercept
+ */
 switch ( exitcode )
 {
 case VMEXIT_INVALID:
@@ -1347,8 +1351,6 @@ nestedsvm_check_intercepts(struct vcpu *v, struct 
cpu_user_regs *regs,
 /* Always let the guest handle VMMCALL/VMCALL */
 return NESTEDHVM_VMEXIT_INJECT;
 default:
-gprintk(XENLOG_ERR, "Unexpected nested vmexit: reason %#"PRIx64"\n",
-exitcode);
 break;
 }
 
-- 
2.25.1

[PATCH 4/6] nestedsvm: Disable TscRateMSR

2024-02-05 Thread George Dunlap

The primary purpose of TSC scaling, from our perspective, is to
maintain the fiction of an "invariant TSC" across migrates between
platforms with different clock speeds.

On AMD, the TscRateMSR CPUID bit is unconditionally enabled in the
"host cpuid", even if the hardware doesn't actually support it.
According to c/s fd14a1943c4 ("nestedsvm: Support TSC Rate MSR"),
testing showed that emulating TSC scaling in an L1 was more expensive
than emulating TSC scaling on an L0 (due to extra sets of vmexit /
vmenter).

However, the current implementation seems to be broken.

First of all, the final L2 scaling ratio should be a composition of
the L0 scaling ratio and the L1 scaling ratio; there's no indication
this is being done anywhere.

Secondly, it's not clear that the L1 tsc scaling ratio actually
affects the L0 tsc scaling ratio.  The stored value (ns_tscratio) is
used to affect the tsc *offset*, but doesn't seem to actually be
factored into d->hvm.tsc_scaling_ratio.  (Which shouldn't be
per-domain anyway, but per-vcpu.)  Having the *offset* scaled
according to the nested scaling without the actual RDTSC itself also
being scaled has got to produce inconsistent results.

For now, just disable the functionality entirely until we can
implement it properly:

- Don't set TSCRATEMSR in the host CPUID policy

- Remove MSR_AMD64_TSC_RATIO emulation handling, so that the guest
  guests a #GP if it tries to access them (as it should when
  TSCRATEMSR is clear)

- Remove ns_tscratio from struct nestedhvm, and all code that touches
  it

Unfortunately this means ripping out the scaling calculation stuff as
well, since it's only used in the nested case; it's there in the git
tree if we need it for reference when we re-introduce it.

Signed-off-by: George Dunlap 
---
CC: Jan Beulich 
CC: Andrew Cooper 
CC: "Roger Pau Monné" 
CC: Wei Liu 
---
 xen/arch/x86/cpu-policy.c|  3 +-
 xen/arch/x86/hvm/svm/nestedsvm.c |  2 -
 xen/arch/x86/hvm/svm/svm.c   | 57 
 xen/arch/x86/include/asm/hvm/svm/nestedsvm.h |  5 --
 4 files changed, 1 insertion(+), 66 deletions(-)

diff --git a/xen/arch/x86/cpu-policy.c b/xen/arch/x86/cpu-policy.c
index 10079c26ae..d71abbc44a 100644
--- a/xen/arch/x86/cpu-policy.c
+++ b/xen/arch/x86/cpu-policy.c
@@ -407,8 +407,7 @@ static void __init calculate_host_policy(void)
(1u << SVM_FEATURE_PAUSEFILTER) |
(1u << SVM_FEATURE_DECODEASSISTS));
 /* Enable features which are always emulated. */
-p->extd.raw[0xa].d |= ((1u << SVM_FEATURE_VMCBCLEAN) |
-   (1u << SVM_FEATURE_TSCRATEMSR));
+p->extd.raw[0xa].d |= (1u << SVM_FEATURE_VMCBCLEAN);
 }
 
 /* 0x00ce  MSR_INTEL_PLATFORM_INFO */
diff --git a/xen/arch/x86/hvm/svm/nestedsvm.c b/xen/arch/x86/hvm/svm/nestedsvm.c
index ee9602f5c8..d02a59f184 100644
--- a/xen/arch/x86/hvm/svm/nestedsvm.c
+++ b/xen/arch/x86/hvm/svm/nestedsvm.c
@@ -146,8 +146,6 @@ int cf_check nsvm_vcpu_reset(struct vcpu *v)
 svm->ns_msr_hsavepa = INVALID_PADDR;
 svm->ns_ovvmcb_pa = INVALID_PADDR;
 
-svm->ns_tscratio = DEFAULT_TSC_RATIO;
-
 svm->ns_cr_intercepts = 0;
 svm->ns_dr_intercepts = 0;
 svm->ns_exception_intercepts = 0;
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index b551eac807..34b9f603bc 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -777,43 +777,6 @@ static int cf_check svm_get_guest_pat(struct vcpu *v, u64 
*gpat)
 return 1;
 }
 
-static uint64_t scale_tsc(uint64_t host_tsc, uint64_t ratio)
-{
-uint64_t mult, frac, scaled_host_tsc;
-
-if ( ratio == DEFAULT_TSC_RATIO )
-return host_tsc;
-
-/*
- * Suppose the most significant 32 bits of host_tsc and ratio are
- * tsc_h and mult, and the least 32 bits of them are tsc_l and frac,
- * then
- * host_tsc * ratio * 2^-32
- * = host_tsc * (mult * 2^32 + frac) * 2^-32
- * = host_tsc * mult + (tsc_h * 2^32 + tsc_l) * frac * 2^-32
- * = host_tsc * mult + tsc_h * frac + ((tsc_l * frac) >> 32)
- *
- * Multiplications in the last two terms are between 32-bit integers,
- * so both of them can fit in 64-bit integers.
- *
- * Because mult is usually less than 10 in practice, it's very rare
- * that host_tsc * mult can overflow a 64-bit integer.
- */
-mult = ratio >> 32;
-frac = ratio & ((1ULL << 32) - 1);
-scaled_host_tsc  = host_tsc * mult;
-scaled_host_tsc += (host_tsc >> 32) * frac;
-scaled_host_tsc += ((host_tsc & ((1ULL << 32) - 1)) * frac) >> 32;
-
-return scaled_host_tsc;
-}
-
-static uint64_t svm_get_tsc_offset(uint64_t host_tsc, uint64_t guest_tsc,
-uint64_t ratio)
-{
-return guest_tsc - scale_tsc(host_tsc, ratio);

[PATCH 1/6] xen/hvm: Convert hap_capabilities into a bitfield

2024-02-05 Thread George Dunlap

hvm_function_table is an internal structure; rather than manually
|-ing and &-ing bits, just make it a boolean bitfield and let the
compiler do all the work.  This makes everything easier to read, and
presumably allows the compiler more flexibility in producing efficient
code.

No functional change intended.

Signed-off-by: George Dunlap 
---
Questions:

* Should hap_superpage_2m really be set unconditionally, or should we
  condition it on cpu_has_svm_npt?

* Do we really need to "!!cpu_has_svm_npt"?  If so, wouldn't it be
  better to put the "!!"  in the #define, rather than requiring the
  user to know that it's needed?

CC: Jan Beulich 
CC: Andrew Cooper 
CC: "Roger Pau Monné" 
CC: Wei Liu 
CC: Jun Nakajima 
CC: Kevin Tian 
---
 xen/arch/x86/hvm/hvm.c |  8 
 xen/arch/x86/hvm/svm/svm.c |  4 ++--
 xen/arch/x86/hvm/vmx/vmcs.c|  4 ++--
 xen/arch/x86/hvm/vmx/vmx.c |  8 ++--
 xen/arch/x86/include/asm/hvm/hvm.h | 19 +++
 5 files changed, 17 insertions(+), 26 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index e8deeb0222..ae9d4c4756 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -174,17 +174,17 @@ static int __init cf_check hvm_enable(void)
 {
 printk("HVM: Hardware Assisted Paging (HAP) detected\n");
 printk("HVM: HAP page sizes: 4kB");
-if ( fns->hap_capabilities & HVM_HAP_SUPERPAGE_2MB )
+if ( fns->caps.hap_superpage_2mb )
 {
 printk(", 2MB%s", opt_hap_2mb ? "" : " [disabled]");
 if ( !opt_hap_2mb )
-hvm_funcs.hap_capabilities &= ~HVM_HAP_SUPERPAGE_2MB;
+hvm_funcs.caps.hap_superpage_2mb = false;
 }
-if ( fns->hap_capabilities & HVM_HAP_SUPERPAGE_1GB )
+if ( fns->caps.hap_superpage_1gb )
 {
 printk(", 1GB%s", opt_hap_1gb ? "" : " [disabled]");
 if ( !opt_hap_1gb )
-hvm_funcs.hap_capabilities &= ~HVM_HAP_SUPERPAGE_1GB;
+hvm_funcs.caps.hap_superpage_1gb = false;
 }
 printk("\n");
 }
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 65f437e958..5741287355 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2581,8 +2581,8 @@ const struct hvm_function_table * __init start_svm(void)
 printk(" - none\n");
 
 svm_function_table.hap_supported = !!cpu_has_svm_npt;
-svm_function_table.hap_capabilities = HVM_HAP_SUPERPAGE_2MB |
-(cpu_has_page1gb ? HVM_HAP_SUPERPAGE_1GB : 0);
+svm_function_table.caps.hap_superpage_2mb = true;
+svm_function_table.caps.hap_superpage_1gb = cpu_has_page1gb;
 
 return _function_table;
 }
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 4fe1213855..53f9d81aa9 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -113,8 +113,8 @@ static int cf_check parse_ept_param_runtime(const char *s)
 int val;
 
 if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
- !(hvm_funcs.hap_capabilities &
-   (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
+ !(hvm_funcs.caps.hap_superpage_2mb ||
+   hvm_funcs.caps.hap_superpage_1gb) )
 {
 printk("VMX: EPT not available, or not in use - ignoring\n");
 return 0;
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 1500dca603..9cfc0140b4 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2989,12 +2989,8 @@ const struct hvm_function_table * __init start_vmx(void)
 vmx_function_table.hap_supported = 1;
 vmx_function_table.altp2m_supported = 1;
 
-vmx_function_table.hap_capabilities = 0;
-
-if ( cpu_has_vmx_ept_2mb )
-vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_2MB;
-if ( cpu_has_vmx_ept_1gb )
-vmx_function_table.hap_capabilities |= HVM_HAP_SUPERPAGE_1GB;
+vmx_function_table.caps.hap_superpage_2mb = cpu_has_vmx_ept_2mb;
+vmx_function_table.caps.hap_superpage_1gb = cpu_has_vmx_ept_1gb;
 
 setup_ept_dump();
 }
diff --git a/xen/arch/x86/include/asm/hvm/hvm.h 
b/xen/arch/x86/include/asm/hvm/hvm.h
index 985c1c14c6..f50476f50f 100644
--- a/xen/arch/x86/include/asm/hvm/hvm.h
+++ b/xen/arch/x86/include/asm/hvm/hvm.h
@@ -61,14 +61,6 @@ enum hvm_intblk {
 #define HVM_INTR_SHADOW_SMI0x0004
 #define HVM_INTR_SHADOW_NMI0x0008
 
-/*
- * HAP super page capabilities:
- * bit0: if 2MB super page is allowed?
- * bit1: if 1GB super page is allowed?
- */
-#define HVM_HAP_SUPERPAGE_2MB   0x0001
-#define HVM_HAP_SUPERPAGE_1GB   0x0002
-
 #define HVM_EVENT_VECTOR_UNSET(-1)
 #define HVM_EVENT_VECTOR_UPDATING (-2)
 
@@

[PATCH 3/6] xen/hvm: Move other hvm_function_table booleans into the caps bitfield

2024-02-05 Thread George Dunlap

Moving them all together has several advantages:
 * Collects them all in one part of the struct
 * The `caps` field means that we can drop the "_supported" suffix, as it's
   clear what is meant.

Signed-off-by: George Dunlap 
---
CC: Jan Beulich 
CC: Andrew Cooper 
CC: "Roger Pau Monné" 
CC: Wei Liu 
CC: Jun Nakajima 
CC: Kevin Tian 
---
 xen/arch/x86/hvm/hvm.c |  6 +++---
 xen/arch/x86/hvm/svm/svm.c |  2 +-
 xen/arch/x86/hvm/vlapic.c  |  4 ++--
 xen/arch/x86/hvm/vmx/vmcs.c|  2 +-
 xen/arch/x86/hvm/vmx/vmx.c |  8 
 xen/arch/x86/include/asm/hvm/hvm.h | 29 ++---
 6 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index ae9d4c4756..aa2f2d054a 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -136,7 +136,7 @@ static struct notifier_block cpu_nfb = {
 
 static bool __init hap_supported(struct hvm_function_table *fns)
 {
-if ( !fns->hap_supported )
+if ( !fns->caps.hap )
 {
 printk("HVM: Hardware Assisted Paging (HAP) not detected\n");
 return false;
@@ -144,7 +144,7 @@ static bool __init hap_supported(struct hvm_function_table 
*fns)
 
 if ( !opt_hap_enabled )
 {
-fns->hap_supported = 0;
+fns->caps.hap = 0;
 printk("HVM: Hardware Assisted Paging (HAP) detected but disabled\n");
 return false;
 }
@@ -190,7 +190,7 @@ static int __init cf_check hvm_enable(void)
 }
 
 if ( !opt_altp2m_enabled )
-hvm_funcs.altp2m_supported = 0;
+hvm_funcs.caps.altp2m = 0;
 
 if ( opt_hvm_fep )
 warning_add(warning_hvm_fep);
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 40bc1ffbc6..b551eac807 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -2580,7 +2580,7 @@ const struct hvm_function_table * __init start_svm(void)
 if ( !printed )
 printk(" - none\n");
 
-svm_function_table.hap_supported = cpu_has_svm_npt;
+svm_function_table.caps.hap = cpu_has_svm_npt;
 svm_function_table.caps.hap_superpage_2mb = true;
 svm_function_table.caps.hap_superpage_1gb = cpu_has_page1gb;
 
diff --git a/xen/arch/x86/hvm/vlapic.c b/xen/arch/x86/hvm/vlapic.c
index 71a4b954b0..dcbcf4a1fe 100644
--- a/xen/arch/x86/hvm/vlapic.c
+++ b/xen/arch/x86/hvm/vlapic.c
@@ -1326,7 +1326,7 @@ int vlapic_has_pending_irq(struct vcpu *v)
 if ( irr == -1 )
 return -1;
 
-if ( hvm_funcs.virtual_intr_delivery_enabled &&
+if ( hvm_funcs.caps.virtual_intr_delivery &&
  !nestedhvm_vcpu_in_guestmode(v) )
 return irr;
 
@@ -1361,7 +1361,7 @@ int vlapic_ack_pending_irq(struct vcpu *v, int vector, 
bool force_ack)
 int isr;
 
 if ( !force_ack &&
- hvm_funcs.virtual_intr_delivery_enabled )
+ hvm_funcs.caps.virtual_intr_delivery )
 return 1;
 
 /* If there's no chance of using APIC assist then bail now. */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 53f9d81aa9..aff69d5320 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -112,7 +112,7 @@ static int cf_check parse_ept_param_runtime(const char *s)
 struct domain *d;
 int val;
 
-if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
+if ( !cpu_has_vmx_ept || !hvm_funcs.caps.hap ||
  !(hvm_funcs.caps.hap_superpage_2mb ||
hvm_funcs.caps.hap_superpage_1gb) )
 {
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 9cfc0140b4..4bcf436d2c 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2963,7 +2963,7 @@ const struct hvm_function_table * __init start_vmx(void)
 return NULL;
 }
 
-vmx_function_table.singlestep_supported = cpu_has_monitor_trap_flag;
+vmx_function_table.caps.singlestep = cpu_has_monitor_trap_flag;
 
 if ( cpu_has_vmx_dt_exiting )
 vmx_function_table.set_descriptor_access_exiting =
@@ -2986,8 +2986,8 @@ const struct hvm_function_table * __init start_vmx(void)
 printk("VMX: Disabling executable EPT superpages due to 
CVE-2018-12207\n");
 }
 
-vmx_function_table.hap_supported = 1;
-vmx_function_table.altp2m_supported = 1;
+vmx_function_table.caps.hap = 1;
+vmx_function_table.caps.altp2m = 1;
 
 vmx_function_table.caps.hap_superpage_2mb = cpu_has_vmx_ept_2mb;
 vmx_function_table.caps.hap_superpage_1gb = cpu_has_vmx_ept_1gb;
@@ -3000,7 +3000,7 @@ const struct hvm_function_table * __init start_vmx(void)
 vmx_function_table.update_eoi_exit_bitmap = vmx_update_eoi_exit_bitmap;
 vmx_function_table.process_isr = vmx_process_isr;
 vmx_function_table.handle_eoi = vmx_handle_eoi;
-vmx_function_table.virtual_intr_delivery_enabled = true;
+vmx_functio

Re: Next steps for Rust guest agent

2024-02-01 Thread George Dunlap

On Thu, Feb 1, 2024 at 4:57 PM Julien Grall  wrote:
>
> Hi George,
>
> On 01/02/2024 16:55, George Dunlap wrote:
> > On Thu, Jan 11, 2024 at 12:27 PM Yann Dirson  wrote:
> >>> - what should be the criteria to advertise it as official Xenproject
> >>> guest agent ?
> >>
> >> What do people think here?
> >
> > As we discussed at the community call, I think that we should
> > basically set a date at which we consider this the official Xen
> > Project guest agent.  Anyone who wants to have input can give it
> > before then.  Then once you guys think it's ready, we can start to
> > "market" it to the distros.
>
> +1
>
> >
> > Shall we say 29 February, 8 weeks from now?
>
> This is 4 weeks away. I am fine with that, but checking this is the date
> you intended to set.

...

I have no idea how that 8 got there... yes, 4 weeks is what I meant to write.

 -George

Re: Next steps for Rust guest agent

2024-02-01 Thread George Dunlap

On Thu, Jan 11, 2024 at 12:27 PM Yann Dirson  wrote:
> > - what should be the criteria to advertise it as official Xenproject
> > guest agent ?
>
> What do people think here?

As we discussed at the community call, I think that we should
basically set a date at which we consider this the official Xen
Project guest agent.  Anyone who wants to have input can give it
before then.  Then once you guys think it's ready, we can start to
"market" it to the distros.

Shall we say 29 February, 8 weeks from now?

 -George

Re: [PATCH v2 1/3] x86/PoD: simplify / improve p2m_pod_cache_add()

2024-02-01 Thread George Dunlap

On Tue, Jan 4, 2022 at 9:48 AM Jan Beulich  wrote:

> Avoid recurring MFN -> page or page -> MFN translations. Drop the pretty
> pointless local variable "p". Make sure the MFN logged in a debugging
> error message is actually the offending one. Return negative errno
> values rather than -1 (presently no caller really cares, but imo this
> should change). Adjust style.
>
> Signed-off-by: Jan Beulich 
>

Sorry, not sure how I managed to miss this:

Reviewed-by: George Dunlap

Re: [PATCH v2 2/2] x86/p2m: aid the compiler in folding p2m_is_...()

2024-02-01 Thread George Dunlap

On Thu, Jun 23, 2022 at 12:54 PM Jan Beulich  wrote:

> By using | instead of || or (in the negated form) && chances increase
> for the compiler to recognize that both predicates can actually be
> folded into an expression requiring just a single branch (via OR-ing
> together the respective P2M_*_TYPES constants).
>
> Signed-off-by: Jan Beulich 
>

Sorry for the delay.  Git complains that this patch is malformed:

error: `git apply --index`: error: corrupt patch at line 28

Similar complaint from patchew when it was posted:

https://patchew.org/Xen/5d6c927e-7d7c-5754-e7eb-65d1e70f6...@suse.com/

 -George

Re: [PATCH] x86/p2m-pt: fix off by one in entry check assert

2024-01-24 Thread George Dunlap

On Wed, Jan 24, 2024 at 8:45 AM Roger Pau Monne  wrote:
>
> The MMIO RO rangeset overlap check is bogus: the rangeset is inclusive so the
> passed end mfn should be the last mfn to be mapped (not last + 1).
>
> Fixes: 6fa1755644d0 ('amd/npt/shadow: replace assert that prevents creating 
> 2M/1G MMIO entries')
> Signed-off-by: Roger Pau Monné 

Reviewed-by: George Dunlap

Re: Sketch of an idea for handling the "mixed workload" problem

2024-01-22 Thread George Dunlap

On Mon, Jan 22, 2024 at 1:02 PM George Dunlap  wrote:
> 2. two VMs running kernbench, but not competing for vcpu
> 3. four VMs running kernbench, competing for vcpus

Sorry, this should be competing for *P*cpus

 -George

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2005 matches

Mail list logo