Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Oliver Upton
Hi Thomas,

On Fri, Oct 1, 2021 at 12:59 PM Thomas Gleixner  wrote:
>
> Marcelo,
>
> On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> > On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> >> But even if that would be the case, then what prevents the stale time
> >> stamps to be visible? Nothing:
> >>
> >> T0:t = now();
> >>  -> pause
> >>  -> resume
> >>  -> magic "fixup"
> >> T1:dostuff(t);
> >
> > Yes.
> >
> > BTW, you could have a userspace notification (then applications
> > could handle this if desired).
>
> Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
> CLOCK_REALTIME. That's what applications which are sensitive to clock
> REALTIME jumps use today.
>
> >>   Now the proposed change is creating exactly the same problem:
> >>
> >>   >> > + if (data.flags & KVM_CLOCK_REALTIME) {
> >>   >> > + u64 now_real_ns = ktime_get_real_ns();
> >>   >> > +
> >>   >> > + /*
> >>   >> > +  * Avoid stepping the kvmclock backwards.
> >>   >> > +  */
> >>   >> > + if (now_real_ns > data.realtime)
> >>   >> > + data.clock += now_real_ns - data.realtime;
> >>   >> > + }
> >>
> >>   IOW, it takes the time between pause and resume into account and
> >>   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >>   jump forward by exactly that amount of time.
> >
> > Well, it is assuming that the
> >
> >  T0:t = now();
> >  T1:pause vm()
> >  T2:  finish vm migration()
> >  T3:dostuff(t);
> >
> > Interval between T1 and T2 is small (and that the guest
> > clocks are synchronized up to a given boundary).
>
> Yes, I understand that, but it's an assumption and there is no boundary
> for the time jump in the proposed patches, which rings my alarm bells :)
>
> > But i suppose adding a limit to the forward clock advance
> > (in the migration case) is useful:
> >
> >   1) If migration (well actually, only the final steps
> >  to finish migration, the time between when guest is paused
> >  on source and is resumed on destination) takes too long,
> >  then too bad: fix it to be shorter if you want the clocks
> >  to have close to zero change to realtime on migration.
> >
> >   2) Avoid the other bugs in case of large forward advance.
> >
> > Maybe having it configurable, with a say, 1 minute maximum by default
> > is a good choice?
>
> Don't know what 1 minute does in terms of applications etc. You have to
> do some experiments on that.

I debated quite a bit on what the absolute limit should be for
advancing the KVM clock, and settled on doing no checks in the kernel
besides the monotonicity invariant. End of the day, userspace can
ignore all of the rules that KVM will try to enforce on the kvm
clock/TSC and jump it as it sees fit (both are already directly
writable). But I agree that there has to be some reason around what is
acceptable. We have an absolute limit on how far forward we will yank
the KVM clock and TSC in our userspace, but of course it has a TOCTOU
problem for whatever madness can come in between userspace and the
time the kernel actually services the ioctl.

--
Thanks,
Oliver


> > An alternative would be to advance only the guests REALTIME clock, from
> > data about how long steps T1-T2 took.
>
> Yes, that's what would happen in the cooperative S2IDLE or S3 case when
> the guest resumes.
>
> >> So now virt came along and created a hard to solve circular dependency
> >> problem:
> >>
> >>- If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
> >>  sync, but everything else is happy.
> >>
> >>- If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
> >>  lose, but NTP/PTP is happy.
> >
> > One must handle the
> >
> >  T0:t = now();
> >   -> pause
> >   -> resume
> >   -> magic "fixup"
> >  T1:dostuff(t);
> >
> > fact if one is going to use savevm/restorevm anyway, so...
> > (it is kind of unfixable, unless you modify your application
> > to accept notifications to redo any computation based on t, isnt it?).
>
> Well yes, but what applications can deal with is CLOCK_REALTIME jumping
> because that's a property of it. Not so much the CLOCK_MONOTONIC part.
>
> >> If you decide that correctness is overrated, then please document it
> >> clearly instead of trying to pretend being correct.
> >
> > Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> > would be correct, right? And its probably not very hard to do.
>
> Time _is_ hard to get right.
>
> So you might experiment with something like this as a stop gap:
>
>   Provide the guest something like this:
>
>   u64  migration_seq;
>   u64  realtime_delta_ns;
>
>   in the shared clock page
>
>   Do not forward jump clock MONOTONIC.
>
>   On resume kick an IPI where the guest handler does:
>
>  if (clock_data->migration_

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Oliver Upton
Marcelo,

On Fri, Oct 1, 2021 at 12:11 PM Marcelo Tosatti  wrote:
>
> On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> > On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > > set in the provided +   structure. KVM will advance the VM's
> > > > kvmclock to account for elapsed +   time since recording the clock
> > > > values.
> > >
> > > You can't advance both kvmclock (kvmclock_offset variable) and the
> > > TSCs, which would be double counting.
> > >
> > > So you have to either add the elapsed realtime (1) between
> > > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > > TSCs. If you do both, there is double counting. Am i missing
> > > something?
> >
> > Probably one of these two (but it's worth pointing out both of them):
> >
> > 1) the attribute that's introduced here *replaces*
> > KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> >
> > 2) the adjustment formula later in the algorithm does not care about how
> > much time passed between step 1 and step 4.  It just takes two well
> > known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> > the same on the destination as if the guest was still running on the
> > source.  It is irrelevant that one of them is before migration and one
> > is after, all it matters is that one is on the source and one is on the
> > destination.
>
> OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay
> which is introduced during migration (which is what i would guess is
> the lower hanging fruit) (for guests using TSC).

The series gives userspace the ability to modify the guest's
perception of the TSC in whatever way it sees fit. The algorithm in
the documentation provides a suggestion to userspace on how to do
exactly that. I kept that advancement logic out of the kernel because
IMO it is an implementation detail: users have differing opinions on
how clocks should behave across a migration and KVM shouldn't have any
baked-in rules around it.

At the same time, userspace can choose to _not_ jump the TSC and use
the available interfaces to just migrate the existing state of the
TSCs.

When I had initially proposed this series upstream, Paolo astutely
pointed out that there was no good way to get a (CLOCK_REALTIME, TSC)
pairing, which is critical for the TSC advancement algorithm in the
documentation. Google's best way to get (CLOCK_REALTIME, TSC) exists
in userspace [1], hence the missing kvm clock changes. So, in all, the
spirit of the KVM clock changes is to provide missing UAPI around the
clock/TSC, with the side effect of changing the guest-visible value.

[1] https://cloud.google.com/spanner/docs/true-time-external-consistency

> My point was that, by advancing the _TSC value_ by:
>
> T0. stop guest vcpus(source)
> T1. KVM_GET_CLOCK   (source)
> T2. KVM_SET_CLOCK   (destination)
> T3. Write guest TSCs(destination)
> T4. resume guest(destination)
>
> new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
>
> t_0:host TSC at KVM_GET_CLOCK time.
> off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
> TSC offset is fixed).
> ...
>
> +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
> +   (k_0) and realtime nanoseconds (r_0) in their respective fields.
> +   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
> +   structure. KVM will advance the VM's kvmclock to account for elapsed
> +   time since recording the clock values.
>
> Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
> (hopefully modern guests on modern hosts will use TSC clocksource,
> whose clock_gettime is faster... some people are using that already).
>

Hopefully the above explanation made it clearer how the TSCs are
supposed to get advanced, and why it isn't done in the kernel.

> At some point QEMU should enable invariant TSC flag by default?
>
> That said, the point is: why not advance the _TSC_ values
> (instead of kvmclock nanoseconds), as doing so would reduce
> the "the CLOCK_REALTIME delay which is introduced during migration"
> for both kvmclock users and modern tsc clocksource users.
>
> So yes, i also like this patchset, but would like it even more
> if it fixed the case above as well (and not sure whether adding
> the migration delta to KVMCLOCK makes it harder to fix TSC case
> later).
>
> > Perhaps we can add to step 6 something like:
> >
> > > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > > time +   elapsed since recording state and (2) difference in TSCs
> > > between the +   source and destination machine: + +   new_off_n = t_0
> > > +

Re: [PATCH v2 03/11] KVM: arm64: Encapsulate reset request logic in a helper function

2021-10-01 Thread Oliver Upton
On Thu, Sep 30, 2021 at 11:05 PM Reiji Watanabe  wrote:
>
> On Thu, Sep 23, 2021 at 12:16 PM Oliver Upton  wrote:
> >
> > In its implementation of the PSCI function, KVM needs to request that a
> > target vCPU resets before its next entry into the guest. Wrap the logic
> > for requesting a reset in a function for later use by other implemented
> > PSCI calls.
> >
> > No functional change intended.
> >
> > Signed-off-by: Oliver Upton 
> > ---
> >  arch/arm64/kvm/psci.c | 59 +--
> >  1 file changed, 35 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> > index 310b9cb2b32b..bb59b692998b 100644
> > --- a/arch/arm64/kvm/psci.c
> > +++ b/arch/arm64/kvm/psci.c
> > @@ -64,9 +64,40 @@ static inline bool kvm_psci_valid_affinity(unsigned long 
> > affinity)
> > return !(affinity & ~MPIDR_HWID_BITMASK);
> >  }
> >
> > -static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
> > +static void kvm_psci_vcpu_request_reset(struct kvm_vcpu *vcpu,
> > +   unsigned long entry_addr,
> > +   unsigned long context_id,
> > +   bool big_endian)
> >  {
> > struct vcpu_reset_state *reset_state;
> > +
> > +   lockdep_assert_held(&vcpu->kvm->lock);
> > +
> > +   reset_state = &vcpu->arch.reset_state;
> > +   reset_state->pc = entry_addr;
> > +
> > +   /* Propagate caller endianness */
> > +   reset_state->be = big_endian;
> > +
> > +   /*
> > +* NOTE: We always update r0 (or x0) because for PSCI v0.1
> > +* the general purpose registers are undefined upon CPU_ON.
> > +*/
> > +   reset_state->r0 = context_id;
> > +
> > +   WRITE_ONCE(reset_state->reset, true);
> > +   kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> > +
> > +   /*
> > +* Make sure the reset request is observed if the change to
> > +* power_state is observed.
> > +*/
> > +   smp_wmb();
> > +   vcpu->arch.power_off = false;
> > +}
> > +
> > +static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
> > +{
> > struct kvm *kvm = source_vcpu->kvm;
> > struct kvm_vcpu *vcpu = NULL;
> > unsigned long cpu_id;
> > @@ -90,29 +121,9 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
> > *source_vcpu)
> > return PSCI_RET_INVALID_PARAMS;
> > }
> >
> > -   reset_state = &vcpu->arch.reset_state;
> > -
> > -   reset_state->pc = smccc_get_arg2(source_vcpu);
> > -
> > -   /* Propagate caller endianness */
> > -   reset_state->be = kvm_vcpu_is_be(source_vcpu);
> > -
> > -   /*
> > -* NOTE: We always update r0 (or x0) because for PSCI v0.1
> > -* the general purpose registers are undefined upon CPU_ON.
> > -*/
> > -   reset_state->r0 = smccc_get_arg3(source_vcpu);
> > -
> > -   WRITE_ONCE(reset_state->reset, true);
> > -   kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> > -
> > -   /*
> > -* Make sure the reset request is observed if the change to
> > -* power_state is observed.
> > -*/
> > -   smp_wmb();
> > -
> > -   vcpu->arch.power_off = false;
> > +   kvm_psci_vcpu_request_reset(vcpu, smccc_get_arg2(source_vcpu),
> > +   smccc_get_arg3(source_vcpu),
> > +   kvm_vcpu_is_be(source_vcpu));
> > kvm_vcpu_wake_up(vcpu);
> >
> > return PSCI_RET_SUCCESS;
> > --
> > 2.33.0.685.g46640cef36-goog
>
> Reviewed-by: Reiji Watanabe 
>
> Not directly related to the patch, but the (original) code doesn't
> do any sanity checking for the entry address although the PSCI spec says:
>
> "INVALID_ADDRESS is returned when the entry point address is known
> by the implementation to be invalid, because it is in a range that
> is known not to be available to the caller."

Right, I had noticed the same but was a tad too lazy to address in
this series :) Thanks for the review, Reji!

--
Best,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Thomas Gleixner
Marcelo,

On Fri, Oct 01 2021 at 09:05, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
>> But even if that would be the case, then what prevents the stale time
>> stamps to be visible? Nothing:
>> 
>> T0:t = now();
>>  -> pause
>>  -> resume
>>  -> magic "fixup"
>> T1:dostuff(t);
>
> Yes.
>
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).

Well, we have that via timerfd with TFD_TIMER_CANCEL_ON_SET for
CLOCK_REALTIME. That's what applications which are sensitive to clock
REALTIME jumps use today.

>>   Now the proposed change is creating exactly the same problem:
>> 
>>   >> > + if (data.flags & KVM_CLOCK_REALTIME) {
>>   >> > + u64 now_real_ns = ktime_get_real_ns();
>>   >> > +
>>   >> > + /*
>>   >> > +  * Avoid stepping the kvmclock backwards.
>>   >> > +  */
>>   >> > + if (now_real_ns > data.realtime)
>>   >> > + data.clock += now_real_ns - data.realtime;
>>   >> > + }
>> 
>>   IOW, it takes the time between pause and resume into account and
>>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>>   jump forward by exactly that amount of time.
>
> Well, it is assuming that the
>
>  T0:t = now();
>  T1:pause vm()
>  T2:  finish vm migration()
>  T3:dostuff(t);
>
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).

Yes, I understand that, but it's an assumption and there is no boundary
for the time jump in the proposed patches, which rings my alarm bells :)

> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
>
>   1) If migration (well actually, only the final steps
>  to finish migration, the time between when guest is paused
>  on source and is resumed on destination) takes too long,
>  then too bad: fix it to be shorter if you want the clocks
>  to have close to zero change to realtime on migration.
>
>   2) Avoid the other bugs in case of large forward advance.
>
> Maybe having it configurable, with a say, 1 minute maximum by default
> is a good choice?

Don't know what 1 minute does in terms of applications etc. You have to
do some experiments on that.

> An alternative would be to advance only the guests REALTIME clock, from 
> data about how long steps T1-T2 took.

Yes, that's what would happen in the cooperative S2IDLE or S3 case when
the guest resumes.

>> So now virt came along and created a hard to solve circular dependency
>> problem:
>> 
>>- If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
>>  sync, but everything else is happy.
>>  
>>- If CLOCK_MONOTONIC jumps too far forward, then all hell breaks
>>  lose, but NTP/PTP is happy.
>
> One must handle the
>
>  T0:t = now();
>   -> pause
>   -> resume
>   -> magic "fixup"
>  T1:dostuff(t);
>
> fact if one is going to use savevm/restorevm anyway, so...
> (it is kind of unfixable, unless you modify your application
> to accept notifications to redo any computation based on t, isnt it?).

Well yes, but what applications can deal with is CLOCK_REALTIME jumping
because that's a property of it. Not so much the CLOCK_MONOTONIC part.

>> If you decide that correctness is overrated, then please document it
>> clearly instead of trying to pretend being correct.
>
> Based on the above, advancing only CLOCK_REALTIME (and not CLOCK_MONOTONIC)
> would be correct, right? And its probably not very hard to do.

Time _is_ hard to get right. 

So you might experiment with something like this as a stop gap:

  Provide the guest something like this:

  u64  migration_seq;
  u64  realtime_delta_ns;

  in the shared clock page

  Do not forward jump clock MONOTONIC.

  On resume kick an IPI where the guest handler does:

 if (clock_data->migration_seq == migration_seq)
return;

 migration_seq = clock_data->migration_seq;

 ts64 = { 0, 0 };
 timespec64_add_ns(&ts64, clock_data->realtime_delta_ns);
 timekeeping_inject_sleeptime64(&ts64);

  Make sure that the IPI completes before you migrate the guest another
  time or implement it slightly smarter, but you get the idea :)

That's what we use for suspend time injection, but it should just work
without frozen tasks as well. It will forward clock REALTIME by the
amount of time spent during migration. It'll also modify the BOOTTIME
offset by the same amount, but that's not a tragedy.

The function will also reset NTP state so the NTP/PTP daemon knows that
there was a kernel initiated time jump and it can work out easily what
to do like it does on resume from an actual suspend. It will also
invoke clock_was_set() which makes all the other time related updates
trigger and wakeup tasks w

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Oliver Upton
On Fri, Oct 1, 2021 at 7:41 AM Paolo Bonzini  wrote:
>
> On 01/10/21 16:39, Paolo Bonzini wrote:
> > On 16/09/21 20:15, Oliver Upton wrote:
> >> +if (data.flags & ~KVM_CLOCK_REALTIME)
> >>   return -EINVAL;
> >
> > Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there
> > may be programs that expect to send back to KVM_SET_CLOCK whatever they
> > got from KVM_GET_CLOCK.
>
> Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
> need to do that!

Yeah, I don't know the story on the interface but it is really odd
that userspace needs to blow away flags to successfully write the
clock structure.

--
Thanks,
Oliver
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: KVM/arm64: Guest ABI changes do not appear rollback-safe

2021-10-01 Thread Oliver Upton
On Fri, Oct 1, 2021 at 4:43 AM Marc Zyngier  wrote:
>
> On Thu, 30 Sep 2021 18:24:23 +0100,
> Oliver Upton  wrote:
> >
> > Hey Marc,
> >
> > On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier  wrote:
> > >
> > > Hi Oliver,
> > >
> > > On Wed, 29 Sep 2021 19:22:05 +0100,
> > > Oliver Upton  wrote:
> > > >
> > > > I have some lingering thoughts on this subject since we last spoke and
> > > > wanted to discuss.
> > > >
> > > > I'm having a hard time figuring out how a VMM should handle a new
> > > > hypercall identity register introduced on a newer kernel. In order to
> > > > maintain guest ABI, the VMM would need to know about that register and
> > > > zero it when restoring an older guest.
> > >
> > > Just as it would need to be able to discover any new system register
> > > exposed by default, as it happens at all times. Which is why we have a
> > > way to discover all the registers, architected or not.
> > >
> > > > Perhaps instead we could reserve a range of firmware registers as the
> > > > 'hypercall identity' registers. Implement all of them as RAZ/WI by
> > > > default, encouraging userspace to zero these registers away for older
> > > > VMs but still allowing an old userspace to pick up new KVM features.
> > > > Doing so would align the hypercall identity registers with the feature
> > > > ID registers from the architecture.
> > >
> > > The range already exists in the form of the "coprocessor" 0x14. I
> > > don't see the need to expose it as RAZ/WI, however. If userspace
> > > doesn't know about how this pseudo-register works, it won't be able to
> > > program it anyway.
> > >
> > > I don't buy the parallel with the ID-regs either. They are RAZ/WI by
> > > default so that they don't UNDEF at runtime. The meaning of a RAZ
> > > id-register is also well defined (feature not implemented), and the
> > > CPU cannot write to them. In a way, the ID-regs *are* the enumeration
> > > mechanism.
> > >
> > > Our firmware registers don't follow the same rules. Userspace can
> > > write to them, and there is no such "not implemented" rule (case in
> > > point, PSCI). We also have a separate enumeration mechanism
> > > (GET_ONE_REG), which is (more or less) designed for userspace to find
> > > what is implemented.
> > >
> > > For these reasons, I don't immediately see the point of advertising a
> > > set of registers ahead of time, before userspace grows an
> > > understanding of what these registers mean.
> >
> > Supposing we don't preallocate some hypercall ID registers, how can we
> > safely migrate a guest from an older kernel to newer kernel? Ideally,
> > we would preserve the hypercall feature set across the migration which
> > could work for a while with the first set of registers that get
> > defined, but whenever a new hypercall firmware register comes along
> > then the VMM will be clueless to the new ABI.
>
> My expectations were that whenever userspace writes a set of firmware
> register, all the defaults are invalidated. For example say that
> host-A know about a single hypercall register, while host-B knows
> about two. Upon writing to the first register, the host would clear
> the set of available services in the second one. If userspace
> eventually writes there, the value would stick if valid.
>
> Also, remember that pseudo-registers don't have to be 64bit. We could
> define a new class of hypercall-specific registers that would be much
> wider, and thus have a larger write granularity (KVM supports anything
> from 8 to 2048 bits). This would make it pretty easy to implement the
> above.
>
> > Fundamentally, I don't think userspace should need a patch to preserve
> > ABI on a newer kernel. Despite that, it would seem that userspace will
> > need to learn of any firmware registers that control hypercall
> > features which come after the initial set that gets proposed. If
> > KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall,
> > architectural feature ID registers) from other parts of the vCPU
> > state, it would be clear to what registers to zero on a newer kernel.
> > Apologies if it is distracting to mention the feature ID registers
> > here, but both are on my mind currently and want to make sure there is
> > some consistency in how features get handled on newer kernels,
> > architected or not.
>
> The problem I see is that we will always need to grow the HC register
> space one way or another, no matter how many we reserve. Your approach
> only works if we don't exceed that particular range. Maybe it will
> never be a problem, but it remains that this is not scalable.
>
> If we wanted to be safe, we'd reserve the whole of the possible space
> as defined by the SMCCC spec. Given that we currently have two HC
> spaces (the ARM-controlled one, and the KVM-specific one), the
> function space being 16bits in each case, that's 16kB worth of zeroes
> that userspace has to save/restore at all times... I'm not overly
> enthusiastic.

Yeah, that's definitely overkill.

I agree with your po

Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Marcelo Tosatti
On Fri, Oct 01, 2021 at 05:12:20PM +0200, Paolo Bonzini wrote:
> On 01/10/21 12:32, Marcelo Tosatti wrote:
> > > +1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
> > > kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
> > >  +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
> > > nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
> > > respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
> > > set in the provided +   structure. KVM will advance the VM's
> > > kvmclock to account for elapsed +   time since recording the clock
> > > values.
> > 
> > You can't advance both kvmclock (kvmclock_offset variable) and the
> > TSCs, which would be double counting.
> > 
> > So you have to either add the elapsed realtime (1) between
> > KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
> > TSCs. If you do both, there is double counting. Am i missing
> > something?
> 
> Probably one of these two (but it's worth pointing out both of them):
> 
> 1) the attribute that's introduced here *replaces*
> KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.
> 
> 2) the adjustment formula later in the algorithm does not care about how
> much time passed between step 1 and step 4.  It just takes two well
> known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
> the same on the destination as if the guest was still running on the
> source.  It is irrelevant that one of them is before migration and one
> is after, all it matters is that one is on the source and one is on the
> destination.

OK, so it still relies on NTPd daemon to fix the CLOCK_REALTIME delay 
which is introduced during migration (which is what i would guess is
the lower hanging fruit) (for guests using TSC).

My point was that, by advancing the _TSC value_ by:

T0. stop guest vcpus(source)
T1. KVM_GET_CLOCK   (source)
T2. KVM_SET_CLOCK   (destination)
T3. Write guest TSCs(destination)
T4. resume guest(destination)

new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

t_0:host TSC at KVM_GET_CLOCK time.
off_n:  TSC offset at vcpu-n (as long as no guest TSC writes are performed,
TSC offset is fixed).
...

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

Only kvmclock is advanced (by passing r_0). But a guest might not use kvmclock
(hopefully modern guests on modern hosts will use TSC clocksource,
whose clock_gettime is faster... some people are using that already).

At some point QEMU should enable invariant TSC flag by default?

That said, the point is: why not advance the _TSC_ values
(instead of kvmclock nanoseconds), as doing so would reduce 
the "the CLOCK_REALTIME delay which is introduced during migration"
for both kvmclock users and modern tsc clocksource users.

So yes, i also like this patchset, but would like it even more
if it fixed the case above as well (and not sure whether adding
the migration delta to KVMCLOCK makes it harder to fix TSC case
later).

> Perhaps we can add to step 6 something like:
> 
> > +6. Adjust the guest TSC offsets for every vCPU to account for (1)
> > time +   elapsed since recording state and (2) difference in TSCs
> > between the +   source and destination machine: + +   new_off_n = t_0
> > + off_n + (k_1 - k_0) * freq - t_1 +
> 
> "off + t - k * freq" is the guest TSC value corresponding to a time of 0
> in kvmclock.  The above formula ensures that it is the same on the
> destination as it was on the source.
> 
> Also, the names are a bit hard to follow.  Perhaps
> 
>   t_0 tsc_src
>   t_1 tsc_dest
>   k_0 guest_src
>   k_1 guest_dest
>   r_0 host_src
>   off_n   ofs_src[i]
>   new_off_n   ofs_dest[i]
> 
> Paolo
> 
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


[PATCH v2] KVM: arm64: Allow KVM to be disabled from the command line

2021-10-01 Thread Marc Zyngier
Although KVM can be compiled out of the kernel, it cannot be disabled
at runtime. Allow this possibility by introducing a new mode that
will prevent KVM from initialising.

This is useful in the (limited) circumstances where you don't want
KVM to be available (what is wrong with you?), or when you want
to install another hypervisor instead (good luck with that).

Reviewed-by: David Brazdil 
Acked-by: Will Deacon 
Acked-by: Suzuki K Poulose 
Signed-off-by: Marc Zyngier 
---

Notes:
v2: Dropped the id_aa64mmfr1_vh=0 setting so that KVM can be disabled
and yet stay in VHE mode on platforms that require it.
I kept the AB/RB's, but please shout if you disagree!

 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 arch/arm64/include/asm/kvm_host.h   |  1 +
 arch/arm64/kvm/arm.c| 14 +-
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..f268731a3d4d 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2365,6 +2365,8 @@
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
 
+   none: Forcefully disable KVM.
+
nvhe: Standard nVHE-based mode, without support for
  protected guests.
 
diff --git a/arch/arm64/include/asm/kvm_host.h 
b/arch/arm64/include/asm/kvm_host.h
index f8be56d5342b..019490c67976 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -58,6 +58,7 @@
 enum kvm_mode {
KVM_MODE_DEFAULT,
KVM_MODE_PROTECTED,
+   KVM_MODE_NONE,
 };
 enum kvm_mode kvm_get_mode(void);
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fe102cd2e518..658171231af9 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2064,6 +2064,11 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}
 
+   if (kvm_get_mode() == KVM_MODE_NONE) {
+   kvm_info("KVM disabled from command line\n");
+   return -ENODEV;
+   }
+
in_hyp_mode = is_kernel_in_hyp_mode();
 
if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE) ||
@@ -2137,8 +2142,15 @@ static int __init early_kvm_mode_cfg(char *arg)
return 0;
}
 
-   if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode()))
+   if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {
+   kvm_mode = KVM_MODE_DEFAULT;
return 0;
+   }
+
+   if (strcmp(arg, "none") == 0) {
+   kvm_mode = KVM_MODE_NONE;
+   return 0;
+   }
 
return -EINVAL;
 }
-- 
2.30.2

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 5/7] kvm: x86: protect masterclock with a seqcount

2021-10-01 Thread Paolo Bonzini

On 16/09/21 20:15, Oliver Upton wrote:

+   seq = read_seqcount_begin(&ka->pvclock_sc);
+   do {
+   use_master_clock = ka->use_master_clock;


Oops, the "seq" assignment should be inside the "do".  With that fixed, 
my tests seem to work.  I will shortly push the result to kvm/queue.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Paolo Bonzini

On 01/10/21 17:39, Oliver Upton wrote:

Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no
need to do that!


Yeah, I don't know the story on the interface but it is really odd
that userspace needs to blow away flags to successfully write the
clock structure.


Yeah, let's fix it now and accept all three flags  I would like that, 
even though it cannot be fixed in existing kernels.


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Paolo Bonzini

On 01/10/21 12:32, Marcelo Tosatti wrote:

+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (t_0), +
kvmclock nanoseconds (k_0), and realtime nanoseconds (r_0). + [...]
 +4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock
nanoseconds +   (k_0) and realtime nanoseconds (r_0) in their
respective fields. +   Ensure that the KVM_CLOCK_REALTIME flag is
set in the provided +   structure. KVM will advance the VM's
kvmclock to account for elapsed +   time since recording the clock
values.


You can't advance both kvmclock (kvmclock_offset variable) and the
TSCs, which would be double counting.

So you have to either add the elapsed realtime (1) between
KVM_GET_CLOCK to kvmclock (which this patch is doing), or to the
TSCs. If you do both, there is double counting. Am i missing
something?


Probably one of these two (but it's worth pointing out both of them):

1) the attribute that's introduced here *replaces*
KVM_SET_MSR(MSR_IA32_TSC), so the TSC is not added.

2) the adjustment formula later in the algorithm does not care about how
much time passed between step 1 and step 4.  It just takes two well
known (TSC, kvmclock) pairs, and uses them to ensure the guest TSC is
the same on the destination as if the guest was still running on the
source.  It is irrelevant that one of them is before migration and one
is after, all it matters is that one is on the source and one is on the
destination.

Perhaps we can add to step 6 something like:


+6. Adjust the guest TSC offsets for every vCPU to account for (1)
time +   elapsed since recording state and (2) difference in TSCs
between the +   source and destination machine: + +   new_off_n = t_0
+ off_n + (k_1 - k_0) * freq - t_1 +


"off + t - k * freq" is the guest TSC value corresponding to a time of 0
in kvmclock.  The above formula ensures that it is the same on the
destination as it was on the source.

Also, the names are a bit hard to follow.  Perhaps

t_0 tsc_src
t_1 tsc_dest
k_0 guest_src
k_1 guest_dest
r_0 host_src
off_n   ofs_src[i]
new_off_n   ofs_dest[i]

Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow KVM to be disabled from the command line

2021-10-01 Thread Marc Zyngier
On Fri, 01 Oct 2021 10:27:18 +0100,
Suzuki K Poulose  wrote:
> 
> On 30/09/2021 11:29, Marc Zyngier wrote:
> > On Wed, 29 Sep 2021 11:35:46 +0100,
> > Suzuki K Poulose  wrote:
> >> 
> >>> + if (strcmp(arg, "none") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {
> >> 
> >> nit: Does this really need to WARN here ? Unlike the "nvhe" case, if the
> >> user wants to keep the KVM out of the picture for, say debugging
> >> something, it is perfectly Ok to allow the kernel to be running at EL2
> >> without having to change the Firmware to alter the landing EL for the
> >> kernel ?
> > 
> > Well, the doc says "run in nVHE mode" and the option forces
> > id_aa64mmfr1.vh=0. The WARN_ON() will only fires on broken^Wfruity HW
> > that is VHE only. Note that this doesn't rely on any firmware change
> > (we drop from EL2 to EL1 and stay there).
> 
> Ah, ok. So the "none" is in fact "nvhe + no-kvm". Thats the bit I
> missed. TBH, that name to me sounds like "no KVM" at all, which is what
> we want. The question is, do we really need "none" to force vh == 0 ? I
> understand this is only a problem on a rare set of HWs. But the generic
> option looks deceiving.
> 
> That said, I am happy to leave this as is and the doc says so.

I think you have a point here. Conflating the two things is a bit odd,
and we might as well let the user pick the configuration they want
(they can always pass 'id_aa64mmfr1.vh=0' themselves).

I'll respin the patch with this change.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Paolo Bonzini

On 01/10/21 16:39, Paolo Bonzini wrote:

On 16/09/21 20:15, Oliver Upton wrote:

+    if (data.flags & ~KVM_CLOCK_REALTIME)
  return -EINVAL;


Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
may be programs that expect to send back to KVM_SET_CLOCK whatever they 
got from KVM_GET_CLOCK.


Nevermind, KVM_SET_CLOCK is already rejecting KVM_CLOCK_TSC_STABLE so no 
need to do that!


Paolo

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Paolo Bonzini

On 16/09/21 20:15, Oliver Upton wrote:

+   if (data.flags & ~KVM_CLOCK_REALTIME)
return -EINVAL;


Let's accept KVM_CLOCK_HOST_TSC here even though it's not used; there 
may be programs that expect to send back to KVM_SET_CLOCK whatever they 
got from KVM_GET_CLOCK.


___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Paolo Bonzini

On 01/10/21 01:02, Thomas Gleixner wrote:


  Now the proposed change is creating exactly the same problem:

  + if (data.flags & KVM_CLOCK_REALTIME) {
  + u64 now_real_ns = ktime_get_real_ns();
  +
  + /*
  +  * Avoid stepping the kvmclock backwards.
  +  */
  + if (now_real_ns > data.realtime)
  + data.clock += now_real_ns - data.realtime;
  + }


Indeed, though it's opt-in (you can always not pass KVM_CLOCK_REALTIME 
and then the kernel will not muck with the value you gave it).



virt came along and created a hard to solve circular dependency
problem:

   - If CLOCK_MONOTONIC stops for too long then NTP/PTP gets out of
 sync, but everything else is happy.
 
   - If CLOCK_MONOTONIC jumps too far forward, then all hell breaks

 lose, but NTP/PTP is happy.


Yes, I agree that this sums it up.

For example QEMU (meaning: Marcelo :)) has gone for the former and 
"hoping" that NTP/PTP sorts it out sooner or later.  The clock in 
nanoseconds is sent out to the destination and restored.


Google's userspace instead went for the latter.  The reason is that 
they've always started running on the destination before finishing the 
memory copy[1], therefore it's much easier to bound the CLOCK_MONOTONIC 
jump.


I do like very much the cooperative S2IDLE or even S3 way to handle the 
brownout during live migration.  However if your stopping time is 
bounded, these patches are nice because, on current processors that have 
TSC scaling, they make it possible to keep the illusion of the TSC 
running.  Of course that's a big "if"; however, you can always bound the 
stopping time by aborting the restart on the destination machine once 
you get close enough to the limit.


Paolo

[1] see https://dl.acm.org/doi/pdf/10.1145/3296975.3186415, figure 3

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 06/11] KVM: arm64: Add support for SYSTEM_SUSPEND PSCI call

2021-10-01 Thread Marc Zyngier
On Thu, 30 Sep 2021 18:40:43 +0100,
Oliver Upton  wrote:
> 
> On Thu, Sep 30, 2021 at 5:29 AM Marc Zyngier  wrote:
> >
> > I think there is an additional feature we need, which is to give
> > control back to userspace every time a wake-up condition occurs (I did
> > touch on that in [1]). This would give the opportunity to the VMM to
> > do whatever it needs to perform.
> >
> > A typical use case would be to implement wake-up from certain
> > interrupts only (mask non-wake-up IRQs on suspend request, return to
> > the guest for WFI, wake-up returns to the VMM to restore the previous
> > interrupt configuration, resume).
> >
> > Userspace would be free to clear the suspend state and resume the
> > guest, or to reenter WFI.
> 
> Yeah, this is definitely needed for the series.
> 
> Just to make sure it's clear, what should the default behavior be if
> userspace doesn't touch state at all and it calls KVM_RUN again? It
> would seem to me that we should resume the guest by default, much like
> we would do for the SUSPEND event exit.

My mental model is that the suspend state is "sticky". Once set, it
stays and the guest doesn't execute any instruction until cleared by
the VMM.

It has the advantage that the VMM always knows the state the vcpu is
in, because that's what it asked for, and the vcpu can't change state
on its own.

It means that the VMM would have to set the vcpu state to 'RUNNABLE'
once it is satisfied that it can be run.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 05/11] KVM: arm64: Defer WFI emulation as a requested event

2021-10-01 Thread Marc Zyngier
On Thu, 30 Sep 2021 18:09:07 +0100,
Sean Christopherson  wrote:
> 
> On Thu, Sep 30, 2021, Marc Zyngier wrote:
> > On Thu, 23 Sep 2021 20:16:04 +0100, Oliver Upton  wrote:
> > > @@ -681,6 +687,9 @@ static void check_vcpu_requests(struct kvm_vcpu *vcpu)
> > >   if (kvm_check_request(KVM_REQ_SLEEP, vcpu))
> > >   kvm_vcpu_sleep(vcpu);
> > >  
> > > + if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> > > + kvm_vcpu_suspend(vcpu);
> > > +
> 
> ...
> 
> > > diff --git a/arch/arm64/kvm/handle_exit.c b/arch/arm64/kvm/handle_exit.c
> > > index 275a27368a04..5e5ef9ff4fba 100644
> > > --- a/arch/arm64/kvm/handle_exit.c
> > > +++ b/arch/arm64/kvm/handle_exit.c
> > > @@ -95,8 +95,7 @@ static int kvm_handle_wfx(struct kvm_vcpu *vcpu)
> > >   } else {
> > >   trace_kvm_wfx_arm64(*vcpu_pc(vcpu), false);
> > >   vcpu->stat.wfi_exit_stat++;
> > > - kvm_vcpu_block(vcpu);
> > > - kvm_clear_request(KVM_REQ_UNHALT, vcpu);
> > > + kvm_make_request(KVM_REQ_SUSPEND, vcpu);
> > >   }
> > >  
> > >   kvm_incr_pc(vcpu);
> > 
> > This is a change in behaviour. At the point where the blocking
> > happens, PC will have already been incremented. I'd rather you don't
> > do that. Instead, make the helper available and call into it directly,
> > preserving the current semantics.
> 
> Is there architectural behavior that KVM can emulate?  E.g. if you
> were to probe a physical CPU while it's waiting, would you observe
> the pre-WFI PC, or the post-WFI PC?  Following arch behavior would
> be ideal because it eliminates subjectivity.

The architecture doesn't really say (that's one of the many IMPDEF
behaviours). However, there are at least some extensions (such as
statistical profiling) that do require the PC to be accurately
recorded together with the effects of the instructions at that PC.

If an implementation was to pre-increment the PC, the corresponding
trace would be inaccurate.

> Regardless of the architectural behavior, changing KVM's behavior
> should be done explicitly in a separate patch.
>
> Irrespective of PC behavior, I would caution against using a request
> for handling WFI.  Deferring the WFI opens up the possibility for
> all sorts of ordering oddities, e.g. if KVM exits to userspace
> between here and check_vcpu_requests(), then KVM can end up with a
> "spurious" pending KVM_REQ_SUSPEND if maniupaltes vCPU state.  I
> highly doubt that userspace VMMs would actually do that, as it would
> basically require a signal from userspace, but it's not impossible,
> and at the very least the pending request is yet another thing to
> worry about in the future.

+1. It really isn't worth the complexity.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 02/10] KVM: arm64: vgic-v3: Check redist region is not above the VM IPA size

2021-10-01 Thread Marc Zyngier
On Tue, 28 Sep 2021 19:47:56 +0100,
Ricardo Koller  wrote:
> 
> Verify that the redistributor regions do not extend beyond the
> VM-specified IPA range (phys_size). This can happen when using
> KVM_VGIC_V3_ADDR_TYPE_REDIST or KVM_VGIC_V3_ADDR_TYPE_REDIST_REGIONS
> with:
> 
>   base + size > phys_size AND base < phys_size
> 
> Add the missing check into vgic_v3_alloc_redist_region() which is called
> when setting the regions, and into vgic_v3_check_base() which is called
> when attempting the first vcpu-run. The vcpu-run check does not apply to
> KVM_VGIC_V3_ADDR_TYPE_REDIST_REGIONS because the regions size is known
> before the first vcpu-run. Note that using the REDIST_REGIONS API
> results in a different check, which already exists, at first vcpu run:
> that the number of redist regions is enough for all vcpus.
> 
> Finally, this patch also enables some extra tests in
> vgic_v3_alloc_redist_region() by calculating "size" early for the legacy
> redist api: like checking that the REDIST region can fit all the already
> created vcpus.
> 
> Signed-off-by: Ricardo Koller 
> ---
>  arch/arm64/kvm/vgic/vgic-mmio-v3.c | 6 --
>  arch/arm64/kvm/vgic/vgic-v3.c  | 4 
>  2 files changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm64/kvm/vgic/vgic-mmio-v3.c 
> b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> index a09cdc0b953c..9be02bf7865e 100644
> --- a/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-mmio-v3.c
> @@ -796,7 +796,9 @@ static int vgic_v3_alloc_redist_region(struct kvm *kvm, 
> uint32_t index,
>   struct vgic_dist *d = &kvm->arch.vgic;
>   struct vgic_redist_region *rdreg;
>   struct list_head *rd_regions = &d->rd_regions;
> - size_t size = count * KVM_VGIC_V3_REDIST_SIZE;
> + int nr_vcpus = atomic_read(&kvm->online_vcpus);
> + size_t size = count ? count * KVM_VGIC_V3_REDIST_SIZE
> + : nr_vcpus * KVM_VGIC_V3_REDIST_SIZE;
>   int ret;
>  
>   /* cross the end of memory ? */
> @@ -840,7 +842,7 @@ static int vgic_v3_alloc_redist_region(struct kvm *kvm, 
> uint32_t index,
>  
>   rdreg->base = VGIC_ADDR_UNDEF;
>  
> - ret = vgic_check_ioaddr(kvm, &rdreg->base, base, SZ_64K);
> + ret = vgic_check_iorange(kvm, &rdreg->base, base, SZ_64K, size);
>   if (ret)
>   goto free;
>  
> diff --git a/arch/arm64/kvm/vgic/vgic-v3.c b/arch/arm64/kvm/vgic/vgic-v3.c
> index 21a6207fb2ee..27ee674631b3 100644
> --- a/arch/arm64/kvm/vgic/vgic-v3.c
> +++ b/arch/arm64/kvm/vgic/vgic-v3.c
> @@ -486,6 +486,10 @@ bool vgic_v3_check_base(struct kvm *kvm)
>   if (rdreg->base + vgic_v3_rd_region_size(kvm, rdreg) <
>   rdreg->base)
>   return false;
> +
> + if (rdreg->base + vgic_v3_rd_region_size(kvm, rdreg) >
> + kvm_phys_size(kvm))
> + return false;

Why can't we replace these two checks with a single call to your new
fancy helper?

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v3 01/10] kvm: arm64: vgic: Introduce vgic_check_iorange

2021-10-01 Thread Marc Zyngier
On Thu, 30 Sep 2021 22:19:16 +0100,
Ricardo Koller  wrote:
> 
> On Thu, Sep 30, 2021 at 09:02:12AM +0200, Eric Auger wrote:
> > Hi,
> > 
> > On 9/29/21 11:17 PM, Ricardo Koller wrote:
> > > On Wed, Sep 29, 2021 at 06:29:21PM +0200, Eric Auger wrote:
> > >> Hi Ricardo,
> > >>
> > >> On 9/28/21 8:47 PM, Ricardo Koller wrote:
> > >>> Add the new vgic_check_iorange helper that checks that an iorange is
> > >>> sane: the start address and size have valid alignments, the range is
> > >>> within the addressable PA range, start+size doesn't overflow, and the
> > >>> start wasn't already defined.
> > >>>
> > >>> No functional change.
> > >>>
> > >>> Signed-off-by: Ricardo Koller 
> > >>> ---
> > >>>  arch/arm64/kvm/vgic/vgic-kvm-device.c | 22 ++
> > >>>  arch/arm64/kvm/vgic/vgic.h|  4 
> > >>>  2 files changed, 26 insertions(+)
> > >>>
> > >>> diff --git a/arch/arm64/kvm/vgic/vgic-kvm-device.c 
> > >>> b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> > >>> index 7740995de982..f714aded67b2 100644
> > >>> --- a/arch/arm64/kvm/vgic/vgic-kvm-device.c
> > >>> +++ b/arch/arm64/kvm/vgic/vgic-kvm-device.c
> > >>> @@ -29,6 +29,28 @@ int vgic_check_ioaddr(struct kvm *kvm, phys_addr_t 
> > >>> *ioaddr,
> > >>> return 0;
> > >>>  }
> > >>>  
> > >>> +int vgic_check_iorange(struct kvm *kvm, phys_addr_t *ioaddr,
> > >>> +  phys_addr_t addr, phys_addr_t alignment,
> > >>> +  phys_addr_t size)
> > >>> +{
> > >>> +   int ret;
> > >>> +
> > >>> +   ret = vgic_check_ioaddr(kvm, ioaddr, addr, alignment);
> > >> nit: not related to this patch but I am just wondering why we are
> > >> passing phys_addr_t *ioaddr downto vgic_check_ioaddr and thus to
> > >>
> > >> vgic_check_iorange()? This must be a leftover of some old code?
> > >>
> > > It's used to check that the base of a region is not already set.
> > > kvm_vgic_addr() uses it to make that check;
> > > vgic_v3_alloc_redist_region() does not:
> > >
> > >   rdreg->base = VGIC_ADDR_UNDEF; // so the "not already defined" check 
> > > passes
> > >   ret = vgic_check_ioaddr(kvm, &rdreg->base, base, SZ_64K);
> > Yes but I meant why a pointer?
> 
> I can't think of any good reason. It must be some leftover as you said.

It definitely is. Please have a patch to fix that. Also, it doesn't
look like vgic_check_ioaddr() has any other user at the end of the
series. Worth getting rid of.

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Marcelo Tosatti
On Fri, Oct 01, 2021 at 09:05:27AM -0300, Marcelo Tosatti wrote:
> On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> > Marcelo,
> > 
> > On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> > >> On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> > >> 
> > >> Thomas, CC'ed, has deeper understanding of problems with 
> > >> forward time jumps than I do. Thomas, any comments?
> > >
> > > Based on the earlier discussion about the problems of synchronizing
> > > the guests clock via a notification to the NTP/Chrony daemon 
> > > (where there is a window where applications can read the stale
> > > value of the clock), a possible solution would be triggering
> > > an NMI on the destination (so that it runs ASAP, with higher
> > > priority than application/kernel).
> > >
> > > What would this NMI do, exactly?
> > 
> > Nothing. You cannot do anything time related in an NMI.
> > 
> > You might queue irq work which handles that, but that would still not
> > prevent user space or kernel space from observing the stale time stamp
> > depending on the execution state from where it resumes.
> 
> Yes.
> 
> > >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> > >> for either vm pause / vm resume (well, if paused for long periods of 
> > >> time) 
> > >> or savevm / restorevm.
> > >
> > > Maybe with the NMI above, it would be possible to use
> > > the realtime clock as a way to know time elapsed between
> > > events and advance guest clock without the current 
> > > problematic window.
> > 
> > As much duct tape you throw at the problem, it cannot be solved ever
> > because it's fundamentally broken. All you can do is to make the
> > observation windows smaller, but that's just curing the symptom.
> 
> Yes.
> 
> > The problem is that the guest is paused/resumed without getting any
> > information about that and the execution of the guest is stopped at an
> > arbitrary instruction boundary from which it resumes after migration or
> > restore. So there is no way to guarantee that after resume all vCPUs are
> > executing in a state which can handle that.
> > 
> > But even if that would be the case, then what prevents the stale time
> > stamps to be visible? Nothing:
> > 
> > T0:t = now();
> >  -> pause
> >  -> resume
> >  -> magic "fixup"
> > T1:dostuff(t);
> 
> Yes.
> 
> BTW, you could have a userspace notification (then applications 
> could handle this if desired).
> 
> > But that's not a fundamental problem because every preemptible or
> > interruptible code has the same issue:
> > 
> > T0:t = now();
> >  -> preemption or interrupt
> > T1:dostuff(t);
> > 
> > Which is usually not a problem, but It becomes a problem when T1 - T0 is
> > greater than the usual expectations which can obviously be trivially
> > achieved by guest migration or a savevm, restorevm cycle.
> > 
> > But let's go a step back and look at the clocks and their expectations:
> > 
> > CLOCK_MONOTONIC:
> > 
> >   Monotonically increasing clock which counts unless the system
> >   is in suspend. On resume it continues counting without jumping
> >   forward.
> > 
> >   That's the reference clock for everything else and therefore it
> >   is important that it does _not_ jump around.
> > 
> >   The reasons why CLOCK_MONOTONIC stops during suspend is
> >   historical and any attempt to change that breaks the world and
> >   some more because making it jump forward will trigger all sorts
> >   of timeouts, watchdogs and whatever. The last attempt to make
> >   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
> >   weeks. It's not going to be attempted again. See a3ed0e4393d6
> >   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
> >   details.
> > 
> >   Now the proposed change is creating exactly the same problem:
> > 
> >   >> > +if (data.flags & KVM_CLOCK_REALTIME) {
> >   >> > +u64 now_real_ns = ktime_get_real_ns();
> >   >> > +
> >   >> > +/*
> >   >> > + * Avoid stepping the kvmclock backwards.
> >   >> > + */
> >   >> > +if (now_real_ns > data.realtime)
> >   >> > +data.clock += now_real_ns - data.realtime;
> >   >> > +}
> > 
> >   IOW, it takes the time between pause and resume into account and
> >   forwards the underlying base clock which makes CLOCK_MONOTONIC
> >   jump forward by exactly that amount of time.
> 
> Well, it is assuming that the
> 
>  T0:t = now();
>  T1:pause vm()
>  T2:  finish vm migration()
>  T3:dostuff(t);
> 
> Interval between T1 and T2 is small (and that the guest
> clocks are synchronized up to a given boundary).
> 
> But i suppose adding a limit to the forward clock advance 
> (in the migration case) is useful:
> 
>   1) If migration (well actually, only the final steps
>  to finish migration, the time between when guest is pause

Re: [PATCH v8 4/7] KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK

2021-10-01 Thread Marcelo Tosatti
On Fri, Oct 01, 2021 at 01:02:23AM +0200, Thomas Gleixner wrote:
> Marcelo,
> 
> On Thu, Sep 30 2021 at 16:21, Marcelo Tosatti wrote:
> > On Wed, Sep 29, 2021 at 03:56:29PM -0300, Marcelo Tosatti wrote:
> >> On Thu, Sep 16, 2021 at 06:15:35PM +, Oliver Upton wrote:
> >> 
> >> Thomas, CC'ed, has deeper understanding of problems with 
> >> forward time jumps than I do. Thomas, any comments?
> >
> > Based on the earlier discussion about the problems of synchronizing
> > the guests clock via a notification to the NTP/Chrony daemon 
> > (where there is a window where applications can read the stale
> > value of the clock), a possible solution would be triggering
> > an NMI on the destination (so that it runs ASAP, with higher
> > priority than application/kernel).
> >
> > What would this NMI do, exactly?
> 
> Nothing. You cannot do anything time related in an NMI.
> 
> You might queue irq work which handles that, but that would still not
> prevent user space or kernel space from observing the stale time stamp
> depending on the execution state from where it resumes.

Yes.

> >> As a note: this makes it not OK to use KVM_CLOCK_REALTIME flag 
> >> for either vm pause / vm resume (well, if paused for long periods of time) 
> >> or savevm / restorevm.
> >
> > Maybe with the NMI above, it would be possible to use
> > the realtime clock as a way to know time elapsed between
> > events and advance guest clock without the current 
> > problematic window.
> 
> As much duct tape you throw at the problem, it cannot be solved ever
> because it's fundamentally broken. All you can do is to make the
> observation windows smaller, but that's just curing the symptom.

Yes.

> The problem is that the guest is paused/resumed without getting any
> information about that and the execution of the guest is stopped at an
> arbitrary instruction boundary from which it resumes after migration or
> restore. So there is no way to guarantee that after resume all vCPUs are
> executing in a state which can handle that.
> 
> But even if that would be the case, then what prevents the stale time
> stamps to be visible? Nothing:
> 
> T0:t = now();
>  -> pause
>  -> resume
>  -> magic "fixup"
> T1:dostuff(t);

Yes.

BTW, you could have a userspace notification (then applications 
could handle this if desired).

> But that's not a fundamental problem because every preemptible or
> interruptible code has the same issue:
> 
> T0:t = now();
>  -> preemption or interrupt
> T1:dostuff(t);
> 
> Which is usually not a problem, but It becomes a problem when T1 - T0 is
> greater than the usual expectations which can obviously be trivially
> achieved by guest migration or a savevm, restorevm cycle.
> 
> But let's go a step back and look at the clocks and their expectations:
> 
> CLOCK_MONOTONIC:
> 
>   Monotonically increasing clock which counts unless the system
>   is in suspend. On resume it continues counting without jumping
>   forward.
> 
>   That's the reference clock for everything else and therefore it
>   is important that it does _not_ jump around.
> 
>   The reasons why CLOCK_MONOTONIC stops during suspend is
>   historical and any attempt to change that breaks the world and
>   some more because making it jump forward will trigger all sorts
>   of timeouts, watchdogs and whatever. The last attempt to make
>   CLOCK_MONOTONIC behave like CLOCK_BOOTTIME was reverted within 3
>   weeks. It's not going to be attempted again. See a3ed0e4393d6
>   ("Revert: Unify CLOCK_MONOTONIC and CLOCK_BOOTTIME") for
>   details.
> 
>   Now the proposed change is creating exactly the same problem:
> 
>   >> > +  if (data.flags & KVM_CLOCK_REALTIME) {
>   >> > +  u64 now_real_ns = ktime_get_real_ns();
>   >> > +
>   >> > +  /*
>   >> > +   * Avoid stepping the kvmclock backwards.
>   >> > +   */
>   >> > +  if (now_real_ns > data.realtime)
>   >> > +  data.clock += now_real_ns - data.realtime;
>   >> > +  }
> 
>   IOW, it takes the time between pause and resume into account and
>   forwards the underlying base clock which makes CLOCK_MONOTONIC
>   jump forward by exactly that amount of time.

Well, it is assuming that the

 T0:t = now();
 T1:pause vm()
 T2:finish vm migration()
 T3:dostuff(t);

Interval between T1 and T2 is small (and that the guest
clocks are synchronized up to a given boundary).

But i suppose adding a limit to the forward clock advance 
(in the migration case) is useful:

1) If migration (well actually, only the final steps
   to finish migration, the time between when guest is paused
   on source and is resumed on destination) takes too long,
   then too bad: fix it to be shorter if you want the clocks
   to have close to zero change to realtime on migration.

2) Avoid the other bugs in case of large forward advance.

Maybe having it configurab

Re: KVM/arm64: Guest ABI changes do not appear rollback-safe

2021-10-01 Thread Marc Zyngier
On Thu, 30 Sep 2021 18:24:23 +0100,
Oliver Upton  wrote:
> 
> Hey Marc,
> 
> On Thu, Sep 30, 2021 at 12:32 AM Marc Zyngier  wrote:
> >
> > Hi Oliver,
> >
> > On Wed, 29 Sep 2021 19:22:05 +0100,
> > Oliver Upton  wrote:
> > >
> > > I have some lingering thoughts on this subject since we last spoke and
> > > wanted to discuss.
> > >
> > > I'm having a hard time figuring out how a VMM should handle a new
> > > hypercall identity register introduced on a newer kernel. In order to
> > > maintain guest ABI, the VMM would need to know about that register and
> > > zero it when restoring an older guest.
> >
> > Just as it would need to be able to discover any new system register
> > exposed by default, as it happens at all times. Which is why we have a
> > way to discover all the registers, architected or not.
> >
> > > Perhaps instead we could reserve a range of firmware registers as the
> > > 'hypercall identity' registers. Implement all of them as RAZ/WI by
> > > default, encouraging userspace to zero these registers away for older
> > > VMs but still allowing an old userspace to pick up new KVM features.
> > > Doing so would align the hypercall identity registers with the feature
> > > ID registers from the architecture.
> >
> > The range already exists in the form of the "coprocessor" 0x14. I
> > don't see the need to expose it as RAZ/WI, however. If userspace
> > doesn't know about how this pseudo-register works, it won't be able to
> > program it anyway.
> >
> > I don't buy the parallel with the ID-regs either. They are RAZ/WI by
> > default so that they don't UNDEF at runtime. The meaning of a RAZ
> > id-register is also well defined (feature not implemented), and the
> > CPU cannot write to them. In a way, the ID-regs *are* the enumeration
> > mechanism.
> >
> > Our firmware registers don't follow the same rules. Userspace can
> > write to them, and there is no such "not implemented" rule (case in
> > point, PSCI). We also have a separate enumeration mechanism
> > (GET_ONE_REG), which is (more or less) designed for userspace to find
> > what is implemented.
> >
> > For these reasons, I don't immediately see the point of advertising a
> > set of registers ahead of time, before userspace grows an
> > understanding of what these registers mean.
> 
> Supposing we don't preallocate some hypercall ID registers, how can we
> safely migrate a guest from an older kernel to newer kernel? Ideally,
> we would preserve the hypercall feature set across the migration which
> could work for a while with the first set of registers that get
> defined, but whenever a new hypercall firmware register comes along
> then the VMM will be clueless to the new ABI.

My expectations were that whenever userspace writes a set of firmware
register, all the defaults are invalidated. For example say that
host-A know about a single hypercall register, while host-B knows
about two. Upon writing to the first register, the host would clear
the set of available services in the second one. If userspace
eventually writes there, the value would stick if valid.

Also, remember that pseudo-registers don't have to be 64bit. We could
define a new class of hypercall-specific registers that would be much
wider, and thus have a larger write granularity (KVM supports anything
from 8 to 2048 bits). This would make it pretty easy to implement the
above.

> Fundamentally, I don't think userspace should need a patch to preserve
> ABI on a newer kernel. Despite that, it would seem that userspace will
> need to learn of any firmware registers that control hypercall
> features which come after the initial set that gets proposed. If
> KVM_GET_REG_LIST were to disambiguate between ID registers (hypercall,
> architectural feature ID registers) from other parts of the vCPU
> state, it would be clear to what registers to zero on a newer kernel.
> Apologies if it is distracting to mention the feature ID registers
> here, but both are on my mind currently and want to make sure there is
> some consistency in how features get handled on newer kernels,
> architected or not.

The problem I see is that we will always need to grow the HC register
space one way or another, no matter how many we reserve. Your approach
only works if we don't exceed that particular range. Maybe it will
never be a problem, but it remains that this is not scalable.

If we wanted to be safe, we'd reserve the whole of the possible space
as defined by the SMCCC spec. Given that we currently have two HC
spaces (the ARM-controlled one, and the KVM-specific one), the
function space being 16bits in each case, that's 16kB worth of zeroes
that userspace has to save/restore at all times... I'm not overly
enthusiastic.

Thanks,

M.

-- 
Without deviation from the norm, progress is not possible.
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Marcelo Tosatti
On Fri, Oct 01, 2021 at 11:17:33AM +0200, Paolo Bonzini wrote:
> On 30/09/21 21:14, Marcelo Tosatti wrote:
> > > +   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1
> > Hi Oliver,
> > 
> > This won't advance the TSC values themselves, right?
> 
> Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is
> advanced too.
> 
> Paolo

+4. Invoke the KVM_SET_CLOCK ioctl, providing the kvmclock nanoseconds
+   (k_0) and realtime nanoseconds (r_0) in their respective fields.
+   Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
+   structure. KVM will advance the VM's kvmclock to account for elapsed
+   time since recording the clock values.

You can't advance both kvmclock (kvmclock_offset variable) and the TSCs,
which would be double counting.

So you have to either add the elapsed realtime (1) between KVM_GET_CLOCK
to kvmclock (which this patch is doing), or to the TSCs. If you do both, there
is double counting. Am i missing something?

To make it clearer: TSC clocksource is faster than kvmclock source, so
we'd rather use when possible, which is achievable with TSC scaling 
support on HW.

1: As mentioned earlier, just using the realtime clock delta between
hosts can introduce problems. So need a scheme to:

- Find the offset between host clocks, with upper and lower
  bounds on error.
- Take appropriate actions based on that (for example,
  do not use KVM_CLOCK_REALTIME flag on KVM_SET_CLOCK
  if the delta between hosts is large).

Which can be done in userspace or kernel space... (hum, but maybe
delegating this to userspace will introduce different solutions
of the same problem?).

> > This (advancing the TSC values by the realtime elapsed time) would be
> > awesome because TSC clock_gettime() vdso is faster, and some
> > applications prefer to just read from TSC directly.
> > See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".
> > 
> > The advancement with this patchset only applies to kvmclock.
> > 
> 
> 

___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 03/11] KVM: arm64: Encapsulate reset request logic in a helper function

2021-10-01 Thread Reiji Watanabe
On Thu, Sep 23, 2021 at 12:16 PM Oliver Upton  wrote:
>
> In its implementation of the PSCI function, KVM needs to request that a
> target vCPU resets before its next entry into the guest. Wrap the logic
> for requesting a reset in a function for later use by other implemented
> PSCI calls.
>
> No functional change intended.
>
> Signed-off-by: Oliver Upton 
> ---
>  arch/arm64/kvm/psci.c | 59 +--
>  1 file changed, 35 insertions(+), 24 deletions(-)
>
> diff --git a/arch/arm64/kvm/psci.c b/arch/arm64/kvm/psci.c
> index 310b9cb2b32b..bb59b692998b 100644
> --- a/arch/arm64/kvm/psci.c
> +++ b/arch/arm64/kvm/psci.c
> @@ -64,9 +64,40 @@ static inline bool kvm_psci_valid_affinity(unsigned long 
> affinity)
> return !(affinity & ~MPIDR_HWID_BITMASK);
>  }
>
> -static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
> +static void kvm_psci_vcpu_request_reset(struct kvm_vcpu *vcpu,
> +   unsigned long entry_addr,
> +   unsigned long context_id,
> +   bool big_endian)
>  {
> struct vcpu_reset_state *reset_state;
> +
> +   lockdep_assert_held(&vcpu->kvm->lock);
> +
> +   reset_state = &vcpu->arch.reset_state;
> +   reset_state->pc = entry_addr;
> +
> +   /* Propagate caller endianness */
> +   reset_state->be = big_endian;
> +
> +   /*
> +* NOTE: We always update r0 (or x0) because for PSCI v0.1
> +* the general purpose registers are undefined upon CPU_ON.
> +*/
> +   reset_state->r0 = context_id;
> +
> +   WRITE_ONCE(reset_state->reset, true);
> +   kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> +
> +   /*
> +* Make sure the reset request is observed if the change to
> +* power_state is observed.
> +*/
> +   smp_wmb();
> +   vcpu->arch.power_off = false;
> +}
> +
> +static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu)
> +{
> struct kvm *kvm = source_vcpu->kvm;
> struct kvm_vcpu *vcpu = NULL;
> unsigned long cpu_id;
> @@ -90,29 +121,9 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu 
> *source_vcpu)
> return PSCI_RET_INVALID_PARAMS;
> }
>
> -   reset_state = &vcpu->arch.reset_state;
> -
> -   reset_state->pc = smccc_get_arg2(source_vcpu);
> -
> -   /* Propagate caller endianness */
> -   reset_state->be = kvm_vcpu_is_be(source_vcpu);
> -
> -   /*
> -* NOTE: We always update r0 (or x0) because for PSCI v0.1
> -* the general purpose registers are undefined upon CPU_ON.
> -*/
> -   reset_state->r0 = smccc_get_arg3(source_vcpu);
> -
> -   WRITE_ONCE(reset_state->reset, true);
> -   kvm_make_request(KVM_REQ_VCPU_RESET, vcpu);
> -
> -   /*
> -* Make sure the reset request is observed if the change to
> -* power_state is observed.
> -*/
> -   smp_wmb();
> -
> -   vcpu->arch.power_off = false;
> +   kvm_psci_vcpu_request_reset(vcpu, smccc_get_arg2(source_vcpu),
> +   smccc_get_arg3(source_vcpu),
> +   kvm_vcpu_is_be(source_vcpu));
> kvm_vcpu_wake_up(vcpu);
>
> return PSCI_RET_SUCCESS;
> --
> 2.33.0.685.g46640cef36-goog

Reviewed-by: Reiji Watanabe 

Not directly related to the patch, but the (original) code doesn't
do any sanity checking for the entry address although the PSCI spec says:

"INVALID_ADDRESS is returned when the entry point address is known
by the implementation to be invalid, because it is in a range that
is known not to be available to the caller."


Thanks,
Reiji
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH] KVM: arm64: Allow KVM to be disabled from the command line

2021-10-01 Thread Suzuki K Poulose

On 30/09/2021 11:29, Marc Zyngier wrote:

On Wed, 29 Sep 2021 11:35:46 +0100,
Suzuki K Poulose  wrote:


On 03/09/2021 10:16, Marc Zyngier wrote:

Although KVM can be compiled out of the kernel, it cannot be disabled
at runtime. Allow this possibility by introducing a new mode that
will prevent KVM from initialising.

This is useful in the (limited) circumstances where you don't want
KVM to be available (what is wrong with you?), or when you want
to install another hypervisor instead (good luck with that).

Signed-off-by: Marc Zyngier 
---
   Documentation/admin-guide/kernel-parameters.txt |  3 +++
   arch/arm64/include/asm/kvm_host.h   |  1 +
   arch/arm64/kernel/idreg-override.c  |  1 +
   arch/arm64/kvm/arm.c| 14 +-
   4 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 91ba391f9b32..cc5f68846434 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2365,6 +2365,9 @@
kvm-arm.mode=
[KVM,ARM] Select one of KVM/arm64's modes of operation.
   +none: Forcefully disable KVM and run in nVHE
mode,
+ preventing KVM from ever initialising.
+
nvhe: Standard nVHE-based mode, without support for
  protected guests.
   diff --git a/arch/arm64/include/asm/kvm_host.h
b/arch/arm64/include/asm/kvm_host.h
index f8be56d5342b..019490c67976 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -58,6 +58,7 @@
   enum kvm_mode {
KVM_MODE_DEFAULT,
KVM_MODE_PROTECTED,
+   KVM_MODE_NONE,
   };
   enum kvm_mode kvm_get_mode(void);
   diff --git a/arch/arm64/kernel/idreg-override.c
b/arch/arm64/kernel/idreg-override.c
index d8e606fe3c21..57013c1b6552 100644
--- a/arch/arm64/kernel/idreg-override.c
+++ b/arch/arm64/kernel/idreg-override.c
@@ -95,6 +95,7 @@ static const struct {
charalias[FTR_ALIAS_NAME_LEN];
charfeature[FTR_ALIAS_OPTION_LEN];
   } aliases[] __initconst = {
+   { "kvm-arm.mode=none","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=nvhe","id_aa64mmfr1.vh=0" },
{ "kvm-arm.mode=protected",   "id_aa64mmfr1.vh=0" },
{ "arm64.nobti",  "id_aa64pfr1.bt=0" },
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index fe102cd2e518..cdc70e238316 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -2064,6 +2064,11 @@ int kvm_arch_init(void *opaque)
return -ENODEV;
}
   +if (kvm_get_mode() == KVM_MODE_NONE) {
+   kvm_info("KVM disabled from command line\n");
+   return -ENODEV;
+   }
+
in_hyp_mode = is_kernel_in_hyp_mode();
if (cpus_have_final_cap(ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE)
||
@@ -2137,8 +2142,15 @@ static int __init early_kvm_mode_cfg(char *arg)
return 0;
}
   -if (strcmp(arg, "nvhe") == 0 &&
!WARN_ON(is_kernel_in_hyp_mode()))
+   if (strcmp(arg, "nvhe") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {
+   kvm_mode = KVM_MODE_DEFAULT;
return 0;
+   }
+
+   if (strcmp(arg, "none") == 0 && !WARN_ON(is_kernel_in_hyp_mode())) {


nit: Does this really need to WARN here ? Unlike the "nvhe" case, if the
user wants to keep the KVM out of the picture for, say debugging
something, it is perfectly Ok to allow the kernel to be running at EL2
without having to change the Firmware to alter the landing EL for the
kernel ?


Well, the doc says "run in nVHE mode" and the option forces
id_aa64mmfr1.vh=0. The WARN_ON() will only fires on broken^Wfruity HW
that is VHE only. Note that this doesn't rely on any firmware change
(we drop from EL2 to EL1 and stay there).


Ah, ok. So the "none" is in fact "nvhe + no-kvm". Thats the bit I
missed. TBH, that name to me sounds like "no KVM" at all, which is what
we want. The question is, do we really need "none" to force vh == 0 ? I
understand this is only a problem on a rare set of HWs. But the generic
option looks deceiving.

That said, I am happy to leave this as is and the doc says so.



We could add another option (none-vhe?) that stays at EL2 and still
disables KVM if there is an appetite for it.


Na. Don't think that is necessary.




Otherwise,

Acked-by: Suzuki K Poulose 


Suzuki
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v8 7/7] KVM: x86: Expose TSC offset controls to userspace

2021-10-01 Thread Paolo Bonzini

On 30/09/21 21:14, Marcelo Tosatti wrote:

+   new_off_n = t_0 + off_n + (k_1 - k_0) * freq - t_1

Hi Oliver,

This won't advance the TSC values themselves, right?


Why not?  It affects the TSC offset in the vmcs, so the TSC in the VM is 
advanced too.


Paolo


This (advancing the TSC values by the realtime elapsed time) would be
awesome because TSC clock_gettime() vdso is faster, and some
applications prefer to just read from TSC directly.
See "x86: kvmguest: use TSC clocksource if invariant TSC is exposed".

The advancement with this patchset only applies to kvmclock.



___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 02/11] KVM: arm64: Clean up SMC64 PSCI filtering for AArch32 guests

2021-10-01 Thread Reiji Watanabe
On Thu, Sep 23, 2021 at 12:16 PM Oliver Upton  wrote:
>
> The only valid calling SMC calling convention from an AArch32 state is
> SMC32. Disallow any PSCI function that sets the SMC64 function ID bit
> when called from AArch32 rather than comparing against known SMC64 PSCI
> functions.
>
> Signed-off-by: Oliver Upton 

Looks nice.
Reviewed-by: Reiji Watanabe 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm


Re: [PATCH v2 01/11] KVM: arm64: Drop unused vcpu param to kvm_psci_valid_affinity()

2021-10-01 Thread Reiji Watanabe
On Thu, Sep 23, 2021 at 12:16 PM Oliver Upton  wrote:
>
> The helper function does not need a pointer to the vCPU, as it only
> consults a constant mask; drop the unused vcpu parameter.
>
> Signed-off-by: Oliver Upton 

Reviewed-by: Reiji Watanabe 
___
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm