Re: kvmclock doesn't work, help?
On Wed, Dec 23, 2015 at 11:27 AM, Marcelo Tosatti wrote: > On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti wrote: >> > (busy spin is equally problematic as IPI for realtime guests). >> >> I disagree. It's never been safe to call clock_gettime from an RT >> task and expect a guarantee of real-time performance. We could fix >> that, but it's not even safe on non-KVM. > > The problem is how long the IPI (or busy spinning in case of version > above) interrupts the vcpu. The busy spin should be a few hundred cycles in the very worst case (a couple of remote cache misses timed such that the guest is spinning the whole time). The IPI is always thousands of cycles no matter what the guest is doing. > >> Sending an IPI *always* stalls the task. Taking a lock (which is >> effectively what this is doing) only stalls the tasks that contend for >> the lock, which, most of the time, means that nothing stalls. >> >> Also, if the host disables preemption or otherwise boosts its priority >> while version is odd, then the actual stall will be very short, in >> contrast to an IPI-induced stall, which will be much, much longer. >> >> --Andy > > 1) The updates are rare. > 2) There are no user complaints about the IPI mechanism. If KVM ever starts directly propagating corrected time (CLOCK_MONOTONIC, for example), then the updates won't be rare. Maybe I'll try to instrument this. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti wrote: > > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: > >> [cc: John Stultz -- maybe you have ideas on how this should best > >> integrate with the core code] > >> > >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti > >> wrote: > > >> > Can you write an actual proposal (with details) that accomodates the > >> > issue described at "Assuming a stable TSC across physical CPUS, and a > >> > stable TSC" ? > >> > > >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for > >> > realtime guests. > >> > >> This shouldn't require many details, and I don't think there's an ABI > >> change. The rules are: > >> > >> When the overall system timebase changes (e.g. when the selected > >> clocksource changes or when update_pvclock_gtod is called), the KVM > >> host would: > >> > >> optionally: preempt_disable(); /* for performance */ > >> > >> for all vms { > >> > >> for all registered pvti structures { > >> pvti->version++; /* should be odd now */ > >> } > > > > pvti is userspace data, so you have to pin it before? > > Yes. > > Fortunately, most systems probably only have one page of pvti > structures, I think (unless there are a ton of vcpus), so the > performance impact should be negligible. > > > > >> /* Note: right now, any vcpu that tries to access pvti will start > >> infinite looping. We should add cpu_relax() to the guests. */ > >> > >> for all registered pvti structures { > >> update everything except pvti->version; > >> } > >> > >> for all registered pvti structures { > >> pvti->version++; /* should be even now */ > >> } > >> > >> cond_resched(); > >> } > >> > >> Is this enough detail? This should work with all existing guests, > >> too, unless there's a buggy guest out there that actually fails to > >> double-check version. > > > > What is the advantage of this over the brute force method, given > > that guests will busy spin? > > > > (busy spin is equally problematic as IPI for realtime guests). > > I disagree. It's never been safe to call clock_gettime from an RT > task and expect a guarantee of real-time performance. We could fix > that, but it's not even safe on non-KVM. The problem is how long the IPI (or busy spinning in case of version above) interrupts the vcpu. > Sending an IPI *always* stalls the task. Taking a lock (which is > effectively what this is doing) only stalls the tasks that contend for > the lock, which, most of the time, means that nothing stalls. > > Also, if the host disables preemption or otherwise boosts its priority > while version is odd, then the actual stall will be very short, in > contrast to an IPI-induced stall, which will be much, much longer. > > --Andy 1) The updates are rare. 2) There are no user complaints about the IPI mechanism. Don't see a reason to change this. For the suspend issue, though, there are complaints (guests on laptops which fail to use masterclock). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti wrote: > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: >> [cc: John Stultz -- maybe you have ideas on how this should best >> integrate with the core code] >> >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti >> wrote: >> > Can you write an actual proposal (with details) that accomodates the >> > issue described at "Assuming a stable TSC across physical CPUS, and a >> > stable TSC" ? >> > >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for >> > realtime guests. >> >> This shouldn't require many details, and I don't think there's an ABI >> change. The rules are: >> >> When the overall system timebase changes (e.g. when the selected >> clocksource changes or when update_pvclock_gtod is called), the KVM >> host would: >> >> optionally: preempt_disable(); /* for performance */ >> >> for all vms { >> >> for all registered pvti structures { >> pvti->version++; /* should be odd now */ >> } > > pvti is userspace data, so you have to pin it before? Yes. Fortunately, most systems probably only have one page of pvti structures, I think (unless there are a ton of vcpus), so the performance impact should be negligible. > >> /* Note: right now, any vcpu that tries to access pvti will start >> infinite looping. We should add cpu_relax() to the guests. */ >> >> for all registered pvti structures { >> update everything except pvti->version; >> } >> >> for all registered pvti structures { >> pvti->version++; /* should be even now */ >> } >> >> cond_resched(); >> } >> >> Is this enough detail? This should work with all existing guests, >> too, unless there's a buggy guest out there that actually fails to >> double-check version. > > What is the advantage of this over the brute force method, given > that guests will busy spin? > > (busy spin is equally problematic as IPI for realtime guests). I disagree. It's never been safe to call clock_gettime from an RT task and expect a guarantee of real-time performance. We could fix that, but it's not even safe on non-KVM. Sending an IPI *always* stalls the task. Taking a lock (which is effectively what this is doing) only stalls the tasks that contend for the lock, which, most of the time, means that nothing stalls. Also, if the host disables preemption or otherwise boosts its priority while version is odd, then the actual stall will be very short, in contrast to an IPI-induced stall, which will be much, much longer. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: > [cc: John Stultz -- maybe you have ideas on how this should best > integrate with the core code] > > On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti wrote: > > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: > >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti > >> wrote: > >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti > >> >> wrote: > >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti > >> >> >> wrote: > >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> >> >> >> wrote: > >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> >> >> >> > wrote: > >> >> >> >> >> > >> >> >> >> >> > >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >> >> >>> > RAW TSC NTP corrected TSC > >> >> >> >> >>> > t0 10 10 > >> >> >> >> >>> > t1 20 19.99 > >> >> >> >> >>> > t2 30 29.98 > >> >> >> >> >>> > t3 40 39.97 > >> >> >> >> >>> > t4 50 49.96 > >> >> > > >> >> > (1) > >> >> > > >> >> >> >> >>> > > >> >> >> >> >>> > ... > >> >> >> >> >>> > > >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >> >> >>> > you can see what will happen. > >> >> >> >> >>> > >> >> >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> >> >> > >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. > >> >> >> >> >> After > >> >> >> >> >> resume, the TSC certainly increases at the same rate as > >> >> >> >> >> before, but the > >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased > >> >> >> >> >> slower > >> >> >> >> >> than the guest kvmclock. > >> >> >> >> > > >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> >> >> > > >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after > >> >> >> >> > resume to > >> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> >> >> > backwards. > >> >> >> >> > > >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied > >> >> >> >> > on top > >> >> >> >> > of kvmclock, and this shouldn't matter. > >> >> >> >> > > >> >> >> >> > I still feel like I'm missing something very basic here. > >> >> >> >> > > >> >> >> >> > >> >> >> >> OK, I think I get it. > >> >> >> >> > >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the > >> >> >> >> host's > >> >> >> >> correction to the guest. If it did, indeed, propagate the > >> >> >> >> correction > >> >> >> >> then, after resume, the host's new system_time would match the > >> >> >> >> guest's > >> >> >> >> idea of it (after accounting for the guest's long nap), and I > >> >> >> >> don't > >> >> >> >> think there would be a problem. > >> >> >> >> That being said, I can't find the code in the masterclock stuff > >> >> >> >> that > >> >> >> >> would actually do this. > >> >> >> > > >> >> >> > Guest clock is maintained by guest timekeeping code, which does: > >> >> >> > > >> >> >> > timer_interrupt() > >> >> >> > offset = read clocksource since last timer interrupt > >> >> >> > accumulate_to_systemclock(offset) > >> >> >> > > >> >> >> > The frequency correction of NTP in the host can be applied to > >> >> >> > kvmclock, which will be visible to the guest > >> >> >> > at "read clocksource since last timer interrupt" > >> >> >> > (kvmclock_clocksource_read function). > >> >> >> > >> >> >> pvclock_clocksource_read? That seems to do the same thing as all the > >> >> >> other clocksource access functions. > >> >> >> > >> >> >> > > >> >> >> > This does not mean that the NTP correction in the host is > >> >> >> > propagated > >> >> >> > to the guests system clock directly. > >> >> >> > > >> >> >> > (For example, the guest can run NTP which is free to do further > >> >> >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> >> >> > >> >> >> Of course. But I expected that, in the absence of NTP on the guest, > >> >> >> that the guest would track the host's *corrected* time. > >> >> >> > >> >> >> > > >> >> >> >> If, on the other hand, the host's NTP correction is not supposed > >> >> >> >> to > >> >> >> >> propagate to the guest, > >> >> >> > > >> >> >> > This is optional. There is a module option to control this, in > >> >> >> > fact. > >> >> >> > > >> >> >> > Its nice to have, because then you can execute a guest without NTP > >> >> >> > (say without network connection), and have a kvmclock (kvmclock is > >> >> >> > a > >> >> >> > clocksource, not a guest system clock) which is NTP corrected. > >> >>
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 12:25 PM, Andy Lutomirski wrote: > [cc: John Stultz -- maybe you have ideas on how this should best > integrate with the core code] > > On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti wrote: >> On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: >>> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti >>> wrote: >>> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: >>> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti >>> >> wrote: >>> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >>> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti >>> >> >> wrote: >>> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >>> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >>> >> >> >> wrote: >>> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >>> >> >> >> > wrote: >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >>> >> >> >> >>> > RAW TSC NTP corrected TSC >>> >> >> >> >>> > t0 10 10 >>> >> >> >> >>> > t1 20 19.99 >>> >> >> >> >>> > t2 30 29.98 >>> >> >> >> >>> > t3 40 39.97 >>> >> >> >> >>> > t4 50 49.96 >>> >> > >>> >> > (1) >>> >> > >>> >> >> >> >>> > >>> >> >> >> >>> > ... >>> >> >> >> >>> > >>> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >>> >> >> >> >>> > you can see what will happen. >>> >> >> >> >>> >>> >> >> >> >>> Sure, but why would you ever switch from one to the other? >>> >> >> >> >> >>> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. >>> >> >> >> >> After >>> >> >> >> >> resume, the TSC certainly increases at the same rate as before, >>> >> >> >> >> but the >>> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased >>> >> >> >> >> slower >>> >> >> >> >> than the guest kvmclock. >>> >> >> >> > >>> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >>> >> >> >> > >>> >> >> >> > If it's the host's, then wouldn't systemtime be reset after >>> >> >> >> > resume to >>> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go >>> >> >> >> > backwards. >>> >> >> >> > >>> >> >> >> > If it's the guest's, then the guest's NTP correction is applied >>> >> >> >> > on top >>> >> >> >> > of kvmclock, and this shouldn't matter. >>> >> >> >> > >>> >> >> >> > I still feel like I'm missing something very basic here. >>> >> >> >> > >>> >> >> >> >>> >> >> >> OK, I think I get it. >>> >> >> >> >>> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the >>> >> >> >> host's >>> >> >> >> correction to the guest. If it did, indeed, propagate the >>> >> >> >> correction >>> >> >> >> then, after resume, the host's new system_time would match the >>> >> >> >> guest's >>> >> >> >> idea of it (after accounting for the guest's long nap), and I don't >>> >> >> >> think there would be a problem. >>> >> >> >> That being said, I can't find the code in the masterclock stuff >>> >> >> >> that >>> >> >> >> would actually do this. >>> >> >> > >>> >> >> > Guest clock is maintained by guest timekeeping code, which does: >>> >> >> > >>> >> >> > timer_interrupt() >>> >> >> > offset = read clocksource since last timer interrupt >>> >> >> > accumulate_to_systemclock(offset) >>> >> >> > >>> >> >> > The frequency correction of NTP in the host can be applied to >>> >> >> > kvmclock, which will be visible to the guest >>> >> >> > at "read clocksource since last timer interrupt" >>> >> >> > (kvmclock_clocksource_read function). >>> >> >> >>> >> >> pvclock_clocksource_read? That seems to do the same thing as all the >>> >> >> other clocksource access functions. >>> >> >> >>> >> >> > >>> >> >> > This does not mean that the NTP correction in the host is propagated >>> >> >> > to the guests system clock directly. >>> >> >> > >>> >> >> > (For example, the guest can run NTP which is free to do further >>> >> >> > adjustments at "accumulate_to_systemclock(offset)" time). >>> >> >> >>> >> >> Of course. But I expected that, in the absence of NTP on the guest, >>> >> >> that the guest would track the host's *corrected* time. >>> >> >> >>> >> >> > >>> >> >> >> If, on the other hand, the host's NTP correction is not supposed to >>> >> >> >> propagate to the guest, >>> >> >> > >>> >> >> > This is optional. There is a module option to control this, in fact. >>> >> >> > >>> >> >> > Its nice to have, because then you can execute a guest without NTP >>> >> >> > (say without network connection), and have a kvmclock (kvmclock is a >>> >> >> > clocksource, not a guest system clock) which is NTP corrected. >>> >> >> >>> >> >> Can you point to how this works? I found kvm_guest_time_update, whch >>> >> >> is called under circumstances that I haven't untangled. I can't >>> >> >> really tell wha
Re: kvmclock doesn't work, help?
[cc: John Stultz -- maybe you have ideas on how this should best integrate with the core code] On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti wrote: > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti >> >> wrote: >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti >> >> >> wrote: >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> >> >> wrote: >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >> >> >>> > RAW TSC NTP corrected TSC >> >> >> >> >>> > t0 10 10 >> >> >> >> >>> > t1 20 19.99 >> >> >> >> >>> > t2 30 29.98 >> >> >> >> >>> > t3 40 39.97 >> >> >> >> >>> > t4 50 49.96 >> >> > >> >> > (1) >> >> > >> >> >> >> >>> > >> >> >> >> >>> > ... >> >> >> >> >>> > >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >> >> >>> > you can see what will happen. >> >> >> >> >>> >> >> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. >> >> >> >> >> After >> >> >> >> >> resume, the TSC certainly increases at the same rate as before, >> >> >> >> >> but the >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased >> >> >> >> >> slower >> >> >> >> >> than the guest kvmclock. >> >> >> >> > >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> >> >> > >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after >> >> >> >> > resume to >> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> >> >> > backwards. >> >> >> >> > >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied >> >> >> >> > on top >> >> >> >> > of kvmclock, and this shouldn't matter. >> >> >> >> > >> >> >> >> > I still feel like I'm missing something very basic here. >> >> >> >> > >> >> >> >> >> >> >> >> OK, I think I get it. >> >> >> >> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the >> >> >> >> host's >> >> >> >> correction to the guest. If it did, indeed, propagate the >> >> >> >> correction >> >> >> >> then, after resume, the host's new system_time would match the >> >> >> >> guest's >> >> >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> >> >> think there would be a problem. >> >> >> >> That being said, I can't find the code in the masterclock stuff that >> >> >> >> would actually do this. >> >> >> > >> >> >> > Guest clock is maintained by guest timekeeping code, which does: >> >> >> > >> >> >> > timer_interrupt() >> >> >> > offset = read clocksource since last timer interrupt >> >> >> > accumulate_to_systemclock(offset) >> >> >> > >> >> >> > The frequency correction of NTP in the host can be applied to >> >> >> > kvmclock, which will be visible to the guest >> >> >> > at "read clocksource since last timer interrupt" >> >> >> > (kvmclock_clocksource_read function). >> >> >> >> >> >> pvclock_clocksource_read? That seems to do the same thing as all the >> >> >> other clocksource access functions. >> >> >> >> >> >> > >> >> >> > This does not mean that the NTP correction in the host is propagated >> >> >> > to the guests system clock directly. >> >> >> > >> >> >> > (For example, the guest can run NTP which is free to do further >> >> >> > adjustments at "accumulate_to_systemclock(offset)" time). >> >> >> >> >> >> Of course. But I expected that, in the absence of NTP on the guest, >> >> >> that the guest would track the host's *corrected* time. >> >> >> >> >> >> > >> >> >> >> If, on the other hand, the host's NTP correction is not supposed to >> >> >> >> propagate to the guest, >> >> >> > >> >> >> > This is optional. There is a module option to control this, in fact. >> >> >> > >> >> >> > Its nice to have, because then you can execute a guest without NTP >> >> >> > (say without network connection), and have a kvmclock (kvmclock is a >> >> >> > clocksource, not a guest system clock) which is NTP corrected. >> >> >> >> >> >> Can you point to how this works? I found kvm_guest_time_update, whch >> >> >> is called under circumstances that I haven't untangled. I can't >> >> >> really tell what it's trying to do. >> >> > >> >> > Documentation/virtual/kvm/timekeeping.txt. >> >> > >> >> >> >> That document is really long. I skimmed it and found nothing. >> > >> > kvm_guest_time_u
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: > On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: > > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti > >> wrote: > >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti > >> >> wrote: > >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> >> >> wrote: > >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> > >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >> >>> > RAW TSC NTP corrected TSC > >> >> >> >>> > t0 10 10 > >> >> >> >>> > t1 20 19.99 > >> >> >> >>> > t2 30 29.98 > >> >> >> >>> > t3 40 39.97 > >> >> >> >>> > t4 50 49.96 > >> > > >> > (1) > >> > > >> >> >> >>> > > >> >> >> >>> > ... > >> >> >> >>> > > >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >> >>> > you can see what will happen. > >> >> >> >>> > >> >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> >> > >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. > >> >> >> >> After > >> >> >> >> resume, the TSC certainly increases at the same rate as before, > >> >> >> >> but the > >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased > >> >> >> >> slower > >> >> >> >> than the guest kvmclock. > >> >> >> > > >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> >> > > >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume > >> >> >> > to > >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> >> > backwards. > >> >> >> > > >> >> >> > If it's the guest's, then the guest's NTP correction is applied on > >> >> >> > top > >> >> >> > of kvmclock, and this shouldn't matter. > >> >> >> > > >> >> >> > I still feel like I'm missing something very basic here. > >> >> >> > > >> >> >> > >> >> >> OK, I think I get it. > >> >> >> > >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> >> then, after resume, the host's new system_time would match the > >> >> >> guest's > >> >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> >> think there would be a problem. > >> >> >> That being said, I can't find the code in the masterclock stuff that > >> >> >> would actually do this. > >> >> > > >> >> > Guest clock is maintained by guest timekeeping code, which does: > >> >> > > >> >> > timer_interrupt() > >> >> > offset = read clocksource since last timer interrupt > >> >> > accumulate_to_systemclock(offset) > >> >> > > >> >> > The frequency correction of NTP in the host can be applied to > >> >> > kvmclock, which will be visible to the guest > >> >> > at "read clocksource since last timer interrupt" > >> >> > (kvmclock_clocksource_read function). > >> >> > >> >> pvclock_clocksource_read? That seems to do the same thing as all the > >> >> other clocksource access functions. > >> >> > >> >> > > >> >> > This does not mean that the NTP correction in the host is propagated > >> >> > to the guests system clock directly. > >> >> > > >> >> > (For example, the guest can run NTP which is free to do further > >> >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> >> > >> >> Of course. But I expected that, in the absence of NTP on the guest, > >> >> that the guest would track the host's *corrected* time. > >> >> > >> >> > > >> >> >> If, on the other hand, the host's NTP correction is not supposed to > >> >> >> propagate to the guest, > >> >> > > >> >> > This is optional. There is a module option to control this, in fact. > >> >> > > >> >> > Its nice to have, because then you can execute a guest without NTP > >> >> > (say without network connection), and have a kvmclock (kvmclock is a > >> >> > clocksource, not a guest system clock) which is NTP corrected. > >> >> > >> >> Can you point to how this works? I found kvm_guest_time_update, whch > >> >> is called under circumstances that I haven't untangled. I can't > >> >> really tell what it's trying to do. > >> > > >> > Documentation/virtual/kvm/timekeeping.txt. > >> > > >> > >> That document is really long. I skimmed it and found nothing. > > > > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set. > > > > This happens when: > > - kvmclock is enabled or disabled by the guest. > > - periodically to propagate NTP correction to kvmclock clock. > > - guest vcpu switching between host pcpus when TSCs are out of sync.
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti >> wrote: >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti >> >> wrote: >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> >> wrote: >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >> >>> > RAW TSC NTP corrected TSC >> >> >> >>> > t0 10 10 >> >> >> >>> > t1 20 19.99 >> >> >> >>> > t2 30 29.98 >> >> >> >>> > t3 40 39.97 >> >> >> >>> > t4 50 49.96 >> > >> > (1) >> > >> >> >> >>> > >> >> >> >>> > ... >> >> >> >>> > >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >> >>> > you can see what will happen. >> >> >> >>> >> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> >> >> resume, the TSC certainly increases at the same rate as before, but >> >> >> >> the >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased >> >> >> >> slower >> >> >> >> than the guest kvmclock. >> >> >> > >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> >> > >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> >> > backwards. >> >> >> > >> >> >> > If it's the guest's, then the guest's NTP correction is applied on >> >> >> > top >> >> >> > of kvmclock, and this shouldn't matter. >> >> >> > >> >> >> > I still feel like I'm missing something very basic here. >> >> >> > >> >> >> >> >> >> OK, I think I get it. >> >> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> >> >> correction to the guest. If it did, indeed, propagate the correction >> >> >> then, after resume, the host's new system_time would match the guest's >> >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> >> think there would be a problem. >> >> >> That being said, I can't find the code in the masterclock stuff that >> >> >> would actually do this. >> >> > >> >> > Guest clock is maintained by guest timekeeping code, which does: >> >> > >> >> > timer_interrupt() >> >> > offset = read clocksource since last timer interrupt >> >> > accumulate_to_systemclock(offset) >> >> > >> >> > The frequency correction of NTP in the host can be applied to >> >> > kvmclock, which will be visible to the guest >> >> > at "read clocksource since last timer interrupt" >> >> > (kvmclock_clocksource_read function). >> >> >> >> pvclock_clocksource_read? That seems to do the same thing as all the >> >> other clocksource access functions. >> >> >> >> > >> >> > This does not mean that the NTP correction in the host is propagated >> >> > to the guests system clock directly. >> >> > >> >> > (For example, the guest can run NTP which is free to do further >> >> > adjustments at "accumulate_to_systemclock(offset)" time). >> >> >> >> Of course. But I expected that, in the absence of NTP on the guest, >> >> that the guest would track the host's *corrected* time. >> >> >> >> > >> >> >> If, on the other hand, the host's NTP correction is not supposed to >> >> >> propagate to the guest, >> >> > >> >> > This is optional. There is a module option to control this, in fact. >> >> > >> >> > Its nice to have, because then you can execute a guest without NTP >> >> > (say without network connection), and have a kvmclock (kvmclock is a >> >> > clocksource, not a guest system clock) which is NTP corrected. >> >> >> >> Can you point to how this works? I found kvm_guest_time_update, whch >> >> is called under circumstances that I haven't untangled. I can't >> >> really tell what it's trying to do. >> > >> > Documentation/virtual/kvm/timekeeping.txt. >> > >> >> That document is really long. I skimmed it and found nothing. > > kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set. > > This happens when: > - kvmclock is enabled or disabled by the guest. > - periodically to propagate NTP correction to kvmclock clock. > - guest vcpu switching between host pcpus when TSCs are out of sync. > - after migration. > - after savevm/loadvm. > >> >> In any case, this still seems much more convoluted than it has to be. >> >> In the case in which the host has a stable TSC (tsc is selected in the >> >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all >> >> the time on the last few ge
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: > On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti wrote: > > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti > >> wrote: > >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> >> wrote: > >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> >> > wrote: > >> >> >> > >> >> >> > >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >> >>> > RAW TSC NTP corrected TSC > >> >> >>> > t0 10 10 > >> >> >>> > t1 20 19.99 > >> >> >>> > t2 30 29.98 > >> >> >>> > t3 40 39.97 > >> >> >>> > t4 50 49.96 > > > > (1) > > > >> >> >>> > > >> >> >>> > ... > >> >> >>> > > >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >> >>> > you can see what will happen. > >> >> >>> > >> >> >>> Sure, but why would you ever switch from one to the other? > >> >> >> > >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> >> resume, the TSC certainly increases at the same rate as before, but > >> >> >> the > >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> >> than the guest kvmclock. > >> >> > > >> >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> >> > > >> >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> >> > the NTP corrected value? If so, the guest wouldn't see time go > >> >> > backwards. > >> >> > > >> >> > If it's the guest's, then the guest's NTP correction is applied on top > >> >> > of kvmclock, and this shouldn't matter. > >> >> > > >> >> > I still feel like I'm missing something very basic here. > >> >> > > >> >> > >> >> OK, I think I get it. > >> >> > >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> >> correction to the guest. If it did, indeed, propagate the correction > >> >> then, after resume, the host's new system_time would match the guest's > >> >> idea of it (after accounting for the guest's long nap), and I don't > >> >> think there would be a problem. > >> >> That being said, I can't find the code in the masterclock stuff that > >> >> would actually do this. > >> > > >> > Guest clock is maintained by guest timekeeping code, which does: > >> > > >> > timer_interrupt() > >> > offset = read clocksource since last timer interrupt > >> > accumulate_to_systemclock(offset) > >> > > >> > The frequency correction of NTP in the host can be applied to > >> > kvmclock, which will be visible to the guest > >> > at "read clocksource since last timer interrupt" > >> > (kvmclock_clocksource_read function). > >> > >> pvclock_clocksource_read? That seems to do the same thing as all the > >> other clocksource access functions. > >> > >> > > >> > This does not mean that the NTP correction in the host is propagated > >> > to the guests system clock directly. > >> > > >> > (For example, the guest can run NTP which is free to do further > >> > adjustments at "accumulate_to_systemclock(offset)" time). > >> > >> Of course. But I expected that, in the absence of NTP on the guest, > >> that the guest would track the host's *corrected* time. > >> > >> > > >> >> If, on the other hand, the host's NTP correction is not supposed to > >> >> propagate to the guest, > >> > > >> > This is optional. There is a module option to control this, in fact. > >> > > >> > Its nice to have, because then you can execute a guest without NTP > >> > (say without network connection), and have a kvmclock (kvmclock is a > >> > clocksource, not a guest system clock) which is NTP corrected. > >> > >> Can you point to how this works? I found kvm_guest_time_update, whch > >> is called under circumstances that I haven't untangled. I can't > >> really tell what it's trying to do. > > > > Documentation/virtual/kvm/timekeeping.txt. > > > > That document is really long. I skimmed it and found nothing. kvm_guest_time_update is called when KVM_REQ_UPDATE_CLOCK is set. This happens when: - kvmclock is enabled or disabled by the guest. - periodically to propagate NTP correction to kvmclock clock. - guest vcpu switching between host pcpus when TSCs are out of sync. - after migration. - after savevm/loadvm. > >> In any case, this still seems much more convoluted than it has to be. > >> In the case in which the host has a stable TSC (tsc is selected in the > >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all > >> the time on the last few generations of CPUs, then the core > >> timekeeping code is already exposing a linear function that's supposed > >> to be used for monotonic, cpu-local access to a corrected nanosecond > >> counter. It's even
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti wrote: > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> wrote: >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> > wrote: >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >>> > RAW TSC NTP corrected TSC >> >> >>> > t0 10 10 >> >> >>> > t1 20 19.99 >> >> >>> > t2 30 29.98 >> >> >>> > t3 40 39.97 >> >> >>> > t4 50 49.96 > > (1) > >> >> >>> > >> >> >>> > ... >> >> >>> > >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >>> > you can see what will happen. >> >> >>> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> >> resume, the TSC certainly increases at the same rate as before, but the >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower >> >> >> than the guest kvmclock. >> >> > >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> > >> >> > If it's the host's, then wouldn't systemtime be reset after resume to >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> > backwards. >> >> > >> >> > If it's the guest's, then the guest's NTP correction is applied on top >> >> > of kvmclock, and this shouldn't matter. >> >> > >> >> > I still feel like I'm missing something very basic here. >> >> > >> >> >> >> OK, I think I get it. >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> >> correction to the guest. If it did, indeed, propagate the correction >> >> then, after resume, the host's new system_time would match the guest's >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> think there would be a problem. >> >> That being said, I can't find the code in the masterclock stuff that >> >> would actually do this. >> > >> > Guest clock is maintained by guest timekeeping code, which does: >> > >> > timer_interrupt() >> > offset = read clocksource since last timer interrupt >> > accumulate_to_systemclock(offset) >> > >> > The frequency correction of NTP in the host can be applied to >> > kvmclock, which will be visible to the guest >> > at "read clocksource since last timer interrupt" >> > (kvmclock_clocksource_read function). >> >> pvclock_clocksource_read? That seems to do the same thing as all the >> other clocksource access functions. >> >> > >> > This does not mean that the NTP correction in the host is propagated >> > to the guests system clock directly. >> > >> > (For example, the guest can run NTP which is free to do further >> > adjustments at "accumulate_to_systemclock(offset)" time). >> >> Of course. But I expected that, in the absence of NTP on the guest, >> that the guest would track the host's *corrected* time. >> >> > >> >> If, on the other hand, the host's NTP correction is not supposed to >> >> propagate to the guest, >> > >> > This is optional. There is a module option to control this, in fact. >> > >> > Its nice to have, because then you can execute a guest without NTP >> > (say without network connection), and have a kvmclock (kvmclock is a >> > clocksource, not a guest system clock) which is NTP corrected. >> >> Can you point to how this works? I found kvm_guest_time_update, whch >> is called under circumstances that I haven't untangled. I can't >> really tell what it's trying to do. > > Documentation/virtual/kvm/timekeeping.txt. > That document is really long. I skimmed it and found nothing. >> In any case, this still seems much more convoluted than it has to be. >> In the case in which the host has a stable TSC (tsc is selected in the >> core timekeeping code, VCLOCK_TSC is set, etc), which is basically all >> the time on the last few generations of CPUs, then the core >> timekeeping code is already exposing a linear function that's supposed >> to be used for monotonic, cpu-local access to a corrected nanosecond >> counter. It's even in pretty much exactly the right form to pass >> through to the guest via pvclock in the gtod data. Why doesn't KVM >> pass it through verbatim, updated in real time? Is there some legacy >> reason that KVM must apply its own corrections and has to jump through >> hoops to pause vcpus when updating those vcpu's copies of the pvclock >> data? > > Read the comment on x86.c which starts with > " * > * Assuming a stable TSC across physical CPUS, and a stable TSC > * across virtual CPUs, the following condition is possible. > * Each numbered line represents an event visible to both > * CPUs at the next numbered event. > " A coup
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: > On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: > > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski > >> wrote: > >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini > >> > wrote: > >> >> > >> >> > >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >> >>> > RAW TSC NTP corrected TSC > >> >>> > t0 10 10 > >> >>> > t1 20 19.99 > >> >>> > t2 30 29.98 > >> >>> > t3 40 39.97 > >> >>> > t4 50 49.96 (1) > >> >>> > > >> >>> > ... > >> >>> > > >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >> >>> > you can see what will happen. > >> >>> > >> >>> Sure, but why would you ever switch from one to the other? > >> >> > >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> >> resume, the TSC certainly increases at the same rate as before, but the > >> >> raw TSC restarted counting from 0 and systemtime has increased slower > >> >> than the guest kvmclock. > >> > > >> > Wait, are we talking about the host's NTP or the guest's NTP? > >> > > >> > If it's the host's, then wouldn't systemtime be reset after resume to > >> > the NTP corrected value? If so, the guest wouldn't see time go > >> > backwards. > >> > > >> > If it's the guest's, then the guest's NTP correction is applied on top > >> > of kvmclock, and this shouldn't matter. > >> > > >> > I still feel like I'm missing something very basic here. > >> > > >> > >> OK, I think I get it. > >> > >> Marcelo, I thought that kvmclock was supposed to propagate the host's > >> correction to the guest. If it did, indeed, propagate the correction > >> then, after resume, the host's new system_time would match the guest's > >> idea of it (after accounting for the guest's long nap), and I don't > >> think there would be a problem. > >> That being said, I can't find the code in the masterclock stuff that > >> would actually do this. > > > > Guest clock is maintained by guest timekeeping code, which does: > > > > timer_interrupt() > > offset = read clocksource since last timer interrupt > > accumulate_to_systemclock(offset) > > > > The frequency correction of NTP in the host can be applied to > > kvmclock, which will be visible to the guest > > at "read clocksource since last timer interrupt" > > (kvmclock_clocksource_read function). > > pvclock_clocksource_read? That seems to do the same thing as all the > other clocksource access functions. > > > > > This does not mean that the NTP correction in the host is propagated > > to the guests system clock directly. > > > > (For example, the guest can run NTP which is free to do further > > adjustments at "accumulate_to_systemclock(offset)" time). > > Of course. But I expected that, in the absence of NTP on the guest, > that the guest would track the host's *corrected* time. > > > > >> If, on the other hand, the host's NTP correction is not supposed to > >> propagate to the guest, > > > > This is optional. There is a module option to control this, in fact. > > > > Its nice to have, because then you can execute a guest without NTP > > (say without network connection), and have a kvmclock (kvmclock is a > > clocksource, not a guest system clock) which is NTP corrected. > > Can you point to how this works? I found kvm_guest_time_update, whch > is called under circumstances that I haven't untangled. I can't > really tell what it's trying to do. Documentation/virtual/kvm/timekeeping.txt. > In any case, this still seems much more convoluted than it has to be. > In the case in which the host has a stable TSC (tsc is selected in the > core timekeeping code, VCLOCK_TSC is set, etc), which is basically all > the time on the last few generations of CPUs, then the core > timekeeping code is already exposing a linear function that's supposed > to be used for monotonic, cpu-local access to a corrected nanosecond > counter. It's even in pretty much exactly the right form to pass > through to the guest via pvclock in the gtod data. Why doesn't KVM > pass it through verbatim, updated in real time? Is there some legacy > reason that KVM must apply its own corrections and has to jump through > hoops to pause vcpus when updating those vcpu's copies of the pvclock > data? Read the comment on x86.c which starts with " * * Assuming a stable TSC across physical CPUS, and a stable TSC * across virtual CPUs, the following condition is possible. * Each numbered line represents an event visible to both * CPUs at the next numbered event. " > >> then shouldn't KVM just update system_time on > >> resume to whatever the guest would think it had (which I think would > >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly > >> shifted by some per-guest constant offset). > >>
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> > wrote: >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >>> > RAW TSC NTP corrected TSC >> >>> > t0 10 10 >> >>> > t1 20 19.99 >> >>> > t2 30 29.98 >> >>> > t3 40 39.97 >> >>> > t4 50 49.96 >> >>> > >> >>> > ... >> >>> > >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >>> > you can see what will happen. >> >>> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> resume, the TSC certainly increases at the same rate as before, but the >> >> raw TSC restarted counting from 0 and systemtime has increased slower >> >> than the guest kvmclock. >> > >> > Wait, are we talking about the host's NTP or the guest's NTP? >> > >> > If it's the host's, then wouldn't systemtime be reset after resume to >> > the NTP corrected value? If so, the guest wouldn't see time go >> > backwards. >> > >> > If it's the guest's, then the guest's NTP correction is applied on top >> > of kvmclock, and this shouldn't matter. >> > >> > I still feel like I'm missing something very basic here. >> > >> >> OK, I think I get it. >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> correction to the guest. If it did, indeed, propagate the correction >> then, after resume, the host's new system_time would match the guest's >> idea of it (after accounting for the guest's long nap), and I don't >> think there would be a problem. >> That being said, I can't find the code in the masterclock stuff that >> would actually do this. > > Guest clock is maintained by guest timekeeping code, which does: > > timer_interrupt() > offset = read clocksource since last timer interrupt > accumulate_to_systemclock(offset) > > The frequency correction of NTP in the host can be applied to > kvmclock, which will be visible to the guest > at "read clocksource since last timer interrupt" > (kvmclock_clocksource_read function). pvclock_clocksource_read? That seems to do the same thing as all the other clocksource access functions. > > This does not mean that the NTP correction in the host is propagated > to the guests system clock directly. > > (For example, the guest can run NTP which is free to do further > adjustments at "accumulate_to_systemclock(offset)" time). Of course. But I expected that, in the absence of NTP on the guest, that the guest would track the host's *corrected* time. > >> If, on the other hand, the host's NTP correction is not supposed to >> propagate to the guest, > > This is optional. There is a module option to control this, in fact. > > Its nice to have, because then you can execute a guest without NTP > (say without network connection), and have a kvmclock (kvmclock is a > clocksource, not a guest system clock) which is NTP corrected. Can you point to how this works? I found kvm_guest_time_update, whch is called under circumstances that I haven't untangled. I can't really tell what it's trying to do. In any case, this still seems much more convoluted than it has to be. In the case in which the host has a stable TSC (tsc is selected in the core timekeeping code, VCLOCK_TSC is set, etc), which is basically all the time on the last few generations of CPUs, then the core timekeeping code is already exposing a linear function that's supposed to be used for monotonic, cpu-local access to a corrected nanosecond counter. It's even in pretty much exactly the right form to pass through to the guest via pvclock in the gtod data. Why doesn't KVM pass it through verbatim, updated in real time? Is there some legacy reason that KVM must apply its own corrections and has to jump through hoops to pause vcpus when updating those vcpu's copies of the pvclock data? > >> then shouldn't KVM just update system_time on >> resume to whatever the guest would think it had (which I think would >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly >> shifted by some per-guest constant offset). >> >> --Andy > > Sure, you could add a correction to compensate and make sure > the guest clock does not see time backwards. > Could you help do that? You understand the code far better than I do. As it stands, it simply doesn't work on any system that suspends and resumes (unless maybe the system has the upcoming Intel ART feature, and I have no clue when that'll show up). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: > On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: > > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: > >> > >> > >> On 14/12/2015 23:31, Andy Lutomirski wrote: > >>> > RAW TSC NTP corrected TSC > >>> > t0 10 10 > >>> > t1 20 19.99 > >>> > t2 30 29.98 > >>> > t3 40 39.97 > >>> > t4 50 49.96 > >>> > > >>> > ... > >>> > > >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, > >>> > you can see what will happen. > >>> > >>> Sure, but why would you ever switch from one to the other? > >> > >> The guest uses the raw TSC and systemtime = 0 until suspend. After > >> resume, the TSC certainly increases at the same rate as before, but the > >> raw TSC restarted counting from 0 and systemtime has increased slower > >> than the guest kvmclock. > > > > Wait, are we talking about the host's NTP or the guest's NTP? > > > > If it's the host's, then wouldn't systemtime be reset after resume to > > the NTP corrected value? If so, the guest wouldn't see time go > > backwards. > > > > If it's the guest's, then the guest's NTP correction is applied on top > > of kvmclock, and this shouldn't matter. > > > > I still feel like I'm missing something very basic here. > > > > OK, I think I get it. > > Marcelo, I thought that kvmclock was supposed to propagate the host's > correction to the guest. If it did, indeed, propagate the correction > then, after resume, the host's new system_time would match the guest's > idea of it (after accounting for the guest's long nap), and I don't > think there would be a problem. > That being said, I can't find the code in the masterclock stuff that > would actually do this. Guest clock is maintained by guest timekeeping code, which does: timer_interrupt() offset = read clocksource since last timer interrupt accumulate_to_systemclock(offset) The frequency correction of NTP in the host can be applied to kvmclock, which will be visible to the guest at "read clocksource since last timer interrupt" (kvmclock_clocksource_read function). This does not mean that the NTP correction in the host is propagated to the guests system clock directly. (For example, the guest can run NTP which is free to do further adjustments at "accumulate_to_systemclock(offset)" time). > If, on the other hand, the host's NTP correction is not supposed to > propagate to the guest, This is optional. There is a module option to control this, in fact. Its nice to have, because then you can execute a guest without NTP (say without network connection), and have a kvmclock (kvmclock is a clocksource, not a guest system clock) which is NTP corrected. > then shouldn't KVM just update system_time on > resume to whatever the guest would think it had (which I think would > be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly > shifted by some per-guest constant offset). > > --Andy Sure, you could add a correction to compensate and make sure the guest clock does not see time backwards. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
GOn Mon, Dec 14, 2015 at 02:31:10PM -0800, Andy Lutomirski wrote: > On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti wrote: > > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: > >> > >> > >> On 11/12/2015 22:57, Andy Lutomirski wrote: > >> > I'm still not seeing the issue. > >> > > >> > The formula is: > >> > > >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> > >> > pvti->tsc_shift) + pvti->system_time > >> > > >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value > >> > after suspend/resume, you would also need to update system_time. > >> > > >> > I don't see what this has to do with suspend/resume or with whether > >> > the effective scale factor is greater than or less than one. The only > >> > suspend/resume interaction I can see is that, if the host allows the > >> > guest-observed TSC value to jump (which is arguably a bug, what that's > >> > not important here), it needs to update pvti before resuming the > >> > guest. > >> > >> Which is not an issue, since freezing obviously gets all CPUs out of > >> guest mode. > >> > >> Marcelo, can you provide an example with made-up values for tsc and pvti? > > > > I meant "systemtime" at ^. > > > > guest visible clock = systemtime (updated at time 0, guest initialization) > > + scaled tsc reads=LARGE VALUE. > > ^^ > > guest reads clock to memory at location A = scaled tsc read. > > -> suspend resume event > > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled > > tsc reads=0. > > guest reads clock to memory at location B. > > > > So before the suspend/resume event, the clock is the RAW TSC values > > (scaled by kvmclock, but the frequency of the RAW TSC). > > > > After suspend/resume event, the clock is updated from the host > > via get_kernel_ns(), which reads the corrected NTP frequency TSC. > > > > So you switch the timebase, from a clock running at a given frequency, > > to a clock running at another frequency (effective frequency). > > > > Example: > > > > RAW TSC NTP corrected TSC > > t0 10 10 > > t1 20 19.99 > > t2 30 29.98 > > t3 40 39.97 > > t4 50 49.96 > > > > ... > > > > if you suddenly switch from RAW TSC to NTP corrected TSC, > > you can see what will happen. > > Sure, but why would you ever switch from one to the other? Because thats what happens when you ask kvmclock to update from system time (which is a reliable clock, resistant to suspend/resume issues). > I'm still not seeing the scenario under which this discontinuity is > visible to anything other than the kvmclock code itself. Host userspace can see if it uses TSC and clock_gettime() and expects them to run hand in hand. > The only things that need to be monotonic are the output from > vread_pvclock and the in-kernel equivalent, I think. > > --Andy clock_gettime as well, should be monotonic. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >>> > RAW TSC NTP corrected TSC >>> > t0 10 10 >>> > t1 20 19.99 >>> > t2 30 29.98 >>> > t3 40 39.97 >>> > t4 50 49.96 >>> > >>> > ... >>> > >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >>> > you can see what will happen. >>> >>> Sure, but why would you ever switch from one to the other? >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> resume, the TSC certainly increases at the same rate as before, but the >> raw TSC restarted counting from 0 and systemtime has increased slower >> than the guest kvmclock. > > Wait, are we talking about the host's NTP or the guest's NTP? > > If it's the host's, then wouldn't systemtime be reset after resume to > the NTP corrected value? If so, the guest wouldn't see time go > backwards. > > If it's the guest's, then the guest's NTP correction is applied on top > of kvmclock, and this shouldn't matter. > > I still feel like I'm missing something very basic here. > OK, I think I get it. Marcelo, I thought that kvmclock was supposed to propagate the host's correction to the guest. If it did, indeed, propagate the correction then, after resume, the host's new system_time would match the guest's idea of it (after accounting for the guest's long nap), and I don't think there would be a problem. That being said, I can't find the code in the masterclock stuff that would actually do this. If, on the other hand, the host's NTP correction is not supposed to propagate to the guest, then shouldn't KVM just update system_time on resume to whatever the guest would think it had (which I think would be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly shifted by some per-guest constant offset). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: > > > On 14/12/2015 23:31, Andy Lutomirski wrote: >> > RAW TSC NTP corrected TSC >> > t0 10 10 >> > t1 20 19.99 >> > t2 30 29.98 >> > t3 40 39.97 >> > t4 50 49.96 >> > >> > ... >> > >> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> > you can see what will happen. >> >> Sure, but why would you ever switch from one to the other? > > The guest uses the raw TSC and systemtime = 0 until suspend. After > resume, the TSC certainly increases at the same rate as before, but the > raw TSC restarted counting from 0 and systemtime has increased slower > than the guest kvmclock. Wait, are we talking about the host's NTP or the guest's NTP? If it's the host's, then wouldn't systemtime be reset after resume to the NTP corrected value? If so, the guest wouldn't see time go backwards. If it's the guest's, then the guest's NTP correction is applied on top of kvmclock, and this shouldn't matter. I still feel like I'm missing something very basic here. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On 14/12/2015 23:31, Andy Lutomirski wrote: > > RAW TSC NTP corrected TSC > > t0 10 10 > > t1 20 19.99 > > t2 30 29.98 > > t3 40 39.97 > > t4 50 49.96 > > > > ... > > > > if you suddenly switch from RAW TSC to NTP corrected TSC, > > you can see what will happen. > > Sure, but why would you ever switch from one to the other? The guest uses the raw TSC and systemtime = 0 until suspend. After resume, the TSC certainly increases at the same rate as before, but the raw TSC restarted counting from 0 and systemtime has increased slower than the guest kvmclock. Paolo > The only things that need to be monotonic are the output from > vread_pvclock and the in-kernel equivalent, I think. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti wrote: > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: >> >> >> On 11/12/2015 22:57, Andy Lutomirski wrote: >> > I'm still not seeing the issue. >> > >> > The formula is: >> > >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> >> > pvti->tsc_shift) + pvti->system_time >> > >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value >> > after suspend/resume, you would also need to update system_time. >> > >> > I don't see what this has to do with suspend/resume or with whether >> > the effective scale factor is greater than or less than one. The only >> > suspend/resume interaction I can see is that, if the host allows the >> > guest-observed TSC value to jump (which is arguably a bug, what that's >> > not important here), it needs to update pvti before resuming the >> > guest. >> >> Which is not an issue, since freezing obviously gets all CPUs out of >> guest mode. >> >> Marcelo, can you provide an example with made-up values for tsc and pvti? > > I meant "systemtime" at ^. > > guest visible clock = systemtime (updated at time 0, guest initialization) + > scaled tsc reads=LARGE VALUE. > ^^ > guest reads clock to memory at location A = scaled tsc read. > -> suspend resume event > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc > reads=0. > guest reads clock to memory at location B. > > So before the suspend/resume event, the clock is the RAW TSC values > (scaled by kvmclock, but the frequency of the RAW TSC). > > After suspend/resume event, the clock is updated from the host > via get_kernel_ns(), which reads the corrected NTP frequency TSC. > > So you switch the timebase, from a clock running at a given frequency, > to a clock running at another frequency (effective frequency). > > Example: > > RAW TSC NTP corrected TSC > t0 10 10 > t1 20 19.99 > t2 30 29.98 > t3 40 39.97 > t4 50 49.96 > > ... > > if you suddenly switch from RAW TSC to NTP corrected TSC, > you can see what will happen. Sure, but why would you ever switch from one to the other? I'm still not seeing the scenario under which this discontinuity is visible to anything other than the kvmclock code itself. The only things that need to be monotonic are the output from vread_pvclock and the in-kernel equivalent, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: > > > On 11/12/2015 22:57, Andy Lutomirski wrote: > > I'm still not seeing the issue. > > > > The formula is: > > > > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> > > pvti->tsc_shift) + pvti->system_time > > > > Obviously, if you reset pvti->tsc_timestamp to the current tsc value > > after suspend/resume, you would also need to update system_time. > > > > I don't see what this has to do with suspend/resume or with whether > > the effective scale factor is greater than or less than one. The only > > suspend/resume interaction I can see is that, if the host allows the > > guest-observed TSC value to jump (which is arguably a bug, what that's > > not important here), it needs to update pvti before resuming the > > guest. > > Which is not an issue, since freezing obviously gets all CPUs out of > guest mode. > > Marcelo, can you provide an example with made-up values for tsc and pvti? I meant "systemtime" at ^. guest visible clock = systemtime (updated at time 0, guest initialization) + scaled tsc reads=LARGE VALUE. ^^ guest reads clock to memory at location A = scaled tsc read. -> suspend resume event guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc reads=0. guest reads clock to memory at location B. So before the suspend/resume event, the clock is the RAW TSC values (scaled by kvmclock, but the frequency of the RAW TSC). After suspend/resume event, the clock is updated from the host via get_kernel_ns(), which reads the corrected NTP frequency TSC. So you switch the timebase, from a clock running at a given frequency, to a clock running at another frequency (effective frequency). Example: RAW TSC NTP corrected TSC t0 10 10 t1 20 19.99 t2 30 29.98 t3 40 39.97 t4 50 49.96 ... if you suddenly switch from RAW TSC to NTP corrected TSC, you can see what will happen. Does that make sense? > > suspend/resume event. > > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc > > reads=0. > > > Can you clarify concretely what goes wrong here? > > > > (I'm also at a bit of a loss as to why this needs both system_time and > > tsc_timestamp. They're redundant in the sense that you could set > > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >> > > tsc_shift to system_time without changing the result of the > > calculation.) > > You would have to ensure that all elements of pvti are rounded correctly > whenever the base TSC is updated. Doable, but it does seem simpler to > keep subtract-TSC and add-nanoseconds separate. > > Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 10:07:21AM -0800, Andy Lutomirski wrote: > On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti wrote: > > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: > >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti > >> wrote: > >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > >> >> I'm trying to clean up kvmclock and I can't get it to work at all. My > >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > >> >> > >> >> If I boot an SMP (2 vcpus) guest, tracing says: > >> >> > >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 0 > >> >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> > >> >> > >> >> If I boot a UP guest, tracing says: > >> >> > >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > >> >> masterclock 0 hostclock tsc offsetmatched 1 > >> >> > >> >> I suspect, but I haven't verified, that this is fallout from: > >> >> > >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > >> >> Author: Marcelo Tosatti > >> >> Date: Wed May 14 12:43:24 2014 -0300 > >> >> > >> >> KVM: x86: disable master clock if TSC is reset during suspend > >> >> > >> >> Updating system_time from the kernel clock once master clock > >> >> has been enabled can result in time backwards event, in case > >> >> kernel clock frequency is lower than TSC frequency. > >> >> > >> >> Disable master clock in case it is necessary to update it > >> >> from the resume path. > >> >> > >> >> Signed-off-by: Marcelo Tosatti > >> >> Signed-off-by: Paolo Bonzini > >> >> > >> >> > >> >> Can we please stop making kvmclock more complex? It's a beast right > >> >> now, and not in a good way. It's far too tangled with the vclock > >> >> machinery on both the host and guest sides, the pvclock stuff is not > >> >> well thought out (even in principle in an ABI sense), and it's never > >> >> been clear to my what problem exactly the kvmclock stuff is supposed > >> >> to solve. > >> >> > >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and > >> >> start over. A correctly functioning KVM guest using TSC (i.e. > >> >> ignoring kvmclock entirely) > >> >> seems to work rather more reliably and > >> >> considerably faster than a kvmclock guest. > >> >> > >> >> --Andy > >> >> > >> >> -- > >> >> Andy Lutomirski > >> >> AMA Capital Management, LLC > >> > > >> > Andy, > >> > > >> > I am all for solving practical problems rather than pleasing aesthetic > >> > pleasure. > >> > > >> >> Updating system_time from the kernel clock once master clock > >> >> has been enabled can result in time backwards event, in case > >> >> kernel clock frequency is lower than TSC frequency. > >> >> > >> >> Disable master clock in case it is necessary to update it > >> >> from the resume path. > >> > > >> >> once master clock > >> >> has been enabled can result in time backwards event, in case > >> >> kernel clock frequency is lower than TSC frequency. > >> > > >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc > >> > reads. > >> > > >> > If the effective frequency of the kernel clock is lower (for example > >> > due to NTP correcting the TSC frequency of the system), and you resume > >> > and update the system, the following happens: > >> > > >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc > >> > reads=LARGE VALUE. > > > > guest reads clock to memory at location A = scaled tsc read. > > > > (note TSC is counting at frequency higher than advertised by > > processor, thats why NTP has
Re: kvmclock doesn't work, help?
On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti wrote: > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti wrote: >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: >> >> I'm trying to clean up kvmclock and I can't get it to work at all. My >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. >> >> >> >> If I boot an SMP (2 vcpus) guest, tracing says: >> >> >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 0 >> >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> >> >> >> If I boot a UP guest, tracing says: >> >> >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> >> I suspect, but I haven't verified, that this is fallout from: >> >> >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f >> >> Author: Marcelo Tosatti >> >> Date: Wed May 14 12:43:24 2014 -0300 >> >> >> >> KVM: x86: disable master clock if TSC is reset during suspend >> >> >> >> Updating system_time from the kernel clock once master clock >> >> has been enabled can result in time backwards event, in case >> >> kernel clock frequency is lower than TSC frequency. >> >> >> >> Disable master clock in case it is necessary to update it >> >> from the resume path. >> >> >> >> Signed-off-by: Marcelo Tosatti >> >> Signed-off-by: Paolo Bonzini >> >> >> >> >> >> Can we please stop making kvmclock more complex? It's a beast right >> >> now, and not in a good way. It's far too tangled with the vclock >> >> machinery on both the host and guest sides, the pvclock stuff is not >> >> well thought out (even in principle in an ABI sense), and it's never >> >> been clear to my what problem exactly the kvmclock stuff is supposed >> >> to solve. >> >> >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> >> start over. A correctly functioning KVM guest using TSC (i.e. >> >> ignoring kvmclock entirely) >> >> seems to work rather more reliably and >> >> considerably faster than a kvmclock guest. >> >> >> >> --Andy >> >> >> >> -- >> >> Andy Lutomirski >> >> AMA Capital Management, LLC >> > >> > Andy, >> > >> > I am all for solving practical problems rather than pleasing aesthetic >> > pleasure. >> > >> >> Updating system_time from the kernel clock once master clock >> >> has been enabled can result in time backwards event, in case >> >> kernel clock frequency is lower than TSC frequency. >> >> >> >> Disable master clock in case it is necessary to update it >> >> from the resume path. >> > >> >> once master clock >> >> has been enabled can result in time backwards event, in case >> >> kernel clock frequency is lower than TSC frequency. >> > >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. >> > >> > If the effective frequency of the kernel clock is lower (for example >> > due to NTP correcting the TSC frequency of the system), and you resume >> > and update the system, the following happens: >> > >> > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc >> > reads=LARGE VALUE. > > guest reads clock to memory at location A = scaled tsc read. > > (note TSC is counting at frequency higher than advertised by > processor, thats why NTP has to "slow down" the kernel clock > which is maintained by successive reads of the TSC). > >> > suspend/resume event. >> > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc >> > reads=0. > > Now the guest visible clock contains a tsc_timestamp that has been > corrected
Re: kvmclock doesn't work, help?
On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: > On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti wrote: > > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > >> I'm trying to clean up kvmclock and I can't get it to work at all. My > >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > >> > >> If I boot an SMP (2 vcpus) guest, tracing says: > >> > >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 0 > >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> > >> > >> If I boot a UP guest, tracing says: > >> > >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > >> masterclock 0 hostclock tsc offsetmatched 1 > >> > >> I suspect, but I haven't verified, that this is fallout from: > >> > >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > >> Author: Marcelo Tosatti > >> Date: Wed May 14 12:43:24 2014 -0300 > >> > >> KVM: x86: disable master clock if TSC is reset during suspend > >> > >> Updating system_time from the kernel clock once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > >> > >> Disable master clock in case it is necessary to update it > >> from the resume path. > >> > >> Signed-off-by: Marcelo Tosatti > >> Signed-off-by: Paolo Bonzini > >> > >> > >> Can we please stop making kvmclock more complex? It's a beast right > >> now, and not in a good way. It's far too tangled with the vclock > >> machinery on both the host and guest sides, the pvclock stuff is not > >> well thought out (even in principle in an ABI sense), and it's never > >> been clear to my what problem exactly the kvmclock stuff is supposed > >> to solve. > >> > >> I'm somewhat tempted to suggest that we delete kvmclock entirely and > >> start over. A correctly functioning KVM guest using TSC (i.e. > >> ignoring kvmclock entirely) > >> seems to work rather more reliably and > >> considerably faster than a kvmclock guest. > >> > >> --Andy > >> > >> -- > >> Andy Lutomirski > >> AMA Capital Management, LLC > > > > Andy, > > > > I am all for solving practical problems rather than pleasing aesthetic > > pleasure. > > > >> Updating system_time from the kernel clock once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > >> > >> Disable master clock in case it is necessary to update it > >> from the resume path. > > > >> once master clock > >> has been enabled can result in time backwards event, in case > >> kernel clock frequency is lower than TSC frequency. > > > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. > > > > If the effective frequency of the kernel clock is lower (for example > > due to NTP correcting the TSC frequency of the system), and you resume > > and update the system, the following happens: > > > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc > > reads=LARGE VALUE. guest reads clock to memory at location A = scaled tsc read. (note TSC is counting at frequency higher than advertised by processor, thats why NTP has to "slow down" the kernel clock which is maintained by successive reads of the TSC). > > suspend/resume event. > > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc > > reads=0. Now the guest visible clock contains a tsc_timestamp that has been corrected by NTP, over say 5 days. So the tiny NTP correction has been added up to something significant. guest reads clock to memory at location B = reads tsc_timestamp. Clock v
Re: kvmclock doesn't work, help?
On 11/12/2015 22:57, Andy Lutomirski wrote: > I'm still not seeing the issue. > > The formula is: > > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> > pvti->tsc_shift) + pvti->system_time > > Obviously, if you reset pvti->tsc_timestamp to the current tsc value > after suspend/resume, you would also need to update system_time. > > I don't see what this has to do with suspend/resume or with whether > the effective scale factor is greater than or less than one. The only > suspend/resume interaction I can see is that, if the host allows the > guest-observed TSC value to jump (which is arguably a bug, what that's > not important here), it needs to update pvti before resuming the > guest. Which is not an issue, since freezing obviously gets all CPUs out of guest mode. Marcelo, can you provide an example with made-up values for tsc and pvti? > Can you clarify concretely what goes wrong here? > > (I'm also at a bit of a loss as to why this needs both system_time and > tsc_timestamp. They're redundant in the sense that you could set > tsc_timestamp to zero and subtract (tsc_timestamp * tsc_to_system_mul) >> > tsc_shift to system_time without changing the result of the > calculation.) You would have to ensure that all elements of pvti are rounded correctly whenever the base TSC is updated. Doable, but it does seem simpler to keep subtract-TSC and add-nanoseconds separate. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti wrote: > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: >> I'm trying to clean up kvmclock and I can't get it to work at all. My >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. >> >> If I boot an SMP (2 vcpus) guest, tracing says: >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 0 >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> If I boot a UP guest, tracing says: >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> >> I suspect, but I haven't verified, that this is fallout from: >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f >> Author: Marcelo Tosatti >> Date: Wed May 14 12:43:24 2014 -0300 >> >> KVM: x86: disable master clock if TSC is reset during suspend >> >> Updating system_time from the kernel clock once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. >> >> Disable master clock in case it is necessary to update it >> from the resume path. >> >> Signed-off-by: Marcelo Tosatti >> Signed-off-by: Paolo Bonzini >> >> >> Can we please stop making kvmclock more complex? It's a beast right >> now, and not in a good way. It's far too tangled with the vclock >> machinery on both the host and guest sides, the pvclock stuff is not >> well thought out (even in principle in an ABI sense), and it's never >> been clear to my what problem exactly the kvmclock stuff is supposed >> to solve. >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> start over. A correctly functioning KVM guest using TSC (i.e. >> ignoring kvmclock entirely) >> seems to work rather more reliably and >> considerably faster than a kvmclock guest. >> >> --Andy >> >> -- >> Andy Lutomirski >> AMA Capital Management, LLC > > Andy, > > I am all for solving practical problems rather than pleasing aesthetic > pleasure. > >> Updating system_time from the kernel clock once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. >> >> Disable master clock in case it is necessary to update it >> from the resume path. > >> once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. > > If the effective frequency of the kernel clock is lower (for example > due to NTP correcting the TSC frequency of the system), and you resume > and update the system, the following happens: > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc > reads=LARGE VALUE. > suspend/resume event. > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0. > I'm still not seeing the issue. The formula is: (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> pvti->tsc_shift) + pvti->system_time Obviously, if you reset pvti->tsc_timestamp to the current tsc value after suspend/resume, you would also need to update system_time. I don't see what this has to do with suspend/resume or with whether the effective scale factor is greater than or less than one. The only suspend/resume interaction I can see is that, if the host allows the guest-observed TSC value to jump (which is arguably a bug, what that's not important here), it needs to update pvti before resuming the guest. Can you clarify concretely what goes wrong here? (I'm also at a bit of a loss as to why this needs both system_time and tsc_ti
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > I'm trying to clean up kvmclock and I can't get it to work at all. My > host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > > If I boot an SMP (2 vcpus) guest, tracing says: > > qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > > If I boot a UP guest, tracing says: > > qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > I suspect, but I haven't verified, that this is fallout from: > > commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > Author: Marcelo Tosatti > Date: Wed May 14 12:43:24 2014 -0300 > > KVM: x86: disable master clock if TSC is reset during suspend > > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > > Signed-off-by: Marcelo Tosatti > Signed-off-by: Paolo Bonzini > > > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. > > --Andy Users can do that, if they want. "clocksource=tsc" kernel option. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 02:27:36PM -0800, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: > > > > > > On 09/12/2015 22:49, Andy Lutomirski wrote: > >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: > >>> > >>> > >>> On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > >>> > >>> It's supposed to solve the problem that: > >>> > >>> - not all hosts have a working TSC > >> > >> Fine, but we don't need any vdso integration for that. > > > > Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. > > > > >>> - even if they all do, virtual machines can be migrated (or > >>> saved/restored) to a host with a different TSC frequency > >>> > >>> - any MMIO- or PIO-based mechanism to access the current time is orders > >>> of magnitude slower than the TSC and less precise too. > >> > >> Yup. But TSC by itself gets that benefit, too. > > > > Yes, the problem is if you want to solve all three of them. The first > > two are solved by the ACPI PM timer with a decent resolution (70 > > ns---much faster anyway than an I/O port access). The third is solved > > by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? > > > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. > >>> > >>> If all your hosts have a working TSC and you don't do migration or > >>> save/restore, that's a valid configuration. It's not a good default, > >>> however. > >> > >> Er? > >> > >> kvmclock is still really quite slow and buggy. > > > > Unless it takes 3-4000 clock cycles for a gettimeofday, which it > > shouldn't even with vdso disabled, it's definitely not slower than PIO. > > > >> And the patch I identified is definitely a problem here: > >> > >> [ 136.131241] KVM: disabling fast timing permanently due to inability > >> to recover from suspend > >> > >> I got that on the host with this whitespace-damaged patch: > >> > >> if (backwards_tsc) { > >> u64 delta_cyc = max_tsc - local_tsc; > >> + if (!backwards_tsc_observed) > >> + pr_warn("KVM: disabling fast timing > >> permanently due to inability to recover from suspend\n"); > >> > >> when I suspended and resumed. > >> > >> Can anyone explain what problem > >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On > >> brief inspection, it just seems to be incorrect. Shouldn't KVM's > >> normal TSC logic handle that case right? After all, all vcpus should > >> be paused when we resume from suspend. At worst, we should just need > >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, > >> shouldn't we do that regardless of which way the TSC jumped on > >> suspend/resume? After all, the jTSC-to-wall-clock offset is quite > >> likely to change except on the very small handful of CPUs (if any) > >> that keep the TSC running in S3 and hibernate. > > > > I don't recall the details of that patch, so Marcelo will have to answer > > this, or Alex too since he chimed in the original thread. At least it > > should be made conditional on the existence of a VM at suspend time (and > > the master clock stuff should be made per VM, as I suggested at > > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). > > > > It would indeed be great if the master clock could be dropped. But I'm > > definitely missing some of the subtle details. :( > > Me, too. > > Anyway, see the attached untested patch. Marcelo? > > --Andy Read the last email, about the problem. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: > I'm trying to clean up kvmclock and I can't get it to work at all. My > host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. > > If I boot an SMP (2 vcpus) guest, tracing says: > > qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 0 > qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc > qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: > vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: > vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc > qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > > If I boot a UP guest, tracing says: > > qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: > masterclock 0 hostclock tsc offsetmatched 1 > > I suspect, but I haven't verified, that this is fallout from: > > commit 16a9602158861687c78b6de6dc6a79e6e8a9136f > Author: Marcelo Tosatti > Date: Wed May 14 12:43:24 2014 -0300 > > KVM: x86: disable master clock if TSC is reset during suspend > > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > > Signed-off-by: Marcelo Tosatti > Signed-off-by: Paolo Bonzini > > > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. > > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) > seems to work rather more reliably and > considerably faster than a kvmclock guest. > > --Andy > > -- > Andy Lutomirski > AMA Capital Management, LLC Andy, I am all for solving practical problems rather than pleasing aesthetic pleasure. > Updating system_time from the kernel clock once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. > > Disable master clock in case it is necessary to update it > from the resume path. > once master clock > has been enabled can result in time backwards event, in case > kernel clock frequency is lower than TSC frequency. guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. If the effective frequency of the kernel clock is lower (for example due to NTP correcting the TSC frequency of the system), and you resume and update the system, the following happens: guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads=LARGE VALUE. suspend/resume event. guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: >> >> >> On 09/12/2015 22:49, Andy Lutomirski wrote: >>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC >>> >>> Fine, but we don't need any vdso integration for that. >> >> Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. > >> - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. >>> >>> Yup. But TSC by itself gets that benefit, too. >> >> Yes, the problem is if you want to solve all three of them. The first >> two are solved by the ACPI PM timer with a decent resolution (70 >> ns---much faster anyway than an I/O port access). The third is solved >> by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? > >> > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. If all your hosts have a working TSC and you don't do migration or save/restore, that's a valid configuration. It's not a good default, however. >>> >>> Er? >>> >>> kvmclock is still really quite slow and buggy. >> >> Unless it takes 3-4000 clock cycles for a gettimeofday, which it >> shouldn't even with vdso disabled, it's definitely not slower than PIO. >> >>> And the patch I identified is definitely a problem here: >>> >>> [ 136.131241] KVM: disabling fast timing permanently due to inability >>> to recover from suspend >>> >>> I got that on the host with this whitespace-damaged patch: >>> >>> if (backwards_tsc) { >>> u64 delta_cyc = max_tsc - local_tsc; >>> + if (!backwards_tsc_observed) >>> + pr_warn("KVM: disabling fast timing >>> permanently due to inability to recover from suspend\n"); >>> >>> when I suspended and resumed. >>> >>> Can anyone explain what problem >>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >>> brief inspection, it just seems to be incorrect. Shouldn't KVM's >>> normal TSC logic handle that case right? After all, all vcpus should >>> be paused when we resume from suspend. At worst, we should just need >>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >>> shouldn't we do that regardless of which way the TSC jumped on >>> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >>> likely to change except on the very small handful of CPUs (if any) >>> that keep the TSC running in S3 and hibernate. >> >> I don't recall the details of that patch, so Marcelo will have to answer >> this, or Alex too since he chimed in the original thread. At least it >> should be made conditional on the existence of a VM at suspend time (and >> the master clock stuff should be made per VM, as I suggested at >> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). >> >> It would indeed be great if the master clock could be dropped. But I'm >> definitely missing some of the subtle details. :( > > Me, too. > > Anyway, see the attached untested patch. Marcelo? That patch seems to work. I have valid timing before and after host suspend. When I suspend and resume the host with a running guest, I get: [ 26.504071] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large: [ 26.505253] clocksource: 'kvm-clock' wd_now: 66744c542 wd_last: 564b09794 mask: [ 26.506436] clocksource: 'tsc' cs_now: fee310b133c8 cs_last: cf5d0b952 mask: in the guest, which is arguably correct. KVM could be further improved to update the tsc offset after suspend/resume to get rid of that artifact. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http:
Re: kvmclock doesn't work, help?
On 09/12/2015 23:27, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: >> On 09/12/2015 22:49, Andy Lutomirski wrote: >>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC >>> >>> Fine, but we don't need any vdso integration for that. >> >> Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. There's no vdso integration unless the host has a constant, nonstop (fully "working") TSC. That's the meaning of PVCLOCK_TSC_STABLE_BIT. So, correction: if you can pull it off, you still want a fast time source. Otherwise, you still want one that is as fast as possible, especially on the kernel side. - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. >> >> the problem is if you want to solve all three of them. The first >> two are solved by the ACPI PM timer with a decent resolution (70 >> ns---much faster anyway than an I/O port access). The third is solved >> by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? You can certainly provide kvmclock even if you lack constant-rate or nonstop TSC. Those are only a requirement for vdso. If the host has a constant-rate TSC, but the rate differs per physical CPU (common on older NUMA machines), you can easily provide a working kvmclock. It cannot support vdso because you'll need to read the time from a non-preemptable section, but it will work because KVM can update the kvmclock parameters on VCPU migration, and it's still faster than anything else. (The purpose of the now-gone migration notifiers was to support vdso even in this case). If the host doesn't even have constant-rate TSC, you can still provide kernel-only kvmclock reads through cpufreq notifiers. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: > > > On 09/12/2015 22:49, Andy Lutomirski wrote: >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: >>> >>> >>> On 09/12/2015 22:10, Andy Lutomirski wrote: Can we please stop making kvmclock more complex? It's a beast right now, and not in a good way. It's far too tangled with the vclock machinery on both the host and guest sides, the pvclock stuff is not well thought out (even in principle in an ABI sense), and it's never been clear to my what problem exactly the kvmclock stuff is supposed to solve. >>> >>> It's supposed to solve the problem that: >>> >>> - not all hosts have a working TSC >> >> Fine, but we don't need any vdso integration for that. > > Well, you still want a fast time source. That was a given. :) If the host can't figure out how to give *itself* a fast time source, I'd be surprised if KVM can manage to give the guest a fast, reliable time source. > >>> - even if they all do, virtual machines can be migrated (or >>> saved/restored) to a host with a different TSC frequency >>> >>> - any MMIO- or PIO-based mechanism to access the current time is orders >>> of magnitude slower than the TSC and less precise too. >> >> Yup. But TSC by itself gets that benefit, too. > > Yes, the problem is if you want to solve all three of them. The first > two are solved by the ACPI PM timer with a decent resolution (70 > ns---much faster anyway than an I/O port access). The third is solved > by TSC. To solve all three, you need kvmclock. Still confused. Is kvmclock really used in cases where even the host can't pull of working TSC? > I'm somewhat tempted to suggest that we delete kvmclock entirely and start over. A correctly functioning KVM guest using TSC (i.e. ignoring kvmclock entirely) seems to work rather more reliably and considerably faster than a kvmclock guest. >>> >>> If all your hosts have a working TSC and you don't do migration or >>> save/restore, that's a valid configuration. It's not a good default, >>> however. >> >> Er? >> >> kvmclock is still really quite slow and buggy. > > Unless it takes 3-4000 clock cycles for a gettimeofday, which it > shouldn't even with vdso disabled, it's definitely not slower than PIO. > >> And the patch I identified is definitely a problem here: >> >> [ 136.131241] KVM: disabling fast timing permanently due to inability >> to recover from suspend >> >> I got that on the host with this whitespace-damaged patch: >> >> if (backwards_tsc) { >> u64 delta_cyc = max_tsc - local_tsc; >> + if (!backwards_tsc_observed) >> + pr_warn("KVM: disabling fast timing >> permanently due to inability to recover from suspend\n"); >> >> when I suspended and resumed. >> >> Can anyone explain what problem >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >> brief inspection, it just seems to be incorrect. Shouldn't KVM's >> normal TSC logic handle that case right? After all, all vcpus should >> be paused when we resume from suspend. At worst, we should just need >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >> shouldn't we do that regardless of which way the TSC jumped on >> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >> likely to change except on the very small handful of CPUs (if any) >> that keep the TSC running in S3 and hibernate. > > I don't recall the details of that patch, so Marcelo will have to answer > this, or Alex too since he chimed in the original thread. At least it > should be made conditional on the existence of a VM at suspend time (and > the master clock stuff should be made per VM, as I suggested at > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). > > It would indeed be great if the master clock could be dropped. But I'm > definitely missing some of the subtle details. :( Me, too. Anyway, see the attached untested patch. Marcelo? --Andy From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001 Message-Id: From: Andy Lutomirski Date: Wed, 9 Dec 2015 14:21:05 -0800 Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks This gets rid of the "did TSC go backwards" logic and just updates all clocks. It should work better (no more disabling of fast timing) and more reliably (all of the clocks are actually updated). Signed-off-by: Andy Lutomirski --- arch/x86/kvm/x86.c | 75 +++--- 1 file changed, 3 insertions(+), 72 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eed32283d22c..c88f91f4b1a3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); unsigned int __read_mostly lapic_timer_advance_ns = 0; module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR); -static bool __read_mostly backwards_
Re: kvmclock doesn't work, help?
On 09/12/2015 22:49, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: >> >> >> On 09/12/2015 22:10, Andy Lutomirski wrote: >>> Can we please stop making kvmclock more complex? It's a beast right >>> now, and not in a good way. It's far too tangled with the vclock >>> machinery on both the host and guest sides, the pvclock stuff is not >>> well thought out (even in principle in an ABI sense), and it's never >>> been clear to my what problem exactly the kvmclock stuff is supposed >>> to solve. >> >> It's supposed to solve the problem that: >> >> - not all hosts have a working TSC > > Fine, but we don't need any vdso integration for that. Well, you still want a fast time source. That was a given. :) >> - even if they all do, virtual machines can be migrated (or >> saved/restored) to a host with a different TSC frequency >> >> - any MMIO- or PIO-based mechanism to access the current time is orders >> of magnitude slower than the TSC and less precise too. > > Yup. But TSC by itself gets that benefit, too. Yes, the problem is if you want to solve all three of them. The first two are solved by the ACPI PM timer with a decent resolution (70 ns---much faster anyway than an I/O port access). The third is solved by TSC. To solve all three, you need kvmclock. >>> I'm somewhat tempted to suggest that we delete kvmclock entirely and >>> start over. A correctly functioning KVM guest using TSC (i.e. >>> ignoring kvmclock entirely) seems to work rather more reliably and >>> considerably faster than a kvmclock guest. >> >> If all your hosts have a working TSC and you don't do migration or >> save/restore, that's a valid configuration. It's not a good default, >> however. > > Er? > > kvmclock is still really quite slow and buggy. Unless it takes 3-4000 clock cycles for a gettimeofday, which it shouldn't even with vdso disabled, it's definitely not slower than PIO. > And the patch I identified is definitely a problem here: > > [ 136.131241] KVM: disabling fast timing permanently due to inability > to recover from suspend > > I got that on the host with this whitespace-damaged patch: > > if (backwards_tsc) { > u64 delta_cyc = max_tsc - local_tsc; > + if (!backwards_tsc_observed) > + pr_warn("KVM: disabling fast timing > permanently due to inability to recover from suspend\n"); > > when I suspended and resumed. > > Can anyone explain what problem > 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On > brief inspection, it just seems to be incorrect. Shouldn't KVM's > normal TSC logic handle that case right? After all, all vcpus should > be paused when we resume from suspend. At worst, we should just need > kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, > shouldn't we do that regardless of which way the TSC jumped on > suspend/resume? After all, the jTSC-to-wall-clock offset is quite > likely to change except on the very small handful of CPUs (if any) > that keep the TSC running in S3 and hibernate. I don't recall the details of that patch, so Marcelo will have to answer this, or Alex too since he chimed in the original thread. At least it should be made conditional on the existence of a VM at suspend time (and the master clock stuff should be made per VM, as I suggested at https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). It would indeed be great if the master clock could be dropped. But I'm definitely missing some of the subtle details. :( Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: > > > On 09/12/2015 22:10, Andy Lutomirski wrote: >> Can we please stop making kvmclock more complex? It's a beast right >> now, and not in a good way. It's far too tangled with the vclock >> machinery on both the host and guest sides, the pvclock stuff is not >> well thought out (even in principle in an ABI sense), and it's never >> been clear to my what problem exactly the kvmclock stuff is supposed >> to solve. > > It's supposed to solve the problem that: > > - not all hosts have a working TSC Fine, but we don't need any vdso integration for that. > > - even if they all do, virtual machines can be migrated (or > saved/restored) to a host with a different TSC frequency OK, I buy that. So we want to export a linear function that the guest applies to the TSC so the guest can apply it. I suppose we also want ntp frequency corrections on the host to propagate to the guest. > > - any MMIO- or PIO-based mechanism to access the current time is orders > of magnitude slower than the TSC and less precise too. Yup. But TSC by itself gets that benefit, too. > >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> start over. A correctly functioning KVM guest using TSC (i.e. >> ignoring kvmclock entirely) seems to work rather more reliably and >> considerably faster than a kvmclock guest. > > If all your hosts have a working TSC and you don't do migration or > save/restore, that's a valid configuration. It's not a good default, > however. Er? kvmclock is still really quite slow and buggy. And the patch I identified is definitely a problem here: [ 136.131241] KVM: disabling fast timing permanently due to inability to recover from suspend I got that on the host with this whitespace-damaged patch: if (backwards_tsc) { u64 delta_cyc = max_tsc - local_tsc; + if (!backwards_tsc_observed) + pr_warn("KVM: disabling fast timing permanently due to inability to recover from suspend\n"); when I suspended and resumed. Can anyone explain what problem 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On brief inspection, it just seems to be incorrect. Shouldn't KVM's normal TSC logic handle that case right? After all, all vcpus should be paused when we resume from suspend. At worst, we should just need kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, shouldn't we do that regardless of which way the TSC jumped on suspend/resume? After all, the jTSC-to-wall-clock offset is quite likely to change except on the very small handful of CPUs (if any) that keep the TSC running in S3 and hibernate. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On 09/12/2015 22:10, Andy Lutomirski wrote: > Can we please stop making kvmclock more complex? It's a beast right > now, and not in a good way. It's far too tangled with the vclock > machinery on both the host and guest sides, the pvclock stuff is not > well thought out (even in principle in an ABI sense), and it's never > been clear to my what problem exactly the kvmclock stuff is supposed > to solve. It's supposed to solve the problem that: - not all hosts have a working TSC - even if they all do, virtual machines can be migrated (or saved/restored) to a host with a different TSC frequency - any MMIO- or PIO-based mechanism to access the current time is orders of magnitude slower than the TSC and less precise too. > I'm somewhat tempted to suggest that we delete kvmclock entirely and > start over. A correctly functioning KVM guest using TSC (i.e. > ignoring kvmclock entirely) seems to work rather more reliably and > considerably faster than a kvmclock guest. If all your hosts have a working TSC and you don't do migration or save/restore, that's a valid configuration. It's not a good default, however. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html