Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-05 Thread Paolo Bonzini


On 02/09/2016 19:31, Luiz Capitulino wrote:
> On Fri, 2 Sep 2016 19:00:41 +0200
> Paolo Bonzini  wrote:
> 
>> On 31/08/2016 19:05, Luiz Capitulino wrote:
>>>   vcpu0: 18446742405270834952
>>>   vcpu1: 18446742405270834952
>>>   vcpu2: 18446742405270834952
>>>   vcpu3: 18446742405270834952
>>>
>>>  - We'll probably need to export the TSC multiplier too.
>>>However, I've been using only the TSC offset for now.
>>>So, let's get this merged first and do the TSC multiplier
>>>as a second step  
>>
>> You'll need to export the number of fractional bits in the multiplier,
>> too.  It's going to be a very simple patch, so please do everything now.
> 
> I didn't want to expose the multiplier before testing our tracing
> procedure with it.

Exposing the multiplier should be independent of tracing.  I can test it
for you.

Paolo

> So far we've been only using the TSC offset (and
> it works great). I don't even know if I have a machine around to
> test it, so it could take a bit.
> 
>> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.
> 
> Will do.
> 
>>> Signed-off-by: Luiz Capitulino 
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  1 +
>>>  arch/x86/kvm/svm.c  |  1 +
>>>  arch/x86/kvm/vmx.c  |  8 
>>>  arch/x86/kvm/x86.c  | 30 ++
>>>  4 files changed, 40 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 33ae3a4..5714bbd 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>>> bool (*has_wbinvd_exit)(void);
>>>  
>>> u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
>>> +   u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>>> void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>>>  
>>> u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index af523d8..c851477 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>>> .has_wbinvd_exit = svm_has_wbinvd_exit,
>>>  
>>> .read_tsc_offset = svm_read_tsc_offset,
>>> +   .read_cached_tsc_offset = svm_read_tsc_offset,
>>> .write_tsc_offset = svm_write_tsc_offset,
>>> .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>>> .read_l1_tsc = svm_read_l1_tsc,
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 5cede40..82dfe42 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>>> u64 hv_deadline_tsc;
>>>  
>>> u64 current_tsc_ratio;
>>> +   u64 cached_tsc_offset;
>>>  
>>> bool guest_pkru_valid;
>>> u32 guest_pkru;
>>> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>>> return vmcs_read64(TSC_OFFSET);
>>>  }
>>>  
>>> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
>>> +{
>>> +   return to_vmx(vcpu)->cached_tsc_offset;
>>> +}
>>> +
>>>  /*
>>>   * writes 'offset' into guest's timestamp counter offset register
>>>   */
>>> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
>>> *vcpu, u64 offset)
>>>vmcs_read64(TSC_OFFSET), offset);
>>> vmcs_write64(TSC_OFFSET, offset);
>>> }
>>> +   to_vmx(vcpu)->cached_tsc_offset = offset;
>>>  }
>>>  
>>>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
>>> adjustment)
>>> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>> .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>>>  
>>> .read_tsc_offset = vmx_read_tsc_offset,
>>> +   .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>>> .write_tsc_offset = vmx_write_tsc_offset,
>>> .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>>> .read_l1_tsc = vmx_read_l1_tsc,  
>>
>> You need to handle SVM as well.  So you might as well simplify the code:
> 
> SVM is handled:
> 
>   +   .read_cached_tsc_offset = svm_read_tsc_offset,
> 
>> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
>>
>> - add a tsc_offset field in struct kvm_vcpu_arch
>>
>> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field
> 
> Given that SVM is handled, you still want me to do this?
> 
>> Then in a fifth patch export the TSC offset (and multiplier ;)) to
>> userspace.
>>
>> I'm not very happy about having a single file for all TSC offsets.
>> Creating subdirectories under the PID-FD per-VM directory would be nicer
>> in the long run.
> 
> I think Steven would also prefer that, but some people raised the
> concern at KVM Forum that creating per vcpu dirs in debugfs may
> consume considerable memory for a system running several dozen
> if not hundreds of VMs. This concern seems valid to me, but I
> can do either way.
> 


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-05 Thread Paolo Bonzini


On 02/09/2016 19:31, Luiz Capitulino wrote:
> On Fri, 2 Sep 2016 19:00:41 +0200
> Paolo Bonzini  wrote:
> 
>> On 31/08/2016 19:05, Luiz Capitulino wrote:
>>>   vcpu0: 18446742405270834952
>>>   vcpu1: 18446742405270834952
>>>   vcpu2: 18446742405270834952
>>>   vcpu3: 18446742405270834952
>>>
>>>  - We'll probably need to export the TSC multiplier too.
>>>However, I've been using only the TSC offset for now.
>>>So, let's get this merged first and do the TSC multiplier
>>>as a second step  
>>
>> You'll need to export the number of fractional bits in the multiplier,
>> too.  It's going to be a very simple patch, so please do everything now.
> 
> I didn't want to expose the multiplier before testing our tracing
> procedure with it.

Exposing the multiplier should be independent of tracing.  I can test it
for you.

Paolo

> So far we've been only using the TSC offset (and
> it works great). I don't even know if I have a machine around to
> test it, so it could take a bit.
> 
>> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.
> 
> Will do.
> 
>>> Signed-off-by: Luiz Capitulino 
>>> ---
>>>  arch/x86/include/asm/kvm_host.h |  1 +
>>>  arch/x86/kvm/svm.c  |  1 +
>>>  arch/x86/kvm/vmx.c  |  8 
>>>  arch/x86/kvm/x86.c  | 30 ++
>>>  4 files changed, 40 insertions(+)
>>>
>>> diff --git a/arch/x86/include/asm/kvm_host.h 
>>> b/arch/x86/include/asm/kvm_host.h
>>> index 33ae3a4..5714bbd 100644
>>> --- a/arch/x86/include/asm/kvm_host.h
>>> +++ b/arch/x86/include/asm/kvm_host.h
>>> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>>> bool (*has_wbinvd_exit)(void);
>>>  
>>> u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
>>> +   u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>>> void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>>>  
>>> u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
>>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>>> index af523d8..c851477 100644
>>> --- a/arch/x86/kvm/svm.c
>>> +++ b/arch/x86/kvm/svm.c
>>> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>>> .has_wbinvd_exit = svm_has_wbinvd_exit,
>>>  
>>> .read_tsc_offset = svm_read_tsc_offset,
>>> +   .read_cached_tsc_offset = svm_read_tsc_offset,
>>> .write_tsc_offset = svm_write_tsc_offset,
>>> .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>>> .read_l1_tsc = svm_read_l1_tsc,
>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
>>> index 5cede40..82dfe42 100644
>>> --- a/arch/x86/kvm/vmx.c
>>> +++ b/arch/x86/kvm/vmx.c
>>> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>>> u64 hv_deadline_tsc;
>>>  
>>> u64 current_tsc_ratio;
>>> +   u64 cached_tsc_offset;
>>>  
>>> bool guest_pkru_valid;
>>> u32 guest_pkru;
>>> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>>> return vmcs_read64(TSC_OFFSET);
>>>  }
>>>  
>>> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
>>> +{
>>> +   return to_vmx(vcpu)->cached_tsc_offset;
>>> +}
>>> +
>>>  /*
>>>   * writes 'offset' into guest's timestamp counter offset register
>>>   */
>>> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
>>> *vcpu, u64 offset)
>>>vmcs_read64(TSC_OFFSET), offset);
>>> vmcs_write64(TSC_OFFSET, offset);
>>> }
>>> +   to_vmx(vcpu)->cached_tsc_offset = offset;
>>>  }
>>>  
>>>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
>>> adjustment)
>>> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>>> .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>>>  
>>> .read_tsc_offset = vmx_read_tsc_offset,
>>> +   .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>>> .write_tsc_offset = vmx_write_tsc_offset,
>>> .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>>> .read_l1_tsc = vmx_read_l1_tsc,  
>>
>> You need to handle SVM as well.  So you might as well simplify the code:
> 
> SVM is handled:
> 
>   +   .read_cached_tsc_offset = svm_read_tsc_offset,
> 
>> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
>>
>> - add a tsc_offset field in struct kvm_vcpu_arch
>>
>> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field
> 
> Given that SVM is handled, you still want me to do this?
> 
>> Then in a fifth patch export the TSC offset (and multiplier ;)) to
>> userspace.
>>
>> I'm not very happy about having a single file for all TSC offsets.
>> Creating subdirectories under the PID-FD per-VM directory would be nicer
>> in the long run.
> 
> I think Steven would also prefer that, but some people raised the
> concern at KVM Forum that creating per vcpu dirs in debugfs may
> consume considerable memory for a system running several dozen
> if not hundreds of VMs. This concern seems valid to me, but I
> can do either way.
> 


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Masami Hiramatsu
On Fri, 2 Sep 2016 21:23:11 -0300
Marcelo Tosatti  wrote:

> On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> > On Fri, 2 Sep 2016 09:43:01 -0400
> > Stefan Hajnoczi  wrote:
> > 
> > > Can TSC offset changes occur at runtime?
> 
> Yes, but Linux guests don't write to the TSC offset
> after booting and unless user does manual TSC writes.
> 
> > > One example is vcpu hotplug where the tracing tool would need to fetch
> > > the new vcpu's TSC offset after tracing has already started.
> > > 
> > > Another example is if QEMU or the guest change the TSC offset while
> > > running.  If the tracing tool doesn't notice this then trace events will 
> > > have
> > > incorrect timestamps.
> 
> So what happens is this:
> 
> HostTSC (a variable). 
> GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

The same idea has been done by Yoshihiro
http://events.linuxfoundation.org/sites/events/files/cojp13_yunomae.pdf

 
> Then the algorithm processes the trace as follows:
> line = for each line(guest_trace)
> line = line - GuestTSCOffset (only the timestamp of course)
> 
> So from the moment the guest writes a new TSC offset, the host 
> should use the new TSC offset to subtract from the trace entries.
> The trace entries are in fact:
> 
> HostTSC + GuestTSCOffset
> 
> So the guest trace should contain entries for "USE NEW TSC OFFSET,
> VALUE: xxx", which can be done (hum not sure if guest entries
> or host entries).
> 
> However, correct me if i am wrong, the usecase seems to be:
> 
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.

IIRC, previous (current?) method is to run trace-cmd at first (before
boot the guest) so that it can get tsc-offset event and
can wait on a special unix domain socket.

For above usecase, we have to have an interface to get the current
tsc offset like Luis suggested.

> 
> Another option is to have notifications as follows: record on a buffer 
> the following:
> 
> [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
> [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
> 
> Then when merging the trace entries, you do:
> 
> line = for each line(host trace)
> write_to_merged_trace(line)
> if (contains_tsc_offset_event(line)) {
> GuestTSCOffset = line.GuestTSCOffset
> if (!guest_tsc_offset_initialized) {
> process_all_guest_lines(
> line = line - GuestTSCOffset (only the timestamp of course)
> }
> }
> 
> Aha, fail: the traces on the host are not sufficient to know when 
> to use which offset to subtract on the guest trace.
> 
> So the only possibility is to have the guest inform the occurrence
> of the events: however the guest does not have access to the TSC offset.
> 
> So the host needs to inform the new tsc offset value and the guest needs
> to inform when the event occurred on its side. So the algorithm can use
> information on both traces to know which value to subtract on the
> algorithm above.
> 
> Is this necessary? Or people do:
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.
> 
> > I believe there are tracepoints for these events. They would obviously
> > need to be enabled for the tracer to catch them.

Yes, Yoshihiro introduced a tracepoint for that.
http://lkml.iu.edu/hypermail/linux/kernel/1306.1/01741.html

So, we have to trace this event in host side.

> > 
> > I would also recommend that they go into their own instance to make
> > sure other events do not overwrite them.
> > 
> > -- Steve

Thanks,

-- 
Masami Hiramatsu 


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Masami Hiramatsu
On Fri, 2 Sep 2016 21:23:11 -0300
Marcelo Tosatti  wrote:

> On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> > On Fri, 2 Sep 2016 09:43:01 -0400
> > Stefan Hajnoczi  wrote:
> > 
> > > Can TSC offset changes occur at runtime?
> 
> Yes, but Linux guests don't write to the TSC offset
> after booting and unless user does manual TSC writes.
> 
> > > One example is vcpu hotplug where the tracing tool would need to fetch
> > > the new vcpu's TSC offset after tracing has already started.
> > > 
> > > Another example is if QEMU or the guest change the TSC offset while
> > > running.  If the tracing tool doesn't notice this then trace events will 
> > > have
> > > incorrect timestamps.
> 
> So what happens is this:
> 
> HostTSC (a variable). 
> GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

The same idea has been done by Yoshihiro
http://events.linuxfoundation.org/sites/events/files/cojp13_yunomae.pdf

 
> Then the algorithm processes the trace as follows:
> line = for each line(guest_trace)
> line = line - GuestTSCOffset (only the timestamp of course)
> 
> So from the moment the guest writes a new TSC offset, the host 
> should use the new TSC offset to subtract from the trace entries.
> The trace entries are in fact:
> 
> HostTSC + GuestTSCOffset
> 
> So the guest trace should contain entries for "USE NEW TSC OFFSET,
> VALUE: xxx", which can be done (hum not sure if guest entries
> or host entries).
> 
> However, correct me if i am wrong, the usecase seems to be:
> 
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.

IIRC, previous (current?) method is to run trace-cmd at first (before
boot the guest) so that it can get tsc-offset event and
can wait on a special unix domain socket.

For above usecase, we have to have an interface to get the current
tsc offset like Luis suggested.

> 
> Another option is to have notifications as follows: record on a buffer 
> the following:
> 
> [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
> [ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
> 
> Then when merging the trace entries, you do:
> 
> line = for each line(host trace)
> write_to_merged_trace(line)
> if (contains_tsc_offset_event(line)) {
> GuestTSCOffset = line.GuestTSCOffset
> if (!guest_tsc_offset_initialized) {
> process_all_guest_lines(
> line = line - GuestTSCOffset (only the timestamp of course)
> }
> }
> 
> Aha, fail: the traces on the host are not sufficient to know when 
> to use which offset to subtract on the guest trace.
> 
> So the only possibility is to have the guest inform the occurrence
> of the events: however the guest does not have access to the TSC offset.
> 
> So the host needs to inform the new tsc offset value and the guest needs
> to inform when the event occurred on its side. So the algorithm can use
> information on both traces to know which value to subtract on the
> algorithm above.
> 
> Is this necessary? Or people do:
> 1) Boot guest.
> 2) run trace-cmd
> 3) run workload
> 4) read traces on host.
> 
> > I believe there are tracepoints for these events. They would obviously
> > need to be enabled for the tracer to catch them.

Yes, Yoshihiro introduced a tracepoint for that.
http://lkml.iu.edu/hypermail/linux/kernel/1306.1/01741.html

So, we have to trace this event in host side.

> > 
> > I would also recommend that they go into their own instance to make
> > sure other events do not overwrite them.
> > 
> > -- Steve

Thanks,

-- 
Masami Hiramatsu 


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 20:49:37 -0300
Marcelo Tosatti  wrote:

> On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:  
> > > We need to retrieve a VM's TSC offset in order to use
> > > the host's TSC to merge host and guest traces. This is
> > > explained in detail in this thread:
> > > 
> > >   [Qemu-devel] [RFC] host and guest kernel trace merging
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > > 
> > > Today, the only way to retrieve a VM's TSC offset is
> > > by using the kvm_write_tsc_offset tracepoint. This has
> > > a few problems. First, the tracepoint is only emitted
> > > when the VM boots, which requires a reboot to get it if
> > > the VM is already running. Second, tracepoints are not
> > > supposed to be ABIs in case they need to be consumed by
> > > user-space tools.
> > > 
> > > This commit exports a VM's TSC offset to user-space via
> > > debugfs. A new file called "tsc-offset" is created in
> > > the VM's debugfs directory. For example:
> > > 
> > >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > > 
> > > This file contains one TSC offset per line, for each
> > > vCPU. For example:
> > > 
> > >   vcpu0: 18446742405270834952
> > >   vcpu1: 18446742405270834952
> > >   vcpu2: 18446742405270834952
> > >   vcpu3: 18446742405270834952
> > > 
> > > There are some important observations about this
> > > solution:
> > > 
> > >  - While all vCPUs TSC offsets should be equal for the
> > >cases we care about (ie. stable TSC and no write to
> > >the TSC MSR), I chose to follow the spec and export
> > >each vCPU's TSC offset (might also be helpful for
> > >debugging)
> > > 
> > >  - The TSC offset is only useful after the VM has booted
> > > 
> > >  - We'll probably need to export the TSC multiplier too.
> > >However, I've been using only the TSC offset for now.
> > >So, let's get this merged first and do the TSC multiplier
> > >as a second step  
> > 
> > Can TSC offset changes occur at runtime?
> > 
> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will 
> > have
> > incorrect timestamps.
> > 
> > Stefan  
> 
> Yes they can, and the interface should mention that "the user is
> responsible for handling races of execution" (IMO).
> 
> So the workflow is:
> 
> 1) User boots VM and knows the state of the VM.
> 2) User runs trace-cmd on the host.
> 
> Is there a need to automate gathering of traces? (that is to know the
> state of reboots and so forth). I don't see one. In that case, the above
> workflow is functional.
> 
> Can you add such comments to the interface Luiz (that the value
> read is potentially stale).

Sure, no problem.


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 20:49:37 -0300
Marcelo Tosatti  wrote:

> On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> > On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:  
> > > We need to retrieve a VM's TSC offset in order to use
> > > the host's TSC to merge host and guest traces. This is
> > > explained in detail in this thread:
> > > 
> > >   [Qemu-devel] [RFC] host and guest kernel trace merging
> > >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > > 
> > > Today, the only way to retrieve a VM's TSC offset is
> > > by using the kvm_write_tsc_offset tracepoint. This has
> > > a few problems. First, the tracepoint is only emitted
> > > when the VM boots, which requires a reboot to get it if
> > > the VM is already running. Second, tracepoints are not
> > > supposed to be ABIs in case they need to be consumed by
> > > user-space tools.
> > > 
> > > This commit exports a VM's TSC offset to user-space via
> > > debugfs. A new file called "tsc-offset" is created in
> > > the VM's debugfs directory. For example:
> > > 
> > >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > > 
> > > This file contains one TSC offset per line, for each
> > > vCPU. For example:
> > > 
> > >   vcpu0: 18446742405270834952
> > >   vcpu1: 18446742405270834952
> > >   vcpu2: 18446742405270834952
> > >   vcpu3: 18446742405270834952
> > > 
> > > There are some important observations about this
> > > solution:
> > > 
> > >  - While all vCPUs TSC offsets should be equal for the
> > >cases we care about (ie. stable TSC and no write to
> > >the TSC MSR), I chose to follow the spec and export
> > >each vCPU's TSC offset (might also be helpful for
> > >debugging)
> > > 
> > >  - The TSC offset is only useful after the VM has booted
> > > 
> > >  - We'll probably need to export the TSC multiplier too.
> > >However, I've been using only the TSC offset for now.
> > >So, let's get this merged first and do the TSC multiplier
> > >as a second step  
> > 
> > Can TSC offset changes occur at runtime?
> > 
> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will 
> > have
> > incorrect timestamps.
> > 
> > Stefan  
> 
> Yes they can, and the interface should mention that "the user is
> responsible for handling races of execution" (IMO).
> 
> So the workflow is:
> 
> 1) User boots VM and knows the state of the VM.
> 2) User runs trace-cmd on the host.
> 
> Is there a need to automate gathering of traces? (that is to know the
> state of reboots and so forth). I don't see one. In that case, the above
> workflow is functional.
> 
> Can you add such comments to the interface Luiz (that the value
> read is potentially stale).

Sure, no problem.


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Marcelo Tosatti
On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> On Fri, 2 Sep 2016 09:43:01 -0400
> Stefan Hajnoczi  wrote:
> 
> > Can TSC offset changes occur at runtime?

Yes, but Linux guests don't write to the TSC offset
after booting and unless user does manual TSC writes.

> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will 
> > have
> > incorrect timestamps.

So what happens is this:

HostTSC (a variable). 
GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

Then the algorithm processes the trace as follows:
line = for each line(guest_trace)
line = line - GuestTSCOffset (only the timestamp of course)

So from the moment the guest writes a new TSC offset, the host 
should use the new TSC offset to subtract from the trace entries.
The trace entries are in fact:

HostTSC + GuestTSCOffset

So the guest trace should contain entries for "USE NEW TSC OFFSET,
VALUE: xxx", which can be done (hum not sure if guest entries
or host entries).

However, correct me if i am wrong, the usecase seems to be:

1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

Another option is to have notifications as follows: record on a buffer 
the following:

[ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
[ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]

Then when merging the trace entries, you do:

line = for each line(host trace)
write_to_merged_trace(line)
if (contains_tsc_offset_event(line)) {
GuestTSCOffset = line.GuestTSCOffset
if (!guest_tsc_offset_initialized) {
process_all_guest_lines(
line = line - GuestTSCOffset (only the timestamp of course)
}
}

Aha, fail: the traces on the host are not sufficient to know when 
to use which offset to subtract on the guest trace.

So the only possibility is to have the guest inform the occurrence
of the events: however the guest does not have access to the TSC offset.

So the host needs to inform the new tsc offset value and the guest needs
to inform when the event occurred on its side. So the algorithm can use
information on both traces to know which value to subtract on the
algorithm above.

Is this necessary? Or people do:
1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

> I believe there are tracepoints for these events. They would obviously
> need to be enabled for the tracer to catch them.
> 
> I would also recommend that they go into their own instance to make
> sure other events do not overwrite them.
> 
> -- Steve


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Marcelo Tosatti
On Fri, Sep 02, 2016 at 10:15:41AM -0400, Steven Rostedt wrote:
> On Fri, 2 Sep 2016 09:43:01 -0400
> Stefan Hajnoczi  wrote:
> 
> > Can TSC offset changes occur at runtime?

Yes, but Linux guests don't write to the TSC offset
after booting and unless user does manual TSC writes.

> > One example is vcpu hotplug where the tracing tool would need to fetch
> > the new vcpu's TSC offset after tracing has already started.
> > 
> > Another example is if QEMU or the guest change the TSC offset while
> > running.  If the tracing tool doesn't notice this then trace events will 
> > have
> > incorrect timestamps.

So what happens is this:

HostTSC (a variable). 
GuestTSC (variable) = GuestTSCOffset (fixed) + HostTSC (variable)

Then the algorithm processes the trace as follows:
line = for each line(guest_trace)
line = line - GuestTSCOffset (only the timestamp of course)

So from the moment the guest writes a new TSC offset, the host 
should use the new TSC offset to subtract from the trace entries.
The trace entries are in fact:

HostTSC + GuestTSCOffset

So the guest trace should contain entries for "USE NEW TSC OFFSET,
VALUE: xxx", which can be done (hum not sure if guest entries
or host entries).

However, correct me if i am wrong, the usecase seems to be:

1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

Another option is to have notifications as follows: record on a buffer 
the following:

[ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]
[ EVENT: TSC offset write, VAL: [ host tsc value, guest tsc offset ] ]

Then when merging the trace entries, you do:

line = for each line(host trace)
write_to_merged_trace(line)
if (contains_tsc_offset_event(line)) {
GuestTSCOffset = line.GuestTSCOffset
if (!guest_tsc_offset_initialized) {
process_all_guest_lines(
line = line - GuestTSCOffset (only the timestamp of course)
}
}

Aha, fail: the traces on the host are not sufficient to know when 
to use which offset to subtract on the guest trace.

So the only possibility is to have the guest inform the occurrence
of the events: however the guest does not have access to the TSC offset.

So the host needs to inform the new tsc offset value and the guest needs
to inform when the event occurred on its side. So the algorithm can use
information on both traces to know which value to subtract on the
algorithm above.

Is this necessary? Or people do:
1) Boot guest.
2) run trace-cmd
3) run workload
4) read traces on host.

> I believe there are tracepoints for these events. They would obviously
> need to be enabled for the tracer to catch them.
> 
> I would also recommend that they go into their own instance to make
> sure other events do not overwrite them.
> 
> -- Steve


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Marcelo Tosatti
On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >cases we care about (ie. stable TSC and no write to
> >the TSC MSR), I chose to follow the spec and export
> >each vCPU's TSC offset (might also be helpful for
> >debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step
> 
> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.
> 
> Stefan

Yes they can, and the interface should mention that "the user is
responsible for handling races of execution" (IMO).

So the workflow is:

1) User boots VM and knows the state of the VM.
2) User runs trace-cmd on the host.

Is there a need to automate gathering of traces? (that is to know the
state of reboots and so forth). I don't see one. In that case, the above
workflow is functional.

Can you add such comments to the interface Luiz (that the value
read is potentially stale).







Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Marcelo Tosatti
On Fri, Sep 02, 2016 at 09:43:01AM -0400, Stefan Hajnoczi wrote:
> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >cases we care about (ie. stable TSC and no write to
> >the TSC MSR), I chose to follow the spec and export
> >each vCPU's TSC offset (might also be helpful for
> >debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step
> 
> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.
> 
> Stefan

Yes they can, and the interface should mention that "the user is
responsible for handling races of execution" (IMO).

So the workflow is:

1) User boots VM and knows the state of the VM.
2) User runs trace-cmd on the host.

Is there a need to automate gathering of traces? (that is to know the
state of reboots and so forth). I don't see one. In that case, the above
workflow is functional.

Can you add such comments to the interface Luiz (that the value
read is potentially stale).







Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 19:00:41 +0200
Paolo Bonzini  wrote:

> On 31/08/2016 19:05, Luiz Capitulino wrote:
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step  
> 
> You'll need to export the number of fractional bits in the multiplier,
> too.  It's going to be a very simple patch, so please do everything now.

I didn't want to expose the multiplier before testing our tracing
procedure with it. So far we've been only using the TSC offset (and
it works great). I don't even know if I have a machine around to
test it, so it could take a bit.

> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

Will do.

> > Signed-off-by: Luiz Capitulino 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/svm.c  |  1 +
> >  arch/x86/kvm/vmx.c  |  8 
> >  arch/x86/kvm/x86.c  | 30 ++
> >  4 files changed, 40 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 33ae3a4..5714bbd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -952,6 +952,7 @@ struct kvm_x86_ops {
> > bool (*has_wbinvd_exit)(void);
> >  
> > u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> > +   u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
> > void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
> >  
> > u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index af523d8..c851477 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
> > .has_wbinvd_exit = svm_has_wbinvd_exit,
> >  
> > .read_tsc_offset = svm_read_tsc_offset,
> > +   .read_cached_tsc_offset = svm_read_tsc_offset,
> > .write_tsc_offset = svm_write_tsc_offset,
> > .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
> > .read_l1_tsc = svm_read_l1_tsc,
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 5cede40..82dfe42 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -616,6 +616,7 @@ struct vcpu_vmx {
> > u64 hv_deadline_tsc;
> >  
> > u64 current_tsc_ratio;
> > +   u64 cached_tsc_offset;
> >  
> > bool guest_pkru_valid;
> > u32 guest_pkru;
> > @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
> > return vmcs_read64(TSC_OFFSET);
> >  }
> >  
> > +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> > +{
> > +   return to_vmx(vcpu)->cached_tsc_offset;
> > +}
> > +
> >  /*
> >   * writes 'offset' into guest's timestamp counter offset register
> >   */
> > @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > *vcpu, u64 offset)
> >vmcs_read64(TSC_OFFSET), offset);
> > vmcs_write64(TSC_OFFSET, offset);
> > }
> > +   to_vmx(vcpu)->cached_tsc_offset = offset;
> >  }
> >  
> >  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
> > adjustment)
> > @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
> > .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> >  
> > .read_tsc_offset = vmx_read_tsc_offset,
> > +   .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
> > .write_tsc_offset = vmx_write_tsc_offset,
> > .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
> > .read_l1_tsc = vmx_read_l1_tsc,  
> 
> You need to handle SVM as well.  So you might as well simplify the code:

SVM is handled:

+   .read_cached_tsc_offset = svm_read_tsc_offset,

> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
> 
> - add a tsc_offset field in struct kvm_vcpu_arch
> 
> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Given that SVM is handled, you still want me to do this?

> Then in a fifth patch export the TSC offset (and multiplier ;)) to
> userspace.
> 
> I'm not very happy about having a single file for all TSC offsets.
> Creating subdirectories under the PID-FD per-VM directory would be nicer
> in the long run.

I think Steven would also prefer that, but some people raised the
concern at KVM Forum that creating per vcpu dirs in debugfs may
consume considerable memory for a system running several dozen
if not hundreds of VMs. This concern seems valid to me, but I
can do either way.


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 19:00:41 +0200
Paolo Bonzini  wrote:

> On 31/08/2016 19:05, Luiz Capitulino wrote:
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step  
> 
> You'll need to export the number of fractional bits in the multiplier,
> too.  It's going to be a very simple patch, so please do everything now.

I didn't want to expose the multiplier before testing our tracing
procedure with it. So far we've been only using the TSC offset (and
it works great). I don't even know if I have a machine around to
test it, so it could take a bit.

> arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

Will do.

> > Signed-off-by: Luiz Capitulino 
> > ---
> >  arch/x86/include/asm/kvm_host.h |  1 +
> >  arch/x86/kvm/svm.c  |  1 +
> >  arch/x86/kvm/vmx.c  |  8 
> >  arch/x86/kvm/x86.c  | 30 ++
> >  4 files changed, 40 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 33ae3a4..5714bbd 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -952,6 +952,7 @@ struct kvm_x86_ops {
> > bool (*has_wbinvd_exit)(void);
> >  
> > u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> > +   u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
> > void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
> >  
> > u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index af523d8..c851477 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
> > .has_wbinvd_exit = svm_has_wbinvd_exit,
> >  
> > .read_tsc_offset = svm_read_tsc_offset,
> > +   .read_cached_tsc_offset = svm_read_tsc_offset,
> > .write_tsc_offset = svm_write_tsc_offset,
> > .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
> > .read_l1_tsc = svm_read_l1_tsc,
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 5cede40..82dfe42 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -616,6 +616,7 @@ struct vcpu_vmx {
> > u64 hv_deadline_tsc;
> >  
> > u64 current_tsc_ratio;
> > +   u64 cached_tsc_offset;
> >  
> > bool guest_pkru_valid;
> > u32 guest_pkru;
> > @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
> > return vmcs_read64(TSC_OFFSET);
> >  }
> >  
> > +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> > +{
> > +   return to_vmx(vcpu)->cached_tsc_offset;
> > +}
> > +
> >  /*
> >   * writes 'offset' into guest's timestamp counter offset register
> >   */
> > @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu 
> > *vcpu, u64 offset)
> >vmcs_read64(TSC_OFFSET), offset);
> > vmcs_write64(TSC_OFFSET, offset);
> > }
> > +   to_vmx(vcpu)->cached_tsc_offset = offset;
> >  }
> >  
> >  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
> > adjustment)
> > @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
> > .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> >  
> > .read_tsc_offset = vmx_read_tsc_offset,
> > +   .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
> > .write_tsc_offset = vmx_write_tsc_offset,
> > .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
> > .read_l1_tsc = vmx_read_l1_tsc,  
> 
> You need to handle SVM as well.  So you might as well simplify the code:

SVM is handled:

+   .read_cached_tsc_offset = svm_read_tsc_offset,

> - add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset
> 
> - add a tsc_offset field in struct kvm_vcpu_arch
> 
> - replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Given that SVM is handled, you still want me to do this?

> Then in a fifth patch export the TSC offset (and multiplier ;)) to
> userspace.
> 
> I'm not very happy about having a single file for all TSC offsets.
> Creating subdirectories under the PID-FD per-VM directory would be nicer
> in the long run.

I think Steven would also prefer that, but some people raised the
concern at KVM Forum that creating per vcpu dirs in debugfs may
consume considerable memory for a system running several dozen
if not hundreds of VMs. This concern seems valid to me, but I
can do either way.


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Paolo Bonzini


On 31/08/2016 19:05, Luiz Capitulino wrote:
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
>  - We'll probably need to export the TSC multiplier too.
>However, I've been using only the TSC offset for now.
>So, let's get this merged first and do the TSC multiplier
>as a second step

You'll need to export the number of fractional bits in the multiplier,
too.  It's going to be a very simple patch, so please do everything now.

arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

> Signed-off-by: Luiz Capitulino 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  1 +
>  arch/x86/kvm/vmx.c  |  8 
>  arch/x86/kvm/x86.c  | 30 ++
>  4 files changed, 40 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 33ae3a4..5714bbd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>   bool (*has_wbinvd_exit)(void);
>  
>   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index af523d8..c851477 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
>   .read_tsc_offset = svm_read_tsc_offset,
> + .read_cached_tsc_offset = svm_read_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>   .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>   .read_l1_tsc = svm_read_l1_tsc,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..82dfe42 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>   u64 hv_deadline_tsc;
>  
>   u64 current_tsc_ratio;
> + u64 cached_tsc_offset;
>  
>   bool guest_pkru_valid;
>   u32 guest_pkru;
> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>   return vmcs_read64(TSC_OFFSET);
>  }
>  
> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + return to_vmx(vcpu)->cached_tsc_offset;
> +}
> +
>  /*
>   * writes 'offset' into guest's timestamp counter offset register
>   */
> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, 
> u64 offset)
>  vmcs_read64(TSC_OFFSET), offset);
>   vmcs_write64(TSC_OFFSET, offset);
>   }
> + to_vmx(vcpu)->cached_tsc_offset = offset;
>  }
>  
>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
> adjustment)
> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>  
>   .read_tsc_offset = vmx_read_tsc_offset,
> + .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>   .write_tsc_offset = vmx_write_tsc_offset,
>   .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>   .read_l1_tsc = vmx_read_l1_tsc,

You need to handle SVM as well.  So you might as well simplify the code:

- add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset

- add a tsc_offset field in struct kvm_vcpu_arch

- replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Then in a fifth patch export the TSC offset (and multiplier ;)) to
userspace.

I'm not very happy about having a single file for all TSC offsets.
Creating subdirectories under the PID-FD per-VM directory would be nicer
in the long run.

Paolo


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Paolo Bonzini


On 31/08/2016 19:05, Luiz Capitulino wrote:
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
>  - We'll probably need to export the TSC multiplier too.
>However, I've been using only the TSC offset for now.
>So, let's get this merged first and do the TSC multiplier
>as a second step

You'll need to export the number of fractional bits in the multiplier,
too.  It's going to be a very simple patch, so please do everything now.

arch/x86/kvm/x86.c is huge; please create a new file arch/x86/kvm/debugfs.c.

> Signed-off-by: Luiz Capitulino 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/svm.c  |  1 +
>  arch/x86/kvm/vmx.c  |  8 
>  arch/x86/kvm/x86.c  | 30 ++
>  4 files changed, 40 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 33ae3a4..5714bbd 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -952,6 +952,7 @@ struct kvm_x86_ops {
>   bool (*has_wbinvd_exit)(void);
>  
>   u64 (*read_tsc_offset)(struct kvm_vcpu *vcpu);
> + u64 (*read_cached_tsc_offset)(struct kvm_vcpu *vcpu);
>   void (*write_tsc_offset)(struct kvm_vcpu *vcpu, u64 offset);
>  
>   u64 (*read_l1_tsc)(struct kvm_vcpu *vcpu, u64 host_tsc);
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index af523d8..c851477 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -5065,6 +5065,7 @@ static struct kvm_x86_ops svm_x86_ops = {
>   .has_wbinvd_exit = svm_has_wbinvd_exit,
>  
>   .read_tsc_offset = svm_read_tsc_offset,
> + .read_cached_tsc_offset = svm_read_tsc_offset,
>   .write_tsc_offset = svm_write_tsc_offset,
>   .adjust_tsc_offset_guest = svm_adjust_tsc_offset_guest,
>   .read_l1_tsc = svm_read_l1_tsc,
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 5cede40..82dfe42 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -616,6 +616,7 @@ struct vcpu_vmx {
>   u64 hv_deadline_tsc;
>  
>   u64 current_tsc_ratio;
> + u64 cached_tsc_offset;
>  
>   bool guest_pkru_valid;
>   u32 guest_pkru;
> @@ -2608,6 +2609,11 @@ static u64 vmx_read_tsc_offset(struct kvm_vcpu *vcpu)
>   return vmcs_read64(TSC_OFFSET);
>  }
>  
> +static u64 vmx_read_cached_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> + return to_vmx(vcpu)->cached_tsc_offset;
> +}
> +
>  /*
>   * writes 'offset' into guest's timestamp counter offset register
>   */
> @@ -2632,6 +2638,7 @@ static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, 
> u64 offset)
>  vmcs_read64(TSC_OFFSET), offset);
>   vmcs_write64(TSC_OFFSET, offset);
>   }
> + to_vmx(vcpu)->cached_tsc_offset = offset;
>  }
>  
>  static void vmx_adjust_tsc_offset_guest(struct kvm_vcpu *vcpu, s64 
> adjustment)
> @@ -11275,6 +11282,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
>   .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>  
>   .read_tsc_offset = vmx_read_tsc_offset,
> + .read_cached_tsc_offset = vmx_read_cached_tsc_offset,
>   .write_tsc_offset = vmx_write_tsc_offset,
>   .adjust_tsc_offset_guest = vmx_adjust_tsc_offset_guest,
>   .read_l1_tsc = vmx_read_l1_tsc,

You need to handle SVM as well.  So you might as well simplify the code:

- add a kvm_vcpu_write_tsc_offset wrapper for kvm_x86_ops->write_tsc_offset

- add a tsc_offset field in struct kvm_vcpu_arch

- replace kvm_x86_ops->read_tsc_offset with accesses to the new field

Then in a fifth patch export the TSC offset (and multiplier ;)) to
userspace.

I'm not very happy about having a single file for all TSC offsets.
Creating subdirectories under the PID-FD per-VM directory would be nicer
in the long run.

Paolo


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 12:26:55 -0400
Luiz Capitulino  wrote:

> I guess that what tools like trace-cmd want to do in those cases
> is to warn the user and discard the trace. A simple way of doing
> this would be to re-check that the TSC offset are the same after
> tracing is done. It could also use inotify, in case it works
> for debugfs (never tried it myself).

The second idea was probably stupid, never mind me :)


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 12:26:55 -0400
Luiz Capitulino  wrote:

> I guess that what tools like trace-cmd want to do in those cases
> is to warn the user and discard the trace. A simple way of doing
> this would be to re-check that the TSC offset are the same after
> tracing is done. It could also use inotify, in case it works
> for debugfs (never tried it myself).

The second idea was probably stupid, never mind me :)


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi  wrote:

> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >cases we care about (ie. stable TSC and no write to
> >the TSC MSR), I chose to follow the spec and export
> >each vCPU's TSC offset (might also be helpful for
> >debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step  
> 
> Can TSC offset changes occur at runtime?

Yes. IIRC, if the system has an unstable TSC, KVM will adjust the
TSC offset when migrating the vCPU to other cores (although
tracing with unstable TSC is not supported). Also, the guest can
write to the TSC MSR at any time (although I don't know if Linux
ever does this).

> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I guess that what tools like trace-cmd want to do in those cases
is to warn the user and discard the trace. A simple way of doing
this would be to re-check that the TSC offset are the same after
tracing is done. It could also use inotify, in case it works
for debugfs (never tried it myself).


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Luiz Capitulino
On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi  wrote:

> On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> > We need to retrieve a VM's TSC offset in order to use
> > the host's TSC to merge host and guest traces. This is
> > explained in detail in this thread:
> > 
> >   [Qemu-devel] [RFC] host and guest kernel trace merging
> >   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> > 
> > Today, the only way to retrieve a VM's TSC offset is
> > by using the kvm_write_tsc_offset tracepoint. This has
> > a few problems. First, the tracepoint is only emitted
> > when the VM boots, which requires a reboot to get it if
> > the VM is already running. Second, tracepoints are not
> > supposed to be ABIs in case they need to be consumed by
> > user-space tools.
> > 
> > This commit exports a VM's TSC offset to user-space via
> > debugfs. A new file called "tsc-offset" is created in
> > the VM's debugfs directory. For example:
> > 
> >   /sys/kernel/debug/kvm/51696-10/tsc-offset
> > 
> > This file contains one TSC offset per line, for each
> > vCPU. For example:
> > 
> >   vcpu0: 18446742405270834952
> >   vcpu1: 18446742405270834952
> >   vcpu2: 18446742405270834952
> >   vcpu3: 18446742405270834952
> > 
> > There are some important observations about this
> > solution:
> > 
> >  - While all vCPUs TSC offsets should be equal for the
> >cases we care about (ie. stable TSC and no write to
> >the TSC MSR), I chose to follow the spec and export
> >each vCPU's TSC offset (might also be helpful for
> >debugging)
> > 
> >  - The TSC offset is only useful after the VM has booted
> > 
> >  - We'll probably need to export the TSC multiplier too.
> >However, I've been using only the TSC offset for now.
> >So, let's get this merged first and do the TSC multiplier
> >as a second step  
> 
> Can TSC offset changes occur at runtime?

Yes. IIRC, if the system has an unstable TSC, KVM will adjust the
TSC offset when migrating the vCPU to other cores (although
tracing with unstable TSC is not supported). Also, the guest can
write to the TSC MSR at any time (although I don't know if Linux
ever does this).

> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I guess that what tools like trace-cmd want to do in those cases
is to warn the user and discard the trace. A simple way of doing
this would be to re-check that the TSC offset are the same after
tracing is done. It could also use inotify, in case it works
for debugfs (never tried it myself).


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Steven Rostedt
On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi  wrote:

> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I believe there are tracepoints for these events. They would obviously
need to be enabled for the tracer to catch them.

I would also recommend that they go into their own instance to make
sure other events do not overwrite them.

-- Steve


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Steven Rostedt
On Fri, 2 Sep 2016 09:43:01 -0400
Stefan Hajnoczi  wrote:

> Can TSC offset changes occur at runtime?
> 
> One example is vcpu hotplug where the tracing tool would need to fetch
> the new vcpu's TSC offset after tracing has already started.
> 
> Another example is if QEMU or the guest change the TSC offset while
> running.  If the tracing tool doesn't notice this then trace events will have
> incorrect timestamps.

I believe there are tracepoints for these events. They would obviously
need to be enabled for the tracer to catch them.

I would also recommend that they go into their own instance to make
sure other events do not overwrite them.

-- Steve


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Stefan Hajnoczi
On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> We need to retrieve a VM's TSC offset in order to use
> the host's TSC to merge host and guest traces. This is
> explained in detail in this thread:
> 
>   [Qemu-devel] [RFC] host and guest kernel trace merging
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> 
> Today, the only way to retrieve a VM's TSC offset is
> by using the kvm_write_tsc_offset tracepoint. This has
> a few problems. First, the tracepoint is only emitted
> when the VM boots, which requires a reboot to get it if
> the VM is already running. Second, tracepoints are not
> supposed to be ABIs in case they need to be consumed by
> user-space tools.
> 
> This commit exports a VM's TSC offset to user-space via
> debugfs. A new file called "tsc-offset" is created in
> the VM's debugfs directory. For example:
> 
>   /sys/kernel/debug/kvm/51696-10/tsc-offset
> 
> This file contains one TSC offset per line, for each
> vCPU. For example:
> 
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
> There are some important observations about this
> solution:
> 
>  - While all vCPUs TSC offsets should be equal for the
>cases we care about (ie. stable TSC and no write to
>the TSC MSR), I chose to follow the spec and export
>each vCPU's TSC offset (might also be helpful for
>debugging)
> 
>  - The TSC offset is only useful after the VM has booted
> 
>  - We'll probably need to export the TSC multiplier too.
>However, I've been using only the TSC offset for now.
>So, let's get this merged first and do the TSC multiplier
>as a second step

Can TSC offset changes occur at runtime?

One example is vcpu hotplug where the tracing tool would need to fetch
the new vcpu's TSC offset after tracing has already started.

Another example is if QEMU or the guest change the TSC offset while
running.  If the tracing tool doesn't notice this then trace events will have
incorrect timestamps.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH 4/4] kvm: x86: export TSC offset to user-space

2016-09-02 Thread Stefan Hajnoczi
On Wed, Aug 31, 2016 at 01:05:45PM -0400, Luiz Capitulino wrote:
> We need to retrieve a VM's TSC offset in order to use
> the host's TSC to merge host and guest traces. This is
> explained in detail in this thread:
> 
>   [Qemu-devel] [RFC] host and guest kernel trace merging
>   https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html
> 
> Today, the only way to retrieve a VM's TSC offset is
> by using the kvm_write_tsc_offset tracepoint. This has
> a few problems. First, the tracepoint is only emitted
> when the VM boots, which requires a reboot to get it if
> the VM is already running. Second, tracepoints are not
> supposed to be ABIs in case they need to be consumed by
> user-space tools.
> 
> This commit exports a VM's TSC offset to user-space via
> debugfs. A new file called "tsc-offset" is created in
> the VM's debugfs directory. For example:
> 
>   /sys/kernel/debug/kvm/51696-10/tsc-offset
> 
> This file contains one TSC offset per line, for each
> vCPU. For example:
> 
>   vcpu0: 18446742405270834952
>   vcpu1: 18446742405270834952
>   vcpu2: 18446742405270834952
>   vcpu3: 18446742405270834952
> 
> There are some important observations about this
> solution:
> 
>  - While all vCPUs TSC offsets should be equal for the
>cases we care about (ie. stable TSC and no write to
>the TSC MSR), I chose to follow the spec and export
>each vCPU's TSC offset (might also be helpful for
>debugging)
> 
>  - The TSC offset is only useful after the VM has booted
> 
>  - We'll probably need to export the TSC multiplier too.
>However, I've been using only the TSC offset for now.
>So, let's get this merged first and do the TSC multiplier
>as a second step

Can TSC offset changes occur at runtime?

One example is vcpu hotplug where the tracing tool would need to fetch
the new vcpu's TSC offset after tracing has already started.

Another example is if QEMU or the guest change the TSC offset while
running.  If the tracing tool doesn't notice this then trace events will have
incorrect timestamps.

Stefan


signature.asc
Description: PGP signature