Re: [Xen-devel] [PATCH] vt-d: use two 32-bit writes to update DMAR fault address registers

2017-10-09 Thread Zhang, Haozhong
On 10/10/17 13:36 +0800, Tian, Kevin wrote:
> > From: Roger Pau Monné [mailto:roger@citrix.com]
> > Sent: Wednesday, September 20, 2017 4:31 PM
> > 
> > On Mon, Sep 11, 2017 at 02:00:48PM +0800, Haozhong Zhang wrote:
> > > The 64-bit DMAR fault address is composed of two 32 bits registers
> > > DMAR_FEADDR_REG and DMAR_FEUADDR_REG. According to VT-d spec:
> > > "Software is expected to access 32-bit registers as aligned doublewords",
> > > a hypervisor should use two 32-bit writes to DMAR_FEADDR_REG and
> > > DMAR_FEUADDR_REG separately in order to update a 64-bit fault
> > address,
> > > rather than a 64-bit write to DMAR_FEADDR_REG.
> > 
> > I would add:
> > 
> > "Note that when x2APIC is disabled DMAR_FEUADDR_REG is reserved and
> > it's not
> > necessary to update it."
> > 
> > > Though I haven't seen any errors caused by such one 64-bit write on
> > > real machines, it's still better to follow the specification.
> > >
> > > Signed-off-by: Haozhong Zhang 
> > 
> > Given the reply from Kevin:
> > 
> > Reviewed-by: Roger Pau Monné 
> > 
> 
> Haozhong, can you resend a new version with patch description
> updated?

Sorry, I forgot it and will send.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen

2016-07-18 Thread Zhang, Haozhong
On 07/19/16 08:58, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, July 18, 2016 5:02 PM
> > 
> > On 07/18/16 16:36, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Monday, July 18, 2016 8:29 AM
> > > >
> > > > Hi,
> > > >
> > > > Following is version 2 of the design doc for supporting vNVDIMM in
> > > > Xen. It's basically the summary of discussion on previous v1 design
> > > > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg6.html).
> > > > Any comments are welcome. The corresponding patches are WIP.
> > > >
> > > > Thanks,
> > > > Haozhong
> > >
> > > It's a very clear doc. Thanks a lot!
> > >
> > > >
> > > > 4.2.2 Detection of Host pmem Devices
> > > >
> > > >  The detection and initialize host pmem devices require a non-trivial
> > > >  driver to interact with the corresponding ACPI namespace devices,
> > > >  parse namespace labels and make necessary recovery actions. Instead
> > > >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> > > >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> > > >  detected host pmem devices to Xen hypervisor.
> > > >
> > > >  Our design takes following steps to detect host pmem devices when Xen
> > > >  boots.
> > > >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> > > >  Linux NVDIMM driver.
> > > >
> > > >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> > > >  of the pmem devices and reserved areas to Xen hypervisor via a
> > > >  new hypercall.
> > >
> > > Does Linux need to provide reserved area to Xen? Why not leaving Xen
> > > to decide reserved area within reported pmem regions and then return
> > > reserved info to Dom0 NVDIMM driver to balloon out?
> > >
> > 
> > NVDIMM can be used as a persistent storage like a disk drive, so the
> > reservation should be done out of Xen and Dom0, for example, by an
> > administrator who is expected to make necessary data backup in
> > advance.
> 
> What prevents NVDIMM driver from reserving some region itself before
> reporting to user space?
>

Nothing in theory prevents the driver doing reservations. I just mean
the reservation should be initiated by someone which can ensure, for
example, the current data on pmem is either useless or properly
backup. The reservation is of course finally done by the driver.

> > 
> > Therefore, dom0 linux actually reports (instead of providing) the
> > reserved area to Xen, and the latter checks if the reserved area is
> > large enough and (if yes) asks dom0 to balloon out the reserved area.
> 
> It looks non-intuitive since administrator doesn't know the actual requirement
> of Xen. Then administrator has to guess and try. Even it finally works, the 
> reserved size may not be optimal.
> 
> If Dom0 NVDIMM driver does reservation itself and notify Xen, at least there 
> is a way for Xen to return a failure with required size and then at the 2nd 
> time 
> the NVDIMM driver can adjust the reservation as desired. 
>
> Did I misunderstand the flow here?
>

I designed to let the administrator calculate the reserved size and
pass to the driver. Now, you are right, I think it's better to let Xen
advise the reserved size to NVDIMM driver in dom0 and therefore no
need for manually calculated size.

Thanks,
Haozhong

> > 
> > > >
> > > >  (3) Xen hypervisor then checks
> > > >  - whether SPA and size of the newly reported pmem device is overlap
> > > >with any previously reported pmem devices;
> > > >  - whether the reserved area can fit in the pmem device and is
> > > >large enough to hold page_info structs for itself.
> > > >
> > > >  If any checks fail, the reported pmem device will be ignored by
> > > >  Xen hypervisor and hence will not be used by any
> > > >  guests. Otherwise, Xen hypervisor will recorded the reported
> > > >  parameters and create page_info structs in the reserved area.
> > > >
> > > >  (4) Because the reserved area is now used by Xen hypervisor, it
> > > >  should not be accessible by Dom0 any more. Therefore, if a host
> > > >  pmem device is recorded by Xen hypervisor, Xen will unmap its
> > > >  reserved area from

Re: [Xen-devel] [RFC Design Doc v2] Add vNVDIMM support for Xen

2016-07-18 Thread Zhang, Haozhong
On 07/18/16 16:36, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, July 18, 2016 8:29 AM
> > 
> > Hi,
> > 
> > Following is version 2 of the design doc for supporting vNVDIMM in
> > Xen. It's basically the summary of discussion on previous v1 design
> > (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg6.html).
> > Any comments are welcome. The corresponding patches are WIP.
> > 
> > Thanks,
> > Haozhong
> 
> It's a very clear doc. Thanks a lot!
> 
> > 
> > 4.2.2 Detection of Host pmem Devices
> > 
> >  The detection and initialize host pmem devices require a non-trivial
> >  driver to interact with the corresponding ACPI namespace devices,
> >  parse namespace labels and make necessary recovery actions. Instead
> >  of duplicating the comprehensive Linux pmem driver in Xen hypervisor,
> >  our designs leaves it to Dom0 Linux and let Dom0 Linux report
> >  detected host pmem devices to Xen hypervisor.
> > 
> >  Our design takes following steps to detect host pmem devices when Xen
> >  boots.
> >  (1) As booting on bare metal, host pmem devices are detected by Dom0
> >  Linux NVDIMM driver.
> > 
> >  (2) Our design extends Linux NVDIMM driver to reports SPA's and sizes
> >  of the pmem devices and reserved areas to Xen hypervisor via a
> >  new hypercall.
> 
> Does Linux need to provide reserved area to Xen? Why not leaving Xen
> to decide reserved area within reported pmem regions and then return
> reserved info to Dom0 NVDIMM driver to balloon out?
>

NVDIMM can be used as a persistent storage like a disk drive, so the
reservation should be done out of Xen and Dom0, for example, by an
administrator who is expected to make necessary data backup in
advance.

Therefore, dom0 linux actually reports (instead of providing) the
reserved area to Xen, and the latter checks if the reserved area is
large enough and (if yes) asks dom0 to balloon out the reserved area.

> > 
> >  (3) Xen hypervisor then checks
> >  - whether SPA and size of the newly reported pmem device is overlap
> >with any previously reported pmem devices;
> >  - whether the reserved area can fit in the pmem device and is
> >large enough to hold page_info structs for itself.
> > 
> >  If any checks fail, the reported pmem device will be ignored by
> >  Xen hypervisor and hence will not be used by any
> >  guests. Otherwise, Xen hypervisor will recorded the reported
> >  parameters and create page_info structs in the reserved area.
> > 
> >  (4) Because the reserved area is now used by Xen hypervisor, it
> >  should not be accessible by Dom0 any more. Therefore, if a host
> >  pmem device is recorded by Xen hypervisor, Xen will unmap its
> >  reserved area from Dom0. Our design also needs to extend Linux
> >  NVDIMM driver to "balloon out" the reserved area after it
> >  successfully reports a pmem device to Xen hypervisor.
> 
> Then both ndctl and Xen become source of requesting reserved area
> to Linux NVDIMM driver. You don't need change ndctl as described in
> 4.2.1. User can still use ndctl to reserve for Dom0's own purpose.
>

I missed something here: Dom0 pmem driver should also prevent
further operations on host namespace after it successfully reports to
Xen. In this way, we can prevent uerspace tools like ndctl to break
the host pmem device.

Thanks.
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [BUG] xen's LLC, UCNA and MEM testing show error messge"Failed to inject MSR"

2016-05-26 Thread Zhang, Haozhong
On 05/27/16 13:07, Zhang, PengtaoX wrote:
> Bug detailed description:
> 
> On haswell-ex and boardwell-ex  server, testing the xen's LLC, MEM, UCNA, 
> show error message:"Failed to inject MSR: Invalid argument".
> 
> Environment :
> 
> HW: haswell-ex/boardwell-ex 
> Xen: Xen 4.7.0 RC3
> Dom0: Linux 4.6.0
> 
> Reproduce steps:
> 
> 1.Compiling  xen-mceinj in xen : xen/tools/tests/mce-test/tools 
> 2.Run the commond:  xen/tools/tests/mce-test/tools/xen-mceinj -t 0
> 
> Current result:
> 
> For step2 show error messages: 
> get gaddr of error inject is: 0x180020
> Failed to inject MSR: Invalid argument
> 
> Base error log:
> 
> Xen 4.7 RC3 and Xen RC1 log are same attached rc1 log .
> Console log:
> (XEN) HV MSR INJECT (interpose) target 0 actual 0 MSR 0x17a <-- 0x5
> (XEN) HV MSR INJECT (interpose) target 0 actual 0 MSR 0x41d <-- 
> 0xbd208a
> (XEN) HV MSR INJECT (interpose) target 0 actual 0 MSR 0x41f <-- 0x86

Hi Pengtao,

I'll send a v2 patch of 
http://lists.xenproject.org/archives/html/xen-devel/2016-05/msg02534.html
which will fix this bug.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-03-19 Thread Zhang, Haozhong
On 03/17/16 22:12, Xu, Quan wrote:
> On March 17, 2016 9:37pm, Haozhong Zhang  wrote:
> > For PV guests (if we add vNVDIMM support for them in future), as I'm going 
> > to
> > use page_info struct for it, I suppose the current mechanism in Xen can 
> > handle
> > this case. I'm not familiar with PV memory management 
> 
> The below web may be helpful:
> http://wiki.xen.org/wiki/X86_Paravirtualised_Memory_Management
> 
> :)
> Quan
> 

Thanks!

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 6/6] docs: Add descriptions of TSC scaling in xl.cfg and tscmode.txt

2016-02-28 Thread Zhang, Haozhong
On 02/29/16 10:02, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:jbeul...@suse.com]
> > Sent: Friday, February 26, 2016 4:01 PM
> > 
> > >>> On 26.02.16 at 05:37, <kevin.t...@intel.com> wrote:
> > >>  From: Zhang, Haozhong
> > >> Sent: Tuesday, February 23, 2016 10:05 AM
> > >>
> > >> Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > >
> > > Reviewed-by: Kevin Tian <kevin.t...@intel.com>, except:
> > >
> > >> +
> > >> +Hardware TSC Scaling
> > >> +
> > >> +Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read
> > >> +by guest rdtsc/p increasing in a different frequency than the host
> > >> +TSC frequency.
> > >> +
> > >> +If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode
> > >
> > > 'HVM container' means something different. We usually use "HVM domain"
> > > as you may see in other places in this doc.
> > 
> > But I think this is specifically meant to refer to both HVM and PVH
> > domains.
> > 
> 
> First, I have a feeling that many people today refer to containers
> running within a VM as 'VM container', which is a bit confusing to
> 'HVM container' purpose here. Couldn't we use 'HVM domains'
> to cover both HVM and PVH (which is PV-HVM)? Curious whether
> there is formal definition of those terminologies...
>

I call it 'HVM container' because I use has_hvm_container_domain(d)
| #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv)
to check whether TSC scaling can be used by a domain, which, in current
implementation, is either a HVM domain (d->guest_type == guest_type_hvm)
or a PVH domain (d->guest_type == guest_type_pvh).

And I also noticed another macro is_hvm_domain(d)
| #define is_hvm_domain(d) ((d)->guest_type == guest_type_hvm)
so I think 'HVM domain' can not be used to refer to both HVM and PVH
domains.

> Second, even when 'HVM container' can be used as you explained,
> it's inconsistent with other places in same doc, where only 'HVM
> domain' is used. I'd think consistency is more important in this
> patch series, and then if 'HVM container' is really preferred which
> should be a separate patch to update all related docs.
>

Or, maybe I should make it explicit, i.e. using 'HVM and PVH domains'
rather than 'HVM container'.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 6/6] docs: Add descriptions of TSC scaling in xl.cfg and tscmode.txt

2016-02-25 Thread Zhang, Haozhong
On 02/26/16 12:37, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, February 23, 2016 10:05 AM
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> 
> Reviewed-by: Kevin Tian <kevin.t...@intel.com>, except:
> 
> > +
> > +Hardware TSC Scaling
> > +
> > +Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read
> > +by guest rdtsc/p increasing in a different frequency than the host
> > +TSC frequency.
> > +
> > +If a HVM container in default TSC mode (tsc_mode=0) or PVRDTSCP mode
> 
> 'HVM container' means something different. We usually use "HVM domain"
> as you may see in other places in this doc.
>

I'll change to 'HVM domain'.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 06/10] x86/hvm: Setup TSC scaling ratio

2016-02-16 Thread Zhang, Haozhong
On 02/05/16 21:54, Jan Beulich wrote:
> >>> On 17.01.16 at 22:58,  wrote:
> > +u64 hvm_get_tsc_scaling_ratio(u32 gtsc_khz)
> > +{
> > +u64 ratio;
> > +
> > +if ( !hvm_tsc_scaling_supported )
> > +return 0;
> > +
> > +/*
> > + * The multiplication of the first two terms may overflow a 64-bit
> > + * integer, so use mul_u64_u32_div() instead to keep precision.
> > + */
> > +ratio = mul_u64_u32_div(1ULL << hvm_funcs.tsc_scaling_ratio_frac_bits,
> > +gtsc_khz, cpu_khz);
> 
> Is this the only use for this new math64 function? If so, I don't
> see the point of adding that function, because (leaving limited
> significant bits aside) the above simply is
> 
> (gtsc_khz << hvm_funcs.tsc_scaling_ratio_frac_bits) / cpu_khz
> 
> which can be had without any multiplication. Personally, if indeed
> the only use I'd favor converting the above to inline assembly
> here instead of adding that helper function (just like we have a
> number of asm()-s in x86/time.c for similar reasons).
>

OK, I'll rewrite it as asm(). mul_u64_u32_div() will not be used any
more and will be removed.

I'll also inline another math64 function mul_u64_u64_shr() in its single
caller hvm_scale_tsc(). Then the math64 patch will be dropped in the
next version.

> > +void hvm_setup_tsc_scaling(struct vcpu *v)
> > +{
> > +v->arch.hvm_vcpu.tsc_scaling_ratio =
> > +hvm_get_tsc_scaling_ratio(v->domain->arch.tsc_khz);
> > +}
> 
> So why again is this per-vCPU setup of per-vCPU state when it
> only depends on a per-domain input? If this was per-domain, its
> setup could be where it belongs - in arch_hvm_load().
>

It's a per-domain state. I'll the state to x86's struct arch_domain
where other TSC fields are (or struct hvm_domain, because this is only
used for HVM?).

Then it will be setup in tsc_set_info() after guest tsc frequency is
determined.

> > @@ -5504,6 +5536,9 @@ void hvm_vcpu_reset_state(struct vcpu *v, uint16_t 
> > cs, uint16_t ip)
> >  hvm_set_segment_register(v, x86_seg_gdtr, );
> >  hvm_set_segment_register(v, x86_seg_idtr, );
> >  
> > +if ( hvm_tsc_scaling_supported && !d->arch.vtsc )
> > +hvm_setup_tsc_scaling(v);
> 
> Could you remind me why this is needed? What state of the guest
> would have changed making this necessary? Is this perhaps just
> because it's per-vCPU instead of per-domain?
> 
> Jan
> 

Yes, just because I mistakenly made it per-vcpu. So it will be not
necessary in this patch after tsc_scaling_ratio become per-domain.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 05/10] x86: Add functions for 64-bit integer arithmetic

2016-02-16 Thread Zhang, Haozhong
On 02/05/16 21:36, Jan Beulich wrote:
> >>> On 17.01.16 at 22:58,  wrote:
> > This patch adds several functions to take multiplication, division and
> > shifting involving 64-bit integers.
> > 
> > Signed-off-by: Haozhong Zhang 
> > Reviewed-by: Boris Ostrovsky 
> > ---
> > Changes in v4:
> >  (addressing Jan Beulich's comments)
> >  * Rewrite mul_u64_u64_shr() in assembly.
> 
> Thanks, but it puzzles me that the other one didn't get converted
> as well. Anyway, I'm not going to make this a requirement, since
> at least it appears to match Linux'es variant.
>

I can't remember why I didn't rewrite mul_u64_u32_div(), especially when
it can be easily implemented as

static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
{
u64 quotient, remainder;

asm volatile ( "mulq %3; divq %4"
   : "=a" (quotient), "=d" (remainder)
   : "0" (a), "rm" ((u64) mul), "c" ((u64) divisor) );

return quotient;
}

I'll modify it in the next version.

> > +static inline u64 mul_u64_u64_shr(u64 a, u64 mul, unsigned int n)
> > +{
> > +u64 hi, lo;
> > +
> > +asm volatile ( "mulq %2; shrdq %1,%0"
> > +   : "=a" (lo), "=d" (hi)
> > +   : "rm" (mul), "0" (a), "c" (n) );
> 
> SHRD formally is a 3-operand instruction, and the fact that gas'
> AT syntax supports a 2-operand "alias" is, well, odd. Please
> let's use the specification mandated 3-operand form properly,
> to avoid surprises with e.g. clang.
>

OK, I'll change it to the 3-operand form.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RESEND PATCH v4 09/10] vmx: Add VMX RDTSC(P) scaling support

2016-02-16 Thread Zhang, Haozhong
On 02/05/16 22:06, Jan Beulich wrote:
> >>> On 19.01.16 at 03:55,  wrote:
> > @@ -2107,6 +2115,14 @@ const struct hvm_function_table * __init 
> > start_vmx(void)
> >   && cpu_has_vmx_secondary_exec_control )
> >  vmx_function_table.pvh_supported = 1;
> >  
> > +if ( cpu_has_vmx_tsc_scaling )
> > +{
> > +vmx_function_table.default_tsc_scaling_ratio = 
> > VMX_TSC_MULTIPLIER_DEFAULT;
> > +vmx_function_table.max_tsc_scaling_ratio = VMX_TSC_MULTIPLIER_MAX;
> > +vmx_function_table.tsc_scaling_ratio_frac_bits = 48;
> > +vmx_function_table.setup_tsc_scaling = vmx_setup_tsc_scaling;
> > +}
> 
> Same comments here as on the earlier patch - it indeed looks as if
> tsc_scaling_ratio_frac_bits would be the ideal field to dynamically
> initialize, as it being zero will not yield any bad behavior afaict.
>

Yes, I'll make changes similar to my reply under patch 4.

> Also please consider making all fields together a sub-structure
> of struct hvm_function_table, such that the above would become
> 
> vmx_function_table.tsc_scaling.default_ratio = 
> VMX_TSC_MULTIPLIER_DEFAULT;
> vmx_function_table.tsc_scaling.max_ratio = VMX_TSC_MULTIPLIER_MAX;
> vmx_function_table.tsc_scaling.ratio_frac_bits = 48;
> vmx_function_table.tsc_scaling.setup = vmx_setup_tsc_scaling;
> 
> keeping everything nicely together.
>

OK, I'll put them in a sub-structure by the earlier patch that
introduces those fields.

> > @@ -258,6 +259,9 @@ extern u64 vmx_ept_vpid_cap;
> >  #define VMX_MISC_CR3_TARGET 0x01ff
> >  #define VMX_MISC_VMWRITE_ALL0x2000
> >  
> > +#define VMX_TSC_MULTIPLIER_DEFAULT  0x0001ULL
> 
> Considering this and the respective SVM value - do we really
> need the separate field in struct hvm_function_table? Both are
> 1ULL << tsc_scaling.ratio_frac_bits after all.
>

I'll remove VMX_TSC_MULTIPLIER_DEFAULT and DEFAULT_TSC_RATIO (for SVM),
and use ratio_frac_bits to initialize default_ratio.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 04/10] x86/hvm: Collect information of TSC scaling ratio

2016-02-16 Thread Zhang, Haozhong
On 02/05/16 19:41, Jan Beulich wrote:
> >>> On 17.01.16 at 22:58,  wrote:
> > Both VMX TSC scaling and SVM TSC ratio use the 64-bit TSC scaling ratio,
> > but the number of fractional bits of the ratio is different between VMX
> > and SVM. This patch adds the architecture code to collect the number of
> > fractional bits and other related information into fields of struct
> > hvm_function_table so that they can be used in the common code.
> > 
> > Signed-off-by: Haozhong Zhang 
> > Reviewed-by: Kevin Tian 
> > Reviewed-by: Boris Ostrovsky 
> > ---
> > Changes in v4:
> >  (addressing Jan Beulich's comments in v3 patch 12)
> >  * Set TSC scaling parameters in hvm_funcs conditionally.
> >  * Remove TSC scaling parameter tsc_scaling_supported in hvm_funcs which
> >can be derived from other parameters.
> >  (code cleanup)
> >  * Merge with v3 patch 11 "x86/hvm: Detect TSC scaling through hvm_funcs"
> >whose work can be done early in this patch.
> 
> I really think this the scope of these changes should have invalidated
> all earlier tags.
>

I'll remove all R-b tags.

> > --- a/xen/arch/x86/hvm/svm/svm.c
> > +++ b/xen/arch/x86/hvm/svm/svm.c
> > @@ -1450,6 +1450,14 @@ const struct hvm_function_table * __init 
> > start_svm(void)
> >  if ( !cpu_has_svm_nrips )
> >  clear_bit(SVM_FEATURE_DECODEASSISTS, _feature_flags);
> >  
> > +if ( cpu_has_tsc_ratio )
> > +{
> > +svm_function_table.default_tsc_scaling_ratio = DEFAULT_TSC_RATIO;
> > +svm_function_table.max_tsc_scaling_ratio = ~TSC_RATIO_RSVD_BITS;
> > +svm_function_table.tsc_scaling_ratio_frac_bits = 32;
> > +svm_function_table.scale_tsc = svm_scale_tsc;
> > +}
> > +
> >  #define P(p,s) if ( p ) { printk(" - %s\n", s); printed = 1; }
> >  P(cpu_has_svm_npt, "Nested Page Tables (NPT)");
> >  P(cpu_has_svm_lbrv, "Last Branch Record (LBR) Virtualisation");
> > @@ -2269,8 +2277,6 @@ static struct hvm_function_table __initdata 
> > svm_function_table = {
> >  .nhvm_vmcx_hap_enabled = nsvm_vmcb_hap_enabled,
> >  .nhvm_intr_blocked = nsvm_intr_blocked,
> >  .nhvm_hap_walk_L1_p2m = nsvm_hap_walk_L1_p2m,
> > -
> > -.scale_tsc= svm_scale_tsc,
> >  };
> 
> From at the first glance purely mechanical POV this change was
> unnecessary with ...
> 
> > @@ -249,6 +261,8 @@ void hvm_set_guest_tsc_fixed(struct vcpu *v, u64 
> > guest_tsc, u64 at_tsc);
> >  u64 hvm_get_guest_tsc_fixed(struct vcpu *v, u64 at_tsc);
> >  #define hvm_get_guest_tsc(v) hvm_get_guest_tsc_fixed(v, 0)
> >  
> > +#define hvm_tsc_scaling_supported (!!hvm_funcs.default_tsc_scaling_ratio)
> 
> ... this, but considering our general aim to avoid having NULL
> callback pointers wherever possible, I think this is more than just
> a mechanical concern: I'd prefer if at least the callback pointer
> always be statically initialized, and ideally also two of the other
> fields. Only one field should be dynamically initialized (unless -
> considering the VMX code to come - static initialization is
> impossible), and ideally one which, if zero, would not have any
> bad consequences if used by mistake (frac_bits maybe). And
> perhaps an ASSERT() should be placed inside svm_scale_tsc()
> making sure the dynamically initialized field actually is initialized.
>

Combined with your comments for patch 9, I'll leave only
tsc_scaling_ratio_frac_bits to be dynamically initialized.

> The conditional here would then check _all_ fields which either
> vendor's code leaves uninitialized (i.e. the VMX patch may then
> add to the above).
>

so it would be
#define hvm_tsc_scaling_supported (!!hvm_funcs.tsc_scaling_ratio_frac_bits)


Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-15 Thread Zhang, Haozhong
On 02/03/16 23:47, Konrad Rzeszutek Wilk wrote:
> > > > >  Open: It seems no system call/ioctl is provided by Linux kernel to
> > > > >get the physical address from a virtual address.
> > > > >/proc//pagemap provides information of mapping from
> > > > >VA to PA. Is it an acceptable solution to let QEMU parse this
> > > > >file to get the physical address?
> > > > 
> > > > Does it work in a non-root scenario?
> > > >
> > > 
> > > Seemingly no, according to Documentation/vm/pagemap.txt in Linux kernel:
> > > | Since Linux 4.0 only users with the CAP_SYS_ADMIN capability can get 
> > > PFNs.
> > > | In 4.0 and 4.1 opens by unprivileged fail with -EPERM.  Starting from
> > > | 4.2 the PFN field is zeroed if the user does not have CAP_SYS_ADMIN.
> > > | Reason: information about PFNs helps in exploiting Rowhammer 
> > > vulnerability.
> 
> Ah right.
> > >
> > > A possible alternative is to add a new hypercall similar to
> > > XEN_DOMCTL_memory_mapping but receiving virtual address as the address
> > > parameter and translating to machine address in the hypervisor.
> > 
> > That might work.
> 
> That won't work.
> 
> This is a userspace VMA - which means the once the ioctl is done we swap
> to kernel virtual addresses. Now we may know that the prior cr3 has the
> userspace virtual address and walk it down - but what if the domain
> that is doing this is PVH? (or HVM) - the cr3 of userspace is tucked somewhere
> inside the kernel.
> 
> Which means this hypercall would need to know the Linux kernel task structure
> to find this.
> 
> May I propose another solution - an stacking driver (similar to loop). You
> setup it up (ioctl /dev/pmem0/guest.img, get some /dev/mapper/guest.img 
> created).
> Then mmap the /dev/mapper/guest.img - all of the operations are the same - 
> except
> it may have an extra ioctl - get_pfns - which would provide the data in 
> similar
> form to pagemap.txt.
>

This stack driver approach seems still need privileged permission and
more modifications in kernel, so ...

> But folks will then ask - why don't you just use pagemap? Could the pagemap
> have an extra security capability check? One that can be set for
> QEMU?
>

... I would like to use pagemap and mlock.

Haozhong

> > 
> > 
> > > > >  Open: For a large pmem, mmap(2) is very possible to not map all SPA
> > > > >occupied by pmem at the beginning, i.e. QEMU may not be able to
> > > > >get all SPA of pmem from buf (in virtual address space) when
> > > > >calling XEN_DOMCTL_memory_mapping.
> > > > >Can mmap flag MAP_LOCKED or mlock(2) be used to enforce the
> > > > >entire pmem being mmaped?
> > > > 
> > > > Ditto
> > > >
> > > 
> > > No. If I take the above alternative for the first open, maybe the new
> > > hypercall above can inject page faults into dom0 for the unmapped
> > > virtual address so as to enforce dom0 Linux to create the page
> > > mapping.
> 
> Ugh. That sounds hacky. And you wouldn't neccessarily be safe.
> Imagine that the system admin decides to defrag the /dev/pmem filesystem.
> Or move the files (disk images) around. If they do that - we may
> still have the guest mapped to system addresses which may contain filesystem
> metadata now, or a different guest image. We MUST mlock or lock the file
> during the duration of the guest.
> 
> 
> ___
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-14 Thread Zhang, Haozhong
On 02/04/16 20:24, Stefano Stabellini wrote:
> On Thu, 4 Feb 2016, Haozhong Zhang wrote:
> > On 02/03/16 15:22, Stefano Stabellini wrote:
> > > On Wed, 3 Feb 2016, George Dunlap wrote:
> > > > On 03/02/16 12:02, Stefano Stabellini wrote:
> > > > > On Wed, 3 Feb 2016, Haozhong Zhang wrote:
> > > > >> Or, we can make a file system on /dev/pmem0, create files on it, set
> > > > >> the owner of those files to xen-qemuuser-domid$domid, and then pass
> > > > >> those files to QEMU. In this way, non-root QEMU should be able to
> > > > >> mmap those files.
> > > > >
> > > > > Maybe that would work. Worth adding it to the design, I would like to
> > > > > read more details on it.
> > > > >
> > > > > Also note that QEMU initially runs as root but drops privileges to
> > > > > xen-qemuuser-domid$domid before the guest is started. Initially QEMU
> > > > > *could* mmap /dev/pmem0 while is still running as root, but then it
> > > > > wouldn't work for any devices that need to be mmap'ed at run time
> > > > > (hotplug scenario).
> > > >
> > > > This is basically the same problem we have for a bunch of other things,
> > > > right?  Having xl open a file and then pass it via qmp to qemu should
> > > > work in theory, right?
> > >
> > > Is there one /dev/pmem? per assignable region?
> > 
> > Yes.
> > 
> > BTW, I'm wondering whether and how non-root qemu works with xl disk
> > configuration that is going to access a host block device, e.g.
> >  disk = [ '/dev/sdb,,hda' ]
> > If that works with non-root qemu, I may take the similar solution for
> > pmem.
>  
> Today the user is required to give the correct ownership and access mode
> to the block device, so that non-root QEMU can open it. However in the
> case of PCI passthrough, QEMU needs to mmap /dev/mem, as a consequence
> the feature doesn't work at all with non-root QEMU
> (http://marc.info/?l=xen-devel=145261763600528).
> 
> If there is one /dev/pmem device per assignable region, then it would be
> conceivable to change its ownership so that non-root QEMU can open it.
> Or, better, the file descriptor could be passed by the toolstack via
> qmp.

Passing file descriptor via qmp is not enough.

Let me clarify where the requirement for root/privileged permissions
comes from. The primary workflow in my design that maps a host pmem
region or files in host pmem region to guest is shown as below:
 (1) QEMU in Dom0 mmap the host pmem (the host /dev/pmem0 or files on
 /dev/pmem0) to its virtual address space, i.e. the guest virtual
 address space.
 (2) QEMU asks Xen hypervisor to map the host physical address, i.e. SPA
 occupied by the host pmem to a DomU. This step requires the
 translation from the guest virtual address (where the host pmem is
 mmaped in (1)) to the host physical address. The translation can be
 done by either
(a) QEMU that parses its own /proc/self/pagemap,
 or
(b) Xen hypervisor that does the translation by itself [1] (though
this choice is not quite doable from Konrad's comments [2]).

[1] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00434.html
[2] http://lists.xenproject.org/archives/html/xen-devel/2016-02/msg00606.html

For 2-a, reading /proc/self/pagemap requires CAP_SYS_ADMIN capability
since linux kernel 4.0. Furthermore, if we don't mlock the mapped host
pmem (by adding MAP_LOCKED flag to mmap or calling mlock after mmap),
pagemap will not contain all mappings. However, mlock may require
privileged permission to lock memory larger than RLIMIT_MEMLOCK. Because
mlock operates on memory, the permission to open(2) the host pmem files
does not solve the problem and therefore passing file descriptor via qmp
does not help.

For 2-b, from Konrad's comments [2], mlock is also required and
privileged permission may be required consequently.

Note that the mapping and the address translation are done before QEMU
dropping privileged permissions, so non-root QEMU should be able to work
with above design until we start considering vNVDIMM hotplug (which has
not been supported by the current vNVDIMM implementation in QEMU). In
the hotplug case, we may let Xen pass explicit flags to QEMU to keep it
running with root permissions.

Haozhong


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-02 Thread Zhang, Haozhong
On 02/02/16 16:03, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, February 02, 2016 3:53 PM
> > 
> > On 02/02/16 15:48, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Tuesday, February 02, 2016 3:39 PM
> > > >
> > > > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > > > will be freed so you may switch to file-backed method to keep
> > > > > persistency (however copy would take time for large pmem trunk). Or
> > > > > will you find some way to keep pmem managed separated from qemu
> > > > > qemu life-cycle (then pmem is not efficiently reused)?
> > > > >
> > > >
> > > > It all depends on guests themselves. clwb/clflushopt/pcommit
> > > > instructions are exposed to guest that are used by guests to make
> > > > writes to pmem persistent.
> > > >
> > >
> > > I meant from guest p.o.v, a range of pmem should be persistent
> > > cross VM power on/off, i.e. the content needs to be maintained
> > > somewhere so guest can get it at next power on...
> > >
> > > Thanks
> > > Kevin
> > 
> > It's just like what we do for guest disk: as long as we always assign
> > the same host pmem device or the same files on file systems on a host
> > pmem device to the guest, the guest can find its last data on pmem.
> > 
> > Haozhong
> 
> This is the detail which I'd like to learn. If it's Qemu to request 
> host pmem and then free when exit, the very pmem may be 
> allocated to another process later. How do you achieve the 'as
> long as'?
> 

QEMU receives following parameters

 -object memory-backend-file,id=mem1,share,mem-path=/dev/pmem0,size=10G \
 -device nvdimm,memdev=mem1,id=nv1
 
that configs a vNVDIMM device backed by the host pmem device
/dev/pmem0. It can also be replaced by files on the file system on
/dev/pmem0. The system address space range occupied by /dev/pmem0 or
files on /dev/pmem0 are then mapped into the guest physical address
space, and all accesses from guest are then directly applied on the
host device without any interception of QEMU.

If we always provide the same vNVDIMM parameters (specially mem-path
and size), the guest can observe the same vNVDIMM devices across
boot.

In Xen, I'll implement the similar configuration options in xl.cfg, e.g.
 nvdimms = [ '/dev/pmem0', '/dev/pmem1', '/mnt/pmem2/file_on_pmem2' ]

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-01 Thread Zhang, Haozhong
Hi Kevin,

Thanks for your review!

On 02/02/16 14:33, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, February 01, 2016 1:44 PM
> > 
> [...]
> > 
> > 1.2 ACPI Support
> > 
> >  ACPI provides two factors of support for NVDIMM. First, NVDIMM
> >  devices are described by firmware (BIOS/EFI) to OS via ACPI-defined
> >  NVDIMM Firmware Interface Table (NFIT). Second, several functions of
> >  NVDIMM, including operations on namespace labels, S.M.A.R.T and
> >  hotplug, are provided by ACPI methods (_DSM and _FIT).
> > 
> > 1.2.1 NFIT
> > 
> >  NFIT is a new system description table added in ACPI v6 with
> >  signature "NFIT". It contains a set of structures.
> 
> Can I consider only NFIT as a minimal requirement, while other stuff
> (_DSM and _FIT) are optional?
>

No. ACPI namespace devices for NVDIMM should also be present. However,
_DSM under those ACPI namespace device can be implemented to support
no functions. _FIT is optional and is used for NVDIMM hotplug.

> > 
> > 
> > 2. NVDIMM/vNVDIMM Support in Linux Kernel/KVM/QEMU
> > 
> > 2.1 NVDIMM Driver in Linux Kernel
> > 
> [...]
> > 
> >  Userspace applications can mmap(2) the whole pmem into its own
> >  virtual address space. Linux kernel maps the system physical address
> >  space range occupied by pmem into the virtual address space, so that every
> >  normal memory loads/writes with proper flushing instructions are
> >  applied to the underlying pmem NVDIMM regions.
> > 
> >  Alternatively, a DAX file system can be made on /dev/pmemX. Files on
> >  that file system can be used in the same way as above. As Linux
> >  kernel maps the system address space range occupied by those files on
> >  NVDIMM to the virtual address space, reads/writes on those files are
> >  applied to the underlying NVDIMM regions as well.
> 
> Does it mean only file-based interface is supported by Linux today, and 
> pmem aware application cannot use normal memory allocation interface
> like malloc for the purpose?
>

right

> > 
> > 2.2 vNVDIMM Implementation in KVM/QEMU
> > 
> >  (1) Address Mapping
> > 
> >   As described before, the host Linux NVDIMM driver provides a block
> >   device interface (/dev/pmem0 at the bottom) for a pmem NVDIMM
> >   region. QEMU can than mmap(2) that device into its virtual address
> >   space (buf). QEMU is responsible to find a proper guest physical
> >   address space range that is large enough to hold /dev/pmem0. Then
> >   QEMU passes the virtual address of mmapped buf to a KVM API
> >   KVM_SET_USER_MEMORY_REGION that maps in EPT the host physical
> >   address range of buf to the guest physical address space range where
> >   the virtual pmem device will be.
> > 
> >   In this way, all guest writes/reads on the virtual pmem device is
> >   applied directly to the host one.
> > 
> >   Besides, above implementation also allows to back a virtual pmem
> >   device by a mmapped regular file or a piece of ordinary ram.
> 
> What's the point of backing pmem with ordinary ram? I can buy-in
> the value of file-backed option which although slower does sustain
> the persistency attribute. However with ram-backed method there's
> no persistency so violates guest expectation.
>

Well, it is not a necessity. The current vNVDIMM implementation in
QEMU uses dimm in QEMU that happens to support ram backend. A possible
usage is for debugging vNVDIMM on machines without NVDIMM.

> btw, how is persistency guaranteed in KVM/QEMU, cross guest 
> power off/on? I guess since Qemu process is killed the allocated pmem
> will be freed so you may switch to file-backed method to keep 
> persistency (however copy would take time for large pmem trunk). Or
> will you find some way to keep pmem managed separated from qemu
> qemu life-cycle (then pmem is not efficiently reused)?
>

It all depends on guests themselves. clwb/clflushopt/pcommit
instructions are exposed to guest that are used by guests to make
writes to pmem persistent.

Haozhong

> > 3. Design of vNVDIMM in Xen
> > 
> > 3.2 Address Mapping
> > 
> > 3.2.2 Alternative Design
> > 
> >  Jan Beulich's comments [7] on my question "why must pmem resource
> >  management and partition be done in hypervisor":
> >  | Because that's where memory management belongs. And PMEM,
> >  | other than PBLK, is just another form of RAM.
> >  | ...
> >  | The main issue is that this would imo be a layering violation
> > 
> >  George Dunlap's comments [8]:
> >  | This is not the case for PMEM.  The whole point o

Re: [Xen-devel] [RFC Design Doc] Add vNVDIMM support for Xen

2016-02-01 Thread Zhang, Haozhong
On 02/02/16 15:48, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, February 02, 2016 3:39 PM
> > 
> > > btw, how is persistency guaranteed in KVM/QEMU, cross guest
> > > power off/on? I guess since Qemu process is killed the allocated pmem
> > > will be freed so you may switch to file-backed method to keep
> > > persistency (however copy would take time for large pmem trunk). Or
> > > will you find some way to keep pmem managed separated from qemu
> > > qemu life-cycle (then pmem is not efficiently reused)?
> > >
> > 
> > It all depends on guests themselves. clwb/clflushopt/pcommit
> > instructions are exposed to guest that are used by guests to make
> > writes to pmem persistent.
> > 
> 
> I meant from guest p.o.v, a range of pmem should be persistent
> cross VM power on/off, i.e. the content needs to be maintained
> somewhere so guest can get it at next power on...
> 
> Thanks
> Kevin

It's just like what we do for guest disk: as long as we always assign
the same host pmem device or the same files on file systems on a host
pmem device to the guest, the guest can find its last data on pmem.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/4] add support for vNVDIMM

2016-01-20 Thread Zhang, Haozhong
On 01/20/16 14:35, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Zhang, Haozhong wrote:
> > On 01/20/16 12:43, Stefano Stabellini wrote:
> > > On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > > > > From: Zhang, Haozhong
> > > > > Sent: Tuesday, December 29, 2015 7:32 PM
> > > > > 
> > > > > This patch series is the Xen part patch to provide virtual NVDIMM to
> > > > > guest. The corresponding QEMU patch series is sent separately with the
> > > > > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > > > > 
> > > > > * Background
> > > > > 
> > > > >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> > > > >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> > > > >  and configured by _DSM method of NVDIMM device in ACPI. Some
> > > > >  documents can be found at
> > > > >  [1] ACPI 6: 
> > > > > http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > > > >  [2] NVDIMM Namespace: 
> > > > > http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > > > >  [3] DSM Interface Example:
> > > > > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > > > >  [4] Driver Writer's Guide:
> > > > > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > > > > 
> > > > >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> > > > >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> > > > >  mapped into CPU's address space and are accessed via normal memory
> > > > >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > > > > 
> > > > >  This patch series and the corresponding QEMU patch series enable Xen
> > > > >  to provide vNVDIMM devices to HVM domains.
> > > > > 
> > > > > * Design
> > > > > 
> > > > >  Supporting vNVDIMM in PMEM mode has three requirements.
> > > > > 
> > > > 
> > > > Although this design is about vNVDIMM, some background of how pNVDIMM
> > > > is managed in Xen would be helpful to understand the whole design since
> > > > in PMEM mode you need map pNVDIMM into GFN addr space so there's
> > > > a matter of how pNVDIMM is allocated.
> > > 
> > > Yes, some background would be very helpful. Given that there are so many
> > > moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> > > I suggest that we start with a design document for this feature.
> > 
> > Let me prepare a design document. Basically, it would include
> > following contents. Please let me know if you want anything additional
> > to be included.
> 
> Thank you!
> 
> 
> > * What NVDIMM is and how it is used
> > * Software interface of NVDIMM
> >   - ACPI NFIT: what parameters are recorded and their usage
> >   - ACPI SSDT: what _DSM methods are provided and their functionality
> >   - New instructions: clflushopt/clwb/pcommit
> > * How the linux kernel drives NVDIMM
> >   - ACPI parsing
> >   - Block device interface
> >   - Partition NVDIMM devices
> > * How KVM/QEMU implements vNVDIMM
> 
> This is a very good start.
> 
> 
> > * What I propose to implement vNVDIMM in Xen
> >   - Xen hypervisor/toolstack: new instruction enabling and address mapping
> >   - Dom0 Linux kernel: host NVDIMM driver
> >   - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping
> 
> This is OK. It might be also good to list other options that were
> discussed, but it is certainly not necessary in first instance.

I'll include them.

And one thing missed above:
* What I propose to implement vNVDIMM in Xen
  - Building vNFIT and vSSDT: copy them from QEMU to Xen toolstack

I know it is controversial and will record other options and my reason
for this choice.

Thanks,
Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 0/4] add support for vNVDIMM

2016-01-20 Thread Zhang, Haozhong
On 01/20/16 12:43, Stefano Stabellini wrote:
> On Wed, 20 Jan 2016, Tian, Kevin wrote:
> > > From: Zhang, Haozhong
> > > Sent: Tuesday, December 29, 2015 7:32 PM
> > > 
> > > This patch series is the Xen part patch to provide virtual NVDIMM to
> > > guest. The corresponding QEMU patch series is sent separately with the
> > > title "[PATCH 0/2] add vNVDIMM support for Xen".
> > > 
> > > * Background
> > > 
> > >  NVDIMM (Non-Volatile Dual In-line Memory Module) is going to be
> > >  supported on Intel's platform. NVDIMM devices are discovered via ACPI
> > >  and configured by _DSM method of NVDIMM device in ACPI. Some
> > >  documents can be found at
> > >  [1] ACPI 6: 
> > > http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
> > >  [2] NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
> > >  [3] DSM Interface Example:
> > > http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
> > >  [4] Driver Writer's Guide:
> > > http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
> > > 
> > >  The upstream QEMU (commits 5c42eef ~ 70d1fb9) has added support to
> > >  provide virtual NVDIMM in PMEM mode, in which NVDIMM devices are
> > >  mapped into CPU's address space and are accessed via normal memory
> > >  read/write and three special instructions (clflushopt/clwb/pcommit).
> > > 
> > >  This patch series and the corresponding QEMU patch series enable Xen
> > >  to provide vNVDIMM devices to HVM domains.
> > > 
> > > * Design
> > > 
> > >  Supporting vNVDIMM in PMEM mode has three requirements.
> > > 
> > 
> > Although this design is about vNVDIMM, some background of how pNVDIMM
> > is managed in Xen would be helpful to understand the whole design since
> > in PMEM mode you need map pNVDIMM into GFN addr space so there's
> > a matter of how pNVDIMM is allocated.
> 
> Yes, some background would be very helpful. Given that there are so many
> moving parts on this (Xen, the Dom0 kernel, QEMU, hvmloader, libxl)
> I suggest that we start with a design document for this feature.

Let me prepare a design document. Basically, it would include
following contents. Please let me know if you want anything additional
to be included.

* What NVDIMM is and how it is used
* Software interface of NVDIMM
  - ACPI NFIT: what parameters are recorded and their usage
  - ACPI SSDT: what _DSM methods are provided and their functionality
  - New instructions: clflushopt/clwb/pcommit
* How the linux kernel drives NVDIMM
  - ACPI parsing
  - Block device interface
  - Partition NVDIMM devices
* How KVM/QEMU implements vNVDIMM
* What I propose to implement vNVDIMM in Xen
  - Xen hypervisor/toolstack: new instruction enabling and address mapping
  - Dom0 Linux kernel: host NVDIMM driver
  - QEMU: virtual NFIT/SSDT, _DSM handling, and role in address mapping

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 4/4] hvmloader: add support to load extra ACPI tables from qemu

2016-01-19 Thread Zhang, Haozhong
On 01/20/16 13:14, Tian, Kevin wrote:
> > From: Jan Beulich [mailto:jbeul...@suse.com]
> > Sent: Tuesday, January 19, 2016 7:47 PM
> > 
> > >>> On 19.01.16 at 12:37,  wrote:
> > > On Mon, Jan 18, 2016 at 01:46:29AM -0700, Jan Beulich wrote:
> > >> >>> On 18.01.16 at 01:52,  wrote:
> > >> > On 01/15/16 10:10, Jan Beulich wrote:
> > >> >> >>> On 29.12.15 at 12:31,  wrote:
> > >> >> > NVDIMM devices are detected and configured by software through
> > >> >> > ACPI. Currently, QEMU maintains ACPI tables of vNVDIMM devices. This
> > >> >> > patch extends the existing mechanism in hvmloader of loading 
> > >> >> > passthrough
> > >> >> > ACPI tables to load extra ACPI tables built by QEMU.
> > >> >>
> > >> >> Mechanically the patch looks okay, but whether it's actually needed
> > >> >> depends on whether indeed we want NV RAM managed in qemu
> > >> >> instead of in the hypervisor (where imo it belongs); I didn' see any
> > >> >> reply yet to that same comment of mine made (iirc) in the context
> > >> >> of another patch.
> > >> >
> > >> > One purpose of this patch series is to provide vNVDIMM backed by host
> > >> > NVDIMM devices. It requires some drivers to detect and manage host
> > >> > NVDIMM devices (including parsing ACPI, managing labels, etc.) that
> > >> > are not trivial, so I leave this work to the dom0 linux. Current Linux
> > >> > kernel abstract NVDIMM devices as block devices (/dev/pmemXX). QEMU
> > >> > then mmaps them into certain range of dom0's address space and asks
> > >> > Xen hypervisor to map that range of address space to a domU.
> > >> >
> > >
> > > OOI Do we have a viable solution to do all these non-trivial things in
> > > core hypervisor?  Are you proposing designing a new set of hypercalls
> > > for NVDIMM?
> > 
> > That's certainly a possibility; I lack sufficient detail to make myself
> > an opinion which route is going to be best.
> > 
> > Jan
> 
> Hi, Haozhong,
> 
> Are NVDIMM related ACPI table in plain text format, or do they require
> a ACPI parser to decode? Is there a corresponding E820 entry?
>

Most in plain text format, but still the driver evaluates _FIT
(firmware interface table) method and decode is needed then.

> Above information would be useful to help decide the direction.
> 
> In a glimpse I like Jan's idea that it's better to let Xen manage NVDIMM
> since it's a type of memory resource while for memory we expect hypervisor
> to centrally manage.
> 
> However in another thought the answer is different if we view this 
> resource as a MMIO resource, similar to PCI BAR MMIO, ACPI NVS, etc.
> then it should be fine to have Dom0 manage NVDIMM then Xen just controls
> the mapping based on existing io permission mechanism.
>

It's more like a MMIO device than the normal ram.

> Another possible point for this model is that PMEM is only one mode of 
> NVDIMM device, which can be also exposed as a storage device. In the
> latter case the management has to be in Dom0. So we don't need to
> scatter the management role into Dom0/Xen based on different modes.
>

NVDIMM device in pmem mode is exposed as storage device (a block
device /dev/pmemXX) in Linux, and it's also used like a disk drive
(you can make file system on it, create files on it and even pass
files rather than a whole /dev/pmemXX to guests).

> Back to your earlier questions:
> 
> > (1) The QEMU patches use xc_hvm_map_io_range_to_ioreq_server() to map
> > the host NVDIMM to domU, which results VMEXIT for every guest
> > read/write to the corresponding vNVDIMM devices. I'm going to find
> > a way to passthrough the address space range of host NVDIMM to a
> > guest domU (similarly to what xen-pt in QEMU uses)
> > 
> > (2) Xen currently does not check whether the address that QEMU asks to
> > map to domU is really within the host NVDIMM address
> > space. Therefore, Xen hypervisor needs a way to decide the host
> > NVDIMM address space which can be done by parsing ACPI NFIT
> > tables.
> 
> If you look at how ACPI OpRegion is handled for IGD passthrough:
> 
>  241 ret = xc_domain_iomem_permission(xen_xc, xen_domid,
>  242 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  243 XEN_PCI_INTEL_OPREGION_PAGES,
>  244 XEN_PCI_INTEL_OPREGION_ENABLE_ACCESSED);
> 
>  254 ret = xc_domain_memory_mapping(xen_xc, xen_domid,
>  255 (unsigned long)(igd_guest_opregion >> XC_PAGE_SHIFT),
>  256 (unsigned long)(igd_host_opregion >> XC_PAGE_SHIFT),
>  257 XEN_PCI_INTEL_OPREGION_PAGES,
>  258 DPCI_ADD_MAPPING);
>

Yes, I've noticed these two functions. The addition work would be
adding new ones that can accept virtual address, as QEMU has no easy
way to get the physical address of /dev/pmemXX and can only mmap them
into its virtual address space.

> Above can address your 2 questions. Xen doesn't need to tell exactly
> whether the 

Re: [Xen-devel] [PATCH v2 2/2] x86/hvm: add support for pcommit instruction

2016-01-04 Thread Zhang, Haozhong
On 01/05/16 15:19, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Tuesday, January 05, 2016 3:15 PM
> > 
> > On 01/05/16 15:08, Tian, Kevin wrote:
> > > > From: Zhang, Haozhong
> > > > Sent: Wednesday, December 30, 2015 7:49 PM
> > > >
> > > > Pass PCOMMIT CPU feature into HMV domain. Currently, we do not intercept
> > > > pcommit instruction for L1 guest, and allow L1 to intercept pcommit
> > > > instruction for L2 guest.
> > >
> > > Could you elaborate why different policies are used for L1/L2? And better
> > > add a comment in code (at least for vvmx) to describe the intention.
> > >
> > > Thanks
> > > Kevin
> > 
> > The intention is that we completely expose pcommit (both the
> > instruction and VMEXIT caused by pcommit) to L1.
> > 
> > Haozhong
> 
> My question is why pcommit can't not be directly in L2 w/o interception?
> 
> Thanks
> Kevin
> 

Isn't it because L1 hypervisor may decide to intercept L2 pcommit?

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 2/2] x86/hvm: add support for pcommit instruction

2016-01-04 Thread Zhang, Haozhong
On 01/05/16 15:08, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Wednesday, December 30, 2015 7:49 PM
> > 
> > Pass PCOMMIT CPU feature into HMV domain. Currently, we do not intercept
> > pcommit instruction for L1 guest, and allow L1 to intercept pcommit
> > instruction for L2 guest.
> 
> Could you elaborate why different policies are used for L1/L2? And better
> add a comment in code (at least for vvmx) to describe the intention.
> 
> Thanks
> Kevin

The intention is that we completely expose pcommit (both the
instruction and VMEXIT caused by pcommit) to L1.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 09/14] x86/hvm: Setup TSC scaling ratio

2015-12-10 Thread Zhang, Haozhong
On 12/10/15 18:27, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, December 07, 2015 4:59 AM
> > 
> > This patch adds a field tsc_scaling_ratio in struct hvm_vcpu to
> > record the TSC scaling ratio, and sets it up when tsc_set_info() is
> > called for a vcpu or when a vcpu is restored or reset.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  xen/arch/x86/hvm/hvm.c| 30
> > ++
> >  xen/arch/x86/hvm/svm/svm.c|  6 --
> >  xen/arch/x86/time.c   | 13 -
> >  xen/include/asm-x86/hvm/hvm.h |  5 +
> >  xen/include/asm-x86/hvm/svm/svm.h |  3 ---
> >  xen/include/asm-x86/hvm/vcpu.h|  2 ++
> >  xen/include/asm-x86/math64.h  | 30
> > ++
> >  7 files changed, 83 insertions(+), 6 deletions(-)
> >  create mode 100644 xen/include/asm-x86/math64.h
> > 
> > diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
> > index 0e63c33..52a0ef8 100644
> > --- a/xen/arch/x86/hvm/hvm.c
> > +++ b/xen/arch/x86/hvm/hvm.c
> > @@ -65,6 +65,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -301,6 +302,29 @@ int hvm_set_guest_pat(struct vcpu *v, u64 guest_pat)
> >  return 1;
> >  }
> > 
> > +void hvm_setup_tsc_scaling(struct vcpu *v)
> > +{
> > +u64 ratio;
> > +
> > +if ( !hvm_funcs.tsc_scaling_supported )
> > +return;
> > +
> > +/*
> > + * The multiplication of the first two terms may overflow a 64-bit
> > + * integer, so use mul_u64_u32_div() instead to keep precision.
> > + */
> > +ratio = mul_u64_u32_div(1ULL << hvm_funcs.tsc_scaling_ratio_frac_bits,
> > +v->domain->arch.tsc_khz, cpu_khz);
> > +
> > +if ( ratio == 0 || ratio > hvm_funcs.max_tsc_scaling_ratio )
> > +return;
> 
> How will you check such error in other places? tsc_scaling_ratio is
> left w/ default value, while if you don't detect the issue that that
> ratio will be used for wrong scale...
>

The intention here is to fall back to the default ratio so that it
would work like no TSC scaling is used. However, I forgot here to fall
back v->domain->arch.tsc_khz and others to default values (i.e. values
used when no TSC scaling). I'll add them in the next version.

Haozhong

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 14/14] docs: Add descriptions of TSC scaling in xl.cfg and tscmode.txt

2015-12-10 Thread Zhang, Haozhong
On 12/10/15 18:40, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, December 07, 2015 4:59 AM
> g and tscmode.txt
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> > ---
> >  docs/man/xl.cfg.pod.5 | 15 ++-
> >  docs/misc/tscmode.txt | 14 ++
> >  2 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> > index 2aca8dd..7e19a9b 100644
> > --- a/docs/man/xl.cfg.pod.5
> > +++ b/docs/man/xl.cfg.pod.5
> > @@ -1313,9 +1313,18 @@ deprecated. Options are:
> > 
> >  =item B<"default">
> > 
> > -Guest rdtsc/p executed natively when monotonicity can be guaranteed
> > +Guest rdtsc/p is executed natively when monotonicity can be guaranteed
> >  and emulated otherwise (with frequency scaled if necessary).
> > 
> > +If a HVM container in B TSC mode is not migrated from other hosts
> 
> "migrated from" -> "migrated to"?
>

I mean "migrated from" here. If the current host supports TSC scaling
and a domain is migrated from another host w/ different host TSC
frequency, then domain may have a different guest TSC frequency than
the current host. Thus, "not migrated from other hosts" is used here
to eliminate such case.

> > +and the host TSC monotonicity can be guaranteed, the guest and host TSC
> > +frequencies will be the same.
> > +
> > +If a HVM container in B TSC mode is migrated to a host that can
> > +guarantee the TSC monotonicity and supports Intel VMX TSC scaling/AMD SVM
> 
> and -> or? Do we think TSC scaling a must to ensure TSC monotonicity? It comes
> to the rescue only when host can't ensure monotonicity...
>

No, I intend to describe the guest behavior when hardware TSC scaling is used.

Really I should say "_host_ TSC monotonicity" here.

> > +TSC ratio, guest rdtsc/p will still execute natively after migration and 
> > the
> > +guest TSC frequencies before and after migration will be the same.
> 
> will be the same before and after migration.
>

will modify in the next version.

> > +
> >  =item B<"always_emulate">
> > 
> >  Guest rdtsc/p always emulated at 1GHz (kernel and user). Guest rdtsc/p
> > @@ -1337,6 +1346,10 @@ determine when a restore/migration has occurred and
> > assumes guest
> >  obtains/uses pvclock-like mechanism to adjust for monotonicity and
> >  frequency changes.
> > 
> > +If a HVM container in B TSC mode can execute both guest
> > +rdtsc and guest rdtscp natively, then the guest TSC frequency will be
> > +determined in the similar way to that of B TSC mode.
> > +
> >  =back
> > 
> >  Please see F for more information on this option.
> > diff --git a/docs/misc/tscmode.txt b/docs/misc/tscmode.txt
> > index e8c84e8..f3b70be 100644
> > --- a/docs/misc/tscmode.txt
> > +++ b/docs/misc/tscmode.txt
> > @@ -297,3 +297,17 @@ and also much faster than nearly all OS-provided time
> > mechanisms.
> >  While pvrtscp is too complex for most apps, certain enterprise
> >  TSC-sensitive high-TSC-frequency apps may find it useful to
> >  obtain a significant performance gain.
> > +
> > +Hardware TSC Scaling
> > +
> > +Intel VMX TSC scaling and AMD SVM TSC ratio allow the guest TSC read
> > +by guest rdtsc/p increasing in the different frequency than the host
> 
> "in the different" -> "in a different"
>

will modify

> > +TSC frequency.
> > +
> > +For a HVM container is in default TSC mode (tsc_mode=0) or PVRDTSCP
> 
> For a HVM container *which* is
>

stupid error... will modify

> > +mode (tsc_mode=3) and can execute both guest rdtsc and rdtscp
> > +natively, if it is not migrated from other hosts, the guest and host
> > +TSC frequencies will be the same. 
> 
> "the guest and host TSC frequencies remain the same if the guest is
> not migrated to other host."
> 
> and the condition is that the host supports constant TSC feature.
>

Yes, I'll modify in the next version.

Thanks,
Haozhong

> > If it is migrated to a host
> > +supporting Intel VMX TSC scaling/AMD SVM TSC ratio and can still
> > +execute guest rdtsc and rdtscp natively, the guest TSC frequencies
> > +before and after migration will be the same.
> > --
> > 2.6.3
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 08/14] x86/hvm: Collect information of TSC scaling ratio

2015-12-10 Thread Zhang, Haozhong
On 12/10/15 18:19, Tian, Kevin wrote:
> > From: Zhang, Haozhong
> > Sent: Monday, December 07, 2015 4:59 AM
> ratio
> > 
> > Both VMX TSC scaling and SVM TSC ratio use the 64-bit TSC scaling ratio,
> > but the number of fractional bits of the ratio is different between VMX
> > and SVM. This patch adds the architecture code to collect the number of
> > fractional bits and other related information into fields of struct
> > hvm_function_table so that they can be used in the common code.
> > 
> > Signed-off-by: Haozhong Zhang <haozhong.zh...@intel.com>
> 
> Reviewed-by: Kevin Tian <kevin.t...@intel.com>, with one comment
> 
> 
> > diff --git a/xen/include/asm-x86/hvm/hvm.h b/xen/include/asm-x86/hvm/hvm.h
> > index aba63ab..8b10a67 100644
> > --- a/xen/include/asm-x86/hvm/hvm.h
> > +++ b/xen/include/asm-x86/hvm/hvm.h
> > @@ -100,6 +100,18 @@ struct hvm_function_table {
> >  unsigned int hap_capabilities;
> > 
> >  /*
> > + * Parameters of hardware-assisted TSC scaling.
> > + */
> > +/* is TSC scaling supported? */
> > +bool_t   tsc_scaling_supported;
> > +/* number of bits of the fractional part of TSC scaling ratio */
> > +uint8_t  tsc_scaling_ratio_frac_bits;
> > +/* default TSC scaling ratio (no scaling) */
> > +uint64_t default_tsc_scaling_ratio;
> > +/* maxmimum-allowed TSC scaling ratio */
> 
> maxmimum -> maximum

will fix in the next version

Thanks,
Haozhong

> 
> > +uint64_t max_tsc_scaling_ratio;
> > +
> > +/*
> >   * Initialise/destroy HVM domain/vcpu resources
> >   */
> >  int  (*domain_initialise)(struct domain *d);
> > --
> > 2.6.3
> 

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel