Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-20 Thread Andrew Cooper
On 20/10/2016 10:14, Haozhong Zhang wrote:
>
>
>> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to
>> work
>> and figure out what is on the DIMM, and which areas are safe to use.
> I don't understand this ordering of events.  Dom0 needs to have a
> mapping to even write the on-media structure to indicate a
> reservation.  So, initial dom0 access can't depend on metadata
> reservation already being present.

 I agree.

 Overall, I think the following is needed.

 * Xen starts up.
 ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
 needs to note this information somehow.
 ** Xen might find some Type 7 E820 regions, and needs to note this
 information somehow.
>>>
>>> IIUC, this is to collect MFNs and no need to create frame table and
>>> M2P at this stage. If so, what is different from ...
>>>
 * Xen starts dom0.
 * Once OSPM is running, a Xen component in Linux needs to collect and
 report all NVDIMM SPA/MFN regions it knowns about.
 ** This covers the AML-only case, and the hotplug case.
>>>
>>> ... the MFNs reported here, especially that the former is a subset
>>> (hotplug ones not included in the former) of latter.
>>
>> Hopefully nothing.  However, Xen shouldn't exclusively rely on the dom0
>> when it is capable of working things out itself, (which can aid with
>> debugging one half of this arrangement).  Also, the MFNS found by Xen
>> alone can be present in the default memory map for dom0.
>>
>
> Sure, I'll add code to parsing NFIT in Xen to discover statically
> plugged pmem mode NVDIMM and their MFNs.
>
> By the default memory map for dom0, do you mean making
> XENMEM_memory_map returns above MFNs in Dom0 E820?

Potentially, yes.  Particularly if type 7 is reserved for NVDIMM, it
would be good to report this information properly.

>
>>>
>>> (There is no E820 hole or SRAT entries to tell which address range is
>>> reserved for hotplugged NVDIMM)
>>>
 * Dom0 requests a mapping of the NVDIMMs via the usual mechanism.
>>>
>>> Two questions:
>>> 1. Why is this request necessary? Even without such requests like what
>>>   my current implementation, Dom0 can still access NVDIMM.
>>
>> Can it?  (if so, great, but I don't think this holds in the general
>> case.)  Is that a side effect of the NVDIMM being covered by a hole in
>> the E820?
>
> In my development environment, NVDIMM MFNs are not covered by any E820
> entry and appear after RAM MFNs.
>
> Can you explain more about this point? Why can it work if covered by
> E820 hole?

It is a question, not a statement.  If things currently work fine then
great.  However,  there does seem to be a lot of flexibility in how the
regions are reported, so please be mindful to this when developing the code.

>
>>
>>>
>>> 2. Who initiates the requests? If it's the libnvdimm driver, that
>>>   means we still need to introduce Xen specific code to the driver.
>>>
>>>   Or the requests are issued by OSPM (or the Xen component you
>>>   mentioned above) when they probe new dimms?
>>>
>>>   For the latter, Dan, do you think it's acceptable in NFIT code to
>>>   call the Xen component to request the access permission of the pmem
>>>   regions, e.g. in apic_nfit_insert_resource(). Of course, it's only
>>>   used for Dom0 case.
>>
>> The libnvdimm driver should continue to use ioremap() or whatever it
>> currently does.  There shouldn't be Xen modifications like that.
>>
>> The one issue will come if libnvdimm tries to ioremap()/other an area
>> which Xen is unaware is an NVDIMM, and rejects the mapping request.
>> Somehow, a Xen component will need to find the MFN/SPA layout and
>> register this information with Xen, before the ioremap() call made by
>> the libnvdimm driver.  Perhaps a notifier mechanism out from the ACPI
>> subsystem might be the best way to make this work in a clean way.
>>
>
> Yes, this is necessary for hotplugged NVDIMM.

Ok.

>
>>>
 ** This should work, as Xen is aware that there is something there
 to be
 mapped (rather than just empty physical address space).
 * Dom0 finds that some NVDIMM ranges are now available for use
 (probably
 modelled as hotplug events).
 * /dev/pmem $STUFF starts happening as normal.

 At some pointer later after dom0 policy decisions are made
 (ultimately,
 by the host administrator):
 * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform
 Xen of the SPA/MFN regions which are safe to use.
 * Xen then incorporates these regions into its idea of RAM, and starts
 using them for whatever.

>>>
>>> Agree. I think we may not need to fix the way/format/... to make the
>>> reservation, and instead let the users (host administrators), who have
>>> better understanding of their data, make the proper decision.
>>
>> Yes.  This is the best course of action.
>>
>>>
>>> In a worse case that no reservation is made, Xen hypervisor could 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-20 Thread Haozhong Zhang

On 10/14/16 13:18 +0100, Andrew Cooper wrote:

On 14/10/16 08:08, Haozhong Zhang wrote:

On 10/13/16 20:33 +0100, Andrew Cooper wrote:

On 13/10/16 19:59, Dan Williams wrote:

On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
 wrote:

On 13/10/16 16:40, Dan Williams wrote:

On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich 
wrote:
[..]

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Well, my opinion certainly doesn't count much here, but I
continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel. Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.

You haven't made the case why Xen is special and other
applications of
persistent memory are not.

In a Xen system, Xen runs in the baremetal root-mode ring0, and
dom0 is
a VM running in ring1/3 with the nvdimm driver.  This is the opposite
way around to the KVM model.

Dom0, being the hardware domain, has default ownership of all the
hardware, but to gain access in the first place, it must request a
mapping from Xen.

This is where my understanding the Xen model breaks down.  Are you
saying dom0 can't access the persistent memory range unless the ring0
agent has metadata storage space for tracking what it maps into dom0?


No.  I am trying to point out that the current suggestion wont work, and
needs re-designing.

Xen *must* be able to properly configure mappings of the NVDIMM for
dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
corruption will occur.

Whether this means no Xen metadata, or the metadata living elsewhere in
regular ram, such as the main frametable, is an implementation detail.




Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to
work
and figure out what is on the DIMM, and which areas are safe to use.

I don't understand this ordering of events.  Dom0 needs to have a
mapping to even write the on-media structure to indicate a
reservation.  So, initial dom0 access can't depend on metadata
reservation already being present.


I agree.

Overall, I think the following is needed.

* Xen starts up.
** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
needs to note this information somehow.
** Xen might find some Type 7 E820 regions, and needs to note this
information somehow.


IIUC, this is to collect MFNs and no need to create frame table and
M2P at this stage. If so, what is different from ...


* Xen starts dom0.
* Once OSPM is running, a Xen component in Linux needs to collect and
report all NVDIMM SPA/MFN regions it knowns about.
** This covers the AML-only case, and the hotplug case.


... the MFNs reported here, especially that the former is a subset
(hotplug ones not included in the former) of latter.


Hopefully nothing.  However, Xen shouldn't exclusively rely on the dom0
when it is capable of working things out itself, (which can aid with
debugging one half of this arrangement).  Also, the MFNS found by Xen
alone can be present in the default memory map for dom0.



Sure, I'll add code to parsing NFIT in Xen to discover statically
plugged pmem mode NVDIMM and their MFNs.

By the default memory map for dom0, do you mean making
XENMEM_memory_map returns above MFNs in Dom0 E820?



(There is no E820 hole or SRAT entries to tell which address range is
reserved for hotplugged NVDIMM)


* Dom0 requests a mapping of the NVDIMMs via the usual mechanism.


Two questions:
1. Why is this request necessary? Even without such requests like what
  my current implementation, Dom0 can still access NVDIMM.


Can it?  (if so, great, but I don't think this holds in the general
case.)  Is that a side effect of the NVDIMM being covered by a hole in
the E820?


In my development environment, NVDIMM MFNs are not covered by any E820
entry and appear after RAM MFNs.

Can you explain more about this point? Why can it work if covered by
E820 hole?


The current logic for what dom0 may access by default is
somewhat ad-hoc, and I have a gut feeling that it won't work with E820
type 7 regions.



  Or do you mean Xen hypervisor should by default disallow Dom0 to
  access MFNs reported in previous step until they are requested?


No - I am not suggesting this.



2. Who initiates the requests? If it's the libnvdimm driver, that
  means we still need to introduce Xen specific code to the driver.

  Or the requests are issued by OSPM (or the Xen component you
  mentioned above) when they probe new dimms?

  For the latter, Dan, do you think it's acceptable in NFIT code to
  call the Xen component to request the access permission of the pmem
  regions, e.g. in apic_nfit_insert_resource(). Of 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-20 Thread Haozhong Zhang

On 10/14/16 04:16 -0600, Jan Beulich wrote:

On 13.10.16 at 17:46,  wrote:

On 10/13/16 03:08 -0600, Jan Beulich wrote:

On 13.10.16 at 10:53,  wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.


Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel.


No such independent concept seems exist right now. It may be hard to
define such concept, because it's hard to know the common requirements
(e.g. size/alignment/...)  from ALL OSes. Making each component to
maintain its own reservation in its own way seems more flexible.


Or if we were to consider Dom0

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-14 Thread Andrew Cooper
On 14/10/16 08:08, Haozhong Zhang wrote:
> On 10/13/16 20:33 +0100, Andrew Cooper wrote:
>> On 13/10/16 19:59, Dan Williams wrote:
>>> On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
>>>  wrote:
 On 13/10/16 16:40, Dan Williams wrote:
> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich 
> wrote:
> [..]
>>> I think we can do the similar for Xen, like to lay another pseudo
>>> device on /dev/pmem and do the reservation, like 2. in my previous
>>> reply.
>> Well, my opinion certainly doesn't count much here, but I
>> continue to
>> consider this a bad idea. For entities like drivers it may well be
>> appropriate, but I think there ought to be an independent concept
>> of "OS reserved", and in the Xen case this could then be shared
>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0
>> "just a guest", things should even be the other way around: Xen gets
>> all of the OS reserved space, and Dom0 needs something custom.
> You haven't made the case why Xen is special and other
> applications of
> persistent memory are not.
 In a Xen system, Xen runs in the baremetal root-mode ring0, and
 dom0 is
 a VM running in ring1/3 with the nvdimm driver.  This is the opposite
 way around to the KVM model.

 Dom0, being the hardware domain, has default ownership of all the
 hardware, but to gain access in the first place, it must request a
 mapping from Xen.
>>> This is where my understanding the Xen model breaks down.  Are you
>>> saying dom0 can't access the persistent memory range unless the ring0
>>> agent has metadata storage space for tracking what it maps into dom0?
>>
>> No.  I am trying to point out that the current suggestion wont work, and
>> needs re-designing.
>>
>> Xen *must* be able to properly configure mappings of the NVDIMM for
>> dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
>> corruption will occur.
>>
>> Whether this means no Xen metadata, or the metadata living elsewhere in
>> regular ram, such as the main frametable, is an implementation detail.
>>
>>>
 Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to
 work
 and figure out what is on the DIMM, and which areas are safe to use.
>>> I don't understand this ordering of events.  Dom0 needs to have a
>>> mapping to even write the on-media structure to indicate a
>>> reservation.  So, initial dom0 access can't depend on metadata
>>> reservation already being present.
>>
>> I agree.
>>
>> Overall, I think the following is needed.
>>
>> * Xen starts up.
>> ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
>> needs to note this information somehow.
>> ** Xen might find some Type 7 E820 regions, and needs to note this
>> information somehow.
>
> IIUC, this is to collect MFNs and no need to create frame table and
> M2P at this stage. If so, what is different from ...
>
>> * Xen starts dom0.
>> * Once OSPM is running, a Xen component in Linux needs to collect and
>> report all NVDIMM SPA/MFN regions it knowns about.
>> ** This covers the AML-only case, and the hotplug case.
>
> ... the MFNs reported here, especially that the former is a subset
> (hotplug ones not included in the former) of latter.

Hopefully nothing.  However, Xen shouldn't exclusively rely on the dom0
when it is capable of working things out itself, (which can aid with
debugging one half of this arrangement).  Also, the MFNS found by Xen
alone can be present in the default memory map for dom0.

>
> (There is no E820 hole or SRAT entries to tell which address range is
> reserved for hotplugged NVDIMM)
>
>> * Dom0 requests a mapping of the NVDIMMs via the usual mechanism.
>
> Two questions:
> 1. Why is this request necessary? Even without such requests like what
>   my current implementation, Dom0 can still access NVDIMM.

Can it?  (if so, great, but I don't think this holds in the general
case.)  Is that a side effect of the NVDIMM being covered by a hole in
the E820?  The current logic for what dom0 may access by default is
somewhat ad-hoc, and I have a gut feeling that it won't work with E820
type 7 regions.

>
>   Or do you mean Xen hypervisor should by default disallow Dom0 to
>   access MFNs reported in previous step until they are requested?

No - I am not suggesting this.

>
> 2. Who initiates the requests? If it's the libnvdimm driver, that
>   means we still need to introduce Xen specific code to the driver.
>
>   Or the requests are issued by OSPM (or the Xen component you
>   mentioned above) when they probe new dimms?
>
>   For the latter, Dan, do you think it's acceptable in NFIT code to
>   call the Xen component to request the access permission of the pmem
>   regions, e.g. in apic_nfit_insert_resource(). Of course, it's only
>   used for Dom0 case.

The libnvdimm driver should continue to use ioremap() or whatever it
currently does.  There shouldn't be 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-14 Thread Jan Beulich
>>> On 13.10.16 at 17:46,  wrote:
> On 10/13/16 03:08 -0600, Jan Beulich wrote:
> On 13.10.16 at 10:53,  wrote:
>>> On 10/13/16 02:34 -0600, Jan Beulich wrote:
>>> On 12.10.16 at 18:19,  wrote:
> On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:
> On 12.10.16 at 17:42,  wrote:
>>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
>>> On 12.10.16 at 16:58,  wrote:
> On 10/12/16 05:32 -0600, Jan Beulich wrote:
> On 12.10.16 at 12:33,  wrote:
>>> The layout is shown as the following diagram.
>>>
>>> +---+---+---+--+--+
>>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
>>> |  by kernel|   Table   | Block | for Xen  |  |
>>> +---+---+---+--+--+
>>> \_ ___/
>>>   V
>>>  /dev/pmem0
>>
>>I have to admit that I dislike this, for not being OS-agnostic.
>>Neither should there be any Xen-specific region, nor should the
>>"whatever used by kernel" one be restricted to just Linux. What
>>I could see is an OS-reserved area ahead of the partition table,
>>the exact usage of which depends on which OS is currently
>>running (and in the Xen case this might be both Xen _and_ the
>>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>>running under Xen, the Dom0 may not have a need for as much
>>control data as it has when running on bare hardware, for it
>>controlling less (if any) of the actual memory ranges when Xen
>>is present.
>>
>
> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
> to know where the reserved area is?  Or do you mean it's not if it's
> defined by a protocol that is accepted by all OSes?

 The latter - we clearly won't get away without some agreement on
 where to retrieve position and size of this area. I was simply
 assuming that such a protocol already exists.

>>>
>>> No, we should not mix the struct page reservation that the Dom0 kernel
>>> may actively use with the Xen reservation that the Dom0 kernel does
>>> not consume.  Explain again what is wrong with the partition approach?
>>
>> Not sure what was unclear in my previous reply. I don't think there
>> should be apriori knowledge of whether Xen is (going to be) used on
>> a system, and even if it gets used, but just occasionally, it would
>> (apart from the abstract considerations already given) be a waste
>> of resources to set something aside that could be used for other
>> purposes while Xen is not running. Static partitioning should only be
>> needed for persistent data.
>
> The reservation needs to be persistent / static even if the data is
> volatile, as is the case with struct page, because we can't have the
> size of the device change depending on use.  So, from the aspect of
> wasting space while Xen is not in use, both partitions and the
> intrinsic reservation approach suffer the same problem. Setting that
> aside I don't want to mix 2 different use cases into the same
> reservation.

Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.

>>>
>>> I think Dan means that there should be a clear separation between
>>> reservations for different usages (kernel/xen/...). The libnvdimm
>>> driver is for the linux kernel and only needs to maintain the
>>> reservation for kernel functionality. For others including xen/dm/...,
>>> if they want reservation for their own purpose, they should maintain
>>> their own reservations out of libnvdimm driver and avoid bothering the
>>> libnvdimm driver (e.g. add specific handling in libnvdimm driver).
>>>
>>> IIUC, one existing example is device-mapper device (dm) which needs to
>>> reserve on-device area for its own meta-data. Its choice is to store
>>> the meta-data on the block device (/dev/pmemN) provided by the
>>> libnvdimm driver.
>>>
>>> I think we can do the similar for Xen, like to lay another pseudo
>>> device on /dev/pmem and do the reservation, like 2. in my previous
>>> reply.
>>
>>Well, my opinion certainly doesn't 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-14 Thread Jan Beulich
>>> On 13.10.16 at 17:40,  wrote:
> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
> [..]
>>> I think we can do the similar for Xen, like to lay another pseudo
>>> device on /dev/pmem and do the reservation, like 2. in my previous
>>> reply.
>>
>> Well, my opinion certainly doesn't count much here, but I continue to
>> consider this a bad idea. For entities like drivers it may well be
>> appropriate, but I think there ought to be an independent concept
>> of "OS reserved", and in the Xen case this could then be shared
>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0
>> "just a guest", things should even be the other way around: Xen gets
>> all of the OS reserved space, and Dom0 needs something custom.
> 
> You haven't made the case why Xen is special and other applications of
> persistent memory are not.

Well, I'm implying this from there being a special Linux reservation.
Xen (as explained by Andrew) sitting underneath the Dom0 kernel
(other than ...

>  The current struct page reservation
> supports fundamental address-ability of persistent memory namespaces
> for the rest of the kernel.  The Xen reservation is application
> specific.  XFS, EXT4, and DM also have application specific usages of
> persistent memory and consume metadata space out of a block device. If
> we don't need an XFS-mode nvdimm device, why do we need Xen-mode?

... all the examples you give) by implication is special then too. If
you made the kernel be no different than the other examples you
give, Xen probably shouldn't be any different anymore either.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-14 Thread Haozhong Zhang

On 10/13/16 20:33 +0100, Andrew Cooper wrote:

On 13/10/16 19:59, Dan Williams wrote:

On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
 wrote:

On 13/10/16 16:40, Dan Williams wrote:

On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
[..]

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel. Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, and Dom0 needs something custom.

You haven't made the case why Xen is special and other applications of
persistent memory are not.

In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is
a VM running in ring1/3 with the nvdimm driver.  This is the opposite
way around to the KVM model.

Dom0, being the hardware domain, has default ownership of all the
hardware, but to gain access in the first place, it must request a
mapping from Xen.

This is where my understanding the Xen model breaks down.  Are you
saying dom0 can't access the persistent memory range unless the ring0
agent has metadata storage space for tracking what it maps into dom0?


No.  I am trying to point out that the current suggestion wont work, and
needs re-designing.

Xen *must* be able to properly configure mappings of the NVDIMM for
dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
corruption will occur.

Whether this means no Xen metadata, or the metadata living elsewhere in
regular ram, such as the main frametable, is an implementation detail.




Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work
and figure out what is on the DIMM, and which areas are safe to use.

I don't understand this ordering of events.  Dom0 needs to have a
mapping to even write the on-media structure to indicate a
reservation.  So, initial dom0 access can't depend on metadata
reservation already being present.


I agree.

Overall, I think the following is needed.

* Xen starts up.
** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
needs to note this information somehow.
** Xen might find some Type 7 E820 regions, and needs to note this
information somehow.


IIUC, this is to collect MFNs and no need to create frame table and
M2P at this stage. If so, what is different from ...


* Xen starts dom0.
* Once OSPM is running, a Xen component in Linux needs to collect and
report all NVDIMM SPA/MFN regions it knowns about.
** This covers the AML-only case, and the hotplug case.


... the MFNs reported here, especially that the former is a subset
(hotplug ones not included in the former) of latter.

(There is no E820 hole or SRAT entries to tell which address range is
reserved for hotplugged NVDIMM)


* Dom0 requests a mapping of the NVDIMMs via the usual mechanism.


Two questions:
1. Why is this request necessary? Even without such requests like what
  my current implementation, Dom0 can still access NVDIMM.

  Or do you mean Xen hypervisor should by default disallow Dom0 to
  access MFNs reported in previous step until they are requested?

2. Who initiates the requests? If it's the libnvdimm driver, that
  means we still need to introduce Xen specific code to the driver.

  Or the requests are issued by OSPM (or the Xen component you
  mentioned above) when they probe new dimms?

  For the latter, Dan, do you think it's acceptable in NFIT code to
  call the Xen component to request the access permission of the pmem
  regions, e.g. in apic_nfit_insert_resource(). Of course, it's only
  used for Dom0 case.


** This should work, as Xen is aware that there is something there to be
mapped (rather than just empty physical address space).
* Dom0 finds that some NVDIMM ranges are now available for use (probably
modelled as hotplug events).
* /dev/pmem $STUFF starts happening as normal.

At some pointer later after dom0 policy decisions are made (ultimately,
by the host administrator):
* If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform
Xen of the SPA/MFN regions which are safe to use.
* Xen then incorporates these regions into its idea of RAM, and starts
using them for whatever.



Agree. I think we may not need to fix the way/format/... to make the
reservation, and instead let the users (host administrators), who have
better understanding of their data, make the proper decision.

In a worse case that no reservation is made, Xen hypervisor could turn
to use RAM for management structures for NVDIMM, with the cost of less
RAM for guests.

Thanks,
Haozhong


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Andrew Cooper
On 13/10/16 19:59, Dan Williams wrote:
> On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
>  wrote:
>> On 13/10/16 16:40, Dan Williams wrote:
>>> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
>>> [..]
> I think we can do the similar for Xen, like to lay another pseudo
> device on /dev/pmem and do the reservation, like 2. in my previous
> reply.
 Well, my opinion certainly doesn't count much here, but I continue to
 consider this a bad idea. For entities like drivers it may well be
 appropriate, but I think there ought to be an independent concept
 of "OS reserved", and in the Xen case this could then be shared
 between hypervisor and Dom0 kernel. Or if we were to consider Dom0
 "just a guest", things should even be the other way around: Xen gets
 all of the OS reserved space, and Dom0 needs something custom.
>>> You haven't made the case why Xen is special and other applications of
>>> persistent memory are not.
>> In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is
>> a VM running in ring1/3 with the nvdimm driver.  This is the opposite
>> way around to the KVM model.
>>
>> Dom0, being the hardware domain, has default ownership of all the
>> hardware, but to gain access in the first place, it must request a
>> mapping from Xen.
> This is where my understanding the Xen model breaks down.  Are you
> saying dom0 can't access the persistent memory range unless the ring0
> agent has metadata storage space for tracking what it maps into dom0?

No.  I am trying to point out that the current suggestion wont work, and
needs re-designing.

Xen *must* be able to properly configure mappings of the NVDIMM for
dom0, *without* modifying any content on the NVDIMM.  Otherwise, data
corruption will occur.

Whether this means no Xen metadata, or the metadata living elsewhere in
regular ram, such as the main frametable, is an implementation detail.

>
>> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work
>> and figure out what is on the DIMM, and which areas are safe to use.
> I don't understand this ordering of events.  Dom0 needs to have a
> mapping to even write the on-media structure to indicate a
> reservation.  So, initial dom0 access can't depend on metadata
> reservation already being present.

I agree.

Overall, I think the following is needed.

* Xen starts up.
** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and
needs to note this information somehow.
** Xen might find some Type 7 E820 regions, and needs to note this
information somehow.
* Xen starts dom0.
* Once OSPM is running, a Xen component in Linux needs to collect and
report all NVDIMM SPA/MFN regions it knowns about.
** This covers the AML-only case, and the hotplug case.
* Dom0 requests a mapping of the NVDIMMs via the usual mechanism.
** This should work, as Xen is aware that there is something there to be
mapped (rather than just empty physical address space).
* Dom0 finds that some NVDIMM ranges are now available for use (probably
modelled as hotplug events).
* /dev/pmem $STUFF starts happening as normal.

At some pointer later after dom0 policy decisions are made (ultimately,
by the host administrator):
* If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform
Xen of the SPA/MFN regions which are safe to use.
* Xen then incorporates these regions into its idea of RAM, and starts
using them for whatever.

~Andrew


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Dan Williams
On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper
 wrote:
> On 13/10/16 16:40, Dan Williams wrote:
>> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
>> [..]
 I think we can do the similar for Xen, like to lay another pseudo
 device on /dev/pmem and do the reservation, like 2. in my previous
 reply.
>>> Well, my opinion certainly doesn't count much here, but I continue to
>>> consider this a bad idea. For entities like drivers it may well be
>>> appropriate, but I think there ought to be an independent concept
>>> of "OS reserved", and in the Xen case this could then be shared
>>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0
>>> "just a guest", things should even be the other way around: Xen gets
>>> all of the OS reserved space, and Dom0 needs something custom.
>> You haven't made the case why Xen is special and other applications of
>> persistent memory are not.
>
> In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is
> a VM running in ring1/3 with the nvdimm driver.  This is the opposite
> way around to the KVM model.
>
> Dom0, being the hardware domain, has default ownership of all the
> hardware, but to gain access in the first place, it must request a
> mapping from Xen.

This is where my understanding the Xen model breaks down.  Are you
saying dom0 can't access the persistent memory range unless the ring0
agent has metadata storage space for tracking what it maps into dom0?
That can't be true because then PCI memory ranges would not work
without metadata reserve space.  Dom0 still needs to map and write the
DIMMs to even set up the struct page reservation, it isn't established
by default.

> Xen therefore needs to know and cope with being able
> to give dom0 a mapping to the nvdimms, without touching the content of
> the nvidmm itself (so as to avoid corrupting data).

Is it true that this metadata only comes into use when remapping the
dom0 discovered range(s) into a guest VM?

> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work
> and figure out what is on the DIMM, and which areas are safe to use.

I don't understand this ordering of events.  Dom0 needs to have a
mapping to even write the on-media structure to indicate a
reservation.  So, initial dom0 access can't depend on metadata
reservation already being present.

> At this point, a Xen subsystem in Linux could choose one or more areas
> to hand back to the hypervisor to use as RAM/other.

To me all this configuration seems to come after the fact.  After dom0
sees /dev/pmemX devices, then it can go to work carving it up and
writing Xen specific metadata to the range(s).  The struct page
reservation never comes into the picture.  In fact, a raw mode
namespace (one without a reservation) could be used in this model, the
nvdimm core never needs to know what is happening.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Andrew Cooper
On 13/10/16 16:40, Dan Williams wrote:
> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
> [..]
>>> I think we can do the similar for Xen, like to lay another pseudo
>>> device on /dev/pmem and do the reservation, like 2. in my previous
>>> reply.
>> Well, my opinion certainly doesn't count much here, but I continue to
>> consider this a bad idea. For entities like drivers it may well be
>> appropriate, but I think there ought to be an independent concept
>> of "OS reserved", and in the Xen case this could then be shared
>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0
>> "just a guest", things should even be the other way around: Xen gets
>> all of the OS reserved space, and Dom0 needs something custom.
> You haven't made the case why Xen is special and other applications of
> persistent memory are not.

In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is
a VM running in ring1/3 with the nvdimm driver.  This is the opposite
way around to the KVM model.

Dom0, being the hardware domain, has default ownership of all the
hardware, but to gain access in the first place, it must request a
mapping from Xen.  Xen therefore needs to know and cope with being able
to give dom0 a mapping to the nvdimms, without touching the content of
the nvidmm itself (so as to avoid corrupting data).

Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work
and figure out what is on the DIMM, and which areas are safe to use.

At this point, a Xen subsystem in Linux could choose one or more areas
to hand back to the hypervisor to use as RAM/other.

~Andrew


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

On 10/13/16 03:08 -0600, Jan Beulich wrote:

On 13.10.16 at 10:53,  wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.


Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel.


No such independent concept seems exist right now. It may be hard to
define such concept, because it's hard to know the common requirements
(e.g. size/alignment/...)  from ALL OSes. Making each component to
maintain its own reservation in its own way seems more flexible.


Or if we were to consider Dom0
"just a guest", things should even be the other way around: Xen gets
all of the OS reserved space, 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Dan Williams
On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich  wrote:
[..]
>> I think we can do the similar for Xen, like to lay another pseudo
>> device on /dev/pmem and do the reservation, like 2. in my previous
>> reply.
>
> Well, my opinion certainly doesn't count much here, but I continue to
> consider this a bad idea. For entities like drivers it may well be
> appropriate, but I think there ought to be an independent concept
> of "OS reserved", and in the Xen case this could then be shared
> between hypervisor and Dom0 kernel. Or if we were to consider Dom0
> "just a guest", things should even be the other way around: Xen gets
> all of the OS reserved space, and Dom0 needs something custom.

You haven't made the case why Xen is special and other applications of
persistent memory are not.  The current struct page reservation
supports fundamental address-ability of persistent memory namespaces
for the rest of the kernel.  The Xen reservation is application
specific.  XFS, EXT4, and DM also have application specific usages of
persistent memory and consume metadata space out of a block device. If
we don't need an XFS-mode nvdimm device, why do we need Xen-mode?


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

+Dan Williams

I accidentally dropped him in my last reply. Add him back.

On 10/13/16 16:53 +0800, Haozhong Zhang wrote:

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
   \_ ___/
 V
/dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Thanks,
Haozhong


The kernel needs to know about the struct page reservation because it
needs to manage the lifetime of page references vs the lifetime of the
device.  It does not have the same relationship with a Xen reservation
which is why I'm proposing they be managed separately.


I don't think I understand the difference you try to point out here.
Linux'es struct page and Xen's struct page_info serve the same
fundamental purpose.

Jan


___
Linux-nvdimm mailing list
linux-nvd...@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Jan Beulich
>>> On 13.10.16 at 10:53,  wrote:
> On 10/13/16 02:34 -0600, Jan Beulich wrote:
> On 12.10.16 at 18:19,  wrote:
>>> On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:
>>> On 12.10.16 at 17:42,  wrote:
> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
> On 12.10.16 at 16:58,  wrote:
>>> On 10/12/16 05:32 -0600, Jan Beulich wrote:
>>> On 12.10.16 at 12:33,  wrote:
> The layout is shown as the following diagram.
>
> +---+---+---+--+--+
> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
> |  by kernel|   Table   | Block | for Xen  |  |
> +---+---+---+--+--+
> \_ ___/
>   V
>  /dev/pmem0

I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.

>>>
>>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
>>> to know where the reserved area is?  Or do you mean it's not if it's
>>> defined by a protocol that is accepted by all OSes?
>>
>> The latter - we clearly won't get away without some agreement on
>> where to retrieve position and size of this area. I was simply
>> assuming that such a protocol already exists.
>>
>
> No, we should not mix the struct page reservation that the Dom0 kernel
> may actively use with the Xen reservation that the Dom0 kernel does
> not consume.  Explain again what is wrong with the partition approach?

 Not sure what was unclear in my previous reply. I don't think there
 should be apriori knowledge of whether Xen is (going to be) used on
 a system, and even if it gets used, but just occasionally, it would
 (apart from the abstract considerations already given) be a waste
 of resources to set something aside that could be used for other
 purposes while Xen is not running. Static partitioning should only be
 needed for persistent data.
>>>
>>> The reservation needs to be persistent / static even if the data is
>>> volatile, as is the case with struct page, because we can't have the
>>> size of the device change depending on use.  So, from the aspect of
>>> wasting space while Xen is not in use, both partitions and the
>>> intrinsic reservation approach suffer the same problem. Setting that
>>> aside I don't want to mix 2 different use cases into the same
>>> reservation.
>>
>>Then you didn't understand what I've said: I certainly didn't mean
>>the reservation to vary from a device perspective. However, when
>>Xen is in use I don't see why part of that static reservation couldn't
>>be used by Xen, and another part by the Dom0 kernel. The kernel
>>obviously would need to ask the hypervisor how much of the space
>>is left, and where that area starts.
>>
> 
> I think Dan means that there should be a clear separation between
> reservations for different usages (kernel/xen/...). The libnvdimm
> driver is for the linux kernel and only needs to maintain the
> reservation for kernel functionality. For others including xen/dm/...,
> if they want reservation for their own purpose, they should maintain
> their own reservations out of libnvdimm driver and avoid bothering the
> libnvdimm driver (e.g. add specific handling in libnvdimm driver).
> 
> IIUC, one existing example is device-mapper device (dm) which needs to
> reserve on-device area for its own meta-data. Its choice is to store
> the meta-data on the block device (/dev/pmemN) provided by the
> libnvdimm driver.
> 
> I think we can do the similar for Xen, like to lay another pseudo
> device on /dev/pmem and do the reservation, like 2. in my previous
> reply.

Well, my opinion certainly doesn't count much here, but I continue to
consider this a bad idea. For entities like drivers it may well be
appropriate, but I think there ought to be an independent concept
of "OS reserved", and in the Xen case this could then be shared
between hypervisor and Dom0 kernel. Or if we 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Haozhong Zhang

On 10/13/16 02:34 -0600, Jan Beulich wrote:

On 12.10.16 at 18:19,  wrote:

On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:

On 12.10.16 at 17:42,  wrote:

On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:

On 12.10.16 at 16:58,  wrote:

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?


The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.



No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.


The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.


Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.



I think Dan means that there should be a clear separation between
reservations for different usages (kernel/xen/...). The libnvdimm
driver is for the linux kernel and only needs to maintain the
reservation for kernel functionality. For others including xen/dm/...,
if they want reservation for their own purpose, they should maintain
their own reservations out of libnvdimm driver and avoid bothering the
libnvdimm driver (e.g. add specific handling in libnvdimm driver).

IIUC, one existing example is device-mapper device (dm) which needs to
reserve on-device area for its own meta-data. Its choice is to store
the meta-data on the block device (/dev/pmemN) provided by the
libnvdimm driver.

I think we can do the similar for Xen, like to lay another pseudo
device on /dev/pmem and do the reservation, like 2. in my previous
reply.

Thanks,
Haozhong


The kernel needs to know about the struct page reservation because it
needs to manage the lifetime of page references vs the lifetime of the
device.  It does not have the same relationship with a Xen reservation
which is why I'm proposing they be managed separately.


I don't think I understand the difference you try to point out here.
Linux'es struct page and Xen's struct page_info serve the same
fundamental purpose.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-13 Thread Jan Beulich
>>> On 12.10.16 at 18:19,  wrote:
> On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:
> On 12.10.16 at 17:42,  wrote:
>>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
>>> On 12.10.16 at 16:58,  wrote:
> On 10/12/16 05:32 -0600, Jan Beulich wrote:
> On 12.10.16 at 12:33,  wrote:
>>> The layout is shown as the following diagram.
>>>
>>> +---+---+---+--+--+
>>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
>>> |  by kernel|   Table   | Block | for Xen  |  |
>>> +---+---+---+--+--+
>>> \_ ___/
>>>   V
>>>  /dev/pmem0
>>
>>I have to admit that I dislike this, for not being OS-agnostic.
>>Neither should there be any Xen-specific region, nor should the
>>"whatever used by kernel" one be restricted to just Linux. What
>>I could see is an OS-reserved area ahead of the partition table,
>>the exact usage of which depends on which OS is currently
>>running (and in the Xen case this might be both Xen _and_ the
>>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>>running under Xen, the Dom0 may not have a need for as much
>>control data as it has when running on bare hardware, for it
>>controlling less (if any) of the actual memory ranges when Xen
>>is present.
>>
>
> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
> to know where the reserved area is?  Or do you mean it's not if it's
> defined by a protocol that is accepted by all OSes?

 The latter - we clearly won't get away without some agreement on
 where to retrieve position and size of this area. I was simply
 assuming that such a protocol already exists.

>>>
>>> No, we should not mix the struct page reservation that the Dom0 kernel
>>> may actively use with the Xen reservation that the Dom0 kernel does
>>> not consume.  Explain again what is wrong with the partition approach?
>>
>> Not sure what was unclear in my previous reply. I don't think there
>> should be apriori knowledge of whether Xen is (going to be) used on
>> a system, and even if it gets used, but just occasionally, it would
>> (apart from the abstract considerations already given) be a waste
>> of resources to set something aside that could be used for other
>> purposes while Xen is not running. Static partitioning should only be
>> needed for persistent data.
> 
> The reservation needs to be persistent / static even if the data is
> volatile, as is the case with struct page, because we can't have the
> size of the device change depending on use.  So, from the aspect of
> wasting space while Xen is not in use, both partitions and the
> intrinsic reservation approach suffer the same problem. Setting that
> aside I don't want to mix 2 different use cases into the same
> reservation.

Then you didn't understand what I've said: I certainly didn't mean
the reservation to vary from a device perspective. However, when
Xen is in use I don't see why part of that static reservation couldn't
be used by Xen, and another part by the Dom0 kernel. The kernel
obviously would need to ask the hypervisor how much of the space
is left, and where that area starts.

> The kernel needs to know about the struct page reservation because it
> needs to manage the lifetime of page references vs the lifetime of the
> device.  It does not have the same relationship with a Xen reservation
> which is why I'm proposing they be managed separately.

I don't think I understand the difference you try to point out here.
Linux'es struct page and Xen's struct page_info serve the same
fundamental purpose.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Jan Beulich
>>> On 12.10.16 at 17:42,  wrote:
> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
> On 12.10.16 at 16:58,  wrote:
>>> On 10/12/16 05:32 -0600, Jan Beulich wrote:
>>> On 12.10.16 at 12:33,  wrote:
> The layout is shown as the following diagram.
>
> +---+---+---+--+--+
> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
> |  by kernel|   Table   | Block | for Xen  |  |
> +---+---+---+--+--+
> \_ ___/
>   V
>  /dev/pmem0

I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.

>>>
>>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
>>> to know where the reserved area is?  Or do you mean it's not if it's
>>> defined by a protocol that is accepted by all OSes?
>>
>> The latter - we clearly won't get away without some agreement on
>> where to retrieve position and size of this area. I was simply
>> assuming that such a protocol already exists.
>>
> 
> No, we should not mix the struct page reservation that the Dom0 kernel
> may actively use with the Xen reservation that the Dom0 kernel does
> not consume.  Explain again what is wrong with the partition approach?

Not sure what was unclear in my previous reply. I don't think there
should be apriori knowledge of whether Xen is (going to be) used on
a system, and even if it gets used, but just occasionally, it would
(apart from the abstract considerations already given) be a waste
of resources to set something aside that could be used for other
purposes while Xen is not running. Static partitioning should only be
needed for persistent data.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Dan Williams
On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich  wrote:
 On 12.10.16 at 17:42,  wrote:
>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
>> On 12.10.16 at 16:58,  wrote:
 On 10/12/16 05:32 -0600, Jan Beulich wrote:
 On 12.10.16 at 12:33,  wrote:
>> The layout is shown as the following diagram.
>>
>> +---+---+---+--+--+
>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
>> |  by kernel|   Table   | Block | for Xen  |  |
>> +---+---+---+--+--+
>> \_ ___/
>>   V
>>  /dev/pmem0
>
>I have to admit that I dislike this, for not being OS-agnostic.
>Neither should there be any Xen-specific region, nor should the
>"whatever used by kernel" one be restricted to just Linux. What
>I could see is an OS-reserved area ahead of the partition table,
>the exact usage of which depends on which OS is currently
>running (and in the Xen case this might be both Xen _and_ the
>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>running under Xen, the Dom0 may not have a need for as much
>control data as it has when running on bare hardware, for it
>controlling less (if any) of the actual memory ranges when Xen
>is present.
>

 Isn't this OS-reserved area still not OS-agnostic, as it requires OS
 to know where the reserved area is?  Or do you mean it's not if it's
 defined by a protocol that is accepted by all OSes?
>>>
>>> The latter - we clearly won't get away without some agreement on
>>> where to retrieve position and size of this area. I was simply
>>> assuming that such a protocol already exists.
>>>
>>
>> No, we should not mix the struct page reservation that the Dom0 kernel
>> may actively use with the Xen reservation that the Dom0 kernel does
>> not consume.  Explain again what is wrong with the partition approach?
>
> Not sure what was unclear in my previous reply. I don't think there
> should be apriori knowledge of whether Xen is (going to be) used on
> a system, and even if it gets used, but just occasionally, it would
> (apart from the abstract considerations already given) be a waste
> of resources to set something aside that could be used for other
> purposes while Xen is not running. Static partitioning should only be
> needed for persistent data.
>

The reservation needs to be persistent / static even if the data is
volatile, as is the case with struct page, because we can't have the
size of the device change depending on use.  So, from the aspect of
wasting space while Xen is not in use, both partitions and the
intrinsic reservation approach suffer the same problem. Setting that
aside I don't want to mix 2 different use cases into the same
reservation.

The kernel needs to know about the struct page reservation because it
needs to manage the lifetime of page references vs the lifetime of the
device.  It does not have the same relationship with a Xen reservation
which is why I'm proposing they be managed separately.

Note that Toshi and Mike added DM for DAX.  This enabling ends up
writing DM metadata on the device without adding new reservation
mechanisms to the nvdimm core.  I'm struggling to see how the Xen use
case is materially different DM.  In the end it's an application
specific metadata space.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Dan Williams
On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich  wrote:
 On 12.10.16 at 16:58,  wrote:
>> On 10/12/16 05:32 -0600, Jan Beulich wrote:
>> On 12.10.16 at 12:33,  wrote:
 The layout is shown as the following diagram.

 +---+---+---+--+--+
 | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
 |  by kernel|   Table   | Block | for Xen  |  |
 +---+---+---+--+--+
 \_ ___/
   V
  /dev/pmem0
>>>
>>>I have to admit that I dislike this, for not being OS-agnostic.
>>>Neither should there be any Xen-specific region, nor should the
>>>"whatever used by kernel" one be restricted to just Linux. What
>>>I could see is an OS-reserved area ahead of the partition table,
>>>the exact usage of which depends on which OS is currently
>>>running (and in the Xen case this might be both Xen _and_ the
>>>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>>>running under Xen, the Dom0 may not have a need for as much
>>>control data as it has when running on bare hardware, for it
>>>controlling less (if any) of the actual memory ranges when Xen
>>>is present.
>>>
>>
>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
>> to know where the reserved area is?  Or do you mean it's not if it's
>> defined by a protocol that is accepted by all OSes?
>
> The latter - we clearly won't get away without some agreement on
> where to retrieve position and size of this area. I was simply
> assuming that such a protocol already exists.
>

No, we should not mix the struct page reservation that the Dom0 kernel
may actively use with the Xen reservation that the Dom0 kernel does
not consume.  Explain again what is wrong with the partition approach?


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Jan Beulich
>>> On 12.10.16 at 16:58,  wrote:
> On 10/12/16 05:32 -0600, Jan Beulich wrote:
> On 12.10.16 at 12:33,  wrote:
>>> The layout is shown as the following diagram.
>>>
>>> +---+---+---+--+--+
>>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
>>> |  by kernel|   Table   | Block | for Xen  |  |
>>> +---+---+---+--+--+
>>> \_ ___/
>>>   V
>>>  /dev/pmem0
>>
>>I have to admit that I dislike this, for not being OS-agnostic.
>>Neither should there be any Xen-specific region, nor should the
>>"whatever used by kernel" one be restricted to just Linux. What
>>I could see is an OS-reserved area ahead of the partition table,
>>the exact usage of which depends on which OS is currently
>>running (and in the Xen case this might be both Xen _and_ the
>>Dom0 kernel, arbitrated by a tbd protocol). After all, when
>>running under Xen, the Dom0 may not have a need for as much
>>control data as it has when running on bare hardware, for it
>>controlling less (if any) of the actual memory ranges when Xen
>>is present.
>>
> 
> Isn't this OS-reserved area still not OS-agnostic, as it requires OS
> to know where the reserved area is?  Or do you mean it's not if it's
> defined by a protocol that is accepted by all OSes?

The latter - we clearly won't get away without some agreement on
where to retrieve position and size of this area. I was simply
assuming that such a protocol already exists.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Haozhong Zhang

On 10/12/16 05:32 -0600, Jan Beulich wrote:

On 12.10.16 at 12:33,  wrote:

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
\_ ___/
  V
 /dev/pmem0


I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.



Isn't this OS-reserved area still not OS-agnostic, as it requires OS
to know where the reserved area is?  Or do you mean it's not if it's
defined by a protocol that is accepted by all OSes?

Let me list another two methods just coming to my mind.

1. The first method extends the usage of the super block used by
  current Linux kernel to reserve space on pmem.

  Current Linux kernel places a super block of the following
  structure near the beginning of a pmem namespace.

   struct nd_pfn_sb {
   u8 signature[PFN_SIG_LEN];
   u8 uuid[16];
   u8 parent_uuid[16];
   __le32 flags;
   __le16 version_major;
   __le16 version_minor;
   __le64 dataoff; /* relative to namespace_base + start_pad */
   __le64 npfns;
   __le32 mode;
   /* minor-version-1 additions for section alignment */
   __le32 start_pad;
   __le32 end_trunc;
   /* minor-version-2 record the base alignment of the mapping */
   __le32 align;
   u8 padding[4000];
   __le64 checksum;
   }

   Two interesting fields here are 'dataoff' and 'mode':
   - 'dataoff' indicates the offset where the data area starts,
 ie. IIUC, the part that can be accessed via /dev/pmemN or
 /dev/daxN.
   - 'mode' indicates whether Linux puts struct page for this
 namespace in the ram (= PFN_MODE_RAM) or on the device (=
 PFN_MODE_PMEM).

   Currently for Linux, only 'mode' is customizable, while 'dataoff'
   is not. If mode == PFN_MODE_RAM, no reservation for struct page is
   made on the device, and dataoff starts almost immediately after
   the super block except a small reserved area in between for other
   structures and alignment. If mode == PFN_MODE_PMEM, the size of
   the reservation is decided by kernel, i.e. 64 bytes per struct
   page.

   I propose to make the size of the reserved area customizable,
   e.g. via ioctl and ndctl.
   - If mode == PFN_MODE_PMEM and
 * if the given reserved size is large enough to hold what an OS
   (not limited to Linux) wants to put in, then the OS just
   starts use it as desired;
 * if the given reserved size is not enough, then the OS reports
   error and may take other fallback actions.
   - If mode == PFN_MODE_RAM and
 * if the reserved size is zero, then it's the current way that
   Linux uses the device;
 * if the reserved size is non-zero, I would like to reserve this
   case for hypervisor (right now, namely Xen hypervisor)
   usage. That is, the OS should not use the reserved area. For
   Xen, we could add a function in xen driver in kernel to report
   the reserved area to hypervisor.

  I guess this might be the OS-agnostic way Jan expects, but Dan may
  object to.


2. Lay another pseudo device on the block device (e.g. /dev/pmemN)
  provided by the NVDIMM driver.

  This pseudo device can reserve the size according to user's
  requirement. The reservation information can be persistently
  recorded in a super block before the reserved area.

  This pseudo device also implements another pseudo block device to
  allow the non-reserved area be accessed as a block device (we can
  even implement it as DAX-capable).

  pseudo block device
/-^---\
+--+---+---+---+
|  whatever used   | Super |  reserved by  |   |
| by NVDIMM driver | Block | pseudo device |   |
+--+---+---+---+
\_ ___/
  V
 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Jan Beulich
>>> On 12.10.16 at 12:33,  wrote:
> The layout is shown as the following diagram.
> 
> +---+---+---+--+--+
> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
> |  by kernel|   Table   | Block | for Xen  |  |
> +---+---+---+--+--+
> \_ ___/
> V
>/dev/pmem0

I have to admit that I dislike this, for not being OS-agnostic.
Neither should there be any Xen-specific region, nor should the
"whatever used by kernel" one be restricted to just Linux. What
I could see is an OS-reserved area ahead of the partition table,
the exact usage of which depends on which OS is currently
running (and in the Xen case this might be both Xen _and_ the
Dom0 kernel, arbitrated by a tbd protocol). After all, when
running under Xen, the Dom0 may not have a need for as much
control data as it has when running on bare hardware, for it
controlling less (if any) of the actual memory ranges when Xen
is present.

The assumption of course is that the reserved area holds no
persistent data. If that assumption didn't hold, you'd have to
have per-OS reserved areas anyway (as many of them as
there might be OSes [planned to get] installed on a particular
system).

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Haozhong Zhang

On 10/11/16 13:17 -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk
 wrote:

On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:

On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
[..]
>> Right, but why does the libnvdimm core need to know about this
>> specific Xen reservation?  For example, if Xen wants some in-kernel
>
> Let me turn this around - why does the libnvdimm core need to know about
> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
> for example can also poke a hole in this and fill it with its
> OS-management meta-data?

Specifically the core needs to know so that it can answer the Linux
specific question of whether the pfn returned by ->direct_access() has
a corresponding struct page or not. It's tied to the lifetime of the
device and the usage of the reservation needs to be coordinated
against the references of those pages.  If FreeBSD decides it needs to
reserve "struct page" capacity at the start of the device, I would
hope that it reuses the same on-device info block that Linux is using
and not create a new "FreeBSD-mode" device type.


The issue here (as I understand, I may be missing something new)
is that the size of this special namespace may be different. That is
the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
64 bytes (numbers pulled out of the sky).

Hence one would have to expand or such to re-use this.


Sure, but we could support that today.  If FreeBSD lays down the info
block it is free to make a bigger reservation and Linux would be happy
to use a smaller subset.  If we, as an industry, want this "struct
page" reservation to be common we can take it to a standards body to
make as a cross-OS guarantee... but I think this is separate from the
Xen reservation.


To be honest I do not yet understand what metadata Xen wants to store
in the device, but it seems the producer and consumer of that metadata
is Xen itself and not the wider Linux kernel as is the case with
struct page.  Can you fill me in on what problem Xen solves with this


Exactly!

reservation?


The same as Linux - its variant of 'struct page'. Which I think is
smaller than the Linux one, but perhaps it is not?



If the hypervisor needs to know where it can store some metadata, can
that be satisfied with userspace tooling in Dom0? Something like,
"/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem
with files to hand to guests".  So my question is not about the
rationale for having metadata, it's why does the Linux kernel need to
know about the Xen reservation? As far as I can see it is independent
/ opaque to the kernel.


Thank everyone for all these comments!

How about doing the reservation in the following way:

1. Create partition(s) on /dev/pmemX and make sure space besides the
  partition table and potential padding before the first partition is
  large enough to hold Xen's management structures and a super block
  introduced in step 2. The space besides the partition table,
  padding and the super block will be used as the reserved area.

2. Write a super block before above reserved area. The super block
  records the base address and the size of the reserved area. It also
  contains a signature and a checksum to identify itself.

The layout is shown as the following diagram.

+---+---+---+--+--+
| whatever used | Partition | Super | Reserved | /dev/pmem0p1 |
|  by kernel|   Table   | Block | for Xen  |  |
+---+---+---+--+--+
   \_ ___/
  V
 /dev/pmem0

Above two steps can be done via a userspace program and do not need
Xen hypervisor running. The partitions on the device can be used
regardless of the existence of Xen hypervisor.

3. When Xen is running, implement a function in Dom0 Linux xen driver
  (drivers/xen/) to response to udevd events that notify the
  detection of the pmem regions.

  This function searches on the pmem region for the super block
  created in step 2. If one is found, it will know this pmem region
  has been prepared for Xen usage.

  Then it gets the base address and size of the reserved area (from
  super block) and the entire address ranges of the pmem region (from
  pmem driver), and reports them to Xen hypervisor.

The implementation of this step can be completely included in the
kernel Xen driver. (It may also be implemented as a udevd service in
userspace, if it's not considered as unsafe)

Thanks,
Haozhong


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-12 Thread Jan Beulich
>>> On 11.10.16 at 17:53,  wrote:
> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
> Andrew Cooper  10/10/16 6:44 PM >>>
>>>On 10/10/16 01:35, Haozhong Zhang wrote:
 Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
memory management data structures, i.e. frame table and M2P table.
 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
hypervisor.
>>>
>>>However, I can't see any justification for 1).  Dom0 should not be
>>>involved in Xen's management of its own frame table and m2p.  The mfns
>>>making up the pmem/pblk regions should be treated just like any other
>>>MMIO regions, and be handed wholesale to dom0 by default.
>>
>> That precludes the use as RAM extension, and I thought earlier rounds of
>> discussion had got everyone in agreement that at least for the pmem case
>> we will need some control data in Xen.
> 
> The missing piece for me is why this reservation for control data
> needs to be done in the libnvdimm core?  I would expect that any dax
> capable file could be mapped and made available to a guest.  This
> includes /dev/ramX devices that are dax capable, but are external to
> the libnvdimm sub-system.

Despite me being the only one on the To list, I don't think the question
was really meant to be directed to me.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Andrew Cooper
On 11/10/16 20:48, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:
>> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
>>  wrote:
>>> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
>> [..]
 Right, but why does the libnvdimm core need to know about this
 specific Xen reservation?  For example, if Xen wants some in-kernel
>>> Let me turn this around - why does the libnvdimm core need to know about
>>> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
>>> for example can also poke a hole in this and fill it with its
>>> OS-management meta-data?
>> Specifically the core needs to know so that it can answer the Linux
>> specific question of whether the pfn returned by ->direct_access() has
>> a corresponding struct page or not. It's tied to the lifetime of the
>> device and the usage of the reservation needs to be coordinated
>> against the references of those pages.  If FreeBSD decides it needs to
>> reserve "struct page" capacity at the start of the device, I would
>> hope that it reuses the same on-device info block that Linux is using
>> and not create a new "FreeBSD-mode" device type.
> The issue here (as I understand, I may be missing something new)
> is that the size of this special namespace may be different. That is
> the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
> 64 bytes (numbers pulled out of the sky).
>
> Hence one would have to expand or such to re-use this.
>> To be honest I do not yet understand what metadata Xen wants to store
>> in the device, but it seems the producer and consumer of that metadata
>> is Xen itself and not the wider Linux kernel as is the case with
>> struct page.  Can you fill me in on what problem Xen solves with this
> Exactly!
>> reservation?
> The same as Linux - its variant of 'struct page'. Which I think is
> smaller than the Linux one, but perhaps it is not?

There is still a bootstrapping issue though, which looks (in its current
form) to cause data corruption.

I hope I am mistaken, and apologies if I am, but clearly we cannot build
a solution that has data corruption in anything other than an
exceptional circumstance.

So far, the sequence of boot operations appears to look like this:

Xen boots, and may find some NVDIMM SPA/MFN ranges via the NFIT table. 
Any ranges available only from AML need dynamically reporting back to
Xen at a later point, once OSPM is up and running.

The NVDIMMs must be mappable by dom0 so the contents can be inspected
and deemed to be safe by the nvdimm driver/host admin, before Xen starts
writing to any of it (for whatever reason).

If this isn't the case, then simply booting a Xen/dom0 combo will end up
corrupting a region before working out that it is safe to do so.

~Andrew


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Dan Williams
On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:
>> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
>>  wrote:
>> > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
>> [..]
>> >> Right, but why does the libnvdimm core need to know about this
>> >> specific Xen reservation?  For example, if Xen wants some in-kernel
>> >
>> > Let me turn this around - why does the libnvdimm core need to know about
>> > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
>> > for example can also poke a hole in this and fill it with its
>> > OS-management meta-data?
>>
>> Specifically the core needs to know so that it can answer the Linux
>> specific question of whether the pfn returned by ->direct_access() has
>> a corresponding struct page or not. It's tied to the lifetime of the
>> device and the usage of the reservation needs to be coordinated
>> against the references of those pages.  If FreeBSD decides it needs to
>> reserve "struct page" capacity at the start of the device, I would
>> hope that it reuses the same on-device info block that Linux is using
>> and not create a new "FreeBSD-mode" device type.
>
> The issue here (as I understand, I may be missing something new)
> is that the size of this special namespace may be different. That is
> the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
> 64 bytes (numbers pulled out of the sky).
>
> Hence one would have to expand or such to re-use this.

Sure, but we could support that today.  If FreeBSD lays down the info
block it is free to make a bigger reservation and Linux would be happy
to use a smaller subset.  If we, as an industry, want this "struct
page" reservation to be common we can take it to a standards body to
make as a cross-OS guarantee... but I think this is separate from the
Xen reservation.

>> To be honest I do not yet understand what metadata Xen wants to store
>> in the device, but it seems the producer and consumer of that metadata
>> is Xen itself and not the wider Linux kernel as is the case with
>> struct page.  Can you fill me in on what problem Xen solves with this
>
> Exactly!
>> reservation?
>
> The same as Linux - its variant of 'struct page'. Which I think is
> smaller than the Linux one, but perhaps it is not?
>

If the hypervisor needs to know where it can store some metadata, can
that be satisfied with userspace tooling in Dom0? Something like,
"/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem
with files to hand to guests".  So my question is not about the
rationale for having metadata, it's why does the Linux kernel need to
know about the Xen reservation? As far as I can see it is independent
/ opaque to the kernel.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote:
> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
>  wrote:
> > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
> [..]
> >> Right, but why does the libnvdimm core need to know about this
> >> specific Xen reservation?  For example, if Xen wants some in-kernel
> >
> > Let me turn this around - why does the libnvdimm core need to know about
> > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
> > for example can also poke a hole in this and fill it with its
> > OS-management meta-data?
> 
> Specifically the core needs to know so that it can answer the Linux
> specific question of whether the pfn returned by ->direct_access() has
> a corresponding struct page or not. It's tied to the lifetime of the
> device and the usage of the reservation needs to be coordinated
> against the references of those pages.  If FreeBSD decides it needs to
> reserve "struct page" capacity at the start of the device, I would
> hope that it reuses the same on-device info block that Linux is using
> and not create a new "FreeBSD-mode" device type.

The issue here (as I understand, I may be missing something new)
is that the size of this special namespace may be different. That is
the 'struct page' on FreeBSD could be 256 bytes while on Linux it is
64 bytes (numbers pulled out of the sky).

Hence one would have to expand or such to re-use this.
> 
> To be honest I do not yet understand what metadata Xen wants to store
> in the device, but it seems the producer and consumer of that metadata
> is Xen itself and not the wider Linux kernel as is the case with
> struct page.  Can you fill me in on what problem Xen solves with this

Exactly!
> reservation?

The same as Linux - its variant of 'struct page'. Which I think is
smaller than the Linux one, but perhaps it is not?



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
> Andrew, why are you providing input to this so late?

First of sorry for this outburst. It was quite uncalled for and quite
unprofessional. You of all people have so much on your plate that I am
astonished that you are able to operate with some many pokers in the
fire.

Again, my sincere apology.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Dan Williams
On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
[..]
>> Right, but why does the libnvdimm core need to know about this
>> specific Xen reservation?  For example, if Xen wants some in-kernel
>
> Let me turn this around - why does the libnvdimm core need to know about
> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD
> for example can also poke a hole in this and fill it with its
> OS-management meta-data?

Specifically the core needs to know so that it can answer the Linux
specific question of whether the pfn returned by ->direct_access() has
a corresponding struct page or not. It's tied to the lifetime of the
device and the usage of the reservation needs to be coordinated
against the references of those pages.  If FreeBSD decides it needs to
reserve "struct page" capacity at the start of the device, I would
hope that it reuses the same on-device info block that Linux is using
and not create a new "FreeBSD-mode" device type.

To be honest I do not yet understand what metadata Xen wants to store
in the device, but it seems the producer and consumer of that metadata
is Xen itself and not the wider Linux kernel as is the case with
struct page.  Can you fill me in on what problem Xen solves with this
reservation?


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Andrew Cooper
On 11/10/16 06:52, Haozhong Zhang wrote:
> On 10/10/16 17:43, Andrew Cooper wrote:
>> On 10/10/16 01:35, Haozhong Zhang wrote:
>>> Overview
>>> 
>>> This RFC kernel patch series along with corresponding patch series of
>>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
>>> NVDIMM devices to Xen HVM domU as vNVDIMM devices.
>>>
>>> Xen hypervisor does not include an NVDIMM driver, so it needs the
>>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM
>>> devices. We currently only supports NVDIMM devices in pmem mode.
>>>
>>> Design and Implementation
>>> =
>>> The complete design can be found at
>>>   
>>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
>>>
>>> All patch series can be found at
>>>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
>>>   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
>>>   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
>>>   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
>>>
>>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
>>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>>>memory management data structures, i.e. frame table and M2P table.
>>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>>>hypervisor.
>> Please can we take a step back here before diving down a rabbit hole.
>>
>>
>> How do pblk/pmem regions appear in the E820 map at boot?  At the very
>> least, I would expect at least a large reserved region.
> ACPI specification does not require them to appear in E820, though
> it defines E820 type-7 for persistent memory.

Ok, so we might get some E820 type-7 ranges, or some holes.

>
>> Is the MFN information (SPA in your terminology, so far as I can tell)
>> available in any static APCI tables, or are they only available as a
>> result of executing AML methods?
>>
> For NVDIMM devices already plugged at power on, their MFN information
> can be got from NFIT table. However, MFN information for hotplugged
> NVDIMM devices should be got via AML _FIT method, so point 2) is needed.

How does NVDIMM hotplug compare to RAM hotplug?  Are the hotplug regions
described at boot and marked as initially not present, or do you only
know the hotplugged SPA at the point that it is hotplugged?

I certainly agree that there needs to be a propagation of the hotplug
notification from OSPM to Xen, which will involve some glue in the Xen
subsystem in Linux, but I would expect that this would be similar to the
existing plain RAM hotplug mechanism.

>
>> If the MFN information is only available via AML, then point 2) is
>> needed, although the reporting back to Xen should be restricted to a xen
>> component, rather than polluting the main device driver.
>>
>> However, I can't see any justification for 1).  Dom0 should not be
>> involved in Xen's management of its own frame table and m2p.  The mfns
>> making up the pmem/pblk regions should be treated just like any other
>> MMIO regions, and be handed wholesale to dom0 by default.
>>
> Do you mean to treat them as mmio pages of type p2m_mmio_direct and
> map them to guest by map_mmio_regions()?

I don't see any reason why it shouldn't be treated like this.  Xen
shouldn't be treating it as anything other than an opaque block of MFNs.

The concept of trying to map a DAX file into the guest physical address
space of a VM is indeed new and doesn't fit into Xen's current model,
but all that fixing this requires is a new privileged mapping hypercall
which takes a source domid and gfn scatter list, and a destination domid
and scatter list.  (I see from a quick look at your Xen series that your
XENMEM_populate_pmemmap looks roughly like this)

~Andrew


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote:
> On 11/10/16 06:52, Haozhong Zhang wrote:
> > On 10/10/16 17:43, Andrew Cooper wrote:
> >> On 10/10/16 01:35, Haozhong Zhang wrote:
> >>> Overview
> >>> 
> >>> This RFC kernel patch series along with corresponding patch series of
> >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >>>
> >>> Xen hypervisor does not include an NVDIMM driver, so it needs the
> >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> >>> devices. We currently only supports NVDIMM devices in pmem mode.
> >>>
> >>> Design and Implementation
> >>> =
> >>> The complete design can be found at
> >>>   
> >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> >>>
> >>> All patch series can be found at
> >>>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
> >>>   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
> >>>   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
> >>>   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
> >>>
> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
> >>> tasks:
> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >>>memory management data structures, i.e. frame table and M2P table.
> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >>>hypervisor.
> >> Please can we take a step back here before diving down a rabbit hole.
> >>
> >>
> >> How do pblk/pmem regions appear in the E820 map at boot?  At the very
> >> least, I would expect at least a large reserved region.
> > ACPI specification does not require them to appear in E820, though
> > it defines E820 type-7 for persistent memory.
> 
> Ok, so we might get some E820 type-7 ranges, or some holes.
> 
> >
> >> Is the MFN information (SPA in your terminology, so far as I can tell)
> >> available in any static APCI tables, or are they only available as a
> >> result of executing AML methods?
> >>
> > For NVDIMM devices already plugged at power on, their MFN information
> > can be got from NFIT table. However, MFN information for hotplugged
> > NVDIMM devices should be got via AML _FIT method, so point 2) is needed.
> 
> How does NVDIMM hotplug compare to RAM hotplug?  Are the hotplug regions
> described at boot and marked as initially not present, or do you only
> know the hotplugged SPA at the point that it is hotplugged?

The latter. You have no idea of the size until you get an ACPI hotplug.
The ACPI hotplug contains the NFIT MADT table so based on that you
can populate the machine.
> 
> I certainly agree that there needs to be a propagation of the hotplug
> notification from OSPM to Xen, which will involve some glue in the Xen
> subsystem in Linux, but I would expect that this would be similar to the
> existing plain RAM hotplug mechanism.

I am actually not sure how ACPI RAM hotplug mechanism is suppose to work
in practice. I thought that the regions (E820) are marked as reserved
and the 'RAM' slots nicely in there.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote:
> On 11/10/16 06:52, Haozhong Zhang wrote:
> > On 10/10/16 17:43, Andrew Cooper wrote:
> >> On 10/10/16 01:35, Haozhong Zhang wrote:
> >>> Overview
> >>> 
> >>> This RFC kernel patch series along with corresponding patch series of
> >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >>>
> >>> Xen hypervisor does not include an NVDIMM driver, so it needs the
> >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> >>> devices. We currently only supports NVDIMM devices in pmem mode.
> >>>
> >>> Design and Implementation
> >>> =
> >>> The complete design can be found at
> >>>   
> >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> >>>
> >>> All patch series can be found at
> >>>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
> >>>   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
> >>>   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
> >>>   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
> >>>
> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
> >>> tasks:
> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >>>memory management data structures, i.e. frame table and M2P table.
> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >>>hypervisor.
> >> Please can we take a step back here before diving down a rabbit hole.
> >>
> >>
> >> How do pblk/pmem regions appear in the E820 map at boot?  At the very
> >> least, I would expect at least a large reserved region.
> > ACPI specification does not require them to appear in E820, though
> > it defines E820 type-7 for persistent memory.
> 
> Ok, so we might get some E820 type-7 ranges, or some holes.
> 
> >
> >> Is the MFN information (SPA in your terminology, so far as I can tell)
> >> available in any static APCI tables, or are they only available as a
> >> result of executing AML methods?
> >>
> > For NVDIMM devices already plugged at power on, their MFN information
> > can be got from NFIT table. However, MFN information for hotplugged
> > NVDIMM devices should be got via AML _FIT method, so point 2) is needed.
> 
> How does NVDIMM hotplug compare to RAM hotplug?  Are the hotplug regions
> described at boot and marked as initially not present, or do you only
> know the hotplugged SPA at the point that it is hotplugged?
> 
> I certainly agree that there needs to be a propagation of the hotplug
> notification from OSPM to Xen, which will involve some glue in the Xen
> subsystem in Linux, but I would expect that this would be similar to the
> existing plain RAM hotplug mechanism.
> 
> >
> >> If the MFN information is only available via AML, then point 2) is
> >> needed, although the reporting back to Xen should be restricted to a xen
> >> component, rather than polluting the main device driver.
> >>
> >> However, I can't see any justification for 1).  Dom0 should not be
> >> involved in Xen's management of its own frame table and m2p.  The mfns
> >> making up the pmem/pblk regions should be treated just like any other
> >> MMIO regions, and be handed wholesale to dom0 by default.
> >>
> > Do you mean to treat them as mmio pages of type p2m_mmio_direct and
> > map them to guest by map_mmio_regions()?
> 
> I don't see any reason why it shouldn't be treated like this.  Xen
> shouldn't be treating it as anything other than an opaque block of MFNs.
> 
> The concept of trying to map a DAX file into the guest physical address
> space of a VM is indeed new and doesn't fit into Xen's current model,
> but all that fixing this requires is a new privileged mapping hypercall
> which takes a source domid and gfn scatter list, and a destination domid
> and scatter list.  (I see from a quick look at your Xen series that your
> XENMEM_populate_pmemmap looks roughly like this)

That can be quite big. Say you want to map an DAX file that has
size of 1TB and the this GFN scatter list has 1073741824 entries?

How do you envision handling this in Xen and populating the P2M entries
with this information?

> 
> ~Andrew
> ___
> Linux-nvdimm mailing list
> linux-nvd...@lists.01.org
> https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 07:15:42PM +0100, Andrew Cooper wrote:
> On 11/10/16 18:51, Dan Williams wrote:
> > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk
> >  wrote:
> >> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote:
> >>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
> >>> Andrew Cooper  10/10/16 6:44 PM >>>
> > On 10/10/16 01:35, Haozhong Zhang wrote:
> >> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
> >> tasks:
> >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >>memory management data structures, i.e. frame table and M2P table.
> >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >>hypervisor.
> > However, I can't see any justification for 1).  Dom0 should not be
> > involved in Xen's management of its own frame table and m2p.  The mfns
> > making up the pmem/pblk regions should be treated just like any other
> > MMIO regions, and be handed wholesale to dom0 by default.
>  That precludes the use as RAM extension, and I thought earlier rounds of
>  discussion had got everyone in agreement that at least for the pmem case
>  we will need some control data in Xen.
> >>> The missing piece for me is why this reservation for control data
> >>> needs to be done in the libnvdimm core?  I would expect that any dax
> >> Isn't it done this way with Linux? That is say if the machine has
> >> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page'
> >> for the NVDIMM ranges somewhere. That place can be in regions on the
> >> NVDIMM that ndctl can reserve.
> > Yes.
> 
> I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM;

I just gave you one. This is the 'usecase' that Linux has to deal with
now that the core kernel folks have pointed out that they don't want
'struct page' for the MMIO regions. This mechanism came about this and
finding a place _somewhere_ to deal with having to have 'struct page'
for the SPA ranges of the NVDIMM.

> NVDIMMs are far more valuable for higher level management in dom0.

Andrew, why are you providing input to this so late?

Haozhong provided an nice design document outlining the problem and
the solution he suggested.

> 
> I certainly think that such a usecase should be out-of-scope for initial
> Xen/NVDIMM support, even if only to reduce the complexity to start with.
> 
> A repeated complain I have of large feature submissions like this is
> that, by trying to solve all potential usecases at one, end up being
> overly complicated to develop, understand and review.

On the other hand - if you don't take these complicated issues from the
start, then you may have to redesign and redevelop this after the first
version which has been set in stone and committed.

> 
> >
> >>> capable file could be mapped and made available to a guest.  This
> >>> includes /dev/ramX devices that are dax capable, but are external to
> >>> the libnvdimm sub-system.
> >> This is more of just keeping track of the ranges if say the DAX file is
> >> extremely fragmented and requires a lot of 'struct pages' to keep track of
> >> when stiching up the VMA.
> > Right, but why does the libnvdimm core need to know about this
> > specific Xen reservation?  For example, if Xen wants some in-kernel
> > driver to own a pmem region and place its own metadata on the device I
> > would recommend something like:
> >
> > bdev = blkdev_get_by_path("/dev/pmemX",  FMODE_EXCL...);
> > bdev_direct_access(bdev, ...);
> >
> > ...in other words, I don't think we want libnvdimm to grow new device
> > types for every possible in-kernel user, Xen, MD, DM, etc. Instead,
> > just claim the resulting device.
> 
> I completely agree.
> 
> Whatever ends up happening between Xen and dom0, there should be no
> modifications like this to the nvdimm driver.  I will go so far as to
> say that there shouldn't be any modifications to the nvdimm driver
> (other than perhaps new query hooks so the Xen subsystem in Linux can
> query information to then pass up to Xen, if the existing queryability
> is insufficient).

Haozhong and Jan had been chatting about this in terms of how to keep
track of a guest having non-contingous SPAs of NVDIMM stiched to a guest.

The initial idea was to treat it as MMIO, but of course if you have 1
page ranges over say 1TB you end up consuming tons of memory to keep
track of this (the same way Linux would if you wanted to mmap an file
from DAX fs).

Other solutions were an bitmap, but that can also be cumbersome to deal
with. In the end the suggestion that was proposed was the one that Linux
choose - stash the 'struct page' in the NVDIMM.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote:
> 260sn3756f-1
>   (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT)
>   for ; Tue, 11 Oct 2016 17:51:21 +
> Received: by mail-oi0-f43.google.com with SMTP id d132so32700570oib.2
> for ; Tue, 11 Oct 2016 10:51:20 -0700 (PDT)
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=intel-com.20150623.gappssmtp.com; s=20150623;
> h=mime-version:in-reply-to:references:from:date:message-id:subject:to
>  :cc;
> bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=;
> b=CiKg4tJf1DGU2x/pSCYU7Jx79oCXMSIApwY2zJjO9Lny3erPxUyjNhszNyQkceYK1A
>  Gzuw05eETGT/k0UWamFdN/ZXF3PucSXIXqrVtTS9kLQBlKPTWQJvndSRqZ6lPb36mlSA
>  BrkdOREz5O/V7p/iGYhnxZU9eyfVY1ekgeMvTKP3su9Ye4Nk6GJYMEb5HSTCm1Ckmoq5
>  T4Rlw6gcnbHCLx27vcghySG4YXcQ4r2qSPcSmAysve77sYCPYlM9XRVpzfPBTmINKGUo
>  9w7MgVs5KG0dG60j1fJNjXoY0WSoP3uI67e69afqjAChzVndGDgMXjOzGrQ6+KQF088Q
>  JeiQ==
> X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
> d=1e100.net; s=20130820;
> h=x-gm-message-state:mime-version:in-reply-to:references:from:date
>  :message-id:subject:to:cc;
> bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=;
> b=SmizBvFSmUHAy/WKfbD4m+QVSajIfcD9SQW7hwqmiwUtrACa2PxQWyx0dHe6DOqVVx
>  jYHSxbbMiz105BMwxfv2pZlAl+phFkj8APxpL2XF36SIsq5u9+evlqBUuzGcpVJ+tXyI
>  0xO0qfyspvNwLwJnkZ2bOxO9FM5cRhGGIAQ2uJCVIixLTPstJgkFL3taQ6bfr/epJGoF
>  VbYrGRu0nxGTWEqk14q0YBt2uiDLWu6WiF8izG/fnyM39wzS0ZsO31hco3jpBWiq7X5N
>  Ehn8ePiR9iYfowHhT3s2PefnrirD0zlJAamVqnbTNQS93PT26dWpm/vc8HVYiMLj+Fq8
>  s2rw==
> X-Gm-Message-State: 
> AA6/9RlGCiscMzjRlXRLSGCPLACOp/VdD9I/y/dQ+vytyQN0tniPrwPxFp4VQtNbW/PYF1zzfyAX+iUOa+dgEsrg
> X-Received: by 10.202.84.69 with SMTP id i66mr3504473oib.93.1476208279931;
>  Tue, 11 Oct 2016 10:51:19 -0700 (PDT)
> MIME-Version: 1.0
> Received: by 10.157.39.201 with HTTP; Tue, 11 Oct 2016 10:51:19 -0700 (PDT)
> In-Reply-To: <20161011165811.GO19349@localhost.localdomain>
> References: <20161010003523.4423-1-haozhong.zh...@intel.com>
>   
> <57fcf26a0278000f1...@prv-mh.provo.novell.com>
>   
> <20161011165811.GO19349@localhost.localdomain>
> From: Dan Williams 
> Date: Tue, 11 Oct 2016 10:51:19 -0700
> Message-ID: 
> 

Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Andrew Cooper
On 11/10/16 18:51, Dan Williams wrote:
> On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk
>  wrote:
>> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote:
>>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
>>> Andrew Cooper  10/10/16 6:44 PM >>>
> On 10/10/16 01:35, Haozhong Zhang wrote:
>> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
>> tasks:
>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>>memory management data structures, i.e. frame table and M2P table.
>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>>hypervisor.
> However, I can't see any justification for 1).  Dom0 should not be
> involved in Xen's management of its own frame table and m2p.  The mfns
> making up the pmem/pblk regions should be treated just like any other
> MMIO regions, and be handed wholesale to dom0 by default.
 That precludes the use as RAM extension, and I thought earlier rounds of
 discussion had got everyone in agreement that at least for the pmem case
 we will need some control data in Xen.
>>> The missing piece for me is why this reservation for control data
>>> needs to be done in the libnvdimm core?  I would expect that any dax
>> Isn't it done this way with Linux? That is say if the machine has
>> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page'
>> for the NVDIMM ranges somewhere. That place can be in regions on the
>> NVDIMM that ndctl can reserve.
> Yes.

I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM;
NVDIMMs are far more valuable for higher level management in dom0.

I certainly think that such a usecase should be out-of-scope for initial
Xen/NVDIMM support, even if only to reduce the complexity to start with.

A repeated complain I have of large feature submissions like this is
that, by trying to solve all potential usecases at one, end up being
overly complicated to develop, understand and review.

>
>>> capable file could be mapped and made available to a guest.  This
>>> includes /dev/ramX devices that are dax capable, but are external to
>>> the libnvdimm sub-system.
>> This is more of just keeping track of the ranges if say the DAX file is
>> extremely fragmented and requires a lot of 'struct pages' to keep track of
>> when stiching up the VMA.
> Right, but why does the libnvdimm core need to know about this
> specific Xen reservation?  For example, if Xen wants some in-kernel
> driver to own a pmem region and place its own metadata on the device I
> would recommend something like:
>
> bdev = blkdev_get_by_path("/dev/pmemX",  FMODE_EXCL...);
> bdev_direct_access(bdev, ...);
>
> ...in other words, I don't think we want libnvdimm to grow new device
> types for every possible in-kernel user, Xen, MD, DM, etc. Instead,
> just claim the resulting device.

I completely agree.

Whatever ends up happening between Xen and dom0, there should be no
modifications like this to the nvdimm driver.  I will go so far as to
say that there shouldn't be any modifications to the nvdimm driver
(other than perhaps new query hooks so the Xen subsystem in Linux can
query information to then pass up to Xen, if the existing queryability
is insufficient).

~Andrew


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Dan Williams
On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk
 wrote:
> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote:
>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
>>  Andrew Cooper  10/10/16 6:44 PM >>>
>> >>On 10/10/16 01:35, Haozhong Zhang wrote:
>> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
>> >>> tasks:
>> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>> >>>memory management data structures, i.e. frame table and M2P table.
>> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>> >>>hypervisor.
>> >>
>> >>However, I can't see any justification for 1).  Dom0 should not be
>> >>involved in Xen's management of its own frame table and m2p.  The mfns
>> >>making up the pmem/pblk regions should be treated just like any other
>> >>MMIO regions, and be handed wholesale to dom0 by default.
>> >
>> > That precludes the use as RAM extension, and I thought earlier rounds of
>> > discussion had got everyone in agreement that at least for the pmem case
>> > we will need some control data in Xen.
>>
>> The missing piece for me is why this reservation for control data
>> needs to be done in the libnvdimm core?  I would expect that any dax
>
> Isn't it done this way with Linux? That is say if the machine has
> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page'
> for the NVDIMM ranges somewhere. That place can be in regions on the
> NVDIMM that ndctl can reserve.

Yes.

>> capable file could be mapped and made available to a guest.  This
>> includes /dev/ramX devices that are dax capable, but are external to
>> the libnvdimm sub-system.
>
> This is more of just keeping track of the ranges if say the DAX file is
> extremely fragmented and requires a lot of 'struct pages' to keep track of
> when stiching up the VMA.

Right, but why does the libnvdimm core need to know about this
specific Xen reservation?  For example, if Xen wants some in-kernel
driver to own a pmem region and place its own metadata on the device I
would recommend something like:

bdev = blkdev_get_by_path("/dev/pmemX",  FMODE_EXCL...);
bdev_direct_access(bdev, ...);

...in other words, I don't think we want libnvdimm to grow new device
types for every possible in-kernel user, Xen, MD, DM, etc. Instead,
just claim the resulting device.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Konrad Rzeszutek Wilk
On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote:
> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
>  Andrew Cooper  10/10/16 6:44 PM >>>
> >>On 10/10/16 01:35, Haozhong Zhang wrote:
> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following 
> >>> tasks:
> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >>>memory management data structures, i.e. frame table and M2P table.
> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >>>hypervisor.
> >>
> >>However, I can't see any justification for 1).  Dom0 should not be
> >>involved in Xen's management of its own frame table and m2p.  The mfns
> >>making up the pmem/pblk regions should be treated just like any other
> >>MMIO regions, and be handed wholesale to dom0 by default.
> >
> > That precludes the use as RAM extension, and I thought earlier rounds of
> > discussion had got everyone in agreement that at least for the pmem case
> > we will need some control data in Xen.
> 
> The missing piece for me is why this reservation for control data
> needs to be done in the libnvdimm core?  I would expect that any dax

Isn't it done this way with Linux? That is say if the machine has
4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page'
for the NVDIMM ranges somewhere. That place can be in regions on the
NVDIMM that ndctl can reserve.

> capable file could be mapped and made available to a guest.  This
> includes /dev/ramX devices that are dax capable, but are external to
> the libnvdimm sub-system.

This is more of just keeping track of the ranges if say the DAX file is
extremely fragmented and requires a lot of 'struct pages' to keep track of
when stiching up the VMA.

> 
> ___
> Xen-devel mailing list
> xen-de...@lists.xen.org
> https://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Dan Williams
On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich  wrote:
 Andrew Cooper  10/10/16 6:44 PM >>>
>>On 10/10/16 01:35, Haozhong Zhang wrote:
>>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
>>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>>>memory management data structures, i.e. frame table and M2P table.
>>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>>>hypervisor.
>>
>>However, I can't see any justification for 1).  Dom0 should not be
>>involved in Xen's management of its own frame table and m2p.  The mfns
>>making up the pmem/pblk regions should be treated just like any other
>>MMIO regions, and be handed wholesale to dom0 by default.
>
> That precludes the use as RAM extension, and I thought earlier rounds of
> discussion had got everyone in agreement that at least for the pmem case
> we will need some control data in Xen.

The missing piece for me is why this reservation for control data
needs to be done in the libnvdimm core?  I would expect that any dax
capable file could be mapped and made available to a guest.  This
includes /dev/ramX devices that are dax capable, but are external to
the libnvdimm sub-system.


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-11 Thread Jan Beulich
>>> Andrew Cooper  10/10/16 6:44 PM >>>
>On 10/10/16 01:35, Haozhong Zhang wrote:
>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>>memory management data structures, i.e. frame table and M2P table.
>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>>hypervisor.
>
>However, I can't see any justification for 1).  Dom0 should not be
>involved in Xen's management of its own frame table and m2p.  The mfns
>making up the pmem/pblk regions should be treated just like any other
>MMIO regions, and be handed wholesale to dom0 by default.

That precludes the use as RAM extension, and I thought earlier rounds of
discussion had got everyone in agreement that at least for the pmem case
we will need some control data in Xen.

Jan



Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-10 Thread Haozhong Zhang
On 10/10/16 17:43, Andrew Cooper wrote:
> On 10/10/16 01:35, Haozhong Zhang wrote:
> > Overview
> > 
> > This RFC kernel patch series along with corresponding patch series of
> > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> > NVDIMM devices to Xen HVM domU as vNVDIMM devices.
> >
> > Xen hypervisor does not include an NVDIMM driver, so it needs the
> > assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> > devices. We currently only supports NVDIMM devices in pmem mode.
> >
> > Design and Implementation
> > =
> > The complete design can be found at
> >   
> > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
> >
> > All patch series can be found at
> >   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
> >   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
> >   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
> >   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
> >
> > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
> > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
> >memory management data structures, i.e. frame table and M2P table.
> > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
> >hypervisor.
> 
> Please can we take a step back here before diving down a rabbit hole.
> 
> 
> How do pblk/pmem regions appear in the E820 map at boot?  At the very
> least, I would expect at least a large reserved region.

ACPI specification does not require them to appear in E820, though
it defines E820 type-7 for persistent memory.

> 
> Is the MFN information (SPA in your terminology, so far as I can tell)
> available in any static APCI tables, or are they only available as a
> result of executing AML methods?
>

For NVDIMM devices already plugged at power on, their MFN information
can be got from NFIT table. However, MFN information for hotplugged
NVDIMM devices should be got via AML _FIT method, so point 2) is needed.

> 
> If the MFN information is only available via AML, then point 2) is
> needed, although the reporting back to Xen should be restricted to a xen
> component, rather than polluting the main device driver.
> 
> However, I can't see any justification for 1).  Dom0 should not be
> involved in Xen's management of its own frame table and m2p.  The mfns
> making up the pmem/pblk regions should be treated just like any other
> MMIO regions, and be handed wholesale to dom0 by default.
>

Do you mean to treat them as mmio pages of type p2m_mmio_direct and
map them to guest by map_mmio_regions()?

Thanks,
Haozhong


Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen

2016-10-10 Thread Andrew Cooper
On 10/10/16 01:35, Haozhong Zhang wrote:
> Overview
> 
> This RFC kernel patch series along with corresponding patch series of
> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host
> NVDIMM devices to Xen HVM domU as vNVDIMM devices.
>
> Xen hypervisor does not include an NVDIMM driver, so it needs the
> assistance from the driver in Dom0 Linux kernel to manage NVDIMM
> devices. We currently only supports NVDIMM devices in pmem mode.
>
> Design and Implementation
> =
> The complete design can be found at
>   https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html.
>
> All patch series can be found at
>   Xen:  https://github.com/hzzhan9/xen.git nvdimm-rfc-v1
>   QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1
>   Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1
>   ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1
>
> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks:
> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place
>memory management data structures, i.e. frame table and M2P table.
> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen
>hypervisor.

Please can we take a step back here before diving down a rabbit hole.


How do pblk/pmem regions appear in the E820 map at boot?  At the very
least, I would expect at least a large reserved region.

Is the MFN information (SPA in your terminology, so far as I can tell)
available in any static APCI tables, or are they only available as a
result of executing AML methods?


If the MFN information is only available via AML, then point 2) is
needed, although the reporting back to Xen should be restricted to a xen
component, rather than polluting the main device driver.

However, I can't see any justification for 1).  Dom0 should not be
involved in Xen's management of its own frame table and m2p.  The mfns
making up the pmem/pblk regions should be treated just like any other
MMIO regions, and be handed wholesale to dom0 by default.

~Andrew