Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 20/10/2016 10:14, Haozhong Zhang wrote: > > >> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to >> work >> and figure out what is on the DIMM, and which areas are safe to use. > I don't understand this ordering of events. Dom0 needs to have a > mapping to even write the on-media structure to indicate a > reservation. So, initial dom0 access can't depend on metadata > reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. >>> >>> IIUC, this is to collect MFNs and no need to create frame table and >>> M2P at this stage. If so, what is different from ... >>> * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. >>> >>> ... the MFNs reported here, especially that the former is a subset >>> (hotplug ones not included in the former) of latter. >> >> Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 >> when it is capable of working things out itself, (which can aid with >> debugging one half of this arrangement). Also, the MFNS found by Xen >> alone can be present in the default memory map for dom0. >> > > Sure, I'll add code to parsing NFIT in Xen to discover statically > plugged pmem mode NVDIMM and their MFNs. > > By the default memory map for dom0, do you mean making > XENMEM_memory_map returns above MFNs in Dom0 E820? Potentially, yes. Particularly if type 7 is reserved for NVDIMM, it would be good to report this information properly. > >>> >>> (There is no E820 hole or SRAT entries to tell which address range is >>> reserved for hotplugged NVDIMM) >>> * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. >>> >>> Two questions: >>> 1. Why is this request necessary? Even without such requests like what >>> my current implementation, Dom0 can still access NVDIMM. >> >> Can it? (if so, great, but I don't think this holds in the general >> case.) Is that a side effect of the NVDIMM being covered by a hole in >> the E820? > > In my development environment, NVDIMM MFNs are not covered by any E820 > entry and appear after RAM MFNs. > > Can you explain more about this point? Why can it work if covered by > E820 hole? It is a question, not a statement. If things currently work fine then great. However, there does seem to be a lot of flexibility in how the regions are reported, so please be mindful to this when developing the code. > >> >>> >>> 2. Who initiates the requests? If it's the libnvdimm driver, that >>> means we still need to introduce Xen specific code to the driver. >>> >>> Or the requests are issued by OSPM (or the Xen component you >>> mentioned above) when they probe new dimms? >>> >>> For the latter, Dan, do you think it's acceptable in NFIT code to >>> call the Xen component to request the access permission of the pmem >>> regions, e.g. in apic_nfit_insert_resource(). Of course, it's only >>> used for Dom0 case. >> >> The libnvdimm driver should continue to use ioremap() or whatever it >> currently does. There shouldn't be Xen modifications like that. >> >> The one issue will come if libnvdimm tries to ioremap()/other an area >> which Xen is unaware is an NVDIMM, and rejects the mapping request. >> Somehow, a Xen component will need to find the MFN/SPA layout and >> register this information with Xen, before the ioremap() call made by >> the libnvdimm driver. Perhaps a notifier mechanism out from the ACPI >> subsystem might be the best way to make this work in a clean way. >> > > Yes, this is necessary for hotplugged NVDIMM. Ok. > >>> ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. >>> >>> Agree. I think we may not need to fix the way/format/... to make the >>> reservation, and instead let the users (host administrators), who have >>> better understanding of their data, make the proper decision. >> >> Yes. This is the best course of action. >> >>> >>> In a worse case that no reservation is made, Xen hypervisor could
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 20/10/2016 10:14, Haozhong Zhang wrote: > > >> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to >> work >> and figure out what is on the DIMM, and which areas are safe to use. > I don't understand this ordering of events. Dom0 needs to have a > mapping to even write the on-media structure to indicate a > reservation. So, initial dom0 access can't depend on metadata > reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. >>> >>> IIUC, this is to collect MFNs and no need to create frame table and >>> M2P at this stage. If so, what is different from ... >>> * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. >>> >>> ... the MFNs reported here, especially that the former is a subset >>> (hotplug ones not included in the former) of latter. >> >> Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 >> when it is capable of working things out itself, (which can aid with >> debugging one half of this arrangement). Also, the MFNS found by Xen >> alone can be present in the default memory map for dom0. >> > > Sure, I'll add code to parsing NFIT in Xen to discover statically > plugged pmem mode NVDIMM and their MFNs. > > By the default memory map for dom0, do you mean making > XENMEM_memory_map returns above MFNs in Dom0 E820? Potentially, yes. Particularly if type 7 is reserved for NVDIMM, it would be good to report this information properly. > >>> >>> (There is no E820 hole or SRAT entries to tell which address range is >>> reserved for hotplugged NVDIMM) >>> * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. >>> >>> Two questions: >>> 1. Why is this request necessary? Even without such requests like what >>> my current implementation, Dom0 can still access NVDIMM. >> >> Can it? (if so, great, but I don't think this holds in the general >> case.) Is that a side effect of the NVDIMM being covered by a hole in >> the E820? > > In my development environment, NVDIMM MFNs are not covered by any E820 > entry and appear after RAM MFNs. > > Can you explain more about this point? Why can it work if covered by > E820 hole? It is a question, not a statement. If things currently work fine then great. However, there does seem to be a lot of flexibility in how the regions are reported, so please be mindful to this when developing the code. > >> >>> >>> 2. Who initiates the requests? If it's the libnvdimm driver, that >>> means we still need to introduce Xen specific code to the driver. >>> >>> Or the requests are issued by OSPM (or the Xen component you >>> mentioned above) when they probe new dimms? >>> >>> For the latter, Dan, do you think it's acceptable in NFIT code to >>> call the Xen component to request the access permission of the pmem >>> regions, e.g. in apic_nfit_insert_resource(). Of course, it's only >>> used for Dom0 case. >> >> The libnvdimm driver should continue to use ioremap() or whatever it >> currently does. There shouldn't be Xen modifications like that. >> >> The one issue will come if libnvdimm tries to ioremap()/other an area >> which Xen is unaware is an NVDIMM, and rejects the mapping request. >> Somehow, a Xen component will need to find the MFN/SPA layout and >> register this information with Xen, before the ioremap() call made by >> the libnvdimm driver. Perhaps a notifier mechanism out from the ACPI >> subsystem might be the best way to make this work in a clean way. >> > > Yes, this is necessary for hotplugged NVDIMM. Ok. > >>> ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. >>> >>> Agree. I think we may not need to fix the way/format/... to make the >>> reservation, and instead let the users (host administrators), who have >>> better understanding of their data, make the proper decision. >> >> Yes. This is the best course of action. >> >>> >>> In a worse case that no reservation is made, Xen hypervisor could
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/14/16 13:18 +0100, Andrew Cooper wrote: On 14/10/16 08:08, Haozhong Zhang wrote: On 10/13/16 20:33 +0100, Andrew Cooper wrote: On 13/10/16 19:59, Dan Williams wrote: On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooperwrote: On 13/10/16 16:40, Dan Williams wrote: On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. IIUC, this is to collect MFNs and no need to create frame table and M2P at this stage. If so, what is different from ... * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. ... the MFNs reported here, especially that the former is a subset (hotplug ones not included in the former) of latter. Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 when it is capable of working things out itself, (which can aid with debugging one half of this arrangement). Also, the MFNS found by Xen alone can be present in the default memory map for dom0. Sure, I'll add code to parsing NFIT in Xen to discover statically plugged pmem mode NVDIMM and their MFNs. By the default memory map for dom0, do you mean making XENMEM_memory_map returns above MFNs in Dom0 E820? (There is no E820 hole or SRAT entries to tell which address range is reserved for hotplugged NVDIMM) * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. Two questions: 1. Why is this request necessary? Even without such requests like what my current implementation, Dom0 can still access NVDIMM. Can it? (if so, great, but I don't think this holds in the general case.) Is that a side effect of the NVDIMM being covered by a hole in the E820? In my development environment, NVDIMM MFNs are not covered by any E820 entry and appear after RAM MFNs. Can you explain more about this point? Why can it work if covered by E820 hole? The current logic for what dom0 may access by default is somewhat ad-hoc, and I have a gut feeling that it won't work with E820 type 7 regions. Or do you mean Xen hypervisor should by default disallow Dom0 to access MFNs reported in previous step until they are requested? No - I am not suggesting this. 2. Who initiates the requests? If it's the libnvdimm driver, that means we still need to introduce Xen specific code to the driver. Or the requests are issued by OSPM (or the Xen component you mentioned above) when they probe new dimms? For the latter, Dan, do you think it's acceptable in NFIT code to call the Xen component to request the access permission of the pmem regions, e.g. in apic_nfit_insert_resource(). Of
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/14/16 13:18 +0100, Andrew Cooper wrote: On 14/10/16 08:08, Haozhong Zhang wrote: On 10/13/16 20:33 +0100, Andrew Cooper wrote: On 13/10/16 19:59, Dan Williams wrote: On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper wrote: On 13/10/16 16:40, Dan Williams wrote: On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. IIUC, this is to collect MFNs and no need to create frame table and M2P at this stage. If so, what is different from ... * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. ... the MFNs reported here, especially that the former is a subset (hotplug ones not included in the former) of latter. Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 when it is capable of working things out itself, (which can aid with debugging one half of this arrangement). Also, the MFNS found by Xen alone can be present in the default memory map for dom0. Sure, I'll add code to parsing NFIT in Xen to discover statically plugged pmem mode NVDIMM and their MFNs. By the default memory map for dom0, do you mean making XENMEM_memory_map returns above MFNs in Dom0 E820? (There is no E820 hole or SRAT entries to tell which address range is reserved for hotplugged NVDIMM) * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. Two questions: 1. Why is this request necessary? Even without such requests like what my current implementation, Dom0 can still access NVDIMM. Can it? (if so, great, but I don't think this holds in the general case.) Is that a side effect of the NVDIMM being covered by a hole in the E820? In my development environment, NVDIMM MFNs are not covered by any E820 entry and appear after RAM MFNs. Can you explain more about this point? Why can it work if covered by E820 hole? The current logic for what dom0 may access by default is somewhat ad-hoc, and I have a gut feeling that it won't work with E820 type 7 regions. Or do you mean Xen hypervisor should by default disallow Dom0 to access MFNs reported in previous step until they are requested? No - I am not suggesting this. 2. Who initiates the requests? If it's the libnvdimm driver, that means we still need to introduce Xen specific code to the driver. Or the requests are issued by OSPM (or the Xen component you mentioned above) when they probe new dimms? For the latter, Dan, do you think it's acceptable in NFIT code to call the Xen component to request the access permission of the pmem regions, e.g. in apic_nfit_insert_resource(). Of course, it's only used for Dom0 case. The
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/14/16 04:16 -0600, Jan Beulich wrote: On 13.10.16 at 17:46,wrote: On 10/13/16 03:08 -0600, Jan Beulich wrote: On 13.10.16 at 10:53, wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. No such independent concept seems exist right now. It may be hard to define such concept, because it's hard to know the common requirements (e.g. size/alignment/...) from ALL OSes. Making each component to maintain its own reservation in its own way seems more flexible. Or if we were to consider Dom0
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/14/16 04:16 -0600, Jan Beulich wrote: On 13.10.16 at 17:46, wrote: On 10/13/16 03:08 -0600, Jan Beulich wrote: On 13.10.16 at 10:53, wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. No such independent concept seems exist right now. It may be hard to define such concept, because it's hard to know the common requirements (e.g. size/alignment/...) from ALL OSes. Making each component to maintain its own reservation in its own way seems more flexible. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. Sure, it's possible to implement the driver in a way that
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 14/10/16 08:08, Haozhong Zhang wrote: > On 10/13/16 20:33 +0100, Andrew Cooper wrote: >> On 13/10/16 19:59, Dan Williams wrote: >>> On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper >>>wrote: On 13/10/16 16:40, Dan Williams wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich > wrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> Well, my opinion certainly doesn't count much here, but I >> continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > You haven't made the case why Xen is special and other > applications of > persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. >>> This is where my understanding the Xen model breaks down. Are you >>> saying dom0 can't access the persistent memory range unless the ring0 >>> agent has metadata storage space for tracking what it maps into dom0? >> >> No. I am trying to point out that the current suggestion wont work, and >> needs re-designing. >> >> Xen *must* be able to properly configure mappings of the NVDIMM for >> dom0, *without* modifying any content on the NVDIMM. Otherwise, data >> corruption will occur. >> >> Whether this means no Xen metadata, or the metadata living elsewhere in >> regular ram, such as the main frametable, is an implementation detail. >> >>> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. >>> I don't understand this ordering of events. Dom0 needs to have a >>> mapping to even write the on-media structure to indicate a >>> reservation. So, initial dom0 access can't depend on metadata >>> reservation already being present. >> >> I agree. >> >> Overall, I think the following is needed. >> >> * Xen starts up. >> ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and >> needs to note this information somehow. >> ** Xen might find some Type 7 E820 regions, and needs to note this >> information somehow. > > IIUC, this is to collect MFNs and no need to create frame table and > M2P at this stage. If so, what is different from ... > >> * Xen starts dom0. >> * Once OSPM is running, a Xen component in Linux needs to collect and >> report all NVDIMM SPA/MFN regions it knowns about. >> ** This covers the AML-only case, and the hotplug case. > > ... the MFNs reported here, especially that the former is a subset > (hotplug ones not included in the former) of latter. Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 when it is capable of working things out itself, (which can aid with debugging one half of this arrangement). Also, the MFNS found by Xen alone can be present in the default memory map for dom0. > > (There is no E820 hole or SRAT entries to tell which address range is > reserved for hotplugged NVDIMM) > >> * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. > > Two questions: > 1. Why is this request necessary? Even without such requests like what > my current implementation, Dom0 can still access NVDIMM. Can it? (if so, great, but I don't think this holds in the general case.) Is that a side effect of the NVDIMM being covered by a hole in the E820? The current logic for what dom0 may access by default is somewhat ad-hoc, and I have a gut feeling that it won't work with E820 type 7 regions. > > Or do you mean Xen hypervisor should by default disallow Dom0 to > access MFNs reported in previous step until they are requested? No - I am not suggesting this. > > 2. Who initiates the requests? If it's the libnvdimm driver, that > means we still need to introduce Xen specific code to the driver. > > Or the requests are issued by OSPM (or the Xen component you > mentioned above) when they probe new dimms? > > For the latter, Dan, do you think it's acceptable in NFIT code to > call the Xen component to request the access permission of the pmem > regions, e.g. in apic_nfit_insert_resource(). Of course, it's only > used for Dom0 case. The libnvdimm driver should continue to use ioremap() or whatever it currently does. There shouldn't be
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 14/10/16 08:08, Haozhong Zhang wrote: > On 10/13/16 20:33 +0100, Andrew Cooper wrote: >> On 13/10/16 19:59, Dan Williams wrote: >>> On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper >>> wrote: On 13/10/16 16:40, Dan Williams wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich > wrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> Well, my opinion certainly doesn't count much here, but I >> continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > You haven't made the case why Xen is special and other > applications of > persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. >>> This is where my understanding the Xen model breaks down. Are you >>> saying dom0 can't access the persistent memory range unless the ring0 >>> agent has metadata storage space for tracking what it maps into dom0? >> >> No. I am trying to point out that the current suggestion wont work, and >> needs re-designing. >> >> Xen *must* be able to properly configure mappings of the NVDIMM for >> dom0, *without* modifying any content on the NVDIMM. Otherwise, data >> corruption will occur. >> >> Whether this means no Xen metadata, or the metadata living elsewhere in >> regular ram, such as the main frametable, is an implementation detail. >> >>> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. >>> I don't understand this ordering of events. Dom0 needs to have a >>> mapping to even write the on-media structure to indicate a >>> reservation. So, initial dom0 access can't depend on metadata >>> reservation already being present. >> >> I agree. >> >> Overall, I think the following is needed. >> >> * Xen starts up. >> ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and >> needs to note this information somehow. >> ** Xen might find some Type 7 E820 regions, and needs to note this >> information somehow. > > IIUC, this is to collect MFNs and no need to create frame table and > M2P at this stage. If so, what is different from ... > >> * Xen starts dom0. >> * Once OSPM is running, a Xen component in Linux needs to collect and >> report all NVDIMM SPA/MFN regions it knowns about. >> ** This covers the AML-only case, and the hotplug case. > > ... the MFNs reported here, especially that the former is a subset > (hotplug ones not included in the former) of latter. Hopefully nothing. However, Xen shouldn't exclusively rely on the dom0 when it is capable of working things out itself, (which can aid with debugging one half of this arrangement). Also, the MFNS found by Xen alone can be present in the default memory map for dom0. > > (There is no E820 hole or SRAT entries to tell which address range is > reserved for hotplugged NVDIMM) > >> * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. > > Two questions: > 1. Why is this request necessary? Even without such requests like what > my current implementation, Dom0 can still access NVDIMM. Can it? (if so, great, but I don't think this holds in the general case.) Is that a side effect of the NVDIMM being covered by a hole in the E820? The current logic for what dom0 may access by default is somewhat ad-hoc, and I have a gut feeling that it won't work with E820 type 7 regions. > > Or do you mean Xen hypervisor should by default disallow Dom0 to > access MFNs reported in previous step until they are requested? No - I am not suggesting this. > > 2. Who initiates the requests? If it's the libnvdimm driver, that > means we still need to introduce Xen specific code to the driver. > > Or the requests are issued by OSPM (or the Xen component you > mentioned above) when they probe new dimms? > > For the latter, Dan, do you think it's acceptable in NFIT code to > call the Xen component to request the access permission of the pmem > regions, e.g. in apic_nfit_insert_resource(). Of course, it's only > used for Dom0 case. The libnvdimm driver should continue to use ioremap() or whatever it currently does. There shouldn't be Xen modifications like that. The one issue
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 17:46,wrote: > On 10/13/16 03:08 -0600, Jan Beulich wrote: > On 13.10.16 at 10:53, wrote: >>> On 10/13/16 02:34 -0600, Jan Beulich wrote: >>> On 12.10.16 at 18:19, wrote: > On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: > On 12.10.16 at 17:42, wrote: >>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >>> On 12.10.16 at 16:58, wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. >>> >>> No, we should not mix the struct page reservation that the Dom0 kernel >>> may actively use with the Xen reservation that the Dom0 kernel does >>> not consume. Explain again what is wrong with the partition approach? >> >> Not sure what was unclear in my previous reply. I don't think there >> should be apriori knowledge of whether Xen is (going to be) used on >> a system, and even if it gets used, but just occasionally, it would >> (apart from the abstract considerations already given) be a waste >> of resources to set something aside that could be used for other >> purposes while Xen is not running. Static partitioning should only be >> needed for persistent data. > > The reservation needs to be persistent / static even if the data is > volatile, as is the case with struct page, because we can't have the > size of the device change depending on use. So, from the aspect of > wasting space while Xen is not in use, both partitions and the > intrinsic reservation approach suffer the same problem. Setting that > aside I don't want to mix 2 different use cases into the same > reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. >>> >>> I think Dan means that there should be a clear separation between >>> reservations for different usages (kernel/xen/...). The libnvdimm >>> driver is for the linux kernel and only needs to maintain the >>> reservation for kernel functionality. For others including xen/dm/..., >>> if they want reservation for their own purpose, they should maintain >>> their own reservations out of libnvdimm driver and avoid bothering the >>> libnvdimm driver (e.g. add specific handling in libnvdimm driver). >>> >>> IIUC, one existing example is device-mapper device (dm) which needs to >>> reserve on-device area for its own meta-data. Its choice is to store >>> the meta-data on the block device (/dev/pmemN) provided by the >>> libnvdimm driver. >>> >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> >>Well, my opinion certainly doesn't
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 17:46, wrote: > On 10/13/16 03:08 -0600, Jan Beulich wrote: > On 13.10.16 at 10:53, wrote: >>> On 10/13/16 02:34 -0600, Jan Beulich wrote: >>> On 12.10.16 at 18:19, wrote: > On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: > On 12.10.16 at 17:42, wrote: >>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >>> On 12.10.16 at 16:58, wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. >>> >>> No, we should not mix the struct page reservation that the Dom0 kernel >>> may actively use with the Xen reservation that the Dom0 kernel does >>> not consume. Explain again what is wrong with the partition approach? >> >> Not sure what was unclear in my previous reply. I don't think there >> should be apriori knowledge of whether Xen is (going to be) used on >> a system, and even if it gets used, but just occasionally, it would >> (apart from the abstract considerations already given) be a waste >> of resources to set something aside that could be used for other >> purposes while Xen is not running. Static partitioning should only be >> needed for persistent data. > > The reservation needs to be persistent / static even if the data is > volatile, as is the case with struct page, because we can't have the > size of the device change depending on use. So, from the aspect of > wasting space while Xen is not in use, both partitions and the > intrinsic reservation approach suffer the same problem. Setting that > aside I don't want to mix 2 different use cases into the same > reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. >>> >>> I think Dan means that there should be a clear separation between >>> reservations for different usages (kernel/xen/...). The libnvdimm >>> driver is for the linux kernel and only needs to maintain the >>> reservation for kernel functionality. For others including xen/dm/..., >>> if they want reservation for their own purpose, they should maintain >>> their own reservations out of libnvdimm driver and avoid bothering the >>> libnvdimm driver (e.g. add specific handling in libnvdimm driver). >>> >>> IIUC, one existing example is device-mapper device (dm) which needs to >>> reserve on-device area for its own meta-data. Its choice is to store >>> the meta-data on the block device (/dev/pmemN) provided by the >>> libnvdimm driver. >>> >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> >>Well, my opinion certainly doesn't count much here, but I continue to >>consider this a bad idea. For entities like drivers it may well be >>appropriate, but I think there ought to be an independent concept >>of "OS reserved", and
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 17:40,wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> >> Well, my opinion certainly doesn't count much here, but I continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > > You haven't made the case why Xen is special and other applications of > persistent memory are not. Well, I'm implying this from there being a special Linux reservation. Xen (as explained by Andrew) sitting underneath the Dom0 kernel (other than ... > The current struct page reservation > supports fundamental address-ability of persistent memory namespaces > for the rest of the kernel. The Xen reservation is application > specific. XFS, EXT4, and DM also have application specific usages of > persistent memory and consume metadata space out of a block device. If > we don't need an XFS-mode nvdimm device, why do we need Xen-mode? ... all the examples you give) by implication is special then too. If you made the kernel be no different than the other examples you give, Xen probably shouldn't be any different anymore either. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 17:40, wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> >> Well, my opinion certainly doesn't count much here, but I continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > > You haven't made the case why Xen is special and other applications of > persistent memory are not. Well, I'm implying this from there being a special Linux reservation. Xen (as explained by Andrew) sitting underneath the Dom0 kernel (other than ... > The current struct page reservation > supports fundamental address-ability of persistent memory namespaces > for the rest of the kernel. The Xen reservation is application > specific. XFS, EXT4, and DM also have application specific usages of > persistent memory and consume metadata space out of a block device. If > we don't need an XFS-mode nvdimm device, why do we need Xen-mode? ... all the examples you give) by implication is special then too. If you made the kernel be no different than the other examples you give, Xen probably shouldn't be any different anymore either. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 20:33 +0100, Andrew Cooper wrote: On 13/10/16 19:59, Dan Williams wrote: On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooperwrote: On 13/10/16 16:40, Dan Williams wrote: On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. IIUC, this is to collect MFNs and no need to create frame table and M2P at this stage. If so, what is different from ... * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. ... the MFNs reported here, especially that the former is a subset (hotplug ones not included in the former) of latter. (There is no E820 hole or SRAT entries to tell which address range is reserved for hotplugged NVDIMM) * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. Two questions: 1. Why is this request necessary? Even without such requests like what my current implementation, Dom0 can still access NVDIMM. Or do you mean Xen hypervisor should by default disallow Dom0 to access MFNs reported in previous step until they are requested? 2. Who initiates the requests? If it's the libnvdimm driver, that means we still need to introduce Xen specific code to the driver. Or the requests are issued by OSPM (or the Xen component you mentioned above) when they probe new dimms? For the latter, Dan, do you think it's acceptable in NFIT code to call the Xen component to request the access permission of the pmem regions, e.g. in apic_nfit_insert_resource(). Of course, it's only used for Dom0 case. ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. Agree. I think we may not need to fix the way/format/... to make the reservation, and instead let the users (host administrators), who have better understanding of their data, make the proper decision. In a worse case that no reservation is made, Xen hypervisor could turn to use RAM for management structures for NVDIMM, with the cost of less RAM for guests. Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 20:33 +0100, Andrew Cooper wrote: On 13/10/16 19:59, Dan Williams wrote: On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper wrote: On 13/10/16 16:40, Dan Williams wrote: On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. IIUC, this is to collect MFNs and no need to create frame table and M2P at this stage. If so, what is different from ... * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. ... the MFNs reported here, especially that the former is a subset (hotplug ones not included in the former) of latter. (There is no E820 hole or SRAT entries to tell which address range is reserved for hotplugged NVDIMM) * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. Two questions: 1. Why is this request necessary? Even without such requests like what my current implementation, Dom0 can still access NVDIMM. Or do you mean Xen hypervisor should by default disallow Dom0 to access MFNs reported in previous step until they are requested? 2. Who initiates the requests? If it's the libnvdimm driver, that means we still need to introduce Xen specific code to the driver. Or the requests are issued by OSPM (or the Xen component you mentioned above) when they probe new dimms? For the latter, Dan, do you think it's acceptable in NFIT code to call the Xen component to request the access permission of the pmem regions, e.g. in apic_nfit_insert_resource(). Of course, it's only used for Dom0 case. ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. Agree. I think we may not need to fix the way/format/... to make the reservation, and instead let the users (host administrators), who have better understanding of their data, make the proper decision. In a worse case that no reservation is made, Xen hypervisor could turn to use RAM for management structures for NVDIMM, with the cost of less RAM for guests. Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 13/10/16 19:59, Dan Williams wrote: > On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper >wrote: >> On 13/10/16 16:40, Dan Williams wrote: >>> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: >>> [..] > I think we can do the similar for Xen, like to lay another pseudo > device on /dev/pmem and do the reservation, like 2. in my previous > reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. >>> You haven't made the case why Xen is special and other applications of >>> persistent memory are not. >> In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is >> a VM running in ring1/3 with the nvdimm driver. This is the opposite >> way around to the KVM model. >> >> Dom0, being the hardware domain, has default ownership of all the >> hardware, but to gain access in the first place, it must request a >> mapping from Xen. > This is where my understanding the Xen model breaks down. Are you > saying dom0 can't access the persistent memory range unless the ring0 > agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. > >> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work >> and figure out what is on the DIMM, and which areas are safe to use. > I don't understand this ordering of events. Dom0 needs to have a > mapping to even write the on-media structure to indicate a > reservation. So, initial dom0 access can't depend on metadata > reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 13/10/16 19:59, Dan Williams wrote: > On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper > wrote: >> On 13/10/16 16:40, Dan Williams wrote: >>> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: >>> [..] > I think we can do the similar for Xen, like to lay another pseudo > device on /dev/pmem and do the reservation, like 2. in my previous > reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. >>> You haven't made the case why Xen is special and other applications of >>> persistent memory are not. >> In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is >> a VM running in ring1/3 with the nvdimm driver. This is the opposite >> way around to the KVM model. >> >> Dom0, being the hardware domain, has default ownership of all the >> hardware, but to gain access in the first place, it must request a >> mapping from Xen. > This is where my understanding the Xen model breaks down. Are you > saying dom0 can't access the persistent memory range unless the ring0 > agent has metadata storage space for tracking what it maps into dom0? No. I am trying to point out that the current suggestion wont work, and needs re-designing. Xen *must* be able to properly configure mappings of the NVDIMM for dom0, *without* modifying any content on the NVDIMM. Otherwise, data corruption will occur. Whether this means no Xen metadata, or the metadata living elsewhere in regular ram, such as the main frametable, is an implementation detail. > >> Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work >> and figure out what is on the DIMM, and which areas are safe to use. > I don't understand this ordering of events. Dom0 needs to have a > mapping to even write the on-media structure to indicate a > reservation. So, initial dom0 access can't depend on metadata > reservation already being present. I agree. Overall, I think the following is needed. * Xen starts up. ** Xen might find some NVDIMM SPA/MFN ranges in the NFIT table, and needs to note this information somehow. ** Xen might find some Type 7 E820 regions, and needs to note this information somehow. * Xen starts dom0. * Once OSPM is running, a Xen component in Linux needs to collect and report all NVDIMM SPA/MFN regions it knowns about. ** This covers the AML-only case, and the hotplug case. * Dom0 requests a mapping of the NVDIMMs via the usual mechanism. ** This should work, as Xen is aware that there is something there to be mapped (rather than just empty physical address space). * Dom0 finds that some NVDIMM ranges are now available for use (probably modelled as hotplug events). * /dev/pmem $STUFF starts happening as normal. At some pointer later after dom0 policy decisions are made (ultimately, by the host administrator): * If an area of NVDIMM is chosen for Xen to use, Dom0 needs to inform Xen of the SPA/MFN regions which are safe to use. * Xen then incorporates these regions into its idea of RAM, and starts using them for whatever. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooperwrote: > On 13/10/16 16:40, Dan Williams wrote: >> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: >> [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. >>> Well, my opinion certainly doesn't count much here, but I continue to >>> consider this a bad idea. For entities like drivers it may well be >>> appropriate, but I think there ought to be an independent concept >>> of "OS reserved", and in the Xen case this could then be shared >>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >>> "just a guest", things should even be the other way around: Xen gets >>> all of the OS reserved space, and Dom0 needs something custom. >> You haven't made the case why Xen is special and other applications of >> persistent memory are not. > > In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is > a VM running in ring1/3 with the nvdimm driver. This is the opposite > way around to the KVM model. > > Dom0, being the hardware domain, has default ownership of all the > hardware, but to gain access in the first place, it must request a > mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? That can't be true because then PCI memory ranges would not work without metadata reserve space. Dom0 still needs to map and write the DIMMs to even set up the struct page reservation, it isn't established by default. > Xen therefore needs to know and cope with being able > to give dom0 a mapping to the nvdimms, without touching the content of > the nvidmm itself (so as to avoid corrupting data). Is it true that this metadata only comes into use when remapping the dom0 discovered range(s) into a guest VM? > Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work > and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. > At this point, a Xen subsystem in Linux could choose one or more areas > to hand back to the hypervisor to use as RAM/other. To me all this configuration seems to come after the fact. After dom0 sees /dev/pmemX devices, then it can go to work carving it up and writing Xen specific metadata to the range(s). The struct page reservation never comes into the picture. In fact, a raw mode namespace (one without a reservation) could be used in this model, the nvdimm core never needs to know what is happening.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Thu, Oct 13, 2016 at 9:01 AM, Andrew Cooper wrote: > On 13/10/16 16:40, Dan Williams wrote: >> On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: >> [..] I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. >>> Well, my opinion certainly doesn't count much here, but I continue to >>> consider this a bad idea. For entities like drivers it may well be >>> appropriate, but I think there ought to be an independent concept >>> of "OS reserved", and in the Xen case this could then be shared >>> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >>> "just a guest", things should even be the other way around: Xen gets >>> all of the OS reserved space, and Dom0 needs something custom. >> You haven't made the case why Xen is special and other applications of >> persistent memory are not. > > In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is > a VM running in ring1/3 with the nvdimm driver. This is the opposite > way around to the KVM model. > > Dom0, being the hardware domain, has default ownership of all the > hardware, but to gain access in the first place, it must request a > mapping from Xen. This is where my understanding the Xen model breaks down. Are you saying dom0 can't access the persistent memory range unless the ring0 agent has metadata storage space for tracking what it maps into dom0? That can't be true because then PCI memory ranges would not work without metadata reserve space. Dom0 still needs to map and write the DIMMs to even set up the struct page reservation, it isn't established by default. > Xen therefore needs to know and cope with being able > to give dom0 a mapping to the nvdimms, without touching the content of > the nvidmm itself (so as to avoid corrupting data). Is it true that this metadata only comes into use when remapping the dom0 discovered range(s) into a guest VM? > Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work > and figure out what is on the DIMM, and which areas are safe to use. I don't understand this ordering of events. Dom0 needs to have a mapping to even write the on-media structure to indicate a reservation. So, initial dom0 access can't depend on metadata reservation already being present. > At this point, a Xen subsystem in Linux could choose one or more areas > to hand back to the hypervisor to use as RAM/other. To me all this configuration seems to come after the fact. After dom0 sees /dev/pmemX devices, then it can go to work carving it up and writing Xen specific metadata to the range(s). The struct page reservation never comes into the picture. In fact, a raw mode namespace (one without a reservation) could be used in this model, the nvdimm core never needs to know what is happening.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 13/10/16 16:40, Dan Williams wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulichwrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> Well, my opinion certainly doesn't count much here, but I continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > You haven't made the case why Xen is special and other applications of > persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. Xen therefore needs to know and cope with being able to give dom0 a mapping to the nvdimms, without touching the content of the nvidmm itself (so as to avoid corrupting data). Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. At this point, a Xen subsystem in Linux could choose one or more areas to hand back to the hypervisor to use as RAM/other. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 13/10/16 16:40, Dan Williams wrote: > On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: > [..] >>> I think we can do the similar for Xen, like to lay another pseudo >>> device on /dev/pmem and do the reservation, like 2. in my previous >>> reply. >> Well, my opinion certainly doesn't count much here, but I continue to >> consider this a bad idea. For entities like drivers it may well be >> appropriate, but I think there ought to be an independent concept >> of "OS reserved", and in the Xen case this could then be shared >> between hypervisor and Dom0 kernel. Or if we were to consider Dom0 >> "just a guest", things should even be the other way around: Xen gets >> all of the OS reserved space, and Dom0 needs something custom. > You haven't made the case why Xen is special and other applications of > persistent memory are not. In a Xen system, Xen runs in the baremetal root-mode ring0, and dom0 is a VM running in ring1/3 with the nvdimm driver. This is the opposite way around to the KVM model. Dom0, being the hardware domain, has default ownership of all the hardware, but to gain access in the first place, it must request a mapping from Xen. Xen therefore needs to know and cope with being able to give dom0 a mapping to the nvdimms, without touching the content of the nvidmm itself (so as to avoid corrupting data). Once dom0 has a mapping of the nvdimm, the nvdimm driver can go to work and figure out what is on the DIMM, and which areas are safe to use. At this point, a Xen subsystem in Linux could choose one or more areas to hand back to the hypervisor to use as RAM/other. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 03:08 -0600, Jan Beulich wrote: On 13.10.16 at 10:53,wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. No such independent concept seems exist right now. It may be hard to define such concept, because it's hard to know the common requirements (e.g. size/alignment/...) from ALL OSes. Making each component to maintain its own reservation in its own way seems more flexible. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space,
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 03:08 -0600, Jan Beulich wrote: On 13.10.16 at 10:53, wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. No such independent concept seems exist right now. It may be hard to define such concept, because it's hard to know the common requirements (e.g. size/alignment/...) from ALL OSes. Making each component to maintain its own reservation in its own way seems more flexible. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. Sure, it's possible to implement the driver in a way that if the driver finds it runs on Xen, then it just leaves the OS reserved area
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulichwrote: [..] >> I think we can do the similar for Xen, like to lay another pseudo >> device on /dev/pmem and do the reservation, like 2. in my previous >> reply. > > Well, my opinion certainly doesn't count much here, but I continue to > consider this a bad idea. For entities like drivers it may well be > appropriate, but I think there ought to be an independent concept > of "OS reserved", and in the Xen case this could then be shared > between hypervisor and Dom0 kernel. Or if we were to consider Dom0 > "just a guest", things should even be the other way around: Xen gets > all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. The current struct page reservation supports fundamental address-ability of persistent memory namespaces for the rest of the kernel. The Xen reservation is application specific. XFS, EXT4, and DM also have application specific usages of persistent memory and consume metadata space out of a block device. If we don't need an XFS-mode nvdimm device, why do we need Xen-mode?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Thu, Oct 13, 2016 at 2:08 AM, Jan Beulich wrote: [..] >> I think we can do the similar for Xen, like to lay another pseudo >> device on /dev/pmem and do the reservation, like 2. in my previous >> reply. > > Well, my opinion certainly doesn't count much here, but I continue to > consider this a bad idea. For entities like drivers it may well be > appropriate, but I think there ought to be an independent concept > of "OS reserved", and in the Xen case this could then be shared > between hypervisor and Dom0 kernel. Or if we were to consider Dom0 > "just a guest", things should even be the other way around: Xen gets > all of the OS reserved space, and Dom0 needs something custom. You haven't made the case why Xen is special and other applications of persistent memory are not. The current struct page reservation supports fundamental address-ability of persistent memory namespaces for the rest of the kernel. The Xen reservation is application specific. XFS, EXT4, and DM also have application specific usages of persistent memory and consume metadata space out of a block device. If we don't need an XFS-mode nvdimm device, why do we need Xen-mode?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
+Dan Williams I accidentally dropped him in my last reply. Add him back. On 10/13/16 16:53 +0800, Haozhong Zhang wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19,wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Thanks, Haozhong The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan ___ Linux-nvdimm mailing list linux-nvd...@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
+Dan Williams I accidentally dropped him in my last reply. Add him back. On 10/13/16 16:53 +0800, Haozhong Zhang wrote: On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Thanks, Haozhong The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan ___ Linux-nvdimm mailing list linux-nvd...@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 10:53,wrote: > On 10/13/16 02:34 -0600, Jan Beulich wrote: > On 12.10.16 at 18:19, wrote: >>> On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: >>> On 12.10.16 at 17:42, wrote: > On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: > On 12.10.16 at 16:58, wrote: >>> On 10/12/16 05:32 -0600, Jan Beulich wrote: >>> On 12.10.16 at 12:33, wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V > /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. >>> >>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >>> to know where the reserved area is? Or do you mean it's not if it's >>> defined by a protocol that is accepted by all OSes? >> >> The latter - we clearly won't get away without some agreement on >> where to retrieve position and size of this area. I was simply >> assuming that such a protocol already exists. >> > > No, we should not mix the struct page reservation that the Dom0 kernel > may actively use with the Xen reservation that the Dom0 kernel does > not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. >>> >>> The reservation needs to be persistent / static even if the data is >>> volatile, as is the case with struct page, because we can't have the >>> size of the device change depending on use. So, from the aspect of >>> wasting space while Xen is not in use, both partitions and the >>> intrinsic reservation approach suffer the same problem. Setting that >>> aside I don't want to mix 2 different use cases into the same >>> reservation. >> >>Then you didn't understand what I've said: I certainly didn't mean >>the reservation to vary from a device perspective. However, when >>Xen is in use I don't see why part of that static reservation couldn't >>be used by Xen, and another part by the Dom0 kernel. The kernel >>obviously would need to ask the hypervisor how much of the space >>is left, and where that area starts. >> > > I think Dan means that there should be a clear separation between > reservations for different usages (kernel/xen/...). The libnvdimm > driver is for the linux kernel and only needs to maintain the > reservation for kernel functionality. For others including xen/dm/..., > if they want reservation for their own purpose, they should maintain > their own reservations out of libnvdimm driver and avoid bothering the > libnvdimm driver (e.g. add specific handling in libnvdimm driver). > > IIUC, one existing example is device-mapper device (dm) which needs to > reserve on-device area for its own meta-data. Its choice is to store > the meta-data on the block device (/dev/pmemN) provided by the > libnvdimm driver. > > I think we can do the similar for Xen, like to lay another pseudo > device on /dev/pmem and do the reservation, like 2. in my previous > reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 13.10.16 at 10:53, wrote: > On 10/13/16 02:34 -0600, Jan Beulich wrote: > On 12.10.16 at 18:19, wrote: >>> On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: >>> On 12.10.16 at 17:42, wrote: > On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: > On 12.10.16 at 16:58, wrote: >>> On 10/12/16 05:32 -0600, Jan Beulich wrote: >>> On 12.10.16 at 12:33, wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V > /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. >>> >>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >>> to know where the reserved area is? Or do you mean it's not if it's >>> defined by a protocol that is accepted by all OSes? >> >> The latter - we clearly won't get away without some agreement on >> where to retrieve position and size of this area. I was simply >> assuming that such a protocol already exists. >> > > No, we should not mix the struct page reservation that the Dom0 kernel > may actively use with the Xen reservation that the Dom0 kernel does > not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. >>> >>> The reservation needs to be persistent / static even if the data is >>> volatile, as is the case with struct page, because we can't have the >>> size of the device change depending on use. So, from the aspect of >>> wasting space while Xen is not in use, both partitions and the >>> intrinsic reservation approach suffer the same problem. Setting that >>> aside I don't want to mix 2 different use cases into the same >>> reservation. >> >>Then you didn't understand what I've said: I certainly didn't mean >>the reservation to vary from a device perspective. However, when >>Xen is in use I don't see why part of that static reservation couldn't >>be used by Xen, and another part by the Dom0 kernel. The kernel >>obviously would need to ask the hypervisor how much of the space >>is left, and where that area starts. >> > > I think Dan means that there should be a clear separation between > reservations for different usages (kernel/xen/...). The libnvdimm > driver is for the linux kernel and only needs to maintain the > reservation for kernel functionality. For others including xen/dm/..., > if they want reservation for their own purpose, they should maintain > their own reservations out of libnvdimm driver and avoid bothering the > libnvdimm driver (e.g. add specific handling in libnvdimm driver). > > IIUC, one existing example is device-mapper device (dm) which needs to > reserve on-device area for its own meta-data. Its choice is to store > the meta-data on the block device (/dev/pmemN) provided by the > libnvdimm driver. > > I think we can do the similar for Xen, like to lay another pseudo > device on /dev/pmem and do the reservation, like 2. in my previous > reply. Well, my opinion certainly doesn't count much here, but I continue to consider this a bad idea. For entities like drivers it may well be appropriate, but I think there ought to be an independent concept of "OS reserved", and in the Xen case this could then be shared between hypervisor and Dom0 kernel. Or if we were to consider Dom0 "just a guest", things should even be the other way around: Xen gets all of the OS reserved space, and Dom0 needs something custom. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19,wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Thanks, Haozhong The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/13/16 02:34 -0600, Jan Beulich wrote: On 12.10.16 at 18:19, wrote: On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. I think Dan means that there should be a clear separation between reservations for different usages (kernel/xen/...). The libnvdimm driver is for the linux kernel and only needs to maintain the reservation for kernel functionality. For others including xen/dm/..., if they want reservation for their own purpose, they should maintain their own reservations out of libnvdimm driver and avoid bothering the libnvdimm driver (e.g. add specific handling in libnvdimm driver). IIUC, one existing example is device-mapper device (dm) which needs to reserve on-device area for its own meta-data. Its choice is to store the meta-data on the block device (/dev/pmemN) provided by the libnvdimm driver. I think we can do the similar for Xen, like to lay another pseudo device on /dev/pmem and do the reservation, like 2. in my previous reply. Thanks, Haozhong The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 18:19,wrote: > On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: > On 12.10.16 at 17:42, wrote: >>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >>> On 12.10.16 at 16:58, wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. >>> >>> No, we should not mix the struct page reservation that the Dom0 kernel >>> may actively use with the Xen reservation that the Dom0 kernel does >>> not consume. Explain again what is wrong with the partition approach? >> >> Not sure what was unclear in my previous reply. I don't think there >> should be apriori knowledge of whether Xen is (going to be) used on >> a system, and even if it gets used, but just occasionally, it would >> (apart from the abstract considerations already given) be a waste >> of resources to set something aside that could be used for other >> purposes while Xen is not running. Static partitioning should only be >> needed for persistent data. > > The reservation needs to be persistent / static even if the data is > volatile, as is the case with struct page, because we can't have the > size of the device change depending on use. So, from the aspect of > wasting space while Xen is not in use, both partitions and the > intrinsic reservation approach suffer the same problem. Setting that > aside I don't want to mix 2 different use cases into the same > reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. > The kernel needs to know about the struct page reservation because it > needs to manage the lifetime of page references vs the lifetime of the > device. It does not have the same relationship with a Xen reservation > which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 18:19, wrote: > On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: > On 12.10.16 at 17:42, wrote: >>> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >>> On 12.10.16 at 16:58, wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. >>> >>> No, we should not mix the struct page reservation that the Dom0 kernel >>> may actively use with the Xen reservation that the Dom0 kernel does >>> not consume. Explain again what is wrong with the partition approach? >> >> Not sure what was unclear in my previous reply. I don't think there >> should be apriori knowledge of whether Xen is (going to be) used on >> a system, and even if it gets used, but just occasionally, it would >> (apart from the abstract considerations already given) be a waste >> of resources to set something aside that could be used for other >> purposes while Xen is not running. Static partitioning should only be >> needed for persistent data. > > The reservation needs to be persistent / static even if the data is > volatile, as is the case with struct page, because we can't have the > size of the device change depending on use. So, from the aspect of > wasting space while Xen is not in use, both partitions and the > intrinsic reservation approach suffer the same problem. Setting that > aside I don't want to mix 2 different use cases into the same > reservation. Then you didn't understand what I've said: I certainly didn't mean the reservation to vary from a device perspective. However, when Xen is in use I don't see why part of that static reservation couldn't be used by Xen, and another part by the Dom0 kernel. The kernel obviously would need to ask the hypervisor how much of the space is left, and where that area starts. > The kernel needs to know about the struct page reservation because it > needs to manage the lifetime of page references vs the lifetime of the > device. It does not have the same relationship with a Xen reservation > which is why I'm proposing they be managed separately. I don't think I understand the difference you try to point out here. Linux'es struct page and Xen's struct page_info serve the same fundamental purpose. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 17:42,wrote: > On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: > On 12.10.16 at 16:58, wrote: >>> On 10/12/16 05:32 -0600, Jan Beulich wrote: >>> On 12.10.16 at 12:33, wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V > /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. >>> >>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >>> to know where the reserved area is? Or do you mean it's not if it's >>> defined by a protocol that is accepted by all OSes? >> >> The latter - we clearly won't get away without some agreement on >> where to retrieve position and size of this area. I was simply >> assuming that such a protocol already exists. >> > > No, we should not mix the struct page reservation that the Dom0 kernel > may actively use with the Xen reservation that the Dom0 kernel does > not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 17:42, wrote: > On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: > On 12.10.16 at 16:58, wrote: >>> On 10/12/16 05:32 -0600, Jan Beulich wrote: >>> On 12.10.16 at 12:33, wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V > /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. >>> >>> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >>> to know where the reserved area is? Or do you mean it's not if it's >>> defined by a protocol that is accepted by all OSes? >> >> The latter - we clearly won't get away without some agreement on >> where to retrieve position and size of this area. I was simply >> assuming that such a protocol already exists. >> > > No, we should not mix the struct page reservation that the Dom0 kernel > may actively use with the Xen reservation that the Dom0 kernel does > not consume. Explain again what is wrong with the partition approach? Not sure what was unclear in my previous reply. I don't think there should be apriori knowledge of whether Xen is (going to be) used on a system, and even if it gets used, but just occasionally, it would (apart from the abstract considerations already given) be a waste of resources to set something aside that could be used for other purposes while Xen is not running. Static partitioning should only be needed for persistent data. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulichwrote: On 12.10.16 at 17:42, wrote: >> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >> On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: >> The layout is shown as the following diagram. >> >> +---+---+---+--+--+ >> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >> | by kernel| Table | Block | for Xen | | >> +---+---+---+--+--+ >> \_ ___/ >> V >> /dev/pmem0 > >I have to admit that I dislike this, for not being OS-agnostic. >Neither should there be any Xen-specific region, nor should the >"whatever used by kernel" one be restricted to just Linux. What >I could see is an OS-reserved area ahead of the partition table, >the exact usage of which depends on which OS is currently >running (and in the Xen case this might be both Xen _and_ the >Dom0 kernel, arbitrated by a tbd protocol). After all, when >running under Xen, the Dom0 may not have a need for as much >control data as it has when running on bare hardware, for it >controlling less (if any) of the actual memory ranges when Xen >is present. > Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? >>> >>> The latter - we clearly won't get away without some agreement on >>> where to retrieve position and size of this area. I was simply >>> assuming that such a protocol already exists. >>> >> >> No, we should not mix the struct page reservation that the Dom0 kernel >> may actively use with the Xen reservation that the Dom0 kernel does >> not consume. Explain again what is wrong with the partition approach? > > Not sure what was unclear in my previous reply. I don't think there > should be apriori knowledge of whether Xen is (going to be) used on > a system, and even if it gets used, but just occasionally, it would > (apart from the abstract considerations already given) be a waste > of resources to set something aside that could be used for other > purposes while Xen is not running. Static partitioning should only be > needed for persistent data. > The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. Note that Toshi and Mike added DM for DAX. This enabling ends up writing DM metadata on the device without adding new reservation mechanisms to the nvdimm core. I'm struggling to see how the Xen use case is materially different DM. In the end it's an application specific metadata space.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Wed, Oct 12, 2016 at 9:01 AM, Jan Beulich wrote: On 12.10.16 at 17:42, wrote: >> On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: >> On 12.10.16 at 16:58, wrote: On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: >> The layout is shown as the following diagram. >> >> +---+---+---+--+--+ >> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >> | by kernel| Table | Block | for Xen | | >> +---+---+---+--+--+ >> \_ ___/ >> V >> /dev/pmem0 > >I have to admit that I dislike this, for not being OS-agnostic. >Neither should there be any Xen-specific region, nor should the >"whatever used by kernel" one be restricted to just Linux. What >I could see is an OS-reserved area ahead of the partition table, >the exact usage of which depends on which OS is currently >running (and in the Xen case this might be both Xen _and_ the >Dom0 kernel, arbitrated by a tbd protocol). After all, when >running under Xen, the Dom0 may not have a need for as much >control data as it has when running on bare hardware, for it >controlling less (if any) of the actual memory ranges when Xen >is present. > Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? >>> >>> The latter - we clearly won't get away without some agreement on >>> where to retrieve position and size of this area. I was simply >>> assuming that such a protocol already exists. >>> >> >> No, we should not mix the struct page reservation that the Dom0 kernel >> may actively use with the Xen reservation that the Dom0 kernel does >> not consume. Explain again what is wrong with the partition approach? > > Not sure what was unclear in my previous reply. I don't think there > should be apriori knowledge of whether Xen is (going to be) used on > a system, and even if it gets used, but just occasionally, it would > (apart from the abstract considerations already given) be a waste > of resources to set something aside that could be used for other > purposes while Xen is not running. Static partitioning should only be > needed for persistent data. > The reservation needs to be persistent / static even if the data is volatile, as is the case with struct page, because we can't have the size of the device change depending on use. So, from the aspect of wasting space while Xen is not in use, both partitions and the intrinsic reservation approach suffer the same problem. Setting that aside I don't want to mix 2 different use cases into the same reservation. The kernel needs to know about the struct page reservation because it needs to manage the lifetime of page references vs the lifetime of the device. It does not have the same relationship with a Xen reservation which is why I'm proposing they be managed separately. Note that Toshi and Mike added DM for DAX. This enabling ends up writing DM metadata on the device without adding new reservation mechanisms to the nvdimm core. I'm struggling to see how the Xen use case is materially different DM. In the end it's an application specific metadata space.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulichwrote: On 12.10.16 at 16:58, wrote: >> On 10/12/16 05:32 -0600, Jan Beulich wrote: >> On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 >>> >>>I have to admit that I dislike this, for not being OS-agnostic. >>>Neither should there be any Xen-specific region, nor should the >>>"whatever used by kernel" one be restricted to just Linux. What >>>I could see is an OS-reserved area ahead of the partition table, >>>the exact usage of which depends on which OS is currently >>>running (and in the Xen case this might be both Xen _and_ the >>>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>>running under Xen, the Dom0 may not have a need for as much >>>control data as it has when running on bare hardware, for it >>>controlling less (if any) of the actual memory ranges when Xen >>>is present. >>> >> >> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >> to know where the reserved area is? Or do you mean it's not if it's >> defined by a protocol that is accepted by all OSes? > > The latter - we clearly won't get away without some agreement on > where to retrieve position and size of this area. I was simply > assuming that such a protocol already exists. > No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Wed, Oct 12, 2016 at 8:39 AM, Jan Beulich wrote: On 12.10.16 at 16:58, wrote: >> On 10/12/16 05:32 -0600, Jan Beulich wrote: >> On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 >>> >>>I have to admit that I dislike this, for not being OS-agnostic. >>>Neither should there be any Xen-specific region, nor should the >>>"whatever used by kernel" one be restricted to just Linux. What >>>I could see is an OS-reserved area ahead of the partition table, >>>the exact usage of which depends on which OS is currently >>>running (and in the Xen case this might be both Xen _and_ the >>>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>>running under Xen, the Dom0 may not have a need for as much >>>control data as it has when running on bare hardware, for it >>>controlling less (if any) of the actual memory ranges when Xen >>>is present. >>> >> >> Isn't this OS-reserved area still not OS-agnostic, as it requires OS >> to know where the reserved area is? Or do you mean it's not if it's >> defined by a protocol that is accepted by all OSes? > > The latter - we clearly won't get away without some agreement on > where to retrieve position and size of this area. I was simply > assuming that such a protocol already exists. > No, we should not mix the struct page reservation that the Dom0 kernel may actively use with the Xen reservation that the Dom0 kernel does not consume. Explain again what is wrong with the partition approach?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 16:58,wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 16:58, wrote: > On 10/12/16 05:32 -0600, Jan Beulich wrote: > On 12.10.16 at 12:33, wrote: >>> The layout is shown as the following diagram. >>> >>> +---+---+---+--+--+ >>> | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | >>> | by kernel| Table | Block | for Xen | | >>> +---+---+---+--+--+ >>> \_ ___/ >>> V >>> /dev/pmem0 >> >>I have to admit that I dislike this, for not being OS-agnostic. >>Neither should there be any Xen-specific region, nor should the >>"whatever used by kernel" one be restricted to just Linux. What >>I could see is an OS-reserved area ahead of the partition table, >>the exact usage of which depends on which OS is currently >>running (and in the Xen case this might be both Xen _and_ the >>Dom0 kernel, arbitrated by a tbd protocol). After all, when >>running under Xen, the Dom0 may not have a need for as much >>control data as it has when running on bare hardware, for it >>controlling less (if any) of the actual memory ranges when Xen >>is present. >> > > Isn't this OS-reserved area still not OS-agnostic, as it requires OS > to know where the reserved area is? Or do you mean it's not if it's > defined by a protocol that is accepted by all OSes? The latter - we clearly won't get away without some agreement on where to retrieve position and size of this area. I was simply assuming that such a protocol already exists. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33,wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? Let me list another two methods just coming to my mind. 1. The first method extends the usage of the super block used by current Linux kernel to reserve space on pmem. Current Linux kernel places a super block of the following structure near the beginning of a pmem namespace. struct nd_pfn_sb { u8 signature[PFN_SIG_LEN]; u8 uuid[16]; u8 parent_uuid[16]; __le32 flags; __le16 version_major; __le16 version_minor; __le64 dataoff; /* relative to namespace_base + start_pad */ __le64 npfns; __le32 mode; /* minor-version-1 additions for section alignment */ __le32 start_pad; __le32 end_trunc; /* minor-version-2 record the base alignment of the mapping */ __le32 align; u8 padding[4000]; __le64 checksum; } Two interesting fields here are 'dataoff' and 'mode': - 'dataoff' indicates the offset where the data area starts, ie. IIUC, the part that can be accessed via /dev/pmemN or /dev/daxN. - 'mode' indicates whether Linux puts struct page for this namespace in the ram (= PFN_MODE_RAM) or on the device (= PFN_MODE_PMEM). Currently for Linux, only 'mode' is customizable, while 'dataoff' is not. If mode == PFN_MODE_RAM, no reservation for struct page is made on the device, and dataoff starts almost immediately after the super block except a small reserved area in between for other structures and alignment. If mode == PFN_MODE_PMEM, the size of the reservation is decided by kernel, i.e. 64 bytes per struct page. I propose to make the size of the reserved area customizable, e.g. via ioctl and ndctl. - If mode == PFN_MODE_PMEM and * if the given reserved size is large enough to hold what an OS (not limited to Linux) wants to put in, then the OS just starts use it as desired; * if the given reserved size is not enough, then the OS reports error and may take other fallback actions. - If mode == PFN_MODE_RAM and * if the reserved size is zero, then it's the current way that Linux uses the device; * if the reserved size is non-zero, I would like to reserve this case for hypervisor (right now, namely Xen hypervisor) usage. That is, the OS should not use the reserved area. For Xen, we could add a function in xen driver in kernel to report the reserved area to hypervisor. I guess this might be the OS-agnostic way Jan expects, but Dan may object to. 2. Lay another pseudo device on the block device (e.g. /dev/pmemN) provided by the NVDIMM driver. This pseudo device can reserve the size according to user's requirement. The reservation information can be persistently recorded in a super block before the reserved area. This pseudo device also implements another pseudo block device to allow the non-reserved area be accessed as a block device (we can even implement it as DAX-capable). pseudo block device /-^---\ +--+---+---+---+ | whatever used | Super | reserved by | | | by NVDIMM driver | Block | pseudo device | | +--+---+---+---+ \_ ___/ V
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/12/16 05:32 -0600, Jan Beulich wrote: On 12.10.16 at 12:33, wrote: The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. Isn't this OS-reserved area still not OS-agnostic, as it requires OS to know where the reserved area is? Or do you mean it's not if it's defined by a protocol that is accepted by all OSes? Let me list another two methods just coming to my mind. 1. The first method extends the usage of the super block used by current Linux kernel to reserve space on pmem. Current Linux kernel places a super block of the following structure near the beginning of a pmem namespace. struct nd_pfn_sb { u8 signature[PFN_SIG_LEN]; u8 uuid[16]; u8 parent_uuid[16]; __le32 flags; __le16 version_major; __le16 version_minor; __le64 dataoff; /* relative to namespace_base + start_pad */ __le64 npfns; __le32 mode; /* minor-version-1 additions for section alignment */ __le32 start_pad; __le32 end_trunc; /* minor-version-2 record the base alignment of the mapping */ __le32 align; u8 padding[4000]; __le64 checksum; } Two interesting fields here are 'dataoff' and 'mode': - 'dataoff' indicates the offset where the data area starts, ie. IIUC, the part that can be accessed via /dev/pmemN or /dev/daxN. - 'mode' indicates whether Linux puts struct page for this namespace in the ram (= PFN_MODE_RAM) or on the device (= PFN_MODE_PMEM). Currently for Linux, only 'mode' is customizable, while 'dataoff' is not. If mode == PFN_MODE_RAM, no reservation for struct page is made on the device, and dataoff starts almost immediately after the super block except a small reserved area in between for other structures and alignment. If mode == PFN_MODE_PMEM, the size of the reservation is decided by kernel, i.e. 64 bytes per struct page. I propose to make the size of the reserved area customizable, e.g. via ioctl and ndctl. - If mode == PFN_MODE_PMEM and * if the given reserved size is large enough to hold what an OS (not limited to Linux) wants to put in, then the OS just starts use it as desired; * if the given reserved size is not enough, then the OS reports error and may take other fallback actions. - If mode == PFN_MODE_RAM and * if the reserved size is zero, then it's the current way that Linux uses the device; * if the reserved size is non-zero, I would like to reserve this case for hypervisor (right now, namely Xen hypervisor) usage. That is, the OS should not use the reserved area. For Xen, we could add a function in xen driver in kernel to report the reserved area to hypervisor. I guess this might be the OS-agnostic way Jan expects, but Dan may object to. 2. Lay another pseudo device on the block device (e.g. /dev/pmemN) provided by the NVDIMM driver. This pseudo device can reserve the size according to user's requirement. The reservation information can be persistently recorded in a super block before the reserved area. This pseudo device also implements another pseudo block device to allow the non-reserved area be accessed as a block device (we can even implement it as DAX-capable). pseudo block device /-^---\ +--+---+---+---+ | whatever used | Super | reserved by | | | by NVDIMM driver | Block | pseudo device | | +--+---+---+---+ \_ ___/ V /dev/pmem0
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 12:33,wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V >/dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. The assumption of course is that the reserved area holds no persistent data. If that assumption didn't hold, you'd have to have per-OS reserved areas anyway (as many of them as there might be OSes [planned to get] installed on a particular system). Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 12.10.16 at 12:33, wrote: > The layout is shown as the following diagram. > > +---+---+---+--+--+ > | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | > | by kernel| Table | Block | for Xen | | > +---+---+---+--+--+ > \_ ___/ > V >/dev/pmem0 I have to admit that I dislike this, for not being OS-agnostic. Neither should there be any Xen-specific region, nor should the "whatever used by kernel" one be restricted to just Linux. What I could see is an OS-reserved area ahead of the partition table, the exact usage of which depends on which OS is currently running (and in the Xen case this might be both Xen _and_ the Dom0 kernel, arbitrated by a tbd protocol). After all, when running under Xen, the Dom0 may not have a need for as much control data as it has when running on bare hardware, for it controlling less (if any) of the actual memory ranges when Xen is present. The assumption of course is that the reserved area holds no persistent data. If that assumption didn't hold, you'd have to have per-OS reserved areas anyway (as many of them as there might be OSes [planned to get] installed on a particular system). Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/11/16 13:17 -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilkwrote: On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: [..] >> Right, but why does the libnvdimm core need to know about this >> specific Xen reservation? For example, if Xen wants some in-kernel > > Let me turn this around - why does the libnvdimm core need to know about > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > for example can also poke a hole in this and fill it with its > OS-management meta-data? Specifically the core needs to know so that it can answer the Linux specific question of whether the pfn returned by ->direct_access() has a corresponding struct page or not. It's tied to the lifetime of the device and the usage of the reservation needs to be coordinated against the references of those pages. If FreeBSD decides it needs to reserve "struct page" capacity at the start of the device, I would hope that it reuses the same on-device info block that Linux is using and not create a new "FreeBSD-mode" device type. The issue here (as I understand, I may be missing something new) is that the size of this special namespace may be different. That is the 'struct page' on FreeBSD could be 256 bytes while on Linux it is 64 bytes (numbers pulled out of the sky). Hence one would have to expand or such to re-use this. Sure, but we could support that today. If FreeBSD lays down the info block it is free to make a bigger reservation and Linux would be happy to use a smaller subset. If we, as an industry, want this "struct page" reservation to be common we can take it to a standards body to make as a cross-OS guarantee... but I think this is separate from the Xen reservation. To be honest I do not yet understand what metadata Xen wants to store in the device, but it seems the producer and consumer of that metadata is Xen itself and not the wider Linux kernel as is the case with struct page. Can you fill me in on what problem Xen solves with this Exactly! reservation? The same as Linux - its variant of 'struct page'. Which I think is smaller than the Linux one, but perhaps it is not? If the hypervisor needs to know where it can store some metadata, can that be satisfied with userspace tooling in Dom0? Something like, "/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem with files to hand to guests". So my question is not about the rationale for having metadata, it's why does the Linux kernel need to know about the Xen reservation? As far as I can see it is independent / opaque to the kernel. Thank everyone for all these comments! How about doing the reservation in the following way: 1. Create partition(s) on /dev/pmemX and make sure space besides the partition table and potential padding before the first partition is large enough to hold Xen's management structures and a super block introduced in step 2. The space besides the partition table, padding and the super block will be used as the reserved area. 2. Write a super block before above reserved area. The super block records the base address and the size of the reserved area. It also contains a signature and a checksum to identify itself. The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 Above two steps can be done via a userspace program and do not need Xen hypervisor running. The partitions on the device can be used regardless of the existence of Xen hypervisor. 3. When Xen is running, implement a function in Dom0 Linux xen driver (drivers/xen/) to response to udevd events that notify the detection of the pmem regions. This function searches on the pmem region for the super block created in step 2. If one is found, it will know this pmem region has been prepared for Xen usage. Then it gets the base address and size of the reserved area (from super block) and the entire address ranges of the pmem region (from pmem driver), and reports them to Xen hypervisor. The implementation of this step can be completely included in the kernel Xen driver. (It may also be implemented as a udevd service in userspace, if it's not considered as unsafe) Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/11/16 13:17 -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk wrote: On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: [..] >> Right, but why does the libnvdimm core need to know about this >> specific Xen reservation? For example, if Xen wants some in-kernel > > Let me turn this around - why does the libnvdimm core need to know about > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > for example can also poke a hole in this and fill it with its > OS-management meta-data? Specifically the core needs to know so that it can answer the Linux specific question of whether the pfn returned by ->direct_access() has a corresponding struct page or not. It's tied to the lifetime of the device and the usage of the reservation needs to be coordinated against the references of those pages. If FreeBSD decides it needs to reserve "struct page" capacity at the start of the device, I would hope that it reuses the same on-device info block that Linux is using and not create a new "FreeBSD-mode" device type. The issue here (as I understand, I may be missing something new) is that the size of this special namespace may be different. That is the 'struct page' on FreeBSD could be 256 bytes while on Linux it is 64 bytes (numbers pulled out of the sky). Hence one would have to expand or such to re-use this. Sure, but we could support that today. If FreeBSD lays down the info block it is free to make a bigger reservation and Linux would be happy to use a smaller subset. If we, as an industry, want this "struct page" reservation to be common we can take it to a standards body to make as a cross-OS guarantee... but I think this is separate from the Xen reservation. To be honest I do not yet understand what metadata Xen wants to store in the device, but it seems the producer and consumer of that metadata is Xen itself and not the wider Linux kernel as is the case with struct page. Can you fill me in on what problem Xen solves with this Exactly! reservation? The same as Linux - its variant of 'struct page'. Which I think is smaller than the Linux one, but perhaps it is not? If the hypervisor needs to know where it can store some metadata, can that be satisfied with userspace tooling in Dom0? Something like, "/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem with files to hand to guests". So my question is not about the rationale for having metadata, it's why does the Linux kernel need to know about the Xen reservation? As far as I can see it is independent / opaque to the kernel. Thank everyone for all these comments! How about doing the reservation in the following way: 1. Create partition(s) on /dev/pmemX and make sure space besides the partition table and potential padding before the first partition is large enough to hold Xen's management structures and a super block introduced in step 2. The space besides the partition table, padding and the super block will be used as the reserved area. 2. Write a super block before above reserved area. The super block records the base address and the size of the reserved area. It also contains a signature and a checksum to identify itself. The layout is shown as the following diagram. +---+---+---+--+--+ | whatever used | Partition | Super | Reserved | /dev/pmem0p1 | | by kernel| Table | Block | for Xen | | +---+---+---+--+--+ \_ ___/ V /dev/pmem0 Above two steps can be done via a userspace program and do not need Xen hypervisor running. The partitions on the device can be used regardless of the existence of Xen hypervisor. 3. When Xen is running, implement a function in Dom0 Linux xen driver (drivers/xen/) to response to udevd events that notify the detection of the pmem regions. This function searches on the pmem region for the super block created in step 2. If one is found, it will know this pmem region has been prepared for Xen usage. Then it gets the base address and size of the reserved area (from super block) and the entire address ranges of the pmem region (from pmem driver), and reports them to Xen hypervisor. The implementation of this step can be completely included in the kernel Xen driver. (It may also be implemented as a udevd service in userspace, if it's not considered as unsafe) Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 11.10.16 at 17:53,wrote: > On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > Andrew Cooper 10/10/16 6:44 PM >>> >>>On 10/10/16 01:35, Haozhong Zhang wrote: Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: 1) Reserve an area on NVDIMM devices for Xen hypervisor to place memory management data structures, i.e. frame table and M2P table. 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen hypervisor. >>> >>>However, I can't see any justification for 1). Dom0 should not be >>>involved in Xen's management of its own frame table and m2p. The mfns >>>making up the pmem/pblk regions should be treated just like any other >>>MMIO regions, and be handed wholesale to dom0 by default. >> >> That precludes the use as RAM extension, and I thought earlier rounds of >> discussion had got everyone in agreement that at least for the pmem case >> we will need some control data in Xen. > > The missing piece for me is why this reservation for control data > needs to be done in the libnvdimm core? I would expect that any dax > capable file could be mapped and made available to a guest. This > includes /dev/ramX devices that are dax capable, but are external to > the libnvdimm sub-system. Despite me being the only one on the To list, I don't think the question was really meant to be directed to me. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> On 11.10.16 at 17:53, wrote: > On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > Andrew Cooper 10/10/16 6:44 PM >>> >>>On 10/10/16 01:35, Haozhong Zhang wrote: Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: 1) Reserve an area on NVDIMM devices for Xen hypervisor to place memory management data structures, i.e. frame table and M2P table. 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen hypervisor. >>> >>>However, I can't see any justification for 1). Dom0 should not be >>>involved in Xen's management of its own frame table and m2p. The mfns >>>making up the pmem/pblk regions should be treated just like any other >>>MMIO regions, and be handed wholesale to dom0 by default. >> >> That precludes the use as RAM extension, and I thought earlier rounds of >> discussion had got everyone in agreement that at least for the pmem case >> we will need some control data in Xen. > > The missing piece for me is why this reservation for control data > needs to be done in the libnvdimm core? I would expect that any dax > capable file could be mapped and made available to a guest. This > includes /dev/ramX devices that are dax capable, but are external to > the libnvdimm sub-system. Despite me being the only one on the To list, I don't think the question was really meant to be directed to me. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 20:48, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk >>wrote: >>> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: >> [..] Right, but why does the libnvdimm core need to know about this specific Xen reservation? For example, if Xen wants some in-kernel >>> Let me turn this around - why does the libnvdimm core need to know about >>> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD >>> for example can also poke a hole in this and fill it with its >>> OS-management meta-data? >> Specifically the core needs to know so that it can answer the Linux >> specific question of whether the pfn returned by ->direct_access() has >> a corresponding struct page or not. It's tied to the lifetime of the >> device and the usage of the reservation needs to be coordinated >> against the references of those pages. If FreeBSD decides it needs to >> reserve "struct page" capacity at the start of the device, I would >> hope that it reuses the same on-device info block that Linux is using >> and not create a new "FreeBSD-mode" device type. > The issue here (as I understand, I may be missing something new) > is that the size of this special namespace may be different. That is > the 'struct page' on FreeBSD could be 256 bytes while on Linux it is > 64 bytes (numbers pulled out of the sky). > > Hence one would have to expand or such to re-use this. >> To be honest I do not yet understand what metadata Xen wants to store >> in the device, but it seems the producer and consumer of that metadata >> is Xen itself and not the wider Linux kernel as is the case with >> struct page. Can you fill me in on what problem Xen solves with this > Exactly! >> reservation? > The same as Linux - its variant of 'struct page'. Which I think is > smaller than the Linux one, but perhaps it is not? There is still a bootstrapping issue though, which looks (in its current form) to cause data corruption. I hope I am mistaken, and apologies if I am, but clearly we cannot build a solution that has data corruption in anything other than an exceptional circumstance. So far, the sequence of boot operations appears to look like this: Xen boots, and may find some NVDIMM SPA/MFN ranges via the NFIT table. Any ranges available only from AML need dynamically reporting back to Xen at a later point, once OSPM is up and running. The NVDIMMs must be mappable by dom0 so the contents can be inspected and deemed to be safe by the nvdimm driver/host admin, before Xen starts writing to any of it (for whatever reason). If this isn't the case, then simply booting a Xen/dom0 combo will end up corrupting a region before working out that it is safe to do so. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 20:48, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk >> wrote: >>> On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: >> [..] Right, but why does the libnvdimm core need to know about this specific Xen reservation? For example, if Xen wants some in-kernel >>> Let me turn this around - why does the libnvdimm core need to know about >>> Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD >>> for example can also poke a hole in this and fill it with its >>> OS-management meta-data? >> Specifically the core needs to know so that it can answer the Linux >> specific question of whether the pfn returned by ->direct_access() has >> a corresponding struct page or not. It's tied to the lifetime of the >> device and the usage of the reservation needs to be coordinated >> against the references of those pages. If FreeBSD decides it needs to >> reserve "struct page" capacity at the start of the device, I would >> hope that it reuses the same on-device info block that Linux is using >> and not create a new "FreeBSD-mode" device type. > The issue here (as I understand, I may be missing something new) > is that the size of this special namespace may be different. That is > the 'struct page' on FreeBSD could be 256 bytes while on Linux it is > 64 bytes (numbers pulled out of the sky). > > Hence one would have to expand or such to re-use this. >> To be honest I do not yet understand what metadata Xen wants to store >> in the device, but it seems the producer and consumer of that metadata >> is Xen itself and not the wider Linux kernel as is the case with >> struct page. Can you fill me in on what problem Xen solves with this > Exactly! >> reservation? > The same as Linux - its variant of 'struct page'. Which I think is > smaller than the Linux one, but perhaps it is not? There is still a bootstrapping issue though, which looks (in its current form) to cause data corruption. I hope I am mistaken, and apologies if I am, but clearly we cannot build a solution that has data corruption in anything other than an exceptional circumstance. So far, the sequence of boot operations appears to look like this: Xen boots, and may find some NVDIMM SPA/MFN ranges via the NFIT table. Any ranges available only from AML need dynamically reporting back to Xen at a later point, once OSPM is up and running. The NVDIMMs must be mappable by dom0 so the contents can be inspected and deemed to be safe by the nvdimm driver/host admin, before Xen starts writing to any of it (for whatever reason). If this isn't the case, then simply booting a Xen/dom0 combo will end up corrupting a region before working out that it is safe to do so. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilkwrote: > On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk >> wrote: >> > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: >> [..] >> >> Right, but why does the libnvdimm core need to know about this >> >> specific Xen reservation? For example, if Xen wants some in-kernel >> > >> > Let me turn this around - why does the libnvdimm core need to know about >> > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD >> > for example can also poke a hole in this and fill it with its >> > OS-management meta-data? >> >> Specifically the core needs to know so that it can answer the Linux >> specific question of whether the pfn returned by ->direct_access() has >> a corresponding struct page or not. It's tied to the lifetime of the >> device and the usage of the reservation needs to be coordinated >> against the references of those pages. If FreeBSD decides it needs to >> reserve "struct page" capacity at the start of the device, I would >> hope that it reuses the same on-device info block that Linux is using >> and not create a new "FreeBSD-mode" device type. > > The issue here (as I understand, I may be missing something new) > is that the size of this special namespace may be different. That is > the 'struct page' on FreeBSD could be 256 bytes while on Linux it is > 64 bytes (numbers pulled out of the sky). > > Hence one would have to expand or such to re-use this. Sure, but we could support that today. If FreeBSD lays down the info block it is free to make a bigger reservation and Linux would be happy to use a smaller subset. If we, as an industry, want this "struct page" reservation to be common we can take it to a standards body to make as a cross-OS guarantee... but I think this is separate from the Xen reservation. >> To be honest I do not yet understand what metadata Xen wants to store >> in the device, but it seems the producer and consumer of that metadata >> is Xen itself and not the wider Linux kernel as is the case with >> struct page. Can you fill me in on what problem Xen solves with this > > Exactly! >> reservation? > > The same as Linux - its variant of 'struct page'. Which I think is > smaller than the Linux one, but perhaps it is not? > If the hypervisor needs to know where it can store some metadata, can that be satisfied with userspace tooling in Dom0? Something like, "/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem with files to hand to guests". So my question is not about the rationale for having metadata, it's why does the Linux kernel need to know about the Xen reservation? As far as I can see it is independent / opaque to the kernel.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 12:48 PM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk >> wrote: >> > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: >> [..] >> >> Right, but why does the libnvdimm core need to know about this >> >> specific Xen reservation? For example, if Xen wants some in-kernel >> > >> > Let me turn this around - why does the libnvdimm core need to know about >> > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD >> > for example can also poke a hole in this and fill it with its >> > OS-management meta-data? >> >> Specifically the core needs to know so that it can answer the Linux >> specific question of whether the pfn returned by ->direct_access() has >> a corresponding struct page or not. It's tied to the lifetime of the >> device and the usage of the reservation needs to be coordinated >> against the references of those pages. If FreeBSD decides it needs to >> reserve "struct page" capacity at the start of the device, I would >> hope that it reuses the same on-device info block that Linux is using >> and not create a new "FreeBSD-mode" device type. > > The issue here (as I understand, I may be missing something new) > is that the size of this special namespace may be different. That is > the 'struct page' on FreeBSD could be 256 bytes while on Linux it is > 64 bytes (numbers pulled out of the sky). > > Hence one would have to expand or such to re-use this. Sure, but we could support that today. If FreeBSD lays down the info block it is free to make a bigger reservation and Linux would be happy to use a smaller subset. If we, as an industry, want this "struct page" reservation to be common we can take it to a standards body to make as a cross-OS guarantee... but I think this is separate from the Xen reservation. >> To be honest I do not yet understand what metadata Xen wants to store >> in the device, but it seems the producer and consumer of that metadata >> is Xen itself and not the wider Linux kernel as is the case with >> struct page. Can you fill me in on what problem Xen solves with this > > Exactly! >> reservation? > > The same as Linux - its variant of 'struct page'. Which I think is > smaller than the Linux one, but perhaps it is not? > If the hypervisor needs to know where it can store some metadata, can that be satisfied with userspace tooling in Dom0? Something like, "/dev/pmem0p1 == Xen metadata" and "/dev/pmem0p2 == DAX filesystem with files to hand to guests". So my question is not about the rationale for having metadata, it's why does the Linux kernel need to know about the Xen reservation? As far as I can see it is independent / opaque to the kernel.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: > On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk >wrote: > > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: > [..] > >> Right, but why does the libnvdimm core need to know about this > >> specific Xen reservation? For example, if Xen wants some in-kernel > > > > Let me turn this around - why does the libnvdimm core need to know about > > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > > for example can also poke a hole in this and fill it with its > > OS-management meta-data? > > Specifically the core needs to know so that it can answer the Linux > specific question of whether the pfn returned by ->direct_access() has > a corresponding struct page or not. It's tied to the lifetime of the > device and the usage of the reservation needs to be coordinated > against the references of those pages. If FreeBSD decides it needs to > reserve "struct page" capacity at the start of the device, I would > hope that it reuses the same on-device info block that Linux is using > and not create a new "FreeBSD-mode" device type. The issue here (as I understand, I may be missing something new) is that the size of this special namespace may be different. That is the 'struct page' on FreeBSD could be 256 bytes while on Linux it is 64 bytes (numbers pulled out of the sky). Hence one would have to expand or such to re-use this. > > To be honest I do not yet understand what metadata Xen wants to store > in the device, but it seems the producer and consumer of that metadata > is Xen itself and not the wider Linux kernel as is the case with > struct page. Can you fill me in on what problem Xen solves with this Exactly! > reservation? The same as Linux - its variant of 'struct page'. Which I think is smaller than the Linux one, but perhaps it is not?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 12:28:56PM -0700, Dan Williams wrote: > On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk > wrote: > > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: > [..] > >> Right, but why does the libnvdimm core need to know about this > >> specific Xen reservation? For example, if Xen wants some in-kernel > > > > Let me turn this around - why does the libnvdimm core need to know about > > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > > for example can also poke a hole in this and fill it with its > > OS-management meta-data? > > Specifically the core needs to know so that it can answer the Linux > specific question of whether the pfn returned by ->direct_access() has > a corresponding struct page or not. It's tied to the lifetime of the > device and the usage of the reservation needs to be coordinated > against the references of those pages. If FreeBSD decides it needs to > reserve "struct page" capacity at the start of the device, I would > hope that it reuses the same on-device info block that Linux is using > and not create a new "FreeBSD-mode" device type. The issue here (as I understand, I may be missing something new) is that the size of this special namespace may be different. That is the 'struct page' on FreeBSD could be 256 bytes while on Linux it is 64 bytes (numbers pulled out of the sky). Hence one would have to expand or such to re-use this. > > To be honest I do not yet understand what metadata Xen wants to store > in the device, but it seems the producer and consumer of that metadata > is Xen itself and not the wider Linux kernel as is the case with > struct page. Can you fill me in on what problem Xen solves with this Exactly! > reservation? The same as Linux - its variant of 'struct page'. Which I think is smaller than the Linux one, but perhaps it is not?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
> Andrew, why are you providing input to this so late? First of sorry for this outburst. It was quite uncalled for and quite unprofessional. You of all people have so much on your plate that I am astonished that you are able to operate with some many pokers in the fire. Again, my sincere apology.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
> Andrew, why are you providing input to this so late? First of sorry for this outburst. It was quite uncalled for and quite unprofessional. You of all people have so much on your plate that I am astonished that you are able to operate with some many pokers in the fire. Again, my sincere apology.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilkwrote: > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: [..] >> Right, but why does the libnvdimm core need to know about this >> specific Xen reservation? For example, if Xen wants some in-kernel > > Let me turn this around - why does the libnvdimm core need to know about > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > for example can also poke a hole in this and fill it with its > OS-management meta-data? Specifically the core needs to know so that it can answer the Linux specific question of whether the pfn returned by ->direct_access() has a corresponding struct page or not. It's tied to the lifetime of the device and the usage of the reservation needs to be coordinated against the references of those pages. If FreeBSD decides it needs to reserve "struct page" capacity at the start of the device, I would hope that it reuses the same on-device info block that Linux is using and not create a new "FreeBSD-mode" device type. To be honest I do not yet understand what metadata Xen wants to store in the device, but it seems the producer and consumer of that metadata is Xen itself and not the wider Linux kernel as is the case with struct page. Can you fill me in on what problem Xen solves with this reservation?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 11:33 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: [..] >> Right, but why does the libnvdimm core need to know about this >> specific Xen reservation? For example, if Xen wants some in-kernel > > Let me turn this around - why does the libnvdimm core need to know about > Linux specific parts? Shouldn't this be OS agnostic, so that FreeBSD > for example can also poke a hole in this and fill it with its > OS-management meta-data? Specifically the core needs to know so that it can answer the Linux specific question of whether the pfn returned by ->direct_access() has a corresponding struct page or not. It's tied to the lifetime of the device and the usage of the reservation needs to be coordinated against the references of those pages. If FreeBSD decides it needs to reserve "struct page" capacity at the start of the device, I would hope that it reuses the same on-device info block that Linux is using and not create a new "FreeBSD-mode" device type. To be honest I do not yet understand what metadata Xen wants to store in the device, but it seems the producer and consumer of that metadata is Xen itself and not the wider Linux kernel as is the case with struct page. Can you fill me in on what problem Xen solves with this reservation?
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 06:52, Haozhong Zhang wrote: > On 10/10/16 17:43, Andrew Cooper wrote: >> On 10/10/16 01:35, Haozhong Zhang wrote: >>> Overview >>> >>> This RFC kernel patch series along with corresponding patch series of >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. >>> >>> Xen hypervisor does not include an NVDIMM driver, so it needs the >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM >>> devices. We currently only supports NVDIMM devices in pmem mode. >>> >>> Design and Implementation >>> = >>> The complete design can be found at >>> >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. >>> >>> All patch series can be found at >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 >>> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>>memory management data structures, i.e. frame table and M2P table. >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>>hypervisor. >> Please can we take a step back here before diving down a rabbit hole. >> >> >> How do pblk/pmem regions appear in the E820 map at boot? At the very >> least, I would expect at least a large reserved region. > ACPI specification does not require them to appear in E820, though > it defines E820 type-7 for persistent memory. Ok, so we might get some E820 type-7 ranges, or some holes. > >> Is the MFN information (SPA in your terminology, so far as I can tell) >> available in any static APCI tables, or are they only available as a >> result of executing AML methods? >> > For NVDIMM devices already plugged at power on, their MFN information > can be got from NFIT table. However, MFN information for hotplugged > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions described at boot and marked as initially not present, or do you only know the hotplugged SPA at the point that it is hotplugged? I certainly agree that there needs to be a propagation of the hotplug notification from OSPM to Xen, which will involve some glue in the Xen subsystem in Linux, but I would expect that this would be similar to the existing plain RAM hotplug mechanism. > >> If the MFN information is only available via AML, then point 2) is >> needed, although the reporting back to Xen should be restricted to a xen >> component, rather than polluting the main device driver. >> >> However, I can't see any justification for 1). Dom0 should not be >> involved in Xen's management of its own frame table and m2p. The mfns >> making up the pmem/pblk regions should be treated just like any other >> MMIO regions, and be handed wholesale to dom0 by default. >> > Do you mean to treat them as mmio pages of type p2m_mmio_direct and > map them to guest by map_mmio_regions()? I don't see any reason why it shouldn't be treated like this. Xen shouldn't be treating it as anything other than an opaque block of MFNs. The concept of trying to map a DAX file into the guest physical address space of a VM is indeed new and doesn't fit into Xen's current model, but all that fixing this requires is a new privileged mapping hypercall which takes a source domid and gfn scatter list, and a destination domid and scatter list. (I see from a quick look at your Xen series that your XENMEM_populate_pmemmap looks roughly like this) ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote: > On 11/10/16 06:52, Haozhong Zhang wrote: > > On 10/10/16 17:43, Andrew Cooper wrote: > >> On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Overview > >>> > >>> This RFC kernel patch series along with corresponding patch series of > >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. > >>> > >>> Xen hypervisor does not include an NVDIMM driver, so it needs the > >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM > >>> devices. We currently only supports NVDIMM devices in pmem mode. > >>> > >>> Design and Implementation > >>> = > >>> The complete design can be found at > >>> > >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > >>> > >>> All patch series can be found at > >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > >>> > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> Please can we take a step back here before diving down a rabbit hole. > >> > >> > >> How do pblk/pmem regions appear in the E820 map at boot? At the very > >> least, I would expect at least a large reserved region. > > ACPI specification does not require them to appear in E820, though > > it defines E820 type-7 for persistent memory. > > Ok, so we might get some E820 type-7 ranges, or some holes. > > > > >> Is the MFN information (SPA in your terminology, so far as I can tell) > >> available in any static APCI tables, or are they only available as a > >> result of executing AML methods? > >> > > For NVDIMM devices already plugged at power on, their MFN information > > can be got from NFIT table. However, MFN information for hotplugged > > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions > described at boot and marked as initially not present, or do you only > know the hotplugged SPA at the point that it is hotplugged? The latter. You have no idea of the size until you get an ACPI hotplug. The ACPI hotplug contains the NFIT MADT table so based on that you can populate the machine. > > I certainly agree that there needs to be a propagation of the hotplug > notification from OSPM to Xen, which will involve some glue in the Xen > subsystem in Linux, but I would expect that this would be similar to the > existing plain RAM hotplug mechanism. I am actually not sure how ACPI RAM hotplug mechanism is suppose to work in practice. I thought that the regions (E820) are marked as reserved and the 'RAM' slots nicely in there.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 06:52, Haozhong Zhang wrote: > On 10/10/16 17:43, Andrew Cooper wrote: >> On 10/10/16 01:35, Haozhong Zhang wrote: >>> Overview >>> >>> This RFC kernel patch series along with corresponding patch series of >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. >>> >>> Xen hypervisor does not include an NVDIMM driver, so it needs the >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM >>> devices. We currently only supports NVDIMM devices in pmem mode. >>> >>> Design and Implementation >>> = >>> The complete design can be found at >>> >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. >>> >>> All patch series can be found at >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 >>> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>>memory management data structures, i.e. frame table and M2P table. >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>>hypervisor. >> Please can we take a step back here before diving down a rabbit hole. >> >> >> How do pblk/pmem regions appear in the E820 map at boot? At the very >> least, I would expect at least a large reserved region. > ACPI specification does not require them to appear in E820, though > it defines E820 type-7 for persistent memory. Ok, so we might get some E820 type-7 ranges, or some holes. > >> Is the MFN information (SPA in your terminology, so far as I can tell) >> available in any static APCI tables, or are they only available as a >> result of executing AML methods? >> > For NVDIMM devices already plugged at power on, their MFN information > can be got from NFIT table. However, MFN information for hotplugged > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions described at boot and marked as initially not present, or do you only know the hotplugged SPA at the point that it is hotplugged? I certainly agree that there needs to be a propagation of the hotplug notification from OSPM to Xen, which will involve some glue in the Xen subsystem in Linux, but I would expect that this would be similar to the existing plain RAM hotplug mechanism. > >> If the MFN information is only available via AML, then point 2) is >> needed, although the reporting back to Xen should be restricted to a xen >> component, rather than polluting the main device driver. >> >> However, I can't see any justification for 1). Dom0 should not be >> involved in Xen's management of its own frame table and m2p. The mfns >> making up the pmem/pblk regions should be treated just like any other >> MMIO regions, and be handed wholesale to dom0 by default. >> > Do you mean to treat them as mmio pages of type p2m_mmio_direct and > map them to guest by map_mmio_regions()? I don't see any reason why it shouldn't be treated like this. Xen shouldn't be treating it as anything other than an opaque block of MFNs. The concept of trying to map a DAX file into the guest physical address space of a VM is indeed new and doesn't fit into Xen's current model, but all that fixing this requires is a new privileged mapping hypercall which takes a source domid and gfn scatter list, and a destination domid and scatter list. (I see from a quick look at your Xen series that your XENMEM_populate_pmemmap looks roughly like this) ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote: > On 11/10/16 06:52, Haozhong Zhang wrote: > > On 10/10/16 17:43, Andrew Cooper wrote: > >> On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Overview > >>> > >>> This RFC kernel patch series along with corresponding patch series of > >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. > >>> > >>> Xen hypervisor does not include an NVDIMM driver, so it needs the > >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM > >>> devices. We currently only supports NVDIMM devices in pmem mode. > >>> > >>> Design and Implementation > >>> = > >>> The complete design can be found at > >>> > >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > >>> > >>> All patch series can be found at > >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > >>> > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> Please can we take a step back here before diving down a rabbit hole. > >> > >> > >> How do pblk/pmem regions appear in the E820 map at boot? At the very > >> least, I would expect at least a large reserved region. > > ACPI specification does not require them to appear in E820, though > > it defines E820 type-7 for persistent memory. > > Ok, so we might get some E820 type-7 ranges, or some holes. > > > > >> Is the MFN information (SPA in your terminology, so far as I can tell) > >> available in any static APCI tables, or are they only available as a > >> result of executing AML methods? > >> > > For NVDIMM devices already plugged at power on, their MFN information > > can be got from NFIT table. However, MFN information for hotplugged > > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions > described at boot and marked as initially not present, or do you only > know the hotplugged SPA at the point that it is hotplugged? The latter. You have no idea of the size until you get an ACPI hotplug. The ACPI hotplug contains the NFIT MADT table so based on that you can populate the machine. > > I certainly agree that there needs to be a propagation of the hotplug > notification from OSPM to Xen, which will involve some glue in the Xen > subsystem in Linux, but I would expect that this would be similar to the > existing plain RAM hotplug mechanism. I am actually not sure how ACPI RAM hotplug mechanism is suppose to work in practice. I thought that the regions (E820) are marked as reserved and the 'RAM' slots nicely in there.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote: > On 11/10/16 06:52, Haozhong Zhang wrote: > > On 10/10/16 17:43, Andrew Cooper wrote: > >> On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Overview > >>> > >>> This RFC kernel patch series along with corresponding patch series of > >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. > >>> > >>> Xen hypervisor does not include an NVDIMM driver, so it needs the > >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM > >>> devices. We currently only supports NVDIMM devices in pmem mode. > >>> > >>> Design and Implementation > >>> = > >>> The complete design can be found at > >>> > >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > >>> > >>> All patch series can be found at > >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > >>> > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> Please can we take a step back here before diving down a rabbit hole. > >> > >> > >> How do pblk/pmem regions appear in the E820 map at boot? At the very > >> least, I would expect at least a large reserved region. > > ACPI specification does not require them to appear in E820, though > > it defines E820 type-7 for persistent memory. > > Ok, so we might get some E820 type-7 ranges, or some holes. > > > > >> Is the MFN information (SPA in your terminology, so far as I can tell) > >> available in any static APCI tables, or are they only available as a > >> result of executing AML methods? > >> > > For NVDIMM devices already plugged at power on, their MFN information > > can be got from NFIT table. However, MFN information for hotplugged > > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions > described at boot and marked as initially not present, or do you only > know the hotplugged SPA at the point that it is hotplugged? > > I certainly agree that there needs to be a propagation of the hotplug > notification from OSPM to Xen, which will involve some glue in the Xen > subsystem in Linux, but I would expect that this would be similar to the > existing plain RAM hotplug mechanism. > > > > >> If the MFN information is only available via AML, then point 2) is > >> needed, although the reporting back to Xen should be restricted to a xen > >> component, rather than polluting the main device driver. > >> > >> However, I can't see any justification for 1). Dom0 should not be > >> involved in Xen's management of its own frame table and m2p. The mfns > >> making up the pmem/pblk regions should be treated just like any other > >> MMIO regions, and be handed wholesale to dom0 by default. > >> > > Do you mean to treat them as mmio pages of type p2m_mmio_direct and > > map them to guest by map_mmio_regions()? > > I don't see any reason why it shouldn't be treated like this. Xen > shouldn't be treating it as anything other than an opaque block of MFNs. > > The concept of trying to map a DAX file into the guest physical address > space of a VM is indeed new and doesn't fit into Xen's current model, > but all that fixing this requires is a new privileged mapping hypercall > which takes a source domid and gfn scatter list, and a destination domid > and scatter list. (I see from a quick look at your Xen series that your > XENMEM_populate_pmemmap looks roughly like this) That can be quite big. Say you want to map an DAX file that has size of 1TB and the this GFN scatter list has 1073741824 entries? How do you envision handling this in Xen and populating the P2M entries with this information? > > ~Andrew > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:37:09PM +0100, Andrew Cooper wrote: > On 11/10/16 06:52, Haozhong Zhang wrote: > > On 10/10/16 17:43, Andrew Cooper wrote: > >> On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Overview > >>> > >>> This RFC kernel patch series along with corresponding patch series of > >>> Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > >>> NVDIMM devices to Xen HVM domU as vNVDIMM devices. > >>> > >>> Xen hypervisor does not include an NVDIMM driver, so it needs the > >>> assistance from the driver in Dom0 Linux kernel to manage NVDIMM > >>> devices. We currently only supports NVDIMM devices in pmem mode. > >>> > >>> Design and Implementation > >>> = > >>> The complete design can be found at > >>> > >>> https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > >>> > >>> All patch series can be found at > >>> Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > >>> QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > >>> Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > >>> ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > >>> > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> Please can we take a step back here before diving down a rabbit hole. > >> > >> > >> How do pblk/pmem regions appear in the E820 map at boot? At the very > >> least, I would expect at least a large reserved region. > > ACPI specification does not require them to appear in E820, though > > it defines E820 type-7 for persistent memory. > > Ok, so we might get some E820 type-7 ranges, or some holes. > > > > >> Is the MFN information (SPA in your terminology, so far as I can tell) > >> available in any static APCI tables, or are they only available as a > >> result of executing AML methods? > >> > > For NVDIMM devices already plugged at power on, their MFN information > > can be got from NFIT table. However, MFN information for hotplugged > > NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > How does NVDIMM hotplug compare to RAM hotplug? Are the hotplug regions > described at boot and marked as initially not present, or do you only > know the hotplugged SPA at the point that it is hotplugged? > > I certainly agree that there needs to be a propagation of the hotplug > notification from OSPM to Xen, which will involve some glue in the Xen > subsystem in Linux, but I would expect that this would be similar to the > existing plain RAM hotplug mechanism. > > > > >> If the MFN information is only available via AML, then point 2) is > >> needed, although the reporting back to Xen should be restricted to a xen > >> component, rather than polluting the main device driver. > >> > >> However, I can't see any justification for 1). Dom0 should not be > >> involved in Xen's management of its own frame table and m2p. The mfns > >> making up the pmem/pblk regions should be treated just like any other > >> MMIO regions, and be handed wholesale to dom0 by default. > >> > > Do you mean to treat them as mmio pages of type p2m_mmio_direct and > > map them to guest by map_mmio_regions()? > > I don't see any reason why it shouldn't be treated like this. Xen > shouldn't be treating it as anything other than an opaque block of MFNs. > > The concept of trying to map a DAX file into the guest physical address > space of a VM is indeed new and doesn't fit into Xen's current model, > but all that fixing this requires is a new privileged mapping hypercall > which takes a source domid and gfn scatter list, and a destination domid > and scatter list. (I see from a quick look at your Xen series that your > XENMEM_populate_pmemmap looks roughly like this) That can be quite big. Say you want to map an DAX file that has size of 1TB and the this GFN scatter list has 1073741824 entries? How do you envision handling this in Xen and populating the P2M entries with this information? > > ~Andrew > ___ > Linux-nvdimm mailing list > linux-nvd...@lists.01.org > https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:15:42PM +0100, Andrew Cooper wrote: > On 11/10/16 18:51, Dan Williams wrote: > > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk > >wrote: > >> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: > >>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > >>> Andrew Cooper 10/10/16 6:44 PM >>> > > On 10/10/16 01:35, Haozhong Zhang wrote: > >> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >> tasks: > >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>memory management data structures, i.e. frame table and M2P table. > >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>hypervisor. > > However, I can't see any justification for 1). Dom0 should not be > > involved in Xen's management of its own frame table and m2p. The mfns > > making up the pmem/pblk regions should be treated just like any other > > MMIO regions, and be handed wholesale to dom0 by default. > That precludes the use as RAM extension, and I thought earlier rounds of > discussion had got everyone in agreement that at least for the pmem case > we will need some control data in Xen. > >>> The missing piece for me is why this reservation for control data > >>> needs to be done in the libnvdimm core? I would expect that any dax > >> Isn't it done this way with Linux? That is say if the machine has > >> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' > >> for the NVDIMM ranges somewhere. That place can be in regions on the > >> NVDIMM that ndctl can reserve. > > Yes. > > I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM; I just gave you one. This is the 'usecase' that Linux has to deal with now that the core kernel folks have pointed out that they don't want 'struct page' for the MMIO regions. This mechanism came about this and finding a place _somewhere_ to deal with having to have 'struct page' for the SPA ranges of the NVDIMM. > NVDIMMs are far more valuable for higher level management in dom0. Andrew, why are you providing input to this so late? Haozhong provided an nice design document outlining the problem and the solution he suggested. > > I certainly think that such a usecase should be out-of-scope for initial > Xen/NVDIMM support, even if only to reduce the complexity to start with. > > A repeated complain I have of large feature submissions like this is > that, by trying to solve all potential usecases at one, end up being > overly complicated to develop, understand and review. On the other hand - if you don't take these complicated issues from the start, then you may have to redesign and redevelop this after the first version which has been set in stone and committed. > > > > >>> capable file could be mapped and made available to a guest. This > >>> includes /dev/ramX devices that are dax capable, but are external to > >>> the libnvdimm sub-system. > >> This is more of just keeping track of the ranges if say the DAX file is > >> extremely fragmented and requires a lot of 'struct pages' to keep track of > >> when stiching up the VMA. > > Right, but why does the libnvdimm core need to know about this > > specific Xen reservation? For example, if Xen wants some in-kernel > > driver to own a pmem region and place its own metadata on the device I > > would recommend something like: > > > > bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); > > bdev_direct_access(bdev, ...); > > > > ...in other words, I don't think we want libnvdimm to grow new device > > types for every possible in-kernel user, Xen, MD, DM, etc. Instead, > > just claim the resulting device. > > I completely agree. > > Whatever ends up happening between Xen and dom0, there should be no > modifications like this to the nvdimm driver. I will go so far as to > say that there shouldn't be any modifications to the nvdimm driver > (other than perhaps new query hooks so the Xen subsystem in Linux can > query information to then pass up to Xen, if the existing queryability > is insufficient). Haozhong and Jan had been chatting about this in terms of how to keep track of a guest having non-contingous SPAs of NVDIMM stiched to a guest. The initial idea was to treat it as MMIO, but of course if you have 1 page ranges over say 1TB you end up consuming tons of memory to keep track of this (the same way Linux would if you wanted to mmap an file from DAX fs). Other solutions were an bitmap, but that can also be cumbersome to deal with. In the end the suggestion that was proposed was the one that Linux choose - stash the 'struct page' in the NVDIMM.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 07:15:42PM +0100, Andrew Cooper wrote: > On 11/10/16 18:51, Dan Williams wrote: > > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk > > wrote: > >> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: > >>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > >>> Andrew Cooper 10/10/16 6:44 PM >>> > > On 10/10/16 01:35, Haozhong Zhang wrote: > >> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >> tasks: > >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>memory management data structures, i.e. frame table and M2P table. > >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>hypervisor. > > However, I can't see any justification for 1). Dom0 should not be > > involved in Xen's management of its own frame table and m2p. The mfns > > making up the pmem/pblk regions should be treated just like any other > > MMIO regions, and be handed wholesale to dom0 by default. > That precludes the use as RAM extension, and I thought earlier rounds of > discussion had got everyone in agreement that at least for the pmem case > we will need some control data in Xen. > >>> The missing piece for me is why this reservation for control data > >>> needs to be done in the libnvdimm core? I would expect that any dax > >> Isn't it done this way with Linux? That is say if the machine has > >> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' > >> for the NVDIMM ranges somewhere. That place can be in regions on the > >> NVDIMM that ndctl can reserve. > > Yes. > > I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM; I just gave you one. This is the 'usecase' that Linux has to deal with now that the core kernel folks have pointed out that they don't want 'struct page' for the MMIO regions. This mechanism came about this and finding a place _somewhere_ to deal with having to have 'struct page' for the SPA ranges of the NVDIMM. > NVDIMMs are far more valuable for higher level management in dom0. Andrew, why are you providing input to this so late? Haozhong provided an nice design document outlining the problem and the solution he suggested. > > I certainly think that such a usecase should be out-of-scope for initial > Xen/NVDIMM support, even if only to reduce the complexity to start with. > > A repeated complain I have of large feature submissions like this is > that, by trying to solve all potential usecases at one, end up being > overly complicated to develop, understand and review. On the other hand - if you don't take these complicated issues from the start, then you may have to redesign and redevelop this after the first version which has been set in stone and committed. > > > > >>> capable file could be mapped and made available to a guest. This > >>> includes /dev/ramX devices that are dax capable, but are external to > >>> the libnvdimm sub-system. > >> This is more of just keeping track of the ranges if say the DAX file is > >> extremely fragmented and requires a lot of 'struct pages' to keep track of > >> when stiching up the VMA. > > Right, but why does the libnvdimm core need to know about this > > specific Xen reservation? For example, if Xen wants some in-kernel > > driver to own a pmem region and place its own metadata on the device I > > would recommend something like: > > > > bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); > > bdev_direct_access(bdev, ...); > > > > ...in other words, I don't think we want libnvdimm to grow new device > > types for every possible in-kernel user, Xen, MD, DM, etc. Instead, > > just claim the resulting device. > > I completely agree. > > Whatever ends up happening between Xen and dom0, there should be no > modifications like this to the nvdimm driver. I will go so far as to > say that there shouldn't be any modifications to the nvdimm driver > (other than perhaps new query hooks so the Xen subsystem in Linux can > query information to then pass up to Xen, if the existing queryability > is insufficient). Haozhong and Jan had been chatting about this in terms of how to keep track of a guest having non-contingous SPAs of NVDIMM stiched to a guest. The initial idea was to treat it as MMIO, but of course if you have 1 page ranges over say 1TB you end up consuming tons of memory to keep track of this (the same way Linux would if you wanted to mmap an file from DAX fs). Other solutions were an bitmap, but that can also be cumbersome to deal with. In the end the suggestion that was proposed was the one that Linux choose - stash the 'struct page' in the NVDIMM.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: > 260sn3756f-1 > (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) > for; Tue, 11 Oct 2016 17:51:21 + > Received: by mail-oi0-f43.google.com with SMTP id d132so32700570oib.2 > for ; Tue, 11 Oct 2016 10:51:20 -0700 (PDT) > DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; > d=intel-com.20150623.gappssmtp.com; s=20150623; > h=mime-version:in-reply-to:references:from:date:message-id:subject:to > :cc; > bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=; > b=CiKg4tJf1DGU2x/pSCYU7Jx79oCXMSIApwY2zJjO9Lny3erPxUyjNhszNyQkceYK1A > Gzuw05eETGT/k0UWamFdN/ZXF3PucSXIXqrVtTS9kLQBlKPTWQJvndSRqZ6lPb36mlSA > BrkdOREz5O/V7p/iGYhnxZU9eyfVY1ekgeMvTKP3su9Ye4Nk6GJYMEb5HSTCm1Ckmoq5 > T4Rlw6gcnbHCLx27vcghySG4YXcQ4r2qSPcSmAysve77sYCPYlM9XRVpzfPBTmINKGUo > 9w7MgVs5KG0dG60j1fJNjXoY0WSoP3uI67e69afqjAChzVndGDgMXjOzGrQ6+KQF088Q > JeiQ== > X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; > d=1e100.net; s=20130820; > h=x-gm-message-state:mime-version:in-reply-to:references:from:date > :message-id:subject:to:cc; > bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=; > b=SmizBvFSmUHAy/WKfbD4m+QVSajIfcD9SQW7hwqmiwUtrACa2PxQWyx0dHe6DOqVVx > jYHSxbbMiz105BMwxfv2pZlAl+phFkj8APxpL2XF36SIsq5u9+evlqBUuzGcpVJ+tXyI > 0xO0qfyspvNwLwJnkZ2bOxO9FM5cRhGGIAQ2uJCVIixLTPstJgkFL3taQ6bfr/epJGoF > VbYrGRu0nxGTWEqk14q0YBt2uiDLWu6WiF8izG/fnyM39wzS0ZsO31hco3jpBWiq7X5N > Ehn8ePiR9iYfowHhT3s2PefnrirD0zlJAamVqnbTNQS93PT26dWpm/vc8HVYiMLj+Fq8 > s2rw== > X-Gm-Message-State: > AA6/9RlGCiscMzjRlXRLSGCPLACOp/VdD9I/y/dQ+vytyQN0tniPrwPxFp4VQtNbW/PYF1zzfyAX+iUOa+dgEsrg > X-Received: by 10.202.84.69 with SMTP id i66mr3504473oib.93.1476208279931; > Tue, 11 Oct 2016 10:51:19 -0700 (PDT) > MIME-Version: 1.0 > Received: by 10.157.39.201 with HTTP; Tue, 11 Oct 2016 10:51:19 -0700 (PDT) > In-Reply-To: <20161011165811.GO19349@localhost.localdomain> > References: <20161010003523.4423-1-haozhong.zh...@intel.com> > > <57fcf26a0278000f1...@prv-mh.provo.novell.com> > > <20161011165811.GO19349@localhost.localdomain> > From: Dan Williams > Date: Tue, 11 Oct 2016 10:51:19 -0700 > Message-ID: >
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 10:51:19AM -0700, Dan Williams wrote: > 260sn3756f-1 > (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) > for ; Tue, 11 Oct 2016 17:51:21 + > Received: by mail-oi0-f43.google.com with SMTP id d132so32700570oib.2 > for ; Tue, 11 Oct 2016 10:51:20 -0700 (PDT) > DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; > d=intel-com.20150623.gappssmtp.com; s=20150623; > h=mime-version:in-reply-to:references:from:date:message-id:subject:to > :cc; > bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=; > b=CiKg4tJf1DGU2x/pSCYU7Jx79oCXMSIApwY2zJjO9Lny3erPxUyjNhszNyQkceYK1A > Gzuw05eETGT/k0UWamFdN/ZXF3PucSXIXqrVtTS9kLQBlKPTWQJvndSRqZ6lPb36mlSA > BrkdOREz5O/V7p/iGYhnxZU9eyfVY1ekgeMvTKP3su9Ye4Nk6GJYMEb5HSTCm1Ckmoq5 > T4Rlw6gcnbHCLx27vcghySG4YXcQ4r2qSPcSmAysve77sYCPYlM9XRVpzfPBTmINKGUo > 9w7MgVs5KG0dG60j1fJNjXoY0WSoP3uI67e69afqjAChzVndGDgMXjOzGrQ6+KQF088Q > JeiQ== > X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; > d=1e100.net; s=20130820; > h=x-gm-message-state:mime-version:in-reply-to:references:from:date > :message-id:subject:to:cc; > bh=vXHG8Ke0lr+jk8ivMDq3ZpmmHHjC205aTSytpqjXFgo=; > b=SmizBvFSmUHAy/WKfbD4m+QVSajIfcD9SQW7hwqmiwUtrACa2PxQWyx0dHe6DOqVVx > jYHSxbbMiz105BMwxfv2pZlAl+phFkj8APxpL2XF36SIsq5u9+evlqBUuzGcpVJ+tXyI > 0xO0qfyspvNwLwJnkZ2bOxO9FM5cRhGGIAQ2uJCVIixLTPstJgkFL3taQ6bfr/epJGoF > VbYrGRu0nxGTWEqk14q0YBt2uiDLWu6WiF8izG/fnyM39wzS0ZsO31hco3jpBWiq7X5N > Ehn8ePiR9iYfowHhT3s2PefnrirD0zlJAamVqnbTNQS93PT26dWpm/vc8HVYiMLj+Fq8 > s2rw== > X-Gm-Message-State: > AA6/9RlGCiscMzjRlXRLSGCPLACOp/VdD9I/y/dQ+vytyQN0tniPrwPxFp4VQtNbW/PYF1zzfyAX+iUOa+dgEsrg > X-Received: by 10.202.84.69 with SMTP id i66mr3504473oib.93.1476208279931; > Tue, 11 Oct 2016 10:51:19 -0700 (PDT) > MIME-Version: 1.0 > Received: by 10.157.39.201 with HTTP; Tue, 11 Oct 2016 10:51:19 -0700 (PDT) > In-Reply-To: <20161011165811.GO19349@localhost.localdomain> > References: <20161010003523.4423-1-haozhong.zh...@intel.com> > > <57fcf26a0278000f1...@prv-mh.provo.novell.com> > > <20161011165811.GO19349@localhost.localdomain> > From: Dan Williams > Date: Tue, 11 Oct 2016 10:51:19 -0700 > Message-ID: > > Subject: Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for > Xen > To: Konrad Rzeszutek Wilk > Cc: Jan Beulich , Juergen Gross , > Haozhong Zhang , > Xiao Guangrong , > Arnd Bergmann , > "linux-nvd...@lists.01.org" , > Boris Ostrovsky , > andrew.coop...@citrix.com, > "linux-kernel@vger.kernel.org" , > Stefano Stabellini , > David Vrabel , > Johannes Thumshirn , > xen-de...@lists.xenproject.org, > Andrew Morton , > Ross Zwisler > Content-Type: text/plain; charset=UTF-8 > X-Source-IP: 209.85.218.43 > X-ServerName: mail-oi0-f43.google.com > X-Proofpoint-SPF-Result: pass > X-Proofpoint-SPF-Record: v=spf1 mx:intel.com include:_spf.google.com -all > X-Proofpoint-Virus-Version: vendor=nai engine=5800 definitions=8315 > signatures=670727 > X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 > suspectscore=1 > malwarescore=0 phishscore=0 adultscore=0 bulkscore=0 classifier=spam > adjust=0 reason=mlx scancount=1 engine=8.0.1-160930 > definitions=main-1610110304 > X-Spam: Clean > > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk > wrote: > > On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: > >> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > >> >>>> Andrew Cooper 10/10/16 6:44 PM >>> > >> >>On 10/10/16 01:35, Haozhong Zhang wrote: > >> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >> >>> tasks: > >> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >> >>>memory management data structures, i.e. frame table and M2P table. > >> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >> >>>hypervisor. > >> >> > >> >>However, I can't see any justification for 1). Dom0 should not be > >> >>involved in Xen's management of its own frame table and m2p. The mfns > >> >>making up the pmem/pblk regions should be treated just like any other > >> >>MMIO regions, and be handed wholesale to dom0 by default. > >> > > >> > That preclu
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 18:51, Dan Williams wrote: > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk >wrote: >> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: >>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: >>> Andrew Cooper 10/10/16 6:44 PM >>> > On 10/10/16 01:35, Haozhong Zhang wrote: >> Xen hypervisor needs assistance from Dom0 Linux kernel for following >> tasks: >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>memory management data structures, i.e. frame table and M2P table. >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>hypervisor. > However, I can't see any justification for 1). Dom0 should not be > involved in Xen's management of its own frame table and m2p. The mfns > making up the pmem/pblk regions should be treated just like any other > MMIO regions, and be handed wholesale to dom0 by default. That precludes the use as RAM extension, and I thought earlier rounds of discussion had got everyone in agreement that at least for the pmem case we will need some control data in Xen. >>> The missing piece for me is why this reservation for control data >>> needs to be done in the libnvdimm core? I would expect that any dax >> Isn't it done this way with Linux? That is say if the machine has >> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' >> for the NVDIMM ranges somewhere. That place can be in regions on the >> NVDIMM that ndctl can reserve. > Yes. I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM; NVDIMMs are far more valuable for higher level management in dom0. I certainly think that such a usecase should be out-of-scope for initial Xen/NVDIMM support, even if only to reduce the complexity to start with. A repeated complain I have of large feature submissions like this is that, by trying to solve all potential usecases at one, end up being overly complicated to develop, understand and review. > >>> capable file could be mapped and made available to a guest. This >>> includes /dev/ramX devices that are dax capable, but are external to >>> the libnvdimm sub-system. >> This is more of just keeping track of the ranges if say the DAX file is >> extremely fragmented and requires a lot of 'struct pages' to keep track of >> when stiching up the VMA. > Right, but why does the libnvdimm core need to know about this > specific Xen reservation? For example, if Xen wants some in-kernel > driver to own a pmem region and place its own metadata on the device I > would recommend something like: > > bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); > bdev_direct_access(bdev, ...); > > ...in other words, I don't think we want libnvdimm to grow new device > types for every possible in-kernel user, Xen, MD, DM, etc. Instead, > just claim the resulting device. I completely agree. Whatever ends up happening between Xen and dom0, there should be no modifications like this to the nvdimm driver. I will go so far as to say that there shouldn't be any modifications to the nvdimm driver (other than perhaps new query hooks so the Xen subsystem in Linux can query information to then pass up to Xen, if the existing queryability is insufficient). ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 11/10/16 18:51, Dan Williams wrote: > On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk > wrote: >> On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: >>> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: >>> Andrew Cooper 10/10/16 6:44 PM >>> > On 10/10/16 01:35, Haozhong Zhang wrote: >> Xen hypervisor needs assistance from Dom0 Linux kernel for following >> tasks: >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>memory management data structures, i.e. frame table and M2P table. >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>hypervisor. > However, I can't see any justification for 1). Dom0 should not be > involved in Xen's management of its own frame table and m2p. The mfns > making up the pmem/pblk regions should be treated just like any other > MMIO regions, and be handed wholesale to dom0 by default. That precludes the use as RAM extension, and I thought earlier rounds of discussion had got everyone in agreement that at least for the pmem case we will need some control data in Xen. >>> The missing piece for me is why this reservation for control data >>> needs to be done in the libnvdimm core? I would expect that any dax >> Isn't it done this way with Linux? That is say if the machine has >> 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' >> for the NVDIMM ranges somewhere. That place can be in regions on the >> NVDIMM that ndctl can reserve. > Yes. I do not see any sensible usecase for Xen to use NVDIMMs as plain RAM; NVDIMMs are far more valuable for higher level management in dom0. I certainly think that such a usecase should be out-of-scope for initial Xen/NVDIMM support, even if only to reduce the complexity to start with. A repeated complain I have of large feature submissions like this is that, by trying to solve all potential usecases at one, end up being overly complicated to develop, understand and review. > >>> capable file could be mapped and made available to a guest. This >>> includes /dev/ramX devices that are dax capable, but are external to >>> the libnvdimm sub-system. >> This is more of just keeping track of the ranges if say the DAX file is >> extremely fragmented and requires a lot of 'struct pages' to keep track of >> when stiching up the VMA. > Right, but why does the libnvdimm core need to know about this > specific Xen reservation? For example, if Xen wants some in-kernel > driver to own a pmem region and place its own metadata on the device I > would recommend something like: > > bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); > bdev_direct_access(bdev, ...); > > ...in other words, I don't think we want libnvdimm to grow new device > types for every possible in-kernel user, Xen, MD, DM, etc. Instead, > just claim the resulting device. I completely agree. Whatever ends up happening between Xen and dom0, there should be no modifications like this to the nvdimm driver. I will go so far as to say that there shouldn't be any modifications to the nvdimm driver (other than perhaps new query hooks so the Xen subsystem in Linux can query information to then pass up to Xen, if the existing queryability is insufficient). ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilkwrote: > On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: >> Andrew Cooper 10/10/16 6:44 PM >>> >> >>On 10/10/16 01:35, Haozhong Zhang wrote: >> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following >> >>> tasks: >> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >> >>>memory management data structures, i.e. frame table and M2P table. >> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >> >>>hypervisor. >> >> >> >>However, I can't see any justification for 1). Dom0 should not be >> >>involved in Xen's management of its own frame table and m2p. The mfns >> >>making up the pmem/pblk regions should be treated just like any other >> >>MMIO regions, and be handed wholesale to dom0 by default. >> > >> > That precludes the use as RAM extension, and I thought earlier rounds of >> > discussion had got everyone in agreement that at least for the pmem case >> > we will need some control data in Xen. >> >> The missing piece for me is why this reservation for control data >> needs to be done in the libnvdimm core? I would expect that any dax > > Isn't it done this way with Linux? That is say if the machine has > 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' > for the NVDIMM ranges somewhere. That place can be in regions on the > NVDIMM that ndctl can reserve. Yes. >> capable file could be mapped and made available to a guest. This >> includes /dev/ramX devices that are dax capable, but are external to >> the libnvdimm sub-system. > > This is more of just keeping track of the ranges if say the DAX file is > extremely fragmented and requires a lot of 'struct pages' to keep track of > when stiching up the VMA. Right, but why does the libnvdimm core need to know about this specific Xen reservation? For example, if Xen wants some in-kernel driver to own a pmem region and place its own metadata on the device I would recommend something like: bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); bdev_direct_access(bdev, ...); ...in other words, I don't think we want libnvdimm to grow new device types for every possible in-kernel user, Xen, MD, DM, etc. Instead, just claim the resulting device.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 9:58 AM, Konrad Rzeszutek Wilk wrote: > On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: >> On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: >> Andrew Cooper 10/10/16 6:44 PM >>> >> >>On 10/10/16 01:35, Haozhong Zhang wrote: >> >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following >> >>> tasks: >> >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >> >>>memory management data structures, i.e. frame table and M2P table. >> >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >> >>>hypervisor. >> >> >> >>However, I can't see any justification for 1). Dom0 should not be >> >>involved in Xen's management of its own frame table and m2p. The mfns >> >>making up the pmem/pblk regions should be treated just like any other >> >>MMIO regions, and be handed wholesale to dom0 by default. >> > >> > That precludes the use as RAM extension, and I thought earlier rounds of >> > discussion had got everyone in agreement that at least for the pmem case >> > we will need some control data in Xen. >> >> The missing piece for me is why this reservation for control data >> needs to be done in the libnvdimm core? I would expect that any dax > > Isn't it done this way with Linux? That is say if the machine has > 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' > for the NVDIMM ranges somewhere. That place can be in regions on the > NVDIMM that ndctl can reserve. Yes. >> capable file could be mapped and made available to a guest. This >> includes /dev/ramX devices that are dax capable, but are external to >> the libnvdimm sub-system. > > This is more of just keeping track of the ranges if say the DAX file is > extremely fragmented and requires a lot of 'struct pages' to keep track of > when stiching up the VMA. Right, but why does the libnvdimm core need to know about this specific Xen reservation? For example, if Xen wants some in-kernel driver to own a pmem region and place its own metadata on the device I would recommend something like: bdev = blkdev_get_by_path("/dev/pmemX", FMODE_EXCL...); bdev_direct_access(bdev, ...); ...in other words, I don't think we want libnvdimm to grow new device types for every possible in-kernel user, Xen, MD, DM, etc. Instead, just claim the resulting device.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: > On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulichwrote: > Andrew Cooper 10/10/16 6:44 PM >>> > >>On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> > >>However, I can't see any justification for 1). Dom0 should not be > >>involved in Xen's management of its own frame table and m2p. The mfns > >>making up the pmem/pblk regions should be treated just like any other > >>MMIO regions, and be handed wholesale to dom0 by default. > > > > That precludes the use as RAM extension, and I thought earlier rounds of > > discussion had got everyone in agreement that at least for the pmem case > > we will need some control data in Xen. > > The missing piece for me is why this reservation for control data > needs to be done in the libnvdimm core? I would expect that any dax Isn't it done this way with Linux? That is say if the machine has 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' for the NVDIMM ranges somewhere. That place can be in regions on the NVDIMM that ndctl can reserve. > capable file could be mapped and made available to a guest. This > includes /dev/ramX devices that are dax capable, but are external to > the libnvdimm sub-system. This is more of just keeping track of the ranges if say the DAX file is extremely fragmented and requires a lot of 'struct pages' to keep track of when stiching up the VMA. > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > https://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 08:53:33AM -0700, Dan Williams wrote: > On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: > Andrew Cooper 10/10/16 6:44 PM >>> > >>On 10/10/16 01:35, Haozhong Zhang wrote: > >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following > >>> tasks: > >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >>>memory management data structures, i.e. frame table and M2P table. > >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >>>hypervisor. > >> > >>However, I can't see any justification for 1). Dom0 should not be > >>involved in Xen's management of its own frame table and m2p. The mfns > >>making up the pmem/pblk regions should be treated just like any other > >>MMIO regions, and be handed wholesale to dom0 by default. > > > > That precludes the use as RAM extension, and I thought earlier rounds of > > discussion had got everyone in agreement that at least for the pmem case > > we will need some control data in Xen. > > The missing piece for me is why this reservation for control data > needs to be done in the libnvdimm core? I would expect that any dax Isn't it done this way with Linux? That is say if the machine has 4GB of RAM and the NVDIMM is in TB range. You want to put the 'struct page' for the NVDIMM ranges somewhere. That place can be in regions on the NVDIMM that ndctl can reserve. > capable file could be mapped and made available to a guest. This > includes /dev/ramX devices that are dax capable, but are external to > the libnvdimm sub-system. This is more of just keeping track of the ranges if say the DAX file is extremely fragmented and requires a lot of 'struct pages' to keep track of when stiching up the VMA. > > ___ > Xen-devel mailing list > xen-de...@lists.xen.org > https://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulichwrote: Andrew Cooper 10/10/16 6:44 PM >>> >>On 10/10/16 01:35, Haozhong Zhang wrote: >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>>memory management data structures, i.e. frame table and M2P table. >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>>hypervisor. >> >>However, I can't see any justification for 1). Dom0 should not be >>involved in Xen's management of its own frame table and m2p. The mfns >>making up the pmem/pblk regions should be treated just like any other >>MMIO regions, and be handed wholesale to dom0 by default. > > That precludes the use as RAM extension, and I thought earlier rounds of > discussion had got everyone in agreement that at least for the pmem case > we will need some control data in Xen. The missing piece for me is why this reservation for control data needs to be done in the libnvdimm core? I would expect that any dax capable file could be mapped and made available to a guest. This includes /dev/ramX devices that are dax capable, but are external to the libnvdimm sub-system.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On Tue, Oct 11, 2016 at 6:08 AM, Jan Beulich wrote: Andrew Cooper 10/10/16 6:44 PM >>> >>On 10/10/16 01:35, Haozhong Zhang wrote: >>> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >>> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>>memory management data structures, i.e. frame table and M2P table. >>> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>>hypervisor. >> >>However, I can't see any justification for 1). Dom0 should not be >>involved in Xen's management of its own frame table and m2p. The mfns >>making up the pmem/pblk regions should be treated just like any other >>MMIO regions, and be handed wholesale to dom0 by default. > > That precludes the use as RAM extension, and I thought earlier rounds of > discussion had got everyone in agreement that at least for the pmem case > we will need some control data in Xen. The missing piece for me is why this reservation for control data needs to be done in the libnvdimm core? I would expect that any dax capable file could be mapped and made available to a guest. This includes /dev/ramX devices that are dax capable, but are external to the libnvdimm sub-system.
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> Andrew Cooper10/10/16 6:44 PM >>> >On 10/10/16 01:35, Haozhong Zhang wrote: >> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>memory management data structures, i.e. frame table and M2P table. >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>hypervisor. > >However, I can't see any justification for 1). Dom0 should not be >involved in Xen's management of its own frame table and m2p. The mfns >making up the pmem/pblk regions should be treated just like any other >MMIO regions, and be handed wholesale to dom0 by default. That precludes the use as RAM extension, and I thought earlier rounds of discussion had got everyone in agreement that at least for the pmem case we will need some control data in Xen. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
>>> Andrew Cooper 10/10/16 6:44 PM >>> >On 10/10/16 01:35, Haozhong Zhang wrote: >> Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: >> 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >>memory management data structures, i.e. frame table and M2P table. >> 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >>hypervisor. > >However, I can't see any justification for 1). Dom0 should not be >involved in Xen's management of its own frame table and m2p. The mfns >making up the pmem/pblk regions should be treated just like any other >MMIO regions, and be handed wholesale to dom0 by default. That precludes the use as RAM extension, and I thought earlier rounds of discussion had got everyone in agreement that at least for the pmem case we will need some control data in Xen. Jan
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/10/16 17:43, Andrew Cooper wrote: > On 10/10/16 01:35, Haozhong Zhang wrote: > > Overview > > > > This RFC kernel patch series along with corresponding patch series of > > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > > NVDIMM devices to Xen HVM domU as vNVDIMM devices. > > > > Xen hypervisor does not include an NVDIMM driver, so it needs the > > assistance from the driver in Dom0 Linux kernel to manage NVDIMM > > devices. We currently only supports NVDIMM devices in pmem mode. > > > > Design and Implementation > > = > > The complete design can be found at > > > > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > > > > All patch series can be found at > > Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > > QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > > Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > > ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > > > > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: > > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >memory management data structures, i.e. frame table and M2P table. > > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >hypervisor. > > Please can we take a step back here before diving down a rabbit hole. > > > How do pblk/pmem regions appear in the E820 map at boot? At the very > least, I would expect at least a large reserved region. ACPI specification does not require them to appear in E820, though it defines E820 type-7 for persistent memory. > > Is the MFN information (SPA in your terminology, so far as I can tell) > available in any static APCI tables, or are they only available as a > result of executing AML methods? > For NVDIMM devices already plugged at power on, their MFN information can be got from NFIT table. However, MFN information for hotplugged NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > If the MFN information is only available via AML, then point 2) is > needed, although the reporting back to Xen should be restricted to a xen > component, rather than polluting the main device driver. > > However, I can't see any justification for 1). Dom0 should not be > involved in Xen's management of its own frame table and m2p. The mfns > making up the pmem/pblk regions should be treated just like any other > MMIO regions, and be handed wholesale to dom0 by default. > Do you mean to treat them as mmio pages of type p2m_mmio_direct and map them to guest by map_mmio_regions()? Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/10/16 17:43, Andrew Cooper wrote: > On 10/10/16 01:35, Haozhong Zhang wrote: > > Overview > > > > This RFC kernel patch series along with corresponding patch series of > > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > > NVDIMM devices to Xen HVM domU as vNVDIMM devices. > > > > Xen hypervisor does not include an NVDIMM driver, so it needs the > > assistance from the driver in Dom0 Linux kernel to manage NVDIMM > > devices. We currently only supports NVDIMM devices in pmem mode. > > > > Design and Implementation > > = > > The complete design can be found at > > > > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > > > > All patch series can be found at > > Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > > QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > > Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > > ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > > > > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: > > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place > >memory management data structures, i.e. frame table and M2P table. > > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen > >hypervisor. > > Please can we take a step back here before diving down a rabbit hole. > > > How do pblk/pmem regions appear in the E820 map at boot? At the very > least, I would expect at least a large reserved region. ACPI specification does not require them to appear in E820, though it defines E820 type-7 for persistent memory. > > Is the MFN information (SPA in your terminology, so far as I can tell) > available in any static APCI tables, or are they only available as a > result of executing AML methods? > For NVDIMM devices already plugged at power on, their MFN information can be got from NFIT table. However, MFN information for hotplugged NVDIMM devices should be got via AML _FIT method, so point 2) is needed. > > If the MFN information is only available via AML, then point 2) is > needed, although the reporting back to Xen should be restricted to a xen > component, rather than polluting the main device driver. > > However, I can't see any justification for 1). Dom0 should not be > involved in Xen's management of its own frame table and m2p. The mfns > making up the pmem/pblk regions should be treated just like any other > MMIO regions, and be handed wholesale to dom0 by default. > Do you mean to treat them as mmio pages of type p2m_mmio_direct and map them to guest by map_mmio_regions()? Thanks, Haozhong
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/10/16 01:35, Haozhong Zhang wrote: > Overview > > This RFC kernel patch series along with corresponding patch series of > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > NVDIMM devices to Xen HVM domU as vNVDIMM devices. > > Xen hypervisor does not include an NVDIMM driver, so it needs the > assistance from the driver in Dom0 Linux kernel to manage NVDIMM > devices. We currently only supports NVDIMM devices in pmem mode. > > Design and Implementation > = > The complete design can be found at > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > > All patch series can be found at > Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >memory management data structures, i.e. frame table and M2P table. > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >hypervisor. Please can we take a step back here before diving down a rabbit hole. How do pblk/pmem regions appear in the E820 map at boot? At the very least, I would expect at least a large reserved region. Is the MFN information (SPA in your terminology, so far as I can tell) available in any static APCI tables, or are they only available as a result of executing AML methods? If the MFN information is only available via AML, then point 2) is needed, although the reporting back to Xen should be restricted to a xen component, rather than polluting the main device driver. However, I can't see any justification for 1). Dom0 should not be involved in Xen's management of its own frame table and m2p. The mfns making up the pmem/pblk regions should be treated just like any other MMIO regions, and be handed wholesale to dom0 by default. ~Andrew
Re: [Xen-devel] [RFC KERNEL PATCH 0/2] Add Dom0 NVDIMM support for Xen
On 10/10/16 01:35, Haozhong Zhang wrote: > Overview > > This RFC kernel patch series along with corresponding patch series of > Xen, QEMU and ndctl implements Xen vNVDIMM, which can map the host > NVDIMM devices to Xen HVM domU as vNVDIMM devices. > > Xen hypervisor does not include an NVDIMM driver, so it needs the > assistance from the driver in Dom0 Linux kernel to manage NVDIMM > devices. We currently only supports NVDIMM devices in pmem mode. > > Design and Implementation > = > The complete design can be found at > https://lists.xenproject.org/archives/html/xen-devel/2016-07/msg01921.html. > > All patch series can be found at > Xen: https://github.com/hzzhan9/xen.git nvdimm-rfc-v1 > QEMU: https://github.com/hzzhan9/qemu.git xen-nvdimm-rfc-v1 > Linux kernel: https://github.com/hzzhan9/nvdimm.git xen-nvdimm-rfc-v1 > ndctl:https://github.com/hzzhan9/ndctl.git pfn-xen-rfc-v1 > > Xen hypervisor needs assistance from Dom0 Linux kernel for following tasks: > 1) Reserve an area on NVDIMM devices for Xen hypervisor to place >memory management data structures, i.e. frame table and M2P table. > 2) Report SPA ranges of NVDIMM devices and the reserved area to Xen >hypervisor. Please can we take a step back here before diving down a rabbit hole. How do pblk/pmem regions appear in the E820 map at boot? At the very least, I would expect at least a large reserved region. Is the MFN information (SPA in your terminology, so far as I can tell) available in any static APCI tables, or are they only available as a result of executing AML methods? If the MFN information is only available via AML, then point 2) is needed, although the reporting back to Xen should be restricted to a xen component, rather than polluting the main device driver. However, I can't see any justification for 1). Dom0 should not be involved in Xen's management of its own frame table and m2p. The mfns making up the pmem/pblk regions should be treated just like any other MMIO regions, and be handed wholesale to dom0 by default. ~Andrew