[ANNOUNCE] ndctl v62
https://github.com/pmem/ndctl/releases/tag/v62 This release incorporates functionality up to the 4.19 kernel, and a number of bug fixes and improvements. Highlights include addition of the 'ndctl monitor' command to monitor for SMART health events, use of the new max_available_extent sysfs attribute for namespace creation, verbosity levels for ndctl-list, a udev rule for enabling the LSS latch when supported, a bypass route for making the unsafe shutdown count available for non-privileged users, improvements to ndctl-inject-smart that include an 'uninject' option for all fields, and a new unit test, a number of static analysis fixes, and unit test improvements and fixes. Shortlog: Dan Williams (2): ndctl, test: check availability of MAP_SYNC for poison test ndctl: Remove dependency on linker garbage collection Keith Busch (5): ndctl: Use max_available_extent for namespace ndctl: Create ndctl udev rules for dirty shutdown ndctl, intel: Fallback to smart cached shutdown_count ndctl: Add 'list' verbose options ndctl: Work around kernel memory corruption Masayoshi Mizuma (3): ndctl, test: remove the firmware image file before the test end ndctl, documentation: Clarify the dimm id for ndctl list d option ndctl, test: add a new unit test for max_available_extent namespace Maxwell William (1): util/strbuf.h: include sys/types.h for ssize_t definition. QI Fuli (14): ndctl, monitor: add a new command - monitor ndctl, monitor: add main ndctl monitor configuration file ndctl, monitor: add the unit file of systemd for ndctl-monitor service ndctl, documentation: add man page for monitor ndctl, test: add a new unit test for monitor ndctl, monitor: fix the lack of detection of invalid dimm-events ndctl, inject-smart: add an interface to inject ctrl-temperature ndctl, monitor: Fix duplicate prefix in monitor.log ndctl, monitor: add [--verbose] option to emit extra debug messages ndctl, list: add alarm_enable_ to list ndctl, monitor: fix the lack of detection of invalid path of log file ndctl, monitor: set default log destination to syslog if "--daemon" is specified ndctl, monitor: add timestamp and pid to log messages in log_file() ndctl, monitor: add [Install] Section to systemd unit file of ndctl-monitor Ross Zwisler (2): ndctl: simplify JSON print flag handling ndctl list: always output array without --human Vishal Verma (33): Documentation: add a newline in namespace Theory of Operations ndctl: Add CONTRIBUTING.md ndctl, test: add start/wait scrub to injection tests libndctl: fix the uninject-error API actually injecting errors ndctl, test: Fix dax.sh return code ndctl, test: fix timeouts in device-dax contrib/do_abidiff: make the build more robust ndctl: add an API to check support for smart injection ndctl, test: fix tests for the array vs object listing fix ndctl, test: convert remaining tests to use test/common ndctl, bash-completion: add completion for ndctl-monitor ndctl, monitor: Add a config-file section to the man page ndctl, monitor: fix memory leak in read_config_file ndctl, monitor: Fix memory leak in monitor_event ndctl, monitor: improve error reporting throughout monitor.c Documentation, create-namespace: clarify fsdax wording ndctl, monitor: fix a resource leak in parse_monitor_event ndctl, documentation: document the label-version option for init-labels ndctl: deprecate undocumented short-options ndctl, inject-smart: Fix man page to match the current behavior ndctl inject-smart: add an option to uninject smart fields ndctl, test/monitor: fix inject-smart field in test_filter_dimmevent ndctl, inject-smart: continue in spite of errors for uninject-all ndctl, tests: add a new unit test for inject-smart ndctl, inject: fix a resource leak in ndctl_namespace_get_clear_unit ndctl: fix a resource leak in submit_get_firmware_info libndctl: fix a resource leak in ndctl_dimm_get_{{event_}flags, health} ndctl, test: fix a potential null pointer dereference in 'ndctl test' ndctl, test: fix a resource leak in check_smart_threshold ndctl, prepare-release.sh: fix revision update checks ndctl: fix potential null dereference in the smart error handler ndctl, udev: fix a resource leak in save_unsafe_shutdown_count ndctl: release v62 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH] ndctl: Work around kernel memory corruption
On 08/13/2018 01:31 PM, Keith Busch wrote: > Kernel commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden > ars_status output length handling") contained an incorrect ars status > output size calculation and may overrun the buffer provided by 4 > bytes. This patch adds 4 bytes to the buffer the user space allocates > so that the kernel's overrun doesn't corrupt the application's heap. > > See kernel patch for more details: > > https://patchwork.kernel.org/patch/10563103/ > > Signed-off-by: Keith Busch Reviewed-by: Dave Jiang > --- > ndctl/lib/ars.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/ndctl/lib/ars.c b/ndctl/lib/ars.c > index c78e3bf..bd75131 100644 > --- a/ndctl/lib/ars.c > +++ b/ndctl/lib/ars.c > @@ -133,7 +133,16 @@ NDCTL_EXPORT struct ndctl_cmd > *ndctl_bus_cmd_new_ars_status(struct ndctl_cmd *ar > } > > size = sizeof(*cmd) + ars_cap_cmd->max_ars_out; > - cmd = calloc(1, size); > + > + /* > + * Older kernels have a bug that miscalculates the output length of the > + * ars status and will overrun the provided buffer by 4 bytes, > + * corrupting the memory. Add an additional 4 bytes in the allocation > + * size to prevent that corruption. See kernel patch for more details: > + * > + * https://patchwork.kernel.org/patch/10563103/ > + */ > + cmd = calloc(1, size + 4); > if (!cmd) > return NULL; > > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[ndctl PATCH] ndctl: Work around kernel memory corruption
Kernel commit efda1b5d87cb ("acpi, nfit, libnvdimm: fix / harden ars_status output length handling") contained an incorrect ars status output size calculation and may overrun the buffer provided by 4 bytes. This patch adds 4 bytes to the buffer the user space allocates so that the kernel's overrun doesn't corrupt the application's heap. See kernel patch for more details: https://patchwork.kernel.org/patch/10563103/ Signed-off-by: Keith Busch --- ndctl/lib/ars.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/ndctl/lib/ars.c b/ndctl/lib/ars.c index c78e3bf..bd75131 100644 --- a/ndctl/lib/ars.c +++ b/ndctl/lib/ars.c @@ -133,7 +133,16 @@ NDCTL_EXPORT struct ndctl_cmd *ndctl_bus_cmd_new_ars_status(struct ndctl_cmd *ar } size = sizeof(*cmd) + ars_cap_cmd->max_ars_out; - cmd = calloc(1, size); + + /* +* Older kernels have a bug that miscalculates the output length of the +* ars status and will overrun the provided buffer by 4 bytes, +* corrupting the memory. Add an additional 4 bytes in the allocation +* size to prevent that corruption. See kernel patch for more details: +* +* https://patchwork.kernel.org/patch/10563103/ +*/ + cmd = calloc(1, size + 4); if (!cmd) return NULL; -- 2.14.4 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v2 3/3] powerpc/mce: Handle memcpy_mcsafe
On Thu, Apr 05, 2018 at 05:15:00PM +1000, Balbir Singh wrote: Add a blocking notifier callback to be called in real-mode on machine check exceptions for UE (ld/st) errors only. It's been a while, but is this patchset still being pursued? This patch in particular (callbacks for MCE handling) has other device memory use cases and I'd like to move it along. The patch registers a callback on boot to be notified of machine check exceptions and returns a NOTIFY_STOP when a page of interest is seen as the source of the machine check exception. This page of interest is a ZONE_DEVICE page and hence for now, for memcpy_mcsafe to work, the page needs to belong to ZONE_DEVICE and memcpy_mcsafe should be used to access the memory. The patch also modifies the NIP of the exception context to go back to the fixup handler (in memcpy_mcsafe) and does not print any error message as the error is treated as returned via a return value and handled. Signed-off-by: Balbir Singh --- arch/powerpc/include/asm/mce.h | 3 +- arch/powerpc/kernel/mce.c | 77 -- 2 files changed, 77 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h index 3a1226e9b465..a76638e3e47e 100644 --- a/arch/powerpc/include/asm/mce.h +++ b/arch/powerpc/include/asm/mce.h @@ -125,7 +125,8 @@ struct machine_check_event { enum MCE_UeErrorType ue_error_type:8; uint8_t effective_address_provided; uint8_t physical_address_provided; - uint8_t reserved_1[5]; + uint8_t error_return; + uint8_t reserved_1[4]; uint64_teffective_address; uint64_tphysical_address; uint8_t reserved_2[8]; diff --git a/arch/powerpc/kernel/mce.c b/arch/powerpc/kernel/mce.c index efdd16a79075..b9e4881fa8c5 100644 --- a/arch/powerpc/kernel/mce.c +++ b/arch/powerpc/kernel/mce.c @@ -28,7 +28,9 @@ #include #include #include +#include +#include #include #include @@ -54,6 +56,52 @@ static struct irq_work mce_event_process_work = { DECLARE_WORK(mce_ue_event_work, machine_process_ue_event); +static BLOCKING_NOTIFIER_HEAD(mce_notifier_list); + +int register_mce_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(register_mce_notifier); + +int unregister_mce_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_unregister(_notifier_list, nb); +} +EXPORT_SYMBOL_GPL(unregister_mce_notifier); + + +static int check_memcpy_mcsafe(struct notifier_block *nb, + unsigned long val, void *data) +{ + /* +* val contains the physical_address of the bad address +*/ + unsigned long pfn = val >> PAGE_SHIFT; + struct page *page = realmode_pfn_to_page(pfn); + int rc = NOTIFY_DONE; + + if (!page) + goto out; + + if (is_zone_device_page(page)) /* for HMM and PMEM */ + rc = NOTIFY_STOP; +out: + return rc; +} + +struct notifier_block memcpy_mcsafe_nb = { + .priority = 0, + .notifier_call = check_memcpy_mcsafe, +}; + +int mce_mcsafe_register(void) +{ + register_mce_notifier(_mcsafe_nb); + return 0; +} +arch_initcall(mce_mcsafe_register); + static void mce_set_error_info(struct machine_check_event *mce, struct mce_error_info *mce_err) { @@ -151,9 +199,31 @@ void save_mce_event(struct pt_regs *regs, long handled, mce->u.ue_error.effective_address_provided = true; mce->u.ue_error.effective_address = addr; if (phys_addr != ULONG_MAX) { + int rc; + const struct exception_table_entry *entry; + + /* +* Once we have the physical address, we check to +* see if the current nip has a fixup entry. +* Having a fixup entry plus the notifier stating +* that it can handle the exception is an indication +* that we should return to the fixup entry and +* return an error from there +*/ mce->u.ue_error.physical_address_provided = true; mce->u.ue_error.physical_address = phys_addr; - machine_check_ue_event(mce); + + rc = blocking_notifier_call_chain(_notifier_list, + phys_addr, NULL); Could we pass mce entirely here instead of just phys_addr? It would allow the callback itself to set error_return if needed. + if (rc & NOTIFY_STOP_MASK) { +
Re: [PATCH V3 3/4] mm: add a function to differentiate the pages is from DAX device memory
On Tue, Aug 14, 2018 at 01:41:40AM +0800, Zhang,Yi wrote: > > > On 2018年08月09日 17:23, Pankaj Gupta wrote: > >> DAX driver hotplug the device memory and move it to memory zone, these > >> pages will be marked reserved flag, however, some other kernel componet > >> will misconceive these pages are reserved mmio (ex: we map these dev_dax > >> or fs_dax pages to kvm for DIMM/NVDIMM backend). Together with the type > >> MEMORY_DEVICE_FS_DAX, we can use is_dax_page() to differentiate the pages > >> is DAX device memory or not. > >> > >> Signed-off-by: Zhang Yi > >> Signed-off-by: Zhang Yu > >> --- > >> include/linux/mm.h | 12 > >> 1 file changed, 12 insertions(+) > >> > >> diff --git a/include/linux/mm.h b/include/linux/mm.h > >> index 68a5121..de5cbc3 100644 > >> --- a/include/linux/mm.h > >> +++ b/include/linux/mm.h > >> @@ -889,6 +889,13 @@ static inline bool is_device_public_page(const struct > >> page *page) > >>page->pgmap->type == MEMORY_DEVICE_PUBLIC; > >> } > >> > >> +static inline bool is_dax_page(const struct page *page) > >> +{ > >> + return is_zone_device_page(page) && > >> + (page->pgmap->type == MEMORY_DEVICE_FS_DAX || > >> + page->pgmap->type == MEMORY_DEVICE_DEV_DAX); > >> +} > > I think question from Dan for KVM VM with 'MEMORY_DEVICE_PUBLIC' still > > holds? > > I am also interested to know if there is any use-case. > > > > Thanks, > > Pankaj > Yes, it is, thanks for your remind, Pankaj. > Adding Jerome for Dan's questions on V1: > [Dan]: > > Jerome, might there be any use case to pass MEMORY_DEVICE_PUBLIC > memory to a guest vm? Yes and no, i am not sure how we are going to do it. But being able to share GPU among multiple VM is on TODO list and those GPU will have MEMORY_DEVICE_PUBLIC|PRIVATE depending on the platform. So either we pass down the real underlying resource to the guest, or we will pass down a fake one and have guest and host driver talk to each other so that the host driver can do overall resource management accross multiple guests. So i would say that for now you can ignore MEMORY_DEVICE_PUBLIC and when we get to the KVM guest sharing of those and decide how we want to do it then we can update kvm to properly interpret those. Cheers, Jérôme ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [ndctl PATCH 1/2] ndctl: fix potential null dereference in the smart error handler
On Fri, Aug 10, 2018 at 06:40:52PM -0600, Vishal Verma wrote: > Static analysis reports that can potentially dereference a NULL pointer > in the smart cmd error handler. This can particular instance won't ever > be hit in practice as the handler is only registered for smart commands, > and smart commands are currently only DIMM commands, and will always > have a dimm object. However for completeness, and to avoid future > errors, we should perform a NULL check in the handler anyway. Hmm, I purposefully didn't have the NULL check because the dimm is never not set in this path. Looks like a false positive, but the NULL check is harmless. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional
On Mon, Aug 13, 2018 at 12:12:52PM +0200, Jan Kara wrote: > > The generic/081 regression appears to be a device-mapper issue... > > I'll see if this reproduces for me. Doesn't seem to be related to the DAX > patches you caary though. It does seem to be a DAX-specific failure though. > > The generic/344 failure seems to be caused by a WARNING triggered in > > the nvdimm code: > > OK, apparently this is nothing new for you as generic/344 fails for you > even with 3.17. But it should not :). I'll try to see if I can reproduce > this in my test setup during more test runs (I don't remember seeing it > during occasional runs I do) and debug it further. Thanks! In case it wasn't clear, I wasn't planning on letting these failures prevent the patches from going upstream. As you say, the generic/081 failure looks unrelated to ext4, and the generic/344 isn't a regression. - Ted ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v4 1/2] dax: dax_layout_busy_page() warn on !exceptional
On Fri 10-08-18 22:10:53, Theodore Y. Ts'o wrote: > On Fri, Aug 10, 2018 at 04:33:49PM -0400, Theodore Y. Ts'o wrote: > > I just kicked off a DAX test ("gce-xfstests -c dax -g auto") with > > CONFIG_KASAN disabled, and I expect it shouldn't show up anything > > concerning. So assuming nothing surprising pops up, yes it should be > > merged at the next merge window. > > ... and here are the results. The first is 4.17, and the second is > the ext4 git tree: > > ext4/dax: 488 tests, 4 failures, 97 skipped, 2647 seconds > Failures: ext4/033 generic/344 generic/491 generic/503 > > ext4/dax: 488 tests, 3 failures, 97 skipped, 2637 seconds > Failures: generic/081 generic/344 generic/388 > > The generic/388 failure is a known flake (shutdown stress test). > > The generic/081 regression appears to be a device-mapper issue: > > generic/081 [22:06:33][ 15.079661] run fstests generic/081 at > 2018-08-10 22:06:33 > [ 15.795745] device-mapper: ioctl: can't change device type (old=4 vs > new=1) after initial table load. > [failed, exit status 1] [22:06:36]- output mismatch (see > /results/ext4/results-dax/generic/081.out.bad) > --- tests/generic/081.out 2018-08-09 18:00:42.0 -0400 > +++ /results/ext4/results-dax/generic/081.out.bad 2018-08-10 > 22:06:36.440005460 -0400 > @@ -1,2 +1,4 @@ > QA output created by 081 > Silence is golden > +Failed to create snapshot > +(see /results/ext4/results-dax/generic/081.full for details) > ... > (Run 'diff -u tests/generic/081.out > /results/ext4/results-dax/generic/081.out.bad' to see the entire diff) I'll see if this reproduces for me. Doesn't seem to be related to the DAX patches you caary though. > The generic/344 failure seems to be caused by a WARNING triggered in > the nvdimm code: OK, apparently this is nothing new for you as generic/344 fails for you even with 3.17. But it should not :). I'll try to see if I can reproduce this in my test setup during more test runs (I don't remember seeing it during occasional runs I do) and debug it further. Honza > generic/344 [22:06:36][ 18.126280] run fstests generic/344 at > 2018-08-10 22:06:36 > [ 18.303113] EXT4-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at > your own risk > [ 18.456988] EXT4-fs (pmem1): DAX enabled. Warning: EXPERIMENTAL, use at > your own risk > [ 97.375912] WARNING: CPU: 2 PID: 1712 at > /usr/projects/linux/ext4/mm/memory.c:1801 insert_pfn+0x15a/0x170 > [ 97.377261] CPU: 2 PID: 1712 Comm: holetest Not tainted > 4.18.0-rc4-xfstests-00039-g863c37fcb14f #497 > [ 97.378486] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.11.1-1 04/01/2014 > [ 97.379516] RIP: 0010:insert_pfn+0x15a/0x170 > [ 97.380064] Code: 19 1b 01 eb dd 48 85 d2 74 07 48 23 1d 3f 19 1b 01 48 09 > df 48 89 f8 0f 1f 40 00 48 b9 00 02 00 00 00 00 00 04 48 09 c1 eb c8 <0f> 0b > e9 5e ff ff ff bb f4 ff ff ff e9 5e ff ff ff e8 80 7b ec ff > [ 97.382469] RSP: :acfd0457fc60 EFLAGS: 00010206 > [ 97.383123] RAX: 00179e3b RBX: fff0 RCX: > 0002 > [ 97.384062] RDX: 000f RSI: 8f5c28f5c28f5c29 RDI: > 800179e3b225 > [ 97.385134] RBP: 923761654558 R08: 0001 R09: > 0001 > [ 97.386213] R10: 92376f415cc0 R11: 0001 R12: > 92377e880cc0 > [ 97.387264] R13: 7fab98aab000 R14: 003e860d R15: > 92376156 > [ 97.388327] FS: 7fab9049c700() GS:92377f20() > knlGS: > [ 97.389514] CS: 0010 DS: ES: CR0: 80050033 > [ 97.390367] CR2: 7fab98aabc00 CR3: 0006a165a004 CR4: > 00360ee0 > [ 97.391432] Call Trace: > [ 97.391798] __vm_insert_mixed+0x7e/0xc0 > [ 97.392376] vmf_insert_mixed_mkwrite+0xf/0x30 > [ 97.393048] dax_iomap_pte_fault+0xb8b/0xe40 > [ 97.393691] ext4_dax_huge_fault+0x145/0x200 > [ 97.394268] do_wp_page+0x175/0x5b0 > [ 97.394710] __handle_mm_fault+0x587/0xbb0 > [ 97.395228] __do_page_fault+0x20c/0x490 > [ 97.395729] ? async_page_fault+0x8/0x30 > [ 97.396251] async_page_fault+0x1e/0x30 > [ 97.396719] RIP: 0033:0x5598144275ea > [ 97.397187] Code: 0f 85 8a 00 00 00 31 d2 48 85 db 4b 8d 04 34 7e 1f 0f 1f > 80 00 00 00 00 48 89 d1 48 83 c2 01 48 0f af 0d 71 1b 20 00 48 39 d3 <48> 89 > 2c 08 75 e8 8b 0d 36 1b 20 00 31 c0 85 c9 74 0a 8b 15 2e 1b > [ 97.399752] RSP: 002b:7fab9049bf20 EFLAGS: 00010212 > [ 97.400541] RAX: 7fab90c9ec00 RBX: 0001 RCX: > 07e0d000 > [ 97.401603] RDX: 7e0e RSI: RDI: > 0001 > [ 97.402673] RBP: 7fab9049c700 R08: 7fab9049c700 R09: > 7fab9049c700 > [ 97.403755] R10: 7fab9049c9d0 R11: 0202 R12: > 7fab90c9e000 > [ 97.404851] R13: 7ffc4608c9e0 R14: 0c00 R15: > 55981608e250 >
Re: [PATCH V3 3/4] mm: add a function to differentiate the pages is from DAX device memory
On 2018年08月09日 17:23, Pankaj Gupta wrote: >> DAX driver hotplug the device memory and move it to memory zone, these >> pages will be marked reserved flag, however, some other kernel componet >> will misconceive these pages are reserved mmio (ex: we map these dev_dax >> or fs_dax pages to kvm for DIMM/NVDIMM backend). Together with the type >> MEMORY_DEVICE_FS_DAX, we can use is_dax_page() to differentiate the pages >> is DAX device memory or not. >> >> Signed-off-by: Zhang Yi >> Signed-off-by: Zhang Yu >> --- >> include/linux/mm.h | 12 >> 1 file changed, 12 insertions(+) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 68a5121..de5cbc3 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -889,6 +889,13 @@ static inline bool is_device_public_page(const struct >> page *page) >> page->pgmap->type == MEMORY_DEVICE_PUBLIC; >> } >> >> +static inline bool is_dax_page(const struct page *page) >> +{ >> +return is_zone_device_page(page) && >> +(page->pgmap->type == MEMORY_DEVICE_FS_DAX || >> +page->pgmap->type == MEMORY_DEVICE_DEV_DAX); >> +} > I think question from Dan for KVM VM with 'MEMORY_DEVICE_PUBLIC' still holds? > I am also interested to know if there is any use-case. > > Thanks, > Pankaj Yes, it is, thanks for your remind, Pankaj. Adding Jerome for Dan's questions on V1: [Dan]: Jerome, might there be any use case to pass MEMORY_DEVICE_PUBLIC memory to a guest vm? > >> + >> #else /* CONFIG_DEV_PAGEMAP_OPS */ >> static inline void dev_pagemap_get_ops(void) >> { >> @@ -912,6 +919,11 @@ static inline bool is_device_public_page(const struct >> page *page) >> { >> return false; >> } >> + >> +static inline bool is_dax_page(const struct page *page) >> +{ >> +return false; >> +} >> #endif /* CONFIG_DEV_PAGEMAP_OPS */ >> >> static inline void get_page(struct page *page) >> -- >> 2.7.4 >> >> ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH V3 0/4] Fix kvm misconceives NVDIMM pages as reserved mmio
On 2018年08月09日 17:02, Jan Kara wrote: > On Thu 09-08-18 18:52:48, Zhang Yi wrote: >> For device specific memory space, when we move these area of pfn to >> memory zone, we will set the page reserved flag at that time, some of >> these reserved for device mmio, and some of these are not, such as >> NVDIMM pmem. >> >> Now, we map these dev_dax or fs_dax pages to kvm for DIMM/NVDIMM >> backend, since these pages are reserved. the check of >> kvm_is_reserved_pfn() misconceives those pages as MMIO. Therefor, we >> introduce 2 page map types, MEMORY_DEVICE_FS_DAX/MEMORY_DEVICE_DEV_DAX, >> to indentify these pages are from NVDIMM pmem. and let kvm treat these >> as normal pages. >> >> Without this patch, Many operations will be missed due to this >> mistreatment to pmem pages. For example, a page may not have chance to >> be unpinned for KVM guest(in kvm_release_pfn_clean); not able to be >> marked as dirty/accessed(in kvm_set_pfn_dirty/accessed) etc. >> >> V1: >> https://lkml.org/lkml/2018/7/4/91 >> >> V2: >> https://lkml.org/lkml/2018/7/10/135 >> >> V3: >> [PATCH V3 1/4] Needs Comments. >> [PATCH V3 2/4] Update the description of MEMORY_DEVICE_DEV_DAX: Jan >> [PATCH V3 3/4] Acked-by: Jan in V2 > Hum, but it is not the the patch... > > Honza Sorry, I missed that, will add in the next version, thanks for your review ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH V3 0/4] Fix kvm misconceives NVDIMM pages as reserved mmio
On 2018年08月10日 21:27, David Hildenbrand wrote: > On 09.08.2018 12:52, Zhang Yi wrote: >> For device specific memory space, when we move these area of pfn to >> memory zone, we will set the page reserved flag at that time, some of >> these reserved for device mmio, and some of these are not, such as >> NVDIMM pmem. >> >> Now, we map these dev_dax or fs_dax pages to kvm for DIMM/NVDIMM >> backend, since these pages are reserved. the check of >> kvm_is_reserved_pfn() misconceives those pages as MMIO. Therefor, we >> introduce 2 page map types, MEMORY_DEVICE_FS_DAX/MEMORY_DEVICE_DEV_DAX, >> to indentify these pages are from NVDIMM pmem. and let kvm treat these >> as normal pages. >> >> Without this patch, Many operations will be missed due to this >> mistreatment to pmem pages. For example, a page may not have chance to >> be unpinned for KVM guest(in kvm_release_pfn_clean); not able to be >> marked as dirty/accessed(in kvm_set_pfn_dirty/accessed) etc. >> > I am right now looking into (and trying to better document) PG_reserved > - and having a hard time :) . > > One of the main points about reserved pages is that the struct pages are > not to be touched. See [1] (I know that statement is fairly old, but it > resembles what PG_reserved is actually used for nowadays - with some > exceptions unfortunately.). > > Struct pages part of user space tables that are PG_reserved can indicate > (as of now according to my research) > - MMIO pages > - Selected MMAPed pages - e.g. vDSO > - Zero page > - PMEM pages as you correctly state > > So I wonder, if it is really the right approach to silently go ahead and > treat reserved pages just like they would not be reserved. Maybe the > right approach would rather be to do something about pmem pages being > reserved. Yes, they are never to be given to the page allocator, but I > wonder if PG_reserved is strictly needed for that. > > [1] https://lists.linuxcoding.com/kernel/2005-q3/msg10350.html Thanks David list the long history of Page reserved, By now, I think we treat nvdimm as a device not a DRAM, also has it's device driver which manager its own device memory. From this perspective, it is reasonable to set these pages as zone device memory and mark reserved flag. @Dan @Dave, how do you think about this? > >> V1: >> https://lkml.org/lkml/2018/7/4/91 >> >> V2: >> https://lkml.org/lkml/2018/7/10/135 >> >> V3: >> [PATCH V3 1/4] Needs Comments. >> [PATCH V3 2/4] Update the description of MEMORY_DEVICE_DEV_DAX: Jan >> [PATCH V3 3/4] Acked-by: Jan in V2 >> [PATCH V3 4/4] Needs Comments. >> >> Zhang Yi (4): >> kvm: remove redundant reserved page check >> mm: introduce memory type MEMORY_DEVICE_DEV_DAX >> mm: add a function to differentiate the pages is from DAX device >> memory >> kvm: add a check if pfn is from NVDIMM pmem. >> >> drivers/dax/pmem.c | 1 + >> include/linux/memremap.h | 8 >> include/linux/mm.h | 12 >> virt/kvm/kvm_main.c | 16 >> 4 files changed, 29 insertions(+), 8 deletions(-) >> > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm