RE: nfit_test.ko.xz needs unknown symbol ....
On Thursday, November 8, 2018 5:32 PM Dan Williams wrote: > On Thu, Nov 8, 2018 at 6:45 AM Dorau, Lukasz wrote: > > > > Hi, > > > > I am building and installing the linux kernel from the 'libnvdimm-for-next' > branch at git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git with > the 'nfit_test' kernel module. > > > > I have set in the .config file: > > > > CONFIG_X86_PMEM_LEGACY=m > > CONFIG_ZONE_DEVICE=y > > CONFIG_LIBNVDIMM=m > > CONFIG_BLK_DEV_PMEM=m > > CONFIG_ND_BLK=m > > CONFIG_BTT=y > > CONFIG_NVDIMM_PFN=y > > CONFIG_NVDIMM_DAX=y > > CONFIG_DEV_DAX_PMEM=m > > > > I ran: > > > > $ make > > $ make M=tools/testing/nvdimm > > $ sudo make M=tools/testing/nvdimm modules_install > > $ sudo make modules_install > > > > and I got: > > > > [...] > > DEPMOD 4.19.0-rc4-nfit_test > > depmod: WARNING: /lib/modules/4.19.0-rc4- > nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol libnvdimm_test > > depmod: WARNING: /lib/modules/4.19.0-rc4- > nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol acpi_nfit_test > > depmod: WARNING: /lib/modules/4.19.0-rc4- > nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol pmem_test > > depmod: WARNING: /lib/modules/4.19.0-rc4- > nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol device_dax_test > > > > Does anyone know what could be wrong here? > > Do you get the error if you run the same script again? For external > modules to find symbols the symbol map needs to be installed, but the > first time you run the process nothing has been installed. So, it's > possible to see this error on the first run on a new kernel version, > but it's benign and should disappear for subsequent builds. > My script: $ make $ make M=tools/testing/nvdimm The first time I run 'modules_install' I get: $ sudo make M=tools/testing/nvdimm modules_install [...] DEPMOD 4.19.0-rc4-nfit_test depmod: WARNING: could not open /lib/modules/4.19.0-rc4-nfit_test/modules.order: No such file or directory depmod: WARNING: could not open /lib/modules/4.19.0-rc4-nfit_test/modules.builtin: No such file or directory and for subsequent builds I get always the same: $ sudo make modules_install [...] DEPMOD 4.19.0-rc4-nfit_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol libnvdimm_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol acpi_nfit_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol pmem_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol device_dax_test $ sudo make M=tools/testing/nvdimm modules_install [...] DEPMOD 4.19.0-rc4-nfit_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol libnvdimm_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol acpi_nfit_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol pmem_test depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol device_dax_test -- Lukasz ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: fsdax memory error handling regression
On Tue, Nov 6, 2018 at 10:01 PM Williams, Dan J wrote: > > On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote: > > On Tue, Nov 06, 2018 at 03:44:47AM +, Williams, Dan J wrote: > > > Hi Willy, > > > > > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh" > > > test > > > from the ndctl repository: > > > > I'll try to run this myself later today. > > > > > I tried to get this test going on -next before the merge window, > > > but > > > -next was not bootable for me. Bisection points to: > > > > > > 9f32d221301c dax: Convert dax_lock_mapping_entry to XArray > > > > > > At first glance I think we need the old "always retry if we slept" > > > behavior. Otherwise this failure seems similar to the issue fixed > > > by > > > Ross' change to always retry on any potential collision: > > > > > > b1f382178d15 ext4: close race between direct IO and > > > ext4_break_layouts() > > > > > > I'll take a closer look tomorrow to see if that guess is plausible. > > > > If your thought is correct, then this should be all that's needed: > > > > diff --git a/fs/dax.c b/fs/dax.c > > index 616e36ea6aaa..529ac9d7c10a 100644 > > --- a/fs/dax.c > > +++ b/fs/dax.c > > @@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page) > > entry = xas_load(&xas); > > if (dax_is_locked(entry)) { > > entry = get_unlocked_entry(&xas); > > - /* Did the page move while we slept? */ > > - if (dax_to_pfn(entry) != page_to_pfn(page)) { > > - xas_unlock_irq(&xas); > > - continue; > > - } > > + xas_unlock_irq(&xas); > > + continue; > > } > > dax_lock_entry(&xas, entry); > > xas_unlock_irq(&xas); > > No, that doesn't work. > > > > > I don't quite understand how we'd find a PFN for this page in the > > tree > > after the page has had page->mapping removed. However, the more I > > look > > at this path, the more I don't like it -- it doesn't handle returning > > NULL explicitly, nor does it handle the situation where a PMD is > > split > > to form multiple PTEs explicitly, it just kind of relies on those bit > > patterns not matching. > > > > So I kind of like the "just retry without doing anything clever" > > situation > > that the above patch takes us to. > > I've been hacking at this today and am starting to lean towards > "revert" over "fix" for the amount of changes needed to get this back > on its feet. I've been able to get the test passing again with the > below changes directly on top of commit 9f32d221301c "dax: Convert > dax_lock_mapping_entry to XArray". That said, I have thus far been > unable to rebase this patch on top of v4.20-rc1 and yield a functional > result. Willy? Thoughts? I'm holding off a revert of the fsdax conversion awaiting whether you see a way to address the concerns incrementally. > My concerns are: > - I can't determine if dax_unlock_entry() wants an unlocked entry > parameter, or locked. The dax_insert_pfn_mkwrite() and > dax_unlock_mapping_entry() usages seem to disagree. > > - The multi-order use case of Xarray is a mystery to me. It seems to > want to know the order of entries a-priori with a choice to use > XA_STATE_ORDER() vs XA_STATE(). This falls over in > dax_unlock_mapping_entry() and other places where the only source of > the order of the entry is determined from dax_is_pmd_entry() i.e. the > Xarray itself. PageHead() does not work for DAX pages because > PageHead() is only established by the page allocator and DAX pages > never participate in the page allocator. > > - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed > for inode lifetime synchronization, not just for walking the radix. > That lock needs to be dropped before sleeping, and if we slept the > inode may no longer exist. > > - I could not see how the pattern: > entry = xas_load(&xas); > if (dax_is_locked(entry)) { > entry = get_unlocked_entry(&xas); > ...was safe given that get_unlock_entry() turns around and does > validation that the entry is !xa_internal_entry() and !NULL. > > - The usage of internal entries in grab_mapping_entry() seems to need > auditing. Previously we would compare the entry size against > @size_flag, but it now if index hits a multi-order entry in > get_unlocked_entry() afaics it could be internal and we need to convert > it to the actual entry before aborting... at least to match the v4.19 > behavior. > > This all seems to merit a rethink of the dax integration of Xarray. > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 1/2] mm: make dev_pagemap_mapping_shift() externally visible
KVM has a use case for determining the size of a dax mapping. The KVM code has easy access to the address and the mm; hence the change in parameters. Signed-off-by: Barret Rhoden --- include/linux/mm.h | 3 +++ mm/memory-failure.c | 38 +++--- mm/util.c | 34 ++ 3 files changed, 40 insertions(+), 35 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5411de93a363..51215d695753 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -935,6 +935,9 @@ static inline bool is_pci_p2pdma_page(const struct page *page) } #endif /* CONFIG_DEV_PAGEMAP_OPS */ +unsigned long dev_pagemap_mapping_shift(unsigned long address, + struct mm_struct *mm); + static inline void get_page(struct page *page) { page = compound_head(page); diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 0cd3de3550f0..c3f2c6a8607e 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -265,40 +265,6 @@ void shake_page(struct page *p, int access) } EXPORT_SYMBOL_GPL(shake_page); -static unsigned long dev_pagemap_mapping_shift(struct page *page, - struct vm_area_struct *vma) -{ - unsigned long address = vma_address(page, vma); - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - pmd_t *pmd; - pte_t *pte; - - pgd = pgd_offset(vma->vm_mm, address); - if (!pgd_present(*pgd)) - return 0; - p4d = p4d_offset(pgd, address); - if (!p4d_present(*p4d)) - return 0; - pud = pud_offset(p4d, address); - if (!pud_present(*pud)) - return 0; - if (pud_devmap(*pud)) - return PUD_SHIFT; - pmd = pmd_offset(pud, address); - if (!pmd_present(*pmd)) - return 0; - if (pmd_devmap(*pmd)) - return PMD_SHIFT; - pte = pte_offset_map(pmd, address); - if (!pte_present(*pte)) - return 0; - if (pte_devmap(*pte)) - return PAGE_SHIFT; - return 0; -} - /* * Failure handling: if we can't find or can't kill a process there's * not much we can do. We just print a message and ignore otherwise. @@ -329,7 +295,9 @@ static void add_to_kill(struct task_struct *tsk, struct page *p, tk->addr = page_address_in_vma(p, vma); tk->addr_valid = 1; if (is_zone_device_page(p)) - tk->size_shift = dev_pagemap_mapping_shift(p, vma); + tk->size_shift = + dev_pagemap_mapping_shift(vma_address(page, vma), + vma->vm_mm); else tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT; diff --git a/mm/util.c b/mm/util.c index 8bf08b5b5760..61bc9bab931d 100644 --- a/mm/util.c +++ b/mm/util.c @@ -780,3 +780,37 @@ int get_cmdline(struct task_struct *task, char *buffer, int buflen) out: return res; } + +unsigned long dev_pagemap_mapping_shift(unsigned long address, + struct mm_struct *mm) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + + pgd = pgd_offset(mm, address); + if (!pgd_present(*pgd)) + return 0; + p4d = p4d_offset(pgd, address); + if (!p4d_present(*p4d)) + return 0; + pud = pud_offset(p4d, address); + if (!pud_present(*pud)) + return 0; + if (pud_devmap(*pud)) + return PUD_SHIFT; + pmd = pmd_offset(pud, address); + if (!pmd_present(*pmd)) + return 0; + if (pmd_devmap(*pmd)) + return PMD_SHIFT; + pte = pte_offset_map(pmd, address); + if (!pte_present(*pte)) + return 0; + if (pte_devmap(*pte)) + return PAGE_SHIFT; + return 0; +} +EXPORT_SYMBOL_GPL(dev_pagemap_mapping_shift); -- 2.19.1.930.g4563a0d9d0-goog ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 0/2] kvm: Use huge pages for DAX-backed files
This patch series depends on dax pages not being PageReserved. Once that is in place, these changes will let KVM use huge pages with dax-backed files. Without the PageReserved change, KVM and DAX still work with these patches, simply without huge pages - which is the current situation. RFC/discussion thread: https://lore.kernel.org/lkml/20181029210716.212159-1-b...@google.com/ Barret Rhoden (2): mm: make dev_pagemap_mapping_shift() externally visible kvm: Use huge pages for DAX-backed files arch/x86/kvm/mmu.c | 34 -- include/linux/mm.h | 3 +++ mm/memory-failure.c | 38 +++--- mm/util.c | 34 ++ 4 files changed, 72 insertions(+), 37 deletions(-) -- 2.19.1.930.g4563a0d9d0-goog ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 2/2] kvm: Use huge pages for DAX-backed files
This change allows KVM to map DAX-backed files made of huge pages with huge mappings in the EPT/TDP. DAX pages are not PageTransCompound. The existing check is trying to determine if the mapping for the pfn is a huge mapping or not. For non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound. For DAX, we can check the page table itself. Note that KVM already faulted in the page (or huge page) in the host's page table, and we hold the KVM mmu spinlock (grabbed before checking the mmu seq). Signed-off-by: Barret Rhoden --- arch/x86/kvm/mmu.c | 34 -- 1 file changed, 32 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index cf5f572f2305..2df8c459dc6a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -3152,6 +3152,36 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, gfn_t gfn, kvm_pfn_t pfn) return -EFAULT; } +static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn) +{ + struct page *page = pfn_to_page(pfn); + unsigned long hva, map_shift; + + if (!is_zone_device_page(page)) + return PageTransCompoundMap(page); + + /* +* DAX pages do not use compound pages. The page should have already +* been mapped into the host-side page table during try_async_pf(), so +* we can check the page tables directly. +*/ + hva = gfn_to_hva(kvm, gfn); + if (kvm_is_error_hva(hva)) + return false; + + /* +* Our caller grabbed the KVM mmu_lock with a successful +* mmu_notifier_retry, so we're safe to walk the page table. +*/ + map_shift = dev_pagemap_mapping_shift(hva, current->mm); + switch (map_shift) { + case PMD_SHIFT: + case PUD_SIZE: + return true; + } + return false; +} + static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, gfn_t *gfnp, kvm_pfn_t *pfnp, int *levelp) @@ -3168,7 +3198,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu, */ if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) && level == PT_PAGE_TABLE_LEVEL && - PageTransCompoundMap(pfn_to_page(pfn)) && + pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) && !mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) { unsigned long mask; /* @@ -5678,7 +5708,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm, */ if (sp->role.direct && !kvm_is_reserved_pfn(pfn) && - PageTransCompoundMap(pfn_to_page(pfn))) { + pfn_is_huge_mapped(kvm, sp->gfn, pfn)) { pte_list_remove(rmap_head, sptep); need_tlb_flush = 1; goto restart; -- 2.19.1.930.g4563a0d9d0-goog ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 0/7] Deferred page init improvements
On 18-11-05 13:19:25, Alexander Duyck wrote: > This patchset is essentially a refactor of the page initialization logic > that is meant to provide for better code reuse while providing a > significant improvement in deferred page initialization performance. > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > memory per node I have seen the following. In the case of regular memory > initialization the deferred init time was decreased from 3.75s to 1.06s on > average. For the persistent memory the initialization time dropped from > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > deferred memory initialization performance, and a 26% improvement in the > persistent memory initialization performance. Hi Alex, Please try to run your persistent memory init experiment with Daniel's patches: https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/ The performance should improve by much more than 26%. Overall, your works looks good, but it needs to be considered how easy it will be to merge with ktask. I will try to complete the review today. Thank you, Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 01/11] keys-encrypted: add nvdimm key format type to encrypted keys
Adding nvdimm key format type to encrypted keys in order to limit the size of the key to 32bytes. Signed-off-by: Dave Jiang --- security/keys/encrypted-keys/encrypted.c | 29 - 1 file changed, 20 insertions(+), 9 deletions(-) diff --git a/security/keys/encrypted-keys/encrypted.c b/security/keys/encrypted-keys/encrypted.c index d92cbf9687c3..182b4f136bdf 100644 --- a/security/keys/encrypted-keys/encrypted.c +++ b/security/keys/encrypted-keys/encrypted.c @@ -45,6 +45,7 @@ static const char hmac_alg[] = "hmac(sha256)"; static const char blkcipher_alg[] = "cbc(aes)"; static const char key_format_default[] = "default"; static const char key_format_ecryptfs[] = "ecryptfs"; +static const char key_format_nvdimm[] = "nvdimm"; static unsigned int ivsize; static int blksize; @@ -54,6 +55,7 @@ static int blksize; #define HASH_SIZE SHA256_DIGEST_SIZE #define MAX_DATA_SIZE 4096 #define MIN_DATA_SIZE 20 +#define KEY_NVDIMM_PAYLOAD_LEN 32 static struct crypto_shash *hash_tfm; @@ -62,12 +64,13 @@ enum { }; enum { - Opt_error = -1, Opt_default, Opt_ecryptfs + Opt_error = -1, Opt_default, Opt_ecryptfs, Opt_nvdimm }; static const match_table_t key_format_tokens = { {Opt_default, "default"}, {Opt_ecryptfs, "ecryptfs"}, + {Opt_nvdimm, "nvdimm"}, {Opt_error, NULL} }; @@ -195,6 +198,7 @@ static int datablob_parse(char *datablob, const char **format, key_format = match_token(p, key_format_tokens, args); switch (key_format) { case Opt_ecryptfs: + case Opt_nvdimm: case Opt_default: *format = p; *master_desc = strsep(&datablob, " \t"); @@ -625,15 +629,22 @@ static struct encrypted_key_payload *encrypted_key_alloc(struct key *key, format_len = (!format) ? strlen(key_format_default) : strlen(format); decrypted_datalen = dlen; payload_datalen = decrypted_datalen; - if (format && !strcmp(format, key_format_ecryptfs)) { - if (dlen != ECRYPTFS_MAX_KEY_BYTES) { - pr_err("encrypted_key: keylen for the ecryptfs format " - "must be equal to %d bytes\n", - ECRYPTFS_MAX_KEY_BYTES); - return ERR_PTR(-EINVAL); + if (format) { + if (!strcmp(format, key_format_ecryptfs)) { + if (dlen != ECRYPTFS_MAX_KEY_BYTES) { + pr_err("encrypted_key: keylen for the ecryptfs format must be equal to %d bytes\n", + ECRYPTFS_MAX_KEY_BYTES); + return ERR_PTR(-EINVAL); + } + decrypted_datalen = ECRYPTFS_MAX_KEY_BYTES; + payload_datalen = sizeof(struct ecryptfs_auth_tok); + } else if (!strcmp(format, key_format_nvdimm)) { + if (decrypted_datalen != KEY_NVDIMM_PAYLOAD_LEN) { + pr_err("encrypted_key: nvdimm key payload incorrect length: %d\n", + decrypted_datalen); + return ERR_PTR(-EINVAL); + } } - decrypted_datalen = ECRYPTFS_MAX_KEY_BYTES; - payload_datalen = sizeof(struct ecryptfs_auth_tok); } encrypted_datalen = roundup(decrypted_datalen, blksize); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 00/11] Additional patches for nvdimm security support
The following series adds additional support for nvdimm security. 1. Converted logon keys to encrypted-keys. 2. Add overwrite DSM support 3. Add DSM 1.8 master passphrase support The patch series is based off the branch from here: https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=for-5.0/nvdimm-security Instead of squashing the previous changes, they are kept for history purposes to document how we arrived to the current iteration. Mimi, Patch 1 requires your ack as it makes changes to encrypted-keys by adding the nvdimm key format type. Dan wanted to restrict the key to 32bytes during creation. I also wouldn't mind if you take a look at patch 2 and make sure I'm providing correct usage of encrypted keys. Thank you! --- Dave Jiang (11): keys-encrypted: add nvdimm key format type to encrypted keys libnvdimm/security: change clear text nvdimm keys to encrypted keys libnvdimm/security: add override module param for key self verification libnvdimm/security: introduce NDD_SECURITY_BUSY flag acpi/nfit, libnvdimm/security: Add security DSM overwrite support tools/testing/nvdimm: Add overwrite support for nfit_test libnvdimm/security: add overwrite status notification libnvdimm/security: add documentation for ovewrite acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test acpi/nfit: prevent indiscriminate DSM payload dumping for security DSMs Documentation/nvdimm/security.txt| 68 +++- drivers/acpi/nfit/Kconfig|7 drivers/acpi/nfit/core.c | 31 ++ drivers/acpi/nfit/intel.c| 245 +++ drivers/acpi/nfit/intel.h| 22 + drivers/acpi/nfit/nfit.h |7 drivers/nvdimm/core.c|3 drivers/nvdimm/dimm.c|3 drivers/nvdimm/dimm_devs.c | 50 +++ drivers/nvdimm/nd-core.h |9 - drivers/nvdimm/nd.h | 19 + drivers/nvdimm/region_devs.c |7 drivers/nvdimm/security.c| 496 +- include/linux/libnvdimm.h| 22 + security/keys/encrypted-keys/encrypted.c | 29 +- tools/testing/nvdimm/dimm_devs.c |2 tools/testing/nvdimm/test/nfit.c | 141 + 17 files changed, 895 insertions(+), 266 deletions(-) -- ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 02/11] libnvdimm/security: change clear text nvdimm keys to encrypted keys
In order to make nvdimm more secure, encrypted keys will be used instead of clear text keys. A master key will be created to seal encrypted nvdimm keys. The master key can be a trusted key generated from TPM 2.0 or a less secure user key. In the process of this conversion, the kernel cached key will be removed in order to simplify the verification process. The hardware will be used to verify the decrypted user payload directly. Signed-off-by: Dave Jiang --- Documentation/nvdimm/security.txt | 29 ++- drivers/nvdimm/dimm.c |3 drivers/nvdimm/dimm_devs.c|2 drivers/nvdimm/nd-core.h |3 drivers/nvdimm/nd.h |5 - drivers/nvdimm/security.c | 316 ++--- 6 files changed, 108 insertions(+), 250 deletions(-) diff --git a/Documentation/nvdimm/security.txt b/Documentation/nvdimm/security.txt index 50cbb6cb96a1..11240ce48755 100644 --- a/Documentation/nvdimm/security.txt +++ b/Documentation/nvdimm/security.txt @@ -39,34 +39,39 @@ The DIMM id would be provided along with the key payload (passphrase) to the kernel. The security keys are managed on the basis of a single key per DIMM. The -key "passphrase" is expected to be 32bytes long or padded to 32bytes. This is +key "passphrase" is expected to be 32bytes long. This is similar to the ATA security specification [2]. A key is initially acquired via the request_key() kernel API call and retrieved from userspace. It is up to the user to provide an upcall application to retrieve the key in whatever fashion meets their security requirements. -A nvdimm user logon key has the description format of: +A nvdimm encrypted key has the description format of: nvdimm: +See Documentation/security/keys/trusted-encrypted.rst for details on encrypted +keys. The encrypted key for the DIMM is generated from a master key that is +either a trusted key created via TPM2 or a less secure user key. The payload +for the nvdimm encrypted key is randomly generated by the kernel and is not +visible to the user. The user is responsible for retaining the encrypted key +blob generated from the encrypted key to be loaded at the next boot session. + 4. Unlocking When the DIMMs are being enumerated by the kernel, the kernel will attempt to -retrieve the key from its keyring. If that fails, it will attempt to acquire the key from the userspace upcall function. This is the only time a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked until reboot. 5. Update - -When doing an update, it is expected that the new key with the 64bit payload of -format described above is added via the keyutils API or utility. The update -command written to the sysfs attribute will be with the format: +When doing an update, it is expected that the new key is added via the +keyutils API or utility. The update command written to the sysfs attribute +will be with the format: update -It is expected that a user logon key has been injected via keyutils to provide -the payload for the update operation. The kernel will take the new user key, -attempt the update operation with the nvdimm, and replace the existing key's -payload with the new passphrase. +It is expected that a encrypted key has been injected via keyutils to provide +the payload for the update operation. The kernel will take the new user key +and the old user key to attempt the update operation. If there is no old key id due to a security enabling, then a 0 should be passed in. If a nvdimm has an existing passphrase, then an "old" key should @@ -80,7 +85,7 @@ frozen by a user with root privelege. 7. Disable -- The security disable command format is: -disable +disable An "old" key with the passphrase payload that is tied to the nvdimm should be injected with a key description that does not have the "nvdimm:" prefix and @@ -89,7 +94,7 @@ its keyid should be passed in via sysfs. 8. Secure Erase --- The command format for doing a secure erase is: -erase +erase An "old" key with the passphrase payload that is tied to the nvdimm should be injected with a key description that does not have the "nvdimm:" prefix and diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c index 225a8474c44a..c3886a22b5e4 100644 --- a/drivers/nvdimm/dimm.c +++ b/drivers/nvdimm/dimm.c @@ -112,9 +112,6 @@ static int nvdimm_probe(struct device *dev) static int nvdimm_remove(struct device *dev) { struct nvdimm_drvdata *ndd = dev_get_drvdata(dev); - struct nvdimm *nvdimm = to_nvdimm(dev); - - nvdimm_security_release(nvdimm); if (!ndd) return 0; diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index 5d34251c4f3b..269f401a5e68 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -485,7 +485,7 @@ struct nvdimm *__nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_dat nvdimm->nu
[PATCH 03/11] libnvdimm/security: add override module param for key self verification
Provide the user an override via kernel module parameter for security key self verification. no_key_self_verify parameter is being added to bypass security key verify against the hardware during nvdimm unlock path. Signed-off-by: Dave Jiang --- drivers/nvdimm/security.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c index ee741199d623..d2831e61f3d8 100644 --- a/drivers/nvdimm/security.c +++ b/drivers/nvdimm/security.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 /* Copyright(c) 2018 Intel Corporation. All rights reserved. */ +#include #include #include #include @@ -14,6 +15,10 @@ #include "nd-core.h" #include "nd.h" +static bool no_key_self_verify; +module_param(no_key_self_verify, bool, 0644); +MODULE_PARM_DESC(no_key_self_verify, "Bypass security key self verify"); + /* * Retrieve user injected key */ @@ -235,6 +240,12 @@ int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm) * other security operations. */ if (nvdimm->state == NVDIMM_SECURITY_UNLOCKED) { + /* bypass if user override */ + if (no_key_self_verify) { + mutex_unlock(&nvdimm->sec_mutex); + return 0; + } + key = nvdimm_self_verify_key(nvdimm); if (!key) { rc = nvdimm_security_freeze_lock(nvdimm); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 04/11] libnvdimm/security: introduce NDD_SECURITY_BUSY flag
Adding a flag for nvdimm->flags to support erase functions. While it's ok to hold the nvdimm_bus lock for secure erase due to minimal time to execute the command, overwrite requires a significantly longer time and makes this impossible. The flag will block any drivers from being loaded and DIMMs being probed. Signed-off-by: Dave Jiang --- drivers/nvdimm/nd.h |1 + drivers/nvdimm/region_devs.c |7 + drivers/nvdimm/security.c| 61 ++ include/linux/libnvdimm.h|2 + 4 files changed, 65 insertions(+), 6 deletions(-) diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h index f8d8f0a2a40d..6cb1cd4a39d0 100644 --- a/drivers/nvdimm/nd.h +++ b/drivers/nvdimm/nd.h @@ -250,6 +250,7 @@ long nvdimm_clear_poison(struct device *dev, phys_addr_t phys, void nvdimm_set_aliasing(struct device *dev); void nvdimm_set_locked(struct device *dev); void nvdimm_clear_locked(struct device *dev); +int nvdimm_check_security_busy(struct nvdimm *nvdimm); struct nd_btt *to_nd_btt(struct device *dev); struct nd_gen_sb { diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c index 174a418cb171..a097282b2c01 100644 --- a/drivers/nvdimm/region_devs.c +++ b/drivers/nvdimm/region_devs.c @@ -78,6 +78,13 @@ int nd_region_activate(struct nd_region *nd_region) for (i = 0; i < nd_region->ndr_mappings; i++) { struct nd_mapping *nd_mapping = &nd_region->mapping[i]; struct nvdimm *nvdimm = nd_mapping->nvdimm; + int rc; + + rc = nvdimm_check_security_busy(nvdimm); + if (rc) { + nvdimm_bus_unlock(&nd_region->dev); + return rc; + } /* at least one null hint slot per-dimm for the "no-hint" case */ flush_data_size += sizeof(void *); diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c index d2831e61f3d8..2a83be47798e 100644 --- a/drivers/nvdimm/security.c +++ b/drivers/nvdimm/security.c @@ -19,6 +19,27 @@ static bool no_key_self_verify; module_param(no_key_self_verify, bool, 0644); MODULE_PARM_DESC(no_key_self_verify, "Bypass security key self verify"); +/* + * Check if we are doing security wipes + */ +int nvdimm_check_security_busy(struct nvdimm *nvdimm) +{ + if (test_bit(NDD_SECURITY_BUSY, &nvdimm->flags)) + return -EBUSY; + + return 0; +} + +static inline void nvdimm_set_security_busy(struct nvdimm *nvdimm) +{ + set_bit(NDD_SECURITY_BUSY, &nvdimm->flags); +} + +static inline void nvdimm_clear_security_busy(struct nvdimm *nvdimm) +{ + clear_bit(NDD_SECURITY_BUSY, &nvdimm->flags); +} + /* * Retrieve user injected key */ @@ -85,6 +106,13 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, unsigned int keyid) nvdimm_bus_lock(dev); mutex_lock(&nvdimm->sec_mutex); + rc = nvdimm_check_security_busy(nvdimm); + if (rc < 0) { + dev_warn(dev, "Security operation in progress.\n"); + goto out; + } + + nvdimm_set_security_busy(nvdimm); if (atomic_read(&nvdimm->busy)) { dev_warn(dev, "Unable to secure erase while DIMM active.\n"); rc = -EBUSY; @@ -113,6 +141,7 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, unsigned int keyid) key_put(key); out: + nvdimm_clear_security_busy(nvdimm); mutex_unlock(&nvdimm->sec_mutex); nvdimm_bus_unlock(dev); nvdimm_security_get_state(nvdimm); @@ -130,15 +159,19 @@ int nvdimm_security_freeze_lock(struct nvdimm *nvdimm) return -EOPNOTSUPP; mutex_lock(&nvdimm->sec_mutex); + rc = nvdimm_check_security_busy(nvdimm); + if (rc < 0) + goto out; + rc = nvdimm->security_ops->freeze_lock(nvdimm); - if (rc < 0) { - mutex_unlock(&nvdimm->sec_mutex); - return rc; - } + if (rc < 0) + goto out; nvdimm_security_get_state(nvdimm); + +out: mutex_unlock(&nvdimm->sec_mutex); - return 0; + return rc; } int nvdimm_security_disable(struct nvdimm *nvdimm, unsigned int keyid) @@ -156,6 +189,12 @@ int nvdimm_security_disable(struct nvdimm *nvdimm, unsigned int keyid) return -EOPNOTSUPP; mutex_lock(&nvdimm->sec_mutex); + rc = nvdimm_check_security_busy(nvdimm); + if (rc < 0) { + mutex_unlock(&nvdimm->sec_mutex); + return rc; + } + /* look for a key from cached key */ key = nvdimm_lookup_user_key(dev, keyid); if (!key) { @@ -232,6 +271,12 @@ int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm) return 0; mutex_lock(&nvdimm->sec_mutex); + rc = nvdimm_check_security_busy(nvdimm); + if (rc < 0) { + mutex_unlock(&nvdimm->sec_mutex); + return rc; + } +
[PATCH 07/11] libnvdimm/security: add overwrite status notification
Adding sysfs notification for when overwrite has completed to allow user monitoring app to be aware of overwrite completion status. Signed-off-by: Dave Jiang --- drivers/acpi/nfit/core.c |5 + drivers/nvdimm/dimm_devs.c | 10 ++ drivers/nvdimm/nd-core.h |1 + drivers/nvdimm/security.c |2 ++ include/linux/libnvdimm.h |1 + 5 files changed, 19 insertions(+) diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index de4e00059277..3e6c7b653872 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2033,6 +2033,11 @@ static int acpi_nfit_register_dimms(struct acpi_nfit_desc *acpi_desc) if (!nvdimm) continue; + rc = nvdimm_setup_security_events(nvdimm); + if (rc < 0) + dev_warn(acpi_desc->dev, + "no security event setup failed\n"); + nfit_kernfs = sysfs_get_dirent(nvdimm_kobj(nvdimm)->sd, "nfit"); if (nfit_kernfs) nfit_mem->flags_attr = sysfs_get_dirent(nfit_kernfs, diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c index b613c131bfb9..39e40074b5df 100644 --- a/drivers/nvdimm/dimm_devs.c +++ b/drivers/nvdimm/dimm_devs.c @@ -508,6 +508,16 @@ struct nvdimm *__nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_dat } EXPORT_SYMBOL_GPL(__nvdimm_create); +int nvdimm_setup_security_events(struct nvdimm *nvdimm) +{ + nvdimm->overwrite_state = sysfs_get_dirent(nvdimm->dev.kobj.sd, + "security"); + if (!nvdimm->overwrite_state) + return -ENODEV; + return 0; +} +EXPORT_SYMBOL_GPL(nvdimm_setup_security_events); + int alias_dpa_busy(struct device *dev, void *data) { resource_size_t map_end, blk_start, new; diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h index 20a8216c503d..b96e1b10e3eb 100644 --- a/drivers/nvdimm/nd-core.h +++ b/drivers/nvdimm/nd-core.h @@ -49,6 +49,7 @@ struct nvdimm { struct mutex sec_mutex; struct delayed_work dwork; unsigned int overwrite_tmo; + struct kernfs_node *overwrite_state; }; /** diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c index 725acd24..f5ba633545b7 100644 --- a/drivers/nvdimm/security.c +++ b/drivers/nvdimm/security.c @@ -122,6 +122,8 @@ void nvdimm_overwrite_query(struct work_struct *work) else dev_info(&nvdimm->dev, "Overwrite completed\n"); + if (nvdimm->overwrite_state) + sysfs_notify_dirent(nvdimm->overwrite_state); nvdimm->overwrite_tmo = 0; nvdimm_clear_security_busy(nvdimm); nvdimm_security_get_state(nvdimm); diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h index 479421ce62c0..c3c5a1c6b1b7 100644 --- a/include/linux/libnvdimm.h +++ b/include/linux/libnvdimm.h @@ -227,6 +227,7 @@ static inline struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, cmd_mask, num_flush, flush_wpq, NULL, NULL); } +int nvdimm_setup_security_events(struct nvdimm *nvdimm); const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd); const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd); u32 nd_cmd_in_size(struct nvdimm *nvdimm, int cmd, ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 08/11] libnvdimm/security: add documentation for ovewrite
Add overwrite command usages to security documentation. Signed-off-by: Dave Jiang --- Documentation/nvdimm/security.txt | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/Documentation/nvdimm/security.txt b/Documentation/nvdimm/security.txt index 11240ce48755..dfe70a8fa25b 100644 --- a/Documentation/nvdimm/security.txt +++ b/Documentation/nvdimm/security.txt @@ -96,9 +96,19 @@ its keyid should be passed in via sysfs. The command format for doing a secure erase is: erase -An "old" key with the passphrase payload that is tied to the nvdimm should be -injected with a key description that does not have the "nvdimm:" prefix and -its keyid should be passed in via sysfs. +9. Overwrite + +The command format for doing an overwrite is: +overwrite + +Overwrite can be done without a key if security is not enabled. A key serial +of 0 can be passed in to indicate no key. + +The sysfs attribute "security" can be polled to wait on overwrite completion. +Overwrite can last tens of minutes or more depending on nvdimm size. + +An encrypted key with the current key passphrase that is tied to the nvdimm +should be injected and its keyid should be passed in via sysfs. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf [2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 11/11] acpi/nfit: prevent indiscriminate DSM payload dumping for security DSMs
Right now when debug is enabled, we dump the command buffer indescriminately. This exposes the clear text payload for security DSMs. Introducing a kernel config to only dump the payload if the config option is turned on so the production kernels can leave this option off and not expose the passphrases. Signed-off-by: Dave Jiang --- drivers/acpi/nfit/Kconfig |7 +++ drivers/acpi/nfit/core.c | 24 +--- 2 files changed, 28 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/nfit/Kconfig b/drivers/acpi/nfit/Kconfig index f7c57e33499e..a0a8eabda2e8 100644 --- a/drivers/acpi/nfit/Kconfig +++ b/drivers/acpi/nfit/Kconfig @@ -13,3 +13,10 @@ config ACPI_NFIT To compile this driver as a module, choose M here: the module will be called nfit. + +config NFIT_SECURITY_DEBUG + bool "Turn on debugging for NVDIMM security features" + depends on ACPI_NFIT + help + Turning on debug output for NVDIMM security DSM commands. This + should not be turned on on a production kernel. diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 867e6fea3737..baaf5308de35 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -405,6 +405,21 @@ static u8 nfit_dsm_revid(unsigned family, unsigned func) return id; } +static bool is_security_cmd(unsigned int cmd, unsigned int func, + unsigned int family) +{ + if (cmd != ND_CMD_CALL) + return false; + + if (family == NVDIMM_FAMILY_INTEL) { + if (func >= NVDIMM_INTEL_GET_SECURITY_STATE && + func <= NVDIMM_INTEL_MASTER_SECURE_ERASE) + return true; + } + + return false; +} + int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm, unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc) { @@ -489,9 +504,12 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm, dev_dbg(dev, "%s cmd: %d: func: %d input length: %d\n", dimm_name, cmd, func, in_buf.buffer.length); - print_hex_dump_debug("nvdimm in ", DUMP_PREFIX_OFFSET, 4, 4, - in_buf.buffer.pointer, - min_t(u32, 256, in_buf.buffer.length), true); + if ((call_pkg && !is_security_cmd(cmd, func, call_pkg->nd_family)) || + IS_ENABLED(CONFIG_NFIT_SECURITY_DEBUG)) { + print_hex_dump_debug("nvdimm in ", DUMP_PREFIX_OFFSET, 4, 4, + in_buf.buffer.pointer, + min_t(u32, 256, in_buf.buffer.length), true); + } /* call the BIOS, prefer the named methods over _DSM if available */ if (nvdimm && cmd == ND_CMD_GET_CONFIG_SIZE ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH 06/11] tools/testing/nvdimm: Add overwrite support for nfit_test
With the implementation of Intel NVDIMM DSM overwrite, we are adding unit test to nfit_test for testing of overwrite operation. Signed-off-by: Dave Jiang --- drivers/acpi/nfit/intel.h|1 + tools/testing/nvdimm/dimm_devs.c |2 + tools/testing/nvdimm/test/nfit.c | 55 ++ 3 files changed, 58 insertions(+) diff --git a/drivers/acpi/nfit/intel.h b/drivers/acpi/nfit/intel.h index a4e01de966dc..0140d9b6a0b7 100644 --- a/drivers/acpi/nfit/intel.h +++ b/drivers/acpi/nfit/intel.h @@ -52,6 +52,7 @@ extern const struct nvdimm_security_ops *intel_security_ops; #define ND_INTEL_STATUS_INVALID_PASS 11 #define ND_INTEL_STATUS_OVERWRITE_UNSUPPORTED 0x10007 #define ND_INTEL_STATUS_OQUERY_INPROGRESS 0x10007 +#define ND_INTEL_STATUS_OQUERY_SEQUENCE_ERR0x20007 #define ND_INTEL_SEC_STATE_ENABLED 0x02 #define ND_INTEL_SEC_STATE_LOCKED 0x04 diff --git a/tools/testing/nvdimm/dimm_devs.c b/tools/testing/nvdimm/dimm_devs.c index de3a22988dc5..3f8ad3d445d2 100644 --- a/tools/testing/nvdimm/dimm_devs.c +++ b/tools/testing/nvdimm/dimm_devs.c @@ -29,6 +29,8 @@ ssize_t security_show(struct device *dev, return sprintf(buf, "locked\n"); case NVDIMM_SECURITY_FROZEN: return sprintf(buf, "frozen\n"); + case NVDIMM_SECURITY_OVERWRITE: + return sprintf(buf, "overwrite\n"); case NVDIMM_SECURITY_UNSUPPORTED: default: return sprintf(buf, "unsupported\n"); diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index e4b560494e0a..c885fe136f42 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -149,6 +149,7 @@ static int dimm_fail_cmd_code[NUM_DCR]; struct nfit_test_sec { u8 state; u8 passphrase[32]; + u64 overwrite_end_time; } dimm_sec_info[NUM_DCR]; static const struct nd_intel_smart smart_def = { @@ -1073,6 +1074,50 @@ static int nd_intel_test_cmd_secure_erase(struct nfit_test *t, return 0; } +static int nd_intel_test_cmd_overwrite(struct nfit_test *t, + struct nd_intel_overwrite *nd_cmd, + unsigned int buf_len, int dimm) +{ + struct device *dev = &t->pdev.dev; + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; + + if ((sec->state & ND_INTEL_SEC_STATE_ENABLED) && + memcmp(nd_cmd->passphrase, sec->passphrase, + ND_INTEL_PASSPHRASE_SIZE) != 0) { + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS; + dev_dbg(dev, "overwrite: wrong passphrase\n"); + return 0; + } + + memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); + sec->state = ND_INTEL_SEC_STATE_OVERWRITE; + dev_dbg(dev, "overwrite progressing.\n"); + sec->overwrite_end_time = get_jiffies_64() + 5 * HZ; + + return 0; +} + +static int nd_intel_test_cmd_query_overwrite(struct nfit_test *t, + struct nd_intel_query_overwrite *nd_cmd, + unsigned int buf_len, int dimm) +{ + struct device *dev = &t->pdev.dev; + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; + + if (!(sec->state & ND_INTEL_SEC_STATE_OVERWRITE)) { + nd_cmd->status = ND_INTEL_STATUS_OQUERY_SEQUENCE_ERR; + return 0; + } + + if (time_is_before_jiffies64(sec->overwrite_end_time)) { + sec->overwrite_end_time = 0; + sec->state = 0; + dev_dbg(dev, "overwrite is complete\n"); + } else + nd_cmd->status = ND_INTEL_STATUS_OQUERY_INPROGRESS; + return 0; +} + static int get_dimm(struct nfit_mem *nfit_mem, unsigned int func) { int i; @@ -1144,6 +1189,14 @@ static int nfit_test_ctl(struct nvdimm_bus_descriptor *nd_desc, rc = nd_intel_test_cmd_secure_erase(t, buf, buf_len, i); break; + case NVDIMM_INTEL_OVERWRITE: + rc = nd_intel_test_cmd_overwrite(t, + buf, buf_len, i - t->dcr_idx); + break; + case NVDIMM_INTEL_QUERY_OVERWRITE: + rc = nd_intel_test_cmd_query_overwrite(t, + buf, buf_len, i - t->dcr_idx); + break; case ND_INTEL_ENABLE_LSS_STATUS: rc = nd_intel_test_cmd_set_lss_status(t, buf, buf_len); @@ -2379,6 +2432,8 @@ static void nfit_test0_setup(struct nfit_test *t) set_bit(NVDIMM_INTEL_UNLOCK_UNIT, &acpi_desc->dimm_cmd_force_en); set_bit(NVDIMM_INTEL_FREEZE_LOCK, &acpi_desc->dimm_cmd_force_en); set_bit(NVDIMM_INTEL_SECURE_ERASE, &acpi_desc->dimm_cmd
[PATCH 09/11] acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support
With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update master passphrase and master secure erase. The master passphrase allows a secure erase to be performed without the user passphrase that is set on the NVDIMM. The commands of master_update and master_erase are added to the sysfs knob in order to initiate the DSMs. They are similar in opeartion mechanism compare to update and erase. [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf Signed-off-by: Dave Jiang --- Documentation/nvdimm/security.txt | 23 ++ drivers/acpi/nfit/core.c |2 + drivers/acpi/nfit/intel.c | 132 - drivers/acpi/nfit/intel.h | 18 + drivers/acpi/nfit/nfit.h |6 +- drivers/nvdimm/dimm_devs.c| 16 drivers/nvdimm/nd-core.h |1 drivers/nvdimm/nd.h |5 + drivers/nvdimm/security.c | 34 -- include/linux/libnvdimm.h | 15 10 files changed, 236 insertions(+), 16 deletions(-) diff --git a/Documentation/nvdimm/security.txt b/Documentation/nvdimm/security.txt index dfe70a8fa25b..a2d954349417 100644 --- a/Documentation/nvdimm/security.txt +++ b/Documentation/nvdimm/security.txt @@ -110,5 +110,28 @@ Overwrite can last tens of minutes or more depending on nvdimm size. An encrypted key with the current key passphrase that is tied to the nvdimm should be injected and its keyid should be passed in via sysfs. +10. Master Update +- +The command format for doing a master update is: +update + +The operating mechansim for master update is identical to update except the +master passphrase key is passed to the kernel. The master passphrase key +is just another encrypted key. + +This command is only available when security is disabled. + +11. Master Erase + +The command format for doing a master erase is: +master_erase + +This command has the same operating mechanism as erase except the master +passphrase key is passed to the kernel. The master passphrase key is just +another encrypted key. + +This command is only available when the master security is enabled, indicated +by the extended security status. + [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf [2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index 3e6c7b653872..867e6fea3737 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -389,6 +389,8 @@ static u8 nfit_dsm_revid(unsigned family, unsigned func) [NVDIMM_INTEL_SECURE_ERASE] = 2, [NVDIMM_INTEL_OVERWRITE] = 2, [NVDIMM_INTEL_QUERY_OVERWRITE] = 2, + [NVDIMM_INTEL_SET_MASTER_PASSPHRASE] = 2, + [NVDIMM_INTEL_MASTER_SECURE_ERASE] = 2, }, }; u8 id; diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index f6d2b217f1c8..4e47fc2b524c 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -20,6 +20,123 @@ #include /* for wbinvd_on_all_cpus() on !SMP builds */ +static int intel_dimm_security_master_erase(struct nvdimm *nvdimm, + const struct nvdimm_key_data *nkey) +{ + int cmd_rc, rc = 0; + struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct { + struct nd_cmd_pkg pkg; + struct nd_intel_secure_erase cmd; + } nd_cmd = { + .pkg = { + .nd_command = NVDIMM_INTEL_MASTER_SECURE_ERASE, + .nd_family = NVDIMM_FAMILY_INTEL, + .nd_size_in = ND_INTEL_PASSPHRASE_SIZE, + .nd_size_out = ND_INTEL_STATUS_SIZE, + .nd_fw_size = ND_INTEL_STATUS_SIZE, + }, + .cmd = { + .status = 0, + }, + }; + + if (!test_bit(NVDIMM_INTEL_MASTER_SECURE_ERASE, &nfit_mem->dsm_mask)) + return -ENOTTY; + + /* flush all cache before we erase DIMM */ + wbinvd_on_all_cpus(); + memcpy(nd_cmd.cmd.passphrase, nkey->data, + sizeof(nd_cmd.cmd.passphrase)); + rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc); + if (rc < 0) + goto out; + if (cmd_rc < 0) { + rc = cmd_rc; + goto out; + } + + switch (nd_cmd.cmd.status) { + case 0: + break; + case ND_INTEL_STATUS_NOT_SUPPORTED: + rc = -EOPNOTSUPP; + goto out; + case ND_INTEL_STATUS_INVALID_PASS: + rc = -EINVAL; + goto out; + case ND_INTEL_STATUS_INVALID_STATE: + default: + rc = -ENXIO; + goto out; + } + + /* DIMM erased, invalidate all
[PATCH 05/11] acpi/nfit, libnvdimm/security: Add security DSM overwrite support
We are adding support for the security calls of ovewrite and query overwrite from Intel DSM spec v1.7. This will allow triggering of overwrite on Intel NVDIMMs. The overwrite operation can take tens of minutes. When the overwrite DSM is issued successfully, the NVDIMMs will be unaccessible. The kernel will do backoff polling to detect when the overwrite process is completed. According to the DSM spec v1.7, the 128G NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will take longer. Signed-off-by: Dave Jiang --- drivers/acpi/nfit/intel.c | 113 +++ drivers/acpi/nfit/intel.h |3 + drivers/acpi/nfit/nfit.h |1 drivers/nvdimm/core.c |3 + drivers/nvdimm/dimm_devs.c | 22 drivers/nvdimm/nd-core.h |4 ++ drivers/nvdimm/nd.h|8 +++ drivers/nvdimm/security.c | 116 include/linux/libnvdimm.h |4 ++ 9 files changed, 273 insertions(+), 1 deletion(-) diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c index f8f63f483859..f6d2b217f1c8 100644 --- a/drivers/acpi/nfit/intel.c +++ b/drivers/acpi/nfit/intel.c @@ -20,6 +20,112 @@ #include /* for wbinvd_on_all_cpus() on !SMP builds */ +static int intel_dimm_security_query_overwrite(struct nvdimm *nvdimm) +{ + int cmd_rc, rc = 0; + struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct { + struct nd_cmd_pkg pkg; + struct nd_intel_query_overwrite cmd; + } nd_cmd = { + .pkg = { + .nd_command = NVDIMM_INTEL_QUERY_OVERWRITE, + .nd_family = NVDIMM_FAMILY_INTEL, + .nd_size_in = 0, + .nd_size_out = ND_INTEL_STATUS_SIZE, + .nd_fw_size = ND_INTEL_STATUS_SIZE, + }, + .cmd = { + .status = 0, + }, + }; + + if (!test_bit(NVDIMM_INTEL_QUERY_OVERWRITE, &nfit_mem->dsm_mask)) + return -ENOTTY; + + rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc); + if (rc < 0) + goto out; + if (cmd_rc < 0) { + rc = cmd_rc; + goto out; + } + + switch (nd_cmd.cmd.status) { + case 0: + break; + case ND_INTEL_STATUS_OQUERY_INPROGRESS: + rc = -EBUSY; + goto out; + default: + rc = -ENXIO; + goto out; + } + + /* flush all cache before we make the nvdimms available */ + wbinvd_on_all_cpus(); + nfit_mem->overwrite = false; + +out: + return rc; +} + +static int intel_dimm_security_overwrite(struct nvdimm *nvdimm, + const struct nvdimm_key_data *nkey) +{ + int cmd_rc, rc = 0; + struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm); + struct { + struct nd_cmd_pkg pkg; + struct nd_intel_overwrite cmd; + } nd_cmd = { + .pkg = { + .nd_command = NVDIMM_INTEL_OVERWRITE, + .nd_family = NVDIMM_FAMILY_INTEL, + .nd_size_in = ND_INTEL_PASSPHRASE_SIZE, + .nd_size_out = ND_INTEL_STATUS_SIZE, + .nd_fw_size = ND_INTEL_STATUS_SIZE, + }, + .cmd = { + .status = 0, + }, + }; + + if (!test_bit(NVDIMM_INTEL_OVERWRITE, &nfit_mem->dsm_mask)) + return -ENOTTY; + + /* flush all cache before we erase DIMM */ + wbinvd_on_all_cpus(); + memcpy(nd_cmd.cmd.passphrase, nkey->data, + sizeof(nd_cmd.cmd.passphrase)); + rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc); + if (rc < 0) + goto out; + if (cmd_rc < 0) { + rc = cmd_rc; + goto out; + } + + switch (nd_cmd.cmd.status) { + case 0: + nfit_mem->overwrite = true; + break; + case ND_INTEL_STATUS_OVERWRITE_UNSUPPORTED: + rc = -ENOTSUPP; + break; + case ND_INTEL_STATUS_INVALID_PASS: + rc = -EINVAL; + goto out; + case ND_INTEL_STATUS_INVALID_STATE: + default: + rc = -ENXIO; + goto out; + } + + out: + return rc; +} + static int intel_dimm_security_erase(struct nvdimm *nvdimm, const struct nvdimm_key_data *nkey) { @@ -319,6 +425,11 @@ static int intel_dimm_security_state(struct nvdimm *nvdimm, return 0; } + if (nfit_mem->overwrite == true) { + *state = NVDIMM_SECURITY_OVERWRITE; + return 0; + } + *state = NVDIMM_SECURITY_DISABLED; rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_c
[PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test
Adding test support for new Intel DSM from v1.8. The ability of simulating master passphrase update and master secure erase have been added to nfit_test. Signed-off-by: Dave Jiang --- tools/testing/nvdimm/test/nfit.c | 86 ++ 1 file changed, 86 insertions(+) diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c index c885fe136f42..602f6703614e 100644 --- a/tools/testing/nvdimm/test/nfit.c +++ b/tools/testing/nvdimm/test/nfit.c @@ -148,7 +148,9 @@ static unsigned long dimm_fail_cmd_flags[NUM_DCR]; static int dimm_fail_cmd_code[NUM_DCR]; struct nfit_test_sec { u8 state; + u8 ext_state; u8 passphrase[32]; + u8 master_passphrase[32]; u64 overwrite_end_time; } dimm_sec_info[NUM_DCR]; @@ -951,6 +953,7 @@ static int nd_intel_test_cmd_security_status(struct nfit_test *t, nd_cmd->status = 0; nd_cmd->state = sec->state; + nd_cmd->extended_state = sec->ext_state; dev_dbg(dev, "security state (%#x) returned\n", nd_cmd->state); return 0; @@ -1067,7 +1070,9 @@ static int nd_intel_test_cmd_secure_erase(struct nfit_test *t, dev_dbg(dev, "secure erase: wrong passphrase\n"); } else { memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); + memset(sec->master_passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); sec->state = 0; + sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED; dev_dbg(dev, "secure erase: done\n"); } @@ -1112,12 +1117,69 @@ static int nd_intel_test_cmd_query_overwrite(struct nfit_test *t, if (time_is_before_jiffies64(sec->overwrite_end_time)) { sec->overwrite_end_time = 0; sec->state = 0; + sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED; dev_dbg(dev, "overwrite is complete\n"); } else nd_cmd->status = ND_INTEL_STATUS_OQUERY_INPROGRESS; return 0; } +static int nd_intel_test_cmd_master_set_pass(struct nfit_test *t, + struct nd_intel_set_master_passphrase *nd_cmd, + unsigned int buf_len, int dimm) +{ + struct device *dev = &t->pdev.dev; + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; + + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) { + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED; + dev_dbg(dev, "master set passphrase in wrong state\n"); + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) { + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE; + dev_dbg(dev, "master set passphrase in wrong security state\n"); + } else if (memcmp(nd_cmd->old_pass, sec->master_passphrase, + ND_INTEL_PASSPHRASE_SIZE) != 0) { + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS; + dev_dbg(dev, "master set passphrase wrong passphrase\n"); + } else { + memcpy(sec->master_passphrase, nd_cmd->new_pass, + ND_INTEL_PASSPHRASE_SIZE); + nd_cmd->status = 0; + dev_dbg(dev, "master passphrase updated\n"); + } + + return 0; +} + +static int nd_intel_test_cmd_master_secure_erase(struct nfit_test *t, + struct nd_intel_master_secure_erase *nd_cmd, + unsigned int buf_len, int dimm) +{ + struct device *dev = &t->pdev.dev; + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; + + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) { + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED; + dev_dbg(dev, "master erase in wrong state\n"); + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) { + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE; + dev_dbg(dev, "master erase in wrong security state\n"); + } else if (memcmp(nd_cmd->passphrase, sec->master_passphrase, + ND_INTEL_PASSPHRASE_SIZE) != 0) { + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS; + dev_dbg(dev, "master secure erase: wrong passphrase\n"); + } else { + memset(sec->master_passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); + sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED; + memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); + sec->state = 0; + dev_dbg(dev, "master secure erase: done\n"); + } + + return 0; +} + + static int get_dimm(struct nfit_mem *nfit_mem, unsigned int func) { int i; @@ -1197,6 +1259,14 @@ static int nfit_test_ctl(struct nvdimm_bus_descriptor *nd_desc, rc = nd_intel_test_cmd_query_overwrite(t, buf, buf_len, i - t->dcr_idx); break; + case NVDIMM_INTEL_SET_MASTER_PASSPHRASE:
Re: [mm PATCH v5 0/7] Deferred page init improvements
On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > On 18-11-05 13:19:25, Alexander Duyck wrote: > > This patchset is essentially a refactor of the page initialization logic > > that is meant to provide for better code reuse while providing a > > significant improvement in deferred page initialization performance. > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > memory per node I have seen the following. In the case of regular memory > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > average. For the persistent memory the initialization time dropped from > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > deferred memory initialization performance, and a 26% improvement in the > > persistent memory initialization performance. > > Hi Alex, > > Please try to run your persistent memory init experiment with Daniel's > patches: > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/ I've taken a quick look at it. It seems like a bit of a brute force way to try and speed things up. I would be worried about it potentially introducing performance issues if the number of CPUs thrown at it end up exceeding the maximum throughput of the memory. The data provided with patch 11 seems to point to issues such as that. In the case of the E7-8895 example cited it is increasing the numbers of CPUs used from memory initialization from 8 to 72, a 9x increase in the number of CPUs but it is yeilding only a 3.88x speedup. > The performance should improve by much more than 26%. The 26% improvement, or speedup of 1.26x using the ktask approach, was for persistent memory, not deferred memory init. The ktask patch doesn't do anything for persistent memory since it is takes the hot- plug path and isn't handled via the deferred memory init. I had increased deferred memory init to about 3.53x the original speed (3.75s to 1.06s) on the system which I was testing. I do agree the two patches should be able to synergistically boost each other though as this patch set was meant to make the init much more cache friendly so as a result it should scale better as you add additional cores. I know I had done some playing around with fake numa to split up a single node into 8 logical nodes and I had seen a similar speedup of about 3.85x with my test memory initializing in about 275ms. > Overall, your works looks good, but it needs to be considered how easy it > will be > to merge with ktask. I will try to complete the review today. > > Thank you, > Pasha Looking over the patches they are still in the RFC stage and the data is in need of updates since it is referencing 4.15-rc kernels as its baseline. If anything I really think the ktask patch 11 would be easier to rebase around my patch set then the other way around. Also, this series is in Andrew's mmots as of a few days ago, so I think it will be in the next mmotm that comes out. The integration with the ktask code should be pretty straight forward. If anything I think my code would probably make it easier since it gets rid of the need to do all this in two passes. The only new limitation it would add is that you would probably want to split up the work along either max order or section aligned boundaries. What it would essentially do is make things so that each of the ktask threads would probably look more like deferred_grow_zone which after my patch set is actually a fairly simple function. Thanks. - Alex ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
> > > + unsigned long epfn = PFN_DOWN(epa); > > > + unsigned long spfn = PFN_UP(spa); > > > + > > > + /* > > > + * Verify the end is at least past the start of the zone and > > > + * that we have at least one PFN to initialize. > > > + */ > > > + if (zone->zone_start_pfn < epfn && spfn < epfn) { > > > + /* if we went too far just stop searching */ > > > + if (zone_end_pfn(zone) <= spfn) > > > + break; > > > > Set *idx = U64_MAX here, then break. This way after we are outside this > > while loop idx is always equals to U64_MAX. > > Actually I think what you are asking for is the logic that is outside > of the while loop we are breaking out of. So if you check at the end of > the function there is the bit of code with the comment "signal end of > iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to > ULONG_MAX, and *out_epfn to 0. > > The general idea I had with the function is that you could use either > the index or spfn < epfn checks to determine if you keep going or not. Yes, I meant to remove that *idx = U64_MAX after the loop, it is confusing to have a loop: while (*idx != U64_MAX) { ... } *idx = U64_MAX; So, it is better to set idx to U643_MAX inside the loop before the break. > > > > > > + > > > + if (out_spfn) > > > + *out_spfn = max(zone->zone_start_pfn, spfn); > > > + if (out_epfn) > > > + *out_epfn = min(zone_end_pfn(zone), epfn); > > > > Don't we need to verify after adjustment that out_spfn != out_epfn, so > > there is at least one PFN to initialize? > > We have a few checks that I believe prevent that. Before we get to this > point we have verified the following: > zone->zone_start < epfn > spfn < epfn > > The other check that should be helping to prevent that is the break > statement above that is forcing us to exit if spfn is somehow already > past the end of the zone, that essentially maps out: > spfn < zone_end_pfn(zone) > > So the only check we don't have is: > zone->zone_start < zone_end_pfn(zone) > > If I am not mistaken that is supposed to be a given is it not? I would > assume we don't have any zones that are completely empty or inverted > that would be called here do we? if (zone_end_pfn(zone) <= spfn) won't break Equal sign in <= here takes care of the case I was thinking. Yes, logic looks good. Thank you Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware Interface Table), is the first known instance of a memory range described by a unique "target" proximity domain. Where "initiator" and "target" proximity domains is an approach that the ACPI HMAT (Heterogeneous Memory Attributes Table) uses to described the unique performance properties of a memory range relative to a given initiator (e.g. CPU or DMA device). Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y char-device follows the traditional notion of 'numa-node' where the attribute conveys the closest online numa-node. That numa-node attribute is useful for cpu-binding and memory-binding processes *near* the device. However, when the memory range backing a 'pmem', or 'dax' device is onlined (memory hot-add) the memory-only-numa-node representing that address needs to be differentiated from the set of online nodes. In other words, the numa-node association of the device depends on whether you can bind processes *near* the cpu-numa-node in the offline device-case, or bind process *on* the memory-range directly after the backing address range is onlined. Allow for the case that platform firmware describes persistent memory with a unique proximity domain, i.e. when it is distinct from the proximity of DRAM and CPUs that are on the same socket. Plumb the Linux numa-node translation of that proximity through the libnvdimm region device to namespaces that are in device-dax mode. With this in place the proposed kmem driver [1] can optionally discover a unique numa-node number for the address range as it transitions the memory from an offline state managed by a device-driver to an online memory range managed by the core-mm. [1]: https://lkml.org/lkml/2018/10/23/9 Reported-by: Fan Du Cc: Michael Ellerman Cc: "Oliver O'Halloran" Cc: Dave Hansen Cc: Jérôme Glisse Signed-off-by: Dan Williams --- arch/powerpc/platforms/pseries/papr_scm.c |1 + drivers/acpi/nfit/core.c |8 ++-- drivers/acpi/numa.c |1 + drivers/dax/bus.c |4 +++- drivers/dax/bus.h |3 ++- drivers/dax/dax-private.h |4 drivers/dax/pmem/core.c |4 +++- drivers/nvdimm/e820.c |1 + drivers/nvdimm/nd.h |2 +- drivers/nvdimm/of_pmem.c |1 + drivers/nvdimm/region_devs.c |1 + include/linux/libnvdimm.h |1 + 12 files changed, 25 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c index ee9372b65ca5..6a0a35b872d1 100644 --- a/arch/powerpc/platforms/pseries/papr_scm.c +++ b/arch/powerpc/platforms/pseries/papr_scm.c @@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p) memset(&ndr_desc, 0, sizeof(ndr_desc)); ndr_desc.attr_groups = region_attr_groups; ndr_desc.numa_node = dev_to_node(&p->pdev->dev); + ndr_desc.target_node = ndr_desc.numa_node; ndr_desc.res = &p->res; ndr_desc.of_node = p->dn; ndr_desc.provider_data = p; diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c index f8c638f3c946..2225e3de33ac 100644 --- a/drivers/acpi/nfit/core.c +++ b/drivers/acpi/nfit/core.c @@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc, ndr_desc->res = &res; ndr_desc->provider_data = nfit_spa; ndr_desc->attr_groups = acpi_nfit_region_attribute_groups; - if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) + if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) { ndr_desc->numa_node = acpi_map_pxm_to_online_node( spa->proximity_domain); - else + ndr_desc->target_node = acpi_map_pxm_to_node( + spa->proximity_domain); + } else { ndr_desc->numa_node = NUMA_NO_NODE; + ndr_desc->target_node = NUMA_NO_NODE; + } /* * Persistence domain bits are hierarchical, if diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c index 274699463b4f..b9d86babb13a 100644 --- a/drivers/acpi/numa.c +++ b/drivers/acpi/numa.c @@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm) return node; } +EXPORT_SYMBOL(acpi_map_pxm_to_node); /** * acpi_map_pxm_to_online_node - Map proximity ID to online node diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 568168500217..c620ad52d7e5 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -214,7 +214,7 @@ static void dax_region_unregister(void *region) } struct dax_region *alloc_dax_region(struct device *parent, int region_id, - struct resource *res, unsigned int align, + struct resource *res, int target_node, unsigned int align, unsigned long p
Re: [mm PATCH v5 0/7] Deferred page init improvements
On 18-11-09 15:14:35, Alexander Duyck wrote: > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > This patchset is essentially a refactor of the page initialization logic > > > that is meant to provide for better code reuse while providing a > > > significant improvement in deferred page initialization performance. > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent > > > memory per node I have seen the following. In the case of regular memory > > > initialization the deferred init time was decreased from 3.75s to 1.06s on > > > average. For the persistent memory the initialization time dropped from > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > deferred memory initialization performance, and a 26% improvement in the > > > persistent memory initialization performance. > > > > Hi Alex, > > > > Please try to run your persistent memory init experiment with Daniel's > > patches: > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/ > > I've taken a quick look at it. It seems like a bit of a brute force way > to try and speed things up. I would be worried about it potentially There is a limit to max number of threads that ktasks start. The memory throughput is *much* higher than what one CPU can maxout in a node, so there is no reason to leave the other CPUs sit idle during boot when they can help to initialize. > introducing performance issues if the number of CPUs thrown at it end > up exceeding the maximum throughput of the memory. > > The data provided with patch 11 seems to point to issues such as that. > In the case of the E7-8895 example cited it is increasing the numbers > of CPUs used from memory initialization from 8 to 72, a 9x increase in > the number of CPUs but it is yeilding only a 3.88x speedup. Yes, but in both cases we are far from maxing out the memory throughput. The 3.88x is indeed low, and I do not know what slows it down. Daniel, Could you please check why multi-threading efficiency is so low here? I bet, there is some atomic operation introduces a contention within a node. It should be possible to resolve. > > > The performance should improve by much more than 26%. > > The 26% improvement, or speedup of 1.26x using the ktask approach, was > for persistent memory, not deferred memory init. The ktask patch > doesn't do anything for persistent memory since it is takes the hot- > plug path and isn't handled via the deferred memory init. Ah, I thought in your experiment persistent memory takes deferred init path. So, what exactly in your patches make this 1.26x speedup? > > I had increased deferred memory init to about 3.53x the original speed > (3.75s to 1.06s) on the system which I was testing. I do agree the two > patches should be able to synergistically boost each other though as > this patch set was meant to make the init much more cache friendly so > as a result it should scale better as you add additional cores. I know > I had done some playing around with fake numa to split up a single node > into 8 logical nodes and I had seen a similar speedup of about 3.85x > with my test memory initializing in about 275ms. > > > Overall, your works looks good, but it needs to be considered how easy it > > will be > > to merge with ktask. I will try to complete the review today. > > > > Thank you, > > Pasha > > Looking over the patches they are still in the RFC stage and the data > is in need of updates since it is referencing 4.15-rc kernels as its > baseline. If anything I really think the ktask patch 11 would be easier > to rebase around my patch set then the other way around. Also, this > series is in Andrew's mmots as of a few days ago, so I think it will be > in the next mmotm that comes out. I do not disagree, I think these two patch series should complement each other. But, if your changes make it impossible for ktask, I would strongly argue against it, as the potential improvements with ktasks are much higher. But, so far I do not see anything, so I think they can work together. I am still reviewing your work. > > The integration with the ktask code should be pretty straight forward. > If anything I think my code would probably make it easier since it gets > rid of the need to do all this in two passes. The only new limitation > it would add is that you would probably want to split up the work along > either max order or section aligned boundaries. What it would Which is totally OK, it should make ktasks scale even better. > essentially do is make things so that each of the ktask threads would > probably look more like deferred_grow_zone which after my patch set is > actually a fairly simple function. > > Thanks. Thank you, Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
On Fri, 2018-11-09 at 18:26 -0500, Pavel Tatashin wrote: > > +/** > > + * for_each_free_mem_range_in_zone - iterate through zone specific free > > + * memblock areas > > + * @i: u64 used as loop variable > > + * @zone: zone in which all of the memory blocks reside > > + * @p_start: ptr to phys_addr_t for start address of the range, can be > > %NULL > > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL > > + * > > + * Walks over free (memory && !reserved) areas of memblock in a specific > > + * zone. Available as soon as memblock is initialized. > > + */ > > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) > > \ > > + for (i = 0, \ > > +__next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);\ > > +i != (u64)ULLONG_MAX; \ > > +__next_mem_pfn_range_in_zone(&i, zone, p_start, p_end)) > > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ > > Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other > places in this file use UULONG_MAX with cast, but I think U64_MAX is > better. Okay, maybe I will submit a follow up that just cleans all of these up. > > + > > /** > > * for_each_free_mem_range - iterate through free memblock areas > > * @i: u64 used as loop variable > > diff --git a/mm/memblock.c b/mm/memblock.c > > index 7df468c8ebc8..f1d1fbfd1ae7 100644 > > --- a/mm/memblock.c > > +++ b/mm/memblock.c > > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t > > base, phys_addr_t size, > > return 0; > > } > > #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */ > > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > > +/** > > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone() > > + * > > + * @idx: pointer to u64 loop variable > > + * @zone: zone in which all of the memory blocks reside > > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL > > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL > > + * > > + * This function is meant to be a zone/pfn specific wrapper for the > > + * for_each_mem_range type iterators. Specifically they are used in the > > + * deferred memory init routines and as such we were duplicating much of > > + * this logic throughout the code. So instead of having it in multiple > > + * locations it seemed like it would make more sense to centralize this to > > + * one new iterator that does everything they need. > > + */ > > +void __init_memblock > > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, > > +unsigned long *out_spfn, unsigned long *out_epfn) > > +{ > > + int zone_nid = zone_to_nid(zone); > > + phys_addr_t spa, epa; > > + int nid; > > + > > + __next_mem_range(idx, zone_nid, MEMBLOCK_NONE, > > +&memblock.memory, &memblock.reserved, > > +&spa, &epa, &nid); > > + > > + while (*idx != ULLONG_MAX) { > > Ditto, use U64_MAX Same here. > > + unsigned long epfn = PFN_DOWN(epa); > > + unsigned long spfn = PFN_UP(spa); > > + > > + /* > > +* Verify the end is at least past the start of the zone and > > +* that we have at least one PFN to initialize. > > +*/ > > + if (zone->zone_start_pfn < epfn && spfn < epfn) { > > + /* if we went too far just stop searching */ > > + if (zone_end_pfn(zone) <= spfn) > > + break; > > Set *idx = U64_MAX here, then break. This way after we are outside this > while loop idx is always equals to U64_MAX. Actually I think what you are asking for is the logic that is outside of the while loop we are breaking out of. So if you check at the end of the function there is the bit of code with the comment "signal end of iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to ULONG_MAX, and *out_epfn to 0. The general idea I had with the function is that you could use either the index or spfn < epfn checks to determine if you keep going or not. > > > + > > + if (out_spfn) > > + *out_spfn = max(zone->zone_start_pfn, spfn); > > + if (out_epfn) > > + *out_epfn = min(zone_end_pfn(zone), epfn); > > Don't we need to verify after adjustment that out_spfn != out_epfn, so > there is at least one PFN to initialize? We have a few checks that I believe prevent that. Before we get to this point we have verified the following: zone->zone_start < epfn spfn < epfn The other check that should be helping to prevent that is the break statement above that is forcing us to exit if spfn is somehow already past the end of the zone, that essentially maps out: spfn < zone_end_pfn(zone) So the only check we don't have is: zone->zone_start < zone_end_pfn(zone) If I am not mistaken that is supp
Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator
> +/** > + * for_each_free_mem_range_in_zone - iterate through zone specific free > + * memblock areas > + * @i: u64 used as loop variable > + * @zone: zone in which all of the memory blocks reside > + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL > + * > + * Walks over free (memory && !reserved) areas of memblock in a specific > + * zone. Available as soon as memblock is initialized. > + */ > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \ > + for (i = 0, \ > + __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);\ > + i != (u64)ULLONG_MAX; \ > + __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end)) > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */ Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other places in this file use UULONG_MAX with cast, but I think U64_MAX is better. > + > /** > * for_each_free_mem_range - iterate through free memblock areas > * @i: u64 used as loop variable > diff --git a/mm/memblock.c b/mm/memblock.c > index 7df468c8ebc8..f1d1fbfd1ae7 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t > base, phys_addr_t size, > return 0; > } > #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */ > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT > +/** > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone() > + * > + * @idx: pointer to u64 loop variable > + * @zone: zone in which all of the memory blocks reside > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL > + * > + * This function is meant to be a zone/pfn specific wrapper for the > + * for_each_mem_range type iterators. Specifically they are used in the > + * deferred memory init routines and as such we were duplicating much of > + * this logic throughout the code. So instead of having it in multiple > + * locations it seemed like it would make more sense to centralize this to > + * one new iterator that does everything they need. > + */ > +void __init_memblock > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone, > + unsigned long *out_spfn, unsigned long *out_epfn) > +{ > + int zone_nid = zone_to_nid(zone); > + phys_addr_t spa, epa; > + int nid; > + > + __next_mem_range(idx, zone_nid, MEMBLOCK_NONE, > + &memblock.memory, &memblock.reserved, > + &spa, &epa, &nid); > + > + while (*idx != ULLONG_MAX) { Ditto, use U64_MAX > + unsigned long epfn = PFN_DOWN(epa); > + unsigned long spfn = PFN_UP(spa); > + > + /* > + * Verify the end is at least past the start of the zone and > + * that we have at least one PFN to initialize. > + */ > + if (zone->zone_start_pfn < epfn && spfn < epfn) { > + /* if we went too far just stop searching */ > + if (zone_end_pfn(zone) <= spfn) > + break; Set *idx = U64_MAX here, then break. This way after we are outside this while loop idx is always equals to U64_MAX. > + > + if (out_spfn) > + *out_spfn = max(zone->zone_start_pfn, spfn); > + if (out_epfn) > + *out_epfn = min(zone_end_pfn(zone), epfn); Don't we need to verify after adjustment that out_spfn != out_epfn, so there is at least one PFN to initialize? The rest looks good. Once the above is fixed: Reviewed-by: Pavel Tatashin Thank you, Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 0/7] Deferred page init improvements
On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > On 18-11-09 15:14:35, Alexander Duyck wrote: > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > This patchset is essentially a refactor of the page initialization logic > > > > that is meant to provide for better code reuse while providing a > > > > significant improvement in deferred page initialization performance. > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of > > > > persistent > > > > memory per node I have seen the following. In the case of regular memory > > > > initialization the deferred init time was decreased from 3.75s to 1.06s > > > > on > > > > average. For the persistent memory the initialization time dropped from > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the > > > > deferred memory initialization performance, and a 26% improvement in the > > > > persistent memory initialization performance. > > > > > > Hi Alex, > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > patches: > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/ > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > to try and speed things up. I would be worried about it potentially > > There is a limit to max number of threads that ktasks start. The memory > throughput is *much* higher than what one CPU can maxout in a node, so > there is no reason to leave the other CPUs sit idle during boot when > they can help to initialize. Right, but right now that limit can still be pretty big when it is something like 25% of all the CPUs on a 288 CPU system. One issue is the way the code was ends up essentially blowing out the cache over and over again. Doing things in two passes made it really expensive as you took one cache miss to initialize it, and another to free it. I think getting rid of that is one of the biggest gains with my patch set. > > introducing performance issues if the number of CPUs thrown at it end > > up exceeding the maximum throughput of the memory. > > > > The data provided with patch 11 seems to point to issues such as that. > > In the case of the E7-8895 example cited it is increasing the numbers > > of CPUs used from memory initialization from 8 to 72, a 9x increase in > > the number of CPUs but it is yeilding only a 3.88x speedup. > > Yes, but in both cases we are far from maxing out the memory throughput. > The 3.88x is indeed low, and I do not know what slows it down. > > Daniel, > > Could you please check why multi-threading efficiency is so low here? > > I bet, there is some atomic operation introduces a contention within a > node. It should be possible to resolve. Depending on what kernel this was based on it probably was something like SetPageReserved triggering pipeline stalls, or maybe it is freeing the pages themselves since I would imagine that has a pretty big overhead an may end up serializing some things. > > > > > The performance should improve by much more than 26%. > > > > The 26% improvement, or speedup of 1.26x using the ktask approach, was > > for persistent memory, not deferred memory init. The ktask patch > > doesn't do anything for persistent memory since it is takes the hot- > > plug path and isn't handled via the deferred memory init. > > Ah, I thought in your experiment persistent memory takes deferred init > path. So, what exactly in your patches make this 1.26x speedup? Basically it is the folding of the reserved bit into the setting of the rest of the values written into the page flags. With that change we are able to tighten the memory initialization loop to the point where it almost looks like a memset case as it will put together all of the memory values outside of the loop and then just loop through assigning the values for each page. > > > > I had increased deferred memory init to about 3.53x the original speed > > (3.75s to 1.06s) on the system which I was testing. I do agree the two > > patches should be able to synergistically boost each other though as > > this patch set was meant to make the init much more cache friendly so > > as a result it should scale better as you add additional cores. I know > > I had done some playing around with fake numa to split up a single node > > into 8 logical nodes and I had seen a similar speedup of about 3.85x > > with my test memory initializing in about 275ms. It is also possible that the the 3.8x is an upper limit for something on the system in terms of scaling. I just realized that the 3.85x I saw with an 8 way split over a 4 socket system isn't too different from the 3.88x that was recorded for a 9 way split over an 8 socket system. > > > Overall, your works looks good, but it needs to be considered how easy it > > > will be > > > to merge with ktask. I will try to complete the review today. > > > > > > Thank you,
Re: [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections
On 18-11-05 13:19:45, Alexander Duyck wrote: > } > - first_init_pfn = max(zone->zone_start_pfn, first_init_pfn); > + > + /* If the zone is empty somebody else may have cleared out the zone */ > + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, > + first_init_pfn)) { > + pgdat_resize_unlock(pgdat, &flags); > + pgdat_init_report_one_done(); > + return 0; It would make more sense to add goto to the end of this function and report that in Node X had 0 pages initialized. Otherwise, it will be confusing why some nodes are not initialized during boot. It is simply because they do not have deferred pages left to be initialized. > + } > > /* > - * Initialize and free pages. We do it in two loops: first we initialize > - * struct page, than free to buddy allocator, because while we are > - * freeing pages we can access pages that are ahead (computing buddy > - * page in __free_one_page()). > + * Initialize and free pages in MAX_ORDER sized increments so > + * that we can avoid introducing any issues with the buddy > + * allocator. >*/ > - for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) { > - spfn = max_t(unsigned long, first_init_pfn, spfn); > - nr_pages += deferred_init_pages(zone, spfn, epfn); > - } > - for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) { > - spfn = max_t(unsigned long, first_init_pfn, spfn); > - deferred_free_pages(spfn, epfn); > - } > + while (spfn < epfn) > + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn); > + > pgdat_resize_unlock(pgdat, &flags); > > /* Sanity check that the next zone really is unpopulated */ > @@ -1602,9 +1689,9 @@ deferred_grow_zone(struct zone *zone, unsigned int > order) > { How about order = max(order, max_order)? > unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION); > - first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn); > - > - if (first_init_pfn >= pgdat_end_pfn(pgdat)) { > + /* If the zone is empty somebody else may have cleared out the zone */ > + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, > + first_deferred_pfn)) { > pgdat_resize_unlock(pgdat, &flags); > - return false; > + return true; I am OK to change to true here, please also set pgdat->first_deferred_pfn to ULONG_MAX to indicate that there are no more deferred pages in this node. Overall, I like this patch, makes things a lot easier, assuming the above is addressed: Reviewed-by: Pavel Tatashin Thank you, Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 0/7] Deferred page init improvements
On 18-11-09 16:46:02, Alexander Duyck wrote: > On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote: > > On 18-11-09 15:14:35, Alexander Duyck wrote: > > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote: > > > > On 18-11-05 13:19:25, Alexander Duyck wrote: > > > > > This patchset is essentially a refactor of the page initialization > > > > > logic > > > > > that is meant to provide for better code reuse while providing a > > > > > significant improvement in deferred page initialization performance. > > > > > > > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of > > > > > persistent > > > > > memory per node I have seen the following. In the case of regular > > > > > memory > > > > > initialization the deferred init time was decreased from 3.75s to > > > > > 1.06s on > > > > > average. For the persistent memory the initialization time dropped > > > > > from > > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for > > > > > the > > > > > deferred memory initialization performance, and a 26% improvement in > > > > > the > > > > > persistent memory initialization performance. > > > > > > > > Hi Alex, > > > > > > > > Please try to run your persistent memory init experiment with Daniel's > > > > patches: > > > > > > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/ > > > > > > I've taken a quick look at it. It seems like a bit of a brute force way > > > to try and speed things up. I would be worried about it potentially > > > > There is a limit to max number of threads that ktasks start. The memory > > throughput is *much* higher than what one CPU can maxout in a node, so > > there is no reason to leave the other CPUs sit idle during boot when > > they can help to initialize. > > Right, but right now that limit can still be pretty big when it is > something like 25% of all the CPUs on a 288 CPU system. It is still OK. About 9 threads per node. That machine has 1T of memory, which means 8 nodes need to initialize 2G of memory each. With 46G/s throughout it should take 0.043s Which is 10 times higher than what Daniel sees with 0.325s, so there is still room to saturate the memory throughput. Now, if the multi-threadding efficiency is good, it should take 1.261s / 9 threads = 0.14s > > One issue is the way the code was ends up essentially blowing out the > cache over and over again. Doing things in two passes made it really > expensive as you took one cache miss to initialize it, and another to > free it. I think getting rid of that is one of the biggest gains with > my patch set. I am not arguing that your patches make sense, all I am saying that ktasks improve time order of magnitude better on machines with large amount of memory. Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 02/11] libnvdimm/security: change clear text nvdimm keys to encrypted keys
On Fri, Nov 9, 2018 at 2:13 PM Dave Jiang wrote: > > In order to make nvdimm more secure, encrypted keys will be used instead of > clear text keys. A master key will be created to seal encrypted nvdimm > keys. The master key can be a trusted key generated from TPM 2.0 or a less > secure user key. > > In the process of this conversion, the kernel cached key will be removed > in order to simplify the verification process. The hardware will be used to > verify the decrypted user payload directly. > > Signed-off-by: Dave Jiang > --- > Documentation/nvdimm/security.txt | 29 ++- > drivers/nvdimm/dimm.c |3 > drivers/nvdimm/dimm_devs.c|2 > drivers/nvdimm/nd-core.h |3 > drivers/nvdimm/nd.h |5 - > drivers/nvdimm/security.c | 316 > ++--- > 6 files changed, 108 insertions(+), 250 deletions(-) Remove twice the amount of code that it adds and gains features / security, nice! ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize
On 18-11-05 13:19:50, Alexander Duyck wrote: > This patch is going through and combining the bits in memmap_init_zone and > memmap_init_zone_device that are related to hotplug into a single function > called __memmap_init_hotplug. > > I also took the opportunity to integrate __init_single_page's functionality > into this function. In doing so I can get rid of some of the redundancy > such as the LRU pointers versus the pgmap. Please don't do it, __init_single_page() is hard function to optimize, do not copy its code. Instead could you you split __init_single_page() in two parts, something like this: static inline init_single_page_nolru(struct page *page, unsigned long pfn, unsigned long zone, int nid) { mm_zero_struct_page(page); set_page_links(page, zone, nid, pfn); init_page_count(page); page_mapcount_reset(page); page_cpupid_reset_last(page); #ifdef WANT_PAGE_VIRTUAL /* The shift won't overflow because ZONE_NORMAL is below 4G. */ if (!is_highmem_idx(zone)) set_page_address(page, __va(pfn << PAGE_SHIFT)); #endif } static void __meminit init_single_page(struct page *page, unsigned long pfn, unsigned long zone, int nid) { init_single_page_nolru(page, pfn, zone, nid); INIT_LIST_HEAD(&page->lru); } And call init_single_page_nolru() from __init_pageblock() ? Also, remove WANT_PAGE_VIRTUAL optimization, I do not think it worse it. The rest looks very good, please do the above change. Reviewed-by: Pavel Tatashin > > Signed-off-by: Alexander Duyck > --- ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links
On 18-11-05 13:19:56, Alexander Duyck wrote: > This patch modifies the set_page_links function to include the setting of > the reserved flag via a simple AND and OR operation. The motivation for > this is the fact that the existing __set_bit call still seems to have > effects on performance as replacing the call with the AND and OR can reduce > initialization time. > > Looking over the assembly code before and after the change the main > difference between the two is that the reserved bit is stored in a value > that is generated outside of the main initialization loop and is then > written with the other flags field values in one write to the page->flags > value. Previously the generated value was written and then then a btsq > instruction was issued. > > On my x86_64 test system with 3TB of persistent memory per node I saw the > persistent memory initialization time on average drop from 23.49s to > 19.12s per node. > > Signed-off-by: Alexander Duyck Reviewed-by: Pavel Tatashin ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH 07/11] libnvdimm/security: add overwrite status notification
> -Original Message- > From: Linux-nvdimm On Behalf Of > Dave Jiang > Sent: Friday, November 09, 2018 4:14 PM > Subject: [PATCH 07/11] libnvdimm/security: add overwrite status > notification > ... > @@ -2033,6 +2033,11 @@ static int acpi_nfit_register_dimms(struct > acpi_nfit_desc *acpi_desc) > if (!nvdimm) > continue; > > + rc = nvdimm_setup_security_events(nvdimm); > + if (rc < 0) > + dev_warn(acpi_desc->dev, > + "no security event setup > failed\n"); That seems like a double negative. Printing the rc value (or better yet, the string for it) is always helpful too. --- Robert Elliott, HPE Persistent Memory ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
RE: [PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test
> -Original Message- > From: Linux-nvdimm On Behalf Of > Dave Jiang > Sent: Friday, November 09, 2018 4:15 PM > To: dan.j.willi...@intel.com; zo...@linux.vnet.ibm.com > Cc: linux-nvdimm@lists.01.org > Subject: [PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8 > support for nfit_test ... > +static int nd_intel_test_cmd_master_set_pass(struct nfit_test *t, > + struct nd_intel_set_master_passphrase *nd_cmd, > + unsigned int buf_len, int dimm) > +{ > + struct device *dev = &t->pdev.dev; > + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; > + > + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) { > + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED; > + dev_dbg(dev, "master set passphrase in wrong state\n"); "master set passphrase:" for consistency > + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) { > + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE; > + dev_dbg(dev, "master set passphrase in wrong security > state\n"); "master set passphrase:" for consistency > + } else if (memcmp(nd_cmd->old_pass, sec->master_passphrase, > + ND_INTEL_PASSPHRASE_SIZE) != 0) { > + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS; > + dev_dbg(dev, "master set passphrase wrong > passphrase\n"); "master set passphrase:" for consistency > + } else { > + memcpy(sec->master_passphrase, nd_cmd->new_pass, > + ND_INTEL_PASSPHRASE_SIZE); > + nd_cmd->status = 0; > + dev_dbg(dev, "master passphrase updated\n"); "master set passphrase:" for consistency > + } > + > + return 0; > +} > + > +static int nd_intel_test_cmd_master_secure_erase(struct nfit_test > *t, > + struct nd_intel_master_secure_erase *nd_cmd, > + unsigned int buf_len, int dimm) > +{ > + struct device *dev = &t->pdev.dev; > + struct nfit_test_sec *sec = &dimm_sec_info[dimm]; > + > + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) { > + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED; > + dev_dbg(dev, "master erase in wrong state\n"); "master secure erase: " for consistency > + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) { > + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE; > + dev_dbg(dev, "master erase in wrong security state\n"); "master secure erase: " for consistency > + } else if (memcmp(nd_cmd->passphrase, sec->master_passphrase, > + ND_INTEL_PASSPHRASE_SIZE) != 0) { > + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS; > + dev_dbg(dev, "master secure erase: wrong passphrase\n"); > + } else { > + memset(sec->master_passphrase, 0, > ND_INTEL_PASSPHRASE_SIZE); > + sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED; > + memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE); > + sec->state = 0; > + dev_dbg(dev, "master secure erase: done\n"); > + } > + > + return 0; ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages
On 18-11-05 13:20:01, Alexander Duyck wrote: > +static unsigned long __next_pfn_valid_range(unsigned long *i, > + unsigned long end_pfn) > { > - if (!pfn_valid_within(pfn)) > - return false; > - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn)) > - return false; > - return true; > + unsigned long pfn = *i; > + unsigned long count; > + > + while (pfn < end_pfn) { > + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages); > + unsigned long pageblock_pfn = min(t, end_pfn); > + > +#ifndef CONFIG_HOLES_IN_ZONE > + count = pageblock_pfn - pfn; > + pfn = pageblock_pfn; > + if (!pfn_valid(pfn)) > + continue; > +#else > + for (count = 0; pfn < pageblock_pfn; pfn++) { > + if (pfn_valid_within(pfn)) { > + count++; > + continue; > + } > + > + if (count) > + break; > + } > + > + if (!count) > + continue; > +#endif > + *i = pfn; > + return count; > + } > + > + return 0; > } > > +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) > \ > + for (i = (start_pfn),\ > + count = __next_pfn_valid_range(&i, (end_pfn)); \ > + count && ({ pfn = i - count; 1; }); \ > + count = __next_pfn_valid_range(&i, (end_pfn))) Can this be improved somehow? It took me a while to understand this piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - count; 1;}) is simply hard to parse. Why can't we make __next_pfn_valid_range() to return both end and a start of a block? The rest is good: Reviewed-by: Pavel Tatashin Thank you, Pasha ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm