RE: nfit_test.ko.xz needs unknown symbol ....

2018-11-09 Thread Dorau, Lukasz
On Thursday, November 8, 2018 5:32 PM Dan Williams  
wrote:
> On Thu, Nov 8, 2018 at 6:45 AM Dorau, Lukasz  wrote:
> >
> > Hi,
> >
> > I am building and installing the linux kernel from the 'libnvdimm-for-next'
> branch at git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git with
> the 'nfit_test' kernel module.
> >
> > I have set in the .config file:
> >
> > CONFIG_X86_PMEM_LEGACY=m
> > CONFIG_ZONE_DEVICE=y
> > CONFIG_LIBNVDIMM=m
> > CONFIG_BLK_DEV_PMEM=m
> > CONFIG_ND_BLK=m
> > CONFIG_BTT=y
> > CONFIG_NVDIMM_PFN=y
> > CONFIG_NVDIMM_DAX=y
> > CONFIG_DEV_DAX_PMEM=m
> >
> > I ran:
> >
> > $ make
> > $ make M=tools/testing/nvdimm
> > $ sudo make M=tools/testing/nvdimm modules_install
> > $ sudo make modules_install
> >
> > and I got:
> >
> > [...]
> > DEPMOD  4.19.0-rc4-nfit_test
> > depmod: WARNING: /lib/modules/4.19.0-rc4-
> nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol libnvdimm_test
> > depmod: WARNING: /lib/modules/4.19.0-rc4-
> nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol acpi_nfit_test
> > depmod: WARNING: /lib/modules/4.19.0-rc4-
> nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol pmem_test
> > depmod: WARNING: /lib/modules/4.19.0-rc4-
> nfit_test/extra/test/nfit_test.ko.xz needs unknown symbol device_dax_test
> >
> > Does anyone know what could be wrong here?
> 
> Do you get the error if you run the same script again? For external
> modules to find symbols the symbol map needs to be installed, but the
> first time you run the process nothing has been installed. So, it's
> possible to see this error on the first run on a new kernel version,
> but it's benign and should disappear for subsequent builds.
>

My script:

$ make
$ make M=tools/testing/nvdimm

The first time I run 'modules_install' I get:

$ sudo make M=tools/testing/nvdimm modules_install
[...]
  DEPMOD  4.19.0-rc4-nfit_test
depmod: WARNING: could not open 
/lib/modules/4.19.0-rc4-nfit_test/modules.order: No such file or directory
depmod: WARNING: could not open 
/lib/modules/4.19.0-rc4-nfit_test/modules.builtin: No such file or directory

and for subsequent builds I get always the same:

$ sudo make modules_install
[...]
  DEPMOD  4.19.0-rc4-nfit_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol libnvdimm_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol acpi_nfit_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol pmem_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol device_dax_test

$ sudo make M=tools/testing/nvdimm modules_install
[...]
  DEPMOD  4.19.0-rc4-nfit_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol libnvdimm_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol acpi_nfit_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol pmem_test
depmod: WARNING: /lib/modules/4.19.0-rc4-nfit_test/extra/test/nfit_test.ko.xz 
needs unknown symbol device_dax_test

--
Lukasz

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: fsdax memory error handling regression

2018-11-09 Thread Dan Williams
On Tue, Nov 6, 2018 at 10:01 PM Williams, Dan J
 wrote:
>
> On Tue, 2018-11-06 at 06:48 -0800, Matthew Wilcox wrote:
> > On Tue, Nov 06, 2018 at 03:44:47AM +, Williams, Dan J wrote:
> > > Hi Willy,
> > >
> > > I'm seeing the following warning with v4.20-rc1 and the "dax.sh"
> > > test
> > > from the ndctl repository:
> >
> > I'll try to run this myself later today.
> >
> > > I tried to get this test going on -next before the merge window,
> > > but
> > > -next was not bootable for me. Bisection points to:
> > >
> > > 9f32d221301c dax: Convert dax_lock_mapping_entry to XArray
> > >
> > > At first glance I think we need the old "always retry if we slept"
> > > behavior. Otherwise this failure seems similar to the issue fixed
> > > by
> > > Ross' change to always retry on any potential collision:
> > >
> > > b1f382178d15 ext4: close race between direct IO and
> > > ext4_break_layouts()
> > >
> > > I'll take a closer look tomorrow to see if that guess is plausible.
> >
> > If your thought is correct, then this should be all that's needed:
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 616e36ea6aaa..529ac9d7c10a 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -383,11 +383,8 @@ bool dax_lock_mapping_entry(struct page *page)
> >   entry = xas_load(&xas);
> >   if (dax_is_locked(entry)) {
> >   entry = get_unlocked_entry(&xas);
> > - /* Did the page move while we slept? */
> > - if (dax_to_pfn(entry) != page_to_pfn(page)) {
> > - xas_unlock_irq(&xas);
> > - continue;
> > - }
> > + xas_unlock_irq(&xas);
> > + continue;
> >   }
> >   dax_lock_entry(&xas, entry);
> >   xas_unlock_irq(&xas);
>
> No, that doesn't work.
>
> >
> > I don't quite understand how we'd find a PFN for this page in the
> > tree
> > after the page has had page->mapping removed.  However, the more I
> > look
> > at this path, the more I don't like it -- it doesn't handle returning
> > NULL explicitly, nor does it handle the situation where a PMD is
> > split
> > to form multiple PTEs explicitly, it just kind of relies on those bit
> > patterns not matching.
> >
> > So I kind of like the "just retry without doing anything clever"
> > situation
> > that the above patch takes us to.
>
> I've been hacking at this today and am starting to lean towards
> "revert" over "fix" for the amount of changes needed to get this back
> on its feet. I've been able to get the test passing again with the
> below changes directly on top of commit 9f32d221301c "dax: Convert
> dax_lock_mapping_entry to XArray". That said, I have thus far been
> unable to rebase this patch on top of v4.20-rc1 and yield a functional
> result.


Willy? Thoughts? I'm holding off a revert of the fsdax conversion
awaiting whether you see a way to address the concerns incrementally.


> My concerns are:
> - I can't determine if dax_unlock_entry() wants an unlocked entry
> parameter, or locked. The dax_insert_pfn_mkwrite() and
> dax_unlock_mapping_entry() usages seem to disagree.
>
> - The multi-order use case of Xarray is a mystery to me. It seems to
> want to know the order of entries a-priori with a choice to use
> XA_STATE_ORDER() vs XA_STATE(). This falls over in
> dax_unlock_mapping_entry() and other places where the only source of
> the order of the entry is determined from dax_is_pmd_entry() i.e. the
> Xarray itself. PageHead() does not work for DAX pages because
> PageHead() is only established by the page allocator and DAX pages
> never participate in the page allocator.
>
> - The usage of rcu_read_lock() in dax_lock_mapping_entry() is needed
> for inode lifetime synchronization, not just for walking the radix.
> That lock needs to be dropped before sleeping, and if we slept the
> inode may no longer exist.
>
> - I could not see how the pattern:
> entry = xas_load(&xas);
> if (dax_is_locked(entry)) {
> entry = get_unlocked_entry(&xas);
> ...was safe given that get_unlock_entry() turns around and does
> validation that the entry is !xa_internal_entry() and !NULL.
>
> - The usage of internal entries in grab_mapping_entry() seems to need
> auditing. Previously we would compare the entry size against
> @size_flag, but it now if index hits a multi-order entry in
> get_unlocked_entry() afaics it could be internal and we need to convert
> it to the actual entry before aborting... at least to match the v4.19
> behavior.
>
> This all seems to merit a rethink of the dax integration of Xarray.
>
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 1/2] mm: make dev_pagemap_mapping_shift() externally visible

2018-11-09 Thread Barret Rhoden
KVM has a use case for determining the size of a dax mapping.  The KVM
code has easy access to the address and the mm; hence the change in
parameters.

Signed-off-by: Barret Rhoden 
---
 include/linux/mm.h  |  3 +++
 mm/memory-failure.c | 38 +++---
 mm/util.c   | 34 ++
 3 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..51215d695753 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -935,6 +935,9 @@ static inline bool is_pci_p2pdma_page(const struct page 
*page)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+   struct mm_struct *mm);
+
 static inline void get_page(struct page *page)
 {
page = compound_head(page);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cd3de3550f0..c3f2c6a8607e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -265,40 +265,6 @@ void shake_page(struct page *p, int access)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-   struct vm_area_struct *vma)
-{
-   unsigned long address = vma_address(page, vma);
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-
-   pgd = pgd_offset(vma->vm_mm, address);
-   if (!pgd_present(*pgd))
-   return 0;
-   p4d = p4d_offset(pgd, address);
-   if (!p4d_present(*p4d))
-   return 0;
-   pud = pud_offset(p4d, address);
-   if (!pud_present(*pud))
-   return 0;
-   if (pud_devmap(*pud))
-   return PUD_SHIFT;
-   pmd = pmd_offset(pud, address);
-   if (!pmd_present(*pmd))
-   return 0;
-   if (pmd_devmap(*pmd))
-   return PMD_SHIFT;
-   pte = pte_offset_map(pmd, address);
-   if (!pte_present(*pte))
-   return 0;
-   if (pte_devmap(*pte))
-   return PAGE_SHIFT;
-   return 0;
-}
-
 /*
  * Failure handling: if we can't find or can't kill a process there's
  * not much we can do. We just print a message and ignore otherwise.
@@ -329,7 +295,9 @@ static void add_to_kill(struct task_struct *tsk, struct 
page *p,
tk->addr = page_address_in_vma(p, vma);
tk->addr_valid = 1;
if (is_zone_device_page(p))
-   tk->size_shift = dev_pagemap_mapping_shift(p, vma);
+   tk->size_shift =
+   dev_pagemap_mapping_shift(vma_address(page, vma),
+ vma->vm_mm);
else
tk->size_shift = compound_order(compound_head(p)) + PAGE_SHIFT;
 
diff --git a/mm/util.c b/mm/util.c
index 8bf08b5b5760..61bc9bab931d 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -780,3 +780,37 @@ int get_cmdline(struct task_struct *task, char *buffer, 
int buflen)
 out:
return res;
 }
+
+unsigned long dev_pagemap_mapping_shift(unsigned long address,
+   struct mm_struct *mm)
+{
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(mm, address);
+   if (!pgd_present(*pgd))
+   return 0;
+   p4d = p4d_offset(pgd, address);
+   if (!p4d_present(*p4d))
+   return 0;
+   pud = pud_offset(p4d, address);
+   if (!pud_present(*pud))
+   return 0;
+   if (pud_devmap(*pud))
+   return PUD_SHIFT;
+   pmd = pmd_offset(pud, address);
+   if (!pmd_present(*pmd))
+   return 0;
+   if (pmd_devmap(*pmd))
+   return PMD_SHIFT;
+   pte = pte_offset_map(pmd, address);
+   if (!pte_present(*pte))
+   return 0;
+   if (pte_devmap(*pte))
+   return PAGE_SHIFT;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(dev_pagemap_mapping_shift);
-- 
2.19.1.930.g4563a0d9d0-goog

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 0/2] kvm: Use huge pages for DAX-backed files

2018-11-09 Thread Barret Rhoden
This patch series depends on dax pages not being PageReserved.  Once
that is in place, these changes will let KVM use huge pages with
dax-backed files.  Without the PageReserved change, KVM and DAX still
work with these patches, simply without huge pages - which is the
current situation.

RFC/discussion thread:
https://lore.kernel.org/lkml/20181029210716.212159-1-b...@google.com/

Barret Rhoden (2):
  mm: make dev_pagemap_mapping_shift() externally visible
  kvm: Use huge pages for DAX-backed files

 arch/x86/kvm/mmu.c  | 34 --
 include/linux/mm.h  |  3 +++
 mm/memory-failure.c | 38 +++---
 mm/util.c   | 34 ++
 4 files changed, 72 insertions(+), 37 deletions(-)

-- 
2.19.1.930.g4563a0d9d0-goog

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 2/2] kvm: Use huge pages for DAX-backed files

2018-11-09 Thread Barret Rhoden
This change allows KVM to map DAX-backed files made of huge pages with
huge mappings in the EPT/TDP.

DAX pages are not PageTransCompound.  The existing check is trying to
determine if the mapping for the pfn is a huge mapping or not.  For
non-DAX maps, e.g. hugetlbfs, that means checking PageTransCompound.
For DAX, we can check the page table itself.

Note that KVM already faulted in the page (or huge page) in the host's
page table, and we hold the KVM mmu spinlock (grabbed before checking
the mmu seq).

Signed-off-by: Barret Rhoden 
---
 arch/x86/kvm/mmu.c | 34 --
 1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index cf5f572f2305..2df8c459dc6a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3152,6 +3152,36 @@ static int kvm_handle_bad_page(struct kvm_vcpu *vcpu, 
gfn_t gfn, kvm_pfn_t pfn)
return -EFAULT;
 }
 
+static bool pfn_is_huge_mapped(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn)
+{
+   struct page *page = pfn_to_page(pfn);
+   unsigned long hva, map_shift;
+
+   if (!is_zone_device_page(page))
+   return PageTransCompoundMap(page);
+
+   /*
+* DAX pages do not use compound pages.  The page should have already
+* been mapped into the host-side page table during try_async_pf(), so
+* we can check the page tables directly.
+*/
+   hva = gfn_to_hva(kvm, gfn);
+   if (kvm_is_error_hva(hva))
+   return false;
+
+   /*
+* Our caller grabbed the KVM mmu_lock with a successful
+* mmu_notifier_retry, so we're safe to walk the page table.
+*/
+   map_shift = dev_pagemap_mapping_shift(hva, current->mm);
+   switch (map_shift) {
+   case PMD_SHIFT:
+   case PUD_SIZE:
+   return true;
+   }
+   return false;
+}
+
 static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
gfn_t *gfnp, kvm_pfn_t *pfnp,
int *levelp)
@@ -3168,7 +3198,7 @@ static void transparent_hugepage_adjust(struct kvm_vcpu 
*vcpu,
 */
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
-   PageTransCompoundMap(pfn_to_page(pfn)) &&
+   pfn_is_huge_mapped(vcpu->kvm, gfn, pfn) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
@@ -5678,7 +5708,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 */
if (sp->role.direct &&
!kvm_is_reserved_pfn(pfn) &&
-   PageTransCompoundMap(pfn_to_page(pfn))) {
+   pfn_is_huge_mapped(kvm, sp->gfn, pfn)) {
pte_list_remove(rmap_head, sptep);
need_tlb_flush = 1;
goto restart;
-- 
2.19.1.930.g4563a0d9d0-goog

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 0/7] Deferred page init improvements

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:19:25, Alexander Duyck wrote:
> This patchset is essentially a refactor of the page initialization logic
> that is meant to provide for better code reuse while providing a
> significant improvement in deferred page initialization performance.
> 
> In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> memory per node I have seen the following. In the case of regular memory
> initialization the deferred init time was decreased from 3.75s to 1.06s on
> average. For the persistent memory the initialization time dropped from
> 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> deferred memory initialization performance, and a 26% improvement in the
> persistent memory initialization performance.

Hi Alex,

Please try to run your persistent memory init experiment with Daniel's
patches:

https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/

The performance should improve by much more than 26%.

Overall, your works looks good, but it needs to be considered how easy it will 
be
to merge with ktask. I will try to complete the review today.

Thank you,
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 01/11] keys-encrypted: add nvdimm key format type to encrypted keys

2018-11-09 Thread Dave Jiang
Adding nvdimm key format type to encrypted keys in order to limit the size
of the key to 32bytes.

Signed-off-by: Dave Jiang 
---
 security/keys/encrypted-keys/encrypted.c |   29 -
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/security/keys/encrypted-keys/encrypted.c 
b/security/keys/encrypted-keys/encrypted.c
index d92cbf9687c3..182b4f136bdf 100644
--- a/security/keys/encrypted-keys/encrypted.c
+++ b/security/keys/encrypted-keys/encrypted.c
@@ -45,6 +45,7 @@ static const char hmac_alg[] = "hmac(sha256)";
 static const char blkcipher_alg[] = "cbc(aes)";
 static const char key_format_default[] = "default";
 static const char key_format_ecryptfs[] = "ecryptfs";
+static const char key_format_nvdimm[] = "nvdimm";
 static unsigned int ivsize;
 static int blksize;
 
@@ -54,6 +55,7 @@ static int blksize;
 #define HASH_SIZE SHA256_DIGEST_SIZE
 #define MAX_DATA_SIZE 4096
 #define MIN_DATA_SIZE  20
+#define KEY_NVDIMM_PAYLOAD_LEN 32
 
 static struct crypto_shash *hash_tfm;
 
@@ -62,12 +64,13 @@ enum {
 };
 
 enum {
-   Opt_error = -1, Opt_default, Opt_ecryptfs
+   Opt_error = -1, Opt_default, Opt_ecryptfs, Opt_nvdimm
 };
 
 static const match_table_t key_format_tokens = {
{Opt_default, "default"},
{Opt_ecryptfs, "ecryptfs"},
+   {Opt_nvdimm, "nvdimm"},
{Opt_error, NULL}
 };
 
@@ -195,6 +198,7 @@ static int datablob_parse(char *datablob, const char 
**format,
key_format = match_token(p, key_format_tokens, args);
switch (key_format) {
case Opt_ecryptfs:
+   case Opt_nvdimm:
case Opt_default:
*format = p;
*master_desc = strsep(&datablob, " \t");
@@ -625,15 +629,22 @@ static struct encrypted_key_payload 
*encrypted_key_alloc(struct key *key,
format_len = (!format) ? strlen(key_format_default) : strlen(format);
decrypted_datalen = dlen;
payload_datalen = decrypted_datalen;
-   if (format && !strcmp(format, key_format_ecryptfs)) {
-   if (dlen != ECRYPTFS_MAX_KEY_BYTES) {
-   pr_err("encrypted_key: keylen for the ecryptfs format "
-  "must be equal to %d bytes\n",
-  ECRYPTFS_MAX_KEY_BYTES);
-   return ERR_PTR(-EINVAL);
+   if (format) {
+   if (!strcmp(format, key_format_ecryptfs)) {
+   if (dlen != ECRYPTFS_MAX_KEY_BYTES) {
+   pr_err("encrypted_key: keylen for the ecryptfs 
format must be equal to %d bytes\n",
+   ECRYPTFS_MAX_KEY_BYTES);
+   return ERR_PTR(-EINVAL);
+   }
+   decrypted_datalen = ECRYPTFS_MAX_KEY_BYTES;
+   payload_datalen = sizeof(struct ecryptfs_auth_tok);
+   } else if (!strcmp(format, key_format_nvdimm)) {
+   if (decrypted_datalen != KEY_NVDIMM_PAYLOAD_LEN) {
+   pr_err("encrypted_key: nvdimm key payload 
incorrect length: %d\n",
+   decrypted_datalen);
+   return ERR_PTR(-EINVAL);
+   }
}
-   decrypted_datalen = ECRYPTFS_MAX_KEY_BYTES;
-   payload_datalen = sizeof(struct ecryptfs_auth_tok);
}
 
encrypted_datalen = roundup(decrypted_datalen, blksize);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 00/11] Additional patches for nvdimm security support

2018-11-09 Thread Dave Jiang
The following series adds additional support for nvdimm security.
1. Converted logon keys to encrypted-keys.
2. Add overwrite DSM support
3. Add DSM 1.8 master passphrase support

The patch series is based off the branch from here:
https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/log/?h=for-5.0/nvdimm-security

Instead of squashing the previous changes, they are kept for history purposes
to document how we arrived to the current iteration.

Mimi,
Patch 1 requires your ack as it makes changes to encrypted-keys by adding
the nvdimm key format type. Dan wanted to restrict the key to 32bytes
during creation. I also wouldn't mind if you take a look at patch 2 and
make sure I'm providing correct usage of encrypted keys. Thank you!

---

Dave Jiang (11):
  keys-encrypted: add nvdimm key format type to encrypted keys
  libnvdimm/security: change clear text nvdimm keys to encrypted keys
  libnvdimm/security: add override module param for key self verification
  libnvdimm/security: introduce NDD_SECURITY_BUSY flag
  acpi/nfit, libnvdimm/security: Add security DSM overwrite support
  tools/testing/nvdimm: Add overwrite support for nfit_test
  libnvdimm/security: add overwrite status notification
  libnvdimm/security: add documentation for ovewrite
  acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support
  tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test
  acpi/nfit: prevent indiscriminate DSM payload dumping for security DSMs


 Documentation/nvdimm/security.txt|   68 +++-
 drivers/acpi/nfit/Kconfig|7 
 drivers/acpi/nfit/core.c |   31 ++
 drivers/acpi/nfit/intel.c|  245 +++
 drivers/acpi/nfit/intel.h|   22 +
 drivers/acpi/nfit/nfit.h |7 
 drivers/nvdimm/core.c|3 
 drivers/nvdimm/dimm.c|3 
 drivers/nvdimm/dimm_devs.c   |   50 +++
 drivers/nvdimm/nd-core.h |9 -
 drivers/nvdimm/nd.h  |   19 +
 drivers/nvdimm/region_devs.c |7 
 drivers/nvdimm/security.c|  496 +-
 include/linux/libnvdimm.h|   22 +
 security/keys/encrypted-keys/encrypted.c |   29 +-
 tools/testing/nvdimm/dimm_devs.c |2 
 tools/testing/nvdimm/test/nfit.c |  141 +
 17 files changed, 895 insertions(+), 266 deletions(-)

--
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 02/11] libnvdimm/security: change clear text nvdimm keys to encrypted keys

2018-11-09 Thread Dave Jiang
In order to make nvdimm more secure, encrypted keys will be used instead of
clear text keys. A master key will be created to seal encrypted nvdimm
keys. The master key can be a trusted key generated from TPM 2.0 or a less
secure user key.

In the process of this conversion, the kernel cached key will be removed
in order to simplify the verification process. The hardware will be used to
verify the decrypted user payload directly.

Signed-off-by: Dave Jiang 
---
 Documentation/nvdimm/security.txt |   29 ++-
 drivers/nvdimm/dimm.c |3 
 drivers/nvdimm/dimm_devs.c|2 
 drivers/nvdimm/nd-core.h  |3 
 drivers/nvdimm/nd.h   |5 -
 drivers/nvdimm/security.c |  316 ++---
 6 files changed, 108 insertions(+), 250 deletions(-)

diff --git a/Documentation/nvdimm/security.txt 
b/Documentation/nvdimm/security.txt
index 50cbb6cb96a1..11240ce48755 100644
--- a/Documentation/nvdimm/security.txt
+++ b/Documentation/nvdimm/security.txt
@@ -39,34 +39,39 @@ The DIMM id would be provided along with the key payload 
(passphrase) to
 the kernel.
 
 The security keys are managed on the basis of a single key per DIMM. The
-key "passphrase" is expected to be 32bytes long or padded to 32bytes. This is
+key "passphrase" is expected to be 32bytes long. This is
 similar to the ATA security specification [2]. A key is initially acquired
 via the request_key() kernel API call and retrieved from userspace. It is up to
 the user to provide an upcall application to retrieve the key in whatever
 fashion meets their security requirements.
 
-A nvdimm user logon key has the description format of:
+A nvdimm encrypted key has the description format of:
 nvdimm:
 
+See Documentation/security/keys/trusted-encrypted.rst for details on encrypted
+keys. The encrypted key for the DIMM is generated from a master key that is
+either a trusted key created via TPM2 or a less secure user key. The payload
+for the nvdimm encrypted key is randomly generated by the kernel and is not
+visible to the user. The user is responsible for retaining the encrypted key
+blob generated from the encrypted key to be loaded at the next boot session.
+
 4. Unlocking
 
 When the DIMMs are being enumerated by the kernel, the kernel will attempt to
-retrieve the key from its keyring. If that fails, it will attempt to
 acquire the key from the userspace upcall function. This is the only time
 a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked
 until reboot.
 
 5. Update
 -
-When doing an update, it is expected that the new key with the 64bit payload of
-format described above is added via the keyutils API or utility. The update
-command written to the sysfs attribute will be with the format:
+When doing an update, it is expected that the new key is added via the
+keyutils API or utility. The update command written to the sysfs attribute
+will be with the format:
 update  
 
-It is expected that a user logon key has been injected via keyutils to provide
-the payload for the update operation. The kernel will take the new user key,
-attempt the update operation with the nvdimm, and replace the existing key's
-payload with the new passphrase.
+It is expected that a encrypted key has been injected via keyutils to provide
+the payload for the update operation. The kernel will take the new user key
+and the old user key to attempt the update operation.
 
 If there is no old key id due to a security enabling, then a 0 should be
 passed in.  If a nvdimm has an existing passphrase, then an "old" key should
@@ -80,7 +85,7 @@ frozen by a user with root privelege.
 7. Disable
 --
 The security disable command format is:
-disable 
+disable 
 
 An "old" key with the passphrase payload that is tied to the nvdimm should be
 injected with a key description that does not have the "nvdimm:" prefix and
@@ -89,7 +94,7 @@ its keyid should be passed in via sysfs.
 8. Secure Erase
 ---
 The command format for doing a secure erase is:
-erase 
+erase 
 
 An "old" key with the passphrase payload that is tied to the nvdimm should be
 injected with a key description that does not have the "nvdimm:" prefix and
diff --git a/drivers/nvdimm/dimm.c b/drivers/nvdimm/dimm.c
index 225a8474c44a..c3886a22b5e4 100644
--- a/drivers/nvdimm/dimm.c
+++ b/drivers/nvdimm/dimm.c
@@ -112,9 +112,6 @@ static int nvdimm_probe(struct device *dev)
 static int nvdimm_remove(struct device *dev)
 {
struct nvdimm_drvdata *ndd = dev_get_drvdata(dev);
-   struct nvdimm *nvdimm = to_nvdimm(dev);
-
-   nvdimm_security_release(nvdimm);
 
if (!ndd)
return 0;
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index 5d34251c4f3b..269f401a5e68 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -485,7 +485,7 @@ struct nvdimm *__nvdimm_create(struct nvdimm_bus 
*nvdimm_bus, void *provider_dat
nvdimm->nu

[PATCH 03/11] libnvdimm/security: add override module param for key self verification

2018-11-09 Thread Dave Jiang
Provide the user an override via kernel module parameter for security key
self verification. no_key_self_verify parameter is being added to bypass
security key verify against the hardware during nvdimm unlock path.

Signed-off-by: Dave Jiang 
---
 drivers/nvdimm/security.c |   11 +++
 1 file changed, 11 insertions(+)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index ee741199d623..d2831e61f3d8 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 /* Copyright(c) 2018 Intel Corporation. All rights reserved. */
 
+#include 
 #include 
 #include 
 #include 
@@ -14,6 +15,10 @@
 #include "nd-core.h"
 #include "nd.h"
 
+static bool no_key_self_verify;
+module_param(no_key_self_verify, bool, 0644);
+MODULE_PARM_DESC(no_key_self_verify, "Bypass security key self verify");
+
 /*
  * Retrieve user injected key
  */
@@ -235,6 +240,12 @@ int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm)
 * other security operations.
 */
if (nvdimm->state == NVDIMM_SECURITY_UNLOCKED) {
+   /* bypass if user override */
+   if (no_key_self_verify) {
+   mutex_unlock(&nvdimm->sec_mutex);
+   return 0;
+   }
+
key = nvdimm_self_verify_key(nvdimm);
if (!key) {
rc = nvdimm_security_freeze_lock(nvdimm);

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 04/11] libnvdimm/security: introduce NDD_SECURITY_BUSY flag

2018-11-09 Thread Dave Jiang
Adding a flag for nvdimm->flags to support erase functions. While it's ok
to hold the nvdimm_bus lock for secure erase due to minimal time to execute
the command, overwrite requires a significantly longer time and makes this
impossible. The flag will block any drivers from being loaded and DIMMs
being probed.

Signed-off-by: Dave Jiang 
---
 drivers/nvdimm/nd.h  |1 +
 drivers/nvdimm/region_devs.c |7 +
 drivers/nvdimm/security.c|   61 ++
 include/linux/libnvdimm.h|2 +
 4 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index f8d8f0a2a40d..6cb1cd4a39d0 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -250,6 +250,7 @@ long nvdimm_clear_poison(struct device *dev, phys_addr_t 
phys,
 void nvdimm_set_aliasing(struct device *dev);
 void nvdimm_set_locked(struct device *dev);
 void nvdimm_clear_locked(struct device *dev);
+int nvdimm_check_security_busy(struct nvdimm *nvdimm);
 struct nd_btt *to_nd_btt(struct device *dev);
 
 struct nd_gen_sb {
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index 174a418cb171..a097282b2c01 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -78,6 +78,13 @@ int nd_region_activate(struct nd_region *nd_region)
for (i = 0; i < nd_region->ndr_mappings; i++) {
struct nd_mapping *nd_mapping = &nd_region->mapping[i];
struct nvdimm *nvdimm = nd_mapping->nvdimm;
+   int rc;
+
+   rc = nvdimm_check_security_busy(nvdimm);
+   if (rc) {
+   nvdimm_bus_unlock(&nd_region->dev);
+   return rc;
+   }
 
/* at least one null hint slot per-dimm for the "no-hint" case 
*/
flush_data_size += sizeof(void *);
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index d2831e61f3d8..2a83be47798e 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -19,6 +19,27 @@ static bool no_key_self_verify;
 module_param(no_key_self_verify, bool, 0644);
 MODULE_PARM_DESC(no_key_self_verify, "Bypass security key self verify");
 
+/*
+ * Check if we are doing security wipes
+ */
+int nvdimm_check_security_busy(struct nvdimm *nvdimm)
+{
+   if (test_bit(NDD_SECURITY_BUSY, &nvdimm->flags))
+   return -EBUSY;
+
+   return 0;
+}
+
+static inline void nvdimm_set_security_busy(struct nvdimm *nvdimm)
+{
+   set_bit(NDD_SECURITY_BUSY, &nvdimm->flags);
+}
+
+static inline void nvdimm_clear_security_busy(struct nvdimm *nvdimm)
+{
+   clear_bit(NDD_SECURITY_BUSY, &nvdimm->flags);
+}
+
 /*
  * Retrieve user injected key
  */
@@ -85,6 +106,13 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, unsigned 
int keyid)
 
nvdimm_bus_lock(dev);
mutex_lock(&nvdimm->sec_mutex);
+   rc = nvdimm_check_security_busy(nvdimm);
+   if (rc < 0) {
+   dev_warn(dev, "Security operation in progress.\n");
+   goto out;
+   }
+
+   nvdimm_set_security_busy(nvdimm);
if (atomic_read(&nvdimm->busy)) {
dev_warn(dev, "Unable to secure erase while DIMM active.\n");
rc = -EBUSY;
@@ -113,6 +141,7 @@ int nvdimm_security_erase(struct nvdimm *nvdimm, unsigned 
int keyid)
key_put(key);
 
  out:
+   nvdimm_clear_security_busy(nvdimm);
mutex_unlock(&nvdimm->sec_mutex);
nvdimm_bus_unlock(dev);
nvdimm_security_get_state(nvdimm);
@@ -130,15 +159,19 @@ int nvdimm_security_freeze_lock(struct nvdimm *nvdimm)
return -EOPNOTSUPP;
 
mutex_lock(&nvdimm->sec_mutex);
+   rc = nvdimm_check_security_busy(nvdimm);
+   if (rc < 0)
+   goto out;
+
rc = nvdimm->security_ops->freeze_lock(nvdimm);
-   if (rc < 0) {
-   mutex_unlock(&nvdimm->sec_mutex);
-   return rc;
-   }
+   if (rc < 0)
+   goto out;
 
nvdimm_security_get_state(nvdimm);
+
+out:
mutex_unlock(&nvdimm->sec_mutex);
-   return 0;
+   return rc;
 }
 
 int nvdimm_security_disable(struct nvdimm *nvdimm, unsigned int keyid)
@@ -156,6 +189,12 @@ int nvdimm_security_disable(struct nvdimm *nvdimm, 
unsigned int keyid)
return -EOPNOTSUPP;
 
mutex_lock(&nvdimm->sec_mutex);
+   rc = nvdimm_check_security_busy(nvdimm);
+   if (rc < 0) {
+   mutex_unlock(&nvdimm->sec_mutex);
+   return rc;
+   }
+
/* look for a key from cached key */
key = nvdimm_lookup_user_key(dev, keyid);
if (!key) {
@@ -232,6 +271,12 @@ int nvdimm_security_unlock_dimm(struct nvdimm *nvdimm)
return 0;
 
mutex_lock(&nvdimm->sec_mutex);
+   rc = nvdimm_check_security_busy(nvdimm);
+   if (rc < 0) {
+   mutex_unlock(&nvdimm->sec_mutex);
+   return rc;
+   }
+

[PATCH 07/11] libnvdimm/security: add overwrite status notification

2018-11-09 Thread Dave Jiang
Adding sysfs notification for when overwrite has completed to allow
user monitoring app to be aware of overwrite completion status.

Signed-off-by: Dave Jiang 
---
 drivers/acpi/nfit/core.c   |5 +
 drivers/nvdimm/dimm_devs.c |   10 ++
 drivers/nvdimm/nd-core.h   |1 +
 drivers/nvdimm/security.c  |2 ++
 include/linux/libnvdimm.h  |1 +
 5 files changed, 19 insertions(+)

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index de4e00059277..3e6c7b653872 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2033,6 +2033,11 @@ static int acpi_nfit_register_dimms(struct 
acpi_nfit_desc *acpi_desc)
if (!nvdimm)
continue;
 
+   rc = nvdimm_setup_security_events(nvdimm);
+   if (rc < 0)
+   dev_warn(acpi_desc->dev,
+   "no security event setup failed\n");
+
nfit_kernfs = sysfs_get_dirent(nvdimm_kobj(nvdimm)->sd, "nfit");
if (nfit_kernfs)
nfit_mem->flags_attr = sysfs_get_dirent(nfit_kernfs,
diff --git a/drivers/nvdimm/dimm_devs.c b/drivers/nvdimm/dimm_devs.c
index b613c131bfb9..39e40074b5df 100644
--- a/drivers/nvdimm/dimm_devs.c
+++ b/drivers/nvdimm/dimm_devs.c
@@ -508,6 +508,16 @@ struct nvdimm *__nvdimm_create(struct nvdimm_bus 
*nvdimm_bus, void *provider_dat
 }
 EXPORT_SYMBOL_GPL(__nvdimm_create);
 
+int nvdimm_setup_security_events(struct nvdimm *nvdimm)
+{
+   nvdimm->overwrite_state = sysfs_get_dirent(nvdimm->dev.kobj.sd,
+   "security");
+   if (!nvdimm->overwrite_state)
+   return -ENODEV;
+   return 0;
+}
+EXPORT_SYMBOL_GPL(nvdimm_setup_security_events);
+
 int alias_dpa_busy(struct device *dev, void *data)
 {
resource_size_t map_end, blk_start, new;
diff --git a/drivers/nvdimm/nd-core.h b/drivers/nvdimm/nd-core.h
index 20a8216c503d..b96e1b10e3eb 100644
--- a/drivers/nvdimm/nd-core.h
+++ b/drivers/nvdimm/nd-core.h
@@ -49,6 +49,7 @@ struct nvdimm {
struct mutex sec_mutex;
struct delayed_work dwork;
unsigned int overwrite_tmo;
+   struct kernfs_node *overwrite_state;
 };
 
 /**
diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 725acd24..f5ba633545b7 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -122,6 +122,8 @@ void nvdimm_overwrite_query(struct work_struct *work)
else
dev_info(&nvdimm->dev, "Overwrite completed\n");
 
+   if (nvdimm->overwrite_state)
+   sysfs_notify_dirent(nvdimm->overwrite_state);
nvdimm->overwrite_tmo = 0;
nvdimm_clear_security_busy(nvdimm);
nvdimm_security_get_state(nvdimm);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 479421ce62c0..c3c5a1c6b1b7 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -227,6 +227,7 @@ static inline struct nvdimm *nvdimm_create(struct 
nvdimm_bus *nvdimm_bus,
cmd_mask, num_flush, flush_wpq, NULL, NULL);
 }
 
+int nvdimm_setup_security_events(struct nvdimm *nvdimm);
 const struct nd_cmd_desc *nd_cmd_dimm_desc(int cmd);
 const struct nd_cmd_desc *nd_cmd_bus_desc(int cmd);
 u32 nd_cmd_in_size(struct nvdimm *nvdimm, int cmd,

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 08/11] libnvdimm/security: add documentation for ovewrite

2018-11-09 Thread Dave Jiang
Add overwrite command usages to security documentation.

Signed-off-by: Dave Jiang 
---
 Documentation/nvdimm/security.txt |   16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/Documentation/nvdimm/security.txt 
b/Documentation/nvdimm/security.txt
index 11240ce48755..dfe70a8fa25b 100644
--- a/Documentation/nvdimm/security.txt
+++ b/Documentation/nvdimm/security.txt
@@ -96,9 +96,19 @@ its keyid should be passed in via sysfs.
 The command format for doing a secure erase is:
 erase 
 
-An "old" key with the passphrase payload that is tied to the nvdimm should be
-injected with a key description that does not have the "nvdimm:" prefix and
-its keyid should be passed in via sysfs.
+9. Overwrite
+
+The command format for doing an overwrite is:
+overwrite 
+
+Overwrite can be done without a key if security is not enabled. A key serial
+of 0 can be passed in to indicate no key.
+
+The sysfs attribute "security" can be polled to wait on overwrite completion.
+Overwrite can last tens of minutes or more depending on nvdimm size.
+
+An encrypted key with the current key passphrase that is tied to the nvdimm
+should be injected and its keyid should be passed in via sysfs.
 
 [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf
 [2]: 
http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 11/11] acpi/nfit: prevent indiscriminate DSM payload dumping for security DSMs

2018-11-09 Thread Dave Jiang
Right now when debug is enabled, we dump the command buffer
indescriminately. This exposes the clear text payload for security DSMs.
Introducing a kernel config to only dump the payload if the config option
is turned on so the production kernels can leave this option off and not
expose the passphrases.

Signed-off-by: Dave Jiang 
---
 drivers/acpi/nfit/Kconfig |7 +++
 drivers/acpi/nfit/core.c  |   24 +---
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/nfit/Kconfig b/drivers/acpi/nfit/Kconfig
index f7c57e33499e..a0a8eabda2e8 100644
--- a/drivers/acpi/nfit/Kconfig
+++ b/drivers/acpi/nfit/Kconfig
@@ -13,3 +13,10 @@ config ACPI_NFIT
 
  To compile this driver as a module, choose M here:
  the module will be called nfit.
+
+config NFIT_SECURITY_DEBUG
+   bool "Turn on debugging for NVDIMM security features"
+   depends on ACPI_NFIT
+   help
+ Turning on debug output for NVDIMM security DSM commands. This
+ should not be turned on on a production kernel.
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 867e6fea3737..baaf5308de35 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -405,6 +405,21 @@ static u8 nfit_dsm_revid(unsigned family, unsigned func)
return id;
 }
 
+static bool is_security_cmd(unsigned int cmd, unsigned int func,
+   unsigned int family)
+{
+   if (cmd != ND_CMD_CALL)
+   return false;
+
+   if (family == NVDIMM_FAMILY_INTEL) {
+   if (func >= NVDIMM_INTEL_GET_SECURITY_STATE &&
+   func <= NVDIMM_INTEL_MASTER_SECURE_ERASE)
+   return true;
+   }
+
+   return false;
+}
+
 int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, struct nvdimm *nvdimm,
unsigned int cmd, void *buf, unsigned int buf_len, int *cmd_rc)
 {
@@ -489,9 +504,12 @@ int acpi_nfit_ctl(struct nvdimm_bus_descriptor *nd_desc, 
struct nvdimm *nvdimm,
 
dev_dbg(dev, "%s cmd: %d: func: %d input length: %d\n",
dimm_name, cmd, func, in_buf.buffer.length);
-   print_hex_dump_debug("nvdimm in  ", DUMP_PREFIX_OFFSET, 4, 4,
-   in_buf.buffer.pointer,
-   min_t(u32, 256, in_buf.buffer.length), true);
+   if ((call_pkg && !is_security_cmd(cmd, func, call_pkg->nd_family)) ||
+   IS_ENABLED(CONFIG_NFIT_SECURITY_DEBUG)) {
+   print_hex_dump_debug("nvdimm in  ", DUMP_PREFIX_OFFSET, 4, 4,
+   in_buf.buffer.pointer,
+   min_t(u32, 256, in_buf.buffer.length), true);
+   }
 
/* call the BIOS, prefer the named methods over _DSM if available */
if (nvdimm && cmd == ND_CMD_GET_CONFIG_SIZE

___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH 06/11] tools/testing/nvdimm: Add overwrite support for nfit_test

2018-11-09 Thread Dave Jiang
With the implementation of Intel NVDIMM DSM overwrite, we are adding unit
test to nfit_test for testing of overwrite operation.

Signed-off-by: Dave Jiang 
---
 drivers/acpi/nfit/intel.h|1 +
 tools/testing/nvdimm/dimm_devs.c |2 +
 tools/testing/nvdimm/test/nfit.c |   55 ++
 3 files changed, 58 insertions(+)

diff --git a/drivers/acpi/nfit/intel.h b/drivers/acpi/nfit/intel.h
index a4e01de966dc..0140d9b6a0b7 100644
--- a/drivers/acpi/nfit/intel.h
+++ b/drivers/acpi/nfit/intel.h
@@ -52,6 +52,7 @@ extern const struct nvdimm_security_ops *intel_security_ops;
 #define ND_INTEL_STATUS_INVALID_PASS   11
 #define ND_INTEL_STATUS_OVERWRITE_UNSUPPORTED  0x10007
 #define ND_INTEL_STATUS_OQUERY_INPROGRESS  0x10007
+#define ND_INTEL_STATUS_OQUERY_SEQUENCE_ERR0x20007
 
 #define ND_INTEL_SEC_STATE_ENABLED 0x02
 #define ND_INTEL_SEC_STATE_LOCKED  0x04
diff --git a/tools/testing/nvdimm/dimm_devs.c b/tools/testing/nvdimm/dimm_devs.c
index de3a22988dc5..3f8ad3d445d2 100644
--- a/tools/testing/nvdimm/dimm_devs.c
+++ b/tools/testing/nvdimm/dimm_devs.c
@@ -29,6 +29,8 @@ ssize_t security_show(struct device *dev,
return sprintf(buf, "locked\n");
case NVDIMM_SECURITY_FROZEN:
return sprintf(buf, "frozen\n");
+   case NVDIMM_SECURITY_OVERWRITE:
+   return sprintf(buf, "overwrite\n");
case NVDIMM_SECURITY_UNSUPPORTED:
default:
return sprintf(buf, "unsupported\n");
diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index e4b560494e0a..c885fe136f42 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -149,6 +149,7 @@ static int dimm_fail_cmd_code[NUM_DCR];
 struct nfit_test_sec {
u8 state;
u8 passphrase[32];
+   u64 overwrite_end_time;
 } dimm_sec_info[NUM_DCR];
 
 static const struct nd_intel_smart smart_def = {
@@ -1073,6 +1074,50 @@ static int nd_intel_test_cmd_secure_erase(struct 
nfit_test *t,
return 0;
 }
 
+static int nd_intel_test_cmd_overwrite(struct nfit_test *t,
+   struct nd_intel_overwrite *nd_cmd,
+   unsigned int buf_len, int dimm)
+{
+   struct device *dev = &t->pdev.dev;
+   struct nfit_test_sec *sec = &dimm_sec_info[dimm];
+
+   if ((sec->state & ND_INTEL_SEC_STATE_ENABLED) &&
+   memcmp(nd_cmd->passphrase, sec->passphrase,
+   ND_INTEL_PASSPHRASE_SIZE) != 0) {
+   nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS;
+   dev_dbg(dev, "overwrite: wrong passphrase\n");
+   return 0;
+   }
+
+   memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
+   sec->state = ND_INTEL_SEC_STATE_OVERWRITE;
+   dev_dbg(dev, "overwrite progressing.\n");
+   sec->overwrite_end_time = get_jiffies_64() + 5 * HZ;
+
+   return 0;
+}
+
+static int nd_intel_test_cmd_query_overwrite(struct nfit_test *t,
+   struct nd_intel_query_overwrite *nd_cmd,
+   unsigned int buf_len, int dimm)
+{
+   struct device *dev = &t->pdev.dev;
+   struct nfit_test_sec *sec = &dimm_sec_info[dimm];
+
+   if (!(sec->state & ND_INTEL_SEC_STATE_OVERWRITE)) {
+   nd_cmd->status = ND_INTEL_STATUS_OQUERY_SEQUENCE_ERR;
+   return 0;
+   }
+
+   if (time_is_before_jiffies64(sec->overwrite_end_time)) {
+   sec->overwrite_end_time = 0;
+   sec->state = 0;
+   dev_dbg(dev, "overwrite is complete\n");
+   } else
+   nd_cmd->status = ND_INTEL_STATUS_OQUERY_INPROGRESS;
+   return 0;
+}
+
 static int get_dimm(struct nfit_mem *nfit_mem, unsigned int func)
 {
int i;
@@ -1144,6 +1189,14 @@ static int nfit_test_ctl(struct nvdimm_bus_descriptor 
*nd_desc,
rc = nd_intel_test_cmd_secure_erase(t,
buf, buf_len, i);
break;
+   case NVDIMM_INTEL_OVERWRITE:
+   rc = nd_intel_test_cmd_overwrite(t,
+   buf, buf_len, i - t->dcr_idx);
+   break;
+   case NVDIMM_INTEL_QUERY_OVERWRITE:
+   rc = nd_intel_test_cmd_query_overwrite(t,
+   buf, buf_len, i - t->dcr_idx);
+   break;
case ND_INTEL_ENABLE_LSS_STATUS:
rc = nd_intel_test_cmd_set_lss_status(t,
buf, buf_len);
@@ -2379,6 +2432,8 @@ static void nfit_test0_setup(struct nfit_test *t)
set_bit(NVDIMM_INTEL_UNLOCK_UNIT, &acpi_desc->dimm_cmd_force_en);
set_bit(NVDIMM_INTEL_FREEZE_LOCK, &acpi_desc->dimm_cmd_force_en);
set_bit(NVDIMM_INTEL_SECURE_ERASE, &acpi_desc->dimm_cmd

[PATCH 09/11] acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support

2018-11-09 Thread Dave Jiang
With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
master passphrase and master secure erase. The master passphrase allows
a secure erase to be performed without the user passphrase that is set on
the NVDIMM. The commands of master_update and master_erase are added to
the sysfs knob in order to initiate the DSMs. They are similar in opeartion
mechanism compare to update and erase.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf

Signed-off-by: Dave Jiang 
---
 Documentation/nvdimm/security.txt |   23 ++
 drivers/acpi/nfit/core.c  |2 +
 drivers/acpi/nfit/intel.c |  132 -
 drivers/acpi/nfit/intel.h |   18 +
 drivers/acpi/nfit/nfit.h  |6 +-
 drivers/nvdimm/dimm_devs.c|   16 
 drivers/nvdimm/nd-core.h  |1 
 drivers/nvdimm/nd.h   |5 +
 drivers/nvdimm/security.c |   34 --
 include/linux/libnvdimm.h |   15 
 10 files changed, 236 insertions(+), 16 deletions(-)

diff --git a/Documentation/nvdimm/security.txt 
b/Documentation/nvdimm/security.txt
index dfe70a8fa25b..a2d954349417 100644
--- a/Documentation/nvdimm/security.txt
+++ b/Documentation/nvdimm/security.txt
@@ -110,5 +110,28 @@ Overwrite can last tens of minutes or more depending on 
nvdimm size.
 An encrypted key with the current key passphrase that is tied to the nvdimm
 should be injected and its keyid should be passed in via sysfs.
 
+10. Master Update
+-
+The command format for doing a master update is:
+update  
+
+The operating mechansim for master update is identical to update except the
+master passphrase key is passed to the kernel. The master passphrase key
+is just another encrypted key.
+
+This command is only available when security is disabled.
+
+11. Master Erase
+
+The command format for doing a master erase is:
+master_erase 
+
+This command has the same operating mechanism as erase except the master
+passphrase key is passed to the kernel. The master passphrase key is just
+another encrypted key.
+
+This command is only available when the master security is enabled, indicated
+by the extended security status.
+
 [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.7.pdf
 [2]: 
http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index 3e6c7b653872..867e6fea3737 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -389,6 +389,8 @@ static u8 nfit_dsm_revid(unsigned family, unsigned func)
[NVDIMM_INTEL_SECURE_ERASE] = 2,
[NVDIMM_INTEL_OVERWRITE] = 2,
[NVDIMM_INTEL_QUERY_OVERWRITE] = 2,
+   [NVDIMM_INTEL_SET_MASTER_PASSPHRASE] = 2,
+   [NVDIMM_INTEL_MASTER_SECURE_ERASE] = 2,
},
};
u8 id;
diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index f6d2b217f1c8..4e47fc2b524c 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -20,6 +20,123 @@
 
 #include  /* for wbinvd_on_all_cpus() on !SMP builds */
 
+static int intel_dimm_security_master_erase(struct nvdimm *nvdimm,
+   const struct nvdimm_key_data *nkey)
+{
+   int cmd_rc, rc = 0;
+   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+   struct {
+   struct nd_cmd_pkg pkg;
+   struct nd_intel_secure_erase cmd;
+   } nd_cmd = {
+   .pkg = {
+   .nd_command = NVDIMM_INTEL_MASTER_SECURE_ERASE,
+   .nd_family = NVDIMM_FAMILY_INTEL,
+   .nd_size_in = ND_INTEL_PASSPHRASE_SIZE,
+   .nd_size_out = ND_INTEL_STATUS_SIZE,
+   .nd_fw_size = ND_INTEL_STATUS_SIZE,
+   },
+   .cmd = {
+   .status = 0,
+   },
+   };
+
+   if (!test_bit(NVDIMM_INTEL_MASTER_SECURE_ERASE, &nfit_mem->dsm_mask))
+   return -ENOTTY;
+
+   /* flush all cache before we erase DIMM */
+   wbinvd_on_all_cpus();
+   memcpy(nd_cmd.cmd.passphrase, nkey->data,
+   sizeof(nd_cmd.cmd.passphrase));
+   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc);
+   if (rc < 0)
+   goto out;
+   if (cmd_rc < 0) {
+   rc = cmd_rc;
+   goto out;
+   }
+
+   switch (nd_cmd.cmd.status) {
+   case 0:
+   break;
+   case ND_INTEL_STATUS_NOT_SUPPORTED:
+   rc = -EOPNOTSUPP;
+   goto out;
+   case ND_INTEL_STATUS_INVALID_PASS:
+   rc = -EINVAL;
+   goto out;
+   case ND_INTEL_STATUS_INVALID_STATE:
+   default:
+   rc = -ENXIO;
+   goto out;
+   }
+
+   /* DIMM erased, invalidate all

[PATCH 05/11] acpi/nfit, libnvdimm/security: Add security DSM overwrite support

2018-11-09 Thread Dave Jiang
We are adding support for the security calls of ovewrite and query
overwrite from Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens
of minutes. When the overwrite DSM is issued successfully, the NVDIMMs
will be unaccessible. The kernel will do backoff polling to detect when
the overwrite process is completed. According to the DSM spec v1.7,
the 128G NVDIMMs can take up to 15mins to perform overwrite and larger
DIMMs will take longer.

Signed-off-by: Dave Jiang 
---
 drivers/acpi/nfit/intel.c  |  113 +++
 drivers/acpi/nfit/intel.h  |3 +
 drivers/acpi/nfit/nfit.h   |1 
 drivers/nvdimm/core.c  |3 +
 drivers/nvdimm/dimm_devs.c |   22 
 drivers/nvdimm/nd-core.h   |4 ++
 drivers/nvdimm/nd.h|8 +++
 drivers/nvdimm/security.c  |  116 
 include/linux/libnvdimm.h  |4 ++
 9 files changed, 273 insertions(+), 1 deletion(-)

diff --git a/drivers/acpi/nfit/intel.c b/drivers/acpi/nfit/intel.c
index f8f63f483859..f6d2b217f1c8 100644
--- a/drivers/acpi/nfit/intel.c
+++ b/drivers/acpi/nfit/intel.c
@@ -20,6 +20,112 @@
 
 #include  /* for wbinvd_on_all_cpus() on !SMP builds */
 
+static int intel_dimm_security_query_overwrite(struct nvdimm *nvdimm)
+{
+   int cmd_rc, rc = 0;
+   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+   struct {
+   struct nd_cmd_pkg pkg;
+   struct nd_intel_query_overwrite cmd;
+   } nd_cmd = {
+   .pkg = {
+   .nd_command = NVDIMM_INTEL_QUERY_OVERWRITE,
+   .nd_family = NVDIMM_FAMILY_INTEL,
+   .nd_size_in = 0,
+   .nd_size_out = ND_INTEL_STATUS_SIZE,
+   .nd_fw_size = ND_INTEL_STATUS_SIZE,
+   },
+   .cmd = {
+   .status = 0,
+   },
+   };
+
+   if (!test_bit(NVDIMM_INTEL_QUERY_OVERWRITE, &nfit_mem->dsm_mask))
+   return -ENOTTY;
+
+   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc);
+   if (rc < 0)
+   goto out;
+   if (cmd_rc < 0) {
+   rc = cmd_rc;
+   goto out;
+   }
+
+   switch (nd_cmd.cmd.status) {
+   case 0:
+   break;
+   case ND_INTEL_STATUS_OQUERY_INPROGRESS:
+   rc = -EBUSY;
+   goto out;
+   default:
+   rc = -ENXIO;
+   goto out;
+   }
+
+   /* flush all cache before we make the nvdimms available */
+   wbinvd_on_all_cpus();
+   nfit_mem->overwrite = false;
+
+out:
+   return rc;
+}
+
+static int intel_dimm_security_overwrite(struct nvdimm *nvdimm,
+   const struct nvdimm_key_data *nkey)
+{
+   int cmd_rc, rc = 0;
+   struct nfit_mem *nfit_mem = nvdimm_provider_data(nvdimm);
+   struct {
+   struct nd_cmd_pkg pkg;
+   struct nd_intel_overwrite cmd;
+   } nd_cmd = {
+   .pkg = {
+   .nd_command = NVDIMM_INTEL_OVERWRITE,
+   .nd_family = NVDIMM_FAMILY_INTEL,
+   .nd_size_in = ND_INTEL_PASSPHRASE_SIZE,
+   .nd_size_out = ND_INTEL_STATUS_SIZE,
+   .nd_fw_size = ND_INTEL_STATUS_SIZE,
+   },
+   .cmd = {
+   .status = 0,
+   },
+   };
+
+   if (!test_bit(NVDIMM_INTEL_OVERWRITE, &nfit_mem->dsm_mask))
+   return -ENOTTY;
+
+   /* flush all cache before we erase DIMM */
+   wbinvd_on_all_cpus();
+   memcpy(nd_cmd.cmd.passphrase, nkey->data,
+   sizeof(nd_cmd.cmd.passphrase));
+   rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_cmd), &cmd_rc);
+   if (rc < 0)
+   goto out;
+   if (cmd_rc < 0) {
+   rc = cmd_rc;
+   goto out;
+   }
+
+   switch (nd_cmd.cmd.status) {
+   case 0:
+   nfit_mem->overwrite = true;
+   break;
+   case ND_INTEL_STATUS_OVERWRITE_UNSUPPORTED:
+   rc = -ENOTSUPP;
+   break;
+   case ND_INTEL_STATUS_INVALID_PASS:
+   rc = -EINVAL;
+   goto out;
+   case ND_INTEL_STATUS_INVALID_STATE:
+   default:
+   rc = -ENXIO;
+   goto out;
+   }
+
+ out:
+   return rc;
+}
+
 static int intel_dimm_security_erase(struct nvdimm *nvdimm,
const struct nvdimm_key_data *nkey)
 {
@@ -319,6 +425,11 @@ static int intel_dimm_security_state(struct nvdimm *nvdimm,
return 0;
}
 
+   if (nfit_mem->overwrite == true) {
+   *state = NVDIMM_SECURITY_OVERWRITE;
+   return 0;
+   }
+
*state = NVDIMM_SECURITY_DISABLED;
rc = nvdimm_ctl(nvdimm, ND_CMD_CALL, &nd_cmd, sizeof(nd_c

[PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test

2018-11-09 Thread Dave Jiang
Adding test support for new Intel DSM from v1.8. The ability of simulating
master passphrase update and master secure erase have been added to
nfit_test.

Signed-off-by: Dave Jiang 
---
 tools/testing/nvdimm/test/nfit.c |   86 ++
 1 file changed, 86 insertions(+)

diff --git a/tools/testing/nvdimm/test/nfit.c b/tools/testing/nvdimm/test/nfit.c
index c885fe136f42..602f6703614e 100644
--- a/tools/testing/nvdimm/test/nfit.c
+++ b/tools/testing/nvdimm/test/nfit.c
@@ -148,7 +148,9 @@ static unsigned long dimm_fail_cmd_flags[NUM_DCR];
 static int dimm_fail_cmd_code[NUM_DCR];
 struct nfit_test_sec {
u8 state;
+   u8 ext_state;
u8 passphrase[32];
+   u8 master_passphrase[32];
u64 overwrite_end_time;
 } dimm_sec_info[NUM_DCR];
 
@@ -951,6 +953,7 @@ static int nd_intel_test_cmd_security_status(struct 
nfit_test *t,
 
nd_cmd->status = 0;
nd_cmd->state = sec->state;
+   nd_cmd->extended_state = sec->ext_state;
dev_dbg(dev, "security state (%#x) returned\n", nd_cmd->state);
 
return 0;
@@ -1067,7 +1070,9 @@ static int nd_intel_test_cmd_secure_erase(struct 
nfit_test *t,
dev_dbg(dev, "secure erase: wrong passphrase\n");
} else {
memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
+   memset(sec->master_passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
sec->state = 0;
+   sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED;
dev_dbg(dev, "secure erase: done\n");
}
 
@@ -1112,12 +1117,69 @@ static int nd_intel_test_cmd_query_overwrite(struct 
nfit_test *t,
if (time_is_before_jiffies64(sec->overwrite_end_time)) {
sec->overwrite_end_time = 0;
sec->state = 0;
+   sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED;
dev_dbg(dev, "overwrite is complete\n");
} else
nd_cmd->status = ND_INTEL_STATUS_OQUERY_INPROGRESS;
return 0;
 }
 
+static int nd_intel_test_cmd_master_set_pass(struct nfit_test *t,
+   struct nd_intel_set_master_passphrase *nd_cmd,
+   unsigned int buf_len, int dimm)
+{
+   struct device *dev = &t->pdev.dev;
+   struct nfit_test_sec *sec = &dimm_sec_info[dimm];
+
+   if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) {
+   nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED;
+   dev_dbg(dev, "master set passphrase in wrong state\n");
+   } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) {
+   nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE;
+   dev_dbg(dev, "master set passphrase in wrong security state\n");
+   } else if (memcmp(nd_cmd->old_pass, sec->master_passphrase,
+   ND_INTEL_PASSPHRASE_SIZE) != 0) {
+   nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS;
+   dev_dbg(dev, "master set passphrase wrong passphrase\n");
+   } else {
+   memcpy(sec->master_passphrase, nd_cmd->new_pass,
+   ND_INTEL_PASSPHRASE_SIZE);
+   nd_cmd->status = 0;
+   dev_dbg(dev, "master passphrase updated\n");
+   }
+
+   return 0;
+}
+
+static int nd_intel_test_cmd_master_secure_erase(struct nfit_test *t,
+   struct nd_intel_master_secure_erase *nd_cmd,
+   unsigned int buf_len, int dimm)
+{
+   struct device *dev = &t->pdev.dev;
+   struct nfit_test_sec *sec = &dimm_sec_info[dimm];
+
+   if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) {
+   nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED;
+   dev_dbg(dev, "master erase in wrong state\n");
+   } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) {
+   nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE;
+   dev_dbg(dev, "master erase in wrong security state\n");
+   } else if (memcmp(nd_cmd->passphrase, sec->master_passphrase,
+   ND_INTEL_PASSPHRASE_SIZE) != 0) {
+   nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS;
+   dev_dbg(dev, "master secure erase: wrong passphrase\n");
+   } else {
+   memset(sec->master_passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
+   sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED;
+   memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
+   sec->state = 0;
+   dev_dbg(dev, "master secure erase: done\n");
+   }
+
+   return 0;
+}
+
+
 static int get_dimm(struct nfit_mem *nfit_mem, unsigned int func)
 {
int i;
@@ -1197,6 +1259,14 @@ static int nfit_test_ctl(struct nvdimm_bus_descriptor 
*nd_desc,
rc = nd_intel_test_cmd_query_overwrite(t,
buf, buf_len, i - t->dcr_idx);
break;
+   case NVDIMM_INTEL_SET_MASTER_PASSPHRASE:

Re: [mm PATCH v5 0/7] Deferred page init improvements

2018-11-09 Thread Alexander Duyck
On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> On 18-11-05 13:19:25, Alexander Duyck wrote:
> > This patchset is essentially a refactor of the page initialization logic
> > that is meant to provide for better code reuse while providing a
> > significant improvement in deferred page initialization performance.
> > 
> > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > memory per node I have seen the following. In the case of regular memory
> > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > average. For the persistent memory the initialization time dropped from
> > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > deferred memory initialization performance, and a 26% improvement in the
> > persistent memory initialization performance.
> 
> Hi Alex,
> 
> Please try to run your persistent memory init experiment with Daniel's
> patches:
> 
> https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/

I've taken a quick look at it. It seems like a bit of a brute force way
to try and speed things up. I would be worried about it potentially
introducing performance issues if the number of CPUs thrown at it end
up exceeding the maximum throughput of the memory.

The data provided with patch 11 seems to point to issues such as that.
In the case of the E7-8895 example cited it is increasing the numbers
of CPUs used from memory initialization from 8 to 72, a 9x increase in
the number of CPUs but it is yeilding only a 3.88x speedup.

> The performance should improve by much more than 26%.

The 26% improvement, or speedup of 1.26x using the ktask approach, was
for persistent memory, not deferred memory init. The ktask patch
doesn't do anything for persistent memory since it is takes the hot-
plug path and isn't handled via the deferred memory init.

I had increased deferred memory init to about 3.53x the original speed
(3.75s to 1.06s) on the system which I was testing. I do agree the two
patches should be able to synergistically boost each other though as
this patch set was meant to make the init much more cache friendly so
as a result it should scale better as you add additional cores. I know
I had done some playing around with fake numa to split up a single node
into 8 logical nodes and I had seen a similar speedup of about 3.85x
with my test memory initializing in about 275ms.

> Overall, your works looks good, but it needs to be considered how easy it 
> will be
> to merge with ktask. I will try to complete the review today.
> 
> Thank you,
> Pasha

Looking over the patches they are still in the RFC stage and the data
is in need of updates since it is referencing 4.15-rc kernels as its
baseline. If anything I really think the ktask patch 11 would be easier
to rebase around my patch set then the other way around. Also, this
series is in Andrew's mmots as of a few days ago, so I think it will be
in the next mmotm that comes out.

The integration with the ktask code should be pretty straight forward.
If anything I think my code would probably make it easier since it gets
rid of the need to do all this in two passes. The only new limitation
it would add is that you would probably want to split up the work along
either max order or section aligned boundaries. What it would
essentially do is make things so that each of the ktask threads would
probably look more like deferred_grow_zone which after my patch set is
actually a fairly simple function.

Thanks.

- Alex




___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator

2018-11-09 Thread Pavel Tatashin
> > > + unsigned long epfn = PFN_DOWN(epa);
> > > + unsigned long spfn = PFN_UP(spa);
> > > +
> > > + /*
> > > +  * Verify the end is at least past the start of the zone and
> > > +  * that we have at least one PFN to initialize.
> > > +  */
> > > + if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > > + /* if we went too far just stop searching */
> > > + if (zone_end_pfn(zone) <= spfn)
> > > + break;
> > 
> > Set *idx = U64_MAX here, then break. This way after we are outside this
> > while loop idx is always equals to U64_MAX.
> 
> Actually I think what you are asking for is the logic that is outside
> of the while loop we are breaking out of. So if you check at the end of
> the function there is the bit of code with the comment "signal end of
> iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to
> ULONG_MAX, and *out_epfn to 0.
> 
> The general idea I had with the function is that you could use either
> the index or spfn < epfn checks to determine if you keep going or not.

Yes, I meant to remove that *idx = U64_MAX after the loop, it is
confusing to have a loop:

while (*idx != U64_MAX) {
...
}

*idx = U64_MAX;


So, it is better to set idx to U643_MAX inside the loop before the
break.

> 
> > 
> > > +
> > > + if (out_spfn)
> > > + *out_spfn = max(zone->zone_start_pfn, spfn);
> > > + if (out_epfn)
> > > + *out_epfn = min(zone_end_pfn(zone), epfn);
> > 
> > Don't we need to verify after adjustment that out_spfn != out_epfn, so
> > there is at least one PFN to initialize?
> 
> We have a few checks that I believe prevent that. Before we get to this
> point we have verified the following:
>   zone->zone_start < epfn
>   spfn < epfn
> 
> The other check that should be helping to prevent that is the break
> statement above that is forcing us to exit if spfn is somehow already
> past the end of the zone, that essentially maps out:
>   spfn < zone_end_pfn(zone)
> 
> So the only check we don't have is:
>   zone->zone_start < zone_end_pfn(zone)
> 
> If I am not mistaken that is supposed to be a given is it not? I would
> assume we don't have any zones that are completely empty or inverted
> that would be called here do we?


if (zone_end_pfn(zone) <= spfn) won't break

Equal sign in <= here takes care of the case I was thinking. Yes, logic looks 
good.

Thank you
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH] acpi/nfit, device-dax: Identify differentiated memory with a unique numa-node

2018-11-09 Thread Dan Williams
Persistent memory, as described by the ACPI NFIT (NVDIMM Firmware
Interface Table), is the first known instance of a memory range
described by a unique "target" proximity domain. Where "initiator" and
"target" proximity domains is an approach that the ACPI HMAT
(Heterogeneous Memory Attributes Table) uses to described the unique
performance properties of a memory range relative to a given initiator
(e.g. CPU or DMA device).

Currently the numa-node for a /dev/pmemX block-device or /dev/daxX.Y
char-device follows the traditional notion of 'numa-node' where the
attribute conveys the closest online numa-node. That numa-node attribute
is useful for cpu-binding and memory-binding processes *near* the
device. However, when the memory range backing a 'pmem', or 'dax' device
is onlined (memory hot-add) the memory-only-numa-node representing that
address needs to be differentiated from the set of online nodes. In
other words, the numa-node association of the device depends on whether
you can bind processes *near* the cpu-numa-node in the offline
device-case, or bind process *on* the memory-range directly after the
backing address range is onlined.

Allow for the case that platform firmware describes persistent memory
with a unique proximity domain, i.e. when it is distinct from the
proximity of DRAM and CPUs that are on the same socket. Plumb the Linux
numa-node translation of that proximity through the libnvdimm region
device to namespaces that are in device-dax mode. With this in place the
proposed kmem driver [1] can optionally discover a unique numa-node
number for the address range as it transitions the memory from an
offline state managed by a device-driver to an online memory range
managed by the core-mm.

[1]: https://lkml.org/lkml/2018/10/23/9

Reported-by: Fan Du 
Cc: Michael Ellerman 
Cc: "Oliver O'Halloran" 
Cc: Dave Hansen 
Cc: Jérôme Glisse 
Signed-off-by: Dan Williams 
---
 arch/powerpc/platforms/pseries/papr_scm.c |1 +
 drivers/acpi/nfit/core.c  |8 ++--
 drivers/acpi/numa.c   |1 +
 drivers/dax/bus.c |4 +++-
 drivers/dax/bus.h |3 ++-
 drivers/dax/dax-private.h |4 
 drivers/dax/pmem/core.c   |4 +++-
 drivers/nvdimm/e820.c |1 +
 drivers/nvdimm/nd.h   |2 +-
 drivers/nvdimm/of_pmem.c  |1 +
 drivers/nvdimm/region_devs.c  |1 +
 include/linux/libnvdimm.h |1 +
 12 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
b/arch/powerpc/platforms/pseries/papr_scm.c
index ee9372b65ca5..6a0a35b872d1 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -233,6 +233,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
memset(&ndr_desc, 0, sizeof(ndr_desc));
ndr_desc.attr_groups = region_attr_groups;
ndr_desc.numa_node = dev_to_node(&p->pdev->dev);
+   ndr_desc.target_node = ndr_desc.numa_node;
ndr_desc.res = &p->res;
ndr_desc.of_node = p->dn;
ndr_desc.provider_data = p;
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index f8c638f3c946..2225e3de33ac 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2825,11 +2825,15 @@ static int acpi_nfit_register_region(struct 
acpi_nfit_desc *acpi_desc,
ndr_desc->res = &res;
ndr_desc->provider_data = nfit_spa;
ndr_desc->attr_groups = acpi_nfit_region_attribute_groups;
-   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID)
+   if (spa->flags & ACPI_NFIT_PROXIMITY_VALID) {
ndr_desc->numa_node = acpi_map_pxm_to_online_node(
spa->proximity_domain);
-   else
+   ndr_desc->target_node = acpi_map_pxm_to_node(
+   spa->proximity_domain);
+   } else {
ndr_desc->numa_node = NUMA_NO_NODE;
+   ndr_desc->target_node = NUMA_NO_NODE;
+   }
 
/*
 * Persistence domain bits are hierarchical, if
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 274699463b4f..b9d86babb13a 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -84,6 +84,7 @@ int acpi_map_pxm_to_node(int pxm)
 
return node;
 }
+EXPORT_SYMBOL(acpi_map_pxm_to_node);
 
 /**
  * acpi_map_pxm_to_online_node - Map proximity ID to online node
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 568168500217..c620ad52d7e5 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -214,7 +214,7 @@ static void dax_region_unregister(void *region)
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-   struct resource *res, unsigned int align,
+   struct resource *res, int target_node, unsigned int align,
unsigned long p

Re: [mm PATCH v5 0/7] Deferred page init improvements

2018-11-09 Thread Pavel Tatashin
On 18-11-09 15:14:35, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > This patchset is essentially a refactor of the page initialization logic
> > > that is meant to provide for better code reuse while providing a
> > > significant improvement in deferred page initialization performance.
> > > 
> > > In my testing on an x86_64 system with 384GB of RAM and 3TB of persistent
> > > memory per node I have seen the following. In the case of regular memory
> > > initialization the deferred init time was decreased from 3.75s to 1.06s on
> > > average. For the persistent memory the initialization time dropped from
> > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > deferred memory initialization performance, and a 26% improvement in the
> > > persistent memory initialization performance.
> > 
> > Hi Alex,
> > 
> > Please try to run your persistent memory init experiment with Daniel's
> > patches:
> > 
> > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/
> 
> I've taken a quick look at it. It seems like a bit of a brute force way
> to try and speed things up. I would be worried about it potentially

There is a limit to max number of threads that ktasks start. The memory
throughput is *much* higher than what one CPU can maxout in a node, so
there is no reason to leave the other CPUs sit idle during boot when
they can help to initialize.

> introducing performance issues if the number of CPUs thrown at it end
> up exceeding the maximum throughput of the memory.
> 
> The data provided with patch 11 seems to point to issues such as that.
> In the case of the E7-8895 example cited it is increasing the numbers
> of CPUs used from memory initialization from 8 to 72, a 9x increase in
> the number of CPUs but it is yeilding only a 3.88x speedup.

Yes, but in both cases we are far from maxing out the memory throughput.
The 3.88x is indeed low, and I do not know what slows it down.

Daniel,

Could you please check why multi-threading efficiency is so low here?

I bet, there is some atomic operation introduces a contention within a
node. It should be possible to resolve.

> 
> > The performance should improve by much more than 26%.
> 
> The 26% improvement, or speedup of 1.26x using the ktask approach, was
> for persistent memory, not deferred memory init. The ktask patch
> doesn't do anything for persistent memory since it is takes the hot-
> plug path and isn't handled via the deferred memory init.

Ah, I thought in your experiment persistent memory takes deferred init
path. So, what exactly in your patches make this 1.26x speedup?

> 
> I had increased deferred memory init to about 3.53x the original speed
> (3.75s to 1.06s) on the system which I was testing. I do agree the two
> patches should be able to synergistically boost each other though as
> this patch set was meant to make the init much more cache friendly so
> as a result it should scale better as you add additional cores. I know
> I had done some playing around with fake numa to split up a single node
> into 8 logical nodes and I had seen a similar speedup of about 3.85x
> with my test memory initializing in about 275ms.
> 
> > Overall, your works looks good, but it needs to be considered how easy it 
> > will be
> > to merge with ktask. I will try to complete the review today.
> > 
> > Thank you,
> > Pasha
> 
> Looking over the patches they are still in the RFC stage and the data
> is in need of updates since it is referencing 4.15-rc kernels as its
> baseline. If anything I really think the ktask patch 11 would be easier
> to rebase around my patch set then the other way around. Also, this
> series is in Andrew's mmots as of a few days ago, so I think it will be
> in the next mmotm that comes out.

I do not disagree, I think these two patch series should complement each
other. But, if your changes make it impossible for ktask, I would strongly argue
against it, as the potential improvements with ktasks are much higher.
But, so far I do not see anything, so I think they can work together. I
am still reviewing your work.

> 
> The integration with the ktask code should be pretty straight forward.
> If anything I think my code would probably make it easier since it gets
> rid of the need to do all this in two passes. The only new limitation
> it would add is that you would probably want to split up the work along
> either max order or section aligned boundaries. What it would

Which is totally OK, it should make ktasks scale even better.

> essentially do is make things so that each of the ktask threads would
> probably look more like deferred_grow_zone which after my patch set is
> actually a fairly simple function.
> 
> Thanks.

Thank you,
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator

2018-11-09 Thread Alexander Duyck
On Fri, 2018-11-09 at 18:26 -0500, Pavel Tatashin wrote:
> > +/**
> > + * for_each_free_mem_range_in_zone - iterate through zone specific free
> > + * memblock areas
> > + * @i: u64 used as loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @p_start: ptr to phys_addr_t for start address of the range, can be 
> > %NULL
> > + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> > + *
> > + * Walks over free (memory && !reserved) areas of memblock in a specific
> > + * zone. Available as soon as memblock is initialized.
> > + */
> > +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end)   
> > \
> > +   for (i = 0, \
> > +__next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);\
> > +i != (u64)ULLONG_MAX;  \
> > +__next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> > +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> 
> Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other
> places in this file use UULONG_MAX with cast, but I think U64_MAX is
> better.

Okay, maybe I will submit a follow up that just cleans all of these up.

> > +
> >  /**
> >   * for_each_free_mem_range - iterate through free memblock areas
> >   * @i: u64 used as loop variable
> > diff --git a/mm/memblock.c b/mm/memblock.c
> > index 7df468c8ebc8..f1d1fbfd1ae7 100644
> > --- a/mm/memblock.c
> > +++ b/mm/memblock.c
> > @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t 
> > base, phys_addr_t size,
> > return 0;
> >  }
> >  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> > +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> > +/**
> > + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> > + *
> > + * @idx: pointer to u64 loop variable
> > + * @zone: zone in which all of the memory blocks reside
> > + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> > + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> > + *
> > + * This function is meant to be a zone/pfn specific wrapper for the
> > + * for_each_mem_range type iterators. Specifically they are used in the
> > + * deferred memory init routines and as such we were duplicating much of
> > + * this logic throughout the code. So instead of having it in multiple
> > + * locations it seemed like it would make more sense to centralize this to
> > + * one new iterator that does everything they need.
> > + */
> > +void __init_memblock
> > +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> > +unsigned long *out_spfn, unsigned long *out_epfn)
> > +{
> > +   int zone_nid = zone_to_nid(zone);
> > +   phys_addr_t spa, epa;
> > +   int nid;
> > +
> > +   __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> > +&memblock.memory, &memblock.reserved,
> > +&spa, &epa, &nid);
> > +
> > +   while (*idx != ULLONG_MAX) {
> 
> Ditto, use U64_MAX

Same here.

> > +   unsigned long epfn = PFN_DOWN(epa);
> > +   unsigned long spfn = PFN_UP(spa);
> > +
> > +   /*
> > +* Verify the end is at least past the start of the zone and
> > +* that we have at least one PFN to initialize.
> > +*/
> > +   if (zone->zone_start_pfn < epfn && spfn < epfn) {
> > +   /* if we went too far just stop searching */
> > +   if (zone_end_pfn(zone) <= spfn)
> > +   break;
> 
> Set *idx = U64_MAX here, then break. This way after we are outside this
> while loop idx is always equals to U64_MAX.

Actually I think what you are asking for is the logic that is outside
of the while loop we are breaking out of. So if you check at the end of
the function there is the bit of code with the comment "signal end of
iteration" where I end up setting *idx to ULLONG_MAX, *out_spfn to
ULONG_MAX, and *out_epfn to 0.

The general idea I had with the function is that you could use either
the index or spfn < epfn checks to determine if you keep going or not.

> 
> > +
> > +   if (out_spfn)
> > +   *out_spfn = max(zone->zone_start_pfn, spfn);
> > +   if (out_epfn)
> > +   *out_epfn = min(zone_end_pfn(zone), epfn);
> 
> Don't we need to verify after adjustment that out_spfn != out_epfn, so
> there is at least one PFN to initialize?

We have a few checks that I believe prevent that. Before we get to this
point we have verified the following:
zone->zone_start < epfn
spfn < epfn

The other check that should be helping to prevent that is the break
statement above that is forcing us to exit if spfn is somehow already
past the end of the zone, that essentially maps out:
spfn < zone_end_pfn(zone)

So the only check we don't have is:
zone->zone_start < zone_end_pfn(zone)

If I am not mistaken that is supp

Re: [mm PATCH v5 3/7] mm: Implement new zone specific memblock iterator

2018-11-09 Thread Pavel Tatashin
> +/**
> + * for_each_free_mem_range_in_zone - iterate through zone specific free
> + * memblock areas
> + * @i: u64 used as loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
> + * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
> + *
> + * Walks over free (memory && !reserved) areas of memblock in a specific
> + * zone. Available as soon as memblock is initialized.
> + */
> +#define for_each_free_mem_pfn_range_in_zone(i, zone, p_start, p_end) \
> + for (i = 0, \
> +  __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end);\
> +  i != (u64)ULLONG_MAX;  \
> +  __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
> +#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */

Use U64_MAX instead of ULLONG_MAX, and avoid u64 cast. I know other
places in this file use UULONG_MAX with cast, but I think U64_MAX is
better.

> +
>  /**
>   * for_each_free_mem_range - iterate through free memblock areas
>   * @i: u64 used as loop variable
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 7df468c8ebc8..f1d1fbfd1ae7 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1239,6 +1239,69 @@ int __init_memblock memblock_set_node(phys_addr_t 
> base, phys_addr_t size,
>   return 0;
>  }
>  #endif /* CONFIG_HAVE_MEMBLOCK_NODE_MAP */
> +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
> +/**
> + * __next_mem_pfn_range_in_zone - iterator for for_each_*_range_in_zone()
> + *
> + * @idx: pointer to u64 loop variable
> + * @zone: zone in which all of the memory blocks reside
> + * @out_start: ptr to ulong for start pfn of the range, can be %NULL
> + * @out_end: ptr to ulong for end pfn of the range, can be %NULL
> + *
> + * This function is meant to be a zone/pfn specific wrapper for the
> + * for_each_mem_range type iterators. Specifically they are used in the
> + * deferred memory init routines and as such we were duplicating much of
> + * this logic throughout the code. So instead of having it in multiple
> + * locations it seemed like it would make more sense to centralize this to
> + * one new iterator that does everything they need.
> + */
> +void __init_memblock
> +__next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
> +  unsigned long *out_spfn, unsigned long *out_epfn)
> +{
> + int zone_nid = zone_to_nid(zone);
> + phys_addr_t spa, epa;
> + int nid;
> +
> + __next_mem_range(idx, zone_nid, MEMBLOCK_NONE,
> +  &memblock.memory, &memblock.reserved,
> +  &spa, &epa, &nid);
> +
> + while (*idx != ULLONG_MAX) {

Ditto, use U64_MAX

> + unsigned long epfn = PFN_DOWN(epa);
> + unsigned long spfn = PFN_UP(spa);
> +
> + /*
> +  * Verify the end is at least past the start of the zone and
> +  * that we have at least one PFN to initialize.
> +  */
> + if (zone->zone_start_pfn < epfn && spfn < epfn) {
> + /* if we went too far just stop searching */
> + if (zone_end_pfn(zone) <= spfn)
> + break;

Set *idx = U64_MAX here, then break. This way after we are outside this
while loop idx is always equals to U64_MAX.

> +
> + if (out_spfn)
> + *out_spfn = max(zone->zone_start_pfn, spfn);
> + if (out_epfn)
> + *out_epfn = min(zone_end_pfn(zone), epfn);

Don't we need to verify after adjustment that out_spfn != out_epfn, so
there is at least one PFN to initialize?

The rest looks good. Once the above is fixed:

Reviewed-by: Pavel Tatashin 

Thank you,
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 0/7] Deferred page init improvements

2018-11-09 Thread Alexander Duyck
On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> On 18-11-09 15:14:35, Alexander Duyck wrote:
> > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > This patchset is essentially a refactor of the page initialization logic
> > > > that is meant to provide for better code reuse while providing a
> > > > significant improvement in deferred page initialization performance.
> > > > 
> > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of 
> > > > persistent
> > > > memory per node I have seen the following. In the case of regular memory
> > > > initialization the deferred init time was decreased from 3.75s to 1.06s 
> > > > on
> > > > average. For the persistent memory the initialization time dropped from
> > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for the
> > > > deferred memory initialization performance, and a 26% improvement in the
> > > > persistent memory initialization performance.
> > > 
> > > Hi Alex,
> > > 
> > > Please try to run your persistent memory init experiment with Daniel's
> > > patches:
> > > 
> > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/
> > 
> > I've taken a quick look at it. It seems like a bit of a brute force way
> > to try and speed things up. I would be worried about it potentially
> 
> There is a limit to max number of threads that ktasks start. The memory
> throughput is *much* higher than what one CPU can maxout in a node, so
> there is no reason to leave the other CPUs sit idle during boot when
> they can help to initialize.

Right, but right now that limit can still be pretty big when it is
something like 25% of all the CPUs on a 288 CPU system.

One issue is the way the code was ends up essentially blowing out the
cache over and over again. Doing things in two passes made it really
expensive as you took one cache miss to initialize it, and another to
free it. I think getting rid of that is one of the biggest gains with
my patch set.

> > introducing performance issues if the number of CPUs thrown at it end
> > up exceeding the maximum throughput of the memory.
> > 
> > The data provided with patch 11 seems to point to issues such as that.
> > In the case of the E7-8895 example cited it is increasing the numbers
> > of CPUs used from memory initialization from 8 to 72, a 9x increase in
> > the number of CPUs but it is yeilding only a 3.88x speedup.
> 
> Yes, but in both cases we are far from maxing out the memory throughput.
> The 3.88x is indeed low, and I do not know what slows it down.
> 
> Daniel,
> 
> Could you please check why multi-threading efficiency is so low here?
> 
> I bet, there is some atomic operation introduces a contention within a
> node. It should be possible to resolve.

Depending on what kernel this was based on it probably was something
like SetPageReserved triggering pipeline stalls, or maybe it is freeing
the pages themselves since I would imagine that has a pretty big
overhead an may end up serializing some things.

> > 
> > > The performance should improve by much more than 26%.
> > 
> > The 26% improvement, or speedup of 1.26x using the ktask approach, was
> > for persistent memory, not deferred memory init. The ktask patch
> > doesn't do anything for persistent memory since it is takes the hot-
> > plug path and isn't handled via the deferred memory init.
> 
> Ah, I thought in your experiment persistent memory takes deferred init
> path. So, what exactly in your patches make this 1.26x speedup?

Basically it is the folding of the reserved bit into the setting of the
rest of the values written into the page flags. With that change we are
able to tighten the memory initialization loop to the point where it
almost looks like a memset case as it will put together all of the
memory values outside of the loop and then just loop through assigning
the values for each page.

> > 
> > I had increased deferred memory init to about 3.53x the original speed
> > (3.75s to 1.06s) on the system which I was testing. I do agree the two
> > patches should be able to synergistically boost each other though as
> > this patch set was meant to make the init much more cache friendly so
> > as a result it should scale better as you add additional cores. I know
> > I had done some playing around with fake numa to split up a single node
> > into 8 logical nodes and I had seen a similar speedup of about 3.85x
> > with my test memory initializing in about 275ms.

It is also possible that the the 3.8x is an upper limit for something
on the system in terms of scaling. I just realized that the 3.85x I saw
with an 8 way split over a 4 socket system isn't too different from the
3.88x that was recorded for a 9 way split over an 8 socket system.

> > > Overall, your works looks good, but it needs to be considered how easy it 
> > > will be
> > > to merge with ktask. I will try to complete the review today.
> > > 
> > > Thank you,

Re: [mm PATCH v5 4/7] mm: Initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:19:45, Alexander Duyck wrote:
>   }
> - first_init_pfn = max(zone->zone_start_pfn, first_init_pfn);
> +
> + /* If the zone is empty somebody else may have cleared out the zone */
> + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +  first_init_pfn)) {
> + pgdat_resize_unlock(pgdat, &flags);
> + pgdat_init_report_one_done();
> + return 0;

It would make more sense to add goto to the end of this function and report
that in Node X had 0 pages initialized. Otherwise, it will be confusing
why some nodes are not initialized during boot. It is simply because
they do not have deferred pages left to be initialized.


> + }
>  
>   /*
> -  * Initialize and free pages. We do it in two loops: first we initialize
> -  * struct page, than free to buddy allocator, because while we are
> -  * freeing pages we can access pages that are ahead (computing buddy
> -  * page in __free_one_page()).
> +  * Initialize and free pages in MAX_ORDER sized increments so
> +  * that we can avoid introducing any issues with the buddy
> +  * allocator.
>*/
> - for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> - spfn = max_t(unsigned long, first_init_pfn, spfn);
> - nr_pages += deferred_init_pages(zone, spfn, epfn);
> - }
> - for_each_free_mem_pfn_range_in_zone(i, zone, &spfn, &epfn) {
> - spfn = max_t(unsigned long, first_init_pfn, spfn);
> - deferred_free_pages(spfn, epfn);
> - }
> + while (spfn < epfn)
> + nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +
>   pgdat_resize_unlock(pgdat, &flags);
>  
>   /* Sanity check that the next zone really is unpopulated */
> @@ -1602,9 +1689,9 @@ deferred_grow_zone(struct zone *zone, unsigned int 
> order)
>  {

How about order = max(order, max_order)?

>   unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);


> - first_init_pfn = max(zone->zone_start_pfn, first_deferred_pfn);
> -
> - if (first_init_pfn >= pgdat_end_pfn(pgdat)) {
> + /* If the zone is empty somebody else may have cleared out the zone */
> + if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> +  first_deferred_pfn)) {
>   pgdat_resize_unlock(pgdat, &flags);
> - return false;
> + return true;

I am OK to change to true here, please also set
pgdat->first_deferred_pfn to ULONG_MAX to indicate that there are no
more deferred pages in this node.


Overall, I like this patch, makes things a lot easier, assuming the
above is addressed:

Reviewed-by: Pavel Tatashin 

Thank you,
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 0/7] Deferred page init improvements

2018-11-09 Thread Pavel Tatashin
On 18-11-09 16:46:02, Alexander Duyck wrote:
> On Fri, 2018-11-09 at 19:00 -0500, Pavel Tatashin wrote:
> > On 18-11-09 15:14:35, Alexander Duyck wrote:
> > > On Fri, 2018-11-09 at 16:15 -0500, Pavel Tatashin wrote:
> > > > On 18-11-05 13:19:25, Alexander Duyck wrote:
> > > > > This patchset is essentially a refactor of the page initialization 
> > > > > logic
> > > > > that is meant to provide for better code reuse while providing a
> > > > > significant improvement in deferred page initialization performance.
> > > > > 
> > > > > In my testing on an x86_64 system with 384GB of RAM and 3TB of 
> > > > > persistent
> > > > > memory per node I have seen the following. In the case of regular 
> > > > > memory
> > > > > initialization the deferred init time was decreased from 3.75s to 
> > > > > 1.06s on
> > > > > average. For the persistent memory the initialization time dropped 
> > > > > from
> > > > > 24.17s to 19.12s on average. This amounts to a 253% improvement for 
> > > > > the
> > > > > deferred memory initialization performance, and a 26% improvement in 
> > > > > the
> > > > > persistent memory initialization performance.
> > > > 
> > > > Hi Alex,
> > > > 
> > > > Please try to run your persistent memory init experiment with Daniel's
> > > > patches:
> > > > 
> > > > https://lore.kernel.org/lkml/20181105165558.11698-1-daniel.m.jor...@oracle.com/
> > > 
> > > I've taken a quick look at it. It seems like a bit of a brute force way
> > > to try and speed things up. I would be worried about it potentially
> > 
> > There is a limit to max number of threads that ktasks start. The memory
> > throughput is *much* higher than what one CPU can maxout in a node, so
> > there is no reason to leave the other CPUs sit idle during boot when
> > they can help to initialize.
> 
> Right, but right now that limit can still be pretty big when it is
> something like 25% of all the CPUs on a 288 CPU system.

It is still OK. About 9 threads per node.

That machine has 1T of memory, which means 8 nodes need to initialize 2G
of memory each. With 46G/s throughout it should take 0.043s Which is 10
times higher than what Daniel sees with 0.325s, so there is still room
to saturate the memory throughput.

Now, if the multi-threadding efficiency is good, it should take
1.261s / 9 threads =  0.14s

> 
> One issue is the way the code was ends up essentially blowing out the
> cache over and over again. Doing things in two passes made it really
> expensive as you took one cache miss to initialize it, and another to
> free it. I think getting rid of that is one of the biggest gains with
> my patch set.

I am not arguing that your patches make sense, all I am saying that
ktasks improve time order of magnitude better on machines with large
amount of memory.

Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH 02/11] libnvdimm/security: change clear text nvdimm keys to encrypted keys

2018-11-09 Thread Dan Williams
On Fri, Nov 9, 2018 at 2:13 PM Dave Jiang  wrote:
>
> In order to make nvdimm more secure, encrypted keys will be used instead of
> clear text keys. A master key will be created to seal encrypted nvdimm
> keys. The master key can be a trusted key generated from TPM 2.0 or a less
> secure user key.
>
> In the process of this conversion, the kernel cached key will be removed
> in order to simplify the verification process. The hardware will be used to
> verify the decrypted user payload directly.
>
> Signed-off-by: Dave Jiang 
> ---
>  Documentation/nvdimm/security.txt |   29 ++-
>  drivers/nvdimm/dimm.c |3
>  drivers/nvdimm/dimm_devs.c|2
>  drivers/nvdimm/nd-core.h  |3
>  drivers/nvdimm/nd.h   |5 -
>  drivers/nvdimm/security.c |  316 
> ++---
>  6 files changed, 108 insertions(+), 250 deletions(-)

Remove twice the amount of code that it adds and gains features /
security, nice!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 5/7] mm: Move hot-plug specific memory init into separate functions and optimize

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:19:50, Alexander Duyck wrote:
> This patch is going through and combining the bits in memmap_init_zone and
> memmap_init_zone_device that are related to hotplug into a single function
> called __memmap_init_hotplug.
> 
> I also took the opportunity to integrate __init_single_page's functionality
> into this function. In doing so I can get rid of some of the redundancy
> such as the LRU pointers versus the pgmap.

Please don't do it, __init_single_page() is hard function to optimize,
do not copy its code. Instead could you you split __init_single_page()
in two parts, something like this:

static inline init_single_page_nolru(struct page *page, unsigned long pfn,
   unsigned long zone, int nid) {
mm_zero_struct_page(page);
set_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
page_cpupid_reset_last(page);
#ifdef WANT_PAGE_VIRTUAL
/* The shift won't overflow because ZONE_NORMAL is below 4G. */
if (!is_highmem_idx(zone))
set_page_address(page, __va(pfn << PAGE_SHIFT));
#endif
}


static void __meminit init_single_page(struct page *page, unsigned long pfn, 
unsigned long zone, int nid) 
{
init_single_page_nolru(page, pfn, zone, nid);
INIT_LIST_HEAD(&page->lru);
}

And call init_single_page_nolru() from __init_pageblock() ? Also, remove
WANT_PAGE_VIRTUAL optimization, I do not think it worse it.

The rest looks very good, please do the above change.

Reviewed-by: Pavel Tatashin 

> 
> Signed-off-by: Alexander Duyck 
> ---
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 6/7] mm: Add reserved flag setting to set_page_links

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:19:56, Alexander Duyck wrote:
> This patch modifies the set_page_links function to include the setting of
> the reserved flag via a simple AND and OR operation. The motivation for
> this is the fact that the existing __set_bit call still seems to have
> effects on performance as replacing the call with the AND and OR can reduce
> initialization time.
> 
> Looking over the assembly code before and after the change the main
> difference between the two is that the reserved bit is stored in a value
> that is generated outside of the main initialization loop and is then
> written with the other flags field values in one write to the page->flags
> value. Previously the generated value was written and then then a btsq
> instruction was issued.
> 
> On my x86_64 test system with 3TB of persistent memory per node I saw the
> persistent memory initialization time on average drop from 23.49s to
> 19.12s per node.
> 
> Signed-off-by: Alexander Duyck 

Reviewed-by: Pavel Tatashin 
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH 07/11] libnvdimm/security: add overwrite status notification

2018-11-09 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: Linux-nvdimm  On Behalf Of
> Dave Jiang
> Sent: Friday, November 09, 2018 4:14 PM
> Subject: [PATCH 07/11] libnvdimm/security: add overwrite status
> notification
> 
...
> @@ -2033,6 +2033,11 @@ static int acpi_nfit_register_dimms(struct
> acpi_nfit_desc *acpi_desc)
>   if (!nvdimm)
>   continue;
> 
> + rc = nvdimm_setup_security_events(nvdimm);
> + if (rc < 0)
> + dev_warn(acpi_desc->dev,
> + "no security event setup
> failed\n");

That seems like a double negative.

Printing the rc value (or better yet, the string for it)
is always helpful too.

---
Robert Elliott, HPE Persistent Memory


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


RE: [PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test

2018-11-09 Thread Elliott, Robert (Persistent Memory)



> -Original Message-
> From: Linux-nvdimm  On Behalf Of
> Dave Jiang
> Sent: Friday, November 09, 2018 4:15 PM
> To: dan.j.willi...@intel.com; zo...@linux.vnet.ibm.com
> Cc: linux-nvdimm@lists.01.org
> Subject: [PATCH 10/11] tools/testing/nvdimm: add Intel DSM 1.8
> support for nfit_test
...
> +static int nd_intel_test_cmd_master_set_pass(struct nfit_test *t,
> + struct nd_intel_set_master_passphrase *nd_cmd,
> + unsigned int buf_len, int dimm)
> +{
> + struct device *dev = &t->pdev.dev;
> + struct nfit_test_sec *sec = &dimm_sec_info[dimm];
> +
> + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) {
> + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED;
> + dev_dbg(dev, "master set passphrase in wrong state\n");

"master set passphrase:" for consistency

> + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) {
> + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE;
> + dev_dbg(dev, "master set passphrase in wrong security
> state\n");

"master set passphrase:" for consistency

> + } else if (memcmp(nd_cmd->old_pass, sec->master_passphrase,
> + ND_INTEL_PASSPHRASE_SIZE) != 0) {
> + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS;
> + dev_dbg(dev, "master set passphrase wrong
> passphrase\n");

"master set passphrase:" for consistency

> + } else {
> + memcpy(sec->master_passphrase, nd_cmd->new_pass,
> + ND_INTEL_PASSPHRASE_SIZE);
> + nd_cmd->status = 0;
> + dev_dbg(dev, "master passphrase updated\n");

"master set passphrase:" for consistency

> + }
> +
> + return 0;
> +}
> +
> +static int nd_intel_test_cmd_master_secure_erase(struct nfit_test
> *t,
> + struct nd_intel_master_secure_erase *nd_cmd,
> + unsigned int buf_len, int dimm)
> +{
> + struct device *dev = &t->pdev.dev;
> + struct nfit_test_sec *sec = &dimm_sec_info[dimm];
> +
> + if (!(sec->ext_state & ND_INTEL_SEC_ESTATE_ENABLED)) {
> + nd_cmd->status = ND_INTEL_STATUS_NOT_SUPPORTED;
> + dev_dbg(dev, "master erase in wrong state\n");

"master secure erase: " for consistency

> + } else if (sec->ext_state & ND_INTEL_SEC_ESTATE_PLIMIT) {
> + nd_cmd->status = ND_INTEL_STATUS_INVALID_STATE;
> + dev_dbg(dev, "master erase in wrong security state\n");

"master secure erase: " for consistency

> + } else if (memcmp(nd_cmd->passphrase, sec->master_passphrase,
> + ND_INTEL_PASSPHRASE_SIZE) != 0) {
> + nd_cmd->status = ND_INTEL_STATUS_INVALID_PASS;
> + dev_dbg(dev, "master secure erase: wrong passphrase\n");
> + } else {
> + memset(sec->master_passphrase, 0,
> ND_INTEL_PASSPHRASE_SIZE);
> + sec->ext_state = ND_INTEL_SEC_ESTATE_ENABLED;
> + memset(sec->passphrase, 0, ND_INTEL_PASSPHRASE_SIZE);
> + sec->state = 0;
> + dev_dbg(dev, "master secure erase: done\n");
> + }
> +
> + return 0;
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [mm PATCH v5 7/7] mm: Use common iterator for deferred_init_pages and deferred_free_pages

2018-11-09 Thread Pavel Tatashin
On 18-11-05 13:20:01, Alexander Duyck wrote:
> +static unsigned long __next_pfn_valid_range(unsigned long *i,
> + unsigned long end_pfn)
>  {
> - if (!pfn_valid_within(pfn))
> - return false;
> - if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
> - return false;
> - return true;
> + unsigned long pfn = *i;
> + unsigned long count;
> +
> + while (pfn < end_pfn) {
> + unsigned long t = ALIGN(pfn + 1, pageblock_nr_pages);
> + unsigned long pageblock_pfn = min(t, end_pfn);
> +
> +#ifndef CONFIG_HOLES_IN_ZONE
> + count = pageblock_pfn - pfn;
> + pfn = pageblock_pfn;
> + if (!pfn_valid(pfn))
> + continue;
> +#else
> + for (count = 0; pfn < pageblock_pfn; pfn++) {
> + if (pfn_valid_within(pfn)) {
> + count++;
> + continue;
> + }
> +
> + if (count)
> + break;
> + }
> +
> + if (!count)
> + continue;
> +#endif
> + *i = pfn;
> + return count;
> + }
> +
> + return 0;
>  }
>  
> +#define for_each_deferred_pfn_valid_range(i, start_pfn, end_pfn, pfn, count) 
> \
> + for (i = (start_pfn),\
> +  count = __next_pfn_valid_range(&i, (end_pfn));  \
> +  count && ({ pfn = i - count; 1; }); \
> +  count = __next_pfn_valid_range(&i, (end_pfn)))

Can this be improved somehow? It took me a while to understand this
piece of code. i is actually end of block, and not an index by PFN, ({pfn = i - 
count; 1;}) is
simply hard to parse. Why can't we make __next_pfn_valid_range() to
return both end and a start of a block?

The rest is good:

Reviewed-by: Pavel Tatashin 

Thank you,
Pasha
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm