Sales/Purchase Inquiry

2020-06-23 Thread Patrick Wormald
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH] libnvdimm/security: Fix key lookup permissions

2020-06-23 Thread Ira Weiny
On Tue, Jun 23, 2020 at 09:35:26PM -0700, Dan Williams wrote:
> As of commit 8c0637e950d6 ("keys: Make the KEY_NEED_* perms an enum rather
> than a mask") lookup_user_key() needs an explicit declaration of what it
> wants to do with the key. Add KEY_NEED_SEARCH to fix a warning with the
> below signature, and fixes the inability to retrieve a key.
> 
> WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 
> key_task_permission+0xd3/0x140
> [..]
> RIP: 0010:key_task_permission+0xd3/0x140
> [..]
> Call Trace:
>  lookup_user_key+0xeb/0x6b0
>  ? vsscanf+0x3df/0x840
>  ? key_validate+0x50/0x50
>  ? key_default_cmp+0x20/0x20
>  nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
>  nvdimm_security_store+0x67d/0xb20 [libnvdimm]
>  security_store+0x67/0x1a0 [libnvdimm]
>  kernfs_fop_write+0xcf/0x1c0
>  vfs_write+0xde/0x1d0
>  ksys_write+0x68/0xe0
>  do_syscall_64+0x5c/0xa0
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> 
> Cc: Dan Williams 
> Cc: Vishal Verma 
> Cc: Dave Jiang 
> Cc: Ira Weiny 

Reviewed-by: Ira Weiny 

> Suggested-by: David Howells 
> Fixes: 8c0637e950d6 ("keys: Make the KEY_NEED_* perms an enum rather than a 
> mask")
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/security.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
> index 89b85970912d..4cef69bd3c1b 100644
> --- a/drivers/nvdimm/security.c
> +++ b/drivers/nvdimm/security.c
> @@ -95,7 +95,7 @@ static struct key *nvdimm_lookup_user_key(struct nvdimm 
> *nvdimm,
>   struct encrypted_key_payload *epayload;
>   struct device *dev = >dev;
>  
> - keyref = lookup_user_key(id, 0, 0);
> + keyref = lookup_user_key(id, 0, KEY_NEED_SEARCH);
>   if (IS_ERR(keyref))
>   return NULL;
>  
> 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[PATCH] libnvdimm/security: Fix key lookup permissions

2020-06-23 Thread Dan Williams
As of commit 8c0637e950d6 ("keys: Make the KEY_NEED_* perms an enum rather
than a mask") lookup_user_key() needs an explicit declaration of what it
wants to do with the key. Add KEY_NEED_SEARCH to fix a warning with the
below signature, and fixes the inability to retrieve a key.

WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 
key_task_permission+0xd3/0x140
[..]
RIP: 0010:key_task_permission+0xd3/0x140
[..]
Call Trace:
 lookup_user_key+0xeb/0x6b0
 ? vsscanf+0x3df/0x840
 ? key_validate+0x50/0x50
 ? key_default_cmp+0x20/0x20
 nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
 nvdimm_security_store+0x67d/0xb20 [libnvdimm]
 security_store+0x67/0x1a0 [libnvdimm]
 kernfs_fop_write+0xcf/0x1c0
 vfs_write+0xde/0x1d0
 ksys_write+0x68/0xe0
 do_syscall_64+0x5c/0xa0
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Cc: Dan Williams 
Cc: Vishal Verma 
Cc: Dave Jiang 
Cc: Ira Weiny 
Suggested-by: David Howells 
Fixes: 8c0637e950d6 ("keys: Make the KEY_NEED_* perms an enum rather than a 
mask")
Signed-off-by: Dan Williams 
---
 drivers/nvdimm/security.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/nvdimm/security.c b/drivers/nvdimm/security.c
index 89b85970912d..4cef69bd3c1b 100644
--- a/drivers/nvdimm/security.c
+++ b/drivers/nvdimm/security.c
@@ -95,7 +95,7 @@ static struct key *nvdimm_lookup_user_key(struct nvdimm 
*nvdimm,
struct encrypted_key_payload *epayload;
struct device *dev = >dev;
 
-   keyref = lookup_user_key(id, 0, 0);
+   keyref = lookup_user_key(id, 0, KEY_NEED_SEARCH);
if (IS_ERR(keyref))
return NULL;
 
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [PATCH v6 0/2] Renovate memcpy_mcsafe with copy_mc_to_{user, kernel}

2020-06-23 Thread Luck, Tony
On Wed, Jun 17, 2020 at 05:31:58PM -0700, Dan Williams wrote:
> No changes since v5 [1], just rebased to v5.8-rc1. No comments since
> that posting back at the end of May either, will continue to re-post
> weekly, I am otherwise at a loss for what else to do to move this
> forward. Should it go through Andrew since it's across PPC and x86?
> Thanks again to Michael for the PPC acks.
> 
> [1]: 
> http://lore.kernel.org/r/159062136234.2192412.7285856919306307817.st...@dwillia2-desk3.amr.corp.intel.com
> 
> ---
> 
> The primary motivation to go touch memcpy_mcsafe() is that the existing
> benefit of doing slow "handle with care" copies is obviated on newer
> CPUs. With that concern lifted it also obviates the need to continue to
> update the MCA-recovery capability detection code currently gated by
> "mcsafe_key". Now the old "mcsafe_key" opt-in to perform the copy with
> concerns for recovery fragility can instead be made an opt-out from the
> default fast copy implementation (enable_copy_mc_fragile()).
> 
> The discussion with Linus on the first iteration of this patch
> identified that memcpy_mcsafe() was misnamed relative to its usage. The
> new names copy_mc_to_user() and copy_mc_to_kernel() clearly indicate the
> intended use case and lets the architecture organize the implementation
> accordingly.
> 
> For both powerpc and x86 a copy_mc_generic() implementation is added as
> the backend for these interfaces.
> 
> Patches are relative to v5.8-rc1. It has a recent build success
> notification from the kbuild robot and is passing local nvdimm tests.
> 

Looks good to me.   I tested on a Broadwell generation system (i.e.
one of the system you now list as "fragile") and injecting uncorrected
errors into buffers that are then copied using copy_mc_to_kernel()
result in recovered machine checks with the return value correctly
indicating the amount remaining.

Reviewed-by: Tony Luck 

-Tony
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [GIT PULL] General notification queue and key notifications

2020-06-23 Thread David Howells
Dan Williams  wrote:

> Shall I wait for your further reworks to fix this for v5.8, or is that
> v5.9 material?

It could do with stewing in linux-next for a while, so 5.9 probably.

David
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [GIT PULL] General notification queue and key notifications

2020-06-23 Thread Dan Williams
On Tue, Jun 23, 2020 at 5:55 PM David Howells  wrote:
>
> Dan Williams  wrote:
>
> > > This commit:
> > >
> > > >   keys: Make the KEY_NEED_* perms an enum rather than a mask
> > >
> > > ...upstream as:
> > >
> > > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a 
> > > mask
> > >
> > > ...triggers a regression in the libnvdimm unit test that exercises the
> > > encrypted keys used to store nvdimm passphrases. It results in the
> > > below warning.
> >
> > This regression is still present in tip of tree. David, have you had a
> > chance to take a look?
>
> nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
> the key for so that the appropriate security checks can take place in SELinux
> and Smack.  Note that I have a patch in the works that changes this still
> further.
>
> Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
> for you?

It does, thanks.

Shall I wait for your further reworks to fix this for v5.8, or is that
v5.9 material?
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [GIT PULL] General notification queue and key notifications

2020-06-23 Thread David Howells
Dan Williams  wrote:

> > This commit:
> >
> > >   keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...upstream as:
> >
> > 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
> >
> > ...triggers a regression in the libnvdimm unit test that exercises the
> > encrypted keys used to store nvdimm passphrases. It results in the
> > below warning.
> 
> This regression is still present in tip of tree. David, have you had a
> chance to take a look?

nvdimm_lookup_user_key() needs to indicate to lookup_user_key() what it wants
the key for so that the appropriate security checks can take place in SELinux
and Smack.  Note that I have a patch in the works that changes this still
further.

Does setting the third argument of lookup_user_key() to KEY_NEED_SEARCH work
for you?

David
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Darrick J. Wong
On Tue, Jun 23, 2020 at 11:40:27PM +0100, Matthew Wilcox wrote:
> On Tue, Jun 23, 2020 at 03:26:58PM -0700, Luck, Tony wrote:
> > On Tue, Jun 23, 2020 at 11:17:41PM +0100, Matthew Wilcox wrote:
> > > It might also be nice to have an madvise() MADV_ZERO option so the
> > > application doesn't have to look up the fd associated with that memory
> > > range, but we haven't floated that idea with the customer yet; I just
> > > thought of it now.
> > 
> > So the conversation between OS and kernel goes like this?
> > 
> > 1) machine check
> > 2) Kernel unmaps the 4K page surroundinng the poison and sends
> >SIGBUS to the application to say that one cache line is gone
> > 3) App says madvise(MADV_ZERO, that cache line)
> > 4) Kernel says ... "oh, you know how to deal with this" and allocates
> >a new page, copying the 63 good cache lines from the old page and
> >zeroing the missing one. New page is mapped to user.
> 
> That could be one way of implementing it.  My understanding is that
> pmem devices will reallocate bad cachelines on writes, so a better
> implementation would be:
> 
> 1) Kernel receives machine check
> 2) Kernel sends SIGBUS to the application
> 3) App send madvise(MADV_ZERO, addr, 1 << granularity)
> 4) Kernel does special writes to ensure the cacheline is zeroed
> 5) App does whatever it needs to recover (reconstructs the data or marks
> it as gone)

Frankly, I've wondered why the filesystem shouldn't just be in charge of
all this--

1. kernel receives machine check
2. kernel tattles to xfs
3. xfs looks up which file(s) own the pmem range
4. xfs zeroes the region, clears the poison, and sets AS_EIO on the
   files
5. xfs sends SIGBUS to any programs that had those files mapped to tell
   them "Your data is gone, we've stabilized the storage you had
   mapped."
6. app does whatever it needs to recover

Apps shouldn't have to do this punch-and-reallocate dance, seeing as
they don't currently do that for SCSI disks and the like.

--D

> > Do you have folks lined up to use that?  I don't know that many
> > folks are even catching the SIGBUS :-(
> 
> Had a 75 minute meeting with some people who want to use pmem this
> afternoon ...
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [GIT PULL] General notification queue and key notifications

2020-06-23 Thread Dan Williams
On Tue, Jun 16, 2020 at 6:15 PM Williams, Dan J
 wrote:
>
> Hi David,
>
> On Tue, 2020-06-02 at 16:55 +0100, David Howells wrote:
> > Date: Tue, 02 Jun 2020 16:51:44 +0100
> >
> > Hi Linus,
> >
> > Can you pull this, please?  It adds a general notification queue
> > concept
> > and adds an event source for keys/keyrings, such as linking and
> > unlinking
> > keys and changing their attributes.
> [..]
>
> This commit:
>
> >   keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...upstream as:
>
> 8c0637e950d6 keys: Make the KEY_NEED_* perms an enum rather than a mask
>
> ...triggers a regression in the libnvdimm unit test that exercises the
> encrypted keys used to store nvdimm passphrases. It results in the
> below warning.

This regression is still present in tip of tree. David, have you had a
chance to take a look?



>
> ---
>
> WARNING: CPU: 15 PID: 6276 at security/keys/permission.c:35 
> key_task_permission+0xd3/0x140
> Modules linked in: nd_blk(OE) nfit_test(OE) device_dax(OE) ebtable_filter(E) 
> ebtables(E) ip6table_filter(E) ip6_tables(E) kvm_intel(E) kvm(E) irqbypass(E) 
> nd_pmem(OE) dax_pmem(OE) nd_btt(OE) dax_p
> ct10dif_pclmul(E) nd_e820(OE) nfit(OE) crc32_pclmul(E) libnvdimm(OE) 
> crc32c_intel(E) ghash_clmulni_intel(E) serio_raw(E) encrypted_keys(E) 
> trusted(E) nfit_test_iomap(OE) tpm(E) drm(E)
> CPU: 15 PID: 6276 Comm: lt-ndctl Tainted: G   OE 5.7.0-rc6+ #155
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
> RIP: 0010:key_task_permission+0xd3/0x140
> Code: c8 21 d9 39 d9 75 25 48 83 c4 08 4c 89 e6 48 89 ef 5b 5d 41 5c 41 5d e9 
> 1b a7 00 00 bb 01 00 00 00 83 fa 01 0f 84 68 ff ff ff <0f> 0b 48 83 c4 08 b8 
> f3 ff ff ff 5b 5d 41 5c 41 5d c3 83 fa 06
>
> RSP: 0018:addc42db7c90 EFLAGS: 00010297
> RAX: 0001 RBX: 0001 RCX: addc42db7c7c
> RDX:  RSI: 985e1c46e840 RDI: 985e3a03de01
> RBP: 985e3a03de01 R08:  R09: 5461e7bc02a0
> R10: 0004 R11:  R12: 985e1c46e840
> R13:  R14: addc42db7cd8 R15: 985e248c6540
> FS:  7f863c18a780() GS:985e3bbc() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 006d3708 CR3: 000125a1e006 CR4: 00160ee0
> Call Trace:
>  lookup_user_key+0xeb/0x6b0
>  ? vsscanf+0x3df/0x840
>  ? key_validate+0x50/0x50
>  ? key_default_cmp+0x20/0x20
>  nvdimm_get_user_key_payload.part.0+0x21/0x110 [libnvdimm]
>  nvdimm_security_store+0x67d/0xb20 [libnvdimm]
>  security_store+0x67/0x1a0 [libnvdimm]
>  kernfs_fop_write+0xcf/0x1c0
>  vfs_write+0xde/0x1d0
>  ksys_write+0x68/0xe0
>  do_syscall_64+0x5c/0xa0
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x7f863c624547
> Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 
> 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 
> 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> RSP: 002b:7ffd61d8f5e8 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 7ffd61d8f640 RCX: 7f863c624547
> RDX: 0014 RSI: 7ffd61d8f640 RDI: 0005
> RBP: 0005 R08: 0014 R09: 7ffd61d8f4a0
> R10: f455 R11: 0246 R12: 006dbbf0
> R13: 006cd710 R14: 7f863c18a6a8 R15: 7ffd61d8fae0
> irq event stamp: 36976
> hardirqs last  enabled at (36975): [] __slab_alloc+0x70/0x90
> hardirqs last disabled at (36976): [] 
> trace_hardirqs_off_thunk+0x1a/0x1c
> softirqs last  enabled at (35474): [] 
> __do_softirq+0x357/0x466
> softirqs last disabled at (35467): [] irq_exit+0xe6/0xf0
>
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Matthew Wilcox
On Tue, Jun 23, 2020 at 03:26:58PM -0700, Luck, Tony wrote:
> On Tue, Jun 23, 2020 at 11:17:41PM +0100, Matthew Wilcox wrote:
> > It might also be nice to have an madvise() MADV_ZERO option so the
> > application doesn't have to look up the fd associated with that memory
> > range, but we haven't floated that idea with the customer yet; I just
> > thought of it now.
> 
> So the conversation between OS and kernel goes like this?
> 
> 1) machine check
> 2) Kernel unmaps the 4K page surroundinng the poison and sends
>SIGBUS to the application to say that one cache line is gone
> 3) App says madvise(MADV_ZERO, that cache line)
> 4) Kernel says ... "oh, you know how to deal with this" and allocates
>a new page, copying the 63 good cache lines from the old page and
>zeroing the missing one. New page is mapped to user.

That could be one way of implementing it.  My understanding is that
pmem devices will reallocate bad cachelines on writes, so a better
implementation would be:

1) Kernel receives machine check
2) Kernel sends SIGBUS to the application
3) App send madvise(MADV_ZERO, addr, 1 << granularity)
4) Kernel does special writes to ensure the cacheline is zeroed
5) App does whatever it needs to recover (reconstructs the data or marks
it as gone)

> Do you have folks lined up to use that?  I don't know that many
> folks are even catching the SIGBUS :-(

Had a 75 minute meeting with some people who want to use pmem this
afternoon ...
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Luck, Tony
On Tue, Jun 23, 2020 at 11:17:41PM +0100, Matthew Wilcox wrote:
> It might also be nice to have an madvise() MADV_ZERO option so the
> application doesn't have to look up the fd associated with that memory
> range, but we haven't floated that idea with the customer yet; I just
> thought of it now.

So the conversation between OS and kernel goes like this?

1) machine check
2) Kernel unmaps the 4K page surroundinng the poison and sends
   SIGBUS to the application to say that one cache line is gone
3) App says madvise(MADV_ZERO, that cache line)
4) Kernel says ... "oh, you know how to deal with this" and allocates
   a new page, copying the 63 good cache lines from the old page and
   zeroing the missing one. New page is mapped to user.

Do you have folks lined up to use that?  I don't know that many
folks are even catching the SIGBUS :-(

-Tony
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Matthew Wilcox
On Tue, Jun 23, 2020 at 03:04:12PM -0700, Luck, Tony wrote:
> On Tue, Jun 23, 2020 at 09:17:45PM +0100, Matthew Wilcox wrote:
> > Hardware actually tells us the blast radius of the error, but we ignore
> > it and take out the entire page.  We've had a customer request to know
> > exactly how much of the page is damaged so they can avoid reconstructing
> > an entire 2MB page if only a single cacheline is damaged.
> > 
> > This is only a strawman that I did in an hour or two; I'd appreciate
> > architectural-level feedback.  Should I just convert memory_failure() to
> > always take an address & granularity?  Should I create a struct to pass
> > around (page, phys, granularity) instead of reconstructing the missing
> > pieces in half a dozen functions?  Is this functionality welcome at all,
> > or is the risk of upsetting applications which expect at least a page
> > of granularity too high?
> 
> What is the interface to these applications that want finer granularity?
> 
> Current code does very poorly with hugetlbfs pages ... user loses the
> whole 2 MB or 1GB. That's just silly (though I've been told that it is
> hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary
> time as the result of a mahcine check means that the kernel needs locking
> around a bunch of fas paths that currently assume that a huge page will
> stay being a huge page).
> 
> For sub-4K page usage, there are different problems. We can't leave the
> original page with the poisoned cache line mapped to the user as they may
> just access the poison data and trigger another machine check. But if we
> map in some different page with all the good bits copied, the user needs
> to be aware which parts of the page no longer have their data.

This is specifically for DAX.  The interface we were hoping to use to
fix the error is to leave the mapping in place and call

fallocate(fd, FALLOC_FL_ZERO_RANGE, offset, len);

The application will then take care of writing actual data that isn't
zeroes to the file.  Or leave it as zeroes if that's what the application
wants to be its error indicator.

The fallocate path doesn't work quite right today, but there's no point
in trying to fix that up if we can't sort out the delivery of the actual
error range.

It might also be nice to have an madvise() MADV_ZERO option so the
application doesn't have to look up the fd associated with that memory
range, but we haven't floated that idea with the customer yet; I just
thought of it now.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Luck, Tony
On Tue, Jun 23, 2020 at 09:17:45PM +0100, Matthew Wilcox wrote:
> 
> Hardware actually tells us the blast radius of the error, but we ignore
> it and take out the entire page.  We've had a customer request to know
> exactly how much of the page is damaged so they can avoid reconstructing
> an entire 2MB page if only a single cacheline is damaged.
> 
> This is only a strawman that I did in an hour or two; I'd appreciate
> architectural-level feedback.  Should I just convert memory_failure() to
> always take an address & granularity?  Should I create a struct to pass
> around (page, phys, granularity) instead of reconstructing the missing
> pieces in half a dozen functions?  Is this functionality welcome at all,
> or is the risk of upsetting applications which expect at least a page
> of granularity too high?

What is the interface to these applications that want finer granularity?

Current code does very poorly with hugetlbfs pages ... user loses the
whole 2 MB or 1GB. That's just silly (though I've been told that it is
hard to fix because allowing a hugetlbfs page to be broken up at an arbitrary
time as the result of a mahcine check means that the kernel needs locking
around a bunch of fas paths that currently assume that a huge page will
stay being a huge page).

For sub-4K page usage, there are different problems. We can't leave the
original page with the poisoned cache line mapped to the user as they may
just access the poison data and trigger another machine check. But if we
map in some different page with all the good bits copied, the user needs
to be aware which parts of the page no longer have their data.

-Tony
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Re: [RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Dan Williams
On Tue, Jun 23, 2020 at 1:18 PM Matthew Wilcox  wrote:
>
>
> Hardware actually tells us the blast radius of the error, but we ignore
> it and take out the entire page.  We've had a customer request to know
> exactly how much of the page is damaged so they can avoid reconstructing
> an entire 2MB page if only a single cacheline is damaged.
>
> This is only a strawman that I did in an hour or two; I'd appreciate
> architectural-level feedback.  Should I just convert memory_failure() to
> always take an address & granularity?  Should I create a struct to pass
> around (page, phys, granularity) instead of reconstructing the missing
> pieces in half a dozen functions?  Is this functionality welcome at all,
> or is the risk of upsetting applications which expect at least a page
> of granularity too high?
>
> I can see places where I've specified a plain PAGE_SHIFT insted of
> interrogating a compound page for its size.  I'd probably split this
> patch up into two or three pieces for applying.
>
> I've also blindly taken out the call to unmap_mapping_range().  Again,
> the customer requested that we not do this.  That deserves to be in its
> own patch and properly justified.

I had been thinking that we could not do much with the legacy
memory-failure reporting model and that applications that want a new
model would need to opt-into it. This topic also dovetails with what
Dave and I had been discussing in terms coordinating memory error
handling with the filesystem which may have more information about
multiple mappings of a DAX page (reflink) [1].

[1]: http://lore.kernel.org/r/20200311063942.ge10...@dread.disaster.area
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


[RFC] Make the memory failure blast radius more precise

2020-06-23 Thread Matthew Wilcox


Hardware actually tells us the blast radius of the error, but we ignore
it and take out the entire page.  We've had a customer request to know
exactly how much of the page is damaged so they can avoid reconstructing
an entire 2MB page if only a single cacheline is damaged.

This is only a strawman that I did in an hour or two; I'd appreciate
architectural-level feedback.  Should I just convert memory_failure() to
always take an address & granularity?  Should I create a struct to pass
around (page, phys, granularity) instead of reconstructing the missing
pieces in half a dozen functions?  Is this functionality welcome at all,
or is the risk of upsetting applications which expect at least a page
of granularity too high?

I can see places where I've specified a plain PAGE_SHIFT insted of
interrogating a compound page for its size.  I'd probably split this
patch up into two or three pieces for applying.

I've also blindly taken out the call to unmap_mapping_range().  Again,
the customer requested that we not do this.  That deserves to be in its
own patch and properly justified.

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index ce9120c4f740..09978c5b9493 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -463,7 +463,7 @@ static void mce_irq_work_cb(struct irq_work *entry)
  * Check if the address reported by the CPU is in a format we can parse.
  * It would be possible to add code for most other cases, but all would
  * be somewhat complicated (e.g. segment offset would require an instruction
- * parser). So only support physical addresses up to page granuality for now.
+ * parser). So only support physical addresses up to page granularity for now.
  */
 int mce_usable_address(struct mce *m)
 {
@@ -570,7 +570,6 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
  void *data)
 {
struct mce *mce = (struct mce *)data;
-   unsigned long pfn;
 
if (!mce || !mce_usable_address(mce))
return NOTIFY_DONE;
@@ -579,9 +578,8 @@ static int uc_decode_notifier(struct notifier_block *nb, 
unsigned long val,
mce->severity != MCE_DEFERRED_SEVERITY)
return NOTIFY_DONE;
 
-   pfn = mce->addr >> PAGE_SHIFT;
-   if (!memory_failure(pfn, 0)) {
-   set_mce_nospec(pfn, whole_page(mce));
+   if (!__memory_failure(mce->addr, MCI_MISC_ADDR_LSB(mce->misc), 0)) {
+   set_mce_nospec(mce->addr >> PAGE_SHIFT, whole_page(mce));
mce->kflags |= MCE_HANDLED_UC;
}
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc7b87310c10..f2ea6491df28 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3035,7 +3035,14 @@ enum mf_flags {
MF_MUST_KILL = 1 << 2,
MF_SOFT_OFFLINE = 1 << 3,
 };
-extern int memory_failure(unsigned long pfn, int flags);
+
+int __memory_failure(unsigned long addr, unsigned int granularity, int flags);
+
+static inline int memory_failure(unsigned long pfn, int flags)
+{
+   return __memory_failure(pfn << PAGE_SHIFT, PAGE_SHIFT, flags);
+}
+
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb1fb9b..775b053dcd98 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -198,7 +198,7 @@ struct to_kill {
struct list_head nd;
struct task_struct *tsk;
unsigned long addr;
-   short size_shift;
+   short granularity;
 };
 
 /*
@@ -209,7 +209,7 @@ struct to_kill {
 static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 {
struct task_struct *t = tk->tsk;
-   short addr_lsb = tk->size_shift;
+   short addr_lsb = tk->granularity;
int ret = 0;
 
pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware 
memory corruption\n",
@@ -262,40 +262,6 @@ void shake_page(struct page *p, int access)
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
-static unsigned long dev_pagemap_mapping_shift(struct page *page,
-   struct vm_area_struct *vma)
-{
-   unsigned long address = vma_address(page, vma);
-   pgd_t *pgd;
-   p4d_t *p4d;
-   pud_t *pud;
-   pmd_t *pmd;
-   pte_t *pte;
-
-   pgd = pgd_offset(vma->vm_mm, address);
-   if (!pgd_present(*pgd))
-   return 0;
-   p4d = p4d_offset(pgd, address);
-   if (!p4d_present(*p4d))
-   return 0;
-   pud = pud_offset(p4d, address);
-   if (!pud_present(*pud))
-   return 0;
-   if (pud_devmap(*pud))
-   return PUD_SHIFT;
-   pmd = pmd_offset(pud, address);
-   if (!pmd_present(*pmd))
-   return 0;
-   if (pmd_devmap(*pmd))
-   return PMD_SHIFT;
-   pte = pte_offset_map(pmd, address);
-   if (!pte_present(*pte))
-  

Re: [PATCH 2/2] powerpc/papr_scm: Add support for fetching nvdimm 'fuel-gauge' metric

2020-06-23 Thread Ira Weiny
On Mon, Jun 22, 2020 at 09:54:51AM +0530, Vaibhav Jain wrote:
> We add support for reporting 'fuel-gauge' NVDIMM metric via
> PAPR_PDSM_HEALTH pdsm payload. 'fuel-gauge' metric indicates the usage
> life remaining of a papr-scm compatible NVDIMM. PHYP exposes this
> metric via the H_SCM_PERFORMANCE_STATS.
> 
> The metric value is returned from the pdsm by extending the return
> payload 'struct nd_papr_pdsm_health' without breaking the ABI. A new
> field 'dimm_fuel_gauge' to hold the metric value is introduced at the
> end of the payload struct and its presence is indicated by by
> extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID.
> 
> The patch introduces a new function papr_pdsm_fuel_gauge() that is
> called from papr_pdsm_health(). If fetching NVDIMM performance stats
> is supported then 'papr_pdsm_fuel_gauge()' allocated an output buffer
> large enough to hold the performance stat and passes it to
> drc_pmem_query_stats() that issues the HCALL to PHYP. The return value
> of the stat is then populated in the 'struct
> nd_papr_pdsm_health.dimm_fuel_gauge' field with extension flag
> 'PDSM_DIMM_HEALTH_RUN_GAUGE_VALID' set in 'struct
> nd_papr_pdsm_health.extension_flags'
> 
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/include/uapi/asm/papr_pdsm.h |  9 +
>  arch/powerpc/platforms/pseries/papr_scm.c | 47 +++
>  2 files changed, 56 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/papr_pdsm.h 
> b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> index 9ccecc1d6840..50ef95e2f5b1 100644
> --- a/arch/powerpc/include/uapi/asm/papr_pdsm.h
> +++ b/arch/powerpc/include/uapi/asm/papr_pdsm.h
> @@ -72,6 +72,11 @@
>  #define PAPR_PDSM_DIMM_CRITICAL  2
>  #define PAPR_PDSM_DIMM_FATAL 3
>  
> +/* struct nd_papr_pdsm_health.extension_flags field flags */
> +
> +/* Indicate that the 'dimm_fuel_gauge' field is valid */
> +#define PDSM_DIMM_HEALTH_RUN_GAUGE_VALID 1
> +
>  /*
>   * Struct exchanged between kernel & ndctl in for PAPR_PDSM_HEALTH
>   * Various flags indicate the health status of the dimm.
> @@ -84,6 +89,7 @@
>   * dimm_locked   : Contents of the dimm cant be modified until 
> CEC reboot
>   * dimm_encrypted: Contents of dimm are encrypted.
>   * dimm_health   : Dimm health indicator. One of 
> PAPR_PDSM_DIMM_
> + * dimm_fuel_gauge   : Life remaining of DIMM as a percentage from 0-100
>   */
>  struct nd_papr_pdsm_health {
>   union {
> @@ -96,6 +102,9 @@ struct nd_papr_pdsm_health {
>   __u8 dimm_locked;
>   __u8 dimm_encrypted;
>   __u16 dimm_health;
> +
> + /* Extension flag PDSM_DIMM_HEALTH_RUN_GAUGE_VALID */
> + __u16 dimm_fuel_gauge;
>   };
>   __u8 buf[ND_PDSM_PAYLOAD_MAX_SIZE];
>   };
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index cb3f9acc325b..39527cd38d9c 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -506,6 +506,45 @@ static int is_cmd_valid(struct nvdimm *nvdimm, unsigned 
> int cmd, void *buf,
>   return 0;
>  }
>  
> +static int papr_pdsm_fuel_gauge(struct papr_scm_priv *p,
> + union nd_pdsm_payload *payload)
> +{
> + int rc, size;
> + struct papr_scm_perf_stat *stat;
> + struct papr_scm_perf_stats *stats;
> +
> + /* Silently fail if fetching performance metrics isn't  supported */
> + if (!p->len_stat_buffer)
> + return 0;
> +
> + /* Allocate request buffer enough to hold single performance stat */
> + size = sizeof(struct papr_scm_perf_stats) +
> + sizeof(struct papr_scm_perf_stat);
> +
> + stats = kzalloc(size, GFP_KERNEL);
> + if (!stats)
> + return -ENOMEM;
> +
> + stat = >scm_statistic[0];
> + memcpy(>statistic_id, "MemLife ", sizeof(stat->statistic_id));
> + stat->statistic_value = 0;
> +
> + /* Fetch the fuel gauge and populate it in payload */
> + rc = drc_pmem_query_stats(p, stats, size, 1, NULL);
> + if (!rc) {

Always best to except the error case...

if (rc) {
... print debuging from below...
goto free_stats;
}

> + dev_dbg(>pdev->dev,
> + "Fetched fuel-gauge %llu", stat->statistic_value);
> + payload->health.extension_flags |=
> + PDSM_DIMM_HEALTH_RUN_GAUGE_VALID;
> + payload->health.dimm_fuel_gauge = stat->statistic_value;
> +
> + rc = sizeof(struct nd_papr_pdsm_health);
> + }
> +

free_stats:

> + kfree(stats);
> + return rc;
> +}
> +
>  /* Fetch the DIMM health info and populate it in provided package. */
>  static int papr_pdsm_health(struct papr_scm_priv *p,
>   union nd_pdsm_payload *payload)
> @@ -546,6 +585,14 @@ static int 

Re: [PATCH 1/2] powerpc/papr_scm: Fetch nvdimm performance stats from PHYP

2020-06-23 Thread Ira Weiny
On Mon, Jun 22, 2020 at 09:54:50AM +0530, Vaibhav Jain wrote:
> Update papr_scm.c to query dimm performance statistics from PHYP via
> H_SCM_PERFORMANCE_STATS hcall and export them to user-space as PAPR
> specific NVDIMM attribute 'perf_stats' in sysfs. The patch also
> provide a sysfs ABI documentation for the stats being reported and
> their meanings.
> 
> During NVDIMM probe time in papr_scm_nvdimm_init() a special variant
> of H_SCM_PERFORMANCE_STATS hcall is issued to check if collection of
> performance statistics is supported or not. If successful then a PHYP
> returns a maximum possible buffer length needed to read all
> performance stats. This returned value is stored in a per-nvdimm
> attribute 'len_stat_buffer'.
> 
> The layout of request buffer for reading NVDIMM performance stats from
> PHYP is defined in 'struct papr_scm_perf_stats' and 'struct
> papr_scm_perf_stat'. These structs are used in newly introduced
> drc_pmem_query_stats() that issues the H_SCM_PERFORMANCE_STATS hcall.
> 
> The sysfs access function perf_stats_show() uses value
> 'len_stat_buffer' to allocate a buffer large enough to hold all
> possible NVDIMM performance stats and passes it to
> drc_pmem_query_stats() to populate. Finally statistics reported in the
> buffer are formatted into the sysfs access function output buffer.
> 
> Signed-off-by: Vaibhav Jain 
> ---
>  Documentation/ABI/testing/sysfs-bus-papr-pmem |  27 
>  arch/powerpc/platforms/pseries/papr_scm.c | 139 ++
>  2 files changed, 166 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-papr-pmem 
> b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> index 5b10d036a8d4..c1a67275c43f 100644
> --- a/Documentation/ABI/testing/sysfs-bus-papr-pmem
> +++ b/Documentation/ABI/testing/sysfs-bus-papr-pmem
> @@ -25,3 +25,30 @@ Description:
> NVDIMM have been scrubbed.
>   * "locked"  : Indicating that NVDIMM contents cant
> be modified until next power cycle.
> +
> +What:/sys/bus/nd/devices/nmemX/papr/perf_stats
> +Date:May, 2020
> +KernelVersion:   v5.9
> +Contact: linuxppc-dev , 
> linux-nvdimm@lists.01.org,
> +Description:
> + (RO) Report various performance stats related to papr-scm NVDIMM
> + device.  Each stat is reported on a new line with each line
> + composed of a stat-identifier followed by it value. Below are
> + currently known dimm performance stats which are reported:
> +
> + * "CtlResCt" : Controller Reset Count
> + * "CtlResTm" : Controller Reset Elapsed Time
> + * "PonSecs " : Power-on Seconds
> + * "MemLife " : Life Remaining
> + * "CritRscU" : Critical Resource Utilization
> + * "HostLCnt" : Host Load Count
> + * "HostSCnt" : Host Store Count
> + * "HostSDur" : Host Store Duration
> + * "HostLDur" : Host Load Duration
> + * "MedRCnt " : Media Read Count
> + * "MedWCnt " : Media Write Count
> + * "MedRDur " : Media Read Duration
> + * "MedWDur " : Media Write Duration
> + * "CchRHCnt" : Cache Read Hit Count
> + * "CchWHCnt" : Cache Write Hit Count
> + * "FastWCnt" : Fast Write Count
> \ No newline at end of file
> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c 
> b/arch/powerpc/platforms/pseries/papr_scm.c
> index 9c569078a09f..cb3f9acc325b 100644
> --- a/arch/powerpc/platforms/pseries/papr_scm.c
> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
> @@ -62,6 +62,24 @@
>   PAPR_PMEM_HEALTH_FATAL |\
>   PAPR_PMEM_HEALTH_UNHEALTHY)
>  
> +#define PAPR_SCM_PERF_STATS_EYECATCHER __stringify(SCMSTATS)
> +#define PAPR_SCM_PERF_STATS_VERSION 0x1
> +
> +/* Struct holding a single performance metric */
> +struct papr_scm_perf_stat {
> + u8 statistic_id[8];
> + u64 statistic_value;
> +};
> +
> +/* Struct exchanged between kernel and PHYP for fetching drc perf stats */
> +struct papr_scm_perf_stats {
> + u8 eye_catcher[8];
> + u32 stats_version;  /* Should be 0x01 */
 
 PAPR_SCM_PERF_STATS_VERSION?

> + u32 num_statistics; /* Number of stats following */
> + /* zero or more performance matrics */
> + struct papr_scm_perf_stat scm_statistic[];
> +} __packed;
> +
>  /* private struct associated with each region */
>  struct papr_scm_priv {
>   struct platform_device *pdev;
> @@ -89,6 +107,9 @@ struct papr_scm_priv {
>  
>   /* Health information for the dimm */
>   u64 health_bitmap;
> +
> + /* length of the stat buffer as expected by phyp */
> + size_t len_stat_buffer;
>  };
>  
>  static int drc_pmem_bind(struct papr_scm_priv *p)
> @@ -194,6 

[PATCH AUTOSEL 5.7 26/28] nvdimm/region: always show the 'align' attribute

2020-06-23 Thread Sasha Levin
From: Vishal Verma 

[ Upstream commit 543094e19c82b5d171e139d09a1a3ea0a7361117 ]

It is possible that a platform that is capable of 'namespace labels'
comes up without the labels properly initialized. In this case, the
region's 'align' attribute is hidden. Howerver, once the user does
initialize he labels, the 'align' attribute still stays hidden, which is
unexpected.

The sysfs_update_group() API is meant to address this, and could be
called during region probe, but it has entanglements with the device
'lockdep_mutex'. Therefore, simply make the 'align' attribute always
visible. It doesn't matter what it says for label-less namespaces, since
it is not possible to change their allocation anyway.

Suggested-by: Dan Williams 
Signed-off-by: Vishal Verma 
Cc: Dan Williams 
Link: https://lore.kernel.org/r/20200520225026.29426-1-vishal.l.ve...@intel.com
Signed-off-by: Dan Williams 
Signed-off-by: Sasha Levin 
---
 drivers/nvdimm/region_devs.c | 14 ++
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ccbb5b43b8b2c..4502f9c4708d0 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -679,18 +679,8 @@ static umode_t region_visible(struct kobject *kobj, struct 
attribute *a, int n)
return a->mode;
}
 
-   if (a == _attr_align.attr) {
-   int i;
-
-   for (i = 0; i < nd_region->ndr_mappings; i++) {
-   struct nd_mapping *nd_mapping = _region->mapping[i];
-   struct nvdimm *nvdimm = nd_mapping->nvdimm;
-
-   if (test_bit(NDD_LABELING, >flags))
-   return a->mode;
-   }
-   return 0;
-   }
+   if (a == _attr_align.attr)
+   return a->mode;
 
if (a != _attr_set_cookie.attr
&& a != _attr_available_size.attr)
-- 
2.25.1
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Hello.

2020-06-23 Thread YAVUZ BEKTER
I am the foreign operations director of Bank of Turkey.
My name is Mr, Yavuz. I have a sensitive investment project to discuss
with you, please reply now.

Ik ben de directeur buitenlandse activiteiten van de Bank of Turkey.
Mijn naam is meneer Yavuz. Ik moet een gevoelig investeringsproject bespreken
met u, antwoord dan nu.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Report FR9352UE05

2020-06-23 Thread Elephant & Castle Group, Inc.
Dear Customer ,
This is to confirm that one or more of your parcels has been shipped.
Shipment Label is attached to this email.

Yours sincerely,
Grover Glasser

Station Agent.
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org


Fw: Re: Re: ADVANCE PAYMENT

2020-06-23 Thread Finance Department
___
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-le...@lists.01.org