Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Jane Chu

On 6/11/2018 6:50 PM, Naoya Horiguchi wrote:


On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:

On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko  wrote:

On Mon 11-06-18 07:44:39, Dan Williams wrote:
[...]

I'm still trying to understand the next level of detail on where you
think the design should go next? Is it just the HWPoison page flag?
Are you concerned about supporting greater than PAGE_SIZE poison?

I simply do not want to check for HWPoison at zillion of places and have
each type of page to have some special handling which can get wrong very
easily. I am not clear on details here, this is something for users of
hwpoison to define what is the reasonable scenarios when the feature is
useful and turn that into a feature list that can be actually turned
into a design document. See the different from let's put some more on
top approach...


So you want me to pay the toll of writing a design document justifying
all the existing use cases of HWPoison before we fix the DAX bugs, and
the design document may or may not result in any substantive change to
these patches?

Naoya or Andi, can you chime in here?

memory_failure() does 3 things:

  - unmapping the error page from processes using it,
  - isolating the error page with PageHWPoison,
  - logging/reporting.

The unmapping part and the isolating part are quite page type dependent,
so this seems to me hard to do them in generic manner (so supporting new
page type always needs case specific new code.)
But I agree that we can improve code and document to help developers add
support for new page type.

About documenting, the content of Documentation/vm/hwpoison.rst is not
updated since 2009, so some update with design thing might be required.
My current thought about update items are like this:

   - detailing general workflow,
   - adding some about soft offline,
   - guideline for developers to support new type of memory,
   (- and anything helpful/requested.)

Making code more readable/self-descriptive is helpful, though I'm
not clear now about how.

Anyway I'll find time to work on this, while now I'm testing the dax
support patches and fixing a bug I found recently.


Thank you. Maybe it's already on your mind, but just in case. When you update 
the
document, would you add the characteristics of pmem error handling in that
  . UE/poison can be repaired until the wear and tear reaches a max level
  . many user applications mmap the entire capacity, leaving no spare pages
for swapping (unlike the volatile memory UE handling)
  . the what-you-see-is-what-you-get nature
?
Regarding HWPOISON redesign, a nagging thought is that a memory UE typically 
indicts
a very small blast radius, less than 4KB.  But it seems that the larger the 
page size,
the greater the 'penalty' in terms of how much memory would end up being 
offlined.
If there a way to be frugal?

Thanks!
-jane



Thanks,
Naoya Horiguchi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Dan Williams
On Mon, Jun 11, 2018 at 6:50 PM, Naoya Horiguchi
 wrote:
> On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
[..]
> Anyway I'll find time to work on this, while now I'm testing the dax
> support patches and fixing a bug I found recently.

Ok, with this and other review feedback these patches are not ready
for 4.18. I'll circle back for 4.19 and we can try again.

Thanks for taking a look!
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Naoya Horiguchi
On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
> On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko  wrote:
> > On Mon 11-06-18 07:44:39, Dan Williams wrote:
> > [...]
> >> I'm still trying to understand the next level of detail on where you
> >> think the design should go next? Is it just the HWPoison page flag?
> >> Are you concerned about supporting greater than PAGE_SIZE poison?
> >
> > I simply do not want to check for HWPoison at zillion of places and have
> > each type of page to have some special handling which can get wrong very
> > easily. I am not clear on details here, this is something for users of
> > hwpoison to define what is the reasonable scenarios when the feature is
> > useful and turn that into a feature list that can be actually turned
> > into a design document. See the different from let's put some more on
> > top approach...
> >
> 
> So you want me to pay the toll of writing a design document justifying
> all the existing use cases of HWPoison before we fix the DAX bugs, and
> the design document may or may not result in any substantive change to
> these patches?
> 
> Naoya or Andi, can you chime in here?

memory_failure() does 3 things:

 - unmapping the error page from processes using it,
 - isolating the error page with PageHWPoison,
 - logging/reporting.

The unmapping part and the isolating part are quite page type dependent,
so this seems to me hard to do them in generic manner (so supporting new
page type always needs case specific new code.)
But I agree that we can improve code and document to help developers add
support for new page type.

About documenting, the content of Documentation/vm/hwpoison.rst is not
updated since 2009, so some update with design thing might be required.
My current thought about update items are like this:

  - detailing general workflow,
  - adding some about soft offline,
  - guideline for developers to support new type of memory,
  (- and anything helpful/requested.)

Making code more readable/self-descriptive is helpful, though I'm
not clear now about how.

Anyway I'll find time to work on this, while now I'm testing the dax
support patches and fixing a bug I found recently.

Thanks,
Naoya Horiguchi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Andi Kleen
On Mon, Jun 11, 2018 at 08:19:54AM -0700, Dan Williams wrote:
> On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko  wrote:
> > On Mon 11-06-18 07:44:39, Dan Williams wrote:
> > [...]
> >> I'm still trying to understand the next level of detail on where you
> >> think the design should go next? Is it just the HWPoison page flag?
> >> Are you concerned about supporting greater than PAGE_SIZE poison?
> >
> > I simply do not want to check for HWPoison at zillion of places and have
> > each type of page to have some special handling which can get wrong very
> > easily. I am not clear on details here, this is something for users of
> > hwpoison to define what is the reasonable scenarios when the feature is
> > useful and turn that into a feature list that can be actually turned
> > into a design document. See the different from let's put some more on
> > top approach...
> >
> 
> So you want me to pay the toll of writing a design document justifying
> all the existing use cases of HWPoison before we fix the DAX bugs, and
> the design document may or may not result in any substantive change to
> these patches?
> 
> Naoya or Andi, can you chime in here?

A new document doesn't make any sense. We have the commit messages and
the code comments as design documents, and as usual the ultimative authority is
what the code does.

The guiding light for new memory recovery code is just these sentences (taken
from the beginning of the main file):

 * In general any code for handling new cases should only be added iff:
 * - You know how to test it.
 * - You have a test that can be added to mce-test
 *   https://git.kernel.org/cgit/utils/cpu/mce/mce-test.git/
 * - The case actually shows up as a frequent (top 10) page state in
 *   tools/vm/page-types when running a real workload.

Since persistent memory is so big it makes sense to add support
for it in common code paths. That is usually just kernel copies and
user space execution.

-Andi
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Dan Williams
On Mon, Jun 11, 2018 at 7:56 AM, Michal Hocko  wrote:
> On Mon 11-06-18 07:44:39, Dan Williams wrote:
> [...]
>> I'm still trying to understand the next level of detail on where you
>> think the design should go next? Is it just the HWPoison page flag?
>> Are you concerned about supporting greater than PAGE_SIZE poison?
>
> I simply do not want to check for HWPoison at zillion of places and have
> each type of page to have some special handling which can get wrong very
> easily. I am not clear on details here, this is something for users of
> hwpoison to define what is the reasonable scenarios when the feature is
> useful and turn that into a feature list that can be actually turned
> into a design document. See the different from let's put some more on
> top approach...
>

So you want me to pay the toll of writing a design document justifying
all the existing use cases of HWPoison before we fix the DAX bugs, and
the design document may or may not result in any substantive change to
these patches?

Naoya or Andi, can you chime in here?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Michal Hocko
On Mon 11-06-18 07:44:39, Dan Williams wrote:
[...]
> I'm still trying to understand the next level of detail on where you
> think the design should go next? Is it just the HWPoison page flag?
> Are you concerned about supporting greater than PAGE_SIZE poison?

I simply do not want to check for HWPoison at zillion of places and have
each type of page to have some special handling which can get wrong very
easily. I am not clear on details here, this is something for users of
hwpoison to define what is the reasonable scenarios when the feature is
useful and turn that into a feature list that can be actually turned
into a design document. See the different from let's put some more on
top approach...

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Dan Williams
On Mon, Jun 11, 2018 at 12:50 AM, Michal Hocko  wrote:
> On Thu 07-06-18 09:52:22, Dan Williams wrote:
>> On Thu, Jun 7, 2018 at 7:37 AM, Michal Hocko  wrote:
>> > On Wed 06-06-18 06:44:45, Dan Williams wrote:
>> >> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko  wrote:
>> >> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
>> >> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
>> >> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
>> >> >> > [...]
>> >> >> >> I'm trying to solve this real world problem when real poison is
>> >> >> >> consumed through a dax mapping:
>> >> >> >>
>> >> >> >> mce: Uncorrected hardware memory error in user-access at 
>> >> >> >> af34214200
>> >> >> >> {1}[Hardware Error]: It has been corrected by h/w and 
>> >> >> >> requires
>> >> >> >> no further action
>> >> >> >> mce: [Hardware Error]: Machine check events logged
>> >> >> >> {1}[Hardware Error]: event severity: corrected
>> >> >> >> Memory failure: 0xaf34214: reserved kernel page still
>> >> >> >> referenced by 1 users
>> >> >> >> [..]
>> >> >> >> Memory failure: 0xaf34214: recovery action for reserved 
>> >> >> >> kernel
>> >> >> >> page: Failed
>> >> >> >> mce: Memory error not recovered
>> >> >> >>
>> >> >> >> ...i.e. currently all poison consumed through dax mappings is
>> >> >> >> needlessly system fatal.
>> >> >> >
>> >> >> > Thanks. That should be a part of the changelog.
>> >> >>
>> >> >> ...added for v3:
>> >> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
>> >> >>
>> >> >> > It would be great to
>> >> >> > describe why this cannot be simply handled by hwpoison code without 
>> >> >> > any
>> >> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
>> >> >> > hwpoison code even care?
>> >> >> >
>> >> >>
>> >> >> Up until we started testing hardware poison recovery for persistent
>> >> >> memory I assumed that the kernel did not need any new enabling to get
>> >> >> basic support for recovering userspace consumed poison.
>> >> >>
>> >> >> However, the recovery code has a dedicated path for many different
>> >> >> page states (see: action_page_types). Without any changes it
>> >> >> incorrectly assumes that a dax mapped page is a page cache page
>> >> >> undergoing dma, or some other pinned operation. It also assumes that
>> >> >> the page must be offlined which is not correct / possible for dax
>> >> >> mapped pages. There is a possibility to repair poison to dax mapped
>> >> >> persistent memory pages, and the pages can't otherwise be offlined
>> >> >> because they 1:1 correspond with a physical storage block, i.e.
>> >> >> offlining pmem would be equivalent to punching a hole in the physical
>> >> >> address space.
>> >> >>
>> >> >> There's also the entanglement of device-dax which guarantees a given
>> >> >> mapping size (4K, 2M, 1G). This requires determining the size of the
>> >> >> mapping encompassing a given pfn to know how much to unmap. Since dax
>> >> >> mapped pfns don't come from the page allocator we need to read the
>> >> >> page size from the page tables, not compound_order(page).
>> >> >
>> >> > OK, but my question is still. Do we really want to do more on top of the
>> >> > existing code and add even more special casing or it is time to rethink
>> >> > the whole hwpoison design?
>> >>
>> >> Well, there's the immediate problem that the current implementation is
>> >> broken for dax and then the longer term problem that the current
>> >> design appears to be too literal with a lot of custom marshaling.
>> >>
>> >> At least for dax in the long term we want to offer an alternative
>> >> error handling model and get the filesystem much more involved. That
>> >> filesystem redesign work has been waiting for the reverse-block-map
>> >> effort to settle in xfs. However, that's more custom work for dax and
>> >> not a redesign that helps the core-mm more generically.
>> >>
>> >> I think the unmap and SIGBUS portion of poison handling is relatively
>> >> straightforward. It's the handling of the page HWPoison page flag that
>> >> seems a bit ad hoc. The current implementation certainly was not
>> >> prepared for the concept that memory can be repaired. set_mce_nospec()
>> >> is a step in the direction of generic memory error handling.
>> >
>> > Agreed! Moreover random checks for HWPoison pages is just a maintenance
>> > hell.
>> >
>> >> Thoughts on other pain points in the design that are on your mind, Michal?
>> >
>> > we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
>> > The main problem is that there is besically no design description so the
>> > whole feature is very easy to break. Yours is another good example of
>> > that. We need to get back to the drawing board and think about how to
>> > make this more robust.
>>
>> I saw that article, but to be honest I did not glean any direct
>> suggestions that read on these current patches. I'm interested in
>> discussing a 

Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-11 Thread Michal Hocko
On Thu 07-06-18 09:52:22, Dan Williams wrote:
> On Thu, Jun 7, 2018 at 7:37 AM, Michal Hocko  wrote:
> > On Wed 06-06-18 06:44:45, Dan Williams wrote:
> >> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko  wrote:
> >> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
> >> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
> >> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> >> >> > [...]
> >> >> >> I'm trying to solve this real world problem when real poison is
> >> >> >> consumed through a dax mapping:
> >> >> >>
> >> >> >> mce: Uncorrected hardware memory error in user-access at 
> >> >> >> af34214200
> >> >> >> {1}[Hardware Error]: It has been corrected by h/w and 
> >> >> >> requires
> >> >> >> no further action
> >> >> >> mce: [Hardware Error]: Machine check events logged
> >> >> >> {1}[Hardware Error]: event severity: corrected
> >> >> >> Memory failure: 0xaf34214: reserved kernel page still
> >> >> >> referenced by 1 users
> >> >> >> [..]
> >> >> >> Memory failure: 0xaf34214: recovery action for reserved 
> >> >> >> kernel
> >> >> >> page: Failed
> >> >> >> mce: Memory error not recovered
> >> >> >>
> >> >> >> ...i.e. currently all poison consumed through dax mappings is
> >> >> >> needlessly system fatal.
> >> >> >
> >> >> > Thanks. That should be a part of the changelog.
> >> >>
> >> >> ...added for v3:
> >> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> >> >>
> >> >> > It would be great to
> >> >> > describe why this cannot be simply handled by hwpoison code without 
> >> >> > any
> >> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> >> >> > hwpoison code even care?
> >> >> >
> >> >>
> >> >> Up until we started testing hardware poison recovery for persistent
> >> >> memory I assumed that the kernel did not need any new enabling to get
> >> >> basic support for recovering userspace consumed poison.
> >> >>
> >> >> However, the recovery code has a dedicated path for many different
> >> >> page states (see: action_page_types). Without any changes it
> >> >> incorrectly assumes that a dax mapped page is a page cache page
> >> >> undergoing dma, or some other pinned operation. It also assumes that
> >> >> the page must be offlined which is not correct / possible for dax
> >> >> mapped pages. There is a possibility to repair poison to dax mapped
> >> >> persistent memory pages, and the pages can't otherwise be offlined
> >> >> because they 1:1 correspond with a physical storage block, i.e.
> >> >> offlining pmem would be equivalent to punching a hole in the physical
> >> >> address space.
> >> >>
> >> >> There's also the entanglement of device-dax which guarantees a given
> >> >> mapping size (4K, 2M, 1G). This requires determining the size of the
> >> >> mapping encompassing a given pfn to know how much to unmap. Since dax
> >> >> mapped pfns don't come from the page allocator we need to read the
> >> >> page size from the page tables, not compound_order(page).
> >> >
> >> > OK, but my question is still. Do we really want to do more on top of the
> >> > existing code and add even more special casing or it is time to rethink
> >> > the whole hwpoison design?
> >>
> >> Well, there's the immediate problem that the current implementation is
> >> broken for dax and then the longer term problem that the current
> >> design appears to be too literal with a lot of custom marshaling.
> >>
> >> At least for dax in the long term we want to offer an alternative
> >> error handling model and get the filesystem much more involved. That
> >> filesystem redesign work has been waiting for the reverse-block-map
> >> effort to settle in xfs. However, that's more custom work for dax and
> >> not a redesign that helps the core-mm more generically.
> >>
> >> I think the unmap and SIGBUS portion of poison handling is relatively
> >> straightforward. It's the handling of the page HWPoison page flag that
> >> seems a bit ad hoc. The current implementation certainly was not
> >> prepared for the concept that memory can be repaired. set_mce_nospec()
> >> is a step in the direction of generic memory error handling.
> >
> > Agreed! Moreover random checks for HWPoison pages is just a maintenance
> > hell.
> >
> >> Thoughts on other pain points in the design that are on your mind, Michal?
> >
> > we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
> > The main problem is that there is besically no design description so the
> > whole feature is very easy to break. Yours is another good example of
> > that. We need to get back to the drawing board and think about how to
> > make this more robust.
> 
> I saw that article, but to be honest I did not glean any direct
> suggestions that read on these current patches. I'm interested in
> discussing a redesign, but I'm not interested in leaving poison
> unhandled for DAX while we figure it out.

Sure but that just keeps the status quo and grows DAX 

Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-07 Thread Michal Hocko
On Wed 06-06-18 06:44:45, Dan Williams wrote:
> On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko  wrote:
> > On Tue 05-06-18 07:33:17, Dan Williams wrote:
> >> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
> >> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> >> > [...]
> >> >> I'm trying to solve this real world problem when real poison is
> >> >> consumed through a dax mapping:
> >> >>
> >> >> mce: Uncorrected hardware memory error in user-access at 
> >> >> af34214200
> >> >> {1}[Hardware Error]: It has been corrected by h/w and requires
> >> >> no further action
> >> >> mce: [Hardware Error]: Machine check events logged
> >> >> {1}[Hardware Error]: event severity: corrected
> >> >> Memory failure: 0xaf34214: reserved kernel page still
> >> >> referenced by 1 users
> >> >> [..]
> >> >> Memory failure: 0xaf34214: recovery action for reserved kernel
> >> >> page: Failed
> >> >> mce: Memory error not recovered
> >> >>
> >> >> ...i.e. currently all poison consumed through dax mappings is
> >> >> needlessly system fatal.
> >> >
> >> > Thanks. That should be a part of the changelog.
> >>
> >> ...added for v3:
> >> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> >>
> >> > It would be great to
> >> > describe why this cannot be simply handled by hwpoison code without any
> >> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> >> > hwpoison code even care?
> >> >
> >>
> >> Up until we started testing hardware poison recovery for persistent
> >> memory I assumed that the kernel did not need any new enabling to get
> >> basic support for recovering userspace consumed poison.
> >>
> >> However, the recovery code has a dedicated path for many different
> >> page states (see: action_page_types). Without any changes it
> >> incorrectly assumes that a dax mapped page is a page cache page
> >> undergoing dma, or some other pinned operation. It also assumes that
> >> the page must be offlined which is not correct / possible for dax
> >> mapped pages. There is a possibility to repair poison to dax mapped
> >> persistent memory pages, and the pages can't otherwise be offlined
> >> because they 1:1 correspond with a physical storage block, i.e.
> >> offlining pmem would be equivalent to punching a hole in the physical
> >> address space.
> >>
> >> There's also the entanglement of device-dax which guarantees a given
> >> mapping size (4K, 2M, 1G). This requires determining the size of the
> >> mapping encompassing a given pfn to know how much to unmap. Since dax
> >> mapped pfns don't come from the page allocator we need to read the
> >> page size from the page tables, not compound_order(page).
> >
> > OK, but my question is still. Do we really want to do more on top of the
> > existing code and add even more special casing or it is time to rethink
> > the whole hwpoison design?
> 
> Well, there's the immediate problem that the current implementation is
> broken for dax and then the longer term problem that the current
> design appears to be too literal with a lot of custom marshaling.
> 
> At least for dax in the long term we want to offer an alternative
> error handling model and get the filesystem much more involved. That
> filesystem redesign work has been waiting for the reverse-block-map
> effort to settle in xfs. However, that's more custom work for dax and
> not a redesign that helps the core-mm more generically.
> 
> I think the unmap and SIGBUS portion of poison handling is relatively
> straightforward. It's the handling of the page HWPoison page flag that
> seems a bit ad hoc. The current implementation certainly was not
> prepared for the concept that memory can be repaired. set_mce_nospec()
> is a step in the direction of generic memory error handling.

Agreed! Moreover random checks for HWPoison pages is just a maintenance
hell.

> Thoughts on other pain points in the design that are on your mind, Michal?

we have discussed those at LSFMM this year https://lwn.net/Articles/753261/
The main problem is that there is besically no design description so the
whole feature is very easy to break. Yours is another good example of
that. We need to get back to the drawing board and think about how to
make this more robust.
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-06 Thread Dan Williams
On Wed, Jun 6, 2018 at 12:39 AM, Michal Hocko  wrote:
> On Tue 05-06-18 07:33:17, Dan Williams wrote:
>> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
>> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
>> > [...]
>> >> I'm trying to solve this real world problem when real poison is
>> >> consumed through a dax mapping:
>> >>
>> >> mce: Uncorrected hardware memory error in user-access at 
>> >> af34214200
>> >> {1}[Hardware Error]: It has been corrected by h/w and requires
>> >> no further action
>> >> mce: [Hardware Error]: Machine check events logged
>> >> {1}[Hardware Error]: event severity: corrected
>> >> Memory failure: 0xaf34214: reserved kernel page still
>> >> referenced by 1 users
>> >> [..]
>> >> Memory failure: 0xaf34214: recovery action for reserved kernel
>> >> page: Failed
>> >> mce: Memory error not recovered
>> >>
>> >> ...i.e. currently all poison consumed through dax mappings is
>> >> needlessly system fatal.
>> >
>> > Thanks. That should be a part of the changelog.
>>
>> ...added for v3:
>> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
>>
>> > It would be great to
>> > describe why this cannot be simply handled by hwpoison code without any
>> > ZONE_DEVICE specific hacks? The error is recoverable so why does
>> > hwpoison code even care?
>> >
>>
>> Up until we started testing hardware poison recovery for persistent
>> memory I assumed that the kernel did not need any new enabling to get
>> basic support for recovering userspace consumed poison.
>>
>> However, the recovery code has a dedicated path for many different
>> page states (see: action_page_types). Without any changes it
>> incorrectly assumes that a dax mapped page is a page cache page
>> undergoing dma, or some other pinned operation. It also assumes that
>> the page must be offlined which is not correct / possible for dax
>> mapped pages. There is a possibility to repair poison to dax mapped
>> persistent memory pages, and the pages can't otherwise be offlined
>> because they 1:1 correspond with a physical storage block, i.e.
>> offlining pmem would be equivalent to punching a hole in the physical
>> address space.
>>
>> There's also the entanglement of device-dax which guarantees a given
>> mapping size (4K, 2M, 1G). This requires determining the size of the
>> mapping encompassing a given pfn to know how much to unmap. Since dax
>> mapped pfns don't come from the page allocator we need to read the
>> page size from the page tables, not compound_order(page).
>
> OK, but my question is still. Do we really want to do more on top of the
> existing code and add even more special casing or it is time to rethink
> the whole hwpoison design?

Well, there's the immediate problem that the current implementation is
broken for dax and then the longer term problem that the current
design appears to be too literal with a lot of custom marshaling.

At least for dax in the long term we want to offer an alternative
error handling model and get the filesystem much more involved. That
filesystem redesign work has been waiting for the reverse-block-map
effort to settle in xfs. However, that's more custom work for dax and
not a redesign that helps the core-mm more generically.

I think the unmap and SIGBUS portion of poison handling is relatively
straightforward. It's the handling of the page HWPoison page flag that
seems a bit ad hoc. The current implementation certainly was not
prepared for the concept that memory can be repaired. set_mce_nospec()
is a step in the direction of generic memory error handling.

Thoughts on other pain points in the design that are on your mind, Michal?
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-06 Thread Michal Hocko
On Tue 05-06-18 07:33:17, Dan Williams wrote:
> On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
> > On Mon 04-06-18 07:31:25, Dan Williams wrote:
> > [...]
> >> I'm trying to solve this real world problem when real poison is
> >> consumed through a dax mapping:
> >>
> >> mce: Uncorrected hardware memory error in user-access at af34214200
> >> {1}[Hardware Error]: It has been corrected by h/w and requires
> >> no further action
> >> mce: [Hardware Error]: Machine check events logged
> >> {1}[Hardware Error]: event severity: corrected
> >> Memory failure: 0xaf34214: reserved kernel page still
> >> referenced by 1 users
> >> [..]
> >> Memory failure: 0xaf34214: recovery action for reserved kernel
> >> page: Failed
> >> mce: Memory error not recovered
> >>
> >> ...i.e. currently all poison consumed through dax mappings is
> >> needlessly system fatal.
> >
> > Thanks. That should be a part of the changelog.
> 
> ...added for v3:
> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html
> 
> > It would be great to
> > describe why this cannot be simply handled by hwpoison code without any
> > ZONE_DEVICE specific hacks? The error is recoverable so why does
> > hwpoison code even care?
> >
> 
> Up until we started testing hardware poison recovery for persistent
> memory I assumed that the kernel did not need any new enabling to get
> basic support for recovering userspace consumed poison.
> 
> However, the recovery code has a dedicated path for many different
> page states (see: action_page_types). Without any changes it
> incorrectly assumes that a dax mapped page is a page cache page
> undergoing dma, or some other pinned operation. It also assumes that
> the page must be offlined which is not correct / possible for dax
> mapped pages. There is a possibility to repair poison to dax mapped
> persistent memory pages, and the pages can't otherwise be offlined
> because they 1:1 correspond with a physical storage block, i.e.
> offlining pmem would be equivalent to punching a hole in the physical
> address space.
> 
> There's also the entanglement of device-dax which guarantees a given
> mapping size (4K, 2M, 1G). This requires determining the size of the
> mapping encompassing a given pfn to know how much to unmap. Since dax
> mapped pfns don't come from the page allocator we need to read the
> page size from the page tables, not compound_order(page).

OK, but my question is still. Do we really want to do more on top of the
existing code and add even more special casing or it is time to rethink
the whole hwpoison design?
-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-05 Thread Dan Williams
On Tue, Jun 5, 2018 at 7:11 AM, Michal Hocko  wrote:
> On Mon 04-06-18 07:31:25, Dan Williams wrote:
> [...]
>> I'm trying to solve this real world problem when real poison is
>> consumed through a dax mapping:
>>
>> mce: Uncorrected hardware memory error in user-access at af34214200
>> {1}[Hardware Error]: It has been corrected by h/w and requires
>> no further action
>> mce: [Hardware Error]: Machine check events logged
>> {1}[Hardware Error]: event severity: corrected
>> Memory failure: 0xaf34214: reserved kernel page still
>> referenced by 1 users
>> [..]
>> Memory failure: 0xaf34214: recovery action for reserved kernel
>> page: Failed
>> mce: Memory error not recovered
>>
>> ...i.e. currently all poison consumed through dax mappings is
>> needlessly system fatal.
>
> Thanks. That should be a part of the changelog.

...added for v3:
https://lists.01.org/pipermail/linux-nvdimm/2018-June/016153.html

> It would be great to
> describe why this cannot be simply handled by hwpoison code without any
> ZONE_DEVICE specific hacks? The error is recoverable so why does
> hwpoison code even care?
>

Up until we started testing hardware poison recovery for persistent
memory I assumed that the kernel did not need any new enabling to get
basic support for recovering userspace consumed poison.

However, the recovery code has a dedicated path for many different
page states (see: action_page_types). Without any changes it
incorrectly assumes that a dax mapped page is a page cache page
undergoing dma, or some other pinned operation. It also assumes that
the page must be offlined which is not correct / possible for dax
mapped pages. There is a possibility to repair poison to dax mapped
persistent memory pages, and the pages can't otherwise be offlined
because they 1:1 correspond with a physical storage block, i.e.
offlining pmem would be equivalent to punching a hole in the physical
address space.

There's also the entanglement of device-dax which guarantees a given
mapping size (4K, 2M, 1G). This requires determining the size of the
mapping encompassing a given pfn to know how much to unmap. Since dax
mapped pfns don't come from the page allocator we need to read the
page size from the page tables, not compound_order(page).
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-05 Thread Michal Hocko
On Mon 04-06-18 07:31:25, Dan Williams wrote:
[...]
> I'm trying to solve this real world problem when real poison is
> consumed through a dax mapping:
> 
> mce: Uncorrected hardware memory error in user-access at af34214200
> {1}[Hardware Error]: It has been corrected by h/w and requires
> no further action
> mce: [Hardware Error]: Machine check events logged
> {1}[Hardware Error]: event severity: corrected
> Memory failure: 0xaf34214: reserved kernel page still
> referenced by 1 users
> [..]
> Memory failure: 0xaf34214: recovery action for reserved kernel
> page: Failed
> mce: Memory error not recovered
> 
> ...i.e. currently all poison consumed through dax mappings is
> needlessly system fatal.

Thanks. That should be a part of the changelog. It would be great to
describe why this cannot be simply handled by hwpoison code without any
ZONE_DEVICE specific hacks? The error is recoverable so why does
hwpoison code even care?

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-04 Thread Dan Williams
On Mon, Jun 4, 2018 at 5:40 AM, Michal Hocko  wrote:
> On Sat 02-06-18 22:22:43, Dan Williams wrote:
>> Changes since v1 [1]:
>> * Rework the locking to not use lock_page() instead use a combination of
>>   rcu_read_lock(), xa_lock_irq(>pages), and igrab() to validate
>>   that dax pages are still associated with the given mapping, and to
>>   prevent the address_space from being freed while memory_failure() is
>>   busy. (Jan)
>>
>> * Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
>>   the case where the injected error is a dax mapping and the pinned
>>   reference needs to be dropped. (Naoya)
>>
>> * Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
>>   mapping of the storage capacity, it could also indicate the zero page.
>>   (Jan)
>>
>> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html
>>
>> ---
>>
>> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
>> backed mappings. The recovery code has specific enabling for several
>> possible page states and needs new enabling to handle poison in dax
>> mappings.
>>
>> In order to support reliable reverse mapping of user space addresses:
>>
>> 1/ Add new locking in the memory_failure() rmap path to prevent races
>> that would typically be handled by the page lock.
>>
>> 2/ Since dev_pagemap pages are hidden from the page allocator and the
>> "compound page" accounting machinery, add a mechanism to determine the
>> size of the mapping that encompasses a given poisoned pfn.
>>
>> 3/ Given pmem errors can be repaired, change the speculatively accessed
>> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
>> allow ongoing access from the kernel.
>
> This doesn't really describe the problem you are trying to solve and why
> do you believe that HWPoison is the best way to handle it. As things
> stand HWPoison is rather ad-hoc and I am not sure adding more to it is
> really great without some deep reconsidering how the whole thing is done
> right now IMHO. Are you actually trying to solve some real world problem
> or you merely want to make soft offlining work properly?

I'm trying to solve this real world problem when real poison is
consumed through a dax mapping:

mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires
no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still
referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel
page: Failed
mce: Memory error not recovered

...i.e. currently all poison consumed through dax mappings is
needlessly system fatal.
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


Re: [PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-04 Thread Michal Hocko
On Sat 02-06-18 22:22:43, Dan Williams wrote:
> Changes since v1 [1]:
> * Rework the locking to not use lock_page() instead use a combination of
>   rcu_read_lock(), xa_lock_irq(>pages), and igrab() to validate
>   that dax pages are still associated with the given mapping, and to
>   prevent the address_space from being freed while memory_failure() is
>   busy. (Jan)
> 
> * Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
>   the case where the injected error is a dax mapping and the pinned
>   reference needs to be dropped. (Naoya)
> 
> * Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
>   mapping of the storage capacity, it could also indicate the zero page.
>   (Jan)
> 
> [1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html
> 
> ---
> 
> As it stands, memory_failure() gets thoroughly confused by dev_pagemap
> backed mappings. The recovery code has specific enabling for several
> possible page states and needs new enabling to handle poison in dax
> mappings.
> 
> In order to support reliable reverse mapping of user space addresses:
> 
> 1/ Add new locking in the memory_failure() rmap path to prevent races
> that would typically be handled by the page lock.
> 
> 2/ Since dev_pagemap pages are hidden from the page allocator and the
> "compound page" accounting machinery, add a mechanism to determine the
> size of the mapping that encompasses a given poisoned pfn.
> 
> 3/ Given pmem errors can be repaired, change the speculatively accessed
> poison protection, mce_unmap_kpfn(), to be reversible and otherwise
> allow ongoing access from the kernel.

This doesn't really describe the problem you are trying to solve and why
do you believe that HWPoison is the best way to handle it. As things
stand HWPoison is rather ad-hoc and I am not sure adding more to it is
really great without some deep reconsidering how the whole thing is done
right now IMHO. Are you actually trying to solve some real world problem
or you merely want to make soft offlining work properly?

> ---
> 
> Dan Williams (11):
>   device-dax: Convert to vmf_insert_mixed and vm_fault_t
>   device-dax: Cleanup vm_fault de-reference chains
>   device-dax: Enable page_mapping()
>   device-dax: Set page->index
>   filesystem-dax: Set page->index
>   mm, madvise_inject_error: Let memory_failure() optionally take a page 
> reference
>   x86, memory_failure: Introduce {set,clear}_mce_nospec()
>   mm, memory_failure: Pass page size to kill_proc()
>   mm, memory_failure: Fix page->mapping assumptions relative to the page 
> lock
>   mm, memory_failure: Teach memory_failure() about dev_pagemap pages
>   libnvdimm, pmem: Restore page attributes when clearing errors
> 
> 
>  arch/x86/include/asm/set_memory.h |   29 
>  arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
>  arch/x86/kernel/cpu/mcheck/mce.c  |   38 -
>  drivers/dax/device.c  |   97 -
>  drivers/nvdimm/pmem.c |   26 
>  drivers/nvdimm/pmem.h |   13 ++
>  fs/dax.c  |   16 ++
>  include/linux/huge_mm.h   |5 -
>  include/linux/mm.h|1 
>  include/linux/set_memory.h|   14 ++
>  mm/huge_memory.c  |4 -
>  mm/madvise.c  |   18 ++
>  mm/memory-failure.c   |  209 
> ++---
>  13 files changed, 366 insertions(+), 119 deletions(-)

-- 
Michal Hocko
SUSE Labs
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm


[PATCH v2 00/11] mm: Teach memory_failure() about ZONE_DEVICE pages

2018-06-02 Thread Dan Williams
Changes since v1 [1]:
* Rework the locking to not use lock_page() instead use a combination of
  rcu_read_lock(), xa_lock_irq(>pages), and igrab() to validate
  that dax pages are still associated with the given mapping, and to
  prevent the address_space from being freed while memory_failure() is
  busy. (Jan)

* Fix use of MF_COUNT_INCREASED in madvise_inject_error() to account for
  the case where the injected error is a dax mapping and the pinned
  reference needs to be dropped. (Naoya)

* Clarify with a comment that VM_FAULT_NOPAGE may not always indicate a
  mapping of the storage capacity, it could also indicate the zero page.
  (Jan)

[1]: https://lists.01.org/pipermail/linux-nvdimm/2018-May/015932.html

---

As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.

In order to support reliable reverse mapping of user space addresses:

1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.

2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine the
size of the mapping that encompasses a given poisoned pfn.

3/ Given pmem errors can be repaired, change the speculatively accessed
poison protection, mce_unmap_kpfn(), to be reversible and otherwise
allow ongoing access from the kernel.

---

Dan Williams (11):
  device-dax: Convert to vmf_insert_mixed and vm_fault_t
  device-dax: Cleanup vm_fault de-reference chains
  device-dax: Enable page_mapping()
  device-dax: Set page->index
  filesystem-dax: Set page->index
  mm, madvise_inject_error: Let memory_failure() optionally take a page 
reference
  x86, memory_failure: Introduce {set,clear}_mce_nospec()
  mm, memory_failure: Pass page size to kill_proc()
  mm, memory_failure: Fix page->mapping assumptions relative to the page 
lock
  mm, memory_failure: Teach memory_failure() about dev_pagemap pages
  libnvdimm, pmem: Restore page attributes when clearing errors


 arch/x86/include/asm/set_memory.h |   29 
 arch/x86/kernel/cpu/mcheck/mce-internal.h |   15 --
 arch/x86/kernel/cpu/mcheck/mce.c  |   38 -
 drivers/dax/device.c  |   97 -
 drivers/nvdimm/pmem.c |   26 
 drivers/nvdimm/pmem.h |   13 ++
 fs/dax.c  |   16 ++
 include/linux/huge_mm.h   |5 -
 include/linux/mm.h|1 
 include/linux/set_memory.h|   14 ++
 mm/huge_memory.c  |4 -
 mm/madvise.c  |   18 ++
 mm/memory-failure.c   |  209 ++---
 13 files changed, 366 insertions(+), 119 deletions(-)
___
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm