Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-09-02 Thread Dave Chinner
On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote:
> On 08/29/12 11:59, Dave Chinner wrote:
> > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
> >> And yes, I understand it's ideal, but many applications choose not to
> >> do that for performance reason.
> >> So I think it's helpful if we can surely report to such applications.
> 
> I suspect "performance vs. integrity" is not a correct
> description of the problem.

Right, to be  more precise, it's a "eat my data" vs "integrity"
problem. And in almost all cases I've seen over the years, "eat my
data" is done for performance reasons...

> > If performance is chosen over data integrity, we are under no
> > obligation to keep the error around indefinitely.  Fundamentally,
> > ensuring a write completes successfully is the reponsibility of the
> > application, not the kernel. There are so many different filesytem
> > and storage errors that can be lost right now because data is not
> > fsync()d, adding another one to them really doesn't change anything.
> > IOWs, a memory error is no different to a disk failing or the system
> > crashing when it comes to data integrity. If you care, you use
> > fsync().
> 
> I agree that applications should fsync() or O_SYNC
> when it wants to make sure the written data in on disk.
> 
> AFAIU, what Naoya is going to address is the case where
> fsync() is not necessarily needed.
> 
> For example, if someone do:
>   $ patch -p1 < ../a.patch
>   $ tar cf . > ../a.tar
> 
> and disk failure occurred between "patch" and "tar",
> "tar" will either see uptodate data or I/O error.

No, it won't. The only place AS_EIO is tested is in
filemap_fdatawait_range(), which is only called in the fsync() path.
The only way to report async write IO errors is to use fsync() -
subsequent reads of the file do *not* see the write error.

IOWs, tar will be oblivious of any IO error that preceeded it
reading the files it is copying.

> OTOH, if the failure was detected on dirty pagecache, the current memory
> failure handler invalidates the dirty page and the "tar" command will
> re-read old contents from disk without error.

After an IO error, the dirty page is no longer uptodate - that gets
cleared - so when the page is read the data will be re-read from
disk just like if a memory error occurred. So tar will behave the
same regardless of whether it is a memory error or an IO error (i.e.
reread old data from disk)

> (Well, the failures above are permanent failures.

Write IO errors can also be transient or permanent. Transient, for
example, when a path failure occurs and multipathing then detects
this and fails over to a good path. A subsequent write will then
succeed. Permanent, for example, when someone unplugs a USB drive.

> IOW, the current
>  memory failure handler turns permanent failure into transient error,
>  which is often more difficult to handle, I think.)

The patch I commented on is turning a transient error (error in a
page that is then poisoned and never used again) into a permanent
error (error on an address space that is reported on every future
operation that tries to insert a page into the page cache).

> Naoya's patch will keep the failure information and allows the reader
> to get I/O error when it reads from broken pagecache.

It only adds a hwposion check in add_to_page_cache_locked(). If the
page is already in the cache, then no error will be sent to the
reader because it never gets to add_to_page_cache_locked().

So there's no guarantee that the reader is even going to see the
error, or that they see the error on the page that actually caused
it - access to any missing page in the page cache will trigger it.
And as memory reclaim clears pages off the inode, more and more of
the range of the inode's data will return an error, even though
there is nothing wrong with the data in most of the file.

Indeed, what happens if the application calls fsync, gets the error
and tries to rewrite the page? i.e. it does everything correctly to
handle the write error? With this patch set, the application
cannot insert a replacement page into the page cache, so all
subsequent writes fail! IOWs, it makes it impossible for
applications to recover from a detected and handled memory failure.

I have no issue with reporting the problem to userspace - that needs
t am I saying that the current IO reporting is wonderful and can't
be improved. What I am saying, though, is that I really don't think
this patch set has been well thought through from either an IO path
or userspace error handling point of view.  The problems with this
patch set are quite significant:
- permanent, unclearable error except by complete removal of
  all data on the file. (forcing the removal of all
  good data to recover from a transient error on a small
  amount of data).
- while the error is present, bad data cannot be overwritten
  (applications cannot recover even if they catch the
  

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-09-02 Thread Dave Chinner
On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote:
 On 08/29/12 11:59, Dave Chinner wrote:
  On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
  And yes, I understand it's ideal, but many applications choose not to
  do that for performance reason.
  So I think it's helpful if we can surely report to such applications.
 
 I suspect performance vs. integrity is not a correct
 description of the problem.

Right, to be  more precise, it's a eat my data vs integrity
problem. And in almost all cases I've seen over the years, eat my
data is done for performance reasons...

  If performance is chosen over data integrity, we are under no
  obligation to keep the error around indefinitely.  Fundamentally,
  ensuring a write completes successfully is the reponsibility of the
  application, not the kernel. There are so many different filesytem
  and storage errors that can be lost right now because data is not
  fsync()d, adding another one to them really doesn't change anything.
  IOWs, a memory error is no different to a disk failing or the system
  crashing when it comes to data integrity. If you care, you use
  fsync().
 
 I agree that applications should fsync() or O_SYNC
 when it wants to make sure the written data in on disk.
 
 AFAIU, what Naoya is going to address is the case where
 fsync() is not necessarily needed.
 
 For example, if someone do:
   $ patch -p1  ../a.patch
   $ tar cf .  ../a.tar
 
 and disk failure occurred between patch and tar,
 tar will either see uptodate data or I/O error.

No, it won't. The only place AS_EIO is tested is in
filemap_fdatawait_range(), which is only called in the fsync() path.
The only way to report async write IO errors is to use fsync() -
subsequent reads of the file do *not* see the write error.

IOWs, tar will be oblivious of any IO error that preceeded it
reading the files it is copying.

 OTOH, if the failure was detected on dirty pagecache, the current memory
 failure handler invalidates the dirty page and the tar command will
 re-read old contents from disk without error.

After an IO error, the dirty page is no longer uptodate - that gets
cleared - so when the page is read the data will be re-read from
disk just like if a memory error occurred. So tar will behave the
same regardless of whether it is a memory error or an IO error (i.e.
reread old data from disk)

 (Well, the failures above are permanent failures.

Write IO errors can also be transient or permanent. Transient, for
example, when a path failure occurs and multipathing then detects
this and fails over to a good path. A subsequent write will then
succeed. Permanent, for example, when someone unplugs a USB drive.

 IOW, the current
  memory failure handler turns permanent failure into transient error,
  which is often more difficult to handle, I think.)

The patch I commented on is turning a transient error (error in a
page that is then poisoned and never used again) into a permanent
error (error on an address space that is reported on every future
operation that tries to insert a page into the page cache).

 Naoya's patch will keep the failure information and allows the reader
 to get I/O error when it reads from broken pagecache.

It only adds a hwposion check in add_to_page_cache_locked(). If the
page is already in the cache, then no error will be sent to the
reader because it never gets to add_to_page_cache_locked().

So there's no guarantee that the reader is even going to see the
error, or that they see the error on the page that actually caused
it - access to any missing page in the page cache will trigger it.
And as memory reclaim clears pages off the inode, more and more of
the range of the inode's data will return an error, even though
there is nothing wrong with the data in most of the file.

Indeed, what happens if the application calls fsync, gets the error
and tries to rewrite the page? i.e. it does everything correctly to
handle the write error? With this patch set, the application
cannot insert a replacement page into the page cache, so all
subsequent writes fail! IOWs, it makes it impossible for
applications to recover from a detected and handled memory failure.

I have no issue with reporting the problem to userspace - that needs
t am I saying that the current IO reporting is wonderful and can't
be improved. What I am saying, though, is that I really don't think
this patch set has been well thought through from either an IO path
or userspace error handling point of view.  The problems with this
patch set are quite significant:
- permanent, unclearable error except by complete removal of
  all data on the file. (forcing the removal of all
  good data to recover from a transient error on a small
  amount of data).
- while the error is present, bad data cannot be overwritten
  (applications cannot recover even if they catch the
  error).
- while the error is present, no new data can be written

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-29 Thread Jun'ichi Nomura
On 08/29/12 11:59, Dave Chinner wrote:
> On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
>> And yes, I understand it's ideal, but many applications choose not to
>> do that for performance reason.
>> So I think it's helpful if we can surely report to such applications.

I suspect "performance vs. integrity" is not a correct
description of the problem.

> If performance is chosen over data integrity, we are under no
> obligation to keep the error around indefinitely.  Fundamentally,
> ensuring a write completes successfully is the reponsibility of the
> application, not the kernel. There are so many different filesytem
> and storage errors that can be lost right now because data is not
> fsync()d, adding another one to them really doesn't change anything.
> IOWs, a memory error is no different to a disk failing or the system
> crashing when it comes to data integrity. If you care, you use
> fsync().

I agree that applications should fsync() or O_SYNC
when it wants to make sure the written data in on disk.

AFAIU, what Naoya is going to address is the case where
fsync() is not necessarily needed.

For example, if someone do:
  $ patch -p1 < ../a.patch
  $ tar cf . > ../a.tar

and disk failure occurred between "patch" and "tar",
"tar" will either see uptodate data or I/O error.
OTOH, if the failure was detected on dirty pagecache, the current memory
failure handler invalidates the dirty page and the "tar" command will
re-read old contents from disk without error.

(Well, the failures above are permanent failures. IOW, the current
 memory failure handler turns permanent failure into transient error,
 which is often more difficult to handle, I think.)

Naoya's patch will keep the failure information and allows the reader
to get I/O error when it reads from broken pagecache.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-29 Thread Jun'ichi Nomura
On 08/29/12 11:59, Dave Chinner wrote:
 On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
 And yes, I understand it's ideal, but many applications choose not to
 do that for performance reason.
 So I think it's helpful if we can surely report to such applications.

I suspect performance vs. integrity is not a correct
description of the problem.

 If performance is chosen over data integrity, we are under no
 obligation to keep the error around indefinitely.  Fundamentally,
 ensuring a write completes successfully is the reponsibility of the
 application, not the kernel. There are so many different filesytem
 and storage errors that can be lost right now because data is not
 fsync()d, adding another one to them really doesn't change anything.
 IOWs, a memory error is no different to a disk failing or the system
 crashing when it comes to data integrity. If you care, you use
 fsync().

I agree that applications should fsync() or O_SYNC
when it wants to make sure the written data in on disk.

AFAIU, what Naoya is going to address is the case where
fsync() is not necessarily needed.

For example, if someone do:
  $ patch -p1  ../a.patch
  $ tar cf .  ../a.tar

and disk failure occurred between patch and tar,
tar will either see uptodate data or I/O error.
OTOH, if the failure was detected on dirty pagecache, the current memory
failure handler invalidates the dirty page and the tar command will
re-read old contents from disk without error.

(Well, the failures above are permanent failures. IOW, the current
 memory failure handler turns permanent failure into transient error,
 which is often more difficult to handle, I think.)

Naoya's patch will keep the failure information and allows the reader
to get I/O error when it reads from broken pagecache.

-- 
Jun'ichi Nomura, NEC Corporation
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-28 Thread Dave Chinner
On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
> On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
> > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> > > Let me explain more to clarify my whole scenario. If a memory error
> > > hits on a dirty pagecache, kernel works like below:
> > > 
> > >   1. handles a MCE interrupt (logging MCE events,)
> > >   2. calls memory error handler (doing 3 to 6,)
> > >   3. sets PageHWPoison flag on the error page,
> > >   4. unmaps all mappings to processes' virtual addresses,
> > 
> > So nothing in userspace sees the bad page after this.
> > 
> > >   5. sets AS_HWPOISON on mappings to which the error page belongs
> > >   6. invalidates the error page (unlinks it from LRU list and removes
> > >  it from pagecache,)
> > >   (memory error handler finished)
> > 
> > Ok, so the moment a memory error is handled, the page has been
> > removed from the inode's mapping, and it will never be seen by
> > aplications again. It's a transient error
> > 
> > >   7. later accesses to the file returns -EIO,
> > >   8. AS_HWPOISON is cleared when the file is removed or completely
> > >  truncated.
> > 
> >  so why do we have to keep an EIO on the inode forever?
> > 
> > If the page is not dirty, then just tossing it from the cache (as
> > is already done) and rereading it from disk next time it is accessed
> > removes the need for any error to be reported at all. It's
> > effectively a transient error at this point, and as such no errors
> > should be visible from userspace.
> > 
> > If the page is dirty, then it needs to be treated just like any
> > other failed page write - the page is invalidated and the address
> > space is marked with AS_EIO, and that is reported to the next
> > operation that waits on IO on that file (i.e. fsync)
> > 
> > If you have a second application that reads the files that depends
> > on a guarantee of good data, then the first step in that process is
> > that application that writes it needs to use fsync to check the data
> > was written correctly. That ensures that you only have clean pages
> > in the cache before the writer closes the file, and any h/w error
> > then devolves to the above transient clean page invalidation case.
> 
> Thank you for detailed explanations.
> And yes, I understand it's ideal, but many applications choose not to
> do that for performance reason.

You choose: data integrity or performance.

> So I think it's helpful if we can surely report to such applications.

If performance is chosen over data integrity, we are under no
obligation to keep the error around indefinitely.  Fundamentally,
ensuring a write completes successfully is the reponsibility of the
application, not the kernel. There are so many different filesytem
and storage errors that can be lost right now because data is not
fsync()d, adding another one to them really doesn't change anything.
IOWs, a memory error is no different to a disk failing or the system
crashing when it comes to data integrity. If you care, you use
fsync().

> > Hence I fail to see why this type of IO error needs to be sticky.
> > The error on the mapping is transient - it is gone as soon as the
> > page is removed from the mapping. Hence the error can be dropped as
> > soon as it is reported to userspace because the mapping is now error
> > free.
> 
> It's error free only for the applications which do fsync check in
> each write, but not for the applications which don't do.
> I think the penalty for the latters (ignore dirty data lost and get
> wrong results) is too big to consider it as a reasonable trade-off.

I'm guessing that you don't deal with data integrity issues very
often. What you are suggesting is not a reasonable tradeoff - either
applications are coded correctly for data integrity, or they give
up any expectation that errors will be detected and reported
reliably.  Hoping that we might be able to report an error somewhere
to someone who didn't care to avoid or collect in the first place
does not improve the situation for anyone

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-28 Thread Dave Chinner
On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote:
 On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
  On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
   Let me explain more to clarify my whole scenario. If a memory error
   hits on a dirty pagecache, kernel works like below:
   
 1. handles a MCE interrupt (logging MCE events,)
 2. calls memory error handler (doing 3 to 6,)
 3. sets PageHWPoison flag on the error page,
 4. unmaps all mappings to processes' virtual addresses,
  
  So nothing in userspace sees the bad page after this.
  
 5. sets AS_HWPOISON on mappings to which the error page belongs
 6. invalidates the error page (unlinks it from LRU list and removes
it from pagecache,)
 (memory error handler finished)
  
  Ok, so the moment a memory error is handled, the page has been
  removed from the inode's mapping, and it will never be seen by
  aplications again. It's a transient error
  
 7. later accesses to the file returns -EIO,
 8. AS_HWPOISON is cleared when the file is removed or completely
truncated.
  
   so why do we have to keep an EIO on the inode forever?
  
  If the page is not dirty, then just tossing it from the cache (as
  is already done) and rereading it from disk next time it is accessed
  removes the need for any error to be reported at all. It's
  effectively a transient error at this point, and as such no errors
  should be visible from userspace.
  
  If the page is dirty, then it needs to be treated just like any
  other failed page write - the page is invalidated and the address
  space is marked with AS_EIO, and that is reported to the next
  operation that waits on IO on that file (i.e. fsync)
  
  If you have a second application that reads the files that depends
  on a guarantee of good data, then the first step in that process is
  that application that writes it needs to use fsync to check the data
  was written correctly. That ensures that you only have clean pages
  in the cache before the writer closes the file, and any h/w error
  then devolves to the above transient clean page invalidation case.
 
 Thank you for detailed explanations.
 And yes, I understand it's ideal, but many applications choose not to
 do that for performance reason.

You choose: data integrity or performance.

 So I think it's helpful if we can surely report to such applications.

If performance is chosen over data integrity, we are under no
obligation to keep the error around indefinitely.  Fundamentally,
ensuring a write completes successfully is the reponsibility of the
application, not the kernel. There are so many different filesytem
and storage errors that can be lost right now because data is not
fsync()d, adding another one to them really doesn't change anything.
IOWs, a memory error is no different to a disk failing or the system
crashing when it comes to data integrity. If you care, you use
fsync().

  Hence I fail to see why this type of IO error needs to be sticky.
  The error on the mapping is transient - it is gone as soon as the
  page is removed from the mapping. Hence the error can be dropped as
  soon as it is reported to userspace because the mapping is now error
  free.
 
 It's error free only for the applications which do fsync check in
 each write, but not for the applications which don't do.
 I think the penalty for the latters (ignore dirty data lost and get
 wrong results) is too big to consider it as a reasonable trade-off.

I'm guessing that you don't deal with data integrity issues very
often. What you are suggesting is not a reasonable tradeoff - either
applications are coded correctly for data integrity, or they give
up any expectation that errors will be detected and reported
reliably.  Hoping that we might be able to report an error somewhere
to someone who didn't care to avoid or collect in the first place
does not improve the situation for anyone

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-27 Thread Naoya Horiguchi
On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
> On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
> > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner 
> > > > > > case
> > > > > > where we have possibilities of data lost. This is because in this 
> > > > > > fix
> > > > > > AS_HWPOISON is cleared when the inode cache is dropped.
> > > 
> > > > > > --- v3.6-rc1.orig/mm/truncate.c
> > > > > > +++ v3.6-rc1/mm/truncate.c
> > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, 
> > > > > > loff_t newsize)
> > > > > >  
> > > > > > oldsize = inode->i_size;
> > > > > > i_size_write(inode, newsize);
> > > > > > +   if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > > > > +   mapping_clear_hwpoison(inode->i_mapping);
> > > > > 
> > > > > So only a truncate to zero size will clear the poison flag?
> > > > 
> > > > Yes, this is because we only know if the file is affected by hwpoison,
> > > > but not where the hwpoisoned page is in the file. We could remember it,
> > > > but I did not do it for simplicity.
> > > 
> > > Surely the page has flags on it to say it is poisoned? That is,
> > > after truncating the page cache, if the address space is poisoned,
> > > then you can do a pass across the mapping tree checking if there are
> > > any poisoned pages left? Or perhaps adding a new mapping tree tag so
> > > that the poisoned status is automatically determined by the presence
> > > of the poisoned page in the mapping tree?
> > 
> > The answer for the first question is yes. And for the second question,
> > I don't think it's easy because the mapping tree has no reference to
> > the error page (I explain more about this below, please see also it,)
> > and it can cost too much to search poisoned pages over page cache in
> > each request.
> 
> Which is my point about a radix tree tag - that's very efficient.
> 
> > And for the third question, I think we could do this, but to do it
> > we need an additional space (8 bytes) in struct radix_tree_node.
> > Considering that this new tag is not used so frequently, so I'm not
> > sure that it's worth the cost.
> 
> A radix tree node is currently 560 bytes on x86_64, packed 7 to a
> page. i.e. using 3920 bytes. We can add another 8 bytes to it
> without increasing memory usage at all. So, no cost at all.

OK.

> > > > > What happens if it is the last page in the mapping that is poisoned,
> > > > > and we truncate that away? Shouldn't that clear the poisoned bit?
> > > > 
> > > > When we handle the hwpoisoned inode, the error page should already
> > > > be removed from pagecache, so the remaining pages are not related
> > > > to the error and we need not care about them when we consider bit
> > > > clearing.
> > > 
> > > Sorry, I don't follow. What removed the page from the page cache?
> > > The truncate_page_cache() call that follows the above code hunk is
> > > what does that, so I don't see how it can already be removed from
> > > the page cache
> > 
> > Memory error handler (memory_failure() in mm/memory-failure.c) has
> > removed the error page from the page cache.
> > And truncate_page_cache() that follows this hunk removes all pages
> > belonging to the page cache of the poisoned file (where the error
> > page itself is not included in them.)
> > 
> > Let me explain more to clarify my whole scenario. If a memory error
> > hits on a dirty pagecache, kernel works like below:
> > 
> >   1. handles a MCE interrupt (logging MCE events,)
> >   2. calls memory error handler (doing 3 to 6,)
> >   3. sets PageHWPoison flag on the error page,
> >   4. unmaps all mappings to processes' virtual addresses,
> 
> So nothing in userspace sees the bad page after this.
> 
> >   5. sets AS_HWPOISON on mappings to which the error page belongs
> >   6. invalidates the error page (unlinks it from LRU list and removes
> >  it from pagecache,)
> >   (memory error handler finished)
> 
> Ok, so the moment a memory error is handled, the page has been
> removed from the inode's mapping, and it will never be seen by
> aplications again. It's a transient error
> 
> >   7. later accesses to the file returns -EIO,
> >   8. AS_HWPOISON is cleared when the file is removed or completely
> >  truncated.
> 
>  so why do we have to keep an EIO on the inode forever?
> 
> If the page is not dirty, then just tossing it from the cache (as
> is already done) and rereading it from disk next time it is accessed
> removes the need for any error to be reported at all. It's
> effectively a transient error at this point, and as such no errors
> should be visible from userspace.
> 
> If the page is dirty, then it 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-27 Thread Naoya Horiguchi
On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote:
 On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
  On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
   On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
 On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
  HWPOISON: report sticky EIO for poisoned file still has a corner 
  case
  where we have possibilities of data lost. This is because in this 
  fix
  AS_HWPOISON is cleared when the inode cache is dropped.
   
  --- v3.6-rc1.orig/mm/truncate.c
  +++ v3.6-rc1/mm/truncate.c
  @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, 
  loff_t newsize)
   
  oldsize = inode-i_size;
  i_size_write(inode, newsize);
  +   if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
  +   mapping_clear_hwpoison(inode-i_mapping);
 
 So only a truncate to zero size will clear the poison flag?

Yes, this is because we only know if the file is affected by hwpoison,
but not where the hwpoisoned page is in the file. We could remember it,
but I did not do it for simplicity.
   
   Surely the page has flags on it to say it is poisoned? That is,
   after truncating the page cache, if the address space is poisoned,
   then you can do a pass across the mapping tree checking if there are
   any poisoned pages left? Or perhaps adding a new mapping tree tag so
   that the poisoned status is automatically determined by the presence
   of the poisoned page in the mapping tree?
  
  The answer for the first question is yes. And for the second question,
  I don't think it's easy because the mapping tree has no reference to
  the error page (I explain more about this below, please see also it,)
  and it can cost too much to search poisoned pages over page cache in
  each request.
 
 Which is my point about a radix tree tag - that's very efficient.
 
  And for the third question, I think we could do this, but to do it
  we need an additional space (8 bytes) in struct radix_tree_node.
  Considering that this new tag is not used so frequently, so I'm not
  sure that it's worth the cost.
 
 A radix tree node is currently 560 bytes on x86_64, packed 7 to a
 page. i.e. using 3920 bytes. We can add another 8 bytes to it
 without increasing memory usage at all. So, no cost at all.

OK.

 What happens if it is the last page in the mapping that is poisoned,
 and we truncate that away? Shouldn't that clear the poisoned bit?

When we handle the hwpoisoned inode, the error page should already
be removed from pagecache, so the remaining pages are not related
to the error and we need not care about them when we consider bit
clearing.
   
   Sorry, I don't follow. What removed the page from the page cache?
   The truncate_page_cache() call that follows the above code hunk is
   what does that, so I don't see how it can already be removed from
   the page cache
  
  Memory error handler (memory_failure() in mm/memory-failure.c) has
  removed the error page from the page cache.
  And truncate_page_cache() that follows this hunk removes all pages
  belonging to the page cache of the poisoned file (where the error
  page itself is not included in them.)
  
  Let me explain more to clarify my whole scenario. If a memory error
  hits on a dirty pagecache, kernel works like below:
  
1. handles a MCE interrupt (logging MCE events,)
2. calls memory error handler (doing 3 to 6,)
3. sets PageHWPoison flag on the error page,
4. unmaps all mappings to processes' virtual addresses,
 
 So nothing in userspace sees the bad page after this.
 
5. sets AS_HWPOISON on mappings to which the error page belongs
6. invalidates the error page (unlinks it from LRU list and removes
   it from pagecache,)
(memory error handler finished)
 
 Ok, so the moment a memory error is handled, the page has been
 removed from the inode's mapping, and it will never be seen by
 aplications again. It's a transient error
 
7. later accesses to the file returns -EIO,
8. AS_HWPOISON is cleared when the file is removed or completely
   truncated.
 
  so why do we have to keep an EIO on the inode forever?
 
 If the page is not dirty, then just tossing it from the cache (as
 is already done) and rereading it from disk next time it is accessed
 removes the need for any error to be reported at all. It's
 effectively a transient error at this point, and as such no errors
 should be visible from userspace.
 
 If the page is dirty, then it needs to be treated just like any
 other failed page write - the page is invalidated and the address
 space is marked with AS_EIO, and that is reported to the next
 operation that waits on IO on that file (i.e. fsync)
 
 If you have a second application that reads the files that depends
 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-26 Thread Dave Chinner
On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
> On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
> > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner 
> > > > > case
> > > > > where we have possibilities of data lost. This is because in this fix
> > > > > AS_HWPOISON is cleared when the inode cache is dropped.
> > 
> > > > > --- v3.6-rc1.orig/mm/truncate.c
> > > > > +++ v3.6-rc1/mm/truncate.c
> > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
> > > > > newsize)
> > > > >  
> > > > >   oldsize = inode->i_size;
> > > > >   i_size_write(inode, newsize);
> > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > > > + mapping_clear_hwpoison(inode->i_mapping);
> > > > 
> > > > So only a truncate to zero size will clear the poison flag?
> > > 
> > > Yes, this is because we only know if the file is affected by hwpoison,
> > > but not where the hwpoisoned page is in the file. We could remember it,
> > > but I did not do it for simplicity.
> > 
> > Surely the page has flags on it to say it is poisoned? That is,
> > after truncating the page cache, if the address space is poisoned,
> > then you can do a pass across the mapping tree checking if there are
> > any poisoned pages left? Or perhaps adding a new mapping tree tag so
> > that the poisoned status is automatically determined by the presence
> > of the poisoned page in the mapping tree?
> 
> The answer for the first question is yes. And for the second question,
> I don't think it's easy because the mapping tree has no reference to
> the error page (I explain more about this below, please see also it,)
> and it can cost too much to search poisoned pages over page cache in
> each request.

Which is my point about a radix tree tag - that's very efficient.

> And for the third question, I think we could do this, but to do it
> we need an additional space (8 bytes) in struct radix_tree_node.
> Considering that this new tag is not used so frequently, so I'm not
> sure that it's worth the cost.

A radix tree node is currently 560 bytes on x86_64, packed 7 to a
page. i.e. using 3920 bytes. We can add another 8 bytes to it
without increasing memory usage at all. So, no cost at all.

> > > > What happens if it is the last page in the mapping that is poisoned,
> > > > and we truncate that away? Shouldn't that clear the poisoned bit?
> > > 
> > > When we handle the hwpoisoned inode, the error page should already
> > > be removed from pagecache, so the remaining pages are not related
> > > to the error and we need not care about them when we consider bit
> > > clearing.
> > 
> > Sorry, I don't follow. What removed the page from the page cache?
> > The truncate_page_cache() call that follows the above code hunk is
> > what does that, so I don't see how it can already be removed from
> > the page cache
> 
> Memory error handler (memory_failure() in mm/memory-failure.c) has
> removed the error page from the page cache.
> And truncate_page_cache() that follows this hunk removes all pages
> belonging to the page cache of the poisoned file (where the error
> page itself is not included in them.)
> 
> Let me explain more to clarify my whole scenario. If a memory error
> hits on a dirty pagecache, kernel works like below:
> 
>   1. handles a MCE interrupt (logging MCE events,)
>   2. calls memory error handler (doing 3 to 6,)
>   3. sets PageHWPoison flag on the error page,
>   4. unmaps all mappings to processes' virtual addresses,

So nothing in userspace sees the bad page after this.

>   5. sets AS_HWPOISON on mappings to which the error page belongs
>   6. invalidates the error page (unlinks it from LRU list and removes
>  it from pagecache,)
>   (memory error handler finished)

Ok, so the moment a memory error is handled, the page has been
removed from the inode's mapping, and it will never be seen by
aplications again. It's a transient error

>   7. later accesses to the file returns -EIO,
>   8. AS_HWPOISON is cleared when the file is removed or completely
>  truncated.

 so why do we have to keep an EIO on the inode forever?

If the page is not dirty, then just tossing it from the cache (as
is already done) and rereading it from disk next time it is accessed
removes the need for any error to be reported at all. It's
effectively a transient error at this point, and as such no errors
should be visible from userspace.

If the page is dirty, then it needs to be treated just like any
other failed page write - the page is invalidated and the address
space is marked with AS_EIO, and that is reported to the next
operation that waits on IO on that file (i.e. fsync)

If you have a second application that reads the files 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-26 Thread Dave Chinner
On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote:
 On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
  On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
   On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
 HWPOISON: report sticky EIO for poisoned file still has a corner 
 case
 where we have possibilities of data lost. This is because in this fix
 AS_HWPOISON is cleared when the inode cache is dropped.
  
 --- v3.6-rc1.orig/mm/truncate.c
 +++ v3.6-rc1/mm/truncate.c
 @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
 newsize)
  
   oldsize = inode-i_size;
   i_size_write(inode, newsize);
 + if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
 + mapping_clear_hwpoison(inode-i_mapping);

So only a truncate to zero size will clear the poison flag?
   
   Yes, this is because we only know if the file is affected by hwpoison,
   but not where the hwpoisoned page is in the file. We could remember it,
   but I did not do it for simplicity.
  
  Surely the page has flags on it to say it is poisoned? That is,
  after truncating the page cache, if the address space is poisoned,
  then you can do a pass across the mapping tree checking if there are
  any poisoned pages left? Or perhaps adding a new mapping tree tag so
  that the poisoned status is automatically determined by the presence
  of the poisoned page in the mapping tree?
 
 The answer for the first question is yes. And for the second question,
 I don't think it's easy because the mapping tree has no reference to
 the error page (I explain more about this below, please see also it,)
 and it can cost too much to search poisoned pages over page cache in
 each request.

Which is my point about a radix tree tag - that's very efficient.

 And for the third question, I think we could do this, but to do it
 we need an additional space (8 bytes) in struct radix_tree_node.
 Considering that this new tag is not used so frequently, so I'm not
 sure that it's worth the cost.

A radix tree node is currently 560 bytes on x86_64, packed 7 to a
page. i.e. using 3920 bytes. We can add another 8 bytes to it
without increasing memory usage at all. So, no cost at all.

What happens if it is the last page in the mapping that is poisoned,
and we truncate that away? Shouldn't that clear the poisoned bit?
   
   When we handle the hwpoisoned inode, the error page should already
   be removed from pagecache, so the remaining pages are not related
   to the error and we need not care about them when we consider bit
   clearing.
  
  Sorry, I don't follow. What removed the page from the page cache?
  The truncate_page_cache() call that follows the above code hunk is
  what does that, so I don't see how it can already be removed from
  the page cache
 
 Memory error handler (memory_failure() in mm/memory-failure.c) has
 removed the error page from the page cache.
 And truncate_page_cache() that follows this hunk removes all pages
 belonging to the page cache of the poisoned file (where the error
 page itself is not included in them.)
 
 Let me explain more to clarify my whole scenario. If a memory error
 hits on a dirty pagecache, kernel works like below:
 
   1. handles a MCE interrupt (logging MCE events,)
   2. calls memory error handler (doing 3 to 6,)
   3. sets PageHWPoison flag on the error page,
   4. unmaps all mappings to processes' virtual addresses,

So nothing in userspace sees the bad page after this.

   5. sets AS_HWPOISON on mappings to which the error page belongs
   6. invalidates the error page (unlinks it from LRU list and removes
  it from pagecache,)
   (memory error handler finished)

Ok, so the moment a memory error is handled, the page has been
removed from the inode's mapping, and it will never be seen by
aplications again. It's a transient error

   7. later accesses to the file returns -EIO,
   8. AS_HWPOISON is cleared when the file is removed or completely
  truncated.

 so why do we have to keep an EIO on the inode forever?

If the page is not dirty, then just tossing it from the cache (as
is already done) and rereading it from disk next time it is accessed
removes the need for any error to be reported at all. It's
effectively a transient error at this point, and as such no errors
should be visible from userspace.

If the page is dirty, then it needs to be treated just like any
other failed page write - the page is invalidated and the address
space is marked with AS_EIO, and that is reported to the next
operation that waits on IO on that file (i.e. fsync)

If you have a second application that reads the files that depends
on a guarantee of good data, then the first step in that process is
that application that writes it needs to use fsync to check the data
was written correctly. That ensures 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-24 Thread Naoya Horiguchi
Hello,

On Thu, Aug 23, 2012 at 04:31:43PM -0400, Naoya Horiguchi wrote:
> On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote:
> > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
...
> > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
> > > index ac8d904..8742397 100644
> > > --- v3.6-rc1.orig/fs/inode.c
> > > +++ v3.6-rc1/fs/inode.c
> > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
> > > nr_to_scan)
> > >   }
> > >  
> > >   /*
> > > +  * Keep inode caches on memory for user processes to certainly
> > > +  * be aware of memory errors.
> > > +  */
> > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) {
> > > + spin_unlock(>i_lock);
> > > + continue;
> > > + }
> > 
> > That chunk prevents reclaiming all the cached pages. However the intention
> > is only to keep the struct inode together with the hwpoison bit?
> 
> Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from
> shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file
> remain for long under high memory pressure.

I might lose your point. Are you suggesting this chunk should come after
if (inode_has_buffers(inode) || inode->i_data.nrpages) { ... } block,
aren't you?  I think that's right, so I'll try and test it this weekend.

> > > + /*
> > >* Referenced or dirty inodes are still in use. Give them
> > >* another pass through the LRU as we canot reclaim them now.
> > >*/
> > > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
> > >   inode->i_state &= ~I_WILL_FREE;
> > >   }
> > >  
> > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop))
> > > + mapping_clear_hwpoison(inode->i_mapping);
> > 
> > Is that clear necessary? Because the bit will be gone with the inode
> > struct: it's going to be de-allocated anyway.
> 
> With the chunk in prune_icache_sb() we keep the inode struct with
> AS_HWPOISON set on memory, so in order to remove it, we need explicitly
> clear the bit.
> Without this clear, the inode remains until system reboot.

And again, you are right here. Without this clear, this inode will be
cleared in destroy_inode().

Thanks,
Naoya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-24 Thread Naoya Horiguchi
On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
> On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > > > where we have possibilities of data lost. This is because in this fix
> > > > AS_HWPOISON is cleared when the inode cache is dropped.
> 
> > > > --- v3.6-rc1.orig/mm/truncate.c
> > > > +++ v3.6-rc1/mm/truncate.c
> > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
> > > > newsize)
> > > >  
> > > > oldsize = inode->i_size;
> > > > i_size_write(inode, newsize);
> > > > +   if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > > +   mapping_clear_hwpoison(inode->i_mapping);
> > > 
> > > So only a truncate to zero size will clear the poison flag?
> > 
> > Yes, this is because we only know if the file is affected by hwpoison,
> > but not where the hwpoisoned page is in the file. We could remember it,
> > but I did not do it for simplicity.
> 
> Surely the page has flags on it to say it is poisoned? That is,
> after truncating the page cache, if the address space is poisoned,
> then you can do a pass across the mapping tree checking if there are
> any poisoned pages left? Or perhaps adding a new mapping tree tag so
> that the poisoned status is automatically determined by the presence
> of the poisoned page in the mapping tree?

The answer for the first question is yes. And for the second question,
I don't think it's easy because the mapping tree has no reference to
the error page (I explain more about this below, please see also it,)
and it can cost too much to search poisoned pages over page cache in
each request.
And for the third question, I think we could do this, but to do it
we need an additional space (8 bytes) in struct radix_tree_node.
Considering that this new tag is not used so frequently, so I'm not
sure that it's worth the cost.

> > > What happens if it is the last page in the mapping that is poisoned,
> > > and we truncate that away? Shouldn't that clear the poisoned bit?
> > 
> > When we handle the hwpoisoned inode, the error page should already
> > be removed from pagecache, so the remaining pages are not related
> > to the error and we need not care about them when we consider bit
> > clearing.
> 
> Sorry, I don't follow. What removed the page from the page cache?
> The truncate_page_cache() call that follows the above code hunk is
> what does that, so I don't see how it can already be removed from
> the page cache

Memory error handler (memory_failure() in mm/memory-failure.c) has
removed the error page from the page cache.
And truncate_page_cache() that follows this hunk removes all pages
belonging to the page cache of the poisoned file (where the error
page itself is not included in them.)

Let me explain more to clarify my whole scenario. If a memory error
hits on a dirty pagecache, kernel works like below:

  1. handles a MCE interrupt (logging MCE events,)
  2. calls memory error handler (doing 3 to 6,)
  3. sets PageHWPoison flag on the error page,
  4. unmaps all mappings to processes' virtual addresses,
  5. sets AS_HWPOISON on mappings to which the error page belongs
  6. invalidates the error page (unlinks it from LRU list and removes
 it from pagecache,)
  (memory error handler finished)
  7. later accesses to the file returns -EIO,
  8. AS_HWPOISON is cleared when the file is removed or completely
 truncated.

This patchset tries to fix the problem in 7 and add a new behavior 8,
where I have an assumption that 1-6 has already worked out.

You may think it strange that the condition of clearing AS_HWPOISON
is checked with file granularity. This is because currently userspace
applications know the memory errors only with file granularity for
simplicity, when they access via read(), write() and/or fsync().
(Only for access via mmap(), error reporting is with page granularity.)
We can do it with page granularity, and I tried 2 approaches:

  a. "adding a new tag in mapping tree" approach
 as I explained above, this needs an additional space on heavily
 used data structure,
  b. "adding a new data structure specific to dirty pagecache error
 management" approach
 I did this in the 1st version of this patchset, but found that
 it's too complicated as the first step.

But I think both of these are some difficulty to be accepted,
so at first I try to start with simpler one.

> > > What about a hole punch over the poisoned range?
> > 
> > For the same reason, this is also not related to when to clear the bit.
> 
> Sure it is - if you remove the poisoned pages from the mapping when
> the hole is punched, then the mapping is no longer poisoned. Hence
> the bit should be cleared at that time as nothing else will ever
> clear it.

If we 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-24 Thread Naoya Horiguchi
On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote:
 On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
  On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
   On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
HWPOISON: report sticky EIO for poisoned file still has a corner case
where we have possibilities of data lost. This is because in this fix
AS_HWPOISON is cleared when the inode cache is dropped.
 
--- v3.6-rc1.orig/mm/truncate.c
+++ v3.6-rc1/mm/truncate.c
@@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
newsize)
 
oldsize = inode-i_size;
i_size_write(inode, newsize);
+   if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
+   mapping_clear_hwpoison(inode-i_mapping);
   
   So only a truncate to zero size will clear the poison flag?
  
  Yes, this is because we only know if the file is affected by hwpoison,
  but not where the hwpoisoned page is in the file. We could remember it,
  but I did not do it for simplicity.
 
 Surely the page has flags on it to say it is poisoned? That is,
 after truncating the page cache, if the address space is poisoned,
 then you can do a pass across the mapping tree checking if there are
 any poisoned pages left? Or perhaps adding a new mapping tree tag so
 that the poisoned status is automatically determined by the presence
 of the poisoned page in the mapping tree?

The answer for the first question is yes. And for the second question,
I don't think it's easy because the mapping tree has no reference to
the error page (I explain more about this below, please see also it,)
and it can cost too much to search poisoned pages over page cache in
each request.
And for the third question, I think we could do this, but to do it
we need an additional space (8 bytes) in struct radix_tree_node.
Considering that this new tag is not used so frequently, so I'm not
sure that it's worth the cost.

   What happens if it is the last page in the mapping that is poisoned,
   and we truncate that away? Shouldn't that clear the poisoned bit?
  
  When we handle the hwpoisoned inode, the error page should already
  be removed from pagecache, so the remaining pages are not related
  to the error and we need not care about them when we consider bit
  clearing.
 
 Sorry, I don't follow. What removed the page from the page cache?
 The truncate_page_cache() call that follows the above code hunk is
 what does that, so I don't see how it can already be removed from
 the page cache

Memory error handler (memory_failure() in mm/memory-failure.c) has
removed the error page from the page cache.
And truncate_page_cache() that follows this hunk removes all pages
belonging to the page cache of the poisoned file (where the error
page itself is not included in them.)

Let me explain more to clarify my whole scenario. If a memory error
hits on a dirty pagecache, kernel works like below:

  1. handles a MCE interrupt (logging MCE events,)
  2. calls memory error handler (doing 3 to 6,)
  3. sets PageHWPoison flag on the error page,
  4. unmaps all mappings to processes' virtual addresses,
  5. sets AS_HWPOISON on mappings to which the error page belongs
  6. invalidates the error page (unlinks it from LRU list and removes
 it from pagecache,)
  (memory error handler finished)
  7. later accesses to the file returns -EIO,
  8. AS_HWPOISON is cleared when the file is removed or completely
 truncated.

This patchset tries to fix the problem in 7 and add a new behavior 8,
where I have an assumption that 1-6 has already worked out.

You may think it strange that the condition of clearing AS_HWPOISON
is checked with file granularity. This is because currently userspace
applications know the memory errors only with file granularity for
simplicity, when they access via read(), write() and/or fsync().
(Only for access via mmap(), error reporting is with page granularity.)
We can do it with page granularity, and I tried 2 approaches:

  a. adding a new tag in mapping tree approach
 as I explained above, this needs an additional space on heavily
 used data structure,
  b. adding a new data structure specific to dirty pagecache error
 management approach
 I did this in the 1st version of this patchset, but found that
 it's too complicated as the first step.

But I think both of these are some difficulty to be accepted,
so at first I try to start with simpler one.

   What about a hole punch over the poisoned range?
  
  For the same reason, this is also not related to when to clear the bit.
 
 Sure it is - if you remove the poisoned pages from the mapping when
 the hole is punched, then the mapping is no longer poisoned. Hence
 the bit should be cleared at that time as nothing else will ever
 clear it.

If we achieve error reporting with the page granularity, I hope we can
also do this easily.

Do I answer all your questions? If 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-24 Thread Naoya Horiguchi
Hello,

On Thu, Aug 23, 2012 at 04:31:43PM -0400, Naoya Horiguchi wrote:
 On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote:
  On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
...
   diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
   index ac8d904..8742397 100644
   --- v3.6-rc1.orig/fs/inode.c
   +++ v3.6-rc1/fs/inode.c
   @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
   nr_to_scan)
 }

 /*
   +  * Keep inode caches on memory for user processes to certainly
   +  * be aware of memory errors.
   +  */
   + if (unlikely(mapping_hwpoison(inode-i_mapping))) {
   + spin_unlock(inode-i_lock);
   + continue;
   + }
  
  That chunk prevents reclaiming all the cached pages. However the intention
  is only to keep the struct inode together with the hwpoison bit?
 
 Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from
 shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file
 remain for long under high memory pressure.

I might lose your point. Are you suggesting this chunk should come after
if (inode_has_buffers(inode) || inode-i_data.nrpages) { ... } block,
aren't you?  I think that's right, so I'll try and test it this weekend.

   + /*
  * Referenced or dirty inodes are still in use. Give them
  * another pass through the LRU as we canot reclaim them now.
  */
   @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
 inode-i_state = ~I_WILL_FREE;
 }

   + if (unlikely(mapping_hwpoison(inode-i_mapping)  drop))
   + mapping_clear_hwpoison(inode-i_mapping);
  
  Is that clear necessary? Because the bit will be gone with the inode
  struct: it's going to be de-allocated anyway.
 
 With the chunk in prune_icache_sb() we keep the inode struct with
 AS_HWPOISON set on memory, so in order to remove it, we need explicitly
 clear the bit.
 Without this clear, the inode remains until system reboot.

And again, you are right here. Without this clear, this inode will be
cleared in destroy_inode().

Thanks,
Naoya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Dave Chinner
On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
> On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > > where we have possibilities of data lost. This is because in this fix
> > > AS_HWPOISON is cleared when the inode cache is dropped.

> > > --- v3.6-rc1.orig/mm/truncate.c
> > > +++ v3.6-rc1/mm/truncate.c
> > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
> > > newsize)
> > >  
> > >   oldsize = inode->i_size;
> > >   i_size_write(inode, newsize);
> > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > > + mapping_clear_hwpoison(inode->i_mapping);
> > 
> > So only a truncate to zero size will clear the poison flag?
> 
> Yes, this is because we only know if the file is affected by hwpoison,
> but not where the hwpoisoned page is in the file. We could remember it,
> but I did not do it for simplicity.

Surely the page has flags on it to say it is poisoned? That is,
after truncating the page cache, if the address space is poisoned,
then you can do a pass across the mapping tree checking if there are
any poisoned pages left? Or perhaps adding a new mapping tree tag so
that the poisoned status is automatically determined by the presence
of the poisoned page in the mapping tree?

> > What happens if it is the last page in the mapping that is poisoned,
> > and we truncate that away? Shouldn't that clear the poisoned bit?
> 
> When we handle the hwpoisoned inode, the error page should already
> be removed from pagecache, so the remaining pages are not related
> to the error and we need not care about them when we consider bit
> clearing.

Sorry, I don't follow. What removed the page from the page cache?
The truncate_page_cache() call that follows the above code hunk is
what does that, so I don't see how it can already be removed from
the page cache

> > What about a hole punch over the poisoned range?
> 
> For the same reason, this is also not related to when to clear the bit.

Sure it is - if you remove the poisoned pages from the mapping when
the hole is punched, then the mapping is no longer poisoned. Hence
the bit should be cleared at that time as nothing else will ever
clear it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Naoya Horiguchi
On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
> On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > where we have possibilities of data lost. This is because in this fix
> > AS_HWPOISON is cleared when the inode cache is dropped.
> > 
> > For example, consider an application in which a process periodically
> > (every 10 minutes) writes some logs on a file (and closes it after
> > each writes,) and at the end of each day some batch programs run using
> > the log file. If a memory error hits on dirty pagecache of this log file
> > just after periodic write/close and the inode cache is cleared before the
> > next write, then this application is not aware of the error and the batch
> > programs will work wrongly.
> > 
> > To avoid this, this patch makes us pin the hwpoisoned inode on memory
> > until we remove or completely truncate the hwpoisoned file.
> > 
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >  fs/inode.c  | 12 
> >  include/linux/pagemap.h | 11 +++
> >  mm/memory-failure.c |  2 +-
> >  mm/truncate.c   |  2 ++
> >  4 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
> > index ac8d904..8742397 100644
> > --- v3.6-rc1.orig/fs/inode.c
> > +++ v3.6-rc1/fs/inode.c
> > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
> > nr_to_scan)
> > }
> >  
> > /*
> > +* Keep inode caches on memory for user processes to certainly
> > +* be aware of memory errors.
> > +*/
> > +   if (unlikely(mapping_hwpoison(inode->i_mapping))) {
> > +   spin_unlock(>i_lock);
> > +   continue;
> > +   }
> > +
> > +   /*
> >  * Referenced or dirty inodes are still in use. Give them
> >  * another pass through the LRU as we canot reclaim them now.
> >  */
> 
> I don't think you tested this at all. Have a look at what the loop
> does more closely - inodes with poisoned mappings will get stuck
> and reclaim doesn't make progress past them.

Sorry, I overlooked something important in my testing. I'll correct it.
Maybe we need list_move_tail() in this block.

> I think you also need to document this inode lifecycle change

OK, I'll do it.

> > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
> > index 75801ac..82a994f 100644
> > --- v3.6-rc1.orig/mm/truncate.c
> > +++ v3.6-rc1/mm/truncate.c
> > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
> > newsize)
> >  
> > oldsize = inode->i_size;
> > i_size_write(inode, newsize);
> > +   if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> > +   mapping_clear_hwpoison(inode->i_mapping);
> 
> So only a truncate to zero size will clear the poison flag?

Yes, this is because we only know if the file is affected by hwpoison,
but not where the hwpoisoned page is in the file. We could remember it,
but I did not do it for simplicity.

> What happens if it is the last page in the mapping that is poisoned,
> and we truncate that away? Shouldn't that clear the poisoned bit?

When we handle the hwpoisoned inode, the error page should already
be removed from pagecache, so the remaining pages are not related
to the error and we need not care about them when we consider bit
clearing.

> What about a hole punch over the poisoned range?

For the same reason, this is also not related to when to clear the bit.

Thanks,
Naoya
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Dave Chinner
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> where we have possibilities of data lost. This is because in this fix
> AS_HWPOISON is cleared when the inode cache is dropped.
> 
> For example, consider an application in which a process periodically
> (every 10 minutes) writes some logs on a file (and closes it after
> each writes,) and at the end of each day some batch programs run using
> the log file. If a memory error hits on dirty pagecache of this log file
> just after periodic write/close and the inode cache is cleared before the
> next write, then this application is not aware of the error and the batch
> programs will work wrongly.
> 
> To avoid this, this patch makes us pin the hwpoisoned inode on memory
> until we remove or completely truncate the hwpoisoned file.
> 
> Signed-off-by: Naoya Horiguchi 
> ---
>  fs/inode.c  | 12 
>  include/linux/pagemap.h | 11 +++
>  mm/memory-failure.c |  2 +-
>  mm/truncate.c   |  2 ++
>  4 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
> index ac8d904..8742397 100644
> --- v3.6-rc1.orig/fs/inode.c
> +++ v3.6-rc1/fs/inode.c
> @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
> nr_to_scan)
>   }
>  
>   /*
> +  * Keep inode caches on memory for user processes to certainly
> +  * be aware of memory errors.
> +  */
> + if (unlikely(mapping_hwpoison(inode->i_mapping))) {
> + spin_unlock(>i_lock);
> + continue;
> + }
> +
> + /*
>* Referenced or dirty inodes are still in use. Give them
>* another pass through the LRU as we canot reclaim them now.
>*/

I don't think you tested this at all. Have a look at what the loop
does more closely - inodes with poisoned mappings will get stuck
and reclaim doesn't make progress past them.

I think you also need to document this inode lifecycle change

> diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
> index 75801ac..82a994f 100644
> --- v3.6-rc1.orig/mm/truncate.c
> +++ v3.6-rc1/mm/truncate.c
> @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
>  
>   oldsize = inode->i_size;
>   i_size_write(inode, newsize);
> + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> + mapping_clear_hwpoison(inode->i_mapping);

So only a truncate to zero size will clear the poison flag?

What happens if it is the last page in the mapping that is poisoned,
and we truncate that away? Shouldn't that clear the poisoned bit?
What about a hole punch over the poisoned range?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Naoya Horiguchi
On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote:
> On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> > "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> > where we have possibilities of data lost. This is because in this fix
> > AS_HWPOISON is cleared when the inode cache is dropped.
> > 
> > For example, consider an application in which a process periodically
> > (every 10 minutes) writes some logs on a file (and closes it after
> > each writes,) and at the end of each day some batch programs run using
> > the log file. If a memory error hits on dirty pagecache of this log file
> > just after periodic write/close and the inode cache is cleared before the
> > next write, then this application is not aware of the error and the batch
> > programs will work wrongly.
> > 
> > To avoid this, this patch makes us pin the hwpoisoned inode on memory
> > until we remove or completely truncate the hwpoisoned file.
> 
> Good point!
>  
> > Signed-off-by: Naoya Horiguchi 
> > ---
> >  fs/inode.c  | 12 
> >  include/linux/pagemap.h | 11 +++
> >  mm/memory-failure.c |  2 +-
> >  mm/truncate.c   |  2 ++
> >  4 files changed, 26 insertions(+), 1 deletion(-)
> > 
> > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
> > index ac8d904..8742397 100644
> > --- v3.6-rc1.orig/fs/inode.c
> > +++ v3.6-rc1/fs/inode.c
> > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
> > nr_to_scan)
> > }
> >  
> > /*
> > +* Keep inode caches on memory for user processes to certainly
> > +* be aware of memory errors.
> > +*/
> > +   if (unlikely(mapping_hwpoison(inode->i_mapping))) {
> > +   spin_unlock(>i_lock);
> > +   continue;
> > +   }
> 
> That chunk prevents reclaiming all the cached pages. However the intention
> is only to keep the struct inode together with the hwpoison bit?

Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from
shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file
remain for long under high memory pressure.

> > +   /*
> >  * Referenced or dirty inodes are still in use. Give them
> >  * another pass through the LRU as we canot reclaim them now.
> >  */
> > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
> > inode->i_state &= ~I_WILL_FREE;
> > }
> >  
> > +   if (unlikely(mapping_hwpoison(inode->i_mapping) && drop))
> > +   mapping_clear_hwpoison(inode->i_mapping);
> 
> Is that clear necessary? Because the bit will be gone with the inode
> struct: it's going to be de-allocated anyway.

With the chunk in prune_icache_sb() we keep the inode struct with
AS_HWPOISON set on memory, so in order to remove it, we need explicitly
clear the bit.
Without this clear, the inode remains until system reboot.

> > inode->i_state |= I_FREEING;
> > if (!list_empty(>i_lru))
> > inode_lru_list_del(inode);
> > diff --git v3.6-rc1.orig/include/linux/pagemap.h 
> > v3.6-rc1/include/linux/pagemap.h
> > index 4d8d821..9fce4e4 100644
> > --- v3.6-rc1.orig/include/linux/pagemap.h
> > +++ v3.6-rc1/include/linux/pagemap.h
> > @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
> > *mapping)
> >  {
> > return test_bit(AS_HWPOISON, >flags);
> >  }
> > +static inline void mapping_set_hwpoison(struct address_space *mapping)
> > +{
> > +   set_bit(AS_HWPOISON, >flags);
> > +}
> > +static inline void mapping_clear_hwpoison(struct address_space *mapping)
> > +{
> > +   clear_bit(AS_HWPOISON, >flags);
> > +}
> >  #else
> >  static inline int mapping_hwpoison(struct address_space *mapping)
> >  {
> > return 0;
> >  }
> > +static inline void mapping_clear_hwpoison(struct address_space *mapping)
> > +{
> > +}
> >  #endif
> >  
> >  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
> > index a1e7e00..ca064c6 100644
> > --- v3.6-rc1.orig/mm/memory-failure.c
> > +++ v3.6-rc1/mm/memory-failure.c
> > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned 
> > long pfn)
> >  * the first EIO, but we're not worse than other parts
> >  * of the kernel.
> >  */
> > -   set_bit(AS_HWPOISON, >flags);
> > +   mapping_set_hwpoison(mapping);
> > }
> >  
> > return me_pagecache_clean(p, pfn);
> > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
> > index 75801ac..82a994f 100644
> > --- v3.6-rc1.orig/mm/truncate.c
> > +++ v3.6-rc1/mm/truncate.c
> > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
> > newsize)
> >  
> > oldsize = inode->i_size;
> > i_size_write(inode, newsize);
> > +   if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
> 
> It 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Fengguang Wu
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
> "HWPOISON: report sticky EIO for poisoned file" still has a corner case
> where we have possibilities of data lost. This is because in this fix
> AS_HWPOISON is cleared when the inode cache is dropped.
> 
> For example, consider an application in which a process periodically
> (every 10 minutes) writes some logs on a file (and closes it after
> each writes,) and at the end of each day some batch programs run using
> the log file. If a memory error hits on dirty pagecache of this log file
> just after periodic write/close and the inode cache is cleared before the
> next write, then this application is not aware of the error and the batch
> programs will work wrongly.
> 
> To avoid this, this patch makes us pin the hwpoisoned inode on memory
> until we remove or completely truncate the hwpoisoned file.

Good point!
 
> Signed-off-by: Naoya Horiguchi 
> ---
>  fs/inode.c  | 12 
>  include/linux/pagemap.h | 11 +++
>  mm/memory-failure.c |  2 +-
>  mm/truncate.c   |  2 ++
>  4 files changed, 26 insertions(+), 1 deletion(-)
> 
> diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
> index ac8d904..8742397 100644
> --- v3.6-rc1.orig/fs/inode.c
> +++ v3.6-rc1/fs/inode.c
> @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
> nr_to_scan)
>   }
>  
>   /*
> +  * Keep inode caches on memory for user processes to certainly
> +  * be aware of memory errors.
> +  */
> + if (unlikely(mapping_hwpoison(inode->i_mapping))) {
> + spin_unlock(>i_lock);
> + continue;
> + }

That chunk prevents reclaiming all the cached pages. However the intention
is only to keep the struct inode together with the hwpoison bit?

> + /*
>* Referenced or dirty inodes are still in use. Give them
>* another pass through the LRU as we canot reclaim them now.
>*/
> @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
>   inode->i_state &= ~I_WILL_FREE;
>   }
>  
> + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop))
> + mapping_clear_hwpoison(inode->i_mapping);

Is that clear necessary? Because the bit will be gone with the inode
struct: it's going to be de-allocated anyway.

>   inode->i_state |= I_FREEING;
>   if (!list_empty(>i_lru))
>   inode_lru_list_del(inode);
> diff --git v3.6-rc1.orig/include/linux/pagemap.h 
> v3.6-rc1/include/linux/pagemap.h
> index 4d8d821..9fce4e4 100644
> --- v3.6-rc1.orig/include/linux/pagemap.h
> +++ v3.6-rc1/include/linux/pagemap.h
> @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
> *mapping)
>  {
>   return test_bit(AS_HWPOISON, >flags);
>  }
> +static inline void mapping_set_hwpoison(struct address_space *mapping)
> +{
> + set_bit(AS_HWPOISON, >flags);
> +}
> +static inline void mapping_clear_hwpoison(struct address_space *mapping)
> +{
> + clear_bit(AS_HWPOISON, >flags);
> +}
>  #else
>  static inline int mapping_hwpoison(struct address_space *mapping)
>  {
>   return 0;
>  }
> +static inline void mapping_clear_hwpoison(struct address_space *mapping)
> +{
> +}
>  #endif
>  
>  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
> diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
> index a1e7e00..ca064c6 100644
> --- v3.6-rc1.orig/mm/memory-failure.c
> +++ v3.6-rc1/mm/memory-failure.c
> @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned 
> long pfn)
>* the first EIO, but we're not worse than other parts
>* of the kernel.
>*/
> - set_bit(AS_HWPOISON, >flags);
> + mapping_set_hwpoison(mapping);
>   }
>  
>   return me_pagecache_clean(p, pfn);
> diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
> index 75801ac..82a994f 100644
> --- v3.6-rc1.orig/mm/truncate.c
> +++ v3.6-rc1/mm/truncate.c
> @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
>  
>   oldsize = inode->i_size;
>   i_size_write(inode, newsize);
> + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))

It might be a bit better to test !newsize first.

> + mapping_clear_hwpoison(inode->i_mapping);
>  
>   truncate_pagecache(inode, oldsize, newsize);
>  }
> -- 
> 1.7.11.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Fengguang Wu
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
 HWPOISON: report sticky EIO for poisoned file still has a corner case
 where we have possibilities of data lost. This is because in this fix
 AS_HWPOISON is cleared when the inode cache is dropped.
 
 For example, consider an application in which a process periodically
 (every 10 minutes) writes some logs on a file (and closes it after
 each writes,) and at the end of each day some batch programs run using
 the log file. If a memory error hits on dirty pagecache of this log file
 just after periodic write/close and the inode cache is cleared before the
 next write, then this application is not aware of the error and the batch
 programs will work wrongly.
 
 To avoid this, this patch makes us pin the hwpoisoned inode on memory
 until we remove or completely truncate the hwpoisoned file.

Good point!
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 ---
  fs/inode.c  | 12 
  include/linux/pagemap.h | 11 +++
  mm/memory-failure.c |  2 +-
  mm/truncate.c   |  2 ++
  4 files changed, 26 insertions(+), 1 deletion(-)
 
 diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
 index ac8d904..8742397 100644
 --- v3.6-rc1.orig/fs/inode.c
 +++ v3.6-rc1/fs/inode.c
 @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
 nr_to_scan)
   }
  
   /*
 +  * Keep inode caches on memory for user processes to certainly
 +  * be aware of memory errors.
 +  */
 + if (unlikely(mapping_hwpoison(inode-i_mapping))) {
 + spin_unlock(inode-i_lock);
 + continue;
 + }

That chunk prevents reclaiming all the cached pages. However the intention
is only to keep the struct inode together with the hwpoison bit?

 + /*
* Referenced or dirty inodes are still in use. Give them
* another pass through the LRU as we canot reclaim them now.
*/
 @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
   inode-i_state = ~I_WILL_FREE;
   }
  
 + if (unlikely(mapping_hwpoison(inode-i_mapping)  drop))
 + mapping_clear_hwpoison(inode-i_mapping);

Is that clear necessary? Because the bit will be gone with the inode
struct: it's going to be de-allocated anyway.

   inode-i_state |= I_FREEING;
   if (!list_empty(inode-i_lru))
   inode_lru_list_del(inode);
 diff --git v3.6-rc1.orig/include/linux/pagemap.h 
 v3.6-rc1/include/linux/pagemap.h
 index 4d8d821..9fce4e4 100644
 --- v3.6-rc1.orig/include/linux/pagemap.h
 +++ v3.6-rc1/include/linux/pagemap.h
 @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
 *mapping)
  {
   return test_bit(AS_HWPOISON, mapping-flags);
  }
 +static inline void mapping_set_hwpoison(struct address_space *mapping)
 +{
 + set_bit(AS_HWPOISON, mapping-flags);
 +}
 +static inline void mapping_clear_hwpoison(struct address_space *mapping)
 +{
 + clear_bit(AS_HWPOISON, mapping-flags);
 +}
  #else
  static inline int mapping_hwpoison(struct address_space *mapping)
  {
   return 0;
  }
 +static inline void mapping_clear_hwpoison(struct address_space *mapping)
 +{
 +}
  #endif
  
  static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
 diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
 index a1e7e00..ca064c6 100644
 --- v3.6-rc1.orig/mm/memory-failure.c
 +++ v3.6-rc1/mm/memory-failure.c
 @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned 
 long pfn)
* the first EIO, but we're not worse than other parts
* of the kernel.
*/
 - set_bit(AS_HWPOISON, mapping-flags);
 + mapping_set_hwpoison(mapping);
   }
  
   return me_pagecache_clean(p, pfn);
 diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
 index 75801ac..82a994f 100644
 --- v3.6-rc1.orig/mm/truncate.c
 +++ v3.6-rc1/mm/truncate.c
 @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
  
   oldsize = inode-i_size;
   i_size_write(inode, newsize);
 + if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))

It might be a bit better to test !newsize first.

 + mapping_clear_hwpoison(inode-i_mapping);
  
   truncate_pagecache(inode, oldsize, newsize);
  }
 -- 
 1.7.11.4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Naoya Horiguchi
On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote:
 On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
  HWPOISON: report sticky EIO for poisoned file still has a corner case
  where we have possibilities of data lost. This is because in this fix
  AS_HWPOISON is cleared when the inode cache is dropped.
  
  For example, consider an application in which a process periodically
  (every 10 minutes) writes some logs on a file (and closes it after
  each writes,) and at the end of each day some batch programs run using
  the log file. If a memory error hits on dirty pagecache of this log file
  just after periodic write/close and the inode cache is cleared before the
  next write, then this application is not aware of the error and the batch
  programs will work wrongly.
  
  To avoid this, this patch makes us pin the hwpoisoned inode on memory
  until we remove or completely truncate the hwpoisoned file.
 
 Good point!
  
  Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
  ---
   fs/inode.c  | 12 
   include/linux/pagemap.h | 11 +++
   mm/memory-failure.c |  2 +-
   mm/truncate.c   |  2 ++
   4 files changed, 26 insertions(+), 1 deletion(-)
  
  diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
  index ac8d904..8742397 100644
  --- v3.6-rc1.orig/fs/inode.c
  +++ v3.6-rc1/fs/inode.c
  @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
  nr_to_scan)
  }
   
  /*
  +* Keep inode caches on memory for user processes to certainly
  +* be aware of memory errors.
  +*/
  +   if (unlikely(mapping_hwpoison(inode-i_mapping))) {
  +   spin_unlock(inode-i_lock);
  +   continue;
  +   }
 
 That chunk prevents reclaiming all the cached pages. However the intention
 is only to keep the struct inode together with the hwpoison bit?

Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from
shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file
remain for long under high memory pressure.

  +   /*
   * Referenced or dirty inodes are still in use. Give them
   * another pass through the LRU as we canot reclaim them now.
   */
  @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
  inode-i_state = ~I_WILL_FREE;
  }
   
  +   if (unlikely(mapping_hwpoison(inode-i_mapping)  drop))
  +   mapping_clear_hwpoison(inode-i_mapping);
 
 Is that clear necessary? Because the bit will be gone with the inode
 struct: it's going to be de-allocated anyway.

With the chunk in prune_icache_sb() we keep the inode struct with
AS_HWPOISON set on memory, so in order to remove it, we need explicitly
clear the bit.
Without this clear, the inode remains until system reboot.

  inode-i_state |= I_FREEING;
  if (!list_empty(inode-i_lru))
  inode_lru_list_del(inode);
  diff --git v3.6-rc1.orig/include/linux/pagemap.h 
  v3.6-rc1/include/linux/pagemap.h
  index 4d8d821..9fce4e4 100644
  --- v3.6-rc1.orig/include/linux/pagemap.h
  +++ v3.6-rc1/include/linux/pagemap.h
  @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
  *mapping)
   {
  return test_bit(AS_HWPOISON, mapping-flags);
   }
  +static inline void mapping_set_hwpoison(struct address_space *mapping)
  +{
  +   set_bit(AS_HWPOISON, mapping-flags);
  +}
  +static inline void mapping_clear_hwpoison(struct address_space *mapping)
  +{
  +   clear_bit(AS_HWPOISON, mapping-flags);
  +}
   #else
   static inline int mapping_hwpoison(struct address_space *mapping)
   {
  return 0;
   }
  +static inline void mapping_clear_hwpoison(struct address_space *mapping)
  +{
  +}
   #endif
   
   static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
  diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
  index a1e7e00..ca064c6 100644
  --- v3.6-rc1.orig/mm/memory-failure.c
  +++ v3.6-rc1/mm/memory-failure.c
  @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned 
  long pfn)
   * the first EIO, but we're not worse than other parts
   * of the kernel.
   */
  -   set_bit(AS_HWPOISON, mapping-flags);
  +   mapping_set_hwpoison(mapping);
  }
   
  return me_pagecache_clean(p, pfn);
  diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
  index 75801ac..82a994f 100644
  --- v3.6-rc1.orig/mm/truncate.c
  +++ v3.6-rc1/mm/truncate.c
  @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
  newsize)
   
  oldsize = inode-i_size;
  i_size_write(inode, newsize);
  +   if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
 
 It might be a bit better to test !newsize first.

Ah, OK.

Thanks,
Naoya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to 

Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Dave Chinner
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
 HWPOISON: report sticky EIO for poisoned file still has a corner case
 where we have possibilities of data lost. This is because in this fix
 AS_HWPOISON is cleared when the inode cache is dropped.
 
 For example, consider an application in which a process periodically
 (every 10 minutes) writes some logs on a file (and closes it after
 each writes,) and at the end of each day some batch programs run using
 the log file. If a memory error hits on dirty pagecache of this log file
 just after periodic write/close and the inode cache is cleared before the
 next write, then this application is not aware of the error and the batch
 programs will work wrongly.
 
 To avoid this, this patch makes us pin the hwpoisoned inode on memory
 until we remove or completely truncate the hwpoisoned file.
 
 Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
 ---
  fs/inode.c  | 12 
  include/linux/pagemap.h | 11 +++
  mm/memory-failure.c |  2 +-
  mm/truncate.c   |  2 ++
  4 files changed, 26 insertions(+), 1 deletion(-)
 
 diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
 index ac8d904..8742397 100644
 --- v3.6-rc1.orig/fs/inode.c
 +++ v3.6-rc1/fs/inode.c
 @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
 nr_to_scan)
   }
  
   /*
 +  * Keep inode caches on memory for user processes to certainly
 +  * be aware of memory errors.
 +  */
 + if (unlikely(mapping_hwpoison(inode-i_mapping))) {
 + spin_unlock(inode-i_lock);
 + continue;
 + }
 +
 + /*
* Referenced or dirty inodes are still in use. Give them
* another pass through the LRU as we canot reclaim them now.
*/

I don't think you tested this at all. Have a look at what the loop
does more closely - inodes with poisoned mappings will get stuck
and reclaim doesn't make progress past them.

I think you also need to document this inode lifecycle change

 diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
 index 75801ac..82a994f 100644
 --- v3.6-rc1.orig/mm/truncate.c
 +++ v3.6-rc1/mm/truncate.c
 @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
  
   oldsize = inode-i_size;
   i_size_write(inode, newsize);
 + if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
 + mapping_clear_hwpoison(inode-i_mapping);

So only a truncate to zero size will clear the poison flag?

What happens if it is the last page in the mapping that is poisoned,
and we truncate that away? Shouldn't that clear the poisoned bit?
What about a hole punch over the poisoned range?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Naoya Horiguchi
On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
 On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
  HWPOISON: report sticky EIO for poisoned file still has a corner case
  where we have possibilities of data lost. This is because in this fix
  AS_HWPOISON is cleared when the inode cache is dropped.
  
  For example, consider an application in which a process periodically
  (every 10 minutes) writes some logs on a file (and closes it after
  each writes,) and at the end of each day some batch programs run using
  the log file. If a memory error hits on dirty pagecache of this log file
  just after periodic write/close and the inode cache is cleared before the
  next write, then this application is not aware of the error and the batch
  programs will work wrongly.
  
  To avoid this, this patch makes us pin the hwpoisoned inode on memory
  until we remove or completely truncate the hwpoisoned file.
  
  Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
  ---
   fs/inode.c  | 12 
   include/linux/pagemap.h | 11 +++
   mm/memory-failure.c |  2 +-
   mm/truncate.c   |  2 ++
   4 files changed, 26 insertions(+), 1 deletion(-)
  
  diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
  index ac8d904..8742397 100644
  --- v3.6-rc1.orig/fs/inode.c
  +++ v3.6-rc1/fs/inode.c
  @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
  nr_to_scan)
  }
   
  /*
  +* Keep inode caches on memory for user processes to certainly
  +* be aware of memory errors.
  +*/
  +   if (unlikely(mapping_hwpoison(inode-i_mapping))) {
  +   spin_unlock(inode-i_lock);
  +   continue;
  +   }
  +
  +   /*
   * Referenced or dirty inodes are still in use. Give them
   * another pass through the LRU as we canot reclaim them now.
   */
 
 I don't think you tested this at all. Have a look at what the loop
 does more closely - inodes with poisoned mappings will get stuck
 and reclaim doesn't make progress past them.

Sorry, I overlooked something important in my testing. I'll correct it.
Maybe we need list_move_tail() in this block.

 I think you also need to document this inode lifecycle change

OK, I'll do it.

  diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
  index 75801ac..82a994f 100644
  --- v3.6-rc1.orig/mm/truncate.c
  +++ v3.6-rc1/mm/truncate.c
  @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
  newsize)
   
  oldsize = inode-i_size;
  i_size_write(inode, newsize);
  +   if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
  +   mapping_clear_hwpoison(inode-i_mapping);
 
 So only a truncate to zero size will clear the poison flag?

Yes, this is because we only know if the file is affected by hwpoison,
but not where the hwpoisoned page is in the file. We could remember it,
but I did not do it for simplicity.

 What happens if it is the last page in the mapping that is poisoned,
 and we truncate that away? Shouldn't that clear the poisoned bit?

When we handle the hwpoisoned inode, the error page should already
be removed from pagecache, so the remaining pages are not related
to the error and we need not care about them when we consider bit
clearing.

 What about a hole punch over the poisoned range?

For the same reason, this is also not related to when to clear the bit.

Thanks,
Naoya
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-23 Thread Dave Chinner
On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote:
 On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote:
  On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote:
   HWPOISON: report sticky EIO for poisoned file still has a corner case
   where we have possibilities of data lost. This is because in this fix
   AS_HWPOISON is cleared when the inode cache is dropped.

   --- v3.6-rc1.orig/mm/truncate.c
   +++ v3.6-rc1/mm/truncate.c
   @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t 
   newsize)

 oldsize = inode-i_size;
 i_size_write(inode, newsize);
   + if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
   + mapping_clear_hwpoison(inode-i_mapping);
  
  So only a truncate to zero size will clear the poison flag?
 
 Yes, this is because we only know if the file is affected by hwpoison,
 but not where the hwpoisoned page is in the file. We could remember it,
 but I did not do it for simplicity.

Surely the page has flags on it to say it is poisoned? That is,
after truncating the page cache, if the address space is poisoned,
then you can do a pass across the mapping tree checking if there are
any poisoned pages left? Or perhaps adding a new mapping tree tag so
that the poisoned status is automatically determined by the presence
of the poisoned page in the mapping tree?

  What happens if it is the last page in the mapping that is poisoned,
  and we truncate that away? Shouldn't that clear the poisoned bit?
 
 When we handle the hwpoisoned inode, the error page should already
 be removed from pagecache, so the remaining pages are not related
 to the error and we need not care about them when we consider bit
 clearing.

Sorry, I don't follow. What removed the page from the page cache?
The truncate_page_cache() call that follows the above code hunk is
what does that, so I don't see how it can already be removed from
the page cache

  What about a hole punch over the poisoned range?
 
 For the same reason, this is also not related to when to clear the bit.

Sure it is - if you remove the poisoned pages from the mapping when
the hole is punched, then the mapping is no longer poisoned. Hence
the bit should be cleared at that time as nothing else will ever
clear it.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-22 Thread Naoya Horiguchi
"HWPOISON: report sticky EIO for poisoned file" still has a corner case
where we have possibilities of data lost. This is because in this fix
AS_HWPOISON is cleared when the inode cache is dropped.

For example, consider an application in which a process periodically
(every 10 minutes) writes some logs on a file (and closes it after
each writes,) and at the end of each day some batch programs run using
the log file. If a memory error hits on dirty pagecache of this log file
just after periodic write/close and the inode cache is cleared before the
next write, then this application is not aware of the error and the batch
programs will work wrongly.

To avoid this, this patch makes us pin the hwpoisoned inode on memory
until we remove or completely truncate the hwpoisoned file.

Signed-off-by: Naoya Horiguchi 
---
 fs/inode.c  | 12 
 include/linux/pagemap.h | 11 +++
 mm/memory-failure.c |  2 +-
 mm/truncate.c   |  2 ++
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
index ac8d904..8742397 100644
--- v3.6-rc1.orig/fs/inode.c
+++ v3.6-rc1/fs/inode.c
@@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
nr_to_scan)
}
 
/*
+* Keep inode caches on memory for user processes to certainly
+* be aware of memory errors.
+*/
+   if (unlikely(mapping_hwpoison(inode->i_mapping))) {
+   spin_unlock(>i_lock);
+   continue;
+   }
+
+   /*
 * Referenced or dirty inodes are still in use. Give them
 * another pass through the LRU as we canot reclaim them now.
 */
@@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
inode->i_state &= ~I_WILL_FREE;
}
 
+   if (unlikely(mapping_hwpoison(inode->i_mapping) && drop))
+   mapping_clear_hwpoison(inode->i_mapping);
+
inode->i_state |= I_FREEING;
if (!list_empty(>i_lru))
inode_lru_list_del(inode);
diff --git v3.6-rc1.orig/include/linux/pagemap.h 
v3.6-rc1/include/linux/pagemap.h
index 4d8d821..9fce4e4 100644
--- v3.6-rc1.orig/include/linux/pagemap.h
+++ v3.6-rc1/include/linux/pagemap.h
@@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
*mapping)
 {
return test_bit(AS_HWPOISON, >flags);
 }
+static inline void mapping_set_hwpoison(struct address_space *mapping)
+{
+   set_bit(AS_HWPOISON, >flags);
+}
+static inline void mapping_clear_hwpoison(struct address_space *mapping)
+{
+   clear_bit(AS_HWPOISON, >flags);
+}
 #else
 static inline int mapping_hwpoison(struct address_space *mapping)
 {
return 0;
 }
+static inline void mapping_clear_hwpoison(struct address_space *mapping)
+{
+}
 #endif
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
index a1e7e00..ca064c6 100644
--- v3.6-rc1.orig/mm/memory-failure.c
+++ v3.6-rc1/mm/memory-failure.c
@@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long 
pfn)
 * the first EIO, but we're not worse than other parts
 * of the kernel.
 */
-   set_bit(AS_HWPOISON, >flags);
+   mapping_set_hwpoison(mapping);
}
 
return me_pagecache_clean(p, pfn);
diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
index 75801ac..82a994f 100644
--- v3.6-rc1.orig/mm/truncate.c
+++ v3.6-rc1/mm/truncate.c
@@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
 
oldsize = inode->i_size;
i_size_write(inode, newsize);
+   if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize))
+   mapping_clear_hwpoison(inode->i_mapping);
 
truncate_pagecache(inode, oldsize, newsize);
 }
-- 
1.7.11.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky

2012-08-22 Thread Naoya Horiguchi
HWPOISON: report sticky EIO for poisoned file still has a corner case
where we have possibilities of data lost. This is because in this fix
AS_HWPOISON is cleared when the inode cache is dropped.

For example, consider an application in which a process periodically
(every 10 minutes) writes some logs on a file (and closes it after
each writes,) and at the end of each day some batch programs run using
the log file. If a memory error hits on dirty pagecache of this log file
just after periodic write/close and the inode cache is cleared before the
next write, then this application is not aware of the error and the batch
programs will work wrongly.

To avoid this, this patch makes us pin the hwpoisoned inode on memory
until we remove or completely truncate the hwpoisoned file.

Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com
---
 fs/inode.c  | 12 
 include/linux/pagemap.h | 11 +++
 mm/memory-failure.c |  2 +-
 mm/truncate.c   |  2 ++
 4 files changed, 26 insertions(+), 1 deletion(-)

diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c
index ac8d904..8742397 100644
--- v3.6-rc1.orig/fs/inode.c
+++ v3.6-rc1/fs/inode.c
@@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int 
nr_to_scan)
}
 
/*
+* Keep inode caches on memory for user processes to certainly
+* be aware of memory errors.
+*/
+   if (unlikely(mapping_hwpoison(inode-i_mapping))) {
+   spin_unlock(inode-i_lock);
+   continue;
+   }
+
+   /*
 * Referenced or dirty inodes are still in use. Give them
 * another pass through the LRU as we canot reclaim them now.
 */
@@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode)
inode-i_state = ~I_WILL_FREE;
}
 
+   if (unlikely(mapping_hwpoison(inode-i_mapping)  drop))
+   mapping_clear_hwpoison(inode-i_mapping);
+
inode-i_state |= I_FREEING;
if (!list_empty(inode-i_lru))
inode_lru_list_del(inode);
diff --git v3.6-rc1.orig/include/linux/pagemap.h 
v3.6-rc1/include/linux/pagemap.h
index 4d8d821..9fce4e4 100644
--- v3.6-rc1.orig/include/linux/pagemap.h
+++ v3.6-rc1/include/linux/pagemap.h
@@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space 
*mapping)
 {
return test_bit(AS_HWPOISON, mapping-flags);
 }
+static inline void mapping_set_hwpoison(struct address_space *mapping)
+{
+   set_bit(AS_HWPOISON, mapping-flags);
+}
+static inline void mapping_clear_hwpoison(struct address_space *mapping)
+{
+   clear_bit(AS_HWPOISON, mapping-flags);
+}
 #else
 static inline int mapping_hwpoison(struct address_space *mapping)
 {
return 0;
 }
+static inline void mapping_clear_hwpoison(struct address_space *mapping)
+{
+}
 #endif
 
 static inline gfp_t mapping_gfp_mask(struct address_space * mapping)
diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c
index a1e7e00..ca064c6 100644
--- v3.6-rc1.orig/mm/memory-failure.c
+++ v3.6-rc1/mm/memory-failure.c
@@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long 
pfn)
 * the first EIO, but we're not worse than other parts
 * of the kernel.
 */
-   set_bit(AS_HWPOISON, mapping-flags);
+   mapping_set_hwpoison(mapping);
}
 
return me_pagecache_clean(p, pfn);
diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c
index 75801ac..82a994f 100644
--- v3.6-rc1.orig/mm/truncate.c
+++ v3.6-rc1/mm/truncate.c
@@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize)
 
oldsize = inode-i_size;
i_size_write(inode, newsize);
+   if (unlikely(mapping_hwpoison(inode-i_mapping)  !newsize))
+   mapping_clear_hwpoison(inode-i_mapping);
 
truncate_pagecache(inode, oldsize, newsize);
 }
-- 
1.7.11.4

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/