Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote: > On 08/29/12 11:59, Dave Chinner wrote: > > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: > >> And yes, I understand it's ideal, but many applications choose not to > >> do that for performance reason. > >> So I think it's helpful if we can surely report to such applications. > > I suspect "performance vs. integrity" is not a correct > description of the problem. Right, to be more precise, it's a "eat my data" vs "integrity" problem. And in almost all cases I've seen over the years, "eat my data" is done for performance reasons... > > If performance is chosen over data integrity, we are under no > > obligation to keep the error around indefinitely. Fundamentally, > > ensuring a write completes successfully is the reponsibility of the > > application, not the kernel. There are so many different filesytem > > and storage errors that can be lost right now because data is not > > fsync()d, adding another one to them really doesn't change anything. > > IOWs, a memory error is no different to a disk failing or the system > > crashing when it comes to data integrity. If you care, you use > > fsync(). > > I agree that applications should fsync() or O_SYNC > when it wants to make sure the written data in on disk. > > AFAIU, what Naoya is going to address is the case where > fsync() is not necessarily needed. > > For example, if someone do: > $ patch -p1 < ../a.patch > $ tar cf . > ../a.tar > > and disk failure occurred between "patch" and "tar", > "tar" will either see uptodate data or I/O error. No, it won't. The only place AS_EIO is tested is in filemap_fdatawait_range(), which is only called in the fsync() path. The only way to report async write IO errors is to use fsync() - subsequent reads of the file do *not* see the write error. IOWs, tar will be oblivious of any IO error that preceeded it reading the files it is copying. > OTOH, if the failure was detected on dirty pagecache, the current memory > failure handler invalidates the dirty page and the "tar" command will > re-read old contents from disk without error. After an IO error, the dirty page is no longer uptodate - that gets cleared - so when the page is read the data will be re-read from disk just like if a memory error occurred. So tar will behave the same regardless of whether it is a memory error or an IO error (i.e. reread old data from disk) > (Well, the failures above are permanent failures. Write IO errors can also be transient or permanent. Transient, for example, when a path failure occurs and multipathing then detects this and fails over to a good path. A subsequent write will then succeed. Permanent, for example, when someone unplugs a USB drive. > IOW, the current > memory failure handler turns permanent failure into transient error, > which is often more difficult to handle, I think.) The patch I commented on is turning a transient error (error in a page that is then poisoned and never used again) into a permanent error (error on an address space that is reported on every future operation that tries to insert a page into the page cache). > Naoya's patch will keep the failure information and allows the reader > to get I/O error when it reads from broken pagecache. It only adds a hwposion check in add_to_page_cache_locked(). If the page is already in the cache, then no error will be sent to the reader because it never gets to add_to_page_cache_locked(). So there's no guarantee that the reader is even going to see the error, or that they see the error on the page that actually caused it - access to any missing page in the page cache will trigger it. And as memory reclaim clears pages off the inode, more and more of the range of the inode's data will return an error, even though there is nothing wrong with the data in most of the file. Indeed, what happens if the application calls fsync, gets the error and tries to rewrite the page? i.e. it does everything correctly to handle the write error? With this patch set, the application cannot insert a replacement page into the page cache, so all subsequent writes fail! IOWs, it makes it impossible for applications to recover from a detected and handled memory failure. I have no issue with reporting the problem to userspace - that needs t am I saying that the current IO reporting is wonderful and can't be improved. What I am saying, though, is that I really don't think this patch set has been well thought through from either an IO path or userspace error handling point of view. The problems with this patch set are quite significant: - permanent, unclearable error except by complete removal of all data on the file. (forcing the removal of all good data to recover from a transient error on a small amount of data). - while the error is present, bad data cannot be overwritten (applications cannot recover even if they catch the
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 29, 2012 at 02:32:04PM +0900, Jun'ichi Nomura wrote: On 08/29/12 11:59, Dave Chinner wrote: On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: And yes, I understand it's ideal, but many applications choose not to do that for performance reason. So I think it's helpful if we can surely report to such applications. I suspect performance vs. integrity is not a correct description of the problem. Right, to be more precise, it's a eat my data vs integrity problem. And in almost all cases I've seen over the years, eat my data is done for performance reasons... If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 ../a.patch $ tar cf . ../a.tar and disk failure occurred between patch and tar, tar will either see uptodate data or I/O error. No, it won't. The only place AS_EIO is tested is in filemap_fdatawait_range(), which is only called in the fsync() path. The only way to report async write IO errors is to use fsync() - subsequent reads of the file do *not* see the write error. IOWs, tar will be oblivious of any IO error that preceeded it reading the files it is copying. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the tar command will re-read old contents from disk without error. After an IO error, the dirty page is no longer uptodate - that gets cleared - so when the page is read the data will be re-read from disk just like if a memory error occurred. So tar will behave the same regardless of whether it is a memory error or an IO error (i.e. reread old data from disk) (Well, the failures above are permanent failures. Write IO errors can also be transient or permanent. Transient, for example, when a path failure occurs and multipathing then detects this and fails over to a good path. A subsequent write will then succeed. Permanent, for example, when someone unplugs a USB drive. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) The patch I commented on is turning a transient error (error in a page that is then poisoned and never used again) into a permanent error (error on an address space that is reported on every future operation that tries to insert a page into the page cache). Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. It only adds a hwposion check in add_to_page_cache_locked(). If the page is already in the cache, then no error will be sent to the reader because it never gets to add_to_page_cache_locked(). So there's no guarantee that the reader is even going to see the error, or that they see the error on the page that actually caused it - access to any missing page in the page cache will trigger it. And as memory reclaim clears pages off the inode, more and more of the range of the inode's data will return an error, even though there is nothing wrong with the data in most of the file. Indeed, what happens if the application calls fsync, gets the error and tries to rewrite the page? i.e. it does everything correctly to handle the write error? With this patch set, the application cannot insert a replacement page into the page cache, so all subsequent writes fail! IOWs, it makes it impossible for applications to recover from a detected and handled memory failure. I have no issue with reporting the problem to userspace - that needs t am I saying that the current IO reporting is wonderful and can't be improved. What I am saying, though, is that I really don't think this patch set has been well thought through from either an IO path or userspace error handling point of view. The problems with this patch set are quite significant: - permanent, unclearable error except by complete removal of all data on the file. (forcing the removal of all good data to recover from a transient error on a small amount of data). - while the error is present, bad data cannot be overwritten (applications cannot recover even if they catch the error). - while the error is present, no new data can be written
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On 08/29/12 11:59, Dave Chinner wrote: > On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: >> And yes, I understand it's ideal, but many applications choose not to >> do that for performance reason. >> So I think it's helpful if we can surely report to such applications. I suspect "performance vs. integrity" is not a correct description of the problem. > If performance is chosen over data integrity, we are under no > obligation to keep the error around indefinitely. Fundamentally, > ensuring a write completes successfully is the reponsibility of the > application, not the kernel. There are so many different filesytem > and storage errors that can be lost right now because data is not > fsync()d, adding another one to them really doesn't change anything. > IOWs, a memory error is no different to a disk failing or the system > crashing when it comes to data integrity. If you care, you use > fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 < ../a.patch $ tar cf . > ../a.tar and disk failure occurred between "patch" and "tar", "tar" will either see uptodate data or I/O error. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the "tar" command will re-read old contents from disk without error. (Well, the failures above are permanent failures. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On 08/29/12 11:59, Dave Chinner wrote: On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: And yes, I understand it's ideal, but many applications choose not to do that for performance reason. So I think it's helpful if we can surely report to such applications. I suspect performance vs. integrity is not a correct description of the problem. If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). I agree that applications should fsync() or O_SYNC when it wants to make sure the written data in on disk. AFAIU, what Naoya is going to address is the case where fsync() is not necessarily needed. For example, if someone do: $ patch -p1 ../a.patch $ tar cf . ../a.tar and disk failure occurred between patch and tar, tar will either see uptodate data or I/O error. OTOH, if the failure was detected on dirty pagecache, the current memory failure handler invalidates the dirty page and the tar command will re-read old contents from disk without error. (Well, the failures above are permanent failures. IOW, the current memory failure handler turns permanent failure into transient error, which is often more difficult to handle, I think.) Naoya's patch will keep the failure information and allows the reader to get I/O error when it reads from broken pagecache. -- Jun'ichi Nomura, NEC Corporation -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: > On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: > > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > > > Let me explain more to clarify my whole scenario. If a memory error > > > hits on a dirty pagecache, kernel works like below: > > > > > > 1. handles a MCE interrupt (logging MCE events,) > > > 2. calls memory error handler (doing 3 to 6,) > > > 3. sets PageHWPoison flag on the error page, > > > 4. unmaps all mappings to processes' virtual addresses, > > > > So nothing in userspace sees the bad page after this. > > > > > 5. sets AS_HWPOISON on mappings to which the error page belongs > > > 6. invalidates the error page (unlinks it from LRU list and removes > > > it from pagecache,) > > > (memory error handler finished) > > > > Ok, so the moment a memory error is handled, the page has been > > removed from the inode's mapping, and it will never be seen by > > aplications again. It's a transient error > > > > > 7. later accesses to the file returns -EIO, > > > 8. AS_HWPOISON is cleared when the file is removed or completely > > > truncated. > > > > so why do we have to keep an EIO on the inode forever? > > > > If the page is not dirty, then just tossing it from the cache (as > > is already done) and rereading it from disk next time it is accessed > > removes the need for any error to be reported at all. It's > > effectively a transient error at this point, and as such no errors > > should be visible from userspace. > > > > If the page is dirty, then it needs to be treated just like any > > other failed page write - the page is invalidated and the address > > space is marked with AS_EIO, and that is reported to the next > > operation that waits on IO on that file (i.e. fsync) > > > > If you have a second application that reads the files that depends > > on a guarantee of good data, then the first step in that process is > > that application that writes it needs to use fsync to check the data > > was written correctly. That ensures that you only have clean pages > > in the cache before the writer closes the file, and any h/w error > > then devolves to the above transient clean page invalidation case. > > Thank you for detailed explanations. > And yes, I understand it's ideal, but many applications choose not to > do that for performance reason. You choose: data integrity or performance. > So I think it's helpful if we can surely report to such applications. If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). > > Hence I fail to see why this type of IO error needs to be sticky. > > The error on the mapping is transient - it is gone as soon as the > > page is removed from the mapping. Hence the error can be dropped as > > soon as it is reported to userspace because the mapping is now error > > free. > > It's error free only for the applications which do fsync check in > each write, but not for the applications which don't do. > I think the penalty for the latters (ignore dirty data lost and get > wrong results) is too big to consider it as a reasonable trade-off. I'm guessing that you don't deal with data integrity issues very often. What you are suggesting is not a reasonable tradeoff - either applications are coded correctly for data integrity, or they give up any expectation that errors will be detected and reported reliably. Hoping that we might be able to report an error somewhere to someone who didn't care to avoid or collect in the first place does not improve the situation for anyone Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Mon, Aug 27, 2012 at 06:05:06PM -0400, Naoya Horiguchi wrote: On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, So nothing in userspace sees the bad page after this. 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) Ok, so the moment a memory error is handled, the page has been removed from the inode's mapping, and it will never be seen by aplications again. It's a transient error 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. so why do we have to keep an EIO on the inode forever? If the page is not dirty, then just tossing it from the cache (as is already done) and rereading it from disk next time it is accessed removes the need for any error to be reported at all. It's effectively a transient error at this point, and as such no errors should be visible from userspace. If the page is dirty, then it needs to be treated just like any other failed page write - the page is invalidated and the address space is marked with AS_EIO, and that is reported to the next operation that waits on IO on that file (i.e. fsync) If you have a second application that reads the files that depends on a guarantee of good data, then the first step in that process is that application that writes it needs to use fsync to check the data was written correctly. That ensures that you only have clean pages in the cache before the writer closes the file, and any h/w error then devolves to the above transient clean page invalidation case. Thank you for detailed explanations. And yes, I understand it's ideal, but many applications choose not to do that for performance reason. You choose: data integrity or performance. So I think it's helpful if we can surely report to such applications. If performance is chosen over data integrity, we are under no obligation to keep the error around indefinitely. Fundamentally, ensuring a write completes successfully is the reponsibility of the application, not the kernel. There are so many different filesytem and storage errors that can be lost right now because data is not fsync()d, adding another one to them really doesn't change anything. IOWs, a memory error is no different to a disk failing or the system crashing when it comes to data integrity. If you care, you use fsync(). Hence I fail to see why this type of IO error needs to be sticky. The error on the mapping is transient - it is gone as soon as the page is removed from the mapping. Hence the error can be dropped as soon as it is reported to userspace because the mapping is now error free. It's error free only for the applications which do fsync check in each write, but not for the applications which don't do. I think the penalty for the latters (ignore dirty data lost and get wrong results) is too big to consider it as a reasonable trade-off. I'm guessing that you don't deal with data integrity issues very often. What you are suggesting is not a reasonable tradeoff - either applications are coded correctly for data integrity, or they give up any expectation that errors will be detected and reported reliably. Hoping that we might be able to report an error somewhere to someone who didn't care to avoid or collect in the first place does not improve the situation for anyone Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: > On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner > > > > > > case > > > > > > where we have possibilities of data lost. This is because in this > > > > > > fix > > > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > > > +++ v3.6-rc1/mm/truncate.c > > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, > > > > > > loff_t newsize) > > > > > > > > > > > > oldsize = inode->i_size; > > > > > > i_size_write(inode, newsize); > > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > > > > > So only a truncate to zero size will clear the poison flag? > > > > > > > > Yes, this is because we only know if the file is affected by hwpoison, > > > > but not where the hwpoisoned page is in the file. We could remember it, > > > > but I did not do it for simplicity. > > > > > > Surely the page has flags on it to say it is poisoned? That is, > > > after truncating the page cache, if the address space is poisoned, > > > then you can do a pass across the mapping tree checking if there are > > > any poisoned pages left? Or perhaps adding a new mapping tree tag so > > > that the poisoned status is automatically determined by the presence > > > of the poisoned page in the mapping tree? > > > > The answer for the first question is yes. And for the second question, > > I don't think it's easy because the mapping tree has no reference to > > the error page (I explain more about this below, please see also it,) > > and it can cost too much to search poisoned pages over page cache in > > each request. > > Which is my point about a radix tree tag - that's very efficient. > > > And for the third question, I think we could do this, but to do it > > we need an additional space (8 bytes) in struct radix_tree_node. > > Considering that this new tag is not used so frequently, so I'm not > > sure that it's worth the cost. > > A radix tree node is currently 560 bytes on x86_64, packed 7 to a > page. i.e. using 3920 bytes. We can add another 8 bytes to it > without increasing memory usage at all. So, no cost at all. OK. > > > > > What happens if it is the last page in the mapping that is poisoned, > > > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > > > > > When we handle the hwpoisoned inode, the error page should already > > > > be removed from pagecache, so the remaining pages are not related > > > > to the error and we need not care about them when we consider bit > > > > clearing. > > > > > > Sorry, I don't follow. What removed the page from the page cache? > > > The truncate_page_cache() call that follows the above code hunk is > > > what does that, so I don't see how it can already be removed from > > > the page cache > > > > Memory error handler (memory_failure() in mm/memory-failure.c) has > > removed the error page from the page cache. > > And truncate_page_cache() that follows this hunk removes all pages > > belonging to the page cache of the poisoned file (where the error > > page itself is not included in them.) > > > > Let me explain more to clarify my whole scenario. If a memory error > > hits on a dirty pagecache, kernel works like below: > > > > 1. handles a MCE interrupt (logging MCE events,) > > 2. calls memory error handler (doing 3 to 6,) > > 3. sets PageHWPoison flag on the error page, > > 4. unmaps all mappings to processes' virtual addresses, > > So nothing in userspace sees the bad page after this. > > > 5. sets AS_HWPOISON on mappings to which the error page belongs > > 6. invalidates the error page (unlinks it from LRU list and removes > > it from pagecache,) > > (memory error handler finished) > > Ok, so the moment a memory error is handled, the page has been > removed from the inode's mapping, and it will never be seen by > aplications again. It's a transient error > > > 7. later accesses to the file returns -EIO, > > 8. AS_HWPOISON is cleared when the file is removed or completely > > truncated. > > so why do we have to keep an EIO on the inode forever? > > If the page is not dirty, then just tossing it from the cache (as > is already done) and rereading it from disk next time it is accessed > removes the need for any error to be reported at all. It's > effectively a transient error at this point, and as such no errors > should be visible from userspace. > > If the page is dirty, then it
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Mon, Aug 27, 2012 at 08:26:07AM +1000, Dave Chinner wrote: On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? The answer for the first question is yes. And for the second question, I don't think it's easy because the mapping tree has no reference to the error page (I explain more about this below, please see also it,) and it can cost too much to search poisoned pages over page cache in each request. Which is my point about a radix tree tag - that's very efficient. And for the third question, I think we could do this, but to do it we need an additional space (8 bytes) in struct radix_tree_node. Considering that this new tag is not used so frequently, so I'm not sure that it's worth the cost. A radix tree node is currently 560 bytes on x86_64, packed 7 to a page. i.e. using 3920 bytes. We can add another 8 bytes to it without increasing memory usage at all. So, no cost at all. OK. What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache Memory error handler (memory_failure() in mm/memory-failure.c) has removed the error page from the page cache. And truncate_page_cache() that follows this hunk removes all pages belonging to the page cache of the poisoned file (where the error page itself is not included in them.) Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, So nothing in userspace sees the bad page after this. 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) Ok, so the moment a memory error is handled, the page has been removed from the inode's mapping, and it will never be seen by aplications again. It's a transient error 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. so why do we have to keep an EIO on the inode forever? If the page is not dirty, then just tossing it from the cache (as is already done) and rereading it from disk next time it is accessed removes the need for any error to be reported at all. It's effectively a transient error at this point, and as such no errors should be visible from userspace. If the page is dirty, then it needs to be treated just like any other failed page write - the page is invalidated and the address space is marked with AS_EIO, and that is reported to the next operation that waits on IO on that file (i.e. fsync) If you have a second application that reads the files that depends
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: > On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner > > > > > case > > > > > where we have possibilities of data lost. This is because in this fix > > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > > +++ v3.6-rc1/mm/truncate.c > > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t > > > > > newsize) > > > > > > > > > > oldsize = inode->i_size; > > > > > i_size_write(inode, newsize); > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > > > So only a truncate to zero size will clear the poison flag? > > > > > > Yes, this is because we only know if the file is affected by hwpoison, > > > but not where the hwpoisoned page is in the file. We could remember it, > > > but I did not do it for simplicity. > > > > Surely the page has flags on it to say it is poisoned? That is, > > after truncating the page cache, if the address space is poisoned, > > then you can do a pass across the mapping tree checking if there are > > any poisoned pages left? Or perhaps adding a new mapping tree tag so > > that the poisoned status is automatically determined by the presence > > of the poisoned page in the mapping tree? > > The answer for the first question is yes. And for the second question, > I don't think it's easy because the mapping tree has no reference to > the error page (I explain more about this below, please see also it,) > and it can cost too much to search poisoned pages over page cache in > each request. Which is my point about a radix tree tag - that's very efficient. > And for the third question, I think we could do this, but to do it > we need an additional space (8 bytes) in struct radix_tree_node. > Considering that this new tag is not used so frequently, so I'm not > sure that it's worth the cost. A radix tree node is currently 560 bytes on x86_64, packed 7 to a page. i.e. using 3920 bytes. We can add another 8 bytes to it without increasing memory usage at all. So, no cost at all. > > > > What happens if it is the last page in the mapping that is poisoned, > > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > > > When we handle the hwpoisoned inode, the error page should already > > > be removed from pagecache, so the remaining pages are not related > > > to the error and we need not care about them when we consider bit > > > clearing. > > > > Sorry, I don't follow. What removed the page from the page cache? > > The truncate_page_cache() call that follows the above code hunk is > > what does that, so I don't see how it can already be removed from > > the page cache > > Memory error handler (memory_failure() in mm/memory-failure.c) has > removed the error page from the page cache. > And truncate_page_cache() that follows this hunk removes all pages > belonging to the page cache of the poisoned file (where the error > page itself is not included in them.) > > Let me explain more to clarify my whole scenario. If a memory error > hits on a dirty pagecache, kernel works like below: > > 1. handles a MCE interrupt (logging MCE events,) > 2. calls memory error handler (doing 3 to 6,) > 3. sets PageHWPoison flag on the error page, > 4. unmaps all mappings to processes' virtual addresses, So nothing in userspace sees the bad page after this. > 5. sets AS_HWPOISON on mappings to which the error page belongs > 6. invalidates the error page (unlinks it from LRU list and removes > it from pagecache,) > (memory error handler finished) Ok, so the moment a memory error is handled, the page has been removed from the inode's mapping, and it will never be seen by aplications again. It's a transient error > 7. later accesses to the file returns -EIO, > 8. AS_HWPOISON is cleared when the file is removed or completely > truncated. so why do we have to keep an EIO on the inode forever? If the page is not dirty, then just tossing it from the cache (as is already done) and rereading it from disk next time it is accessed removes the need for any error to be reported at all. It's effectively a transient error at this point, and as such no errors should be visible from userspace. If the page is dirty, then it needs to be treated just like any other failed page write - the page is invalidated and the address space is marked with AS_EIO, and that is reported to the next operation that waits on IO on that file (i.e. fsync) If you have a second application that reads the files
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 01:24:16PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? The answer for the first question is yes. And for the second question, I don't think it's easy because the mapping tree has no reference to the error page (I explain more about this below, please see also it,) and it can cost too much to search poisoned pages over page cache in each request. Which is my point about a radix tree tag - that's very efficient. And for the third question, I think we could do this, but to do it we need an additional space (8 bytes) in struct radix_tree_node. Considering that this new tag is not used so frequently, so I'm not sure that it's worth the cost. A radix tree node is currently 560 bytes on x86_64, packed 7 to a page. i.e. using 3920 bytes. We can add another 8 bytes to it without increasing memory usage at all. So, no cost at all. What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache Memory error handler (memory_failure() in mm/memory-failure.c) has removed the error page from the page cache. And truncate_page_cache() that follows this hunk removes all pages belonging to the page cache of the poisoned file (where the error page itself is not included in them.) Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, So nothing in userspace sees the bad page after this. 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) Ok, so the moment a memory error is handled, the page has been removed from the inode's mapping, and it will never be seen by aplications again. It's a transient error 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. so why do we have to keep an EIO on the inode forever? If the page is not dirty, then just tossing it from the cache (as is already done) and rereading it from disk next time it is accessed removes the need for any error to be reported at all. It's effectively a transient error at this point, and as such no errors should be visible from userspace. If the page is dirty, then it needs to be treated just like any other failed page write - the page is invalidated and the address space is marked with AS_EIO, and that is reported to the next operation that waits on IO on that file (i.e. fsync) If you have a second application that reads the files that depends on a guarantee of good data, then the first step in that process is that application that writes it needs to use fsync to check the data was written correctly. That ensures
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
Hello, On Thu, Aug 23, 2012 at 04:31:43PM -0400, Naoya Horiguchi wrote: > On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: ... > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > > index ac8d904..8742397 100644 > > > --- v3.6-rc1.orig/fs/inode.c > > > +++ v3.6-rc1/fs/inode.c > > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int > > > nr_to_scan) > > > } > > > > > > /* > > > + * Keep inode caches on memory for user processes to certainly > > > + * be aware of memory errors. > > > + */ > > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > > + spin_unlock(>i_lock); > > > + continue; > > > + } > > > > That chunk prevents reclaiming all the cached pages. However the intention > > is only to keep the struct inode together with the hwpoison bit? > > Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from > shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file > remain for long under high memory pressure. I might lose your point. Are you suggesting this chunk should come after if (inode_has_buffers(inode) || inode->i_data.nrpages) { ... } block, aren't you? I think that's right, so I'll try and test it this weekend. > > > + /* > > >* Referenced or dirty inodes are still in use. Give them > > >* another pass through the LRU as we canot reclaim them now. > > >*/ > > > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > > > inode->i_state &= ~I_WILL_FREE; > > > } > > > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > Is that clear necessary? Because the bit will be gone with the inode > > struct: it's going to be de-allocated anyway. > > With the chunk in prune_icache_sb() we keep the inode struct with > AS_HWPOISON set on memory, so in order to remove it, we need explicitly > clear the bit. > Without this clear, the inode remains until system reboot. And again, you are right here. Without this clear, this inode will be cleared in destroy_inode(). Thanks, Naoya -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: > On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > > where we have possibilities of data lost. This is because in this fix > > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > > --- v3.6-rc1.orig/mm/truncate.c > > > > +++ v3.6-rc1/mm/truncate.c > > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t > > > > newsize) > > > > > > > > oldsize = inode->i_size; > > > > i_size_write(inode, newsize); > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > > > So only a truncate to zero size will clear the poison flag? > > > > Yes, this is because we only know if the file is affected by hwpoison, > > but not where the hwpoisoned page is in the file. We could remember it, > > but I did not do it for simplicity. > > Surely the page has flags on it to say it is poisoned? That is, > after truncating the page cache, if the address space is poisoned, > then you can do a pass across the mapping tree checking if there are > any poisoned pages left? Or perhaps adding a new mapping tree tag so > that the poisoned status is automatically determined by the presence > of the poisoned page in the mapping tree? The answer for the first question is yes. And for the second question, I don't think it's easy because the mapping tree has no reference to the error page (I explain more about this below, please see also it,) and it can cost too much to search poisoned pages over page cache in each request. And for the third question, I think we could do this, but to do it we need an additional space (8 bytes) in struct radix_tree_node. Considering that this new tag is not used so frequently, so I'm not sure that it's worth the cost. > > > What happens if it is the last page in the mapping that is poisoned, > > > and we truncate that away? Shouldn't that clear the poisoned bit? > > > > When we handle the hwpoisoned inode, the error page should already > > be removed from pagecache, so the remaining pages are not related > > to the error and we need not care about them when we consider bit > > clearing. > > Sorry, I don't follow. What removed the page from the page cache? > The truncate_page_cache() call that follows the above code hunk is > what does that, so I don't see how it can already be removed from > the page cache Memory error handler (memory_failure() in mm/memory-failure.c) has removed the error page from the page cache. And truncate_page_cache() that follows this hunk removes all pages belonging to the page cache of the poisoned file (where the error page itself is not included in them.) Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. This patchset tries to fix the problem in 7 and add a new behavior 8, where I have an assumption that 1-6 has already worked out. You may think it strange that the condition of clearing AS_HWPOISON is checked with file granularity. This is because currently userspace applications know the memory errors only with file granularity for simplicity, when they access via read(), write() and/or fsync(). (Only for access via mmap(), error reporting is with page granularity.) We can do it with page granularity, and I tried 2 approaches: a. "adding a new tag in mapping tree" approach as I explained above, this needs an additional space on heavily used data structure, b. "adding a new data structure specific to dirty pagecache error management" approach I did this in the 1st version of this patchset, but found that it's too complicated as the first step. But I think both of these are some difficulty to be accepted, so at first I try to start with simpler one. > > > What about a hole punch over the poisoned range? > > > > For the same reason, this is also not related to when to clear the bit. > > Sure it is - if you remove the poisoned pages from the mapping when > the hole is punched, then the mapping is no longer poisoned. Hence > the bit should be cleared at that time as nothing else will ever > clear it. If we
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 02:39:17PM +1000, Dave Chinner wrote: On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? The answer for the first question is yes. And for the second question, I don't think it's easy because the mapping tree has no reference to the error page (I explain more about this below, please see also it,) and it can cost too much to search poisoned pages over page cache in each request. And for the third question, I think we could do this, but to do it we need an additional space (8 bytes) in struct radix_tree_node. Considering that this new tag is not used so frequently, so I'm not sure that it's worth the cost. What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache Memory error handler (memory_failure() in mm/memory-failure.c) has removed the error page from the page cache. And truncate_page_cache() that follows this hunk removes all pages belonging to the page cache of the poisoned file (where the error page itself is not included in them.) Let me explain more to clarify my whole scenario. If a memory error hits on a dirty pagecache, kernel works like below: 1. handles a MCE interrupt (logging MCE events,) 2. calls memory error handler (doing 3 to 6,) 3. sets PageHWPoison flag on the error page, 4. unmaps all mappings to processes' virtual addresses, 5. sets AS_HWPOISON on mappings to which the error page belongs 6. invalidates the error page (unlinks it from LRU list and removes it from pagecache,) (memory error handler finished) 7. later accesses to the file returns -EIO, 8. AS_HWPOISON is cleared when the file is removed or completely truncated. This patchset tries to fix the problem in 7 and add a new behavior 8, where I have an assumption that 1-6 has already worked out. You may think it strange that the condition of clearing AS_HWPOISON is checked with file granularity. This is because currently userspace applications know the memory errors only with file granularity for simplicity, when they access via read(), write() and/or fsync(). (Only for access via mmap(), error reporting is with page granularity.) We can do it with page granularity, and I tried 2 approaches: a. adding a new tag in mapping tree approach as I explained above, this needs an additional space on heavily used data structure, b. adding a new data structure specific to dirty pagecache error management approach I did this in the 1st version of this patchset, but found that it's too complicated as the first step. But I think both of these are some difficulty to be accepted, so at first I try to start with simpler one. What about a hole punch over the poisoned range? For the same reason, this is also not related to when to clear the bit. Sure it is - if you remove the poisoned pages from the mapping when the hole is punched, then the mapping is no longer poisoned. Hence the bit should be cleared at that time as nothing else will ever clear it. If we achieve error reporting with the page granularity, I hope we can also do this easily. Do I answer all your questions? If
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
Hello, On Thu, Aug 23, 2012 at 04:31:43PM -0400, Naoya Horiguchi wrote: On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: ... diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* + * Keep inode caches on memory for user processes to certainly + * be aware of memory errors. + */ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } That chunk prevents reclaiming all the cached pages. However the intention is only to keep the struct inode together with the hwpoison bit? Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file remain for long under high memory pressure. I might lose your point. Are you suggesting this chunk should come after if (inode_has_buffers(inode) || inode-i_data.nrpages) { ... } block, aren't you? I think that's right, so I'll try and test it this weekend. + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode-i_state = ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode-i_mapping) drop)) + mapping_clear_hwpoison(inode-i_mapping); Is that clear necessary? Because the bit will be gone with the inode struct: it's going to be de-allocated anyway. With the chunk in prune_icache_sb() we keep the inode struct with AS_HWPOISON set on memory, so in order to remove it, we need explicitly clear the bit. Without this clear, the inode remains until system reboot. And again, you are right here. Without this clear, this inode will be cleared in destroy_inode(). Thanks, Naoya -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: > On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > > where we have possibilities of data lost. This is because in this fix > > > AS_HWPOISON is cleared when the inode cache is dropped. > > > --- v3.6-rc1.orig/mm/truncate.c > > > +++ v3.6-rc1/mm/truncate.c > > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t > > > newsize) > > > > > > oldsize = inode->i_size; > > > i_size_write(inode, newsize); > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > > + mapping_clear_hwpoison(inode->i_mapping); > > > > So only a truncate to zero size will clear the poison flag? > > Yes, this is because we only know if the file is affected by hwpoison, > but not where the hwpoisoned page is in the file. We could remember it, > but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? > > What happens if it is the last page in the mapping that is poisoned, > > and we truncate that away? Shouldn't that clear the poisoned bit? > > When we handle the hwpoisoned inode, the error page should already > be removed from pagecache, so the remaining pages are not related > to the error and we need not care about them when we consider bit > clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache > > What about a hole punch over the poisoned range? > > For the same reason, this is also not related to when to clear the bit. Sure it is - if you remove the poisoned pages from the mapping when the hole is punched, then the mapping is no longer poisoned. Hence the bit should be cleared at that time as nothing else will ever clear it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > where we have possibilities of data lost. This is because in this fix > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > For example, consider an application in which a process periodically > > (every 10 minutes) writes some logs on a file (and closes it after > > each writes,) and at the end of each day some batch programs run using > > the log file. If a memory error hits on dirty pagecache of this log file > > just after periodic write/close and the inode cache is cleared before the > > next write, then this application is not aware of the error and the batch > > programs will work wrongly. > > > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > > until we remove or completely truncate the hwpoisoned file. > > > > Signed-off-by: Naoya Horiguchi > > --- > > fs/inode.c | 12 > > include/linux/pagemap.h | 11 +++ > > mm/memory-failure.c | 2 +- > > mm/truncate.c | 2 ++ > > 4 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > index ac8d904..8742397 100644 > > --- v3.6-rc1.orig/fs/inode.c > > +++ v3.6-rc1/fs/inode.c > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int > > nr_to_scan) > > } > > > > /* > > +* Keep inode caches on memory for user processes to certainly > > +* be aware of memory errors. > > +*/ > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > + spin_unlock(>i_lock); > > + continue; > > + } > > + > > + /* > > * Referenced or dirty inodes are still in use. Give them > > * another pass through the LRU as we canot reclaim them now. > > */ > > I don't think you tested this at all. Have a look at what the loop > does more closely - inodes with poisoned mappings will get stuck > and reclaim doesn't make progress past them. Sorry, I overlooked something important in my testing. I'll correct it. Maybe we need list_move_tail() in this block. > I think you also need to document this inode lifecycle change OK, I'll do it. > > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > > index 75801ac..82a994f 100644 > > --- v3.6-rc1.orig/mm/truncate.c > > +++ v3.6-rc1/mm/truncate.c > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t > > newsize) > > > > oldsize = inode->i_size; > > i_size_write(inode, newsize); > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > + mapping_clear_hwpoison(inode->i_mapping); > > So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. > What happens if it is the last page in the mapping that is poisoned, > and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. > What about a hole punch over the poisoned range? For the same reason, this is also not related to when to clear the bit. Thanks, Naoya -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > where we have possibilities of data lost. This is because in this fix > AS_HWPOISON is cleared when the inode cache is dropped. > > For example, consider an application in which a process periodically > (every 10 minutes) writes some logs on a file (and closes it after > each writes,) and at the end of each day some batch programs run using > the log file. If a memory error hits on dirty pagecache of this log file > just after periodic write/close and the inode cache is cleared before the > next write, then this application is not aware of the error and the batch > programs will work wrongly. > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > until we remove or completely truncate the hwpoisoned file. > > Signed-off-by: Naoya Horiguchi > --- > fs/inode.c | 12 > include/linux/pagemap.h | 11 +++ > mm/memory-failure.c | 2 +- > mm/truncate.c | 2 ++ > 4 files changed, 26 insertions(+), 1 deletion(-) > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > index ac8d904..8742397 100644 > --- v3.6-rc1.orig/fs/inode.c > +++ v3.6-rc1/fs/inode.c > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int > nr_to_scan) > } > > /* > + * Keep inode caches on memory for user processes to certainly > + * be aware of memory errors. > + */ > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > + spin_unlock(>i_lock); > + continue; > + } > + > + /* >* Referenced or dirty inodes are still in use. Give them >* another pass through the LRU as we canot reclaim them now. >*/ I don't think you tested this at all. Have a look at what the loop does more closely - inodes with poisoned mappings will get stuck and reclaim doesn't make progress past them. I think you also need to document this inode lifecycle change > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > index 75801ac..82a994f 100644 > --- v3.6-rc1.orig/mm/truncate.c > +++ v3.6-rc1/mm/truncate.c > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > oldsize = inode->i_size; > i_size_write(inode, newsize); > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > + mapping_clear_hwpoison(inode->i_mapping); So only a truncate to zero size will clear the poison flag? What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? What about a hole punch over the poisoned range? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: > On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > > where we have possibilities of data lost. This is because in this fix > > AS_HWPOISON is cleared when the inode cache is dropped. > > > > For example, consider an application in which a process periodically > > (every 10 minutes) writes some logs on a file (and closes it after > > each writes,) and at the end of each day some batch programs run using > > the log file. If a memory error hits on dirty pagecache of this log file > > just after periodic write/close and the inode cache is cleared before the > > next write, then this application is not aware of the error and the batch > > programs will work wrongly. > > > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > > until we remove or completely truncate the hwpoisoned file. > > Good point! > > > Signed-off-by: Naoya Horiguchi > > --- > > fs/inode.c | 12 > > include/linux/pagemap.h | 11 +++ > > mm/memory-failure.c | 2 +- > > mm/truncate.c | 2 ++ > > 4 files changed, 26 insertions(+), 1 deletion(-) > > > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > > index ac8d904..8742397 100644 > > --- v3.6-rc1.orig/fs/inode.c > > +++ v3.6-rc1/fs/inode.c > > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int > > nr_to_scan) > > } > > > > /* > > +* Keep inode caches on memory for user processes to certainly > > +* be aware of memory errors. > > +*/ > > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > > + spin_unlock(>i_lock); > > + continue; > > + } > > That chunk prevents reclaiming all the cached pages. However the intention > is only to keep the struct inode together with the hwpoison bit? Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file remain for long under high memory pressure. > > + /* > > * Referenced or dirty inodes are still in use. Give them > > * another pass through the LRU as we canot reclaim them now. > > */ > > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > > inode->i_state &= ~I_WILL_FREE; > > } > > > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > > + mapping_clear_hwpoison(inode->i_mapping); > > Is that clear necessary? Because the bit will be gone with the inode > struct: it's going to be de-allocated anyway. With the chunk in prune_icache_sb() we keep the inode struct with AS_HWPOISON set on memory, so in order to remove it, we need explicitly clear the bit. Without this clear, the inode remains until system reboot. > > inode->i_state |= I_FREEING; > > if (!list_empty(>i_lru)) > > inode_lru_list_del(inode); > > diff --git v3.6-rc1.orig/include/linux/pagemap.h > > v3.6-rc1/include/linux/pagemap.h > > index 4d8d821..9fce4e4 100644 > > --- v3.6-rc1.orig/include/linux/pagemap.h > > +++ v3.6-rc1/include/linux/pagemap.h > > @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space > > *mapping) > > { > > return test_bit(AS_HWPOISON, >flags); > > } > > +static inline void mapping_set_hwpoison(struct address_space *mapping) > > +{ > > + set_bit(AS_HWPOISON, >flags); > > +} > > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > > +{ > > + clear_bit(AS_HWPOISON, >flags); > > +} > > #else > > static inline int mapping_hwpoison(struct address_space *mapping) > > { > > return 0; > > } > > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > > +{ > > +} > > #endif > > > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > > index a1e7e00..ca064c6 100644 > > --- v3.6-rc1.orig/mm/memory-failure.c > > +++ v3.6-rc1/mm/memory-failure.c > > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned > > long pfn) > > * the first EIO, but we're not worse than other parts > > * of the kernel. > > */ > > - set_bit(AS_HWPOISON, >flags); > > + mapping_set_hwpoison(mapping); > > } > > > > return me_pagecache_clean(p, pfn); > > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > > index 75801ac..82a994f 100644 > > --- v3.6-rc1.orig/mm/truncate.c > > +++ v3.6-rc1/mm/truncate.c > > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t > > newsize) > > > > oldsize = inode->i_size; > > i_size_write(inode, newsize); > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) > > It
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: > "HWPOISON: report sticky EIO for poisoned file" still has a corner case > where we have possibilities of data lost. This is because in this fix > AS_HWPOISON is cleared when the inode cache is dropped. > > For example, consider an application in which a process periodically > (every 10 minutes) writes some logs on a file (and closes it after > each writes,) and at the end of each day some batch programs run using > the log file. If a memory error hits on dirty pagecache of this log file > just after periodic write/close and the inode cache is cleared before the > next write, then this application is not aware of the error and the batch > programs will work wrongly. > > To avoid this, this patch makes us pin the hwpoisoned inode on memory > until we remove or completely truncate the hwpoisoned file. Good point! > Signed-off-by: Naoya Horiguchi > --- > fs/inode.c | 12 > include/linux/pagemap.h | 11 +++ > mm/memory-failure.c | 2 +- > mm/truncate.c | 2 ++ > 4 files changed, 26 insertions(+), 1 deletion(-) > > diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c > index ac8d904..8742397 100644 > --- v3.6-rc1.orig/fs/inode.c > +++ v3.6-rc1/fs/inode.c > @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int > nr_to_scan) > } > > /* > + * Keep inode caches on memory for user processes to certainly > + * be aware of memory errors. > + */ > + if (unlikely(mapping_hwpoison(inode->i_mapping))) { > + spin_unlock(>i_lock); > + continue; > + } That chunk prevents reclaiming all the cached pages. However the intention is only to keep the struct inode together with the hwpoison bit? > + /* >* Referenced or dirty inodes are still in use. Give them >* another pass through the LRU as we canot reclaim them now. >*/ > @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) > inode->i_state &= ~I_WILL_FREE; > } > > + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) > + mapping_clear_hwpoison(inode->i_mapping); Is that clear necessary? Because the bit will be gone with the inode struct: it's going to be de-allocated anyway. > inode->i_state |= I_FREEING; > if (!list_empty(>i_lru)) > inode_lru_list_del(inode); > diff --git v3.6-rc1.orig/include/linux/pagemap.h > v3.6-rc1/include/linux/pagemap.h > index 4d8d821..9fce4e4 100644 > --- v3.6-rc1.orig/include/linux/pagemap.h > +++ v3.6-rc1/include/linux/pagemap.h > @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space > *mapping) > { > return test_bit(AS_HWPOISON, >flags); > } > +static inline void mapping_set_hwpoison(struct address_space *mapping) > +{ > + set_bit(AS_HWPOISON, >flags); > +} > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > +{ > + clear_bit(AS_HWPOISON, >flags); > +} > #else > static inline int mapping_hwpoison(struct address_space *mapping) > { > return 0; > } > +static inline void mapping_clear_hwpoison(struct address_space *mapping) > +{ > +} > #endif > > static inline gfp_t mapping_gfp_mask(struct address_space * mapping) > diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c > index a1e7e00..ca064c6 100644 > --- v3.6-rc1.orig/mm/memory-failure.c > +++ v3.6-rc1/mm/memory-failure.c > @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned > long pfn) >* the first EIO, but we're not worse than other parts >* of the kernel. >*/ > - set_bit(AS_HWPOISON, >flags); > + mapping_set_hwpoison(mapping); > } > > return me_pagecache_clean(p, pfn); > diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c > index 75801ac..82a994f 100644 > --- v3.6-rc1.orig/mm/truncate.c > +++ v3.6-rc1/mm/truncate.c > @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) > > oldsize = inode->i_size; > i_size_write(inode, newsize); > + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) It might be a bit better to test !newsize first. > + mapping_clear_hwpoison(inode->i_mapping); > > truncate_pagecache(inode, oldsize, newsize); > } > -- > 1.7.11.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Good point! Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* + * Keep inode caches on memory for user processes to certainly + * be aware of memory errors. + */ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } That chunk prevents reclaiming all the cached pages. However the intention is only to keep the struct inode together with the hwpoison bit? + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode-i_state = ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode-i_mapping) drop)) + mapping_clear_hwpoison(inode-i_mapping); Is that clear necessary? Because the bit will be gone with the inode struct: it's going to be de-allocated anyway. inode-i_state |= I_FREEING; if (!list_empty(inode-i_lru)) inode_lru_list_del(inode); diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index 4d8d821..9fce4e4 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) { return test_bit(AS_HWPOISON, mapping-flags); } +static inline void mapping_set_hwpoison(struct address_space *mapping) +{ + set_bit(AS_HWPOISON, mapping-flags); +} +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ + clear_bit(AS_HWPOISON, mapping-flags); +} #else static inline int mapping_hwpoison(struct address_space *mapping) { return 0; } +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ +} #endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a1e7e00..ca064c6 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - set_bit(AS_HWPOISON, mapping-flags); + mapping_set_hwpoison(mapping); } return me_pagecache_clean(p, pfn); diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) It might be a bit better to test !newsize first. + mapping_clear_hwpoison(inode-i_mapping); truncate_pagecache(inode, oldsize, newsize); } -- 1.7.11.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Thu, Aug 23, 2012 at 05:11:25PM +0800, Fengguang Wu wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Good point! Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* +* Keep inode caches on memory for user processes to certainly +* be aware of memory errors. +*/ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } That chunk prevents reclaiming all the cached pages. However the intention is only to keep the struct inode together with the hwpoison bit? Yes, we can not reclaim pagecaches from shrink_slab(), but we can do from shrink_zone(). So it shouldn't happen that cached pages on hwpoisoned file remain for long under high memory pressure. + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode-i_state = ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode-i_mapping) drop)) + mapping_clear_hwpoison(inode-i_mapping); Is that clear necessary? Because the bit will be gone with the inode struct: it's going to be de-allocated anyway. With the chunk in prune_icache_sb() we keep the inode struct with AS_HWPOISON set on memory, so in order to remove it, we need explicitly clear the bit. Without this clear, the inode remains until system reboot. inode-i_state |= I_FREEING; if (!list_empty(inode-i_lru)) inode_lru_list_del(inode); diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index 4d8d821..9fce4e4 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) { return test_bit(AS_HWPOISON, mapping-flags); } +static inline void mapping_set_hwpoison(struct address_space *mapping) +{ + set_bit(AS_HWPOISON, mapping-flags); +} +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ + clear_bit(AS_HWPOISON, mapping-flags); +} #else static inline int mapping_hwpoison(struct address_space *mapping) { return 0; } +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ +} #endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a1e7e00..ca064c6 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - set_bit(AS_HWPOISON, mapping-flags); + mapping_set_hwpoison(mapping); } return me_pagecache_clean(p, pfn); diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) It might be a bit better to test !newsize first. Ah, OK. Thanks, Naoya -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* + * Keep inode caches on memory for user processes to certainly + * be aware of memory errors. + */ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } + + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ I don't think you tested this at all. Have a look at what the loop does more closely - inodes with poisoned mappings will get stuck and reclaim doesn't make progress past them. I think you also need to document this inode lifecycle change diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? What about a hole punch over the poisoned range? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* +* Keep inode caches on memory for user processes to certainly +* be aware of memory errors. +*/ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } + + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ I don't think you tested this at all. Have a look at what the loop does more closely - inodes with poisoned mappings will get stuck and reclaim doesn't make progress past them. Sorry, I overlooked something important in my testing. I'll correct it. Maybe we need list_move_tail() in this block. I think you also need to document this inode lifecycle change OK, I'll do it. diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. What about a hole punch over the poisoned range? For the same reason, this is also not related to when to clear the bit. Thanks, Naoya -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
On Thu, Aug 23, 2012 at 10:39:32PM -0400, Naoya Horiguchi wrote: On Fri, Aug 24, 2012 at 11:31:18AM +1000, Dave Chinner wrote: On Wed, Aug 22, 2012 at 11:17:35AM -0400, Naoya Horiguchi wrote: HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); So only a truncate to zero size will clear the poison flag? Yes, this is because we only know if the file is affected by hwpoison, but not where the hwpoisoned page is in the file. We could remember it, but I did not do it for simplicity. Surely the page has flags on it to say it is poisoned? That is, after truncating the page cache, if the address space is poisoned, then you can do a pass across the mapping tree checking if there are any poisoned pages left? Or perhaps adding a new mapping tree tag so that the poisoned status is automatically determined by the presence of the poisoned page in the mapping tree? What happens if it is the last page in the mapping that is poisoned, and we truncate that away? Shouldn't that clear the poisoned bit? When we handle the hwpoisoned inode, the error page should already be removed from pagecache, so the remaining pages are not related to the error and we need not care about them when we consider bit clearing. Sorry, I don't follow. What removed the page from the page cache? The truncate_page_cache() call that follows the above code hunk is what does that, so I don't see how it can already be removed from the page cache What about a hole punch over the poisoned range? For the same reason, this is also not related to when to clear the bit. Sure it is - if you remove the poisoned pages from the mapping when the hole is punched, then the mapping is no longer poisoned. Hence the bit should be cleared at that time as nothing else will ever clear it. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
"HWPOISON: report sticky EIO for poisoned file" still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Signed-off-by: Naoya Horiguchi --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* +* Keep inode caches on memory for user processes to certainly +* be aware of memory errors. +*/ + if (unlikely(mapping_hwpoison(inode->i_mapping))) { + spin_unlock(>i_lock); + continue; + } + + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode->i_state &= ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode->i_mapping) && drop)) + mapping_clear_hwpoison(inode->i_mapping); + inode->i_state |= I_FREEING; if (!list_empty(>i_lru)) inode_lru_list_del(inode); diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index 4d8d821..9fce4e4 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) { return test_bit(AS_HWPOISON, >flags); } +static inline void mapping_set_hwpoison(struct address_space *mapping) +{ + set_bit(AS_HWPOISON, >flags); +} +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ + clear_bit(AS_HWPOISON, >flags); +} #else static inline int mapping_hwpoison(struct address_space *mapping) { return 0; } +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ +} #endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a1e7e00..ca064c6 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - set_bit(AS_HWPOISON, >flags); + mapping_set_hwpoison(mapping); } return me_pagecache_clean(p, pfn); diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode->i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode->i_mapping) && !newsize)) + mapping_clear_hwpoison(inode->i_mapping); truncate_pagecache(inode, oldsize, newsize); } -- 1.7.11.4 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] HWPOISON: prevent inode cache removal to keep AS_HWPOISON sticky
HWPOISON: report sticky EIO for poisoned file still has a corner case where we have possibilities of data lost. This is because in this fix AS_HWPOISON is cleared when the inode cache is dropped. For example, consider an application in which a process periodically (every 10 minutes) writes some logs on a file (and closes it after each writes,) and at the end of each day some batch programs run using the log file. If a memory error hits on dirty pagecache of this log file just after periodic write/close and the inode cache is cleared before the next write, then this application is not aware of the error and the batch programs will work wrongly. To avoid this, this patch makes us pin the hwpoisoned inode on memory until we remove or completely truncate the hwpoisoned file. Signed-off-by: Naoya Horiguchi n-horigu...@ah.jp.nec.com --- fs/inode.c | 12 include/linux/pagemap.h | 11 +++ mm/memory-failure.c | 2 +- mm/truncate.c | 2 ++ 4 files changed, 26 insertions(+), 1 deletion(-) diff --git v3.6-rc1.orig/fs/inode.c v3.6-rc1/fs/inode.c index ac8d904..8742397 100644 --- v3.6-rc1.orig/fs/inode.c +++ v3.6-rc1/fs/inode.c @@ -717,6 +717,15 @@ void prune_icache_sb(struct super_block *sb, int nr_to_scan) } /* +* Keep inode caches on memory for user processes to certainly +* be aware of memory errors. +*/ + if (unlikely(mapping_hwpoison(inode-i_mapping))) { + spin_unlock(inode-i_lock); + continue; + } + + /* * Referenced or dirty inodes are still in use. Give them * another pass through the LRU as we canot reclaim them now. */ @@ -1405,6 +1414,9 @@ static void iput_final(struct inode *inode) inode-i_state = ~I_WILL_FREE; } + if (unlikely(mapping_hwpoison(inode-i_mapping) drop)) + mapping_clear_hwpoison(inode-i_mapping); + inode-i_state |= I_FREEING; if (!list_empty(inode-i_lru)) inode_lru_list_del(inode); diff --git v3.6-rc1.orig/include/linux/pagemap.h v3.6-rc1/include/linux/pagemap.h index 4d8d821..9fce4e4 100644 --- v3.6-rc1.orig/include/linux/pagemap.h +++ v3.6-rc1/include/linux/pagemap.h @@ -59,11 +59,22 @@ static inline int mapping_hwpoison(struct address_space *mapping) { return test_bit(AS_HWPOISON, mapping-flags); } +static inline void mapping_set_hwpoison(struct address_space *mapping) +{ + set_bit(AS_HWPOISON, mapping-flags); +} +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ + clear_bit(AS_HWPOISON, mapping-flags); +} #else static inline int mapping_hwpoison(struct address_space *mapping) { return 0; } +static inline void mapping_clear_hwpoison(struct address_space *mapping) +{ +} #endif static inline gfp_t mapping_gfp_mask(struct address_space * mapping) diff --git v3.6-rc1.orig/mm/memory-failure.c v3.6-rc1/mm/memory-failure.c index a1e7e00..ca064c6 100644 --- v3.6-rc1.orig/mm/memory-failure.c +++ v3.6-rc1/mm/memory-failure.c @@ -652,7 +652,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn) * the first EIO, but we're not worse than other parts * of the kernel. */ - set_bit(AS_HWPOISON, mapping-flags); + mapping_set_hwpoison(mapping); } return me_pagecache_clean(p, pfn); diff --git v3.6-rc1.orig/mm/truncate.c v3.6-rc1/mm/truncate.c index 75801ac..82a994f 100644 --- v3.6-rc1.orig/mm/truncate.c +++ v3.6-rc1/mm/truncate.c @@ -574,6 +574,8 @@ void truncate_setsize(struct inode *inode, loff_t newsize) oldsize = inode-i_size; i_size_write(inode, newsize); + if (unlikely(mapping_hwpoison(inode-i_mapping) !newsize)) + mapping_clear_hwpoison(inode-i_mapping); truncate_pagecache(inode, oldsize, newsize); } -- 1.7.11.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/