Re: svn commit: r328914 - in head/sys: kern ufs/ffs
On Tue, Feb 6, 2018 at 1:09 AM, Conrad Meyerwrote: > Hi Kirk, > > On Mon, Feb 5, 2018 at 4:19 PM, Kirk McKusick > wrote: > > Author: mckusick > > Date: Tue Feb 6 00:19:46 2018 > > New Revision: 328914 > > URL: https://svnweb.freebsd.org/changeset/base/328914 > > > > Log: > > ... > > The second problem was that the check hash computed at the end of the > > read was incorrect because the calculation of the check hash on > > completion of the read was being done too soon. > > > > - When a read completes we had the following sequence: > > > > - bufdone() > > -- b_ckhashcalc (calculates check hash) > > -- bufdone_finish() > > --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) > > > > - When we are reading a buffer where one or more pages are already > > in memory (but not all pages, or we wouldn't be doing the read), > > the I/O is done with bogus_page mapped in for the pages that exist > > in the VM cache. This mapping is done to avoid corrupting the > > cached pages if there is any I/O overrun. The vfs_vmio_iodone() > > function is responsible for replacing the bogus_page(s) with the > > cached ones. But we were calculating the check hash before the > > bogus_page(s) were replaced. Hence, when we were calculating the > > check hash, we were partly reading from bogus_page, which means > > we calculated a bad check hash (e.g., because multiple pages have > > been mapped to bogus_page, so its contents are indeterminate). > > > > The second fix is to move the check-hash calculation from bufdone() > > to bufdone_finish() after the call to vfs_vmio_iodone() so that it > > computes the check hash over the correct set of pages. > > Does the b_iodone callback have a very similar potential problem? It > is invoked in bufdone(), before bufdone_finish() and > vfs_vmio_iodone(). > > One example b_iodone is the bdone() callback. It seems that b_iodone > -> bdone() can then wake bwait()ers before any VMIO cached content has > been filled in. > > I don't know that any specific consumers of b_iodone are currently > broken, but it seems like maybe the b_iodone callback should really be > in the later location as well. > It looks to me like this part of bufdone_finish() if (bp->b_flags & B_VMIO) { /* * Set B_CACHE if the op was a normal read and no error * occurred. B_CACHE is set for writes in the b*write() * routines. */ if (bp->b_iocmd == BIO_READ && !(bp->b_flags & (B_INVAL|B_NOCACHE)) && !(bp->b_ioflags & BIO_ERROR)) bp->b_flags |= B_CACHE; vfs_vmio_iodone(bp); } belongs before the callback to b_iodone() rather than in bufdone_finish(). It appears that bufdone_finish() isn't called elsewhere, despite being non-static. I'm curious why this is... Warner ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
Re: svn commit: r328914 - in head/sys: kern ufs/ffs
Hi Kirk, On Mon, Feb 5, 2018 at 4:19 PM, Kirk McKusickwrote: > Author: mckusick > Date: Tue Feb 6 00:19:46 2018 > New Revision: 328914 > URL: https://svnweb.freebsd.org/changeset/base/328914 > > Log: > ... > The second problem was that the check hash computed at the end of the > read was incorrect because the calculation of the check hash on > completion of the read was being done too soon. > > - When a read completes we had the following sequence: > > - bufdone() > -- b_ckhashcalc (calculates check hash) > -- bufdone_finish() > --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) > > - When we are reading a buffer where one or more pages are already > in memory (but not all pages, or we wouldn't be doing the read), > the I/O is done with bogus_page mapped in for the pages that exist > in the VM cache. This mapping is done to avoid corrupting the > cached pages if there is any I/O overrun. The vfs_vmio_iodone() > function is responsible for replacing the bogus_page(s) with the > cached ones. But we were calculating the check hash before the > bogus_page(s) were replaced. Hence, when we were calculating the > check hash, we were partly reading from bogus_page, which means > we calculated a bad check hash (e.g., because multiple pages have > been mapped to bogus_page, so its contents are indeterminate). > > The second fix is to move the check-hash calculation from bufdone() > to bufdone_finish() after the call to vfs_vmio_iodone() so that it > computes the check hash over the correct set of pages. Does the b_iodone callback have a very similar potential problem? It is invoked in bufdone(), before bufdone_finish() and vfs_vmio_iodone(). One example b_iodone is the bdone() callback. It seems that b_iodone -> bdone() can then wake bwait()ers before any VMIO cached content has been filled in. I don't know that any specific consumers of b_iodone are currently broken, but it seems like maybe the b_iodone callback should really be in the later location as well. Best, Conrad ___ svn-src-head@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/svn-src-head To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"
svn commit: r328914 - in head/sys: kern ufs/ffs
Author: mckusick Date: Tue Feb 6 00:19:46 2018 New Revision: 328914 URL: https://svnweb.freebsd.org/changeset/base/328914 Log: Occasional cylinder-group check-hash errors were being reported on systems running with a heavy filesystem load. Tracking down this bug was elusive because there were actually two problems. Sometimes the in-memory check hash was wrong and sometimes the check hash computed when doing the read was wrong. The occurrence of either error caused a check-hash mismatch to be reported. The first error was that the check hash in the in-memory cylinder group was incorrect. This error was caused by the following sequence of events: - We read a cylinder-group buffer and the check hash is valid. - We update its cg_time and cg_old_time which makes the in-memory check-hash value invalid but we do not mark the cylinder group dirty. - We do not make any other changes to the cylinder group, so we never mark it dirty, thus do not write it out, and hence never update the incorrect check hash for the in-memory buffer. - Later, the buffer gets freed, but the page with the old incorrect check hash is still in the VM cache. - Later, we read the cylinder group again, and the first page with the old check hash is still in the VM cache, but some other pages are not, so we have to do a read. - The read does not actually get the first page from disk, but rather from the VM cache, resulting in the old check hash in the buffer. - The value computed after doing the read does not match causing the error to be printed. The fix for this problem is to only set cg_time and cg_old_time as the cylinder group is being written to disk. This keeps the in-memory check-hash valid unless the cylinder group has had other modifications which will require it to be written with a new check hash calculated. It also requires that the check hash be recalculated in the in-memory cylinder group when it is marked clean after doing a background write. The second problem was that the check hash computed at the end of the read was incorrect because the calculation of the check hash on completion of the read was being done too soon. - When a read completes we had the following sequence: - bufdone() -- b_ckhashcalc (calculates check hash) -- bufdone_finish() --- vfs_vmio_iodone() (replaces bogus pages with the cached ones) - When we are reading a buffer where one or more pages are already in memory (but not all pages, or we wouldn't be doing the read), the I/O is done with bogus_page mapped in for the pages that exist in the VM cache. This mapping is done to avoid corrupting the cached pages if there is any I/O overrun. The vfs_vmio_iodone() function is responsible for replacing the bogus_page(s) with the cached ones. But we were calculating the check hash before the bogus_page(s) were replaced. Hence, when we were calculating the check hash, we were partly reading from bogus_page, which means we calculated a bad check hash (e.g., because multiple pages have been mapped to bogus_page, so its contents are indeterminate). The second fix is to move the check-hash calculation from bufdone() to bufdone_finish() after the call to vfs_vmio_iodone() so that it computes the check hash over the correct set of pages. With these two changes, the occasional cylinder-group check-hash errors are gone. Submitted by: David PfitznerReviewed by: kib Tested by: David Pfitzner Modified: head/sys/kern/vfs_bio.c head/sys/ufs/ffs/ffs_alloc.c head/sys/ufs/ffs/ffs_vfsops.c Modified: head/sys/kern/vfs_bio.c == --- head/sys/kern/vfs_bio.c Tue Feb 6 00:02:30 2018(r328913) +++ head/sys/kern/vfs_bio.c Tue Feb 6 00:19:46 2018(r328914) @@ -4075,10 +4075,6 @@ bufdone(struct buf *bp) runningbufwakeup(bp); if (bp->b_iocmd == BIO_WRITE) dropobj = bp->b_bufobj; - else if ((bp->b_flags & B_CKHASH) != 0) { - KASSERT(buf_mapped(bp), ("biodone: bp %p not mapped", bp)); - (*bp->b_ckhashcalc)(bp); - } /* call optional completion function if requested */ if (bp->b_iodone != NULL) { biodone = bp->b_iodone; @@ -4114,6 +4110,13 @@ bufdone_finish(struct buf *bp) !(bp->b_ioflags & BIO_ERROR)) bp->b_flags |= B_CACHE; vfs_vmio_iodone(bp); + } + if ((bp->b_flags & B_CKHASH) != 0) { + KASSERT(bp->b_iocmd == BIO_READ, + ("bufdone_finish: b_iocmd %d not BIO_READ", bp->b_iocmd)); + KASSERT(buf_mapped(bp), + ("bufdone_finish: bp %p not mapped", bp)); + (*bp->b_ckhashcalc)(bp); } /* Modified: