Hi,

I've discovered that short reads (nonzero b_resid) aren't
handled very well in our kernel and I've proposed a diff
like this to handle short reads of buffercache read-ahead
buffers:

diff --git sys/kern/vfs_bio.c sys/kern/vfs_bio.c
index 95bc80bc0e6..1cc1943d752 100644
--- sys/kern/vfs_bio.c
+++ sys/kern/vfs_bio.c
@@ -534,11 +534,27 @@ bread_cluster_callback(struct buf *bp)
                 */
                buf_fix_mapping(bp, newsize);
                bp->b_bcount = newsize;
        }
 
-       for (i = 1; xbpp[i] != 0; i++) {
+       /* Invalidate read-ahead buffers if read short */
+       if (bp->b_resid > 0) {
+               for (i = 0; xbpp[i] != NULL; i++)
+                       continue;
+               for (i = i - 1; i != 0; i--) {
+                       if (xbpp[i]->b_bufsize <= bp->b_resid) {
+                               bp->b_resid -= xbpp[i]->b_bufsize;
+                               SET(xbpp[i]->b_flags, B_INVAL);
+                       } else if (bp->b_resid > 0) {
+                               bp->b_resid = 0;
+                               SET(xbpp[i]->b_flags, B_INVAL);
+                       } else
+                               break;
+               }
+       }
+
+       for (i = 1; xbpp[i] != NULL; i++) {
                if (ISSET(bp->b_flags, B_ERROR))
                        SET(xbpp[i]->b_flags, B_INVAL | B_ERROR);
                biodone(xbpp[i]);
        }
 

Now I said before that the only issue that this diff didn't
fix was with the xbpp[0] aka the buf we return to FFS: if we
have a 64k sized cluster on our filesystem then we've never
created read-ahead bufs and thus this code never runs and we
never account for the b_resid.  However, this is thankfully
not correct as FFS handles short reads itself (except one
small detail...). Here's a chunk from ffs_read:

        if (lblktosize(fs, nextlbn) >= DIP(ip, size))
                error = bread(vp, lbn, size, &bp);
        else
                error = bread_cluster(vp, lbn, size, &bp);

        if (error)
                break;

        /*
         * We should only get non-zero b_resid when an I/O error
         * has occurred, which should cause us to break above.
         * However, if the short read did not cause an error,
         * then we want to ensure that we do not uiomove bad
         * or uninitialized data.
         */
        size -= bp->b_resid;
        if (size < xfersize) {
                if (size == 0)
                        break;
                xfersize = size;
        }
        error = uiomove(bp->b_data + blkoffset, xfersize, uio);

As you can see it copies (size - bp->b_resid) into the uio.
That would be OK if b_resid was as large as the 'size'. But
due to how bread_cluster extends the b_count to cover for
all additional read-ahead buffers, the transfer in the end
can have a b_resid anywhere in the interval of [0, MAXPHYS]
which can be larger than 'size' that FFS has asked for.

This leads to 'size' underflow because it's an integer and
then uiomove gets a negative value for xfersize which gets
converted to a very large unsigned long (size_t) parameter
for uiomove. And this is bad.  Therefore, additionally I'd
like to assert this in the FFS code itself.  If this is the
way to go, I'll look into other filesystems and propose a
similar check.

diff --git sys/ufs/ffs/ffs_vnops.c sys/ufs/ffs/ffs_vnops.c
index 160e187820f..56c222612a2 100644
--- sys/ufs/ffs/ffs_vnops.c
+++ sys/ufs/ffs/ffs_vnops.c
@@ -244,10 +244,11 @@ ffs_read(void *v)
                 * has occurred, which should cause us to break above.
                 * However, if the short read did not cause an error,
                 * then we want to ensure that we do not uiomove bad
                 * or uninitialized data.
                 */
+               KASSERT(bp->b_resid <= size);
                size -= bp->b_resid;
                if (size < xfersize) {
                        if (size == 0)
                                break;
                        xfersize = size;


So to make it clear: I'd like to commit both changes and
if that's something we agree upon, I'll look into other
filesystems and make sure that they implement similar
assertions.

Opinions?

Reply via email to