[PATCH 1/4] fs: allow short direct-io reads to be completed via buffered IO V2
V1-V2: Check to see if our current ppos is = i_size after a short DIO read, just in case it was actually a short read and we need to just return. This is similar to what already happens in the write case. If we have a short read while doing O_DIRECT, instead of just returning, fallthrough and try to read the rest via buffered IO. BTRFS needs this because if we encounter a compressed or inline extent during DIO, we need to fallback on buffered. If the extent is compressed we need to read the entire thing into memory and de-compress it into the users pages. I have tested this with fsx and everything works great. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- mm/filemap.c | 36 +++- 1 files changed, 31 insertions(+), 5 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 140ebda..829ac9c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1263,7 +1263,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, { struct file *filp = iocb-ki_filp; ssize_t retval; - unsigned long seg; + unsigned long seg = 0; size_t count; loff_t *ppos = iocb-ki_pos; @@ -1290,21 +1290,47 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, retval = mapping-a_ops-direct_IO(READ, iocb, iov, pos, nr_segs); } - if (retval 0) + if (retval 0) { *ppos = pos + retval; - if (retval) { + count -= retval; + } + + /* +* Btrfs can have a short DIO read if we encounter +* compressed extents, so if there was an error, or if +* we've already read everything we wanted to, or if +* there was a short read because we hit EOF, go ahead +* and return. Otherwise fallthrough to buffered io for +* the rest of the read. +*/ + if (retval 0 || !count || *ppos = size) { file_accessed(filp); goto out; } } } + count = retval; for (seg = 0; seg nr_segs; seg++) { read_descriptor_t desc; + loff_t offset = 0; + + /* +* If we did a short DIO read we need to skip the section of the +* iov that we've already read data into. +*/ + if (count) { + if (count iov[seg].iov_len) { + count -= iov[seg].iov_len; + continue; + } + offset = count; + count = 0; + } desc.written = 0; - desc.arg.buf = iov[seg].iov_base; - desc.count = iov[seg].iov_len; + desc.arg.buf = iov[seg].iov_base + offset; + desc.count = iov[seg].iov_len - offset; if (desc.count == 0) continue; desc.error = 0; -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] fs: kill blockdev_direct_IO_no_locking
Christoph said he'd rather everybody use __blockdev_direct_IO directly instead of having a bunch of random helper functions, so thats what this patch does. It's a basic change, I've tested it with xfstests on ext4 and xfs. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/block_dev.c |5 +++-- fs/ext4/inode.c |8 fs/gfs2/aops.c |6 +++--- fs/ocfs2/aops.c |8 +++- fs/xfs/linux-2.6/xfs_aops.c |7 +++ include/linux/fs.h |9 - 6 files changed, 16 insertions(+), 27 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 6dcee88..0f42cbc 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -172,8 +172,9 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov, struct file *file = iocb-ki_filp; struct inode *inode = file-f_mapping-host; - return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode), - iov, offset, nr_segs, blkdev_get_blocks, NULL); + return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov, + offset, nr_segs, blkdev_get_blocks, NULL, + NULL, 0); } int __sync_blockdev(struct block_device *bdev, int wait) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 81d6054..8f37762 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3494,10 +3494,10 @@ static ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, retry: if (rw == READ ext4_should_dioread_nolock(inode)) - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, -inode-i_sb-s_bdev, iov, -offset, nr_segs, -ext4_get_block, NULL); + ret = __blockdev_direct_IO(rw, iocb, inode, + inode-i_sb-s_bdev, iov, offset, + nr_segs, ext4_get_block, NULL, NULL, + 0); else ret = blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c index 0c1d0b8..45b23b0 100644 --- a/fs/gfs2/aops.c +++ b/fs/gfs2/aops.c @@ -1039,9 +1039,9 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb, if (rv != 1) goto out; /* dio not valid, fall back to buffered i/o */ - rv = blockdev_direct_IO_no_locking(rw, iocb, inode, inode-i_sb-s_bdev, - iov, offset, nr_segs, - gfs2_get_block_direct, NULL); + rv = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, + iov, offset, nr_segs, gfs2_get_block_direct, + NULL, NULL, 0); out: gfs2_glock_dq_m(1, gh); gfs2_holder_uninit(gh); diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c index 21441dd..f2e53a9 100644 --- a/fs/ocfs2/aops.c +++ b/fs/ocfs2/aops.c @@ -669,11 +669,9 @@ static ssize_t ocfs2_direct_IO(int rw, if (i_size_read(inode) = offset) return 0; - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, - inode-i_sb-s_bdev, iov, offset, - nr_segs, - ocfs2_direct_IO_get_blocks, - ocfs2_dio_end_io); + ret = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov, + offset, nr_segs, ocfs2_direct_IO_get_blocks, + ocfs2_dio_end_io, NULL, 0); mlog_exit(ret); return ret; diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c index 0f8b996..fcc14d8 100644 --- a/fs/xfs/linux-2.6/xfs_aops.c +++ b/fs/xfs/linux-2.6/xfs_aops.c @@ -1617,10 +1617,9 @@ xfs_vm_direct_IO( iocb-private = xfs_alloc_ioend(inode, rw == WRITE ? IOMAP_UNWRITTEN : IOMAP_READ); - ret = blockdev_direct_IO_no_locking(rw, iocb, inode, bdev, iov, - offset, nr_segs, - xfs_get_blocks_direct, - xfs_end_io_direct); + ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs, + xfs_get_blocks_direct, xfs_end_io_direct, + NULL, 0); if (unlikely(ret != -EIOCBQUEUED iocb-private)) xfs_destroy_ioend(iocb-private); diff --git a/include/linux/fs.h b/include/linux/fs.h index 10704f0..d841e7d 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2277,15 +2277,6 @@ static inline ssize_t blockdev_direct_IO(int rw, struct kiocb *iocb,
[PATCH 4/4] Btrfs: add basic DIO read/write support V3
V1-V2 -Use __blockdev_direct_IO instead of helper -Use KM_IRQ0 for kmap instead of KM_USER0 V2-V3 -Update the submit function to work with my submit hook changes -Add DIO write support This provides basic DIO support for reads only. It does not do any of the work to recover from mismatching checksums, that will come later. A few design changes have been made from Jim's code (sorry Jim!) 1) Use the generic direct-io code. Jim originally re-wrote all the generic DIO code in order to account for all of BTRFS's oddities, but thanks to that work it seems like the best bet is to just ignore compression and such and just opt to fallback on buffered IO. 2) Fallback on buffered IO for compressed or inline extents. Jim's code did it's own buffering to make dio with compressed extents work. Now we just fallback onto normal buffered IO. 3) Lock the entire range during DIO. I originally had it so we would lock the extents as get_block was called, and then unlock them as the endio function was called, which worked great, but if we ever had an error in the submit_io hook, we could have locked an extent that would never be submitted for IO, so we wouldn't be able to unlock it, so this solution fixed that problem and made it a bit cleaner. I've tested this with fsx and everything works great. This patch depends on my dio and filemap.c patches to work. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/ctree.h |7 + fs/btrfs/file-item.c | 258 +++- fs/btrfs/file.c | 69 - fs/btrfs/inode.c | 414 +++--- 4 files changed, 716 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 746a724..8f0c202 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2257,6 +2257,8 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 len); int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, struct bio *bio, u32 *dst); +int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode, + struct bio *bio, u64 logical_offset, u32 *dst); int btrfs_insert_file_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 objectid, u64 pos, @@ -2270,8 +2272,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_ordered_sum *sums); +int btrfs_csum_file_blocks_dio(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 disk_bytenr, + u64 bytes, u32 *csums); int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode, struct bio *bio, u64 file_start, int contig); +int btrfs_csum_one_bio_dio(struct btrfs_root *root, struct inode *inode, + struct bio *bio, u32 *csums); int btrfs_csum_file_bytes(struct btrfs_root *root, struct inode *inode, u64 start, unsigned long len); struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans, diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 54a2550..a3035f7 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -149,13 +149,14 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle *trans, } -int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, - struct bio *bio, u32 *dst) +static int __btrfs_lookup_bio_sums(struct btrfs_root *root, + struct inode *inode, struct bio *bio, + u64 logical_offset, u32 *dst, int dio) { u32 sum; struct bio_vec *bvec = bio-bi_io_vec; int bio_index = 0; - u64 offset; + u64 offset = 0; u64 item_start_offset = 0; u64 item_last_offset = 0; u64 disk_bytenr; @@ -174,8 +175,11 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode, WARN_ON(bio-bi_vcnt = 0); disk_bytenr = (u64)bio-bi_sector 9; + if (dio) + offset = logical_offset; while (bio_index bio-bi_vcnt) { - offset = page_offset(bvec-bv_page) + bvec-bv_offset; + if (!dio) + offset = page_offset(bvec-bv_page) + bvec-bv_offset; ret = btrfs_find_ordered_sum(inode, offset, disk_bytenr, sum); if (ret == 0) goto found; @@ -238,6 +242,7 @@ found: else set_state_private(io_tree, offset, sum); disk_bytenr += bvec-bv_len; + offset += bvec-bv_len; bio_index++; bvec++; } @@
Re: Disk space accounting and subvolume delete
On 12 May 2010 06:02, Yan, Zheng yanzh...@21cn.com wrote: On Tue, May 11, 2010 at 11:45 PM, Bruce Guenter br...@untroubled.org wrote: On Tue, May 11, 2010 at 08:10:38AM +0800, Yan, Zheng wrote: This is because the snapshot deleting ioctl only removes the a link. Right, I understand that. That part is not unexpected, as it works just like unlink would. However... The corresponding tree is dropped in the background by a kernel thread. The surprise is that 'sync', in any form I was able to try, does not wait until all or even most of the I/O is completed. Apparently the standards spec for sync(2) says it is not required to wait for I/O to complete, but AFAIK all other Linux FS do wait (the man page for sync(2) implies as much, as does the info page for sync in glibc). The only way I've found so far to force this behavior is to unmount, and that's rather intrusive to other users of the FS. We could probably add another ioctl that waits until the tree has been completely dropped. Since the expected behavior for sync is to wait until all pending I/O has been completed, I would argue this should be the default action for sync. Am I misunderstanding something? Dropping a tree can be lengthy. It's not good to let sync wait for hours. For most linux FS, 'sync' just force an transaction/journal commit. I don't think they wait for large operations that can span multiple transactions to complete. Disclaimer: I know nothing about the internals of Btrfs! I have an analogy as a way to thinking about what deleting a snapshot entails (which I hope isn't totally bogus). Deleting a clone of a file system is not like unlinking a single file. It is analogous to deleting a directory tree. Syncing in the middle of a recursive delete will wait for the in flight I/O to complete, but it would not wait for the unlink requests from the portion of the directory tree not yet traversed. The same would be true when the kernel thread deletes the snapshot by recursing through it's tree. Mike -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html