[PATCH 1/4] fs: allow short direct-io reads to be completed via buffered IO V2

2010-05-12 Thread Josef Bacik
V1-V2: Check to see if our current ppos is = i_size after a short DIO read,
just in case it was actually a short read and we need to just return.

This is similar to what already happens in the write case.  If we have a short
read while doing O_DIRECT, instead of just returning, fallthrough and try to
read the rest via buffered IO.  BTRFS needs this because if we encounter a
compressed or inline extent during DIO, we need to fallback on buffered.  If the
extent is compressed we need to read the entire thing into memory and
de-compress it into the users pages.  I have tested this with fsx and everything
works great.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 mm/filemap.c |   36 +++-
 1 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 140ebda..829ac9c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1263,7 +1263,7 @@ generic_file_aio_read(struct kiocb *iocb, const struct 
iovec *iov,
 {
struct file *filp = iocb-ki_filp;
ssize_t retval;
-   unsigned long seg;
+   unsigned long seg = 0;
size_t count;
loff_t *ppos = iocb-ki_pos;
 
@@ -1290,21 +1290,47 @@ generic_file_aio_read(struct kiocb *iocb, const struct 
iovec *iov,
retval = mapping-a_ops-direct_IO(READ, iocb,
iov, pos, nr_segs);
}
-   if (retval  0)
+   if (retval  0) {
*ppos = pos + retval;
-   if (retval) {
+   count -= retval;
+   }
+
+   /*
+* Btrfs can have a short DIO read if we encounter
+* compressed extents, so if there was an error, or if
+* we've already read everything we wanted to, or if
+* there was a short read because we hit EOF, go ahead
+* and return.  Otherwise fallthrough to buffered io for
+* the rest of the read.
+*/
+   if (retval  0 || !count || *ppos = size) {
file_accessed(filp);
goto out;
}
}
}
 
+   count = retval;
for (seg = 0; seg  nr_segs; seg++) {
read_descriptor_t desc;
+   loff_t offset = 0;
+
+   /*
+* If we did a short DIO read we need to skip the section of the
+* iov that we've already read data into.
+*/
+   if (count) {
+   if (count  iov[seg].iov_len) {
+   count -= iov[seg].iov_len;
+   continue;
+   }
+   offset = count;
+   count = 0;
+   }
 
desc.written = 0;
-   desc.arg.buf = iov[seg].iov_base;
-   desc.count = iov[seg].iov_len;
+   desc.arg.buf = iov[seg].iov_base + offset;
+   desc.count = iov[seg].iov_len - offset;
if (desc.count == 0)
continue;
desc.error = 0;
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] fs: kill blockdev_direct_IO_no_locking

2010-05-12 Thread Josef Bacik
Christoph said he'd rather everybody use __blockdev_direct_IO directly instead
of having a bunch of random helper functions, so thats what this patch does.
It's a basic change, I've tested it with xfstests on ext4 and xfs.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/block_dev.c  |5 +++--
 fs/ext4/inode.c |8 
 fs/gfs2/aops.c  |6 +++---
 fs/ocfs2/aops.c |8 +++-
 fs/xfs/linux-2.6/xfs_aops.c |7 +++
 include/linux/fs.h  |9 -
 6 files changed, 16 insertions(+), 27 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 6dcee88..0f42cbc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -172,8 +172,9 @@ blkdev_direct_IO(int rw, struct kiocb *iocb, const struct 
iovec *iov,
struct file *file = iocb-ki_filp;
struct inode *inode = file-f_mapping-host;
 
-   return blockdev_direct_IO_no_locking(rw, iocb, inode, I_BDEV(inode),
-   iov, offset, nr_segs, blkdev_get_blocks, NULL);
+   return __blockdev_direct_IO(rw, iocb, inode, I_BDEV(inode), iov,
+   offset, nr_segs, blkdev_get_blocks, NULL,
+   NULL, 0);
 }
 
 int __sync_blockdev(struct block_device *bdev, int wait)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 81d6054..8f37762 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3494,10 +3494,10 @@ static ssize_t ext4_ind_direct_IO(int rw, struct kiocb 
*iocb,
 
 retry:
if (rw == READ  ext4_should_dioread_nolock(inode))
-   ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-inode-i_sb-s_bdev, iov,
-offset, nr_segs,
-ext4_get_block, NULL);
+   ret = __blockdev_direct_IO(rw, iocb, inode,
+  inode-i_sb-s_bdev, iov, offset,
+  nr_segs, ext4_get_block, NULL, NULL,
+  0);
else
ret = blockdev_direct_IO(rw, iocb, inode,
 inode-i_sb-s_bdev, iov,
diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 0c1d0b8..45b23b0 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -1039,9 +1039,9 @@ static ssize_t gfs2_direct_IO(int rw, struct kiocb *iocb,
if (rv != 1)
goto out; /* dio not valid, fall back to buffered i/o */
 
-   rv = blockdev_direct_IO_no_locking(rw, iocb, inode, inode-i_sb-s_bdev,
-  iov, offset, nr_segs,
-  gfs2_get_block_direct, NULL);
+   rv = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev,
+ iov, offset, nr_segs, gfs2_get_block_direct,
+ NULL, NULL, 0);
 out:
gfs2_glock_dq_m(1, gh);
gfs2_holder_uninit(gh);
diff --git a/fs/ocfs2/aops.c b/fs/ocfs2/aops.c
index 21441dd..f2e53a9 100644
--- a/fs/ocfs2/aops.c
+++ b/fs/ocfs2/aops.c
@@ -669,11 +669,9 @@ static ssize_t ocfs2_direct_IO(int rw,
if (i_size_read(inode) = offset)
return 0;
 
-   ret = blockdev_direct_IO_no_locking(rw, iocb, inode,
-   inode-i_sb-s_bdev, iov, offset,
-   nr_segs,
-   ocfs2_direct_IO_get_blocks,
-   ocfs2_dio_end_io);
+   ret = __blockdev_direct_IO(rw, iocb, inode, inode-i_sb-s_bdev, iov,
+  offset, nr_segs, ocfs2_direct_IO_get_blocks,
+  ocfs2_dio_end_io, NULL, 0);
 
mlog_exit(ret);
return ret;
diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index 0f8b996..fcc14d8 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -1617,10 +1617,9 @@ xfs_vm_direct_IO(
iocb-private = xfs_alloc_ioend(inode, rw == WRITE ?
IOMAP_UNWRITTEN : IOMAP_READ);
 
-   ret = blockdev_direct_IO_no_locking(rw, iocb, inode, bdev, iov,
-   offset, nr_segs,
-   xfs_get_blocks_direct,
-   xfs_end_io_direct);
+   ret = __blockdev_direct_IO(rw, iocb, inode, bdev, iov, offset, nr_segs,
+  xfs_get_blocks_direct, xfs_end_io_direct,
+  NULL, 0);
 
if (unlikely(ret != -EIOCBQUEUED  iocb-private))
xfs_destroy_ioend(iocb-private);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 10704f0..d841e7d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2277,15 +2277,6 @@ static inline ssize_t blockdev_direct_IO(int rw, struct 
kiocb *iocb,
   

[PATCH 4/4] Btrfs: add basic DIO read/write support V3

2010-05-12 Thread Josef Bacik
V1-V2
-Use __blockdev_direct_IO instead of helper
-Use KM_IRQ0 for kmap instead of KM_USER0

V2-V3
-Update the submit function to work with my submit hook changes
-Add DIO write support

This provides basic DIO support for reads only.  It does not do any of the work
to recover from mismatching checksums, that will come later.  A few design
changes have been made from Jim's code (sorry Jim!)

1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
code in order to account for all of BTRFS's oddities, but thanks to that work it
seems like the best bet is to just ignore compression and such and just opt to
fallback on buffered IO.

2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
it's own buffering to make dio with compressed extents work.  Now we just
fallback onto normal buffered IO.

3) Lock the entire range during DIO.  I originally had it so we would lock the
extents as get_block was called, and then unlock them as the endio function was
called, which worked great, but if we ever had an error in the submit_io hook,
we could have locked an extent that would never be submitted for IO, so we
wouldn't be able to unlock it, so this solution fixed that problem and made it a
bit cleaner.

I've tested this with fsx and everything works great.  This patch depends on my
dio and filemap.c patches to work.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/ctree.h |7 +
 fs/btrfs/file-item.c |  258 +++-
 fs/btrfs/file.c  |   69 -
 fs/btrfs/inode.c |  414 +++---
 4 files changed, 716 insertions(+), 32 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 746a724..8f0c202 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2257,6 +2257,8 @@ int btrfs_del_csums(struct btrfs_trans_handle *trans,
struct btrfs_root *root, u64 bytenr, u64 len);
 int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
  struct bio *bio, u32 *dst);
+int btrfs_lookup_bio_sums_dio(struct btrfs_root *root, struct inode *inode,
+ struct bio *bio, u64 logical_offset, u32 *dst);
 int btrfs_insert_file_extent(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 u64 objectid, u64 pos,
@@ -2270,8 +2272,13 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
*trans,
 int btrfs_csum_file_blocks(struct btrfs_trans_handle *trans,
   struct btrfs_root *root,
   struct btrfs_ordered_sum *sums);
+int btrfs_csum_file_blocks_dio(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root, u64 disk_bytenr,
+  u64 bytes, u32 *csums);
 int btrfs_csum_one_bio(struct btrfs_root *root, struct inode *inode,
   struct bio *bio, u64 file_start, int contig);
+int btrfs_csum_one_bio_dio(struct btrfs_root *root, struct inode *inode,
+  struct bio *bio, u32 *csums);
 int btrfs_csum_file_bytes(struct btrfs_root *root, struct inode *inode,
  u64 start, unsigned long len);
 struct btrfs_csum_item *btrfs_lookup_csum(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 54a2550..a3035f7 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -149,13 +149,14 @@ int btrfs_lookup_file_extent(struct btrfs_trans_handle 
*trans,
 }
 
 
-int btrfs_lookup_bio_sums(struct btrfs_root *root, struct inode *inode,
- struct bio *bio, u32 *dst)
+static int __btrfs_lookup_bio_sums(struct btrfs_root *root,
+  struct inode *inode, struct bio *bio,
+  u64 logical_offset, u32 *dst, int dio)
 {
u32 sum;
struct bio_vec *bvec = bio-bi_io_vec;
int bio_index = 0;
-   u64 offset;
+   u64 offset = 0;
u64 item_start_offset = 0;
u64 item_last_offset = 0;
u64 disk_bytenr;
@@ -174,8 +175,11 @@ int btrfs_lookup_bio_sums(struct btrfs_root *root, struct 
inode *inode,
WARN_ON(bio-bi_vcnt = 0);
 
disk_bytenr = (u64)bio-bi_sector  9;
+   if (dio)
+   offset = logical_offset;
while (bio_index  bio-bi_vcnt) {
-   offset = page_offset(bvec-bv_page) + bvec-bv_offset;
+   if (!dio)
+   offset = page_offset(bvec-bv_page) + bvec-bv_offset;
ret = btrfs_find_ordered_sum(inode, offset, disk_bytenr, sum);
if (ret == 0)
goto found;
@@ -238,6 +242,7 @@ found:
else
set_state_private(io_tree, offset, sum);
disk_bytenr += bvec-bv_len;
+   offset += bvec-bv_len;
bio_index++;
bvec++;
}
@@ 

Re: Disk space accounting and subvolume delete

2010-05-12 Thread Mike Fleetwood
On 12 May 2010 06:02, Yan, Zheng yanzh...@21cn.com wrote:
 On Tue, May 11, 2010 at 11:45 PM, Bruce Guenter br...@untroubled.org wrote:
 On Tue, May 11, 2010 at 08:10:38AM +0800, Yan, Zheng  wrote:
 This is because the snapshot deleting ioctl only removes the a link.

 Right, I understand that.  That part is not unexpected, as it works just
 like unlink would.  However...

 The corresponding tree is dropped in the background by a kernel thread.

 The surprise is that 'sync', in any form I was able to try, does not
 wait until all or even most of the I/O is completed.  Apparently the
 standards spec for sync(2) says it is not required to wait for I/O to
 complete, but AFAIK all other Linux FS do wait (the man page for sync(2)
 implies as much, as does the info page for sync in glibc).

 The only way I've found so far to force this behavior is to unmount, and
 that's rather intrusive to other users of the FS.

 We could probably add another ioctl that waits until the tree has been
 completely dropped.

 Since the expected behavior for sync is to wait until all pending I/O
 has been completed, I would argue this should be the default action for
 sync.  Am I misunderstanding something?


 Dropping a tree can be lengthy. It's not good to let sync wait for hours.
 For most linux FS, 'sync' just force an transaction/journal commit. I don't
 think they wait for large operations that can span multiple transactions to
 complete.

Disclaimer: I know nothing about the internals of Btrfs!

I have an analogy as a way to thinking about what deleting a snapshot
entails (which I hope isn't totally bogus).

Deleting a clone of a file system is not like unlinking a single file.
 It is analogous to deleting a directory tree.  Syncing in the middle
of a recursive delete will wait for the in flight I/O to complete, but
it would not wait for the unlink requests from the portion of the
directory tree not yet traversed.  The same would be true when the
kernel thread deletes the snapshot by recursing through it's tree.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html