[RFC PATCH] Btrfs: do not flush csum items of unchanged file data during treelog

2011-04-21 Thread liubo

The current code relogs the entire inode every time during fsync log,
and it is much better suited to small files rather than large ones.

During my performance test, the fsync performace of large files sucks,
and we can ascribe this to the tremendous amount of csum infos of the
large ones, cause we have to flush all of these csum infos into log trees
even when there are only _one_ change in the whole file data.  Apparently,
to optimize fsync, we need to create a filter to skip the unnecessary csum
ones, that is, the corresponding file data remains unchanged before this fsync.

Here I have some test results to show, I use sysbench to do random write + 
fsync.

Sysbench args:
  - Number of threads: 1
  - Extra file open flags: 0
  - 2 files, 4Gb each
  - Block size 4Kb
  - Number of random requests for random IO: 1
  - Read/Write ratio for combined random IO test: 1.50
  - Periodic FSYNC enabled, calling fsync() each 100 requests.
  - Calling fsync() at the end of test, Enabled.
  - Using synchronous I/O mode
  - Doing random write test

Sysbench results:
===
   Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
   Read 0b  Written 39.062Mb  Total transferred 39.062Mb
===
a) without patch:  (*SPEED* : 451.01Kb/sec)
   112.75 Requests/sec executed

b) with patch: (*SPEED* : 5.1537Mb/sec)
   1319.34 Requests/sec executed

Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com
---
 fs/btrfs/ctree.h|   14 --
 fs/btrfs/inode.c|1 +
 fs/btrfs/tree-log.c |   31 +--
 3 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 2e61fe1..300bea0 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -642,6 +642,12 @@ struct btrfs_root_ref {
 #define BTRFS_FILE_EXTENT_REG 1
 #define BTRFS_FILE_EXTENT_PREALLOC 2
 
+/*
+ * used to indicate that this file extent has just been changed and
+ * its csums need to be updated when fsync tries to log this inode.
+ */
+#define BTRFS_FILE_EXTENT_CSUM_UPTODATE(1  0)
+
 struct btrfs_file_extent_item {
/*
 * transaction id that created this extent
@@ -665,7 +671,9 @@ struct btrfs_file_extent_item {
 */
u8 compression;
u8 encryption;
-   __le16 other_encoding; /* spare for later use */
+   u8 other_encoding; /* spare for later use */
+
+   u8 flag;
 
/* are we inline data or a real extent? */
u8 type;
@@ -2026,7 +2034,9 @@ BTRFS_SETGET_FUNCS(file_extent_compression, struct 
btrfs_file_extent_item,
 BTRFS_SETGET_FUNCS(file_extent_encryption, struct btrfs_file_extent_item,
   encryption, 8);
 BTRFS_SETGET_FUNCS(file_extent_other_encoding, struct btrfs_file_extent_item,
-  other_encoding, 16);
+  other_encoding, 8);
+BTRFS_SETGET_FUNCS(file_extent_flag, struct btrfs_file_extent_item,
+  flag, 8);
 
 /* this returns the number of file bytes represented by the inline item.
  * If an item is compressed, this is the uncompressed size
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a4157cf..ed4e318 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1660,6 +1660,7 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
btrfs_set_file_extent_compression(leaf, fi, compression);
btrfs_set_file_extent_encryption(leaf, fi, encryption);
btrfs_set_file_extent_other_encoding(leaf, fi, other_encoding);
+   btrfs_set_file_extent_flag(leaf, fi, BTRFS_FILE_EXTENT_CSUM_UPTODATE);
 
btrfs_unlock_up_safe(path, 1);
btrfs_set_lock_blocking(leaf);
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index c50271a..baa4a0a 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2591,11 +2591,24 @@ static int drop_objectid_items(struct 
btrfs_trans_handle *trans,
return ret;
 }
 
+static inline int need_csum(struct extent_buffer *src,
+   struct btrfs_file_extent_item *fi,
+   u64 gen, int csum)
+{
+   if (csum 
+   (btrfs_file_extent_generation(src, fi) == gen) 
+   (btrfs_file_extent_flag(src, fi)  BTRFS_FILE_EXTENT_CSUM_UPTODATE))
+   return 1;
+
+   return 0;
+}
+
+
 static noinline int copy_items(struct btrfs_trans_handle *trans,
   struct btrfs_root *log,
   struct btrfs_path *dst_path,
   struct extent_buffer *src,
-  int start_slot, int nr, int inode_only)
+  int start_slot, int nr, int inode_only, int csum)
 {
unsigned long src_offset;
unsigned long dst_offset;
@@ -2653,6 +2666,7 @@ static noinline int copy_items(struct btrfs_trans_handle 
*trans,
btrfs_set_inode_generation(dst_path-nodes[0],
   inode_item, 0);
}
+
/* 

[PATCH] Btrfs: remove trans and root argument from fixup_low_keys()

2011-04-21 Thread Tsutomu Itoh
'trans' and 'root' argument not used in fixup_low_keys() are deleted.
And, the argument that became unnecessary is deleted about the caller
of fixup_low_keys().

Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
 fs/btrfs/ctree.c   |   49 ---
 fs/btrfs/ctree.h   |6 +---
 fs/btrfs/dir-item.c|2 +-
 fs/btrfs/extent-tree.c |9 +++
 fs/btrfs/file-item.c   |   11 -
 fs/btrfs/file.c|6 ++--
 fs/btrfs/inode-item.c  |3 +-
 fs/btrfs/inode.c   |2 +-
 fs/btrfs/tree-log.c|2 +-
 9 files changed, 39 insertions(+), 51 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 84d7ca1..d080042 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -36,10 +36,10 @@ static int balance_node_right(struct btrfs_trans_handle 
*trans,
  struct btrfs_root *root,
  struct extent_buffer *dst_buf,
  struct extent_buffer *src_buf);
-static int del_ptr(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+static int del_ptr(struct btrfs_root *root,
   struct btrfs_path *path, int level, int slot);
-static int setup_items_for_insert(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, struct btrfs_path *path,
+static int setup_items_for_insert(struct btrfs_root *root,
+   struct btrfs_path *path,
struct btrfs_key *cpu_key, u32 *data_size,
u32 total_data, u32 total_size, int nr);
 
@@ -994,8 +994,7 @@ static noinline int balance_level(struct btrfs_trans_handle 
*trans,
if (btrfs_header_nritems(right) == 0) {
clean_tree_block(trans, root, right);
btrfs_tree_unlock(right);
-   wret = del_ptr(trans, root, path, level + 1, pslot +
-  1);
+   wret = del_ptr(root, path, level + 1, pslot + 1);
if (wret)
ret = wret;
root_sub_used(root, right-len);
@@ -1035,7 +1034,7 @@ static noinline int balance_level(struct 
btrfs_trans_handle *trans,
if (btrfs_header_nritems(mid) == 0) {
clean_tree_block(trans, root, mid);
btrfs_tree_unlock(mid);
-   wret = del_ptr(trans, root, path, level + 1, pslot);
+   wret = del_ptr(root, path, level + 1, pslot);
if (wret)
ret = wret;
root_sub_used(root, mid-len);
@@ -1767,8 +1766,7 @@ done:
  * If this fails to write a tree block, it returns -1, but continues
  * fixing up the blocks in ram so the tree is consistent.
  */
-static int fixup_low_keys(struct btrfs_trans_handle *trans,
- struct btrfs_root *root, struct btrfs_path *path,
+static int fixup_low_keys(struct btrfs_path *path,
  struct btrfs_disk_key *key, int level)
 {
int i;
@@ -1794,8 +1792,7 @@ static int fixup_low_keys(struct btrfs_trans_handle 
*trans,
  * This function isn't completely safe. It's the caller's responsibility
  * that the new key won't break the order
  */
-int btrfs_set_item_key_safe(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root, struct btrfs_path *path,
+int btrfs_set_item_key_safe(struct btrfs_path *path,
struct btrfs_key *new_key)
 {
struct btrfs_disk_key disk_key;
@@ -1819,7 +1816,7 @@ int btrfs_set_item_key_safe(struct btrfs_trans_handle 
*trans,
btrfs_set_item_key(eb, disk_key, slot);
btrfs_mark_buffer_dirty(eb);
if (slot == 0)
-   fixup_low_keys(trans, root, path, disk_key, 1);
+   fixup_low_keys(path, disk_key, 1);
return 0;
 }
 
@@ -2584,7 +2581,7 @@ static noinline int __push_leaf_left(struct 
btrfs_trans_handle *trans,
clean_tree_block(trans, root, right);
 
btrfs_item_key(right, disk_key, 0);
-   wret = fixup_low_keys(trans, root, path, disk_key, 1);
+   wret = fixup_low_keys(path, disk_key, 1);
if (wret)
ret = wret;
 
@@ -2971,8 +2968,7 @@ again:
path-nodes[0] = right;
path-slots[0] = 0;
if (path-slots[1] == 0) {
-   wret = fixup_low_keys(trans, root,
-   path, disk_key, 1);
+   wret = fixup_low_keys(path, disk_key, 1);
if (wret)
ret = wret;
}
@@ -3192,7 +3188,7 @@ int btrfs_duplicate_item(struct btrfs_trans_handle *trans,
return ret;
 
path-slots[0]++;
-   ret = setup_items_for_insert(trans, root, path, new_key, item_size,
+   ret = 

Re: [PATCH] Btrfs: remove trans and root argument from fixup_low_keys()

2011-04-21 Thread David Sterba
Hi,

On Thu, Apr 21, 2011 at 06:54:45PM +0900, Tsutomu Itoh wrote:
 'trans' and 'root' argument not used in fixup_low_keys() are deleted.
 And, the argument that became unnecessary is deleted about the caller
 of fixup_low_keys().

heh, coincidence or synchronicity, I've spent some time yesterday fixing
the same problems. There are tons of unused arguments all over the
place.  I have a few patches already, but still not removed all. I'll
send them out today, so you can have a look and to not duplicate work.


dave
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: check return value of kmalloc()

2011-04-21 Thread David Sterba
On Wed, Apr 20, 2011 at 09:51:30AM +0900, Tsutomu Itoh wrote:
  2030delayed = kmalloc(sizeof(*delayed), GFP_NOFS | __GFP_NOFAIL);
 
 I think that it doesn't fail ordinary when __GFP_NOFAIL is specified...

yes I agree with you, my oversight. However, from linux/gfp.h

 * __GFP_NOFAIL: The VM implementation _must_ retry infinitely: the
 * caller cannot handle allocation failures.  This modifier is
 * deprecated and no new users should be added.

in long-term, redoing without NOFAIL would be probably wise. Currently
the btrfs_add_delayed_iput is called at places which do not seem to
like failure, I'm not sure whether possibly blocking indefinetely is
better than occasional failure with chance to do recovery ...

 
  
  and in extent-tree.c:relocate_one_extent()
  
  7992new_extents = kmalloc(sizeof(*new_extents),
  7993 GFP_NOFS);
  
  the value is checked later, new_extents is passed to get_new_locations
  and there it's checked, but no other callers pass potential NULL and the
  check fits here and can be dropped from get_new_locations;

  there's a
  little chance that get_new_locations will be able to succesfully
  allocate the same data a jiffy later.
 
 Yes, therefore I did not check 'new_extents'.

heh reading it again after myself, it sounds quite the opposite than I
wanted: I'd rather see the kmalloc checked right at the callsite, easier
to read and understand, than diving into get_new_locations and there
seeing checking extents for NULL and doing own alloc/free.


david
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: do not flush csum items of unchanged file data during treelog

2011-04-21 Thread Chris Mason
Excerpts from liubo's message of 2011-04-21 03:58:21 -0400:
 
 The current code relogs the entire inode every time during fsync log,
 and it is much better suited to small files rather than large ones.
 
 During my performance test, the fsync performace of large files sucks,
 and we can ascribe this to the tremendous amount of csum infos of the
 large ones, cause we have to flush all of these csum infos into log trees
 even when there are only _one_ change in the whole file data.  Apparently,
 to optimize fsync, we need to create a filter to skip the unnecessary csum
 ones, that is, the corresponding file data remains unchanged before this 
 fsync.
 
 Here I have some test results to show, I use sysbench to do random write + 
 fsync.
 
 Sysbench args:
   - Number of threads: 1
   - Extra file open flags: 0
   - 2 files, 4Gb each
   - Block size 4Kb
   - Number of random requests for random IO: 1
   - Read/Write ratio for combined random IO test: 1.50
   - Periodic FSYNC enabled, calling fsync() each 100 requests.
   - Calling fsync() at the end of test, Enabled.
   - Using synchronous I/O mode
   - Doing random write test
 
 Sysbench results:
 ===
Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
Read 0b  Written 39.062Mb  Total transferred 39.062Mb
 ===
 a) without patch:  (*SPEED* : 451.01Kb/sec)
112.75 Requests/sec executed
 
 b) with patch: (*SPEED* : 5.1537Mb/sec)
1319.34 Requests/sec executed

Really nice results! Especially considering the small size of the patch.

But, I'd really like to look at using sub transaction ids for this, and
then logging just the part of the inode that had changed since the last
log commit.  It's more complex, but will also help reduce tree searches
for the file items.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix easily get into ENOSPC in mixed case

2011-04-21 Thread Sergei Trofimovich
On Wed, 20 Apr 2011 00:55:31 +0300
Sergei Trofimovich sly...@gmail.com wrote:

  Thanks a lot for testing, though.
  
  I guess something messed up your btrfs metadata, cause when btrfs_unlink() 
  wanted to remove A,
  it found that A was just missing...

Yeah. Now I think I completely figured out what's going on:

My mixed partition was correct except it did'n have MIXED bit in superblock
(I mistakenly used LZO bit for that in btrfs-progs, as I took some old patch 
from maillist).

Thus I had D+M marked (aka mixed) block groups, but from superblock's features
there was no way to figure out there is yet such groups. That's why your second 
patch
(looking correct) didn't help me. I've fallen to mixed == 0 case.

 I think interesting the parts are (they were found by Arne before):
 [0.04] new update_space_info: 
 [0.04] space_info has 0 free, is not full
 [0.04] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, 
 readonly=0
 [0.04] new update_space_info: 
 [0.04] space_info has 0 free, is not full
 [0.04] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, 
 readonly=0
 [0.04] new update_space_info: 
 [0.04] space_info has 0 free, is not full
 [0.04] space_info total=0, used=0, pinned=0, reserved=0, may_use=0, 
 readonly=0

Talking with Arne I was convinced my breakage is not worth fixing in kernel.
After I've added MIXED bit to superblocks' features OOpses gone.

So, once again, please disregard my OOpses.

-- 

  Sergei


signature.asc
Description: PGP signature


How Snapshots Inter-relate?

2011-04-21 Thread CACook

I have set up a backup server in the garage which does rsync backups of all my 
servers weekly, and snapshots those backups.  It works wonderfully, and I was 
able to set it up thanks to help from this listserv, thank you.

But I'm accumulating quite a few backups now, unnecessarily.  After a couple 
months I only need a representative one, once a month.  I am afraid to delete 
three of my weekly snapshots out of each month, as I'm afraid they're 
cumulative.  I might end up with nothing valid if each builds on the last, 
forever.

How do I resolve this?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How Snapshots Inter-relate?

2011-04-21 Thread Calvin Walton
On Thu, 2011-04-21 at 10:17 -0700, cac...@quantum-sci.com wrote:
 I have set up a backup server in the garage which does rsync backups
 of all my servers weekly, and snapshots those backups.  It works
 wonderfully, and I was able to set it up thanks to help from this
 listserv, thank you.
 
 But I'm accumulating quite a few backups now, unnecessarily.  After a
 couple months I only need a representative one, once a month.  I am
 afraid to delete three of my weekly snapshots out of each month, as
 I'm afraid they're cumulative.  I might end up with nothing valid if
 each builds on the last, forever.

You have nothing to worry about. You can delete any snapshot on btrfs
without losing data from any other snapshot. Each snapshot is completely
independent.

This works because data which is shared between multiple snapshots is
reference counted, and won't be deleted until you remove the last
snapshot that references that data.

-- 
Calvin Walton calvin.wal...@kepstin.ca

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Josef Bacik
This just gets us ready to support the SEEK_HOLE and SEEK_DATA flags.  Turns out
using fiemap in things like cp cause more problems than it solves, so lets try
and give userspace an interface that doesn't suck.  So we have

-SEEK_HOLE: this moves the file pos to the nearest hole in the file from the
given position.  If the given position is a hole then pos won't move.  A hole
is defined by whatever the fs feels like defining it to be.  In simple things
like ext2/3 it will simplly mean an unallocated space in the file.  For more
complex things where you have preallocated space then that is left up to the
filesystem.  Since preallocated space is supposed to return all 0's it is
perfectly legitimate to have SEEK_HOLE dump you out at the start of a
preallocated extent, but then again if this is not something you can do and be
sure the extent isn't in the middle of being converted to a real extent then it
is also perfectly legitimate to skip preallocated extents and only park f_pos at
a truly unallocated section.

-SEEK_DATA: this is obviously a little more self-explanatory.  Again the only
ambiguity comes in with preallocated extents.  If you have an fs that can't
reliably tell that the preallocated extent is in the process of turning into a
real extent, it is correct for SEEK_DATA to park you at a preallocated extent.

Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/read_write.c|3 +++
 include/linux/fs.h |4 +++-
 2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 5520f8a..5446d61 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -64,6 +64,9 @@ generic_file_llseek_unlocked(struct file *file, loff_t 
offset, int origin)
return file-f_pos;
offset += file-f_pos;
break;
+   case SEEK_DATA:
+   case SEEK_HOLE:
+   return -EINVAL;
}
 
if (offset  0  !unsigned_offsets(file))
diff --git a/include/linux/fs.h b/include/linux/fs.h
index dbd860a..1b72e0c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -31,7 +31,9 @@
 #define SEEK_SET   0   /* seek relative to beginning of file */
 #define SEEK_CUR   1   /* seek relative to current file position */
 #define SEEK_END   2   /* seek relative to end of file */
-#define SEEK_MAX   SEEK_END
+#define SEEK_HOLE  3   /* seek to the closest hole */
+#define SEEK_DATA  4   /* seek to the closest data */
+#define SEEK_MAX   SEEK_DATA
 
 struct fstrim_range {
__u64 start;
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] Btrfs: implement our own -llseek

2011-04-21 Thread Josef Bacik
In order to handle SEEK_HOLE/SEEK_DATA we need to implement our own llseek.
Basically for the normal SEEK_*'s we will just defer to the generic helper, and
for SEEK_HOLE/SEEK_DATA we will use our fiemap helper to figure out the nearest
hole or data.  Currently this helper doesn't check for delalloc bytes for
prealloc space, so for now treat prealloc as data until that is fixed.  Thanks,

Signed-off-by: Josef Bacik jo...@redhat.com
---
 fs/btrfs/ctree.h |3 ++
 fs/btrfs/file.c  |   85 +-
 2 files changed, 87 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index d5f043e..d2991c8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2474,6 +2474,9 @@ int btrfs_csum_truncate(struct btrfs_trans_handle *trans,
 int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start,
 u64 end, struct list_head *list);
 /* inode.c */
+struct extent_map *btrfs_get_extent_fiemap(struct inode *inode, struct page 
*page,
+  size_t pg_offset, u64 start, u64 len,
+  int create);
 
 /* RHEL and EL kernels have a patch that renames PG_checked to FsMisc */
 #if defined(ClearPageFsMisc)  !defined(ClearPageChecked)
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index cd5e82e..f0a7a33 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1406,8 +1406,91 @@ out:
return ret;
 }
 
+static int find_desired_extent(struct inode *inode, loff_t *offset, int origin)
+{
+   struct extent_map *em;
+   struct extent_state *cached_state = NULL;
+   u64 lockstart = *offset;
+   u64 lockend = i_size_read(inode);
+   u64 start = *offset;
+   u64 len = i_size_read(inode);
+   int ret = 0;
+
+   lockend = max_t(u64, BTRFS_I(inode)-root-sectorsize, lockend);
+   len = lockend;
+
+   lock_extent_bits(BTRFS_I(inode)-io_tree, lockstart, lockend, 0,
+cached_state, GFP_NOFS);
+   while (1) {
+   em = btrfs_get_extent_fiemap(inode, NULL, 0, start, len, 0);
+   if (IS_ERR(em)) {
+   ret = -ENXIO;
+   break;
+   }
+
+   if (em-block_start == EXTENT_MAP_HOLE) {
+   if (origin == SEEK_HOLE) {
+   *offset = em-start;
+   free_extent_map(em);
+   break;
+   }
+   } else {
+   if (origin == SEEK_DATA) {
+   *offset = em-start;
+   free_extent_map(em);
+   break;
+   }
+   }
+   start = em-start + em-len;
+   if (test_bit(EXTENT_FLAG_VACANCY, em-flags)) {
+   free_extent_map(em);
+   ret = -ENXIO;
+   break;
+   }
+   free_extent_map(em);
+   }
+   unlock_extent_cached(BTRFS_I(inode)-io_tree, lockstart, lockend,
+cached_state, GFP_NOFS);
+   return ret;
+}
+
+static loff_t btrfs_file_llseek(struct file *file, loff_t offset, int origin)
+{
+   struct inode *inode = file-f_mapping-host;
+   int ret;
+
+   mutex_lock(inode-i_mutex);
+   switch (origin) {
+   case SEEK_END:
+   case SEEK_CUR:
+   offset = generic_file_llseek_unlocked(file, offset, origin);
+   goto out;
+   case SEEK_DATA:
+   case SEEK_HOLE:
+   ret = find_desired_extent(inode, offset, origin);
+   if (ret) {
+   mutex_unlock(inode-i_mutex);
+   return ret;
+   }
+   }
+
+   if (offset  0  !(file-f_mode  FMODE_UNSIGNED_OFFSET))
+   return -EINVAL;
+   if (offset  inode-i_sb-s_maxbytes)
+   return -EINVAL;
+
+   /* Special lock needed here? */
+   if (offset != file-f_pos) {
+   file-f_pos = offset;
+   file-f_version = 0;
+   }
+out:
+   mutex_unlock(inode-i_mutex);
+   return offset;
+}
+
 const struct file_operations btrfs_file_operations = {
-   .llseek = generic_file_llseek,
+   .llseek = btrfs_file_llseek,
.read   = do_sync_read,
.write  = do_sync_write,
.aio_read   = generic_file_aio_read,
-- 
1.7.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Theodore Tso

On Apr 21, 2011, at 3:42 PM, Josef Bacik wrote:

 + case SEEK_DATA:
 + case SEEK_HOLE:
 + return -EINVAL;
   }

Maybe we should be returning EOPNOTSUPP?

-- Ted

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Sunil Mushran

On 04/21/2011 01:45 PM, Theodore Tso wrote:

On Apr 21, 2011, at 3:42 PM, Josef Bacik wrote:

+   case SEEK_DATA:
+   case SEEK_HOLE:
+   return -EINVAL;
}


Maybe we should be returning EOPNOTSUPP?

-- Ted


For SEEK_HOLE yes. But we could let the default SEEK_DATA be a null-op.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: remove trans and root argument from fixup_low_keys()

2011-04-21 Thread Tsutomu Itoh
(2011/04/21 19:59), David Sterba wrote:
 Hi,
 
 On Thu, Apr 21, 2011 at 06:54:45PM +0900, Tsutomu Itoh wrote:
 'trans' and 'root' argument not used in fixup_low_keys() are deleted.
 And, the argument that became unnecessary is deleted about the caller
 of fixup_low_keys().
 
 heh, coincidence or synchronicity, I've spent some time yesterday fixing
 the same problems. There are tons of unused arguments all over the
 place.  I have a few patches already, but still not removed all. I'll
 send them out today, so you can have a look and to not duplicate work.

OK, solving of this problem is entrusted to you.

Thanks,
Tsutomu

 
 
 dave
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: do not flush csum items of unchanged file data during treelog

2011-04-21 Thread Li Zefan
Chris Mason wrote:
 Excerpts from liubo's message of 2011-04-21 03:58:21 -0400:

 The current code relogs the entire inode every time during fsync log,
 and it is much better suited to small files rather than large ones.

 During my performance test, the fsync performace of large files sucks,
 and we can ascribe this to the tremendous amount of csum infos of the
 large ones, cause we have to flush all of these csum infos into log trees
 even when there are only _one_ change in the whole file data.  Apparently,
 to optimize fsync, we need to create a filter to skip the unnecessary csum
 ones, that is, the corresponding file data remains unchanged before this 
 fsync.

 Here I have some test results to show, I use sysbench to do random write + 
 fsync.

 Sysbench args:
   - Number of threads: 1
   - Extra file open flags: 0
   - 2 files, 4Gb each
   - Block size 4Kb
   - Number of random requests for random IO: 1
   - Read/Write ratio for combined random IO test: 1.50
   - Periodic FSYNC enabled, calling fsync() each 100 requests.
   - Calling fsync() at the end of test, Enabled.
   - Using synchronous I/O mode
   - Doing random write test

 Sysbench results:
 ===
Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
Read 0b  Written 39.062Mb  Total transferred 39.062Mb
 ===
 a) without patch:  (*SPEED* : 451.01Kb/sec)
112.75 Requests/sec executed

 b) with patch: (*SPEED* : 5.1537Mb/sec)
1319.34 Requests/sec executed
 
 Really nice results! Especially considering the small size of the patch.
 
 But, I'd really like to look at using sub transaction ids for this, and
 then logging just the part of the inode that had changed since the last
 log commit.  It's more complex, but will also help reduce tree searches
 for the file items.
 

And this patch forgot to mention it has compatability issue.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] Btrfs: do not flush csum items of unchanged file data during treelog

2011-04-21 Thread Chris Mason
Excerpts from Li Zefan's message of 2011-04-21 20:55:40 -0400:
 Chris Mason wrote:
  Excerpts from liubo's message of 2011-04-21 03:58:21 -0400:
 
  The current code relogs the entire inode every time during fsync log,
  and it is much better suited to small files rather than large ones.
 
  During my performance test, the fsync performace of large files sucks,
  and we can ascribe this to the tremendous amount of csum infos of the
  large ones, cause we have to flush all of these csum infos into log trees
  even when there are only _one_ change in the whole file data.  Apparently,
  to optimize fsync, we need to create a filter to skip the unnecessary csum
  ones, that is, the corresponding file data remains unchanged before this 
  fsync.
 
  Here I have some test results to show, I use sysbench to do random write 
  + fsync.
 
  Sysbench args:
- Number of threads: 1
- Extra file open flags: 0
- 2 files, 4Gb each
- Block size 4Kb
- Number of random requests for random IO: 1
- Read/Write ratio for combined random IO test: 1.50
- Periodic FSYNC enabled, calling fsync() each 100 requests.
- Calling fsync() at the end of test, Enabled.
- Using synchronous I/O mode
- Doing random write test
 
  Sysbench results:
  ===
 Operations performed:  0 Read, 1 Write, 200 Other = 10200 Total
 Read 0b  Written 39.062Mb  Total transferred 39.062Mb
  ===
  a) without patch:  (*SPEED* : 451.01Kb/sec)
 112.75 Requests/sec executed
 
  b) with patch: (*SPEED* : 5.1537Mb/sec)
 1319.34 Requests/sec executed
  
  Really nice results! Especially considering the small size of the patch.
  
  But, I'd really like to look at using sub transaction ids for this, and
  then logging just the part of the inode that had changed since the last
  log commit.  It's more complex, but will also help reduce tree searches
  for the file items.
  
 
 And this patch forgot to mention it has compatability issue.

Right, at the very least we want to just use one bit of that field
instead of all 8.  But keeping a sub-transid and putting that in the
generation field of the file extent instead can get us the same benefits
without stealing the bits.

As we push the sub transid into the btree blocks as well, we'll get much
faster tree walks too.  The penalty is in complexity in the logging
code, since it will have to deal with finding extents in the log tree
and merging in the new extents from the file.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: check return value of kmalloc()

2011-04-21 Thread Tsutomu Itoh
The check on the return value of kmalloc() is added to some places.

Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
V1-V2: adding check code to relocate_one_extent() for the readability,
which is suggested by David Sterba.

 fs/btrfs/extent-tree.c |4 
 fs/btrfs/inode.c   |3 +++
 2 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 31f33ba..cd52f7f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -8059,6 +8059,10 @@ static noinline int relocate_one_extent(struct 
btrfs_root *extent_root,
u64 group_start = group-key.objectid;
new_extents = kmalloc(sizeof(*new_extents),
  GFP_NOFS);
+   if (!new_extents) {
+   ret = -ENOMEM;
+   goto out;
+   }
nr_extents = 1;
ret = get_new_locations(reloc_inode,
extent_key,
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index a4157cf..c718d27 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -953,6 +953,7 @@ static int cow_file_range_async(struct inode *inode, struct 
page *locked_page,
 1, 0, NULL, GFP_NOFS);
while (start  end) {
async_cow = kmalloc(sizeof(*async_cow), GFP_NOFS);
+   BUG_ON(!async_cow);
async_cow-inode = inode;
async_cow-root = root;
async_cow-locked_page = locked_page;
@@ -5001,6 +5002,8 @@ static noinline int uncompress_inline(struct btrfs_path 
*path,
inline_size = btrfs_file_extent_inline_item_len(leaf,
btrfs_item_nr(leaf, path-slots[0]));
tmp = kmalloc(inline_size, GFP_NOFS);
+   if (!tmp)
+   return -ENOMEM;
ptr = btrfs_file_extent_inline_start(item);
 
read_extent_buffer(leaf, tmp, ptr, inline_size);

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Matthew Wilcox
On Thu, Apr 21, 2011 at 04:45:54PM -0400, Theodore Tso wrote:
 
 On Apr 21, 2011, at 3:42 PM, Josef Bacik wrote:
 
  +   case SEEK_DATA:
  +   case SEEK_HOLE:
  +   return -EINVAL;
  }
 
 Maybe we should be returning EOPNOTSUPP?

That's definitely wrong ... -ENOSYS might be better!

-- 
Matthew Wilcox  Intel Open Source Technology Centre
Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Christoph Hellwig
We'll also need:

 - a patch to lseek(2) to document the new flags
 - some testcases for xfstests, specially dealing with things that
   were problematic in FIEMAP, e.g. data in delalloc extents, making
   sure stuff in unwrittent extents partially converted actually gets
   copied, etc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] fs: add SEEK_HOLE and SEEK_DATA flags

2011-04-21 Thread Christoph Hellwig
[Eric: please don't drop the Cc list, thanks!]

On Thu, Apr 21, 2011 at 09:22:55PM -0400, Josef Bacik wrote:
  since all files have a virtual hole at the end, but leaves the position
  unchanged). ??I'd have to write a test program on Solaris to see whether 
  that
  definition is actually true, or if it is a bug in the Solaris man page.
 
 
 lseek's purpose is to reposition the file position, so I'd imagine
 this is just a bug in the man page.

I would be surprised if the bug is around for such a long time, but
otherwise I concur.

 Except you can have blocks allocated past i_size, so returning i_size
 isn't necessarily correct.  Obviously the idea is to have most of the
 filesystems doing the correct thing, but for my first go around I just
 was going to do one since I expect quite a bit of repainting for this
 bikeshed is yet to be done.  But I guess if you have data past i_size
 doing this would encourage the various fs people to fix it.

Trying to be smart about our semantics is not helpful here.  The
interface has been out in Solaris for a while, and we'll have to either
stick to it or use different names for the interface.  And staying
compatible is far more useful for userspace programmers.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html