[PATCH] Btrfs: cleanup some BUG_ON()

2011-03-23 Thread Tsutomu Itoh
This patch changes some BUG_ON() to the error return.
(but, most callers still use BUG_ON())

Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com
---
 fs/btrfs/ctree.c   |3 ++-
 fs/btrfs/disk-io.c |5 -
 fs/btrfs/extent-tree.c |   25 ++---
 fs/btrfs/file-item.c   |3 ++-
 fs/btrfs/inode-map.c   |3 ++-
 fs/btrfs/ioctl.c   |5 -
 fs/btrfs/root-tree.c   |6 --
 fs/btrfs/transaction.c |   12 +---
 fs/btrfs/tree-log.c|   15 +--
 9 files changed, 54 insertions(+), 23 deletions(-)

diff -urNp linux-2.6.38/fs/btrfs/ctree.c linux-2.6.38.new/fs/btrfs/ctree.c
--- linux-2.6.38/fs/btrfs/ctree.c   2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/ctree.c   2011-03-23 11:28:09.0 +0900
@@ -3840,7 +3840,8 @@ int btrfs_insert_item(struct btrfs_trans
unsigned long ptr;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
ret = btrfs_insert_empty_item(trans, root, path, cpu_key, data_size);
if (!ret) {
leaf = path-nodes[0];
diff -urNp linux-2.6.38/fs/btrfs/disk-io.c linux-2.6.38.new/fs/btrfs/disk-io.c
--- linux-2.6.38/fs/btrfs/disk-io.c 2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/disk-io.c 2011-03-23 11:44:39.0 +0900
@@ -1160,7 +1160,10 @@ struct btrfs_root *btrfs_read_fs_root_no
 root, fs_info, location-objectid);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path) {
+   kfree(root);
+   return ERR_PTR(-ENOMEM);
+   }
ret = btrfs_search_slot(NULL, tree_root, location, path, 0, 0);
if (ret == 0) {
l = path-nodes[0];
diff -urNp linux-2.6.38/fs/btrfs/extent-tree.c 
linux-2.6.38.new/fs/btrfs/extent-tree.c
--- linux-2.6.38/fs/btrfs/extent-tree.c 2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/extent-tree.c 2011-03-23 11:28:09.0 
+0900
@@ -5444,7 +5444,8 @@ static int alloc_reserved_file_extent(st
size = sizeof(*extent_item) + btrfs_extent_inline_ref_size(type);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
path-leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info-extent_root, path,
@@ -6438,10 +6439,14 @@ int btrfs_drop_subtree(struct btrfs_tran
BUG_ON(root-root_key.objectid != BTRFS_TREE_RELOC_OBJECTID);
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
 
wc = kzalloc(sizeof(*wc), GFP_NOFS);
-   BUG_ON(!wc);
+   if (!wc) {
+   btrfs_free_path(path);
+   return -ENOMEM;
+   }
 
btrfs_assert_tree_locked(parent);
parent_level = btrfs_header_level(parent);
@@ -6899,7 +6904,11 @@ static noinline int get_new_locations(st
}
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path) {
+   if (exts != *extents)
+   kfree(exts);
+   return -ENOMEM;
+   }
 
cur_pos = extent_key-objectid - offset;
last_byte = extent_key-objectid + extent_key-offset;
@@ -7423,7 +7432,8 @@ static noinline int replace_extents_in_l
int ret;
 
new_extent = kmalloc(sizeof(*new_extent), GFP_NOFS);
-   BUG_ON(!new_extent);
+   if (!new_extent)
+   return -ENOMEM;
 
ref = btrfs_lookup_leaf_ref(root, leaf-start);
BUG_ON(!ref);
@@ -7627,7 +7637,8 @@ static noinline int init_reloc_tree(stru
return 0;
 
root_item = kmalloc(sizeof(*root_item), GFP_NOFS);
-   BUG_ON(!root_item);
+   if (!root_item)
+   return -ENOMEM;
 
ret = btrfs_copy_root(trans, root, root-commit_root,
  eb, BTRFS_TREE_RELOC_OBJECTID);
@@ -7653,7 +7664,7 @@ static noinline int init_reloc_tree(stru
 
reloc_root = btrfs_read_fs_root_no_radix(root-fs_info-tree_root,
 root_key);
-   BUG_ON(!reloc_root);
+   BUG_ON(IS_ERR(reloc_root));
reloc_root-last_trans = trans-transid;
reloc_root-commit_root = NULL;
reloc_root-ref_tree = root-fs_info-reloc_ref_tree;
diff -urNp linux-2.6.38/fs/btrfs/file-item.c 
linux-2.6.38.new/fs/btrfs/file-item.c
--- linux-2.6.38/fs/btrfs/file-item.c   2011-03-15 10:20:32.0 +0900
+++ linux-2.6.38.new/fs/btrfs/file-item.c   2011-03-23 11:28:09.0 
+0900
@@ -48,7 +48,8 @@ int btrfs_insert_file_extent(struct btrf
struct extent_buffer *leaf;
 
path = btrfs_alloc_path();
-   BUG_ON(!path);
+   if (!path)
+   return -ENOMEM;
file_key.objectid = objectid;
file_key.offset = pos;
btrfs_set_key_type(file_key, BTRFS_EXTENT_DATA_KEY);
diff -urNp linux-2.6.38/fs/btrfs/inode-map.c 

read-only subvolumes?

2011-03-23 Thread Andreas Philipp

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
Hi all,

When I am creating subvolumes I get this strange behavior. If I create
a subvolume with a name longer than 4 characters it is read-only, if
the name is shorter than 5 characters the subvolume is writeable as
expected. I think it is since I upgraded to kernel version 2.6.38 (I
do not create subvolumes on a regular basis.). I will compile one of
the latest 2.6.37 kernels to see whether there the problem exists,
too. Another interesting point is that previously created subvolumes
are not affected.

Thanks,
Andreas Philipp

thor btrfs # btrfs subvolume create 123456789
Create subvolume './123456789'
thor btrfs # touch 123456789/lsdkfj
touch: cannot touch `123456789/lsdkfj': Read-only file system

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQIcBAEBAgAGBQJNia2nAAoJEJIcBJ3+XkgiTQQQAJqvn+zYbDmqmSo8ZRG86ssR
Tj0hsaAYjSWiIUxs7Y9XulmC0sZoSpvX5BLIjW1pYwECzhzrA//pQrwVbXgwbW5H
VjZ+YxcOcw4jxoUbW3lG+KYtSFMJFtbdMejmCY3GgYYIq1mtn0hBrCqZJ0syl4LO
IjyTHR/v0r7FMIgL26F1jOfC478RfhIxAgZOtd65kl7/pHOv5At+99tgM4teUoy0
I76CWu6Ls9+1XevxMWp39XNceYCtQ/WoEThuQCvPERq6Th3NWczPBTP3POSBetVA
Kcomq0TmgXQx1ZalFAFpMi9iRriDXbSm3ITSZW6Jp2BSEPurzpydchfhg0AWVNcC
Icp5b+dy2RVM/K5UNDO6lNf8p+K1wk8GGpD/Pr+K0lO0FlKX+6rApzgx54GYL3cx
0RYL+NAAwSpy1i2uBIw72gyGX/yBliX7CB+YZZ/iULk0eUd36FvpJJAJ1Isk+QNn
6WFBoRwsMrL3WfiqR5/ODO+i+z+CUzYU0mUnD9IuQkdCANyXOeQhs5AyMOPkB1NC
SS9ChpL60khwmLs9c99AyIzcvZU/q12JMvOZ2YUnfEHNIC/XmaThq11RbCIWIsl2
vjPr1QvKK+aykaOfjiTgLTwvB3mq147uEAylzIkduiQSFizMudbsI9vcO/X2pcy3
SVO9m6tlBUsCq3dU1dcA
=NEEb
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only subvolumes?

2011-03-23 Thread Fajar A. Nugraha
On Wed, Mar 23, 2011 at 3:21 PM, Andreas Philipp
philipp.andr...@gmail.com wrote:
 I think it is since I upgraded to kernel version 2.6.38 (I
 do not create subvolumes on a regular basis.).

 thor btrfs # btrfs subvolume create 123456789
 Create subvolume './123456789'
 thor btrfs # touch 123456789/lsdkfj
 touch: cannot touch `123456789/lsdkfj': Read-only file system

It works on my system

# touch test1
# btrfs su cr 123456789
Create subvolume './123456789'
# touch 123456789/lsdkfj
# uname -a
Linux HP 2.6.38-020638-generic #201103151303 SMP Tue Mar 15 14:33:40
UTC 2011 i686 GNU/Linux

-- 
Fajar
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only subvolumes?

2011-03-23 Thread Li Zefan
 Hi all,
 
 When I am creating subvolumes I get this strange behavior. If I create
 a subvolume with a name longer than 4 characters it is read-only, if
 the name is shorter than 5 characters the subvolume is writeable as
 expected. I think it is since I upgraded to kernel version 2.6.38 (I
 do not create subvolumes on a regular basis.). I will compile one of
 the latest 2.6.37 kernels to see whether there the problem exists,
 too. Another interesting point is that previously created subvolumes
 are not affected.
 
 Thanks,
 Andreas Philipp
 
 thor btrfs # btrfs subvolume create 123456789
 Create subvolume './123456789'
 thor btrfs # touch 123456789/lsdkfj
 touch: cannot touch `123456789/lsdkfj': Read-only file system

This is really odd, but I can't reproduce it.

I created a btrfs filesystem on 2.6.37 kernel, and rebooted to latest
2.6.38+, and tried the procedures as you did, but nothing bad happend.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Miao Xie
Hi, Kitayama-san

On wed, 23 Mar 2011 13:19:18 +0900, Itaru Kitayama wrote:
 On Wed, 23 Mar 2011 12:00:38 +0800
 Miao Xie mi...@cn.fujitsu.com wrote:
 
 I is testing the new version, in which I fixed the slab shrinker problem 
 reported by
 Chris. In the new version, the delayed node is removed before the relative 
 inode is
 moved into the unused_inode list(the slab shrinker will reclaim inodes in 
 this list).
 Maybe this method can also fix this bug. So could you tell me the reproduce 
 step
 or the options of mkfs and mount? I will check if the new patch can fix this 
 bug or not.
 
 I used the default mkfs options for $TEST_DEV and enabled the space_cache and 
 the
 compress=lzo options upon mounting the partition.

Unfortunately, I can trigger this warning, but by analyzing the following 
information,

 =
 [ INFO: possible irq lock inversion dependency detected ]
 2.6.36-xie+ #122
 -
 kswapd0/49 just changed the state of lock:
  (iprune_sem){.-}, at: [811316d0] 
 shrink_icache_memory+0x4d/0x213
 but this lock took another, RECLAIM_FS-unsafe lock in the past:
  (delayed_node-mutex){+.+.+.}
 
 and interrupts could create inverse lock ordering between them.
[SNIP]
 RECLAIM_FS-ON-W at:
  [81074292] mark_held_locks+0x52/0x70
  [81074354] 
 lockdep_trace_alloc+0xa4/0xc2
  [8110f206] __kmalloc+0x7f/0x154
  [811c2c21] kzalloc+0x14/0x16
  [811c5e83] 
 cache_block_group+0x106/0x238
  [811c7069] find_free_extent+0x4e2/0xa86
  [811c76c1] 
 btrfs_reserve_extent+0xb4/0x142
  [811c78b6] 
 btrfs_alloc_free_block+0x167/0x2af
  [811be610] 
 __btrfs_cow_block+0x103/0x346
  [811bedb8] btrfs_cow_block+0x101/0x110
  [811c05d8] 
 btrfs_search_slot+0x143/0x513
  [811cf5ab] btrfs_lookup_inode+0x2f/0x8f
  [81212405] 
 btrfs_update_delayed_inode+0x75/0x135
  [8121340d] 
 btrfs_run_delayed_items+0xd6/0x131
  [811d64d7] 
 btrfs_commit_transaction+0x28b/0x668
  [811ba786] btrfs_sync_fs+0x6b/0x70
  [81140265] __sync_filesystem+0x6b/0x83
  [81140347] sync_filesystem+0x4c/0x50
  [8111fc56] 
 generic_shutdown_super+0x27/0xd7
  [8111fd5b] kill_anon_super+0x16/0x54
  [8111effd] 
 deactivate_locked_super+0x26/0x46
  [8111f495] deactivate_super+0x45/0x4a
  [81135962] mntput_no_expire+0xd6/0x104
  [81136a87] sys_umount+0x2c1/0x2ec
  [81002ddb] 
 system_call_fastpath+0x16/0x1b

we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the 
above lockdep
warning. the attached patch, which against the delayed items operation patch, 
may fix this
problem, Could you test it for me?

Thanks
Miao
From f84daee1d2060beae945a2774cda7af2ef7f3e61 Mon Sep 17 00:00:00 2001
From: Miao Xie mi...@cn.fujitsu.com
Date: Wed, 23 Mar 2011 16:01:16 +0800
Subject: [PATCH] btrfs: use GFP_NOFS instead of GFP_KERNEL

In the filesystem context, we must allocate memory by GFP_NOFS,
or we will start another filesystem operation and trigger lockdep warnning.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
 fs/btrfs/extent-tree.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f1db57d..fe50cff 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -471,7 +471,7 @@ static int cache_block_group(struct btrfs_block_group_cache 
*cache,
if (load_cache_only)
return 0;
 
-   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_KERNEL);
+   caching_ctl = kzalloc(sizeof(*caching_ctl), GFP_NOFS);
BUG_ON(!caching_ctl);
 
INIT_LIST_HEAD(caching_ctl-list);
-- 
1.7.3.1



Re: read-only subvolumes?

2011-03-23 Thread Andreas Philipp

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
On 23.03.2011 10:25, Li Zefan wrote:
 Hi all,

 When I am creating subvolumes I get this strange behavior. If I
 create a subvolume with a name longer than 4 characters it is
 read-only, if the name is shorter than 5 characters the subvolume
 is writeable as expected. I think it is since I upgraded to
 kernel version 2.6.38 (I do not create subvolumes on a regular
 basis.). I will compile one of the latest 2.6.37 kernels to see
 whether there the problem exists, too. Another interesting point
 is that previously created subvolumes are not affected.

 Thanks, Andreas Philipp

 thor btrfs # btrfs subvolume create 123456789 Create subvolume
 './123456789' thor btrfs # touch 123456789/lsdkfj touch: cannot
 touch `123456789/lsdkfj': Read-only file system

 This is really odd, but I can't reproduce it.

 I created a btrfs filesystem on 2.6.37 kernel, and rebooted to
 latest 2.6.38+, and tried the procedures as you did, but nothing
 bad happend.
While playing around I found the following three new points:
- - Now the length of the subvolume name does not matter. So even the
ones with short names are read-only.
- - It also happens to a fresh newly created btrfs filesystem.
- - If I take a snapshot of an old (= writeable) subvolume this is
writeable.

I will now reboot into 2.6.37.4, check there, and then report back.

Thanks,
Andreas Philipp
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQIcBAEBAgAGBQJNicZIAAoJEJIcBJ3+XkgiDysP/1oo770VqaEhf3F9gXq5/V3W
AkGGuRb0Upkwie5Y7L3YjCAjAJplYCemncsjqLDQVIQP6iYfmC3bLIM1GjDjMLfT
uwt89/pDte2JStW6kFx0u5i7IwYD6NO7vh3/i7+l1RB4qpZ7DAomroeHS5FFgD2M
y6hZcQ/bhiRKDv82c7YscBVE3ZgKIDPUHoNeduCGsCj8hSd4+/8PR7auGjv42a/l
C92G01cx4mMS0pmnwLUL4U54n1rbJNrKkaoQwINNW/E3fj6gQRwtI1QyDhDWnmfO
Y6c3JRtyYeWGadCaMq4SYGWvSFhG8jlR/a17ozubrLf/An14ywohx1pUZq0fPp9z
oxSlZCINhGBDSeahGQBw7szmU45lXf8N99TgaUTLiHyStnlQfcqpD5RyJUTSBOa2
VAVpMeuvjqw1ng+Tsd1r35e/WBtPQOd9aUj6r5Hcjt4oGlV0mL7oBAR/J0DjNYfl
kii8Ah+NWHFVw/pUVfWC3lzcwfqFIikvn3KVsR2X4LrOTmi6thrh0EG+eSOhfWuf
dI/agqONGzNGH73V7jFtWaEjetrhqRrr5Q22syqWfqX/AYbzTAlISHm574RPtf0G
P2r1fn/s/3FXGKo4zfTsscuvEE4LJaKFrjFxz5mW4wOz9hhFmTBox71ex538ZiMv
NfZzNRKpmXyZCm8USF/i
=b3lE
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs scrub: make fixups sync, don't reuse fixup bios

2011-03-23 Thread Arne Jansen
Hi Ilya,

On 18.03.2011 17:21, Ilya Dryomov wrote:
 [CC'ing the list to make development public, the patch is againt Arne's
 tree at kernel.org]
 
 Below is a quite long diff the primary purpose of which is to make
 fixups totally sync.  They are already sync for checksum failures, this
 patch makes them sync for EIO case as well.  This is required for
 integrating drive swap, since the idea is that I have to fixup up
 everyithing first and then write the correct data to a new device.
 Obviously to do that fixups have to be sync.  The EIO is supposed to
 happen quite rare, so the performance loose from making EIO sync should
 be minimal.
 
 The other significant change is that fixups are now sharing the buffer
 with the parent sbio.  So instead of allocating a page to do a fixup, we
 just grab the page from the sbio buffer.  This is also required for
 drive swap, since when all the fixups are done, sbio buffer will contain
 the right data which I can write to a new device using a single bio.  It
 doesn't affect scrub at all.
 
 The third change is that fixup bios are no longer reused.  This is a
 change that I think should be added even if you don't like the rest.
 You were right at the first place, bios cannot be reused that simply,
 and since it's just a one page bio, it's better to allocate it each time
 we need it.
 
 Since fixups are now sync, I don't embed spage into a fixup structure.
 Instead a pointer is used.

thanks for the patch. The sync path looks good to me. I'd suggest to
see if you can get rid of struct fixup completely. Also there is no need
to increment scrubs_running anymore inside the recheck code.
Your patch reorders some functions, which makes it harder to read. Could
you separate that into two steps?

Thanks,
Arne

 
 This is a preliminary version, it's meant to get us on the same page.
 But if you can give this code some quick testing on real hardware with
 your test cases I'd appreciate that.
 
 I also plan to fix EIO handling in scrub_checksum, but that will happen
 only next week.  My disk should arrive Monday-Tuesday + a couple days to
 play with it.
 
 I may have forgotten something, so ping me on IRC any time.  Also
 disregard my debugging output at the end.
 
 Thanks,
 
   Ilya
 
 ---
 
 diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
 index 85a4d4b..f3fe5a5 100644
 --- a/fs/btrfs/scrub.c
 +++ b/fs/btrfs/scrub.c
 @@ -69,9 +69,6 @@ static int scrub_checksum_tree_block(struct scrub_dev *sdev,
   struct scrub_page *spag, u64 logical,
   void *buffer);
  static int scrub_checksum_super(struct scrub_bio *sbio, void *buffer);
 -static void scrub_recheck_end_io(struct bio *bio, int err);
 -static void scrub_fixup_worker(scrub_work_t *work);
 -static void scrub_fixup(struct scrub_fixup *fixup);
  
  #define SCRUB_PAGES_PER_BIO  16  /* 64k per bio */
  #define SCRUB_BIOS_PER_DEV   16  /* 1 MB per device in flight */
 @@ -117,13 +114,10 @@ struct scrub_dev {
  
  struct scrub_fixup {
   struct scrub_dev*sdev;
 - struct bio  *bio;
   u64 logical;
   u64 physical;
 - struct scrub_page   spag;
 - scrub_work_twork;
 - int err;
 - int recheck;
 + struct page *page;
 + struct scrub_page   *spag;
  };
  
  static void scrub_free_csums(struct scrub_dev *sdev)
 @@ -230,115 +224,19 @@ nomem:
   return ERR_PTR(-ENOMEM);
  }
  
 -/*
 - * scrub_recheck_error gets called when either verification of the page
 - * failed or the bio failed to read, e.g. with EIO. In the latter case,
 - * recheck_error gets called for every page in the bio, even though only
 - * one may be bad
 - */
 -static void scrub_recheck_error(struct scrub_bio *sbio, int ix)
 -{
 - struct scrub_dev *sdev = sbio-sdev;
 - struct btrfs_fs_info *fs_info = sdev-dev-dev_root-fs_info;
 - struct bio *bio = NULL;
 - struct page *page = NULL;
 - struct scrub_fixup *fixup = NULL;
 - int ret;
 -
 - /*
 -  * while we're in here we do not want the transaction to commit.
 -  * To prevent it, we increment scrubs_running. scrub_pause will
 -  * have to wait until we're finished
 -  * we can safely increment scrubs_running here, because we're
 -  * in the context of the original bio which is still marked in_flight
 -  */
 - atomic_inc(fs_info-scrubs_running);
 -
 - fixup = kzalloc(sizeof(*fixup), GFP_NOFS);
 - if (!fixup)
 - goto malloc_error;
 -
 - fixup-logical = sbio-logical + ix * PAGE_SIZE;
 - fixup-physical = sbio-physical + ix * PAGE_SIZE;
 - fixup-spag = sbio-spag[ix];
 - fixup-sdev = sdev;
 -
 - bio = bio_alloc(GFP_NOFS, 1);
 - if (!bio)
 - goto malloc_error;
 - bio-bi_private = fixup;
 - bio-bi_size = 0;
 - bio-bi_bdev = sdev-dev-bdev;
 - 

Re: read-only subvolumes?

2011-03-23 Thread Andreas Philipp

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
 
On 23.03.2011 11:07, Andreas Philipp wrote:

 On 23.03.2011 10:25, Li Zefan wrote:
 Hi all,

 When I am creating subvolumes I get this strange behavior. If I
 create a subvolume with a name longer than 4 characters it is
 read-only, if the name is shorter than 5 characters the
 subvolume is writeable as expected. I think it is since I
 upgraded to kernel version 2.6.38 (I do not create subvolumes
 on a regular basis.). I will compile one of the latest 2.6.37
 kernels to see whether there the problem exists, too. Another
 interesting point is that previously created subvolumes are
 not affected.

 Thanks, Andreas Philipp

 thor btrfs # btrfs subvolume create 123456789 Create subvolume
 './123456789' thor btrfs # touch 123456789/lsdkfj touch:
 cannot touch `123456789/lsdkfj': Read-only file system

 This is really odd, but I can't reproduce it.

 I created a btrfs filesystem on 2.6.37 kernel, and rebooted to
 latest 2.6.38+, and tried the procedures as you did, but nothing
 bad happend.
 While playing around I found the following three new points: - Now
 the length of the subvolume name does not matter. So even the ones
 with short names are read-only. - It also happens to a fresh newly
 created btrfs filesystem. - If I take a snapshot of an old (=
 writeable) subvolume this is writeable.

 I will now reboot into 2.6.37.4, check there, and then report
 back.

Well, this was fast. Everything works as expected on 2.6.37.4. See the
output of uname -a for the exact kernel version below.
I will now reboot into a differently configured kernel version 2.6.38
and look whether the problem is gone there.

Thanks,
Andreas Philipp

thor ~ # uname -a
Linux thor 2.6.37.4 #2 SMP Wed Mar 23 10:25:54 CET 2011 x86_64
Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
 
iQIcBAEBAgAGBQJNicgcAAoJEJIcBJ3+XkgiDRwP/jVRcrc+Qwq8D3rE50MjBox0
vy8ZKna2MXO4dl6Et8td1ikL0T91eueIjvOeaU5cS8vxknG7xqGh9k89Nd74j98a
2paWOiR49vYYhcKF1EZm6oKgHri/N/1RfLWvhXJef3POprwz3/n3YZcSDsiXcAnJ
M8RnGgYFoXNGamPorp32rR5XMln9A6Uma+cUZuaL4eitvsZ+YDsYk4XKZ/8O+cql
u5xKihRNDRqQL7LCfqfL0iJxDl3AReOdXUo8sBmo2ioLNv+syJJhhJ2XRbx7r8rM
LDWOnsBE1oCq2QuM49MDxuD4JFhCmTJ6oJotaBShcU0J0S8Dlu1URucDO7P33BOK
qBFnavR3HaUR+MRor7U+LmeYvasmhj/hUa1nx5jvMEQqeTIioQmYLdllyvHGApfy
R4n1+/L91mRr56s96DC31mF7xnSC13LVLJLG+r3ktlj9/u6B+8LAISgo1uDJX681
YQ5KkI8O+5AcAT8Hu1pwdQVC+LXDPp8HIqL59pUWD2v4zyynVqSKgCSKLJ10npLF
+NZRhSb6czNSvM0UrUBXPLq1th+ErfMNn4b6RCrAPbA4T5bejvCUUlkx7FiAMmVx
rnfiyolblNMfQ+9rY9k8zzZeJfR88wx7yS2VoZlV7n68K01GMy+NDRK203TjcX36
+Y8kUmwptiXc48H6teUN
=aSSq
-END PGP SIGNATURE-

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-23 Thread Tejun Heo
Hello, Chris.

On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote:
 Ok, this impact of this is really interesting.  If we have very short
 waits where there is no IO at all, this patch tends to lose.  I ran with
 dbench 10 and got about 20% slower tput.
 
 But, if we do any IO at all it wins by at least that much or more.  I
 think we should take this patch and just work on getting rid of the
 scheduling with the mutex held where possible.

I see.

 Tejun, could you please send the mutex_tryspin stuff in?  If we can get
 a sob for that I can send the whole thing.

I'm not sure whetehr mutex_tryspin() is justified at this point, and,
even if so, how to proceed with it.  Maybe we want to make
mutex_trylock() perform owner spin by default without introducing a
new API.

Given that the difference between SIMPLE and SPIN is small, I think it
would be best to simply use mutex_trylock() for now.  It's not gonna
make much difference either way.

How do you want to proceed?  I can prep patches doing the following.

- Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP.

- Drop locking.c and make the lock function simple wrapper around
  mutex operations.  This makes blocking/unblocking noops.

- Remove all blocking/unblocking calls along with the API.

- Remove locking wrappers and use mutex API directly.

What do you think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Simplify locking

2011-03-23 Thread Chris Mason
Excerpts from Tejun Heo's message of 2011-03-23 06:46:14 -0400:
 Hello, Chris.
 
 On Tue, Mar 22, 2011 at 07:13:09PM -0400, Chris Mason wrote:
  Ok, this impact of this is really interesting.  If we have very short
  waits where there is no IO at all, this patch tends to lose.  I ran with
  dbench 10 and got about 20% slower tput.
  
  But, if we do any IO at all it wins by at least that much or more.  I
  think we should take this patch and just work on getting rid of the
  scheduling with the mutex held where possible.
 
 I see.
 
  Tejun, could you please send the mutex_tryspin stuff in?  If we can get
  a sob for that I can send the whole thing.
 
 I'm not sure whetehr mutex_tryspin() is justified at this point, and,
 even if so, how to proceed with it.  Maybe we want to make
 mutex_trylock() perform owner spin by default without introducing a
 new API.

I'll benchmark without it, but I think the cond_resched is going to have
a pretty big impact.  I'm digging up the related benchmarks I used
during the initial adaptive spin work.

 
 Given that the difference between SIMPLE and SPIN is small, I think it
 would be best to simply use mutex_trylock() for now.  It's not gonna
 make much difference either way.

mutex_trylock is a good start.

 
 How do you want to proceed?  I can prep patches doing the following.
 
 - Convert CONFIG_DEBUG_LOCK_ALLOC to CONFIG_LOCKDEP.
 
 - Drop locking.c and make the lock function simple wrapper around
   mutex operations.  This makes blocking/unblocking noops.
 
 - Remove all blocking/unblocking calls along with the API.

I'd like to keep the blocking/unblocking calls for one release.  I'd
like to finally finish off my patches that do concurrent reads.

 
 - Remove locking wrappers and use mutex API directly.

I'd also like to keep the wrappers until the concurrent reader locking
is done.

 
 What do you think?

Thanks for all the work.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: efficiency of btrfs cow

2011-03-23 Thread Brian J. Murrell
On 11-03-06 11:06 AM, Calvin Walton wrote:
 
 To see exactly what's going on, you should use the btrfs filesystem df
 command to see how space is being allocated for data and metadata
 separately:

OK.  So with an empty filesystem, before my first copy (i.e. the base on
which the next copy will CoW from) df reports:

Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/mapper/btrfs--test-btrfs--test
 92274688056 922746824   1% /mnt/btrfs-test

and btrfs fi df reports:

Data: total=8.00MB, used=0.00
Metadata: total=1.01GB, used=24.00KB
System: total=12.00MB, used=4.00KB

after the first copy df and btrfs fi df report:

Filesystem   1K-blocks  Used Available Use% Mounted on
/dev/mapper/btrfs--test-btrfs--test
 922746880 121402328 801344552  14% /mnt/btrfs-test

root@linux:/mnt/btrfs-test# cat .snapshots/monthly.22/metadata/btrfs_df-stop
Data: total=110.01GB, used=109.26GB
Metadata: total=5.01GB, used=3.26GB
System: total=12.00MB, used=24.00KB

So it's clear that total usage (as reported by df) was 121,402,328KB but
Metadata has two values:

Metadata: total=5.01GB, used=3.26GB

What's the difference between total and used?  And for that matter,
what's the difference between the total and used for Data
(total=110.01GB, used=109.26GB)?

Even if I take the largest values (i.e. the total values) for Data and
Metadata (each converted to KB first) and add them up they are:
120,607,211.52 which is not quite the 121,402,328 that df reports.
There is a 795,116.48KB discrepancy.

In any case, which value from a btrfs df fi should I be subtracting from
df's accounting to get a real accounting of the amount of data used?

Cheers,
b.



signature.asc
Description: OpenPGP digital signature


[RFC] Tree fragmentation and prefetching

2011-03-23 Thread Arne Jansen
While looking into the performance of scrub I noticed that a significant
amount of time is being used for loading the extent tree and the csum
tree. While this is no surprise I did some prototyping on how to improve
on it.
The main idea is to load the tree (or parts of it) top-down, order the
needed blocks and distribute it over all disks.
To keep you interested, some results first.

a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
28%. It is created with the current git tree + the round robin patch and
filled with

fs_mark -D 512 -t 16 -n 4096 -F -S0

The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
with path-reada=2. Both trees are being enumerated one after the other.
The prototype currently just uses raw bios, does not make use of the
page cache and does not enter the read pages into the cache. This will
probably add some overhead. It also does not check the crcs.

While it is very promising to implement it for scrub, I think a more
general interface which can be used for every enumeration would be
beneficial. Use cases that come to mind are rebalance, reflink, deletion
of large files, listing of large directories etc..

I'd imagine an interface along the lines of

int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
int btrfs_readahead_add(struct btrfs_root *root,
struct btrfs_key *start,
struct btrfs_key *end,
struct btrfs_reada_ctx *reada);
void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

to trigger the readahead of parts of a tree. Multiple readahead
requests can be given before waiting. This would enable the very
beneficial folding seen above for 'reading both trees'.

Also it would be possible to add a cascading readahead, where the
content of leaves would trigger readaheads in other trees, maybe by
giving a callback for the decisions what to read instead of the fixed
start/end range.

For the implementation I'd need an interface which I haven't been able
to find yet. Currently I can trigger the read of several pages / tree
blocks and wait for the completion of each of them. What I'd need would
be an interface that gives me a callback on each completion or a waiting
function that wakes up on each completion with the information which
pages just completed.
One way to achieve this would be to add a hook, but I gladly take any
implementation hints.

--
Arne


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Miao Xie
On wed, 23 Mar 2011 09:57:56 +0800, Miao Xie wrote:
 On Mon, 21 Mar 2011 08:08:17 -0400, Chris Mason wrote:
 I also think that code is racing with the code that frees delayed nodes,
 but haven't yet triggered my debugging printks to prove either one.

 We free delayed nodes when we want to destroy the inode, at that time, just 
 one task,
 which is destroying inode, can access the delayed nodes, so I think 
 ACCESS_ONCE() is
 enough. What do you think about?

 Great, I see what you mean.  The bigger problem right now is that we may do
 a lot of operations in destroy_inode(), which can block the slab
 shrinkers on our metadata operations.  That same stress.sh -n 50 run is
 running into OOM.

 So we need to rework the part where the final free is done.  We could
 keep a ref on the inode until the delayed items are complete, or we
 could let the inode go and make a way to lookup the delayed node when
 the inode is read.
 
 I find the slab shrinkers just reclaim inodes which are in the inode_unused
 list, so if we free delayed nodes before moving the inode into the 
 inode_unused
 list, we can avoid blocking the slab shrinker. Thus maybe we can fix the above
 problem by moving btrfs_remove_delayed_node() from btrfs_destroy_inode()
 to btrfs_drop_inode(). How do you think about?

This method is not good, because we may do delayed insertion/deletion and free
the delayed node when we release the inode, but we may access its delayed node
soon.

After reading the lockdep message reported by Kitayama, I guess the reason of 
the slab
shrinker block may not be that we do a lot of operations in destroy_inode(), 
maybe the
real reason is that we pass GFP_KERNEL into kzalloc()(in cache_block_group()), 
which makes
the slab shrinkers hang up. (I don't trigger OOM by running stress.sh till now, 
so I
can't locate this bug.)

Thanks
Miao

 

 I'll read more today.

 -chris

 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] btrfs: add scrub code and prototypes

2011-03-23 Thread Arne Jansen
On 22.03.2011 17:38, David Sterba wrote:

 David Sterba wrote:
 On Fri, Mar 11, 2011 at 03:49:40PM +0100, Arne Jansen wrote:
 This is the main scrub code.


 sizeof(struct scrub_dev) == 18760 on an x86_64, an order 3 allocation in
 scrub_setup_dev()

 Is this a problem? There are only few allocations of it, one per device.
 
 High order allocations may fail when memory is fragmented, and should be
 avoided when possible. (And it is here, allocate each 'struct scrub_bio'
 separately and fill the bios array with pointers.) Scrub ioctl may fail
 to start until order-3 allocation will be available.
 

I updated this in my git repo.

Thanks,
Arne
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Arne Jansen
On 23.03.2011 14:06, Arne Jansen wrote:
 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.
 
 a) by tree enumeration with reada=2
reading extent tree: 242s
reading csum tree: 140s


reading both trees: 324s
sorry, the number is wrong, it should be 384s (just the sum of both
./. rounding errors).


 
 b) prefetch prototype
reading extent tree: 23.5s
reading csum tree: 20.4s
reading both trees: 25.7s
 
 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with
 
 fs_mark -D 512 -t 16 -n 4096 -F -S0
 
 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.
 
 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..
 
 I'd imagine an interface along the lines of
 
 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
 struct btrfs_key *start,
 struct btrfs_key *end,
 struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
 
 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.
 
 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.
 
 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.
 
 --
 Arne
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Tejun Heo
Hello, guys.

I've been playing with locking in btrfs which has developed custom
locking to avoid excessive context switches in its btree
implementation.

Generally, doing away with the custom implementation and just using
the mutex adaptive owner spinning seems better; however, there's an
interesting distinction in the custom implemention of trylock.  It
distinguishes between simple trylock and tryspin, where the former
just tries once and then fail while the latter does some spinning
before giving up.

Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
just once.  I got curious whether using adaptive spinning on
mutex_trylock() would be beneficial and it seems so, at least for
btrfs anyway.

The following results are from dbench 50 run on an opteron two
socket eight core machine with 4GiB of memory and an OCZ vertex SSD.
During the run, disk stays mostly idle and all CPUs are fully occupied
and the difference in locking performance becomes quite visible.

IMPLE is with the locking simplification patch[1] applied.  i.e. it
basically just uses mutex.  SPIN is with the patch at the end of this
message applied on top - mutex_trylock() uses adaptive spinning.

USER   SYSTEM   SIRQCXTSW  THROUGHPUT
 SIMPLE 61107  354977217  8099529  845.100 MB/sec
 SPIN   63140  364888214  6840527  879.077 MB/sec

I've been running various different configs from yesterday and the
adaptive spinning trylock consistently posts higher throughput.  The
amount of difference varies but it outperforms consistently.

In general, I think using adaptive spinning on trylock makes sense as
trylock failure usually leads to costly unlock-relock sequence.

Also, contrary to the comment, the adaptive spinning doesn't seem to
check whether there are pending waiters or not.  Is this intended or
the test got lost somehow?

Thanks.

NOT-Signed-off-by: Tejun Heo t...@kernel.org
---
 kernel/mutex.c |   98 +++--
 1 file changed, 61 insertions(+), 37 deletions(-)

Index: work/kernel/mutex.c
===
--- work.orig/kernel/mutex.c
+++ work/kernel/mutex.c
@@ -126,39 +126,33 @@ void __sched mutex_unlock(struct mutex *
 
 EXPORT_SYMBOL(mutex_unlock);
 
-/*
- * Lock a mutex (possibly interruptible), slowpath:
+/**
+ * mutex_spin - optimistic spinning on mutex
+ * @lock: mutex to spin on
+ *
+ * This function implements optimistic spin for acquisition of @lock when
+ * we find that there are no pending waiters and the lock owner is
+ * currently running on a (different) CPU.
+ *
+ * The rationale is that if the lock owner is running, it is likely to
+ * release the lock soon.
+ *
+ * Since this needs the lock owner, and this mutex implementation doesn't
+ * track the owner atomically in the lock field, we need to track it
+ * non-atomically.
+ *
+ * We can't do this for DEBUG_MUTEXES because that relies on wait_lock to
+ * serialize everything.
+ *
+ * CONTEXT:
+ * Preemption disabled.
+ *
+ * RETURNS:
+ * %true if @lock is acquired, %false otherwise.
  */
-static inline int __sched
-__mutex_lock_common(struct mutex *lock, long state, unsigned int subclass,
-   unsigned long ip)
+static inline bool mutex_spin(struct mutex *lock)
 {
-   struct task_struct *task = current;
-   struct mutex_waiter waiter;
-   unsigned long flags;
-
-   preempt_disable();
-   mutex_acquire(lock-dep_map, subclass, 0, ip);
-
 #ifdef CONFIG_MUTEX_SPIN_ON_OWNER
-   /*
-* Optimistic spinning.
-*
-* We try to spin for acquisition when we find that there are no
-* pending waiters and the lock owner is currently running on a
-* (different) CPU.
-*
-* The rationale is that if the lock owner is running, it is likely to
-* release the lock soon.
-*
-* Since this needs the lock owner, and this mutex implementation
-* doesn't track the owner atomically in the lock field, we need to
-* track it non-atomically.
-*
-* We can't do this for DEBUG_MUTEXES because that relies on wait_lock
-* to serialize everything.
-*/
-
for (;;) {
struct thread_info *owner;
 
@@ -177,12 +171,8 @@ __mutex_lock_common(struct mutex *lock,
if (owner  !mutex_spin_on_owner(lock, owner))
break;
 
-   if (atomic_cmpxchg(lock-count, 1, 0) == 1) {
-   lock_acquired(lock-dep_map, ip);
-   mutex_set_owner(lock);
-   preempt_enable();
-   return 0;
-   }
+   if (atomic_cmpxchg(lock-count, 1, 0) == 1)
+   return true;
 
/*
 * When there's no owner, we might have preempted between the
@@ -190,7 +180,7 @@ __mutex_lock_common(struct mutex *lock,
 * we're an RT task that will 

Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Linus Torvalds
On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote:

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, at least for
 btrfs anyway.

Hmm. Seems reasonable to me. The patch looks clean, although part of
that is just the mutex_spin() cleanup that is independent of actually
using it in trylock.

So no objections from me.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Tejun Heo
On Wed, Mar 23, 2011 at 08:48:01AM -0700, Linus Torvalds wrote:
 On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote:
 
  Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
  just once.  I got curious whether using adaptive spinning on
  mutex_trylock() would be beneficial and it seems so, at least for
  btrfs anyway.
 
 Hmm. Seems reasonable to me. The patch looks clean, although part of
 that is just the mutex_spin() cleanup that is independent of actually
 using it in trylock.

Oh, I have two split patches.  Posted the combined one for comments.

 So no objections from me.

Awesome.  Peter, what do you think?  Are there some other tests which
can be useful?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: efficiency of btrfs cow

2011-03-23 Thread Chester
I'm not a developer, but I think it goes something like this:
btrfs doesn't write the filesystem on the entire device/partition at
format time, rather, it dynamically increases the size of the
filesystem as data is used. That's why formating a disk in btrfs can
be so fast.

On Wed, Mar 23, 2011 at 12:39 PM, Brian J. Murrell
br...@interlinx.bc.ca wrote:

 On 11-03-06 11:06 AM, Calvin Walton wrote:
 
  To see exactly what's going on, you should use the btrfs filesystem df
  command to see how space is being allocated for data and metadata
  separately:

 OK.  So with an empty filesystem, before my first copy (i.e. the base on
 which the next copy will CoW from) df reports:

 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/mapper/btrfs--test-btrfs--test
                     922746880        56 922746824   1% /mnt/btrfs-test

 and btrfs fi df reports:

 Data: total=8.00MB, used=0.00
 Metadata: total=1.01GB, used=24.00KB
 System: total=12.00MB, used=4.00KB

 after the first copy df and btrfs fi df report:

 Filesystem           1K-blocks      Used Available Use% Mounted on
 /dev/mapper/btrfs--test-btrfs--test
                     922746880 121402328 801344552  14% /mnt/btrfs-test

 root@linux:/mnt/btrfs-test# cat .snapshots/monthly.22/metadata/btrfs_df-stop
 Data: total=110.01GB, used=109.26GB
 Metadata: total=5.01GB, used=3.26GB
 System: total=12.00MB, used=24.00KB

 So it's clear that total usage (as reported by df) was 121,402,328KB but
 Metadata has two values:

 Metadata: total=5.01GB, used=3.26GB

 What's the difference between total and used?  And for that matter,
 what's the difference between the total and used for Data
 (total=110.01GB, used=109.26GB)?

 Even if I take the largest values (i.e. the total values) for Data and
 Metadata (each converted to KB first) and add them up they are:
 120,607,211.52 which is not quite the 121,402,328 that df reports.
 There is a 795,116.48KB discrepancy.

 In any case, which value from a btrfs df fi should I be subtracting from
 df's accounting to get a real accounting of the amount of data used?

 Cheers,
 b.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: efficiency of btrfs cow

2011-03-23 Thread Brian J. Murrell
On 11-03-23 11:53 AM, Chester wrote:
 I'm not a developer, but I think it goes something like this:
 btrfs doesn't write the filesystem on the entire device/partition at
 format time, rather, it dynamically increases the size of the
 filesystem as data is used. That's why formating a disk in btrfs can
 be so fast.

Indeed, this much is understood, which is why I am using btrfs fi df to
try to determine how much of the increase in raw device usage is the
dynamic allocation of metadata.

Cheers,
b.



signature.asc
Description: OpenPGP digital signature


stratified B-trees

2011-03-23 Thread Karn Kallio
I just noticed this out today on the arXiv : http://xxx.lanl.gov/abs/1103.4282 
The paper describes stratified B-trees and quoting from the abstract:


We describe the `stratified B-tree', which beats the CoW B-tree in every way. 
In particular, it is the first versioned dictionary to achieve optimal 
tradeoffs between space, query and update performance. Therefore, we believe 
there is no longer a good reason to use CoW B-trees for versioned data stores.


The paper mentions that a company called Acunu is developing an 
implementation.  

Are these stratified B-trees something which the btrfs project could use?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] btrfs: add scrub code and prototypes

2011-03-23 Thread David Sterba
Hi,

I'm reviewing the atomic counters and the wait/wake infrastructure,
just found two missed mutex_unlock()s in btrfs_scrub_dev() in error
paths.

On Fri, Mar 18, 2011 at 04:55:06PM +0100, Arne Jansen wrote:
 This is the main scrub code.
 
 Updates v3:
 
  - fixed EIO handling, need to reallocate bio instead of reusing it
  - Updated according to David Sterba's review
  - don't clobber bi_flags on reuse, just set UPTODATE
  - allocate long living bios with bio_kmalloc instead of bio_alloc
 
 Updates v4:
 
  - don't restart chunk on commit
  - each EIO leaked a bio
  - the BIO_UPTODATE check was wrong
  - removed some trailing whitespace
  - nstripes int - u64
  - %lld - %llu
  - extent_map reference not freed, leaking them on unmount
  - remov unnecessary mutex locks around 'scrubs_running'
 
 Thanks to Ilya Dryomov and Jan Schmidt for pointing most of those out.
 ---
  fs/btrfs/Makefile |2 +-
  fs/btrfs/ctree.h  |   14 +
  fs/btrfs/scrub.c  | 1496 
 +
  3 files changed, 1511 insertions(+), 1 deletions(-)
 
 diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
 index 31610ea..8fda313 100644
 --- a/fs/btrfs/Makefile
 +++ b/fs/btrfs/Makefile
 @@ -7,4 +7,4 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
 root-tree.o dir-item.o \
  extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
  extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
  export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
 -compression.o delayed-ref.o relocation.o
 +compression.o delayed-ref.o relocation.o scrub.o
 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index fd2b92f..a89a205 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -2619,4 +2619,18 @@ void btrfs_reloc_pre_snapshot(struct 
 btrfs_trans_handle *trans,
 u64 *bytes_to_reserve);
  void btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 struct btrfs_pending_snapshot *pending);
 +
 +/* scrub.c */
 +int btrfs_scrub_dev(struct btrfs_root *root, u64 devid, u64 start, u64 end,
 +struct btrfs_scrub_progress *progress);
 +int btrfs_scrub_pause(struct btrfs_root *root);
 +int btrfs_scrub_pause_super(struct btrfs_root *root);
 +int btrfs_scrub_continue(struct btrfs_root *root);
 +int btrfs_scrub_continue_super(struct btrfs_root *root);
 +int btrfs_scrub_cancel(struct btrfs_root *root);
 +int btrfs_scrub_cancel_dev(struct btrfs_root *root, struct btrfs_device 
 *dev);
 +int btrfs_scrub_cancel_devid(struct btrfs_root *root, u64 devid);
 +int btrfs_scrub_progress(struct btrfs_root *root, u64 devid,
 + struct btrfs_scrub_progress *progress);
 +
  #endif
 diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
 new file mode 100644
 index 000..85a4d4b
 --- /dev/null
 +++ b/fs/btrfs/scrub.c
 @@ -0,0 +1,1496 @@
 +/*
 + * Copyright (C) 2011 STRATO.  All rights reserved.
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public
 + * License v2 as published by the Free Software Foundation.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
 + * General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public
 + * License along with this program; if not, write to the
 + * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
 + * Boston, MA 021110-1307, USA.
 + */
 +
 +#include linux/sched.h
 +#include linux/pagemap.h
 +#include linux/writeback.h
 +#include linux/blkdev.h
 +#include linux/rbtree.h
 +#include linux/slab.h
 +#include linux/workqueue.h
 +#include ctree.h
 +#include volumes.h
 +#include disk-io.h
 +#include ordered-data.h
 +
 +/*
 + * This is only the first step towards a full-features scrub. It reads all
 + * extent and super block and verifies the checksums. In case a bad checksum
 + * is found or the extent cannot be read, good data will be written back if
 + * any can be found.
 + *
 + * Future enhancements:
 + *  - To enhance the performance, better read-ahead strategies for the
 + *extent-tree can be employed.
 + *  - In case an unrepairable extent is encountered, track which files are
 + *affected and report them
 + *  - In case of a read error on files with nodatasum, map the file and read
 + *the extent to trigger a writeback of the good copy
 + *  - track and record media errors, throw out bad devices
 + *  - add a readonly mode
 + *  - add a mode to also read unallocated space
 + *  - make the prefetch cancellable
 + */
 +
 +#ifdef SCRUB_BTRFS_WORKER
 +typedef struct btrfs_work scrub_work_t;
 +#define SCRUB_INIT_WORK(work, fn) do { (work)-func = (fn); } while (0)
 +#define SCRUB_QUEUE_WORK(wq, w) do { btrfs_queue_worker((wq), w); } 

Re: [PATCH v4 4/6] btrfs: sync scrub with commit device removal

2011-03-23 Thread David Sterba
Hi,

you are adding a new smp_mb, can you please explain why it's needed and
document it?

thanks,
dave

On Fri, Mar 18, 2011 at 04:55:07PM +0100, Arne Jansen wrote:
 This adds several synchronizations:
  - for a transaction commit, the scrub gets paused before the
tree roots are committed until the super are safely on disk
  - during a log commit, scrubbing of supers is disabled
  - on unmount, the scrub gets cancelled
  - on device removal, the scrub for the particular device gets cancelled
 
 Signed-off-by: Arne Jansen sensi...@gmx.net
 ---
  fs/btrfs/disk-io.c |1 +
  fs/btrfs/transaction.c |3 +++
  fs/btrfs/tree-log.c|2 ++
  fs/btrfs/volumes.c |2 ++
  4 files changed, 8 insertions(+), 0 deletions(-)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 3e1ea3e..924a366 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2493,6 +2493,7 @@ int close_ctree(struct btrfs_root *root)
   fs_info-closing = 1;
   smp_mb();
  
 + btrfs_scrub_cancel(root);
   btrfs_put_block_group_cache(fs_info);
  
   /*
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 3d73c8d..5a43b20 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -1310,6 +1310,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
 *trans,
  
   WARN_ON(cur_trans != trans-transaction);
  
 + btrfs_scrub_pause(root);
   /* btrfs_commit_tree_roots is responsible for getting the
* various roots consistent with each other.  Every pointer
* in the tree of tree roots has to point to the most up to date
 @@ -1391,6 +1392,8 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
 *trans,
  
   mutex_unlock(root-fs_info-trans_mutex);
  
 + btrfs_scrub_continue(root);
 +
   if (current-journal_info == trans)
   current-journal_info = NULL;
  
 diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
 index 1f6788f..2be84fa 100644
 --- a/fs/btrfs/tree-log.c
 +++ b/fs/btrfs/tree-log.c
 @@ -2098,7 +2098,9 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans,
* the running transaction open, so a full commit can't hop
* in and cause problems either.
*/
 + btrfs_scrub_pause_super(root);
   write_ctree_super(trans, root-fs_info-tree_root, 1);
 + btrfs_scrub_continue_super(root);
   ret = 0;
  
   mutex_lock(root-log_mutex);
 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 7dc9fa5..ad3ea88 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -1330,6 +1330,8 @@ int btrfs_rm_device(struct btrfs_root *root, char 
 *device_path)
   goto error_undo;
  
   device-in_fs_metadata = 0;
 + smp_mb();


 + btrfs_scrub_cancel_dev(root, device);
  
   /*
* the device list mutex makes sure that we don't change
 -- 
 1.7.3.4
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: efficiency of btrfs cow

2011-03-23 Thread Kolja Dummann
 So it's clear that total usage (as reported by df) was 121,402,328KB but
 Metadata has two values:

 Metadata: total=5.01GB, used=3.26GB

 What's the difference between total and used?  And for that matter,
 what's the difference between the total and used for Data
 (total=110.01GB, used=109.26GB)?


total is the space allocated (reserved) for a kind usage (metadata or
data) the space allocated for a kind of usage can't be used for
something else. The used value is the space that is used from the
space that has been allocated for a kind of usage.

The wiki gives you a overview how to interpret the values:

https://btrfs.wiki.kernel.org/index.php/FAQ#btrfs_filesystem_df_.2Fmountpoint

cheers Kolja.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stratified B-trees

2011-03-23 Thread Andi Kleen
Karn Kallio tierplusplusli...@gmail.com writes:

 Are these stratified B-trees something which the btrfs project could use?

The current b*tree is pretty much hardcoded in the disk format, so it 
would be hard to change in a compatible way.

-Andi
-- 
a...@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: stratified B-trees

2011-03-23 Thread Ezra Ulembeck
On Wed, Mar 23, 2011 at 5:38 PM, Karn Kallio
tierplusplusli...@gmail.com wrote:
 I just noticed this out today on the arXiv : http://xxx.lanl.gov/abs/1103.4282
 The paper describes stratified B-trees and quoting from the abstract:



LOL.
It looks like this paper is generated by a robot:

... Stratified B-trees don’t need block-size tuning, unlike B-trees.
One major advantage is that they are naturally good candidates for
SSDs – the Intel X25M can perform 35,000 random 4K reads/s,
but must write in units of many MBs in order to fully utilise its performance.
This massive asymmetry in block size makes life very hard...

How do you like:
to utilise performance,
massive asymmetry in block size..


 
 We describe the `stratified B-tree', which beats the CoW B-tree in every way.
 In particular, it is the first versioned dictionary to achieve optimal
 tradeoffs between space, query and update performance. Therefore, we believe
 there is no longer a good reason to use CoW B-trees for versioned data stores.
 

 The paper mentions that a company called Acunu is developing an
 implementation.

 Are these stratified B-trees something which the btrfs project could use?
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Btrfs wins Linux New Media Award

2011-03-23 Thread Chris Mason
Hi everyone,

During the last Cebit conference, Btrfs was presented with an award for
the most innovative open source project.

I'd like to thank everyone at Linux magazine involved with selecting us,
and since we have so many contributors I wanted to share a picture of
the shiny award:

http://oss.oracle.com/~mason/btrfs/btrfs_award.jpg

Thanks everyone!

Chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recovering parent transid verify failed

2011-03-23 Thread Luke Sheldrick
Hi,

I'm having the same issues as previously mentioned.

Apparently the new fsck tool will be able to recover this?

Few questions, is there a GIT version I can compile and use already for this?

If not, is there any indication of when this will be released?

---
Luke Sheldrick
e: l...@sheldrick.co.uk
p: 07880 725099
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansen sensi...@gmx.net wrote:
 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

 b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

10x speed-up looks indeed impressive. Just for me to be sure, did I
get you right in that you attribute this effect specifically to
enumerating tree leaves in key address vs. disk addresses when these
two are not aligned?

Regards,
Andrey


 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with

 fs_mark -D 512 -t 16 -n 4096 -F -S0

 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.

 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..

 I'd imagine an interface along the lines of

 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
                        struct btrfs_key *start,
                        struct btrfs_key *end,
                        struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.

 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.

 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

 --
 Arne


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Chris Mason
Excerpts from Arne Jansen's message of 2011-03-23 09:06:02 -0400:
 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.
 
 a) by tree enumeration with reada=2
reading extent tree: 242s
reading csum tree: 140s
reading both trees: 324s
 
 b) prefetch prototype
reading extent tree: 23.5s
reading csum tree: 20.4s
reading both trees: 25.7s

Very nice, btrfsck does something similar.

 
 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with
 
 fs_mark -D 512 -t 16 -n 4096 -F -S0
 
 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.
 
 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..
 
 I'd imagine an interface along the lines of
 
 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
 struct btrfs_key *start,
 struct btrfs_key *end,
 struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);
 
 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.
 
 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.
 
 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

We have the bio endio call backs for this, I think that's the only thing
you can use.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] mutex: Apply adaptive spinning on mutex_trylock()

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 6:48 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Wed, Mar 23, 2011 at 8:37 AM, Tejun Heo t...@kernel.org wrote:

 Currently, mutex_trylock() doesn't use adaptive spinning.  It tries
 just once.  I got curious whether using adaptive spinning on
 mutex_trylock() would be beneficial and it seems so, at least for
 btrfs anyway.

 Hmm. Seems reasonable to me.

TAS/spin with exponential back-off has been preferred locking approach
in Postgres (and I believe other DBMSes) for years, at least since '04
when I had last touched Postgres code. Even with 'false negative' cost
in user-space being much higher than in the kernel, it's still just a
question of scale (no wonder measurable improvement here is reported
from dbench on SSD capable of few dozen thousand IOPS).

Regards,
Andrey

 The patch looks clean, although part of that is just the mutex_spin()
 cleanup that is independent of actually using it in trylock.

 So no objections from me.

                    Linus
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Arne Jansen

On 23.03.2011 20:26, Andrey Kuzmin wrote:

On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:

While looking into the performance of scrub I noticed that a significant
amount of time is being used for loading the extent tree and the csum
tree. While this is no surprise I did some prototyping on how to improve
on it.
The main idea is to load the tree (or parts of it) top-down, order the
needed blocks and distribute it over all disks.
To keep you interested, some results first.

a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s


10x speed-up looks indeed impressive. Just for me to be sure, did I
get you right in that you attribute this effect specifically to
enumerating tree leaves in key address vs. disk addresses when these
two are not aligned?


Yes. Leaves and the intermediate nodes tend to be quite scattered
around the disk with respect to their logical order.
Reading them in logical (ascending/descending) order require lots
of seeks.



Regards,
Andrey



The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
28%. It is created with the current git tree + the round robin patch and
filled with

fs_mark -D 512 -t 16 -n 4096 -F -S0

The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
with path-reada=2. Both trees are being enumerated one after the other.
The prototype currently just uses raw bios, does not make use of the
page cache and does not enter the read pages into the cache. This will
probably add some overhead. It also does not check the crcs.

While it is very promising to implement it for scrub, I think a more
general interface which can be used for every enumeration would be
beneficial. Use cases that come to mind are rebalance, reflink, deletion
of large files, listing of large directories etc..

I'd imagine an interface along the lines of

int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
int btrfs_readahead_add(struct btrfs_root *root,
struct btrfs_key *start,
struct btrfs_key *end,
struct btrfs_reada_ctx *reada);
void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

to trigger the readahead of parts of a tree. Multiple readahead
requests can be given before waiting. This would enable the very
beneficial folding seen above for 'reading both trees'.

Also it would be possible to add a cascading readahead, where the
content of leaves would trigger readaheads in other trees, maybe by
giving a callback for the decisions what to read instead of the fixed
start/end range.

For the implementation I'd need an interface which I haven't been able
to find yet. Currently I can trigger the read of several pages / tree
blocks and wait for the completion of each of them. What I'd need would
be an interface that gives me a callback on each completion or a waiting
function that wakes up on each completion with the information which
pages just completed.
One way to achieve this would be to add a hook, but I gladly take any
implementation hints.

--
Arne


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Andrey Kuzmin
On Wed, Mar 23, 2011 at 11:28 PM, Arne Jansen sensi...@gmx.net wrote:
 On 23.03.2011 20:26, Andrey Kuzmin wrote:

 On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:

 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
   reading extent tree: 242s
   reading csum tree: 140s
   reading both trees: 324s

 b) prefetch prototype
   reading extent tree: 23.5s
   reading csum tree: 20.4s
   reading both trees: 25.7s

 10x speed-up looks indeed impressive. Just for me to be sure, did I
 get you right in that you attribute this effect specifically to
 enumerating tree leaves in key address vs. disk addresses when these
 two are not aligned?

 Yes. Leaves and the intermediate nodes tend to be quite scattered
 around the disk with respect to their logical order.
 Reading them in logical (ascending/descending) order require lots
 of seeks.

And the patch actually does on-the-fly defragmentation, right? Why
loose it then :)?

Regards,
Andrey



 Regards,
 Andrey


 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with

 fs_mark -D 512 -t 16 -n 4096 -F -S0

 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.

 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..

 I'd imagine an interface along the lines of

 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
                        struct btrfs_key *start,
                        struct btrfs_key *end,
                        struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.

 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.

 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

 --
 Arne


 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recovering parent transid verify failed

2011-03-23 Thread Chris Mason
Excerpts from Luke Sheldrick's message of 2011-03-23 14:12:45 -0400:
 Hi,
 
 I'm having the same issues as previously mentioned.
 
 Apparently the new fsck tool will be able to recover this?
 
 Few questions, is there a GIT version I can compile and use already for this?
 
 If not, is there any indication of when this will be released?

Yes, I'm still hammering out a reliable way to resolve most of these.
But, please post the messages you're hitting, it is actually a very
generic problem and has many different causes.

What happened to your FS that made them come up?

Which kernel were you running and what was the FS built on top of?

What happens when you grab the latest btrfsck from git and do:

btrfsck -s 1 /dev/xxx

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs wins Linux New Media Award

2011-03-23 Thread Ric Wheeler

On 03/23/2011 02:17 PM, Chris Mason wrote:

Hi everyone,

During the last Cebit conference, Btrfs was presented with an award for
the most innovative open source project.

I'd like to thank everyone at Linux magazine involved with selecting us,
and since we have so many contributors I wanted to share a picture of
the shiny award:

http://oss.oracle.com/~mason/btrfs/btrfs_award.jpg

Thanks everyone!

Chris


Congratulations! Very much deserved,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Tree fragmentation and prefetching

2011-03-23 Thread Miao Xie
On wed, 23 Mar 2011 21:28:25 +0100, Arne Jansen wrote:
 On 23.03.2011 20:26, Andrey Kuzmin wrote:
 On Wed, Mar 23, 2011 at 4:06 PM, Arne Jansensensi...@gmx.net  wrote:
 While looking into the performance of scrub I noticed that a significant
 amount of time is being used for loading the extent tree and the csum
 tree. While this is no surprise I did some prototyping on how to improve
 on it.
 The main idea is to load the tree (or parts of it) top-down, order the
 needed blocks and distribute it over all disks.
 To keep you interested, some results first.

 a) by tree enumeration with reada=2
reading extent tree: 242s
reading csum tree: 140s
reading both trees: 324s

 b) prefetch prototype
reading extent tree: 23.5s
reading csum tree: 20.4s
reading both trees: 25.7s

 10x speed-up looks indeed impressive. Just for me to be sure, did I
 get you right in that you attribute this effect specifically to
 enumerating tree leaves in key address vs. disk addresses when these
 two are not aligned?
 
 Yes. Leaves and the intermediate nodes tend to be quite scattered
 around the disk with respect to their logical order.
 Reading them in logical (ascending/descending) order require lots
 of seeks.

I'm also dealing with tree fragmentation problem, I try to store the leaves
which have the same parent closely.

Regards
Miao

 

 Regards,
 Andrey


 The test setup consists of a 7 Seagate ES.2 1TB disks filesystem, filled
 28%. It is created with the current git tree + the round robin patch and
 filled with

 fs_mark -D 512 -t 16 -n 4096 -F -S0

 The 'normal' read is done by enumerating the leaves by btrfs_next_leaf()
 with path-reada=2. Both trees are being enumerated one after the other.
 The prototype currently just uses raw bios, does not make use of the
 page cache and does not enter the read pages into the cache. This will
 probably add some overhead. It also does not check the crcs.

 While it is very promising to implement it for scrub, I think a more
 general interface which can be used for every enumeration would be
 beneficial. Use cases that come to mind are rebalance, reflink, deletion
 of large files, listing of large directories etc..

 I'd imagine an interface along the lines of

 int btrfs_readahead_init(struct btrfs_reada_ctx *reada);
 int btrfs_readahead_add(struct btrfs_root *root,
 struct btrfs_key *start,
 struct btrfs_key *end,
 struct btrfs_reada_ctx *reada);
 void btrfs_readahead_wait(struct btrfs_reada_ctx *reada);

 to trigger the readahead of parts of a tree. Multiple readahead
 requests can be given before waiting. This would enable the very
 beneficial folding seen above for 'reading both trees'.

 Also it would be possible to add a cascading readahead, where the
 content of leaves would trigger readaheads in other trees, maybe by
 giving a callback for the decisions what to read instead of the fixed
 start/end range.

 For the implementation I'd need an interface which I haven't been able
 to find yet. Currently I can trigger the read of several pages / tree
 blocks and wait for the completion of each of them. What I'd need would
 be an interface that gives me a callback on each completion or a waiting
 function that wakes up on each completion with the information which
 pages just completed.
 One way to achieve this would be to add a hook, but I gladly take any
 implementation hints.

 -- 
 Arne


 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 -- 
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4] btrfs: implement delayed inode items operation

2011-03-23 Thread Itaru Kitayama
On Wed, 23 Mar 2011 17:47:01 +0800
Miao Xie mi...@cn.fujitsu.com wrote:

 we found GFP_KERNEL was passed into kzalloc(), I think this flag trigger the 
 above lockdep
 warning. the attached patch, which against the delayed items operation patch, 
 may fix this
 problem, Could you test it for me?

The possible irq lock inversion dependency warning seems to go away.   

itaru   
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html