[PATCH][BTRFS] raid5/6: chunk allocation

2013-02-17 Thread Goffredo Baroncelli
Hi Chris,

I am playing with the raid5/6 code, to adapt my disk-usage 
patches to the raid5/6 code.
During this develop I found that the chunk allocation is strange. 
Looking at the code I found in volume.c the following codes:

3576 static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,

3730 /*
3731  * this will have to be fixed for RAID1 and RAID10 over
3732  * more drives
3733  */
3734 data_stripes = num_stripes / ncopies;
3735
3736 if (stripe_size * ndevs  max_chunk_size * ncopies) {
3737 stripe_size = max_chunk_size * ncopies;
3738 do_div(stripe_size, ndevs);
3739 }

This code decides how big is a chunk, following two mains roles:
1) the chunk stripe shall be less than max_stripe_size
2) the chunk capability (the space usable by the user) shall 
be less than max_chunk_size.

The code above works well in case of RAID0/RAID1/DUP/SINGLE/RAID10
but doesn't play well in case of RAID5/6. In fact in case the chunk
type is BTRFS_BLOCK_GROUP_METADATA then max_stripe_size is 1GB 
and max_chunk_size is 1GB too. If the number of devices (ndevs) is 7
and the raid profile is RAID6, then ncopies is 3, the stripe_size is
1GB*3/7 = 438MB, which lead to a chunk size of 2.14GB ! Which is 
not the expected value. 
I think that we should change the test above in raid6 case to

data_stripes = ndevs - 2;
if (stripe_size * data_stripes  max_chunk_size) {
stripe_size = max_chunk_size;
do_div(stripe_size, data_stripes);
}


The patch below should solve this issue, and clean up a bit the logic
separating the code of raid5, raid6 from the code of the others
raid profiles.

Anyway I would like to point out another possible issue: the 
fragmentation. To avoid the fragmentation should we round up 
the stripe size to a more sane value like like 256MB ? 

I know that this could led to an insane chunk size when 
the number of disk is higher; but the current logic (
the stripe_size is equal to the chunk_size / number_of_device)
could lead to fragmentation problem when different raid profiles
where used together.

BR
G.Baroncelli

Signed-off-by: Goffredo Baroncelli kreij...@inwind.it

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c372264..88d17b4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3724,25 +3724,32 @@ static int __btrfs_alloc_chunk(struct 
btrfs_trans_handle *trans,
stripe_size = devices_info[ndevs-1].max_avail;
num_stripes = ndevs * dev_stripes;
 
-   /*
-* this will have to be fixed for RAID1 and RAID10 over
-* more drives
-*/
-   data_stripes = num_stripes / ncopies;
-
-   if (stripe_size * ndevs  max_chunk_size * ncopies) {
-   stripe_size = max_chunk_size * ncopies;
-   do_div(stripe_size, ndevs);
-   }
if (type  BTRFS_BLOCK_GROUP_RAID5) {
raid_stripe_len = find_raid56_stripe_len(ndevs - 1,
 btrfs_super_stripesize(info-super_copy));
-   data_stripes = num_stripes - 1;
-   }
-   if (type  BTRFS_BLOCK_GROUP_RAID6) {
+   data_stripes = ndevs - 1;
+   if (stripe_size * data_stripes  max_chunk_size) {
+   stripe_size = max_chunk_size;
+   do_div(stripe_size, data_stripes);
+   }
+   } else if (type  BTRFS_BLOCK_GROUP_RAID6) {
raid_stripe_len = find_raid56_stripe_len(ndevs - 2,
 btrfs_super_stripesize(info-super_copy));
-   data_stripes = num_stripes - 2;
+   data_stripes = ndevs - 2;
+   if (stripe_size * data_stripes  max_chunk_size) {
+   stripe_size = max_chunk_size;
+   do_div(stripe_size, data_stripes);
+   }
+   } else { /* RAID1, RAID0, RAID10, SINGLE, SUP */
+   /*
+* this will have to be fixed for RAID1 and RAID10 over
+* more drives
+*/
+   data_stripes = num_stripes / ncopies;
+   if (stripe_size * ndevs  max_chunk_size * ncopies) {
+   stripe_size = max_chunk_size * ncopies;
+   do_div(stripe_size, ndevs);
+   }
}
do_div(stripe_size, dev_stripes);
 

 



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] btrfs: clean snapshots one by one

2013-02-17 Thread Alex Lyakas
Hi David,
thank you for addressing this issue.

On Mon, Feb 11, 2013 at 6:11 PM, David Sterba dste...@suse.cz wrote:
 Each time pick one dead root from the list and let the caller know if
 it's needed to continue. This should improve responsiveness during
 umount and balance which at some point wait for cleaning all currently
 queued dead roots.

 A new dead root is added to the end of the list, so the snapshots
 disappear in the order of deletion.

 Process snapshot cleaning is now done only from the cleaner thread and
 the others wake it if needed.
This is great.



 Signed-off-by: David Sterba dste...@suse.cz
 ---

 * btrfs_clean_old_snapshots is removed from the reloc loop, I don't know if 
 this
   is safe wrt reloc's assumptions

 * btrfs_run_delayed_iputs is left in place in super_commit, may get removed as
   well because transaction commit calls it in the end

 * the responsiveness can be improved further if btrfs_drop_snapshot check
   fs_closing, but this needs changes to error handling in the main reloc loop

  fs/btrfs/disk-io.c |8 --
  fs/btrfs/relocation.c  |3 --
  fs/btrfs/transaction.c |   57 
 
  fs/btrfs/transaction.h |2 +-
  4 files changed, 44 insertions(+), 26 deletions(-)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 51bff86..6a02336 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -1635,15 +1635,17 @@ static int cleaner_kthread(void *arg)
 struct btrfs_root *root = arg;

 do {
 +   int again = 0;
 +
 if (!(root-fs_info-sb-s_flags  MS_RDONLY) 
 mutex_trylock(root-fs_info-cleaner_mutex)) {
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 +   again = btrfs_clean_one_deleted_snapshot(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 btrfs_run_defrag_inodes(root-fs_info);
 }

 -   if (!try_to_freeze()) {
 +   if (!try_to_freeze()  !again) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (!kthread_should_stop())
 schedule();
 @@ -3301,8 +3303,8 @@ int btrfs_commit_super(struct btrfs_root *root)

 mutex_lock(root-fs_info-cleaner_mutex);
 btrfs_run_delayed_iputs(root);
 -   btrfs_clean_old_snapshots(root);
 mutex_unlock(root-fs_info-cleaner_mutex);
 +   wake_up_process(root-fs_info-cleaner_kthread);
I am probably missing something, but if the cleaner wakes up here,
won't it attempt cleaning the next snap? Because I don't see the
cleaner checking anywhere that we are unmounting. Or at this point
dead_roots is supposed to be empty?



 /* wait until ongoing cleanup work done */
 down_write(root-fs_info-cleanup_work_sem);
 diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
 index ba5a321..ab6a718 100644
 --- a/fs/btrfs/relocation.c
 +++ b/fs/btrfs/relocation.c
 @@ -4060,10 +4060,7 @@ int btrfs_relocate_block_group(struct btrfs_root 
 *extent_root, u64 group_start)

 while (1) {
 mutex_lock(fs_info-cleaner_mutex);
 -
 -   btrfs_clean_old_snapshots(fs_info-tree_root);
 ret = relocate_block_group(rc);
 -
 mutex_unlock(fs_info-cleaner_mutex);
 if (ret  0) {
 err = ret;
 diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
 index 361fb7d..f1e3606 100644
 --- a/fs/btrfs/transaction.c
 +++ b/fs/btrfs/transaction.c
 @@ -895,7 +895,7 @@ static noinline int commit_cowonly_roots(struct 
 btrfs_trans_handle *trans,
  int btrfs_add_dead_root(struct btrfs_root *root)
  {
 spin_lock(root-fs_info-trans_lock);
 -   list_add(root-root_list, root-fs_info-dead_roots);
 +   list_add_tail(root-root_list, root-fs_info-dead_roots);
 spin_unlock(root-fs_info-trans_lock);
 return 0;
  }
 @@ -1783,31 +1783,50 @@ cleanup_transaction:
  }

  /*
 - * interface function to delete all the snapshots we have scheduled for 
 deletion
 + * return  0 if error
 + * 0 if there are no more dead_roots at the time of call
 + * 1 there are more to be processed, call me again
 + *
 + * The return value indicates there are certainly more snapshots to delete, 
 but
 + * if there comes a new one during processing, it may return 0. We don't 
 mind,
 + * because btrfs_commit_super will poke cleaner thread and it will process 
 it a
 + * few seconds later.
   */
 -int btrfs_clean_old_snapshots(struct btrfs_root *root)
 +int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root)
  {
 -   LIST_HEAD(list);
 +   int ret;
 +   int run_again = 1;
 struct btrfs_fs_info *fs_info = root-fs_info;

 +   if (root-fs_info-sb-s_flags  MS_RDONLY) {
 +   pr_debug(G btrfs: cleaner called for RO fs!\n);
 +  

Fixing mount points

2013-02-17 Thread Bob McGowan
First, apologies if this is well known, I'm totally new here, and
haven't figured out yet how to search the archives.

Also, I'm not a subscriber, so please respond to me as well as the list.

I've had a number of issues with distribution installations (separate
and unrelated issues, but history and context).  I had a system set up,
with a separate /home (brtfs, sda1), when it crashed (badly) while
updating and rebuilding the kernel.

I installed again (different window system, same distribution) and to be
completely safe, I left the /home out of the new setup.  I want (need)
to keep it because it has a very large of digital photos and I'd rather
not have to restore them again (between audio, video, the photos, DVD
images, about 800GB).

So, I now have a volume/subvolume on one disk, where the subvolume is
/home (brtfs, sde2), and I want to change fstab to mount sda1 as /home
instead.

For any other FS I've worked with, simple edits of fstab would be
enough, but doing so doesn't appear to be enough for btrfs.

Even though things look OK from the command line, logging in through the
window system fails (actually, just hangs).

I assume this means I should be doing something to clean up the
subvolume?  Or maybe there's something in the Window system
configuration to change?

I'm running Linux Mint 14 KDE.  My fstab for the parts in question looks
like:

# / was on /dev/sde2 during installation
UUID=1a...9 / btrfs   defaults,subvol=@ 0   1
# /home was on /dev/sde2 during installation
UUID=1a...9 /home btrfs   defaults,subvol=@home 0   2

What I want is something like:
# / was on /dev/sde2 during installation
UUID=1a...9 / btrfs   defaults  0   1
# /home is on /dev/sda1
UUID=7f...3 /home btrfs   defaults  0   2

Thanks for bearing with me as I learn this new environment. ;-)

Bob
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs: use kmalloc for lzo de/compress buffer

2013-02-17 Thread Kyungsik Lee
The size of de/compress buffer is small enough to use kmalloc.
Allocating it with kmalloc rather than vmalloc is preferred.

This patch depends on my previous patch, “btrfs: fix decompress buffer size”.

v2: Using vmalloc for workspace-mem due to the size limit.

Signed-off-by: Kyungsik Lee kyungsik@lge.com
Cc: David Sterba dste...@suse.cz
---
 fs/btrfs/lzo.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
index 223893a..b2e 100644
--- a/fs/btrfs/lzo.c
+++ b/fs/btrfs/lzo.c
@@ -40,8 +40,8 @@ static void lzo_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
 
-   vfree(workspace-buf);
-   vfree(workspace-cbuf);
+   kfree(workspace-buf);
+   kfree(workspace-cbuf);
vfree(workspace-mem);
kfree(workspace);
 }
@@ -55,8 +55,9 @@ static struct list_head *lzo_alloc_workspace(void)
return ERR_PTR(-ENOMEM);
 
workspace-mem = vmalloc(LZO1X_MEM_COMPRESS);
-   workspace-buf = vmalloc(PAGE_CACHE_SIZE);
-   workspace-cbuf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE));
+   workspace-buf = kmalloc(PAGE_CACHE_SIZE, GFP_NOFS);
+   workspace-cbuf = kmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE),
+   GFP_NOFS);
if (!workspace-mem || !workspace-buf || !workspace-cbuf)
goto fail;
 
-- 
1.8.0.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html