[PATCH 0/5] btrfs: Add lzo compression support

2010-10-25 Thread Li Zefan
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and compression speed.

Usage:

# mount -t btrfs -o compress[=zlib,lzo] dev /mnt
or
# mount -t btrfs -o compress-force[=zlib,lzo] dev /mnt

-o compress without argument is still allowed for compatability.

Performance:

The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted the tarball.

(time in second)
   lzozlibnocompress
copy:  10.6   21.714.9
extract:   70.1   94.466.6

(data size in MB)
   lzozlibnocompress
copy:  185.87 108.69  394.49
extract:   193.80 132.36  381.21

Other:

The defrag ioctl is also updated, so one can choose lzo or zlib when
turning on compression in defrag operation.

TODO:

- Update btrfs-progs accordingly.

---
 fs/btrfs/Makefile   |2 +-
 fs/btrfs/btrfs_inode.h  |2 +-
 fs/btrfs/compression.c  |  329 +-
 fs/btrfs/compression.h  |   72 ++--
 fs/btrfs/ctree.h|   11 +-
 fs/btrfs/extent_io.c|5 +-
 fs/btrfs/extent_io.h|   17 ++-
 fs/btrfs/extent_map.c   |2 +
 fs/btrfs/extent_map.h   |3 +-
 fs/btrfs/file.c |2 +
 fs/btrfs/inode.c|   82 ++
 fs/btrfs/ioctl.c|   10 +-
 fs/btrfs/ioctl.h|9 +-
 fs/btrfs/lzo.c  |  409 +++
 fs/btrfs/ordered-data.c |   18 ++-
 fs/btrfs/ordered-data.h |8 +-
 fs/btrfs/super.c|   47 +-
 fs/btrfs/zlib.c |  361 +++--
 18 files changed, 1013 insertions(+), 376 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: Fix bugs in zlib workspace

2010-10-25 Thread Li Zefan
- Fix a race that can result in alloc_workspace  cpus.
- Fix to check num_workspace after wakeup.

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 fs/btrfs/zlib.c |7 +--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
index b9cd544..e5b8b22 100644
--- a/fs/btrfs/zlib.c
+++ b/fs/btrfs/zlib.c
@@ -75,16 +75,19 @@ again:
return workspace;
 
}
-   spin_unlock(workspace_lock);
if (atomic_read(alloc_workspace)  cpus) {
DEFINE_WAIT(wait);
+
+   spin_unlock(workspace_lock);
prepare_to_wait(workspace_wait, wait, TASK_UNINTERRUPTIBLE);
-   if (atomic_read(alloc_workspace)  cpus)
+   if (atomic_read(alloc_workspace)  cpus  !num_workspace)
schedule();
finish_wait(workspace_wait, wait);
goto again;
}
atomic_inc(alloc_workspace);
+   spin_unlock(workspace_lock);
+
workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
if (!workspace) {
ret = -ENOMEM;
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: Allow to add new compression algorithm

2010-10-25 Thread Li Zefan
Make the code aware of compression type, instead of always assuming
zlib compression.

Also make the zlib workspace function as common code for all
compression types.

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 fs/btrfs/btrfs_inode.h  |2 +-
 fs/btrfs/compression.c  |  236 ++-
 fs/btrfs/compression.h  |   66 +
 fs/btrfs/ctree.h|   10 +-
 fs/btrfs/extent_io.c|5 +-
 fs/btrfs/extent_io.h|   17 +++-
 fs/btrfs/extent_map.c   |2 +
 fs/btrfs/extent_map.h   |3 +-
 fs/btrfs/file.c |2 +
 fs/btrfs/inode.c|   82 ++--
 fs/btrfs/ioctl.c|4 +-
 fs/btrfs/ordered-data.c |   18 +++-
 fs/btrfs/ordered-data.h |8 ++-
 fs/btrfs/super.c|   41 +++-
 fs/btrfs/zlib.c |  253 +-
 15 files changed, 472 insertions(+), 277 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 6ad63f1..ccc991c 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -157,7 +157,7 @@ struct btrfs_inode {
/*
 * always compress this one file
 */
-   unsigned force_compress:1;
+   unsigned force_compress:4;
 
struct inode vfs_inode;
 };
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index cb3877c..4c6068b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -62,6 +62,9 @@ struct compressed_bio {
/* number of bytes on disk */
unsigned long compressed_len;
 
+   /* the compression algorithm for this bio */
+   int compress_type;
+
/* number of compressed pages in the array */
unsigned long nr_pages;
 
@@ -186,11 +189,12 @@ static void end_compressed_bio_read(struct bio *bio, int 
err)
/* ok, we're the last bio for this extent, lets start
 * the decompression.
 */
-   ret = btrfs_zlib_decompress_biovec(cb-compressed_pages,
-   cb-start,
-   cb-orig_bio-bi_io_vec,
-   cb-orig_bio-bi_vcnt,
-   cb-compressed_len);
+   ret = btrfs_decompress_biovec(cb-compress_type,
+ cb-compressed_pages,
+ cb-start,
+ cb-orig_bio-bi_io_vec,
+ cb-orig_bio-bi_vcnt,
+ cb-compressed_len);
 csum_failed:
if (ret)
cb-errors = 1;
@@ -611,6 +615,7 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
 
cb-len = uncompressed_len;
cb-compressed_len = compressed_len;
+   cb-compress_type = extent_compress_type(bio_flags);
cb-orig_bio = bio;
 
nr_pages = (compressed_len + PAGE_CACHE_SIZE - 1) /
@@ -700,3 +705,224 @@ int btrfs_submit_compressed_read(struct inode *inode, 
struct bio *bio,
bio_put(comp_bio);
return 0;
 }
+
+static struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES];
+static spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES];
+static int comp_num_workspace[BTRFS_COMPRESS_TYPES];
+static atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES];
+static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES];
+
+struct btrfs_compress_op *btrfs_compress_op[] = {
+   btrfs_zlib_compress,
+};
+
+int __init btrfs_init_compress(void)
+{
+   int i;
+
+   for (i = 0; i  BTRFS_COMPRESS_TYPES; i++) {
+   INIT_LIST_HEAD(comp_idle_workspace[i]);
+   spin_lock_init(comp_workspace_lock[i]);
+   atomic_set(comp_alloc_workspace[i], 0);
+   init_waitqueue_head(comp_workspace_wait[i]);
+   }
+   return 0;
+}
+
+/*
+ * this finds an available workspace or allocates a new one
+ * ERR_PTR is returned if things go bad.
+ */
+static struct list_head *find_workspace(int type)
+{
+   struct list_head *workspace;
+   int cpus = num_online_cpus();
+   int idx = type - 1;
+
+   struct list_head *idle_workspace= comp_idle_workspace[idx];
+   spinlock_t *workspace_lock  = comp_workspace_lock[idx];
+   atomic_t *alloc_workspace   = comp_alloc_workspace[idx];
+   wait_queue_head_t *workspace_wait   = comp_workspace_wait[idx];
+   int *num_workspace  = comp_num_workspace[idx];
+again:
+   spin_lock(workspace_lock);
+   if (!list_empty(idle_workspace)) {
+   workspace = idle_workspace-next;
+   list_del(workspace);
+   (*num_workspace)--;
+   spin_unlock(workspace_lock);
+   return workspace;
+
+   }
+   if (atomic_read(alloc_workspace)  cpus) {
+   DEFINE_WAIT(wait);
+
+   spin_unlock(workspace_lock);
+   prepare_to_wait(workspace_wait, wait, 

[PATCH 3/5] btrfs: Add lzo compression support

2010-10-25 Thread Li Zefan
Lzo is a much faster compression algorithm than gzib, so would allow
more users to enable transparent compression, and some users can
choose from compression ratio and speed for different applications

Usage:

# mount -t btrfs -o compress[=zlib,lzo] dev /mnt
or
# mount -t btrfs -o compress-force[=zlib,lzo] dev /mnt

-o compress without argument is still allowed for compatability.

Performance:

The test copied a linux source tarball (~400M) from an ext4 partition
to the btrfs partition, and then extracted it.

(time in second)
   lzozlibnocompress
copy:  10.6   21.714.9
extract:   70.1   94.466.6

(data size in MB)
   lzozlibnocompress
copy:  185.87 108.69  394.49
extract:   193.80 132.36  381.21

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |1 +
 fs/btrfs/ctree.h   |5 +-
 fs/btrfs/lzo.c |  498 
 fs/btrfs/super.c   |6 +
 6 files changed, 510 insertions(+), 3 deletions(-)
 create mode 100644 fs/btrfs/lzo.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index a35eb36..31610ea 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -6,5 +6,5 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   transaction.o inode.o file.o tree-defrag.o \
   extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \
   extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \
-  export.o tree-log.o acl.o free-space-cache.o zlib.o \
+  export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 4c6068b..84c5490 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -714,6 +714,7 @@ static wait_queue_head_t 
comp_workspace_wait[BTRFS_COMPRESS_TYPES];
 
 struct btrfs_compress_op *btrfs_compress_op[] = {
btrfs_zlib_compress,
+   btrfs_lzo_compress,
 };
 
 int __init btrfs_init_compress(void)
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 9b5f2f3..f7ce217 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -73,5 +73,6 @@ struct btrfs_compress_op {
 };
 
 extern struct btrfs_compress_op btrfs_zlib_compress;
+extern struct btrfs_compress_op btrfs_lzo_compress;
 
 #endif
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b0f8b4d..f9c2b00 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -530,8 +530,9 @@ struct btrfs_timespec {
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
-   BTRFS_COMPRESS_TYPES = 1,
-   BTRFS_COMPRESS_LAST  = 2,
+   BTRFS_COMPRESS_LZO   = 2,
+   BTRFS_COMPRESS_TYPES = 2,
+   BTRFS_COMPRESS_LAST  = 3,
 };
 
 struct btrfs_inode_item {
diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
new file mode 100644
index 000..0a518f3
--- /dev/null
+++ b/fs/btrfs/lzo.c
@@ -0,0 +1,498 @@
+/*
+ * Copyright (C) 2008 Oracle.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+#include linux/kernel.h
+#include linux/slab.h
+#include linux/vmalloc.h
+#include linux/init.h
+#include linux/err.h
+#include linux/sched.h
+#include linux/pagemap.h
+#include linux/bio.h
+#include linux/lzo.h
+#include compression.h
+
+#define LZO_LEN4
+
+struct workspace {
+   void *mem;
+   void *buf;  /* where compressed data goes */
+   void *cbuf; /* where decompressed data goes */
+   struct list_head list;
+};
+
+static void lzo_free_workspace(struct list_head *ws)
+{
+   struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+   vfree(workspace-buf);
+   vfree(workspace-cbuf);
+   vfree(workspace-mem);
+   kfree(workspace);
+}
+
+static struct list_head *lzo_alloc_workspace(void)
+{
+   struct workspace *workspace;
+
+   workspace = kzalloc(sizeof(*workspace), GFP_NOFS);
+   if (!workspace)
+   return ERR_PTR(-ENOMEM);
+
+   workspace-mem = vmalloc(LZO1X_MEM_COMPRESS);
+   workspace-buf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE));
+   workspace-cbuf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE));
+   

[PATCH 4/5] btrfs: Allow to specify compress method when defrag

2010-10-25 Thread Li Zefan
Update defrag ioctl, so one can choose lzo or zlib when turning
on compression in defrag operation.

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 fs/btrfs/ioctl.c |6 +-
 fs/btrfs/ioctl.h |9 -
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ecc49ee..2ea05df 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -584,8 +584,12 @@ static int btrfs_defrag_file(struct file *file,
}
total_read++;
mutex_lock(inode-i_mutex);
-   if (range-flags  BTRFS_DEFRAG_RANGE_COMPRESS)
+   if (range-flags  BTRFS_DEFRAG_RANGE_COMPRESS) {
BTRFS_I(inode)-force_compress = BTRFS_COMPRESS_ZLIB;
+   if (range-compress_type)
+   BTRFS_I(inode)-force_compress =
+   range-compress_type;
+   }
 
ret  = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
if (ret)
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..776a2b4 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -122,8 +122,15 @@ struct btrfs_ioctl_defrag_range_args {
 */
__u32 extent_thresh;
 
+   /*
+* which compression method to use if turning on compression
+* for this defrag operation.  If unspecified, zlib will
+* be used
+*/
+   __u32 compress_type;
+
/* spare for later */
-   __u32 unused[5];
+   __u32 unused[4];
 };
 
 struct btrfs_ioctl_space_info {
-- 
1.7.0.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: Extract duplicate decompress code

2010-10-25 Thread Li Zefan
Add a common function to copy decompressed data from working buffer
to bio pages.

Signed-off-by: Li Zefan l...@cn.fujitsu.com
---
 fs/btrfs/compression.c |   92 +++
 fs/btrfs/compression.h |5 ++
 fs/btrfs/lzo.c |  101 +++-
 fs/btrfs/zlib.c|  111 +--
 4 files changed, 115 insertions(+), 194 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 84c5490..c7f695b 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -927,3 +927,95 @@ void __exit btrfs_exit_compress(void)
 {
 free_workspaces();
 }
+
+/*
+ * Copy uncompressed data from working buffer to pages.
+ *
+ * buf_start is the byte offset we're of the start of our workspace buffer.
+ *
+ * total_out is the last byte of the buffer
+ */
+int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
+ unsigned long total_out, u64 disk_start,
+ struct bio_vec *bvec, int vcnt,
+ unsigned long *page_index,
+ unsigned long *pg_offset)
+{
+   unsigned long buf_offset;
+   unsigned long current_buf_start;
+   unsigned long start_byte;
+   unsigned long working_bytes = total_out - buf_start;
+   unsigned long bytes;
+   char *kaddr;
+   struct page *page_out = bvec[*page_index].bv_page;
+
+   /*
+* start byte is the first byte of the page we're currently
+* copying into relative to the start of the compressed data.
+*/
+   start_byte = page_offset(page_out) - disk_start;
+
+   /* we haven't yet hit data corresponding to this page */
+   if (total_out = start_byte)
+   return 1;
+
+   /*
+* the start of the data we care about is offset into
+* the middle of our working buffer
+*/
+   if (total_out  start_byte  buf_start  start_byte) {
+   buf_offset = start_byte - buf_start;
+   working_bytes -= buf_offset;
+   } else {
+   buf_offset = 0;
+   }
+   current_buf_start = buf_start;
+
+   /* copy bytes from the working buffer into the pages */
+   while (working_bytes  0) {
+   bytes = min(PAGE_CACHE_SIZE - *pg_offset,
+   PAGE_CACHE_SIZE - buf_offset);
+   bytes = min(bytes, working_bytes);
+   kaddr = kmap_atomic(page_out, KM_USER0);
+   memcpy(kaddr + *pg_offset, buf + buf_offset, bytes);
+   kunmap_atomic(kaddr, KM_USER0);
+   flush_dcache_page(page_out);
+
+   *pg_offset += bytes;
+   buf_offset += bytes;
+   working_bytes -= bytes;
+   current_buf_start += bytes;
+
+   /* check if we need to pick another page */
+   if (*pg_offset == PAGE_CACHE_SIZE) {
+   (*page_index)++;
+   if (*page_index = vcnt)
+   return 0;
+
+   page_out = bvec[*page_index].bv_page;
+   *pg_offset = 0;
+   start_byte = page_offset(page_out) - disk_start;
+
+   /*
+* make sure our new page is covered by this
+* working buffer
+*/
+   if (total_out = start_byte)
+   return 1;
+
+   /*
+* the next page in the biovec might not be adjacent
+* to the last page, but it might still be found
+* inside this working buffer. bump our offset pointer
+*/
+   if (total_out  start_byte 
+   current_buf_start  start_byte) {
+   buf_offset = start_byte - buf_start;
+   working_bytes = total_out - start_byte;
+   current_buf_start = buf_start + buf_offset;
+   }
+   }
+   }
+
+   return 1;
+}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index f7ce217..5100017 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -34,6 +34,11 @@ int btrfs_decompress_biovec(int type, struct page 
**pages_in, u64 disk_start,
struct bio_vec *bvec, int vcnt, size_t srclen);
 int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page,
 unsigned long start_byte, size_t srclen, size_t destlen);
+int btrfs_decompress_buf2page(char *buf, unsigned long buf_start,
+ unsigned long total_out, u64 disk_start,
+ struct bio_vec *bvec, int vcnt,
+ unsigned long *page_index,
+ 

Re: Determine if a given fs is a btrfs fs

2010-10-25 Thread Jérôme Poulin
On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com wrote:
...
 p4 jerome # btrfs device scan /dev/dm-22
 Scanning for Btrfs filesystems in '/dev/dm-22'
 p4 jerome # echo $?
 0
This is OK.

 p4 jerome # btrfs device scan /dev/sda
 Scanning for Btrfs filesystems in '/dev/sda'
 ERROR: unable to scan the device '/dev/sda'
 p4 jerome # echo $?
 11
...
But isn't that error misleading, btrfs scan was succesfully able to
scan /dev/sda, but, it doesn't contain btrfs, right?
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determine if a given fs is a btrfs fs

2010-10-25 Thread C Anthony Risinger
On Mon, Oct 25, 2010 at 6:31 AM, Jérôme Poulin jeromepou...@gmail.com wrote:
 On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com wrote:
 ...
 p4 jerome # btrfs device scan /dev/dm-22
 Scanning for Btrfs filesystems in '/dev/dm-22'
 p4 jerome # echo $?
 0
 This is OK.

 p4 jerome # btrfs device scan /dev/sda
 Scanning for Btrfs filesystems in '/dev/sda'
 ERROR: unable to scan the device '/dev/sda'
 p4 jerome # echo $?
 11
 ...
 But isn't that error misleading, btrfs scan was succesfully able to
 scan /dev/sda, but, it doesn't contain btrfs, right?

imo, the best way is:

# root must be btrfs else silent return
[ $(blkid -s TYPE -o value ${root}) = btrfs ] || return 0

at least that the way i do it in my initramfs hook; seems to be reliable.

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
Hi Chris,

This is the extent of my current queue of Btrfs snapshot/subvol/commit 
stuff. Most of these were posted several months ago.  Can be sent 
upstream during this merge window?  Not having this functionality is 
becoming a bit of a roadblock for our efforts to keep the Ceph data in a 
consistent state.

These patches are also available from

git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls

The first patch is strictly a bug fix for a deadlock in 
btrfs_commit_transaction().

The next few patches are a repost (with a few minor revisions) of the 
async snapshot/subvolume ioctls I originally posted last spring.  They 
include:

 - Some async commit helper functions
 - Start and wait sync ioctls, for initiating and waiting for a sync
 - An ioctl to start a snapshot creation asynchronously, such that you 
   don't have to wait for the full commit to disk.  The interface is cleaned
   up a bit from the original version.

There is also a patch that changes the SNAP_DESTROY ioctl to not do a 
commit before returning.  The rationale here is there is no obvious 
reason (to me at least) why the snapshot removal should be on disk 
before returning; rm(2) and unlink(2) certainly don't do that.  If the 
user disagrees, they can call sync(2).  If you would prefer, I also have 
a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't 
change any existing behavior.

The last item is a change to SNAP_DESTROY to allow deletion of a 
snapshot when the user owns the subvol's root inode and the parent 
directory permissions are such that we would have allowed an rmdir(2).  
Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
semantics completely (except for the empty directory check) by 
duplicating some VFS code.  Whether we want weaker semantics, duplicated 
code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
is distinct from a similar patch (also from Goffredo) that allows 
rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
subvol to be deleted by a non-root user.  As long as I can do that, my 
daemon doesn't have to run as root and I'm a happy camper.  :)

If anybody has any questions or issues with any of this, please let me 
know so I can revise the patches accordingly.

Thanks!
sage

---


Sage Weil (6):
  Btrfs: fix deadlock in btrfs_commit_transaction
  Btrfs: async transaction commit
  Btrfs: add START_SYNC, WAIT_SYNC ioctls
  Btrfs: add SNAP_CREATE_ASYNC ioctl
  Btrfs: make SNAP_DESTROY async
  Btrfs: allow subvol deletion by owner

 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/ioctl.c   |  152 +++-
 fs/btrfs/ioctl.h   |9 +++
 fs/btrfs/transaction.c |  183 +--
 fs/btrfs/transaction.h |4 +
 security/security.c|1 +
 7 files changed, 326 insertions(+), 25 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction

2010-10-25 Thread Sage Weil
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether
num_writers  1 or should_grow at the top of the loop.  Then, much much
later, we wait for that timeout if either num_writers or should_grow is
true.  However, it's possible for a racing process (calling
btrfs_end_transaction()) to decrement num_writers such that we wait
forever instead of for 1.

Fix this by deciding how long to wait when we wait.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/transaction.c |   12 
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 66e4c66..bf399ea 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
 struct btrfs_root *root)
 {
unsigned long joined = 0;
-   unsigned long timeout = 1;
struct btrfs_transaction *cur_trans;
struct btrfs_transaction *prev_trans = NULL;
DEFINE_WAIT(wait);
@@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
snap_pending = 1;
 
WARN_ON(cur_trans != trans-transaction);
-   if (cur_trans-num_writers  1)
-   timeout = MAX_SCHEDULE_TIMEOUT;
-   else if (should_grow)
-   timeout = 1;
-
mutex_unlock(root-fs_info-trans_mutex);
 
if (flush_on_commit || snap_pending) {
@@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
TASK_UNINTERRUPTIBLE);
 
smp_mb();
-   if (cur_trans-num_writers  1 || should_grow)
-   schedule_timeout(timeout);
+   if (cur_trans-num_writers  1)
+   schedule_timeout(MAX_SCHEDULE_TIMEOUT);
+   else if (should_grow)
+   schedule_timeout(1);
 
mutex_lock(root-fs_info-trans_mutex);
finish_wait(cur_trans-writer_wait, wait);
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] Btrfs: async transaction commit

2010-10-25 Thread Sage Weil
Add support for an async transaction commit that is ordered such that any
subsequent operations will join the following transaction, but does not
wait until the current commit is fully on disk.  This avoids much of the
latency associated with the btrfs_commit_transaction for callers concerned
with serialization and not safety.

The wait_for_unblock flag controls whether we wait for the 'middle' portion
of commit_transaction to complete, which is necessary if the caller expects
some of the modifications contained in the commit to be available (this is
the case for subvol/snapshot creation).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |1 +
 fs/btrfs/transaction.c |  119 
 fs/btrfs/transaction.h |3 +
 4 files changed, 124 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index eaf286a..4c3c20b 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -863,6 +863,7 @@ struct btrfs_fs_info {
struct btrfs_transaction *running_transaction;
wait_queue_head_t transaction_throttle;
wait_queue_head_t transaction_wait;
+   wait_queue_head_t transaction_blocked_wait;
wait_queue_head_t async_submit_wait;
 
struct btrfs_super_block super_copy;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 64f1008..034abc6 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1680,6 +1680,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
init_waitqueue_head(fs_info-transaction_throttle);
init_waitqueue_head(fs_info-transaction_wait);
+   init_waitqueue_head(fs_info-transaction_blocked_wait);
init_waitqueue_head(fs_info-async_submit_wait);
 
__setup_root(4096, 4096, 4096, 4096, tree_root,
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index bf399ea..d720106 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -988,6 +988,123 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info)
return ret;
 }
 
+/*
+ * wait for the current transaction commit to start and block subsequent
+ * transaction joins
+ */
+static void wait_current_trans_commit_start(struct btrfs_root *root,
+   struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-in_commit)
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_blocked_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-in_commit) {
+   finish_wait(root-fs_info-transaction_blocked_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_blocked_wait, wait);
+   }
+}
+
+/*
+ * wait for the current transaction to start and then become unblocked.
+ * caller holds ref.
+ */
+static void wait_current_trans_commit_start_and_unblock(struct btrfs_root 
*root,
+struct btrfs_transaction *trans)
+{
+   DEFINE_WAIT(wait);
+
+   if (trans-commit_done || (trans-in_commit  !trans-blocked))
+   return;
+
+   while (1) {
+   prepare_to_wait(root-fs_info-transaction_wait, wait,
+   TASK_UNINTERRUPTIBLE);
+   if (trans-commit_done ||
+   (trans-in_commit  !trans-blocked)) {
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   break;
+   }
+   mutex_unlock(root-fs_info-trans_mutex);
+   schedule();
+   mutex_lock(root-fs_info-trans_mutex);
+   finish_wait(root-fs_info-transaction_wait,
+   wait);
+   }
+}
+
+/*
+ * commit transactions asynchronously. once btrfs_commit_transaction_async
+ * returns, any subsequent transaction will not be allowed to join.
+ */
+struct btrfs_async_commit {
+   struct btrfs_trans_handle *newtrans;
+   struct btrfs_root *root;
+   struct delayed_work work;
+};
+
+static void do_async_commit(struct work_struct *work)
+{
+   struct btrfs_async_commit *ac =
+   container_of(work, struct btrfs_async_commit, work.work);
+   
+   btrfs_commit_transaction(ac-newtrans, ac-root);
+   kfree(ac);
+}
+
+int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  int wait_for_unblock)
+{
+   struct btrfs_async_commit *ac;
+   struct btrfs_transaction *cur_trans;
+
+   ac = kmalloc(sizeof(*ac), GFP_NOFS);
+   BUG_ON(!ac);
+
+   INIT_DELAYED_WORK(ac-work, 

[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls

2010-10-25 Thread Sage Weil
START_SYNC will start a sync/commit, but not wait for it to
complete.  Any modification started after the ioctl returns is
guaranteed not to be included in the commit.  If a non-NULL
pointer is passed, the transaction id will be returned to
userspace.

WAIT_SYNC will wait for any in-progress commit to complete.  If a
transaction id is specified, the ioctl will block and then
return (success) when the specified transaction has committed.
If it has already committed when we call the ioctl, it returns
immediately.  If the specified transaction doesn't exist, it
returns EINVAL.

If no transaction id is specified, WAIT_SYNC will wait for the
currently committing transaction to finish it's commit to disk.
If there is no currently committing transaction, it returns
success.

These ioctls are useful for applications which want to impose an
ordering on when fs modifications reach disk, but do not want to
wait for the full (slow) commit process to do so.

Picky callers can take the transid returned by START_SYNC and
feed it to WAIT_SYNC, and be certain to wait only as long as
necessary for the transaction _they_ started to reach disk.

Sloppy callers can START_SYNC and WAIT_SYNC without a transid,
and provided they didn't wait too long between the calls, they
will get the same result.  However, if a second commit starts
before they call WAIT_SYNC, they may end up waiting longer for
it to commit as well.  Even so, a START_SYNC+WAIT_SYNC still
guarantees that any operation completed before the START_SYNC
reaches disk.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c   |   34 +++
 fs/btrfs/ioctl.h   |2 +
 fs/btrfs/transaction.c |   52 
 fs/btrfs/transaction.h |1 +
 4 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9254b3d..6306051 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1984,6 +1984,36 @@ long btrfs_ioctl_trans_end(struct file *file)
return 0;
 }
 
+static noinline long btrfs_ioctl_start_sync(struct file *file, void __user 
*argp)
+{
+   struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root;
+   struct btrfs_trans_handle *trans;
+   u64 transid;
+
+   trans = btrfs_start_transaction(root, 0);
+   transid = trans-transid;
+   btrfs_commit_transaction_async(trans, root, 0);
+
+   if (argp)
+   if (copy_to_user(argp, transid, sizeof(transid)))
+   return -EFAULT;
+   return 0;
+}
+
+static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user 
*argp)
+{
+   struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root;
+   u64 transid;
+
+   if (argp) {
+   if (copy_from_user(transid, argp, sizeof(transid)))
+   return -EFAULT;
+   } else {
+   transid = 0;  /* current trans */
+   }
+   return btrfs_wait_for_commit(root, transid);
+}
+
 long btrfs_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
 {
@@ -2034,6 +2064,10 @@ long btrfs_ioctl(struct file *file, unsigned int
case BTRFS_IOC_SYNC:
btrfs_sync_fs(file-f_dentry-d_sb, 1);
return 0;
+   case BTRFS_IOC_START_SYNC:
+   return btrfs_ioctl_start_sync(file, argp);
+   case BTRFS_IOC_WAIT_SYNC:
+   return btrfs_ioctl_wait_sync(file, argp);
}
 
return -ENOTTY;
diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h
index 424694a..e9a2f7e 100644
--- a/fs/btrfs/ioctl.h
+++ b/fs/btrfs/ioctl.h
@@ -178,4 +178,6 @@ struct btrfs_ioctl_space_args {
 #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64)
 #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \
struct btrfs_ioctl_space_args)
+#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64)
+#define BTRFS_IOC_WAIT_SYNC  _IOW(BTRFS_IOCTL_MAGIC, 22, __u64)
 #endif
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index d720106..932bc03 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -270,6 +270,58 @@ static noinline int wait_for_commit(struct btrfs_root 
*root,
return 0;
 }
 
+int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid)
+{
+   struct btrfs_transaction *cur_trans = NULL, *t;
+   int ret;
+
+   mutex_lock(root-fs_info-trans_mutex);
+
+   ret = 0;
+   if (transid) {
+   if (transid = root-fs_info-last_trans_committed)
+   goto out_unlock;
+
+   /* find specified transaction */
+   list_for_each_entry(t, root-fs_info-trans_list, list) {
+   if (t-transid == transid) {
+   cur_trans = t;
+   break;
+   }
+   if (t-transid  transid)
+   break;
+  

[PATCH 5/6] Btrfs: make SNAP_DESTROY async

2010-10-25 Thread Sage Weil
There is no reason to force an immediate commit when deleting a snapshot.
Users have some expectation that space from a deleted snapshot be freed
immediately, but even if we do commit the reclaim is a background process.

If users _do_ want the deletion to be durable, they can call 'sync'.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 679b8a8..9cbda86 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1365,7 +1365,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
BUG_ON(ret);
}
 
-   ret = btrfs_commit_transaction(trans, root);
+   ret = btrfs_end_transaction(trans, root);
BUG_ON(ret);
inode-i_flags |= S_DEAD;
 out_up_write:
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] Btrfs: allow subvol deletion by owner

2010-10-25 Thread Sage Weil
Allow users to delete the snapshots/subvols they own.  When CAP_SYS_ADMIN
is not present, require that

 - the user has write and exec permission on the parent directory
 - security_inode_rmdir() doesn't object
 - the user has write and exec permission on the subvol directory
 - the user owns the subvol directory
 - the directory and subvol append flags are not set

This is a bit more strict than the requirements for 'rm -f subvol/*',
which is allowed with just 'wx' on non-owned non-sticky dirs.  It is less
strict than 'rmdir subvol', which has additional requirements if the parent
directory is sticky.  It is less strict than 'rm -rf subvol', which would
require scanning the entire subvol to ensure all content is deletable.

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ioctl.c|   27 ---
 security/security.c |1 +
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 9cbda86..90d2871 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1288,9 +1288,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1325,6 +1322,30 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
goto out_dput;
}
 
+   /*
+* Allow subvol deletion if we own the subvol and we would
+* (approximately) be allowed to rmdir it.  Strictly speaking,
+* we could possibly delete everything with fewer permissions,
+* or might require more permissions to remove all files
+* contained by the subvol, but we aren't trying to mimic
+* directory semantics perfectly.
+*/
+   if (!capable(CAP_SYS_ADMIN)) {
+   err = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = security_inode_rmdir(dir, dentry);
+   if (err)
+   goto out_dput;
+   err = inode_permission(inode, MAY_WRITE | MAY_EXEC);
+   if (err)
+   goto out_dput;
+   err = -EPERM;
+   if (inode-i_uid != current_fsuid() || IS_APPEND(dir) ||
+   IS_APPEND(inode))
+   goto out_dput;
+   }
+
dest = BTRFS_I(inode)-root;
 
mutex_lock(inode-i_mutex);
diff --git a/security/security.c b/security/security.c
index c53949f..1c980ee 100644
--- a/security/security.c
+++ b/security/security.c
@@ -490,6 +490,7 @@ int security_inode_rmdir(struct inode *dir, struct dentry 
*dentry)
return 0;
return security_ops-inode_rmdir(dir, dentry);
 }
+EXPORT_SYMBOL_GPL(security_inode_rmdir);
 
 int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, 
dev_t dev)
 {
-- 
1.6.6.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Determine if a given fs is a btrfs fs

2010-10-25 Thread Felix Blanke
Hi,

that was a good hint, thanks a lot!

The best at that solution is that it works for all filesystems: Just test if the
output of $type (type=`blkid -s TYPE -o value $loopdev`) is empty (''), then it 
is no
filesystem and the loopaes password was wrong or the device/fs is broken.



Felix


On 25. October 2010 - 08:29, C Anthony Risinger wrote:
 Date: Mon, 25 Oct 2010 08:29:58 -0500
 From: C Anthony Risinger anth...@extof.me
 To: Jérôme Poulin jeromepou...@gmail.com
 Cc: linux-btrfs linux-btrfs@vger.kernel.org
 Subject: Re: Determine if a given fs is a btrfs fs
 
 On Mon, Oct 25, 2010 at 6:31 AM, Jérôme Poulin jeromepou...@gmail.com wrote:
  On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com 
  wrote:
  ...
  p4 jerome # btrfs device scan /dev/dm-22
  Scanning for Btrfs filesystems in '/dev/dm-22'
  p4 jerome # echo $?
  0
  This is OK.
 
  p4 jerome # btrfs device scan /dev/sda
  Scanning for Btrfs filesystems in '/dev/sda'
  ERROR: unable to scan the device '/dev/sda'
  p4 jerome # echo $?
  11
  ...
  But isn't that error misleading, btrfs scan was succesfully able to
  scan /dev/sda, but, it doesn't contain btrfs, right?
 
 imo, the best way is:
 
 # root must be btrfs else silent return
 [ $(blkid -s TYPE -o value ${root}) = btrfs ] || return 0
 
 at least that the way i do it in my initramfs hook; seems to be reliable.
 
 C Anthony
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
---end quoted text---
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Chris Mason
On Mon, Oct 25, 2010 at 12:07:36PM -0700, Sage Weil wrote:
 Hi Chris,
 
 This is the extent of my current queue of Btrfs snapshot/subvol/commit 
 stuff. Most of these were posted several months ago.  Can be sent 
 upstream during this merge window?  Not having this functionality is 
 becoming a bit of a roadblock for our efforts to keep the Ceph data in a 
 consistent state.
 
 These patches are also available from
 
   git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls
 
 The first patch is strictly a bug fix for a deadlock in 
 btrfs_commit_transaction().
 
 The next few patches are a repost (with a few minor revisions) of the 
 async snapshot/subvolume ioctls I originally posted last spring.  They 
 include:
 
  - Some async commit helper functions
  - Start and wait sync ioctls, for initiating and waiting for a sync
  - An ioctl to start a snapshot creation asynchronously, such that you 
don't have to wait for the full commit to disk.  The interface is cleaned
up a bit from the original version.
 
 There is also a patch that changes the SNAP_DESTROY ioctl to not do a 
 commit before returning.  The rationale here is there is no obvious 
 reason (to me at least) why the snapshot removal should be on disk 
 before returning; rm(2) and unlink(2) certainly don't do that.  If the 
 user disagrees, they can call sync(2).  If you would prefer, I also have 
 a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't 
 change any existing behavior.

These all look good to me and I'm pulling them in.

 
 The last item is a change to SNAP_DESTROY to allow deletion of a 
 snapshot when the user owns the subvol's root inode and the parent 
 directory permissions are such that we would have allowed an rmdir(2).  
 Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
 semantics completely (except for the empty directory check) by 
 duplicating some VFS code.  Whether we want weaker semantics, duplicated 
 code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
 is distinct from a similar patch (also from Goffredo) that allows 
 rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
 subvol to be deleted by a non-root user.  As long as I can do that, my 
 daemon doesn't have to run as root and I'm a happy camper.  :)

Someone at the storage workshop mentioned that this subvol deletion
trick is slightly stronger than rm -rf, to make it include the same
level of permission checks would require testing all the directories in
the tree for permissions.

For now, could you please make a mount -o user_subvol_rm_allowed option?
(or something similar with a better name).

Thanks!

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
On Mon, 25 Oct 2010, Chris Mason wrote:
 These all look good to me and I'm pulling them in.

Great, thanks!

  The last item is a change to SNAP_DESTROY to allow deletion of a 
  snapshot when the user owns the subvol's root inode and the parent 
  directory permissions are such that we would have allowed an rmdir(2).  
  Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) 
  semantics completely (except for the empty directory check) by 
  duplicating some VFS code.  Whether we want weaker semantics, duplicated 
  code, or some new EXPORT_SYMBOLS is up to you I think.  Note that this 
  is distinct from a similar patch (also from Goffredo) that allows 
  rmdir(2) to remove an empty subvol; my goal is to allow a non-empty 
  subvol to be deleted by a non-root user.  As long as I can do that, my 
  daemon doesn't have to run as root and I'm a happy camper.  :)
 
 Someone at the storage workshop mentioned that this subvol deletion
 trick is slightly stronger than rm -rf, to make it include the same
 level of permission checks would require testing all the directories in
 the tree for permissions.

I think that was me :)
 
 For now, could you please make a mount -o user_subvol_rm_allowed option?
 (or something similar with a better name).

Sure.  

Do you have a preference as far as what checks are implemented?  My patch 
implemented a simplified approximation of may_rmdir(); Goffredo's 
duplicated the vfs checks.  I guess I'm leaning toward the latter...

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Replacing the top-level root

2010-10-25 Thread C Anthony Risinger
On Mon, Oct 25, 2010 at 2:58 PM, Chris Mason chris.ma...@oracle.com wrote:

 Oh, and it shouldn't work on the root of the FS either ;)

 -chris

This is not 100% related but...

Will removing [replacing] the top-level root be something that
can/will be supported?  I've asked in the past about this regarding an
initramfs hook i maintain implementing system rollbacks, but I never
really got a solid answer.

For example, right now extlinux support booting btrfs, but _only_ from
the top-level root.  if i just had a way to swap the top-level root
with a different subvol, i could overcome several problems i have with
users all at once:

) users install their system to the top-level root, which means it is
no longer manageable by snapshot scripts [currently]
) if the top-level root could be swapped, extlinux could then boot my
snapshot? (i'm probably wrong here)

set-default is not enough; i'm looking for a way to gain control
over the system state when the system has been installed into the
top-level root.  currently, i have no way to manipulate/move/change
it, because the top-level is essentially immutable, or so it seems.

thus it's not possible to support kernel rollbacks without manually
syncing snapshot/boot to /boot.

is there a solution to this that i'm missing?

C Anthony
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed

2010-10-25 Thread Sage Weil
Add a mount option user_subvol_rm_allowed that allows users to delete a
(potentially non-empty!) subvol when they would otherwise we allowed to do
an rmdir(2).  We duplicate the may_delete() checks from the core VFS code
to implement identical security checks (minus the directory size check).

Signed-off-by: Sage Weil s...@newdream.net
---
 fs/btrfs/ctree.h |1 +
 fs/btrfs/ioctl.c |  109 +++--
 fs/btrfs/super.c |6 ++-
 3 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5ac2bca..140003e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1196,6 +1196,7 @@ struct btrfs_root {
 #define BTRFS_MOUNT_NOSSD  (1  9)
 #define BTRFS_MOUNT_DISCARD(1  10)
 #define BTRFS_MOUNT_FORCE_COMPRESS  (1  11)
+#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1  12)
 
 #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt)
 #define btrfs_set_opt(o, opt)  ((o) |= BTRFS_MOUNT_##opt)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 906e3b3..919b23f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -409,6 +409,76 @@ fail:
return ret;
 }
 
+/*  copy of check_sticky in fs/namei.c() 
+* It's inline, so penalty for filesystems that don't use sticky bit is
+* minimal.
+*/
+static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode)
+{
+   uid_t fsuid = current_fsuid();
+
+   if (!(dir-i_mode  S_ISVTX))
+   return 0;
+   if (inode-i_uid == fsuid)
+   return 0;
+   if (dir-i_uid == fsuid)
+   return 0;
+   return !capable(CAP_FOWNER);
+}
+
+/*  copy of may_delete in fs/namei.c() 
+ * Check whether we can remove a link victim from directory dir, check
+ *  whether the type of victim is right.
+ *  1. We can't do it if dir is read-only (done in permission())
+ *  2. We should have write and exec permissions on dir
+ *  3. We can't remove anything from append-only dir
+ *  4. We can't do anything with immutable dir (done in permission())
+ *  5. If the sticky bit on dir is set we should either
+ * a. be owner of dir, or
+ * b. be owner of victim, or
+ * c. have CAP_FOWNER capability
+ *  6. If the victim is append-only or immutable we can't do antyhing with
+ * links pointing to it.
+ *  7. If we were asked to remove a directory and victim isn't one - ENOTDIR.
+ *  8. If we were asked to remove a non-directory and victim isn't one - 
EISDIR.
+ *  9. We can't remove a root or mountpoint.
+ * 10. We don't allow removal of NFS sillyrenamed files; it's handled by
+ * nfs_async_unlink().
+ */
+
+static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir)
+{
+   int error;
+
+   if (!victim-d_inode)
+   return -ENOENT;
+
+   BUG_ON(victim-d_parent-d_inode != dir);
+   audit_inode_child(victim, dir);
+
+   error = inode_permission(dir, MAY_WRITE | MAY_EXEC);
+   if (error)
+   return error;
+   if (IS_APPEND(dir))
+   return -EPERM;
+   if (btrfs_check_sticky(dir, victim-d_inode)||
+   IS_APPEND(victim-d_inode)||
+   IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode))
+   return -EPERM;
+   if (isdir) {
+   if (!S_ISDIR(victim-d_inode-i_mode))
+   return -ENOTDIR;
+   if (IS_ROOT(victim))
+   return -EBUSY;
+   } else if (S_ISDIR(victim-d_inode-i_mode))
+   return -EISDIR;
+   if (IS_DEADDIR(dir))
+   return -ENOENT;
+   if (victim-d_flags  DCACHE_NFSFS_RENAMED)
+   return -EBUSY;
+   return 0;
+}
+
 /* copy of may_create in fs/namei.c() */
 static inline int btrfs_may_create(struct inode *dir, struct dentry *child)
 {
@@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
int ret;
int err = 0;
 
-   if (!capable(CAP_SYS_ADMIN))
-   return -EPERM;
-
vol_args = memdup_user(arg, sizeof(*vol_args));
if (IS_ERR(vol_args))
return PTR_ERR(vol_args);
@@ -1320,13 +1387,45 @@ static noinline int btrfs_ioctl_snap_destroy(struct 
file *file,
}
 
inode = dentry-d_inode;
+   dest = BTRFS_I(inode)-root;
+   if (!capable(CAP_SYS_ADMIN)){
+   /*
+* Regular user.  Only allow this with a special mount
+* option, and when rmdir(2) would have been allowed.
+*
+* Note that this is _not_ check that the subvol is
+* empty or doesn't contain data that we wouldn't
+* otherwise be able to delete.
+*
+* Users who want to delete empty subvols should try
+* rmdir(2).
+*/
+   err = -EPERM;
+   if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED))
+   

Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations

2010-10-25 Thread Sage Weil
On Mon, 25 Oct 2010, Chris Mason wrote:
 Yes, lets duplicate the vfs checks.  Christoph just sat bolt upright in
 whatever ski lift he's currently riding.

:)

This version is based on Goffredo's patch, but requires the 
user_subvol_rm_allowed mount option, and will proceed even if the subvol 
is nonempty.  (I suspect we also want his other rmdir(2) patch at some 
point as well so that users can rmdir an empty subvol and/or 'rm -r'.)

 We should also make sure they do the subvol rm against the root of the
 subvol (if that check isn't already done), none of the magic to resolve
 the subvol based on any file inside it.  I don't want people to
 accidentally think they are deleting a subdir and have it go higher up
 into the directory tree.
 
 Oh, and it shouldn't work on the root of the FS either ;)

Since the ioctl is based on a name lookup, I added a check that the parent 
inode's root isn't the same as the target root.  That implies the ioctl is 
called the dentry referencing the subvol, and presumably the root doesn't 
have one of those.  Does that cover it?  It isn't possible to have the 
subvol bound to multiple locations in the namespace, right?

$ whoami
sage
$ btrfs sub create a
Create subvolume './a'
$ btrfs sub create a/b
Create subvolume 'a/b'
$ btrfs sub snap a asnap
Create a snapshot of 'a' in './asnap'
$ btrfs sub del asnap/b 
ERROR: 'asnap/b' is not a subvolume
$ btrfs sub del a
Delete subvolume '/mnt/osd27/a/a'
ERROR: cannot delete '/mnt/osd27/a/a'
$ btrfs sub del a/b
Delete subvolume '/mnt/osd27/a/a/b'
$ btrfs sub del a  
Delete subvolume '/mnt/osd27/a/a'
$ btrfs sub del asnap
Delete subvolume '/mnt/osd27/a/asnap'

Sending the patch separately.

Thanks!
sage
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mark btrfsctl deprecated

2010-10-25 Thread David Nicol
I am certainly not in a position to answer for Chris Mason, but I am
happy to share my response to the question, coming from a perspective
of being somewhat obsessive about not breaking back-compat.

Let's not. As I am certainly within the lot of people in question,
having just done exactly that, I found the two programs -- with very
different styles -- to not be much of a block, and providing two
examples instead of one of code that invokes an ioctl seems fine.

Ideally, everyone who needs to use an ioctl will simply write and
compile a C program that does what they need -- ha ha.

I see no reason to break legacy code (such as it is) that uses
btrfsctl instead of btrfs for a user-space tool to invoke ioctls.
Calling the tool btrfs is actually a little confusing.



On Fri, Oct 22, 2010 at 7:02 AM, Goffredo Baroncelli
kreij...@libero.it kreij...@libero.it wrote:
 Hi Chris,

 what do you think about marking deprecate the btrfsctl program ?

 A lot of people make patch involving both btrfs command and btrfsctl command,
 spending a lot of effort.

 Initially we can put a warning in the btrfctl command which suggest to use the
 btrfs command. and after XX month (six ?) we could remouve the command at all.
 The same for the other utilities like btrfs-show, btrfs-vol

 Of course this is applicable if there is no evidence of regression of btrfs vs
 btrfsctl.

 Regards
 G.Baroncelli
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] btrfs: Add lzo compression support

2010-10-25 Thread Chris Mason
On Mon, Oct 25, 2010 at 03:11:22PM +0800, Li Zefan wrote:
 Lzo is a much faster compression algorithm than gzib, so would allow
 more users to enable transparent compression, and some users can
 choose from compression ratio and compression speed.

This is also much smaller than I expected, really nice.  It looks like
older kernels won't properly deal (nicely give EIO) with lzo compressed
files?

We can add compatbits to deal with that if it is the case.

-chris
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/5] btrfs: Add lzo compression support

2010-10-25 Thread Li Zefan
Chris Mason wrote:
 On Mon, Oct 25, 2010 at 03:11:22PM +0800, Li Zefan wrote:
 Lzo is a much faster compression algorithm than gzib, so would allow
 more users to enable transparent compression, and some users can
 choose from compression ratio and compression speed.
 
 This is also much smaller than I expected, really nice.  It looks like
 older kernels won't properly deal (nicely give EIO) with lzo compressed
 files?
 
 We can add compatbits to deal with that if it is the case.
 

I forgot compatibility issue with older kernels..

Though I didn't test older kernels, I don't think they can deal with lzo
compressed files properly, at least not for inlined extents, in which
case I think btrfs will just show compressed data to the users.

So yes, an incompat flag is needed.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html