[PATCH 0/5] btrfs: Add lzo compression support
Lzo is a much faster compression algorithm than gzib, so would allow more users to enable transparent compression, and some users can choose from compression ratio and compression speed. Usage: # mount -t btrfs -o compress[=zlib,lzo] dev /mnt or # mount -t btrfs -o compress-force[=zlib,lzo] dev /mnt -o compress without argument is still allowed for compatability. Performance: The test copied a linux source tarball (~400M) from an ext4 partition to the btrfs partition, and then extracted the tarball. (time in second) lzozlibnocompress copy: 10.6 21.714.9 extract: 70.1 94.466.6 (data size in MB) lzozlibnocompress copy: 185.87 108.69 394.49 extract: 193.80 132.36 381.21 Other: The defrag ioctl is also updated, so one can choose lzo or zlib when turning on compression in defrag operation. TODO: - Update btrfs-progs accordingly. --- fs/btrfs/Makefile |2 +- fs/btrfs/btrfs_inode.h |2 +- fs/btrfs/compression.c | 329 +- fs/btrfs/compression.h | 72 ++-- fs/btrfs/ctree.h| 11 +- fs/btrfs/extent_io.c|5 +- fs/btrfs/extent_io.h| 17 ++- fs/btrfs/extent_map.c |2 + fs/btrfs/extent_map.h |3 +- fs/btrfs/file.c |2 + fs/btrfs/inode.c| 82 ++ fs/btrfs/ioctl.c| 10 +- fs/btrfs/ioctl.h|9 +- fs/btrfs/lzo.c | 409 +++ fs/btrfs/ordered-data.c | 18 ++- fs/btrfs/ordered-data.h |8 +- fs/btrfs/super.c| 47 +- fs/btrfs/zlib.c | 361 +++-- 18 files changed, 1013 insertions(+), 376 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs: Fix bugs in zlib workspace
- Fix a race that can result in alloc_workspace cpus. - Fix to check num_workspace after wakeup. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/zlib.c |7 +-- 1 files changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index b9cd544..e5b8b22 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -75,16 +75,19 @@ again: return workspace; } - spin_unlock(workspace_lock); if (atomic_read(alloc_workspace) cpus) { DEFINE_WAIT(wait); + + spin_unlock(workspace_lock); prepare_to_wait(workspace_wait, wait, TASK_UNINTERRUPTIBLE); - if (atomic_read(alloc_workspace) cpus) + if (atomic_read(alloc_workspace) cpus !num_workspace) schedule(); finish_wait(workspace_wait, wait); goto again; } atomic_inc(alloc_workspace); + spin_unlock(workspace_lock); + workspace = kzalloc(sizeof(*workspace), GFP_NOFS); if (!workspace) { ret = -ENOMEM; -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] btrfs: Allow to add new compression algorithm
Make the code aware of compression type, instead of always assuming zlib compression. Also make the zlib workspace function as common code for all compression types. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/btrfs_inode.h |2 +- fs/btrfs/compression.c | 236 ++- fs/btrfs/compression.h | 66 + fs/btrfs/ctree.h| 10 +- fs/btrfs/extent_io.c|5 +- fs/btrfs/extent_io.h| 17 +++- fs/btrfs/extent_map.c |2 + fs/btrfs/extent_map.h |3 +- fs/btrfs/file.c |2 + fs/btrfs/inode.c| 82 ++-- fs/btrfs/ioctl.c|4 +- fs/btrfs/ordered-data.c | 18 +++- fs/btrfs/ordered-data.h |8 ++- fs/btrfs/super.c| 41 +++- fs/btrfs/zlib.c | 253 +- 15 files changed, 472 insertions(+), 277 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 6ad63f1..ccc991c 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -157,7 +157,7 @@ struct btrfs_inode { /* * always compress this one file */ - unsigned force_compress:1; + unsigned force_compress:4; struct inode vfs_inode; }; diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index cb3877c..4c6068b 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -62,6 +62,9 @@ struct compressed_bio { /* number of bytes on disk */ unsigned long compressed_len; + /* the compression algorithm for this bio */ + int compress_type; + /* number of compressed pages in the array */ unsigned long nr_pages; @@ -186,11 +189,12 @@ static void end_compressed_bio_read(struct bio *bio, int err) /* ok, we're the last bio for this extent, lets start * the decompression. */ - ret = btrfs_zlib_decompress_biovec(cb-compressed_pages, - cb-start, - cb-orig_bio-bi_io_vec, - cb-orig_bio-bi_vcnt, - cb-compressed_len); + ret = btrfs_decompress_biovec(cb-compress_type, + cb-compressed_pages, + cb-start, + cb-orig_bio-bi_io_vec, + cb-orig_bio-bi_vcnt, + cb-compressed_len); csum_failed: if (ret) cb-errors = 1; @@ -611,6 +615,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, cb-len = uncompressed_len; cb-compressed_len = compressed_len; + cb-compress_type = extent_compress_type(bio_flags); cb-orig_bio = bio; nr_pages = (compressed_len + PAGE_CACHE_SIZE - 1) / @@ -700,3 +705,224 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, bio_put(comp_bio); return 0; } + +static struct list_head comp_idle_workspace[BTRFS_COMPRESS_TYPES]; +static spinlock_t comp_workspace_lock[BTRFS_COMPRESS_TYPES]; +static int comp_num_workspace[BTRFS_COMPRESS_TYPES]; +static atomic_t comp_alloc_workspace[BTRFS_COMPRESS_TYPES]; +static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; + +struct btrfs_compress_op *btrfs_compress_op[] = { + btrfs_zlib_compress, +}; + +int __init btrfs_init_compress(void) +{ + int i; + + for (i = 0; i BTRFS_COMPRESS_TYPES; i++) { + INIT_LIST_HEAD(comp_idle_workspace[i]); + spin_lock_init(comp_workspace_lock[i]); + atomic_set(comp_alloc_workspace[i], 0); + init_waitqueue_head(comp_workspace_wait[i]); + } + return 0; +} + +/* + * this finds an available workspace or allocates a new one + * ERR_PTR is returned if things go bad. + */ +static struct list_head *find_workspace(int type) +{ + struct list_head *workspace; + int cpus = num_online_cpus(); + int idx = type - 1; + + struct list_head *idle_workspace= comp_idle_workspace[idx]; + spinlock_t *workspace_lock = comp_workspace_lock[idx]; + atomic_t *alloc_workspace = comp_alloc_workspace[idx]; + wait_queue_head_t *workspace_wait = comp_workspace_wait[idx]; + int *num_workspace = comp_num_workspace[idx]; +again: + spin_lock(workspace_lock); + if (!list_empty(idle_workspace)) { + workspace = idle_workspace-next; + list_del(workspace); + (*num_workspace)--; + spin_unlock(workspace_lock); + return workspace; + + } + if (atomic_read(alloc_workspace) cpus) { + DEFINE_WAIT(wait); + + spin_unlock(workspace_lock); + prepare_to_wait(workspace_wait, wait,
[PATCH 3/5] btrfs: Add lzo compression support
Lzo is a much faster compression algorithm than gzib, so would allow more users to enable transparent compression, and some users can choose from compression ratio and speed for different applications Usage: # mount -t btrfs -o compress[=zlib,lzo] dev /mnt or # mount -t btrfs -o compress-force[=zlib,lzo] dev /mnt -o compress without argument is still allowed for compatability. Performance: The test copied a linux source tarball (~400M) from an ext4 partition to the btrfs partition, and then extracted it. (time in second) lzozlibnocompress copy: 10.6 21.714.9 extract: 70.1 94.466.6 (data size in MB) lzozlibnocompress copy: 185.87 108.69 394.49 extract: 193.80 132.36 381.21 Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |1 + fs/btrfs/ctree.h |5 +- fs/btrfs/lzo.c | 498 fs/btrfs/super.c |6 + 6 files changed, 510 insertions(+), 3 deletions(-) create mode 100644 fs/btrfs/lzo.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index a35eb36..31610ea 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -6,5 +6,5 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ transaction.o inode.o file.o tree-defrag.o \ extent_map.o sysfs.o struct-funcs.o xattr.o ordered-data.o \ extent_io.o volumes.o async-thread.o ioctl.o locking.o orphan.o \ - export.o tree-log.o acl.o free-space-cache.o zlib.o \ + export.o tree-log.o acl.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 4c6068b..84c5490 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -714,6 +714,7 @@ static wait_queue_head_t comp_workspace_wait[BTRFS_COMPRESS_TYPES]; struct btrfs_compress_op *btrfs_compress_op[] = { btrfs_zlib_compress, + btrfs_lzo_compress, }; int __init btrfs_init_compress(void) diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index 9b5f2f3..f7ce217 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -73,5 +73,6 @@ struct btrfs_compress_op { }; extern struct btrfs_compress_op btrfs_zlib_compress; +extern struct btrfs_compress_op btrfs_lzo_compress; #endif diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b0f8b4d..f9c2b00 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -530,8 +530,9 @@ struct btrfs_timespec { enum btrfs_compression_type { BTRFS_COMPRESS_NONE = 0, BTRFS_COMPRESS_ZLIB = 1, - BTRFS_COMPRESS_TYPES = 1, - BTRFS_COMPRESS_LAST = 2, + BTRFS_COMPRESS_LZO = 2, + BTRFS_COMPRESS_TYPES = 2, + BTRFS_COMPRESS_LAST = 3, }; struct btrfs_inode_item { diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c new file mode 100644 index 000..0a518f3 --- /dev/null +++ b/fs/btrfs/lzo.c @@ -0,0 +1,498 @@ +/* + * Copyright (C) 2008 Oracle. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#include linux/kernel.h +#include linux/slab.h +#include linux/vmalloc.h +#include linux/init.h +#include linux/err.h +#include linux/sched.h +#include linux/pagemap.h +#include linux/bio.h +#include linux/lzo.h +#include compression.h + +#define LZO_LEN4 + +struct workspace { + void *mem; + void *buf; /* where compressed data goes */ + void *cbuf; /* where decompressed data goes */ + struct list_head list; +}; + +static void lzo_free_workspace(struct list_head *ws) +{ + struct workspace *workspace = list_entry(ws, struct workspace, list); + + vfree(workspace-buf); + vfree(workspace-cbuf); + vfree(workspace-mem); + kfree(workspace); +} + +static struct list_head *lzo_alloc_workspace(void) +{ + struct workspace *workspace; + + workspace = kzalloc(sizeof(*workspace), GFP_NOFS); + if (!workspace) + return ERR_PTR(-ENOMEM); + + workspace-mem = vmalloc(LZO1X_MEM_COMPRESS); + workspace-buf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE)); + workspace-cbuf = vmalloc(lzo1x_worst_compress(PAGE_CACHE_SIZE)); +
[PATCH 4/5] btrfs: Allow to specify compress method when defrag
Update defrag ioctl, so one can choose lzo or zlib when turning on compression in defrag operation. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/ioctl.c |6 +- fs/btrfs/ioctl.h |9 - 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index ecc49ee..2ea05df 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -584,8 +584,12 @@ static int btrfs_defrag_file(struct file *file, } total_read++; mutex_lock(inode-i_mutex); - if (range-flags BTRFS_DEFRAG_RANGE_COMPRESS) + if (range-flags BTRFS_DEFRAG_RANGE_COMPRESS) { BTRFS_I(inode)-force_compress = BTRFS_COMPRESS_ZLIB; + if (range-compress_type) + BTRFS_I(inode)-force_compress = + range-compress_type; + } ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE); if (ret) diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 424694a..776a2b4 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -122,8 +122,15 @@ struct btrfs_ioctl_defrag_range_args { */ __u32 extent_thresh; + /* +* which compression method to use if turning on compression +* for this defrag operation. If unspecified, zlib will +* be used +*/ + __u32 compress_type; + /* spare for later */ - __u32 unused[5]; + __u32 unused[4]; }; struct btrfs_ioctl_space_info { -- 1.7.0.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs: Extract duplicate decompress code
Add a common function to copy decompressed data from working buffer to bio pages. Signed-off-by: Li Zefan l...@cn.fujitsu.com --- fs/btrfs/compression.c | 92 +++ fs/btrfs/compression.h |5 ++ fs/btrfs/lzo.c | 101 +++- fs/btrfs/zlib.c| 111 +-- 4 files changed, 115 insertions(+), 194 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 84c5490..c7f695b 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -927,3 +927,95 @@ void __exit btrfs_exit_compress(void) { free_workspaces(); } + +/* + * Copy uncompressed data from working buffer to pages. + * + * buf_start is the byte offset we're of the start of our workspace buffer. + * + * total_out is the last byte of the buffer + */ +int btrfs_decompress_buf2page(char *buf, unsigned long buf_start, + unsigned long total_out, u64 disk_start, + struct bio_vec *bvec, int vcnt, + unsigned long *page_index, + unsigned long *pg_offset) +{ + unsigned long buf_offset; + unsigned long current_buf_start; + unsigned long start_byte; + unsigned long working_bytes = total_out - buf_start; + unsigned long bytes; + char *kaddr; + struct page *page_out = bvec[*page_index].bv_page; + + /* +* start byte is the first byte of the page we're currently +* copying into relative to the start of the compressed data. +*/ + start_byte = page_offset(page_out) - disk_start; + + /* we haven't yet hit data corresponding to this page */ + if (total_out = start_byte) + return 1; + + /* +* the start of the data we care about is offset into +* the middle of our working buffer +*/ + if (total_out start_byte buf_start start_byte) { + buf_offset = start_byte - buf_start; + working_bytes -= buf_offset; + } else { + buf_offset = 0; + } + current_buf_start = buf_start; + + /* copy bytes from the working buffer into the pages */ + while (working_bytes 0) { + bytes = min(PAGE_CACHE_SIZE - *pg_offset, + PAGE_CACHE_SIZE - buf_offset); + bytes = min(bytes, working_bytes); + kaddr = kmap_atomic(page_out, KM_USER0); + memcpy(kaddr + *pg_offset, buf + buf_offset, bytes); + kunmap_atomic(kaddr, KM_USER0); + flush_dcache_page(page_out); + + *pg_offset += bytes; + buf_offset += bytes; + working_bytes -= bytes; + current_buf_start += bytes; + + /* check if we need to pick another page */ + if (*pg_offset == PAGE_CACHE_SIZE) { + (*page_index)++; + if (*page_index = vcnt) + return 0; + + page_out = bvec[*page_index].bv_page; + *pg_offset = 0; + start_byte = page_offset(page_out) - disk_start; + + /* +* make sure our new page is covered by this +* working buffer +*/ + if (total_out = start_byte) + return 1; + + /* +* the next page in the biovec might not be adjacent +* to the last page, but it might still be found +* inside this working buffer. bump our offset pointer +*/ + if (total_out start_byte + current_buf_start start_byte) { + buf_offset = start_byte - buf_start; + working_bytes = total_out - start_byte; + current_buf_start = buf_start + buf_offset; + } + } + } + + return 1; +} diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h index f7ce217..5100017 100644 --- a/fs/btrfs/compression.h +++ b/fs/btrfs/compression.h @@ -34,6 +34,11 @@ int btrfs_decompress_biovec(int type, struct page **pages_in, u64 disk_start, struct bio_vec *bvec, int vcnt, size_t srclen); int btrfs_decompress(int type, unsigned char *data_in, struct page *dest_page, unsigned long start_byte, size_t srclen, size_t destlen); +int btrfs_decompress_buf2page(char *buf, unsigned long buf_start, + unsigned long total_out, u64 disk_start, + struct bio_vec *bvec, int vcnt, + unsigned long *page_index, +
Re: Determine if a given fs is a btrfs fs
On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com wrote: ... p4 jerome # btrfs device scan /dev/dm-22 Scanning for Btrfs filesystems in '/dev/dm-22' p4 jerome # echo $? 0 This is OK. p4 jerome # btrfs device scan /dev/sda Scanning for Btrfs filesystems in '/dev/sda' ERROR: unable to scan the device '/dev/sda' p4 jerome # echo $? 11 ... But isn't that error misleading, btrfs scan was succesfully able to scan /dev/sda, but, it doesn't contain btrfs, right? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determine if a given fs is a btrfs fs
On Mon, Oct 25, 2010 at 6:31 AM, Jérôme Poulin jeromepou...@gmail.com wrote: On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com wrote: ... p4 jerome # btrfs device scan /dev/dm-22 Scanning for Btrfs filesystems in '/dev/dm-22' p4 jerome # echo $? 0 This is OK. p4 jerome # btrfs device scan /dev/sda Scanning for Btrfs filesystems in '/dev/sda' ERROR: unable to scan the device '/dev/sda' p4 jerome # echo $? 11 ... But isn't that error misleading, btrfs scan was succesfully able to scan /dev/sda, but, it doesn't contain btrfs, right? imo, the best way is: # root must be btrfs else silent return [ $(blkid -s TYPE -o value ${root}) = btrfs ] || return 0 at least that the way i do it in my initramfs hook; seems to be reliable. C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] Btrfs commit fixes, async subvol operations
Hi Chris, This is the extent of my current queue of Btrfs snapshot/subvol/commit stuff. Most of these were posted several months ago. Can be sent upstream during this merge window? Not having this functionality is becoming a bit of a roadblock for our efforts to keep the Ceph data in a consistent state. These patches are also available from git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls The first patch is strictly a bug fix for a deadlock in btrfs_commit_transaction(). The next few patches are a repost (with a few minor revisions) of the async snapshot/subvolume ioctls I originally posted last spring. They include: - Some async commit helper functions - Start and wait sync ioctls, for initiating and waiting for a sync - An ioctl to start a snapshot creation asynchronously, such that you don't have to wait for the full commit to disk. The interface is cleaned up a bit from the original version. There is also a patch that changes the SNAP_DESTROY ioctl to not do a commit before returning. The rationale here is there is no obvious reason (to me at least) why the snapshot removal should be on disk before returning; rm(2) and unlink(2) certainly don't do that. If the user disagrees, they can call sync(2). If you would prefer, I also have a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't change any existing behavior. The last item is a change to SNAP_DESTROY to allow deletion of a snapshot when the user owns the subvol's root inode and the parent directory permissions are such that we would have allowed an rmdir(2). Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) semantics completely (except for the empty directory check) by duplicating some VFS code. Whether we want weaker semantics, duplicated code, or some new EXPORT_SYMBOLS is up to you I think. Note that this is distinct from a similar patch (also from Goffredo) that allows rmdir(2) to remove an empty subvol; my goal is to allow a non-empty subvol to be deleted by a non-root user. As long as I can do that, my daemon doesn't have to run as root and I'm a happy camper. :) If anybody has any questions or issues with any of this, please let me know so I can revise the patches accordingly. Thanks! sage --- Sage Weil (6): Btrfs: fix deadlock in btrfs_commit_transaction Btrfs: async transaction commit Btrfs: add START_SYNC, WAIT_SYNC ioctls Btrfs: add SNAP_CREATE_ASYNC ioctl Btrfs: make SNAP_DESTROY async Btrfs: allow subvol deletion by owner fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/ioctl.c | 152 +++- fs/btrfs/ioctl.h |9 +++ fs/btrfs/transaction.c | 183 +-- fs/btrfs/transaction.h |4 + security/security.c|1 + 7 files changed, 326 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] Btrfs: fix deadlock in btrfs_commit_transaction
We calculate timeout (either 1 or MAX_SCHEDULE_TIMEOUT) based on whether num_writers 1 or should_grow at the top of the loop. Then, much much later, we wait for that timeout if either num_writers or should_grow is true. However, it's possible for a racing process (calling btrfs_end_transaction()) to decrement num_writers such that we wait forever instead of for 1. Fix this by deciding how long to wait when we wait. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/transaction.c | 12 1 files changed, 4 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 66e4c66..bf399ea 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -992,7 +992,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, struct btrfs_root *root) { unsigned long joined = 0; - unsigned long timeout = 1; struct btrfs_transaction *cur_trans; struct btrfs_transaction *prev_trans = NULL; DEFINE_WAIT(wait); @@ -1063,11 +1062,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, snap_pending = 1; WARN_ON(cur_trans != trans-transaction); - if (cur_trans-num_writers 1) - timeout = MAX_SCHEDULE_TIMEOUT; - else if (should_grow) - timeout = 1; - mutex_unlock(root-fs_info-trans_mutex); if (flush_on_commit || snap_pending) { @@ -1089,8 +1083,10 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans, TASK_UNINTERRUPTIBLE); smp_mb(); - if (cur_trans-num_writers 1 || should_grow) - schedule_timeout(timeout); + if (cur_trans-num_writers 1) + schedule_timeout(MAX_SCHEDULE_TIMEOUT); + else if (should_grow) + schedule_timeout(1); mutex_lock(root-fs_info-trans_mutex); finish_wait(cur_trans-writer_wait, wait); -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] Btrfs: async transaction commit
Add support for an async transaction commit that is ordered such that any subsequent operations will join the following transaction, but does not wait until the current commit is fully on disk. This avoids much of the latency associated with the btrfs_commit_transaction for callers concerned with serialization and not safety. The wait_for_unblock flag controls whether we wait for the 'middle' portion of commit_transaction to complete, which is necessary if the caller expects some of the modifications contained in the commit to be available (this is the case for subvol/snapshot creation). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |1 + fs/btrfs/transaction.c | 119 fs/btrfs/transaction.h |3 + 4 files changed, 124 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index eaf286a..4c3c20b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -863,6 +863,7 @@ struct btrfs_fs_info { struct btrfs_transaction *running_transaction; wait_queue_head_t transaction_throttle; wait_queue_head_t transaction_wait; + wait_queue_head_t transaction_blocked_wait; wait_queue_head_t async_submit_wait; struct btrfs_super_block super_copy; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 64f1008..034abc6 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1680,6 +1680,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, init_waitqueue_head(fs_info-transaction_throttle); init_waitqueue_head(fs_info-transaction_wait); + init_waitqueue_head(fs_info-transaction_blocked_wait); init_waitqueue_head(fs_info-async_submit_wait); __setup_root(4096, 4096, 4096, 4096, tree_root, diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index bf399ea..d720106 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -988,6 +988,123 @@ int btrfs_transaction_blocked(struct btrfs_fs_info *info) return ret; } +/* + * wait for the current transaction commit to start and block subsequent + * transaction joins + */ +static void wait_current_trans_commit_start(struct btrfs_root *root, + struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-in_commit) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_blocked_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-in_commit) { + finish_wait(root-fs_info-transaction_blocked_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_blocked_wait, wait); + } +} + +/* + * wait for the current transaction to start and then become unblocked. + * caller holds ref. + */ +static void wait_current_trans_commit_start_and_unblock(struct btrfs_root *root, +struct btrfs_transaction *trans) +{ + DEFINE_WAIT(wait); + + if (trans-commit_done || (trans-in_commit !trans-blocked)) + return; + + while (1) { + prepare_to_wait(root-fs_info-transaction_wait, wait, + TASK_UNINTERRUPTIBLE); + if (trans-commit_done || + (trans-in_commit !trans-blocked)) { + finish_wait(root-fs_info-transaction_wait, + wait); + break; + } + mutex_unlock(root-fs_info-trans_mutex); + schedule(); + mutex_lock(root-fs_info-trans_mutex); + finish_wait(root-fs_info-transaction_wait, + wait); + } +} + +/* + * commit transactions asynchronously. once btrfs_commit_transaction_async + * returns, any subsequent transaction will not be allowed to join. + */ +struct btrfs_async_commit { + struct btrfs_trans_handle *newtrans; + struct btrfs_root *root; + struct delayed_work work; +}; + +static void do_async_commit(struct work_struct *work) +{ + struct btrfs_async_commit *ac = + container_of(work, struct btrfs_async_commit, work.work); + + btrfs_commit_transaction(ac-newtrans, ac-root); + kfree(ac); +} + +int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + int wait_for_unblock) +{ + struct btrfs_async_commit *ac; + struct btrfs_transaction *cur_trans; + + ac = kmalloc(sizeof(*ac), GFP_NOFS); + BUG_ON(!ac); + + INIT_DELAYED_WORK(ac-work,
[PATCH 3/6] Btrfs: add START_SYNC, WAIT_SYNC ioctls
START_SYNC will start a sync/commit, but not wait for it to complete. Any modification started after the ioctl returns is guaranteed not to be included in the commit. If a non-NULL pointer is passed, the transaction id will be returned to userspace. WAIT_SYNC will wait for any in-progress commit to complete. If a transaction id is specified, the ioctl will block and then return (success) when the specified transaction has committed. If it has already committed when we call the ioctl, it returns immediately. If the specified transaction doesn't exist, it returns EINVAL. If no transaction id is specified, WAIT_SYNC will wait for the currently committing transaction to finish it's commit to disk. If there is no currently committing transaction, it returns success. These ioctls are useful for applications which want to impose an ordering on when fs modifications reach disk, but do not want to wait for the full (slow) commit process to do so. Picky callers can take the transid returned by START_SYNC and feed it to WAIT_SYNC, and be certain to wait only as long as necessary for the transaction _they_ started to reach disk. Sloppy callers can START_SYNC and WAIT_SYNC without a transid, and provided they didn't wait too long between the calls, they will get the same result. However, if a second commit starts before they call WAIT_SYNC, they may end up waiting longer for it to commit as well. Even so, a START_SYNC+WAIT_SYNC still guarantees that any operation completed before the START_SYNC reaches disk. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c | 34 +++ fs/btrfs/ioctl.h |2 + fs/btrfs/transaction.c | 52 fs/btrfs/transaction.h |1 + 4 files changed, 89 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9254b3d..6306051 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1984,6 +1984,36 @@ long btrfs_ioctl_trans_end(struct file *file) return 0; } +static noinline long btrfs_ioctl_start_sync(struct file *file, void __user *argp) +{ + struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root; + struct btrfs_trans_handle *trans; + u64 transid; + + trans = btrfs_start_transaction(root, 0); + transid = trans-transid; + btrfs_commit_transaction_async(trans, root, 0); + + if (argp) + if (copy_to_user(argp, transid, sizeof(transid))) + return -EFAULT; + return 0; +} + +static noinline long btrfs_ioctl_wait_sync(struct file *file, void __user *argp) +{ + struct btrfs_root *root = BTRFS_I(file-f_dentry-d_inode)-root; + u64 transid; + + if (argp) { + if (copy_from_user(transid, argp, sizeof(transid))) + return -EFAULT; + } else { + transid = 0; /* current trans */ + } + return btrfs_wait_for_commit(root, transid); +} + long btrfs_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { @@ -2034,6 +2064,10 @@ long btrfs_ioctl(struct file *file, unsigned int case BTRFS_IOC_SYNC: btrfs_sync_fs(file-f_dentry-d_sb, 1); return 0; + case BTRFS_IOC_START_SYNC: + return btrfs_ioctl_start_sync(file, argp); + case BTRFS_IOC_WAIT_SYNC: + return btrfs_ioctl_wait_sync(file, argp); } return -ENOTTY; diff --git a/fs/btrfs/ioctl.h b/fs/btrfs/ioctl.h index 424694a..e9a2f7e 100644 --- a/fs/btrfs/ioctl.h +++ b/fs/btrfs/ioctl.h @@ -178,4 +178,6 @@ struct btrfs_ioctl_space_args { #define BTRFS_IOC_DEFAULT_SUBVOL _IOW(BTRFS_IOCTL_MAGIC, 19, u64) #define BTRFS_IOC_SPACE_INFO _IOWR(BTRFS_IOCTL_MAGIC, 20, \ struct btrfs_ioctl_space_args) +#define BTRFS_IOC_START_SYNC _IOR(BTRFS_IOCTL_MAGIC, 21, __u64) +#define BTRFS_IOC_WAIT_SYNC _IOW(BTRFS_IOCTL_MAGIC, 22, __u64) #endif diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index d720106..932bc03 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -270,6 +270,58 @@ static noinline int wait_for_commit(struct btrfs_root *root, return 0; } +int btrfs_wait_for_commit(struct btrfs_root *root, u64 transid) +{ + struct btrfs_transaction *cur_trans = NULL, *t; + int ret; + + mutex_lock(root-fs_info-trans_mutex); + + ret = 0; + if (transid) { + if (transid = root-fs_info-last_trans_committed) + goto out_unlock; + + /* find specified transaction */ + list_for_each_entry(t, root-fs_info-trans_list, list) { + if (t-transid == transid) { + cur_trans = t; + break; + } + if (t-transid transid) + break; +
[PATCH 5/6] Btrfs: make SNAP_DESTROY async
There is no reason to force an immediate commit when deleting a snapshot. Users have some expectation that space from a deleted snapshot be freed immediately, but even if we do commit the reclaim is a background process. If users _do_ want the deletion to be durable, they can call 'sync'. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 679b8a8..9cbda86 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1365,7 +1365,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, BUG_ON(ret); } - ret = btrfs_commit_transaction(trans, root); + ret = btrfs_end_transaction(trans, root); BUG_ON(ret); inode-i_flags |= S_DEAD; out_up_write: -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Btrfs: allow subvol deletion by owner
Allow users to delete the snapshots/subvols they own. When CAP_SYS_ADMIN is not present, require that - the user has write and exec permission on the parent directory - security_inode_rmdir() doesn't object - the user has write and exec permission on the subvol directory - the user owns the subvol directory - the directory and subvol append flags are not set This is a bit more strict than the requirements for 'rm -f subvol/*', which is allowed with just 'wx' on non-owned non-sticky dirs. It is less strict than 'rmdir subvol', which has additional requirements if the parent directory is sticky. It is less strict than 'rm -rf subvol', which would require scanning the entire subvol to ensure all content is deletable. Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ioctl.c| 27 --- security/security.c |1 + 2 files changed, 25 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 9cbda86..90d2871 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1288,9 +1288,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1325,6 +1322,30 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, goto out_dput; } + /* +* Allow subvol deletion if we own the subvol and we would +* (approximately) be allowed to rmdir it. Strictly speaking, +* we could possibly delete everything with fewer permissions, +* or might require more permissions to remove all files +* contained by the subvol, but we aren't trying to mimic +* directory semantics perfectly. +*/ + if (!capable(CAP_SYS_ADMIN)) { + err = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = security_inode_rmdir(dir, dentry); + if (err) + goto out_dput; + err = inode_permission(inode, MAY_WRITE | MAY_EXEC); + if (err) + goto out_dput; + err = -EPERM; + if (inode-i_uid != current_fsuid() || IS_APPEND(dir) || + IS_APPEND(inode)) + goto out_dput; + } + dest = BTRFS_I(inode)-root; mutex_lock(inode-i_mutex); diff --git a/security/security.c b/security/security.c index c53949f..1c980ee 100644 --- a/security/security.c +++ b/security/security.c @@ -490,6 +490,7 @@ int security_inode_rmdir(struct inode *dir, struct dentry *dentry) return 0; return security_ops-inode_rmdir(dir, dentry); } +EXPORT_SYMBOL_GPL(security_inode_rmdir); int security_inode_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev) { -- 1.6.6.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Determine if a given fs is a btrfs fs
Hi, that was a good hint, thanks a lot! The best at that solution is that it works for all filesystems: Just test if the output of $type (type=`blkid -s TYPE -o value $loopdev`) is empty (''), then it is no filesystem and the loopaes password was wrong or the device/fs is broken. Felix On 25. October 2010 - 08:29, C Anthony Risinger wrote: Date: Mon, 25 Oct 2010 08:29:58 -0500 From: C Anthony Risinger anth...@extof.me To: Jérôme Poulin jeromepou...@gmail.com Cc: linux-btrfs linux-btrfs@vger.kernel.org Subject: Re: Determine if a given fs is a btrfs fs On Mon, Oct 25, 2010 at 6:31 AM, Jérôme Poulin jeromepou...@gmail.com wrote: On Sun, Oct 24, 2010 at 5:32 PM, Jérôme Poulin jeromepou...@gmail.com wrote: ... p4 jerome # btrfs device scan /dev/dm-22 Scanning for Btrfs filesystems in '/dev/dm-22' p4 jerome # echo $? 0 This is OK. p4 jerome # btrfs device scan /dev/sda Scanning for Btrfs filesystems in '/dev/sda' ERROR: unable to scan the device '/dev/sda' p4 jerome # echo $? 11 ... But isn't that error misleading, btrfs scan was succesfully able to scan /dev/sda, but, it doesn't contain btrfs, right? imo, the best way is: # root must be btrfs else silent return [ $(blkid -s TYPE -o value ${root}) = btrfs ] || return 0 at least that the way i do it in my initramfs hook; seems to be reliable. C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ---end quoted text--- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations
On Mon, Oct 25, 2010 at 12:07:36PM -0700, Sage Weil wrote: Hi Chris, This is the extent of my current queue of Btrfs snapshot/subvol/commit stuff. Most of these were posted several months ago. Can be sent upstream during this merge window? Not having this functionality is becoming a bit of a roadblock for our efforts to keep the Ceph data in a consistent state. These patches are also available from git://git.kernel.org/pub/scm/linux/kernel/git/sage/btrfs.git snap_ioctls The first patch is strictly a bug fix for a deadlock in btrfs_commit_transaction(). The next few patches are a repost (with a few minor revisions) of the async snapshot/subvolume ioctls I originally posted last spring. They include: - Some async commit helper functions - Start and wait sync ioctls, for initiating and waiting for a sync - An ioctl to start a snapshot creation asynchronously, such that you don't have to wait for the full commit to disk. The interface is cleaned up a bit from the original version. There is also a patch that changes the SNAP_DESTROY ioctl to not do a commit before returning. The rationale here is there is no obvious reason (to me at least) why the snapshot removal should be on disk before returning; rm(2) and unlink(2) certainly don't do that. If the user disagrees, they can call sync(2). If you would prefer, I also have a patch that introduces a new SNAP_DESTROY_ASYNC ioctl that doesn't change any existing behavior. These all look good to me and I'm pulling them in. The last item is a change to SNAP_DESTROY to allow deletion of a snapshot when the user owns the subvol's root inode and the parent directory permissions are such that we would have allowed an rmdir(2). Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) semantics completely (except for the empty directory check) by duplicating some VFS code. Whether we want weaker semantics, duplicated code, or some new EXPORT_SYMBOLS is up to you I think. Note that this is distinct from a similar patch (also from Goffredo) that allows rmdir(2) to remove an empty subvol; my goal is to allow a non-empty subvol to be deleted by a non-root user. As long as I can do that, my daemon doesn't have to run as root and I'm a happy camper. :) Someone at the storage workshop mentioned that this subvol deletion trick is slightly stronger than rm -rf, to make it include the same level of permission checks would require testing all the directories in the tree for permissions. For now, could you please make a mount -o user_subvol_rm_allowed option? (or something similar with a better name). Thanks! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations
On Mon, 25 Oct 2010, Chris Mason wrote: These all look good to me and I'm pulling them in. Great, thanks! The last item is a change to SNAP_DESTROY to allow deletion of a snapshot when the user owns the subvol's root inode and the parent directory permissions are such that we would have allowed an rmdir(2). Goffredo Baroncelli posted a similar patch that replicates the rmdir(2) semantics completely (except for the empty directory check) by duplicating some VFS code. Whether we want weaker semantics, duplicated code, or some new EXPORT_SYMBOLS is up to you I think. Note that this is distinct from a similar patch (also from Goffredo) that allows rmdir(2) to remove an empty subvol; my goal is to allow a non-empty subvol to be deleted by a non-root user. As long as I can do that, my daemon doesn't have to run as root and I'm a happy camper. :) Someone at the storage workshop mentioned that this subvol deletion trick is slightly stronger than rm -rf, to make it include the same level of permission checks would require testing all the directories in the tree for permissions. I think that was me :) For now, could you please make a mount -o user_subvol_rm_allowed option? (or something similar with a better name). Sure. Do you have a preference as far as what checks are implemented? My patch implemented a simplified approximation of may_rmdir(); Goffredo's duplicated the vfs checks. I guess I'm leaning toward the latter... Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Replacing the top-level root
On Mon, Oct 25, 2010 at 2:58 PM, Chris Mason chris.ma...@oracle.com wrote: Oh, and it shouldn't work on the root of the FS either ;) -chris This is not 100% related but... Will removing [replacing] the top-level root be something that can/will be supported? I've asked in the past about this regarding an initramfs hook i maintain implementing system rollbacks, but I never really got a solid answer. For example, right now extlinux support booting btrfs, but _only_ from the top-level root. if i just had a way to swap the top-level root with a different subvol, i could overcome several problems i have with users all at once: ) users install their system to the top-level root, which means it is no longer manageable by snapshot scripts [currently] ) if the top-level root could be swapped, extlinux could then boot my snapshot? (i'm probably wrong here) set-default is not enough; i'm looking for a way to gain control over the system state when the system has been installed into the top-level root. currently, i have no way to manipulate/move/change it, because the top-level is essentially immutable, or so it seems. thus it's not possible to support kernel rollbacks without manually syncing snapshot/boot to /boot. is there a solution to this that i'm missing? C Anthony -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: allow subvol deletion by unprivileged user with -o user_subvol_rm_allowed
Add a mount option user_subvol_rm_allowed that allows users to delete a (potentially non-empty!) subvol when they would otherwise we allowed to do an rmdir(2). We duplicate the may_delete() checks from the core VFS code to implement identical security checks (minus the directory size check). Signed-off-by: Sage Weil s...@newdream.net --- fs/btrfs/ctree.h |1 + fs/btrfs/ioctl.c | 109 +++-- fs/btrfs/super.c |6 ++- 3 files changed, 110 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5ac2bca..140003e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1196,6 +1196,7 @@ struct btrfs_root { #define BTRFS_MOUNT_NOSSD (1 9) #define BTRFS_MOUNT_DISCARD(1 10) #define BTRFS_MOUNT_FORCE_COMPRESS (1 11) +#define BTRFS_MOUNT_USER_SUBVOL_RM_ALLOWED (1 12) #define btrfs_clear_opt(o, opt)((o) = ~BTRFS_MOUNT_##opt) #define btrfs_set_opt(o, opt) ((o) |= BTRFS_MOUNT_##opt) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 906e3b3..919b23f 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -409,6 +409,76 @@ fail: return ret; } +/* copy of check_sticky in fs/namei.c() +* It's inline, so penalty for filesystems that don't use sticky bit is +* minimal. +*/ +static inline int btrfs_check_sticky(struct inode *dir, struct inode *inode) +{ + uid_t fsuid = current_fsuid(); + + if (!(dir-i_mode S_ISVTX)) + return 0; + if (inode-i_uid == fsuid) + return 0; + if (dir-i_uid == fsuid) + return 0; + return !capable(CAP_FOWNER); +} + +/* copy of may_delete in fs/namei.c() + * Check whether we can remove a link victim from directory dir, check + * whether the type of victim is right. + * 1. We can't do it if dir is read-only (done in permission()) + * 2. We should have write and exec permissions on dir + * 3. We can't remove anything from append-only dir + * 4. We can't do anything with immutable dir (done in permission()) + * 5. If the sticky bit on dir is set we should either + * a. be owner of dir, or + * b. be owner of victim, or + * c. have CAP_FOWNER capability + * 6. If the victim is append-only or immutable we can't do antyhing with + * links pointing to it. + * 7. If we were asked to remove a directory and victim isn't one - ENOTDIR. + * 8. If we were asked to remove a non-directory and victim isn't one - EISDIR. + * 9. We can't remove a root or mountpoint. + * 10. We don't allow removal of NFS sillyrenamed files; it's handled by + * nfs_async_unlink(). + */ + +static int btrfs_may_delete(struct inode *dir,struct dentry *victim,int isdir) +{ + int error; + + if (!victim-d_inode) + return -ENOENT; + + BUG_ON(victim-d_parent-d_inode != dir); + audit_inode_child(victim, dir); + + error = inode_permission(dir, MAY_WRITE | MAY_EXEC); + if (error) + return error; + if (IS_APPEND(dir)) + return -EPERM; + if (btrfs_check_sticky(dir, victim-d_inode)|| + IS_APPEND(victim-d_inode)|| + IS_IMMUTABLE(victim-d_inode) || IS_SWAPFILE(victim-d_inode)) + return -EPERM; + if (isdir) { + if (!S_ISDIR(victim-d_inode-i_mode)) + return -ENOTDIR; + if (IS_ROOT(victim)) + return -EBUSY; + } else if (S_ISDIR(victim-d_inode-i_mode)) + return -EISDIR; + if (IS_DEADDIR(dir)) + return -ENOENT; + if (victim-d_flags DCACHE_NFSFS_RENAMED) + return -EBUSY; + return 0; +} + /* copy of may_create in fs/namei.c() */ static inline int btrfs_may_create(struct inode *dir, struct dentry *child) { @@ -1288,9 +1358,6 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, int ret; int err = 0; - if (!capable(CAP_SYS_ADMIN)) - return -EPERM; - vol_args = memdup_user(arg, sizeof(*vol_args)); if (IS_ERR(vol_args)) return PTR_ERR(vol_args); @@ -1320,13 +1387,45 @@ static noinline int btrfs_ioctl_snap_destroy(struct file *file, } inode = dentry-d_inode; + dest = BTRFS_I(inode)-root; + if (!capable(CAP_SYS_ADMIN)){ + /* +* Regular user. Only allow this with a special mount +* option, and when rmdir(2) would have been allowed. +* +* Note that this is _not_ check that the subvol is +* empty or doesn't contain data that we wouldn't +* otherwise be able to delete. +* +* Users who want to delete empty subvols should try +* rmdir(2). +*/ + err = -EPERM; + if (!btrfs_test_opt(root, USER_SUBVOL_RM_ALLOWED)) +
Re: [PATCH 0/6] Btrfs commit fixes, async subvol operations
On Mon, 25 Oct 2010, Chris Mason wrote: Yes, lets duplicate the vfs checks. Christoph just sat bolt upright in whatever ski lift he's currently riding. :) This version is based on Goffredo's patch, but requires the user_subvol_rm_allowed mount option, and will proceed even if the subvol is nonempty. (I suspect we also want his other rmdir(2) patch at some point as well so that users can rmdir an empty subvol and/or 'rm -r'.) We should also make sure they do the subvol rm against the root of the subvol (if that check isn't already done), none of the magic to resolve the subvol based on any file inside it. I don't want people to accidentally think they are deleting a subdir and have it go higher up into the directory tree. Oh, and it shouldn't work on the root of the FS either ;) Since the ioctl is based on a name lookup, I added a check that the parent inode's root isn't the same as the target root. That implies the ioctl is called the dentry referencing the subvol, and presumably the root doesn't have one of those. Does that cover it? It isn't possible to have the subvol bound to multiple locations in the namespace, right? $ whoami sage $ btrfs sub create a Create subvolume './a' $ btrfs sub create a/b Create subvolume 'a/b' $ btrfs sub snap a asnap Create a snapshot of 'a' in './asnap' $ btrfs sub del asnap/b ERROR: 'asnap/b' is not a subvolume $ btrfs sub del a Delete subvolume '/mnt/osd27/a/a' ERROR: cannot delete '/mnt/osd27/a/a' $ btrfs sub del a/b Delete subvolume '/mnt/osd27/a/a/b' $ btrfs sub del a Delete subvolume '/mnt/osd27/a/a' $ btrfs sub del asnap Delete subvolume '/mnt/osd27/a/asnap' Sending the patch separately. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Mark btrfsctl deprecated
I am certainly not in a position to answer for Chris Mason, but I am happy to share my response to the question, coming from a perspective of being somewhat obsessive about not breaking back-compat. Let's not. As I am certainly within the lot of people in question, having just done exactly that, I found the two programs -- with very different styles -- to not be much of a block, and providing two examples instead of one of code that invokes an ioctl seems fine. Ideally, everyone who needs to use an ioctl will simply write and compile a C program that does what they need -- ha ha. I see no reason to break legacy code (such as it is) that uses btrfsctl instead of btrfs for a user-space tool to invoke ioctls. Calling the tool btrfs is actually a little confusing. On Fri, Oct 22, 2010 at 7:02 AM, Goffredo Baroncelli kreij...@libero.it kreij...@libero.it wrote: Hi Chris, what do you think about marking deprecate the btrfsctl program ? A lot of people make patch involving both btrfs command and btrfsctl command, spending a lot of effort. Initially we can put a warning in the btrfctl command which suggest to use the btrfs command. and after XX month (six ?) we could remouve the command at all. The same for the other utilities like btrfs-show, btrfs-vol Of course this is applicable if there is no evidence of regression of btrfs vs btrfsctl. Regards G.Baroncelli -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] btrfs: Add lzo compression support
On Mon, Oct 25, 2010 at 03:11:22PM +0800, Li Zefan wrote: Lzo is a much faster compression algorithm than gzib, so would allow more users to enable transparent compression, and some users can choose from compression ratio and compression speed. This is also much smaller than I expected, really nice. It looks like older kernels won't properly deal (nicely give EIO) with lzo compressed files? We can add compatbits to deal with that if it is the case. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] btrfs: Add lzo compression support
Chris Mason wrote: On Mon, Oct 25, 2010 at 03:11:22PM +0800, Li Zefan wrote: Lzo is a much faster compression algorithm than gzib, so would allow more users to enable transparent compression, and some users can choose from compression ratio and compression speed. This is also much smaller than I expected, really nice. It looks like older kernels won't properly deal (nicely give EIO) with lzo compressed files? We can add compatbits to deal with that if it is the case. I forgot compatibility issue with older kernels.. Though I didn't test older kernels, I don't think they can deal with lzo compressed files properly, at least not for inlined extents, in which case I think btrfs will just show compressed data to the users. So yes, an incompat flag is needed. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html