Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
Hi Ted > I happened to be going through the source code for write_cache_pages(), > and I came across a reference to AOP_WRITEPAGE_ACTIVATE. I was curious > what the heck that was, so I did search for it, and found this in > Documentation/filesystems/vfs.txt: > > If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to > try too hard if there are problems, and may choose to write out > other pages from the mapping if that is easier (e.g. due to > internal dependencies). If it chooses not to start writeout, it > should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep > calling ->writepage on that page. > > See the file "Locking" for more details. > > No filesystems are currently returning AOP_WRITEPAGE_ACTIVATE when it > chooses not to writeout page and call redirty_page_for_writeback() > instead. > > Is this a change we should make, for example when btrfs refuses a > writepage() when PF_MEMALLOC is set, or when ext4 refuses a writepage() > if the page involved hasn't been allocated an on-disk block yet (i.e., > delayed allocation)? The change seems to be that we should call > redirty_page_for_writeback() as before, but then _not_ unlock the page, > and return AOP_WRITEPAGE_ACTIVATE. Is this a good and useful thing for > us to do? Sorry, no. AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing (and later rd choosed to use another way). Then, It assume writepage refusing aren't happen on majority pages. IOW, the VM assume other many pages can writeout although the page can't. Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned. but now ext4 and btrfs refuse all writepage(). (right?) IOW, I don't think such documentation suppose delayed allocation issue ;) The point is, Our dirty page accounting only account per-system-memory dirty ratio and per-task dirty pages. but It doesn't account per-numa-node nor per-zone dirty ratio. and then, to refuse write page and fake numa abusing can make confusing our vm easily. if _all_ pages in our VM LRU list (it's per-zone), page activation doesn't help. It also lead to OOM. And I'm sorry. I have to say now all vm developers fake numa is not production level quority yet. afaik, nobody have seriously tested our vm code on such environment. (linux/arch/x86/Kconfig says "This is only useful for debugging".) -- config NUMA_EMU bool "NUMA emulation" depends on X86_64 && NUMA ---help--- Enable NUMA emulation. A flat machine will be split into virtual nodes when booted with "numa=fake=N", where N is the number of nodes. This is only useful for debugging. > > Right now, the only writepage() function which is returning > AOP_WRITEPAGE_ACTIVATE is shmem_writepage(), and very curiously it's not > using redirty_page_for_writeback(). Should it, out of consistency's > sake if not to keep various zone accounting straight? Umm. I don't know the reason. instead I've cc to hugh :) > There are some longer-term issues, including the fact that ext4 and > btrfs are violating some of the rules laid out in > Documentation/vfs/Locking regarding what writepage() is supposed to do > under direct reclaim -- something which isn't going to be practical for > us to change on the file-system side, at least not without doing some > pretty nasty and serious rework, for both ext4 and I suspect btrfs. But > if returning AOP_WRITEPAGE_ACTIVATE will help the VM deal more > gracefully with the fact that ext4 and btrfs will be refusing > writepage() calls under certain conditions, maybe we should make this > change? I'm sorry again. I'm pretty sure our vm also need to change if we need to solve your company's fake numa use case. I think our vm is still delayed allocation unfriendly. we haven't noticed ext4 delayed allocation issue ;-) So, I have two questions - I really hope to understand ext4 delayed allocation issue, can you please tell me which url explain ext4 high level design and behavior about delayed allocation. - If my understood is correctly, making very much fake numa node and simple dd can reproduce your issue. right? Now I'm guessing enough small vm patch can solve this issue. (that's only guess, maybe yes maybe no). but correct understanding and correct testing way are really necessary. please help. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 01/12] Btrfs: Link block groups of different raid types in the same space_info
The size of reserved space is stored in space_info. If block groups of different raid types are linked to separate space_info, changing allocation profile will corrupt reserved space accounting. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:23:52.921839641 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:23:52.926830638 +0800 @@ -662,6 +662,7 @@ struct btrfs_csum_item { #define BTRFS_BLOCK_GROUP_RAID1(1 << 4) #define BTRFS_BLOCK_GROUP_DUP (1 << 5) #define BTRFS_BLOCK_GROUP_RAID10 (1 << 6) +#define BTRFS_NR_RAID_TYPES 5 struct btrfs_block_group_item { __le64 used; @@ -673,7 +674,8 @@ struct btrfs_space_info { u64 flags; u64 total_bytes;/* total bytes in the space */ - u64 bytes_used; /* total bytes used on disk */ + u64 bytes_used; /* total bytes used, + this does't take mirrors into account */ u64 bytes_pinned; /* total bytes pinned, will be freed when the transaction finishes */ u64 bytes_reserved; /* total bytes the allocator has reserved for @@ -686,6 +688,7 @@ struct btrfs_space_info { delalloc/allocations */ u64 bytes_delalloc; /* number of bytes currently reserved for delayed allocation */ + u64 disk_used; /* total bytes used on disk */ int full; /* indicates that we cannot allocate any more chunks for this space */ @@ -703,7 +706,7 @@ struct btrfs_space_info { int flushing; /* for block groups in our same type */ - struct list_head block_groups; + struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; struct rw_semaphore groups_sem; atomic_t caching_threads; diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-26 17:23:52.922840061 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-26 17:23:52.929829246 +0800 @@ -506,6 +506,9 @@ static struct btrfs_space_info *__find_s struct list_head *head = &info->space_info; struct btrfs_space_info *found; + flags &= BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | +BTRFS_BLOCK_GROUP_METADATA; + rcu_read_lock(); list_for_each_entry_rcu(found, head, list) { if (found->flags == flags) { @@ -2659,12 +2662,21 @@ static int update_space_info(struct btrf struct btrfs_space_info **space_info) { struct btrfs_space_info *found; + int i; + int factor; + + if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 | +BTRFS_BLOCK_GROUP_RAID10)) + factor = 2; + else + factor = 1; found = __find_space_info(info, flags); if (found) { spin_lock(&found->lock); found->total_bytes += total_bytes; found->bytes_used += bytes_used; + found->disk_used += bytes_used * factor; found->full = 0; spin_unlock(&found->lock); *space_info = found; @@ -2674,14 +2686,18 @@ static int update_space_info(struct btrf if (!found) return -ENOMEM; - INIT_LIST_HEAD(&found->block_groups); + for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) + INIT_LIST_HEAD(&found->block_groups[i]); init_rwsem(&found->groups_sem); init_waitqueue_head(&found->flush_wait); init_waitqueue_head(&found->allocate_wait); spin_lock_init(&found->lock); - found->flags = flags; + found->flags = flags & (BTRFS_BLOCK_GROUP_DATA | + BTRFS_BLOCK_GROUP_SYSTEM | + BTRFS_BLOCK_GROUP_METADATA); found->total_bytes = total_bytes; found->bytes_used = bytes_used; + found->disk_used = bytes_used * factor; found->bytes_pinned = 0; found->bytes_reserved = 0; found->bytes_readonly = 0; @@ -2751,26 +2767,32 @@ u64 btrfs_reduce_alloc_profile(struct bt return flags; } -static u64 btrfs_get_alloc_profile(struct btrfs_root *root, u64 data) +static u64 get_alloc_profile(struct btrfs_root *root, u64 flags) { - struct btrfs_fs_info *info = root->fs_info; - u64 alloc_profile; + if (flags & BTRFS_BLOCK_GROUP_DATA) + flags |= root->fs_info->avail_data_alloc_bits & +root->fs_info->data_alloc_profile; + else if (flags & BTRFS_BLOCK_GROUP_SYSTEM) + flags |= root->fs_info->avail_system_alloc_bits & +root->fs_info->system_alloc_profile; + else if (flags & BTRFS_BLOCK_GROUP_METADATA) + flags |= root->fs_info->avail_metadata_alloc_bits
[PATCH V2 02/12] Btrfs: Kill allocate_wait in space_info
We already have fs_info->chunk_mutex to avoid concurrent chunk creation. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:24:10.436081649 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:24:10.441079491 +0800 @@ -700,9 +700,7 @@ struct btrfs_space_info { struct list_head list; /* for controlling how we free up space for allocations */ - wait_queue_head_t allocate_wait; wait_queue_head_t flush_wait; - int allocating_chunk; int flushing; /* for block groups in our same type */ diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-26 17:24:10.437084933 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-26 17:24:10.444079704 +0800 @@ -70,6 +70,9 @@ static int find_next_key(struct btrfs_pa struct btrfs_key *key); static void dump_space_info(struct btrfs_space_info *info, u64 bytes, int dump_block_groups); +static int maybe_allocate_chunk(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_space_info *sinfo, u64 num_bytes); static noinline int block_group_cache_done(struct btrfs_block_group_cache *cache) @@ -2690,7 +2693,6 @@ static int update_space_info(struct btrf INIT_LIST_HEAD(&found->block_groups[i]); init_rwsem(&found->groups_sem); init_waitqueue_head(&found->flush_wait); - init_waitqueue_head(&found->allocate_wait); spin_lock_init(&found->lock); found->flags = flags & (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -3003,71 +3005,6 @@ flush: wake_up(&info->flush_wait); } -static int maybe_allocate_chunk(struct btrfs_root *root, -struct btrfs_space_info *info) -{ - struct btrfs_super_block *disk_super = &root->fs_info->super_copy; - struct btrfs_trans_handle *trans; - bool wait = false; - int ret = 0; - u64 min_metadata; - u64 free_space; - - free_space = btrfs_super_total_bytes(disk_super); - /* -* we allow the metadata to grow to a max of either 10gb or 5% of the -* space in the volume. -*/ - min_metadata = min((u64)10 * 1024 * 1024 * 1024, -div64_u64(free_space * 5, 100)); - if (info->total_bytes >= min_metadata) { - spin_unlock(&info->lock); - return 0; - } - - if (info->full) { - spin_unlock(&info->lock); - return 0; - } - - if (!info->allocating_chunk) { - info->force_alloc = 1; - info->allocating_chunk = 1; - } else { - wait = true; - } - - spin_unlock(&info->lock); - - if (wait) { - wait_event(info->allocate_wait, - !info->allocating_chunk); - return 1; - } - - trans = btrfs_start_transaction(root, 1); - if (!trans) { - ret = -ENOMEM; - goto out; - } - - ret = do_chunk_alloc(trans, root->fs_info->extent_root, -4096 + 2 * 1024 * 1024, -info->flags, 0); - btrfs_end_transaction(trans, root); - if (ret) - goto out; -out: - spin_lock(&info->lock); - info->allocating_chunk = 0; - spin_unlock(&info->lock); - wake_up(&info->allocate_wait); - - if (ret) - return 0; - return 1; -} - /* * Reserve metadata space for delalloc. */ @@ -3108,7 +3045,8 @@ again: flushed++; if (flushed == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; flushed++; } else { @@ -3223,7 +3161,8 @@ again: if (used > meta_sinfo->total_bytes) { retries++; if (retries == 1) { - if (maybe_allocate_chunk(root, meta_sinfo)) + if (maybe_allocate_chunk(NULL, root, meta_sinfo, +num_bytes)) goto again; retries++; } else { @@ -3420,13 +3359,28 @@ static void force_metadata_allocation(st rcu_read_unlock(); } +static int should_alloc_chunk(struct btrfs_space_info *sinfo, + u64 alloc_bytes) +{ + u64 num_bytes = sinfo->total_bytes - sinfo->bytes_readonly; + + if (sinfo->bytes_used + sinfo->bytes_reserved + + alloc_bytes + 256 * 1024 * 1024 < num_bytes) + return 0; + + if (s
[PATCH V2 04/12] Btrfs: Kill init_btrfs_i()
All code in init_btrfs_i can be moved into btrfs_alloc_inode() Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c --- 2/fs/btrfs/inode.c 2010-04-26 17:24:41.254078880 +0800 +++ 3/fs/btrfs/inode.c 2010-04-26 17:24:41.270103836 +0800 @@ -3595,40 +3595,10 @@ again: return 0; } -static noinline void init_btrfs_i(struct inode *inode) -{ - struct btrfs_inode *bi = BTRFS_I(inode); - - bi->generation = 0; - bi->sequence = 0; - bi->last_trans = 0; - bi->last_sub_trans = 0; - bi->logged_trans = 0; - bi->delalloc_bytes = 0; - bi->reserved_bytes = 0; - bi->disk_i_size = 0; - bi->flags = 0; - bi->index_cnt = (u64)-1; - bi->last_unlink_trans = 0; - bi->ordered_data_close = 0; - bi->force_compress = 0; - extent_map_tree_init(&BTRFS_I(inode)->extent_tree, GFP_NOFS); - extent_io_tree_init(&BTRFS_I(inode)->io_tree, -inode->i_mapping, GFP_NOFS); - extent_io_tree_init(&BTRFS_I(inode)->io_failure_tree, -inode->i_mapping, GFP_NOFS); - INIT_LIST_HEAD(&BTRFS_I(inode)->delalloc_inodes); - INIT_LIST_HEAD(&BTRFS_I(inode)->ordered_operations); - RB_CLEAR_NODE(&BTRFS_I(inode)->rb_node); - btrfs_ordered_inode_tree_init(&BTRFS_I(inode)->ordered_tree); - mutex_init(&BTRFS_I(inode)->log_mutex); -} - static int btrfs_init_locked_inode(struct inode *inode, void *p) { struct btrfs_iget_args *args = p; inode->i_ino = args->ino; - init_btrfs_i(inode); BTRFS_I(inode)->root = args->root; btrfs_set_inode_space_info(args->root, inode); return 0; @@ -3691,8 +3661,6 @@ static struct inode *new_simple_dir(stru if (!inode) return ERR_PTR(-ENOMEM); - init_btrfs_i(inode); - BTRFS_I(inode)->root = root; memcpy(&BTRFS_I(inode)->location, key, sizeof(*key)); BTRFS_I(inode)->dummy_inode = 1; @@ -4091,7 +4059,6 @@ static struct inode *btrfs_new_inode(str * btrfs_get_inode_index_count has an explanation for the magic * number */ - init_btrfs_i(inode); BTRFS_I(inode)->index_cnt = 2; BTRFS_I(inode)->root = root; BTRFS_I(inode)->generation = trans->transid; @@ -5262,21 +5229,46 @@ unsigned long btrfs_force_ra(struct addr struct inode *btrfs_alloc_inode(struct super_block *sb) { struct btrfs_inode *ei; + struct inode *inode; ei = kmem_cache_alloc(btrfs_inode_cachep, GFP_NOFS); if (!ei) return NULL; + + ei->root = NULL; + ei->space_info = NULL; + ei->generation = 0; + ei->sequence = 0; ei->last_trans = 0; ei->last_sub_trans = 0; ei->logged_trans = 0; + ei->delalloc_bytes = 0; + ei->reserved_bytes = 0; + ei->disk_i_size = 0; + ei->flags = 0; + ei->index_cnt = (u64)-1; + ei->last_unlink_trans = 0; + + spin_lock_init(&ei->accounting_lock); ei->outstanding_extents = 0; ei->reserved_extents = 0; - ei->root = NULL; - spin_lock_init(&ei->accounting_lock); + + ei->ordered_data_close = 0; + ei->dummy_inode = 0; + ei->force_compress = 0; + + inode = &ei->vfs_inode; + extent_map_tree_init(&ei->extent_tree, GFP_NOFS); + extent_io_tree_init(&ei->io_tree, &inode->i_data, GFP_NOFS); + extent_io_tree_init(&ei->io_failure_tree, &inode->i_data, GFP_NOFS); + mutex_init(&ei->log_mutex); btrfs_ordered_inode_tree_init(&ei->ordered_tree); INIT_LIST_HEAD(&ei->i_orphan); + INIT_LIST_HEAD(&ei->delalloc_inodes); INIT_LIST_HEAD(&ei->ordered_operations); - return &ei->vfs_inode; + RB_CLEAR_NODE(&ei->rb_node); + + return inode; } void btrfs_destroy_inode(struct inode *inode) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V2 03/12] Btrfs: Shrink delay allocated space in a synchronized way
Shrink delay allocated space in a synchronized manner is more controllable than flushing all delay allocated space in an async thread. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:24:27.895089314 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:24:27.899105313 +0800 @@ -699,10 +699,6 @@ struct btrfs_space_info { struct list_head list; - /* for controlling how we free up space for allocations */ - wait_queue_head_t flush_wait; - int flushing; - /* for block groups in our same type */ struct list_head block_groups[BTRFS_NR_RAID_TYPES]; spinlock_t lock; @@ -927,7 +923,6 @@ struct btrfs_fs_info { struct btrfs_workers endio_meta_write_workers; struct btrfs_workers endio_write_workers; struct btrfs_workers submit_workers; - struct btrfs_workers enospc_workers; /* * fixup workers take dirty pages that didn't properly go through * the cow mechanism and make them safe to write. It happens @@ -2311,6 +2306,7 @@ int btrfs_truncate_inode_items(struct bt u32 min_type); int btrfs_start_delalloc_inodes(struct btrfs_root *root, int delay_iput); +int btrfs_start_one_delalloc_inode(struct btrfs_root *root, int delay_iput); int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, struct extent_state **cached_state); int btrfs_writepages(struct address_space *mapping, diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:24:27.881831438 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:24:27.900080102 +0800 @@ -1768,9 +1768,6 @@ struct btrfs_root *open_ctree(struct sup min_t(u64, fs_devices->num_devices, fs_info->thread_pool_size), &fs_info->generic_worker); - btrfs_init_workers(&fs_info->enospc_workers, "enospc", - fs_info->thread_pool_size, - &fs_info->generic_worker); /* a higher idle thresh on the submit workers makes it much more * likely that bios will be send down in a sane order to the @@ -1818,7 +1815,6 @@ struct btrfs_root *open_ctree(struct sup btrfs_start_workers(&fs_info->endio_meta_workers, 1); btrfs_start_workers(&fs_info->endio_meta_write_workers, 1); btrfs_start_workers(&fs_info->endio_write_workers, 1); - btrfs_start_workers(&fs_info->enospc_workers, 1); fs_info->bdi.ra_pages *= btrfs_super_num_devices(disk_super); fs_info->bdi.ra_pages = max(fs_info->bdi.ra_pages, @@ -2049,7 +2045,6 @@ fail_sb_buffer: btrfs_stop_workers(&fs_info->endio_meta_write_workers); btrfs_stop_workers(&fs_info->endio_write_workers); btrfs_stop_workers(&fs_info->submit_workers); - btrfs_stop_workers(&fs_info->enospc_workers); fail_iput: invalidate_inode_pages2(fs_info->btree_inode->i_mapping); iput(fs_info->btree_inode); @@ -2482,7 +2477,6 @@ int close_ctree(struct btrfs_root *root) btrfs_stop_workers(&fs_info->endio_meta_write_workers); btrfs_stop_workers(&fs_info->endio_write_workers); btrfs_stop_workers(&fs_info->submit_workers); - btrfs_stop_workers(&fs_info->enospc_workers); btrfs_close_devices(fs_info->fs_devices); btrfs_mapping_tree_free(&fs_info->mapping_tree); diff -urp 2/fs/btrfs/extent-tree.c 3/fs/btrfs/extent-tree.c --- 2/fs/btrfs/extent-tree.c2010-04-26 17:24:27.896099931 +0800 +++ 3/fs/btrfs/extent-tree.c2010-04-26 17:24:27.913079910 +0800 @@ -73,6 +73,9 @@ static void dump_space_info(struct btrfs static int maybe_allocate_chunk(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_space_info *sinfo, u64 num_bytes); +static int shrink_delalloc(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_space_info *sinfo, u64 to_reclaim); static noinline int block_group_cache_done(struct btrfs_block_group_cache *cache) @@ -2692,7 +2695,6 @@ static int update_space_info(struct btrf for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) INIT_LIST_HEAD(&found->block_groups[i]); init_rwsem(&found->groups_sem); - init_waitqueue_head(&found->flush_wait); spin_lock_init(&found->lock); found->flags = flags & (BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_SYSTEM | @@ -2906,105 +2908,6 @@ static void check_force_delalloc(struct meta_sinfo->force_delalloc = 0; } -struct async_flush { - struct btrfs_root *root; - struct btrfs_space_info *info; - struct btrfs_work work; -}; - -static noinline void flush_delalloc_async(struct btrfs_work *work) -{ - struct async_flush *async
[PATCH V2 08/12] Btrfs: Introduce global metadata reservation
Reserve metadata space for extent tree, checksum tree and root tree Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:27:31.644829469 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:27:31.648830941 +0800 @@ -682,21 +682,15 @@ struct btrfs_space_info { u64 bytes_reserved; /* total bytes the allocator has reserved for current allocations */ u64 bytes_readonly; /* total bytes that are read only */ - u64 bytes_super;/* total bytes reserved for the super blocks */ - u64 bytes_root; /* the number of bytes needed to commit a - transaction */ + u64 bytes_may_use; /* number of bytes that may be used for delalloc/allocations */ - u64 bytes_delalloc; /* number of bytes currently reserved for - delayed allocation */ u64 disk_used; /* total bytes used on disk */ int full; /* indicates that we cannot allocate any more chunks for this space */ int force_alloc;/* set if we need to force a chunk alloc for this space */ - int force_delalloc; /* make people start doing filemap_flush until - we're under a threshold */ struct list_head list; diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:27:31.638850832 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:31.649830174 +0800 @@ -1472,10 +1472,6 @@ static int cleaner_kthread(void *arg) struct btrfs_root *root = arg; do { - smp_mb(); - if (root->fs_info->closing) - break; - vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE); if (!(root->fs_info->sb->s_flags & MS_RDONLY) && @@ -1488,11 +1484,9 @@ static int cleaner_kthread(void *arg) if (freezing(current)) { refrigerator(); } else { - smp_mb(); - if (root->fs_info->closing) - break; set_current_state(TASK_INTERRUPTIBLE); - schedule(); + if (!kthread_should_stop()) + schedule(); __set_current_state(TASK_RUNNING); } } while (!kthread_should_stop()); @@ -1504,36 +1498,39 @@ static int transaction_kthread(void *arg struct btrfs_root *root = arg; struct btrfs_trans_handle *trans; struct btrfs_transaction *cur; + u64 transid; unsigned long now; unsigned long delay; int ret; do { - smp_mb(); - if (root->fs_info->closing) - break; - delay = HZ * 30; vfs_check_frozen(root->fs_info->sb, SB_FREEZE_WRITE); - mutex_lock(&root->fs_info->transaction_kthread_mutex); - mutex_lock(&root->fs_info->trans_mutex); + spin_lock(&root->fs_info->new_trans_lock); cur = root->fs_info->running_transaction; if (!cur) { - mutex_unlock(&root->fs_info->trans_mutex); + spin_unlock(&root->fs_info->new_trans_lock); goto sleep; } now = get_seconds(); - if (now < cur->start_time || now - cur->start_time < 30) { - mutex_unlock(&root->fs_info->trans_mutex); + if (!cur->blocked && + (now < cur->start_time || now - cur->start_time < 30)) { + spin_unlock(&root->fs_info->new_trans_lock); delay = HZ * 5; goto sleep; } - mutex_unlock(&root->fs_info->trans_mutex); - trans = btrfs_join_transaction(root, 1); - ret = btrfs_commit_transaction(trans, root); + transid = cur->transid; + spin_unlock(&root->fs_info->new_trans_lock); + trans = btrfs_join_transaction(root, 1); + if (transid == trans->transid) { + ret = btrfs_commit_transaction(trans, root); + BUG_ON(ret); + } else { + btrfs_end_transaction(trans, root); + } sleep: wake_up_process(root->fs_info->cleaner_kthread); mutex_unlock(&root->fs_info->transaction_kthread_mutex); @@ -1541,10 +1538,10 @@ sleep: if (freezing(current)) { refrigerator(); } else { - if (root->fs_info->closing) -
[PATCH V2 07/12] Btrfs: Update metadata reservation for delayed allocation
Introduce metadata reservation context for delayed allocation and update various related functions. This patch also introduces EXTENT_FIRST_DELALLOC control bit for set/clear_extent_bit. It tells set/clear_bit_hook whether they are processing the first extent_state with EXTENT_DELALLOC bit set. This change is important if set/clear_extent_bit involves multiple extent_state. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h --- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.450105767 +0800 +++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:26:55.456080004 +0800 @@ -137,8 +137,8 @@ struct btrfs_inode { * of extent items we've reserved metadata for. */ spinlock_t accounting_lock; + atomic_t outstanding_extents; int reserved_extents; - int outstanding_extents; /* * ordered_data_close is set by truncate when a file that used diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:26:55.451104861 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:26:55.457079656 +0800 @@ -2078,19 +2078,8 @@ int btrfs_remove_block_group(struct btrf u64 btrfs_reduce_alloc_profile(struct btrfs_root *root, u64 flags); void btrfs_set_inode_space_info(struct btrfs_root *root, struct inode *ionde); void btrfs_clear_space_info_full(struct btrfs_fs_info *info); - -int btrfs_unreserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_reserve_metadata_for_delalloc(struct btrfs_root *root, - struct inode *inode, int num_items); -int btrfs_check_data_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); -void btrfs_free_reserved_data_space(struct btrfs_root *root, - struct inode *inode, u64 bytes); -void btrfs_delalloc_reserve_space(struct btrfs_root *root, struct inode *inode, -u64 bytes); -void btrfs_delalloc_free_space(struct btrfs_root *root, struct inode *inode, - u64 bytes); +int btrfs_check_data_free_space(struct inode *inode, u64 bytes); +void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes); int btrfs_trans_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root, int num_items, int *retries); @@ -2098,6 +2087,10 @@ void btrfs_trans_release_metadata(struct struct btrfs_root *root); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); +int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes); +int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes); +void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes); void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root); void btrfs_free_block_rsv(struct btrfs_root *root, diff -urp 2/fs/btrfs/extent_io.c 3/fs/btrfs/extent_io.c --- 2/fs/btrfs/extent_io.c 2010-04-26 17:26:55.447090049 +0800 +++ 3/fs/btrfs/extent_io.c 2010-04-26 17:26:55.458079658 +0800 @@ -336,21 +336,18 @@ static int merge_state(struct extent_io_ } static int set_state_cb(struct extent_io_tree *tree, -struct extent_state *state, -unsigned long bits) +struct extent_state *state, int *bits) { if (tree->ops && tree->ops->set_bit_hook) { return tree->ops->set_bit_hook(tree->mapping->host, - state->start, state->end, - state->state, bits); + state, bits); } return 0; } static void clear_state_cb(struct extent_io_tree *tree, - struct extent_state *state, - unsigned long bits) + struct extent_state *state, int *bits) { if (tree->ops && tree->ops->clear_bit_hook) tree->ops->clear_bit_hook(tree->mapping->host, state, bits); @@ -368,9 +365,10 @@ static void clear_state_cb(struct extent */ static int insert_state(struct extent_io_tree *tree, struct extent_state *state, u64 start, u64 end, - int bits) + int *bits) { struct rb_node *node; + int bits_to_set = *bits & ~EXTENT_CTLBITS; int ret; if (end < start) { @@ -385,9 +383,9 @@ static int insert_state(struct extent_io if (ret) return ret; - if (bits & EXTENT_DIRTY) + if (bits_to_set & EXTENT_DIRTY)
[PATCH V2 09/12] Btrfs: Metadata reservation for orphan inodes
reserve metadata space for handling orphan inodes Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/btrfs_inode.h 3/fs/btrfs/btrfs_inode.h --- 2/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.113080051 +0800 +++ 3/fs/btrfs/btrfs_inode.h2010-04-26 17:27:52.118079430 +0800 @@ -151,6 +151,7 @@ struct btrfs_inode { * of these. */ unsigned ordered_data_close:1; + unsigned orphan_meta_reserved:1; unsigned dummy_inode:1; /* diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:27:52.114079844 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:27:52.119079920 +0800 @@ -1068,7 +1068,6 @@ struct btrfs_root { int ref_cows; int track_dirty; int in_radix; - int clean_orphans; u64 defrag_trans_start; struct btrfs_key defrag_progress; @@ -1082,8 +1081,11 @@ struct btrfs_root { struct list_head root_list; - spinlock_t list_lock; + spinlock_t orphan_lock; struct list_head orphan_list; + struct btrfs_block_rsv *orphan_block_rsv; + int orphan_item_inserted; + int orphan_cleanup_state; spinlock_t inode_lock; /* red-black tree that keeps track of in-memory inodes */ @@ -2079,6 +2081,9 @@ int btrfs_trans_reserve_metadata(struct int num_items, int *retries); void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root); +int btrfs_orphan_reserve_metadata(struct btrfs_trans_handle *trans, + struct inode *inode); +void btrfs_orphan_release_metadata(struct inode *inode); int btrfs_snap_reserve_metadata(struct btrfs_trans_handle *trans, struct btrfs_pending_snapshot *pending); int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); @@ -2403,6 +2408,13 @@ int btrfs_update_inode(struct btrfs_tran int btrfs_orphan_add(struct btrfs_trans_handle *trans, struct inode *inode); int btrfs_orphan_del(struct btrfs_trans_handle *trans, struct inode *inode); void btrfs_orphan_cleanup(struct btrfs_root *root); +void btrfs_orphan_pre_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending, + u64 *bytes_to_reserve); +void btrfs_orphan_post_snapshot(struct btrfs_trans_handle *trans, + struct btrfs_pending_snapshot *pending); +void btrfs_orphan_commit_root(struct btrfs_trans_handle *trans, + struct btrfs_root *root); int btrfs_cont_expand(struct inode *inode, loff_t size); int btrfs_invalidate_inodes(struct btrfs_root *root); void btrfs_add_delayed_iput(struct inode *inode); diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:27:52.105081158 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:27:52.120080690 +0800 @@ -895,7 +895,8 @@ static int __setup_root(u32 nodesize, u3 root->ref_cows = 0; root->track_dirty = 0; root->in_radix = 0; - root->clean_orphans = 0; + root->orphan_item_inserted = 0; + root->orphan_cleanup_state = 0; root->fs_info = fs_info; root->objectid = objectid; @@ -905,12 +906,13 @@ static int __setup_root(u32 nodesize, u3 root->in_sysfs = 0; root->inode_tree = RB_ROOT; root->block_rsv = NULL; + root->orphan_block_rsv = NULL; INIT_LIST_HEAD(&root->dirty_list); INIT_LIST_HEAD(&root->orphan_list); INIT_LIST_HEAD(&root->root_list); spin_lock_init(&root->node_lock); - spin_lock_init(&root->list_lock); + spin_lock_init(&root->orphan_lock); spin_lock_init(&root->inode_lock); spin_lock_init(&root->accounting_lock); mutex_init(&root->objectid_mutex); @@ -1194,19 +1196,23 @@ again: if (root) return root; - ret = btrfs_find_orphan_item(fs_info->tree_root, location->objectid); - if (ret == 0) - ret = -ENOENT; - if (ret < 0) - return ERR_PTR(ret); - root = btrfs_read_fs_root_no_radix(fs_info->tree_root, location); if (IS_ERR(root)) return root; - WARN_ON(btrfs_root_refs(&root->root_item) == 0); set_anon_super(&root->anon_super, NULL); + if (btrfs_root_refs(&root->root_item) == 0) { + ret = -ENOENT; + goto fail; + } + + ret = btrfs_find_orphan_item(fs_info->tree_root, location->objectid); + if (ret < 0) + goto fail; + if (ret == 0) + root->orphan_item_inserted = 1; + ret = radix_tree_preload(GFP_NOFS & ~__GFP_HIGHMEM); if (ret) goto fail; @@ -1215,10 +1221,9 @@ again: ret = radix_tree_insert(&fs_info->fs_roots_radix, (unsigned long)root->root_key.ob
[PATCH V2 10/12] Btrfs: Metadata ENOSPC handling for tree log
Previous patches make the allocater return -ENOSPC if there is no unreserved free metadata space. This patch updates tree log code and various other places to propagate/handle the ENOSPC error. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/disk-io.c 3/fs/btrfs/disk-io.c --- 2/fs/btrfs/disk-io.c2010-04-26 17:28:05.496079922 +0800 +++ 3/fs/btrfs/disk-io.c2010-04-26 17:28:05.506079726 +0800 @@ -973,42 +973,6 @@ static int find_and_setup_root(struct bt return 0; } -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info) -{ - struct extent_buffer *eb; - struct btrfs_root *log_root_tree = fs_info->log_root_tree; - u64 start = 0; - u64 end = 0; - int ret; - - if (!log_root_tree) - return 0; - - while (1) { - ret = find_first_extent_bit(&log_root_tree->dirty_log_pages, - 0, &start, &end, EXTENT_DIRTY | EXTENT_NEW); - if (ret) - break; - - clear_extent_bits(&log_root_tree->dirty_log_pages, start, end, - EXTENT_DIRTY | EXTENT_NEW, GFP_NOFS); - } - eb = fs_info->log_root_tree->node; - - WARN_ON(btrfs_header_level(eb) != 0); - WARN_ON(btrfs_header_nritems(eb) != 0); - - ret = btrfs_free_reserved_extent(fs_info->tree_root, - eb->start, eb->len); - BUG_ON(ret); - - free_extent_buffer(eb); - kfree(fs_info->log_root_tree); - fs_info->log_root_tree = NULL; - return 0; -} - static struct btrfs_root *alloc_log_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info) { diff -urp 2/fs/btrfs/disk-io.h 3/fs/btrfs/disk-io.h --- 2/fs/btrfs/disk-io.h2010-04-26 17:28:05.495079921 +0800 +++ 3/fs/btrfs/disk-io.h2010-04-26 17:28:05.507080566 +0800 @@ -95,8 +95,6 @@ int btrfs_congested_async(struct btrfs_f unsigned long btrfs_async_submit_limit(struct btrfs_fs_info *info); int btrfs_write_tree_block(struct extent_buffer *buf); int btrfs_wait_tree_block_writeback(struct extent_buffer *buf); -int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, -struct btrfs_fs_info *fs_info); int btrfs_init_log_root_tree(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info); int btrfs_add_log_tree(struct btrfs_trans_handle *trans, diff -urp 2/fs/btrfs/file-item.c 3/fs/btrfs/file-item.c --- 2/fs/btrfs/file-item.c 2010-04-26 17:28:05.503100326 +0800 +++ 3/fs/btrfs/file-item.c 2010-04-26 17:28:05.507080566 +0800 @@ -656,6 +656,9 @@ again: goto found; } ret = PTR_ERR(item); + if (ret != -EFBIG && ret != -ENOENT) + goto fail_unlock; + if (ret == -EFBIG) { u32 item_size; /* we found one, but it isn't big enough yet */ diff -urp 2/fs/btrfs/tree-log.c 3/fs/btrfs/tree-log.c --- 2/fs/btrfs/tree-log.c 2010-04-26 17:28:05.498105836 +0800 +++ 3/fs/btrfs/tree-log.c 2010-04-26 17:28:05.509079730 +0800 @@ -134,6 +134,7 @@ static int start_log_trans(struct btrfs_ struct btrfs_root *root) { int ret; + int err = 0; mutex_lock(&root->log_mutex); if (root->log_root) { @@ -154,17 +155,19 @@ static int start_log_trans(struct btrfs_ mutex_lock(&root->fs_info->tree_log_mutex); if (!root->fs_info->log_root_tree) { ret = btrfs_init_log_root_tree(trans, root->fs_info); - BUG_ON(ret); + if (ret) + err = ret; } - if (!root->log_root) { + if (err == 0 && !root->log_root) { ret = btrfs_add_log_tree(trans, root); - BUG_ON(ret); + if (ret) + err = ret; } mutex_unlock(&root->fs_info->tree_log_mutex); root->log_batch++; atomic_inc(&root->log_writers); mutex_unlock(&root->log_mutex); - return 0; + return err; } /* @@ -375,7 +378,7 @@ insert: BUG_ON(ret); } } else if (ret) { - BUG(); + return ret; } dst_ptr = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); @@ -1698,9 +1701,9 @@ static noinline int walk_down_log_tree(s next = btrfs_find_create_tree_block(root, bytenr, blocksize); - wc->process_func(root, next, wc, ptr_gen); - if (*level == 1) { + wc->process_func(root, next, wc, ptr_gen); + path->slots[*level]++; if (wc->free) { btrfs_read_buffer(next, ptr_gen); @@ -1733,35 +1736,7 @@ static noinlin
[PATCH V2 11/12] Btrfs: Pre-allocate space for data relocation
Pre-allocate space for data relocation. This can detect ENOPSC condition caused by fragmentation of free space. Signed-off-by: Yan Zheng --- diff -urp 2/fs/btrfs/ctree.h 3/fs/btrfs/ctree.h --- 2/fs/btrfs/ctree.h 2010-04-26 17:28:20.493839748 +0800 +++ 3/fs/btrfs/ctree.h 2010-04-26 17:28:20.498830465 +0800 @@ -2419,6 +2419,9 @@ int btrfs_cont_expand(struct inode *inod int btrfs_invalidate_inodes(struct btrfs_root *root); void btrfs_add_delayed_iput(struct inode *inode); void btrfs_run_delayed_iputs(struct btrfs_root *root); +int btrfs_prealloc_file_range(struct inode *inode, int mode, + u64 start, u64 num_bytes, u64 min_size, + loff_t actual_len, u64 *alloc_hint); extern const struct dentry_operations btrfs_dentry_operations; /* ioctl.c */ diff -urp 2/fs/btrfs/inode.c 3/fs/btrfs/inode.c --- 2/fs/btrfs/inode.c 2010-04-26 17:28:20.489839672 +0800 +++ 3/fs/btrfs/inode.c 2010-04-26 17:28:20.500829420 +0800 @@ -1174,6 +1174,13 @@ out_check: num_bytes, num_bytes, type); BUG_ON(ret); + if (root->root_key.objectid == + BTRFS_DATA_RELOC_TREE_OBJECTID) { + ret = btrfs_reloc_clone_csums(inode, cur_offset, + num_bytes); + BUG_ON(ret); + } + extent_clear_unlock_delalloc(inode, &BTRFS_I(inode)->io_tree, cur_offset, cur_offset + num_bytes - 1, locked_page, EXTENT_CLEAR_UNLOCK_PAGE | @@ -6079,16 +6086,15 @@ out_unlock: return err; } -static int prealloc_file_range(struct inode *inode, u64 start, u64 end, - u64 alloc_hint, int mode, loff_t actual_len) +int btrfs_prealloc_file_range(struct inode *inode, int mode, + u64 start, u64 num_bytes, u64 min_size, + loff_t actual_len, u64 *alloc_hint) { struct btrfs_trans_handle *trans; struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_key ins; u64 cur_offset = start; - u64 num_bytes = end - start; int ret = 0; - u64 i_size; while (num_bytes > 0) { trans = btrfs_start_transaction(root, 3); @@ -6097,9 +6103,8 @@ static int prealloc_file_range(struct in break; } - ret = btrfs_reserve_extent(trans, root, num_bytes, - root->sectorsize, 0, alloc_hint, - (u64)-1, &ins, 1); + ret = btrfs_reserve_extent(trans, root, num_bytes, min_size, + 0, *alloc_hint, (u64)-1, &ins, 1); if (ret) { btrfs_end_transaction(trans, root); break; @@ -6116,20 +6121,19 @@ static int prealloc_file_range(struct in num_bytes -= ins.offset; cur_offset += ins.offset; - alloc_hint = ins.objectid + ins.offset; + *alloc_hint = ins.objectid + ins.offset; inode->i_ctime = CURRENT_TIME; BTRFS_I(inode)->flags |= BTRFS_INODE_PREALLOC; if (!(mode & FALLOC_FL_KEEP_SIZE) && - (actual_len > inode->i_size) && - (cur_offset > inode->i_size)) { - + (actual_len > inode->i_size) && + (cur_offset > inode->i_size)) { if (cur_offset > actual_len) - i_size = actual_len; + i_size_write(inode, actual_len); else - i_size = cur_offset; - i_size_write(inode, i_size); - btrfs_ordered_update_i_size(inode, i_size, NULL); + i_size_write(inode, cur_offset); + i_size_write(inode, cur_offset); + btrfs_ordered_update_i_size(inode, cur_offset, NULL); } ret = btrfs_update_inode(trans, root, inode); @@ -6215,16 +6219,16 @@ static long btrfs_fallocate(struct inode if (em->block_start == EXTENT_MAP_HOLE || (cur_offset >= inode->i_size && !test_bit(EXTENT_FLAG_PREALLOC, &em->flags))) { - ret = prealloc_file_range(inode, - cur_offset, last_byte, - alloc_hint, mode, offset+len); + ret = btrfs_prealloc_file_range(inode, 0, cur_offset, + last_byte - cur_offset, + 1 << inode->i_blkbits, +
Re: NFS mount attempts hangs with btrfs on server side
On Wednesday 2010-04-21 12:37, Manio wrote: > On 2010-04-21 12:29, Jan Engelhardt wrote: >> I'd rather have a real description rather than "stuff like that". >> Since I am not on debian, "old nfs-utils" would not happen - >> rpcbind-0.1.6+git20080930 and nfs-kernel-server-1.1.3 >> should be very much recent enough. >> Especially since exportability of filesystems is usually not >> so much a userspace thing. > > Unfortunatelly i can't tell you which package exactly was causing the problem. So I bisected it then. commit 3a340251597a5b0c579c31d8caf9aa3b53a77016 Author: David Woodhouse Date: Thu Aug 28 11:05:17 2008 -0400 Use fsid from statfs for UUID if blkid can't cope (or not used) Signed-off-by: David Woodhouse Signed-off-by: Steve Dickson It is the commit enabling btrfs mounts. The message hints to blkid being unable to get the uuid or something, however, the standalone /sbin/blkid tool does get a uuid: /dev/loop2: UUID="e19fe89b-cde3-4ccc-bc70-b759a57bd1c9" UUID_SUB="f29c6218-d040-4546-a227-4dd2d2142817" TYPE="btrfs" So what went wrong here? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
On Apr 26, 2010, at 6:18 AM, KOSAK > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing > (and later rd choosed to use another way). > Then, It assume writepage refusing aren't happen on majority pages. > IOW, the VM assume other many pages can writeout although the page can't. > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is returned. > but now ext4 and btrfs refuse all writepage(). (right?) No, not exactly. Btrfs refuses the writepage() in the direct reclaim cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the case of zone scanning. I don't want to speak for Chris, but I assume it's due to stack depth concerns --- if it was just due to worrying about fs recursion issues, i assume all of the btrfs allocations could be done GFP_NOFS. Ext4 is slightly different; it refuses writepages() if the inode blocks for the page haven't yet been allocated. (Regardless of whether it's happening for direct reclaim or zone scanning.) However, if the on-disk block has been assigned (i.e., this isn't a delalloc case), ext4 will honor the writepage(). (i.e., if this is an mmap of an already existing file, or if the space has been pre-allocated using fallocate()).The reason for ext4's concern is lock ordering, although I'm investigating whether I can fix this. If we call set_page_writeback() to set PG_writeback (plus set the various bits of magic fs accounting), and then drop the page_lock, does that protect us from random changes happening to the page (i.e., from vmtruncate, etc.)? > > IOW, I don't think such documentation suppose delayed allocation issue ;) > > The point is, Our dirty page accounting only account per-system-memory > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node > nor per-zone dirty ratio. and then, to refuse write page and fake numa > abusing can make confusing our vm easily. if _all_ pages in our VM LRU > list (it's per-zone), page activation doesn't help. It also lead to OOM. > > And I'm sorry. I have to say now all vm developers fake numa is not > production level quority yet. afaik, nobody have seriously tested our > vm code on such environment. (linux/arch/x86/Kconfig says "This is only > useful for debugging".) So I'm sorry I mentioned the fake numa bit, since I think this is a bit of a red herring. That code is in production here, and we've made all sorts of changes so ti can be used for more than just debugging. So please ignore it, it's our local hack, and if it breaks that's our problem.More importantly, just two weeks ago I talked to soeone in the financial sector, who was testing out ext4 on an upstream kernel, and not using our hacks that force 128MB zones, and he ran into the ext4/OOM problem while using an upstream kernel. It involved Oracle pinning down 3G worth of pages, and him trying to do a huge streaming backup (which of course wasn't using fallocate or direct I/O) under ext4, and he had the same issue --- an OOM, that I'm pretty sure was caused by the fact that ext4_writepage() was refusing the writepage() and most of the pages weren't nailed down by Oracle were delalloc.The same test scenario using ext3 worked just fine, of course. Under normal cases it's not a problem since statistically there should be enough other pages in the system compared to the number of pages that are subject to delalloc, such that pages can usually get pushed out until the writeback code can get around to writing out the pages. But in cases where the zones have been made artificially small, or you have a big program like Oracle pinning down a large number of pages, then of course we have problems. I'm trying to fix things from the file system side, which means trying to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is described in Documentation/filesystems/Locking as something which MUST be used if writepage() is going refuse a page. And then I discovered no one is actually using it. So that's why I was asking with respect whether the Locking documentation file was out of date, or whether all of the file systems are doing it wrong. On a related example of how file system code isnt' necessarily following what is required/recommended by the Locking documentation, ext2 and ext3 are both NOT using set_page_writeback()/end_page_writeback(), but are rather keeping the page locked until after they call block_write_full_page(), because of concerns of truncate coming in and screwing things up. But now looking at Locking, it appears that set_page_writeback() is as good as page_lock() for preventing the truncate code from coming in and screwing everything up? It's not clear to me exactly what locking guarantees are provided against truncate by set_page_writeback(). And suppose we are writing out a whole cluster of pages, say 4MB worth of pages; do we need to call set_page_writeback() on every single page in the cluster before we do th
Re: list subvolumes with new btrfs command
> I am using ubuntu-10.04-rc with kernel compiled from the almost > lastest source , the btrfs-progs is latest too. > > You can modify line > > fprintf(stderr, "ERROR: can't perform the search\n"); > to > fprintf(stderr, "ERROR: can't perform the search: %s\n", strerror(errno)); > > to see what happened on earth. nice: $ sudo btrfs subvolume list / ERROR: can't perform the search: Inappropriate ioctl for device i'm not really familiar with C, or anything this low level, does this help you diagnose my problem? thanks again for the help thus far, C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No one seems to be using AOP_WRITEPAGE_ACTIVATE?
On Mon, Apr 26, 2010 at 10:50:45AM -0400, Theodore Tso wrote: > > On Apr 26, 2010, at 6:18 AM, KOSAK > > AOP_WRITEPAGE_ACTIVATE was introduced for ramdisk and tmpfs thing > > (and later rd choosed to use another way). > > Then, It assume writepage refusing aren't happen on majority pages. > > IOW, the VM assume other many pages can writeout although the page can't. > > Then, the VM only make page activation if AOP_WRITEPAGE_ACTIVATE is > > returned. > > but now ext4 and btrfs refuse all writepage(). (right?) > > No, not exactly. Btrfs refuses the writepage() in the direct reclaim > cases (i.e., if PF_MEMALLOC is set), but will do writepage() in the > case of zone scanning. I don't want to speak for Chris, but I assume > it's due to stack depth concerns --- if it was just due to worrying > about fs recursion issues, i assume all of the btrfs allocations could > be done GFP_NOFS. > Btrfs refuses all PF_MEMALLOC writepage. It will go ahead and process a regular writepage but in practice that never happens...everyone else except a few internal btrfs callers use writepages. I wish I had thought of stack depth back then, but really this was to keep kswapd out of the heavy work done by delalloc. From a locking point of view we're properly GPF_NOFS, so its safe, but it just isn't a great way to use precious PF_MEMALLOC cycles. > Ext4 is slightly different; it refuses writepages() if the inode > blocks for the page haven't yet been allocated. (Regardless of > whether it's happening for direct reclaim or zone scanning.) However, > if the on-disk block has been assigned (i.e., this isn't a delalloc > case), ext4 will honor the writepage(). (i.e., if this is an mmap of > an already existing file, or if the space has been pre-allocated using > fallocate()).The reason for ext4's concern is lock ordering, > although I'm investigating whether I can fix this. If we call > set_page_writeback() to set PG_writeback (plus set the various bits of > magic fs accounting), and then drop the page_lock, does that protect > us from random changes happening to the page (i.e., from vmtruncate, > etc.)? PG_writeback will protect you from vmtruncate, but may also want to have page_mkwrite wait for pages in flight. > > > > > IOW, I don't think such documentation suppose delayed allocation issue ;) > > > > The point is, Our dirty page accounting only account per-system-memory > > dirty ratio and per-task dirty pages. but It doesn't account per-numa-node > > nor per-zone dirty ratio. and then, to refuse write page and fake numa > > abusing can make confusing our vm easily. if _all_ pages in our VM LRU > > list (it's per-zone), page activation doesn't help. It also lead to OOM. > > > > And I'm sorry. I have to say now all vm developers fake numa is not > > production level quority yet. afaik, nobody have seriously tested our > > vm code on such environment. (linux/arch/x86/Kconfig says "This is only > > useful for debugging".) > > So I'm sorry I mentioned the fake numa bit, since I think this is a > bit of a red herring. That code is in production here, and we've > made all sorts of changes so ti can be used for more than just > debugging. So please ignore it, it's our local hack, and if it breaks > that's our problem.More importantly, just two weeks ago I talked > to soeone in the financial sector, who was testing out ext4 on an > upstream kernel, and not using our hacks that force 128MB zones, and > he ran into the ext4/OOM problem while using an upstream kernel. It > involved Oracle pinning down 3G worth of pages, and him trying to do a > huge streaming backup (which of course wasn't using fallocate or > direct I/O) under ext4, and he had the same issue --- an OOM, that I'm > pretty sure was caused by the fact that ext4_writepage() was refusing > the writepage() and most of the pages weren't nailed down by Oracle > were delalloc.The same test scenario using ext3 worked just fine, > of course. > > Under normal cases it's not a problem since statistically there should > be enough other pages in the system compared to the number of pages > that are subject to delalloc, such that pages can usually get pushed > out until the writeback code can get around to writing out the pages. > But in cases where the zones have been made artificially small, or you > have a big program like Oracle pinning down a large number of pages, > then of course we have problems. > > I'm trying to fix things from the file system side, which means trying > to understand magic flags like AOP_WRITEPAGE_ACTIVATE, which is > described in Documentation/filesystems/Locking as something which MUST > be used if writepage() is going refuse a page. And then I discovered > no one is actually using it. So that's why I was asking with respect > whether the Locking documentation file was out of date, or whether all > of the file systems are doing it wrong. > > On a related example of how file system code isnt' necessarily > followin
Re: list subvolumes with new btrfs command
On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote: > > I am using ubuntu-10.04-rc with kernel compiled from the almost > > lastest source , the btrfs-progs is latest too. > > > > You can modify line > > > > fprintf(stderr, "ERROR: can't perform the search\n"); > > to > > fprintf(stderr, "ERROR: can't perform the search: %s\n", > > strerror(errno)); > > > > to see what happened on earth. > > nice: > > $ sudo btrfs subvolume list / > ERROR: can't perform the search: Inappropriate ioctl for device > > i'm not really familiar with C, or anything this low level, does this > help you diagnose my problem? Have you tried to run it on the device with the btrfs, not the mount point? It looks like the ioctl was made too restrictive about its arguments. -- Hubert Kario QBS - Quality Business Software 02-656 Warszawa, ul. Ksawerów 30/85 tel. +48 (22) 646-61-51, 646-74-24 www.qbs.com.pl System Zarządzania Jakością zgodny z normą ISO 9001:2000 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list subvolumes with new btrfs command
On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario wrote: > On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote: >> > I am using ubuntu-10.04-rc with kernel compiled from the almost >> > lastest source , the btrfs-progs is latest too. >> > >> > You can modify line >> > >> > fprintf(stderr, "ERROR: can't perform the search\n"); >> > to >> > fprintf(stderr, "ERROR: can't perform the search: %s\n", >> > strerror(errno)); >> > >> > to see what happened on earth. >> >> nice: >> >> $ sudo btrfs subvolume list / >> ERROR: can't perform the search: Inappropriate ioctl for device >> >> i'm not really familiar with C, or anything this low level, does this >> help you diagnose my problem? > > Have you tried to run it on the device with the btrfs, not the mount point? > > It looks like the ioctl was made too restrictive about its arguments. ah yes i missed mentioning that to, tried that: $ sudo btrfs sub list /dev/sda2 ERROR: '/dev/sda2' is not a subvolume no dice :( -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list subvolumes with new btrfs command
On Mon, Apr 26, 2010 at 2:10 PM, C Anthony Risinger wrote: > On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario wrote: >> On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote: >>> > I am using ubuntu-10.04-rc with kernel compiled from the almost >>> > lastest source , the btrfs-progs is latest too. >>> > >>> > You can modify line >>> > >>> > fprintf(stderr, "ERROR: can't perform the search\n"); >>> > to >>> > fprintf(stderr, "ERROR: can't perform the search: %s\n", >>> > strerror(errno)); >>> > >>> > to see what happened on earth. >>> >>> nice: >>> >>> $ sudo btrfs subvolume list / >>> ERROR: can't perform the search: Inappropriate ioctl for device >>> >>> i'm not really familiar with C, or anything this low level, does this >>> help you diagnose my problem? >> >> Have you tried to run it on the device with the btrfs, not the mount point? >> >> It looks like the ioctl was made too restrictive about its arguments. > > ah yes i missed mentioning that to, tried that: > > $ sudo btrfs sub list /dev/sda2 > ERROR: '/dev/sda2' is not a subvolume > > no dice :( i tried setting up loopback with a newly formatted btrfs image + mounting, same result: Inappropriate ioctl for device. same error whether i point the command at the default subvolume or a snapshot. is there anything (missing) i should check in regards to my kernel (module/progs mismatch)? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: list subvolumes with new btrfs command
On Mon, Apr 26, 2010 at 3:51 PM, C Anthony Risinger wrote: > On Mon, Apr 26, 2010 at 2:10 PM, C Anthony Risinger wrote: >> On Mon, Apr 26, 2010 at 12:58 PM, Hubert Kario wrote: >>> On Monday 26 April 2010 19:23:21 C Anthony Risinger wrote: > I am using ubuntu-10.04-rc with kernel compiled from the almost > lastest source , the btrfs-progs is latest too. > > You can modify line > > fprintf(stderr, "ERROR: can't perform the search\n"); > to > fprintf(stderr, "ERROR: can't perform the search: %s\n", > strerror(errno)); > > to see what happened on earth. nice: $ sudo btrfs subvolume list / ERROR: can't perform the search: Inappropriate ioctl for device i'm not really familiar with C, or anything this low level, does this help you diagnose my problem? >>> >>> Have you tried to run it on the device with the btrfs, not the mount point? >>> >>> It looks like the ioctl was made too restrictive about its arguments. >> >> ah yes i missed mentioning that to, tried that: >> >> $ sudo btrfs sub list /dev/sda2 >> ERROR: '/dev/sda2' is not a subvolume >> >> no dice :( > > i tried setting up loopback with a newly formatted btrfs image + > mounting, same result: Inappropriate ioctl for device. same error > whether i point the command at the default subvolume or a snapshot. > is there anything (missing) i should check in regards to my kernel > (module/progs mismatch)? bleh, looks like my kernel didn't have what it needed; i thought 2.6.33/stock Arch kernel was recent enough. i booted an 2.6.34rc5 kernel any everything works now: $ sudo btrfs sub list / ID 259 top level 5 path vps/var/lib/vps-lxc/tpl/arch-nano ID 260 top level 5 path vps/var/lib/vps-lxc/dom/dom1 heh, i forgot about those snapshots :-). i will compensate for this possibility in my initrd hook. apologies for the noise. on a parting note, the "strerror(errno)" was a nice change, and might be a useful addition for others, as it also pointed my in the right direction for permission problems (without sudo/non-super): $ btrfs sub list / ERROR: can't perform the search: Operation not permitted other than that, thanks for the assistance; the new btrfs tool is nice. C Anthony -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[For review] [PATCH] Check all kmalloc(), etc, functions for success
Hi Chris, et. al, I was bored on the long flight from Melbourne to LA (and kept awake by crying babies) so I thought I'd dip my toe into kernel programming and thought I'd see if any results from kmalloc() were being used without being checked for success first. Turns out there were quite a few that I found by hand with a simple "git grep -A2 kmalloc fs/btrfs" and so I've gone through and either BUG_ON()'d them or made them return -ENOMEM in those cases where the return value is checked. I then installed Coccinelle (packaged in Ubuntu 10.04) and used the kmtest.cocci file to pick up other cases of memory allocations that need to be tested: http://coccinelle.lip6.fr/impact/kmtest.html There was one odd case in fs/btrfs/inode.c where the kzalloc() was preceded by a WARN_ON(pages); which would always be true as the only prior reference was its declaration and initialisation to NULL, so I took the liberty of moving that after the allocation and changing it to a BUG_ON(). As I'm new to this I'm only using BUG_ON() as that seems to be used elsewhere for this case in btrfs but the kernel itself (include/asm-generic/bug.h) seems to indicate that you should only use BUG_ON() as a last resort. Please review the patch and let me know whether I'm on the right track or not! Just be gentle with me, I'm jetlagged. :-) Patch is included inline and also attached as a MIME attachment to give a better alternative in case of wordwrap breakage in the inline version. All the best, Chris diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 396039b..eb6e785 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -351,6 +351,7 @@ int btrfs_submit_compressed_write(struct inode *inode, u64 start, WARN_ON(start & ((u64)PAGE_CACHE_SIZE - 1)); cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS); + BUG_ON(!cb); atomic_set(&cb->pending_bios, 0); cb->errors = 0; cb->inode = inode; @@ -588,6 +589,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, compressed_len = em->block_len; cb = kmalloc(compressed_bio_size(root, compressed_len), GFP_NOFS); + BUG_ON(!cb); atomic_set(&cb->pending_bios, 0); cb->errors = 0; cb->inode = inode; @@ -609,6 +611,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, PAGE_CACHE_SIZE; cb->compressed_pages = kmalloc(sizeof(struct page *) * nr_pages, GFP_NOFS); + BUG_ON(!cb->compressed_pages); bdev = BTRFS_I(inode)->root->fs_info->fs_devices->latest_bdev; for (page_index = 0; page_index < nr_pages; page_index++) { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index e7b8f2c..3e5f0ff 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1967,6 +1967,7 @@ struct btrfs_root *open_ctree(struct super_block *sb, log_tree_root = kzalloc(sizeof(struct btrfs_root), GFP_NOFS); + BUG_ON(!log_tree_root); __setup_root(nodesize, leafsize, sectorsize, stripesize, log_tree_root, fs_info, BTRFS_TREE_LOG_OBJECTID); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index b34d32f..6e20c54 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7161,6 +7161,8 @@ static noinline int relocate_one_extent(struct btrfs_root *extent_root, u64 group_start = group->key.objectid; new_extents = kmalloc(sizeof(*new_extents), GFP_NOFS); + if(!new_extents) + goto out; nr_extents = 1; ret = get_new_locations(reloc_inode, extent_key, diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 29ff749..59765bc 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -877,6 +877,7 @@ static ssize_t btrfs_file_write(struct file *file, const char __user *buf, file_update_time(file); pages = kmalloc(nrptrs * sizeof(struct page *), GFP_KERNEL); + BUG_ON(!pages); /* generic_write_checks can change our pos */ start_pos = pos; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 2bfdc64..d1216ba 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -284,6 +284,7 @@ static noinline int add_async_extent(struct async_cow *cow, struct async_extent *async_extent; async_extent = kmalloc(sizeof(*async_extent), GFP_NOFS); + BUG_ON(!async_extent); async_extent->start = start; async_extent->ram_size = ram_size; async_extent->compressed_size = compressed_size; @@ -382,8 +383,8 @@ again: if (!(BTRFS_I(inode)->flags
physician mailing list
To get additional details, samples and counts for our USA contact data please email me at this address successto...@gmx.com we have lots of different lists in many fields and this week is the time to buy with lowered list prices. Send email to to ensure no further communication after MENU key pressed, off hook to interrupt and exit -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html