Re: Writes in idle/How to debug filesystem activity
I use btrfs on a SD card. When my computer is running in idle (i.e. nothing running but wmii and a few background tasks that don't do any i/o; freshly synced), from time to time some [btrfs-*] tasks do writes (for 5 seconds or so). What are those processes doing on my disk, having nothing to write/read? That leads to the second question: How can I debug file system activity in inotify style? There was once a similar thread about this issue; unfortunately, without any constructive answers: http://thread.gmane.org/gmane.comp.file-systems.btrfs/10840 -- Tomasz Chmielewski http://wpkg.org -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Writes in idle/How to debug filesystem activity
There was once a similar thread about this issue; unfortunately, without any constructive answers: http://thread.gmane.org/gmane.comp.file-systems.btrfs/10840 I believe sync does a transaction commit every time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/inode.c:2299
On Tue, 2011-08-30 at 14:27 +0800, Miao Xie wrote: On mon, 29 Aug 2011 02:45:07 +0100, Maciej Marcin Piechotka wrote: I receive the bug when I try to snapshot from fcron: 2011-08-29T02:00:46.529238+01:00 picard kernel: [ 4155.76] [ cut here ] 2011-08-29T02:00:46.529253+01:00 picard kernel: [ 4155.90] kernel BUG at fs/btrfs/inode.c:2299! 2011-08-29T02:00:46.529256+01:00 picard kernel: [ 4155.98] invalid opcode: [#1] PREEMPT SMP 2011-08-29T02:00:46.529259+01:00 picard kernel: [ 4155.444506] CPU 1 2011-08-29T02:00:46.529264+01:00 picard kernel: [ 4155.444508] Modules linked in: microcode reiserfs btrfs zlib_deflate libcrc32c arc4 snd_hda_codec_conexant iwlagn mac80211 uvcvideo videodev v4l2_compat_ioctl32 cfg80211 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_timer snd thinkpad_acpi soundcore snd_page_alloc hwmon rfkill nvram e1000e zram(C) rtc_cmos tp_smapi thinkpad_ec psmouse fscache kvm_intel kvm configs ipv6 autofs4 usbhid uhci_hcd firewire_ohci firewire_core sdhci_pci ehci_hcd crc_itu_t sdhci mmc_core sd_mod usbcore sr_mod cdrom 2011-08-29T02:00:46.529267+01:00 picard kernel: [ 4155.444569] 2011-08-29T02:00:46.529270+01:00 picard kernel: [ 4155.444576] Pid: 3167, comm: btrfs Tainted: P C 3.0.3-gentoo #1 LENOVO 2242CTO/2242CTO 2011-08-29T02:00:46.529274+01:00 picard kernel: [ 4155.444586] RIP: 0010:[a031339b] [a031339b] btrfs_orphan_del +0x8d/0xa9 [btrfs] 2011-08-29T02:00:46.529276+01:00 picard kernel: [ 4155.444611] RSP: :880111c21b58 EFLAGS: 00010282 2011-08-29T02:00:46.529279+01:00 picard kernel: [ 4155.444619] RAX: fffe RBX: 88011141bd60 RCX: 1000 2011-08-29T02:00:46.529289+01:00 picard kernel: [ 4155.444626] RDX: 000f RSI: 880110c8d250 RDI: 880134092d80 2011-08-29T02:00:46.529292+01:00 picard kernel: [ 4155.444634] RBP: R08: 880111c21a30 R09: 0011 2011-08-29T02:00:46.529297+01:00 picard kernel: [ 4155.444642] R10: 01dc R11: 0286 R12: 880136866300 2011-08-29T02:00:46.529299+01:00 picard kernel: [ 4155.444650] R13: 880130cac800 R14: 0001 R15: 880130cacb88 2011-08-29T02:00:46.529302+01:00 picard kernel: [ 4155.444658] FS: 7f126715a740() GS:88013bc8() knlGS: 2011-08-29T02:00:46.529304+01:00 picard kernel: [ 4155.444666] CS: 0010 DS: ES: CR0: 8005003b 2011-08-29T02:00:46.529307+01:00 picard kernel: [ 4155.444673] CR2: 7f5149eb7008 CR3: 9caca000 CR4: 000406e0 2011-08-29T02:00:46.529309+01:00 picard kernel: [ 4155.444681] DR0: DR1: DR2: 2011-08-29T02:00:46.529311+01:00 picard kernel: [ 4155.444689] DR3: DR6: 0ff0 DR7: 0400 2011-08-29T02:00:46.529314+01:00 picard kernel: [ 4155.444697] Process btrfs (pid: 3167, threadinfo 880111c2, task 88010b35a140) 2011-08-29T02:00:46.529316+01:00 picard kernel: [ 4155.444705] Stack: 2011-08-29T02:00:46.529318+01:00 picard kernel: [ 4155.444711] 00c369c5 880130cac800 880110c8dbe0 00c369c5 2011-08-29T02:00:46.529321+01:00 picard kernel: [ 4155.444720] 880130cacb88 880130cacb90 a03173c4 2011-08-29T02:00:46.562822+01:00 picard kernel: [ 4155.444730] 880136866300 88011141bd60 880110c8dbe0 fffb 2011-08-29T02:00:46.562959+01:00 picard kernel: [ 4155.444739] Call Trace: 2011-08-29T02:00:46.562964+01:00 picard kernel: [ 4155.444756] [a03173c4] ? btrfs_orphan_cleanup+0x196/0x2d9 [btrfs] 2011-08-29T02:00:46.562966+01:00 picard kernel: [ 4155.444773] [a0317862] ? btrfs_lookup_dentry+0x35b/0x392 [btrfs] 2011-08-29T02:00:46.562969+01:00 picard kernel: [ 4155.444790] [a03178a2] ? btrfs_lookup+0x9/0x1f [btrfs] 2011-08-29T02:00:46.562971+01:00 picard kernel: [ 4155.444801] [810ccf63] ? d_alloc_and_lookup+0x3a/0x60 2011-08-29T02:00:46.562973+01:00 picard kernel: [ 4155.444810] [810ce356] ? walk_component+0x210/0x3b7 2011-08-29T02:00:46.562975+01:00 picard kernel: [ 4155.444818] [810cea4a] ? path_lookupat+0x7c/0x29e 2011-08-29T02:00:46.562979+01:00 picard kernel: [ 4155.444827] [810cfc34] ? do_path_lookup+0x1c/0x87 2011-08-29T02:00:46.562981+01:00 picard kernel: [ 4155.444835] [810cffe2] ? user_path_at+0x49/0x7b 2011-08-29T02:00:46.562983+01:00 picard kernel: [ 4155.444843] [810c8781] ? vfs_fstatat+0x3c/0x6a 2011-08-29T02:00:46.562986+01:00 picard kernel: [ 4155.444853] [8102d38e] ? sub_preempt_count+0x83/0x94 2011-08-29T02:00:46.562988+01:00 picard kernel: [ 4155.444861] [810c88ab] ? sys_newstat+0x12/0x2b 2011-08-29T02:00:46.562990+01:00 picard kernel: [ 4155.444870]
[PATCH] Btrfs: handle enospc accounting for free space inodes
Since free space inodes now use normal checksumming we need to make sure to account for their metadata use. So reserve metadata space, and then if we fail to write out the metadata we can just release it, otherwise it will be freed up when the io completes. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/extent-tree.c | 18 fs/btrfs/free-space-cache.c | 44 -- fs/btrfs/inode-map.c|6 +++- fs/btrfs/inode.c|2 +- 4 files changed, 47 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1f1d3e8..ccdc4d1 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2755,16 +2755,20 @@ again: num_pages *= 16; num_pages *= PAGE_CACHE_SIZE; - ret = btrfs_check_data_free_space(inode, num_pages); + ret = btrfs_delalloc_reserve_space(inode, num_pages); if (ret) goto out_put; ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, num_pages, num_pages, num_pages, alloc_hint); - if (!ret) + if (!ret) { dcs = BTRFS_DC_SETUP; - btrfs_free_reserved_data_space(inode, num_pages); + btrfs_free_reserved_data_space(inode, num_pages); + } else { + btrfs_delalloc_release_space(inode, num_pages); + } + out_put: iput(inode); out_free: @@ -4002,9 +4006,13 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) struct btrfs_block_rsv *block_rsv = root-fs_info-delalloc_block_rsv; u64 to_reserve = 0; unsigned nr_extents = 0; + int flush = 1; int ret; - if (btrfs_transaction_in_commit(root-fs_info)) + if (btrfs_is_free_space_inode(root, inode)) + flush = 0; + + if (flush btrfs_transaction_in_commit(root-fs_info)) schedule_timeout(1); num_bytes = ALIGN(num_bytes, root-sectorsize); @@ -4023,7 +4031,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) to_reserve += calc_csum_metadata_size(inode, num_bytes, 1); spin_unlock(BTRFS_I(inode)-lock); - ret = reserve_metadata_bytes(NULL, root, block_rsv, to_reserve, 1); + ret = reserve_metadata_bytes(NULL, root, block_rsv, to_reserve, flush); if (ret) { u64 to_free = 0; unsigned dropped; diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index d56ba54..13c8fb3 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -528,6 +528,19 @@ out: return ret; } +/** + * __btrfs_write_out_cache - write out cached info to an inode + * @root - the root the inode belongs to + * @ctl - the free space cache we are going to write out + * @block_group - the block_group for this cache if it belongs to a block_group + * @trans - the trans handle + * @path - the path to use + * @offset - the offset for the key we'll insert + * + * This function writes out a free space cache struct to disk for quick recovery + * on mount. This will return 0 if it was successfull in writing the cache out, + * and -1 if it was not. + */ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, struct btrfs_free_space_ctl *ctl, struct btrfs_block_group_cache *block_group, @@ -551,7 +564,8 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, int index = 0, num_pages = 0; int entries = 0; int bitmaps = 0; - int ret = -1; + int ret; + int err = -1; bool next_page = false; bool out_of_space = false; @@ -559,7 +573,7 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, node = rb_first(ctl-free_space_offset); if (!node) - return 0; + return -1; if (!i_size_read(inode)) return -1; @@ -763,7 +777,6 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, unlock_extent_cached(BTRFS_I(inode)-io_tree, 0, i_size_read(inode) - 1, cached_state, GFP_NOFS); - ret = 0; goto out; } @@ -785,10 +798,8 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, unlock_extent_cached(BTRFS_I(inode)-io_tree, 0, i_size_read(inode) - 1, cached_state, GFP_NOFS); - if (ret) { - ret = 0; + if (ret) goto out; - } BTRFS_I(inode)-generation = trans-transid; @@ -800,7 +811,6 @@ int __btrfs_write_out_cache(struct btrfs_root *root, struct inode *inode, ret =
[PATCH] Btrfs: use the transactions block_rsv for the csum root
The alloc warnings everybody has been seeing is because we have been reserving space for csums, but we weren't actually using that space. So make get_block_rsv() return the trans-block_rsv if we're modifying the csum root. Also set the trans-block_rsv to NULL so that if we modify the csum root when running delayed ref's that comes out of the global reserve like it's supposed to. With this patch I'm not seeing those alloc warnings anymore. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/extent-tree.c | 15 +-- fs/btrfs/transaction.c |1 + 2 files changed, 10 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index ccdc4d1..53f6dbd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3565,10 +3565,12 @@ out: static struct btrfs_block_rsv *get_block_rsv(struct btrfs_trans_handle *trans, struct btrfs_root *root) { - struct btrfs_block_rsv *block_rsv; - if (root-ref_cows) + struct btrfs_block_rsv *block_rsv = NULL; + + if (root-ref_cows || root == root-fs_info-csum_root) block_rsv = trans-block_rsv; - else + + if (!block_rsv) block_rsv = root-block_rsv; if (!block_rsv) @@ -3865,12 +3867,13 @@ static void release_global_block_rsv(struct btrfs_fs_info *fs_info) void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans, struct btrfs_root *root) { + struct btrfs_block_rsv *block_rsv; + if (!trans-bytes_reserved) return; - BUG_ON(trans-block_rsv != root-fs_info-trans_block_rsv); - btrfs_block_rsv_release(root, trans-block_rsv, - trans-bytes_reserved); + block_rsv = root-fs_info-trans_block_rsv; + btrfs_block_rsv_release(root, block_rsv, trans-bytes_reserved); trans-bytes_reserved = 0; } diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3418acc..329e80a 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -453,6 +453,7 @@ static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, return 0; } + trans-block_rsv = NULL; while (count 4) { unsigned long cur = trans-delayed_ref_updates; trans-delayed_ref_updates = 0; -- 1.7.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: stop passing a trans handle all around the reservation code
The only thing that we need to have a trans handle for is in reserve_metadata_bytes and thats to know how much flushing we can do. So instead of passing it around, just check current-journal_info for a trans_handle so we know if we can commit a transaction to try and free up space or not. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/ctree.h|6 +--- fs/btrfs/extent-tree.c | 45 ++- fs/btrfs/free-space-cache.c |4 +-- fs/btrfs/inode.c|4 +- fs/btrfs/relocation.c | 15 ++--- fs/btrfs/transaction.c |8 +++--- 6 files changed, 39 insertions(+), 43 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index caa73cd..a5faf8e 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2237,12 +2237,10 @@ void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv); struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root); void btrfs_free_block_rsv(struct btrfs_root *root, struct btrfs_block_rsv *rsv); -int btrfs_block_rsv_add(struct btrfs_trans_handle *trans, - struct btrfs_root *root, +int btrfs_block_rsv_add(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, u64 num_bytes); -int btrfs_block_rsv_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, +int btrfs_block_rsv_check(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, u64 min_reserved, int min_factor, int flush); int btrfs_block_rsv_migrate(struct btrfs_block_rsv *src_rsv, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index d236df7..9bb7159 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3404,29 +3404,34 @@ static int shrink_delalloc(struct btrfs_trans_handle *trans, return reclaimed = to_reclaim; } -/* - * Retries tells us how many times we've called reserve_metadata_bytes. The - * idea is if this is the first call (retries == 0) then we will add to our - * reserved count if we can't make the allocation in order to hold our place - * while we go and try and free up space. That way for retries 1 we don't try - * and add space, we just check to see if the amount of unused space is = the - * total space, meaning that our reservation is valid. +/** + * reserve_metadata_bytes - try to reserve bytes from the block_rsv's space + * @root - the root we're allocating for + * @block_rsv - the block_rsv we're allocating for + * @orig_bytes - the number of bytes we want + * @flush - wether or not we can flush to make our reservation * - * However if we don't intend to retry this reservation, pass -1 as retries so - * that it short circuits this logic. + * This will reserve orgi_bytes number of bytes from the space info associated + * with the block_rsv. If there is not enough space it will make an attempt to + * flush out space to make room. It will do this by flushing delalloc if + * possible or committing the transaction. If flush is 0 then no attempts to + * regain reservations will be made and this will fail if there is not enough + * space already. */ -static int reserve_metadata_bytes(struct btrfs_trans_handle *trans, - struct btrfs_root *root, +static int reserve_metadata_bytes(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, u64 orig_bytes, int flush) { struct btrfs_space_info *space_info = block_rsv-space_info; + struct btrfs_trans_handle *trans; u64 unused; u64 num_bytes = orig_bytes; int retries = 0; int ret = 0; bool committed = false; bool flushing = false; + + trans = (struct btrfs_trans_handle *)current-journal_info; again: ret = 0; spin_lock(space_info-lock); @@ -3689,8 +3694,7 @@ void btrfs_free_block_rsv(struct btrfs_root *root, kfree(rsv); } -int btrfs_block_rsv_add(struct btrfs_trans_handle *trans, - struct btrfs_root *root, +int btrfs_block_rsv_add(struct btrfs_root *root, struct btrfs_block_rsv *block_rsv, u64 num_bytes) { @@ -3699,7 +3703,7 @@ int btrfs_block_rsv_add(struct btrfs_trans_handle *trans, if (num_bytes == 0) return 0; - ret = reserve_metadata_bytes(trans, root, block_rsv, num_bytes, 1); + ret = reserve_metadata_bytes(root, block_rsv, num_bytes, 1); if (!ret) { block_rsv_add_bytes(block_rsv, num_bytes, 1); return 0; @@ -3708,8 +3712,7 @@ int btrfs_block_rsv_add(struct btrfs_trans_handle *trans, return ret; } -int btrfs_block_rsv_check(struct btrfs_trans_handle *trans, - struct btrfs_root *root, +int btrfs_block_rsv_check(struct btrfs_root *root,
Re: [PATCH] btrfs: fix warning in iput for bad-inode
Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) -- Sergei signature.asc Description: PGP signature
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On 08/27/2011 01:30 AM, Marco Stornelli wrote: Il 26/08/2011 16:41, Zach Brown ha scritto: Hole: a range of the file that contains no data or is made up entirely of NULL (zero) data. Holes include preallocated ranges of files that have not had actual data written to them. No for me. A hole is made up of zero data? It's a strange definition for me. It's a very natural definition for me. It mirrors the behaviour of read() of sparse data inside i_size that file system authors already have to consider. It's also a reminder for people that this interface is about avoiding reading zeros. Systems that track contents can do this for files that had tons of zeros written. The data is there but the app is specifically asking us to skip it by using SEEK_DATA. - z I think we need to consider a hole and data not present/not written yet as two different cases even they are related. For example, if I do an fallocate without keep size option, then I do a read, I have the same behavior of sparse data inside i_size, but the blocks are allocated so no sparse data in this case. Simply there are no difference from app point of view. Exactly. That's why seek_hole should identify them alike, if possible. But that should not be a requirement because the sole aim here is to improve performance. Identifying a hole as data is not the end of the world. In some cases it may be more efficient. We just have to ensure that we don't identify data as a hole. BTW, we still have the fiemap interface that allows users to identify unwritten extents, etc. Use that if you want that kind of detail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix warning in iput for bad-inode
On 08/30/2011 12:53 PM, Sergei Trofimovich wrote: Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) Sergei, can you do sysrq+w when this is happening, and maybe turn on the softlockup detector so we can see where sync is getting stuck? Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix warning in iput for bad-inode
On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik jo...@redhat.com wrote: On 08/30/2011 12:53 PM, Sergei Trofimovich wrote: Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) Sergei, can you do sysrq+w when this is happening, and maybe turn on the softlockup detector so we can see where sync is getting stuck? Thanks, Sure. As I keep telling about 2 cases in IRC I will state both here explicitely: ==The First Issue (aka The Hung sync() case) == - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on / type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem (usual nonmixed mode) How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db /home/st/.bogofilter/wordlist.db: 19070 extents found the file is 138MB sqlite database for bayesian SPAM filter, it's being read and written every 20 minutes or so. Maybe, it was writtent even in defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag ~/.bogofilter/wordlist.db $ sync ^Chung in D-state I didn't try to reproduce it yet. As for lockdep I'll try but I'm afraid I will fail to reproduce, but I'll try tomorrow. I suspect I'll need to seriously fragment some file first down to such horrible state. With help of David I've some (hopefully relevant) info: #!/bin/sh -x for i in $(ps aux|grep D[+ ]\?|awk '{print $2}'); do ps $i sudo cat /proc/$i/stack done PID TTY STAT TIME COMMAND 1291 ?D 0:00 [btrfs-endio-wri] [8130055d] btrfs_tree_read_lock+0x6d/0x120 [812b8e88] btrfs_search_slot+0x698/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812ca10f] __btrfs_lookup_bio_sums+0x1cf/0x3e0 [812ca371] btrfs_lookup_bio_sums+0x11/0x20 [812d6a50] btrfs_submit_bio_hook+0x140/0x170 [812ed594] submit_one_bio+0x64/0xa0 [812f14f5] extent_readpages+0xe5/0x100 [812d7aaa] btrfs_readpages+0x1a/0x20 [810a6a02] __do_page_cache_readahead+0x1d2/0x280 [810a6d8c] ra_submit+0x1c/0x20 [810a6ebd] ondemand_readahead+0x12d/0x270 [810a70cc] page_cache_sync_readahead+0x2c/0x40 [81309987] __load_free_space_cache+0x1a7/0x5b0 [81309e61] load_free_space_cache+0xd1/0x190 [812be07b] cache_block_group+0xab/0x290 [812c3def] find_free_extent.clone.71+0x39f/0xab0 [812c5160] btrfs_reserve_extent+0xe0/0x170 [812c56df] btrfs_alloc_free_block+0xcf/0x330 [812b498d] __btrfs_cow_block+0x11d/0x4a0 [812b4df8] btrfs_cow_block+0xe8/0x1a0 [812b8965] btrfs_search_slot+0x175/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812caf6e] btrfs_csum_file_blocks+0xbe/0x670 [812d7d91] add_pending_csums.clone.39+0x41/0x60 [812da528] btrfs_finish_ordered_io+0x218/0x310 [812da635] btrfs_writepage_end_io_hook+0x15/0x20 [8130c71a] end_compressed_bio_write+0x7a/0xe0 [811146f8] bio_endio+0x18/0x30 [812cd8fc] end_workqueue_fn+0xec/0x120 [812fb0ac] worker_loop+0xac/0x520 [8105d486] kthread+0x96/0xa0 [815f9214] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 1296 ?D 0:00 [btrfs-transacti] [812d3d6e] btrfs_commit_transaction+0x22e/0x870 [812cd5bb] transaction_kthread+0x26b/0x280 [8105d486] kthread+0x96/0xa0 [815f9214] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 2891 pts/0D+ 0:00 sync [8109d069] sleep_on_page+0x9/0x10 [8109d28e] wait_on_page_bit+0x6e/0x80 [8109d9aa] filemap_fdatawait_range+0xfa/0x190 [8109da62] filemap_fdatawait+0x22/0x30 [8110ae5a] sync_inodes_sb+0x17a/0x1e0 [8110eae8] __sync_filesystem+0x78/0x80 [8110eb07] sync_one_sb+0x17/0x20 [810e9cc5] iterate_supers+0x95/0xf0 [8110ea2b] sync_filesystems+0x1b/0x20 [8110eb8c] sys_sync+0x1c/0x40 [815f813b] system_call_fastpath+0x16/0x1b [] 0x PID TTY STAT TIME COMMAND cat: /proc/4454/stack: Нет такого файла или каталога disassebbled btrfs_tree_read_lock: 813004f0 btrfs_tree_read_lock push %rbp 813004f1 btrfs_tree_read_lock+0x1 mov%gs:0xb5c0,%rax 813004fa btrfs_tree_read_lock+0xa mov%rsp,%rbp 813004fd
Re: [PATCH] btrfs: fix warning in iput for bad-inode
On 08/30/2011 03:31 PM, Sergei Trofimovich wrote: On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik jo...@redhat.com wrote: On 08/30/2011 12:53 PM, Sergei Trofimovich wrote: Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) Sergei, can you do sysrq+w when this is happening, and maybe turn on the softlockup detector so we can see where sync is getting stuck? Thanks, Sure. As I keep telling about 2 cases in IRC I will state both here explicitely: ==The First Issue (aka The Hung sync() case) == - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on / type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem (usual nonmixed mode) How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db /home/st/.bogofilter/wordlist.db: 19070 extents found the file is 138MB sqlite database for bayesian SPAM filter, it's being read and written every 20 minutes or so. Maybe, it was writtent even in defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag ~/.bogofilter/wordlist.db $ sync ^Chung in D-state I didn't try to reproduce it yet. As for lockdep I'll try but I'm afraid I will fail to reproduce, but I'll try tomorrow. I suspect I'll need to seriously fragment some file first down to such horrible state. With help of David I've some (hopefully relevant) info: #!/bin/sh -x for i in $(ps aux|grep D[+ ]\?|awk '{print $2}'); do ps $i sudo cat /proc/$i/stack done PID TTY STAT TIME COMMAND 1291 ?D 0:00 [btrfs-endio-wri] [8130055d] btrfs_tree_read_lock+0x6d/0x120 [812b8e88] btrfs_search_slot+0x698/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812ca10f] __btrfs_lookup_bio_sums+0x1cf/0x3e0 [812ca371] btrfs_lookup_bio_sums+0x11/0x20 [812d6a50] btrfs_submit_bio_hook+0x140/0x170 [812ed594] submit_one_bio+0x64/0xa0 [812f14f5] extent_readpages+0xe5/0x100 [812d7aaa] btrfs_readpages+0x1a/0x20 [810a6a02] __do_page_cache_readahead+0x1d2/0x280 [810a6d8c] ra_submit+0x1c/0x20 [810a6ebd] ondemand_readahead+0x12d/0x270 [810a70cc] page_cache_sync_readahead+0x2c/0x40 [81309987] __load_free_space_cache+0x1a7/0x5b0 [81309e61] load_free_space_cache+0xd1/0x190 [812be07b] cache_block_group+0xab/0x290 [812c3def] find_free_extent.clone.71+0x39f/0xab0 [812c5160] btrfs_reserve_extent+0xe0/0x170 [812c56df] btrfs_alloc_free_block+0xcf/0x330 [812b498d] __btrfs_cow_block+0x11d/0x4a0 [812b4df8] btrfs_cow_block+0xe8/0x1a0 [812b8965] btrfs_search_slot+0x175/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812caf6e] btrfs_csum_file_blocks+0xbe/0x670 [812d7d91] add_pending_csums.clone.39+0x41/0x60 [812da528] btrfs_finish_ordered_io+0x218/0x310 [812da635] btrfs_writepage_end_io_hook+0x15/0x20 [8130c71a] end_compressed_bio_write+0x7a/0xe0 [811146f8] bio_endio+0x18/0x30 [812cd8fc] end_workqueue_fn+0xec/0x120 [812fb0ac] worker_loop+0xac/0x520 [8105d486] kthread+0x96/0xa0 [815f9214] kernel_thread_helper+0x4/0x10 [] 0x Ok this should have been fixed with Btrfs: use the commit_root for reading free_space_inode crcs which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f. Does your kernel have this commit? Because if it does then we did something wrong. If not it should be in linus's latest tree, so update and it should go away (hopefully). Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: skip locking if searching the commit root in lookup_csums
It's not enough to just search the commit root, since we could be cow'ing the very block we need to search through, which would mean that its locked and we'll still deadlock. So use path-skip_locking as well. Thanks, Signed-off-by: Josef Bacik jo...@redhat.com --- fs/btrfs/file-item.c |4 +++- 1 files changed, 3 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index b910694..a1cb782 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -183,8 +183,10 @@ static int __btrfs_lookup_bio_sums(struct btrfs_root *root, * read from the commit root and sidestep a nasty deadlock * between reading the free space cache and updating the csum tree. */ - if (btrfs_is_free_space_inode(root, inode)) + if (btrfs_is_free_space_inode(root, inode)) { path-search_commit_root = 1; + path-skip_locking = 1; + } disk_bytenr = (u64)bio-bi_sector 9; if (dio) -- 1.7.5.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix warning in iput for bad-inode
On 08/30/2011 03:40 PM, Josef Bacik wrote: On 08/30/2011 03:31 PM, Sergei Trofimovich wrote: On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik jo...@redhat.com wrote: On 08/30/2011 12:53 PM, Sergei Trofimovich wrote: Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) Sergei, can you do sysrq+w when this is happening, and maybe turn on the softlockup detector so we can see where sync is getting stuck? Thanks, Sure. As I keep telling about 2 cases in IRC I will state both here explicitely: ==The First Issue (aka The Hung sync() case) == - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on / type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem (usual nonmixed mode) How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db /home/st/.bogofilter/wordlist.db: 19070 extents found the file is 138MB sqlite database for bayesian SPAM filter, it's being read and written every 20 minutes or so. Maybe, it was writtent even in defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag ~/.bogofilter/wordlist.db $ sync ^Chung in D-state I didn't try to reproduce it yet. As for lockdep I'll try but I'm afraid I will fail to reproduce, but I'll try tomorrow. I suspect I'll need to seriously fragment some file first down to such horrible state. With help of David I've some (hopefully relevant) info: #!/bin/sh -x for i in $(ps aux|grep D[+ ]\?|awk '{print $2}'); do ps $i sudo cat /proc/$i/stack done PID TTY STAT TIME COMMAND 1291 ?D 0:00 [btrfs-endio-wri] [8130055d] btrfs_tree_read_lock+0x6d/0x120 [812b8e88] btrfs_search_slot+0x698/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812ca10f] __btrfs_lookup_bio_sums+0x1cf/0x3e0 [812ca371] btrfs_lookup_bio_sums+0x11/0x20 [812d6a50] btrfs_submit_bio_hook+0x140/0x170 [812ed594] submit_one_bio+0x64/0xa0 [812f14f5] extent_readpages+0xe5/0x100 [812d7aaa] btrfs_readpages+0x1a/0x20 [810a6a02] __do_page_cache_readahead+0x1d2/0x280 [810a6d8c] ra_submit+0x1c/0x20 [810a6ebd] ondemand_readahead+0x12d/0x270 [810a70cc] page_cache_sync_readahead+0x2c/0x40 [81309987] __load_free_space_cache+0x1a7/0x5b0 [81309e61] load_free_space_cache+0xd1/0x190 [812be07b] cache_block_group+0xab/0x290 [812c3def] find_free_extent.clone.71+0x39f/0xab0 [812c5160] btrfs_reserve_extent+0xe0/0x170 [812c56df] btrfs_alloc_free_block+0xcf/0x330 [812b498d] __btrfs_cow_block+0x11d/0x4a0 [812b4df8] btrfs_cow_block+0xe8/0x1a0 [812b8965] btrfs_search_slot+0x175/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812caf6e] btrfs_csum_file_blocks+0xbe/0x670 [812d7d91] add_pending_csums.clone.39+0x41/0x60 [812da528] btrfs_finish_ordered_io+0x218/0x310 [812da635] btrfs_writepage_end_io_hook+0x15/0x20 [8130c71a] end_compressed_bio_write+0x7a/0xe0 [811146f8] bio_endio+0x18/0x30 [812cd8fc] end_workqueue_fn+0xec/0x120 [812fb0ac] worker_loop+0xac/0x520 [8105d486] kthread+0x96/0xa0 [815f9214] kernel_thread_helper+0x4/0x10 [] 0x Ok this should have been fixed with Btrfs: use the commit_root for reading free_space_inode crcs which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f. Does your kernel have this commit? Because if it does then we did something wrong. If not it should be in linus's latest tree, so update and it should go away (hopefully). Thanks, Oops looks like that patch won't fix it completely, I just sent another patch that will fix this problem totally, sorry about that [PATCH] Btrfs: skip locking if searching the commit root in lookup_csums Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: fix warning in iput for bad-inode
On 08/30/2011 03:40 PM, Josef Bacik wrote: On 08/30/2011 03:31 PM, Sergei Trofimovich wrote: On Tue, 30 Aug 2011 14:02:37 -0400 Josef Bacik jo...@redhat.com wrote: On 08/30/2011 12:53 PM, Sergei Trofimovich wrote: Running 'sync' program after the load does not finish and eats 100%CPU busy-waiting for something in kernel. It's easy to reproduce hang with patch for me. I just run liferea and sync after it. Without patch I haven't managed to hang btrfs up. And I think it's another btrfs bug. I've managed to reproduce it _without_ your patch and _without_ autodefrag enabled by manually running the following commands: $ btrfs fi defrag file-with-20_000-extents $ sync I think your patch just shuffles things a bit and forces autodefrag races to pop-up sooner (which is good! :]) Sergei, can you do sysrq+w when this is happening, and maybe turn on the softlockup detector so we can see where sync is getting stuck? Thanks, Sure. As I keep telling about 2 cases in IRC I will state both here explicitely: ==The First Issue (aka The Hung sync() case) == - it's an unpatched linus's v3.1-rc4-80-g0f43dd5 - /dev/root on / type btrfs (rw,noatime,compress=lzo) - 50% full 30GB filesystem (usual nonmixed mode) How I hung it: $ /usr/sbin/filefrag ~/.bogofilter/wordlist.db /home/st/.bogofilter/wordlist.db: 19070 extents found the file is 138MB sqlite database for bayesian SPAM filter, it's being read and written every 20 minutes or so. Maybe, it was writtent even in defrag/sync time! $~/dev/git/btrfs-progs-unstable/btrfs fi defrag ~/.bogofilter/wordlist.db $ sync ^Chung in D-state I didn't try to reproduce it yet. As for lockdep I'll try but I'm afraid I will fail to reproduce, but I'll try tomorrow. I suspect I'll need to seriously fragment some file first down to such horrible state. With help of David I've some (hopefully relevant) info: #!/bin/sh -x for i in $(ps aux|grep D[+ ]\?|awk '{print $2}'); do ps $i sudo cat /proc/$i/stack done PID TTY STAT TIME COMMAND 1291 ?D 0:00 [btrfs-endio-wri] [8130055d] btrfs_tree_read_lock+0x6d/0x120 [812b8e88] btrfs_search_slot+0x698/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812ca10f] __btrfs_lookup_bio_sums+0x1cf/0x3e0 [812ca371] btrfs_lookup_bio_sums+0x11/0x20 [812d6a50] btrfs_submit_bio_hook+0x140/0x170 [812ed594] submit_one_bio+0x64/0xa0 [812f14f5] extent_readpages+0xe5/0x100 [812d7aaa] btrfs_readpages+0x1a/0x20 [810a6a02] __do_page_cache_readahead+0x1d2/0x280 [810a6d8c] ra_submit+0x1c/0x20 [810a6ebd] ondemand_readahead+0x12d/0x270 [810a70cc] page_cache_sync_readahead+0x2c/0x40 [81309987] __load_free_space_cache+0x1a7/0x5b0 [81309e61] load_free_space_cache+0xd1/0x190 [812be07b] cache_block_group+0xab/0x290 [812c3def] find_free_extent.clone.71+0x39f/0xab0 [812c5160] btrfs_reserve_extent+0xe0/0x170 [812c56df] btrfs_alloc_free_block+0xcf/0x330 [812b498d] __btrfs_cow_block+0x11d/0x4a0 [812b4df8] btrfs_cow_block+0xe8/0x1a0 [812b8965] btrfs_search_slot+0x175/0x8b0 [812c9e18] btrfs_lookup_csum+0x68/0x190 [812caf6e] btrfs_csum_file_blocks+0xbe/0x670 [812d7d91] add_pending_csums.clone.39+0x41/0x60 [812da528] btrfs_finish_ordered_io+0x218/0x310 [812da635] btrfs_writepage_end_io_hook+0x15/0x20 [8130c71a] end_compressed_bio_write+0x7a/0xe0 [811146f8] bio_endio+0x18/0x30 [812cd8fc] end_workqueue_fn+0xec/0x120 [812fb0ac] worker_loop+0xac/0x520 [8105d486] kthread+0x96/0xa0 [815f9214] kernel_thread_helper+0x4/0x10 [] 0x Ok this should have been fixed with Btrfs: use the commit_root for reading free_space_inode crcs which is commit # 2cf8572dac62cc2ff7e995173e95b6c694401b3f. Does your kernel have this commit? Because if it does then we did something wrong. If not it should be in linus's latest tree, so update and it should go away (hopefully). Thanks, Yeah, this one was in my local tree when hung. Oops looks like that patch won't fix it completely, I just sent another patch that will fix this problem totally, sorry about that [PATCH] Btrfs: skip locking if searching the commit root in lookup_csums I'll try to reproduce/test it tomorrow. About the second one: ==The Second Issue (aka The Busy Looping sync() case) == The box is different from first, so conditions are a bit different. - /dev/root on / type btrfs (rw,noatime,autodefrag) (note autodefrag!) - 15% full 594GB filesystem (usual nonmixed mode) $ liferea wait it to calm down. it does a lot of SQLite reads/writes $ sync Got CPU is 100% loaded hung Still reproducible with 2 patches above + $SUBJ
Re: [PATCH] btrfs: fix warning in iput for bad-inode
About the second one: ==The Second Issue (aka The Busy Looping sync() case) == The box is different from first, so conditions are a bit different. - /dev/root on / type btrfs (rw,noatime,autodefrag) (note autodefrag!) - 15% full 594GB filesystem (usual nonmixed mode) $ liferea wait it to calm down. it does a lot of SQLite reads/writes $ sync Got CPU is 100% loaded hung Still reproducible with 2 patches above + $SUBJ one. strace says it hangs in strace() syscall. Stack trace is odd: # cat /proc/`pidof sync`/stack [] 0x got stack of the looper with such hack: $ while :; do sudo echo -; cat /proc/2304/stack | fgrep -v '[] 0x'; done | uniq [8109d39d] find_get_pages_tag+0x14d/0x1a0 [81076981] __lock_acquire+0x5a1/0xbd0 [8109e7bd] filemap_fdatawait_range+0x9d/0x1b0 [8109e8f2] filemap_fdatawait+0x22/0x30 [81106a04] sync_inodes_sb+0x1c4/0x250 [8110b398] __sync_filesystem+0x78/0x80 [8110b3b7] sync_one_sb+0x17/0x20 [810e3fbe] iterate_supers+0x7e/0xe0 [8110b400] sys_sync+0x40/0x60 [81456afb] system_call_fastpath+0x16/0x1b - [811069f1] sync_inodes_sb+0x1b1/0x250 [8110b398] __sync_filesystem+0x78/0x80 [8110b3b7] sync_one_sb+0x17/0x20 [810e3fbe] iterate_supers+0x7e/0xe0 [8110b400] sys_sync+0x40/0x60 [81456afb] system_call_fastpath+0x16/0x1b - [811069f1] sync_inodes_sb+0x1b1/0x250 [8110b398] __sync_filesystem+0x78/0x80 [8110b3b7] sync_one_sb+0x17/0x20 [810e3fbe] iterate_supers+0x7e/0xe0 [8110b400] sys_sync+0x40/0x60 [81456afb] system_call_fastpath+0x16/0x1b - [814564b6] retint_kernel+0x26/0x30 [8109d398] find_get_pages_tag+0x148/0x1a0 [810a7a60] pagevec_lookup_tag+0x20/0x30 [8109e7bd] filemap_fdatawait_range+0x9d/0x1b0 [8109e8f2] filemap_fdatawait+0x22/0x30 [81106a04] sync_inodes_sb+0x1c4/0x250 [8110b398] __sync_filesystem+0x78/0x80 [8110b3b7] sync_one_sb+0x17/0x20 [810e3fbe] iterate_supers+0x7e/0xe0 [8110b400] sys_sync+0x40/0x60 [81456afb] system_call_fastpath+0x16/0x1b Looks like dirty inode count (or dirty page count) is affected by the $SUBJ patch. -- Sergei signature.asc Description: PGP signature
Re: [PATCH 1/4] fs: add SEEK_HOLE and SEEK_DATA flags
On Mon, Aug 22, 2011 at 07:56:31PM +0200, Marco Stornelli wrote: Il 22/08/2011 17:57, Sunil Mushran ha scritto: Any proposal that differentiates between holes is wrong. It should not matter where the hole is. Think of it from the usage-pov. doff = 0; while ((doff = lseek(SEEK_DATA, doff)) != -ENXIO) { hoff = lseek(SEEK_HOLE, doff); read_offset = doff; read_len = hoff -doff; process(); doff = hoff; } The goal is to make this as efficient as follows. Treating the last hole differently adds more code for no benefit. Mmmm.It seems that Josef has to be clear in this point. However I looked for the seek hole test in xfs test suite, but I didn't find anything. Btrfs guys, how have you got tested the implementation? What do you think about this corner case? Al, what do you think about it? The following test was used to test the early implementations. http://oss.oracle.com/~smushran/seek_data/ Thank you very much!! I found another point. Your test fails with my implementation because here (http://www.austingroupbugs.net/view.php?id=415) says: If whence is SEEK_DATA, the file offset shall be set to the smallest location of a byte not within a hole and not less than offset. It shall be an error if no such byte exists. So in this case I return ENXIO but the test expects another value. I have to say that there is a bit of confusion about the real behavior of this new feature :) Which is exactly why I'm trying to get the definitions clarified first, then the behaviour codified in a single test suite we can call the 'authoritive test'. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On 08/25/2011 06:35 PM, Dave Chinner wrote: Agreed, that's the way I'd interpret it, too. So perhaps we need to ensure that this interpretation is actually tested by this test? How about some definitions to work by: Data: a range of the file that contains valid data, regardless of whether it exists in memory or on disk. The valid data can be preceeded and/or followed by an arbitrary number of zero bytes dependent on the underlying implementation of hole detection. Hole: a range of the file that contains no data or is made up entirely of NULL (zero) data. Holes include preallocated ranges of files that have not had actual data written to them. Does that make sense? It has sufficient flexibility in it for the existing generic non-implementation, allows for filesystems to define their own hole detection boundaries (e.g. filesystem block size), and effectively defines how preallocated ranges from fallocate() should be treated (i.e. as holes). If we can agree on those definitions, I think that we should document them in both the kernel and the man page that defines SEEK_HOLE/SEEK_DATA so everyone is on the same page... We should not tie in the definition to existing fs technologies. Instead we should let the fs weigh the cost of providing accurate information with the possible gain in performance. Data: A range in a file that could contain something other than nulls. If in doubt, it is data. Hole: A range in a file that only contains nulls. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On Tue, Aug 30, 2011 at 06:17:02PM -0700, Sunil Mushran wrote: On 08/25/2011 06:35 PM, Dave Chinner wrote: Agreed, that's the way I'd interpret it, too. So perhaps we need to ensure that this interpretation is actually tested by this test? How about some definitions to work by: Data: a range of the file that contains valid data, regardless of whether it exists in memory or on disk. The valid data can be preceeded and/or followed by an arbitrary number of zero bytes dependent on the underlying implementation of hole detection. Hole: a range of the file that contains no data or is made up entirely of NULL (zero) data. Holes include preallocated ranges of files that have not had actual data written to them. Does that make sense? It has sufficient flexibility in it for the existing generic non-implementation, allows for filesystems to define their own hole detection boundaries (e.g. filesystem block size), and effectively defines how preallocated ranges from fallocate() should be treated (i.e. as holes). If we can agree on those definitions, I think that we should document them in both the kernel and the man page that defines SEEK_HOLE/SEEK_DATA so everyone is on the same page... We should not tie in the definition to existing fs technologies. Such as? If we don't use well known, well defined terminology, we end up with ambiguous, vague functionality and inconsistent implementations. Instead we should let the fs weigh the cost of providing accurate information with the possible gain in performance. Data: A range in a file that could contain something other than nulls. If in doubt, it is data. Hole: A range in a file that only contains nulls. And that's -exactly- the ambiguous, vague definition that has raised all these questions in the first place. I was in doubt about whether unwritten extents can be considered a hole, and by your definition that means it should be data. But Andreas seems to be in no doubt it should be considered a hole. Hence if I implement XFS support and Andreas implements ext4 support by your defintion, we end with vastly different behaviour even though the two filesystems use the same underlying technology for preallocated ranges. That's exactly the inconsistency in implementation that I'd like us to avoid. IOWs, the definition needs to be clear enough to prevent these inconsistencies from occurring. Indeed, the phrase preallocated ranges that have not had data written to them is as independent of filesystem implementation or technologies as possible. However, because Linux supports preallocation (unlike our reference platform), and we encourage developers to use it where appropriate, it is best that we define how we expect such ranges to behave clearly. That makes life easier for everyone. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On Wed, 31 Aug 2011, Dave Chinner wrote: On Tue, Aug 30, 2011 at 06:17:02PM -0700, Sunil Mushran wrote: On 08/25/2011 06:35 PM, Dave Chinner wrote: Agreed, that's the way I'd interpret it, too. So perhaps we need to ensure that this interpretation is actually tested by this test? How about some definitions to work by: Data: a range of the file that contains valid data, regardless of whether it exists in memory or on disk. The valid data can be preceeded and/or followed by an arbitrary number of zero bytes dependent on the underlying implementation of hole detection. Hole: a range of the file that contains no data or is made up entirely of NULL (zero) data. Holes include preallocated ranges of files that have not had actual data written to them. Does that make sense? It has sufficient flexibility in it for the existing generic non-implementation, allows for filesystems to define their own hole detection boundaries (e.g. filesystem block size), and effectively defines how preallocated ranges from fallocate() should be treated (i.e. as holes). If we can agree on those definitions, I think that we should document them in both the kernel and the man page that defines SEEK_HOLE/SEEK_DATA so everyone is on the same page... We should not tie in the definition to existing fs technologies. Such as? If we don't use well known, well defined terminology, we end up with ambiguous, vague functionality and inconsistent implementations. Instead we should let the fs weigh the cost of providing accurate information with the possible gain in performance. Data: A range in a file that could contain something other than nulls. If in doubt, it is data. Hole: A range in a file that only contains nulls. And that's -exactly- the ambiguous, vague definition that has raised all these questions in the first place. I was in doubt about whether unwritten extents can be considered a hole, and by your definition that means it should be data. But Andreas seems to be in no doubt it should be considered a hole. Hence if I implement XFS support and Andreas implements ext4 support by your defintion, we end with vastly different behaviour even though the two filesystems use the same underlying technology for preallocated ranges. That's exactly the inconsistency in implementation that I'd like us to avoid. IOWs, the definition needs to be clear enough to prevent these inconsistencies from occurring. Indeed, the phrase preallocated ranges that have not had data written to them is as independent of filesystem implementation or technologies as possible. However, because Linux supports preallocation (unlike our reference platform), and we encourage developers to use it where appropriate, it is best that we define how we expect such ranges to behave clearly. That makes life easier for everyone. Since a sparse file has the holes filled by nulls by definition, it seems fairly clear that they chould count as holes. In fact, I would not be surprised to see some filesystem _only_ report the unwritten pieces of sparse files as holes (not any other ranges of nulls) the question I have is how large does the range of nulls need to be before it's reported as a hole? disk sectors, filesystem blocks, other? David Lang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: specify label length larger than 255 bytes cause mkfs.btrfs buffer overflow
Hello, While going through the mkfs.c, I noticed there is an issue for label length checking, mkfs.btrfs will crashed if the label length exceeding 255 bytes, it's easy to triggered that out as below: jeff@pibroch:~/opensource/btrfs-progs$ sudo ./mkfs.btrfs -L `perl -e 'print Ax256'` /usr/src/linux-3.0/img0 WARNING! - Btrfs v0.19-35-g1b444cd IS EXPERIMENTAL WARNING! - see http://btrfs.wiki.kernel.org before using *** buffer overflow detected ***: ./mkfs.btrfs terminated === Backtrace: = /lib/i386-linux-gnu/libc.so.6(__fortify_fail+0x50)[0xb7774df0] /lib/i386-linux-gnu/libc.so.6(+0xe4cca)[0xb7773cca] /lib/i386-linux-gnu/libc.so.6(__strcpy_chk+0x3f)[0xb777305f] ./mkfs.btrfs[0x805acc4] ./mkfs.btrfs[0x805def6] /lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0xb76a5e37] ./mkfs.btrfs[0x8048ef1] === Memory map: .. a tiny patch could fix it. Signed-off-by: Jie Liu jeff@oracle.com --- mkfs.c |4 ++-- 1 files changed, 2 insertions(+), 2 deletions(-) diff --git a/mkfs.c b/mkfs.c index 2e99b95..1598aae 100644 --- a/mkfs.c +++ b/mkfs.c @@ -308,9 +308,9 @@ static char *parse_label(char *input) int i; int len = strlen(input); -if (len BTRFS_LABEL_SIZE) { +if (len = BTRFS_LABEL_SIZE) { fprintf(stderr, Label %s is too long (max %d)\n, input, -BTRFS_LABEL_SIZE); +BTRFS_LABEL_SIZE - 1); exit(1); } for (i = 0; i len; i++) { -- 1.7.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On 8/30/2011 8:29 PM, Dave Chinner wrote: And that's -exactly- the ambiguous, vague definition that has raised all these questions in the first place. I was in doubt about whether unwritten extents can be considered a hole, and by your definition that means it should be data. But Andreas seems to be in no doubt it should be considered a hole. Fair enough. Let me rephrase. Data: A range in a file when read could return something other than nulls. Hole: A range in a file when read only returns nulls. Considering preallocated extents only return null, they should be considered holes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests 255: add a seek_data/seek_hole tester
On Tue, Aug 30, 2011 at 11:29 PM, Dave Chinner da...@fromorbit.com wrote: On Tue, Aug 30, 2011 at 06:17:02PM -0700, Sunil Mushran wrote: Instead we should let the fs weigh the cost of providing accurate information with the possible gain in performance. Data: A range in a file that could contain something other than nulls. If in doubt, it is data. Hole: A range in a file that only contains nulls. And that's -exactly- the ambiguous, vague definition that has raised all these questions in the first place. I was in doubt about whether unwritten extents can be considered a hole, and by your definition that means it should be data. But Andreas seems to be in no doubt it should be considered a hole. That's fine, though. Different filesystems have different abilities to recognize a data hole - FAT can't do it at all. Perhaps the requirements would be better stated in reverse: If the filesystem knows that a read() will return nulls (for whatever reason based on it's internal knowledge), it can report a hole. If it can't guarantee that, it's data. It's an absolute requirement that SEEK_DATA never miss data. SEEK_HOLE working is a nicety that userspace would appreciate - remember that the consumer here is cp(1), using it to skip empty portions of files and create sparse destination files. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html