Re: [PATCH 3/3] btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace
@@ -59,7 +59,7 @@ static struct list_head *zlib_alloc_workspace(void) workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL), zlib_inflate_workspacesize()); - workspace->strm.workspace = vmalloc(workspacesize); + workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL); workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); Workspace->buf can be a kvmalloc as well as in lzo? or I don't know if there was any purpose for not doing it for zlib. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace
On 05/31/17 23:41, David Sterba wrote: As alloc_workspace is now protected by memalloc_nofs where needed, we can switch the kmalloc to use GFP_KERNEL. Signed-off-by: David SterbaReviewed-by: Anand Jain Thanks, Anand --- fs/btrfs/lzo.c | 2 +- fs/btrfs/zlib.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c index a554856a5f8a..c556f3f3fbf0 100644 --- a/fs/btrfs/lzo.c +++ b/fs/btrfs/lzo.c @@ -51,7 +51,7 @@ static struct list_head *lzo_alloc_workspace(void) { struct workspace *workspace; - workspace = kzalloc(sizeof(*workspace), GFP_NOFS); + workspace = kzalloc(sizeof(*workspace), GFP_KERNEL); if (!workspace) return ERR_PTR(-ENOMEM); diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index d5446e18bb59..c1db7572283b 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -53,14 +53,14 @@ static struct list_head *zlib_alloc_workspace(void) struct workspace *workspace; int workspacesize; - workspace = kzalloc(sizeof(*workspace), GFP_NOFS); + workspace = kzalloc(sizeof(*workspace), GFP_KERNEL); if (!workspace) return ERR_PTR(-ENOMEM); workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL), zlib_inflate_workspacesize()); workspace->strm.workspace = vmalloc(workspacesize); - workspace->buf = kmalloc(PAGE_SIZE, GFP_NOFS); + workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); if (!workspace->strm.workspace || !workspace->buf) goto fail; -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs: add memalloc_nofs protections around alloc_workspace callback
On 05/31/17 23:41, David Sterba wrote: The workspaces are preallocated at the beginning where we can safely use GFP_KERNEL, but in some cases the find_workspace might reach the allocation again, now in a more restricted context when the bios or pages are being compressed. To avoid potential lockup when alloc_workspace -> vmalloc would silently use the GFP_KERNEL, add the memalloc_nofs helpers around the critical call site. Reviewed-by: Anand JainThanks, Anand Signed-off-by: David Sterba --- fs/btrfs/compression.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ba511dd454d5..39cd164e5a62 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -32,6 +32,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -757,6 +758,7 @@ static struct list_head *find_workspace(int type) struct list_head *workspace; int cpus = num_online_cpus(); int idx = type - 1; + unsigned nofs_flag; struct list_head *idle_ws = _comp_ws[idx].idle_ws; spinlock_t *ws_lock = _comp_ws[idx].ws_lock; @@ -786,7 +788,15 @@ static struct list_head *find_workspace(int type) atomic_inc(total_ws); spin_unlock(ws_lock); + /* +* Allocation helpers call vmalloc that can't use GFP_NOFS, so we have +* to turn it off here because we might get called from the restricted +* context of btrfs_compress_bio/btrfs_compress_pages +*/ + nofs_flag = memalloc_nofs_save(); workspace = btrfs_compress_op[idx]->alloc_workspace(); + memalloc_nofs_restore(nofs_flag); + if (IS_ERR(workspace)) { atomic_dec(total_ws); wake_up(ws_wait); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data
At 05/31/2017 09:52 PM, David Sterba wrote: On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote: Introduce a wrapper to recover raid56 data. The logical is the same with kernel one, but with different interfaces, since kernel ones cares the performance while in btrfs we don't care that much. Can you please rephrase this paragraph? I'll commit the patch as-is for now but want replace the changelog with an improved one. I'm not sure which direction I should change the paragraph to. To explain more about the wrapper or remove unrelated difference between kernel and btrfs-progs part? Or why we need the wrapper? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3.2 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
At 05/31/2017 10:30 PM, David Sterba wrote: On Wed, May 31, 2017 at 08:31:35AM +0800, Qu Wenruo wrote: Yes it's hard to find such deadlock especially when lockdep will not detect it. And this makes the advantage of using stack memory in v3 patch more obvious. I didn't realize the extra possible deadlock when memory pressure is high, and to make completely correct usage of GFP_ flags we should let caller to choose its GFP_ flag, which will introduce more modification and more possibility to cause problem. So now I prefer the stack version a little more. The difference is that the stack version will always consume the stack at runtime. The dynamic allocation will not, but we have to add error handling and make sure we use right gfp flags. So it's runtime vs review trade off, I choose to spend time on review. OK, then I'll update the patchset to allow passing gfp flags for each reservation. You mean to add gfp flags to extent_changeset_alloc and update the direct callers or to add gfp flags to the whole reservation codepath? Yes, I was planning to do it. I strongly prefer to use GFP_NOFS for now, although it's not ideal. OK, then keep GFP_NOFS. But I also want to know the reason why. Is it just because we don't have good enough tool to detect possible deadlock caused by wrong GFP_* flags in write path? Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging Deadlocks?
On Wed, May 31, 2017 at 5:54 AM, David Sterbawrote: > On Tue, May 30, 2017 at 09:12:39AM -0700, Sargun Dhillon wrote: >> We've been running BtrFS for a couple months now in production on >> several clusters. We're running on Canonical's 4.8 kernel, and >> currently, in the process of moving to our own patchset atop vanilla >> 4.10+. I'm glad to say it's been a fairly good experience for us. Bar >> some performance issues, it's been largely smooth sailing. > > Yay, thanks for the feedback. > >> There has been one class of persistent issues that has been plaguing >> our cluster is deadlocks. We've seen a fair number of issues where >> there are some number of background threads and user threads are in >> the process of performing operations where some are waiting to start a >> transaction, and at least one background thread or user thread is in >> the process of committing a transaction. Unfortunately, these >> situations are ending in deadlocks, where no threads are making >> progress. > > In such situations, save stacks of all processes (/proc/PID/stack). I > don't want to play terminology here, so by a deadlock I could understand > a system that's making progress so slow that's effectively stuck. This > could happen if the files are freamgented, so eg. traversing extents > takes locks and has a lot of work before it unlocks. Add some extent > sharing and updating references, this adds some points where the threads > just wait. > > The stacktraces could give an idea of what kind of hang it is. > We're saving a dump of the tasks currently running. A recent dump can be found here: http://cwillu.com:8080/50.19.255.106/1. This is the only snapshot I have from a node that's not making any progress. We also see the other case, where tasks are not making progress very quickly, and it causes the kernel hung task detector to kick in. This happens pretty often, and it's difficult to catch when it's happening, but the symptoms can be frustrating, including failed instance healthchecks, poor performance, and high latency for interactive services. Some of the traces we've gotten from the stuck task detector include: https://gist.github.com/sargun/9643c0c380d27a147ef3486e1d51dbdb https://gist.github.com/sargun/8858263b8d04c8ab726738022725ec12 >> We've talked about a couple ideas internally, like adding the ability >> to timeout transactions, abort commits or start_transactions which are >> taking too long, and adding more debugging to get insights into the >> state of the filesystem. Unfortunately, since our usage and knowledge >> of BtrFS is still somewhat nascent, we're unsure of what is the right >> investment. > > There's a kernel-wide hung task detection, but I think a similar > mechanism around just the transaction commits would be useful, as a > debugging option. > > There are number of ways how a transaction can be blocked though, so > we'd need to choose the starting point. Extent-related locks, waiting > for writes, other locks, the intenral transactional logic (and possibly > more). > As a first step, it'd be nice to have the transaction wrapped in a stack frame. We can then instrument it much more easily with off the shelf tools like simple BPF-based kprobes / kretprobes, or ftraces, rather than having to write a custom probe that's familiar with the innards of the txn datastructure, and does its own accounting to keep track of what's in flight. I'll take a cut at something as simple as an in-memory list of transactions which is periodically scanned for transactions which are taking too long, and log whether they're stuck starting, commiting, or in-flight and uncommitted. >> I'm curious, are other people seeing deadlocks crop up in production >> often? How are you going about debugging them, and are there any good >> pieces of advice on avoiding these for production workloads? > > I have seen hangs with kernel 4.9 a while back triggered by a > long-running iozone stress test, but 4.8 was not affected, and 4.10+ > worked fine again. I don't know if/which btrfs patches the 'canonical > 4.8' kernel has, so this might not be related. > > As for deadlocks (double taken lock, lock inversion), I haven't seen > them for a long time. The testing kernels run with lockdep, so we should > be able to see them early. You could try to run turn lockdep on if the > performance penalty is still acceptable for you. But there are still > cases that lockdep does not cover IIRC, due to the higher-level > semantics of the various btrfs trees and locking of extent buffers. For some of these use-cases, we can pretty easily recreate the pattern on the machine. For others, it's more complicated to find out which containers, and datasets were scheduled to be processed on the machine. We've run some sanity, and stress tests, but we can rarely get the filesystem to fall over in a predictable way in these tests compared to some production workloads. -- To unsubscribe from this list: send the line "unsubscribe
Re: [PATCH v2] Btrfs: fix invalid extent maps due to hole punching
On Sun, May 28, 2017 at 10:31:05PM +0100, fdman...@kernel.org wrote: > From: Filipe Manana> > While punching a hole in a range that is not aligned with the sector size > (currently the same as the page size) we can end up leaving an extent map > in memory with a length that is smaller then the sector size, which is > not expected and can lead to problems. This issue is easily detected > after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of > inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the > following for example: > > $ mkfs.btrfs -f /dev/sdb > $ mount /dev/sdb /mnt > $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo > $ xfs_io -c "fpunch 60K 90K" /mnt/foo > $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo > $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo > $ umount /mnt > > After the unmount operation we can see several warnings emmitted due to > underflows related to space reservation counters: > > [ 2837.443299] [ cut here ] > [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 > btrfs_destroy_inode+0xe8/0x27e [btrfs] > [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse > parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev > tpm button se > rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq > libcrc32c crc32c_gene > ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic > virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod > floppy > [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: GW > 4.10.0-rc8-btrfs-next-43+ #1 > [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [ 2837.462379] Call Trace: > [ 2837.462379] dump_stack+0x68/0x92 > [ 2837.462379] __warn+0xc2/0xdd > [ 2837.462379] warn_slowpath_null+0x1d/0x1f > [ 2837.462379] btrfs_destroy_inode+0xe8/0x27e [btrfs] > [ 2837.462379] destroy_inode+0x3d/0x55 > [ 2837.462379] evict+0x177/0x17e > [ 2837.462379] dispose_list+0x50/0x71 > [ 2837.462379] evict_inodes+0x132/0x141 > [ 2837.462379] generic_shutdown_super+0x3f/0xeb > [ 2837.462379] kill_anon_super+0x12/0x1c > [ 2837.462379] btrfs_kill_super+0x16/0x21 [btrfs] > [ 2837.462379] deactivate_locked_super+0x30/0x68 > [ 2837.462379] deactivate_super+0x36/0x39 > [ 2837.462379] cleanup_mnt+0x58/0x76 > [ 2837.462379] __cleanup_mnt+0x12/0x14 > [ 2837.462379] task_work_run+0x77/0x9b > [ 2837.462379] prepare_exit_to_usermode+0x9d/0xc5 > [ 2837.462379] syscall_return_slowpath+0x196/0x1b9 > [ 2837.462379] entry_SYSCALL_64_fastpath+0xab/0xad > [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7 > [ 2837.462379] RSP: 002b:7ffdd0d8de58 EFLAGS: 0246 ORIG_RAX: > 00a6 > [ 2837.462379] RAX: RBX: 556f76a39060 RCX: > 7f3ef3e6b9a7 > [ 2837.462379] RDX: 0001 RSI: RDI: > 556f76a3f910 > [ 2837.462379] RBP: 556f76a3f910 R08: 556f76a3e670 R09: > 0015 > [ 2837.462379] R10: 06b4 R11: 0246 R12: > 7f3ef436ce64 > [ 2837.462379] R13: R14: 556f76a39240 R15: > 7ffdd0d8e0e0 > [ 2837.519355] ---[ end trace e79345fe24b30b8d ]--- > [ 2837.596256] [ cut here ] > [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 > btrfs_free_block_groups+0x246/0x3eb [btrfs] > [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse > parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev > tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 > raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor > raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod > cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring > virtio e1000 scsi_mod floppy > [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: GW > 4.10.0-rc8-btrfs-next-43+ #1 > [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 > [ 2837.663359] Call Trace: > [ 2837.663359] dump_stack+0x68/0x92 > [ 2837.663359] __warn+0xc2/0xdd > [ 2837.663359] warn_slowpath_null+0x1d/0x1f > [ 2837.663359] btrfs_free_block_groups+0x246/0x3eb [btrfs] > [ 2837.663359] close_ctree+0x1dd/0x2e1 [btrfs] > [ 2837.663359] ? evict_inodes+0x132/0x141 > [ 2837.663359] btrfs_put_super+0x15/0x17 [btrfs] > [ 2837.663359] generic_shutdown_super+0x6a/0xeb > [ 2837.663359] kill_anon_super+0x12/0x1c > [ 2837.663359] btrfs_kill_super+0x16/0x21 [btrfs] > [ 2837.663359] deactivate_locked_super+0x30/0x68 > [ 2837.663359] deactivate_super+0x36/0x39 > [ 2837.663359]
Re: Debugging Deadlocks?
On Wed, May 31, 2017 at 06:47:09AM +, Duncan wrote: > Sargun Dhillon posted on Tue, 30 May 2017 09:12:39 -0700 as excerpted: > > currently, in the process of moving to our own patchset atop vanilla > > 4.10+ > > Good for getting off kernel 4.8, as in mainline kernel terms that's only > a short-term stable release and support has now ended. In fact, support for 4.10 has already ended too. You either stick with LTS (currently 4.9) or keep upgrading to the newest-and-greatest bleeding edge as soon as it's released. For production machines, the choice is obvious. -- Don't be racist. White, amber or black, all beers should be judged based solely on their merits. Heck, even if occasionally a cider applies for a beer's job, why not? On the other hand, corpo lager is not a race. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
hi
-- Please contact me in the my e-mail addres,(griestkrist1...@gmail.com) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] btrfs: use GFP_KERNEL in init_ipath
Now that init_ipath is called either from a safe context or with memalloc_nofs protection, we can switch to GFP_KERNEL allocations in init_path and init_data_container. Signed-off-by: David Sterba--- fs/btrfs/backref.c | 10 +- fs/btrfs/ioctl.c | 4 ++-- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 24865da63d8f..f723c11bb763 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -16,7 +16,7 @@ * Boston, MA 021110-1307, USA. */ -#include +#include #include #include "ctree.h" #include "disk-io.h" @@ -2305,7 +2305,7 @@ struct btrfs_data_container *init_data_container(u32 total_bytes) size_t alloc_bytes; alloc_bytes = max_t(size_t, total_bytes, sizeof(*data)); - data = vmalloc(alloc_bytes); + data = kvmalloc(alloc_bytes, GFP_KERNEL); if (!data) return ERR_PTR(-ENOMEM); @@ -2339,9 +2339,9 @@ struct inode_fs_paths *init_ipath(s32 total_bytes, struct btrfs_root *fs_root, if (IS_ERR(fspath)) return (void *)fspath; - ifp = kmalloc(sizeof(*ifp), GFP_NOFS); + ifp = kmalloc(sizeof(*ifp), GFP_KERNEL); if (!ifp) { - vfree(fspath); + kvfree(fspath); return ERR_PTR(-ENOMEM); } @@ -2356,6 +2356,6 @@ void free_ipath(struct inode_fs_paths *ipath) { if (!ipath) return; - vfree(ipath->fspath); + kvfree(ipath->fspath); kfree(ipath); } diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index c9cdea8061bc..e4116f9248c2 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -37,7 +37,7 @@ #include #include #include -#include +#include #include #include #include @@ -4588,7 +4588,7 @@ static long btrfs_ioctl_logical_to_ino(struct btrfs_fs_info *fs_info, out: btrfs_free_path(path); - vfree(inodes); + kvfree(inodes); kfree(loi); return ret; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] More vmalloc cleanups
All vmalloc/vzalloc calls should be addressed, in this or the currently pending patches. David Sterba (5): btrfs: replace opencoded kvzalloc with the helper btrfs: send: use kvmalloc in iterate_dir_item btrfs: scrub: add memalloc_nofs protection around init_ipath btrfs: use GFP_KERNEL in init_ipath btrfs: adjust includes after vmalloc removal fs/btrfs/backref.c | 10 +- fs/btrfs/check-integrity.c | 11 --- fs/btrfs/ctree.c | 2 +- fs/btrfs/ioctl.c | 4 ++-- fs/btrfs/raid56.c | 11 --- fs/btrfs/scrub.c | 9 + fs/btrfs/send.c| 11 --- 7 files changed, 29 insertions(+), 29 deletions(-) -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] btrfs: send: use kvmalloc in iterate_dir_item
We use a growing buffer for xattrs larger than a page size, at some point vmalloc is unconditionally used for larger buffers. We can still try to avoid it using the kvmalloc helper. Signed-off-by: David Sterba--- fs/btrfs/send.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index 924b1d941b53..7416b17c0eac 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -1083,7 +1083,7 @@ static int iterate_dir_item(struct btrfs_root *root, struct btrfs_path *path, buf = tmp; } if (!buf) { - buf = vmalloc(buf_len); + buf = kvmalloc(buf_len, GFP_KERNEL); if (!buf) { ret = -ENOMEM; goto out; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] btrfs: scrub: add memalloc_nofs protection around init_ipath
init_ipath is called from a safe ioctl context and from scrub when printing an error. The protection is added for three reasons: * init_data_container calls vmalloc and this does not work as expected in the GFP_NOFS context, so this silently does GFP_KERNEL and might deadlock in some cases * keep the context constraint of GFP_NOFS, used by scrub * we want to use GFP_KERNEL unconditionally inside init_ipath or its callees Signed-off-by: David Sterba--- fs/btrfs/scrub.c | 9 + 1 file changed, 9 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e99be644b19f..096e503e3ddc 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -18,6 +18,7 @@ #include #include +#include #include "ctree.h" #include "volumes.h" #include "disk-io.h" @@ -733,6 +734,7 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, u32 nlink; int ret; int i; + unsigned nofs_flag; struct extent_buffer *eb; struct btrfs_inode_item *inode_item; struct scrub_warning *swarn = warn_ctx; @@ -771,7 +773,14 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, nlink = btrfs_inode_nlink(eb, inode_item); btrfs_release_path(swarn->path); + /* +* init_path might indirectly call vmalloc, or use GFP_KERNEL. Scrub +* uses GFP_NOFS in this context, so we keep it consistent but it does +* not seem to be strictly necessary. +*/ + nofs_flag = memalloc_nofs_save(); ipath = init_ipath(4096, local_root, swarn->path); + memalloc_nofs_restore(nofs_flag); if (IS_ERR(ipath)) { ret = PTR_ERR(ipath); ipath = NULL; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs: replace opencoded kvzalloc with the helper
The logic of kmalloc and vmalloc fallback is open coded in several places, we can now use the existing helper. Signed-off-by: David Sterba--- fs/btrfs/check-integrity.c | 11 --- fs/btrfs/raid56.c | 11 --- fs/btrfs/send.c| 9 +++-- 3 files changed, 11 insertions(+), 20 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 6cabc8acee2a..5f8006e4de9d 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -94,7 +94,7 @@ #include #include #include -#include +#include #include #include "ctree.h" #include "disk-io.h" @@ -2920,13 +2920,10 @@ int btrfsic_mount(struct btrfs_fs_info *fs_info, fs_info->sectorsize, PAGE_SIZE); return -1; } - state = kzalloc(sizeof(*state), GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); + state = kvzalloc(sizeof(*state), GFP_KERNEL); if (!state) { - state = vzalloc(sizeof(*state)); - if (!state) { - pr_info("btrfs check-integrity: vzalloc() failed!\n"); - return -1; - } + pr_info("btrfs check-integrity: allocation failed!\n"); + return -1; } if (!btrfsic_is_initialized) { diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index d8ea0eb76325..d68af3c61b49 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -31,7 +31,7 @@ #include #include #include -#include +#include #include #include "ctree.h" #include "extent_map.h" @@ -218,12 +218,9 @@ int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info) * of a failing mount. */ table_size = sizeof(*table) + sizeof(*h) * num_entries; - table = kzalloc(table_size, GFP_KERNEL | __GFP_NOWARN | __GFP_REPEAT); - if (!table) { - table = vzalloc(table_size); - if (!table) - return -ENOMEM; - } + table = kvzalloc(table_size, GFP_KERNEL); + if (!table) + return -ENOMEM; spin_lock_init(>cache_lock); INIT_LIST_HEAD(>stripe_cache); diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index e8185c83f667..924b1d941b53 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -6389,13 +6389,10 @@ long btrfs_ioctl_send(struct file *mnt_file, void __user *arg_) alloc_size = sizeof(struct clone_root) * (arg->clone_sources_count + 1); - sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL | __GFP_NOWARN); + sctx->clone_roots = kzalloc(alloc_size, GFP_KERNEL); if (!sctx->clone_roots) { - sctx->clone_roots = vzalloc(alloc_size); - if (!sctx->clone_roots) { - ret = -ENOMEM; - goto out; - } + ret = -ENOMEM; + goto out; } alloc_size = arg->clone_sources_count * sizeof(*arg->clone_sources); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs: adjust includes after vmalloc removal
As we don't use vmalloc/vzalloc/vfree directly in ctree.c, we can now use the proper header that defines kvmalloc. Signed-off-by: David Sterba--- fs/btrfs/ctree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 6e1b02dd72d3..3f4daa9d6e2c 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -19,7 +19,7 @@ #include #include #include -#include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] Update gfp flags in alloc_workspace
We can cheaply get rid of a few GFP_NOFS, as there are only two contexts, initializaton (GFP_KERNEL is ok), and potentially a "NOFS", but with memalloc_nofs it's made safe. David Sterba (3): btrfs: add memalloc_nofs protections around alloc_workspace callback btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace fs/btrfs/compression.c | 10 ++ fs/btrfs/lzo.c | 16 fs/btrfs/zlib.c| 10 +- 3 files changed, 23 insertions(+), 13 deletions(-) -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] btrfs: add memalloc_nofs protections around alloc_workspace callback
The workspaces are preallocated at the beginning where we can safely use GFP_KERNEL, but in some cases the find_workspace might reach the allocation again, now in a more restricted context when the bios or pages are being compressed. To avoid potential lockup when alloc_workspace -> vmalloc would silently use the GFP_KERNEL, add the memalloc_nofs helpers around the critical call site. Signed-off-by: David Sterba--- fs/btrfs/compression.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index ba511dd454d5..39cd164e5a62 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -32,6 +32,7 @@ #include #include #include +#include #include "ctree.h" #include "disk-io.h" #include "transaction.h" @@ -757,6 +758,7 @@ static struct list_head *find_workspace(int type) struct list_head *workspace; int cpus = num_online_cpus(); int idx = type - 1; + unsigned nofs_flag; struct list_head *idle_ws = _comp_ws[idx].idle_ws; spinlock_t *ws_lock = _comp_ws[idx].ws_lock; @@ -786,7 +788,15 @@ static struct list_head *find_workspace(int type) atomic_inc(total_ws); spin_unlock(ws_lock); + /* +* Allocation helpers call vmalloc that can't use GFP_NOFS, so we have +* to turn it off here because we might get called from the restricted +* context of btrfs_compress_bio/btrfs_compress_pages +*/ + nofs_flag = memalloc_nofs_save(); workspace = btrfs_compress_op[idx]->alloc_workspace(); + memalloc_nofs_restore(nofs_flag); + if (IS_ERR(workspace)) { atomic_dec(total_ws); wake_up(ws_wait); -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs: switch kmallocs to GFP_KERNEL in lzo/zlib alloc_workspace
As alloc_workspace is now protected by memalloc_nofs where needed, we can switch the kmalloc to use GFP_KERNEL. Signed-off-by: David Sterba--- fs/btrfs/lzo.c | 2 +- fs/btrfs/zlib.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c index a554856a5f8a..c556f3f3fbf0 100644 --- a/fs/btrfs/lzo.c +++ b/fs/btrfs/lzo.c @@ -51,7 +51,7 @@ static struct list_head *lzo_alloc_workspace(void) { struct workspace *workspace; - workspace = kzalloc(sizeof(*workspace), GFP_NOFS); + workspace = kzalloc(sizeof(*workspace), GFP_KERNEL); if (!workspace) return ERR_PTR(-ENOMEM); diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index d5446e18bb59..c1db7572283b 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -53,14 +53,14 @@ static struct list_head *zlib_alloc_workspace(void) struct workspace *workspace; int workspacesize; - workspace = kzalloc(sizeof(*workspace), GFP_NOFS); + workspace = kzalloc(sizeof(*workspace), GFP_KERNEL); if (!workspace) return ERR_PTR(-ENOMEM); workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL), zlib_inflate_workspacesize()); workspace->strm.workspace = vmalloc(workspacesize); - workspace->buf = kmalloc(PAGE_SIZE, GFP_NOFS); + workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); if (!workspace->strm.workspace || !workspace->buf) goto fail; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] btrfs: switch to kvmalloc and GFP_KERNEL in lzo/zlib alloc_workspace
The compression workspace buffers are larger than a page so we use vmalloc, unconditionally. This is not always necessary as there might be contiguous memory available. Let's use the kvmalloc helpers that will try kmalloc first and fallback to vmalloc. For that they require GFP_KERNEL flags. As we now have the alloc_workspace calls protected by memalloc_nofs in the critical contexts, we can safely use GFP_KERNEL. Signed-off-by: David Sterba--- fs/btrfs/lzo.c | 14 +++--- fs/btrfs/zlib.c | 6 +++--- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c index c556f3f3fbf0..cde13cce01a0 100644 --- a/fs/btrfs/lzo.c +++ b/fs/btrfs/lzo.c @@ -18,7 +18,7 @@ #include #include -#include +#include #include #include #include @@ -41,9 +41,9 @@ static void lzo_free_workspace(struct list_head *ws) { struct workspace *workspace = list_entry(ws, struct workspace, list); - vfree(workspace->buf); - vfree(workspace->cbuf); - vfree(workspace->mem); + kvfree(workspace->buf); + kvfree(workspace->cbuf); + kvfree(workspace->mem); kfree(workspace); } @@ -55,9 +55,9 @@ static struct list_head *lzo_alloc_workspace(void) if (!workspace) return ERR_PTR(-ENOMEM); - workspace->mem = vmalloc(LZO1X_MEM_COMPRESS); - workspace->buf = vmalloc(lzo1x_worst_compress(PAGE_SIZE)); - workspace->cbuf = vmalloc(lzo1x_worst_compress(PAGE_SIZE)); + workspace->mem = kvmalloc(LZO1X_MEM_COMPRESS, GFP_KERNEL); + workspace->buf = kvmalloc(lzo1x_worst_compress(PAGE_SIZE), GFP_KERNEL); + workspace->cbuf = kvmalloc(lzo1x_worst_compress(PAGE_SIZE), GFP_KERNEL); if (!workspace->mem || !workspace->buf || !workspace->cbuf) goto fail; diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c index c1db7572283b..c248f9286366 100644 --- a/fs/btrfs/zlib.c +++ b/fs/btrfs/zlib.c @@ -24,7 +24,7 @@ #include #include #include -#include +#include #include #include #include @@ -43,7 +43,7 @@ static void zlib_free_workspace(struct list_head *ws) { struct workspace *workspace = list_entry(ws, struct workspace, list); - vfree(workspace->strm.workspace); + kvfree(workspace->strm.workspace); kfree(workspace->buf); kfree(workspace); } @@ -59,7 +59,7 @@ static struct list_head *zlib_alloc_workspace(void) workspacesize = max(zlib_deflate_workspacesize(MAX_WBITS, MAX_MEM_LEVEL), zlib_inflate_workspacesize()); - workspace->strm.workspace = vmalloc(workspacesize); + workspace->strm.workspace = kvmalloc(workspacesize, GFP_KERNEL); workspace->buf = kmalloc(PAGE_SIZE, GFP_KERNEL); if (!workspace->strm.workspace || !workspace->buf) goto fail; -- 2.12.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v3.2 5/6] btrfs: qgroup: Introduce extent changeset for qgroup reserve functions
On Wed, May 31, 2017 at 08:31:35AM +0800, Qu Wenruo wrote: > >> Yes it's hard to find such deadlock especially when lockdep will not > >> detect it. > >> > >> And this makes the advantage of using stack memory in v3 patch more > >> obvious. > >> > >> I didn't realize the extra possible deadlock when memory pressure is > >> high, and to make completely correct usage of GFP_ flags we should let > >> caller to choose its GFP_ flag, which will introduce more modification > >> and more possibility to cause problem. > >> > >> So now I prefer the stack version a little more. > > > > The difference is that the stack version will always consume the stack > > at runtime. The dynamic allocation will not, but we have to add error > > handling and make sure we use right gfp flags. So it's runtime vs review > > trade off, I choose to spend time on review. > > OK, then I'll update the patchset to allow passing gfp flags for each > reservation. You mean to add gfp flags to extent_changeset_alloc and update the direct callers or to add gfp flags to the whole reservation codepath? I strongly prefer to use GFP_NOFS for now, although it's not ideal. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/20] Btrfs-progs offline scrub
On Tue, May 30, 2017 at 08:54:32PM +0200, David Sterba wrote: > On Thu, May 25, 2017 at 02:21:45PM +0800, Qu Wenruo wrote: > > For any one who wants to try it, it can be get from my repo: > > https://github.com/adam900710/btrfs-progs/tree/offline_scrub > > Qu Wenruo (20): > > btrfs-progs: raid56: Introduce raid56 header for later recovery usage > > btrfs-progs: raid56: Introduce tables for RAID6 recovery > > btrfs-progs: raid56: Allow raid6 to recover 2 data stripes > > btrfs-progs: raid56: Allow raid6 to recover data and p Patches 1-5 applied, with small fixups here and there, that took me more time than I wanted. Yet I'm not sure if I should bother you with the coding style things and grammar fixes. Maybe I should, it becomes too distracting namely in the rest of the patch series. > > btrfs-progs: Introduce wrapper to recover raid56 data > > btrfs-progs: Introduce new btrfs_map_block function which returns more > > unified result. > > btrfs-progs: Allow __btrfs_map_block_v2 to remove unrelated stripes > > btrfs-progs: csum: Introduce function to read out data csums > > I'm about to start merging this patches, in parts. First the patches 1-8 > as they're independent and not intrusive. > > > btrfs-progs: scrub: Introduce structures to support offline scrub for > > RAID56 > > btrfs-progs: scrub: Introduce functions to scrub mirror based tree > > block > > btrfs-progs: scrub: Introduce functions to scrub mirror based data > > blocks > > btrfs-progs: scrub: Introduce function to scrub one mirror-based > > extent > > btrfs-progs: scrub: Introduce function to scrub one data stripe > > btrfs-progs: scrub: Introduce function to verify parities > > btrfs-progs: extent-tree: Introduce function to check if there is any > > extent in given range. > > btrfs-progs: scrub: Introduce function to recover data parity > > btrfs-progs: scrub: Introduce helper to write a full stripe > > btrfs-progs: scrub: Introduce a function to scrub one full stripe > > btrfs-progs: scrub: Introduce function to check a whole block group > > btrfs-progs: scrub: Introduce offline scrub function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 05/20] btrfs-progs: Introduce wrapper to recover raid56 data
On Thu, May 25, 2017 at 02:21:50PM +0800, Qu Wenruo wrote: > Introduce a wrapper to recover raid56 data. > > The logical is the same with kernel one, but with different interfaces, > since kernel ones cares the performance while in btrfs we don't care > that much. Can you please rephrase this paragraph? I'll commit the patch as-is for now but want replace the changelog with an improved one. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/20] btrfs-progs: raid56: Introduce raid56 header for later recovery usage
On Thu, May 25, 2017 at 02:21:46PM +0800, Qu Wenruo wrote: > @@ -0,0 +1,28 @@ > +/* > + * Copyright (C) 2017 Fujitsu. All rights reserved. Please use just plain GPL v2 header in newly created files. The history of who modified what in the given file is stored in git. I can't simply remove the copyright notice from existing files without checking first, but I can stop addding more. > + * > + * This program is free software; you can redistribute it and/or > + * modify it under the terms of the GNU General Public > + * License v2 as published by the Free Software Foundation. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * General Public License for more details. > + * > + * You should have received a copy of the GNU General Public > + * License along with this program; if not, write to the > + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, > + * Boston, MA 021110-1307, USA. > + */ > + > +#ifndef _BTRFS_PROGS_RAID56_H > +#define _BTRFS_PROGS_RAID56_H The pattern for the ifndef macro name is: "__BTRFS_" + descriptive header name + "_H__" just look to other files. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Debugging Deadlocks?
On Tue, May 30, 2017 at 09:12:39AM -0700, Sargun Dhillon wrote: > We've been running BtrFS for a couple months now in production on > several clusters. We're running on Canonical's 4.8 kernel, and > currently, in the process of moving to our own patchset atop vanilla > 4.10+. I'm glad to say it's been a fairly good experience for us. Bar > some performance issues, it's been largely smooth sailing. Yay, thanks for the feedback. > There has been one class of persistent issues that has been plaguing > our cluster is deadlocks. We've seen a fair number of issues where > there are some number of background threads and user threads are in > the process of performing operations where some are waiting to start a > transaction, and at least one background thread or user thread is in > the process of committing a transaction. Unfortunately, these > situations are ending in deadlocks, where no threads are making > progress. In such situations, save stacks of all processes (/proc/PID/stack). I don't want to play terminology here, so by a deadlock I could understand a system that's making progress so slow that's effectively stuck. This could happen if the files are freamgented, so eg. traversing extents takes locks and has a lot of work before it unlocks. Add some extent sharing and updating references, this adds some points where the threads just wait. The stacktraces could give an idea of what kind of hang it is. > We've talked about a couple ideas internally, like adding the ability > to timeout transactions, abort commits or start_transactions which are > taking too long, and adding more debugging to get insights into the > state of the filesystem. Unfortunately, since our usage and knowledge > of BtrFS is still somewhat nascent, we're unsure of what is the right > investment. There's a kernel-wide hung task detection, but I think a similar mechanism around just the transaction commits would be useful, as a debugging option. There are number of ways how a transaction can be blocked though, so we'd need to choose the starting point. Extent-related locks, waiting for writes, other locks, the intenral transactional logic (and possibly more). > I'm curious, are other people seeing deadlocks crop up in production > often? How are you going about debugging them, and are there any good > pieces of advice on avoiding these for production workloads? I have seen hangs with kernel 4.9 a while back triggered by a long-running iozone stress test, but 4.8 was not affected, and 4.10+ worked fine again. I don't know if/which btrfs patches the 'canonical 4.8' kernel has, so this might not be related. As for deadlocks (double taken lock, lock inversion), I haven't seen them for a long time. The testing kernels run with lockdep, so we should be able to see them early. You could try to run turn lockdep on if the performance penalty is still acceptable for you. But there are still cases that lockdep does not cover IIRC, due to the higher-level semantics of the various btrfs trees and locking of extent buffers. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs wiki account
On Wed, May 31, 2017 at 01:46:17PM +0200, David Sterba wrote: > On Wed, May 31, 2017 at 07:35:24PM +0800, chou Dai wrote: > > Hi,I wanted to add some content and edit some content to wiki,but I > > made a registration two week ago(Account Name:Dai chou) . Now, I still > > cannot register, through registration request expired already. And > > Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list > > with an appropriate subject to get attention.". So, I come to here. > > thanks. > > Account approved. > > There are 2 wiki admins who can approve the accounts. For the record, I see in the user creating log that Andre responds to the requests within a reasonable time. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs wiki account
On Wed, May 31, 2017 at 07:35:24PM +0800, chou Dai wrote: > Hi,I wanted to add some content and edit some content to wiki,but I > made a registration two week ago(Account Name:Dai chou) . Now, I still > cannot register, through registration request expired already. And > Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list > with an appropriate subject to get attention.". So, I come to here. > thanks. Account approved. There are 2 wiki admins who can approve the accounts. I used to get emails on new requests but this does not seem to work now. I'd have to manually check if there are pending requests, so asking in the mailinglist should be the last option. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs wiki account
Hi,I wanted to add some content and edit some content to wiki,but I made a registration two week ago(Account Name:Dai chou) . Now, I still cannot register, through registration request expired already. And Btrfs wiki says"Just drop a mail on linux kernel btrfs mailing list with an appropriate subject to get attention.". So, I come to here. thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: fix invalid extent maps due to hole punching
From: Filipe MananaWhile punching a hole in a range that is not aligned with the sector size (currently the same as the page size) we can end up leaving an extent map in memory with a length that is smaller then the sector size, which is not expected and can lead to problems. This issue is easily detected after the patch from commit a7e3b975a0f9 ("Btrfs: fix reported number of inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the following for example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo $ xfs_io -c "fpunch 60K 90K" /mnt/foo $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo $ umount /mnt After the unmount operation we can see several warnings emmitted due to underflows related to space reservation counters: [ 2837.443299] [ cut here ] [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs] [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: GW 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.462379] Call Trace: [ 2837.462379] dump_stack+0x68/0x92 [ 2837.462379] __warn+0xc2/0xdd [ 2837.462379] warn_slowpath_null+0x1d/0x1f [ 2837.462379] btrfs_destroy_inode+0xe8/0x27e [btrfs] [ 2837.462379] destroy_inode+0x3d/0x55 [ 2837.462379] evict+0x177/0x17e [ 2837.462379] dispose_list+0x50/0x71 [ 2837.462379] evict_inodes+0x132/0x141 [ 2837.462379] generic_shutdown_super+0x3f/0xeb [ 2837.462379] kill_anon_super+0x12/0x1c [ 2837.462379] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.462379] deactivate_locked_super+0x30/0x68 [ 2837.462379] deactivate_super+0x36/0x39 [ 2837.462379] cleanup_mnt+0x58/0x76 [ 2837.462379] __cleanup_mnt+0x12/0x14 [ 2837.462379] task_work_run+0x77/0x9b [ 2837.462379] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.462379] syscall_return_slowpath+0x196/0x1b9 [ 2837.462379] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7 [ 2837.462379] RSP: 002b:7ffdd0d8de58 EFLAGS: 0246 ORIG_RAX: 00a6 [ 2837.462379] RAX: RBX: 556f76a39060 RCX: 7f3ef3e6b9a7 [ 2837.462379] RDX: 0001 RSI: RDI: 556f76a3f910 [ 2837.462379] RBP: 556f76a3f910 R08: 556f76a3e670 R09: 0015 [ 2837.462379] R10: 06b4 R11: 0246 R12: 7f3ef436ce64 [ 2837.462379] R13: R14: 556f76a39240 R15: 7ffdd0d8e0e0 [ 2837.519355] ---[ end trace e79345fe24b30b8d ]--- [ 2837.596256] [ cut here ] [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs] [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: GW 4.10.0-rc8-btrfs-next-43+ #1 [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014 [ 2837.663359] Call Trace: [ 2837.663359] dump_stack+0x68/0x92 [ 2837.663359] __warn+0xc2/0xdd [ 2837.663359] warn_slowpath_null+0x1d/0x1f [ 2837.663359] btrfs_free_block_groups+0x246/0x3eb [btrfs] [ 2837.663359] close_ctree+0x1dd/0x2e1 [btrfs] [ 2837.663359] ? evict_inodes+0x132/0x141 [ 2837.663359] btrfs_put_super+0x15/0x17 [btrfs] [ 2837.663359] generic_shutdown_super+0x6a/0xeb [ 2837.663359] kill_anon_super+0x12/0x1c [ 2837.663359] btrfs_kill_super+0x16/0x21 [btrfs] [ 2837.663359] deactivate_locked_super+0x30/0x68 [ 2837.663359] deactivate_super+0x36/0x39 [ 2837.663359] cleanup_mnt+0x58/0x76 [ 2837.663359] __cleanup_mnt+0x12/0x14 [ 2837.663359] task_work_run+0x77/0x9b [ 2837.663359] prepare_exit_to_usermode+0x9d/0xc5 [ 2837.663359] syscall_return_slowpath+0x196/0x1b9 [ 2837.663359] entry_SYSCALL_64_fastpath+0xab/0xad [ 2837.663359] RIP:
Re: [PATCH] Btrfs: fix invalid extent maps due to hole punching
On Tue, May 30, 2017 at 9:50 PM, Omar Sandovalwrote: > On Sun, May 28, 2017 at 05:31:53PM +0100, fdman...@kernel.org wrote: >> From: Filipe Manana > > [snip] > > Hey, Filipe, > > I saw this warning and tried to apply your patch, but it doesn't apply > cleanly (seems to conflict with 9986277e0e4c ("Btrfs: handle only > applicable errors returned by btrfs_get_extent")). Yes, it was based on an older branch. I'll rebase it. Thanks. > >> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a >> already existed hole.") >> Cc: >> Signed-off-by: Filipe Manana >> --- >> fs/btrfs/file.c | 4 +++- >> 1 file changed, 3 insertions(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c >> index f7d022bc7998..2645d820422c 100644 >> --- a/fs/btrfs/file.c >> +++ b/fs/btrfs/file.c >> @@ -2390,10 +2390,12 @@ static int fill_holes(struct btrfs_trans_handle >> *trans, >> */ >> static int find_first_non_hole(struct inode *inode, u64 *start, u64 *len) >> { >> + struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); >> struct extent_map *em; >> int ret = 0; >> >> - em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, *start, *len, 0); >> + em = btrfs_get_extent(BTRFS_I(inode), NULL, 0, *start, >> + round_up(*len, fs_info->sectorsize), 0); >> if (IS_ERR_OR_NULL(em)) { >> if (!em) >> ret = -ENOMEM; >> -- >> 2.11.0 >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 09/10] xfs: nowait aio support
On Tue 30-05-17 11:13:29, Goldwyn Rodrigues wrote: > > Btw, can you write a small blurb up for the man page to document these > > Ñ•emantics in man-page like language? > > > > Yes, but which man page would it belong to? > Should it be a subsection of errors in io_getevents/io_submit. We don't > want to add ERRORS to io_getevents() because it would be the return > value of the io_getevents call, and not the ones in the iocb structure. > Should it be a new man page, say for iocb(7/8)? I think you should extend the manpage for io_submit(8). There you can add definition of struct iocb in 'DESCRIPTION' section explaining at least the most common fields. You can also explain there which flags can be passed and what are they intended to do. You can also expand EAGAIN error description to specifically mention that in case of NOWAIT aio EAGAIN can be returned if io submission would block. Honza -- Jan KaraSUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] btrfs: qgroup: Fix hang when using inode_cache and qgroup
Commit 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT") is causing hang, with the following backtrace: Call Trace: __schedule+0x374/0xaf0 schedule+0x3d/0x90 wait_for_commit+0x4a/0x80 [btrfs] ? wake_atomic_t_function+0x60/0x60 btrfs_commit_transaction+0xe0/0xa10 [btrfs] <<< Here ? start_transaction+0xad/0x510 [btrfs] qgroup_reserve+0x1f0/0x350 [btrfs] btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs] ? _raw_spin_unlock+0x27/0x40 btrfs_check_data_free_space+0x6d/0xb0 [btrfs] btrfs_delalloc_reserve_space+0x25/0x70 [btrfs] btrfs_save_ino_cache+0x402/0x650 [btrfs] commit_fs_roots+0xb7/0x170 [btrfs] btrfs_commit_transaction+0x425/0xa10 [btrfs] <<< And here qgroup_reserve+0x1f0/0x350 [btrfs] btrfs_qgroup_reserve_data+0xf8/0x2f0 [btrfs] ? _raw_spin_unlock+0x27/0x40 btrfs_check_data_free_space+0x6d/0xb0 [btrfs] btrfs_delalloc_reserve_space+0x25/0x70 [btrfs] btrfs_direct_IO+0x1c5/0x3b0 [btrfs] generic_file_direct_write+0xab/0x150 btrfs_file_write_iter+0x243/0x530 [btrfs] __vfs_write+0xc9/0x120 vfs_write+0xcb/0x1f0 SyS_pwrite64+0x79/0x90 entry_SYSCALL_64_fastpath+0x18/0xad The problem is that, inode_cache will be written in commit_fs_roots(), which is called in btrfs_commit_transaction(). And when it fails to reserve enough data space, qgroup_reserve() will try to call btrfs_commit_transaction() again, then we are waiting for ourselves. The patch will introduce can_retry parameter for qgroup_reserve(), allowing related callers to avoid deadly commit transaction deadlock. Now for space cache inode, we will not allow qgroup retry, so it will not cause deadlock. Fixes: 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT") Cc: Goldwyn RodriguesSigned-off-by: Qu Wenruo --- Commit 48a89bc4f2ce ("btrfs: qgroups: Retry after commit on getting EDQUOT") is not only causing such deadlock, but also screwing up qgroup reserved space for even generic test cases. I'm afraid we may need to revert that commit if we can't find a good way to fix the newly caused qgroup meta reserved space underflow. (Unlike old bug which is qgroup data reserved space underflow, this time the commit is causing new metadata space underflow). --- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/qgroup.c | 15 ++- fs/btrfs/qgroup.h | 2 +- fs/btrfs/transaction.c | 2 +- 4 files changed, 17 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index e390451c72e6..15e7fcf3a7ab 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -5817,7 +5817,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root *root, if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { /* One for parent inode, two for dir entries */ num_bytes = 3 * fs_info->nodesize; - ret = btrfs_qgroup_reserve_meta(root, num_bytes, true); + ret = btrfs_qgroup_reserve_meta(root, num_bytes, true, true); if (ret) return ret; } else { @@ -5945,6 +5945,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) u64 to_free = 0; unsigned dropped; bool release_extra = false; + bool can_qgroup_retry = true; /* If we are a free space inode we need to not flush since we will be in * the middle of a transaction commit. We also don't need the delalloc @@ -5957,6 +5958,7 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) if (btrfs_is_free_space_inode(inode)) { flush = BTRFS_RESERVE_NO_FLUSH; delalloc_lock = false; + can_qgroup_retry = false; } else if (current->journal_info) { flush = BTRFS_RESERVE_FLUSH_LIMIT; } @@ -5987,7 +5989,8 @@ int btrfs_delalloc_reserve_metadata(struct btrfs_inode *inode, u64 num_bytes) if (test_bit(BTRFS_FS_QUOTA_ENABLED, _info->flags)) { ret = btrfs_qgroup_reserve_meta(root, - nr_extents * fs_info->nodesize, true); + nr_extents * fs_info->nodesize, true, + can_qgroup_retry); if (ret) goto out_fail; } diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c index deffbeb74a0b..3859acb03ae9 100644 --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@ -2322,7 +2322,8 @@ static bool qgroup_check_limits(const struct btrfs_qgroup *qg, u64 num_bytes) return true; } -static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce) +static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool enforce, + bool can_retry) { struct btrfs_root *quota_root; struct btrfs_qgroup *qgroup; @@ -2364,7 +2365,8 @@ static int qgroup_reserve(struct btrfs_root *root, u64 num_bytes, bool
Re: Debugging Deadlocks?
Sargun Dhillon posted on Tue, 30 May 2017 09:12:39 -0700 as excerpted: > We've been running BtrFS for a couple months now in production on > several clusters. We're running on Canonical's 4.8 kernel, and > currently, in the process of moving to our own patchset atop vanilla > 4.10+. I'm glad to say it's been a fairly good experience for us. Bar > some performance issues, it's been largely smooth sailing. > > There has been one class of persistent issues that has been plaguing our > cluster is deadlocks. Being just a list regular and btrfs (personal) user, not a dev or big- time production user, I can't say I've seen a deadlocks problem either here or reported in significant numbers on-list, but beyond that I can't help there. I'm replying, however, regarding your kernel choices. Good for getting off kernel 4.8, as in mainline kernel terms that's only a short-term stable release and support has now ended. But I'm slightly concerned with your kernel 4.10+ choice on production clusters. 4.9 is the most recent mainline and therefore btrfs LTS kernel series, and as such, what I was expecting. Now don't get me wrong, 4.10+ is appropriate ATM as well, and if you're planning to stay current, within the 2-latest-current-kernel-cycles list recommendation, I'd consider it preferred. However, most large-scale production deployments tend to prefer a somewhat slower upgrade cycle than that, in which case 4.9 is preferred as the latest mainline LTS series. As far as LTS series go, this list tries to support the latest two LTS series, as it does the latest two current stable series. While that's rather shorter than the LTS series support in general, it's in keeping with the fact that btrfs remains still stabilizing and as such under heavy development, tho it's far more stable than it was back in the kernel 3.x or early 4.x era. At present that means 4.9 and the previous 4.4, altho in practice 4.4 was long enough ago that we prefer 4.9 unless there's some definite reason it's not going to work for you. But you're not talking as old as 4.4 in any case, so it's a question of 4.9 LTS and staying with that series for awhile, or 4.10+, but upgrading every 10 weeks or so as a new kernel series is released and the second- back, now 4.10 as 4.11 is the newest, becomes the third back and thus slips out of both mainline stable release and btrfs list primary support range. If you're comfortable with a ten-week upgrade cycle on the scale you're running in production, then by all means, go 4.10 or 4.11 at this point and do the upgrades, as that's preferable here for those where it's acceptable, but if not, then I'd strongly recommend the 4.9 LTS series for now, and upgrading LTS kernel series once a year or whatever, after the next LTS series comes out and has had a release or two to shake out the early bugs. OTOH if there's something you really need 4.10 for but would otherwise prefer LTS, then yes, go current now and try to do the 10-week cycle, until the next LTS, then if desired stick with it and drop back to annual or whatever LTS series upgrades. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html