[PATCH] Btrfs: fix race deleting block group from space_info-ro_bgs list
When removing a block group we were deleting it from its space_info's ro_bgs list, using list_del_init, without any synchronization. Fix this by doing the list delete while holding the space info and block group spinlocks. This issue was introduced in the 3.19 kernel by the following change: Btrfs: move read only block groups onto their own list V2 commit 633c0aad4c0243a506a3e8590551085ad78af82d I ran into a kernel crash while a block group was being removed, another task was executing statfs in parallel (iterating the space_info-ro_bgs list) and other another task was setting another block group to readonly mode (which adds it to the list space_info-ro_bgs). This happened while running the stress test xfstests/generic/038 I recently made. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5a45253..09145ac 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9424,7 +9424,6 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, * are still on the list after taking the semaphore */ list_del_init(block_group-list); - list_del_init(block_group-ro_list); if (list_empty(block_group-space_info-block_groups[index])) { kobj = block_group-space_info-block_group_kobjs[index]; block_group-space_info-block_group_kobjs[index] = NULL; @@ -9466,6 +9465,9 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, btrfs_remove_free_space_cache(block_group); spin_lock(block_group-space_info-lock); + spin_lock(block_group-lock); + list_del_init(block_group-ro_list); + spin_unlock(block_group-lock); block_group-space_info-total_bytes -= block_group-key.offset; block_group-space_info-bytes_readonly -= block_group-key.offset; block_group-space_info-disk_total -= block_group-key.offset * factor; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub status reports not running when it is
On Thu, Jan 15, 2015 at 12:24:41PM +0100, David Sterba wrote: On Wed, Jan 14, 2015 at 02:27:17PM -0800, Zach Brown wrote: On Wed, Jan 14, 2015 at 04:06:02PM -0500, Sandy McArthur Jr wrote: Sometimes btrfs scrub status reports that is not running when it still is. I think this a cosmetic bug. And I believe this is related to the scrub completing on some drives before others in a multi-drive btrfs filesystem that is not well balanced. Boy, I don't really know this code, but it looks like: if (ss-in_progress) printf(, running for %llu seconds\n, ss-duration); else printf(, interrupted after %llu seconds, not running\n, ss-duration); in_progress = is_scrub_running_in_kernel(fdmnt, di_args, fi_args.num_devices); static int is_scrub_running_in_kernel(int fd, struct btrfs_ioctl_dev_info_args *di_args, u64 max_devices) { struct scrub_progress sp; int i; int ret; for (i = 0; i max_devices; i++) { memset(sp, 0, sizeof(sp)); sp.scrub_args.devid = di_args[i].devid; ret = ioctl(fd, BTRFS_IOC_SCRUB_PROGRESS, sp.scrub_args); if (ret 0 errno == ENODEV) continue; if (ret 0 errno == ENOTCONN) return 0; It says that scrub isn't running if any devices have completed. If you drop all those ret 0 conditional branches that are either noops or wrong, does it work like you'd expect? Why wrong? The ioctl callback returns -ENODEV or -ENOTCONN that get translated to the errno values and ioctl(...) returns -1 in both cases. Wrong because returning 0 on the first ENOTCONN, instead of continuing to find more devices which might still be scrubbing, leads to this confusing status message. That's my working theory having spent 15 seconds reading code. I would be not surprised at all if I'm missing something here. - z -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Kernel bug in 3.19-rc4
Hi, I just started some btrfs stress testing on latest linux kernel 3.19-rc4: A few hours later, filesystem stopped working - the kernel bug report can be found below. The test consists of one massive IO thread (writing 100GB files with dd), and 2 tar instances extracting kernel sources and deleting them afterwards (I can provide the simple bash script doing this, if needed). System information (Ubuntu 14.04.1, latest kernel): root@thunder # uname -a Linux thunder 3.19.0-rc4-custom #1 SMP Mon Jan 12 16:13:44 CET 2015 x86_64 x86_64 x86_64 GNU/Linux root@thunder # /root/btrfs-progs/btrfs --version Btrfs v3.18-36-g0173148 Tests are done on 14 SCSI disks, using raid6 for data and metadata: root@thunder # /root/btrfs-progs/btrfs fi show Label: 'raid6' uuid: cbe34d2b-5f75-46cf-9263-9813028ebc19 Total devices 14 FS bytes used 674.62GiB devid1 size 279.39GiB used 59.24GiB path /dev/cciss/c1d0 devid2 size 279.39GiB used 59.22GiB path /dev/cciss/c1d1 devid3 size 279.39GiB used 59.22GiB path /dev/cciss/c1d10 devid4 size 279.39GiB used 59.22GiB path /dev/cciss/c1d11 devid5 size 279.39GiB used 59.22GiB path /dev/cciss/c1d12 devid6 size 279.39GiB used 59.22GiB path /dev/cciss/c1d13 devid7 size 279.39GiB used 59.22GiB path /dev/cciss/c1d2 devid8 size 279.39GiB used 59.22GiB path /dev/cciss/c1d3 devid9 size 279.39GiB used 59.22GiB path /dev/cciss/c1d4 devid 10 size 279.39GiB used 59.22GiB path /dev/cciss/c1d5 devid 11 size 279.39GiB used 59.22GiB path /dev/cciss/c1d6 devid 12 size 279.39GiB used 59.22GiB path /dev/cciss/c1d7 devid 13 size 279.39GiB used 59.22GiB path /dev/cciss/c1d8 devid 14 size 279.39GiB used 59.22GiB path /dev/cciss/c1d9 Btrfs v3.18-36-g0173148 # This is provided for completeness only, and is taken # somewhen *before* the kernel crash occured, so basic # setup is the same, but allocated/free sizes won't match root@thunder # /root/btrfs-progs/btrfs fi df /tmp/m Data, single: total=8.00MiB, used=0.00B Data, RAID6: total=727.45GiB, used=697.84GiB System, single: total=4.00MiB, used=0.00B System, RAID6: total=13.50MiB, used=64.00KiB Metadata, single: total=8.00MiB, used=0.00B Metadata, RAID6: total=3.43GiB, used=805.91MiB GlobalReserve, single: total=272.00MiB, used=0.00B Here's what happens after some hours of stress testing: [85162.472989] [ cut here ] [85162.473071] kernel BUG at fs/btrfs/inode.c:3142! [85162.473139] invalid opcode: [#1] SMP [85162.473212] Modules linked in: btrfs(E) xor(E) raid6_pq(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) hpwdt(E) amd64_edac_mod(E) kvm(E) edac_core(E) shpchp(E) k8temp(E) serio_raw(E) hpilo(E) edac_mce_amd(E) mac_hid(E) i2c_algo_bit(E) ipmi_si(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) lp(E) fscache(E) parport(E) hid_generic(E) usbhid(E) hid(E) hpsa(E) psmouse(E) bnx2(E) cciss(E) pata_acpi(E) pata_amd(E) [85162.473911] CPU: 4 PID: 3039 Comm: btrfs-cleaner Tainted: G E 3.19.0-rc4-custom #1 [85162.474028] Hardware name: HP ProLiant DL585 G2 , BIOS A07 05/02/2011 [85162.474122] task: 88085b054aa0 ti: 88205ad4c000 task.ti: 88205ad4c000 [85162.474230] RIP: 0010:[a06a8182] [a06a8182] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [85162.474422] RSP: 0018:88205ad4fc48 EFLAGS: 00010286 [85162.474497] RAX: ffe4 RBX: 8810a35d42f8 RCX: 88185b896000 [85162.474595] RDX: 6a54 RSI: 0004 RDI: 88185b896138 [85162.474694] RBP: 88205ad4fc88 R08: 0001e670 R09: 88016194b240 [85162.474793] R10: a06bd797 R11: ea0004f71800 R12: 88185baa2000 [85162.474892] R13: 88085f6d7630 R14: 88185baa2458 R15: 0001 [85162.474992] FS: 7fb3f27fb740() GS:88085fd0() knlGS: [85162.475105] CS: 0010 DS: ES: CR0: 8005003b [85162.475184] CR2: 7f896c02c220 CR3: 00085b328000 CR4: 07e0 [85162.475286] Stack: [85162.475318] 88205ad4fc88 a06e6a14 88185b896b04 88105b03e800 [85162.475442] 88016194b240 8810a35d42f8 881e8ffe9a00 88133dc48ea0 [85162.475561] 88205ad4fd18 a0691a57 88016194b244 88016194b240 [85162.475680] Call Trace: [85162.475738] [a06e6a14] ? lookup_free_space_inode+0x44/0x100 [btrfs] [85162.475849] [a0691a57] btrfs_remove_block_group+0x137/0x740 [btrfs] [85162.475964] [a06ca8d2] btrfs_remove_chunk+0x672/0x780 [btrfs] [85162.476065] [a06922bf] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs] [85162.476172] [a0699e0c] cleaner_kthread+0x12c/0x190 [btrfs] [85162.476269] [a0699ce0] ? check_leaf+0x350/0x350 [btrfs] [85162.476355] [8108f8d2] kthread+0xd2/0xf0 [85162.476424] [8108f800] ? kthread_create_on_node+0x180/0x180 [85162.476519] [8177bcbc]
BtrFs on drives with error recovery control / TLER?
Hi, Can anybody comment on how BtrFs (particularly RAID1 mirroring) interacts with drives that offer error recovery control (or TLER in WDC terms)? I generally prefer to buy this type of drive for any serious data storage purposes I notice ZFS gets a mention in the Wikipedia article about the topic: http://en.wikipedia.org/wiki/Error_recovery_control Should BtrFs be mentioned there too? Regards, Daniel -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On Thu, Jan 8, 2015 at 11:53 AM, Lennart Poettering lenn...@poettering.net wrote: On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote: On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of chattr -C (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE. We already use fallocate(), but this is not enough on cow file systems. With fallocate() you can certainly improve fragmentation when appending things to a file. But on a COW file system this will help little if we change things in the beginning of the file, since COW means that it will then make a copy of those blocks and alter the copy, but leave the original version unmodified. And if we do that all the time the files get heavily fragmented, even though all the blocks we modify have been fallocate()d initially... This would work on ext4, xfs, and others, and provide the same benefit (or even better) without filesystem-specific code. journald would preallocate a contiguous chunk past the end of the file for appends, and That's precisely what we do. But journald's write pattern is not purely appending to files, it's append something to the end, then link it up in the beginning. And for the append part we are fine with fallocate(). It's the link up part that completely fucks up fragmentation so far. I think a per-file autodefrag flag would help a lot here. We've made some improvements for autodefrag and slowly growing log files because we noticed that compression ratios on slowly growing files really weren't very good. The problem was we'd never have more than a single block to compress, so the compression code would give up and write the raw data. compression + autodefrag on the other hand would take 64-128K and recow it down, giving very good results. The second problem we hit was with stable page writes. If bdflush decides to write the last block in the file, it's really a wasted IO unless the block is fully filled. We've been experimenting with a patch to leave the last block out of writepages unless its a fsync/O_SYNC. I'll code up the per-file autodefrag, we've hit a few use cases that make sense. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs command completion Was: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes
David Sterba posted on Thu, 15 Jan 2015 13:05:46 +0100 as excerpted: A shell completion would be great of course, it's in the project ideas. There's a starting point http://www.spinics.net/lists/linux-btrfs/msg15899.html . FWIW, in case anyone is interested... What I did here is a bit different; shell completion would be better, but while I know bash, I don't know bash/shell completion so I couldn't write that without learning it. What I did is a btrfs wrapper script (which I simply called 'b', a previously here-unused single letter command) that based on the first parameter or two and the number of parameters, decides if what's there matches a valid btrfs subcommand and whether it looks complete or not, and if it's valid but incomplete, echoes the appropriate btrfs sub help command and prompts for more input. So just 'b' gives me: b: btrfs helper (common commands only) btrfs cmd (Just Enter for help): b(alance) c(heck) d(ev) f(ilesystem) i(nspect) p(roperty) qg(roup) qu(ota) rec(eive) rep(lace) resc(ue) rest(ore) sc(rub) se(nd) su(bvolume) v(ersion): There it waits for further input. If I then add an 'f', or if I had typed 'b f' (or 'b fi' or 'b filesystem', the script checks status of btrfs cmd --help to see if it's valid or not), I get this: f btrfs filesystem action (Just Enter for help): de(frag) df l(abel) r(esize) sh(ow) sy(nc) u(sage): Further input. If I then add 'df' (or if I had typed 'b fi df'), I get: df usage: btrfs filesystem df [options] path Show space usage information for a mount point -b|--raw raw numbers in bytes -h human friendly numbers, base 1024 (default) -H human friendly numbers, base 1000 --iec use 1024 as a base (KiB, MiB, GiB, TiB) --si use 1000 as a base (kB, MB, GB, TB) -k|--kbytesshow sizes in KiB, or kB with --si -m|--mbytesshow sizes in MiB, or MB with --si -g|--gbytesshow sizes in GiB, or GB with --si -t|--tbytesshow sizes in TiB, or TB with --si btrfs filesystem df usage is printed above. Please enter additional parameters here: Further input. At this point it doesn't check further input, instead simply echoing back what will be the final command and prompting whether to run it or not. If I simply type '/'... / btrfs filesystem df / Final check: OK to run above command (y/N)? Note the default to N... If at that point I simply enter (or if I hit anything else besides y/Y), of course it doesn't run the command, it simply exits. However, the built command was printed above, making it simple enough to select/paste (assuming gpm or a terminal window in X, thus mouse selection). If I hit 'y', it executes the command. Meanwhile, if there's more than two parameters and the first two validate, the script assumes the user knows what they are doing and simply executes it as-is. b fi df / b: btrfs helper (common commands only) Data, RAID1: total=3.00GiB, used=1.74GiB System, RAID1: total=32.00MiB, used=16.00KiB Metadata, RAID1: total=768.00MiB, used=299.89MiB GlobalReserve, single: total=48.00MiB, used=0.00B b fi df invalid b: btrfs helper (common commands only) ERROR: can't access 'invalid' b fxxx df / This one's interesting. The script checks the status of btrfs fxxx --help and determines that btrfs doesn't consider the fxxx valid, so it simply prints the full btrfs --help output, running it thru $PAGER (if unset, less if it's executable, else no pager) due to length. Anyone interested in a script such as this? Absent a bash completion script, I found it /tremendously/ helpful with btrfs command basics, and because it prompts with the final command before execution, it helps in learning the commands as well. I still use it for commands I don't use frequently enough to have memorized. While it's a bit of a hack, thus my not posting it previously, if anyone else would find such a script useful, I could post it. I don't have hosting for it but I suppose it could go on the wiki if enough other folks find it useful. It's 125 lines including comments ATM. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe
filesystem corruption/unable to scrub
Hello, I am having trouble with my btrfs setup. An unwanted reset probably caused the corruption. I can mount the filesystem, but cannot perform scrub as this ends with GPF. uname -a Linux sysresccd 3.14.24-alt441-amd64 #2 SMP Sun Nov 16 08:27:16 UTC 2014 x86_64 AMD Phenom(tm) II X4 965 Processor AuthenticAMD GNU/Linux btrfs --version Btrfs v3.17.1 btrfs fi show Label: 'suc_storage' uuid: 76bf605a-936b-4fce-8a74-1eb2c750f51c Total devices 2 FS bytes used 2.57TiB devid1 size 3.64TiB used 3.63TiB path /dev/sda1 devid2 size 3.64TiB used 3.63TiB path /dev/sdb1 btrfs fi df /home # Replace /home with the mount point of your btrfs-filesystem Data, RAID1: total=3.63TiB, used=2.56TiB System, RAID1: total=32.00MiB, used=532.00KiB Metadata, RAID1: total=7.00GiB, used=5.83GiB dmesg dmesg.log [14408.258515] general protection fault: [#1] SMP [14408.258520] Modules linked in: ppdev microcode parport_pc parport acpi_cpufreq serio_raw edac_core edac_mce_amd k10temp sp5100_tco i2c_piix4 shpchp raid10 raid456 async_raid6_recov async_pq async_xor async_memcpy async_tx raid1 raid0 multipath linear ata_generic pata_acpi usb_storage nouveau firewire_ohci firewire_core pata_atiixp ttm drm_kms_helper drm r8169 mii i2c_algo_bit i2c_core mxm_wmi video wmi [14408.258543] CPU: 3 PID: 3100 Comm: btrfs-scrub-2 Not tainted 3.14.24-alt441-amd64 #2 [14408.258546] Hardware name: Gigabyte Technology Co., Ltd. GA-MA785GT-UD3H/GA-MA785GT-UD3H, BIOS F8 05/25/2010 [14408.258549] task: 88007158a600 ti: 88011b9e4000 task.ti: 88011b9e4000 [14408.258551] RIP: 0010:[81423e91] [81423e91] scrub_bio_end_io_worker+0xb6/0x5fb [14408.258558] RSP: :88011b9e5d28 EFLAGS: 00010287 [14408.258560] RAX: 8800779ace00 RBX: fffe880118114e40 RCX: 8800b79dd800 [14408.258562] RDX: 8800b79dd8a8 RSI: 0001 RDI: 0014 [14408.258564] RBP: 88011b9e5df8 R08: ea0004604518 R09: 88011b9e5ce8 [14408.258566] R10: 81420d8a R11: 88010001 R12: 8800779ac100 [14408.258568] R13: R14: 8800779ac100 R15: [14408.258570] FS: () GS:88013fcc() knlGS:f75ccb40 [14408.258572] CS: 0010 DS: ES: CR0: 8005003b [14408.258574] CR2: f77c6000 CR3: 3ee94000 CR4: 07e0 [14408.258575] Stack: [14408.258577] 88011b9e5d88 88011b9e5da8 880139ab7200 8801 [14408.258581] 880118114f00 8800b79dd940 8800b7fa1800 000100d8f612 [14408.258583] 8800b79dd8a8 00158107d8ca 8800b79dd800 8800b7fa1800 [14408.258586] Call Trace: [14408.258592] [817b73c1] ? schedule_timeout+0xa1/0xbd [14408.258596] [8107d3aa] ? lock_timer_base+0x4d/0x4d [14408.258600] [81401d95] worker_loop+0x194/0x527 [14408.258604] [81401c01] ? btrfs_queue_worker+0x239/0x239 [14408.258607] [8108eb43] kthread+0xc9/0xd1 [14408.258611] [8108ea7a] ? kthread_freezable_should_stop+0x60/0x60 [14408.258614] [817c11cc] ret_from_fork+0x7c/0xb0 [14408.258617] [8108ea7a] ? kthread_freezable_should_stop+0x60/0x60 [14408.258618] Code: 02 48 8b 09 80 a1 98 00 00 00 fb 48 8b 4d 80 48 83 c2 08 3b 81 38 01 00 00 7c dc eb 9c 48 8b 95 70 ff ff ff 48 8b 42 38 48 8b 18 f0 ff 8b 84 00 00 00 0f 94 c0 84 c0 0f 84 2c 04 00 00 f6 83 98 [14408.258639] RIP [81423e91] scrub_bio_end_io_worker+0xb6/0x5fb [14408.258642] RSP 88011b9e5d28 [14408.258645] ---[ end trace 168b2c0c1e0d1fcc ]--- Running btrfsck with --repair ends also with errors. ... Device extent[2, 1522566955008, 1073741824] didn't find its device. Device extent[2, 1523640696832, 1073741824] didn't find its device. Device extent[2, 1524714438656, 1073741824] didn't find its device. Device extent[2, 1525788180480, 1073741824] didn't find its device. Device extent[2, 1526861922304, 1073741824] didn't find its device. Device extent[2, 1527935664128, 1073741824] didn't find its device. Errors found in extent allocation tree or chunk allocation checking free space cache cache and super generation don't match, space cache will be invalidated checking fs roots bad key ordering 1 2 Deleting bad dir index [618631,96,57124] root 270 volumes.c:978: btrfs_alloc_chunk: Assertion `ret` failed. btrfs check[0x8083f7d] btrfs check[0x8087624] btrfs check[0x807d132] btrfs check[0x807d4fc] btrfs check[0x807dfc4] btrfs check[0x80710d7] btrfs check[0x8071774] btrfs check[0x80737c0] btrfs check[0x808060a] btrfs check[0x805f56d] btrfs check[0x8061f82] btrfs check[0x8067797] btrfs check[0x804afd9] /lib/libc.so.6(__libc_start_main+0xe6)[0xf7619346] btrfs check[0x804ac31] Could you please help me as how could I correct current state. Thank you in advance Pavol -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: BtrFs on drives with error recovery control / TLER?
Daniel Pocock posted on Thu, 15 Jan 2015 20:54:10 +0100 as excerpted: Can anybody comment on how BtrFs (particularly RAID1 mirroring) interacts with drives that offer error recovery control (or TLER in WDC terms)? I generally prefer to buy this type of drive for any serious data storage purposes I notice ZFS gets a mention in the Wikipedia article about the topic: http://en.wikipedia.org/wiki/Error_recovery_control Should BtrFs be mentioned there too? I make no claims to being an expert in this area and others with more expertise will likely be along shortly. However... In general you have a valid worry, and the recommendation is as with other raid technology, if possible, set your device to a recovery time under 30 seconds, as that's the default Linux SCSI level link reset time, and it will short-circuit the process and doesn't get the bad sector marked as such and remapped to a reserve sector, on the device. On consumer level devices where setting the device recovery time isn't possible, the hard-wired recovery time can be near two minutes, so the recommendation is to set the Linux SCSI level link reset time to 120 seconds or so, thus allowing the hardware device to timeout first so it can again recognize the bad sector and do its remapping thing. In general, this recommendation should apply to all Linux-kernel-based soft-raid technologies (including btrfs, mdraid, dmraid...) where the raid redundancy can fill in the missing data so letting it fail and potentially trigger a remap is the best strategy. OTOH, the shorter time wouldn't be recommended (tho a longer SCSI reset time well could be) for a single-device btrfs or a multi-device btrfs in raid0 or single mode, because in those cases, the assumption is that there's no other copies of the data, so letting the device take up to two minutes to try to retrieve that data in the hope that the extra tries will finally be successful, can very possibly save that data... of course at the cost of a system that goes unresponsive for upto two minutes at a time, which clearly isn't going to work if it's happening frequently. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_inode_item's otime?
On 15/01/15 21:48, David Sterba wrote: Chandan, please drop the btrfs_inode_otime helper and resend. Thanks. Thanks! Sorry I'd had no further time to look at this, I've been fully committed with $DAY_JOB and on a number of projects with our local community observatory (if anyone is in/visiting Melbourne and into astronomy ping me for details). All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug in 3.19-rc4
Hi, On 2015/01/16 10:05, Tomasz Chmielewski wrote: I just started some btrfs stress testing on latest linux kernel 3.19-rc4: A few hours later, filesystem stopped working - the kernel bug report can be found below. Hi, your kernel BUG at fs/btrfs/inode.c:3142! from 3.19-rc4 corresponds to http://marc.info/?l=linux-btrfsm=141903172106342w=2 - it was kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123! in 3.18.1, and is exactly the same code in both cases: /* grab metadata reservation from transaction handle */ if (reserve) { ret = btrfs_orphan_reserve_metadata(trans, inode); BUG_ON(ret); /* -ENOSPC in reservation; Logic error? JDM */ } Year, it's the same. BTW, I've tried to reproduce this problem by using the way you told me. However, it hasn't reproduced yet. Thanks, Satoru -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel bug in 3.19-rc4
Hi Marcel, On 2015/01/16 4:46, Marcel Ritter wrote: Hi, I just started some btrfs stress testing on latest linux kernel 3.19-rc4: A few hours later, filesystem stopped working - the kernel bug report can be found below. The test consists of one massive IO thread (writing 100GB files with dd), and 2 tar instances extracting kernel sources and deleting them afterwards (I can provide the simple bash script doing this, if needed). Could you give me this script? Thanks, Satoru System information (Ubuntu 14.04.1, latest kernel): root@thunder # uname -a Linux thunder 3.19.0-rc4-custom #1 SMP Mon Jan 12 16:13:44 CET 2015 x86_64 x86_64 x86_64 GNU/Linux root@thunder # /root/btrfs-progs/btrfs --version Btrfs v3.18-36-g0173148 Tests are done on 14 SCSI disks, using raid6 for data and metadata: root@thunder # /root/btrfs-progs/btrfs fi show Label: 'raid6' uuid: cbe34d2b-5f75-46cf-9263-9813028ebc19 Total devices 14 FS bytes used 674.62GiB devid1 size 279.39GiB used 59.24GiB path /dev/cciss/c1d0 devid2 size 279.39GiB used 59.22GiB path /dev/cciss/c1d1 devid3 size 279.39GiB used 59.22GiB path /dev/cciss/c1d10 devid4 size 279.39GiB used 59.22GiB path /dev/cciss/c1d11 devid5 size 279.39GiB used 59.22GiB path /dev/cciss/c1d12 devid6 size 279.39GiB used 59.22GiB path /dev/cciss/c1d13 devid7 size 279.39GiB used 59.22GiB path /dev/cciss/c1d2 devid8 size 279.39GiB used 59.22GiB path /dev/cciss/c1d3 devid9 size 279.39GiB used 59.22GiB path /dev/cciss/c1d4 devid 10 size 279.39GiB used 59.22GiB path /dev/cciss/c1d5 devid 11 size 279.39GiB used 59.22GiB path /dev/cciss/c1d6 devid 12 size 279.39GiB used 59.22GiB path /dev/cciss/c1d7 devid 13 size 279.39GiB used 59.22GiB path /dev/cciss/c1d8 devid 14 size 279.39GiB used 59.22GiB path /dev/cciss/c1d9 Btrfs v3.18-36-g0173148 # This is provided for completeness only, and is taken # somewhen *before* the kernel crash occured, so basic # setup is the same, but allocated/free sizes won't match root@thunder # /root/btrfs-progs/btrfs fi df /tmp/m Data, single: total=8.00MiB, used=0.00B Data, RAID6: total=727.45GiB, used=697.84GiB System, single: total=4.00MiB, used=0.00B System, RAID6: total=13.50MiB, used=64.00KiB Metadata, single: total=8.00MiB, used=0.00B Metadata, RAID6: total=3.43GiB, used=805.91MiB GlobalReserve, single: total=272.00MiB, used=0.00B Here's what happens after some hours of stress testing: [85162.472989] [ cut here ] [85162.473071] kernel BUG at fs/btrfs/inode.c:3142! [85162.473139] invalid opcode: [#1] SMP [85162.473212] Modules linked in: btrfs(E) xor(E) raid6_pq(E) radeon(E) ttm(E) drm_kms_helper(E) drm(E) hpwdt(E) amd64_edac_mod(E) kvm(E) edac_core(E) shpchp(E) k8temp(E) serio_raw(E) hpilo(E) edac_mce_amd(E) mac_hid(E) i2c_algo_bit(E) ipmi_si(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) nfs(E) lockd(E) grace(E) sunrpc(E) lp(E) fscache(E) parport(E) hid_generic(E) usbhid(E) hid(E) hpsa(E) psmouse(E) bnx2(E) cciss(E) pata_acpi(E) pata_amd(E) [85162.473911] CPU: 4 PID: 3039 Comm: btrfs-cleaner Tainted: G E 3.19.0-rc4-custom #1 [85162.474028] Hardware name: HP ProLiant DL585 G2 , BIOS A07 05/02/2011 [85162.474122] task: 88085b054aa0 ti: 88205ad4c000 task.ti: 88205ad4c000 [85162.474230] RIP: 0010:[a06a8182] [a06a8182] btrfs_orphan_add+0x1d2/0x1e0 [btrfs] [85162.474422] RSP: 0018:88205ad4fc48 EFLAGS: 00010286 [85162.474497] RAX: ffe4 RBX: 8810a35d42f8 RCX: 88185b896000 [85162.474595] RDX: 6a54 RSI: 0004 RDI: 88185b896138 [85162.474694] RBP: 88205ad4fc88 R08: 0001e670 R09: 88016194b240 [85162.474793] R10: a06bd797 R11: ea0004f71800 R12: 88185baa2000 [85162.474892] R13: 88085f6d7630 R14: 88185baa2458 R15: 0001 [85162.474992] FS: 7fb3f27fb740() GS:88085fd0() knlGS: [85162.475105] CS: 0010 DS: ES: CR0: 8005003b [85162.475184] CR2: 7f896c02c220 CR3: 00085b328000 CR4: 07e0 [85162.475286] Stack: [85162.475318] 88205ad4fc88 a06e6a14 88185b896b04 88105b03e800 [85162.475442] 88016194b240 8810a35d42f8 881e8ffe9a00 88133dc48ea0 [85162.475561] 88205ad4fd18 a0691a57 88016194b244 88016194b240 [85162.475680] Call Trace: [85162.475738] [a06e6a14] ? lookup_free_space_inode+0x44/0x100 [btrfs] [85162.475849] [a0691a57] btrfs_remove_block_group+0x137/0x740 [btrfs] [85162.475964] [a06ca8d2] btrfs_remove_chunk+0x672/0x780 [btrfs] [85162.476065] [a06922bf] btrfs_delete_unused_bgs+0x25f/0x280 [btrfs] [85162.476172] [a0699e0c] cleaner_kthread+0x12c/0x190 [btrfs] [85162.476269] [a0699ce0] ? check_leaf+0x350/0x350 [btrfs] [85162.476355] [8108f8d2]
Re: Kernel bug in 3.19-rc4
I just started some btrfs stress testing on latest linux kernel 3.19-rc4: A few hours later, filesystem stopped working - the kernel bug report can be found below. Hi, your kernel BUG at fs/btrfs/inode.c:3142! from 3.19-rc4 corresponds to http://marc.info/?l=linux-btrfsm=141903172106342w=2 - it was kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123! in 3.18.1, and is exactly the same code in both cases: /* grab metadata reservation from transaction handle */ if (reserve) { ret = btrfs_orphan_reserve_metadata(trans, inode); BUG_ON(ret); /* -ENOSPC in reservation; Logic error? JDM */ } -- Tomasz Chmielewski http://www.sslrack.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstest: btrfs/006: Add extra check on return value and 'fi show' by device
Reported in Red Hat BZ#1181627, 'btrfs fi show' on unmounted device will return 1 even no error happens. Introduced by: commit 2513077f btrfs-progs: fix device missing of btrfs fi show with seed devices Patch fixing it: https://patchwork.kernel.org/patch/5626001/ btrfs-progs: Fix wrong return value when executing 'fi show' on umounted device. Reported-by: Vratislav Podzimek vpodz...@redhat.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- tests/btrfs/006 | 51 --- tests/btrfs/006.out | 10 ++ 2 files changed, 54 insertions(+), 7 deletions(-) diff --git a/tests/btrfs/006 b/tests/btrfs/006 index 715fd80..2d8c1c0 100755 --- a/tests/btrfs/006 +++ b/tests/btrfs/006 @@ -62,33 +62,70 @@ _scratch_pool_mkfs $seqres.full 21 || _fail mkfs failed # These have to be done unmounted...? echo == Set filesystem label to $LABEL -$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV $LABEL +$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV $LABEL || \ + _fail set lable failed echo == Get filesystem label -$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV +$BTRFS_UTIL_PROG filesystem label $SCRATCH_DEV || \ + _fail get lable failed + +echo == Show filesystem by device(offline) +$BTRFS_UTIL_PROG filesystem show $FIRST_POOL_DEV | \ + _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show filesystem by device(offline) return value wrong echo == Mount. _scratch_mount echo == Show filesystem by label -$BTRFS_UTIL_PROG filesystem show $LABEL | _filter_btrfs_filesystem_show $TOTAL_DEVS +$BTRFS_UTIL_PROG filesystem show $LABEL | \ + _filter_btrfs_filesystem_show $TOTAL_DEVS +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show filesystem by lable return value wrong UUID=`$BTRFS_UTIL_PROG filesystem show $LABEL | grep uuid: | awk '{print $NF}'` echo UUID $UUID $seqres.full echo == Show filesystem by UUID -$BTRFS_UTIL_PROG filesystem show $UUID | _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID +$BTRFS_UTIL_PROG filesystem show $UUID | \ + _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show filesystem by UUID return value wrong + +echo == Show filesystem by device(online) +$BTRFS_UTIL_PROG filesystem show $FIRST_POOL_DEV | \ + _filter_btrfs_filesystem_show $TOTAL_DEVS $UUID +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show filesystem by UUID return value wrong echo == Sync filesystem $BTRFS_UTIL_PROG filesystem sync $SCRATCH_MNT | _filter_scratch +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail sync filesystem failed + echo == Show device stats by mountpoint -$BTRFS_UTIL_PROG device stats $SCRATCH_MNT | _filter_btrfs_device_stats $TOTAL_DEVS +$BTRFS_UTIL_PROG device stats $SCRATCH_MNT | \ + _filter_btrfs_device_stats $TOTAL_DEVS +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show device status return value wrong + echo == Show device stats by first/scratch dev $BTRFS_UTIL_PROG device stats $SCRATCH_DEV | _filter_btrfs_device_stats +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show device status return value wrong + echo == Show device stats by second dev -$BTRFS_UTIL_PROG device stats $FIRST_POOL_DEV | sed -e s,$FIRST_POOL_DEV,FIRST_POOL_DEV,g +$BTRFS_UTIL_PROG device stats $FIRST_POOL_DEV | \ + sed -e s,$FIRST_POOL_DEV,FIRST_POOL_DEV,g +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show device status return value wrong + echo == Show device stats by last dev -$BTRFS_UTIL_PROG device stats $LAST_POOL_DEV | sed -e s,$LAST_POOL_DEV,LAST_POOL_DEV,g +$BTRFS_UTIL_PROG device stats $LAST_POOL_DEV | \ + sed -e s,$LAST_POOL_DEV,LAST_POOL_DEV,g +[ ${PIPESTATUS[0]} -ne 0 ] \ + _fail show device status return value wrong # success, all done status=0 diff --git a/tests/btrfs/006.out b/tests/btrfs/006.out index 22bcb77..497de67 100644 --- a/tests/btrfs/006.out +++ b/tests/btrfs/006.out @@ -2,6 +2,11 @@ == Set filesystem label to TestLabel.006 == Get filesystem label TestLabel.006 +== Show filesystem by device(offline) +Label: 'TestLabel.006' uuid: UUID + Total devices EXACTNUM FS bytes used SIZE + devid DEVID size SIZE used SIZE path SCRATCH_DEV + == Mount. == Show filesystem by label Label: 'TestLabel.006' uuid: UUID @@ -13,6 +18,11 @@ Label: 'TestLabel.006' uuid: EXACTUUID Total devices EXACTNUM FS bytes used SIZE devid DEVID size SIZE used SIZE path SCRATCH_DEV +== Show filesystem by device(online) +Label: 'TestLabel.006' uuid: EXACTUUID + Total devices EXACTNUM FS bytes used SIZE + devid DEVID size SIZE used SIZE path SCRATCH_DEV + == Sync filesystem FSSync 'SCRATCH_MNT' == Show device stats by mountpoint -- 2.2.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs scrub status reports not running when it is
On Wed, Jan 14, 2015 at 02:27:17PM -0800, Zach Brown wrote: On Wed, Jan 14, 2015 at 04:06:02PM -0500, Sandy McArthur Jr wrote: Sometimes btrfs scrub status reports that is not running when it still is. I think this a cosmetic bug. And I believe this is related to the scrub completing on some drives before others in a multi-drive btrfs filesystem that is not well balanced. Boy, I don't really know this code, but it looks like: if (ss-in_progress) printf(, running for %llu seconds\n, ss-duration); else printf(, interrupted after %llu seconds, not running\n, ss-duration); in_progress = is_scrub_running_in_kernel(fdmnt, di_args, fi_args.num_devices); static int is_scrub_running_in_kernel(int fd, struct btrfs_ioctl_dev_info_args *di_args, u64 max_devices) { struct scrub_progress sp; int i; int ret; for (i = 0; i max_devices; i++) { memset(sp, 0, sizeof(sp)); sp.scrub_args.devid = di_args[i].devid; ret = ioctl(fd, BTRFS_IOC_SCRUB_PROGRESS, sp.scrub_args); if (ret 0 errno == ENODEV) continue; if (ret 0 errno == ENOTCONN) return 0; It says that scrub isn't running if any devices have completed. If you drop all those ret 0 conditional branches that are either noops or wrong, does it work like you'd expect? Why wrong? The ioctl callback returns -ENODEV or -ENOTCONN that get translated to the errno values and ioctl(...) returns -1 in both cases. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: fix test btrfs/017 (qgroup shared extent accounting test)
On Wed, Jan 14, 2015 at 11:21:43PM +, Filipe Manana wrote: Currently this test fails on 2 situations: 1) The scratch device supports trim/discard. In this case any modern version of mkfs.btrfs outputs a message (to stderr) informing that a trim is performed, which the golden output doesn't expect: btrfs/017 - output mismatch (see /git/xfstests/results//btrfs/017.out.bad) --- tests/btrfs/017.out2015-01-06 11:14:22.730143144 + +++ /git/xfstests/results//btrfs/017.out.bad 2015-01-14 22:33:01.582195719 + @@ -1,4 +1,5 @@ QA output created by 017 +Performing full device TRIM (100.00GiB) ... wrote 8192/8192 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) 4096 4096 ... (Run 'diff -u tests/btrfs/017.out /git/xfstests/results//btrfs/017.out.bad' to see the entire diff) So like others tests do, just redirect mkfs' standard error. 2) On platforms with a page size greater than 4Kb. At the moment btrfs doesn't support a node/leaf size smaller than the page size, but it supports a larger one. So use the max supported node size (64Kb) so that the test runs on any platform currently supported by Linux. Signed-off-by: Filipe Manana fdman...@suse.com Reviewed-by: David Sterba dste...@suse.cz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_inode_item's otime?
On Fri, Jan 09, 2015 at 05:11:42PM +0100, David Sterba wrote: --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5835,6 +5835,11 @@ static struct inode *btrfs_new_inode(struct btrfs_trans_handle *trans, sizeof(*inode_item)); fill_inode_item(trans, path-nodes[0], inode_item, inode); + /* +* Set the creation time on the inode. +*/ + btrfs_set_stack_timespec_sec( inode.otime, cur_time.tv_sec ); Drop the spaces after/before parens and also set usec the same way. There's no such thing as 'current_time', only CURRENT_TIME but that cannot be used directly as a structure. Given that the mtime is set a few lines above, copy the tv_sec and tv_usec from there. chandan pointed out on IRC the other day that he'd sent a patch for that already http://www.mail-archive.com/linux-btrfs%40vger.kernel.org/msg17508.html Though the patch cannot be applied as-is, it's more complete (I've missed a few places where the otime has to be set). Chandan, please drop the btrfs_inode_otime helper and resend. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2 RESEND] btrfs: remove empty fs_devices to prevent memory runout
There is a global list @fs_uuids to keep @fs_devices object for each created btrfs. But when a btrfs becomes empty (all devices belong to it are gone), its @fs_devices remains in @fs_uuids list until module exit. If we keeps mkfs.btrfs on the same device again and again, all empty @fs_devices produced are sure to eat up our memory. So this case has better to be prevented. I think that each time we setup btrfs on that device, we should check whether we are stealing some device from another btrfs seen before. To faciliate the search procedure, we could insert all @btrfs_device in a rb_root, one @btrfs_device per each physical device, with @bdev-bd_dev as key. Each time device stealing happens, we should replace the corresponding @btrfs_device in the rb_root with an up-to-date version. If the stolen device is the last device in its @fs_devices, then we have an empty btrfs to be deleted. Actually there are 3 ways to steal devices and lead to empty btrfs 1. mkfs, with -f option 2. device add, with -f option 3. device replace, with -f option We should act under these cases. Moreover, there are special cases to consider: o If there are seed devices, then it is asured that the devices in cloned @fs_devices are not treated as valid devices. o If a device disappears and reappears without any touch, its @bdev-bd_dev may change, so we have to re-insert it into the rb_root. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- changelog v1-v2: add handle for device disappears and reappears event *Note* Actually this handles the case when a device disappears and reappears without any touch. We are going to recycle all dead btrfs_device in another patch. Two events leads to the deads: 1) device disappears and never returns again 2) device disappears and returns with a new fs on it A shrinker shall kill the deads. --- fs/btrfs/super.c | 1 + fs/btrfs/volumes.c | 281 ++--- fs/btrfs/volumes.h | 6 ++ 3 files changed, 230 insertions(+), 58 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 60f7cbe..001cba5 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2184,6 +2184,7 @@ static void __exit exit_btrfs_fs(void) btrfs_end_io_wq_exit(); unregister_filesystem(btrfs_fs_type); btrfs_exit_sysfs(); + btrfs_cleanup_valid_dev_root(); btrfs_cleanup_fs_uuids(); btrfs_exit_compress(); btrfs_hash_exit(); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0144790..228a7e0 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -27,6 +27,7 @@ #include linux/kthread.h #include linux/raid/pq.h #include linux/semaphore.h +#include linux/rbtree.h #include asm/div64.h #include ctree.h #include extent_map.h @@ -52,6 +53,126 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); +static struct rb_root valid_dev_root = RB_ROOT; + +static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev) +{ + struct rb_node **p; + struct rb_node *parent; + struct rb_node *new; + struct btrfs_device *old_dev; + + WARN_ON(!mutex_is_locked(uuid_mutex)); + + parent = NULL; + new = new_dev-rb_node; + + p = valid_dev_root.rb_node; + while (*p) { + parent = *p; + old_dev = rb_entry(parent, struct btrfs_device, rb_node); + + if (new_dev-devnum old_dev-devnum) + p = parent-rb_left; + else if (new_dev-devnum old_dev-devnum) + p = parent-rb_right; + else { + rb_replace_node(parent, new, valid_dev_root); + RB_CLEAR_NODE(parent); + + goto out; + } + } + + old_dev = NULL; + rb_link_node(new, parent, p); + rb_insert_color(new, valid_dev_root); + +out: + return old_dev; +} + +static void free_fs_devices(struct btrfs_fs_devices *fs_devices) +{ + struct btrfs_device *device; + WARN_ON(fs_devices-opened); + while (!list_empty(fs_devices-devices)) { + device = list_entry(fs_devices-devices.next, + struct btrfs_device, dev_list); + list_del(device-dev_list); + rcu_string_free(device-name); + kfree(device); + } + kfree(fs_devices); +} + +static void remove_empty_fs_if_need(struct btrfs_fs_devices *old_fs) +{ + struct btrfs_fs_devices *seed_fs; + + if (!list_empty(old_fs-devices)) + return; + + list_del(old_fs-list); + + /* free the seed clones */ + seed_fs = old_fs-seed; + free_fs_devices(old_fs); + while (seed_fs) { + old_fs = seed_fs; +
[PATCH 2/2 RESEND] btrfs: introduce shrinker for rb_tree that keeps valid btrfs_devices
The following patch: btrfs: remove empty fs_devices to prevent memory runout introduces @valid_dev_root aiming at recording @btrfs_device objects that have corresponding block devices with btrfs. But if a block device is broken or unplugged, no one tells the @valid_dev_root to cleanup the dead objects. To recycle the memory occuppied by those deads, we could rely on the shrinker. The shrinker's scan function will traverse the @valid_dev_root and trys to open the devices one by one, if it fails or encounters a non-btrfs it will remove the dead @btrfs_device. A special case to deal with is that a block device is unplugged and replugged, then it appears with a new @bdev-bd_dev as devnum. In this case, we should remove the older since we should have a new one for that block device already. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- fs/btrfs/super.c | 10 fs/btrfs/volumes.c | 74 +- fs/btrfs/volumes.h | 4 +++ 3 files changed, 87 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 001cba5..022381e 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2017,6 +2017,12 @@ static struct miscdevice btrfs_misc = { .fops = btrfs_ctl_fops }; +static struct shrinker btrfs_valid_dev_shrinker = { + .scan_objects = btrfs_valid_dev_scan, + .count_objects = btrfs_valid_dev_count, + .seeks = DEFAULT_SEEKS, +}; + MODULE_ALIAS_MISCDEV(BTRFS_MINOR); MODULE_ALIAS(devname:btrfs-control); @@ -2130,6 +2136,8 @@ static int __init init_btrfs_fs(void) btrfs_init_lockdep(); + register_shrinker(btrfs_valid_dev_shrinker); + btrfs_print_info(); err = btrfs_run_sanity_tests(); @@ -2143,6 +2151,7 @@ static int __init init_btrfs_fs(void) return 0; unregister_ioctl: + unregister_shrinker(btrfs_valid_dev_shrinker); btrfs_interface_exit(); free_end_io_wq: btrfs_end_io_wq_exit(); @@ -2183,6 +2192,7 @@ static void __exit exit_btrfs_fs(void) btrfs_interface_exit(); btrfs_end_io_wq_exit(); unregister_filesystem(btrfs_fs_type); + unregister_shrinker(btrfs_valid_dev_shrinker); btrfs_exit_sysfs(); btrfs_cleanup_valid_dev_root(); btrfs_cleanup_fs_uuids(); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 228a7e0..5462557 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -54,6 +54,7 @@ static void btrfs_dev_stat_print_on_load(struct btrfs_device *device); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); static struct rb_root valid_dev_root = RB_ROOT; +static atomic_long_t unopened_dev_count = ATOMIC_LONG_INIT(0); static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev) { @@ -130,6 +131,8 @@ static void free_invalid_device(struct btrfs_device *invalid_dev) { struct btrfs_fs_devices *old_fs; + atomic_long_dec(unopened_dev_count); + old_fs = invalid_dev-fs_devices; mutex_lock(old_fs-device_list_mutex); list_del(invalid_dev-dev_list); @@ -605,6 +608,7 @@ static noinline int device_list_add(const char *path, list_add_rcu(device-dev_list, fs_devices-devices); fs_devices-num_devices++; mutex_unlock(fs_devices-device_list_mutex); + atomic_long_inc(unopened_dev_count); ret = 1; device-fs_devices = fs_devices; @@ -778,6 +782,7 @@ again: blkdev_put(device-bdev, device-mode); device-bdev = NULL; fs_devices-open_devices--; + atomic_long_inc(unopened_dev_count); } if (device-writeable) { list_del_init(device-dev_alloc_list); @@ -840,8 +845,10 @@ static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices) struct btrfs_device *new_device; struct rcu_string *name; - if (device-bdev) + if (device-bdev) { fs_devices-open_devices--; + atomic_long_inc(unopened_dev_count); + } if (device-writeable device-devid != BTRFS_DEV_REPLACE_DEVID) { @@ -971,6 +978,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices *fs_devices, fs_devices-rotating = 1; fs_devices-open_devices++; + atomic_long_dec(unopened_dev_count); if (device-writeable device-devid != BTRFS_DEV_REPLACE_DEVID) { fs_devices-rw_devices++; @@ -6848,3 +6856,67 @@ void btrfs_update_commit_device_bytes_used(struct btrfs_root *root, } unlock_chunks(root); } + +static unsigned long shrink_valid_dev_root(void) +{ + struct rb_node *n; + struct btrfs_device *device; + struct
Re: [PATCH 04/15] Btrfs: add ref_count and free function for btrfs_bio
The cleanups look good in general, some minor nitpicks below. On Tue, Jan 13, 2015 at 08:34:37PM +0800, Zhaolei wrote: - kfree(bbio); + put_btrfs_bio(bbio); Please rename it to btrfs_put_bbio, this is more consistent with other *_put_* helpers and 'bbio' distinguishes btrfs_bio from regular 'bio'. static void btrfs_end_bio(struct bio *bio, int err) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fb0e8c3..db195f0 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -295,6 +295,7 @@ typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err); #define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 0) struct btrfs_bio { + atomic_t ref_count; atomic_t refs; atomic_t stripes_pending; struct btrfs_fs_info *fs_info; bio_end_io_t *end_io; @@ -394,13 +395,8 @@ struct btrfs_balance_control { int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start, u64 end, u64 *length); - -#define btrfs_bio_size(total_stripes, real_stripes) \ - (sizeof(struct btrfs_bio) + \ - (sizeof(struct btrfs_bio_stripe) * (total_stripes)) + \ - (sizeof(int) * (real_stripes)) + \ - (sizeof(u64) * (real_stripes))) - +void get_btrfs_bio(struct btrfs_bio *bbio); btrfs_get_bbio +void put_btrfs_bio(struct btrfs_bio *bbio); int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, u64 logical, u64 *length, struct btrfs_bio **bbio_ret, int mirror_num); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes
On Thu, Jan 15, 2015 at 09:17:01AM +0800, Fan Chengniang/樊成酿 wrote: 在 2015年01月14日 23:46, David Sterba 写道: On Tue, Jan 13, 2015 at 01:53:39PM +0800, Fan Chengniang wrote: make btrfs qgroups show human readable sizes using --human-readable option, example: That's too long to type and the idea was to add all the long options that force the specific unit base, ie. --kbytes/--mbytes/..., --raw, --si and --iec. We can possibly make the human readable the default because that's what I'd expect to see to have a quick overview and can use the other options otherwise. The geopt parser accepts short options if they're unique, so --kb or even --k works as a very convenient shorcut for frequent commandline use. I have sent a mail for your advise of adding options. In that mail, I asked whether I should use --human-readable and add --kbytes --mbytes ... But you have not reply to me. So you've sent a v2 where we can see if our ideas match or not and continue from there. Timely replies are not always feasible, I get a lot of mails. Patch iterations are normal, nothing new here. So, your advise is add --kbytes --mbytes ... and make human-readable default behaviour? qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 299.58MiB299.58MiB400.00MiB0.00B1/1 --- 0/265299.58MiB16.00KiB 0.00B320.00MiB1/1 --- 0/266299.58MiB16.00KiB 350.00MiB0.00B--- --- 1/1 599.16MiB299.59MiB800.00MiB0.00B--- 0/5,0/265 The values should be also aligned to the right. It is aligned to left before my patch. I just keep it. Ok, take it as a hint for another patch. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs performance - ssd array
Hello, Could you check how many extents with BTRFS and Ext4: # filefrag test1 So my findings are odd: On BTRFS when I run fio with a single worker thread (target file is 12GB large,and its 100% random write of 4kb blocks), then number of extents reported by filefrag is around 3. However when I do the same with 4 worker threads, I get some crazy number of extents - test1: 3141866 extents found. Also when running with 4 threads when I check CPU, the sys% utilization takes 80% of CPU ( in the top output I see that all is consumed by kworker processes). On the EXT4 I get only 13 extents even when running with 4 worker threads. (note that I created RAID10 using mdadm before setting up ext4 there in order to get comparable storage solution to what we test with BTRFS). Another odd thing is, that it takes very long time for the filefrag utility to return the result on the BTRFS and not only for the case where I got 3 milions of extents but also for the first case where I ran single worker and the number of extents was only 3. Filefrag on EXT4 returns immediately. To see if this is because bad fragments for BTRFS. I am still not sure how fio will test randwrite here, so here is possibilities: case1: if fio don’t repeat write same position for several time, i think you could add --overite=0, and retest to see if it helps. Not sure what parameter do you mean here. case2: if fio randwrite did write same position for several time, i think you could use ‘-o nodatacow’ mount option to verify if this is because BTRFS COW caused serious fragments. It seems that mounting it with this option does have some effect but not very significant and it is not very deterministic. The IOPs are slightly higher at the beginning (~25 000 IOPs) but IOPs perfromance is very spiky and I can still see that CPU sys% is very high. As soon as the kworker threads start consuming CPU, the IOPs performance goes down again to some ~15 000 IOPs. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 RESEND] btrfs-progs: make btrfs qgroups show human readable sizes
在 2015年01月15日 20:30, David Sterba 写道: On Thu, Jan 15, 2015 at 09:17:01AM +0800, Fan Chengniang/樊成酿 wrote: 在 2015年01月14日 23:46, David Sterba 写道: On Tue, Jan 13, 2015 at 01:53:39PM +0800, Fan Chengniang wrote: make btrfs qgroups show human readable sizes using --human-readable option, example: That's too long to type and the idea was to add all the long options that force the specific unit base, ie. --kbytes/--mbytes/..., --raw, --si and --iec. We can possibly make the human readable the default because that's what I'd expect to see to have a quick overview and can use the other options otherwise. The geopt parser accepts short options if they're unique, so --kb or even --k works as a very convenient shorcut for frequent commandline use. I have sent a mail for your advise of adding options. In that mail, I asked whether I should use --human-readable and add --kbytes --mbytes ... But you have not reply to me. So you've sent a v2 where we can see if our ideas match or not and continue from there. Timely replies are not always feasible, I get a lot of mails. Patch iterations are normal, nothing new here. Sorry to you because of my words. I didn't consider you have a lot of mails. I will take your advice and improve my patch. This is my personal mail address. So, your advise is add --kbytes --mbytes ... and make human-readable default behaviour? qgroupid rfer excl max_rfer max_excl parent child -- - 0/5 299.58MiB299.58MiB400.00MiB0.00B1/1 --- 0/265299.58MiB16.00KiB 0.00B320.00MiB1/1 --- 0/266299.58MiB16.00KiB 350.00MiB0.00B--- --- 1/1 599.16MiB299.59MiB800.00MiB0.00B--- 0/5,0/265 The values should be also aligned to the right. It is aligned to left before my patch. I just keep it. Ok, take it as a hint for another patch. I have combined them to one patch. Maybe I should seperate them. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH 04/15] Btrfs: add ref_count and free function for btrfs_bio
Hi, David Sterba * From: David Sterba [mailto:dste...@suse.cz] The cleanups look good in general, some minor nitpicks below. On Tue, Jan 13, 2015 at 08:34:37PM +0800, Zhaolei wrote: - kfree(bbio); + put_btrfs_bio(bbio); Please rename it to btrfs_put_bbio, this is more consistent with other *_put_* helpers and 'bbio' distinguishes btrfs_bio from regular 'bio'. Good suggestion, I like these unified-format name. static void btrfs_end_bio(struct bio *bio, int err) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index fb0e8c3..db195f0 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -295,6 +295,7 @@ typedef void (btrfs_bio_end_io_t) (struct btrfs_bio *bio, int err); #define BTRFS_BIO_ORIG_BIO_SUBMITTED (1 0) struct btrfs_bio { + atomic_t ref_count; atomic_t refs; Ok. atomic_t stripes_pending; struct btrfs_fs_info *fs_info; bio_end_io_t *end_io; @@ -394,13 +395,8 @@ struct btrfs_balance_control { int btrfs_account_dev_extents_size(struct btrfs_device *device, u64 start, u64 end, u64 *length); - -#define btrfs_bio_size(total_stripes, real_stripes)\ - (sizeof(struct btrfs_bio) + \ -(sizeof(struct btrfs_bio_stripe) * (total_stripes)) + \ -(sizeof(int) * (real_stripes)) + \ -(sizeof(u64) * (real_stripes))) - +void get_btrfs_bio(struct btrfs_bio *bbio); btrfs_get_bbio Thanks for your suggestion, I'll include above changes in v2. Thanks Zhaolei +void put_btrfs_bio(struct btrfs_bio *bbio); int btrfs_map_block(struct btrfs_fs_info *fs_info, int rw, u64 logical, u64 *length, struct btrfs_bio **bbio_ret, int mirror_num); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html