Re: [PATCH] btrfs: Fix error handling in btrfs_cleanup_ordered_extents
On 10/26/18 13:13, Nikolay Borisov wrote: + if (page_start >= offset && page_end <= (offset + bytes - 1) { fs/btrfs/inode.c: In function 'btrfs_cleanup_ordered_extents': fs/btrfs/inode.c:140:62: error: expected ')' before '{' token if (page_start >= offset && page_end <= (offset + bytes - 1) { ~^~ ) You're welcome :) Holger
Re: Scrub aborts due to corrupt leaf
On 10/10/18 19:44, Chris Murphy wrote: On Wed, Oct 10, 2018 at 10:04 AM, Holger Hoffstätte wrote: On 10/10/18 17:44, Larkin Lowrey wrote: (..) About once a week, or so, I'm running into the above situation where FS seems to deadlock. All IO to the FS blocks, there is no IO activity at all. I have to hard reboot the system to recover. There are no error indications except for the following which occurs well before the FS freezes up: BTRFS warning (device dm-3): block group 78691883286528 has wrong amount of free space BTRFS warning (device dm-3): failed to load free space cache for block group 78691883286528, rebuilding it now Do I have any options other the nuking the FS and starting over? Unmount cleanly & mount again with -o space_cache=v2. I'm pretty sure you have to umount, and then clear the space_cache with 'btrfs check --clear-space-cache=v1' and then do a one time mount with -o space_cache=v2. But anyway, to me that seems premature because we don't even know what's causing the problem. Space cache writeout not honoring errors from the depths below is not unusual, I think there were some fixes recently which Larkin likely doesn't have yet. But yeah, I forgot to mention that cache-v2 alone won't really fix the _underlying_ symptoms. It is, however, vastly more reliable in general. a. Freezing means there's a kernel bug. Hands down. b. Is it freezing on the rebuild? Or something else? c. I think the devs would like to see the output from btrfs-progs v4.17.1, 'btrfs check --mode=lowmem' and see if it finds anything, in particular something not related to free space cache. Apart from performance implications, if only the free space cache inodes/blocks are borked then the rest will (should) work just fine and/or be replaced/overwritten eventually. Well, at least that was the idea. :} -h
Re: Scrub aborts due to corrupt leaf
On 10/10/18 19:25, Larkin Lowrey wrote: On 10/10/2018 12:04 PM, Holger Hoffstätte wrote: On 10/10/18 17:44, Larkin Lowrey wrote: (..) About once a week, or so, I'm running into the above situation where FS seems to deadlock. All IO to the FS blocks, there is no IO activity at all. I have to hard reboot the system to recover. There are no error indications except for the following which occurs well before the FS freezes up: BTRFS warning (device dm-3): block group 78691883286528 has wrong amount of free space BTRFS warning (device dm-3): failed to load free space cache for block group 78691883286528, rebuilding it now Do I have any options other the nuking the FS and starting over? Unmount cleanly & mount again with -o space_cache=v2. It froze while unmounting. The attached zip is a stack dump captured via 'echo t > /proc/sysrq-trigger'. A second attempt after a hard reboot worked. Trace says freespace cache writeout failed midway while the scsi device was resetting itself and then went rrrghh. Probably managed to hit different blocks on the second attempt. So chances are your controller, disk or something else is broken, dying, or both. When things have settled and you have verified that r/o mounting works and is stable, try rescuing the data (when necessary) before scrubbing, dm-device-checking or whatever you have set up. -h
Re: Scrub aborts due to corrupt leaf
On 10/10/18 17:44, Larkin Lowrey wrote: (..) About once a week, or so, I'm running into the above situation where FS seems to deadlock. All IO to the FS blocks, there is no IO activity at all. I have to hard reboot the system to recover. There are no error indications except for the following which occurs well before the FS freezes up: BTRFS warning (device dm-3): block group 78691883286528 has wrong amount of free space BTRFS warning (device dm-3): failed to load free space cache for block group 78691883286528, rebuilding it now Do I have any options other the nuking the FS and starting over? Unmount cleanly & mount again with -o space_cache=v2. -h
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/18 17:46, Hans van Kranenburg wrote: fs.devices() also looks for dev_items in the chunk tree: https://github.com/knorrie/python-btrfs/blob/master/btrfs/ctree.py#L481 So, BOOM! you need root. Or just start a 0, ignore errors and start trying all devids until you found num_devices amount of them that work, yolo. Since I need to walk /sys/fs/btrfs/ anyway I *think* I can just look at the entries in /sys/fs/btrfs//devices/ and query them all directly. The skeleton btrfs_exporter already responds to http requests and returns dummy metrics, using the official python client library. I've found a nice little python sysfs scraper and now only need to figure out how best to map the btrfs info in sysfs to useful metrics. The privileged access issue would only have come into play much later, but it seems I can avoid it after all, which is great. I'm also (re-)learning python in the process, so that's the actual thing slowing me down.. -h
Re: Curious problem: btrfs device stats & unpriviliged access
On 10/08/18 16:40, Hans van Kranenburg wrote: Looking at the kernel side of things in fs/btrfs/ioctl.c I see both BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN. That's the tree search ioctl, for reading arbitrary metadata. The device stats ioctl is IOC_GET_DEV_STATS... Yeah..OK, that is clear and gave me the hint to ask: why is it calling this in the first place? And as it turns out [1] is where it seems to go wrong, as is_block_device() returns 0 (as it should), fi_args.num_devices is never set (remains 0) and it proceeds to call everything below, eventually calling the BTRFS_IOC_FS_INFO ioctl in #1712. And that works fine: 1711 if (fi_args->num_devices != 1) { (gdb) s 1712ret = ioctl(fd, BTRFS_IOC_FS_INFO, fi_args); (gdb) s 1713if (ret < 0) { (gdb) p ret $28 = 0 (gdb) p *fi_args $30 = { max_id = 1, num_devices = 1, fsid = "z%:\371\315\033A\203\267.\200\255;FH\221", nodesize = 16384, sectorsize = 4096, clone_alignment = 4096, reserved32 = 0, reserved = {0 } } It's only when it goes into search_chunk_tree_for_fs_info() where things fail due to CAP_SYS_ADMIN. And all this explains the actual bug: when I call btrfs device stats not on the mountpoint (as I've been trying all this time) but rather on the device, it works properly right away as regular user: (gdb) set args device stats /dev/loop0 (gdb) r Breakpoint 1, get_fs_info (path=path@entry=0x7fffe527 "/dev/loop0", fi_args=fi_args@entry=0x7fffd400, di_ret=di_ret@entry=0x7fffd3f0) at utils.c:1652 1652{ (gdb) c Continuing. [/dev/loop0].write_io_errs0 [/dev/loop0].read_io_errs 0 [/dev/loop0].flush_io_errs0 [/dev/loop0].corruption_errs 0 [/dev/loop0].generation_errs 0 [Inferior 1 (process 2805) exited normally] So this is simply a discrepancy in handling a device vs. the device(s) for a mountpoint. I can do the device stats ioctl as normal user: import btrfs fs = btrfs.FileSystem('/') btrfs.utils.pretty_print(fs.dev_stats(1)) devid: 1 nr_items: 5 flags: 0 write_errs: 0 read_errs: 0 flush_errs: 0 generation_errs: 0 corruption_errs: 0 That works for me too, and that's actually the important part. \o/ Glad we talked about it. :} -h [1] https://github.com/kdave/btrfs-progs/blob/7faaca0d9f78f7162ae603231f693dd8e1af2a41/utils.c#L1666
Curious problem: btrfs device stats & unpriviliged access
(moving the discussion here from GH [1]) Apparently there is something weird going on with the device stats ioctls. I cannot get them to work as regular user, while they work for David. A friend confirms the same issue on his system - no access as non-root. So I made a new empty fs, mounted it, built btrfs-progs-4.17.1 with debug symbols and stepped into search_chunk_tree_for_fs_info(). Everything is fine, all args are correct, right until: (gdb) s 1614ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, _args); (gdb) s 1615if (ret < 0) (gdb) p ret $4 = -1 (gdb) p search_args $5 = {key = {tree_id = 3, min_objectid = 1, max_objectid = 1, min_offset = 1, max_offset = 18446744073709551615, min_transid = 0, max_transid = 18446744073709551615, min_type = 216, max_type = 216, nr_items = 30, unused = 0, unused1 = 0, unused2 = 0, unused3 = 0, unused4 = 0}, buf = '\000' } Looking at the kernel side of things in fs/btrfs/ioctl.c I see both BTRFS_IOC_TREE_SEARCH[_V2} unconditionally require CAP_SYS_ADMIN. So why can Dave get his dev stats as unprivileged user? Does this work for anybody else? And why? :) cheers Holger [1] https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427823190
Monitoring btrfs with Prometheus (and soon OpenMonitoring)
The Prometheus statistics collection/aggregation/monitoring/alerting system [1] is quite popular, easy to use and will probably be the basis for the upcoming OpenMetrics "standard" [2]. Prometheus collects metrics by polling host-local "exporters" that respond to http requests; many such exporters exist, from the generic node_exporter for OS metrics to all sorts of application-/service-specific varieties. Since btrfs already exposes quite a lot of monitorable and - more importantly - actionable runtime information in sysfs it only makes sense to expose these metrics for visualization & alerting. I noodled over the idea some time ago but got sidetracked, besides not being thrilled at all by the idea of doing this in golang (which I *really* dislike). However, exporters can be written in any language as long as they speak the standard response protocol, so an alternative would be to use one of the other official exporter clients. These provide language-native "mini-frameworks" where one only has to fill in the blanks (see [3] for examples). Since the issue just came up in the node_exporter bugtracker [3] I figured I ask if anyone here is interested in helping build a proper standalone btrfs_exporter in C++? :D ..just kidding, I'd probably use python (which I kind of don't really know either :) and build on Hans' python-btrfs library for anything not covered by sysfs. Anybody interested in helping? Apparently there are also golang libs for btrfs [5] but I don't know anything about them (if you do, please comment on the bug), and the idea of adding even more stuff into the monolithic, already creaky and somewhat bloated node_exporter is not appealing to me. Potential problems wrt. btrfs are access to root-only information, like e.g. the btrfs device stats/errors in the aforementioned bug, since exporters are really supposed to run unprivileged due to network exposure. The S.M.A.R.T. exporter [6] solves this with dual-process contortions; obviously it would be better if all relevant metrics were accessible directly in sysfs and not require privileged access, but forking a tiny privileged process every polling interval is probably not that bad. All ideas welcome! cheers, Holger [1] https://www.prometheus.io/ [2] https://openmetrics.io/ [3] https://github.com/prometheus/client_python, https://github.com/prometheus/client_ruby [4] https://github.com/prometheus/node_exporter/issues/1100 [5] https://github.com/prometheus/node_exporter/issues/1100#issuecomment-427651028 [6] https://github.com/cloudandheat/prometheus_smart_exporter
Re: [PATCH 0/5] rb_first to rb_first_cached conversion
On 08/22/18 21:51, Liu Bo wrote: Several structs in btrfs are using rb_first() in a while loop, it'd be more efficient to do this with rb_first_cached() which has the O(1) complexity. This patch set updates five structs which may have a large rb tree in practice Great idea, works for me with no bad side effects so far. So: Tested-by: Holger Hoffstätte cheers Holger
Re: [PATCH 0/2] address lock contention of btree root
On 08/21/18 20:15, Liu Bo wrote: I just realize that patch 2 can result in softlockup as btrfs_search_slot() may return a path with all nodes being in spinning lock, and if the callers want to sleep, we're in trouble. I've removed patch 2 and am re-running the test (xfstests, fsmark and dbench). You mean like this, when trying to balance? :) Got it only once so far, subsequent attempts worked. Otherwise everything seems fine. -h kernel: BTRFS info (device sdc1): relocating block group 4128424067072 flags data kernel: BTRFS info (device sdc1): found 1706 extents kernel: INFO: rcu_sched self-detected stall on CPU kernel: ^I3-: (17999 ticks this GP) idle=f5e/1/4611686018427387906 softirq=269430/269430 fqs=5999 kernel: ^I (t=18000 jiffies g=232869 c=232868 q=4365) kernel: NMI backtrace for cpu 3 kernel: CPU: 3 PID: 4287 Comm: kworker/u8:0 Not tainted 4.18.3 #1 kernel: Hardware name: System manufacturer System Product Name/P8Z68-V LX, BIOS 4105 07/01/2013 kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] kernel: Call Trace: kernel: kernel: dump_stack+0x46/0x60 kernel: nmi_cpu_backtrace.cold.0+0x13/0x57 kernel: ? lapic_can_unplug_cpu.cold.5+0x34/0x34 kernel: nmi_trigger_cpumask_backtrace+0x8f/0x91 kernel: rcu_dump_cpu_stacks+0x87/0xb2 kernel: rcu_check_callbacks.cold.59+0x2ac/0x430 kernel: ? tick_sched_handle.isra.6+0x40/0x40 kernel: update_process_times+0x28/0x60 kernel: tick_sched_handle.isra.6+0x35/0x40 kernel: tick_sched_timer+0x3b/0x80 kernel: __hrtimer_run_queues+0xfe/0x270 kernel: hrtimer_interrupt+0xf4/0x210 kernel: smp_apic_timer_interrupt+0x56/0x110 kernel: apic_timer_interrupt+0xf/0x20 kernel: kernel: RIP: 0010:queued_write_lock_slowpath+0x4a/0x80 kernel: Code: ff 00 00 00 f0 0f b1 13 85 c0 74 32 f0 81 03 00 01 00 00 ba ff 00 00 00 b9 00 01 00 00 8b 03 3d 00 01 00 00 74 0b f3 90 8b 03 <3d> 00 01 00 00 75 f5 89 c8 f0 0f b1 13 3d 00 01 00 00 75 df c6 43 kernel: RSP: 0018:c940fc40 EFLAGS: 0206 ORIG_RAX: ff13 kernel: RAX: 0300 RBX: 88071abf86f0 RCX: 0100 kernel: RDX: 00ff RSI: 8800 RDI: 88071abf86f0 kernel: RBP: R08: 88071abf8690 R09: 880724ebeb58 kernel: R10: 880724ebeb80 R11: R12: 0001 kernel: R13: 8803d0db5f54 R14: 1600 R15: 0006 kernel: btrfs_try_tree_write_lock+0x23/0x60 [btrfs] kernel: btrfs_search_slot+0x2df/0x970 [btrfs] kernel: btrfs_mark_extent_written+0xb0/0xac0 [btrfs] kernel: ? kmem_cache_alloc+0x1a5/0x1b0 kernel: btrfs_finish_ordered_io+0x2e2/0x7a0 [btrfs] kernel: normal_work_helper+0xad/0x2c0 [btrfs] kernel: process_one_work+0x1e3/0x390 kernel: worker_thread+0x2d/0x3c0 kernel: ? process_one_work+0x390/0x390 kernel: kthread+0x111/0x130 kernel: ? kthread_flush_work_fn+0x10/0x10 kernel: ret_from_fork+0x1f/0x30
Re: Regression with crc32c selection?
On 07/23/18 18:50, David Sterba wrote: On Mon, Jul 23, 2018 at 04:13:26PM +0200, Holger Hoffstätte wrote: While backporting a bunch of fixes to my own 4.16.x tree (4.17 had a few too many bugs for my taste) I also ended up merging: Curious, bugs in btrfs or the whole 4.17 kernel? And if bugs, real breakage or backported fixes? Overall. I don't remember specifics but skimming lkml at the time didn't inspire a lot of confidence, and since I already had a large number of hand-picked & backported patches from 4.17/4.18/4.19 :) for btrfs, xfs, net, blk-mq & drivers - just the stuff I care about - I skipped it instead of upgrading & rebasing everything. Might well be that the latest 4.17-stable works reliably, but 4.18 is already around the corner, so.. no really good reason. :) cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression with crc32c selection? (solved - pilot error)
On 07/23/18 16:39, Patrik Lundquist wrote: $ uname -a Linux nas 4.17.0-1-amd64 #1 SMP Debian 4.17.8-1 (2018-07-20) x86_64 GNU/Linux $ dmesg | grep Btrfs [8.168408] Btrfs loaded, crc32c=crc32c-intel $ lsmod | grep crc32 crc32_pclmul 16384 0 libcrc32c 16384 1 btrfs crc32c_generic 16384 0 crc32c_intel 24576 2 Ooohh..thanks for that. I wouldn't be surprised it's because my libcrc is built-in (probably because of built-in xfs) and at initialization time doesn't see any modules, so it always selects the generic impl. Since btrfs is only ever loaded last, the previous btrfs init code could properly detect/load/use the crc32c-intel module. I switched the crc32c impls to built-in and what do you know: Jul 23 17:04:26 ragnarok kernel: Btrfs loaded, crc32c=crc32c-intel \o/ Thanks Patrick! Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
python-btrfs & btrfs-heatmap ebuilds now available for Gentoo
I wanted to migrate my collection of questionable shell scripts for btrfs maintenance/inspection to a more stable foundation and therefore created Gentoo ebuilds for Hans van Kranenburg's excellent python-btrfs [1] and btrfs-heatmap [2] packages. They can be found in my overlay at: https://github.com/hhoffstaette/portage/tree/master/sys-fs/python-btrfs https://github.com/hhoffstaette/portage/tree/master/sys-fs/btrfs-heatmap Both are curently only available for python-3.6 since a change in 3.7 broke the existing python API [3]. As soon as that is fixed I'll add 3.7 to the PYTHON_COMPAT entries. btrfs-heatmap properly uses the python-exec wrapper and therefore works regardless of currently selected default python version. :) I hope this is useful to someone. cheers, Holger [1] https://github.com/knorrie/python-btrfs [2] https://github.com/knorrie/btrfs-heatmap [3] https://github.com/knorrie/python-btrfs/issues/13 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Regression with crc32c selection?
Hi, While backporting a bunch of fixes to my own 4.16.x tree (4.17 had a few too many bugs for my taste) I also ended up merging: df91f56adce1f: libcrc32c: Add crc32c_impl function 9678c54388b6a: btrfs: Remove custom crc32c init code ..which AFAIK went into 4.17 and seemed harmless enough; after fixing up a trivial context conflict it builds, runs, all good..except that btrfs (apprently?) no longer uses the preferred crc32c-intel module, but the crc32c-generic one instead. In order to rule out any mistakes on my part I built 4.18.0-rc6 and it seems to have the same problem: Jul 23 15:55:09 ragnarok kernel: raid6: sse2x1 gen() 11267 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: sse2x1 xor() 8110 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: sse2x2 gen() 13409 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: sse2x2 xor() 9137 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: sse2x4 gen() 15884 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: sse2x4 xor() 10579 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: using algorithm sse2x4 gen() 15884 MB/s Jul 23 15:55:09 ragnarok kernel: raid6: xor() 10579 MB/s, rmw enabled Jul 23 15:55:09 ragnarok kernel: raid6: using ssse3x2 recovery algorithm Jul 23 15:55:09 ragnarok kernel: xor: automatically using best checksumming function avx Jul 23 15:55:09 ragnarok kernel: Btrfs loaded, crc32c=crc32c-generic I understand that the new crc32c_impl() function changed from crypto_tfm_alg_driver_name() to crypto_shash_driver_name() - could this be the reason? The module is loaded just fine, but apprently not used: $lsmod | grep crc32 crc32_pclmul 16384 0 crc32c_intel 24576 0 In other words, is this supposed to happen or is my kernel config somehow no longer right? It worked before and doesn't look too wrong: $grep CRC /etc/kernels/kernel-config-x86_64-4.18.0-rc6 # CONFIG_PCIE_ECRC is not set CONFIG_CRYPTO_CRC32C=y CONFIG_CRYPTO_CRC32C_INTEL=m CONFIG_CRYPTO_CRC32=m CONFIG_CRYPTO_CRC32_PCLMUL=m # CONFIG_CRYPTO_CRCT10DIF is not set CONFIG_CRC_CCITT=m CONFIG_CRC16=y # CONFIG_CRC_T10DIF is not set CONFIG_CRC_ITU_T=y CONFIG_CRC32=y # CONFIG_CRC32_SELFTEST is not set CONFIG_CRC32_SLICEBY8=y # CONFIG_CRC32_SLICEBY4 is not set # CONFIG_CRC32_SARWATE is not set # CONFIG_CRC32_BIT is not set # CONFIG_CRC4 is not set # CONFIG_CRC7 is not set CONFIG_LIBCRC32C=y # CONFIG_CRC8 is not set Ultimately btrfs (and everything else) works, but the process of how the kernel selects a crc32c implementation seems rather mysterious to me. :/ Any insights welcome. If it's a regression I can gladly test fixes. cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue on BTRFS/copy of really huge files
On 07/06/18 10:37, Juergen Sauer wrote: .. Moving a virtual machine from ssd/raid1 subvolume (nocow) into the rotational big store (noocow) fails. After filling up the cachememory (ram) the data flow cuts down to zero 0 kb/sec. In fatal result the copy of an huge file hangs does not proceed any more, load raises infinite, iops falling to zero. In kernel log I find: [ 491.151952] INFO: task kworker/u16:28:1027 blocked for more than 120 seconds. [ 491.151953] Tainted: P O 4.17.3-1-ARCH #1 [ 491.151953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 491.151953] kworker/u16:28 D0 1027 2 0x8000 [ 491.151965] Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs] [ 491.151965] Call Trace: [ 491.151967] ? __schedule+0x282/0x890 [ 491.151969] schedule+0x32/0x90 [ 491.151970] io_schedule+0x12/0x40 [ 491.151971] blk_mq_get_tag+0x146/0x2a0 This has nothing to do with btrfs and is simply one of the remaining (but already fixed upstream) bugs in the blk-mq stack, probably related to sbitmap concurrency and or "tag starvation". I could give you a list of patches from 4.18+ that help (reliably) but I suppose you're not into kernel patching, so the easiest way for you would be to to switch to the old block layer (e.g. by booting with kernel flag scsi_mod.use_blk_mq=0) and use deadline/cfq as before. This should all be fixed & work reliable with 4.18+; it looks that by 4.19 blk-mq will also be enabled by default. cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS losing SE Linux labels on power failure or "reboot -nffd"
On 06/04/18 15:29, Hans van Kranenburg wrote: Hi, On 06/04/2018 03:14 PM, Russell Coker wrote: The command "reboot -nffd" (kernel reboot without flushing kernel buffers or writing status) when run on a BTRFS system with SE Linux will often result in /var/log/audit/audit.log being unlabeled. This recent fix might be what you're looking for: https://www.spinics.net/lists/linux-btrfs/msg77927.html ..which has been in 4.16 since 4.16.11. :) -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs: Speedup btrfs_read_block_groups()
On 02/22/18 05:52, Qu Wenruo wrote: > btrfs_read_block_groups() is used to build up the block group cache for > all block groups, so it will iterate all block group items in extent > tree. > > For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM > thousands times, which is the most time consuming part of mounting > btrfs. > > So this patch will try to speed it up by: > > 1) Avoid unnecessary readahead >We were using READA_FORWARD to search for block group item. >However block group items are in fact scattered across quite a lot of >leaves. Doing readahead will just waste our IO (especially important >for HDD). > >In real world case, for a filesystem with 3T used space, it would >have about 50K extent tree leaves, but only have 3K block group >items. Meaning we need to iterate 16 leaves to meet one block group >on average. > >So readahead won't help but waste slow HDD seeks. > > 2) Use chunk mapping to locate block group items >Since one block group item always has one corresponding chunk item, >we could use chunk mapping to get the block group item size. > >With block group item size, we can do a pinpoint tree search, instead >of searching with some uncertain value and do forward search. > >In some case, like next BLOCK_GROUP_ITEM is in the next leaf of >current path, we could save such unnecessary tree block read. > > Cc: Ellis H. Wilson III> Signed-off-by: Qu Wenruo > --- > Since all my TB level storage is all occupied by my NAS, any feedback > (especially for the real world mount speed change) is welcome. (sorry for the previous mail without results..finger salad) Decided to give this a try & got some nice results! Probably not on the same scale & nonlinear behaviour as Ellis will provide since I just don't have that much storage, but interesting nevertheless. $btrfs filesystem df /mnt/backup Data, single: total=1.10TiB, used=1.09TiB System, DUP: total=32.00MiB, used=144.00KiB Metadata, DUP: total=4.00GiB, used=2.23GiB GlobalReserve, single: total=512.00MiB, used=0.00B $btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l 1137 current kernel (4.14++ with most of blk-mq+BFQ from 4.16): mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.211 total mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.122 total mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.236 total patched: mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.070 total mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.056 total mount /mnt/backup 0.00s user 0.02s system 1% cpu 1.058 total That's not overwhelming, but still measurable and nice to have! While I was at it I decided to fill up the drive to almost-max ~3.7TB and see how much slower it woulöd get...you won't believe what happened next. :-) $btrfs-debug-tree -t chunk /dev/sdc1 | grep CHUNK_ITEM | wc -l 3719 mount /mnt/backup 0.00s user 0.02s system 2% cpu 1.328 total mount /mnt/backup 0.00s user 0.03s system 2% cpu 1.361 total mount /mnt/backup 0.00s user 0.03s system 2% cpu 1.368 total Over three times the data, almost the same mount time as before? Yes please! Overall this looks like a really nice improvement. Glad to see that my suspicion about the (non)usefulness of the readhead turned out to be true. :) cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] btrfs: Speedup btrfs_read_block_groups()
On 02/22/18 05:52, Qu Wenruo wrote: > btrfs_read_block_groups() is used to build up the block group cache for > all block groups, so it will iterate all block group items in extent > tree. > > For large filesystem (TB level), it will search for BLOCK_GROUP_ITEM > thousands times, which is the most time consuming part of mounting > btrfs. > > So this patch will try to speed it up by: > > 1) Avoid unnecessary readahead >We were using READA_FORWARD to search for block group item. >However block group items are in fact scattered across quite a lot of >leaves. Doing readahead will just waste our IO (especially important >for HDD). > >In real world case, for a filesystem with 3T used space, it would >have about 50K extent tree leaves, but only have 3K block group >items. Meaning we need to iterate 16 leaves to meet one block group >on average. > >So readahead won't help but waste slow HDD seeks. > > 2) Use chunk mapping to locate block group items >Since one block group item always has one corresponding chunk item, >we could use chunk mapping to get the block group item size. > >With block group item size, we can do a pinpoint tree search, instead >of searching with some uncertain value and do forward search. > >In some case, like next BLOCK_GROUP_ITEM is in the next leaf of >current path, we could save such unnecessary tree block read. > > Cc: Ellis H. Wilson III> Signed-off-by: Qu Wenruo Decided to give this a try & got interesting results! The scene of the crime: -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 02/13/18 13:54, Qu Wenruo wrote: > On 2018年02月13日 20:26, Holger Hoffstätte wrote: >> On 02/13/18 12:40, Qu Wenruo wrote: >>>>> The problem is not about how much space it takes, but how many extents >>>>> are here in the filesystem. >> >> I have no idea why btrfs' mount even needs to touch all block groups to >> get going (which seems to be the root of the problem), but here's a >> not so crazy idea for more "mechanical sympathy". Feel free to mock >> me if this is terribly wrong or not possible. ;) >> >> Mounting of even large filesystems (with many extents) seems to be fine >> on SSDS, but not so fine on rotational storage. We've heard that from >> several people with large (multi-TB) filesystems, and obviously it's >> even more terrible on 5400RPM drives because their seeks are sooo sloow. >> >> If the problem is that the bgs are touched/iterated in "tree order", >> would it then not be possible to sort the block groups in physical order >> before trying to load whatever mount needs to load? > > This is in fact a good idea. > Make block group into its own tree. Well, that's not what I was thinking about at all..yet. :) (keep in mind I'm not really that familiar with the internals). Out of curiosity I ran a bit of perf on my own mount process, which is fast (~700 ms) despite being a ~1.1TB fs, mixture of lots of large and small files. Unfortunately it's also very fresh since I recreated it just this weekend, so everything is neatly packed together and fast. In contrast a friend's fs is ~800 GB, but has 11 GB metadata and is pretty old and fragmented (but running an up-to-date kernel). His fs mounts in ~5s. My perf run shows that the only interesting part responsible for mount time is the nested loop in btrfs_read_block_groups calling find_first_block_group (which got inlined & is not in the perf callgraph) over and over again, accounting for 75% of time spent. I now understand your comment why the real solution to this problem is to move bgs into their own tree, and agree: both kitchens and databases have figured out a long time ago that the key to fast scan and lookup performance is to not put different things in the same storage container; in the case of analytical DBMS this is columnar storage. :) But what I originally meant was something much simpler and more brute-force-ish. I see that btrfs_read_block_groups adds readahead (is that actually effective?) but what I was looking for was the equivalent of a DBMS' sequential scan. Right now finding (and loading) a bg seems to involve a nested loop of tree lookups. It seems easier to rip through the entire tree in nice 8MB chunks and discard what you don't need instead of seeking around trying to find all the right bits in scattered order. Could we alleviate cold mounts by starting more readaheads in btrfs_read_block_groups, so that the extent tree is scanned more linearly? cheers, Holger signature.asc Description: OpenPGP digital signature
Re: mount btrfs takes 30 minutes, btrfs check runs out of memory
On 02/13/18 12:40, Qu Wenruo wrote: >>> The problem is not about how much space it takes, but how many extents >>> are here in the filesystem. I have no idea why btrfs' mount even needs to touch all block groups to get going (which seems to be the root of the problem), but here's a not so crazy idea for more "mechanical sympathy". Feel free to mock me if this is terribly wrong or not possible. ;) Mounting of even large filesystems (with many extents) seems to be fine on SSDS, but not so fine on rotational storage. We've heard that from several people with large (multi-TB) filesystems, and obviously it's even more terrible on 5400RPM drives because their seeks are sooo sloow. If the problem is that the bgs are touched/iterated in "tree order", would it then not be possible to sort the block groups in physical order before trying to load whatever mount needs to load? That way the entire process would involve less seeking (no backward seeks for one) and the drive could very likely get more done during a rotation before stepping further. cheers, Holger signature.asc Description: OpenPGP digital signature
Re: Can rsync use reflink to synchronize directories?
On 01/17/18 18:35, MegaBrutal wrote: > Does rsync support copying files with reflink on local btrfs file > system? Of course it could only work if the necessary conditions for > reflinking are met, but it would be very useful. Not yet: https://bugzilla.samba.org/show_bug.cgi?id=10170 -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
On 01/09/18 00:27, Holger Hoffstätte wrote: > On 01/08/18 23:55, Jens Axboe wrote: >> the good old >> >> int srcu_idx = srcu_idx; >> >> should get the job done. > > (Narrator: It didn't.) Narrator: we retract our previous statement and apologize for the confusion. It works fine when you correctly replace all occurrences, not just one. <:) Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] blk-mq: protect completion path with RCU
On 01/08/18 20:15, Tejun Heo wrote: > Currently, blk-mq protects only the issue path with RCU. This patch > puts the completion path under the same RCU protection. This will be > used to synchronize issue/completion against timeout by later patches, > which will also add the comments. > > Signed-off-by: Tejun Heo> --- > block/blk-mq.c | 5 + > 1 file changed, 5 insertions(+) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index ddc9261..6741c3e 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -584,11 +584,16 @@ static void hctx_lock(struct blk_mq_hw_ctx *hctx, int > *srcu_idx) > void blk_mq_complete_request(struct request *rq) > { > struct request_queue *q = rq->q; > + struct blk_mq_hw_ctx *hctx = blk_mq_map_queue(q, rq->mq_ctx->cpu); > + int srcu_idx; > > if (unlikely(blk_should_fake_timeout(q))) > return; > + > + hctx_lock(hctx, _idx); > if (!blk_mark_rq_complete(rq)) > __blk_mq_complete_request(rq); > + hctx_unlock(hctx, srcu_idx); > } > EXPORT_SYMBOL(blk_mq_complete_request); So I've had v3 running fine with 4.14++ and when I first tried Jens' additional helpers on top, I got a bunch of warnings which I didn't investigate further at the time. Now they are back since the helpers moved into patch #1 and #2 correctly says: .. block/blk-mq.c: In function ‘blk_mq_complete_request’: ./include/linux/srcu.h:175:2: warning: ‘srcu_idx’ may be used uninitialized in this function [-Wmaybe-uninitialized] __srcu_read_unlock(sp, idx); ^~~ block/blk-mq.c:587:6: note: ‘srcu_idx’ was declared here int srcu_idx; ^~~~ ..etc. This is with gcc 7.2.0. I understand that this is a somewhat-false positive since the lock always precedes the unlock & writes to the value, but can we properly initialize or annotate this? cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARNING: CPU: 1 PID: 3016 at fs/btrfs/ctree.h:1564 btrfs_update_device+0x189/0x190 [btrfs]
Apply the patch from https://patchwork.kernel.org/patch/9960893/ and follow the logged instructions re. device resizing (or see https://bugzilla.kernel.org/show_bug.cgi?id=196949 for examples). The patch is unfortunately not yet merged into 4.15rc, otherwise it could be sent to 4.14-stable. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel hangs during balance
On 12/20/17 20:02, Chris Murphy wrote: > I don't know if it's the sending MUA or the list server, but the line > wrapping makes this much harder to follow. I suggest putting it in a > text file and attaching the text file. It's definitely not on the > receiving side, I see it here also: > https://www.spinics.net/lists/linux-btrfs/msg72872.html You can see enough to suggest that blk-mq is hanging, which is "unsurprising" (being kind here) with such an old kernel. Rich, build your kernel with CONFIG_SCSI_MQ_DEFAULT=n or boot with scsi_mod.use_blk_mq=n as kernel prarameter. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-transacti hammering the system
On 12/01/17 18:34, Matt McKinnon wrote: > Thanks, I'll give space_cache=v2 a shot. Yes, very much recommended. > My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/ Turn autodefrag off and use noatime instead of relatime. Your filesystem also seems very full, that's bad with every filesystem but *especially* with btrfs because the allocator has to work really hard to find free space for COWing. Really consider deleting stuff or adding more space. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2.1 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs
On Tue, 28 Nov 2017 15:20:39 +0800, Qu Wenruo wrote: > [BUG] > fstrim on some btrfs only trims the unallocated space, not trimming any > space in existing block groups. > > [CAUSE] > Before fstrim_range passed to btrfs_trim_fs(), it get truncated to > range [0, super->total_bytes). > So later btrfs_trim_fs() will only be able to trim block groups in range > [0, super->total_bytes). > > While for btrfs, any bytenr aligned to sector size is valid, since btrfs use > its logical address space, there is nothing limiting the location where > we put block groups. > > For btrfs with routine balance, it's quite easy to relocate all > block groups and bytenr of block groups will start beyond super->total_bytes. > > In that case, btrfs will not trim existing block groups. > > [FIX] > Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs() > can get the unmodified range, which is normally set to [0, U64_MAX]. > > Reported-by: Chris Murphy> Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim > ioctl") > Cc: # v4.0+ > Signed-off-by: Qu Wenruo > --- > changelog: > v2: > Locate the root cause in btrfs_ioctl_fitrim(), remove the truncation > so we can still allow user to trim custom range. > v2.1: > Include the missing change in btrfs_trim_fs() and update the commit > message to reflect this. > --- > fs/btrfs/extent-tree.c | 9 + > fs/btrfs/ioctl.c | 11 +++ > 2 files changed, 8 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index f830aa91ac3d..f74958e11008 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -10972,14 +10972,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, > struct fstrim_range *range) > int dev_ret = 0; > int ret = 0; > > - /* > - * try to trim all FS space, our block group may start from non-zero. > - */ > - if (range->len == total_bytes) > - cache = btrfs_lookup_first_block_group(fs_info, range->start); > - else > - cache = btrfs_lookup_block_group(fs_info, range->start); > - > + cache = btrfs_lookup_first_block_group(fs_info, range->start); > for (; cache; cache = next_block_group(fs_info, cache)) { > if (cache->key.objectid >= (range->start + range->len)) { > btrfs_put_block_group(cache); (..snip..) This first hunk should also remove total_bytes, otherwise we get an unused-variable warning: @@ -10972,19 +10972,11 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range) u64 start; u64 end; u64 trimmed = 0; - u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy); int bg_ret = 0; int dev_ret = 0; int ret = 0; ..etc.. cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel 4.14 RAID5 multi disk array on bcache not mounting
On 11/21/17 23:22, Lionel Bouton wrote: > Le 21/11/2017 à 23:04, Andy Leadbetter a écrit : >> I have a 4 disk array on top of 120GB bcache setup, arranged as follows > [...] >> Upgraded today to 4.14.1 from their PPA and the > > 4.14 and 4.14.1 have a nasty bug affecting bcache users. See for example > : > https://www.reddit.com/r/linux/comments/7eh2oz/serious_regression_in_linux_414_using_bcache_can/ 4.14.2 (just out as rc1) will have the fix. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!
On 11/17/17 18:48, Marc MERLIN wrote: > On Thu, Nov 16, 2017 at 09:53:15PM -0800, Marc MERLIN wrote: >>> I suggest that you try lvmcache instead. It's much more flexible than >>> bcache, >>> does pretty much the same job, and has much less of the "hacky" feel to it. >> >> I can read up on it, it's going to be a big pain to convert from one to >> the other, but I can look at it for new filesystems. > > I had a quick read. As expected, it's slower since it goes through all > the LVM overhead that I got rid of recently > https://github.com/stec-inc/EnhanceIO/wiki/PERFORMANCE-COMPARISON-AMONG-dm-cache,-bcache-and-EnhanceIO > > Given the pain it would be for me to switch, I'm going to stick with > bcache and hope it improves. My understanding is that bcache until recently was more or less unmaintained, but got quite a few patches for 4.14 and now has a new maintainer as well. So things are looking up. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!
On 11/16/17 22:45, Marc MERLIN wrote: (snip) >> This BUG() was recently removed and seems to be caused by some kind >> of persistent corruption, which is seen as invalid inline extent. >> See [1], [2] for details. Maybe you can backport them? >> Alternatively just give 4.14 a whirl, it's great. >> >> -h >> >> [1] >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745 >> [2] >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4 > > First thanks a lot for the quick reply, it was super timely considering > my server was rebooting every 20mn :) > I've now been running 4.14 for a couple of hours, and things seem ok > btrfs-wise. Don't pop the champagne just yet, I just read that apprently 4.14 broke bcache for some people [1]. Not sure how much that affects you, but it might well make things worse. Yeah, I know, wonderful. > So, just so that I understand: > 1) I do have some kind of FS problem/corruption (minor? major?) All I know is what's in those commits, I just remembered the description. ;) If I understand the patches correctly you're still supposed to get an "invalid extent inline ref type" message. > 2) it started crashing 4.9.36 and then 4.13 today, every 20mn, probably due > to some background > cleaner process that kept starting and hitting the problem spot Sounds like. > 3) 4.14 does not crash anymore, but it doesn't even report any problem > either. Does it mean > the error that crashed the old kernel is minor enough that the new kernel > doesn't bother even > logging it? See above, you should still get a warning. OTOH it's hard to tell what is going on when you seem to have dm/dmcrypt/bcache lasagne going on.. > 4) I just ran scrub on the filesystem and it ran fine. That's not too depressing. :) > I'm asusming that running btrfs check --force on a mounted filesystem > that is being used is not going to give useful results, unless I leave > the FS read only. Correct? Think so, yes. > As for 4.14, the serial console code seems broken though, I can't get login > or bash > to work anymore on them: > [ 2786.305004] INFO: task login:5636 blocked for more than 120 seconds. > [ 2786.324648] Tainted: G U W > 4.14.0-amd64-stkreg-sysrq-20171018 #1 > [ 2786.347692] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 2786.371742] login D0 5636 1 0xa0020006 I'm out. :/ -h [1] https://marc.info/?t=15108212601 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13.12: kernel BUG at fs/btrfs/ctree.h:1802!
On 11/16/17 18:07, Marc MERLIN wrote: > Sorry, was missing the kernel number in the subject, just fixed that. > > On Thu, Nov 16, 2017 at 09:04:45AM -0800, Marc MERLIN wrote: >> My server now reboots every 20mn or so, with this. >> Sadly another BUG_ON() and it won't even tell me which filesystem >> it's on >> >> static inline u32 btrfs_extent_inline_ref_size(int type) >> { >> if (type == BTRFS_TREE_BLOCK_REF_KEY || >> type == BTRFS_SHARED_BLOCK_REF_KEY) >> return sizeof(struct btrfs_extent_inline_ref); >> if (type == BTRFS_SHARED_DATA_REF_KEY) >> return sizeof(struct btrfs_shared_data_ref) + >> sizeof(struct btrfs_extent_inline_ref); >> if (type == BTRFS_EXTENT_DATA_REF_KEY) >> return sizeof(struct btrfs_extent_data_ref) + >> offsetof(struct btrfs_extent_inline_ref, offset); >> BUG(); >> return 0; >> } This BUG() was recently removed and seems to be caused by some kind of persistent corruption, which is seen as invalid inline extent. See [1], [2] for details. Maybe you can backport them? Alternatively just give 4.14 a whirl, it's great. -h [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=167ce953ca55bdee20fe56c3c0fa51002435f745 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4335958de2a43c6790c7f6aa0682aa7189983fa4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add option to only list parent subvolumes
On 09/30/17 19:56, Holger Hoffstätte wrote: > shell hackery as alternative. Anyway, I was sure that at the time the > other letters sounded even worse/were taken, but that may just have been > in my head. ;-) > > I just rechecked and -S is still available, so that's good. Except that it isn't really, since there is already an 'S' case in cmds-subvolume.c as shortcut to --sort: -- case 'S': ret = btrfs_list_parse_sort_string(optarg, _set); -- ..which is why I picked P at the time. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add option to only list parent subvolumes
On 09/30/17 18:37, Graham Cobb wrote: > On 30/09/17 14:08, Holger Hoffstätte wrote: >> A "root" subvolume is identified by a null parent UUID, so adding a new >> subvolume filter and flag -P ("Parent") does the trick. > > I don't like the naming. The flag you are proposing is really nothing to > do with whether a subvolume is a parent or not: it is about whether it > is a snapshot or not (many subvolumes are both snapshots and also > parents of other snapshots, and many non-snapshots are not the parent of > any subvolumes). You're completely right. I wrote that patch about a year ago because I needed it for my own homegrown backup program and spent approx. 5 seconds finding a free option letter; ultimately I ended up using the mentioned shell hackery as alternative. Anyway, I was sure that at the time the other letters sounded even worse/were taken, but that may just have been in my head. ;-) I just rechecked and -S is still available, so that's good. > I have two suggestions: > > 1) Use -S (meaning "not a snapshot", the inverse of -s). Along with this > change. I would change the usage text to say something like: > > -s list subvolumes originally created as snapshots > -S list subvolumes originally created not as snapshots > > Presumably specifying both -s and -S should be an error. That sounds much better indeed and is a quick fix. While I agree that the "-P /none" filter would be useful too, it's also a different feature and more work than I want to invest in this right now. Maybe later "-S" can simply delegate to "-P none". cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: right-align number columns in btrfs-debugfs output
The values for block group offset, length etc. in btrfs-debugfs' output are left-aligned, which creates unaligned output and makes the usage percentage hard to read/process further. This patch adds right-aligning format specifiers for the number values. Ideally the format values wouldn't be hardcoded but instead derived from the filesystem size, but this seems to work for now. Signed-off-by: Holger Hoffstätte <hol...@applied-asynchrony.com> --- btrfs-debugfs | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/btrfs-debugfs b/btrfs-debugfs index dfb88539..a7ecd16d 100755 --- a/btrfs-debugfs +++ b/btrfs-debugfs @@ -356,8 +356,13 @@ def print_block_groups(mountpoint): ctypes.memmove(ctypes.addressof(bg), p, ctypes.sizeof(bg)) if bg.flags & BTRFS_BLOCK_GROUP_DATA: -print "block group offset %Lu len %Lu used %Lu chunk_objectid %Lu flags %Lu usage %.2f" %\ - (header.objectid, header.offset, bg.used, bg.chunk_objectid, bg.flags, float(bg.used) / float(header.offset)) +print "block group offset %s len %s used %s chunk_objectid %Lu flags %Lu usage %.2f" %\ + ('{:>14}'.format(header.objectid), + '{:>10}'.format(header.offset), + '{:>10}'.format(bg.used), + bg.chunk_objectid, + bg.flags, + float(bg.used) / float(header.offset)) total_free += (header.offset - bg.used) if min_used >= bg.used: -- 2.14.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: add option to only list parent subvolumes
Hi, When listing subvolumes it can be useful to filter out any snapshots; surprisingly enough I couldn't find an option to do this easily, only the opposite (list only snapshots). A "root" subvolume is identified by a null parent UUID, so adding a new subvolume filter and flag -P ("Parent") does the trick. The same can of course be accomplished with shell hackery, e.g.: btrfs subvol list -q -o | grep -i "parent_uuid -" | cut -d " " -f 11 but a built-in way seems less fragile. I originally cooked this up for myself, but David Sterba encouraged me to send this to the list, so here it is. I'm not too proud of the -P but couldn't find a better option letter; suggestions welcome. :) cheers, Holger Signed-off-by: Holger Hoffstätte <hol...@applied-asynchrony.com> --- btrfs-list.c | 6 ++ btrfs-list.h | 1 + cmds-subvolume.c | 8 +++- 3 files changed, 14 insertions(+), 1 deletion(-) diff --git a/btrfs-list.c b/btrfs-list.c index 4cc2ed49..6aa7a290 100644 --- a/btrfs-list.c +++ b/btrfs-list.c @@ -1175,6 +1175,11 @@ static int filter_deleted(struct root_info *ri, u64 data) return ri->deleted; } +static int filter_parent_subvol_only(struct root_info *ri, u64 data) +{ + return uuid_is_null(ri->puuid); +} + static btrfs_list_filter_func all_filter_funcs[] = { [BTRFS_LIST_FILTER_ROOTID] = filter_by_rootid, [BTRFS_LIST_FILTER_SNAPSHOT_ONLY] = filter_snapshot, @@ -1189,6 +1194,7 @@ static btrfs_list_filter_func all_filter_funcs[] = { [BTRFS_LIST_FILTER_FULL_PATH] = filter_full_path, [BTRFS_LIST_FILTER_BY_PARENT] = filter_by_parent, [BTRFS_LIST_FILTER_DELETED] = filter_deleted, + [BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY] = filter_parent_subvol_only, }; struct btrfs_list_filter_set *btrfs_list_alloc_filter_set(void) diff --git a/btrfs-list.h b/btrfs-list.h index 13f44c3a..54aab123 100644 --- a/btrfs-list.h +++ b/btrfs-list.h @@ -142,6 +142,7 @@ enum btrfs_list_filter_enum { BTRFS_LIST_FILTER_FULL_PATH, BTRFS_LIST_FILTER_BY_PARENT, BTRFS_LIST_FILTER_DELETED, + BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY, BTRFS_LIST_FILTER_MAX, }; diff --git a/cmds-subvolume.c b/cmds-subvolume.c index e7ef67d3..2371338e 100644 --- a/cmds-subvolume.c +++ b/cmds-subvolume.c @@ -404,6 +404,7 @@ static const char * const cmd_subvol_list_usage[] = { "-q print the parent uuid of the snapshots", "-R print the uuid of the received snapshots", "-t print the result as a table", + "-P list parent subvolumes only", "-s list snapshots only in the filesystem", "-r list readonly subvolumes (including snapshots)", "-d list deleted subvolumes that are not yet cleaned", @@ -445,7 +446,7 @@ static int cmd_subvol_list(int argc, char **argv) }; c = getopt_long(argc, argv, - "acdgopqsurRG:C:t", long_options, NULL); + "acdgopPqsurRG:C:t", long_options, NULL); if (c < 0) break; @@ -473,6 +474,11 @@ static int cmd_subvol_list(int argc, char **argv) case 't': is_tab_result = 1; break; + case 'P': + btrfs_list_setup_filter(_set, + BTRFS_LIST_FILTER_PARENT_SUBVOL_ONLY, + 0); + break; case 's': btrfs_list_setup_filter(_set, BTRFS_LIST_FILTER_SNAPSHOT_ONLY, -- 2.14.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent_io.c:1989
On 09/18/17 19:09, Liu Bo wrote: > This 'mirror 0' looks fishy, (as mirror comes from > btrfs_io_bio->mirror_num, which should be at least 1 if raid1 setup is > in use.) > > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you No, it did not; Gentoo always strives to be as close to mainline as possible except for urgent security & low-risk convenience fixes. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix unexpected result when dio reading corrupted blocks
Hello, quick question for backporting.. On 09/15/17 23:06, Liu Bo wrote: > commit 4246a0b63bd8 ("block: add a bi_error field to struct bio") > changed the logic of how dio read endio reports errors. [snip] I've tried to merge this into my 4.9.x++ tree but have a question since the DIO APIs changed recently and itt's hard to tell what is a bug and what is a feature.. :-/ > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -8267,11 +8267,8 @@ static void btrfs_endio_direct_read(struct bio *bio) > struct btrfs_io_bio *io_bio = btrfs_io_bio(bio); > blk_status_t err = bio->bi_status; > > - if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) { > + if (dip->flags & BTRFS_DIO_ORIG_BIO_SUBMITTED) > err = btrfs_subio_endio_read(inode, io_bio, err); > - if (!err) > - bio->bi_status = 0; > - } > > unlock_extent(_I(inode)->io_tree, dip->logical_offset, > dip->logical_offset + dip->bytes - 1); This hunk is fairly easy, just reverse bi_status to bi->error. However.. > @@ -8279,7 +8276,7 @@ static void btrfs_endio_direct_read(struct bio *bio) > > kfree(dip); > > - dio_bio->bi_status = bio->bi_status; > + dio_bio->bi_status = err; > dio_end_io(dio_bio); ^^^ Same here, except that the call to dio_end_io used to take a second parameter (the error code, which has been moved into bi_status in 4.10+) and looked like this: dio_end_io(dio_bio, bio->bi_error); Given that "bio->bi_error" should have been "err" instead, I think err should also be passed to dio_end_io(), so that the whole hunk would look like: .. -dio_bio->bi_error = bio->bi_error; -dio_end_io(dio_bio, bio->bi_error); +dio_bio->bi_error = err; +dio_end_io(dio_bio, err); .. Would this be correct or did I misunderstand some subtle aspect about the DIO error handling? Thanks :) Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Odd fallocate behavior on BTRFS.
On 08/01/17 20:15, Holger Hoffstätte wrote: > On 08/01/17 19:34, Austin S. Hemmelgarn wrote: > [..] >> Apparently, if you call fallocate() on a file with an offset of 0 and >> a length longer than the length of the file itself, BTRFS will >> allocate that exact amount of space, instead of just filling in holes >> in the file and allocating space to extend it. If there isn't enough >> space on the filesystem for this, then it will fail, even though it >> would succeed on ext4, XFS, and F2FS. > [..] >> I'm curious to hear anybody's thoughts on this, namely: 1. Is this >> behavior that should be considered implementation defined? 2. If not, >> is my assessment that BTRFS is behaving incorrectly in this case >> accurate? > > IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3) > make it very clear that the expected default behaviour is to extend. > I don't think this can be interpreted in any other way than incorrect > behaviour on behalf of btrfs. > > Your script reproduces for me, so that's a start. Your reproducer should never ENOSPC because it requires exactly 0 new bytes to be allocated, yet it also fails with --keep-size. >From a quick look it seems that btrfs_fallocate() unconditionally calls btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start) to lazily allocate the necessary extent(s), which goes ENOSPC because that size is again the full size of the requested range, not the difference between the existing file size and the new range length. But I might be misreading things.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Odd fallocate behavior on BTRFS.
On 08/01/17 19:34, Austin S. Hemmelgarn wrote: [..] > Apparently, if you call fallocate() on a file with an offset of 0 and > a length longer than the length of the file itself, BTRFS will > allocate that exact amount of space, instead of just filling in holes > in the file and allocating space to extend it. If there isn't enough > space on the filesystem for this, then it will fail, even though it > would succeed on ext4, XFS, and F2FS. [..] > I'm curious to hear anybody's thoughts on this, namely: 1. Is this > behavior that should be considered implementation defined? 2. If not, > is my assessment that BTRFS is behaving incorrectly in this case > accurate? IMHO no and yes, respectively. Both fallocate(2) and posix_fallocate(3) make it very clear that the expected default behaviour is to extend. I don't think this can be interpreted in any other way than incorrect behaviour on behalf of btrfs. Your script reproduces for me, so that's a start. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: add support to sort by topid
On 07/13/17 15:25, Jeffrey Mahoney wrote: > On 7/13/17 12:16 AM, Anand Jain wrote: >> As users generally organize the subvols and snapshots based on the subvol >> directory hierarchy. So providing an ability to sort them by topid would >> help. Thanks. > > What is a topid? I needed to look at the code to discover this and it's > not even documented as part of the root_info structure. Users are going > to have no idea what this means. The idea seems similar to my attempt to add listing of toplevel-only (i.e. non-snapshot) subvolumes, which can be found here: https://github.com/hhoffstaette/btrfs-progs/commit/f0a065e02b I ended up not submitting it since nobody seemed interested at the time and I could work around it for my own purposes by filtering manually on parent uuid (more precisely the lack thereof). cheers Holger signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs: fix integer overflow in calc_reclaim_items_nr
On 06/23/17 16:32, Chris Mason wrote: [..] > -static inline int calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, > +static inline u64 calc_reclaim_items_nr(struct btrfs_fs_info *fs_info, > u64 to_reclaim) > { > u64 bytes; > - int nr; > + u64 nr; > > bytes = btrfs_calc_trans_metadata_size(fs_info, 1); > nr = (int)div64_u64(to_reclaim, bytes); ^^ Isn't this a bit odd? I can't even tell whether it matters, just randomly noticed it because I genuinely dislike type casts.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: fiemap: Cache and merge fiemap extent before submit it to user
On 06/18/17 13:23, Qu Wenruo wrote: > Well, still no good news. > > I created file extents which returns 0x2008 and the last extent with 0x9. > But still failed to reproduce the error message. > > BTW, I noticed that your output is a little strange. > Normally we should have "offset=%llu phys=%llu len=%llu and flags=0x%x". > > But your output seems a little out of shape. > And further more, the len (if it's correct) is not aligned even to 512 bytes. > > Seems something went wrong totally. That seems to be a mail decoding/display problem at your side. Adam's original mail contained: offset=0 phys=2435798867968 len=131072 flags=0x2008 So length looks fine. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] Btrfs: fix total_bytes_pinned counter
On 06/07/17 01:45, Omar Sandoval wrote: > From: Omar Sandoval <osan...@fb.com> > > This series fixes several problems with the total_bytes_pinned counter. > Patches 1 and 2 are cleanups. Patches 3 and 4 are straightforward fixes. > Patch 5 is prep for patch 6. Patch 6 is the most complicated fix. > Patches 5 and 6 are ugly, I'd love any suggestions for a cleaner fix. > Finally, patch 7 adds a warning to catch similar issues in the future. Since this didn't really change the code significantly from the previous version (except for the addition of minor patches 2+7) have a Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> Thanks! Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: skip commit transaction if we don't have enough pinned bytes
On 06/02/17 20:14, Omar Sandoval wrote: > On Fri, May 19, 2017 at 11:39:15AM -0600, Liu Bo wrote: >> We commit transaction in order to reclaim space from pinned bytes because >> it could process delayed refs, and in may_commit_transaction(), we check >> first if pinned bytes are enough for the required space, we then check if >> that plus bytes reserved for delayed insert are enough for the required >> space. >> >> This changes the code to the above logic. >> >> Signed-off-by: Liu Bo>> --- >> fs/btrfs/extent-tree.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c >> index e390451c72e6..bded1ddd1bb6 100644 >> --- a/fs/btrfs/extent-tree.c >> +++ b/fs/btrfs/extent-tree.c >> @@ -4837,7 +4837,7 @@ static int may_commit_transaction(struct btrfs_fs_info >> *fs_info, >> >> spin_lock(_rsv->lock); >> if (percpu_counter_compare(_info->total_bytes_pinned, >> - bytes - delayed_rsv->size) >= 0) { >> + bytes - delayed_rsv->size) < 0) { >> spin_unlock(_rsv->lock); >> return -ENOSPC; >> } > > I found this bug in my latest enospc investigation, too. However, the > total_bytes_pinned counter is pretty broken. E.g., on my laptop: > > $ sync; grep -H '' /sys/fs/btrfs/*/allocation/*/total_bytes_pinned > /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/data/total_bytes_pinned:48693501952 > /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/metadata/total_bytes_pinned:-258146304 > /sys/fs/btrfs/f31cc926-37d3-442d-b50a-10c62d47badc/allocation/system/total_bytes_pinned:0 > > I have a patch to fix it that I haven't cleaned up yet, below. Without > it, Bo's fix will probably cause early ENOSPCs. Dave, should we pull > Bo's patch out of for-next? In any case, I'll get my fix sent out. [..patch snipped..] This made me curious since I also found the underflowed metadata counter on my system. I tried to reproduce it after un/remount (to reset the counters) and noticed that I could reliably cause the metadata underflow by defragging a few large subvolumes, which apparently creates enough extent tree movement that the counter quickly goes bananas. It took some backporting, but with your patch applied I can defrag away and have so far not seen a single counter underflow; all of data/metadata/system are always positive after writes and then reliably drop to 0 after sync/commit. Nice! Just thought you'd want to know. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Question re. required since btrfs-progs-4.10
Hi, I'm on Gentoo and wanted to update Docker to 17.03.0, which failed when it couldn't build the btrfs driver due to a missing . This worked fine on another machine the other day, so I dug in and found that the only difference was an intermediate update to btrfs-progs 4.10. Sure enough: since 4.10 ctree.h includes , which is nowhere to be found in my package of linux-headers-4.10 (also not in 4.9). Simply copying sizes.h from the 4.9 kernel sources made everything work, but I'm curious whether this is actually correct - hence my question whether _should_ actually be installed? Is this a bug with Gentoo's kernel header package or a new problem with btrfs-progs? I don't see any good reason why sizes.h should not be installed, but wanted to verify since there are probably users of other distributions here as well. :) Thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: duperemove : some real world figures on BTRFS deduplication
On 12/09/16 16:43, Chris Murphy wrote: >> If compression has nothing to do with this, then this is heavy >> fragmentation. > > It's probably not that fragmented. Due to compression, metadata > describes 128KiB extents even though the data is actually contiguous. > > And it might be the same thing in my case also, even though no > compression is involved. In that case you can quickly collapse physically contiguous ranges by reflink-mv'ing (ie. a recent mv) the file across subvolume boundaries and back. :) -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: resend: Re: Btrfs: adjust len of writes if following a preallocated extent
On 11/23/16 18:21, Stefan Priebe - Profihost AG wrote: > Am 04.11.2016 um 20:20 schrieb Liu Bo: >> If we have >> >> |0--hole--4095||4096--preallocate--12287| >> >> instead of using preallocated space, a 8K direct write will just >> create a new 8K extent and it'll end up with >> >> |0--new extent--8191||8192--preallocate--12287| >> >> It's because we find a hole em and then go to create a new 8K >> extent directly without adjusting @len. > > after applying that one on top of my 4.4 btrfs branch (includes patches > up to 4.10 / next). i'm getting deadlocks in btrfs. *ctrl+f sectorsize* .. That's not surprising if you did what I suspect. If your tree is based on my - now really very retired - 4.4.x queue, then you are likely missing _all the other blocksize/sectorsize patches_ that came in from Chandra Seetharaman et al., which I _really_ carefully patched around, for many good reasons. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
On 11/10/16 17:20, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 05:01:44PM +0100, Holger Hoffstätte wrote: >> On 11/10/16 16:37, Omar Sandoval wrote: >>> On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: >>>> On 11/10/16 00:26, Omar Sandoval wrote: >>>>> From: Omar Sandoval <osan...@fb.com> >>>>> >>>>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to >>>>> errors coming from the qcow2 virtual drive in the host system. The qcow2 >>>>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. >>>>> Every once in awhile, pread() or pwrite() would return EEXIST, which >>>>> makes no sense. This turned out to be a bug in btrfs_get_extent(). >>>>> >>>>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >>>>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >>>>> two threads race on adding the same extent map to an inode's extent map >>>>> tree. However, if the added em is merged with an adjacent em in the >>>>> extent tree, then we'll end up with an existing extent that is not >>>>> identical to but instead encompasses the extent we tried to add. When we >>>>> call merge_extent_mapping() to find the nonoverlapping part of the new >>>>> em, the arithmetic overflows because there is no such thing. We then end >>>>> up trying to add a bogus em to the em_tree, which results in a EEXIST >>>>> that can bubble all the way up to userspace. >>>>> >>>>> Fix it by extending the identical extent map special case. >>>>> >>>>> Signed-off-by: Omar Sandoval <osan...@fb.com> >>>>> --- >>>>> Applies to 4.9-rc4. >>>>> >>>>> Here [1] is a reproducer for this bug that doesn't involve firing up a >>>>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible >>>>> to debug this on my laptop without compiling a custom kernel and >>>>> rebooting just to add printks [3]. >>>>> >>>>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b >>>> >>>> I can't really make this reproducer fail. It builds and runs fine, but just >>>> exits with no messages (other than the one about drop_caches in dmesg). >>>> It creates the 1MB file and always returns 0. Ideas? >>>> >>>> -h >>> >>> It's a race condition, so it doesn't happen 100% of the time. I imagine >>> it depends on the storage speed, as well. On my laptop, which is >>> dm-crypt on top of an SSD, it works about 50% of the time. Could you >>> just try running it 100 times or something and see if it fails? >> >> $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" >> >> ..couple of thousand runs without problem, only lots of fallocating and >> cache dropping. >> >> Oh well, I tried. :) >> >> -h > > Just out of curiousity, what kind of disk were you trying this on? I've > only been able to trigger it on my laptop and a VM running on my laptop. Tried on both an SSD and an old slowpoke 2.5" rotational disk on USB2. But I also have a ton of other patches and a custom CPU scheduler, so everything it likely my fault anyway. Don't sweat it. :) >From what I can tell the explanation of the problem and the change itself make sense. Would have been nice to be able to repro. cheers, -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
On 11/10/16 16:37, Omar Sandoval wrote: > On Thu, Nov 10, 2016 at 04:11:35PM +0100, Holger Hoffstätte wrote: >> On 11/10/16 00:26, Omar Sandoval wrote: >>> From: Omar Sandoval <osan...@fb.com> >>> >>> My QEMU VM was seeing inexplicable I/O errors that I tracked down to >>> errors coming from the qcow2 virtual drive in the host system. The qcow2 >>> file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. >>> Every once in awhile, pread() or pwrite() would return EEXIST, which >>> makes no sense. This turned out to be a bug in btrfs_get_extent(). >>> >>> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >>> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >>> two threads race on adding the same extent map to an inode's extent map >>> tree. However, if the added em is merged with an adjacent em in the >>> extent tree, then we'll end up with an existing extent that is not >>> identical to but instead encompasses the extent we tried to add. When we >>> call merge_extent_mapping() to find the nonoverlapping part of the new >>> em, the arithmetic overflows because there is no such thing. We then end >>> up trying to add a bogus em to the em_tree, which results in a EEXIST >>> that can bubble all the way up to userspace. >>> >>> Fix it by extending the identical extent map special case. >>> >>> Signed-off-by: Omar Sandoval <osan...@fb.com> >>> --- >>> Applies to 4.9-rc4. >>> >>> Here [1] is a reproducer for this bug that doesn't involve firing up a >>> QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible >>> to debug this on my laptop without compiling a custom kernel and >>> rebooting just to add printks [3]. >>> >>> 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b >> >> I can't really make this reproducer fail. It builds and runs fine, but just >> exits with no messages (other than the one about drop_caches in dmesg). >> It creates the 1MB file and always returns 0. Ideas? >> >> -h > > It's a race condition, so it doesn't happen 100% of the time. I imagine > it depends on the storage speed, as well. On my laptop, which is > dm-crypt on top of an SSD, it works about 50% of the time. Could you > just try running it 100 times or something and see if it fails? $for i ($(seq 1 1000)) ./pread_eexist_repro /mnt/test/$i || echo "fail" ..couple of thousand runs without problem, only lots of fallocating and cache dropping. Oh well, I tried. :) -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
On 11/10/16 16:06, David Sterba wrote: > On Wed, Nov 09, 2016 at 03:26:50PM -0800, Omar Sandoval wrote: >> From: Omar Sandoval>> [snip] >> Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map >> insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where >> two threads race on adding the same extent map to an inode's extent map >> tree. However, if the added em is merged with an adjacent em in the >> extent tree, then we'll end up with an existing extent that is not >> identical to but instead encompasses the extent we tried to add. When we >> call merge_extent_mapping() to find the nonoverlapping part of the new >> em, the arithmetic overflows because there is no such thing. We then end >> up trying to add a bogus em to the em_tree, which results in a EEXIST >> that can bubble all the way up to userspace. > > Is this possibly the same bug that Liu Bo fixes in > https://patchwork.kernel.org/patch/9413129/ ? Seem similar, but I can't reproduce without that patch either.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: deal with existing encompassing extent map in btrfs_get_extent()
On 11/10/16 00:26, Omar Sandoval wrote: > From: Omar Sandoval> > My QEMU VM was seeing inexplicable I/O errors that I tracked down to > errors coming from the qcow2 virtual drive in the host system. The qcow2 > file is a nocow file on my Btrfs drive, which QEMU opens with O_DIRECT. > Every once in awhile, pread() or pwrite() would return EEXIST, which > makes no sense. This turned out to be a bug in btrfs_get_extent(). > > Commit 8dff9c853410 ("Btrfs: deal with duplciates during extent_map > insertion in btrfs_get_extent") fixed a case in btrfs_get_extent() where > two threads race on adding the same extent map to an inode's extent map > tree. However, if the added em is merged with an adjacent em in the > extent tree, then we'll end up with an existing extent that is not > identical to but instead encompasses the extent we tried to add. When we > call merge_extent_mapping() to find the nonoverlapping part of the new > em, the arithmetic overflows because there is no such thing. We then end > up trying to add a bogus em to the em_tree, which results in a EEXIST > that can bubble all the way up to userspace. > > Fix it by extending the identical extent map special case. > > Signed-off-by: Omar Sandoval > --- > Applies to 4.9-rc4. > > Here [1] is a reproducer for this bug that doesn't involve firing up a > QEMU VM. Also, a big shoutout to BCC [2] and BPF for making it possible > to debug this on my laptop without compiling a custom kernel and > rebooting just to add printks [3]. > > 1: https://gist.github.com/osandov/d08aabe5d4dec15517e9fde17012fd3b I can't really make this reproducer fail. It builds and runs fine, but just exits with no messages (other than the one about drop_caches in dmesg). It creates the 1MB file and always returns 0. Ideas? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: make block group flags in balance printks human-readable
On 11/07/16 22:40, Adam Borowski wrote: > They're not even documented anywhere, letting users with no recourse but > to RTFS. It's no big burden to output the bitfield as words. > > Also, display unknown flags as hex. > > Signed-off-by: Adam Borowski[..] > > /* > + * explain bit flags, prefixed by a '|' that'll be dropped > + */ > +static char *describe_block_group_flags(char *buf, u64 flags) > +{ > +#define BUF_SIZE 128 > + char *buf0 = buf = kmalloc(BUF_SIZE, GFP_NOFS); [..] Maybe I'm missing some clever (?) trick here, but what's the point of passing in a potentially uninitialized 'buf' when it's immediately reassigned locally, and a new value is returned and assigned at the call site? IMHO you'd probably either want to pass the buffer in or return it, but not both - and in that case the allocation should probably be hoisted out into the caller as well, if only to make things a bit more symmetric. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: make block group flags in balance printks human-readable
On 11/04/16 08:26, Adam Borowski wrote: > They're not even documented anywhere, letting users with no recourse but > to RTFS. It's no big burden to output the bitfield as words. > > Also, display unknown flags as hex. > > Signed-off-by: Adam Borowski <kilob...@angband.pl> Very helpful and works (for me) as advertised. Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> > --- > fs/btrfs/relocation.c | 34 -- > 1 file changed, 32 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index 0ec8ffa..388216f 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -4326,6 +4326,34 @@ static struct reloc_control > *alloc_reloc_control(struct btrfs_fs_info *fs_info) > } > > /* > + * explain bit flags, prefixed by a '|' that'll be dropped > + */ > +static void describe_block_group_flags(char *buf, u64 flags) > +{ > + if (!flags) > + *buf += sprintf(buf, "|NONE"); > + else { > +#define DESCRIBE_FLAG(f, d) \ > + if (flags & BTRFS_BLOCK_GROUP_##f) { \ > + buf += sprintf(buf, "|%s", d); \ > + flags &= ~BTRFS_BLOCK_GROUP_##f; \ > + } > + DESCRIBE_FLAG(DATA, "data"); > + DESCRIBE_FLAG(SYSTEM, "system"); > + DESCRIBE_FLAG(METADATA, "metadata"); > + DESCRIBE_FLAG(RAID0,"raid0"); > + DESCRIBE_FLAG(RAID1,"raid1"); > + DESCRIBE_FLAG(DUP, "dup"); > + DESCRIBE_FLAG(RAID10, "raid10"); > + DESCRIBE_FLAG(RAID5,"raid5"); > + DESCRIBE_FLAG(RAID6,"raid6"); > + if (flags) > + buf += sprintf(buf, "|0x%llx", flags); > + } > + *buf = 0; > +} > + > +/* > * function to relocate all extents in a block group. > */ > int btrfs_relocate_block_group(struct btrfs_root *extent_root, u64 > group_start) > @@ -4337,6 +4365,7 @@ int btrfs_relocate_block_group(struct btrfs_root > *extent_root, u64 group_start) > int ret; > int rw = 0; > int err = 0; > + char flags_str[128]; > > rc = alloc_reloc_control(fs_info); > if (!rc) > @@ -4381,9 +4410,10 @@ int btrfs_relocate_block_group(struct btrfs_root > *extent_root, u64 group_start) > goto out; > } > > + describe_block_group_flags(flags_str, rc->block_group->flags); > btrfs_info(extent_root->fs_info, > -"relocating block group %llu flags %llu", > -rc->block_group->key.objectid, rc->block_group->flags); > +"relocating block group %llu flags %s", > +rc->block_group->key.objectid, flags_str+1); > > btrfs_wait_block_group_reservations(rc->block_group); > btrfs_wait_nocow_writers(rc->block_group); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: imporve delayed refs iterations
On 10/24/16 18:46, David Sterba wrote: > On Fri, Oct 21, 2016 at 05:05:07PM +0800, Wang Xiaoguang wrote: >> This issue was found when I tried to delete a heavily reflinked file, >> when deleting such files, other transaction operation will not have a >> chance to make progress, for example, start_transaction() will blocked >> in wait_current_trans(root) for long time, sometimes it even triggers >> soft lockups, and the time taken to delete such heavily reflinked file >> is also very large, often hundreds of seconds. Using perf top, it reports >> that: > >> [...] With this patch, it just took about 10~15 seconds to >> delte the same file. > > Great improvement! Patch looks good on a quick skim so I'll add it to > next, but proper review is still required. If it helps, I've been running it for ~2 days now with no negative side effects, mostly rsync creating & deleting files with various levels of reflinking (via snapshots). No problems at all. Also tried to manually create & delete a large file with heavy CoW and hundreds of reflinked copies - no problem either and pretty fast. So.. Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: pass correct args to btrfs_async_run_delayed_refs()
On Tue, 18 Oct 2016 15:56:13 +0800, Wang Xiaoguang wrote: > In btrfs_truncate_inode_items()->btrfs_async_run_delayed_refs(), we > swap the arg2 and arg3 wrongly, fix this. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> > --- > fs/btrfs/inode.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 2b790bd..2f1372b 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -4605,8 +4605,8 @@ int btrfs_truncate_inode_items(struct > btrfs_trans_handle *trans, > BUG_ON(ret); > if (btrfs_should_throttle_delayed_refs(trans, root)) > btrfs_async_run_delayed_refs(root, > - trans->transid, > - trans->delayed_ref_updates * 2, 0); > + trans->delayed_ref_updates * 2, > + trans->transid, 0); > if (be_nice) { > if (truncate_space_check(trans, root, > extent_num_bytes)) { Reviewed-by: Holger Hoffstätte <hol...@applied-asynchrony.com> Passing the wrong transid..why did this ever work? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix false enospc for compression
On 10/06/16 04:51, Wang Xiaoguang wrote: > When testing btrfs compression, sometimes we got ENOSPC error, though fs > still has much free space, xfstests generic/171, generic/172, generic/173, > generic/174, generic/175 can reveal this bug in my test environment when > compression is enabled. > > After some debuging work, we found that it's btrfs_delalloc_reserve_metadata() > which sometimes tries to reserve plenty of metadata space, even for very small > data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes > we try to reserve is calculated by the difference between outstanding_extents > and reserved_extents. Please see below case for how ENOSPC occurs: > > 1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode > outstanding extents be 1, and reserved_extents be 1024. Note it's > btrfs_merge_extent_hook() that merges these 128KB units into one big > outstanding extent, but do not change reserved_extents. > > 2, When writing dirty pages, for compression, cow_file_range_async() will > split above big extent in unit of 128KB(compression extent size is 128KB). > When first split opeartion finishes, we'll have 2 outstanding extents and 1024 > reserved extents, and just right now the currently generated ordered extent is > dispatched to run and complete, then btrfs_delalloc_release_metadata()(see > btrfs_finish_ordered_io()) will be called to release metadata, after that we > will have 1 outstanding extents and 1 reserved extents(also see logic in > drop_outstanding_extent()). Later cow_file_range_async() continues to handles > left data range[128KB, 128MB), and if no other ordered extent was dispatched > to run, there will be 1023 outstanding extents and 1 reserved extent. > > 3, Now if another bufferd write for this file enters, then > btrfs_delalloc_reserve_metadata() will at least try to reserve metadata > for 1023 outstanding extents' metadata, for 16KB node size, it'll be > 1023*16384*2*8, > about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, > so > obviously it's not sane and can easily result in enospc error. > > The root cause is that for compression, its max extent size will no longer be > BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation > method in btrfs is not appropriate or correct, here we introduce: > enum btrfs_metadata_reserve_type { > BTRFS_RESERVE_NORMAL, > BTRFS_RESERVE_COMPRESS, > }; > and expand btrfs_delalloc_reserve_metadata() and > btrfs_delalloc_reserve_space() > by adding a new enum btrfs_metadata_reserve_type argument. When a data range > will > go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata. > Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go > through compression path. > > With this patch, we can fix these false enospc error for compression. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> I took some time again to get this into my tree on top of what's in btrfs-4.9rc1 and managed to merge it after all. Both this and patch #1 seem to work fine, and they don't seem to cause any regressions; ran a couple of both full and incremental rsync backups with >100GB on a new and now compressed subvolume without problem. Also Stefan just reported that his ENOSPC seems to be gone as well, so it seems to be good. \o/ So for both this and patch #1 have a careful: Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> Also a comment about something I found while resolving conflicts caused by the preliminary dedupe suppoort: [..] > int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, > - struct extent_state **cached_state); > + struct extent_state **cached_state, int flag); > int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, > - struct extent_state **cached_state); > + struct extent_state **cached_state, int flag); [..] > int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, > struct page **pages, size_t num_pages, > loff_t pos, size_t write_bytes, > - struct extent_state **cached); > + struct extent_state **cached, int flag); Instead of adding "int flag" why not use the already defined btrfs_metadata_reserve_type enum? I know it's just an int at the end of the day, but the dedupe support already added another "int dedupe" argument and it's probably easy to cause confusion. Maybe later it would be beneficial to consolidate the flags into a consistent set of enum values to prevent more "int flag" infla
Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full
On 10/07/16 09:17, Wang Xiaoguang wrote: > Hi, > > On 10/07/2016 03:03 PM, Stefan Priebe - Profihost AG wrote: >> Dear Wang, >> >> can't use v4.8.0 as i always get OOMs and total machine crashes. >> >> Complete traces with your patch and some more btrfs patches applied (in >> the hope in fixes the OOM but it did not): >> http://pastebin.com/raw/6vmRSDm1 > I didn't see any such OOMs... > Can you try holger's tree with my patches. They don't really apply to either 4.4.x (because it has diverged too much by now) or 4.8.x because of the initial dedupe support which came in as part of 4.9rc1 - there are way too many conflicts all over the place and merging them manually took way too much time. It would be useful if you could rebase your patches to for-next. Stefan, have you tried setting THP to 'madvise' or 'never'? Try 'echo madvise > /sys/kernel/mm/transparent_hugepage/enabled' or boot with transparent_hugepage=madvise (or never) kernel flag. I have no idea if it will help, but it's worth a try. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8rc8 & OOM panic
On 09/28/16 21:45, Holger Hoffstätte wrote: > On 09/28/16 20:46, E V wrote: >> I just booted my backup box with 4.8rc8 and started an rsync onto >> btrfs and it panic'd with OOM a couple hours later. I thought the OOM >> problems from 4.7 we're supposed to be fixed in 4.8, or did I get that >> wrong? No users or anything else on the system. > > Keep in mind that those problems are generic, other filesystems also > suffer: http://www.spinics.net/lists/linux-mm/msg114123.html > > I don't see any recent compaction-related patches in Linus' tree at > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/mm/ Sorry, that was not quite correct. -stable 4.7.x and 4.8 contain a workaround (http://goo.gl/DiwfPH) that is supposed to help. That went into 4.8rc8 already, so not sure what happened in your case. > So unless they get merged in the last minute it looks like 4.8 will > be DOA. Running 4.8 final now, let's see what happens.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/6] Btrfs: free space tree and sanity test fixes
On 09/29/16 14:21, Anatoly Pugachev wrote: >> ... >> >> This is fixed by patch >> >> "Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf" >> >> that's in the 4.9 queue. Other than that, the self-tests seem to pass, >> thanks for the test. Would be good if you can test with the mentioned >> patch included or without integrity checker. Thanks for testing. > > updated git kernel to v4.8-rc8-8-gae6dd8d , applied this > "Btrfs: free space tree and sanity test fixes" patchset and added/applied > "Btrfs: remove unnecessary btrfs_mark_buffer_dirty in split_leaf" : > (snip) > Sep 29 00:40:55 ttip kernel: BTRFS: device fsid > 7bb81df9-0e2b-47f2-81ff-c08502d38da6 devid 1 transid 5 /dev/loop4 > Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): disk space caching is > enabled > Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): has skinny extents > Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): flagging fs with big > metadata feature > Sep 29 00:41:30 ttip kernel: BTRFS info (device loop4): creating UUID tree > Sep 29 00:41:31 ttip kernel: BTRFS: device fsid > d0ee7ca3-3be0-465f-857b-19e681181923 devid 1 transid 5 /dev/loop0 > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): enabling free space > tree > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): using free space tree > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): has skinny extents > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): flagging fs with big > metadata feature > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): creating free space > tree > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): setting 1 ro feature > flag > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): setting 2 ro feature > flag > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop0): creating UUID tree > Sep 29 00:41:32 ttip kernel: BTRFS critical (device loop4): corrupt leaf, > non-root leaf's nritems is 0: block=29556736,root=1, slot=0 > Sep 29 00:41:32 ttip kernel: BTRFS info (device loop4): leaf 29556736 total > ptrs 0 free space 16283 > Sep 29 00:41:32 ttip kernel: BTRFS: assertion failed: 0, file: > fs/btrfs/disk-io.c, line: 4059kernel BUG at fs/btrfs/ctree.h:3369! Try to add https://patchwork.kernel.org/patch/9332707/ aka "Btrfs: improve check_node to avoid reading corrupted nodes" which should return -EIO and prevent the BUG(). -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.8rc8 & OOM panic
On 09/28/16 20:46, E V wrote: > I just booted my backup box with 4.8rc8 and started an rsync onto > btrfs and it panic'd with OOM a couple hours later. I thought the OOM > problems from 4.7 we're supposed to be fixed in 4.8, or did I get that > wrong? No users or anything else on the system. Keep in mind that those problems are generic, other filesystems also suffer: http://www.spinics.net/lists/linux-mm/msg114123.html I don't see any recent compaction-related patches in Linus' tree at https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/mm/ ..so it seems they have not been merged yet. So unless they get merged in the last minute it looks like 4.8 will be DOA. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full
On 09/28/16 15:06, Stefan Priebe - Profihost AG wrote: > > Yes this is 4.4.22 and no i don't have qgroups enabled so it can't help. > > # btrfs qgroup show /path/ > ERROR: can't perform the search - No such file or directory > ERROR: can't list qgroups: No such file or director > > This is the same output on all backup machines. OK, that is really good to know (your other mails arrived just after I sent mine). The fact that you see this problem with all kernels - even with 4.8rc - *and* on all machines is good (in a way) because it means I haven't messed up anything, and we're not chasing ghosts caused by broken backport patches. >> would be unfortunate, but you could try to disable compression for a >> while and see what happens, assuming the space requirements allow this >> experiment. > Good idea but it does not. I hope i can reproduce this with my already > existing testscript which i've now bumped to use a 37TB partition and > big files rather than a 15GB part and small files. If i can reproduce it > i can also check whether disabling compression fixes this. Great. Remember to undo the compression on existing files, or create them from scratch. > No that's not the case. No rsync nor inplace is involved. I'm dumping > differences directly from ceph and put them on top of a base image but > only for 7 days. So it's not endless fragmenting the file. After 7 days > a clean whole image is dumped. That sounds sane but it's also not at all how you described things to me previosuly ;) But OK. How do you "dump differences directly from Ceph"? I'd assume the VM images are RBDs, but it sounds you're somehow using overlayfs. > yes and no - this is not idea and even very slow if your customers need > backups on a daily basis. So you must be able to mount a specific backup > very fast. And stacking on demand is mostly too slow - but this is far > away from the topic in this thread. I understand the desire to mount & immediately access backups - it's what I do here at home too (every machine can access its own last #n backups via NFS) and it's very useful. Anyway..something is off and you successfully cause it while other people apparently do not. Do you still use those nonstandard mount options with extremely long transaction flush times? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: space_info 4 has 18446742286429913088 free, is not full
On 09/28/16 13:35, Wang Xiaoguang wrote: > hello, > > On 09/28/2016 07:15 PM, Stefan Priebe - Profihost AG wrote: >> Dear list, >> >> is there any chance anybody wants to work with me on the following issue? > Though I'm also somewhat new to btrfs, but I'd like to. > >> >> BTRFS: space_info 4 has 18446742286429913088 free, is not full >> BTRFS: space_info total=98247376896, used=77036814336, pinned=0, >> reserved=0, may_use=1808490201088, readonly=0 >> >> i get this nearly every day. >> >> Here are some msg collected from today and yesterday from different servers: >> | BTRFS: space_info 4 has 18446742182612910080 free, is not full | >> | BTRFS: space_info 4 has 18446742254739439616 free, is not full | >> | BTRFS: space_info 4 has 18446743980225085440 free, is not full | >> | BTRFS: space_info 4 has 18446743619906420736 free, is not full | >> | BTRFS: space_info 4 has 18446743647369576448 free, is not full | >> | BTRFS: space_info 4 has 18446742286429913088 free, is not full >> >> What i tried so far without success: >> - use vanilla 4.8-rc8 kernel >> - use latest vanilla 4.4 kernel >> - use latest 4.4 kernel + patches from holger hoffstaette Was that 4.4.22? It contains a patch by Goldwyn Rodrigues called "Prevent qgroup->reserved from going subzero" which should prevent this from happening. This should only affect filesystems with enabled quota; you said you didn't have quota enabled, yet some quota-only patches caused problems on your system (despite being scheduled for 4.9 and apparently working fine everywhere else, even when I specifically tested them *with* quota enabled). So, long story short: something doesn't add up. It means either: - you tried my patchset for 4.4.21 (i.e. *without* the above patch) and should bump to .22 right away - you _do_ have qgroups enabled for some reason (systemd?) - your fs is corrupted and needs nuking - you did something else entirely - unknown unknowns aka. ¯\_(ツ)_/¯ There is also the chance that your use of compress-force (or rather compression in general) causes leakage; compression runs asynchronously and I wouldn't be surprised if that is still full of racy races..which would be unfortunate, but you could try to disable compression for a while and see what happens, assuming the space requirements allow this experiment. You have also not told us whether this happens only on one (potentially corrupted/confused) fs or on every one - my impression was that you have several sharded backup filesystems/machines; not sure if that is still the case. If it happens only on one specific fs chances are it's hosed. > I also met enospc error in 4.8-rc6 when doing big files create and delete > tests, > for my cases, I have written some patches to fix it. > Would you please apply my patches to have a try: > btrfs: try to satisfy metadata requests when every flush_space() returns > btrfs: try to write enough delalloc bytes when reclaiming metadata space > btrfs: make shrink_delalloc() try harder to reclaim metadata space These are all in my series for 4.4.22 and seem to work fine, however Stefan's workload has nothing directly to do with big files; instead it's the worst case scenario in terms of fragmentation (of huge files) and a huge number of extents: incremental backups of VMs via rsync --inplace with forced compression. IMHO this way of making backups is suboptimal in basically every possible way, despite its convenience appeal. With such huge space requirements it would be more effective to have a "current backup" to rsync into and then take a snapshot (for fs consistency), pack the snapshot to a tar.gz (massively better compression than with btrfs), dump them into your Ceph cluster as objects with expiry (preferrably a separate EC pool) and then immediately delete the snapshot from the local fs. That should relieve the landing fs from getting overloaded by COWing and too many snapshots (approx. #VMs * #versions). The obvious downside is that restoring an archived snapshot would require some creative efforts. Other alternatives exist, but are probably even more (too) expensive. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: handle quota reserve failure properly
On Thu, 15 Sep 2016 14:57:48 -0400, Josef Bacik wrote: > btrfs/022 was spitting a warning for the case that we exceed the quota. If we > fail to make our quota reservation we need to clean up our data space > reservation. Thanks, > > Signed-off-by: Josef Bacik> --- > fs/btrfs/extent-tree.c | 9 +++-- > 1 file changed, 3 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 03da2f6..d72eaae 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4286,13 +4286,10 @@ int btrfs_check_data_free_space(struct inode *inode, > u64 start, u64 len) > if (ret < 0) > return ret; > > - /* > - * Use new btrfs_qgroup_reserve_data to reserve precious data space > - * > - * TODO: Find a good method to avoid reserve data space for NOCOW > - * range, but don't impact performance on quota disable case. > - */ > + /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */ > ret = btrfs_qgroup_reserve_data(inode, start, len); > + if (ret) > + btrfs_free_reserved_data_space_noquota(inode, start, len); > return ret; > } > > -- > 2.7.4 This came up before, though slightly different: http://www.spinics.net/lists/linux-btrfs/msg56644.html Which version is correct - with or without _noquota ? -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix memory leak in do_walk_down
On 09/14/16 04:02, Liu Bo wrote: > The extent buffer 'next' needs to be free'd conditionally. > > Signed-off-by: Liu Bo> --- > fs/btrfs/extent-tree.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 5a940ab..779fd72 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -8882,6 +8882,7 @@ static noinline int do_walk_down(struct > btrfs_trans_handle *trans, > >flags[level - 1]); > if (ret < 0) { > btrfs_tree_unlock(next); > + free_extent_buffer(next); > return ret; > } > > This was fixed a long time ago by the following patch by Josef: "Btrfs: don't BUG() during drop snapshot" https://patchwork.kernel.org/patch/7002791/ ..which never got merged. I've been using it since 4.1.x until today without problems. Note that apparently Patchwork got confused here: downloading only the "patch" fragment won't work; you'll need the full "mbox" attachment. It also seems to have been sorted into the wrong list bucket for some reason. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Commit 'synchronize incompat feature bits with sysfs files' still missing in for-next?
I've noticed that the 4.5-rc commit 14e46e04: 'btrfs: synchronize incompat feature bits with sysfs files' [1] was reverted later in [2], but despite fixes to protect sysfs with locks & exorcise GFP_NOFS in favor of GFP_KERNEL it was never reinstated - neither for 4.5-final, nor later..and it's been quite a while since then. Is this still valid? I ask because I've just noticed I've had this in my tree since forever, but have never encountered a problem during balance - probably because of all the other fixes. thanks, Holger [1] goo.gl/2DBMSe [2] goo.gl/cIKgv5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
On 09/12/16 09:38, Wang Xiaoguang wrote: >> Actually even that is not true; both patches seem to be wrong in subtle >> ways. Naohiro's patch seems to prevent the deletion during balance, whereas >> yours prevents the cleaner from kicking in. > Indeed in my patch, I just change "struct mutex delete_unused_bgs_mutex" > to "struct rw_semaphore bg_delete_sem", and try to get bg_delete_sem when > we allocate data space, so this patch should work as before :) > >> >> As a simple reproducer you can convert from -mdup to -msingle (to create >> bloat) and then balance with -musage=10. Depending on which of the two >> patches are applied, you end with bloat that only grows and never shrinks, >> or bloat that ends up in mixed state (dup and single). > Can you give me a simple test script to reproduce your problem. > (I can write it myself, but I'm afraid I may misunderstand you :) ) I don't have a script and no time this week to play with this. Just create an fs with dup metadata, balance -mconvert=single and then back with -mconvert=dup. That's all I tried. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
On 09/09/16 12:18, Holger Hoffstätte wrote: > On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote: > >> cleaner_kthread() may run at any time, in which it'll call >> btrfs_delete_unused_bgs() >> to delete unused block groups. Because this work is asynchronous, it may >> also result >> in false ENOSPC error. > > > With this v3 I can now no longer balance (tested only with metadata). > New chunks are allocated (as balance does) but nothing ever shrinks, until > after unmount/remount, when the cleaner eventually kicks in. > > This might be related to the recent patch by Naohiro Aota: > "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" > > which by itself doesn't seem to do any harm (i.e. everything still seems > to work as expected). Actually even that is not true; both patches seem to be wrong in subtle ways. Naohiro's patch seems to prevent the deletion during balance, whereas yours prevents the cleaner from kicking in. As a simple reproducer you can convert from -mdup to -msingle (to create bloat) and then balance with -musage=10. Depending on which of the two patches are applied, you end with bloat that only grows and never shrinks, or bloat that ends up in mixed state (dup and single). Undoing both makes both balancing and cleaning work again. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: should block unused block groups deletion work when allocating data space
On Fri, 09 Sep 2016 16:17:48 +0800, Wang Xiaoguang wrote: > cleaner_kthread() may run at any time, in which it'll call > btrfs_delete_unused_bgs() > to delete unused block groups. Because this work is asynchronous, it may also > result > in false ENOSPC error. With this v3 I can now no longer balance (tested only with metadata). New chunks are allocated (as balance does) but nothing ever shrinks, until after unmount/remount, when the cleaner eventually kicks in. This might be related to the recent patch by Naohiro Aota: "btrfs: let btrfs_delete_unused_bgs() to clean relocated bgs" which by itself doesn't seem to do any harm (i.e. everything still seems to work as expected). -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Finding only non-snapshots via btrfs subvol list
On 07/21/16 16:55, Holger Hoffstätte wrote: > I'm trying to find non-snapshots, i.e. 'top-level' subvolumes in a > filesystem and this seems harder than it IMHO should be. > > The fs is just like: > > /mnt/stuff > subvolA > subvolA-date1 > subvolA-date2 > subvolB > subvolB-date1 > subvolB-date2 > .. > > All I want are the subvol{A,B} *without* the snapshots, but so > far I haven't been able to accomplish this easily with "subvol list" > and its options. -s lists only snapshots, but what I want is the > exact opposite. This question received a deafening lack of feedback, so I just took a swing at this and apparently hit something. When you have a set of subvols and snapshots like so: $./btrfs subvolume list /t/btrfs ID 257 gen 13 top level 5 path a ID 258 gen 16 top level 5 path b ID 259 gen 15 top level 5 path c ID 260 gen 11 top level 5 path a1 ID 261 gen 12 top level 5 path a2 ID 263 gen 14 top level 5 path b1 ID 264 gen 15 top level 5 path c1 ID 265 gen 16 top level 5 path b2 where a,b,c are subvolumes and ?{1,2,3} are snapshots, you can now do: $./btrfs subvolume list -P /t/btrfs ID 257 gen 13 top level 5 path a ID 258 gen 16 top level 5 path b ID 259 gen 15 top level 5 path c Is this of interest? I find it useful to iterate over all parent subvols (though you'll still need cut or awk to get only the name) without accidentally hitting the snapshots, or relying on fragile inhouse naming conventions. The -P was the only meaningful letter left (P for Parent). I first used -S (for grown-up -s ;) but that was already used for matching getopt on --sort. If -S is deemed better I can reroute that to -Z or something, since it's unused in short form. The patch is surprisingly small and was quite easy to write. Nice! cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Please disable balance auto-resume for 4.9 (or even 4.8)
Automatically resuming an interrupted balance has repeatedly caused all sorts of problems because it creates a possible failure mode when a user can least use it: after a crash/power loss/sudden reboot (which, like it or not, is the de facto "fix random problems" approach for many people). The idea behind the automnatic resume is good and important for cases like automation and unattended operation, but nevertheless right now it creates more problems than it fixes. As far as I can see it should be easy enough to simply disable calling btrfs_resume_balance_async() at least on mount (in open_ctree()) and possibly on remount as well. The skip_balance flag could then simply be ignored or removed. I can't say how much work it would be to completely remove the persistent balance state or whether it is useful to be kept around for resume, but at least not continuing would stop filesystems from eating themselves further on mount. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check btree node's nritems
On 08/05/16 11:24, Holger Hoffstätte wrote: > On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote: > >> When btree node (level = 1) has nritems which equals to zero, >> we can end up with panic due to insert_ptr()'s >> >> BUG_ON(slot > nritems); >> >> where slot is 1 and nritems is 0, as copy_for_split() calls >> insert_ptr(.., path->slots[1] + 1, ...); >> >> A invalid value results in the whole mess, this adds the check >> for btree's node nritems so that we stop reading block when >> when something is wrong. >> >> Signed-off-by: Liu Bo <bo.li@oracle.com> >> --- >> fs/btrfs/disk-io.c | 17 + >> 1 file changed, 17 insertions(+) >> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c >> index 37d1780..a5a22be 100644 >> --- a/fs/btrfs/disk-io.c >> +++ b/fs/btrfs/disk-io.c >> @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root, >> return 0; >> } >> >> +static noinline int check_node(struct btrfs_root *root, >> + struct extent_buffer *node) >> +{ >> +unsigned long nr = btrfs_header_nritems(node); >> + >> +if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) { >> +btrfs_crit(root->fs_info, >> + "corrupt node: block %llu root %llu nritems %lu\n", > > I think the trailing \n can be dropped here, btrfs_crit() already provides > a proper newline. On top of that I get a whole bunch of false positives with this patch. Files that are perfectly readable without it now error out, in which case the logged nritems is always 493 - regardless of file or containing subvolume. Something is fishy here. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: check btree node's nritems
On Wed, 03 Aug 2016 12:57:28 -0700, Liu Bo wrote: > When btree node (level = 1) has nritems which equals to zero, > we can end up with panic due to insert_ptr()'s > > BUG_ON(slot > nritems); > > where slot is 1 and nritems is 0, as copy_for_split() calls > insert_ptr(.., path->slots[1] + 1, ...); > > A invalid value results in the whole mess, this adds the check > for btree's node nritems so that we stop reading block when > when something is wrong. > > Signed-off-by: Liu Bo> --- > fs/btrfs/disk-io.c | 17 + > 1 file changed, 17 insertions(+) > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > index 37d1780..a5a22be 100644 > --- a/fs/btrfs/disk-io.c > +++ b/fs/btrfs/disk-io.c > @@ -612,6 +612,20 @@ static noinline int check_leaf(struct btrfs_root *root, > return 0; > } > > +static noinline int check_node(struct btrfs_root *root, > +struct extent_buffer *node) > +{ > + unsigned long nr = btrfs_header_nritems(node); > + > + if (nr <= 0 || nr >= BTRFS_NODEPTRS_PER_BLOCK(root)) { > + btrfs_crit(root->fs_info, > +"corrupt node: block %llu root %llu nritems %lu\n", I think the trailing \n can be dropped here, btrfs_crit() already provides a proper newline. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memory overflow or undeflow in free space tree / space_info?
On Fri, 29 Jul 2016 22:57:36 +, Holger Hoffstätte wrote: > The only other patch I just found missing and which looks like it > could/should (I think?) work on top of the 4.4.x pagesize-based > calculations in file.c is: > > a2af23b7 "__btrfs_buffered_write: Pass valid file offset when > releasing delalloc space" > > Would that make sense? No it wouldn't, not without some other sectorsize-related patches that came before...and those would just make matters worse. So forget the above. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: memory overflow or undeflow in free space tree / space_info?
On Fri, 29 Jul 2016 17:03:43 -0400, Josef Bacik wrote: > On 07/29/2016 03:14 PM, Omar Sandoval wrote: >> On Fri, Jul 29, 2016 at 12:11:53PM -0700, Omar Sandoval wrote: >>> On Fri, Jul 29, 2016 at 08:40:26PM +0200, Stefan Priebe - Profihost AG >>> wrote: Dear list, i'm seeing btrfs no space messages frequently on big filesystems (> 30TB). In all cases i'm getting a trace like this one a space_info warning. (since commit [1]). Could someone please be so kind and help me debugging / fixing this bug? I'm using space_cache=v2 on all those systems. >>> >>> Hm, so I think this indicates a bug in space accounting somewhere else >>> rather than the free space tree itself. I haven't debugged one of these >>> issues before, I'll see if I can reproduce it. Cc'ing Josef, too. >> >> I should've asked, what sort of filesystem activity triggers this? >> > > Chris just fixed this I think, try his next branch from his git tree > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > > and see if it still happens. Thanks, > > Josef Hi Josef, can you say which patch you have in mind? The tree in question doesn't have any of Chandra's pagesize/sectorsize patches (carefully patched around, for stability and LTS patchability) so I hope it's not the recent commit 8b8b08cb "fix delalloc accounting after copy_from_user faults" because that would be too fiddly (at least for me) to backport correctly. The only other patch I just found missing and which looks like it could/should (I think?) work on top of the 4.4.x pagesize-based calculations in file.c is: a2af23b7 "__btrfs_buffered_write: Pass valid file offset when releasing delalloc space" Would that make sense? Neither I nor any other users of that tree have observed weird space-info underflows so far (and I use my fs daily), so it's definitely something peculiar Stefan is doing with his weird compressed rsync-inplace workload. Odd sector offsets causing slowly creeping space_info underflow sounds to me like it just might be the problem. thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: deal with unexpected return value in flush_space
On Wed, 27 Jul 2016 18:42:03 -0700, Liu Bo wrote: > Function start_transaction() can return ERR_PTR(1) when flush is > BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is > > start_transaction (return ERR_PTR(1)) > -> btrfs_block_rsv_add (return 1) > -> reserve_metadata_bytes (return 1) > -> flush_space (return 1) >-> do_chunk_alloc (return 1) > > With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the > flush_state of ALLOC_CHUNK and it successfully allocates a new > chunk, then instead of trying to reserve space again, > reserve_metadata_bytes returns 1 immediately. > > Eventually the callers who call start_transaction() usually just > do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get > a panic when dereferencing a pointer which is ERR_PTR(1). > > This makes flush_space() translate 'ret = 1' to 'ret = 0'. > > Signed-off-by: Liu Bo> --- > We found this 'NULL pointer dereference' on an old 3.8 kernel but > it's not going to happen on the upstream since there is no caller > of btrfs_start_transaction_lflush(). > > fs/btrfs/extent-tree.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 7a35c9d..a00fb67 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c > @@ -4457,6 +4457,15 @@ void check_system_chunk(struct btrfs_trans_handle > *trans, > } > } > > +/* > + * If force is CHUNK_ALLOC_FORCE: > + *- return 1 if it successfully allocates a chunk, > + *- return errors including -ENOSPC otherwise. > + * If force is NOT CHUNK_ALLOC_FORCE: > + *- return 0 if it doesn't need to allocate a new chunk, > + *- return 1 if it successfully allocates a chunk, > + *- return errors including -ENOSPC otherwise. > + */ > static int do_chunk_alloc(struct btrfs_trans_handle *trans, > struct btrfs_root *extent_root, u64 flags, int force) > { > @@ -4857,7 +4866,7 @@ static int flush_space(struct btrfs_root *root, >btrfs_get_alloc_profile(root, 0), >CHUNK_ALLOC_NO_FORCE); > btrfs_end_transaction(trans, root); > - if (ret == -ENOSPC) > + if (ret == -ENOSPC || ret == 1) > ret = 0; > break; > case COMMIT_TRANS: > -- > 2.5.5 For reviewers - this came up before here: https://patchwork.kernel.org/patch/7778651/ Same fix basically. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Finding only non-snapshots via btrfs subvol list
I'm trying to find non-snapshots, i.e. 'top-level' subvolumes in a filesystem and this seems harder than it IMHO should be. The fs is just like: /mnt/stuff subvolA subvolA-date1 subvolA-date2 subvolB subvolB-date1 subvolB-date2 .. All I want are the subvol{A,B} *without* the snapshots, but so far I haven't been able to accomplish this easily with "subvol list" and its options. -s lists only snapshots, but what I want is the exact opposite. So far the best I could find - except for relying on my ad-hoc naming conventions and inverse-grepping for that - is via -q (print parent UUID) and matching on that: btrfs subvol list -q /mnt/stuff | grep "parent_uuid -" | cut -f 11 -d " " gives me what I want - but eeww. So somehow I think I'm missing something trivial. Is there a better, non-greppy way? Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] update bytes_may_use timely to avoid false ENOSPC issue
On 07/20/16 07:56, Wang Xiaoguang wrote: > Currently in btrfs, for data space reservation, it does not update > bytes_may_use in btrfs_update_reserved_bytes() and the decrease operation > will be delayed to be done in extent_clear_unlock_delalloc(), for > fallocate(2), decrease operation is even delayed to be done in end > of btrfs_fallocate(), which is too late. Obviously such delay will > cause unnecessary pressure to enospc system, in [PATCH 4/4], there is > a simpel test script that can reveal such false ENOSPC bug. > > So in this patch set, it will remove RESERVE_FREE, RESERVE_ALLOC and > RESERVE_ALLOC_NO_ACCOUNT, and we always update bytes_may_use timely. > > I'll also commit a fstests test case for this issue. > > Wang Xiaoguang (4): > btrfs: use correct offset for reloc_inode in > prealloc_file_extent_cluster() > btrfs: divide btrfs_update_reserved_bytes() into two functions > btrfs: introduce new EXTENT_CLEAR_DATA_RESV flag > btrfs: update btrfs_space_info's bytes_may_use timely > Just like the previous version, for all 4 patches: Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> via the reproducer script & some very large manual fallocates. thanks! Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ENOSPC / no space on very large devices
On 07/20/16 07:31, Stefan Priebe - Profihost AG wrote: > Hi list, > > while i didn't had the problem for some month i'm now getting ENOSPC on > a regular basis on one host. Well, it's getting better. :) > if i umount the volume i get traces (i already did a clear_cache 4 days > ago to recalculate the space_tree): > > [545031.675797] [ cut here ] > [545031.725166] WARNING: CPU: 1 PID: 17711 at > fs/btrfs/extent-tree.c:5710 btrfs_free_block_groups+0x35a/0x400 [btrfs]() This is "only" a warning, but as we can see below it indicates a real problem. The warning was added only recently to for-next by the patch called "Btrfs: warn_on for unaccounted spaces" [1], but I've had it in my tree forever. Never seen the warning myself. (snip) > [545037.909700] BTRFS: space_info 4 has 18446743523026157568 free, is > not full Wow, ~18.4 exabytes really is a lot of free space. :) So it looks like something underflowed the space_info and now things are confused for about ~550 GB. Unfortunately I have no good idea how to fix that. :( > The kernel is something special - i'm using this one from holger: > https://github.com/hhoffstaette/kernel-patches > > which is basically a 4.4.15 + several patches especially a lot of btrfs > patches up to 4.8 i think. More like for-next with all the pagesize/sectorsize stuff carefully avoided. I'm really looking forward to 4.8, this is becoming unwieldy.. -h [1] https://git.kernel.org/cgit/linux/kernel/git/kdave/linux.git/commit/?h=for-next=d555b6c380c644af63dbdaa7cc14bba041a4e4dd -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kdave/for-next commit 26112f7f472
On 07/08/16 13:55, Jeff Mahoney wrote: > On 7/8/16 7:19 AM, Holger Hoffstätte wrote: >> On 07/08/16 06:24, Jeff Mahoney wrote: >>> Hi Dave - >>> >>> This commit introduces a bug. I ran across it when running xfstests >>> against my own integrated branch. >> >> I can't find that commit id anywhere...? > > Hi Holger - > > This is the for-next branch. It's not in any mainline branch yet. Yes, I understand that. I searched in the github/kdave tree, which only has the for-next-xyz branches. Found it in the one on kernel.org. -h signature.asc Description: OpenPGP digital signature
Re: kdave/for-next commit 26112f7f472
On 07/08/16 06:24, Jeff Mahoney wrote: > Hi Dave - > > This commit introduces a bug. I ran across it when running xfstests > against my own integrated branch. I can't find that commit id anywhere...? > The problem is that btrfs_calc_reclaim_metadata_size didn't used to be > called from recovery, so it was safe to use fs_info->fs_root. With > commit 7c83c6a09 (Btrfs: don't bother kicking async if there's nothing > to reclaim) we do call it from recovery context and fs_info->fs_root is > NULL. > > The fix is to just not switch btrfs_calc_reclaim_metadata_size to take > an fs_info. All the other call sites were using fs_info->fs_root > anyway, so it's not like we're pinning a root somewhere just for this call. I've had this patch from last October in my 4.4.x tree forever: http://www.spinics.net/lists/linux-btrfs/msg48457.html Apparently it fell off the table. Shouldn't that fix it? -h signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On 07/06/16 17:20, Hugo Mills wrote: > On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote: >> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte >> <hol...@applied-asynchrony.com> wrote: >>> On 07/06/16 14:25, Wang Shilong wrote: >>>> 'btrfs file du' is a very useful tool to watch my system >>>> file usage with snapshot aware. >>>> >>>> when trying to run following commands: >>>> [root@localhost btrfs-progs]# btrfs file du / >>>> Total Exclusive Set shared Filename >>>> ERROR: Failed to lookup root id - Inappropriate ioctl for device >>>> ERROR: cannot check space of '/': Unknown error -1 >>>> >>>> and My Filesystem looks like this: >>>> [root@localhost btrfs-progs]# df -Th >>>> Filesystem Type Size Used Avail Use% Mounted on >>>> devtmpfs devtmpfs 16G 0 16G 0% /dev >>>> tmpfs tmpfs 16G 368K 16G 1% /dev/shm >>>> tmpfs tmpfs 16G 1.4M 16G 1% /run >>>> tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup >>>> /dev/sda3 btrfs 60G 19G 40G 33% / >>>> tmpfs tmpfs 16G 332K 16G 1% /tmp >>>> /dev/sdc btrfs 2.8T 166G 1.7T 9% /data >>>> /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot >>>> /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi >>>> tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 >>>> >>>> So I installed Btrfs as my root partition, but boot partition >>>> can be other fs. >>>> >>>> We can Let btrfs tool aware of this is not a btrfs file or >>>> directory and skip those files, so that someone like me >>>> could just run 'btrfs file du /' to scan all btrfs filesystems. >>>> >>>> After patch, it will look like: >>>>Total Exclusive Set shared Filename >>>> skipping not btrfs dir/file: boot >>>> skipping not btrfs dir/file: dev >>>> skipping not btrfs dir/file: proc >>>> skipping not btrfs dir/file: run >>>> skipping not btrfs dir/file: sys >>>> 0.00B 0.00B - //root/.bash_logout >>>> 0.00B 0.00B - //root/.bash_profile >>>> 0.00B 0.00B - //root/.bashrc >>>> 0.00B 0.00B - //root/.cshrc >>>> 0.00B 0.00B - //root/.tcshrc >>>> >>>> This works for me to analysis system usage and analysis >>>> performaces. >>> >>> This is great, but can we please skip the "skipping .." messages? >>> Maybe it's just me but I really don't see the value of printing them >>> when they don't contribute to the result. >>> They also mess up the display. :) >> >> I don't have a taste whether it needed or not, because it is somehow >> useful to let users know some files/directories skipped When you run "find /path -type d" you don't get messages for all the things you just didn't want to find either. >At the absolute minimum, I think that these messages should go to > stderr (like du does when it deosn't have permissions), and should go > away with -q. They're still irritating, but at least you can get rid > of them easily. If anything this should require a --verbose, not the other way around. Maybe instead of breaking the output just indicate the special status via "-- --" values, or default to 0.00? Still, we're explicitly only interested in btrfs stuff and not anything else, so printing non-information can only yield noise. This is very much orthogonal to not printing anything after an otherwise successful command execution. -h signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On 07/06/16 14:25, Wang Shilong wrote: > 'btrfs file du' is a very useful tool to watch my system > file usage with snapshot aware. > > when trying to run following commands: > [root@localhost btrfs-progs]# btrfs file du / > Total Exclusive Set shared Filename > ERROR: Failed to lookup root id - Inappropriate ioctl for device > ERROR: cannot check space of '/': Unknown error -1 > > and My Filesystem looks like this: > [root@localhost btrfs-progs]# df -Th > Filesystem Type Size Used Avail Use% Mounted on > devtmpfs devtmpfs 16G 0 16G 0% /dev > tmpfs tmpfs 16G 368K 16G 1% /dev/shm > tmpfs tmpfs 16G 1.4M 16G 1% /run > tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup > /dev/sda3 btrfs 60G 19G 40G 33% / > tmpfs tmpfs 16G 332K 16G 1% /tmp > /dev/sdc btrfs 2.8T 166G 1.7T 9% /data > /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot > /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi > tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 > > So I installed Btrfs as my root partition, but boot partition > can be other fs. > > We can Let btrfs tool aware of this is not a btrfs file or > directory and skip those files, so that someone like me > could just run 'btrfs file du /' to scan all btrfs filesystems. > > After patch, it will look like: >Total Exclusive Set shared Filename > skipping not btrfs dir/file: boot > skipping not btrfs dir/file: dev > skipping not btrfs dir/file: proc > skipping not btrfs dir/file: run > skipping not btrfs dir/file: sys > 0.00B 0.00B - //root/.bash_logout > 0.00B 0.00B - //root/.bash_profile > 0.00B 0.00B - //root/.bashrc > 0.00B 0.00B - //root/.cshrc > 0.00B 0.00B - //root/.tcshrc > > This works for me to analysis system usage and analysis > performaces. This is great, but can we please skip the "skipping .." messages? Maybe it's just me but I really don't see the value of printing them when they don't contribute to the result. They also mess up the display. :) thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()
On 07/06/16 12:37, Wang Xiaoguang wrote: > Below test scripts can reproduce this false ENOSPC: > #!/bin/bash > dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 > dev=$(losetup --show -f fs.img) > mkfs.btrfs -f -M $dev > mkdir /tmp/mntpoint > mount /dev/loop0 /tmp/mntpoint > cd mntpoint > xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile > > Above fallocate(2) operation will fail for ENOSPC reason, but indeed > fs still has free space to satisfy this request. The reason is > btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use > just in time, and it calls btrfs_free_reserved_data_space_noquota() in > the end of btrfs_fallocate(), which is too late and have already added > false unnecessary pressure to enospc system. See call graph: > btrfs_fallocate() > |-> btrfs_alloc_data_chunk_ondemand() > It will add btrfs_space_info's bytes_may_use accordingly. > |-> btrfs_prealloc_file_range() > It will call btrfs_reserve_extent(), but note that alloc type is > RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will > only increase btrfs_space_info's bytes_reserved accordingly, but > will not decrease btrfs_space_info's bytes_may_use, then obviously > we have overestimated real needed disk space, and it'll impact > other processes who do write(2) or fallocate(2) operations, also > can impact metadata reservation in mixed mode, and bytes_max_use > will only be decreased in the end of btrfs_fallocate(). To fix > this false ENOSPC, we need to decrease btrfs_space_info's > bytes_may_use in btrfs_prealloc_file_range() in time, as what we > do in cow_file_range(), > See call graph in : > cow_file_range() > |-> extent_clear_unlock_delalloc() > |-> clear_extent_bit() > |-> btrfs_clear_bit_hook() > |-> btrfs_free_reserved_data_space_noquota() > This function will decrease bytes_may_use accordingly. > > So this patch choose to call btrfs_free_reserved_data_space() in > __btrfs_prealloc_file_range() for both successful and failed path. > > Also this patch removes some old and useless comments. > > Signed-off-by: Wang Xiaoguang <wangxg.f...@cn.fujitsu.com> Verified that the reproducer script indeed fails (with btrfs ~4.7) and the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating things without problem. Free space also still seems sane, as far as I could tell. So for both patches: Tested-by: Holger Hoffstätte <hol...@applied-asynchrony.com> cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] btrfs-progs: check improve 'checking extents' scalability
On 06/23/16 21:26, je...@suse.com wrote: > From: Jeff Mahoney> > While running xfstests generic/291, which creates a single file populated > with reflinks to the same extent, I found that fsck had been running for > hours. perf top lead me to find_data_backref as the culprit, and a litte > more digging made it clear: For every extent record we add, we iterate > the entire list first. My test case had ~2M records. That math doesn't > go well. > > This patchset converts the extent_backref list to an rbtree. The test > that used to run for more than 8 hours without completing now takes > less than 20 seconds. Very interesting. Time for science! unpatched: holger>time btrfs check /dev/sdc1 Checking filesystem on /dev/sdc1 .. btrfs check /dev/sdc1 17.03s user 3.79s system 25% cpu 1:22.82 total patched: holger>time btrfs check /dev/sdc1 Checking filesystem on /dev/sdc1 .. btrfs check /dev/sdc1 17.03s user 3.74s system 24% cpu 1:23.24 total This is a 1TB disk with ~850GB data in 4 subvolumes, ~2 snapshots each. I guess it only starts to matter (relative to the necessary I/O cost per extent) when the level of sharing is higher, i.e. many more snapshots? OTOH it doesn't explode, so that's good. :) cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 2/2] btrfs: wait for bdev put
On 06/21/16 12:24, Anand Jain wrote: > From: Anand Jain <anand.j...@oracle.com> > > Further to the commit > bc178622d40d87e75abc131007342429c9b03351 > btrfs: use rcu_barrier() to wait for bdev puts at unmount > > This patch implements a method to time wait on the __free_device() > which actually does the bdev put. This is needed as the user space > running 'btrfs fi show -d' immediately after the replace and > unmount, is still reading older information from the device. > > mail-archive.com/linux-btrfs@vger.kernel.org/msg54188.html > > Signed-off-by: Anand Jain <anand.j...@oracle.com> > [updates: bc178622d40d87e75abc131007342429c9b03351] > --- > v2: Also to make sure bdev_closing is set it needs rcu_barrier(), > restored rcu_barrier(). Looks like this one works reliably again. ;) Tested with a slow disk, no long unmounts or timeout messages. Tested-by: Holger Hoffstätte <holger.hoffstae...@applied-asynchrony.com> thanks! Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: wait for bdev put
On Tue, 14 Jun 2016 18:55:26 +0800, Anand Jain wrote: > Further to the previous commit > bc178622d40d87e75abc131007342429c9b03351 > btrfs: use rcu_barrier() to wait for bdev puts at unmount > > Since free_device() spinoff __free_device() the rcu_barrier() for > call_rcu(>rcu, free_device); > didn't help. > > This patch reverts changes by > bc178622d40d87e75abc131007342429c9b03351 > and implement a method to wait on the __free_device() by using > a new bdev_closing member in struct btrfs_device. > > Signed-off-by: Anand Jain> [rework: bc178622d40d87e75abc131007342429c9b03351] > --- > fs/btrfs/volumes.c | 44 ++-- > fs/btrfs/volumes.h | 1 + > 2 files changed, 39 insertions(+), 6 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index a4e8d48acd4b..404ce1daebb1 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > #include > #include "ctree.h" > #include "extent_map.h" > @@ -254,6 +255,17 @@ static struct btrfs_device *__alloc_device(void) > return dev; > } > > +static int is_device_closing(struct list_head *head) > +{ > + struct btrfs_device *dev; > + > + list_for_each_entry(dev, head, dev_list) { > + if (dev->bdev_closing) > + return 1; > + } > + return 0; > +} > + > static noinline struct btrfs_device *__find_device(struct list_head *head, > u64 devid, u8 *uuid) > { > @@ -832,12 +844,22 @@ again: > static void __free_device(struct work_struct *work) > { > struct btrfs_device *device; > + struct btrfs_device *new_device_addr; > > device = container_of(work, struct btrfs_device, rcu_work); > > if (device->bdev) > blkdev_put(device->bdev, device->mode); > > + /* > + * If we are coming here from btrfs_close_one_device() > + * then it allocates a new device structure for the same > + * devid, so find device again with the devid > + */ > + new_device_addr = __find_device(>fs_devices->devices, > + device->devid, NULL); > + > + new_device_addr->bdev_closing = 0; > rcu_string_free(device->name); > kfree(device); > } > @@ -884,6 +906,12 @@ static void btrfs_close_one_device(struct btrfs_device > *device) > list_replace_rcu(>dev_list, _device->dev_list); > new_device->fs_devices = device->fs_devices; > > + /* > + * So to wait for kworkers to finish all blkdev_puts, > + * so device is really free when umount is done. > + */ > + new_device->bdev_closing = 1; > + > call_rcu(>rcu, free_device); > } > > @@ -912,6 +940,7 @@ int btrfs_close_devices(struct btrfs_fs_devices > *fs_devices) > { > struct btrfs_fs_devices *seed_devices = NULL; > int ret; > + int retry_cnt = 5; > > mutex_lock(_mutex); > ret = __btrfs_close_devices(fs_devices); > @@ -927,12 +956,15 @@ int btrfs_close_devices(struct btrfs_fs_devices > *fs_devices) > __btrfs_close_devices(fs_devices); > free_fs_devices(fs_devices); > } > - /* > - * Wait for rcu kworkers under __btrfs_close_devices > - * to finish all blkdev_puts so device is really > - * free when umount is done. > - */ > - rcu_barrier(); > + > + while (is_device_closing(_devices->devices) && > + --retry_cnt) { > + mdelay(1000); //1 sec > + } > + > + if (!(retry_cnt > 0)) > + printk(KERN_WARNING "BTRFS: %pU bdev_put didn't complete, > giving up\n", > + fs_devices->fsid); > return ret; > } > > diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h > index 0ac90f8d85bd..945e49f5e17d 100644 > --- a/fs/btrfs/volumes.h > +++ b/fs/btrfs/volumes.h > @@ -150,6 +150,7 @@ struct btrfs_device { > /* Counter to record the change of device stats */ > atomic_t dev_stats_ccnt; > atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX]; > + int bdev_closing; > }; > > /* > -- > 2.7.0 I gave this a try and somehow it seems to make unmounting worse: it now always takes ~5s (max retry time) and I see the warning every time. Without the patch unmounting a single volume (disk) is much faster (1-2s), without problems. Any ideas? cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Another 'traling page extent' observation
Hi, I have made another observation regarding extra extents; it seems I'm good at finding these things. Sorry. ;-) This time it's with rsync. I found it when I started to use the --inplace option, which doesn't do rsync's usual write-to-temporary/rename cycle, but instead writes to a destination file directly. All of a sudden many newly backed up files had a traling 4k extent, for no good reason. This has nothing to do with extending overwrites (where new extents would of course be fine); it happens when the file is new. It is also independent of the file size or the filesystem state: it does not seem to be caused by fragmented free space. Reproducer example (current dir is btrfs): $ls -al /tmp/data -rw-r--r-- 1 root root 17569552 Jun 14 16:33 /tmp/data $rm -f data && rsync /tmp/data . && sync && filefrag -ek data Filesystem type is: 9123683e File size of data is 17569552 (17160 blocks of 1024 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 17159: 53918020.. 53935179: 17160: last,eof data: 1 extent found $rm -f data && rsync --inplace /tmp/data . && sync && filefrag -ek data Filesystem type is: 9123683e File size of data is 17569552 (17160 blocks of 1024 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 17155: 48133532.. 48150687: 17156: 1:17156.. 17159: 36734592.. 36734595: 4: 48150688: last,eof data: 2 extents found This is repeatable and independent of the file, so I suspect that after Liu Bo's fix for the previously reported stray extents in the middle of the file with slow buffered writes [1] there's an edge case where a page is still treated differently at the end after close()-ing the file - which rsync does correctly. This is on my 4.4.x++ kernel with btrfs ~4.7 and space_cache=v2, but since it also happens on a fresh volume with v1 it's probably just another off-by-one somewhere in the writeback/page handling. thanks, Holger [1] commit a91326679f aka "Btrfs: make mapping->writeback_index point to the last written page" -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug] Kernel 4.6.1 errors
On Tue, 07 Jun 2016 17:54:26 +0200, g6094199 wrote: > he guys! > > I´m running Debian Sid where i have found several kernel errors today > and most of them are btrfs related. > > uname -a > Linux NAS 4.6.0-trunk-amd64 #1 SMP Debian 4.6-1~exp1 (2016-05-17) x86_64 > GNU/Linux This is 4.6.0-whatever, not 4.6.1. The problem is fixed in the real 4.6.1 (released last week), which you can find described in commit 4704fa54.. at https://cdn.kernel.org/pub/linux/kernel/v4.x/ChangeLog-4.6.1 -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 05/29/16 19:53, Chris Murphy wrote: > But I'm skeptical of bcache using a hidden area historically for the > bootloader, to put its device metadata. I didn't realize that was the > case. Imagine if LVM were to stuff metadata into the MBR gap, or > mdadm. Egads. On the matter of bcache in general this seems noteworthy: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667 bummer.. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] Btrfs: replace BUG() with WARN_ONCE in raid56
On 05/14/16 02:06, Liu Bo wrote: > This BUG() has been triggered by a fuzz testing image, but in fact > btrfs can handle this gracefully by returning -EIO. > > Thus, use WARN_ONCE for warning purpose and don't leave a possible > kernel panic. > > Signed-off-by: Liu Bo> --- > fs/btrfs/raid56.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index 0b7792e..863f7fe 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -2139,7 +2139,7 @@ int raid56_parity_recover(struct btrfs_root *root, > struct bio *bio, > > rbio->faila = find_logical_bio_stripe(rbio, bio); > if (rbio->faila == -1) { > - BUG(); > + WARN_ONCE(1, KERN_WARNING "rbio->faila is -1\n"); I'm generally in favor of not BUGing out for no good reason, but what is e.g. an admin (or user) supposed to do when he sees this message? Same for the other rather cryptic WARNs - they contain no actionable information, and are most likely going to be ignored as "debug spam". IMHO things that can be ignored can be deleted. thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix unexpected balance crash due to BUG_ON
On Tue, May 3, 2016 at 1:01 AM, Liu Bowrote: > Mounting a btrfs can resume previous balance operations asynchronously. > An user got a crash when one drive has some corrupt sectors. > > Since balance can cancel itself in case of any error, we can gracefully > return errors to upper layers and let balance do the cancel job. > > Reported-by: sash > Signed-off-by: Liu Bo > --- > fs/btrfs/volumes.c | 18 +++--- > 1 file changed, 15 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index bd0f45f..5aed2e2 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3418,13 +3418,25 @@ static int __btrfs_balance(struct btrfs_fs_info > *fs_info) > ret = btrfs_shrink_device(device, old_size - size_to_free); > if (ret == -ENOSPC) > break; > - BUG_ON(ret); > + if (ret) { > + /* btrfs_shrink_device never returns ret > 0 */ > + WARN_ON_ONCE(ret > 0); > + goto error; > + } > > trans = btrfs_start_transaction(dev_root, 0); > - BUG_ON(IS_ERR(trans)); > + if (IS_ERR(trans)) { > + ret = PTR_ERR(trans); > + goto error; > + } > > ret = btrfs_grow_device(trans, device, old_size); > - BUG_ON(ret); > + if (ret) { > + btrfs_end_transaction(trans, dev_root); > + /* btrfs_grow_device never returns ret > 0 */ > + WARN_ON_ONCE(ret > 0); > + goto error; > + } > > btrfs_end_transaction(trans, dev_root); > } Just a heads up that this seems to introduce a valid warning, since it now can goto error before the first initializing use of path: fs/btrfs/volumes.c: In function 'btrfs_balance': fs/btrfs/volumes.c:3601:2: warning: 'path' may be used uninitialized in this function [-Wmaybe-uninitialized] btrfs_free_path(path); ^ fs/btrfs/volumes.c:3385:21: note: 'path' was declared here struct btrfs_path *path; ^ (it's really in __btrfs_balance which got inlined, so gcc thinks it's at the call site). Simply setting path = NULL at the beginning of __btrfs_balance fixes it, since btrfs_free_path allows NULL values. cheers Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add device while rebalancing
On 04/27/16 18:40, Juan Alberto Cirez wrote: > If this is so, then it leaves even confused. I was under the > impression that the driving imperative for the creation of btrfs was > to address the shortcomings of current filesystems (within the context > of distributed data). That the idea was to create a low level > filesystem that would be the primary choice as a block/brick layer for a > scale-out, distributed data storage... I can't speak for who was or is motivated by what. Btrfs was a necessary reaction to ZFS, and AFAIK this had nothing to do with distributed storage but rather growing concerns around reliability (checksumming), scalability and operational ease: snapshotting, growing/shrinking etc. It's true that some of btrfs' capabilities make it look like a a good candidate, and e.g. Ceph started out using it. For many reasons that didn't work out (AFAIK btrfs maturity + extensibility) - but it also did not address a fundamental mismatch in requirements, which other filesystems (ext4, xfs) could not address either. btrfs simply does "too much" because it has to; you cannot remove or turn off half of what makes a kernel-based filesystem a usable filesystem. This is kind of sad because at its core btrfs *is* an object store with various trees for metadata handling and whatnot - but there's no easy way to turn off all the "Unix is stupid" stuff. AFAIK Gluster will soon also start managing xattrs differently, so this is not limited to Ceph. I've been following this saga for several years now and it's absolutely *astounding* how many bugs and performance problems Ceph has unearthed in existing filesystems, simply because it stresses them in ways they never have been stressed before..only to create the illusion of a distributed key/value store, badly. I don't want to argue about details, you can read more about some of the reasons in [1]. [grumble grumble exokernels and composable things in userland grumble] cheers Holger [1] http://www.slideshare.net/sageweil1/ceph-and-rocksdb -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Add device while rebalancing
On 04/27/16 17:58, Juan Alberto Cirez wrote: > Correct me if I'm wrong but the sum total of the above seems to > suggest (at first glance) that BRTFS add several layers of complexity, > but for little real benefit (at least in the case use of btrfs at the > brick layer with a distributed filesystem on top)... This may come as a surprise, but the same can be said for every other (common) filesystem (+ device management stack) that can be used standalone. Jeff Darcy (of GlusterFS) just wrote a really nice blog post why current filesystems and their historically grown requirements (mostly as they relate to the POSIX interface standard) are in many ways just not a good fit for scale-out/redundant storage: http://pl.atyp.us/2016-05-updating-posix.html Quite a few of the capabilities & features which are useful or necessary in standalone operation (regardless of single- or multi- device setup) are *actively unhelpful* in a distributed context, which is why e.g. Ceph will soon do away with the on-disk filesystem for data, and manage metadata exclusively by itself. cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Question: raid1 behaviour on failure
On 04/26/16 18:19, Henk Slager wrote: > It looks like a JMS567 + SATA port multipliers behaind it are used in > this drivebay. The command lsusb -v could show that. So your HW > setup is like JBOD, not RAID. I hate to quote the "harmful" trope, but.. SATA Port Multipliers Considered Harmful https://www.usenix.org/system/files/fastpw13-paper7_0.pdf aka: how to make any RAID setup useless in 1 easy step. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] Fixes for races in relocation and avoid start and wait for unrelated IO
On Mon, Apr 25, 2016 at 3:01 AM,wrote: > The following patches fix 2 hard to hit races in relocation that make its > first phase (MOVE_DATA_EXTENTS) miss extents, triggers a warning in the > second phase (UPDATE_DATA_PTRS) and leaves metadata in an invalid state > (file extent items pointing to areas corresponding to the deleted block > group), leading to a BUG_ON() when attempting to read those extents after > the relocation finishes. Never saw this particular race/error, but decided to give these patches a workout to see whether they cause any new or unrelated problems. Continuous rebalancing (full, partial) for ~30m while unpacking and deleting kernel trees on a 16GB tmpfs-backed loopback device did not cause any problem; balance just cruises along at (sometimes) up to ~1GB/s and does its thing. Finally btrfs check also found nothing wrong. Not sure if this qualifies as testing, but anyway: Tested-by: Holger Hoffstaette cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] btrfs: qgroup: Fix qgroup accounting when creating snapshot
On Tue, 19 Apr 2016 15:19:46 -0700, Mark Fasheh wrote: > On Fri, Apr 15, 2016 at 05:08:22PM +0800, Qu Wenruo wrote: >> Current btrfs qgroup design implies a requirement that after calling >> btrfs_qgroup_account_extents() there must be a commit root switch. >> >> Normally this is OK, as btrfs_qgroup_accounting_extents() is only called >> inside btrfs_commit_transaction() just be commit_cowonly_roots(). >> >> However there is a exception at create_pending_snapshot(), which will >> call btrfs_qgroup_account_extents() but no any commit root switch. >> >> In case of creating a snapshot whose parent root is itself (create a >> snapshot of fs tree), it will corrupt qgroup by the following trace: >> (skipped unrelated data) >> == >> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, >> nr_old_roots = 0, nr_new_roots = 1 >> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer >> = 0, excl = 0 >> qgroup_update_counters: qgid = 5, cur_old_count = 0, cur_new_count = 1, rfer >> = 16384, excl = 16384 >> btrfs_qgroup_account_extent: bytenr = 29786112, num_bytes = 16384, >> nr_old_roots = 0, nr_new_roots = 0 >> == >> >> The problem here is in first qgroup_account_extent(), the >> nr_new_roots of the extent is 1, which means its reference got >> increased, and qgroup increased its rfer and excl. >> >> But at second qgroup_account_extent(), its reference got decreased, but >> between these two qgroup_account_extent(), there is no switch roots. >> This leads to the same nr_old_roots, and this extent just got ignored by >> qgroup, which means this extent is wrongly accounted. >> >> Fix it by call commit_cowonly_roots() after qgroup_account_extent() in >> create_pending_snapshot(), with needed preparation. >> >> Reported-by: Mark Fasheh>> Signed-off-by: Qu Wenruo > > Ok, this version seems to be giving me the right numbers. I'll send a test > for it shortly. I'd still like to know if this patch introduces an > unintended side effects but otherwise, thanks Qu. > --Mark Hi Mark, Can't speak about other side effects since I have not observed any so far, but I can confirm that the previously failing case of deleting a renamed snapshot [1] now works properly with v4 without getting the commit roots in a twist. So: Tested-by: holger.hoffstae...@googlemail.com cheers Holger [1] http://thread.gmane.org/gmane.comp.file-systems.btrfs/55052 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
[cc: Mark and Qu] On 04/08/16 13:51, Holger Hoffstätte wrote: > On 04/08/16 13:14, Filipe Manana wrote: >> Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs >> patches, it didn't reproduce here: > > Great, that's good to know (sort of :). Thanks also to Liu Bo. > >> Are you sure that you are not using some patches not in 4.6? We have a bingo! Reverting "qgroup: Fix qgroup accounting when creating snapshot" from last Wednesday immediately fixes the problem. Was quite easy to find - the triggered WARN_ON was the second one that complained about a mismatch between roots. The only patch that even remotely did something in that area was said qgroup fix. Looks like something is missing there. Suggestions welcome. :) Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
On 04/08/16 13:14, Filipe Manana wrote: > Using Chris' for-linus-4.6 branch, which is 4.5-rc6 + all 4.6 btrfs > patches, it didn't reproduce here: Great, that's good to know (sort of :). Thanks also to Liu Bo. > Are you sure that you are not using some patches not in 4.6? Quite a few, but to offset that I also left out some that have diverged too much or were not that important (block/sectorsize, device handling). But those should not have anything to do with this particular bug. Except for this everything works rock-solid, I use it daily. Should be easy to track down.. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARN_ON in record_root_in_trans() when deleting freshly renamed subvolume
Hi, Looks like I just found an exciting new corner case. kernel 4.4.6 with btrfs ~4.6, so 4.6 should reproduce. Try on a fresh volume: $btrfs subvolume create foo Create subvolume './foo' $sync $btrfs subvolume snapshot foo foo-1 Create a snapshot of 'foo' in './foo-1' $sync $mv foo-1 foo.new $btrfs subvolume delete foo.new Delete subvolume (no-commit): '/mnt/test/foo.new' $dmesg [ 226.923316] [ cut here ] [ 226.923339] WARNING: CPU: 1 PID: 5863 at fs/btrfs/transaction.c:319 record_root_in_trans+0xd6/0x100 [btrfs]() [ 226.923340] Modules linked in: auth_rpcgss oid_registry nfsv4 btrfs xor raid6_pq loop nfs lockd grace sunrpc autofs4 sch_fq_codel radeon snd_hda_codec_realtek x86_pkg_temp_thermal snd_hda_codec_generic coretemp crc32_pclmul crc32c_intel aesni_intel i2c_algo_bit uvcvideo snd_hda_codec_hdmi aes_x86_64 drm_kms_helper videobuf2_vmalloc glue_helper videobuf2_memops syscopyarea lrw sysfillrect gf128mul videobuf2_v4l2 sysimgblt snd_usb_audio fb_sys_fops ablk_helper snd_hda_intel videobuf2_core ttm cryptd snd_hwdep v4l2_common usbhid snd_hda_codec snd_usbmidi_lib videodev snd_rawmidi drm snd_hda_core snd_seq_device i2c_i801 snd_pcm i2c_core snd_timer snd r8169 soundcore mii parport_pc parport [ 226.923365] CPU: 1 PID: 5863 Comm: ls Not tainted 4.4.6 #1 [ 226.923366] Hardware name: Gigabyte Technology Co., Ltd. P67-DS3-B3/P67-DS3-B3, BIOS F1 05/06/2011 [ 226.923367] 8800da677d20 813181a8 [ 226.923368] a0aacdbf 8800da677d58 810507b2 880601e90800 [ 226.923369] 8800dacf10a0 880601e90800 880601e909f0 0001 [ 226.923371] Call Trace: [ 226.923374] [] dump_stack+0x4d/0x65 [ 226.923376] [] warn_slowpath_common+0x82/0xc0 [ 226.923378] [] warn_slowpath_null+0x1a/0x20 [ 226.923387] [] record_root_in_trans+0xd6/0x100 [btrfs] [ 226.923395] [] btrfs_record_root_in_trans+0x44/0x70 [btrfs] [ 226.923404] [] start_transaction+0x9e/0x4c0 [btrfs] [ 226.923412] [] btrfs_join_transaction+0x17/0x20 [btrfs] [ 226.923421] [] btrfs_dirty_inode+0x35/0xd0 [btrfs] [ 226.923430] [] btrfs_update_time+0x7d/0xb0 [btrfs] [ 226.923432] [] touch_atime+0x88/0xa0 [ 226.923434] [] iterate_dir+0xdb/0x120 [ 226.923435] [] SyS_getdents+0x88/0xf0 [ 226.923437] [] ? fillonedir+0xd0/0xd0 [ 226.923439] [] entry_SYSCALL_64_fastpath+0x12/0x6a [ 226.923440] ---[ end trace 9c78caf253e284fe ]--- Code looks like: .. static int record_root_in_trans(struct btrfs_trans_handle *trans, struct btrfs_root *root) { if (test_bit(BTRFS_ROOT_REF_COWS, >state) && root->last_trans < trans->transid) { WARN_ON(root == root->fs_info->extent_root); WARN_ON(root->commit_root != root->node); .. There's been a few journal/recovery/directory consistency patches recently, so maybe it's a corner case or an older problem. I'll try to bisect, but meanwhile wanted to report it for discussion. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] delete obsolete function btrfs_print_tree()
On 04/04/16 18:02, Filipe Manana wrote: > I use this function frequently during development, and there's a good > reason to use it instead of the user space tool btrfs-debug-tree. Good to know, that's why I asked. Printing unwritten extents makes sense. -h -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html