Re: BUG: unable to handle kernel paging request at ffff9fb75f827100
On 21.02.2018 06:09, Christoph Anton Mitterer wrote: > Hi. > > Not sure if that's a bug in btrfs... maybe someone's interested in it. > > Cheers, > Chris. > > # uname -a > Linux heisenberg 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 > GNU/Linux > > > Feb 21 04:55:51 heisenberg kernel: BUG: unable to handle kernel paging > request at 9fb75f827100 > Feb 21 04:55:51 heisenberg kernel: IP: btrfs_drop_inode+0x16/0x40 [btrfs] This looks like the one fixed by e8f1bc1493855e32b7a2a019decc3c353d94daf6 . It's tagged for stable so you should get it eventually. > Feb 21 04:55:51 heisenberg kernel: PGD 410806067 P4D 410806067 PUD 0 > Feb 21 04:55:51 heisenberg kernel: Oops: [#1] SMP PTI > Feb 21 04:55:51 heisenberg kernel: Modules linked in: vhost_net vhost tap > xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat > nf_nat_ipv4 nf_nat tun bridge stp llc ctr ccm fuse ebtable_filter ebtables > devlink cpufreq_userspace cpufreq_powersave cpufreq_conservative ip6t_REJECT > nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter > ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 > nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter > binfmt_misc arc4 snd_hda_codec_hdmi btusb btrtl btbcm btintel bluetooth > snd_hda_codec_realtek snd_hda_codec_generic drbg uvcvideo videobuf2_vmalloc > videobuf2_memops snd_soc_skl snd_usb_audio ansi_cprng snd_soc_skl_ipc > cdc_mbim cdc_wdm snd_usbmidi_lib snd_soc_sst_ipc cdc_ncm snd_soc_sst_dsp > snd_rawmidi ecdh_generic > Feb 21 04:55:51 heisenberg kernel: videobuf2_v4l2 i2c_designware_platform > iwlmvm snd_hda_ext_core videobuf2_core usbnet i2c_designware_core > snd_seq_device snd_soc_sst_match mii snd_soc_core videodev snd_compress media > mac80211 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel > kvm irqbypass intel_cstate snd_hda_intel intel_uncore iwlwifi snd_hda_codec > intel_rapl_perf snd_hda_core snd_hwdep snd_pcm pcspkr joydev evdev cfg80211 > serio_raw snd_timer rfkill snd iTCO_wdt sg soundcore iTCO_vendor_support i915 > wmi shpchp drm_kms_helper mei_me battery fujitsu_laptop mei tpm_crb idma64 > button drm sparse_keymap video i2c_algo_bit ac acpi_pad intel_lpss_pci > intel_lpss mfd_core loop parport_pc ppdev sunrpc lp parport ip_tables > x_tables autofs4 algif_skcipher af_alg ext4 crc16 mbcache jbd2 fscrypto ecb > hid_generic > Feb 21 04:55:51 heisenberg kernel: usbhid hid dm_crypt dm_mod raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid1 > raid0 multipath linear md_mod btrfs crc32c_generic xor zstd_decompress > zstd_compress xxhash uas raid6_pq uhci_hcd ehci_pci ehci_hcd usb_storage > sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc > aesni_intel aes_x86_64 crypto_simd glue_helper cryptd e1000e xhci_pci ahci > ptp libahci xhci_hcd psmouse pps_core libata sdhci_pci sdhci i2c_i801 > scsi_mod mmc_core usbcore usb_common > Feb 21 04:55:51 heisenberg kernel: CPU: 3 PID: 50 Comm: kswapd0 Not tainted > 4.14.0-3-amd64 #1 Debian 4.14.17-1 > Feb 21 04:55:51 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK > U757/FJNB2A5, BIOS Version 1.16 07/05/2017 > Feb 21 04:55:51 heisenberg kernel: task: 9fbff9523040 task.stack: > ad5e8378 > Feb 21 04:55:51 heisenberg kernel: RIP: 0010:btrfs_drop_inode+0x16/0x40 > [btrfs] > Feb 21 04:55:51 heisenberg kernel: RSP: 0018:ad5e83783c28 EFLAGS: 00010286 > Feb 21 04:55:51 heisenberg kernel: RAX: 0001 RBX: > 9fbe05d69330 RCX: > Feb 21 04:55:51 heisenberg kernel: RDX: 9fb75f827000 RSI: > c075f2b0 RDI: 9fbe05d69330 > Feb 21 04:55:51 heisenberg kernel: RBP: 9fbff2040800 R08: > 9fbff1ddea08 R09: ad5e83783d88 > Feb 21 04:55:51 heisenberg kernel: R10: 001bf318 R11: > R12: 9fbe05d693b8 > Feb 21 04:55:51 heisenberg kernel: R13: 9fbe05d69488 R14: > 9fbfc26cecc0 R15: > Feb 21 04:55:51 heisenberg kernel: FS: () > GS:9fc01dd8() knlGS: > Feb 21 04:55:51 heisenberg kernel: CS: 0010 DS: ES: CR0: > 80050033 > Feb 21 04:55:51 heisenberg kernel: CR2: 9fb75f827100 CR3: > 00041020a004 CR4: 003606e0 > Feb 21 04:55:51 heisenberg kernel: DR0: DR1: > DR2: > Feb 21 04:55:51 heisenberg kernel: DR3: DR6: > fffe0ff0 DR7: 0400 > Feb 21 04:55:51 heisenberg kernel: Call Trace: > Feb 21 04:55:51 heisenberg kernel: iput+0xf7/0x210 > Feb 21 04:55:51 heisenberg kernel: __dentry_kill+0xce/0x160 > Feb 21 04:55:51 heisenberg kernel: shrink_dentry_list+0xe0/0x2d0 > Feb 21 04:55:51 heisenberg kernel: prune_dcache_sb+0x52/0x70 > Feb 21 04:55:51 heisenberg kernel: super_cache_scan+0xf7/0x1a0 > Feb 21 04:55:51 heisenberg kernel:
BUG: unable to handle kernel paging request at ffff9fb75f827100
Hi. Not sure if that's a bug in btrfs... maybe someone's interested in it. Cheers, Chris. # uname -a Linux heisenberg 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 GNU/Linux Feb 21 04:55:51 heisenberg kernel: BUG: unable to handle kernel paging request at 9fb75f827100 Feb 21 04:55:51 heisenberg kernel: IP: btrfs_drop_inode+0x16/0x40 [btrfs] Feb 21 04:55:51 heisenberg kernel: PGD 410806067 P4D 410806067 PUD 0 Feb 21 04:55:51 heisenberg kernel: Oops: [#1] SMP PTI Feb 21 04:55:51 heisenberg kernel: Modules linked in: vhost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat tun bridge stp llc ctr ccm fuse ebtable_filter ebtables devlink cpufreq_userspace cpufreq_powersave cpufreq_conservative ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter binfmt_misc arc4 snd_hda_codec_hdmi btusb btrtl btbcm btintel bluetooth snd_hda_codec_realtek snd_hda_codec_generic drbg uvcvideo videobuf2_vmalloc videobuf2_memops snd_soc_skl snd_usb_audio ansi_cprng snd_soc_skl_ipc cdc_mbim cdc_wdm snd_usbmidi_lib snd_soc_sst_ipc cdc_ncm snd_soc_sst_dsp snd_rawmidi ecdh_generic Feb 21 04:55:51 heisenberg kernel: videobuf2_v4l2 i2c_designware_platform iwlmvm snd_hda_ext_core videobuf2_core usbnet i2c_designware_core snd_seq_device snd_soc_sst_match mii snd_soc_core videodev snd_compress media mac80211 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate snd_hda_intel intel_uncore iwlwifi snd_hda_codec intel_rapl_perf snd_hda_core snd_hwdep snd_pcm pcspkr joydev evdev cfg80211 serio_raw snd_timer rfkill snd iTCO_wdt sg soundcore iTCO_vendor_support i915 wmi shpchp drm_kms_helper mei_me battery fujitsu_laptop mei tpm_crb idma64 button drm sparse_keymap video i2c_algo_bit ac acpi_pad intel_lpss_pci intel_lpss mfd_core loop parport_pc ppdev sunrpc lp parport ip_tables x_tables autofs4 algif_skcipher af_alg ext4 crc16 mbcache jbd2 fscrypto ecb hid_generic Feb 21 04:55:51 heisenberg kernel: usbhid hid dm_crypt dm_mod raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid1 raid0 multipath linear md_mod btrfs crc32c_generic xor zstd_decompress zstd_compress xxhash uas raid6_pq uhci_hcd ehci_pci ehci_hcd usb_storage sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper cryptd e1000e xhci_pci ahci ptp libahci xhci_hcd psmouse pps_core libata sdhci_pci sdhci i2c_i801 scsi_mod mmc_core usbcore usb_common Feb 21 04:55:51 heisenberg kernel: CPU: 3 PID: 50 Comm: kswapd0 Not tainted 4.14.0-3-amd64 #1 Debian 4.14.17-1 Feb 21 04:55:51 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK U757/FJNB2A5, BIOS Version 1.16 07/05/2017 Feb 21 04:55:51 heisenberg kernel: task: 9fbff9523040 task.stack: ad5e8378 Feb 21 04:55:51 heisenberg kernel: RIP: 0010:btrfs_drop_inode+0x16/0x40 [btrfs] Feb 21 04:55:51 heisenberg kernel: RSP: 0018:ad5e83783c28 EFLAGS: 00010286 Feb 21 04:55:51 heisenberg kernel: RAX: 0001 RBX: 9fbe05d69330 RCX: Feb 21 04:55:51 heisenberg kernel: RDX: 9fb75f827000 RSI: c075f2b0 RDI: 9fbe05d69330 Feb 21 04:55:51 heisenberg kernel: RBP: 9fbff2040800 R08: 9fbff1ddea08 R09: ad5e83783d88 Feb 21 04:55:51 heisenberg kernel: R10: 001bf318 R11: R12: 9fbe05d693b8 Feb 21 04:55:51 heisenberg kernel: R13: 9fbe05d69488 R14: 9fbfc26cecc0 R15: Feb 21 04:55:51 heisenberg kernel: FS: () GS:9fc01dd8() knlGS: Feb 21 04:55:51 heisenberg kernel: CS: 0010 DS: ES: CR0: 80050033 Feb 21 04:55:51 heisenberg kernel: CR2: 9fb75f827100 CR3: 00041020a004 CR4: 003606e0 Feb 21 04:55:51 heisenberg kernel: DR0: DR1: DR2: Feb 21 04:55:51 heisenberg kernel: DR3: DR6: fffe0ff0 DR7: 0400 Feb 21 04:55:51 heisenberg kernel: Call Trace: Feb 21 04:55:51 heisenberg kernel: iput+0xf7/0x210 Feb 21 04:55:51 heisenberg kernel: __dentry_kill+0xce/0x160 Feb 21 04:55:51 heisenberg kernel: shrink_dentry_list+0xe0/0x2d0 Feb 21 04:55:51 heisenberg kernel: prune_dcache_sb+0x52/0x70 Feb 21 04:55:51 heisenberg kernel: super_cache_scan+0xf7/0x1a0 Feb 21 04:55:51 heisenberg kernel: shrink_slab.part.49+0x1e8/0x3e0 Feb 21 04:55:51 heisenberg kernel: shrink_node+0x123/0x300 Feb 21 04:55:51 heisenberg kernel: kswapd+0x299/0x6f0 Feb 21 04:55:51 heisenberg kernel: ? mem_cgroup_shrink_node+0x190/0x190 Feb 21 04:55:51 heisenberg kernel: kthread+0x11a/0x130 Feb 21 04:55:51 heisenberg kernel: ? kthread_create_on_node+0x70/0x70 Feb 21 04:55:51 heisenberg
Re: Status of FST and mount times
On 2018年02月20日 23:41, Austin S. Hemmelgarn wrote: > On 2018-02-20 09:59, Ellis H. Wilson III wrote: >> On 02/16/2018 07:59 PM, Qu Wenruo wrote: >>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote: $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 >>> >>> OK, this explains everything. >>> >>> There are too many chunks. >>> This means at mount you need to search for block group item 3454 times. >>> >>> Even each search only needs to iterate 3 tree blocks, multiply it 3454 >>> it would still be a big work. >>> Although some tree blocks like the root node and level 1 nodes can be >>> cached, we still need to read about 3500 tree blocks. >>> >>> If the fs is created using 16K nodesize, this means you need to do >>> random read for 54M using 16K blocksize. >>> >>> No wonder it will takes some time. >>> >>> Normally I would expect 1G chunk for each data and metadata chunk. >>> >>> If there is nothing special, it means your filesystem is already larger >>> than 3T. >>> If your used space is way smaller (less than 30%) than 3.5T, then this >>> means your chunk usage is pretty low, and in that case, balance to >>> reduce number of chunks (block groups) would reduce mount time. >> >> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by >> btrfs fi df. So, from what I am hearing, this mount time is normal >> for a filesystem this size. Ignoring a more complex and proper fix >> like the ones we've been discussing, would bumping the nodesize reduce >> the number of chunks, thereby reducing the mount time? > It would probably not. Chunk size is only based on the total size of > the filesystem, with reasonable base values, so you would still need to > have at least as many chunks to store the same amount of data (increase > the node size too much though, and you will end up with more chunks, > because you'll have more empty space wasted). Increasing node size may reduce extent tree size. Although at most reduce one level AFAIK. But considering that the higher the node is, the more chance it's cached, reducing tree height wouldn't bring much performance impact AFAIK. If one could do real world benchmark to beat or prove my assumption, it would be much better though. >> >> I don't see why balance would come into play here -- my understanding >> was that was for aged filesystems. The only operations I've done on >> here was: >> 1. Format filesystem clean >> 2. Create a subvolume >> 3. rsync our home directories into that new subvolume >> 4. Create another subvolume >> 5. rsync our home directories into that new subvolume >> >> Accordingly, zero (or at least, extremely little) data should have >> been overwritten, so I would expect things to be fairly well allocated >> already. Please correct me if this is naive thinking. > Your logic is in general correct regarding data, but not necessarily > metadata. Assuming you did not use the `--inplace` option for rsync, it > had to issue a rename for each individual file that got copied in, and > as a result there was likely a lot of metadata being rewritten. > > As far as balance being for aged filesystems, that's not exactly true. > There are four big reasons you might run a balance: > > 1. As part of reshaping a volume. You generally want run a balance > whenever the number of disks in a volume permanently increases (it will > happen automatically when it permanently decreases, as the device > deletion operation is a special type of balance under the hood). It's > also used for converting chunk profiles. > 2. To free up empty space inside chunks when the filesystem is full at > the chunk level. > 3. To redistribute data across multiple disks in a more even manner > after deleting a lot of data. > 4. To reduce the likelihood of 2 or 3 being an issue. > > Reasons 2 and 3 are generally more likely to be needed on old volumes. > Reason 1 is independent of the age of a volume. Reason 4 is the reason > for the regular filtered balances that I and some other people recommend > be run as part of preventative maintenance, and is also generally > independent of the age of a volume. > > Qu's suggestion is actually independent of all the above reasons, but > does kind of fit in with the fourth as another case of preventative > maintenance. My suggestion is to use balance to reduce number of block groups, so we could do less search at mount time. It's more like reason 2. But it only works for case where there are a lot of fragments so a lot of chunks are not fully utilized. Unfortunately, that's not the case for OP, so my suggestion doesn't make sense here. BTW, if OP still wants to try something to possibly to reduce mount time with same the fs, I could try some modification to current block group iteration code to see if it makes sense. Thanks, Qu >> I was using btrfs sub del -C for the deletions, so I believe (if that command truly waits for the subvolume to be utterly gone) it captures the entirety
Re: [PATCH v2 00/27] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"
On Thu, Feb 15, 2018 at 11:04:45AM -0800, Omar Sandoval wrote: > From: Omar Sandoval> This is v2 of "btrfs-progs as a library". > > Most of the changes since v1 are small: > > - Rebased onto v4.15 > - Split up btrfs_util_subvolume_path() which was accidentally squashed > together > with the commit adding btrfs_util_create_subvolume() > - Renamed btrfs_util_f_* functions to btrfs_util_*_fd for clarity I like this naming scheme. > - Added -fvisibility=hidden and a macro for > __attribute__((visibility("default"))) > - Changed to use semantic versioning > - Fixed missing install of btrfsutil.h > - Documented functions which require root or are non-atomic > - Added a missing license to setup.py > > The bigger change is in the last two patches. Dave requested that I get > rid of the runtime dependency of libbtrfsutil from libbtrfs. The easiest > way to do this was to remove the btrfs_list_subvols_print() > implementation from libbtrfs and put it in cmds-subvolume.c (details in > patch 26). I'm open to alternatives. This should be ok as a temporary fix to get the library going. The column printing helpers will be replaced by libsmartcols (I have a prototype for that but there are still some issues to fix). > Omar Sandoval (27): > btrfs-progs: get rid of undocumented qgroup inheritance options For initial merge I'll skip this patch (and what depends on it), as the functionality is not yet out of krenel. I looked at the patch and am not yet convinced to merge it, more time needed, but the library should not be blocked by it. > Add libbtrfsutil > libbtrfsutil: add Python bindings > libbtrfsutil: add btrfs_util_is_subvolume() and > btrfs_util_subvolume_id() > libbtrfsutil: add qgroup inheritance helpers > libbtrfsutil: add btrfs_util_create_subvolume() > libbtrfsutil: add btrfs_util_subvolume_path() > libbtrfsutil: add btrfs_util_subvolume_info() > libbtrfsutil: add btrfs_util_[gs]et_read_only() > libbtrfsutil: add btrfs_util_[gs]et_default_subvolume() > libbtrfsutil: add subvolume iterator helpers > libbtrfsutil: add btrfs_util_create_snapshot() > libbtrfsutil: add btrfs_util_delete_subvolume() > libbtrfsutil: add btrfs_util_deleted_subvolumes() > libbtrfsutil: add filesystem sync helpers I'm going to add the above to devel now. > btrfs-progs: use libbtrfsutil for read-only property > btrfs-progs: use libbtrfsutil for sync ioctls > btrfs-progs: use libbtrfsutil for set-default > btrfs-progs: use libbtrfsutil for get-default > btrfs-progs: use libbtrfsutil for subvol create and snapshot > btrfs-progs: use libbtrfsutil for subvol delete > btrfs-progs: use libbtrfsutil for subvol show > btrfs-progs: use libbtrfsutil for subvol sync > btrfs-progs: replace test_issubvolume() with btrfs_util_is_subvolume() > btrfs-progs: add recursive snapshot/delete using libbtrfsutil > btrfs-progs: use libbtrfsutil for subvolume list > btrfs-progs: deprecate libbtrfs helpers Besides the subvol and qgroup inheritance, all of the above look good, but I'd like to spend more time merging them and we should also add commandline tests for coverage. I have more comments or maybe questions about the future development workflow, but at this point the patchset is in a good shape for incremental merge. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/27] Add libbtrfsutil
On Tue, Feb 20, 2018 at 10:32:17AM -0700, Liu Bo wrote: > > --- a/.gitignore > > +++ b/.gitignore > > @@ -43,6 +43,8 @@ libbtrfs.so.0.1 > > library-test > > library-test-static > > /fssum > > +/libbtrfsutil.so* > > +/libbtrfsutil.a > > .gitignore is not part of btrfs-progs, is it? It is, since version 0.19 :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/27] Add libbtrfsutil
On Thu, Feb 15, 2018 at 11:04:47AM -0800, Omar Sandoval wrote: > From: Omar Sandoval> > Currently, users wishing to manage Btrfs filesystems programatically > have to shell out to btrfs-progs and parse the output. This isn't ideal. > The goal of libbtrfsutil is to provide a library version of as many of > the operations of btrfs-progs as possible and to migrate btrfs-progs to > use it. > > Rather than simply refactoring the existing btrfs-progs code, the code > has to be written from scratch for a couple of reasons: > > * A lot of the btrfs-progs code was not designed with a nice library API > in mind in terms of reusability, naming, and error reporting. > * libbtrfsutil is licensed under the LGPL, whereas btrfs-progs is under > the GPL, which makes it dubious to directly copy or move the code. > > Eventually, most of the low-level btrfs-progs code should either live in > libbtrfsutil or the shared kernel/userspace filesystem code, and > btrfs-progs will just be the CLI wrapper. > > This first commit just includes the build system changes, license, > README, and error reporting helper. > > Signed-off-by: Omar Sandoval > --- > .gitignore| 2 + > Makefile | 72 ++-- > Makefile.inc.in | 2 +- > libbtrfsutil/COPYING | 674 > ++ > libbtrfsutil/COPYING.LESSER | 165 ++ > libbtrfsutil/README.md| 35 ++ > libbtrfsutil/btrfsutil.h | 76 + > libbtrfsutil/btrfsutil_internal.h | 40 +++ > libbtrfsutil/errors.c | 55 > 9 files changed, 1100 insertions(+), 21 deletions(-) > create mode 100644 libbtrfsutil/COPYING > create mode 100644 libbtrfsutil/COPYING.LESSER > create mode 100644 libbtrfsutil/README.md > create mode 100644 libbtrfsutil/btrfsutil.h > create mode 100644 libbtrfsutil/btrfsutil_internal.h > create mode 100644 libbtrfsutil/errors.c > > diff --git a/.gitignore b/.gitignore > index 8e607f6e..272d53e4 100644 > --- a/.gitignore > +++ b/.gitignore > @@ -43,6 +43,8 @@ libbtrfs.so.0.1 > library-test > library-test-static > /fssum > +/libbtrfsutil.so* > +/libbtrfsutil.a > .gitignore is not part of btrfs-progs, is it? I Skipped Makefile, the .c and .h look good to me, Reviewed-by: Liu Bo Thanks, -liubo > /tests/*-tests-results.txt > /tests/test-console.txt > diff --git a/Makefile b/Makefile > index 00e21379..7fb70d06 100644 > --- a/Makefile > +++ b/Makefile > @@ -73,10 +73,20 @@ CFLAGS = $(SUBST_CFLAGS) \ >-fPIC \ >-I$(TOPDIR) \ >-I$(TOPDIR)/kernel-lib \ > + -I$(TOPDIR)/libbtrfsutil \ >$(EXTRAWARN_CFLAGS) \ >$(DEBUG_CFLAGS_INTERNAL) \ >$(EXTRA_CFLAGS) > > +LIBBTRFSUTIL_CFLAGS = $(SUBST_CFLAGS) \ > + $(CSTD) \ > + -D_GNU_SOURCE \ > + -fPIC \ > + -fvisibility=hidden \ > + -I$(TOPDIR)/libbtrfsutil \ > + $(EXTRAWARN_CFLAGS) \ > + $(EXTRA_CFLAGS) > + > LDFLAGS = $(SUBST_LDFLAGS) \ > -rdynamic -L$(TOPDIR) \ > $(DEBUG_LDFLAGS_INTERNAL) \ > @@ -121,12 +131,17 @@ libbtrfs_headers = send-stream.h send-utils.h send.h > kernel-lib/rbtree.h btrfs-l > kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \ > kernel-lib/radix-tree.h kernel-lib/sizes.h kernel-lib/raid56.h \ > extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h version.h > +libbtrfsutil_major := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_MAJOR > ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h) > +libbtrfsutil_minor := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_MINOR > ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h) > +libbtrfsutil_patch := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_PATCH > ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h) > +libbtrfsutil_version := > $(libbtrfsutil_major).$(libbtrfsutil_minor).$(libbtrfsutil_patch) > +libbtrfsutil_objects = libbtrfsutil/errors.o > convert_objects = convert/main.o convert/common.o convert/source-fs.o \ > convert/source-ext2.o convert/source-reiserfs.o > mkfs_objects = mkfs/main.o mkfs/common.o mkfs/rootdir.o > image_objects = image/main.o image/sanitize.o > all_objects = $(objects) $(cmds_objects) $(libbtrfs_objects) > $(convert_objects) \ > - $(mkfs_objects) $(image_objects) > + $(mkfs_objects) $(image_objects) $(libbtrfsutil_objects) > > udev_rules = 64-btrfs-dm.rules > > @@ -246,11 +261,10 @@ static_convert_objects = $(patsubst %.o, %.static.o, > $(convert_objects)) > static_mkfs_objects = $(patsubst %.o, %.static.o, $(mkfs_objects)) > static_image_objects = $(patsubst %.o, %.static.o, $(image_objects)) > > -libs_shared = libbtrfs.so.0.1 > -libs_static = libbtrfs.a > +libs_shared = libbtrfs.so.0.1 libbtrfsutil.so.$(libbtrfsutil_version) > +libs_static = libbtrfs.a
Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW
On Tue, Feb 13, 2018 at 11:00:46AM +0800, Anand Jain wrote: > Fixes the endianness bug in the fs_info::super_copy by using its > btrfs_set_super...() function to set values in the SB, as these > functions manage the endianness compatibility nicely. Reviewed-by: Liu BoThanks, -liubo > > Signed-off-by: Anand Jain > --- > v1->v2: Update change log. Update $Subject. > Old: > [PATCH] btrfs: use set functions to update latest refs to the SB > fs/btrfs/transaction.c | 20 > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c > index 04f07144b45c..9220f004001c 100644 > --- a/fs/btrfs/transaction.c > +++ b/fs/btrfs/transaction.c > @@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info > *fs_info) > > super = fs_info->super_copy; > > + /* update latest btrfs_super_block::chunk_root refs */ > root_item = _info->chunk_root->root_item; > - super->chunk_root = root_item->bytenr; > - super->chunk_root_generation = root_item->generation; > - super->chunk_root_level = root_item->level; > + btrfs_set_super_chunk_root(super, root_item->bytenr); > + btrfs_set_super_chunk_root_generation(super, root_item->generation); > + btrfs_set_super_chunk_root_level(super, root_item->level); > > + /* update latest btrfs_super_block::root refs */ > root_item = _info->tree_root->root_item; > - super->root = root_item->bytenr; > - super->generation = root_item->generation; > - super->root_level = root_item->level; > + btrfs_set_super_root(super, root_item->bytenr); > + btrfs_set_super_generation(super, root_item->generation); > + btrfs_set_super_root_level(super, root_item->level); > + > if (btrfs_test_opt(fs_info, SPACE_CACHE)) > - super->cache_generation = root_item->generation; > + btrfs_set_super_cache_generation(super, root_item->generation); > if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags)) > - super->uuid_tree_generation = root_item->generation; > + btrfs_set_super_uuid_tree_generation(super, > + root_item->generation); > } > > int btrfs_transaction_in_commit(struct btrfs_fs_info *info) > -- > 2.15.0 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote: > On detaching of a disk which is a part of a RAID6 filesystem, the following > kernel OOPS may happen: > > [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, > flush 1, corrupt 0, gen 0 > [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, > flush 1, corrupt 0, gen 0 > [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, > flush 1, corrupt 0, gen 0 > [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo > [63122.935338] BUG: unable to handle kernel NULL pointer dereference at > 0080 > [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] > [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 > [63122.971202] Oops: [#1] SMP > [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace > cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs > lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev > snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc > soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci > ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor > zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod > hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod > sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last > unloaded: scst] > [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W > 4.14.2-16-scst34x+ #8 > [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS > VirtualBox 12/01/2006 > [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] > [63123.007595] task: 880036ea4040 task.stack: c90006384000 > [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] > [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 > [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: > 8800b2bd5690 > [63123.008359] RDX: RSI: 88007bb43500 RDI: > 88004beaa000 > [63123.008621] RBP: c90006387ae8 R08: 9910 R09: > 8800b2bd5600 > [63123.008840] R10: 0004 R11: 0001 R12: > 88007bb43500 > [63123.009059] R13: fffb R14: 880036fc5180 R15: > 0004 > [63123.009278] FS: () GS:8800b700() > knlGS: > [63123.009564] CS: 0010 DS: ES: CR0: 80050033 > [63123.009748] CR2: 0080 CR3: b0866000 CR4: > 000406f0 > [63123.009969] Call Trace: > [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] > [63123.010251] bio_endio+0xa1/0x120 > [63123.010378] generic_make_request+0x218/0x270 > [63123.010921] submit_bio+0x66/0x130 > [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] > [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] > [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] > [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] > [63123.011759] ? ___cache_free+0x1c5/0x300 > [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] > [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] > [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] > [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] > [63123.012656] process_one_work+0x14d/0x350 > [63123.012888] worker_thread+0x4d/0x3a0 > [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 > [63123.013192] kthread+0x109/0x140 > [63123.013315] ? process_scheduled_works+0x40/0x40 > [63123.013472] ? kthread_stop+0x110/0x110 > [63123.013610] ret_from_fork+0x25/0x30 > [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 > 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> > 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 > [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 > [63123.014678] CR2: 0080 > [63123.016590] ---[ end trace a295ea7259c17880 ]— > > This is reproducible in a cycle, where a series of writes is followed by SCSI > device delete command. The test may take up to few minutes. > Please also place your SOB here. > Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace > bi_bdev with a gendisk pointer and partitions index") > --- > fs/btrfs/raid56.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c > index dec0907dfb8a..fcfc20de2df3 100644 > --- a/fs/btrfs/raid56.c > +++ b/fs/btrfs/raid56.c > @@ -1370,6 +1370,7 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio, > stripe_start = stripe->physical; > if (physical >= stripe_start && >
Re: [PATCH] btrfs: fix NPD during canceling replace when the target is missing
On Tue, Feb 20, 2018 at 10:48:09PM +0800, Anand Jain wrote: > Replace target can be missing after a reboot during the replace. > So check if device is null. > > BUG: unable to handle kernel NULL pointer dereference at 00b0 > IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs] > Call Trace: > btrfs_dev_replace_cancel+0x22b/0x250 [btrfs] > btrfs_ioctl+0x2216/0x2590 [btrfs] > ? do_vfs_ioctl+0x625/0x650 > do_vfs_ioctl+0x625/0x650 > ? security_file_ioctl+0x30/0x50 > SyS_ioctl+0x4e/0x80 > do_syscall_64+0x5d/0x160 > entry_SYSCALL64_slow_path+0x25/0x25 > > Signed-off-by: Anand Jain> --- > fs/btrfs/dev-replace.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c > index 87f975143c05..476981c2cf55 100644 > --- a/fs/btrfs/dev-replace.c > +++ b/fs/btrfs/dev-replace.c > @@ -749,7 +749,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info > *fs_info) > btrfs_dev_name(src_device), src_device->devid, > btrfs_dev_name(tgt_device)); > > - btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); > + if (tgt_device) > + btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); I'll just discard the patch you sent a week ago that removes the 'if' in the first place. https://patchwork.kernel.org/patch/10215103/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_clone_files and bind mounts
On 20.02.2018 17:50, Hristo Venev wrote: > What is the problem with cloning files between different (vfs)mounts of > the same filesystem? > The "problem" is not really a problem, but rather a well-imposed restriction: >From http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html "Both files must reside within the same filesystem." And as a matter of fact this is enforced in the generic vfs_clone_file_range. Of course if we were to do this across filesystem then we'd have all the problems associated with not being able to ensure atomicity of operations. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs_clone_files and bind mounts
What is the problem with cloning files between different (vfs)mounts of the same filesystem? signature.asc Description: This is a digitally signed message part
Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic
Gentle ping. 2018-01-03 0:23 GMT+03:00 Timofey Titovets: > 2018-01-02 21:31 GMT+03:00 Liu Bo : >> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote: >>> Currently btrfs raid1/10 balancer bаlance requests to mirrors, >>> based on pid % num of mirrors. >>> >>> Make logic understood: >>> - if one of underline devices are non rotational >>> - Queue leght to underline devices >>> >>> By default try use pid % num_mirrors guessing, but: >>> - If one of mirrors are non rotational, repick optimal to it >>> - If underline mirror have less queue leght then optimal, >>>repick to that mirror >>> >>> For avoid round-robin request balancing, >>> lets round down queue leght: >>> - By 8 for rotational devs >>> - By 2 for all non rotational devs >>> >> >> Sorry for making a late comment on v3. >> >> It's good to choose non-rotational if it could. >> >> But I'm not sure whether it's a good idea to guess queue depth here >> because filesystem is still at a high position of IO stack. It'd >> probably get good results when running tests, but in practical mixed >> workloads, the underlying queue depth will be changing all the time. > > First version supposed for SSD, SSD + HDD only cases. > At that version that just a "attempt", make LB on hdd. > That can be easy dropped, if we decide that's a bad behaviour. > > If i understood correctly, which counters used, > we check count of I/O ops that device processing currently > (i.e. after merging & etc), > not queue what not send (i.e. before merging & etc). > > i.e. we guessed based on low level block io stuff. > As example that not work on zram devs (AFAIK, as zram don't have that > counters). > > So, no matter at which level we check that. > >> In fact, I think for rotational disks, more merging and less seeking >> make more sense, even in raid1/10 case. >> >> Thanks, >> >> -liubo > > queue_depth changing must not create big problems there, > i.e. round_down must make all changes "equal". > > For hdd, if we have a "big" (8..16?) queue depth, > with high probability that hdd overloaded, > and if other hdd have much less load > (may be instead of round_down, that better use abs diff > 8) > we try to requeue requests to other hdd. > > That will not show true equal distribution, but in case where > one disks have more load, and pid based mechanism fail to make LB, > we will just move part of load to other hdd. > > Until load distribution will not changed. > > May be for HDD that need to make threshold more aggressive, like 16 > (i.e. afaik SATA drives have hw rq len 31, so just use half of that). > > Thanks. > >>> Changes: >>> v1 -> v2: >>> - Use helper part_in_flight() from genhd.c >>> to get queue lenght >>> - Move guess code to guess_optimal() >>> - Change balancer logic, try use pid % mirror by default >>> Make balancing on spinning rust if one of underline devices >>> are overloaded >>> v2 -> v3: >>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes >>> >>> Signed-off-by: Timofey Titovets >>> --- >>> block/genhd.c | 1 + >>> fs/btrfs/volumes.c | 115 >>> - >>> 2 files changed, 114 insertions(+), 2 deletions(-) >>> >>> diff --git a/block/genhd.c b/block/genhd.c >>> index 96a66f671720..a77426a7 100644 >>> --- a/block/genhd.c >>> +++ b/block/genhd.c >>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct >>> hd_struct *part, >>> atomic_read(>in_flight[1]); >>> } >>> } >>> +EXPORT_SYMBOL_GPL(part_in_flight); >>> >>> struct hd_struct *__disk_get_part(struct gendisk *disk, int partno) >>> { >>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c >>> index 49810b70afd3..a3b80ba31d4d 100644 >>> --- a/fs/btrfs/volumes.c >>> +++ b/fs/btrfs/volumes.c >>> @@ -27,6 +27,7 @@ >>> #include >>> #include >>> #include >>> +#include >>> #include >>> #include "ctree.h" >>> #include "extent_map.h" >>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info >>> *fs_info, u64 logical, u64 len) >>> return ret; >>> } >>> >>> +/** >>> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev >>> + * >>> + * @bdev: target bdev >>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2 >>> + */ >>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down) >>> +{ >>> + int sum; >>> + struct hd_struct *bd_part = bdev->bd_part; >>> + struct request_queue *rq = bdev_get_queue(bdev); >>> + uint32_t inflight[2] = {0, 0}; >>> + >>> + part_in_flight(rq, bd_part, inflight); >>> + >>> + sum = max_t(uint32_t, inflight[0], inflight[1]); >>> + >>> + /* >>> + * Try prevent switch for every sneeze >>> + * By roundup output num by some value >>> + */ >>> + return ALIGN_DOWN(sum, round_down); >>> +} >>> + >>> +/** >>> + *
Re: Status of FST and mount times
On 2018-02-20 09:59, Ellis H. Wilson III wrote: On 02/16/2018 07:59 PM, Qu Wenruo wrote: On 2018年02月16日 22:12, Ellis H. Wilson III wrote: $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 OK, this explains everything. There are too many chunks. This means at mount you need to search for block group item 3454 times. Even each search only needs to iterate 3 tree blocks, multiply it 3454 it would still be a big work. Although some tree blocks like the root node and level 1 nodes can be cached, we still need to read about 3500 tree blocks. If the fs is created using 16K nodesize, this means you need to do random read for 54M using 16K blocksize. No wonder it will takes some time. Normally I would expect 1G chunk for each data and metadata chunk. If there is nothing special, it means your filesystem is already larger than 3T. If your used space is way smaller (less than 30%) than 3.5T, then this means your chunk usage is pretty low, and in that case, balance to reduce number of chunks (block groups) would reduce mount time. The nodesize is 16K, and the filesystem data is 3.32TiB as reported by btrfs fi df. So, from what I am hearing, this mount time is normal for a filesystem this size. Ignoring a more complex and proper fix like the ones we've been discussing, would bumping the nodesize reduce the number of chunks, thereby reducing the mount time? It would probably not. Chunk size is only based on the total size of the filesystem, with reasonable base values, so you would still need to have at least as many chunks to store the same amount of data (increase the node size too much though, and you will end up with more chunks, because you'll have more empty space wasted). I don't see why balance would come into play here -- my understanding was that was for aged filesystems. The only operations I've done on here was: 1. Format filesystem clean 2. Create a subvolume 3. rsync our home directories into that new subvolume 4. Create another subvolume 5. rsync our home directories into that new subvolume Accordingly, zero (or at least, extremely little) data should have been overwritten, so I would expect things to be fairly well allocated already. Please correct me if this is naive thinking. Your logic is in general correct regarding data, but not necessarily metadata. Assuming you did not use the `--inplace` option for rsync, it had to issue a rename for each individual file that got copied in, and as a result there was likely a lot of metadata being rewritten. As far as balance being for aged filesystems, that's not exactly true. There are four big reasons you might run a balance: 1. As part of reshaping a volume. You generally want run a balance whenever the number of disks in a volume permanently increases (it will happen automatically when it permanently decreases, as the device deletion operation is a special type of balance under the hood). It's also used for converting chunk profiles. 2. To free up empty space inside chunks when the filesystem is full at the chunk level. 3. To redistribute data across multiple disks in a more even manner after deleting a lot of data. 4. To reduce the likelihood of 2 or 3 being an issue. Reasons 2 and 3 are generally more likely to be needed on old volumes. Reason 1 is independent of the age of a volume. Reason 4 is the reason for the regular filtered balances that I and some other people recommend be run as part of preventative maintenance, and is also generally independent of the age of a volume. Qu's suggestion is actually independent of all the above reasons, but does kind of fit in with the fourth as another case of preventative maintenance. I was using btrfs sub del -C for the deletions, so I believe (if that command truly waits for the subvolume to be utterly gone) it captures the entirety of the snapshot. No, snapshot deletion is completely delayed in background. -C only ensures that even a powerloss happen after command return, you won't see the snapshot anywhere, but it will still be deleted in background. Ah, I had no idea. Thank you! Is there any way to "encourage" btrfs-cleaner to run at specific times, which I presume is the snapshot deletion process you are referring to? If it can be told to run at a given time, can I throttle how fast it works, such that I avoid some of the high foreground interruption I've seen in the past? I don't think there's any way to do this right now (though it would be nice if there was). In theory, you could adjust the priority of the kernel thread itself, but messing around with kthread priorities is seriously dangerous even if you know exactly what you're doing. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/16/2018 07:59 PM, Qu Wenruo wrote: On 2018年02月16日 22:12, Ellis H. Wilson III wrote: $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 OK, this explains everything. There are too many chunks. This means at mount you need to search for block group item 3454 times. Even each search only needs to iterate 3 tree blocks, multiply it 3454 it would still be a big work. Although some tree blocks like the root node and level 1 nodes can be cached, we still need to read about 3500 tree blocks. If the fs is created using 16K nodesize, this means you need to do random read for 54M using 16K blocksize. No wonder it will takes some time. Normally I would expect 1G chunk for each data and metadata chunk. If there is nothing special, it means your filesystem is already larger than 3T. If your used space is way smaller (less than 30%) than 3.5T, then this means your chunk usage is pretty low, and in that case, balance to reduce number of chunks (block groups) would reduce mount time. The nodesize is 16K, and the filesystem data is 3.32TiB as reported by btrfs fi df. So, from what I am hearing, this mount time is normal for a filesystem this size. Ignoring a more complex and proper fix like the ones we've been discussing, would bumping the nodesize reduce the number of chunks, thereby reducing the mount time? I don't see why balance would come into play here -- my understanding was that was for aged filesystems. The only operations I've done on here was: 1. Format filesystem clean 2. Create a subvolume 3. rsync our home directories into that new subvolume 4. Create another subvolume 5. rsync our home directories into that new subvolume Accordingly, zero (or at least, extremely little) data should have been overwritten, so I would expect things to be fairly well allocated already. Please correct me if this is naive thinking. I was using btrfs sub del -C for the deletions, so I believe (if that command truly waits for the subvolume to be utterly gone) it captures the entirety of the snapshot. No, snapshot deletion is completely delayed in background. -C only ensures that even a powerloss happen after command return, you won't see the snapshot anywhere, but it will still be deleted in background. Ah, I had no idea. Thank you! Is there any way to "encourage" btrfs-cleaner to run at specific times, which I presume is the snapshot deletion process you are referring to? If it can be told to run at a given time, can I throttle how fast it works, such that I avoid some of the high foreground interruption I've seen in the past? Thanks, ellis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] btrfs: remove assert in btrfs_init_dev_replace_tgtdev()
In the same function we just ran btrfs_alloc_device() which means the btrfs_device::resized_list is sure to be empty and we are protected with the btrfs_fs_info::volume_mutex. Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d4a5f9126e7b..ac29ec0de984 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2655,7 +2655,6 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info *fs_info, device->total_bytes = btrfs_device_get_total_bytes(srcdev); device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev); device->bytes_used = btrfs_device_get_bytes_used(srcdev); - ASSERT(list_empty(>resized_list)); device->commit_total_bytes = srcdev->commit_total_bytes; device->commit_bytes_used = device->bytes_used; device->fs_info = fs_info; -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: fix NPD during canceling replace when the target is missing
Replace target can be missing after a reboot during the replace. So check if device is null. BUG: unable to handle kernel NULL pointer dereference at 00b0 IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs] Call Trace: btrfs_dev_replace_cancel+0x22b/0x250 [btrfs] btrfs_ioctl+0x2216/0x2590 [btrfs] ? do_vfs_ioctl+0x625/0x650 do_vfs_ioctl+0x625/0x650 ? security_file_ioctl+0x30/0x50 SyS_ioctl+0x4e/0x80 do_syscall_64+0x5d/0x160 entry_SYSCALL64_slow_path+0x25/0x25 Signed-off-by: Anand Jain--- fs/btrfs/dev-replace.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 87f975143c05..476981c2cf55 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -749,7 +749,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info) btrfs_dev_name(src_device), src_device->devid, btrfs_dev_name(tgt_device)); - btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); + if (tgt_device) + btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device); leave: mutex_unlock(_replace->lock_finishing_cancel_unmount); -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/1] btrfs: fix NPD when target device is missing
The replace target device can be missing in which case we don't allocate a missing btrfs_device when mounted with the -o degraded. So check the device before access. BUG: unable to handle kernel NULL pointer dereference at 00b0 IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs] Call Trace: btrfs_dev_replace_cancel+0x15f/0x180 [btrfs] btrfs_ioctl+0x2216/0x2590 [btrfs] do_vfs_ioctl+0x625/0x650 SyS_ioctl+0x4e/0x80 do_syscall_64+0x5d/0x160 entry_SYSCALL64_slow_path+0x25/0x25 Signed-off-by: Anand Jain--- fs/btrfs/dev-replace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index dbaa6880a15e..87f975143c05 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -312,7 +312,7 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info) static char* btrfs_dev_name(struct btrfs_device *device) { - if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) + if (!device || test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) return ""; else return rcu_str_deref(device->name); -- 2.15.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] btrfs: delete function btrfs_close_extra_devices()
On 02/20/2018 12:53 AM, David Sterba wrote: On Thu, Feb 15, 2018 at 01:02:24PM +0800, Anand Jain wrote: btrfs_close_extra_devices() is not exactly about just closing the opened devices, but its also about free-ing the stale devices which may have scanned into the btrfs_fs_devices::dev_list. The way it picks devices to be freed is by going through the btrfs_fs_devices::dev_list and its seed devices, and finding for devices which do not have the flag BTRFS_DEV_STATE_IN_FS_METADATA nor if it is part of the replace target. However, in the first place the way devices are scanned and added to the btrfs_fs_devices::dev_list have changed for a long time now. During scan when it finds matching fsid+uuid+devid it would add the device to btrfs_fs_devices::dev_list. A matched device with higher generation number overwrites the device with lower generation number during. Further, the stale devices containing the stale fsid are removed at the time of the scan itself. So there isn't any opportunity that btrfs_close_extra_devices() can free the stale device within the fsid which is being mounted. Further about the btrfs_fs_devices::latest_bdev that the btrfs_close_extra_devices() function assigns, is already assigned by the function __btrfs_open_devices(). So as this function has no effect, delete it. I think this is correct. Freeing stale devices as a side effect of mount does not seem right anyway. I'm still not able to convince myself that there's not an unexpected interaction of dev scan and dev replace, as it relies on the state bits and other locks. If you have ideas were to put some asserts or extra checks, please suggest. Thanks for comments. I took a fresh look after a long weekend. Here I found something new. Following test case [1] can simulate stale SB on the missing device which got deleted. Now if we do the fresh dev scan and mount the deleted device should not be kept in the device_list after the mount because that's stale. And as IN_FS_METADATA flag is not set on this device this function helps to clean up those devices. The dev scan free_stale can't catch this condition because we won't go deeper than the SB read. Sorry, my bad pls ignore this patch this is a very corner case or I had gone nuts when I investigated this. [1] mkfs.btrfs -fq -mraid1 -draid1 /dev/sde /dev/sdf /dev/sdg modprobe -r btrfs mount -o degraded,device=/dev/sde /dev/sdf /btrfs btrfs dev del 3 /btrfs umount /btrfs btrfs dev scan /dev/sde /dev/sdf /dev/sdg mount /dev/sde /btrfs wipefs -a /dev/sdg <-- this should work as sdg is not part of fsid mounted anymore. I am thinking to rename this function to btrfs_free_extra_devices() and add comments. Also about the fs_devices::latest_bdev we certainly assign it in __btrfs_open_devices(), but there we don't care if it picks the replacing target. And suppose if replacing_target has the highest (or equal) generation number compared to other we shall read the sys and chunk tree from it. So during the mount context we use and reconstruct using the replacing target, but at the end of mount context, we use the non-replace target as the latest_bdev. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix rcu_dereference usage outside of read critical section
On 02/20/2018 07:40 PM, Nikolay Borisov wrote: Patch 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user") added a new btrfs_info call with a couple of btrfs_dev_name() args. This is wrong since the latter require being called in rcu read side critical section. Fix it by instead calling btrfs_info_in_rcu. This fixes the following splat: = WARNING: suspicious RCU usage 4.16.0-rc2-nbor #463 Not tainted - fs/btrfs/dev-replace.c:318 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by btrfs/5698: #0: (_info->dev_replace.lock_finishing_cancel_unmount){+.+.}, at: [<942cb4ee>] btrfs_dev_replace_cancel+0xac/0x3f0 stack backtrace: CPU: 2 PID: 5698 Comm: btrfs Not tainted 4.16.0-rc2-nbor #463 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x85/0xc9 lockdep_rcu_suspicious+0x123/0x170 btrfs_dev_name.part.1+0x6d/0x80 btrfs_dev_replace_cancel+0x330/0x3f0 btrfs_ioctl+0x2751/0x65b0 ? debug_check_no_locks_freed+0x290/0x290 ? trace_hardirqs_on_caller+0x400/0x570 ? trace_hardirqs_on+0xd/0x10 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? __handle_mm_fault+0x1aca/0x3230 ? lock_downgrade+0x650/0x650 ? trace_hardirqs_on+0xd/0x10 ? mem_cgroup_commit_charge+0xc0/0xdd0 ? _raw_spin_unlock+0x27/0x40 ? __handle_mm_fault+0x1aca/0x3230 ? lock_downgrade+0x650/0x650 ? vm_insert_page+0x650/0x650 ? __vma_link_rb+0x125/0x1d0 do_vfs_ioctl+0x184/0xf00 ? do_vfs_ioctl+0x184/0xf00 ? lock_downgrade+0x650/0x650 ? ioctl_preallocate+0x1a0/0x1a0 ? up_read+0x1f/0x40 ? __do_page_fault+0x5c6/0xb30 ? SyS_brk+0x412/0x5f0 ? mm_fault_error+0x2e0/0x2e0 SyS_ioctl+0x41/0x70 ? do_vfs_ioctl+0xf00/0xf00 do_syscall_64+0x19d/0x5d0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user") Signed-off-by: Nikolay BorisovI notice too. Thanks Nikolay for the fix. Reviewed-by: Anand Jain --- fs/btrfs/dev-replace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 3b0760f7ec8a..0e776eb90ad8 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -744,7 +744,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info) ret = btrfs_commit_transaction(trans); WARN_ON(ret); - btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled", + btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s cancelled", btrfs_dev_name(src_device), src_device->devid, btrfs_dev_name(tgt_device)); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs/150: add _scratch_dev_pool_get/put to run the test as expected
On 2018年02月20日 13:34, Misono, Tomohiro wrote: > btrfs/150 uses RAID1 profile and make SCRATCH_DEV fail for test. > However, if SCRATCH_DEV_POOL consists more than two devices, SCRATCH_DEV > may not be used for RAID1 pair and the tests may not run as expected. > > Fix this by add _scratch_dev_pool_get/put like other tests (141, 143 > etc.) do. > > Signed-off-by: Tomohiro MisonoIndeed, if we have more devices, it's highly possible that the first device doesn't have data stripe of the raid1 chunk on it. (And under most case it won't have data stripe, since during mkfs we use the first device to contain temporary chunks, so unallocated space of devid 1 is smaller compared to other devices, and chunk allocator will use device with more unallocated space) Reviewed-by: Qu Wenruo Thanks, Qu > --- > tests/btrfs/150 | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/tests/btrfs/150 b/tests/btrfs/150 > index 97041b6c..1e4586be 100755 > --- a/tests/btrfs/150 > +++ b/tests/btrfs/150 > @@ -55,6 +55,7 @@ _supported_os Linux > _require_scratch > _require_fail_make_request > _require_scratch_dev_pool 2 > +_scratch_dev_pool_get 2 > > SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV` > enable_io_failure() > @@ -100,6 +101,7 @@ while [[ -z $result ]]; do > disable_io_failure > done > > +_scratch_dev_pool_put > # success, all done > status=0 > exit > signature.asc Description: OpenPGP digital signature
[PATCH] btrfs: Fix rcu_dereference usage outside of read critical section
Patch 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user") added a new btrfs_info call with a couple of btrfs_dev_name() args. This is wrong since the latter require being called in rcu read side critical section. Fix it by instead calling btrfs_info_in_rcu. This fixes the following splat: = WARNING: suspicious RCU usage 4.16.0-rc2-nbor #463 Not tainted - fs/btrfs/dev-replace.c:318 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by btrfs/5698: #0: (_info->dev_replace.lock_finishing_cancel_unmount){+.+.}, at: [<942cb4ee>] btrfs_dev_replace_cancel+0xac/0x3f0 stack backtrace: CPU: 2 PID: 5698 Comm: btrfs Not tainted 4.16.0-rc2-nbor #463 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: dump_stack+0x85/0xc9 lockdep_rcu_suspicious+0x123/0x170 btrfs_dev_name.part.1+0x6d/0x80 btrfs_dev_replace_cancel+0x330/0x3f0 btrfs_ioctl+0x2751/0x65b0 ? debug_check_no_locks_freed+0x290/0x290 ? trace_hardirqs_on_caller+0x400/0x570 ? trace_hardirqs_on+0xd/0x10 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? __handle_mm_fault+0x1aca/0x3230 ? lock_downgrade+0x650/0x650 ? trace_hardirqs_on+0xd/0x10 ? mem_cgroup_commit_charge+0xc0/0xdd0 ? _raw_spin_unlock+0x27/0x40 ? __handle_mm_fault+0x1aca/0x3230 ? lock_downgrade+0x650/0x650 ? vm_insert_page+0x650/0x650 ? __vma_link_rb+0x125/0x1d0 do_vfs_ioctl+0x184/0xf00 ? do_vfs_ioctl+0x184/0xf00 ? lock_downgrade+0x650/0x650 ? ioctl_preallocate+0x1a0/0x1a0 ? up_read+0x1f/0x40 ? __do_page_fault+0x5c6/0xb30 ? SyS_brk+0x412/0x5f0 ? mm_fault_error+0x2e0/0x2e0 SyS_ioctl+0x41/0x70 ? do_vfs_ioctl+0xf00/0xf00 do_syscall_64+0x19d/0x5d0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Fixes: 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user") Signed-off-by: Nikolay Borisov--- fs/btrfs/dev-replace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 3b0760f7ec8a..0e776eb90ad8 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -744,7 +744,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info) ret = btrfs_commit_transaction(trans); WARN_ON(ret); - btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled", + btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s cancelled", btrfs_dev_name(src_device), src_device->devid, btrfs_dev_name(tgt_device)); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html