Re: BUG: unable to handle kernel paging request at ffff9fb75f827100

2018-02-20 Thread Nikolay Borisov


On 21.02.2018 06:09, Christoph Anton Mitterer wrote:
> Hi.
> 
> Not sure if that's a bug in btrfs... maybe someone's interested in it.
> 
> Cheers,
> Chris.
> 
> # uname -a
> Linux heisenberg 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 
> GNU/Linux
> 
> 
> Feb 21 04:55:51 heisenberg kernel: BUG: unable to handle kernel paging 
> request at 9fb75f827100
> Feb 21 04:55:51 heisenberg kernel: IP: btrfs_drop_inode+0x16/0x40 [btrfs]

This looks like the one fixed by
e8f1bc1493855e32b7a2a019decc3c353d94daf6 . It's tagged for stable so you
should get it eventually.

> Feb 21 04:55:51 heisenberg kernel: PGD 410806067 P4D 410806067 PUD 0 
> Feb 21 04:55:51 heisenberg kernel: Oops:  [#1] SMP PTI
> Feb 21 04:55:51 heisenberg kernel: Modules linked in: vhost_net vhost tap 
> xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
> nf_nat_ipv4 nf_nat tun bridge stp llc ctr ccm fuse ebtable_filter ebtables 
> devlink cpufreq_userspace cpufreq_powersave cpufreq_conservative ip6t_REJECT 
> nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
> ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 
> nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter 
> binfmt_misc arc4 snd_hda_codec_hdmi btusb btrtl btbcm btintel bluetooth 
> snd_hda_codec_realtek snd_hda_codec_generic drbg uvcvideo videobuf2_vmalloc 
> videobuf2_memops snd_soc_skl snd_usb_audio ansi_cprng snd_soc_skl_ipc 
> cdc_mbim cdc_wdm snd_usbmidi_lib snd_soc_sst_ipc cdc_ncm snd_soc_sst_dsp 
> snd_rawmidi ecdh_generic
> Feb 21 04:55:51 heisenberg kernel:  videobuf2_v4l2 i2c_designware_platform 
> iwlmvm snd_hda_ext_core videobuf2_core usbnet i2c_designware_core 
> snd_seq_device snd_soc_sst_match mii snd_soc_core videodev snd_compress media 
> mac80211 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
> kvm irqbypass intel_cstate snd_hda_intel intel_uncore iwlwifi snd_hda_codec 
> intel_rapl_perf snd_hda_core snd_hwdep snd_pcm pcspkr joydev evdev cfg80211 
> serio_raw snd_timer rfkill snd iTCO_wdt sg soundcore iTCO_vendor_support i915 
> wmi shpchp drm_kms_helper mei_me battery fujitsu_laptop mei tpm_crb idma64 
> button drm sparse_keymap video i2c_algo_bit ac acpi_pad intel_lpss_pci 
> intel_lpss mfd_core loop parport_pc ppdev sunrpc lp parport ip_tables 
> x_tables autofs4 algif_skcipher af_alg ext4 crc16 mbcache jbd2 fscrypto ecb 
> hid_generic
> Feb 21 04:55:51 heisenberg kernel:  usbhid hid dm_crypt dm_mod raid10 raid456 
> async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid1 
> raid0 multipath linear md_mod btrfs crc32c_generic xor zstd_decompress 
> zstd_compress xxhash uas raid6_pq uhci_hcd ehci_pci ehci_hcd usb_storage 
> sd_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc 
> aesni_intel aes_x86_64 crypto_simd glue_helper cryptd e1000e xhci_pci ahci 
> ptp libahci xhci_hcd psmouse pps_core libata sdhci_pci sdhci i2c_i801 
> scsi_mod mmc_core usbcore usb_common
> Feb 21 04:55:51 heisenberg kernel: CPU: 3 PID: 50 Comm: kswapd0 Not tainted 
> 4.14.0-3-amd64 #1 Debian 4.14.17-1
> Feb 21 04:55:51 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
> U757/FJNB2A5, BIOS Version 1.16 07/05/2017
> Feb 21 04:55:51 heisenberg kernel: task: 9fbff9523040 task.stack: 
> ad5e8378
> Feb 21 04:55:51 heisenberg kernel: RIP: 0010:btrfs_drop_inode+0x16/0x40 
> [btrfs]
> Feb 21 04:55:51 heisenberg kernel: RSP: 0018:ad5e83783c28 EFLAGS: 00010286
> Feb 21 04:55:51 heisenberg kernel: RAX: 0001 RBX: 
> 9fbe05d69330 RCX: 
> Feb 21 04:55:51 heisenberg kernel: RDX: 9fb75f827000 RSI: 
> c075f2b0 RDI: 9fbe05d69330
> Feb 21 04:55:51 heisenberg kernel: RBP: 9fbff2040800 R08: 
> 9fbff1ddea08 R09: ad5e83783d88
> Feb 21 04:55:51 heisenberg kernel: R10: 001bf318 R11: 
>  R12: 9fbe05d693b8
> Feb 21 04:55:51 heisenberg kernel: R13: 9fbe05d69488 R14: 
> 9fbfc26cecc0 R15: 
> Feb 21 04:55:51 heisenberg kernel: FS:  () 
> GS:9fc01dd8() knlGS:
> Feb 21 04:55:51 heisenberg kernel: CS:  0010 DS:  ES:  CR0: 
> 80050033
> Feb 21 04:55:51 heisenberg kernel: CR2: 9fb75f827100 CR3: 
> 00041020a004 CR4: 003606e0
> Feb 21 04:55:51 heisenberg kernel: DR0:  DR1: 
>  DR2: 
> Feb 21 04:55:51 heisenberg kernel: DR3:  DR6: 
> fffe0ff0 DR7: 0400
> Feb 21 04:55:51 heisenberg kernel: Call Trace:
> Feb 21 04:55:51 heisenberg kernel:  iput+0xf7/0x210
> Feb 21 04:55:51 heisenberg kernel:  __dentry_kill+0xce/0x160
> Feb 21 04:55:51 heisenberg kernel:  shrink_dentry_list+0xe0/0x2d0
> Feb 21 04:55:51 heisenberg kernel:  prune_dcache_sb+0x52/0x70
> Feb 21 04:55:51 heisenberg kernel:  super_cache_scan+0xf7/0x1a0
> Feb 21 04:55:51 heisenberg kernel:  

BUG: unable to handle kernel paging request at ffff9fb75f827100

2018-02-20 Thread Christoph Anton Mitterer
Hi.

Not sure if that's a bug in btrfs... maybe someone's interested in it.

Cheers,
Chris.

# uname -a
Linux heisenberg 4.14.0-3-amd64 #1 SMP Debian 4.14.17-1 (2018-02-14) x86_64 
GNU/Linux


Feb 21 04:55:51 heisenberg kernel: BUG: unable to handle kernel paging request 
at 9fb75f827100
Feb 21 04:55:51 heisenberg kernel: IP: btrfs_drop_inode+0x16/0x40 [btrfs]
Feb 21 04:55:51 heisenberg kernel: PGD 410806067 P4D 410806067 PUD 0 
Feb 21 04:55:51 heisenberg kernel: Oops:  [#1] SMP PTI
Feb 21 04:55:51 heisenberg kernel: Modules linked in: vhost_net vhost tap 
xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat 
nf_nat_ipv4 nf_nat tun bridge stp llc ctr ccm fuse ebtable_filter ebtables 
devlink cpufreq_userspace cpufreq_powersave cpufreq_conservative ip6t_REJECT 
nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter 
ip6_tables xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 
nf_defrag_ipv4 xt_multiport xt_conntrack nf_conntrack iptable_filter 
binfmt_misc arc4 snd_hda_codec_hdmi btusb btrtl btbcm btintel bluetooth 
snd_hda_codec_realtek snd_hda_codec_generic drbg uvcvideo videobuf2_vmalloc 
videobuf2_memops snd_soc_skl snd_usb_audio ansi_cprng snd_soc_skl_ipc cdc_mbim 
cdc_wdm snd_usbmidi_lib snd_soc_sst_ipc cdc_ncm snd_soc_sst_dsp snd_rawmidi 
ecdh_generic
Feb 21 04:55:51 heisenberg kernel:  videobuf2_v4l2 i2c_designware_platform 
iwlmvm snd_hda_ext_core videobuf2_core usbnet i2c_designware_core 
snd_seq_device snd_soc_sst_match mii snd_soc_core videodev snd_compress media 
mac80211 intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel 
kvm irqbypass intel_cstate snd_hda_intel intel_uncore iwlwifi snd_hda_codec 
intel_rapl_perf snd_hda_core snd_hwdep snd_pcm pcspkr joydev evdev cfg80211 
serio_raw snd_timer rfkill snd iTCO_wdt sg soundcore iTCO_vendor_support i915 
wmi shpchp drm_kms_helper mei_me battery fujitsu_laptop mei tpm_crb idma64 
button drm sparse_keymap video i2c_algo_bit ac acpi_pad intel_lpss_pci 
intel_lpss mfd_core loop parport_pc ppdev sunrpc lp parport ip_tables x_tables 
autofs4 algif_skcipher af_alg ext4 crc16 mbcache jbd2 fscrypto ecb hid_generic
Feb 21 04:55:51 heisenberg kernel:  usbhid hid dm_crypt dm_mod raid10 raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c raid1 
raid0 multipath linear md_mod btrfs crc32c_generic xor zstd_decompress 
zstd_compress xxhash uas raid6_pq uhci_hcd ehci_pci ehci_hcd usb_storage sd_mod 
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel 
aes_x86_64 crypto_simd glue_helper cryptd e1000e xhci_pci ahci ptp libahci 
xhci_hcd psmouse pps_core libata sdhci_pci sdhci i2c_i801 scsi_mod mmc_core 
usbcore usb_common
Feb 21 04:55:51 heisenberg kernel: CPU: 3 PID: 50 Comm: kswapd0 Not tainted 
4.14.0-3-amd64 #1 Debian 4.14.17-1
Feb 21 04:55:51 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
U757/FJNB2A5, BIOS Version 1.16 07/05/2017
Feb 21 04:55:51 heisenberg kernel: task: 9fbff9523040 task.stack: 
ad5e8378
Feb 21 04:55:51 heisenberg kernel: RIP: 0010:btrfs_drop_inode+0x16/0x40 [btrfs]
Feb 21 04:55:51 heisenberg kernel: RSP: 0018:ad5e83783c28 EFLAGS: 00010286
Feb 21 04:55:51 heisenberg kernel: RAX: 0001 RBX: 9fbe05d69330 
RCX: 
Feb 21 04:55:51 heisenberg kernel: RDX: 9fb75f827000 RSI: c075f2b0 
RDI: 9fbe05d69330
Feb 21 04:55:51 heisenberg kernel: RBP: 9fbff2040800 R08: 9fbff1ddea08 
R09: ad5e83783d88
Feb 21 04:55:51 heisenberg kernel: R10: 001bf318 R11:  
R12: 9fbe05d693b8
Feb 21 04:55:51 heisenberg kernel: R13: 9fbe05d69488 R14: 9fbfc26cecc0 
R15: 
Feb 21 04:55:51 heisenberg kernel: FS:  () 
GS:9fc01dd8() knlGS:
Feb 21 04:55:51 heisenberg kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Feb 21 04:55:51 heisenberg kernel: CR2: 9fb75f827100 CR3: 00041020a004 
CR4: 003606e0
Feb 21 04:55:51 heisenberg kernel: DR0:  DR1:  
DR2: 
Feb 21 04:55:51 heisenberg kernel: DR3:  DR6: fffe0ff0 
DR7: 0400
Feb 21 04:55:51 heisenberg kernel: Call Trace:
Feb 21 04:55:51 heisenberg kernel:  iput+0xf7/0x210
Feb 21 04:55:51 heisenberg kernel:  __dentry_kill+0xce/0x160
Feb 21 04:55:51 heisenberg kernel:  shrink_dentry_list+0xe0/0x2d0
Feb 21 04:55:51 heisenberg kernel:  prune_dcache_sb+0x52/0x70
Feb 21 04:55:51 heisenberg kernel:  super_cache_scan+0xf7/0x1a0
Feb 21 04:55:51 heisenberg kernel:  shrink_slab.part.49+0x1e8/0x3e0
Feb 21 04:55:51 heisenberg kernel:  shrink_node+0x123/0x300
Feb 21 04:55:51 heisenberg kernel:  kswapd+0x299/0x6f0
Feb 21 04:55:51 heisenberg kernel:  ? mem_cgroup_shrink_node+0x190/0x190
Feb 21 04:55:51 heisenberg kernel:  kthread+0x11a/0x130
Feb 21 04:55:51 heisenberg kernel:  ? kthread_create_on_node+0x70/0x70
Feb 21 04:55:51 heisenberg 

Re: Status of FST and mount times

2018-02-20 Thread Qu Wenruo


On 2018年02月20日 23:41, Austin S. Hemmelgarn wrote:
> On 2018-02-20 09:59, Ellis H. Wilson III wrote:
>> On 02/16/2018 07:59 PM, Qu Wenruo wrote:
>>> On 2018年02月16日 22:12, Ellis H. Wilson III wrote:
 $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
 3454
>>>
>>> OK, this explains everything.
>>>
>>> There are too many chunks.
>>> This means at mount you need to search for block group item 3454 times.
>>>
>>> Even each search only needs to iterate 3 tree blocks, multiply it 3454
>>> it would still be a big work.
>>> Although some tree blocks like the root node and level 1 nodes can be
>>> cached, we still need to read about 3500 tree blocks.
>>>
>>> If the fs is created using 16K nodesize, this means you need to do
>>> random read for 54M using 16K blocksize.
>>>
>>> No wonder it will takes some time.
>>>
>>> Normally I would expect 1G chunk for each data and metadata chunk.
>>>
>>> If there is nothing special, it means your filesystem is already larger
>>> than 3T.
>>> If your used space is way smaller (less than 30%) than 3.5T, then this
>>> means your chunk usage is pretty low, and in that case, balance to
>>> reduce number of chunks (block groups) would reduce mount time.
>>
>> The nodesize is 16K, and the filesystem data is 3.32TiB as reported by
>> btrfs fi df.  So, from what I am hearing, this mount time is normal
>> for a filesystem this size.  Ignoring a more complex and proper fix
>> like the ones we've been discussing, would bumping the nodesize reduce
>> the number of chunks, thereby reducing the mount time?
> It would probably not.  Chunk size is only based on the total size of
> the filesystem, with reasonable base values, so you would still need to
> have at least as many chunks to store the same amount of data (increase
> the node size too much though, and you will end up with more chunks,
> because you'll have more empty space wasted).

Increasing node size may reduce extent tree size. Although at most
reduce one level AFAIK.

But considering that the higher the node is, the more chance it's
cached, reducing tree height wouldn't bring much performance impact AFAIK.

If one could do real world benchmark to beat or prove my assumption, it
would be much better though.

>>
>> I don't see why balance would come into play here -- my understanding
>> was that was for aged filesystems.  The only operations I've done on
>> here was:
>> 1. Format filesystem clean
>> 2. Create a subvolume
>> 3. rsync our home directories into that new subvolume
>> 4. Create another subvolume
>> 5. rsync our home directories into that new subvolume
>>
>> Accordingly, zero (or at least, extremely little) data should have
>> been overwritten, so I would expect things to be fairly well allocated
>> already.  Please correct me if this is naive thinking.
> Your logic is in general correct regarding data, but not necessarily
> metadata.  Assuming you did not use the `--inplace` option for rsync, it
> had to issue a rename for each individual file that got copied in, and
> as a result there was likely a lot of metadata being rewritten.
> 
> As far as balance being for aged filesystems, that's not exactly true.
> There are four big reasons you might run a balance:
> 
> 1. As part of reshaping a volume.  You generally want run a balance
> whenever the number of disks in a volume permanently increases (it will
> happen automatically when it permanently decreases, as the device
> deletion operation is a special type of balance under the hood).  It's
> also used for converting chunk profiles.
> 2. To free up empty space inside chunks when the filesystem is full at
> the chunk level.
> 3. To redistribute data across multiple disks in a more even manner
> after deleting a lot of data.
> 4. To reduce the likelihood of 2 or 3 being an issue.
> 
> Reasons 2 and 3 are generally more likely to be needed on old volumes.
> Reason 1 is independent of the age of a volume.  Reason 4 is the reason
> for the regular filtered balances that I and some other people recommend
> be run as part of preventative maintenance, and is also generally
> independent of the age of a volume.
> 
> Qu's suggestion is actually independent of all the above reasons, but
> does kind of fit in with the fourth as another case of preventative
> maintenance.

My suggestion is to use balance to reduce number of block groups, so we
could do less search at mount time.

It's more like reason 2.

But it only works for case where there are a lot of fragments so a lot
of chunks are not fully utilized.
Unfortunately, that's not the case for OP, so my suggestion doesn't make
sense here.

BTW, if OP still wants to try something to possibly to reduce mount time
with same the fs, I could try some modification to current block group
iteration code to see if it makes sense.

Thanks,
Qu

>>
 I was using btrfs sub del -C for the deletions, so I believe (if that
 command truly waits for the subvolume to be utterly gone) it captures
 the entirety 

Re: [PATCH v2 00/27] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"

2018-02-20 Thread David Sterba
On Thu, Feb 15, 2018 at 11:04:45AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> This is v2 of "btrfs-progs as a library".
> 
> Most of the changes since v1 are small:
> 
> - Rebased onto v4.15
> - Split up btrfs_util_subvolume_path() which was accidentally squashed 
> together
>   with the commit adding btrfs_util_create_subvolume()
> - Renamed btrfs_util_f_* functions to btrfs_util_*_fd for clarity

I like this naming scheme.

> - Added -fvisibility=hidden and a macro for
>   __attribute__((visibility("default")))
> - Changed to use semantic versioning
> - Fixed missing install of btrfsutil.h
> - Documented functions which require root or are non-atomic
> - Added a missing license to setup.py
> 
> The bigger change is in the last two patches. Dave requested that I get
> rid of the runtime dependency of libbtrfsutil from libbtrfs. The easiest
> way to do this was to remove the btrfs_list_subvols_print()
> implementation from libbtrfs and put it in cmds-subvolume.c (details in
> patch 26). I'm open to alternatives.

This should be ok as a temporary fix to get the library going. The
column printing helpers will be replaced by libsmartcols (I have a
prototype for that but there are still some issues to fix).

> Omar Sandoval (27):
>   btrfs-progs: get rid of undocumented qgroup inheritance options

For initial merge I'll skip this patch (and what depends on it), as the
functionality is not yet out of krenel. I looked at the patch and am not
yet convinced to merge it, more time needed, but the library should not
be blocked by it.

>   Add libbtrfsutil
>   libbtrfsutil: add Python bindings
>   libbtrfsutil: add btrfs_util_is_subvolume() and
> btrfs_util_subvolume_id()
>   libbtrfsutil: add qgroup inheritance helpers
>   libbtrfsutil: add btrfs_util_create_subvolume()
>   libbtrfsutil: add btrfs_util_subvolume_path()
>   libbtrfsutil: add btrfs_util_subvolume_info()
>   libbtrfsutil: add btrfs_util_[gs]et_read_only()
>   libbtrfsutil: add btrfs_util_[gs]et_default_subvolume()
>   libbtrfsutil: add subvolume iterator helpers
>   libbtrfsutil: add btrfs_util_create_snapshot()
>   libbtrfsutil: add btrfs_util_delete_subvolume()
>   libbtrfsutil: add btrfs_util_deleted_subvolumes()
>   libbtrfsutil: add filesystem sync helpers

I'm going to add the above to devel now.

>   btrfs-progs: use libbtrfsutil for read-only property
>   btrfs-progs: use libbtrfsutil for sync ioctls
>   btrfs-progs: use libbtrfsutil for set-default
>   btrfs-progs: use libbtrfsutil for get-default
>   btrfs-progs: use libbtrfsutil for subvol create and snapshot
>   btrfs-progs: use libbtrfsutil for subvol delete
>   btrfs-progs: use libbtrfsutil for subvol show
>   btrfs-progs: use libbtrfsutil for subvol sync
>   btrfs-progs: replace test_issubvolume() with btrfs_util_is_subvolume()
>   btrfs-progs: add recursive snapshot/delete using libbtrfsutil
>   btrfs-progs: use libbtrfsutil for subvolume list
>   btrfs-progs: deprecate libbtrfs helpers

Besides the subvol and qgroup inheritance, all of the above look good,
but I'd like to spend more time merging them and we should also add
commandline tests for coverage.

I have more comments or maybe questions about the future development
workflow, but at this point the patchset is in a good shape for
incremental merge.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/27] Add libbtrfsutil

2018-02-20 Thread David Sterba
On Tue, Feb 20, 2018 at 10:32:17AM -0700, Liu Bo wrote:
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -43,6 +43,8 @@ libbtrfs.so.0.1
> >  library-test
> >  library-test-static
> >  /fssum
> > +/libbtrfsutil.so*
> > +/libbtrfsutil.a
> 
> .gitignore is not part of btrfs-progs, is it?

It is, since version 0.19 :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 02/27] Add libbtrfsutil

2018-02-20 Thread Liu Bo
On Thu, Feb 15, 2018 at 11:04:47AM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Currently, users wishing to manage Btrfs filesystems programatically
> have to shell out to btrfs-progs and parse the output. This isn't ideal.
> The goal of libbtrfsutil is to provide a library version of as many of
> the operations of btrfs-progs as possible and to migrate btrfs-progs to
> use it.
> 
> Rather than simply refactoring the existing btrfs-progs code, the code
> has to be written from scratch for a couple of reasons:
> 
> * A lot of the btrfs-progs code was not designed with a nice library API
>   in mind in terms of reusability, naming, and error reporting.
> * libbtrfsutil is licensed under the LGPL, whereas btrfs-progs is under
>   the GPL, which makes it dubious to directly copy or move the code.
> 
> Eventually, most of the low-level btrfs-progs code should either live in
> libbtrfsutil or the shared kernel/userspace filesystem code, and
> btrfs-progs will just be the CLI wrapper.
> 
> This first commit just includes the build system changes, license,
> README, and error reporting helper.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  .gitignore|   2 +
>  Makefile  |  72 ++--
>  Makefile.inc.in   |   2 +-
>  libbtrfsutil/COPYING  | 674 
> ++
>  libbtrfsutil/COPYING.LESSER   | 165 ++
>  libbtrfsutil/README.md|  35 ++
>  libbtrfsutil/btrfsutil.h  |  76 +
>  libbtrfsutil/btrfsutil_internal.h |  40 +++
>  libbtrfsutil/errors.c |  55 
>  9 files changed, 1100 insertions(+), 21 deletions(-)
>  create mode 100644 libbtrfsutil/COPYING
>  create mode 100644 libbtrfsutil/COPYING.LESSER
>  create mode 100644 libbtrfsutil/README.md
>  create mode 100644 libbtrfsutil/btrfsutil.h
>  create mode 100644 libbtrfsutil/btrfsutil_internal.h
>  create mode 100644 libbtrfsutil/errors.c
> 
> diff --git a/.gitignore b/.gitignore
> index 8e607f6e..272d53e4 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,6 +43,8 @@ libbtrfs.so.0.1
>  library-test
>  library-test-static
>  /fssum
> +/libbtrfsutil.so*
> +/libbtrfsutil.a
>

.gitignore is not part of btrfs-progs, is it?

I Skipped Makefile, the .c and .h look good to me, 

Reviewed-by: Liu Bo 

Thanks,

-liubo

>  /tests/*-tests-results.txt
>  /tests/test-console.txt
> diff --git a/Makefile b/Makefile
> index 00e21379..7fb70d06 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -73,10 +73,20 @@ CFLAGS = $(SUBST_CFLAGS) \
>-fPIC \
>-I$(TOPDIR) \
>-I$(TOPDIR)/kernel-lib \
> +  -I$(TOPDIR)/libbtrfsutil \
>$(EXTRAWARN_CFLAGS) \
>$(DEBUG_CFLAGS_INTERNAL) \
>$(EXTRA_CFLAGS)
>  
> +LIBBTRFSUTIL_CFLAGS = $(SUBST_CFLAGS) \
> +   $(CSTD) \
> +   -D_GNU_SOURCE \
> +   -fPIC \
> +   -fvisibility=hidden \
> +   -I$(TOPDIR)/libbtrfsutil \
> +   $(EXTRAWARN_CFLAGS) \
> +   $(EXTRA_CFLAGS)
> +
>  LDFLAGS = $(SUBST_LDFLAGS) \
> -rdynamic -L$(TOPDIR) \
> $(DEBUG_LDFLAGS_INTERNAL) \
> @@ -121,12 +131,17 @@ libbtrfs_headers = send-stream.h send-utils.h send.h 
> kernel-lib/rbtree.h btrfs-l
>  kernel-lib/crc32c.h kernel-lib/list.h kerncompat.h \
>  kernel-lib/radix-tree.h kernel-lib/sizes.h kernel-lib/raid56.h \
>  extent-cache.h extent_io.h ioctl.h ctree.h btrfsck.h version.h
> +libbtrfsutil_major := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_MAJOR 
> ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h)
> +libbtrfsutil_minor := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_MINOR 
> ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h)
> +libbtrfsutil_patch := $(shell sed -rn 's/^\#define BTRFS_UTIL_VERSION_PATCH 
> ([0-9])+$$/\1/p' libbtrfsutil/btrfsutil.h)
> +libbtrfsutil_version := 
> $(libbtrfsutil_major).$(libbtrfsutil_minor).$(libbtrfsutil_patch)
> +libbtrfsutil_objects = libbtrfsutil/errors.o
>  convert_objects = convert/main.o convert/common.o convert/source-fs.o \
> convert/source-ext2.o convert/source-reiserfs.o
>  mkfs_objects = mkfs/main.o mkfs/common.o mkfs/rootdir.o
>  image_objects = image/main.o image/sanitize.o
>  all_objects = $(objects) $(cmds_objects) $(libbtrfs_objects) 
> $(convert_objects) \
> -   $(mkfs_objects) $(image_objects)
> +   $(mkfs_objects) $(image_objects) $(libbtrfsutil_objects)
>  
>  udev_rules = 64-btrfs-dm.rules
>  
> @@ -246,11 +261,10 @@ static_convert_objects = $(patsubst %.o, %.static.o, 
> $(convert_objects))
>  static_mkfs_objects = $(patsubst %.o, %.static.o, $(mkfs_objects))
>  static_image_objects = $(patsubst %.o, %.static.o, $(image_objects))
>  
> -libs_shared = libbtrfs.so.0.1
> -libs_static = libbtrfs.a
> +libs_shared = libbtrfs.so.0.1 libbtrfsutil.so.$(libbtrfsutil_version)
> +libs_static = libbtrfs.a 

Re: [PATCH v2] btrfs: fix endianness compatibility during the SB RW

2018-02-20 Thread Liu Bo
On Tue, Feb 13, 2018 at 11:00:46AM +0800, Anand Jain wrote:
> Fixes the endianness bug in the fs_info::super_copy by using its
> btrfs_set_super...() function to set values in the SB, as these
> functions manage the endianness compatibility nicely.

Reviewed-by: Liu Bo 

Thanks,

-liubo
> 
> Signed-off-by: Anand Jain 
> ---
> v1->v2: Update change log. Update $Subject.
> Old:
> [PATCH] btrfs: use set functions to update latest refs to the SB
>  fs/btrfs/transaction.c | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
> index 04f07144b45c..9220f004001c 100644
> --- a/fs/btrfs/transaction.c
> +++ b/fs/btrfs/transaction.c
> @@ -1722,19 +1722,23 @@ static void update_super_roots(struct btrfs_fs_info 
> *fs_info)
>  
>   super = fs_info->super_copy;
>  
> + /* update latest btrfs_super_block::chunk_root refs */
>   root_item = _info->chunk_root->root_item;
> - super->chunk_root = root_item->bytenr;
> - super->chunk_root_generation = root_item->generation;
> - super->chunk_root_level = root_item->level;
> + btrfs_set_super_chunk_root(super, root_item->bytenr);
> + btrfs_set_super_chunk_root_generation(super, root_item->generation);
> + btrfs_set_super_chunk_root_level(super, root_item->level);
>  
> + /* update latest btrfs_super_block::root refs */
>   root_item = _info->tree_root->root_item;
> - super->root = root_item->bytenr;
> - super->generation = root_item->generation;
> - super->root_level = root_item->level;
> + btrfs_set_super_root(super, root_item->bytenr);
> + btrfs_set_super_generation(super, root_item->generation);
> + btrfs_set_super_root_level(super, root_item->level);
> +
>   if (btrfs_test_opt(fs_info, SPACE_CACHE))
> - super->cache_generation = root_item->generation;
> + btrfs_set_super_cache_generation(super, root_item->generation);
>   if (test_bit(BTRFS_FS_UPDATE_UUID_TREE_GEN, _info->flags))
> - super->uuid_tree_generation = root_item->generation;
> + btrfs_set_super_uuid_tree_generation(super,
> +  root_item->generation);
>  }
>  
>  int btrfs_transaction_in_commit(struct btrfs_fs_info *info)
> -- 
> 2.15.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()

2018-02-20 Thread Liu Bo
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote:
> On detaching of a disk which is a part of a RAID6 filesystem, the following 
> kernel OOPS may happen:
> 
> [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on 
> /dev/sdo 
> [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, 
> flush 1, corrupt 0, gen 0 
> [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo 
> [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 
> 0080 
> [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 
> [63122.971202] Oops:  [#1] SMP 
> [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace 
> cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs 
> lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev 
> snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc 
> soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci 
> ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor 
> zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod 
> hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod 
> sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last 
> unloaded: scst] 
> [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 
> 4.14.2-16-scst34x+ #8 
> [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS 
> VirtualBox 12/01/2006 
> [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] 
> [63123.007595] task: 880036ea4040 task.stack: c90006384000 
> [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] 
> [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 
> [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: 
> 8800b2bd5690 
> [63123.008359] RDX:  RSI: 88007bb43500 RDI: 
> 88004beaa000 
> [63123.008621] RBP: c90006387ae8 R08: 9910 R09: 
> 8800b2bd5600 
> [63123.008840] R10: 0004 R11: 0001 R12: 
> 88007bb43500 
> [63123.009059] R13: fffb R14: 880036fc5180 R15: 
> 0004 
> [63123.009278] FS: () GS:8800b700() 
> knlGS: 
> [63123.009564] CS: 0010 DS:  ES:  CR0: 80050033 
> [63123.009748] CR2: 0080 CR3: b0866000 CR4: 
> 000406f0 
> [63123.009969] Call Trace: 
> [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] 
> [63123.010251] bio_endio+0xa1/0x120 
> [63123.010378] generic_make_request+0x218/0x270 
> [63123.010921] submit_bio+0x66/0x130 
> [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] 
> [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] 
> [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] 
> [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] 
> [63123.011759] ? ___cache_free+0x1c5/0x300 
> [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] 
> [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] 
> [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] 
> [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] 
> [63123.012656] process_one_work+0x14d/0x350 
> [63123.012888] worker_thread+0x4d/0x3a0 
> [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 
> [63123.013192] kthread+0x109/0x140 
> [63123.013315] ? process_scheduled_works+0x40/0x40 
> [63123.013472] ? kthread_stop+0x110/0x110 
> [63123.013610] ret_from_fork+0x25/0x30 
> [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 
> 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> 
> 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 
> [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 
> [63123.014678] CR2: 0080 
> [63123.016590] ---[ end trace a295ea7259c17880 ]— 
> 
> This is reproducible in a cycle, where a series of writes is followed by SCSI 
> device delete command. The test may take up to few minutes.
> 

Please also place your SOB here.

> Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace 
> bi_bdev with a gendisk pointer and partitions index")
> ---
>  fs/btrfs/raid56.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
> index dec0907dfb8a..fcfc20de2df3 100644
> --- a/fs/btrfs/raid56.c
> +++ b/fs/btrfs/raid56.c
> @@ -1370,6 +1370,7 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio,
> stripe_start = stripe->physical;
> if (physical >= stripe_start &&
>  

Re: [PATCH] btrfs: fix NPD during canceling replace when the target is missing

2018-02-20 Thread David Sterba
On Tue, Feb 20, 2018 at 10:48:09PM +0800, Anand Jain wrote:
> Replace target can be missing after a reboot during the replace.
> So check if device is null.
> 
> BUG: unable to handle kernel NULL pointer dereference at 00b0
> IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
> Call Trace:
> btrfs_dev_replace_cancel+0x22b/0x250 [btrfs]
> btrfs_ioctl+0x2216/0x2590 [btrfs]
> ? do_vfs_ioctl+0x625/0x650
> do_vfs_ioctl+0x625/0x650
> ? security_file_ioctl+0x30/0x50
> SyS_ioctl+0x4e/0x80
> do_syscall_64+0x5d/0x160
> entry_SYSCALL64_slow_path+0x25/0x25
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/dev-replace.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
> index 87f975143c05..476981c2cf55 100644
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -749,7 +749,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info 
> *fs_info)
>  btrfs_dev_name(src_device), src_device->devid,
>  btrfs_dev_name(tgt_device));
>  
> - btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
> + if (tgt_device)
> + btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);

I'll just discard the patch you sent a week ago that removes the 'if' in
the first place.

https://patchwork.kernel.org/patch/10215103/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_clone_files and bind mounts

2018-02-20 Thread Nikolay Borisov


On 20.02.2018 17:50, Hristo Venev wrote:
> What is the problem with cloning files between different (vfs)mounts of
> the same filesystem?
> 

The "problem" is not really a problem, but rather a well-imposed
restriction:

>From  http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html

"Both files must reside within the same filesystem."

And as a matter of fact this is enforced in the generic
vfs_clone_file_range. Of course if we were to do this across filesystem
then we'd have all the problems associated with not being able to ensure
atomicity of operations.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs_clone_files and bind mounts

2018-02-20 Thread Hristo Venev
What is the problem with cloning files between different (vfs)mounts of
the same filesystem?

signature.asc
Description: This is a digitally signed message part


Re: [PATCH V3] Btrfs: enchanse raid1/10 balance heuristic

2018-02-20 Thread Timofey Titovets
Gentle ping.

2018-01-03 0:23 GMT+03:00 Timofey Titovets :
> 2018-01-02 21:31 GMT+03:00 Liu Bo :
>> On Sat, Dec 30, 2017 at 11:32:04PM +0300, Timofey Titovets wrote:
>>> Currently btrfs raid1/10 balancer bаlance requests to mirrors,
>>> based on pid % num of mirrors.
>>>
>>> Make logic understood:
>>>  - if one of underline devices are non rotational
>>>  - Queue leght to underline devices
>>>
>>> By default try use pid % num_mirrors guessing, but:
>>>  - If one of mirrors are non rotational, repick optimal to it
>>>  - If underline mirror have less queue leght then optimal,
>>>repick to that mirror
>>>
>>> For avoid round-robin request balancing,
>>> lets round down queue leght:
>>>  - By 8 for rotational devs
>>>  - By 2 for all non rotational devs
>>>
>>
>> Sorry for making a late comment on v3.
>>
>> It's good to choose non-rotational if it could.
>>
>> But I'm not sure whether it's a good idea to guess queue depth here
>> because filesystem is still at a high position of IO stack.  It'd
>> probably get good results when running tests, but in practical mixed
>> workloads, the underlying queue depth will be changing all the time.
>
> First version supposed for SSD, SSD + HDD only cases.
> At that version that just a "attempt", make LB on hdd.
> That can be easy dropped, if we decide that's a bad behaviour.
>
> If i understood correctly, which counters used,
> we check count of I/O ops that device processing currently
> (i.e. after merging & etc),
> not queue what not send (i.e. before merging & etc).
>
> i.e. we guessed based on low level block io stuff.
> As example that not work on zram devs (AFAIK, as zram don't have that 
> counters).
>
> So, no matter at which level we check that.
>
>> In fact, I think for rotational disks, more merging and less seeking
>> make more sense, even in raid1/10 case.
>>
>> Thanks,
>>
>> -liubo
>
> queue_depth changing must not create big problems there,
> i.e. round_down must make all changes "equal".
>
> For hdd, if we have a "big" (8..16?) queue depth,
> with high probability that hdd overloaded,
> and if other hdd have much less load
> (may be instead of round_down, that better use abs diff > 8)
> we try to requeue requests to other hdd.
>
> That will not show true equal distribution, but in case where
> one disks have more load, and pid based mechanism fail to make LB,
> we will just move part of load to other hdd.
>
> Until load distribution will not changed.
>
> May be for HDD that need to make threshold more aggressive, like 16
> (i.e. afaik SATA drives have hw rq len 31, so just use half of that).
>
> Thanks.
>
>>> Changes:
>>>   v1 -> v2:
>>> - Use helper part_in_flight() from genhd.c
>>>   to get queue lenght
>>> - Move guess code to guess_optimal()
>>> - Change balancer logic, try use pid % mirror by default
>>>   Make balancing on spinning rust if one of underline devices
>>>   are overloaded
>>>   v2 -> v3:
>>> - Fix arg for RAID10 - use sub_stripes, instead of num_stripes
>>>
>>> Signed-off-by: Timofey Titovets 
>>> ---
>>>  block/genhd.c  |   1 +
>>>  fs/btrfs/volumes.c | 115 
>>> -
>>>  2 files changed, 114 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/genhd.c b/block/genhd.c
>>> index 96a66f671720..a77426a7 100644
>>> --- a/block/genhd.c
>>> +++ b/block/genhd.c
>>> @@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct 
>>> hd_struct *part,
>>>   atomic_read(>in_flight[1]);
>>>   }
>>>  }
>>> +EXPORT_SYMBOL_GPL(part_in_flight);
>>>
>>>  struct hd_struct *__disk_get_part(struct gendisk *disk, int partno)
>>>  {
>>> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
>>> index 49810b70afd3..a3b80ba31d4d 100644
>>> --- a/fs/btrfs/volumes.c
>>> +++ b/fs/btrfs/volumes.c
>>> @@ -27,6 +27,7 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>>  #include 
>>>  #include "ctree.h"
>>>  #include "extent_map.h"
>>> @@ -5153,6 +5154,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
>>> *fs_info, u64 logical, u64 len)
>>>   return ret;
>>>  }
>>>
>>> +/**
>>> + * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
>>> + *
>>> + * @bdev: target bdev
>>> + * @round_down: round factor big for hdd and small for ssd, like 8 and 2
>>> + */
>>> +static int bdev_get_queue_len(struct block_device *bdev, int round_down)
>>> +{
>>> + int sum;
>>> + struct hd_struct *bd_part = bdev->bd_part;
>>> + struct request_queue *rq = bdev_get_queue(bdev);
>>> + uint32_t inflight[2] = {0, 0};
>>> +
>>> + part_in_flight(rq, bd_part, inflight);
>>> +
>>> + sum = max_t(uint32_t, inflight[0], inflight[1]);
>>> +
>>> + /*
>>> +  * Try prevent switch for every sneeze
>>> +  * By roundup output num by some value
>>> +  */
>>> + return ALIGN_DOWN(sum, round_down);
>>> +}
>>> +
>>> +/**
>>> + * 

Re: Status of FST and mount times

2018-02-20 Thread Austin S. Hemmelgarn

On 2018-02-20 09:59, Ellis H. Wilson III wrote:

On 02/16/2018 07:59 PM, Qu Wenruo wrote:

On 2018年02月16日 22:12, Ellis H. Wilson III wrote:

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454


OK, this explains everything.

There are too many chunks.
This means at mount you need to search for block group item 3454 times.

Even each search only needs to iterate 3 tree blocks, multiply it 3454
it would still be a big work.
Although some tree blocks like the root node and level 1 nodes can be
cached, we still need to read about 3500 tree blocks.

If the fs is created using 16K nodesize, this means you need to do
random read for 54M using 16K blocksize.

No wonder it will takes some time.

Normally I would expect 1G chunk for each data and metadata chunk.

If there is nothing special, it means your filesystem is already larger
than 3T.
If your used space is way smaller (less than 30%) than 3.5T, then this
means your chunk usage is pretty low, and in that case, balance to
reduce number of chunks (block groups) would reduce mount time.


The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
btrfs fi df.  So, from what I am hearing, this mount time is normal for 
a filesystem this size.  Ignoring a more complex and proper fix like the 
ones we've been discussing, would bumping the nodesize reduce the number 
of chunks, thereby reducing the mount time?
It would probably not.  Chunk size is only based on the total size of 
the filesystem, with reasonable base values, so you would still need to 
have at least as many chunks to store the same amount of data (increase 
the node size too much though, and you will end up with more chunks, 
because you'll have more empty space wasted).


I don't see why balance would come into play here -- my understanding 
was that was for aged filesystems.  The only operations I've done on 
here was:

1. Format filesystem clean
2. Create a subvolume
3. rsync our home directories into that new subvolume
4. Create another subvolume
5. rsync our home directories into that new subvolume

Accordingly, zero (or at least, extremely little) data should have been 
overwritten, so I would expect things to be fairly well allocated 
already.  Please correct me if this is naive thinking.
Your logic is in general correct regarding data, but not necessarily 
metadata.  Assuming you did not use the `--inplace` option for rsync, it 
had to issue a rename for each individual file that got copied in, and 
as a result there was likely a lot of metadata being rewritten.


As far as balance being for aged filesystems, that's not exactly true. 
There are four big reasons you might run a balance:


1. As part of reshaping a volume.  You generally want run a balance 
whenever the number of disks in a volume permanently increases (it will 
happen automatically when it permanently decreases, as the device 
deletion operation is a special type of balance under the hood).  It's 
also used for converting chunk profiles.
2. To free up empty space inside chunks when the filesystem is full at 
the chunk level.
3. To redistribute data across multiple disks in a more even manner 
after deleting a lot of data.

4. To reduce the likelihood of 2 or 3 being an issue.

Reasons 2 and 3 are generally more likely to be needed on old volumes. 
Reason 1 is independent of the age of a volume.  Reason 4 is the reason 
for the regular filtered balances that I and some other people recommend 
be run as part of preventative maintenance, and is also generally 
independent of the age of a volume.


Qu's suggestion is actually independent of all the above reasons, but 
does kind of fit in with the fourth as another case of preventative 
maintenance.



I was using btrfs sub del -C for the deletions, so I believe (if that
command truly waits for the subvolume to be utterly gone) it captures
the entirety of the snapshot.


No, snapshot deletion is completely delayed in background.

-C only ensures that even a powerloss happen after command return, you
won't see the snapshot anywhere, but it will still be deleted in 
background.


Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
btrfs-cleaner to run at specific times, which I presume is the snapshot 
deletion process you are referring to?  If it can be told to run at a 
given time, can I throttle how fast it works, such that I avoid some of 
the high foreground interruption I've seen in the past?
I don't think there's any way to do this right now (though it would be 
nice if there was).  In theory, you could adjust the priority of the 
kernel thread itself, but messing around with kthread priorities is 
seriously dangerous even if you know exactly what you're doing.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of FST and mount times

2018-02-20 Thread Ellis H. Wilson III

On 02/16/2018 07:59 PM, Qu Wenruo wrote:

On 2018年02月16日 22:12, Ellis H. Wilson III wrote:

$ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l
3454


OK, this explains everything.

There are too many chunks.
This means at mount you need to search for block group item 3454 times.

Even each search only needs to iterate 3 tree blocks, multiply it 3454
it would still be a big work.
Although some tree blocks like the root node and level 1 nodes can be
cached, we still need to read about 3500 tree blocks.

If the fs is created using 16K nodesize, this means you need to do
random read for 54M using 16K blocksize.

No wonder it will takes some time.

Normally I would expect 1G chunk for each data and metadata chunk.

If there is nothing special, it means your filesystem is already larger
than 3T.
If your used space is way smaller (less than 30%) than 3.5T, then this
means your chunk usage is pretty low, and in that case, balance to
reduce number of chunks (block groups) would reduce mount time.


The nodesize is 16K, and the filesystem data is 3.32TiB as reported by 
btrfs fi df.  So, from what I am hearing, this mount time is normal for 
a filesystem this size.  Ignoring a more complex and proper fix like the 
ones we've been discussing, would bumping the nodesize reduce the number 
of chunks, thereby reducing the mount time?


I don't see why balance would come into play here -- my understanding 
was that was for aged filesystems.  The only operations I've done on 
here was:

1. Format filesystem clean
2. Create a subvolume
3. rsync our home directories into that new subvolume
4. Create another subvolume
5. rsync our home directories into that new subvolume

Accordingly, zero (or at least, extremely little) data should have been 
overwritten, so I would expect things to be fairly well allocated 
already.  Please correct me if this is naive thinking.



I was using btrfs sub del -C for the deletions, so I believe (if that
command truly waits for the subvolume to be utterly gone) it captures
the entirety of the snapshot.


No, snapshot deletion is completely delayed in background.

-C only ensures that even a powerloss happen after command return, you
won't see the snapshot anywhere, but it will still be deleted in background.


Ah, I had no idea.  Thank you!  Is there any way to "encourage" 
btrfs-cleaner to run at specific times, which I presume is the snapshot 
deletion process you are referring to?  If it can be told to run at a 
given time, can I throttle how fast it works, such that I avoid some of 
the high foreground interruption I've seen in the past?


Thanks,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] btrfs: remove assert in btrfs_init_dev_replace_tgtdev()

2018-02-20 Thread Anand Jain
In the same function we just ran btrfs_alloc_device() which means the
btrfs_device::resized_list is sure to be empty and we are protected
with the btrfs_fs_info::volume_mutex.

Signed-off-by: Anand Jain 
---
 fs/btrfs/volumes.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d4a5f9126e7b..ac29ec0de984 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -2655,7 +2655,6 @@ int btrfs_init_dev_replace_tgtdev(struct btrfs_fs_info 
*fs_info,
device->total_bytes = btrfs_device_get_total_bytes(srcdev);
device->disk_total_bytes = btrfs_device_get_disk_total_bytes(srcdev);
device->bytes_used = btrfs_device_get_bytes_used(srcdev);
-   ASSERT(list_empty(>resized_list));
device->commit_total_bytes = srcdev->commit_total_bytes;
device->commit_bytes_used = device->bytes_used;
device->fs_info = fs_info;
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: fix NPD during canceling replace when the target is missing

2018-02-20 Thread Anand Jain
Replace target can be missing after a reboot during the replace.
So check if device is null.

BUG: unable to handle kernel NULL pointer dereference at 00b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x22b/0x250 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
? do_vfs_ioctl+0x625/0x650
do_vfs_ioctl+0x625/0x650
? security_file_ioctl+0x30/0x50
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 87f975143c05..476981c2cf55 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -749,7 +749,8 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
   btrfs_dev_name(src_device), src_device->devid,
   btrfs_dev_name(tgt_device));
 
-   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+   if (tgt_device)
+   btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
 
 leave:
mutex_unlock(_replace->lock_finishing_cancel_unmount);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/1] btrfs: fix NPD when target device is missing

2018-02-20 Thread Anand Jain
The replace target device can be missing in which case we don't
allocate a missing btrfs_device when mounted with the -o degraded.
So check the device before access.

BUG: unable to handle kernel NULL pointer dereference at 00b0
IP: btrfs_destroy_dev_replace_tgtdev+0x43/0xf0 [btrfs]
Call Trace:
btrfs_dev_replace_cancel+0x15f/0x180 [btrfs]
btrfs_ioctl+0x2216/0x2590 [btrfs]
do_vfs_ioctl+0x625/0x650
SyS_ioctl+0x4e/0x80
do_syscall_64+0x5d/0x160
entry_SYSCALL64_slow_path+0x25/0x25

Signed-off-by: Anand Jain 
---
 fs/btrfs/dev-replace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index dbaa6880a15e..87f975143c05 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -312,7 +312,7 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info 
*fs_info)
 
 static char* btrfs_dev_name(struct btrfs_device *device)
 {
-   if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
+   if (!device || test_bit(BTRFS_DEV_STATE_MISSING, >dev_state))
return "";
else
return rcu_str_deref(device->name);
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] btrfs: delete function btrfs_close_extra_devices()

2018-02-20 Thread Anand Jain



On 02/20/2018 12:53 AM, David Sterba wrote:

On Thu, Feb 15, 2018 at 01:02:24PM +0800, Anand Jain wrote:

btrfs_close_extra_devices() is not exactly about just closing the opened
devices, but its also about free-ing the stale devices which may have
scanned into the btrfs_fs_devices::dev_list. The way it picks devices to
be freed is by going through the btrfs_fs_devices::dev_list and its seed
devices, and finding for devices which do not have the flag
BTRFS_DEV_STATE_IN_FS_METADATA nor if it is part of the replace target.

However, in the first place the way devices are scanned and added to the
btrfs_fs_devices::dev_list have changed for a long time now. During scan
when it finds matching fsid+uuid+devid it would add the device to
btrfs_fs_devices::dev_list. A matched device with higher generation number
overwrites the device with lower generation number during.

Further, the stale devices containing the stale fsid are removed at the
time of the scan itself.

So there isn't any opportunity that btrfs_close_extra_devices() can free
the stale device within the fsid which is being mounted.

Further about the btrfs_fs_devices::latest_bdev that
the btrfs_close_extra_devices() function assigns, is already assigned by
the function __btrfs_open_devices().

So as this function has no effect, delete it.


I think this is correct. Freeing stale devices as a side effect of mount
does not seem right anyway. I'm still not able to convince myself that
there's not an unexpected interaction of dev scan and dev replace, as
it relies on the state bits and other locks. If you have ideas were to
put some asserts or extra checks, please suggest.



 Thanks for comments. I took a fresh look after a long weekend.
 Here I found something new. Following test case [1] can simulate stale
 SB on the missing device which got deleted. Now if we do the fresh
 dev scan and mount the deleted device should not be kept in the
 device_list after the mount because that's stale. And as IN_FS_METADATA
 flag is not set on this device this function helps to clean up those
 devices. The dev scan free_stale can't catch this condition because
 we won't go deeper than the SB read. Sorry, my bad pls ignore this
 patch this is a very corner case or I had gone nuts when I investigated
 this.

[1]
  mkfs.btrfs -fq -mraid1 -draid1 /dev/sde /dev/sdf /dev/sdg
  modprobe -r btrfs
  mount -o degraded,device=/dev/sde /dev/sdf /btrfs
  btrfs dev del 3 /btrfs
  umount /btrfs
  btrfs dev scan /dev/sde /dev/sdf /dev/sdg
  mount /dev/sde /btrfs
  wipefs -a /dev/sdg <-- this should work as sdg is not part of
 fsid mounted anymore.

 I am thinking to rename this function to btrfs_free_extra_devices()
 and add comments.

 Also about the fs_devices::latest_bdev we certainly assign it in
 __btrfs_open_devices(), but there we don't care if it picks the
 replacing target. And suppose if replacing_target has the highest
 (or equal) generation number compared to other we shall read the sys
 and chunk tree from it. So during the mount context we use and
 reconstruct using the replacing target, but at the end of mount
 context, we use the non-replace target as the latest_bdev.

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Fix rcu_dereference usage outside of read critical section

2018-02-20 Thread Anand Jain



On 02/20/2018 07:40 PM, Nikolay Borisov wrote:

Patch 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user")
added a new btrfs_info call with a couple of btrfs_dev_name() args. This
is wrong since the latter require being called in rcu read side
critical section. Fix it by instead calling btrfs_info_in_rcu. This
fixes the following splat:

=
WARNING: suspicious RCU usage
4.16.0-rc2-nbor #463 Not tainted
-
fs/btrfs/dev-replace.c:318 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by btrfs/5698:
  #0:  (_info->dev_replace.lock_finishing_cancel_unmount){+.+.}, at: 
[<942cb4ee>] btrfs_dev_replace_cancel+0xac/0x3f0

stack backtrace:
CPU: 2 PID: 5698 Comm: btrfs Not tainted 4.16.0-rc2-nbor #463
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
  dump_stack+0x85/0xc9
  lockdep_rcu_suspicious+0x123/0x170
  btrfs_dev_name.part.1+0x6d/0x80
  btrfs_dev_replace_cancel+0x330/0x3f0
  btrfs_ioctl+0x2751/0x65b0
  ? debug_check_no_locks_freed+0x290/0x290
  ? trace_hardirqs_on_caller+0x400/0x570
  ? trace_hardirqs_on+0xd/0x10
  ? btrfs_ioctl_get_supported_features+0x30/0x30
  ? __handle_mm_fault+0x1aca/0x3230
  ? lock_downgrade+0x650/0x650
  ? trace_hardirqs_on+0xd/0x10
  ? mem_cgroup_commit_charge+0xc0/0xdd0
  ? _raw_spin_unlock+0x27/0x40
  ? __handle_mm_fault+0x1aca/0x3230
  ? lock_downgrade+0x650/0x650
  ? vm_insert_page+0x650/0x650
  ? __vma_link_rb+0x125/0x1d0
  do_vfs_ioctl+0x184/0xf00
  ? do_vfs_ioctl+0x184/0xf00
  ? lock_downgrade+0x650/0x650
  ? ioctl_preallocate+0x1a0/0x1a0
  ? up_read+0x1f/0x40
  ? __do_page_fault+0x5c6/0xb30
  ? SyS_brk+0x412/0x5f0
  ? mm_fault_error+0x2e0/0x2e0
  SyS_ioctl+0x41/0x70
  ? do_vfs_ioctl+0xf00/0xf00
  do_syscall_64+0x19d/0x5d0
  entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user")
Signed-off-by: Nikolay Borisov 


 I notice too. Thanks Nikolay for the fix.

 Reviewed-by: Anand Jain 



---
  fs/btrfs/dev-replace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 3b0760f7ec8a..0e776eb90ad8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -744,7 +744,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
ret = btrfs_commit_transaction(trans);
WARN_ON(ret);
  
-	btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled",

+   btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s 
cancelled",
   btrfs_dev_name(src_device), src_device->devid,
   btrfs_dev_name(tgt_device));
  


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs/150: add _scratch_dev_pool_get/put to run the test as expected

2018-02-20 Thread Qu Wenruo


On 2018年02月20日 13:34, Misono, Tomohiro wrote:
> btrfs/150 uses RAID1 profile and make SCRATCH_DEV fail for test.
> However, if SCRATCH_DEV_POOL consists more than two devices, SCRATCH_DEV
> may not be used for RAID1 pair and the tests may not run as expected.
> 
> Fix this by add _scratch_dev_pool_get/put like other tests (141, 143
> etc.) do.
> 
> Signed-off-by: Tomohiro Misono 

Indeed, if we have more devices, it's highly possible that the first
device doesn't have data stripe of the raid1 chunk on it.

(And under most case it won't have data stripe, since during mkfs we use
the first device to contain temporary chunks, so unallocated space of
devid 1 is smaller compared to other devices, and chunk allocator will
use device with more unallocated space)

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> ---
>  tests/btrfs/150 | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/tests/btrfs/150 b/tests/btrfs/150
> index 97041b6c..1e4586be 100755
> --- a/tests/btrfs/150
> +++ b/tests/btrfs/150
> @@ -55,6 +55,7 @@ _supported_os Linux
>  _require_scratch
>  _require_fail_make_request
>  _require_scratch_dev_pool 2
> +_scratch_dev_pool_get 2
>  
>  SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
>  enable_io_failure()
> @@ -100,6 +101,7 @@ while [[ -z $result ]]; do
>   disable_io_failure
>  done
>  
> +_scratch_dev_pool_put
>  # success, all done
>  status=0
>  exit
> 



signature.asc
Description: OpenPGP digital signature


[PATCH] btrfs: Fix rcu_dereference usage outside of read critical section

2018-02-20 Thread Nikolay Borisov
Patch 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user")
added a new btrfs_info call with a couple of btrfs_dev_name() args. This
is wrong since the latter require being called in rcu read side
critical section. Fix it by instead calling btrfs_info_in_rcu. This
fixes the following splat:

=
WARNING: suspicious RCU usage
4.16.0-rc2-nbor #463 Not tainted
-
fs/btrfs/dev-replace.c:318 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by btrfs/5698:
 #0:  (_info->dev_replace.lock_finishing_cancel_unmount){+.+.}, at: 
[<942cb4ee>] btrfs_dev_replace_cancel+0xac/0x3f0

stack backtrace:
CPU: 2 PID: 5698 Comm: btrfs Not tainted 4.16.0-rc2-nbor #463
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0x85/0xc9
 lockdep_rcu_suspicious+0x123/0x170
 btrfs_dev_name.part.1+0x6d/0x80
 btrfs_dev_replace_cancel+0x330/0x3f0
 btrfs_ioctl+0x2751/0x65b0
 ? debug_check_no_locks_freed+0x290/0x290
 ? trace_hardirqs_on_caller+0x400/0x570
 ? trace_hardirqs_on+0xd/0x10
 ? btrfs_ioctl_get_supported_features+0x30/0x30
 ? __handle_mm_fault+0x1aca/0x3230
 ? lock_downgrade+0x650/0x650
 ? trace_hardirqs_on+0xd/0x10
 ? mem_cgroup_commit_charge+0xc0/0xdd0
 ? _raw_spin_unlock+0x27/0x40
 ? __handle_mm_fault+0x1aca/0x3230
 ? lock_downgrade+0x650/0x650
 ? vm_insert_page+0x650/0x650
 ? __vma_link_rb+0x125/0x1d0
 do_vfs_ioctl+0x184/0xf00
 ? do_vfs_ioctl+0x184/0xf00
 ? lock_downgrade+0x650/0x650
 ? ioctl_preallocate+0x1a0/0x1a0
 ? up_read+0x1f/0x40
 ? __do_page_fault+0x5c6/0xb30
 ? SyS_brk+0x412/0x5f0
 ? mm_fault_error+0x2e0/0x2e0
 SyS_ioctl+0x41/0x70
 ? do_vfs_ioctl+0xf00/0xf00
 do_syscall_64+0x19d/0x5d0
 entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 11ac3f1da5fd ("btrfs: log, when replace, is canceled by the user")
Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/dev-replace.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 3b0760f7ec8a..0e776eb90ad8 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -744,7 +744,7 @@ int btrfs_dev_replace_cancel(struct btrfs_fs_info *fs_info)
ret = btrfs_commit_transaction(trans);
WARN_ON(ret);
 
-   btrfs_info(fs_info, "dev_replace from %s (devid %llu) to %s canceled",
+   btrfs_info_in_rcu(fs_info, "dev_replace from %s (devid %llu) to %s 
cancelled",
   btrfs_dev_name(src_device), src_device->devid,
   btrfs_dev_name(tgt_device));
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html