Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 4.12.18 г. 22:14 ч., Wilson, Ellis wrote:
> On 12/4/18 8:07 AM, Nikolay Borisov wrote:
>> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>>> With 14TB drives available today, it doesn't take more than a handful of
>>> drives to result in a filesystem that takes around a minute to mount.
>>> As a result of this, I suspect this will become an increasingly problem
>>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>>> not a contributor so I have no room to do so -- just shedding some light
>>> on a problem that may deserve attention as filesystem sizes continue to
>>> grow.
>> Would it be possible to provide perf traces of the longer-running mount
>> time? Everyone seems to be fixated on reading block groups (which is
>> likely to be the culprit) but before pointing finger I'd like concrete
>> evidence pointed at the offender.
> 
> I am glad to collect such traces -- please advise with commands that 
> would achieve that.  If you just mean block traces, I can do that, but I 
> suspect you mean something more BTRFS-specific.

A command that would be good is :

perf record --all-kernel -g mount /dev/vdc /media/scratch/

of course replace device/mount path appropriately. This will result in a
perf.data file which contains stacktraces of the hottest paths executed
during invocation of mount. If you could send this file to the mailing
list or upload it somwhere for interested people (me and perhaps) Qu to
inspect would be appreciated.

If the file turned out way too big you can use

perf report --stdio  to create a text output and you could send that as
well.

> 
> Best,
> 
> ellis
> 


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Wilson, Ellis
On 12/4/18 8:07 AM, Nikolay Borisov wrote:
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> With 14TB drives available today, it doesn't take more than a handful of
>> drives to result in a filesystem that takes around a minute to mount.
>> As a result of this, I suspect this will become an increasingly problem
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm
>> not a contributor so I have no room to do so -- just shedding some light
>> on a problem that may deserve attention as filesystem sizes continue to
>> grow.
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

I am glad to collect such traces -- please advise with commands that 
would achieve that.  If you just mean block traces, I can do that, but I 
suspect you mean something more BTRFS-specific.

Best,

ellis



[Mount time bug bounty?] was: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 03/12/2018 à 23:22, Hans van Kranenburg a écrit :
> [...]
> Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982
>
> What the code is doing here is starting at the beginning of the extent
> tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
> is not that far away), and then based on the information in it, computes
> where the next one will be (just after the end of the vaddr+length of
> it), and then jumps over all normal extent items and searches again near
> where the next block group item has to be. So, yes, that means that they
> depend on each other.
>
> Two possible ways to improve this:
>
> 1. Instead, walk the chunk tree (which has all related items packed
> together) instead to find out at which locations in the extent tree the
> block group items are located and then start getting items in parallel.
> If you have storage with a lot of rotating rust that can deliver much
> more random reads if you ask for more of them at the same time, then
> this can already cause a massive speedup.
>
> 2. Move the block group items somewhere else, where they can nicely be
> grouped together, so that the amount of metadata pages that has to be
> looked up is minimal. Quoting from the link below, "slightly tricky
> [...] but there are no fundamental obstacles".
>
> https://www.spinics.net/lists/linux-btrfs/msg71766.html
>
> I think the main obstacle here is finding a developer with enough
> experience and time to do it. :)

I would definitely be interested in sponsoring at least a part of the
needed time through my company (we are too small to hire kernel
developers full-time but we can make a one-time contribution for
something as valuable to us as faster mount delays).

If needed it could be split in two steps with separate bounties :
- providing a patch for the latest LTS kernel with a substantial
decrease in mount time in our case (ideally less than a minute instead
of 15 minutes but <5 minutes is already worth it).
- having it integrated in mainline.

I don't have any experience with company sponsorship/bounties but I'm
willing to learn (don't hesitate to make suggestions). I'll have to
discuss it with our accountant to make sure we do it correctly.

Is it the right place to discuss this kind of subject or should I take
the discussion elsewhere ?

Best regards,

Lionel


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Lionel Bouton
Le 04/12/2018 à 03:52, Chris Murphy a écrit :
> On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
>  wrote:
>> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
>>> [...]
>>> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
>>> tuning of the io queue (switching between classic io-schedulers and
>>> blk-mq ones in the virtual machines) and BTRFS mount options
>>> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
>>> in mount time (I managed to reduce the mount of IO requests
>> Sent to quickly : I meant to write "managed to reduce by half the number
>> of IO write requests for the same amount of data writen"
>>
>>>  by half on
>>> one server in production though although more tests are needed to
>>> isolate the cause).
> Interesting. I wonder if it's ssd_spread or space_cache=v2 that
> reduces the writes by half, or by how much for each? That's a major
> reduction in writes, and suggests it might be possible for further
> optimization, to help mitigate the wandering trees impact.

Note, the other major changes were :
- 4.9 upgrade to 1.14,
- using multi-queue aware bfq instead of noop.

If BTRFS IO patterns in our case allow bfq to merge io-requests, this
could be another explanation.

Lionel



Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Qu Wenruo



On 2018/12/4 下午9:07, Nikolay Borisov wrote:
> 
> 
> On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
>> Hi all,
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
>>
>> With 14TB drives available today, it doesn't take more than a handful of 
>> drives to result in a filesystem that takes around a minute to mount. 
>> As a result of this, I suspect this will become an increasingly problem 
>> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
>> not a contributor so I have no room to do so -- just shedding some light 
>> on a problem that may deserve attention as filesystem sizes continue to 
>> grow.
> 
> Would it be possible to provide perf traces of the longer-running mount
> time? Everyone seems to be fixated on reading block groups (which is
> likely to be the culprit) but before pointing finger I'd like concrete
> evidence pointed at the offender.

IIRC I submitted such analyse years ago.

Nowadays it may change due to chunk <-> bg <-> dev_extents cross checking.
So yes, it would be a good idea to show such percentage.

Thanks,
Qu

> 
>>
>> Best,
>>
>> ellis
>>


Re: BTRFS Mount Delay Time Graph

2018-12-04 Thread Nikolay Borisov



On 3.12.18 г. 20:20 ч., Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

Would it be possible to provide perf traces of the longer-running mount
time? Everyone seems to be fixated on reading block groups (which is
likely to be the culprit) but before pointing finger I'd like concrete
evidence pointed at the offender.

> 
> Best,
> 
> ellis
> 


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Chris Murphy
On Mon, Dec 3, 2018 at 1:04 PM Lionel Bouton
 wrote:
>
> Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> > [...]
> > Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> > tuning of the io queue (switching between classic io-schedulers and
> > blk-mq ones in the virtual machines) and BTRFS mount options
> > (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> > in mount time (I managed to reduce the mount of IO requests
>
> Sent to quickly : I meant to write "managed to reduce by half the number
> of IO write requests for the same amount of data writen"
>
> >  by half on
> > one server in production though although more tests are needed to
> > isolate the cause).

Interesting. I wonder if it's ssd_spread or space_cache=v2 that
reduces the writes by half, or by how much for each? That's a major
reduction in writes, and suggests it might be possible for further
optimization, to help mitigate the wandering trees impact.


-- 
Chris Murphy


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Qu Wenruo


On 2018/12/4 上午2:20, Wilson, Ellis wrote:
> Hi all,
> 
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
> 
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
> 
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.
> 
> With 14TB drives available today, it doesn't take more than a handful of 
> drives to result in a filesystem that takes around a minute to mount. 
> As a result of this, I suspect this will become an increasingly problem 
> for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
> not a contributor so I have no room to do so -- just shedding some light 
> on a problem that may deserve attention as filesystem sizes continue to 
> grow.

This problem is somewhat known.

If you dig further, it's btrfs_read_block_groups() which will try to
read *ALL* block group items.
And to no one's surprise, when the fs goes larger, the more block group
items need to be read from disk.

We need to do some delay for such read to improve such case.

Thanks,
Qu

> 
> Best,
> 
> ellis
> 



signature.asc
Description: OpenPGP digital signature


Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Hans van Kranenburg
Hi,

On 12/3/18 8:56 PM, Lionel Bouton wrote:
> 
> Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
>>
>> Many months ago I promised to graph how long it took to mount a BTRFS 
>> filesystem as it grows.  I finally had (made) time for this, and the 
>> attached is the result of my testing.  The image is a fairly 
>> self-explanatory graph, and the raw data is also attached in 
>> comma-delimited format for the more curious.  The columns are: 
>> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>>
>> Experimental setup:
>> - System:
>> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
>> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
>> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
>> - 3 unmount/mount cycles performed in between adding another 250GB of data
>> - 250GB of data added each time in the form of 25x10GB files in their 
>> own directory.  Files generated in parallel each epoch (25 at the same 
>> time, with a 1MB record size).
>> - 240 repetitions of this performed (to collect timings in increments of 
>> 250GB between a 0GB and 60TB filesystem)
>> - Normal "time" command used to measure time to mount.  "Real" time used 
>> of the timings reported from time.
>> - Mount:
>> /dev/md0 on /btrfs type btrfs 
>> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>>
>> At 60TB, we take 30s to mount the filesystem, which is actually not as 
>> bad as I originally thought it would be (perhaps as a result of using 
>> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
>> to comment if folks more intimately familiar with BTRFS think this is 
>> due to the very large files I've used.

Probably yes. The thing that is happening is that all block group items
are read from the extent tree. And, instead of being nicely grouped
together, they are scattered all over the place, at their virtual
address, in between all normal extent items.

So, mount time depends on cold random read iops your storage can do, and
the size of the extent tree and amount of block groups. And, your extent
tree has more items in it if you have more extents. So, yes, writing a
lot of 4kiB files should have a similar effect I think as a lot of
128MiB files that are still stored in 1 extent per file.

>  I can redo the test with much 
>> more realistic data if people have legitimate reason to think it will 
>> drastically change the result.
> 
> We are hosting some large BTRFS filesystems on Ceph (RBD used by
> QEMU/KVM). I believe the delay is heavily linked to the number of files
> (I didn't check if snapshots matter and I suspect it does but not as
> much as the number of "original" files at least if you don't heavily
> modify existing files but mostly create new ones as we do).
> As an example, we have a filesystem with 20TB used space with 4
> subvolumes hosting multi millions files/directories (probably 10-20
> millions total I didn't check the exact number recently as simply
> counting files is a very long process) and 40 snapshots for each volume.
> Mount takes about 15 minutes.
> We have virtual machines that we don't reboot as often as we would like
> because of these slow mount times.
> 
> If you want to study this, you could :
> - graph the delay for various individual file sizes (instead of 25x10GB,
> create 2 500 x 100MB and 250 000 x 1MB files between each run and
> compare to the original result)
> - graph the delay vs the number of snapshots (probably starting with a
> large number of files in the initial subvolume to start with a non
> trivial mount delay)
> You may want to study the impact of the differences between snapshots by
> comparing snapshoting without modifications and snapshots made at
> various stages of your suvolume growth.
> 
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests by half on
> one server in production though although more tests are needed to
> isolate the cause).
> I didn't expect much for the mount times, it seems to me that mount is
> mostly constrained by the BTRFS on disk structures needed at mount time
> and how the filesystem reads them (for example it doesn't benefit at all
> from large IO queue depths which probably means that each read depends
> on previous ones which prevents io-schedulers from optimizing anything).

Yes, I think that's true. See btrfs_read_block_groups in extent-tree.c:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/btrfs/extent-tree.c#n9982

What the code is doing here is starting at the beginning of the extent
tree, searching forward until it sees the first BLOCK_GROUP_ITEM (which
is not that far away), and then based on the information in it, computes

Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Le 03/12/2018 à 20:56, Lionel Bouton a écrit :
> [...]
> Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
> tuning of the io queue (switching between classic io-schedulers and
> blk-mq ones in the virtual machines) and BTRFS mount options
> (space_cache=v2,ssd_spread) but there wasn't any measurable improvement
> in mount time (I managed to reduce the mount of IO requests

Sent to quickly : I meant to write "managed to reduce by half the number
of IO write requests for the same amount of data writen"

>  by half on
> one server in production though although more tests are needed to
> isolate the cause).




Re: BTRFS Mount Delay Time Graph

2018-12-03 Thread Lionel Bouton
Hi,

Le 03/12/2018 à 19:20, Wilson, Ellis a écrit :
> Hi all,
>
> Many months ago I promised to graph how long it took to mount a BTRFS 
> filesystem as it grows.  I finally had (made) time for this, and the 
> attached is the result of my testing.  The image is a fairly 
> self-explanatory graph, and the raw data is also attached in 
> comma-delimited format for the more curious.  The columns are: 
> Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).
>
> Experimental setup:
> - System:
> Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
> 06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
> - 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
> - 3 unmount/mount cycles performed in between adding another 250GB of data
> - 250GB of data added each time in the form of 25x10GB files in their 
> own directory.  Files generated in parallel each epoch (25 at the same 
> time, with a 1MB record size).
> - 240 repetitions of this performed (to collect timings in increments of 
> 250GB between a 0GB and 60TB filesystem)
> - Normal "time" command used to measure time to mount.  "Real" time used 
> of the timings reported from time.
> - Mount:
> /dev/md0 on /btrfs type btrfs 
> (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
>
> At 60TB, we take 30s to mount the filesystem, which is actually not as 
> bad as I originally thought it would be (perhaps as a result of using 
> RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
> to comment if folks more intimately familiar with BTRFS think this is 
> due to the very large files I've used.  I can redo the test with much 
> more realistic data if people have legitimate reason to think it will 
> drastically change the result.

We are hosting some large BTRFS filesystems on Ceph (RBD used by
QEMU/KVM). I believe the delay is heavily linked to the number of files
(I didn't check if snapshots matter and I suspect it does but not as
much as the number of "original" files at least if you don't heavily
modify existing files but mostly create new ones as we do).
As an example, we have a filesystem with 20TB used space with 4
subvolumes hosting multi millions files/directories (probably 10-20
millions total I didn't check the exact number recently as simply
counting files is a very long process) and 40 snapshots for each volume.
Mount takes about 15 minutes.
We have virtual machines that we don't reboot as often as we would like
because of these slow mount times.

If you want to study this, you could :
- graph the delay for various individual file sizes (instead of 25x10GB,
create 2 500 x 100MB and 250 000 x 1MB files between each run and
compare to the original result)
- graph the delay vs the number of snapshots (probably starting with a
large number of files in the initial subvolume to start with a non
trivial mount delay)
You may want to study the impact of the differences between snapshots by
comparing snapshoting without modifications and snapshots made at
various stages of your suvolume growth.

Note : recently I tried upgrading from 4.9 to 4.14 kernels, various
tuning of the io queue (switching between classic io-schedulers and
blk-mq ones in the virtual machines) and BTRFS mount options
(space_cache=v2,ssd_spread) but there wasn't any measurable improvement
in mount time (I managed to reduce the mount of IO requests by half on
one server in production though although more tests are needed to
isolate the cause).
I didn't expect much for the mount times, it seems to me that mount is
mostly constrained by the BTRFS on disk structures needed at mount time
and how the filesystem reads them (for example it doesn't benefit at all
from large IO queue depths which probably means that each read depends
on previous ones which prevents io-schedulers from optimizing anything).

Best regards,

Lionel


BTRFS Mount Delay Time Graph

2018-12-03 Thread Wilson, Ellis
Hi all,

Many months ago I promised to graph how long it took to mount a BTRFS 
filesystem as it grows.  I finally had (made) time for this, and the 
attached is the result of my testing.  The image is a fairly 
self-explanatory graph, and the raw data is also attached in 
comma-delimited format for the more curious.  The columns are: 
Filesystem Size (GB), Mount Time 1 (s), Mount Time 2 (s), Mount Time 3 (s).

Experimental setup:
- System:
Linux pgh-sa-1-2 4.20.0-rc4-1.g1ac69b7-default #1 SMP PREEMPT Mon Nov 26 
06:22:42 UTC 2018 (1ac69b7) x86_64 x86_64 x86_64 GNU/Linux
- 6-drive RAID0 (mdraid, 8MB chunks) array of 12TB enterprise drives.
- 3 unmount/mount cycles performed in between adding another 250GB of data
- 250GB of data added each time in the form of 25x10GB files in their 
own directory.  Files generated in parallel each epoch (25 at the same 
time, with a 1MB record size).
- 240 repetitions of this performed (to collect timings in increments of 
250GB between a 0GB and 60TB filesystem)
- Normal "time" command used to measure time to mount.  "Real" time used 
of the timings reported from time.
- Mount:
/dev/md0 on /btrfs type btrfs 
(rw,relatime,space_cache=v2,subvolid=5,subvol=/)

At 60TB, we take 30s to mount the filesystem, which is actually not as 
bad as I originally thought it would be (perhaps as a result of using 
RAID0 via mdraid rather than native RAID0 in BTRFS).  However, I am open 
to comment if folks more intimately familiar with BTRFS think this is 
due to the very large files I've used.  I can redo the test with much 
more realistic data if people have legitimate reason to think it will 
drastically change the result.

With 14TB drives available today, it doesn't take more than a handful of 
drives to result in a filesystem that takes around a minute to mount. 
As a result of this, I suspect this will become an increasingly problem 
for serious users of BTRFS as time goes on.  I'm not complaining as I'm 
not a contributor so I have no room to do so -- just shedding some light 
on a problem that may deserve attention as filesystem sizes continue to 
grow.

Best,

ellis
0,0.018,0.037,0.016
250,0.245,0.098,0.066
500,0.417,0.119,0.138
750,0.284,0.073,0.066
1000,0.506,0.109,0.126
1250,0.824,0.134,0.204
1500,0.779,0.098,0.147
1750,0.805,0.107,0.215
2000,0.87,0.137,0.223
2250,1.009,0.168,0.226
2500,1.094,0.147,0.174
2750,0.908,0.137,0.246
3000,1.144,0.182,0.313
3250,1.232,0.209,0.312
3500,1.287,0.259,0.292
3750,1.29,0.166,0.298
4000,1.521,0.249,0.418
4250,1.448,0.341,0.395
4500,1.441,0.383,0.362
4750,1.555,0.35,0.371
5000,1.825,0.482,0.638
5250,1.731,0.69,0.928
5500,1.8,0.353,0.348
5750,1.979,0.295,1.194
6000,2.115,0.915,1.241
6250,2.238,0.614,1.735
6500,2.025,0.523,0.536
6750,2.15,0.458,0.727
7000,2.415,2.158,1.925
7250,2.589,1.059,2.24
7500,2.371,1.796,2.102
7750,2.737,1.579,1.659
8000,2.768,1.786,2.579
8250,2.979,2.544,2.654
8500,2.994,2.529,2.847
8750,3.042,2.283,2.947
9000,3.209,2.509,3.077
9250,3.124,2.7,3.096
9500,3.13,3.048,3.105
9750,3.444,2.702,3.33
1,3.671,3.354,3.297
10250,3.639,3.468,3.681
10500,3.693,3.651,3.711
10750,3.729,3.135,3.303
11000,3.846,3.862,3.917
11250,4.006,3.668,3.861
11500,4.113,3.919,3.875
11750,3.968,3.774,3.985
12000,4.205,3.882,4.218
12250,4.454,4.354,4.444
12500,4.528,4.441,4.616
12750,4.688,4.206,4.252
13000,4.551,4.507,4.444
13250,4.806,5.059,4.81
13500,5.041,4.662,4.997
13750,5.057,4.394,4.713
14000,5.029,5.03,4.927
14250,5.173,5.259,5.101
14500,5.104,5.3,5.416
14750,4.809,4.62,4.698
15000,5.045,5.066,4.806
15250,5.101,5.159,5.174
15500,5.074,5.245,5.65
15750,5.123,5.031,5.056
16000,5.518,5.097,5.595
16250,5.318,5.463,5.353
16500,5.63,5.689,5.768
16750,5.375,5.24,5.165
17000,5.578,5.846,5.628
17250,5.73,5.774,5.726
17500,6.108,6.202,6.226
17750,5.645,5.668,5.936
18000,6.308,5.925,6.317
18250,6.19,6.171,6.169
18500,6.442,6.601,6.403
18750,6.558,6.44,6.803
19000,6.664,7.176,6.742
19250,7.37,7.414,6.807
19500,7.021,7.143,7.253
19750,7.051,6.691,7.063
2,6.942,6.858,7.225
20250,7.617,7.39,7.202
20500,7.239,7.525,7.381
20750,7.638,7.332,7.549
21000,7.697,8.081,7.807
21250,7.867,7.929,7.826
21500,7.98,8.208,8.059
21750,7.79,7.614,7.726
22000,8.144,8.611,8.361
22250,8.19,8.558,8.459
22500,8.685,8.785,8.617
22750,8.702,8.454,8.727
23000,8.653,8.699,8.89
23250,8.897,9.328,9.101
23500,9.245,9.456,9.464
23750,9.242,9.072,9.363
24000,9.367,8.934,9.541
24250,9.2,9.754,9.708
24500,9.622,9.472,9.484
24750,9.756,9.672,10.091
25000,10.207,10.304,9.981
25250,10.135,10.166,9.991
25500,9.969,10.234,10.266
25750,10.098,10.515,10.98
26000,10.811,10.6,11.3
26250,11.211,10.761,10.825
26500,10.799,11.075,10.973
26750,10.72,11.12,11.39
27000,11.463,11.106,11.679
27250,11.644,11.363,11.316
27500,11.541,11.748,11.657
27750,11.292,11.794,11.616
28000,11.888,11.697,12.169
28250,12.298,12.183,12.002
28500,12.124,12.48,12.352
28750,11.347,11.815,12.201
29000,12.009,11.72,12.734
29250,11.918,12.02,12.583
29500,12.445,12.439,12.466
29750,12.071,11.863,12.078