Re: Healthy amount of free space?
On 2018-07-20 01:01, Andrei Borzenkov wrote: 18.07.2018 16:30, Austin S. Hemmelgarn пишет: On 2018-07-18 09:07, Chris Murphy wrote: On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn wrote: If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations. There are two issues here: 1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit How is it possible? As long as fallocate actually allocates space, this should be checksummed which means it is no more possible to overwrite it. May be fallocate on btrfs could simply reserve space. Not sure whether it complies with fallocate specification, but as long as intention is to ensure write will not fail for the lack of space it should be adequate (to the extent it can be ensured on btrfs of course). Also hole in file returns zeros by definition which also matches fallocate behavior. Except it doesn't _have_ to be checksummed if there's no data there, and that will always be the case for a new allocation. When I say it could be NOCOW, I'm talking specifically about the first write to each newly allocated block (that is, one either beyond the previous end of the file, or one in a region that used to be a hole). This obviously won't work for places where there are already data. of extra space and doing a rotating approach like this for writes: - Write goes into the extra space. - Once the write is done, convert the region covered by the write into a new block of extra space. - When the final block of the preallocated region is written, deallocate the extra space. 2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly. And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead. ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have). What happens on btrfs then? fallocate specifies that new space should be initialized to zero, so something should still write those zeros? For new regions (places that were holes previously, or were beyond the end of the file), we create an unwritten extent, which is a region that's 'allocated', but everything reads back as zero. The problem is that we don't write into the blocks allocated for the unwritten extent at all, and only deallocate them once a write to another block finishes. In essence, we're (either explicitly or implicitly) applying COW semantics to a region that should not be COW until after the first write to each block. For the case of calling fallocate on existing data, we don't really do anything (unless the flag telling fallocate to unshare the region is passed). This is actually consistent with pretty much every other filesystem in existence, but that's because pretty much every other filesystem in existence implicitly provides the same guarantee that fallocate does for regions that already have data. This case can in theory be handled by the same looping algorithm I described above without needing the base amount of space allocated, but I wouldn't consider it important
Re: Healthy amount of free space?
18.07.2018 16:30, Austin S. Hemmelgarn пишет: > On 2018-07-18 09:07, Chris Murphy wrote: >> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn >> wrote: >> >>> If you're doing a training presentation, it may be worth mentioning that >>> preallocation with fallocate() does not behave the same on BTRFS as >>> it does >>> on other filesystems. For example, the following sequence of commands: >>> >>> fallocate -l X ./tmp >>> dd if=/dev/zero of=./tmp bs=1 count=X >>> >>> Will always work on ext4, XFS, and most other filesystems, for any >>> value of >>> X between zero and just below the total amount of free space on the >>> filesystem. On BTRFS though, it will reliably fail with ENOSPC for >>> values >>> of X that are greater than _half_ of the total amount of free space >>> on the >>> filesystem (actually, greater than just short of half). In essence, >>> preallocating space does not prevent COW semantics for the first write >>> unless the file is marked NOCOW. >> >> Is this a bug, or is it suboptimal behavior, or is it intentional? > It's been discussed before, though I can't find the email thread right > now. Pretty much, this is _technically_ not incorrect behavior, as the > documentation for fallocate doesn't say that subsequent writes can't > fail due to lack of space. I personally consider it a bug though > because it breaks from existing behavior in a way that is avoidable and > defies user expectations. > > There are two issues here: > > 1. Regions preallocated with fallocate still do COW on the first write > to any given block in that region. This can be handled by either > treating the first write to each block as NOCOW, or by allocating a bit How is it possible? As long as fallocate actually allocates space, this should be checksummed which means it is no more possible to overwrite it. May be fallocate on btrfs could simply reserve space. Not sure whether it complies with fallocate specification, but as long as intention is to ensure write will not fail for the lack of space it should be adequate (to the extent it can be ensured on btrfs of course). Also hole in file returns zeros by definition which also matches fallocate behavior. > of extra space and doing a rotating approach like this for writes: > - Write goes into the extra space. > - Once the write is done, convert the region covered by the write > into a new block of extra space. > - When the final block of the preallocated region is written, > deallocate the extra space. > 2. Preallocation does not completely account for necessary metadata > space that will be needed to store the data there. This may not be > necessary if the first issue is addressed properly. >> >> And then I wonder what happens with XFS COW: >> >> fallocate -l X ./tmp >> cp --reflink ./tmp ./tmp2 >> dd if=/dev/zero of=./tmp bs=1 count=X > I'm not sure. In this particular case, this will fail on BTRFS for any > X larger than just short of one third of the total free space. I would > expect it to fail for any X larger than just short of half instead. > > ZFS gets around this by not supporting fallocate (well, kind of, if > you're using glibc and call posix_fallocate, that _will_ work, but it > will take forever because it works by writing out each block of space > that's being allocated, which, ironically, means that that still suffers > from the same issue potentially that we have). What happens on btrfs then? fallocate specifies that new space should be initialized to zero, so something should still write those zeros? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-18 17:32, Chris Murphy wrote: On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn wrote: On 2018-07-18 13:40, Chris Murphy wrote: On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy wrote: I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? Yes, you're right, I had things backwards (well, kind of, this does work on ext4 and regular XFS, so it arguably should work here). I guess I'm confused what it even means to fallocate over a file with in-use blocks unless either -d or -p options are used. And from the man page, I don't grok the distinction between -d and -p either. But based on their descriptions I'd expect they both should work without enospc. Without any specific options, it forces allocation of any sparse regions in the file (that is, it gets rid of holes in the file). On BTRFS, I believe the command also forcibly unshares all the extents in the file (for the system call, there's a special flag for doing this). Additionally, you can extend a file with fallocate this way by specifying a length longer than the current size of the file, which guarantees that writes into that region will succeed, unlike truncating the file to a larger size, which just creates a hole at the end of the file to bring it up to size. As far as `-d` versus `-p`: `-p` directly translates to the option for the system call that punches a hole. It requires a length and possibly an offset, and will punch a hole at that exact location of that exact size. `-d` is a special option that's only available for the command. It tells the `fallocate` command to search the file for zero-filled regions, and punch holes there. Neither option should ever trigger an ENOSPC, except possibly if it has to split an extent for some reason and you are completely out of metadata space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Related on XFS list. https://www.spinics.net/lists/linux-xfs/msg20722.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn wrote: > On 2018-07-18 13:40, Chris Murphy wrote: >> >> On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy >> wrote: >> >>> I don't know for sure, but based on the addresses reported before and >>> after dd for the fallocated tmp file, it looks like Btrfs is not using >>> the originally fallocated addresses for dd. So maybe it is COWing into >>> new blocks, but is just as quickly deallocating the fallocated blocks >>> as it goes, and hence doesn't end up in enospc? >> >> >> Previous thread is "Problem with file system" from August 2017. And >> there's these reproduce steps from Austin which have fallocate coming >> after the dd. >> >> truncate --size=4G ./test-fs >> mkfs.btrfs ./test-fs >> mkdir ./test >> mount -t auto ./test-fs ./test >> dd if=/dev/zero of=./test/test bs=65536 count=32768 >> fallocate -l 2147483650 ./test/test && echo "Success!" >> >> >> My test Btrfs is 2G not 4G, so I'm cutting the values of dd and >> fallocate in half. >> >> [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 >> 1000+0 records in >> 1000+0 records out >> 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s >> [chris@f28s btrfs]$ sync >> [chris@f28s btrfs]$ df -h >> FilesystemSize Used Avail Use% Mounted on >> /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs >> [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp >> >> >> Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over >> it, this fails, but I kinda expect that because there's only 1.1G free >> space. But maybe that's what you're saying is the bug, it shouldn't >> fail? > > Yes, you're right, I had things backwards (well, kind of, this does work on > ext4 and regular XFS, so it arguably should work here). I guess I'm confused what it even means to fallocate over a file with in-use blocks unless either -d or -p options are used. And from the man page, I don't grok the distinction between -d and -p either. But based on their descriptions I'd expect they both should work without enospc. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-18 13:40, Chris Murphy wrote: On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy wrote: I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? Yes, you're right, I had things backwards (well, kind of, this does work on ext4 and regular XFS, so it arguably should work here). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy wrote: > I don't know for sure, but based on the addresses reported before and > after dd for the fallocated tmp file, it looks like Btrfs is not using > the originally fallocated addresses for dd. So maybe it is COWing into > new blocks, but is just as quickly deallocating the fallocated blocks > as it goes, and hence doesn't end up in enospc? Previous thread is "Problem with file system" from August 2017. And there's these reproduce steps from Austin which have fallocate coming after the dd. truncate --size=4G ./test-fs mkfs.btrfs ./test-fs mkdir ./test mount -t auto ./test-fs ./test dd if=/dev/zero of=./test/test bs=65536 count=32768 fallocate -l 2147483650 ./test/test && echo "Success!" My test Btrfs is 2G not 4G, so I'm cutting the values of dd and fallocate in half. [chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s [chris@f28s btrfs]$ sync [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1018M 1.1G 50% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1000m tmp Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over it, this fails, but I kinda expect that because there's only 1.1G free space. But maybe that's what you're saying is the bug, it shouldn't fail? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 11:06 AM, Austin S. Hemmelgarn wrote: > On 2018-07-18 13:04, Chris Murphy wrote: >> >> On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn >> wrote: >> >>> >>> I'm not sure. In this particular case, this will fail on BTRFS for any X >>> larger than just short of one third of the total free space. I would >>> expect >>> it to fail for any X larger than just short of half instead. >> >> >> I'm confused. I can't get it to fail when X is 3/4 of free space. >> >> lvcreate -V 2g -T vg/thintastic -n btrfstest >> mkfs.btrfs -M /dev/mapper/vg-btrfstest >> mount /dev/mapper/vg-btrfstest /mnt/btrfs >> cd /mnt/btrfs >> fallocate -l 1500m tmp >> dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 >> >> Succeeds. No enospc. This is on kernel 4.17.6. > > Odd, I could have sworn it would fail reliably. Unless something has > changed since I last tested though, doing it with X equal to the free space > on the filesystem will fail. OK well X is being defined twice here so I can't tell if I'm doing this correctly. There's fallocate X and that's 75% of free space for the empty fs at the time of fallocate. And then there's dd which is 1450m which is ~2.67x the free space at the time of dd. I don't know for sure, but based on the addresses reported before and after dd for the fallocated tmp file, it looks like Btrfs is not using the originally fallocated addresses for dd. So maybe it is COWing into new blocks, but is just as quickly deallocating the fallocated blocks as it goes, and hence doesn't end up in enospc? -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 7:30 AM, Austin S. Hemmelgarn wrote: > > I'm not sure. In this particular case, this will fail on BTRFS for any X > larger than just short of one third of the total free space. I would expect > it to fail for any X larger than just short of half instead. I'm confused. I can't get it to fail when X is 3/4 of free space. lvcreate -V 2g -T vg/thintastic -n btrfstest mkfs.btrfs -M /dev/mapper/vg-btrfstest mount /dev/mapper/vg-btrfstest /mnt/btrfs cd /mnt/btrfs fallocate -l 1500m tmp dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 Succeeds. No enospc. This is on kernel 4.17.6. Copied from terminal: [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 17M 2.0G 1% /mnt/btrfs [chris@f28s btrfs]$ sudo fallocate -l 1500m /mnt/btrfs/tmp [chris@f28s btrfs]$ filefrag -v tmp Filesystem type is: 9123683e File size of tmp is 1572864000 (384000 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 32767: 16400.. 49167: 32768: unwritten 1:32768.. 65535: 56576.. 89343: 32768: 49168: unwritten 2:65536.. 98303: 109824..142591: 32768: 89344: unwritten 3:98304.. 131071: 163072..195839: 32768: 142592: unwritten 4: 131072.. 163839: 216320..249087: 32768: 195840: unwritten 5: 163840.. 196607: 269568..302335: 32768: 249088: unwritten 6: 196608.. 229375: 322816..355583: 32768: 302336: unwritten 7: 229376.. 262143: 376064..408831: 32768: 355584: unwritten 8: 262144.. 294911: 429312..462079: 32768: 408832: unwritten 9: 294912.. 327679: 482560..515327: 32768: 462080: unwritten 10: 327680.. 344063: 89344..105727: 16384: 515328: unwritten 11: 344064.. 360447: 142592..158975: 16384: 105728: unwritten 12: 360448.. 376831: 195840..212223: 16384: 158976: unwritten 13: 376832.. 383999: 249088..256255: 7168: 212224: last,unwritten,eof tmp: 14 extents found [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1.5G 543M 74% /mnt/btrfs [chris@f28s btrfs]$ sudo dd if=/dev/zero of=/mnt/btrfs/tmp bs=1M count=1450 1450+0 records in 1450+0 records out 1520435200 bytes (1.5 GB, 1.4 GiB) copied, 13.4757 s, 113 MB/s [chris@f28s btrfs]$ df -h FilesystemSize Used Avail Use% Mounted on /dev/mapper/vg-btrfstest 2.0G 1.5G 591M 72% /mnt/btrfs [chris@f28s btrfs]$ filefrag -v tmp Filesystem type is: 9123683e File size of tmp is 1520435200 (371200 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0.. 16383: 302336..318719: 16384: 1:16384.. 32767: 355584..371967: 16384: 318720: 2:32768.. 49151: 408832..425215: 16384: 371968: 3:49152.. 65535: 462080..478463: 16384: 425216: 4:65536.. 73727: 515328..523519: 8192: 478464: 5:73728.. 86015: 3328.. 15615: 12288: 523520: 6:86016.. 98303: 256256..268543: 12288: 15616: 7:98304.. 104959: 49168.. 55823: 6656: 268544: 8: 104960.. 109047: 105728..109815: 4088: 55824: 9: 109048.. 113143: 158976..163071: 4096: 109816: 10: 113144.. 117239: 212224..216319: 4096: 163072: 11: 117240.. 121335: 318720..322815: 4096: 216320: 12: 121336.. 125431: 371968..376063: 4096: 322816: 13: 125432.. 128251: 425216..428035: 2820: 376064: 14: 128252.. 131071: 478464..481283: 2820: 428036: 15: 131072.. 132409: 1460.. 2797: 1338: 481284: 16: 132410.. 165177: 322816..355583: 32768: 2798: 17: 165178.. 197945: 376064..408831: 32768: 355584: 18: 197946.. 230713: 429312..462079: 32768: 408832: 19: 230714.. 263481: 482560..515327: 32768: 462080: 20: 263482.. 296249: 16400.. 49167: 32768: 515328: 21: 296250.. 327687: 56576.. 88013: 31438: 49168: 22: 327688.. 328711: 428036..429059: 1024: 88014: 23: 328712.. 361479: 109824..142591: 32768: 429060: 24: 361480.. 371199: 88014.. 97733: 9720: 142592: last,eof tmp: 25 extents found [chris@f28s btrfs]$ *shrug* -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-18 09:07, Chris Murphy wrote: On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn wrote: If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? It's been discussed before, though I can't find the email thread right now. Pretty much, this is _technically_ not incorrect behavior, as the documentation for fallocate doesn't say that subsequent writes can't fail due to lack of space. I personally consider it a bug though because it breaks from existing behavior in a way that is avoidable and defies user expectations. There are two issues here: 1. Regions preallocated with fallocate still do COW on the first write to any given block in that region. This can be handled by either treating the first write to each block as NOCOW, or by allocating a bit of extra space and doing a rotating approach like this for writes: - Write goes into the extra space. - Once the write is done, convert the region covered by the write into a new block of extra space. - When the final block of the preallocated region is written, deallocate the extra space. 2. Preallocation does not completely account for necessary metadata space that will be needed to store the data there. This may not be necessary if the first issue is addressed properly. And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X I'm not sure. In this particular case, this will fail on BTRFS for any X larger than just short of one third of the total free space. I would expect it to fail for any X larger than just short of half instead. ZFS gets around this by not supporting fallocate (well, kind of, if you're using glibc and call posix_fallocate, that _will_ work, but it will take forever because it works by writing out each block of space that's being allocated, which, ironically, means that that still suffers from the same issue potentially that we have). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn wrote: > If you're doing a training presentation, it may be worth mentioning that > preallocation with fallocate() does not behave the same on BTRFS as it does > on other filesystems. For example, the following sequence of commands: > > fallocate -l X ./tmp > dd if=/dev/zero of=./tmp bs=1 count=X > > Will always work on ext4, XFS, and most other filesystems, for any value of > X between zero and just below the total amount of free space on the > filesystem. On BTRFS though, it will reliably fail with ENOSPC for values > of X that are greater than _half_ of the total amount of free space on the > filesystem (actually, greater than just short of half). In essence, > preallocating space does not prevent COW semantics for the first write > unless the file is marked NOCOW. Is this a bug, or is it suboptimal behavior, or is it intentional? And then I wonder what happens with XFS COW: fallocate -l X ./tmp cp --reflink ./tmp ./tmp2 dd if=/dev/zero of=./tmp bs=1 count=X -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 2018-07-17 13:54, Martin Steigerwald wrote: Nikolay Borisov - 17.07.18, 10:16: On 17.07.2018 11:02, Martin Steigerwald wrote: Nikolay Borisov - 17.07.18, 09:20: On 16.07.2018 23:58, Wolf wrote: Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) […] Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Btrfs will start running into problems when you run out of unallocated space. So the best advice will be monitor your device unallocated, once it gets really low - like 2-3 gb I will suggest you run balance which will try to free up unallocated space by rewriting data more compactly into sparsely populated block groups. If after running balance you haven't really freed any space then you should consider adding a new drive and running balance to even out the spread of data/metadata. What are these issues exactly? For example if you have plenty of data space but your metadata is full then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? If you're doing a training presentation, it may be worth mentioning that preallocation with fallocate() does not behave the same on BTRFS as it does on other filesystems. For example, the following sequence of commands: fallocate -l X ./tmp dd if=/dev/zero of=./tmp bs=1 count=X Will always work on ext4, XFS, and most other filesystems, for any value of X between zero and just below the total amount of free space on the filesystem. On BTRFS though, it will reliably fail with ENOSPC for values of X that are greater than _half_ of the total amount of free space on the filesystem (actually, greater than just short of half). In essence, preallocating space does not prevent COW semantics for the first write unless the file is marked NOCOW. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Nikolay Borisov - 17.07.18, 10:16: > On 17.07.2018 11:02, Martin Steigerwald wrote: > > Nikolay Borisov - 17.07.18, 09:20: > >> On 16.07.2018 23:58, Wolf wrote: > >>> Greetings, > >>> I would like to ask what what is healthy amount of free space to > >>> keep on each device for btrfs to be happy? > >>> > >>> This is how my disk array currently looks like > >>> > >>> [root@dennas ~]# btrfs fi usage /raid > >>> > >>> Overall: > >>> Device size: 29.11TiB > >>> Device allocated: 21.26TiB > >>> Device unallocated:7.85TiB > >>> Device missing: 0.00B > >>> Used: 21.18TiB > >>> Free (estimated): 3.96TiB (min: 3.96TiB) > >>> Data ratio: 2.00 > >>> Metadata ratio: 2.00 > >>> Global reserve: 512.00MiB (used: 0.00B) > > > > […] > > > >>> Btrfs does quite good job of evenly using space on all devices. > >>> No, > >>> how low can I let that go? In other words, with how much space > >>> free/unallocated remaining space should I consider adding new > >>> disk? > >> > >> Btrfs will start running into problems when you run out of > >> unallocated space. So the best advice will be monitor your device > >> unallocated, once it gets really low - like 2-3 gb I will suggest > >> you run balance which will try to free up unallocated space by > >> rewriting data more compactly into sparsely populated block > >> groups. If after running balance you haven't really freed any > >> space then you should consider adding a new drive and running > >> balance to even out the spread of data/metadata. > > > > What are these issues exactly? > > For example if you have plenty of data space but your metadata is full > then you will be getting ENOSPC. Of that one I am aware. This just did not happen so far. I did not yet add it explicitly to the training slides, but I just make myself a note to do that. Anything else? > > I have > > > > % btrfs fi us -T /home > > > > Overall: > > Device size: 340.00GiB > > Device allocated:340.00GiB > > Device unallocated:2.00MiB > > Device missing: 0.00B > > Used:308.37GiB > > Free (estimated): 14.65GiB (min: 14.65GiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) > > > > Data Metadata System > > > > Id Path RAID1 RAID1RAID1Unallocated > > -- -- - --- > > > > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > > > > -- -- - --- > > > >Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB > >Used 151.24GiB 2.95GiB 48.00KiB > > You already have only 33% of your metadata full so if your workload > turned out to actually be making more metadata-heavy changed i.e > snapshots you could exhaust this and get ENOSPC, despite having around > 14gb of free data space. Furthermore this data space is spread around > multiple data chunks, depending on how populated they are a balance > could be able to free up unallocated space which later could be > re-purposed for metadata (again, depending on what you are doing). The filesystem above IMO is not fit for snapshots. It would fill up rather quickly, I think even when I balance metadata. Actually I tried this and as I remember it took at most a day until it was full. If I read above figures currently at maximum I could gain one additional GiB by balancing metadata. That would not make a huge difference. I bet I am already running this filesystem beyond recommendation, as I bet many would argue it is to full already for regular usage… I do not see the benefit of squeezing the last free space out of it just to fit in another GiB. So I still do not get the point why it would make sense to balance it at this point in time. Especially as this 1 GiB I could regain is not even needed. And I do not see th
Re: Healthy amount of free space?
On 2018-07-16 16:58, Wolf wrote: Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:10.61TiB, Used:10.58TiB /dev/mapper/data1 1.75TiB /dev/mapper/data2 1.75TiB /dev/mapper/data3 856.00GiB /dev/mapper/data4 856.00GiB /dev/mapper/data5 1.75TiB /dev/mapper/data6 1.75TiB /dev/mapper/data7 6.29TiB /dev/mapper/data8 6.29TiB Metadata,RAID1: Size:15.00GiB, Used:13.00GiB /dev/mapper/data1 2.00GiB /dev/mapper/data2 3.00GiB /dev/mapper/data3 1.00GiB /dev/mapper/data4 1.00GiB /dev/mapper/data5 3.00GiB /dev/mapper/data6 1.00GiB /dev/mapper/data7 9.00GiB /dev/mapper/data8 10.00GiB Slightly OT, but the distribution of metadata chunks across devices looks a bit sub-optimal here. If you can tolerate the volume being somewhat slower for a while, I'd suggest balancing these (it should get you better performance long-term). System,RAID1: Size:64.00MiB, Used:1.50MiB /dev/mapper/data2 32.00MiB /dev/mapper/data6 32.00MiB /dev/mapper/data7 32.00MiB /dev/mapper/data8 32.00MiB Unallocated: /dev/mapper/data11004.52GiB /dev/mapper/data21004.49GiB /dev/mapper/data31006.01GiB /dev/mapper/data41006.01GiB /dev/mapper/data51004.52GiB /dev/mapper/data61004.49GiB /dev/mapper/data71005.00GiB /dev/mapper/data81005.00GiB Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Disclaimer: What I'm about to say is based on personal experience. YMMV. It depends on how you use the filesystem. Realistically, there are a couple of things I consider when trying to decide on this myself: * How quickly does the total usage increase on average, and how much can it be expected to increase in one day in the worst case scenario? This isn't really BTRFS specific, but it's worth mentioning. I usually don't let an array get close enough to full that it wouldn't be able to safely handle at least one day of the worst case increase and another 2 of average increases. In BTRFS terms, the 'safely handle' part means you should be adding about 5GB for a multi-TB array like you have, or about 1GB for a sub-TB array. * What are the typical write patterns? Do files get rewritten in-place, or are they only ever rewritten with a replace-by-rename? Are writes mostly random, or mostly sequential? Are writes mostly small or mostly large? The more towards the first possibility listed in each of those question (in-place rewrites, random access, and small writes), the more free space you should keep on the volume. * Does this volume see heavy usage of fallocate() either to preallocate space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch holes or remove ranges from files. If whatever software you're using does this a lot on this volume, you want even more free space. * Do old files tend to get removed in large batches? That is, possibly hundreds or thousands of files at a time. If so, and you're running a reasonably recent (4.x series) kernel or regularly balance the volume to clean up empty chunks, you don't need quite as much free space. * How quickly can you get a new device added, and is it critical that this volume always be writable? Sounds stupid, but a lot of people don't consider this. If you can trivially get a new device added immediately, you can generally let things go a bit further than you would normally, same for if the volume being read-only can be tolerated for a while without significant issues. It's worth noting that I explicitly do not care about snapshot usage. It rarely has much impact on this other than changing how the total usage increases in a day. Evaluating all of this is of course something I can't really do for you. If I had to guess, with no other information that the allocations shown, I'd say that you're probably generically fine until you get down to about 5GB more than twice the average
Re: Healthy amount of free space?
On 17.07.2018 11:02, Martin Steigerwald wrote: > Hi Nikolay. > > Nikolay Borisov - 17.07.18, 09:20: >> On 16.07.2018 23:58, Wolf wrote: >>> Greetings, >>> I would like to ask what what is healthy amount of free space to >>> keep on each device for btrfs to be happy? >>> >>> This is how my disk array currently looks like >>> >>> [root@dennas ~]# btrfs fi usage /raid >>> >>> Overall: >>> Device size: 29.11TiB >>> Device allocated: 21.26TiB >>> Device unallocated:7.85TiB >>> Device missing: 0.00B >>> Used: 21.18TiB >>> Free (estimated): 3.96TiB (min: 3.96TiB) >>> Data ratio: 2.00 >>> Metadata ratio: 2.00 >>> Global reserve: 512.00MiB (used: 0.00B) > […] >>> Btrfs does quite good job of evenly using space on all devices. No, >>> how low can I let that go? In other words, with how much space >>> free/unallocated remaining space should I consider adding new disk? >> >> Btrfs will start running into problems when you run out of unallocated >> space. So the best advice will be monitor your device unallocated, >> once it gets really low - like 2-3 gb I will suggest you run balance >> which will try to free up unallocated space by rewriting data more >> compactly into sparsely populated block groups. If after running >> balance you haven't really freed any space then you should consider >> adding a new drive and running balance to even out the spread of >> data/metadata. > > What are these issues exactly? For example if you have plenty of data space but your metadata is full then you will be getting ENOSPC. > > I have > > % btrfs fi us -T /home > Overall: > Device size: 340.00GiB > Device allocated:340.00GiB > Device unallocated:2.00MiB > Device missing: 0.00B > Used:308.37GiB > Free (estimated): 14.65GiB (min: 14.65GiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data Metadata System > Id Path RAID1 RAID1RAID1Unallocated > -- -- - --- > 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB > -- -- - --- >Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB >Used 151.24GiB 2.95GiB 48.00KiB You already have only 33% of your metadata full so if your workload turned out to actually be making more metadata-heavy changed i.e snapshots you could exhaust this and get ENOSPC, despite having around 14gb of free data space. Furthermore this data space is spread around multiple data chunks, depending on how populated they are a balance could be able to free up unallocated space which later could be re-purposed for metadata (again, depending on what you are doing). > > on a RAID-1 filesystem one, part of the time two Plasma desktops + > KDEPIM and Akonadi + Baloo desktop search + you name it write to like > mad. > > > Thanks, > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
Hi Nikolay. Nikolay Borisov - 17.07.18, 09:20: > On 16.07.2018 23:58, Wolf wrote: > > Greetings, > > I would like to ask what what is healthy amount of free space to > > keep on each device for btrfs to be happy? > > > > This is how my disk array currently looks like > > > > [root@dennas ~]# btrfs fi usage /raid > > > > Overall: > > Device size: 29.11TiB > > Device allocated: 21.26TiB > > Device unallocated:7.85TiB > > Device missing: 0.00B > > Used: 21.18TiB > > Free (estimated): 3.96TiB (min: 3.96TiB) > > Data ratio: 2.00 > > Metadata ratio: 2.00 > > Global reserve: 512.00MiB (used: 0.00B) […] > > Btrfs does quite good job of evenly using space on all devices. No, > > how low can I let that go? In other words, with how much space > > free/unallocated remaining space should I consider adding new disk? > > Btrfs will start running into problems when you run out of unallocated > space. So the best advice will be monitor your device unallocated, > once it gets really low - like 2-3 gb I will suggest you run balance > which will try to free up unallocated space by rewriting data more > compactly into sparsely populated block groups. If after running > balance you haven't really freed any space then you should consider > adding a new drive and running balance to even out the spread of > data/metadata. What are these issues exactly? I have % btrfs fi us -T /home Overall: Device size: 340.00GiB Device allocated:340.00GiB Device unallocated:2.00MiB Device missing: 0.00B Used:308.37GiB Free (estimated): 14.65GiB (min: 14.65GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data Metadata System Id Path RAID1 RAID1RAID1Unallocated -- -- - --- 1 /dev/mapper/msata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB 2 /dev/mapper/sata-home 165.89GiB 4.08GiB 32.00MiB 1.00MiB -- -- - --- Total 165.89GiB 4.08GiB 32.00MiB 2.00MiB Used 151.24GiB 2.95GiB 48.00KiB on a RAID-1 filesystem one, part of the time two Plasma desktops + KDEPIM and Akonadi + Baloo desktop search + you name it write to like mad. Since kernel 4.5 or 4.6 this simply works. Before that sometimes BTRFS crawled to an halt on searching for free blocks, and I had to switch off the laptop uncleanly. If that happened, a balance helped for a while. But since 4.5 or 4.6 this did not happen anymore. I found with SLES 12 SP 3 or so there is btrfsmaintenance running a balance weekly. Which created an issue on our Proxmox + Ceph on Intel NUC based opensource demo lab. This is for sure no recommended configuration for Ceph and Ceph is quite slow on these 2,5 inch harddisks and 1 GBit network link, despite albeit somewhat minimal, limited to 5 GiB m.2 SSD caching. What happened it that the VM crawled to a halt and the kernel gave task hung for more than 120 seconds messages. The VM was basically unusable during the balance. Sure that should not happen with a "proper" setup, also it also did not happen without the automatic balance. Also what would happen on a hypervisor setup with several thousands of VMs with BTRFS, when several 100 of them decide to start the balance at a similar time? It could probably bring the I/O system below to an halt, as many enterprise storage systems are designed to sustain burst I/O loads, but not maximum utilization during an extended period of time. I am really wondering what to recommend in my Linux performance tuning and analysis courses. On my own laptop I do not do regular balances so far. Due to my thinking: If it is not broken, do not fix it. My personal opinion here also is: If the filesystem degrades that much that it becomes unusable without regular maintenance from user space, the filesystem needs to be fixed. Ideally I would not have to worry on whether to regularly balance an BTRFS or not. In other words: I should not have to visit a performance analysis and tuning course in order to use a computer with BTRFS filesystem. Thanks, -- Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Healthy amount of free space?
On 16.07.2018 23:58, Wolf wrote: > Greetings, > I would like to ask what what is healthy amount of free space to keep on > each device for btrfs to be happy? > > This is how my disk array currently looks like > > [root@dennas ~]# btrfs fi usage /raid > Overall: > Device size: 29.11TiB > Device allocated: 21.26TiB > Device unallocated:7.85TiB > Device missing: 0.00B > Used: 21.18TiB > Free (estimated): 3.96TiB (min: 3.96TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,RAID1: Size:10.61TiB, Used:10.58TiB >/dev/mapper/data1 1.75TiB >/dev/mapper/data2 1.75TiB >/dev/mapper/data3 856.00GiB >/dev/mapper/data4 856.00GiB >/dev/mapper/data5 1.75TiB >/dev/mapper/data6 1.75TiB >/dev/mapper/data7 6.29TiB >/dev/mapper/data8 6.29TiB > > Metadata,RAID1: Size:15.00GiB, Used:13.00GiB >/dev/mapper/data1 2.00GiB >/dev/mapper/data2 3.00GiB >/dev/mapper/data3 1.00GiB >/dev/mapper/data4 1.00GiB >/dev/mapper/data5 3.00GiB >/dev/mapper/data6 1.00GiB >/dev/mapper/data7 9.00GiB >/dev/mapper/data8 10.00GiB > > System,RAID1: Size:64.00MiB, Used:1.50MiB >/dev/mapper/data2 32.00MiB >/dev/mapper/data6 32.00MiB >/dev/mapper/data7 32.00MiB >/dev/mapper/data8 32.00MiB > > Unallocated: >/dev/mapper/data11004.52GiB >/dev/mapper/data21004.49GiB >/dev/mapper/data31006.01GiB >/dev/mapper/data41006.01GiB >/dev/mapper/data51004.52GiB >/dev/mapper/data61004.49GiB >/dev/mapper/data71005.00GiB >/dev/mapper/data81005.00GiB > > Btrfs does quite good job of evenly using space on all devices. No, how > low can I let that go? In other words, with how much space > free/unallocated remaining space should I consider adding new disk? Btrfs will start running into problems when you run out of unallocated space. So the best advice will be monitor your device unallocated, once it gets really low - like 2-3 gb I will suggest you run balance which will try to free up unallocated space by rewriting data more compactly into sparsely populated block groups. If after running balance you haven't really freed any space then you should consider adding a new drive and running balance to even out the spread of data/metadata. > > Thanks for advice :) > > W. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Healthy amount of free space?
Greetings, I would like to ask what what is healthy amount of free space to keep on each device for btrfs to be happy? This is how my disk array currently looks like [root@dennas ~]# btrfs fi usage /raid Overall: Device size: 29.11TiB Device allocated: 21.26TiB Device unallocated:7.85TiB Device missing: 0.00B Used: 21.18TiB Free (estimated): 3.96TiB (min: 3.96TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:10.61TiB, Used:10.58TiB /dev/mapper/data1 1.75TiB /dev/mapper/data2 1.75TiB /dev/mapper/data3 856.00GiB /dev/mapper/data4 856.00GiB /dev/mapper/data5 1.75TiB /dev/mapper/data6 1.75TiB /dev/mapper/data7 6.29TiB /dev/mapper/data8 6.29TiB Metadata,RAID1: Size:15.00GiB, Used:13.00GiB /dev/mapper/data1 2.00GiB /dev/mapper/data2 3.00GiB /dev/mapper/data3 1.00GiB /dev/mapper/data4 1.00GiB /dev/mapper/data5 3.00GiB /dev/mapper/data6 1.00GiB /dev/mapper/data7 9.00GiB /dev/mapper/data8 10.00GiB System,RAID1: Size:64.00MiB, Used:1.50MiB /dev/mapper/data2 32.00MiB /dev/mapper/data6 32.00MiB /dev/mapper/data7 32.00MiB /dev/mapper/data8 32.00MiB Unallocated: /dev/mapper/data11004.52GiB /dev/mapper/data21004.49GiB /dev/mapper/data31006.01GiB /dev/mapper/data41006.01GiB /dev/mapper/data51004.52GiB /dev/mapper/data61004.49GiB /dev/mapper/data71005.00GiB /dev/mapper/data81005.00GiB Btrfs does quite good job of evenly using space on all devices. No, how low can I let that go? In other words, with how much space free/unallocated remaining space should I consider adding new disk? Thanks for advice :) W. -- There are only two hard things in Computer Science: cache invalidation, naming things and off-by-one errors. signature.asc Description: PGP signature