Re: exclusive subvolume space missing

2017-12-15 Thread Duncan
Tomasz Pala posted on Fri, 15 Dec 2017 09:22:14 +0100 as excerpted:

> I wonder how this one db-library behaves:
> 
> $  find . -name \*.sqlite | xargs ls -gGhS | head -n1
> -rw-r--r-- 1  15M 2017-12-08 12:14
> ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite
> 
> $  ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite |
> head -n1
> File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite
> has 128 extents:
> 
> 
> At least every $HOME/{.{,c}cache,tmp} should be +C...

Many admins will put tmp, and sometimes cache or selected parts of it, on 
tmpfs anyway... thereby both automatically clearing it on reboot, and 
allowing enforced size control as necessary.

>> And if possible, use nocow for this file.
> 
> Actually, this should be officially advised to use +C for entire /var
> tree and every other tree that might be exposed for hostile write
> patterns, like /home or /tmp (if held on btrfs).
> 
> I'd say, that from security point of view the nocow should be default,
> unless specified for mount or specific file... Currently, if I mount
> with nocow, there is no way to whitelist trusted users or secure
> location, and until btrfs-specific options could be handled per
> subvolume, there is really no alternative.

Nocow disables many reasons people run btrfs in the first place, 
including checksumming and damage-detection, with auto-repair from other 
copies where available (raid1/10 and dup modes primarily), as well as 
btrfs transparent compression, for users using that.  Additionally, 
snapshotting, another feature people use btrfs for, turns nocow into cow1 
(cow the first time a block is written after a snapshot), since 
snapshotting locks down the previous extent in ordered to maintain the 
snapshotted reference.

And given that any user can create a snapshot any time they want (even if 
you lock down the btrfs executable, if they're malevolent users and not 
locked to only running specifically whitelisted executables, they can 
always get a copy of the executable elsewhere), and /home or individual 
user subvols may well be auto-snapshotted already, setting nocow isn't 
likely to be of much security value at all.

So nocow is, as one regular wrote, most useful for "this really should go 
on something other than btrfs, but I'm too lazy to set it up that way and 
I'm already on btrfs, so the nocow band-aid is all I got.  And yes, I try 
using my screwdriver as a hammer too, because that's what I have there 
too!"

In that sort of case, just use some other filesystem more appropriate to 
the use-case, and you won't have to worry about btrfs issues, cow-
triggered or otherwise, in the first place.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-15 Thread Tomasz Pala
On Tue, Dec 12, 2017 at 08:50:15 +0800, Qu Wenruo wrote:

> Even without snapshot, things can easily go crazy.
> 
> This will write 128M file (max btrfs file extent size) and write it to disk.
> # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file
> 
> Then, overwrite the 1~128M range.
> # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file
> 
> Guess your real disk usage, it's 127M + 128M = 255M.
> 
> The point here, if there is any reference of a file extent, the whole
> extent won't be freed, even it's only 1M of a 128M extent.

OK, /this/ is scary. I guess nocow prevents this behaviour?
I have +C chatted the file eating my space and it ceased.

> Are you pre-allocating the file before write using tools like dd?

I have no idea, this could be checked in source of 
http://pam-abl.sourceforge.net/
But this is plain Berkeley DB (5.3 in my case)... which scarries me even
more:

$  rpm -q --what-requires 'libdb-5.2.so()(64bit)' 'libdb-5.3.so()(64bit)' | wc 
-l
14

#  ipoldek desc -B db5.3
Package:db5.3-5.3.28.0-4.x86_64
Required(by):   apache1-base, apache1-mod_ssl, apr-util-dbm-db,
bogofilter, 
c-icap, c-icap-srv_url_check, courier-authlib, 
courier-authlib-authuserdb, courier-imap, courier-imap-common, 
cyrus-imapd, cyrus-imapd-libs, cyrus-sasl, cyrus-sasl-sasldb, 
db5.3-devel, db5.3-utils, dnshistory, dsniff, evolution-data-server, 
evolution-data-server-libs, exim, gda-db, ggz-server, 
heimdal-libs-common, hotkeys, inn, inn-libs, isync, jabberd, jigdo, 
jigdo-gtk, jnettop, libetpan, libgda3, libgda3-devel, libhome, libqxt, 
libsolv, lizardfs-master, maildrop, moc, mutt, netatalk, nss_updatedb, 
ocaml-dbm, opensips, opensmtpd, pam-pam_abl, pam-pam_ccreds, perl-BDB, 
perl-BerkeleyDB, perl-BerkeleyDB, perl-DB_File, perl-URPM, 
perl-cyrus-imapd, php4-dba, php52-dba, php53-dba, php54-dba, php55-dba, 
php56-dba, php70-dba, php70-dba, php71-dba, php71-dba, php72-dba, 
php72-dba, postfix, python-bsddb, python-modules, python3-bsddb3, 
redland, ruby-modules, sendmail, squid-session_acl, 
squid-time_quota_acl, squidGuard, subversion-libs, swish-e, tomoe-svn, 
webalizer-base, wwwcount

OK, not much of user-applications here, as they mostly use sqlite.
I wonder how this one db-library behaves:

$  find . -name \*.sqlite | xargs ls -gGhS | head -n1
-rw-r--r-- 1  15M 2017-12-08 12:14 
./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite

$  ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | head 
-n1
File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite has 128 
extents:


At least every $HOME/{.{,c}cache,tmp} should be +C...

> And if possible, use nocow for this file.

Actually, this should be officially advised to use +C for entire /var tree and
every other tree that might be exposed for hostile write patterns, like /home
or /tmp (if held on btrfs).

I'd say, that from security point of view the nocow should be default,
unless specified for mount or specific file... Currently, if I mount
with nocow, there is no way to whitelist trusted users or secure
location, and until btrfs-specific options could be handled per
subvolume, there is really no alternative.


-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-11 Thread Qu Wenruo


On 2017年12月11日 19:40, Tomasz Pala wrote:
> On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:
> 
>>> I could debug something before I'll clean this up, is there anything you
>>> want to me to check/know about the files?
>>
>> fiemap result along with btrfs dump-tree -t2 result.
> 
> fiemap attached, but dump-tree requires unmounted fs, doesn't it?

It doesn't.

You can dump your tree with fs mounted, although it may affect the accuracy.

The good news is, in your case, it doesn't really need extent tree, as
there is no shared extent here.

> 
>>> - I've lost 3.6 GB during the night with reasonably small
>>> amount of writes, I guess it might be possible to trash entire
>>> filesystem within 10 minutes if doing this on purpose.
>>
>> That's a little complex.
>> To get into such situation, snapshot must be used and one must know
>> which file extent is shared and how it's shared.
> 
> Hostile user might assume that any of his own files old enough were
> being snapshotted. Unless snapshots are not used at all...
> 
> The 'obvious' solution would be for quotas to limit the data size including
> extents lost due to fragmentation, but this is not the real solution as
> users don't care about fragmentation. So we're back to square one.
> 
>> But as I mentioned, XFS supports reflink, which means file extent can be
>> shared between several inodes.
>>
>> From the message I got from XFS guys, they free any unused space of a
>> file extent, so it should handle it quite well.
> 
> Forgive my ignorance, as I'm not familiar with details, but isn't the
> problem 'solvable' by reusing space freed from the same extent for any
> single (i.e. the same) inode?

Not that easy.

The extent tree design makes it a little tricky to do that.
So btrfs use the current extent booking, the laziest way to delete extent.

> This would certainly increase
> fragmentation of a file, but reduce extent usage significially.
> 
> 
> Still, I don't comprehend the cause of my situation. If - after doing a
> defrag (after snapshotting whatever there were already trashed) btrfs
> decides to allocate new extents for the file, why doesn't is use them
> efficiently as long as I'm not doing snapshots anymore?

Even without snapshot, things can easily go crazy.

This will write 128M file (max btrfs file extent size) and write it to disk.
# xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file

Then, overwrite the 1~128M range.
# xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file

Guess your real disk usage, it's 127M + 128M = 255M.

The point here, if there is any reference of a file extent, the whole
extent won't be freed, even it's only 1M of a 128M extent.

While defrag will basically read out the whole 128M file, and rewrite it.
Basically the same as:

# dd if=/mnt/btrfs/file of=/mnt/btrfs/file2
# rm /mnt/btrfs/file

In this case, it will cause a new 128M file extent, while old 128M+127M
extents lost all their reference so they are freed.
As a result, it frees 127M.



> I'm attaching the second fiemap, the same file from last snapshot taken.
> According to this one-liner:
> 
> for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done
> 
> current file doesn't share any physical locations with the old one.
> But still grows, so what does this situation have with snapshots anyway?

In your fiemap, all your file extent is exclusive, so not really related
to snapshot.

But the file is very fragmented.
Most of them is 4K sized, several 8K sized.
And the final extent is 220K sized.

Are you pre-allocating the file before write using tools like dd?
If so, just as I explained above, it will at least *DOUBLE* on-disk
space usage, and cause tons of fragment.

It's recommended to use fallocate to prealloc file instead of things
like dd.
(preallocated range acts must like nocow, although only for first write)

And if possible, use nocow for this file.


> 
> Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
> occupied per extent. How is that possible?

Appending small write and frequently fsync or small random DIO.

Avoid such pattern or at least use nocow.
Also avoid using dd to preallocate file.

Another solution is autodefrag, but I doubt the effect.

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-11 Thread Tomasz Pala
On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:

>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.

fiemap attached, but dump-tree requires unmounted fs, doesn't it?

>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.

Hostile user might assume that any of his own files old enough were
being snapshotted. Unless snapshots are not used at all...

The 'obvious' solution would be for quotas to limit the data size including
extents lost due to fragmentation, but this is not the real solution as
users don't care about fragmentation. So we're back to square one.

> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.

Forgive my ignorance, as I'm not familiar with details, but isn't the
problem 'solvable' by reusing space freed from the same extent for any
single (i.e. the same) inode? This would certainly increase
fragmentation of a file, but reduce extent usage significially.


Still, I don't comprehend the cause of my situation. If - after doing a
defrag (after snapshotting whatever there were already trashed) btrfs
decides to allocate new extents for the file, why doesn't is use them
efficiently as long as I'm not doing snapshots anymore?

I'm attaching the second fiemap, the same file from last snapshot taken.
According to this one-liner:

for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done

current file doesn't share any physical locations with the old one.
But still grows, so what does this situation have with snapshots anyway?

Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
occupied per extent. How is that possible?

-- 
Tomasz Pala 
File log.14 has 933 extents:
#   Logical  Physical Length   Flags
0:   00297a001000 1000 
1:  1000 00297aa01000 1000 
2:  2000 002979ffe000 1000 
3:  3000 00297d1fc000 1000 
4:  4000 00297e5f7000 1000 
5:  5000 00297d1fe000 1000 
6:  6000 00297c7f4000 1000 
7:  7000 00297dbf9000 1000 
8:  8000 00297eff3000 1000 
9:  9000 0029821c7000 1000 
10: a000 002982bbf000 1000 
11: b000 0029803e 1000 
12: c000 00297b40 1000 
13: d000 002979601000 1000 
14: e000 002980dd5000 1000 
15: f000 0029821be000 1000 
16: 0001 00298715f000 1000 
17: 00011000 002985d71000 1000 
18: 00012000 00298537f000 1000 
19: 00013000 00298676 1000 
20: 00014000 00298498d000 1000 
21: 00015000 0029821b4000 1000 
22: 00016000 0029817c7000 1000 
23: 00017000 00298a2fa000 1000 
24: 00018000 002988f1f000 1000 
25: 00019000 00298d47f000 1000 
26: 0001a000 00298c0af000 1000 
27: 0001b000 00298a2ee000 1000 
28: 0001c000 00298a2eb000 1000 
29: 0001d000 0029905f2000 1000 
30: 0001e000 00298f22a000 1000 
31: 0001f000 00298de66000 1000 
32: 0002 00298ace3000 1000 
33: 00021000 00298a2e9000 1000 
34: 00022000 00298a2e7000 1000 
35: 00023000 00298b6c3000 1000 
36: 00024000 002990fd5000 1000 
37: 00025000 002992d6c000 1000 
38: 00026000 0029954db000 1000 
39: 00027000 002993747000 1000 
40: 00028000 002992d62000 1000 
41: 00029000 002992389000 

Re: exclusive subvolume space missing

2017-12-10 Thread Qu Wenruo


On 2017年12月11日 07:44, Qu Wenruo wrote:
> 
> 
> On 2017年12月10日 19:27, Tomasz Pala wrote:
>> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
>>
 1. is there any switch resulting in 'defrag only exclusive data'?
>>>
>>> IIRC, no.
>>
>> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
>> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
>> defrag. After defragging files were not snapshotted again and I've lost
>> 3.6 GB again, so I got this fully reproducible.
>> There are 7 files, one of which is 99% of the space (10 MB). None of
>> them has nocow set, so they're riding all-btrfs.
>>
>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.
> 
> Both output has nothing related to file name/dir name, but only some
> "meaningless" bytenr, so it should be completely OK to share them.
> 
>>
>> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
>> condition which could be triggered by malicious user during a few hours
>> or faster
> 
> You won't want to hear this:
> The biggest ratio in theory is, 128M / 4K = 32768.
> 
>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.
> 
> But yes, it's possible.
> 
> While on the other hand, XFS, which also supports reflink, handles it
> quite well, so I'm wondering if it's possible for btrfs to follow its
> behavior.
> 
>>
 3. I guess there aren't, so how could I accomplish my target, i.e.
reclaiming space that was lost due to fragmentation, without breaking
spanshoted CoW where it would be not only pointless, but actually 
 harmful?
>>>
>>> What about using old kernel, like v4.13?
>>
>> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
>> will be pushing towards 4.14.
> 
> No, I really mean v4.13.

My fault, it is v3.13.

What a stupid error...

> 
> From btrfs(5):
> ---
>Warning
>Defragmenting with Linux kernel versions < 3.9 or ≥
> 3.14-rc2 as
>well as with Linux stable kernel versions ≥ 3.10.31, ≥
> 3.12.12
>or ≥ 3.13.4 will break up the ref-links of CoW data (for
>example files copied with cp --reflink, snapshots or
>de-duplicated data). This may cause considerable increase of
>space usage depending on the broken up ref-links.
> ---
> 
>>
 4. How can I prevent this from happening again? All the files, that are
written constantly (stats collector here, PostgreSQL database and
logs on other machines), are marked with nocow (+C); maybe some new
attribute to mark file as autodefrag? +t?
>>>
>>> Unfortunately, nocow only works if there is no other subvolume/inode
>>> referring to it.
>>
>> This shouldn't be my case anymore after defrag (==breaking links).
>> I guess no easy way to check refcounts of the blocks?
> 
> No easy way unfortunately.
> It's either time consuming (used by qgroup) or complex (manually tree
> search and do the backref walk by yourself)
> 
>>
>>> But in my understanding, btrfs is not suitable for such conflicting
>>> situation, where you want to have snapshots of frequent partial updates.
>>>
>>> IIRC, btrfs is better for use case where either update is less frequent,
>>> or update is replacing the whole file, not just part of it.
>>>
>>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
>>
>> That is something coherent with my conclusions after 2 years on btrfs,
>> however I didn't expect a single file to eat 1000 times more space than it
>> should...
>>
>>
>> I wonder how many other filesystems were trashed like this - I'm short
>> of ~10 GB on other system, many other users might be affected by that
>> (telling the Internet stories about btrfs running out of space).
> 
> Firstly, no other filesystem supports snapshot.
> So it's pretty hard to get a baseline.
> 
> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.
> 
> But it's quite a hard work to achieve in btrfs, needs years development
> at least.
> 
>>
>> It is not a problem that I need to defrag a file, the problem is I don't 
>> know:
>> 1. whether I need to defrag,
>> 2. *what* should I defrag
>> nor have a tool that would defrag smart - only the exclusive data or, in
>> general, the block that are worth defragging if space released from
>> extents is greater than space lost on 

Re: exclusive subvolume space missing

2017-12-10 Thread Qu Wenruo


On 2017年12月10日 19:27, Tomasz Pala wrote:
> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:
> 
>>> 1. is there any switch resulting in 'defrag only exclusive data'?
>>
>> IIRC, no.
> 
> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
> defrag. After defragging files were not snapshotted again and I've lost
> 3.6 GB again, so I got this fully reproducible.
> There are 7 files, one of which is 99% of the space (10 MB). None of
> them has nocow set, so they're riding all-btrfs.
> 
> I could debug something before I'll clean this up, is there anything you
> want to me to check/know about the files?

fiemap result along with btrfs dump-tree -t2 result.

Both output has nothing related to file name/dir name, but only some
"meaningless" bytenr, so it should be completely OK to share them.

> 
> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
> condition which could be triggered by malicious user during a few hours
> or faster

You won't want to hear this:
The biggest ratio in theory is, 128M / 4K = 32768.

> - I've lost 3.6 GB during the night with reasonably small
> amount of writes, I guess it might be possible to trash entire
> filesystem within 10 minutes if doing this on purpose.

That's a little complex.
To get into such situation, snapshot must be used and one must know
which file extent is shared and how it's shared.

But yes, it's possible.

While on the other hand, XFS, which also supports reflink, handles it
quite well, so I'm wondering if it's possible for btrfs to follow its
behavior.

> 
>>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>>reclaiming space that was lost due to fragmentation, without breaking
>>>spanshoted CoW where it would be not only pointless, but actually 
>>> harmful?
>>
>> What about using old kernel, like v4.13?
> 
> Unfortunately (I guess you had 3.13 on mind), I need the new ones and
> will be pushing towards 4.14.

No, I really mean v4.13.

From btrfs(5):
---
   Warning
   Defragmenting with Linux kernel versions < 3.9 or ≥
3.14-rc2 as
   well as with Linux stable kernel versions ≥ 3.10.31, ≥
3.12.12
   or ≥ 3.13.4 will break up the ref-links of CoW data (for
   example files copied with cp --reflink, snapshots or
   de-duplicated data). This may cause considerable increase of
   space usage depending on the broken up ref-links.
---

> 
>>> 4. How can I prevent this from happening again? All the files, that are
>>>written constantly (stats collector here, PostgreSQL database and
>>>logs on other machines), are marked with nocow (+C); maybe some new
>>>attribute to mark file as autodefrag? +t?
>>
>> Unfortunately, nocow only works if there is no other subvolume/inode
>> referring to it.
> 
> This shouldn't be my case anymore after defrag (==breaking links).
> I guess no easy way to check refcounts of the blocks?

No easy way unfortunately.
It's either time consuming (used by qgroup) or complex (manually tree
search and do the backref walk by yourself)

> 
>> But in my understanding, btrfs is not suitable for such conflicting
>> situation, where you want to have snapshots of frequent partial updates.
>>
>> IIRC, btrfs is better for use case where either update is less frequent,
>> or update is replacing the whole file, not just part of it.
>>
>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.
> 
> That is something coherent with my conclusions after 2 years on btrfs,
> however I didn't expect a single file to eat 1000 times more space than it
> should...
> 
> 
> I wonder how many other filesystems were trashed like this - I'm short
> of ~10 GB on other system, many other users might be affected by that
> (telling the Internet stories about btrfs running out of space).

Firstly, no other filesystem supports snapshot.
So it's pretty hard to get a baseline.

But as I mentioned, XFS supports reflink, which means file extent can be
shared between several inodes.

From the message I got from XFS guys, they free any unused space of a
file extent, so it should handle it quite well.

But it's quite a hard work to achieve in btrfs, needs years development
at least.

> 
> It is not a problem that I need to defrag a file, the problem is I don't know:
> 1. whether I need to defrag,
> 2. *what* should I defrag
> nor have a tool that would defrag smart - only the exclusive data or, in
> general, the block that are worth defragging if space released from
> extents is greater than space lost on inter-snapshot duplication.
> 
> I can't just defrag entire filesystem since it breaks links with snapshots.
> This change was a real deal-breaker here...

IIRC it's better to add a option to make defrag snapshot-aware.
(Don't break snapshot sharing but only to 

Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote:

> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after

#  df
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda264G   61G  2.8G  96% /

#  btrfs fi du .
 Total   Exclusive  Set shared  Filename
 0.00B   0.00B   -  ./1/__db.register
  10.00MiB10.00MiB   -  ./1/log.01
  16.00KiB   0.00B   -  ./1/hosts.db
  16.00KiB   0.00B   -  ./1/users.db
 168.00KiB   0.00B   -  ./1/__db.001
  40.00KiB   0.00B   -  ./1/__db.002
  44.00KiB   0.00B   -  ./1/__db.003
  10.28MiB10.00MiB   -  ./1
 0.00B   0.00B   -  ./__db.register
  16.00KiB16.00KiB   -  ./hosts.db
  16.00KiB16.00KiB   -  ./users.db
  10.00MiB10.00MiB   -  ./log.13
 0.00B   0.00B   -  ./__db.001
 0.00B   0.00B   -  ./__db.002
 0.00B   0.00B   -  ./__db.003
  20.31MiB20.03MiB   284.00KiB  .

#  btrfs fi defragment log.13 
#  df
/dev/sda264G   54G  9.4G  86% /


6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:

>> 1. is there any switch resulting in 'defrag only exclusive data'?
> 
> IIRC, no.

I have found a directory - pam_abl databases, which occupy 10 MB (yes,
TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
defrag. After defragging files were not snapshotted again and I've lost
3.6 GB again, so I got this fully reproducible.
There are 7 files, one of which is 99% of the space (10 MB). None of
them has nocow set, so they're riding all-btrfs.

I could debug something before I'll clean this up, is there anything you
want to me to check/know about the files?

The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
condition which could be triggered by malicious user during a few hours
or faster - I've lost 3.6 GB during the night with reasonably small
amount of writes, I guess it might be possible to trash entire
filesystem within 10 minutes if doing this on purpose.

>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>reclaiming space that was lost due to fragmentation, without breaking
>>spanshoted CoW where it would be not only pointless, but actually harmful?
> 
> What about using old kernel, like v4.13?

Unfortunately (I guess you had 3.13 on mind), I need the new ones and
will be pushing towards 4.14.

>> 4. How can I prevent this from happening again? All the files, that are
>>written constantly (stats collector here, PostgreSQL database and
>>logs on other machines), are marked with nocow (+C); maybe some new
>>attribute to mark file as autodefrag? +t?
> 
> Unfortunately, nocow only works if there is no other subvolume/inode
> referring to it.

This shouldn't be my case anymore after defrag (==breaking links).
I guess no easy way to check refcounts of the blocks?

> But in my understanding, btrfs is not suitable for such conflicting
> situation, where you want to have snapshots of frequent partial updates.
> 
> IIRC, btrfs is better for use case where either update is less frequent,
> or update is replacing the whole file, not just part of it.
> 
> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

That is something coherent with my conclusions after 2 years on btrfs,
however I didn't expect a single file to eat 1000 times more space than it
should...


I wonder how many other filesystems were trashed like this - I'm short
of ~10 GB on other system, many other users might be affected by that
(telling the Internet stories about btrfs running out of space).

It is not a problem that I need to defrag a file, the problem is I don't know:
1. whether I need to defrag,
2. *what* should I defrag
nor have a tool that would defrag smart - only the exclusive data or, in
general, the block that are worth defragging if space released from
extents is greater than space lost on inter-snapshot duplication.

I can't just defrag entire filesystem since it breaks links with snapshots.
This change was a real deal-breaker here...

Any way to fed the deduplication code with snapshots maybe? There are
directories and files in the same layout, this could be fast-tracked to
check and deduplicate.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote:

> OTOH, it's also quite possible that people chose btrfs at least partly
> for other reasons, say the "storage pool" qualities, and would rather

Well, to name some:

1. filesystem-level backups via snapshot/send/receive - much cleaner and
faster than rsyncs or other old-fashioned methods. This obviously requires the 
CoW-once feature;

- caveat: for btrfs-killing usage patterns all the snapshots but the
  last one need to be removed;


2. block-level checksums with RAID1-awareness - in contrary to mdadm
RAIDx, which chooses random data copy from underlying devices, this is
much less susceptible to bit rot;

- caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID
  instead of real balancing), no N-way mirroring nor write-mostly flag.


3. compression - there is no real alternative, however:

- caveat: requires CoW enabled, which makes it not suitable for
  ...systemd journals, which compress with great ratio (c.a. 1:10),
  nor for various databases, as they will be nocowed sooner or later;


4. storage pools you've mentioned - they are actually not much superior to
LVM-based approach; until one could create subvolume with different
profile (e.g. 'disable RAID1 for /var/log/journal') it is still better
to create separate filesystems, meaning one have to use LVM or (the hard
way) paritioning.


Some of the drawbacks above are immanent to CoW and so shouldn't be
expected to be fixed internally, as the needs are conflicting, but their
impact might be nullified by some housekeeping.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing)

2017-12-05 Thread Andrei Borzenkov
02.12.2017 03:27, Qu Wenruo пишет:
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.
> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.
> 

Could you also explain how parent qgroup computes exclusive space? I.e.

10:~ # mkfs -t btrfs -f /dev/sdb1
btrfs-progs v4.13.3
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/sdb1 (1023.00MiB) ...
Label:  (null)
UUID:   b9b0643f-a248-4667-9e69-acf5baaef05b
Node size:  16384
Sector size:4096
Filesystem size:1023.00MiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  51.12MiB
  System:   DUP   8.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1  1023.00MiB  /dev/sdb1

10:~ # mount -t btrfs /dev/sdb1 /mnt
10:~ # cd /mnt
10:/mnt # btrfs quota enable .
10:/mnt # btrfs su cre sub1
Create subvolume './sub1'
10:/mnt # dd if=/dev/urandom of=sub1/file1 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00833739 s, 126 MB/s
10:/mnt # dd if=/dev/urandom of=sub1/file2 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0179272 s, 58.5 MB/s
10:/mnt # btrfs subvolume snapshot sub1 sub2
Create a snapshot of 'sub1' in './sub2'
10:/mnt # dd if=/dev/urandom of=sub2/file2 bs=1K count=1024 conv=notrunc
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0348762 s, 30.1 MB/s
10:/mnt # btrfs qgroup show --sync -p .
qgroupid rfer excl parent
   --
0/5  16.00KiB 16.00KiB ---
0/256 2.02MiB  1.02MiB ---
0/257 2.02MiB  1.02MiB ---

So far so good. This is expected, each subvolume has 1MiB shared and
1MiB exclusive.

10:/mnt # btrfs qgroup create 22/7 /mnt
10:/mnt # btrfs qgroup assign --rescan 0/256 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup assign --rescan 0/257 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup show --sync -p .
qgroupid rfer excl parent
   --
0/5  16.00KiB 16.00KiB ---
0/256 2.02MiB  1.02MiB 22/7
0/257 2.02MiB  1.02MiB 22/7
22/7  3.03MiB  3.03MiB ---
10:/mnt #

Oops. Total for 22/7 is correct (1MiB shared + 2 * 1MiB exclusive) but
why all data is treated as exclusive here? It does not match your
explanation ...



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-03 Thread Chris Murphy
On Sun, Dec 3, 2017 at 3:47 AM, Adam Borowski  wrote:

> I'd say that the only good use for nocow is "I wish I have placed this file
> on a non-btrfs, but it'd be too much hassle to repartition".
>
> If you snapshot nocow at all, you get the worst of both worlds.

I think it's better to have the option than not have it, but for
regular Joe user I think it's a problem. And that's why I'm not such a
big fan of systemd-journald using chattr +C on journals when on Btrfs,
by default. I wouldn't mind it if systemd also made /var/log/journal/
a subvolume, just like it automatically creates /var/lib/machines as
as subvolume. That way by default /var/log/journal would be immune to
snapshots.

Or alternatively a rework of how journals are written to be more COW friendly.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Chris Murphy
On Fri, Dec 1, 2017 at 5:53 PM, Tomasz Pala  wrote:

> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:117.19GiB
> Device unallocated:   10.81GiB
> Device missing:  0.00B
> Used:103.56GiB
> Free (estimated): 11.19GiB  (min: 11.14GiB)
> Data ratio:   1.98
> Metadata ratio:   2.00
> Global reserve:  146.08MiB  (used: 0.00B)
>
> Data,single: Size:1.19GiB, Used:1.18GiB
>/dev/sda2   1.07GiB
>/dev/sdb2 132.00MiB

This is asking for trouble. Two devices have single copy data chunks,
if those drives die, you lose that data. But the metadata referring to
those files will survive and Btrfs will keep complaining about them at
every scrub until they're all deleted - there is no command that makes
this easy. You'd have to scrape scrub output, which includes paths to
the missing files, and script something to delete them all.

You should convert this with something like 'btrfs balance start
-dconvert=raid1,soft '



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-03 Thread Qu Wenruo


On 2017年12月02日 17:33, Tomasz Pala wrote:
> OK, I seriously need to address that, as during the night I lost
> 3 GB again:
> 
> On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:
> 
>>> #  btrfs fi sh /
>>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>>> Total devices 2 FS bytes used 44.10GiB
>Total devices 2 FS bytes used 47.28GiB
> 
>>> #  btrfs fi usage /
>>> Overall:
>>> Used: 88.19GiB
>Used: 94.58GiB
>>> Free (estimated): 18.75GiB  (min: 18.75GiB)
>Free (estimated): 15.56GiB  (min: 15.56GiB)
>>>
>>> #  btrfs dev usage /
> - output not changed
> 
>>> #  btrfs fi df /
>>> Data, RAID1: total=51.97GiB, used=43.22GiB
>Data, RAID1: total=51.97GiB, used=46.42GiB
>>> System, RAID1: total=32.00MiB, used=16.00KiB
>>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>>> GlobalReserve, single: total=131.14MiB, used=0.00B
>GlobalReserve, single: total=135.50MiB, used=0.00B
>>>
>>> # df
>>> /dev/sda264G   45G   19G  71% /
>/dev/sda264G   48G   16G  76% /
>>> However the difference is on active root fs:
>>>
>>> -0/29124.29GiB  9.77GiB
>>> +0/29115.99GiB 76.00MiB
> 0/29119.19GiB  3.28GiB
>>
>> Since you have already showed the size of the snapshots, which hardly
>> goes beyond 1G, it may be possible that extent booking is the cause.
>>
>> And considering it's all exclusive, defrag may help in this case.
> 
> I'm going to try defrag here, but have a bunch of questions before;
> as defrag would break CoW, I don't want to defrag files that span
> multiple snapshots, unless they have huge overhead:
> 1. is there any switch resulting in 'defrag only exclusive data'?

IIRC, no.

> 2. is there any switch resulting in 'defrag only extents fragmented more than 
> X'
>or 'defrag only fragments that would be possibly freed'?

No, either.

> 3. I guess there aren't, so how could I accomplish my target, i.e.
>reclaiming space that was lost due to fragmentation, without breaking
>spanshoted CoW where it would be not only pointless, but actually harmful?

What about using old kernel, like v4.13?

> 4. How can I prevent this from happening again? All the files, that are
>written constantly (stats collector here, PostgreSQL database and
>logs on other machines), are marked with nocow (+C); maybe some new
>attribute to mark file as autodefrag? +t?

Unfortunately, nocow only works if there is no other subvolume/inode
referring to it.

That's to say, if you're using snapshot, then NOCOW won't help as much
as you expected, but still much better than normal data cow.

> 
> For example, the largest file from stats collector:
>  Total   Exclusive  Set shared  Filename
>  432.00KiB   176.00KiB   256.00KiB  load/load.rrd
> 
> but most of them has 'Set shared'==0.
> 
> 5. The stats collector is running from the beginning, according to the
> quota output was not the issue since something happened. If the problem
> was triggered by (guessing) low space condition, and it results in even
> more space lost, there is positive feedback that is dangerous, as makes
> any filesystem unstable ("once you run out of space, you won't recover").
> Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
> pattern, like RRD files?

Hard to say the cause.

But in my understanding, btrfs is not suitable for such conflicting
situation, where you want to have snapshots of frequent partial updates.

IIRC, btrfs is better for use case where either update is less frequent,
or update is replacing the whole file, not just part of it.

So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

> 
> 6. Or maybe some extra steps just before taking snapshot should be taken?
> I guess 'defrag exclusive' would be perfect here - reclaiming space
> before it is being locked inside snapshot.

Yes, this sounds perfectly reasonable.

Thanks,
Qu

> Rationale behind this is obvious: since the snapshot-aware defrag was
> removed, allow to defrag snapshot exclusive data only.
> This would of course result in partial file defragmentation, but that
> should be enough for pathological cases like mine.



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-03 Thread Adam Borowski
On Sun, Dec 03, 2017 at 01:45:45AM +, Duncan wrote:
> Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:
> >> I got ~500 small files (100-500 kB) updated partially in regular
> >> intervals:
> >> 
> >> # du -Lc **/*.rrd | tail -n1
> >> 105Mtotal
> 
> FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
> are (other than that a quick google suggests that it's...
> round-robin-database...

Basically: preallocate a file, its size doesn't change since then.  Every a
few minutes, write several bytes into the file, slowly advancing.

This is indeed the worst possible case for btrfs, and nocow doesn't help the
slightest as the database doesn't wrap around before a typical snapshot
interval.

> Meanwhile, /because/ nocow has these complexities along with others (nocow
> automatically turns off data checksumming and compression for the files
> too), and the fact that they nullify some of the big reasons people might
> choose btrfs in the first place, I actually don't recommend setting
> nocow in the first place -- if usage is such than a file needs nocow,
> my thinking is that btrfs isn't a particularly good hosting choice for
> that file in the first place, a more traditional rewrite-in-place
> filesystem is likely to be a better fit.

I'd say that the only good use for nocow is "I wish I have placed this file
on a non-btrfs, but it'd be too much hassle to repartition".

If you snapshot nocow at all, you get the worst of both worlds.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...;
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Duncan
Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:

> On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:
> 
>>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
>> [...]
>>> Now make various small changes to the file, say under 16 KiB each.  These
>>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>>> a time I believe (might be 4 KiB, as it was back when the default leaf
>> 
>> I got ~500 small files (100-500 kB) updated partially in regular
>> intervals:
>> 
>> # du -Lc **/*.rrd | tail -n1
>> 105Mtotal

FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
are (other than that a quick google suggests that it's...
round-robin-database... and the database bit alone sounds bad in this context
as database-file rewrites are known to be a worst-case for cow-based
filesystems), but it sounds like you suspect that they have this
rewrite-most pattern that could explain your problem...

>>> But here's the kicker.  Even without a snapshot locking that original 100
>>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>>> at once, once no references to it remain, which means once that last
>>> block of the extent gets rewritten.
> 
> OTOH - should this happen with nodatacow files? As I mentioned before,
> these files are chattred +C (however this was not their initial state
> due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
> Am I wrong thinking, that in such case they should occupy twice their
> size maximum? Or maybe there is some tool that could show me the real
> space wasted by file, including extents count etc?

Nodatacow... isn't as simple as the name might suggest.

For one thing, snapshots depend on COW and lock the extents they reference
in-place, so while a file might be set nocow and that setting is retained,
the first write to a block after a snapshot *MUST* cow that block... because
the snapshot has the existing version referenced and it can't change without
changing the snapshot as well, and that would of course defeat the purpose
of snapshots.

Tho the attribute is retained and further writes to the same already cowed
block won't cow it again.

FWIW, on this list that behavior is often referred to as cow1, cow only the
first time that a block is written after a snapshot locks the previous
version in place.

The effect of cow1 depends on the frequency and extent of block rewrites vs.
the frequency of snapshots of the subvolume they're on.  As should be
obvious if you think about it, once you've done the cow1, further rewrites
to the same block before further snapshots won't cow further, so if only
a few blocks are repeatedly rewritten multiple times between snapshots, the
effect should be relatively small.  Similarly if snapshots happen far more
frequently than block rewrites, since in that case most of the snapshots
won't have anything changed (for that file anyway) since the last one.

However, if most of the file gets rewritten between snapshots and the
snapshot frequency is often enough to be a major factor, the effect can be
practically as bad as if the file weren't nocow in the first place.

If I knew a bit more about rrd's rewrite pattern... and your snapshot
pattern...


Second, as you alluded, for btrfs files must be set nocow before anything
is written to them.  Quoting the chattr (1) manpage:  "If it is set on a
file which already has data blocks, it is undefined when the blocks
assigned to the file will be fully stable."

Not being a dev I don't read the code to know what that means in practice,
but it could well be effectively cow1, which would yield the maximum 2X
size you assumed.

But I think it's best to take "undefined" at its meaning, and assume
worst-case "no effect at all", for size calculation purposes, unless you
really /did/ set it at file creation, before the file had content.

And the easiest way to do /that/, and something that might be worthwhile
doing anyway if you think unreclaimed still referenced extents are your
problem, is to set the nocow flag on the /directory/, then copy the
files into it, taking care to actually create them new, that is, use
--reflink=never or copy the files to a different filesystem, perhaps
tmpfs, and back, so they /have/ to be created new.  Of course with the
rewriter (rrdcached, apparently) shut down for the process.

Then, once the files are safely back in place and the filesystem synced
so the data is actually on disk, you can delete the old copies (which
will continue to serve as backups until then), and sync the filesystem
again.

While snapshots will of course continue to keep extents they reference
locked, for unsnapshotted files at least, this process should clear up
any still 

Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:

>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
> [...]
>> Now make various small changes to the file, say under 16 KiB each.  These
>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>> a time I believe (might be 4 KiB, as it was back when the default leaf
> 
> I got ~500 small files (100-500 kB) updated partially in regular
> intervals:
> 
> # du -Lc **/*.rrd | tail -n1
> 105Mtotal
> 
>> But here's the kicker.  Even without a snapshot locking that original 100
>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>> at once, once no references to it remain, which means once that last
>> block of the extent gets rewritten.

OTOH - should this happen with nodatacow files? As I mentioned before,
these files are chattred +C (however this was not their initial state
due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
Am I wrong thinking, that in such case they should occupy twice their
size maximum? Or maybe there is some tool that could show me the real
space wasted by file, including extents count etc?

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote:

> OK, is this supposed to be raid1 or single data, because the above shows
> metadata as all raid1, while some data is single tho most is raid1, and
> while old mkfs used to create unused single chunks on raid1 that had to
> be removed manually via balance, those single data chunks aren't unused.

It is supposed to be RAID1, the single data were leftovers from my previous
attempts to gain some space by converting into single profile. Which
miserably failed BTW (would it be smarter with "soft" option?),
but I've already managed to clear this.

> Assuming the intent is raid1, I'd recommend doing...
>
> btrfs balance start -dconvert=raid1,soft /

Yes, this was the way to go. It also reclaimed the 8 GB. I assume the
failing -dconvert=single somehow locked that 8 GB, so this issue should
be addressed in btrfs-tools to report such locked out region. You've
already noted that the single profile data occupied much less itself.

So this was the first issue, the second is running overhead, that
accumulates over time. Since yesterday, when I had 19 GB free, I've lost
4 GB already. The scenario you've described is very probable:

> btrfs balance start -dusage=N /
[...]
> allocated value toward usage.  I too run relatively small btrfs raid1s
> and would suggest trying N=5, 20, 40, 70, until the spread between

There were no effects above N=10 (both dusage and musage).

> consuming your space either, as I'd suspect they might if the problem were
> for instance atime updates, so while noatime is certainly recommended and

I use noatime by default since years, so not the source of problem here.

> The other possibility that comes to mind here has to do with btrfs COW
> write patterns...

> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
[...]
> Now make various small changes to the file, say under 16 KiB each.  These
> will each be COWed elsewhere as one might expect. by default 16 KiB at
> a time I believe (might be 4 KiB, as it was back when the default leaf

I got ~500 small files (100-500 kB) updated partially in regular
intervals:

# du -Lc **/*.rrd | tail -n1
105Mtotal

> But here's the kicker.  Even without a snapshot locking that original 100
> MiB extent in place, if even one of the original 16 KiB blocks isn't
> rewritten, that entire 100 MiB extent will remain locked in place, as the
> original 16 KiB blocks that have been changed and thus COWed elsewhere
> aren't freed one at a time, the full 100 MiB extent only gets freed, all
> at once, once no references to it remain, which means once that last
> block of the extent gets rewritten.
>
> So perhaps you have a pattern where files of several MiB get mostly
> rewritten, taking more space for the rewrites due to COW, but one or
> more blocks remain as originally written, locking the original extent
> in place at its full size, thus taking twice the space of the original
> file.
>
> Of course worst-case is rewrite the file minus a block, then rewrite
> that minus a block, then rewrite... in which case the total space
> usage will end up being several times the size of the original file!
>
> Luckily few people have this sort of usage pattern, but if you do...
>
> It would certainly explain the space eating...

Did anyone investigated how is that related to RRD rewrites? I don't use
rrdcached, never thought that 100 MB of data might trash entire
filesystem...

best regards,
-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
OK, I seriously need to address that, as during the night I lost
3 GB again:

On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:

>> #  btrfs fi sh /
>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>> Total devices 2 FS bytes used 44.10GiB
   Total devices 2 FS bytes used 47.28GiB

>> #  btrfs fi usage /
>> Overall:
>> Used: 88.19GiB
   Used: 94.58GiB
>> Free (estimated): 18.75GiB  (min: 18.75GiB)
   Free (estimated): 15.56GiB  (min: 15.56GiB)
>> 
>> #  btrfs dev usage /
- output not changed

>> #  btrfs fi df /
>> Data, RAID1: total=51.97GiB, used=43.22GiB
   Data, RAID1: total=51.97GiB, used=46.42GiB
>> System, RAID1: total=32.00MiB, used=16.00KiB
>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>> GlobalReserve, single: total=131.14MiB, used=0.00B
   GlobalReserve, single: total=135.50MiB, used=0.00B
>> 
>> # df
>> /dev/sda264G   45G   19G  71% /
   /dev/sda264G   48G   16G  76% /
>> However the difference is on active root fs:
>> 
>> -0/29124.29GiB  9.77GiB
>> +0/29115.99GiB 76.00MiB
0/29119.19GiB  3.28GiB
> 
> Since you have already showed the size of the snapshots, which hardly
> goes beyond 1G, it may be possible that extent booking is the cause.
> 
> And considering it's all exclusive, defrag may help in this case.

I'm going to try defrag here, but have a bunch of questions before;
as defrag would break CoW, I don't want to defrag files that span
multiple snapshots, unless they have huge overhead:
1. is there any switch resulting in 'defrag only exclusive data'?
2. is there any switch resulting in 'defrag only extents fragmented more than X'
   or 'defrag only fragments that would be possibly freed'?
3. I guess there aren't, so how could I accomplish my target, i.e.
   reclaiming space that was lost due to fragmentation, without breaking
   spanshoted CoW where it would be not only pointless, but actually harmful?
4. How can I prevent this from happening again? All the files, that are
   written constantly (stats collector here, PostgreSQL database and
   logs on other machines), are marked with nocow (+C); maybe some new
   attribute to mark file as autodefrag? +t?

For example, the largest file from stats collector:
 Total   Exclusive  Set shared  Filename
 432.00KiB   176.00KiB   256.00KiB  load/load.rrd

but most of them has 'Set shared'==0.

5. The stats collector is running from the beginning, according to the
quota output was not the issue since something happened. If the problem
was triggered by (guessing) low space condition, and it results in even
more space lost, there is positive feedback that is dangerous, as makes
any filesystem unstable ("once you run out of space, you won't recover").
Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
pattern, like RRD files?

6. Or maybe some extra steps just before taking snapshot should be taken?
I guess 'defrag exclusive' would be perfect here - reclaiming space
before it is being locked inside snapshot.
Rationale behind this is obvious: since the snapshot-aware defrag was
removed, allow to defrag snapshot exclusive data only.
This would of course result in partial file defragmentation, but that
should be enough for pathological cases like mine.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Duncan
Tomasz Pala posted on Sat, 02 Dec 2017 01:53:39 +0100 as excerpted:

> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:117.19GiB
> Device unallocated:   10.81GiB
> Device missing:  0.00B
> Used:103.56GiB
> Free (estimated): 11.19GiB  (min: 11.14GiB)
> Data ratio:   1.98
> Metadata ratio:   2.00
> Global reserve:  146.08MiB  (used: 0.00B)
> 
> Data,single: Size:1.19GiB, Used:1.18GiB
>/dev/sda2   1.07GiB
>/dev/sdb2 132.00MiB
> 
> Data,RAID1: Size:55.97GiB, Used:50.30GiB
>/dev/sda2  55.97GiB
>/dev/sdb2  55.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
>/dev/sda2   2.00GiB
>/dev/sdb2   2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>/dev/sda2  32.00MiB
>/dev/sdb2  32.00MiB
> 
> Unallocated:
>/dev/sda2   4.93GiB
>/dev/sdb2   5.87GiB

OK, is this supposed to be raid1 or single data, because the above shows
metadata as all raid1, while some data is single tho most is raid1, and
while old mkfs used to create unused single chunks on raid1 that had to
be removed manually via balance, those single data chunks aren't unused.

Which means if it's supposed to raid1, you don't have redundancy on that
single data.

Assuming the intent is raid1, I'd recommend doing...

btrfs balance start -dconvert=raid1,soft /

Probably disable quotas at least temporarily while you do so, tho, as
they don't scale well with balance and make it take much longer.

That should go reasonably fast as it's only a bit over 1 GiB on the one
device, and 132 MiB on the other (from your btrfs device usage), and the
soft allows it to skip chunks that don't need conversion.

It should kill those single entries and even up usage on both devices,
along with making the filesystem much more tolerant of loss of one of
the two devices.


Other than that, what we can see from the above is that it's a relatively
small filesystem, 64 GiB each on a pair of devices, raid1 but for the
above.

We also see that the allocated chunks vs. chunk usage isn't /too/ bad,
with that being a somewhat common problem.  However, given the relatively
small 64 GiB per device pair-device raid1 filesystem, there is some
slack, about 5 GiB worth, in that raid1 data, that you can recover.

btrfs balance start -dusage=N /

Where N represents a percentage full, so 0-100.  Normally, smaller
values of N complete much faster, with the most effect if they're
enough, because at say 10% usage, 10 90% empty chunks can be rewritten
into a single 100% full chunk.

The idea is to start with a small N value since it completes fast, and
redo with higher values as necessary to shrink the total data chunk
allocated value toward usage.  I too run relatively small btrfs raid1s
and would suggest trying N=5, 20, 40, 70, until the spread between
used and total is under 2 gigs, under a gig if you want to go that far
(nominal data chunk size is a gig so even a full balance will be unlikely
to get you a spread less than that).  Over 70 likely won't get you much
so isn't worth it.

That should return the excess to unallocated, leaving the filesystem 
able to use the freed space for data or metadata chunks as necessary,
tho you're unlikely to see an increase in available space in (non-btrfs)
df or similar.  If the unallocated value gets down below 1 GiB you may
have issues trying to free space since balance will want space to write
the chunk it's going to write into to free the others, so you probably 
want to keep an eye on this and rebalance if it gets under 2-3 gigs
free space, assuming of course that there's slack between used and
total that /can/ be freed by a rebalance.

FWIW the same can be done with metadata using -musage=, with metadata
chunks being 256 MiB nominal, but keep in mind that global reserve is
allocated from metadata space but doesn't count as used, so you typically
can't get the spread down below half a GiB or so.  And in most cases
it's data chunks that get the big spread, not metadata, so it's much
more common to have to do -d for data than -m for metadata.


All that said, the numbers don't show a runaway spread between total
and used, so while this might help, it's not going to fix the primary
space being eaten problem of the thread, as I had hoped it might.

Additionally, at 2 GiB total per device, metadata chunks aren't runaway
consuming your space either, as I'd suspect they might if the problem were
for instance atime updates, so while noatime is certainly recommended and
might help some, it doesn't appear to be a primary contributor to the
problem either.


The other possibility that comes to mind here has to do with btrfs COW
write patterns...

Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
the GiB+ example typically used due to the filesystem size 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 10:21, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:
> 
>>> Actually I should rephrase the problem:
>>>
>>> "snapshot has taken 8 GB of space despite nothing has altered source 
>>> subvolume"
> 
> Actually, after:
> 
> # btrfs balance start -v -dconvert=raid1 /
> ctrl-c on block group 35G/113G
> # btrfs balance start -v -dconvert=raid1,soft /
> # btrfs balance start -v -dusage=55 /
> Done, had to relocate 1 out of 56 chunks
> # btrfs balance start -v -musage=55 /
> Done, had to relocate 2 out of 55 chunks
> 
> and waiting a few minutes after ...the 8 GB I've lost yesterday is back:
> 
> #  btrfs fi sh /
> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
> Total devices 2 FS bytes used 44.10GiB
> devid1 size 64.00GiB used 54.00GiB path /dev/sda2
> devid2 size 64.00GiB used 54.00GiB path /dev/sdb2
> 
> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:108.00GiB
> Device unallocated:   20.00GiB
> Device missing:  0.00B
> Used: 88.19GiB
> Free (estimated): 18.75GiB  (min: 18.75GiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  131.14MiB  (used: 0.00B)
> 
> Data,RAID1: Size:51.97GiB, Used:43.22GiB
>/dev/sda2  51.97GiB
>/dev/sdb2  51.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
>/dev/sda2   2.00GiB
>/dev/sdb2   2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>/dev/sda2  32.00MiB
>/dev/sdb2  32.00MiB
> 
> Unallocated:
>/dev/sda2  10.00GiB
>/dev/sdb2  10.00GiB
> 
> #  btrfs dev usage /
> /dev/sda2, ID: 1
>Device size:64.00GiB
>Device slack:  0.00B
>Data,RAID1: 51.97GiB
>Metadata,RAID1:  2.00GiB
>System,RAID1:   32.00MiB
>Unallocated:10.00GiB
> 
> /dev/sdb2, ID: 2
>Device size:64.00GiB
>Device slack:  0.00B
>Data,RAID1: 51.97GiB
>Metadata,RAID1:  2.00GiB
>System,RAID1:   32.00MiB
>Unallocated:10.00GiB
> 
> #  btrfs fi df /
> Data, RAID1: total=51.97GiB, used=43.22GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=895.69MiB
> GlobalReserve, single: total=131.14MiB, used=0.00B
> 
> # df
> /dev/sda264G   45G   19G  71% /
> 
> However the difference is on active root fs:
> 
> -0/29124.29GiB  9.77GiB
> +0/29115.99GiB 76.00MiB
> 
> Still, 45G used, while there is (if I counted this correctly) 25G of data...
> 
>> Then please provide correct qgroup numbers.
>>
>> The correct number should be get by:
>> # btrfs quota enable 
>> # btrfs quota rescan -w 
>> # btrfs qgroup show -prce --sync 
> 
> OK, just added the --sort=excl:
> 
> qgroupid rfer excl max_rfer max_excl parent  child 
>      --  - 
> 0/5  16.00KiB 16.00KiB none none --- ---  
> 0/36122.57GiB  7.00MiB none none --- ---  
> 0/35822.54GiB  7.50MiB none none --- ---  
> 0/34322.36GiB  7.84MiB none none --- ---  
> 0/34522.49GiB  8.05MiB none none --- ---  
> 0/35722.50GiB  9.27MiB none none --- ---  
> 0/36022.57GiB 10.27MiB none none --- ---  
> 0/34422.48GiB 11.09MiB none none --- ---  
> 0/35922.55GiB 12.57MiB none none --- ---  
> 0/36222.59GiB 22.96MiB none none --- ---  
> 0/30212.87GiB 31.23MiB none none --- ---  
> 0/42815.96GiB 38.68MiB none none --- ---  
> 0/29411.09GiB 47.86MiB none none --- ---  
> 0/33621.80GiB 49.59MiB none none --- ---  
> 0/30012.56GiB 51.43MiB none none --- ---  
> 0/34222.31GiB 52.93MiB none none --- ---  
> 0/33321.71GiB 54.54MiB none none --- ---  
> 0/36322.63GiB 58.83MiB none none --- ---  
> 0/37023.27GiB 59.46MiB none none --- ---  
> 0/30513.01GiB 61.47MiB none none --- ---  
> 0/33121.61GiB 61.49MiB none none --- ---  
> 0/33421.78GiB 62.95MiB none none --- ---  
> 0/30613.04GiB 64.11MiB none none --- ---  
> 0/30412.96GiB 64.90MiB none none --- 

Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:

>> Actually I should rephrase the problem:
>> 
>> "snapshot has taken 8 GB of space despite nothing has altered source 
>> subvolume"

Actually, after:

# btrfs balance start -v -dconvert=raid1 /
ctrl-c on block group 35G/113G
# btrfs balance start -v -dconvert=raid1,soft /
# btrfs balance start -v -dusage=55 /
Done, had to relocate 1 out of 56 chunks
# btrfs balance start -v -musage=55 /
Done, had to relocate 2 out of 55 chunks

and waiting a few minutes after ...the 8 GB I've lost yesterday is back:

#  btrfs fi sh /
Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
Total devices 2 FS bytes used 44.10GiB
devid1 size 64.00GiB used 54.00GiB path /dev/sda2
devid2 size 64.00GiB used 54.00GiB path /dev/sdb2

#  btrfs fi usage /
Overall:
Device size: 128.00GiB
Device allocated:108.00GiB
Device unallocated:   20.00GiB
Device missing:  0.00B
Used: 88.19GiB
Free (estimated): 18.75GiB  (min: 18.75GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  131.14MiB  (used: 0.00B)

Data,RAID1: Size:51.97GiB, Used:43.22GiB
   /dev/sda2  51.97GiB
   /dev/sdb2  51.97GiB

Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
   /dev/sda2   2.00GiB
   /dev/sdb2   2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2  32.00MiB
   /dev/sdb2  32.00MiB

Unallocated:
   /dev/sda2  10.00GiB
   /dev/sdb2  10.00GiB

#  btrfs dev usage /
/dev/sda2, ID: 1
   Device size:64.00GiB
   Device slack:  0.00B
   Data,RAID1: 51.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated:10.00GiB

/dev/sdb2, ID: 2
   Device size:64.00GiB
   Device slack:  0.00B
   Data,RAID1: 51.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated:10.00GiB

#  btrfs fi df /
Data, RAID1: total=51.97GiB, used=43.22GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=895.69MiB
GlobalReserve, single: total=131.14MiB, used=0.00B

# df
/dev/sda264G   45G   19G  71% /

However the difference is on active root fs:

-0/29124.29GiB  9.77GiB
+0/29115.99GiB 76.00MiB

Still, 45G used, while there is (if I counted this correctly) 25G of data...

> Then please provide correct qgroup numbers.
> 
> The correct number should be get by:
> # btrfs quota enable 
> # btrfs quota rescan -w 
> # btrfs qgroup show -prce --sync 

OK, just added the --sort=excl:

qgroupid rfer excl max_rfer max_excl parent  child 
     --  - 
0/5  16.00KiB 16.00KiB none none --- ---  
0/36122.57GiB  7.00MiB none none --- ---  
0/35822.54GiB  7.50MiB none none --- ---  
0/34322.36GiB  7.84MiB none none --- ---  
0/34522.49GiB  8.05MiB none none --- ---  
0/35722.50GiB  9.27MiB none none --- ---  
0/36022.57GiB 10.27MiB none none --- ---  
0/34422.48GiB 11.09MiB none none --- ---  
0/35922.55GiB 12.57MiB none none --- ---  
0/36222.59GiB 22.96MiB none none --- ---  
0/30212.87GiB 31.23MiB none none --- ---  
0/42815.96GiB 38.68MiB none none --- ---  
0/29411.09GiB 47.86MiB none none --- ---  
0/33621.80GiB 49.59MiB none none --- ---  
0/30012.56GiB 51.43MiB none none --- ---  
0/34222.31GiB 52.93MiB none none --- ---  
0/33321.71GiB 54.54MiB none none --- ---  
0/36322.63GiB 58.83MiB none none --- ---  
0/37023.27GiB 59.46MiB none none --- ---  
0/30513.01GiB 61.47MiB none none --- ---  
0/33121.61GiB 61.49MiB none none --- ---  
0/33421.78GiB 62.95MiB none none --- ---  
0/30613.04GiB 64.11MiB none none --- ---  
0/30412.96GiB 64.90MiB none none --- ---  
0/30312.94GiB 68.39MiB none none --- ---  
0/36723.20GiB 68.52MiB none none --- ---  
0/36623.22GiB 69.79MiB none none --- ---  
0/36422.63GiB 72.03MiB 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 09:43, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:
> 
>>> qgroupid rfer excl 
>>>    
>>> 0/26012.25GiB  3.22GiB  from 170712 - first snapshot
>>> 0/31217.54GiB  4.56GiB  from 170811
>>> 0/36625.59GiB  2.44GiB  from 171028
>>> 0/37023.27GiB 59.46MiB  from 18 - prev snapshot
>>> 0/38821.69GiB  7.16GiB  from 171125 - last snapshot
>>> 0/29124.29GiB  9.77GiB  default subvolume
>>
>> You may need to manually sync the filesystem (trigger a transaction
>> commitment) to update qgroup accounting.
> 
> The data I've pasted were just calculated.
> 
>>> # btrfs quota enable /
>>> # btrfs qgroup show /
>>> WARNING: quota disabled, qgroup data may be out of date
>>> [...]
>>> # btrfs quota enable /  - for the second time!
>>> # btrfs qgroup show /
>>> WARNING: qgroup data inconsistent, rescan recommended
>>
>> Please wait the rescan, or any number is not correct.
> 
> Here I was pointing that first "quota enable" resulted in "quota
> disabled" warning until I've enabled it once again.
> 
>> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
>> ensure you understand all the limitation.
> 
> I probably won't understand them all, but this is not an issue of my
> concern as I don't use it. There is simply no other way I am aware that
> could show me per-subvolume stats. Well, straightforward way, as the
> hard way I'm using (btrfs send) confirms the problem.

Unfortunately, send doesn't count everything.

The most common thing is, send doesn't count extent booking space.
Try the following command:

# fallocate -l 1G 
# mkfs.btrfs -f 
# mount  
# btrfs subv create /subv1
# xfs_io -f -c "pwrite 0 128M" -c "sync" /subv1/file1
# xfs_io -f "fpunch 0 127M" -c "sync" /subv1/file1
# btrfs subv snapshot -r /subv1 /snapshot
# btrfs send /snapshot

You will only get the 1M data, while it still takes 128M space on-disk.

Btrfs extent book will only free the whole extent if and only if there
is no inode referring *ANY* part of the extent.

Even only 1M of a 128M file extent is used, it will still takes 128M
space on-disk.

And that's what send can't tell you.
And that's also what qgroup can tell you.

That's also why I need *CORRECT* qgroup numbers to further investigate
the problem.

> 
> You could simply remove all the quota results I've posted and there will
> be the underlaying problem, that the 25 GB of data I got occupies 52 GB.

If you only want to know the answer why your "25G" data occupies 52G on
disk, above is one of the possible explanations.

(And I think I should put it into btrfs(5), although I highly doubt if
user will really read them)

You could try to defrag, but I'm not sure if defrag works well in multi
subvolumes case.

> At least one recent snapshot, that was taken after some minor (<100 MB) 
> changes
> from the subvolume, that has undergo some minor changes since then,
> occupied 8 GB during one night when the entire system was idling.

The only possible method to fully isolate all the disturbing factors is
to get rid of snapshot.

Build the subvolume from scratch (even no cp --reflink from other
subvolume), then test what's happening.

Only in that case, you could trust vanilla du (if you don't do any reflink).
Although you can always trust qgroups number, such subvolume built from
scratch will makes exclusive number equals to reference, making
debugging a little easier.

Thanks,
Qu

> 
> This was crosschecked on files metadata (mtimes compared) and 'du'
> results.
> 
> 
> As a last-resort I've rebalanced the disk (once again), this time with
> -dconvert=raid1 (to get rid of the single residue).
> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 09:23, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:
> 
>> I assume there is program eating up the space.
>> Not btrfs itself.
> 
> Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
> find by lsof on 3.4.75 kernel, but the space was returning after killing
> Xorg. The system I'm having problem now is very recent, the space
> doesn't return after reboot/emergency and doesn't sum up with files.

Unlike vanilla df or "fi usage" or "fi df", btrfs quota only counts
on-disk extents.

That's to say, reserved space won't contribute to qgroup.
Unless one is using anonymous file, which is opened but unlinked, so no
one can access it except the owner.
(Which I doubt that may be your case)

Which should make quota the best tool to debug your problem.
(As long as you follow the variant limitations of btrfs quota,
especially you need to sync or use --sync option to show qgroup numbers)

> 
>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>> Subvolume ID:   388
>>> # btrfs fi du -s ./snapshot-171125 
>>>  Total   Exclusive  Set shared  Filename
>>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>>
>> That's the difference between how sub show and quota works.
>>
>> For quota, it's per-root owner check.
> 
> Just to be clear: I've enabled quota _only_ to see subvolume usage on
> spot. And exclusive data - the more detailed approach I've described in
> e-mail I've send a minute ago.
> 
>> Means even a file extent is shared between different inodes, if all
>> inodes are inside the same subvolume, it's counted as exclusive.
>> And if any of the file extent belongs to other subvolume, then it's
>> counted as shared.
> 
> Good to know, but this is almost UID0-only system. There are system
> users (vendor provided) and 2 ssh accounts for su, but nobody uses this
> machine for daily work. The quota values were the last tool I could find
> to debug.
> 
>> For fi du, it's per-inode owner check. (The exact behavior is a little
>> more complex, I'll skip such corner case to make it a little easier to
>> understand).
>>
>> That's to say, if one file extent is shared by different inodes, then
>> it's counted as shared, no matter if these inodes belong to different or
>> the same subvolume.
>>
>> That's to say, "fi du" has a looser condition for "shared" calculation,
>> and that should explain why you have 20+G shared.
> 
> There shouldn't be many multi-inode extents inside single subvolume, as this 
> is mostly fresh
> system, with no containers, no deduplication, snapshots are taken from
> the same running system before or after some more important change is
> done. By 'change' I mean altering text config files mostly (plus
> etckeeper's git metadata), so the volume of difference is extremelly
> low. Actually most of the difs between subvolumes come from updating
> distro packages. There were not much reflink copies made on this
> partition, only one kernel source compiled (.ccache files removed
> today). So this partition is as clean, as it could be after almost
> 5 months in use.
> 
> Actually I should rephrase the problem:
> 
> "snapshot has taken 8 GB of space despite nothing has altered source 
> subvolume"

Then please provide correct qgroup numbers.

The correct number should be get by:
# btrfs quota enable 
# btrfs quota rescan -w 
# btrfs qgroup show -prce --sync 

Rescan and --sync are important to get the correct number.
(while rescan can take a long long time to finish)

And further more, please ensure that all deleted files are really deleted.
Btrfs delay file and subvolume deletion, so you may need to sync several
times or use "btrfs subv sync" to ensure deleted files are deleted.

(vanilla du won't tell you if such delayed file deletion is really done)

Thanks,
Qu
> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:

>> qgroupid rfer excl 
>>    
>> 0/26012.25GiB  3.22GiB   from 170712 - first snapshot
>> 0/31217.54GiB  4.56GiB   from 170811
>> 0/36625.59GiB  2.44GiB   from 171028
>> 0/37023.27GiB 59.46MiB   from 18 - prev snapshot
>> 0/38821.69GiB  7.16GiB   from 171125 - last snapshot
>> 0/29124.29GiB  9.77GiB   default subvolume
> 
> You may need to manually sync the filesystem (trigger a transaction
> commitment) to update qgroup accounting.

The data I've pasted were just calculated.

>> # btrfs quota enable /
>> # btrfs qgroup show /
>> WARNING: quota disabled, qgroup data may be out of date
>> [...]
>> # btrfs quota enable /   - for the second time!
>> # btrfs qgroup show /
>> WARNING: qgroup data inconsistent, rescan recommended
> 
> Please wait the rescan, or any number is not correct.

Here I was pointing that first "quota enable" resulted in "quota
disabled" warning until I've enabled it once again.

> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
> ensure you understand all the limitation.

I probably won't understand them all, but this is not an issue of my
concern as I don't use it. There is simply no other way I am aware that
could show me per-subvolume stats. Well, straightforward way, as the
hard way I'm using (btrfs send) confirms the problem.

You could simply remove all the quota results I've posted and there will
be the underlaying problem, that the 25 GB of data I got occupies 52 GB.
At least one recent snapshot, that was taken after some minor (<100 MB) changes
from the subvolume, that has undergo some minor changes since then,
occupied 8 GB during one night when the entire system was idling.

This was crosschecked on files metadata (mtimes compared) and 'du'
results.


As a last-resort I've rebalanced the disk (once again), this time with
-dconvert=raid1 (to get rid of the single residue).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:

> I assume there is program eating up the space.
> Not btrfs itself.

Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
find by lsof on 3.4.75 kernel, but the space was returning after killing
Xorg. The system I'm having problem now is very recent, the space
doesn't return after reboot/emergency and doesn't sum up with files.

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>> Subvolume ID:   388
>> # btrfs fi du -s ./snapshot-171125 
>>  Total   Exclusive  Set shared  Filename
>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.

Just to be clear: I've enabled quota _only_ to see subvolume usage on
spot. And exclusive data - the more detailed approach I've described in
e-mail I've send a minute ago.

> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.

Good to know, but this is almost UID0-only system. There are system
users (vendor provided) and 2 ssh accounts for su, but nobody uses this
machine for daily work. The quota values were the last tool I could find
to debug.

> For fi du, it's per-inode owner check. (The exact behavior is a little
> more complex, I'll skip such corner case to make it a little easier to
> understand).
> 
> That's to say, if one file extent is shared by different inodes, then
> it's counted as shared, no matter if these inodes belong to different or
> the same subvolume.
> 
> That's to say, "fi du" has a looser condition for "shared" calculation,
> and that should explain why you have 20+G shared.

There shouldn't be many multi-inode extents inside single subvolume, as this is 
mostly fresh
system, with no containers, no deduplication, snapshots are taken from
the same running system before or after some more important change is
done. By 'change' I mean altering text config files mostly (plus
etckeeper's git metadata), so the volume of difference is extremelly
low. Actually most of the difs between subvolumes come from updating
distro packages. There were not much reflink copies made on this
partition, only one kernel source compiled (.ccache files removed
today). So this partition is as clean, as it could be after almost
5 months in use.

Actually I should rephrase the problem:

"snapshot has taken 8 GB of space despite nothing has altered source subvolume"

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo
>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>> Subvolume ID:   388
>>> # btrfs fi du -s ./snapshot-171125 
>>>  Total   Exclusive  Set shared  Filename
>>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>>>
>>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>>> And the same happens with other snapshots, much more exclusive data
>>> shown in qgroup than actually found in files. So if not files, where
>>> is that space wasted? Metadata?
>>
>>Personally, I'd trust qgroups' output about as far as I could spit
>> Belgium(*).
> 
> Well, there is something wrong here, as after removing the .ccache
> directories inside all the snapshots the 'excl' values decreased
> ...except for the last snapshot (the list below is short by ~40 snapshots
> that have 2 GB excl in total):
> 
> qgroupid rfer excl 
>    
> 0/26012.25GiB  3.22GiBfrom 170712 - first snapshot
> 0/31217.54GiB  4.56GiBfrom 170811
> 0/36625.59GiB  2.44GiBfrom 171028
> 0/37023.27GiB 59.46MiBfrom 18 - prev snapshot
> 0/38821.69GiB  7.16GiBfrom 171125 - last snapshot
> 0/29124.29GiB  9.77GiBdefault subvolume

You may need to manually sync the filesystem (trigger a transaction
commitment) to update qgroup accounting.
> 
> 
> [~/test/snapshot-171125]#  du -sh .
> 15G .
> 
> 
> After changing back to ro I tested how much data really has changed
> between the previous and last snapshot:
> 
> [~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
> At subvol snapshot-171125
> 74.2MiB 0:00:32 [2.28MiB/s]
> 
> This means there can't be 7 GiB of exclusive data in the last snapshot.

Mentioned before, sync the fs first before checking the qgroup numbers.
Or use --sync option along with qgroup show.

> 
> Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
> 5.68GiB 0:03:23 [28.6MiB/s]
> 
> I've created a new snapshot right now to compare it with 171125:
> 75.5MiB 0:00:43 [1.73MiB/s]
> 
> 
> OK, I could even compare all the snapshots in sequence:
> 
> # for i in snapshot-17*; btrfs prop set $i ro true
> # p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv 
> > /dev/null; p="$i" done
>  1.7GiB 0:00:15 [ 114MiB/s]
> 1.03GiB 0:00:38 [27.2MiB/s]
>  155MiB 0:00:08 [19.1MiB/s]
> 1.08GiB 0:00:47 [23.3MiB/s]
>  294MiB 0:00:29 [ 9.9MiB/s]
>  324MiB 0:00:42 [7.69MiB/s]
> 82.8MiB 0:00:06 [12.7MiB/s]
> 64.3MiB 0:00:05 [11.6MiB/s]
>  137MiB 0:00:07 [19.3MiB/s]
> 85.3MiB 0:00:13 [6.18MiB/s]
> 62.8MiB 0:00:19 [3.21MiB/s]
>  132MiB 0:00:42 [3.15MiB/s]
>  102MiB 0:00:42 [2.42MiB/s]
>  197MiB 0:00:50 [3.91MiB/s]
>  321MiB 0:01:01 [5.21MiB/s]
>  229MiB 0:00:18 [12.3MiB/s]
>  109MiB 0:00:11 [ 9.7MiB/s]
>  139MiB 0:00:14 [9.32MiB/s]
>  573MiB 0:00:35 [15.9MiB/s]
> 64.1MiB 0:00:30 [2.11MiB/s]
>  172MiB 0:00:11 [14.9MiB/s]
> 98.9MiB 0:00:07 [14.1MiB/s]
>   54MiB 0:00:08 [6.17MiB/s]
> 78.6MiB 0:00:02 [32.1MiB/s]
> 15.1MiB 0:00:01 [12.5MiB/s]
> 20.6MiB 0:00:00 [  23MiB/s]
> 20.3MiB 0:00:00 [  23MiB/s]
>  110MiB 0:00:14 [7.39MiB/s]
> 62.6MiB 0:00:11 [5.67MiB/s]
> 65.7MiB 0:00:08 [7.58MiB/s]
>  731MiB 0:00:42 [  17MiB/s]
> 73.7MiB 0:00:29 [ 2.5MiB/s]
>  322MiB 0:00:53 [6.04MiB/s]
>  105MiB 0:00:35 [2.95MiB/s]
> 95.2MiB 0:00:36 [2.58MiB/s]
> 74.2MiB 0:00:30 [2.43MiB/s]
> 75.5MiB 0:00:46 [1.61MiB/s]
> 
> This is 9.3 GB of total diffs between all the snapshots I got.
> Plus 15 GB of initial snapshot means there is about 25 GB used,
> while df reports twice the amount, way too much for overhead:
> /dev/sda264G   52G   11G  84% /
> 
> 
> # btrfs quota enable /
> # btrfs qgroup show /
> WARNING: quota disabled, qgroup data may be out of date
> [...]
> # btrfs quota enable /- for the second time!
> # btrfs qgroup show /
> WARNING: qgroup data inconsistent, rescan recommended

Please wait the rescan, or any number is not correct.
(Although it will only be less than actual occupied space)

It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
ensure you understand all the limitation.

> [...]
> 0/42815.96GiB 19.23MiBnewly created (now) snapshot
> 
> 
> 
> Assuming the qgroups output is bugus and the space isn't physically
> occupied (which is coherent with btrfs fi du output and my expectation)
> the question remains: why is that bogus-excl removed from available
> space as reported by df or btrfs fi df/usage? And how to reclaim it?

Already explained the difference in another thread.

Thanks,
Qu

> 
> 
> [~/test]#  btrfs device usage /
> /dev/sda2, ID: 1
>Device size:64.00GiB
>Device slack:  0.00B
>Data,single: 1.07GiB
>Data,RAID1: 55.97GiB
>Metadata,RAID1:  2.00GiB
>

Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Fri, Dec 01, 2017 at 21:36:14 +, Hugo Mills wrote:

>The thing I'd first go looking for here is some rogue process
> writing lots of data. I've had something like this happen to me
> before, a few times. First, I'd look for large files with "du -ms /* |
> sort -n", then work down into the tree until you find them.

I already did a handful of searches (mounting parent node in separate
directory and diving into default working subvolume on order to unhide
possible things covered by any other mounts on top of actual root fs).
That's how it looks like:

[~/test/@]#  du -sh . 
15G .

>If that doesn't show up anything unusually large, then lsof to look
> for open but deleted files (orphans) which are still being written to
> by some process.

No (deleted) files, the only activity on iotop are internals...

  174 be/4 root   15.64 K/s3.67 M/s  0.00 %  5.88 % [btrfs-transacti]
 1439 be/4 root0.00 B/s 1173.22 K/s  0.00 %  0.00 % [kworker/u8:8]

Only the systemd-journald is writing, but the /var/log is mounted to
separate ext3 parition (with journald restarted after the mount); this
is also confirmed by looking into separate mount. Anyway that can't be
opened-deleted files, as the usage doesn't change after booting into
emergency. The worst thing is that the 8 GB was lost during the night,
when nothing except for stats collector was running.

As already said, this is not the classical "Linux eats my HDD" problem.

>This is very likely _not_ to be a btrfs problem, but instead some
> runaway process writing lots of crap very fast. Log files are probably
> the most plausible location, but not the only one.

That would be visible in iostat or /proc/diskstats - it isn't. The free
space disappears without being physically written, which means it is
some allocation problem.


I also created a list of files modified between the snapshots with:

find test/@ -xdev -newer some_reference_file_inside_snapshot

and there is nothing bigger than a few MBs.


I've changed the snapshots to rw and removed some data from all the
instances: 4.8 GB in two ISO images and 5 GB-limited .ccache directory.
After this I got 11 GB freed, so the numbers are fine.

#  btrfs fi usage /
Overall:
Device size: 128.00GiB
Device allocated:117.19GiB
Device unallocated:   10.81GiB
Device missing:  0.00B
Used:103.56GiB
Free (estimated): 11.19GiB  (min: 11.14GiB)
Data ratio:   1.98
Metadata ratio:   2.00
Global reserve:  146.08MiB  (used: 0.00B)

Data,single: Size:1.19GiB, Used:1.18GiB
   /dev/sda2   1.07GiB
   /dev/sdb2 132.00MiB

Data,RAID1: Size:55.97GiB, Used:50.30GiB
   /dev/sda2  55.97GiB
   /dev/sdb2  55.97GiB

Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
   /dev/sda2   2.00GiB
   /dev/sdb2   2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2  32.00MiB
   /dev/sdb2  32.00MiB

Unallocated:
   /dev/sda2   4.93GiB
   /dev/sdb2   5.87GiB

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>> Subvolume ID:   388
>> # btrfs fi du -s ./snapshot-171125 
>>  Total   Exclusive  Set shared  Filename
>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>> 
>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>> And the same happens with other snapshots, much more exclusive data
>> shown in qgroup than actually found in files. So if not files, where
>> is that space wasted? Metadata?
> 
>Personally, I'd trust qgroups' output about as far as I could spit
> Belgium(*).

Well, there is something wrong here, as after removing the .ccache
directories inside all the snapshots the 'excl' values decreased
...except for the last snapshot (the list below is short by ~40 snapshots
that have 2 GB excl in total):

qgroupid rfer excl 
   
0/26012.25GiB  3.22GiB  from 170712 - first snapshot
0/31217.54GiB  4.56GiB  from 170811
0/36625.59GiB  2.44GiB  from 171028
0/37023.27GiB 59.46MiB  from 18 - prev snapshot
0/38821.69GiB  7.16GiB  from 171125 - last snapshot
0/29124.29GiB  9.77GiB  default subvolume


[~/test/snapshot-171125]#  du -sh .
15G .


After changing back to ro I tested how much data really has changed
between the previous and last snapshot:

[~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
At subvol snapshot-171125
74.2MiB 0:00:32 [2.28MiB/s]

This means there can't be 7 GiB of exclusive data in the last snapshot.

Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
5.68GiB 0:03:23 [28.6MiB/s]

I've created a new snapshot right now to 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 00:15, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

I assume there is program eating up the space.
Not btrfs itself.

> 
> 
> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125

That's the difference between how sub show and quota works.

For quota, it's per-root owner check.
Means even a file extent is shared between different inodes, if all
inodes are inside the same subvolume, it's counted as exclusive.
And if any of the file extent belongs to other subvolume, then it's
counted as shared.

For fi du, it's per-inode owner check. (The exact behavior is a little
more complex, I'll skip such corner case to make it a little easier to
understand).

That's to say, if one file extent is shared by different inodes, then
it's counted as shared, no matter if these inodes belong to different or
the same subvolume.

That's to say, "fi du" has a looser condition for "shared" calculation,
and that should explain why you have 20+G shared.

Thanks,
Qu


> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?
> 
> btrfs-progs-4.12 running on Linux 4.9.46.
> 
> best regards,
> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Hugo Mills
On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

   The thing I'd first go looking for here is some rogue process
writing lots of data. I've had something like this happen to me
before, a few times. First, I'd look for large files with "du -ms /* |
sort -n", then work down into the tree until you find them.

   If that doesn't show up anything unusually large, then lsof to look
for open but deleted files (orphans) which are still being written to
by some process.

   This is very likely _not_ to be a btrfs problem, but instead some
runaway process writing lots of crap very fast. Log files are probably
the most plausible location, but not the only one.

> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?

   Personally, I'd trust qgroups' output about as far as I could spit
Belgium(*).

   Hugo.

(*) No offence indended to Belgium.

-- 
Hugo Mills | I used to live in hope, but I got evicted.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Duncan
Tomasz Pala posted on Fri, 01 Dec 2017 17:15:55 +0100 as excerpted:

> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /

Scary.

> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:

I don't use quotas so won't claim working knowledge or an explanation of
that side of things, however...
> 
> btrfs-progs-4.12 running on Linux 4.9.46.

Until quite recently btrfs quotas were too buggy to recommend for use.
While the known blocker-level bugs are now fixed, scaling and real-world
performance are still an issue, and AFAIK, the fixes didn't make 4.9 and
may not be backported as the feature was simply known to be broken beyond
reliable usability at that point.

Based on comments in other threads here, I /think/ the critical quota
fixes hit 4.10, but of course not being an LTS, 4.10 is long out of support.
I'd suggest either turning off and forgetting about quotas since it doesn't
appear you actually need them, or upgrading to at least 4.13 and keeping
current, or the LTS 4.14 if you want to stay on the same kernel series for
awhile.

As for the scaling and performance issues, during normal/generic filesystem
use things are generally fine; it's various btrfs maintenance commands such
as balance, snapshot deletion, and btrfs check, that have the scaling
issues, and they have /some/ scaling issues even without quotas, it's just
that quotas makes the problem *much* worse.  One workaround for balance
and snapshot deletion is to temporarily disable quotas while the job is
running, then reenable (and rescan if necessary, as I don't use the feature
here I'm not sure whether it is).  That can literally turn a job that was
looking to take /weeks/ due to the scaling issue, into a job of hours.
Unfortunately, the sorts of conditions that would trigger running a btrfs
check don't lend themselves to the same sort of workaround, so not having
quotas on at all is the only workaround there.


As to your space being eaten problem, the output of btrfs filesystem usage
(and perhaps btrfs device usage if it's a multi-device btrfs) could be
really helpful here, much more so than quota reports if it's a btrfs
issue, or to help eliminate btrfs as the problem if it's not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
Hello,

I got a problem with btrfs running out of space (not THE
Internet-wide, well known issues with interpretation).

The problem is: something eats the space while not running anything that
justifies this. There were 18 GB free space available, suddenly it
dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
right now, just as I'm writing this e-mail:

/dev/sda264G   63G  452M 100% /
/dev/sda264G   63G  365M 100% /
/dev/sda264G   63G  316M 100% /
/dev/sda264G   63G  287M 100% /
/dev/sda264G   63G  268M 100% /
/dev/sda264G   63G  239M 100% /
/dev/sda264G   63G  230M 100% /
/dev/sda264G   63G  182M 100% /
/dev/sda264G   63G  163M 100% /
/dev/sda264G   64G  153M 100% /
/dev/sda264G   64G  143M 100% /
/dev/sda264G   64G   96M 100% /
/dev/sda264G   64G   88M 100% /
/dev/sda264G   64G   57M 100% /
/dev/sda264G   64G   25M 100% /

while my rough calculations show, that there should be at least 10 GB of
free space. After enabling quotas it is somehow confirmed:

# btrfs qgroup sh --sort=excl / 
qgroupid rfer excl 
   
0/5  16.00KiB 16.00KiB 
[30 snapshots with about 100 MiB excl]
0/33324.53GiB305.79MiB 
0/29813.44GiB312.74MiB 
0/32723.79GiB427.13MiB 
0/33123.93GiB930.51MiB 
0/26012.25GiB  3.22GiB 
0/31219.70GiB  4.56GiB 
0/38828.75GiB  7.15GiB 
0/29130.60GiB  9.01GiB <- this is the running one

This is about 30 GB total excl (didn't find a switch to sum this up). I
know I can't just add 'excl' to get usage, so tried to pinpoint the
exact files that occupy space in 0/388 exclusively (this is the last
snapshots taken, all of the snapshots are created from the running fs).


Now, the weird part for me is exclusive data count:

# btrfs sub sh ./snapshot-171125
[...]
Subvolume ID:   388
# btrfs fi du -s ./snapshot-171125 
 Total   Exclusive  Set shared  Filename
  21.50GiB63.35MiB20.77GiB  snapshot-171125


How is that possible? This doesn't even remotely relate to 7.15 GiB
from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
And the same happens with other snapshots, much more exclusive data
shown in qgroup than actually found in files. So if not files, where
is that space wasted? Metadata?

btrfs-progs-4.12 running on Linux 4.9.46.

best regards,
-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html