Re: exclusive subvolume space missing

2017-12-01 Thread Duncan
Tomasz Pala posted on Sat, 02 Dec 2017 01:53:39 +0100 as excerpted:

> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:117.19GiB
> Device unallocated:   10.81GiB
> Device missing:  0.00B
> Used:103.56GiB
> Free (estimated): 11.19GiB  (min: 11.14GiB)
> Data ratio:   1.98
> Metadata ratio:   2.00
> Global reserve:  146.08MiB  (used: 0.00B)
> 
> Data,single: Size:1.19GiB, Used:1.18GiB
>/dev/sda2   1.07GiB
>/dev/sdb2 132.00MiB
> 
> Data,RAID1: Size:55.97GiB, Used:50.30GiB
>/dev/sda2  55.97GiB
>/dev/sdb2  55.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
>/dev/sda2   2.00GiB
>/dev/sdb2   2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>/dev/sda2  32.00MiB
>/dev/sdb2  32.00MiB
> 
> Unallocated:
>/dev/sda2   4.93GiB
>/dev/sdb2   5.87GiB

OK, is this supposed to be raid1 or single data, because the above shows
metadata as all raid1, while some data is single tho most is raid1, and
while old mkfs used to create unused single chunks on raid1 that had to
be removed manually via balance, those single data chunks aren't unused.

Which means if it's supposed to raid1, you don't have redundancy on that
single data.

Assuming the intent is raid1, I'd recommend doing...

btrfs balance start -dconvert=raid1,soft /

Probably disable quotas at least temporarily while you do so, tho, as
they don't scale well with balance and make it take much longer.

That should go reasonably fast as it's only a bit over 1 GiB on the one
device, and 132 MiB on the other (from your btrfs device usage), and the
soft allows it to skip chunks that don't need conversion.

It should kill those single entries and even up usage on both devices,
along with making the filesystem much more tolerant of loss of one of
the two devices.


Other than that, what we can see from the above is that it's a relatively
small filesystem, 64 GiB each on a pair of devices, raid1 but for the
above.

We also see that the allocated chunks vs. chunk usage isn't /too/ bad,
with that being a somewhat common problem.  However, given the relatively
small 64 GiB per device pair-device raid1 filesystem, there is some
slack, about 5 GiB worth, in that raid1 data, that you can recover.

btrfs balance start -dusage=N /

Where N represents a percentage full, so 0-100.  Normally, smaller
values of N complete much faster, with the most effect if they're
enough, because at say 10% usage, 10 90% empty chunks can be rewritten
into a single 100% full chunk.

The idea is to start with a small N value since it completes fast, and
redo with higher values as necessary to shrink the total data chunk
allocated value toward usage.  I too run relatively small btrfs raid1s
and would suggest trying N=5, 20, 40, 70, until the spread between
used and total is under 2 gigs, under a gig if you want to go that far
(nominal data chunk size is a gig so even a full balance will be unlikely
to get you a spread less than that).  Over 70 likely won't get you much
so isn't worth it.

That should return the excess to unallocated, leaving the filesystem 
able to use the freed space for data or metadata chunks as necessary,
tho you're unlikely to see an increase in available space in (non-btrfs)
df or similar.  If the unallocated value gets down below 1 GiB you may
have issues trying to free space since balance will want space to write
the chunk it's going to write into to free the others, so you probably 
want to keep an eye on this and rebalance if it gets under 2-3 gigs
free space, assuming of course that there's slack between used and
total that /can/ be freed by a rebalance.

FWIW the same can be done with metadata using -musage=, with metadata
chunks being 256 MiB nominal, but keep in mind that global reserve is
allocated from metadata space but doesn't count as used, so you typically
can't get the spread down below half a GiB or so.  And in most cases
it's data chunks that get the big spread, not metadata, so it's much
more common to have to do -d for data than -m for metadata.


All that said, the numbers don't show a runaway spread between total
and used, so while this might help, it's not going to fix the primary
space being eaten problem of the thread, as I had hoped it might.

Additionally, at 2 GiB total per device, metadata chunks aren't runaway
consuming your space either, as I'd suspect they might if the problem were
for instance atime updates, so while noatime is certainly recommended and
might help some, it doesn't appear to be a primary contributor to the
problem either.


The other possibility that comes to mind here has to do with btrfs COW
write patterns...

Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
the GiB+ example typically used due to the filesystem size 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 10:21, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:
> 
>>> Actually I should rephrase the problem:
>>>
>>> "snapshot has taken 8 GB of space despite nothing has altered source 
>>> subvolume"
> 
> Actually, after:
> 
> # btrfs balance start -v -dconvert=raid1 /
> ctrl-c on block group 35G/113G
> # btrfs balance start -v -dconvert=raid1,soft /
> # btrfs balance start -v -dusage=55 /
> Done, had to relocate 1 out of 56 chunks
> # btrfs balance start -v -musage=55 /
> Done, had to relocate 2 out of 55 chunks
> 
> and waiting a few minutes after ...the 8 GB I've lost yesterday is back:
> 
> #  btrfs fi sh /
> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
> Total devices 2 FS bytes used 44.10GiB
> devid1 size 64.00GiB used 54.00GiB path /dev/sda2
> devid2 size 64.00GiB used 54.00GiB path /dev/sdb2
> 
> #  btrfs fi usage /
> Overall:
> Device size: 128.00GiB
> Device allocated:108.00GiB
> Device unallocated:   20.00GiB
> Device missing:  0.00B
> Used: 88.19GiB
> Free (estimated): 18.75GiB  (min: 18.75GiB)
> Data ratio:   2.00
> Metadata ratio:   2.00
> Global reserve:  131.14MiB  (used: 0.00B)
> 
> Data,RAID1: Size:51.97GiB, Used:43.22GiB
>/dev/sda2  51.97GiB
>/dev/sdb2  51.97GiB
> 
> Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
>/dev/sda2   2.00GiB
>/dev/sdb2   2.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:16.00KiB
>/dev/sda2  32.00MiB
>/dev/sdb2  32.00MiB
> 
> Unallocated:
>/dev/sda2  10.00GiB
>/dev/sdb2  10.00GiB
> 
> #  btrfs dev usage /
> /dev/sda2, ID: 1
>Device size:64.00GiB
>Device slack:  0.00B
>Data,RAID1: 51.97GiB
>Metadata,RAID1:  2.00GiB
>System,RAID1:   32.00MiB
>Unallocated:10.00GiB
> 
> /dev/sdb2, ID: 2
>Device size:64.00GiB
>Device slack:  0.00B
>Data,RAID1: 51.97GiB
>Metadata,RAID1:  2.00GiB
>System,RAID1:   32.00MiB
>Unallocated:10.00GiB
> 
> #  btrfs fi df /
> Data, RAID1: total=51.97GiB, used=43.22GiB
> System, RAID1: total=32.00MiB, used=16.00KiB
> Metadata, RAID1: total=2.00GiB, used=895.69MiB
> GlobalReserve, single: total=131.14MiB, used=0.00B
> 
> # df
> /dev/sda264G   45G   19G  71% /
> 
> However the difference is on active root fs:
> 
> -0/29124.29GiB  9.77GiB
> +0/29115.99GiB 76.00MiB
> 
> Still, 45G used, while there is (if I counted this correctly) 25G of data...
> 
>> Then please provide correct qgroup numbers.
>>
>> The correct number should be get by:
>> # btrfs quota enable 
>> # btrfs quota rescan -w 
>> # btrfs qgroup show -prce --sync 
> 
> OK, just added the --sort=excl:
> 
> qgroupid rfer excl max_rfer max_excl parent  child 
>      --  - 
> 0/5  16.00KiB 16.00KiB none none --- ---  
> 0/36122.57GiB  7.00MiB none none --- ---  
> 0/35822.54GiB  7.50MiB none none --- ---  
> 0/34322.36GiB  7.84MiB none none --- ---  
> 0/34522.49GiB  8.05MiB none none --- ---  
> 0/35722.50GiB  9.27MiB none none --- ---  
> 0/36022.57GiB 10.27MiB none none --- ---  
> 0/34422.48GiB 11.09MiB none none --- ---  
> 0/35922.55GiB 12.57MiB none none --- ---  
> 0/36222.59GiB 22.96MiB none none --- ---  
> 0/30212.87GiB 31.23MiB none none --- ---  
> 0/42815.96GiB 38.68MiB none none --- ---  
> 0/29411.09GiB 47.86MiB none none --- ---  
> 0/33621.80GiB 49.59MiB none none --- ---  
> 0/30012.56GiB 51.43MiB none none --- ---  
> 0/34222.31GiB 52.93MiB none none --- ---  
> 0/33321.71GiB 54.54MiB none none --- ---  
> 0/36322.63GiB 58.83MiB none none --- ---  
> 0/37023.27GiB 59.46MiB none none --- ---  
> 0/30513.01GiB 61.47MiB none none --- ---  
> 0/33121.61GiB 61.49MiB none none --- ---  
> 0/33421.78GiB 62.95MiB none none --- ---  
> 0/30613.04GiB 64.11MiB none none --- ---  
> 0/30412.96GiB 64.90MiB none none --- 

Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 09:47:19 +0800, Qu Wenruo wrote:

>> Actually I should rephrase the problem:
>> 
>> "snapshot has taken 8 GB of space despite nothing has altered source 
>> subvolume"

Actually, after:

# btrfs balance start -v -dconvert=raid1 /
ctrl-c on block group 35G/113G
# btrfs balance start -v -dconvert=raid1,soft /
# btrfs balance start -v -dusage=55 /
Done, had to relocate 1 out of 56 chunks
# btrfs balance start -v -musage=55 /
Done, had to relocate 2 out of 55 chunks

and waiting a few minutes after ...the 8 GB I've lost yesterday is back:

#  btrfs fi sh /
Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
Total devices 2 FS bytes used 44.10GiB
devid1 size 64.00GiB used 54.00GiB path /dev/sda2
devid2 size 64.00GiB used 54.00GiB path /dev/sdb2

#  btrfs fi usage /
Overall:
Device size: 128.00GiB
Device allocated:108.00GiB
Device unallocated:   20.00GiB
Device missing:  0.00B
Used: 88.19GiB
Free (estimated): 18.75GiB  (min: 18.75GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:  131.14MiB  (used: 0.00B)

Data,RAID1: Size:51.97GiB, Used:43.22GiB
   /dev/sda2  51.97GiB
   /dev/sdb2  51.97GiB

Metadata,RAID1: Size:2.00GiB, Used:895.69MiB
   /dev/sda2   2.00GiB
   /dev/sdb2   2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2  32.00MiB
   /dev/sdb2  32.00MiB

Unallocated:
   /dev/sda2  10.00GiB
   /dev/sdb2  10.00GiB

#  btrfs dev usage /
/dev/sda2, ID: 1
   Device size:64.00GiB
   Device slack:  0.00B
   Data,RAID1: 51.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated:10.00GiB

/dev/sdb2, ID: 2
   Device size:64.00GiB
   Device slack:  0.00B
   Data,RAID1: 51.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated:10.00GiB

#  btrfs fi df /
Data, RAID1: total=51.97GiB, used=43.22GiB
System, RAID1: total=32.00MiB, used=16.00KiB
Metadata, RAID1: total=2.00GiB, used=895.69MiB
GlobalReserve, single: total=131.14MiB, used=0.00B

# df
/dev/sda264G   45G   19G  71% /

However the difference is on active root fs:

-0/29124.29GiB  9.77GiB
+0/29115.99GiB 76.00MiB

Still, 45G used, while there is (if I counted this correctly) 25G of data...

> Then please provide correct qgroup numbers.
> 
> The correct number should be get by:
> # btrfs quota enable 
> # btrfs quota rescan -w 
> # btrfs qgroup show -prce --sync 

OK, just added the --sort=excl:

qgroupid rfer excl max_rfer max_excl parent  child 
     --  - 
0/5  16.00KiB 16.00KiB none none --- ---  
0/36122.57GiB  7.00MiB none none --- ---  
0/35822.54GiB  7.50MiB none none --- ---  
0/34322.36GiB  7.84MiB none none --- ---  
0/34522.49GiB  8.05MiB none none --- ---  
0/35722.50GiB  9.27MiB none none --- ---  
0/36022.57GiB 10.27MiB none none --- ---  
0/34422.48GiB 11.09MiB none none --- ---  
0/35922.55GiB 12.57MiB none none --- ---  
0/36222.59GiB 22.96MiB none none --- ---  
0/30212.87GiB 31.23MiB none none --- ---  
0/42815.96GiB 38.68MiB none none --- ---  
0/29411.09GiB 47.86MiB none none --- ---  
0/33621.80GiB 49.59MiB none none --- ---  
0/30012.56GiB 51.43MiB none none --- ---  
0/34222.31GiB 52.93MiB none none --- ---  
0/33321.71GiB 54.54MiB none none --- ---  
0/36322.63GiB 58.83MiB none none --- ---  
0/37023.27GiB 59.46MiB none none --- ---  
0/30513.01GiB 61.47MiB none none --- ---  
0/33121.61GiB 61.49MiB none none --- ---  
0/33421.78GiB 62.95MiB none none --- ---  
0/30613.04GiB 64.11MiB none none --- ---  
0/30412.96GiB 64.90MiB none none --- ---  
0/30312.94GiB 68.39MiB none none --- ---  
0/36723.20GiB 68.52MiB none none --- ---  
0/36623.22GiB 69.79MiB none none --- ---  
0/36422.63GiB 72.03MiB 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 09:43, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:
> 
>>> qgroupid rfer excl 
>>>    
>>> 0/26012.25GiB  3.22GiB  from 170712 - first snapshot
>>> 0/31217.54GiB  4.56GiB  from 170811
>>> 0/36625.59GiB  2.44GiB  from 171028
>>> 0/37023.27GiB 59.46MiB  from 18 - prev snapshot
>>> 0/38821.69GiB  7.16GiB  from 171125 - last snapshot
>>> 0/29124.29GiB  9.77GiB  default subvolume
>>
>> You may need to manually sync the filesystem (trigger a transaction
>> commitment) to update qgroup accounting.
> 
> The data I've pasted were just calculated.
> 
>>> # btrfs quota enable /
>>> # btrfs qgroup show /
>>> WARNING: quota disabled, qgroup data may be out of date
>>> [...]
>>> # btrfs quota enable /  - for the second time!
>>> # btrfs qgroup show /
>>> WARNING: qgroup data inconsistent, rescan recommended
>>
>> Please wait the rescan, or any number is not correct.
> 
> Here I was pointing that first "quota enable" resulted in "quota
> disabled" warning until I've enabled it once again.
> 
>> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
>> ensure you understand all the limitation.
> 
> I probably won't understand them all, but this is not an issue of my
> concern as I don't use it. There is simply no other way I am aware that
> could show me per-subvolume stats. Well, straightforward way, as the
> hard way I'm using (btrfs send) confirms the problem.

Unfortunately, send doesn't count everything.

The most common thing is, send doesn't count extent booking space.
Try the following command:

# fallocate -l 1G 
# mkfs.btrfs -f 
# mount  
# btrfs subv create /subv1
# xfs_io -f -c "pwrite 0 128M" -c "sync" /subv1/file1
# xfs_io -f "fpunch 0 127M" -c "sync" /subv1/file1
# btrfs subv snapshot -r /subv1 /snapshot
# btrfs send /snapshot

You will only get the 1M data, while it still takes 128M space on-disk.

Btrfs extent book will only free the whole extent if and only if there
is no inode referring *ANY* part of the extent.

Even only 1M of a 128M file extent is used, it will still takes 128M
space on-disk.

And that's what send can't tell you.
And that's also what qgroup can tell you.

That's also why I need *CORRECT* qgroup numbers to further investigate
the problem.

> 
> You could simply remove all the quota results I've posted and there will
> be the underlaying problem, that the 25 GB of data I got occupies 52 GB.

If you only want to know the answer why your "25G" data occupies 52G on
disk, above is one of the possible explanations.

(And I think I should put it into btrfs(5), although I highly doubt if
user will really read them)

You could try to defrag, but I'm not sure if defrag works well in multi
subvolumes case.

> At least one recent snapshot, that was taken after some minor (<100 MB) 
> changes
> from the subvolume, that has undergo some minor changes since then,
> occupied 8 GB during one night when the entire system was idling.

The only possible method to fully isolate all the disturbing factors is
to get rid of snapshot.

Build the subvolume from scratch (even no cp --reflink from other
subvolume), then test what's happening.

Only in that case, you could trust vanilla du (if you don't do any reflink).
Although you can always trust qgroups number, such subvolume built from
scratch will makes exclusive number equals to reference, making
debugging a little easier.

Thanks,
Qu

> 
> This was crosschecked on files metadata (mtimes compared) and 'du'
> results.
> 
> 
> As a last-resort I've rebalanced the disk (once again), this time with
> -dconvert=raid1 (to get rid of the single residue).
> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 09:23, Tomasz Pala wrote:
> On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:
> 
>> I assume there is program eating up the space.
>> Not btrfs itself.
> 
> Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
> find by lsof on 3.4.75 kernel, but the space was returning after killing
> Xorg. The system I'm having problem now is very recent, the space
> doesn't return after reboot/emergency and doesn't sum up with files.

Unlike vanilla df or "fi usage" or "fi df", btrfs quota only counts
on-disk extents.

That's to say, reserved space won't contribute to qgroup.
Unless one is using anonymous file, which is opened but unlinked, so no
one can access it except the owner.
(Which I doubt that may be your case)

Which should make quota the best tool to debug your problem.
(As long as you follow the variant limitations of btrfs quota,
especially you need to sync or use --sync option to show qgroup numbers)

> 
>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>> Subvolume ID:   388
>>> # btrfs fi du -s ./snapshot-171125 
>>>  Total   Exclusive  Set shared  Filename
>>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>>
>> That's the difference between how sub show and quota works.
>>
>> For quota, it's per-root owner check.
> 
> Just to be clear: I've enabled quota _only_ to see subvolume usage on
> spot. And exclusive data - the more detailed approach I've described in
> e-mail I've send a minute ago.
> 
>> Means even a file extent is shared between different inodes, if all
>> inodes are inside the same subvolume, it's counted as exclusive.
>> And if any of the file extent belongs to other subvolume, then it's
>> counted as shared.
> 
> Good to know, but this is almost UID0-only system. There are system
> users (vendor provided) and 2 ssh accounts for su, but nobody uses this
> machine for daily work. The quota values were the last tool I could find
> to debug.
> 
>> For fi du, it's per-inode owner check. (The exact behavior is a little
>> more complex, I'll skip such corner case to make it a little easier to
>> understand).
>>
>> That's to say, if one file extent is shared by different inodes, then
>> it's counted as shared, no matter if these inodes belong to different or
>> the same subvolume.
>>
>> That's to say, "fi du" has a looser condition for "shared" calculation,
>> and that should explain why you have 20+G shared.
> 
> There shouldn't be many multi-inode extents inside single subvolume, as this 
> is mostly fresh
> system, with no containers, no deduplication, snapshots are taken from
> the same running system before or after some more important change is
> done. By 'change' I mean altering text config files mostly (plus
> etckeeper's git metadata), so the volume of difference is extremelly
> low. Actually most of the difs between subvolumes come from updating
> distro packages. There were not much reflink copies made on this
> partition, only one kernel source compiled (.ccache files removed
> today). So this partition is as clean, as it could be after almost
> 5 months in use.
> 
> Actually I should rephrase the problem:
> 
> "snapshot has taken 8 GB of space despite nothing has altered source 
> subvolume"

Then please provide correct qgroup numbers.

The correct number should be get by:
# btrfs quota enable 
# btrfs quota rescan -w 
# btrfs qgroup show -prce --sync 

Rescan and --sync are important to get the correct number.
(while rescan can take a long long time to finish)

And further more, please ensure that all deleted files are really deleted.
Btrfs delay file and subvolume deletion, so you may need to sync several
times or use "btrfs subv sync" to ensure deleted files are deleted.

(vanilla du won't tell you if such delayed file deletion is really done)

Thanks,
Qu
> 



signature.asc
Description: OpenPGP digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:

>> qgroupid rfer excl 
>>    
>> 0/26012.25GiB  3.22GiB   from 170712 - first snapshot
>> 0/31217.54GiB  4.56GiB   from 170811
>> 0/36625.59GiB  2.44GiB   from 171028
>> 0/37023.27GiB 59.46MiB   from 18 - prev snapshot
>> 0/38821.69GiB  7.16GiB   from 171125 - last snapshot
>> 0/29124.29GiB  9.77GiB   default subvolume
> 
> You may need to manually sync the filesystem (trigger a transaction
> commitment) to update qgroup accounting.

The data I've pasted were just calculated.

>> # btrfs quota enable /
>> # btrfs qgroup show /
>> WARNING: quota disabled, qgroup data may be out of date
>> [...]
>> # btrfs quota enable /   - for the second time!
>> # btrfs qgroup show /
>> WARNING: qgroup data inconsistent, rescan recommended
> 
> Please wait the rescan, or any number is not correct.

Here I was pointing that first "quota enable" resulted in "quota
disabled" warning until I've enabled it once again.

> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
> ensure you understand all the limitation.

I probably won't understand them all, but this is not an issue of my
concern as I don't use it. There is simply no other way I am aware that
could show me per-subvolume stats. Well, straightforward way, as the
hard way I'm using (btrfs send) confirms the problem.

You could simply remove all the quota results I've posted and there will
be the underlaying problem, that the 25 GB of data I got occupies 52 GB.
At least one recent snapshot, that was taken after some minor (<100 MB) changes
from the subvolume, that has undergo some minor changes since then,
occupied 8 GB during one night when the entire system was idling.

This was crosschecked on files metadata (mtimes compared) and 'du'
results.


As a last-resort I've rebalanced the disk (once again), this time with
-dconvert=raid1 (to get rid of the single residue).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:

> I assume there is program eating up the space.
> Not btrfs itself.

Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
find by lsof on 3.4.75 kernel, but the space was returning after killing
Xorg. The system I'm having problem now is very recent, the space
doesn't return after reboot/emergency and doesn't sum up with files.

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>> Subvolume ID:   388
>> # btrfs fi du -s ./snapshot-171125 
>>  Total   Exclusive  Set shared  Filename
>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.

Just to be clear: I've enabled quota _only_ to see subvolume usage on
spot. And exclusive data - the more detailed approach I've described in
e-mail I've send a minute ago.

> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.

Good to know, but this is almost UID0-only system. There are system
users (vendor provided) and 2 ssh accounts for su, but nobody uses this
machine for daily work. The quota values were the last tool I could find
to debug.

> For fi du, it's per-inode owner check. (The exact behavior is a little
> more complex, I'll skip such corner case to make it a little easier to
> understand).
> 
> That's to say, if one file extent is shared by different inodes, then
> it's counted as shared, no matter if these inodes belong to different or
> the same subvolume.
> 
> That's to say, "fi du" has a looser condition for "shared" calculation,
> and that should explain why you have 20+G shared.

There shouldn't be many multi-inode extents inside single subvolume, as this is 
mostly fresh
system, with no containers, no deduplication, snapshots are taken from
the same running system before or after some more important change is
done. By 'change' I mean altering text config files mostly (plus
etckeeper's git metadata), so the volume of difference is extremelly
low. Actually most of the difs between subvolumes come from updating
distro packages. There were not much reflink copies made on this
partition, only one kernel source compiled (.ccache files removed
today). So this partition is as clean, as it could be after almost
5 months in use.

Actually I should rephrase the problem:

"snapshot has taken 8 GB of space despite nothing has altered source subvolume"

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo
>>> Now, the weird part for me is exclusive data count:
>>>
>>> # btrfs sub sh ./snapshot-171125
>>> [...]
>>> Subvolume ID:   388
>>> # btrfs fi du -s ./snapshot-171125 
>>>  Total   Exclusive  Set shared  Filename
>>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>>>
>>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>>> And the same happens with other snapshots, much more exclusive data
>>> shown in qgroup than actually found in files. So if not files, where
>>> is that space wasted? Metadata?
>>
>>Personally, I'd trust qgroups' output about as far as I could spit
>> Belgium(*).
> 
> Well, there is something wrong here, as after removing the .ccache
> directories inside all the snapshots the 'excl' values decreased
> ...except for the last snapshot (the list below is short by ~40 snapshots
> that have 2 GB excl in total):
> 
> qgroupid rfer excl 
>    
> 0/26012.25GiB  3.22GiBfrom 170712 - first snapshot
> 0/31217.54GiB  4.56GiBfrom 170811
> 0/36625.59GiB  2.44GiBfrom 171028
> 0/37023.27GiB 59.46MiBfrom 18 - prev snapshot
> 0/38821.69GiB  7.16GiBfrom 171125 - last snapshot
> 0/29124.29GiB  9.77GiBdefault subvolume

You may need to manually sync the filesystem (trigger a transaction
commitment) to update qgroup accounting.
> 
> 
> [~/test/snapshot-171125]#  du -sh .
> 15G .
> 
> 
> After changing back to ro I tested how much data really has changed
> between the previous and last snapshot:
> 
> [~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
> At subvol snapshot-171125
> 74.2MiB 0:00:32 [2.28MiB/s]
> 
> This means there can't be 7 GiB of exclusive data in the last snapshot.

Mentioned before, sync the fs first before checking the qgroup numbers.
Or use --sync option along with qgroup show.

> 
> Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
> 5.68GiB 0:03:23 [28.6MiB/s]
> 
> I've created a new snapshot right now to compare it with 171125:
> 75.5MiB 0:00:43 [1.73MiB/s]
> 
> 
> OK, I could even compare all the snapshots in sequence:
> 
> # for i in snapshot-17*; btrfs prop set $i ro true
> # p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv 
> > /dev/null; p="$i" done
>  1.7GiB 0:00:15 [ 114MiB/s]
> 1.03GiB 0:00:38 [27.2MiB/s]
>  155MiB 0:00:08 [19.1MiB/s]
> 1.08GiB 0:00:47 [23.3MiB/s]
>  294MiB 0:00:29 [ 9.9MiB/s]
>  324MiB 0:00:42 [7.69MiB/s]
> 82.8MiB 0:00:06 [12.7MiB/s]
> 64.3MiB 0:00:05 [11.6MiB/s]
>  137MiB 0:00:07 [19.3MiB/s]
> 85.3MiB 0:00:13 [6.18MiB/s]
> 62.8MiB 0:00:19 [3.21MiB/s]
>  132MiB 0:00:42 [3.15MiB/s]
>  102MiB 0:00:42 [2.42MiB/s]
>  197MiB 0:00:50 [3.91MiB/s]
>  321MiB 0:01:01 [5.21MiB/s]
>  229MiB 0:00:18 [12.3MiB/s]
>  109MiB 0:00:11 [ 9.7MiB/s]
>  139MiB 0:00:14 [9.32MiB/s]
>  573MiB 0:00:35 [15.9MiB/s]
> 64.1MiB 0:00:30 [2.11MiB/s]
>  172MiB 0:00:11 [14.9MiB/s]
> 98.9MiB 0:00:07 [14.1MiB/s]
>   54MiB 0:00:08 [6.17MiB/s]
> 78.6MiB 0:00:02 [32.1MiB/s]
> 15.1MiB 0:00:01 [12.5MiB/s]
> 20.6MiB 0:00:00 [  23MiB/s]
> 20.3MiB 0:00:00 [  23MiB/s]
>  110MiB 0:00:14 [7.39MiB/s]
> 62.6MiB 0:00:11 [5.67MiB/s]
> 65.7MiB 0:00:08 [7.58MiB/s]
>  731MiB 0:00:42 [  17MiB/s]
> 73.7MiB 0:00:29 [ 2.5MiB/s]
>  322MiB 0:00:53 [6.04MiB/s]
>  105MiB 0:00:35 [2.95MiB/s]
> 95.2MiB 0:00:36 [2.58MiB/s]
> 74.2MiB 0:00:30 [2.43MiB/s]
> 75.5MiB 0:00:46 [1.61MiB/s]
> 
> This is 9.3 GB of total diffs between all the snapshots I got.
> Plus 15 GB of initial snapshot means there is about 25 GB used,
> while df reports twice the amount, way too much for overhead:
> /dev/sda264G   52G   11G  84% /
> 
> 
> # btrfs quota enable /
> # btrfs qgroup show /
> WARNING: quota disabled, qgroup data may be out of date
> [...]
> # btrfs quota enable /- for the second time!
> # btrfs qgroup show /
> WARNING: qgroup data inconsistent, rescan recommended

Please wait the rescan, or any number is not correct.
(Although it will only be less than actual occupied space)

It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
ensure you understand all the limitation.

> [...]
> 0/42815.96GiB 19.23MiBnewly created (now) snapshot
> 
> 
> 
> Assuming the qgroups output is bugus and the space isn't physically
> occupied (which is coherent with btrfs fi du output and my expectation)
> the question remains: why is that bogus-excl removed from available
> space as reported by df or btrfs fi df/usage? And how to reclaim it?

Already explained the difference in another thread.

Thanks,
Qu

> 
> 
> [~/test]#  btrfs device usage /
> /dev/sda2, ID: 1
>Device size:64.00GiB
>Device slack:  0.00B
>Data,single: 1.07GiB
>Data,RAID1: 55.97GiB
>Metadata,RAID1:  2.00GiB
>

Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Fri, Dec 01, 2017 at 21:36:14 +, Hugo Mills wrote:

>The thing I'd first go looking for here is some rogue process
> writing lots of data. I've had something like this happen to me
> before, a few times. First, I'd look for large files with "du -ms /* |
> sort -n", then work down into the tree until you find them.

I already did a handful of searches (mounting parent node in separate
directory and diving into default working subvolume on order to unhide
possible things covered by any other mounts on top of actual root fs).
That's how it looks like:

[~/test/@]#  du -sh . 
15G .

>If that doesn't show up anything unusually large, then lsof to look
> for open but deleted files (orphans) which are still being written to
> by some process.

No (deleted) files, the only activity on iotop are internals...

  174 be/4 root   15.64 K/s3.67 M/s  0.00 %  5.88 % [btrfs-transacti]
 1439 be/4 root0.00 B/s 1173.22 K/s  0.00 %  0.00 % [kworker/u8:8]

Only the systemd-journald is writing, but the /var/log is mounted to
separate ext3 parition (with journald restarted after the mount); this
is also confirmed by looking into separate mount. Anyway that can't be
opened-deleted files, as the usage doesn't change after booting into
emergency. The worst thing is that the 8 GB was lost during the night,
when nothing except for stats collector was running.

As already said, this is not the classical "Linux eats my HDD" problem.

>This is very likely _not_ to be a btrfs problem, but instead some
> runaway process writing lots of crap very fast. Log files are probably
> the most plausible location, but not the only one.

That would be visible in iostat or /proc/diskstats - it isn't. The free
space disappears without being physically written, which means it is
some allocation problem.


I also created a list of files modified between the snapshots with:

find test/@ -xdev -newer some_reference_file_inside_snapshot

and there is nothing bigger than a few MBs.


I've changed the snapshots to rw and removed some data from all the
instances: 4.8 GB in two ISO images and 5 GB-limited .ccache directory.
After this I got 11 GB freed, so the numbers are fine.

#  btrfs fi usage /
Overall:
Device size: 128.00GiB
Device allocated:117.19GiB
Device unallocated:   10.81GiB
Device missing:  0.00B
Used:103.56GiB
Free (estimated): 11.19GiB  (min: 11.14GiB)
Data ratio:   1.98
Metadata ratio:   2.00
Global reserve:  146.08MiB  (used: 0.00B)

Data,single: Size:1.19GiB, Used:1.18GiB
   /dev/sda2   1.07GiB
   /dev/sdb2 132.00MiB

Data,RAID1: Size:55.97GiB, Used:50.30GiB
   /dev/sda2  55.97GiB
   /dev/sdb2  55.97GiB

Metadata,RAID1: Size:2.00GiB, Used:908.61MiB
   /dev/sda2   2.00GiB
   /dev/sdb2   2.00GiB

System,RAID1: Size:32.00MiB, Used:16.00KiB
   /dev/sda2  32.00MiB
   /dev/sdb2  32.00MiB

Unallocated:
   /dev/sda2   4.93GiB
   /dev/sdb2   5.87GiB

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>> Subvolume ID:   388
>> # btrfs fi du -s ./snapshot-171125 
>>  Total   Exclusive  Set shared  Filename
>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
>> 
>> How is that possible? This doesn't even remotely relate to 7.15 GiB
>> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
>> And the same happens with other snapshots, much more exclusive data
>> shown in qgroup than actually found in files. So if not files, where
>> is that space wasted? Metadata?
> 
>Personally, I'd trust qgroups' output about as far as I could spit
> Belgium(*).

Well, there is something wrong here, as after removing the .ccache
directories inside all the snapshots the 'excl' values decreased
...except for the last snapshot (the list below is short by ~40 snapshots
that have 2 GB excl in total):

qgroupid rfer excl 
   
0/26012.25GiB  3.22GiB  from 170712 - first snapshot
0/31217.54GiB  4.56GiB  from 170811
0/36625.59GiB  2.44GiB  from 171028
0/37023.27GiB 59.46MiB  from 18 - prev snapshot
0/38821.69GiB  7.16GiB  from 171125 - last snapshot
0/29124.29GiB  9.77GiB  default subvolume


[~/test/snapshot-171125]#  du -sh .
15G .


After changing back to ro I tested how much data really has changed
between the previous and last snapshot:

[~/test]#  btrfs send -p snapshot-171118 snapshot-171125 | pv > /dev/null
At subvol snapshot-171125
74.2MiB 0:00:32 [2.28MiB/s]

This means there can't be 7 GiB of exclusive data in the last snapshot.

Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
5.68GiB 0:03:23 [28.6MiB/s]

I've created a new snapshot right now to 

Re: exclusive subvolume space missing

2017-12-01 Thread Qu Wenruo


On 2017年12月02日 00:15, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

I assume there is program eating up the space.
Not btrfs itself.

> 
> 
> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125

That's the difference between how sub show and quota works.

For quota, it's per-root owner check.
Means even a file extent is shared between different inodes, if all
inodes are inside the same subvolume, it's counted as exclusive.
And if any of the file extent belongs to other subvolume, then it's
counted as shared.

For fi du, it's per-inode owner check. (The exact behavior is a little
more complex, I'll skip such corner case to make it a little easier to
understand).

That's to say, if one file extent is shared by different inodes, then
it's counted as shared, no matter if these inodes belong to different or
the same subvolume.

That's to say, "fi du" has a looser condition for "shared" calculation,
and that should explain why you have 20+G shared.

Thanks,
Qu


> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?
> 
> btrfs-progs-4.12 running on Linux 4.9.46.
> 
> best regards,
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super

2017-12-01 Thread Anand Jain



On 12/01/2017 05:19 PM, Nikolay Borisov wrote:

Currently this function executes the inner loop at most 1 due to the i = 0;
i < 1 condition. Furthermore, the btrfs_super_generation(super) > transid code
in the if condition is never executed due to latest always set to NULL hence the
first part of the condition always triggering. The gist of btrfs_read_dev_super
is really to read the first superblock.

Signed-off-by: Nikolay Borisov 
---
  fs/btrfs/disk-io.c | 27 ---
  1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 82c96607fc46..6d5f632fd1e7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3170,37 +3170,18 @@ int btrfs_read_dev_one_super(struct block_device *bdev, 
int copy_num,
  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
  {
struct buffer_head *bh;
-   struct buffer_head *latest = NULL;
-   struct btrfs_super_block *super;
-   int i;
-   u64 transid = 0;
-   int ret = -EINVAL;
+   int ret;
  
  	/* we would like to check all the supers, but that would make

 * a btrfs mount succeed after a mkfs from a different FS.
 * So, we need to add a special mount option to scan for
 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 */


 We need below loop to support the above comment at some point,
 instead of removing I would prefer to fix as per above comments.

Thanks, Anand



-   for (i = 0; i < 1; i++) {
-   ret = btrfs_read_dev_one_super(bdev, i, );
-   if (ret)
-   continue;
-
-   super = (struct btrfs_super_block *)bh->b_data;
-
-   if (!latest || btrfs_super_generation(super) > transid) {
-   brelse(latest);
-   latest = bh;
-   transid = btrfs_super_generation(super);
-   } else {
-   brelse(bh);
-   }
-   }
-
-   if (!latest)
+   ret = btrfs_read_dev_one_super(bdev, 0, );
+   if (ret)
return ERR_PTR(ret);
  
-	return latest;

+   return bh;
  }
  
  /*



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

Well, it's at zero now...

# btrfs fi df /export/
Data, single: total=30.45TiB, used=30.25TiB
System, DUP: total=32.00MiB, used=3.62MiB
Metadata, DUP: total=66.50GiB, used=65.16GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


On 01/12/17 16:47, Duncan wrote:

Hans van Kranenburg posted on Fri, 01 Dec 2017 18:06:23 +0100 as
excerpted:


On 12/01/2017 05:31 PM, Matt McKinnon wrote:

Sorry, I missed your in-line reply:



2) How big is this filesystem? What does your `btrfs fi df
/mountpoint` say?



# btrfs fi df /export/
Data, single: total=30.45TiB, used=30.25TiB
System, DUP: total=32.00MiB, used=3.62MiB
Metadata, DUP: total=66.50GiB, used=65.08GiB
GlobalReserve, single: total=512.00MiB, used=53.69MiB


Multi-TiB filesystem, check. total/used ratio looks healthy.


Not so healthy, from here.  Data/metadata are healthy, yes,
but...

Any usage at all of global reserve is a red flag indicating that
something in the filesystem thinks, or thought when it resorted
to global reserve, that space is running out.

Global reserve usage doesn't really hint what the problem is,
but it's definitely a red flag that there /is/ a problem, and
it's easily overlooked, as it apparently was here.

It's likely indication of a bug, possibly one of the ones fixed
right around 4.12/4.13.  I'll let the devs and better experts take
it from there, but I'd certainly be worried until global reserve
drops to zero usage.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Duncan
Hans van Kranenburg posted on Fri, 01 Dec 2017 18:06:23 +0100 as
excerpted:

> On 12/01/2017 05:31 PM, Matt McKinnon wrote:
>> Sorry, I missed your in-line reply:
>> 
>> 
>>> 2) How big is this filesystem? What does your `btrfs fi df
>>> /mountpoint` say?
>>>
>> 
>> # btrfs fi df /export/
>> Data, single: total=30.45TiB, used=30.25TiB
>> System, DUP: total=32.00MiB, used=3.62MiB
>> Metadata, DUP: total=66.50GiB, used=65.08GiB
>> GlobalReserve, single: total=512.00MiB, used=53.69MiB
> 
> Multi-TiB filesystem, check. total/used ratio looks healthy.

Not so healthy, from here.  Data/metadata are healthy, yes,
but...

Any usage at all of global reserve is a red flag indicating that
something in the filesystem thinks, or thought when it resorted
to global reserve, that space is running out.

Global reserve usage doesn't really hint what the problem is,
but it's definitely a red flag that there /is/ a problem, and
it's easily overlooked, as it apparently was here.

It's likely indication of a bug, possibly one of the ones fixed
right around 4.12/4.13.  I'll let the devs and better experts take
it from there, but I'd certainly be worried until global reserve
drops to zero usage.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Hugo Mills
On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

   The thing I'd first go looking for here is some rogue process
writing lots of data. I've had something like this happen to me
before, a few times. First, I'd look for large files with "du -ms /* |
sort -n", then work down into the tree until you find them.

   If that doesn't show up anything unusually large, then lsof to look
for open but deleted files (orphans) which are still being written to
by some process.

   This is very likely _not_ to be a btrfs problem, but instead some
runaway process writing lots of crap very fast. Log files are probably
the most plausible location, but not the only one.

> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?

   Personally, I'd trust qgroups' output about as far as I could spit
Belgium(*).

   Hugo.

(*) No offence indended to Belgium.

-- 
Hugo Mills | I used to live in hope, but I got evicted.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Duncan
Tomasz Pala posted on Fri, 01 Dec 2017 17:15:55 +0100 as excerpted:

> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /

Scary.

> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:

I don't use quotas so won't claim working knowledge or an explanation of
that side of things, however...
> 
> btrfs-progs-4.12 running on Linux 4.9.46.

Until quite recently btrfs quotas were too buggy to recommend for use.
While the known blocker-level bugs are now fixed, scaling and real-world
performance are still an issue, and AFAIK, the fixes didn't make 4.9 and
may not be backported as the feature was simply known to be broken beyond
reliable usability at that point.

Based on comments in other threads here, I /think/ the critical quota
fixes hit 4.10, but of course not being an LTS, 4.10 is long out of support.
I'd suggest either turning off and forgetting about quotas since it doesn't
appear you actually need them, or upgrading to at least 4.13 and keeping
current, or the LTS 4.14 if you want to stay on the same kernel series for
awhile.

As for the scaling and performance issues, during normal/generic filesystem
use things are generally fine; it's various btrfs maintenance commands such
as balance, snapshot deletion, and btrfs check, that have the scaling
issues, and they have /some/ scaling issues even without quotas, it's just
that quotas makes the problem *much* worse.  One workaround for balance
and snapshot deletion is to temporarily disable quotas while the job is
running, then reenable (and rescan if necessary, as I don't use the feature
here I'm not sure whether it is).  That can literally turn a job that was
looking to take /weeks/ due to the scaling issue, into a job of hours.
Unfortunately, the sorts of conditions that would trigger running a btrfs
check don't lend themselves to the same sort of workaround, so not having
quotas on at all is the only workaround there.


As to your space being eaten problem, the output of btrfs filesystem usage
(and perhaps btrfs device usage if it's a multi-device btrfs) could be
really helpful here, much more so than quota reports if it's a btrfs
issue, or to help eliminate btrfs as the problem if it's not.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Chris Murphy
On Fri, Dec 1, 2017 at 12:07 PM, Matt McKinnon  wrote:
> Right.  The file system is 48T, with 17T available, so we're not quite
> pushing it yet.
>
> So far so good on the space_cache=v2 mount.  I'm surprised this isn't on the
> gotcha page in the wiki; it may end up making a world of difference to the
> users here
>

I'd change one  thing at a time so you learn what change does/doesn't
resolve the problem. For storage of mostly large files, autodefrag
doesn't seem applicable, but I'd leave it on for now since you've
already made the space cache v2 change.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs - failed btrfs replace on RAID1 seems to have left things in a wrong state

2017-12-01 Thread Duncan
Patrik Lundquist posted on Fri, 01 Dec 2017 10:29:43 +0100 as excerpted:

> On 1 December 2017 at 08:18, Duncan <1i5t5.dun...@cox.net> wrote:
>>
>> When udev sees a device it triggers
>> a btrfs device scan, which lets btrfs know which devices belong to which
>> individual btrfs.  But once it associates a device with a particular
>> btrfs, there's nothing to unassociate it -- the only way to do that on
>> a running kernel is to successfully complete a btrfs device remove or
>> replacement... and your replace didn't complete due to error.
>>
>> Of course the other way to do it is to reboot, fresh kernel, fresh
>> btrfs state, and it learns again what devices go with which btrfs
>> when the appearing devices trigger the udev rule that triggers a
>> btrfs scan.
> 
> Or reload the btrfs module.

Thanks.  Yes.  With a monolithic kernel I tend to forget about that (and
as I have a btrfs root it wouldn't be possible anyway), but indeed,
unloading/reloading the btrfs kernel module clears the btrfs device state
tracking as effectively as a reboot.  Good point! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon
Right.  The file system is 48T, with 17T available, so we're not quite 
pushing it yet.


So far so good on the space_cache=v2 mount.  I'm surprised this isn't on 
the gotcha page in the wiki; it may end up making a world of difference 
to the users here


Thanks again,
Matt

On 01/12/17 13:24, Hans van Kranenburg wrote:

On 12/01/2017 06:57 PM, Holger Hoffstätte wrote:

On 12/01/17 18:34, Matt McKinnon wrote:

Thanks, I'll give space_cache=v2 a shot.


Yes, very much recommended.


My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/


Turn autodefrag off and use noatime instead of relatime.

Your filesystem also seems very full,


We don't know. btrfs fi df only displays allocated space. And that being
full is good, it means not too much free space fragments everywhere.


that's bad with every filesystem but
*especially* with btrfs because the allocator has to work really hard to find
free space for COWing. Really consider deleting stuff or adding more space.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 06:57 PM, Holger Hoffstätte wrote:
> On 12/01/17 18:34, Matt McKinnon wrote:
>> Thanks, I'll give space_cache=v2 a shot.
> 
> Yes, very much recommended.
> 
>> My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/
> 
> Turn autodefrag off and use noatime instead of relatime.
> 
> Your filesystem also seems very full,

We don't know. btrfs fi df only displays allocated space. And that being
full is good, it means not too much free space fragments everywhere.

> that's bad with every filesystem but
> *especially* with btrfs because the allocator has to work really hard to find
> free space for COWing. Really consider deleting stuff or adding more space.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Austin S. Hemmelgarn

On 2017-12-01 12:13, Andrei Borzenkov wrote:

01.12.2017 20:06, Hans van Kranenburg пишет:


Additional tips (forgot to ask for your /proc/mounts before):
* Use the noatime mount option, so that only accessing files does not
lead to changes in metadata,


Is not 'lazytime" default today? It gives you correct atime + no extra
metadata update cause by update of atime only.
Unless things have changed since the last time this came up, BTRFS does 
not support the 'lazytime' mount option (but it doesn't complain about 
it either).


Also, lazytime is independent from noatime, and using both can have 
benefits (lazytime will still have to write out the inode for every file 
read on the system every 24 hours, but with noatime it only has to write 
out the inode for files that have changed).


On top of all that though, you generally shouldn't be trusting atime 
because:
1. Many people run with noatime (or patch their kernels to default to 
noatime instead of relatime), so you can't be certain if the atime is 
accurate at all.

2. It has somewhat non-intuitive semantics when dealing with directories.
3. Even without noatime thrown in, you only get 1 day resolution by 
default (as per the operation of 'relatime').
4. Essentially nothing uses it other than find (which only has one day 
resolution as it's typically used) and older versions of mutt (which use 
it because of lazy programming), which is why issue 1 and 3 are the case.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Holger Hoffstätte
On 12/01/17 18:34, Matt McKinnon wrote:
> Thanks, I'll give space_cache=v2 a shot.

Yes, very much recommended.

> My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/

Turn autodefrag off and use noatime instead of relatime.

Your filesystem also seems very full, that's bad with every filesystem but
*especially* with btrfs because the allocator has to work really hard to find
free space for COWing. Really consider deleting stuff or adding more space.

-h
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

Thanks, I'll give space_cache=v2 a shot.

My mount options are: rw,relatime,space_cache,autodefrag,subvolid=5,subvol=/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Andrei Borzenkov
01.12.2017 20:06, Hans van Kranenburg пишет:
> 
> Additional tips (forgot to ask for your /proc/mounts before):
> * Use the noatime mount option, so that only accessing files does not
> lead to changes in metadata,

Is not 'lazytime" default today? It gives you correct atime + no extra
metadata update cause by update of atime only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 05:31 PM, Matt McKinnon wrote:
> Sorry, I missed your in-line reply:
> 
> 
>> 1) The one right above, btrfs_write_out_cache, is the write-out of the
>> free space cache v1. Do you see this for multiple seconds going on, and
>> does it match the time when it's writing X MB/s to disk?
>>
> 
> It seems to only last until the next watch update.
> 
> [] io_schedule+0x16/0x40
> [] get_request+0x23e/0x720
> [] blk_queue_bio+0xc1/0x3a0
> [] generic_make_request+0xf8/0x2a0
> [] submit_bio+0x75/0x150
> [] btrfs_map_bio+0xe5/0x2f0 [btrfs]
> [] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
> [] submit_one_bio+0x63/0xa0 [btrfs]
> [] flush_epd_write_bio+0x3b/0x50 [btrfs]
> [] flush_write_bio+0xe/0x10 [btrfs]
> [] btree_write_cache_pages+0x379/0x450 [btrfs]
> [] btree_writepages+0x5d/0x70 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
> [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80
> [btrfs]
> [] btrfs_commit_transaction+0x665/0x900 [btrfs]
> [] transaction_kthread+0x18a/0x1c0 [btrfs]
> [] kthread+0x109/0x140
> [] ret_from_fork+0x25/0x30
> 
> The last three lines will stick around for a while.  Is switching to
> space cache v2 something that everyone should be doing?  Something that
> would be a good test at least?

Yes. Read on.

>> 2) How big is this filesystem? What does your `btrfs fi df
>> /mountpoint` say?
>>
> 
> # btrfs fi df /export/
> Data, single: total=30.45TiB, used=30.25TiB
> System, DUP: total=32.00MiB, used=3.62MiB
> Metadata, DUP: total=66.50GiB, used=65.08GiB
> GlobalReserve, single: total=512.00MiB, used=53.69MiB

Multi-TiB filesystem, check. total/used ratio looks healthy.

>> 3) What kind of workload are you running? E.g. how can you describe it
>> within a range from "big files which just sit there" to "small writes
>> and deletes all over the place all the time"?
> 
> It's a pretty light workload most of the time.  It's a file system that
> exports two NFS shares to a small lab group.  I believe it is more small
> reads all over a large file (MRI imaging) rather than small writes.

Ok.

>> 4) What kernel version is this? `uname -a` output?
> 
> # uname -a
> Linux machine_name 4.12.8-custom #1 SMP Tue Aug 22 10:15:01 EDT 2017
> x86_64 x86_64 x86_64 GNU/Linux
> 

Yes, I'd recommend switching to space_cache v2, which stores the free
space information in a tree instead of separate blobs, and does not
block the transaction while writing out all info of all touched parts of
the filesystem again.

Here's of course the famous presentation with all kinds of info why:

http://events.linuxfoundation.org/sites/events/files/slides/vault2016_0.pdf

How:

* umount the filesystem
* btrfsck --clear-space-cache v1 /block/device
* do a rw mount with the space_cache=v2 option added (only needed
explicitly once)

During that mount, it will generate the free space tree by reading the
extent tree and writing the inverse of it. This will take some time,
depending on how fast your storage can do random reads with a cold disk
cache.

For x86_64, using the free space cache v2 is fine since linux 4.5. Up to
4.9, there was a bug for big-endian systems. So, with your kernel it's
absolutely fine.

Why isn't this the default yet? It's because btrfs-progs don't have
support to update the free space tree when doing offline modifications
(like check --repair or btrfstune, which you hopefully don't need often
anyway). So, until that's fully added, you need to do an `btrfsck
--clear-space-cache v2`, then do the offline r/w action and then
generate the tree again on next mount.

Additional tips (forgot to ask for your /proc/mounts before):
* Use the noatime mount option, so that only accessing files does not
lead to changes in metadata, which lead to writes, which lead to cowing
and writes in a new place, which lead to updates of the free space
administration etc...

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
Hello,

I got a problem with btrfs running out of space (not THE
Internet-wide, well known issues with interpretation).

The problem is: something eats the space while not running anything that
justifies this. There were 18 GB free space available, suddenly it
dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
right now, just as I'm writing this e-mail:

/dev/sda264G   63G  452M 100% /
/dev/sda264G   63G  365M 100% /
/dev/sda264G   63G  316M 100% /
/dev/sda264G   63G  287M 100% /
/dev/sda264G   63G  268M 100% /
/dev/sda264G   63G  239M 100% /
/dev/sda264G   63G  230M 100% /
/dev/sda264G   63G  182M 100% /
/dev/sda264G   63G  163M 100% /
/dev/sda264G   64G  153M 100% /
/dev/sda264G   64G  143M 100% /
/dev/sda264G   64G   96M 100% /
/dev/sda264G   64G   88M 100% /
/dev/sda264G   64G   57M 100% /
/dev/sda264G   64G   25M 100% /

while my rough calculations show, that there should be at least 10 GB of
free space. After enabling quotas it is somehow confirmed:

# btrfs qgroup sh --sort=excl / 
qgroupid rfer excl 
   
0/5  16.00KiB 16.00KiB 
[30 snapshots with about 100 MiB excl]
0/33324.53GiB305.79MiB 
0/29813.44GiB312.74MiB 
0/32723.79GiB427.13MiB 
0/33123.93GiB930.51MiB 
0/26012.25GiB  3.22GiB 
0/31219.70GiB  4.56GiB 
0/38828.75GiB  7.15GiB 
0/29130.60GiB  9.01GiB <- this is the running one

This is about 30 GB total excl (didn't find a switch to sum this up). I
know I can't just add 'excl' to get usage, so tried to pinpoint the
exact files that occupy space in 0/388 exclusively (this is the last
snapshots taken, all of the snapshots are created from the running fs).


Now, the weird part for me is exclusive data count:

# btrfs sub sh ./snapshot-171125
[...]
Subvolume ID:   388
# btrfs fi du -s ./snapshot-171125 
 Total   Exclusive  Set shared  Filename
  21.50GiB63.35MiB20.77GiB  snapshot-171125


How is that possible? This doesn't even remotely relate to 7.15 GiB
from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
And the same happens with other snapshots, much more exclusive data
shown in qgroup than actually found in files. So if not files, where
is that space wasted? Metadata?

btrfs-progs-4.12 running on Linux 4.9.46.

best regards,
-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

Sorry, I missed your in-line reply:



1) The one right above, btrfs_write_out_cache, is the write-out of the
free space cache v1. Do you see this for multiple seconds going on, and
does it match the time when it's writing X MB/s to disk?



It seems to only last until the next watch update.

[] io_schedule+0x16/0x40
[] get_request+0x23e/0x720
[] blk_queue_bio+0xc1/0x3a0
[] generic_make_request+0xf8/0x2a0
[] submit_bio+0x75/0x150
[] btrfs_map_bio+0xe5/0x2f0 [btrfs]
[] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
[] submit_one_bio+0x63/0xa0 [btrfs]
[] flush_epd_write_bio+0x3b/0x50 [btrfs]
[] flush_write_bio+0xe/0x10 [btrfs]
[] btree_write_cache_pages+0x379/0x450 [btrfs]
[] btree_writepages+0x5d/0x70 [btrfs]
[] do_writepages+0x1c/0x70
[] __filemap_fdatawrite_range+0xaa/0xe0
[] filemap_fdatawrite_range+0x13/0x20
[] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
[] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80 
[btrfs]

[] btrfs_commit_transaction+0x665/0x900 [btrfs]
[] transaction_kthread+0x18a/0x1c0 [btrfs]
[] kthread+0x109/0x140
[] ret_from_fork+0x25/0x30

The last three lines will stick around for a while.  Is switching to 
space cache v2 something that everyone should be doing?  Something that 
would be a good test at least?




2) How big is this filesystem? What does your `btrfs fi df /mountpoint` say?



# btrfs fi df /export/
Data, single: total=30.45TiB, used=30.25TiB
System, DUP: total=32.00MiB, used=3.62MiB
Metadata, DUP: total=66.50GiB, used=65.08GiB
GlobalReserve, single: total=512.00MiB, used=53.69MiB



3) What kind of workload are you running? E.g. how can you describe it
within a range from "big files which just sit there" to "small writes
and deletes all over the place all the time"?



It's a pretty light workload most of the time.  It's a file system that 
exports two NFS shares to a small lab group.  I believe it is more small 
reads all over a large file (MRI imaging) rather than small writes.



4) What kernel version is this? `uname -a` output?



# uname -a
Linux machine_name 4.12.8-custom #1 SMP Tue Aug 22 10:15:01 EDT 2017 
x86_64 x86_64 x86_64 GNU/Linux


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

These seem to come up most often:

[] transaction_kthread+0x133/0x1c0 [btrfs]
[] kthread+0x109/0x140
[] ret_from_fork+0x25/0x30


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 04:24 PM, Matt McKinnon wrote:
> Thanks for this.  Here's what I get:

Ok, and which one is displaying most of the time?

> [...]
> 
> [] io_schedule+0x16/0x40
> [] get_request+0x23e/0x720
> [] blk_queue_bio+0xc1/0x3a0
> [] generic_make_request+0xf8/0x2a0
> [] submit_bio+0x75/0x150
> [] btrfs_map_bio+0xe5/0x2f0 [btrfs]
> [] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
> [] submit_one_bio+0x63/0xa0 [btrfs]
> [] flush_epd_write_bio+0x3b/0x50 [btrfs]
> [] flush_write_bio+0xe/0x10 [btrfs]
> [] btree_write_cache_pages+0x379/0x450 [btrfs]
> [] btree_writepages+0x5d/0x70 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
> [] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80
> [btrfs]
> [] btrfs_commit_transaction+0x665/0x900 [btrfs]
> 
> [...]
> 
> [] io_schedule+0x16/0x40
> [] wait_on_page_bit+0xe8/0x120
> [] read_extent_buffer_pages+0x1cd/0x2e0 [btrfs]
> [] btree_read_extent_buffer_pages+0x9f/0x100 [btrfs]
> [] read_tree_block+0x32/0x50 [btrfs]
> [] read_block_for_search.isra.32+0x120/0x2e0 [btrfs]
> [] btrfs_next_old_leaf+0x215/0x400 [btrfs]
> [] btrfs_next_leaf+0x10/0x20 [btrfs]
> [] btrfs_lookup_csums_range+0x12e/0x410 [btrfs]
> [] csum_exist_in_range.isra.49+0x2a/0x81 [btrfs]
> [] run_delalloc_nocow+0x9b2/0xa10 [btrfs]
> [] run_delalloc_range+0x68/0x340 [btrfs]
> [] writepage_delalloc.isra.47+0xf0/0x140 [btrfs]
> [] __extent_writepage+0xc7/0x290 [btrfs]
> [] extent_write_cache_pages.constprop.53+0x2b5/0x450
> [btrfs]
> [] extent_writepages+0x4d/0x70 [btrfs]
> [] btrfs_writepages+0x28/0x30 [btrfs]
> [] do_writepages+0x1c/0x70
> [] __filemap_fdatawrite_range+0xaa/0xe0
> [] filemap_fdatawrite_range+0x13/0x20
> [] btrfs_fdatawrite_range+0x20/0x50 [btrfs]
> [] __btrfs_write_out_cache+0x3d9/0x420 [btrfs]
> [] btrfs_write_out_cache+0x86/0x100 [btrfs]
> [] btrfs_write_dirty_block_groups+0x261/0x390 [btrfs]
> [] commit_cowonly_roots+0x1fb/0x290 [btrfs]
> [] btrfs_commit_transaction+0x434/0x900 [btrfs]

1) The one right above, btrfs_write_out_cache, is the write-out of the
free space cache v1. Do you see this for multiple seconds going on, and
does it match the time when it's writing X MB/s to disk?

2) How big is this filesystem? What does your `btrfs fi df /mountpoint` say?

3) What kind of workload are you running? E.g. how can you describe it
within a range from "big files which just sit there" to "small writes
and deletes all over the place all the time"?

4) What kernel version is this? `uname -a` output?


-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

Thanks for this.  Here's what I get:


[] transaction_kthread+0x133/0x1c0 [btrfs]
[] kthread+0x109/0x140
[] ret_from_fork+0x25/0x30

...

[] io_schedule+0x16/0x40
[] get_request+0x23e/0x720
[] blk_queue_bio+0xc1/0x3a0
[] generic_make_request+0xf8/0x2a0
[] submit_bio+0x75/0x150
[] btrfs_map_bio+0xe5/0x2f0 [btrfs]
[] btree_submit_bio_hook+0x8c/0xe0 [btrfs]
[] submit_one_bio+0x63/0xa0 [btrfs]
[] flush_epd_write_bio+0x3b/0x50 [btrfs]
[] flush_write_bio+0xe/0x10 [btrfs]
[] btree_write_cache_pages+0x379/0x450 [btrfs]
[] btree_writepages+0x5d/0x70 [btrfs]
[] do_writepages+0x1c/0x70
[] __filemap_fdatawrite_range+0xaa/0xe0
[] filemap_fdatawrite_range+0x13/0x20
[] btrfs_write_marked_extents+0xe9/0x110 [btrfs]
[] btrfs_write_and_wait_transaction.isra.22+0x3d/0x80 
[btrfs]

[] btrfs_commit_transaction+0x665/0x900 [btrfs]

...

[] io_schedule+0x16/0x40
[] wait_on_page_bit+0xe8/0x120
[] read_extent_buffer_pages+0x1cd/0x2e0 [btrfs]
[] btree_read_extent_buffer_pages+0x9f/0x100 [btrfs]
[] read_tree_block+0x32/0x50 [btrfs]
[] read_block_for_search.isra.32+0x120/0x2e0 [btrfs]
[] btrfs_next_old_leaf+0x215/0x400 [btrfs]
[] btrfs_next_leaf+0x10/0x20 [btrfs]
[] btrfs_lookup_csums_range+0x12e/0x410 [btrfs]
[] csum_exist_in_range.isra.49+0x2a/0x81 [btrfs]
[] run_delalloc_nocow+0x9b2/0xa10 [btrfs]
[] run_delalloc_range+0x68/0x340 [btrfs]
[] writepage_delalloc.isra.47+0xf0/0x140 [btrfs]
[] __extent_writepage+0xc7/0x290 [btrfs]
[] extent_write_cache_pages.constprop.53+0x2b5/0x450 
[btrfs]

[] extent_writepages+0x4d/0x70 [btrfs]
[] btrfs_writepages+0x28/0x30 [btrfs]
[] do_writepages+0x1c/0x70
[] __filemap_fdatawrite_range+0xaa/0xe0
[] filemap_fdatawrite_range+0x13/0x20
[] btrfs_fdatawrite_range+0x20/0x50 [btrfs]
[] __btrfs_write_out_cache+0x3d9/0x420 [btrfs]
[] btrfs_write_out_cache+0x86/0x100 [btrfs]
[] btrfs_write_dirty_block_groups+0x261/0x390 [btrfs]
[] commit_cowonly_roots+0x1fb/0x290 [btrfs]
[] btrfs_commit_transaction+0x434/0x900 [btrfs]

...

[] tree_search_offset.isra.23+0x37/0x1d0 [btrfs]

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-transacti hammering the system

2017-12-01 Thread Hans van Kranenburg
On 12/01/2017 03:25 PM, Matt McKinnon wrote:
> 
> Is there any way to figure out what exactly btrfs-transacti is chugging
> on?  I have a few file systems that seem to get wedged for days on end
> with this process pegged around 100%.  I've stopped all snapshots, made
> sure no quotas were enabled, turned on autodefrag in the mount options,
> tried manual defragging, kernel upgrades, yet still this brings my
> system to a crawl.
> 
> Network I/O to the system seems very tiny.  The only I/O I see to the
> disk is btrfs-transacti writing a couple M/s.
> 
> # time touch foo
> 
> real    2m54.303s
> user    0m0.000s
> sys 0m0.002s
> 
> # uname -r
> 4.12.8-custom
> 
> # btrfs --version
> btrfs-progs v4.13.3
> 
> Yes, I know I'm a bit behind there...

One of the simple things you can do is watch the stack traces of the
kernel thread.

watch 'cat /proc//stack'

where  is the pid of the btrfs-transaction process.

In there, you will see a pattern of reoccuring things, like, it's
searching for free space, it's writing out free space cache, or other
things. Correlate this with the disk write traffic and see if we get a
step further.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-transacti hammering the system

2017-12-01 Thread Matt McKinnon

Hi All,

Is there any way to figure out what exactly btrfs-transacti is chugging 
on?  I have a few file systems that seem to get wedged for days on end 
with this process pegged around 100%.  I've stopped all snapshots, made 
sure no quotas were enabled, turned on autodefrag in the mount options, 
tried manual defragging, kernel upgrades, yet still this brings my 
system to a crawl.


Network I/O to the system seems very tiny.  The only I/O I see to the 
disk is btrfs-transacti writing a couple M/s.


# time touch foo

real2m54.303s
user0m0.000s
sys 0m0.002s

# uname -r
4.12.8-custom

# btrfs --version
btrfs-progs v4.13.3

Yes, I know I'm a bit behind there...

-Matt



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: btrfs-progs - failed btrfs replace on RAID1 seems to have left things in a wrong state

2017-12-01 Thread Eric Mesa
Duncan,

Thank you for your thorough response to my problem. I am now wiser in
my understanding of how btrfs works in RAID1 thanks to your words.
Last night I worked with someone in the IRC channel and we essentially
came to the exact same conclusion. I used wipefs -a on the errant
drive. Rebooted and viola. As of last night the replace was running
fine. (Didn't have time to check this morning before heading out) The
people on IRC had recommended filing a bug based on the fact that a
btrfs filesystem was created during the replace, but if I understand
your feedback, this has already been noted and there are patches being
considered.

As for your backup feedback, it has been thoroughly beaten into my
head over the last half-decade that RAID is not backup. Although, I'd
argue that RAID on btrfs or ZFS making use of snapshots is pretty darn
close. (Since it covers the fat-finger situation - although it doesn't
cover the MOBO frying your hard-drives situation) But I do have
offsite backup - it's just that it's with a commercial provider (as
opposed to, say, a friend's house) so I didn't want to have to
download 3TB if things got borked. (I consider that my house
burning/theft backup) And it IS in the plans to have a separate backup
system in my house. I just haven't spent the money yet as it's
currently a bit tight. But I do appreciate that you took the time to
explain that in case I didn't know about it. And it's on the mailing
list archives now so if someone else is under the misunderstanding
that RAID is backup they can also be educated.

Anyway, this is running a bit long. I just want to conclude by again
offering my thanks at your very thorough response. If I hadn't been
able to obtain help on the IRC, this would have put me on the right
path. And it came with knowledge rather than just a list of
instructions. So thanks for that as well.
--
Eric Mesa
http://www.ericmesa.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-12-01 Thread Jan Kara
On Wed 29-11-17 13:38:26, Chris Mason wrote:
> On 11/29/2017 12:05 PM, Tejun Heo wrote:
> >On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:
> >>Hello,
> >>
> >>On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:
> >>>What has happened with this patch set?
> >>
> >>No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
> >>can you please route them through the btrfs tree?
> >
> >lol looking at the patchset again, I'm not sure that's obviously the
> >right tree.  It can either be cgroup, block or btrfs.  If no one
> >objects, I'll just route them through cgroup.
> 
> We'll have to coordinate a bit during the next merge window but I don't have
> a problem with these going in through cgroup.  Dave does this sound good to
> you?

Also I was wondering about another thing: How does this play with Josef's
series for metadata writeback (Metadata specific accouting and dirty
writeout)? Would the per-inode selection of cgroup writeback still be
needed when Josef's series is applied since metadata writeback then won't
be associated with any particular mapping anymore?

Honza
-- 
Jan Kara 
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs - failed btrfs replace on RAID1 seems to have left things in a wrong state

2017-12-01 Thread Patrik Lundquist
On 1 December 2017 at 08:18, Duncan <1i5t5.dun...@cox.net> wrote:
>
> When udev sees a device it triggers
> a btrfs device scan, which lets btrfs know which devices belong to which
> individual btrfs.  But once it associates a device with a particular
> btrfs, there's nothing to unassociate it -- the only way to do that on
> a running kernel is to successfully complete a btrfs device remove or
> replacement... and your replace didn't complete due to error.
>
> Of course the other way to do it is to reboot, fresh kernel, fresh
> btrfs state, and it learns again what devices go with which btrfs
> when the appearing devices trigger the udev rule that triggers a
> btrfs scan.

Or reload the btrfs module.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super

2017-12-01 Thread Nikolay Borisov
Currently this function executes the inner loop at most 1 due to the i = 0;
i < 1 condition. Furthermore, the btrfs_super_generation(super) > transid code
in the if condition is never executed due to latest always set to NULL hence the
first part of the condition always triggering. The gist of btrfs_read_dev_super
is really to read the first superblock.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/disk-io.c | 27 ---
 1 file changed, 4 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 82c96607fc46..6d5f632fd1e7 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3170,37 +3170,18 @@ int btrfs_read_dev_one_super(struct block_device *bdev, 
int copy_num,
 struct buffer_head *btrfs_read_dev_super(struct block_device *bdev)
 {
struct buffer_head *bh;
-   struct buffer_head *latest = NULL;
-   struct btrfs_super_block *super;
-   int i;
-   u64 transid = 0;
-   int ret = -EINVAL;
+   int ret;
 
/* we would like to check all the supers, but that would make
 * a btrfs mount succeed after a mkfs from a different FS.
 * So, we need to add a special mount option to scan for
 * later supers, using BTRFS_SUPER_MIRROR_MAX instead
 */
-   for (i = 0; i < 1; i++) {
-   ret = btrfs_read_dev_one_super(bdev, i, );
-   if (ret)
-   continue;
-
-   super = (struct btrfs_super_block *)bh->b_data;
-
-   if (!latest || btrfs_super_generation(super) > transid) {
-   brelse(latest);
-   latest = bh;
-   transid = btrfs_super_generation(super);
-   } else {
-   brelse(bh);
-   }
-   }
-
-   if (!latest)
+   ret = btrfs_read_dev_one_super(bdev, 0, );
+   if (ret)
return ERR_PTR(ret);
 
-   return latest;
+   return bh;
 }
 
 /*
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs: Remove dead code

2017-12-01 Thread Nikolay Borisov
trans was statically assigned to NULL and this never changed over the course of
btrfs_get_extent. So remove any code which checks whether trans != NULL and
just hardcode the fact trans is always NULL. This fixes CID#112806

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/inode.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 57785eadb95c..92d140b06271 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -6943,7 +6943,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
struct extent_map *em = NULL;
struct extent_map_tree *em_tree = >extent_tree;
struct extent_io_tree *io_tree = >io_tree;
-   struct btrfs_trans_handle *trans = NULL;
const bool new_inline = !page || create;
 
read_lock(_tree->lock);
@@ -6984,8 +6983,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
path->reada = READA_FORWARD;
}
 
-   ret = btrfs_lookup_file_extent(trans, root, path,
-  objectid, start, trans != NULL);
+   ret = btrfs_lookup_file_extent(NULL, root, path, objectid, start, 0);
if (ret < 0) {
err = ret;
goto out;
@@ -7181,11 +7179,6 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode 
*inode,
trace_btrfs_get_extent(root, inode, em);
 
btrfs_free_path(path);
-   if (trans) {
-   ret = btrfs_end_transaction(trans);
-   if (!err)
-   err = ret;
-   }
if (err) {
free_extent_map(em);
return ERR_PTR(err);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] btrfs: Fix possible off-by-one in btrfs_search_path_in_tree

2017-12-01 Thread Nikolay Borisov
The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/ioctl.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index e8adebc8c1b0..fc148b7c4265 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2206,7 +2206,7 @@ static noinline int btrfs_search_path_in_tree(struct 
btrfs_fs_info *info,
if (!path)
return -ENOMEM;
 
-   ptr = [BTRFS_INO_LOOKUP_PATH_MAX];
+   ptr = [BTRFS_INO_LOOKUP_PATH_MAX - 1];
 
key.objectid = tree_id;
key.type = BTRFS_ROOT_ITEM_KEY;
@@ -2272,8 +2272,8 @@ static noinline int btrfs_search_path_in_tree(struct 
btrfs_fs_info *info,
 static noinline int btrfs_ioctl_ino_lookup(struct file *file,
   void __user *argp)
 {
-struct btrfs_ioctl_ino_lookup_args *args;
-struct inode *inode;
+   struct btrfs_ioctl_ino_lookup_args *args;
+   struct inode *inode;
int ret = 0;
 
args = memdup_user(argp, sizeof(*args));
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] btrfs: Remove redundant NULL check

2017-12-01 Thread Nikolay Borisov
Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
than null. However, by the time this null check is triggered we already know
hole_em is not null because it means it points to the em we found and it
has already been dereferenced.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/inode.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 92d140b06271..9e0473c883ce 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7300,9 +7300,8 @@ struct extent_map *btrfs_get_extent_fiemap(struct 
btrfs_inode *inode,
em->block_start = EXTENT_MAP_DELALLOC;
em->block_len = found;
}
-   } else if (hole_em) {
+   } else
return hole_em;
-   }
 out:
 
free_extent_map(hole_em);
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] btrfs: Remove dead code

2017-12-01 Thread Nikolay Borisov
'clear' is always set to 0 (BTRFS_FEATURE_COMPAT_SAFE_CLEAR,
BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR and BTRFS_FEATURE_INCOMPAT_SAFE_CLEAR are
all defined to 0). So remove the code that logically can never execute.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/sysfs.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index a8bafed931f4..37dbf2fccedc 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -84,8 +84,6 @@ static int can_modify_feature(struct btrfs_feature_attr *fa)
 
if (set & fa->feature_bit)
val |= 1;
-   if (clear & fa->feature_bit)
-   val |= 2;
 
return val;
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Misc cleanups

2017-12-01 Thread Nikolay Borisov
Here's a bunch of stuff that coverity found, this survived a full xfstest run.

Nikolay Borisov (5):
  btrfs: Remove dead code
  btrfs: Remove dead code
  btrfs: Fix possible off-by-one in  btrfs_search_path_in_tree
  btrfs: Remove redundant NULL check
  btrfs: Greatly simplify btrfs_read_dev_super

 fs/btrfs/disk-io.c | 27 ---
 fs/btrfs/inode.c   | 12 ++--
 fs/btrfs/ioctl.c   |  6 +++---
 fs/btrfs/sysfs.c   |  2 --
 4 files changed, 9 insertions(+), 38 deletions(-)

-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html