Re: exclusive subvolume space missing

2017-12-02 Thread Duncan
Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:

> On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:
> 
>>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
>> [...]
>>> Now make various small changes to the file, say under 16 KiB each.  These
>>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>>> a time I believe (might be 4 KiB, as it was back when the default leaf
>> 
>> I got ~500 small files (100-500 kB) updated partially in regular
>> intervals:
>> 
>> # du -Lc **/*.rrd | tail -n1
>> 105Mtotal

FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
are (other than that a quick google suggests that it's...
round-robin-database... and the database bit alone sounds bad in this context
as database-file rewrites are known to be a worst-case for cow-based
filesystems), but it sounds like you suspect that they have this
rewrite-most pattern that could explain your problem...

>>> But here's the kicker.  Even without a snapshot locking that original 100
>>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>>> at once, once no references to it remain, which means once that last
>>> block of the extent gets rewritten.
> 
> OTOH - should this happen with nodatacow files? As I mentioned before,
> these files are chattred +C (however this was not their initial state
> due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
> Am I wrong thinking, that in such case they should occupy twice their
> size maximum? Or maybe there is some tool that could show me the real
> space wasted by file, including extents count etc?

Nodatacow... isn't as simple as the name might suggest.

For one thing, snapshots depend on COW and lock the extents they reference
in-place, so while a file might be set nocow and that setting is retained,
the first write to a block after a snapshot *MUST* cow that block... because
the snapshot has the existing version referenced and it can't change without
changing the snapshot as well, and that would of course defeat the purpose
of snapshots.

Tho the attribute is retained and further writes to the same already cowed
block won't cow it again.

FWIW, on this list that behavior is often referred to as cow1, cow only the
first time that a block is written after a snapshot locks the previous
version in place.

The effect of cow1 depends on the frequency and extent of block rewrites vs.
the frequency of snapshots of the subvolume they're on.  As should be
obvious if you think about it, once you've done the cow1, further rewrites
to the same block before further snapshots won't cow further, so if only
a few blocks are repeatedly rewritten multiple times between snapshots, the
effect should be relatively small.  Similarly if snapshots happen far more
frequently than block rewrites, since in that case most of the snapshots
won't have anything changed (for that file anyway) since the last one.

However, if most of the file gets rewritten between snapshots and the
snapshot frequency is often enough to be a major factor, the effect can be
practically as bad as if the file weren't nocow in the first place.

If I knew a bit more about rrd's rewrite pattern... and your snapshot
pattern...


Second, as you alluded, for btrfs files must be set nocow before anything
is written to them.  Quoting the chattr (1) manpage:  "If it is set on a
file which already has data blocks, it is undefined when the blocks
assigned to the file will be fully stable."

Not being a dev I don't read the code to know what that means in practice,
but it could well be effectively cow1, which would yield the maximum 2X
size you assumed.

But I think it's best to take "undefined" at its meaning, and assume
worst-case "no effect at all", for size calculation purposes, unless you
really /did/ set it at file creation, before the file had content.

And the easiest way to do /that/, and something that might be worthwhile
doing anyway if you think unreclaimed still referenced extents are your
problem, is to set the nocow flag on the /directory/, then copy the
files into it, taking care to actually create them new, that is, use
--reflink=never or copy the files to a different filesystem, perhaps
tmpfs, and back, so they /have/ to be created new.  Of course with the
rewriter (rrdcached, apparently) shut down for the process.

Then, once the files are safely back in place and the filesystem synced
so the data is actually on disk, you can delete the old copies (which
will continue to serve as backups until then), and sync the filesystem
again.

While snapshots will of course continue to keep extents they reference
locked, for unsnapshotted files at least, this process should clear up
any still 

Re: btrfs-transacti hammering the system

2017-12-02 Thread Andrei Borzenkov
01.12.2017 21:04, Austin S. Hemmelgarn пишет:
> On 2017-12-01 12:13, Andrei Borzenkov wrote:
>> 01.12.2017 20:06, Hans van Kranenburg пишет:
>>>
>>> Additional tips (forgot to ask for your /proc/mounts before):
>>> * Use the noatime mount option, so that only accessing files does not
>>> lead to changes in metadata,
>>
>> Is not 'lazytime" default today?

Sorry, it was relatime that is today's default, I mixed them up.

> It gives you correct atime + no extra
>> metadata update cause by update of atime only.
> Unless things have changed since the last time this came up, BTRFS does
> not support the 'lazytime' mount option (but it doesn't complain about
> it either).
> 

Actually since v2.27 "lazytime" is interpreted by mount command itself
and converted into MS_LAZYTIME flag, so should be available for each FS.

bor@10:~> sudo mkfs -t ext4 /dev/sdb1
mke2fs 1.43.7 (16-Oct-2017)
...

bor@10:~> sudo mount -t ext4 -o lazytime /dev/sdb1 /mnt
bor@10:~> tail /proc/self/mountinfo
...
224 66 8:17 / /mnt rw,relatime shared:152 - ext4 /dev/sdb1
rw,lazytime,data=ordered
bor@10:~> sudo umount /dev/sdb1
bor@10:~> sudo mkfs -t btrfs -f /dev/sdb1
btrfs-progs v4.13.3
...

bor@10:~> sudo mount -t btrfs -o lazytime /dev/sdb1 /mnt
bor@10:~> tail /proc/self/mountinfo
...
224 66 0:88 / /mnt rw,relatime shared:152 - btrfs /dev/sdb1
rw,lazytime,space_cache,subvolid=5,subvol=/
bor@10:~>


> Also, lazytime is independent from noatime, and using both can have
> benefits (lazytime will still have to write out the inode for every file
> read on the system every 24 hours, but with noatime it only has to write
> out the inode for files that have changed).
> 

OK, that's true.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:

>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
> [...]
>> Now make various small changes to the file, say under 16 KiB each.  These
>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>> a time I believe (might be 4 KiB, as it was back when the default leaf
> 
> I got ~500 small files (100-500 kB) updated partially in regular
> intervals:
> 
> # du -Lc **/*.rrd | tail -n1
> 105Mtotal
> 
>> But here's the kicker.  Even without a snapshot locking that original 100
>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>> at once, once no references to it remain, which means once that last
>> block of the extent gets rewritten.

OTOH - should this happen with nodatacow files? As I mentioned before,
these files are chattred +C (however this was not their initial state
due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
Am I wrong thinking, that in such case they should occupy twice their
size maximum? Or maybe there is some tool that could show me the real
space wasted by file, including extents count etc?

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote:

> OK, is this supposed to be raid1 or single data, because the above shows
> metadata as all raid1, while some data is single tho most is raid1, and
> while old mkfs used to create unused single chunks on raid1 that had to
> be removed manually via balance, those single data chunks aren't unused.

It is supposed to be RAID1, the single data were leftovers from my previous
attempts to gain some space by converting into single profile. Which
miserably failed BTW (would it be smarter with "soft" option?),
but I've already managed to clear this.

> Assuming the intent is raid1, I'd recommend doing...
>
> btrfs balance start -dconvert=raid1,soft /

Yes, this was the way to go. It also reclaimed the 8 GB. I assume the
failing -dconvert=single somehow locked that 8 GB, so this issue should
be addressed in btrfs-tools to report such locked out region. You've
already noted that the single profile data occupied much less itself.

So this was the first issue, the second is running overhead, that
accumulates over time. Since yesterday, when I had 19 GB free, I've lost
4 GB already. The scenario you've described is very probable:

> btrfs balance start -dusage=N /
[...]
> allocated value toward usage.  I too run relatively small btrfs raid1s
> and would suggest trying N=5, 20, 40, 70, until the spread between

There were no effects above N=10 (both dusage and musage).

> consuming your space either, as I'd suspect they might if the problem were
> for instance atime updates, so while noatime is certainly recommended and

I use noatime by default since years, so not the source of problem here.

> The other possibility that comes to mind here has to do with btrfs COW
> write patterns...

> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
[...]
> Now make various small changes to the file, say under 16 KiB each.  These
> will each be COWed elsewhere as one might expect. by default 16 KiB at
> a time I believe (might be 4 KiB, as it was back when the default leaf

I got ~500 small files (100-500 kB) updated partially in regular
intervals:

# du -Lc **/*.rrd | tail -n1
105Mtotal

> But here's the kicker.  Even without a snapshot locking that original 100
> MiB extent in place, if even one of the original 16 KiB blocks isn't
> rewritten, that entire 100 MiB extent will remain locked in place, as the
> original 16 KiB blocks that have been changed and thus COWed elsewhere
> aren't freed one at a time, the full 100 MiB extent only gets freed, all
> at once, once no references to it remain, which means once that last
> block of the extent gets rewritten.
>
> So perhaps you have a pattern where files of several MiB get mostly
> rewritten, taking more space for the rewrites due to COW, but one or
> more blocks remain as originally written, locking the original extent
> in place at its full size, thus taking twice the space of the original
> file.
>
> Of course worst-case is rewrite the file minus a block, then rewrite
> that minus a block, then rewrite... in which case the total space
> usage will end up being several times the size of the original file!
>
> Luckily few people have this sort of usage pattern, but if you do...
>
> It would certainly explain the space eating...

Did anyone investigated how is that related to RRD rewrites? I don't use
rrdcached, never thought that 100 MB of data might trash entire
filesystem...

best regards,
-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: snaprotate

2017-12-02 Thread Ulli Horlacher
On Sat 2017-12-02 (13:53), Ulli Horlacher wrote:
> Being a Netapp user for a long time, I have always missed btrfs snapshots
> the way Netapp creates them.
> 
> I have now written snaprotate:
> 
> http://fex.belwue.de/snaprotate.html

Uuuppps.. just found, I have posted this 2017-09-09 already!
Sorry for the duplicate!

But the documentation now is better :-)

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20171202125356.ga1...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


snaprotate

2017-12-02 Thread Ulli Horlacher
Being a Netapp user for a long time, I have always missed btrfs snapshots
the way Netapp creates them.

I have now written snaprotate:

http://fex.belwue.de/snaprotate.html

snaprotate creates and manages btrfs readonly snapshots similar to Netapp.
Snapshots have names like hourly, daily, weekly, single and a date_time
prefix.

Snapshots are stored in a .snapshot/ directory in the subvolume root. 
Example: /local/home/.snapshot/2017-09-09_1200.hourly

You create a snapshot with:

snaprotate   

 is your snapshot name
 is the maximum number of snapshots for this class
 is the btrfs subvolume directory you want to snapshoot

If count is exceeded then the oldest snapshot will be deleted.

You can use snaprotate either as root or as normal user (for your own
subvolumes).

Example usages:

snaprotate single 3 /data # create and rotate "single" snapshot
snaprotate test 0 /tmp# delete all "test" snapshots
snaprotate daily 7 /home /local/share # create and rotate "daily" snapshot
snaprotate -l /home   # list snapshots


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20171202125356.ga1...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
OK, I seriously need to address that, as during the night I lost
3 GB again:

On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:

>> #  btrfs fi sh /
>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>> Total devices 2 FS bytes used 44.10GiB
   Total devices 2 FS bytes used 47.28GiB

>> #  btrfs fi usage /
>> Overall:
>> Used: 88.19GiB
   Used: 94.58GiB
>> Free (estimated): 18.75GiB  (min: 18.75GiB)
   Free (estimated): 15.56GiB  (min: 15.56GiB)
>> 
>> #  btrfs dev usage /
- output not changed

>> #  btrfs fi df /
>> Data, RAID1: total=51.97GiB, used=43.22GiB
   Data, RAID1: total=51.97GiB, used=46.42GiB
>> System, RAID1: total=32.00MiB, used=16.00KiB
>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>> GlobalReserve, single: total=131.14MiB, used=0.00B
   GlobalReserve, single: total=135.50MiB, used=0.00B
>> 
>> # df
>> /dev/sda264G   45G   19G  71% /
   /dev/sda264G   48G   16G  76% /
>> However the difference is on active root fs:
>> 
>> -0/29124.29GiB  9.77GiB
>> +0/29115.99GiB 76.00MiB
0/29119.19GiB  3.28GiB
> 
> Since you have already showed the size of the snapshots, which hardly
> goes beyond 1G, it may be possible that extent booking is the cause.
> 
> And considering it's all exclusive, defrag may help in this case.

I'm going to try defrag here, but have a bunch of questions before;
as defrag would break CoW, I don't want to defrag files that span
multiple snapshots, unless they have huge overhead:
1. is there any switch resulting in 'defrag only exclusive data'?
2. is there any switch resulting in 'defrag only extents fragmented more than X'
   or 'defrag only fragments that would be possibly freed'?
3. I guess there aren't, so how could I accomplish my target, i.e.
   reclaiming space that was lost due to fragmentation, without breaking
   spanshoted CoW where it would be not only pointless, but actually harmful?
4. How can I prevent this from happening again? All the files, that are
   written constantly (stats collector here, PostgreSQL database and
   logs on other machines), are marked with nocow (+C); maybe some new
   attribute to mark file as autodefrag? +t?

For example, the largest file from stats collector:
 Total   Exclusive  Set shared  Filename
 432.00KiB   176.00KiB   256.00KiB  load/load.rrd

but most of them has 'Set shared'==0.

5. The stats collector is running from the beginning, according to the
quota output was not the issue since something happened. If the problem
was triggered by (guessing) low space condition, and it results in even
more space lost, there is positive feedback that is dangerous, as makes
any filesystem unstable ("once you run out of space, you won't recover").
Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
pattern, like RRD files?

6. Or maybe some extra steps just before taking snapshot should be taken?
I guess 'defrag exclusive' would be perfect here - reclaiming space
before it is being locked inside snapshot.
Rationale behind this is obvious: since the snapshot-aware defrag was
removed, allow to defrag snapshot exclusive data only.
This would of course result in partial file defragmentation, but that
should be enough for pathological cases like mine.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html