Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Sat, Aug 11, 2018 at 04:18:35AM +0200, erentheti...@mail.de wrote:
> Write hole:
> 
> 
> > The data will be readable until one of the data blocks becomes
> > inaccessible (bad sector or failed disk). This is because it is only the
> > parity block that is corrupted (old data blocks are still not modified
> > due to btrfs CoW), and the parity block is only required when recovering
> > from a disk failure.
> 
> I am unsure about your meaning. 
> Assuming you perform an unclean shutdown (eg. crash), and after restart
> perform a scrub, with no additional error (bad sector, bit-rot) before
> or after the crash:
> will you loose data? 

No, the parity blocks will be ignored and RAID5 will act like slow RAID0
if no other errors occur.

> Will you be able to mount the filesystem like normal? 

Yes.

> Additionaly, will the crash create additional errors like bad
> sectors and or bit-rot aside from the parity-block corruption?

No, only parity-block corruptions should occur.

> Its actually part of my first mail, where the btrfs Raid5/6 page
> assumes no data damage while the spinics comment implies the opposite.

The above assumes no drive failures or data corruption; however, if this
were the case, you could use RAID0 instead of RAID5.

The only reason to use RAID5 is to handle cases where at least one block
(or an entire disk) fails, so the behavior of RAID5 when all disks are
working is almost irrelevant.

A drive failure could occur at any time, so even if you mount successfully,
if a disk fails immediately after, any stripes affected by write hole will
be unrecoverably corrupted.

> The write hole does not seem as dangerous if you could simply scrub
> to repair damage (On smaller discs that is, where scrub doesnt take
> enough time for additional errors to occur)

Scrub can repair parity damage on normal data and metadata--it recomputes
parity from data if the data passes a CRC check.

No repair is possible for data in nodatasum files--the parity can be
recomputed, but there is no way to determine if the result is correct.

Metadata is always checksummed and transid verified; alas, there isn't
an easy way to get btrfs to perform an urgent scrub on metadata only.

> > Put another way: if all disks are online then RAID5/6 behaves like a slow
> > RAID0, and RAID0 does not have the partial stripe update problem because
> > all of the data blocks in RAID0 are independent. It is only when a disk
> > fails in RAID5/6 that the parity block is combined with data blocks, so
> > it is only in this case that the write hole bug can result in lost data.
> 
> So data will not be lost if no drive has failed?

Correct, but the array will have reduced failure tolerance, and RAID5
only matters when a drive has failed.  It is effectively operating in
degraded mode on parts of the array affected by write hole, and no single
disk failure can be tolerated there.

It is possible to recover the parity by performing an immediate scrub
after reboot, but this cannot be as effective as a proper RAID5 update
journal which avoids making the parity bad in the first place.

> > > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > > to the write hole, but data is. In this configuration you can determine
> > > > with high confidence which files you need to restore from backup, and
> > > > the filesystem will remain writable to replace the restored data, 
> > > > because
> > > > raid1 does not have the write hole bug.
> 
> In regards to my earlier questions, what would change if i do -draid5 -mraid1?

Metadata would be using RAID1 which is not subject to the RAID5 write
hole issue.  It is much more tolerant of unclean shutdowns especially
in degraded mode.

Data in RAID5 may be damaged when the array is in degraded mode and
a write hole occurs (in either order as long as both occur).  Due to
RAID1 metadata, the filesystem will continue to operate properly,
allowing the damaged data to be overwritten or deleted.

> Lost Writes:
> 
> > Hotplugging causes an effect (lost writes) which can behave similarly
> > to the write hole bug in some instances. The similarity ends there.
> 
> Are we speaking about the same problem that is causing transid mismatch? 

Transid mismatch is usually caused by lost writes, by any mechanism
that prevents a write from being completed after the disk reports that
it was completed.

Drives may report that data is "in stable storage", i.e. the drive
believes it can complete the write in the future even if power is lost
now because the drive or controller has capacitors or NVRAM or similar.
If the drive is reset by the SATA host because of a cable disconnect
event, the drive may forget that it has promised to do writes in the
future.  Drives may simply lie, and claim that data has been written to
disk when the data is actually in volatile RAM and will disappear in a
power failure.

btrfs uses a transaction mechanism and CoW metadata to handle lost writes
within an interrupted transaction. 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Andrei Borzenkov
10.08.2018 10:33, Tomasz Pala пишет:
> On Fri, Aug 10, 2018 at 07:03:18 +0300, Andrei Borzenkov wrote:
> 
>>> So - the limit set on any user
>>
>> Does btrfs support per-user quota at all? I am aware only of per-subvolume 
>> quotas.
> 
> Well, this is a kind of deceptive word usage in "post-truth" times.
> 
> In this case both "user" and "quota" are not valid...
> - by "user" I ment general word, not unix-user account; such user might
>   possess some container running full-blown guest OS,
> - by "quota" btrfs means - I guess, dataset-quotas?
> 
> 
> In fact: https://btrfs.wiki.kernel.org/index.php/Quota_support
> "Quota support in BTRFS is implemented at a subvolume level by the use of 
> quota groups or qgroup"
> 
> - what the hell is "quota group" and how it differs from qgroup? According to 
> btrfs-quota(8):
> 
> "The quota groups (qgroups) are managed by the subcommand btrfs qgroup(8)"
> 
> - they are the same... just completely different from traditional "quotas".
> 
> 
> My suggestion would be to completely remove the standalone "quota" word
> from btrfs documentation - there is no "quota", just "subvolume quota"
> or "qgroup" supported.
> 

Well, qgroup allows you to limit amount of data that can be stored in
subvolume (or under quota group in general), so it behaves like
traditional quota to me.


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Duncan
Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted:

> But whether data is shared or exclusive seems potentially ephemeral, and
> not something a sysadmin should even be able to anticipate let alone
> individual users.

Define "user(s)".

Arguably, in the context of btrfs tool usage, "user" /is/ the admin, the 
one who cares that it's btrfs in the first place, who should have chosen 
btrfs based on the best-case match for the use-case, and who continues to 
maintain the system's btrfs filesystems using btrfs tools.

Arguably, in this context "user" is /not/ the other users the admin is 
caring for the system in behalf of, who don't care /what/ is under the 
covers so long as it works, to which should be made available more 
appropriate-to-their-needs tools should they be found necessary or useful.

> Going back to the example, I'd expect to give the user a 2GiB quota,
> with 1GiB of initially provisioned data via snapshot, so right off the
> bat they are at 50% usage of their quota. If they were to modify every
> single provisioned file, they'd in effect go from 100% shared data to
> 100% exclusive data, but their quota usage would still be 50%. That's
> completely sane and easily understandable by a regular user. The idea
> that they'd start modifying shared files, and their quota usage climbs
> is weird to me. The state of files being shared or exclusive is not user
> domain terminology anyway.

It's user-domain terminology if the "user" is the admin, who will care 
about shared/exclusive usage in the context of how it affects the usage 
of available storage resources.

"Regular users" as you use the term, that is the non-admins who just need 
to know how close they are to running out of their allotted storage 
resources, shouldn't really need to care about btrfs tool usage in the 
first place, and btrfs commands in general, including btrfs quota related 
commands, really aren't targeted at them, and aren't designed to report 
the type of information they are likely to find useful.  Other tools will 
be more appropriate.

>> The most common case is, you do a snapshot, user would only care how
>> much new space can be written into the subvolume, other than the total
>> subvolume size.
> 
> I think that's expecting a lot of users.

Not really.  Remember, "users" in this context are admins, those to whom 
the duty of maintaining their btrfs falls, and the ones at whom btrfs * 
commands are normally targeted, since this is the btrfs tool designed to 
help them with that job.

And said "users" will (presumably) be concerned about shared/exclusive if 
they're using btrfs quotas because they are trying to well manage the 
filesystem space utilization per subvolume.

(FWIW, "presumably" is thrown in there because here I don't use 
subvolumes /or/ sub-filesystem-level quotas as personally, I prefer to 
manage that at the filesystem level, with multiple independent 
filesystems and the size of individual filesystems enforcing limits on 
how much the stuff stored in them can grow.)

> I also wonder if it expects a lot from services like samba and NFS who
> have to communicate all of this in some sane way to remote clients? My
> expectation is that a remote client shows Free Space on a quota'd system
> to be based on the unused amount of the quota. I also expect if I delete
> a 1GiB file, that my quota consumption goes down. But you're saying it
> would be unchanged if I delete a 1GiB shared file, and would only go
> down if I delete a 1GiB exclusive file. Do samba and NFS know about
> shared and exclusive files? If samba and NFS don't understand this, then
> how is a user supposed to understand it?

There's a reason btrfs quotas don't work with standard VFS level quotas.  
They're managing two different things, and I'd assume the btrfs quota 
information isn't typically what samba/NFS information exporting is 
designed to deal with in the first place.  Just because a screwdriver 
/can/ be used as a hammer doesn't make it the appropriate tool for the 
job.

> And now I'm sufficiently confused I'm ready for the weekend!

LOL!

(I had today/Friday off, arguably why I'm even taking the time to reply, 
but my second day off this "week" is next Tuesday, the last day of the 
schedule-week.  I had actually forgotten that this was the last day of 
the work-week for most, until I saw that, but then, LOL!)

> And we can't have quotas getting busted all of a sudden because the
> sysadmin decides to do -dconvert -mconvert raid1, without requiring the
> sysadmin to double everyone's quota before performing the operation.

Not every_one's_, every-subvolume's.  "Everyone's" quotas shouldn't be 
affected, because that's not what btrfs quotas manage.  There are other 
(non-btrfs) tools for that.

>>> In short: values representing quotas are user-oriented ("the numbers
>>> one bought"), not storage-oriented ("the numbers they actually
>>> occupy").

Btrfs quotas are storage-oriented, and if you're using them, at 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread erenthetitan
Write hole:


> The data will be readable until one of the data blocks becomes
> inaccessible (bad sector or failed disk). This is because it is only the
> parity block that is corrupted (old data blocks are still not modified
> due to btrfs CoW), and the parity block is only required when recovering
> from a disk failure.

I am unsure about your meaning. 
Assuming you perform an unclean shutdown (eg. crash), and after restart perform 
a scrub, with no additional error (bad sector, bit-rot) before or after the 
crash:
will you loose data? Will you be able to mount the filesystem like normal? 
Additionaly, will the crash create additional errors like bad sectors and or 
bit-rot aside from the parity-block corruption?
Its actually part of my first mail, where the btrfs Raid5/6 page assumes no 
data damage while the spinics comment implies the opposite.
The write hole does not seem as dangerous if you could simply scrub to repair 
damage (On smaller discs that is, where scrub doesnt take enough time for 
additional errors to occur)

> Put another way: if all disks are online then RAID5/6 behaves like a slow
> RAID0, and RAID0 does not have the partial stripe update problem because
> all of the data blocks in RAID0 are independent. It is only when a disk
> fails in RAID5/6 that the parity block is combined with data blocks, so
> it is only in this case that the write hole bug can result in lost data.

So data will not be lost if no drive has failed?

> > > If the filesystem is -draid5 -mraid1 then the metadata is not vulnerable
> > > to the write hole, but data is. In this configuration you can determine
> > > with high confidence which files you need to restore from backup, and
> > > the filesystem will remain writable to replace the restored data, because
> > > raid1 does not have the write hole bug.

In regards to my earlier questions, what would change if i do -draid5 -mraid1?


Lost Writes:


> Hotplugging causes an effect (lost writes) which can behave similarly
> to the write hole bug in some instances. The similarity ends there.

Are we speaking about the same problem that is causing transid mismatch? 

> They are really two distinct categories of problem. Temporary connection
> loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
> and the btrfs requirements for handling connection loss and write holes
> are very different.

What kind of bad things? Will scrub (1/10, 5/6) detect and repair it?

> > > Hot-unplugging a device can cause many lost write events at once, and
> > > each lost write event is very bad.

> Transid mismatch is btrfs detecting data
> that was previously silently corrupted by some component outside of btrfs.
> 
> btrfs can't prevent disks from silently corrupting data. It can only
> try to detect and repair the damage after the damage has occurred.

Aside from the chance that all copies of data are corrupted, is there any way 
scrubbing could fail?

> Normally RAID1/5/6/10 or DUP profiles are used for btrfs metadata, so any
> transid mismatches can be recovered by reading up-to-date data from the
> other mirror copy of the metadata, or by reconstructing the data with
> parity blocks in the RAID 5/6 case. It is only after this recovery
> mechanism fails (i.e. too many disks have a failure or corruption at
> the same time on the same sectors) that the filesystem is ended.

Does this mean that transid mismatch is harmless unless both copys are hit at 
once (And in case of Raid 6 all three)?


Old hardware:


> > > It's fun and/or scary to put known good and bad hardware in the same
> > > RAID1 array and watch btrfs autocorrecting the bad data after every
> > > other power failure; however, the bad hardware is clearly not sufficient
> > > to implement any sort of reliable data persistence, and arrays with bad
> > > hardware in them will eventually fail.

I am having a hard time wrapping my head around this statement.
If Btrfs can repair corrupted data and Raid  6 allows two disc failures at once 
without data loss, is using old discs even with high average error count not 
still pretty much safe?
You would simply have to repeat the scrubbing process more often to make sure 
that not enough data is corrupted to break redundancy.

> > > I have one test case where I write millions of errors into a raid5/6 and
> > > the filesystem recovers every single one transparently while verifying
> > > SHA1 hashes of test data. After years of rebuilding busted ext3 on
> > > mdadm-raid5 filesystems, watching btrfs do it all automatically is
> > > just...beautiful.

Once again, if Btrfs is THIS good at repairing data, then is old hardware, 
hotplugging and maybe even (depending on whether i understood your point) write 
hole really dangerous? Are there bugs that could destroy the data or filesystem 
whitout corrupting all copies of data (Or all copies at once)? Assuming Raid 6, 
corrupted data would not break redundancy and repeated scrubbing would fix any 
upcoming issue.

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 06:55:58PM +0200, erentheti...@mail.de wrote:
> Did i get you right?
> Please correct me if i am wrong:
> 
> Scrubbing seems to have been fixed, you only have to run it once.

Yes.

There is one minor bug remaining here:  when scrub detects an error
on any disk in a raid5/6 array, the error counts are garbage (random
numbers on all the disks).  You will need to inspect btrfs dev stats
or the kernel log messages to learn which disks are injecting errors.

This does not impair the scrubbing function, only the detailed statistics
report (scrub status -d).

If there are no errors, scrub correctly reports 0 for all error counts.
Only raid5/6 is affected this way--other RAID profiles produce correct
scrub statistics.

> Hotplugging (temporary connection loss) is affected by the write hole
> bug, and will create undetectable errors every 16 TB (crc32 limitation).

Hotplugging causes an effect (lost writes) which can behave similarly
to the write hole bug in some instances.  The similarity ends there.

They are really two distinct categories of problem.  Temporary connection
loss can do bad things to all RAID profiles on btrfs (not just RAID5/6)
and the btrfs requirements for handling connection loss and write holes
are very different.

> The write Hole Bug can affect both old and new data. 

Normally, only old data can be affected by the write hole bug.

The "new" data is not committed before the power failure (otherwise we
would call it "old" data), so any corrupted new data will be inaccessible
as a result of the power failure.  The filesytem will roll back to the
last complete committed data tree (discarding all new and modified data
blocks), then replay the fsync log (which repeats and completes some
writes that occurred since the last commit).  This process eliminates
new data from the filesystem whether the new data was corrupted by the
write hole or not.

Only corruptions that affect old data will remain, because old data is
not overwritten by data saved in the fsync log, and old data is not part
of the incomplete data tree that is rolled back after power failure.

Exception:  new data in nodatasum files can also be corrupted, but since
nodatasum disables all data integrity or recovery features it's hard to
define what "corrupted" means for a nodatasum file.

> Reason: BTRFS saves data in fixed size stripes, if the write operation
> fails midway, the stripe is lost.
> This does not matter much for Raid 1/10, data always uses a full stripe,
> and stripes are copied on write. Only new data could be lost.

This is incorrect.  Btrfs saves data in variable-sized extents (between
1 and 32768 4K data blocks) and btrfs has no concept of stripes outside of
its raid layer.  Stripes are never copied.

In RAID 1/10/DUP all data blocks are fully independent of each other,
i.e. writing to any block on these RAID profiles does not corrupt data in
any other block.  As a result these RAID profiles do not allow old data
to be corrupted by partially completed writes of new data.

There is striping in some profiles, but it is only used for performance
in these cases, and has no effect on data recovery.

> However, for some reason Raid 5/6 works with partial stripes, meaning
> that data is stored in stripes not completley filled by prior data,

In RAID 5/6 each data block is related to all other data blocks in the
same stripe with the parity block(s).  If any individual data block in the
stripe is updated, the parity block(s) must also be updated atomically,
or the wrong data will be reconstructed during RAID5/6 recovery.

Because btrfs does nothing to prevent it, some writes will occur
to RAID5/6 stripes that are already partially occupied by old data.
btrfs also does nothing to ensure that parity block updates are atomic,
so btrfs has the write hole bug as a result.

> and stripes are removed on write.

Stripes are never removed...?  A stripe is just a group of disk blocks
divided on 64K boundaries, same as mdadm and many hardware RAID5/6
implementations.

> Result: If the operation fails midway, the stripe is lost as is all
> data previously stored it.

You can only lose as many data blocks in each stripe as there are parity
disks (i.e. raid5 can lose 0 or 1 block, while raid6 can lose 0, 1, or 2
blocks); however, multiple writes can be lost affecting multiple stripes
in a single power loss event.  Losing even 1 block is often too much.  ;)

The data will be readable until one of the data blocks becomes
inaccessible (bad sector or failed disk).  This is because it is only the
parity block that is corrupted (old data blocks are still not modified
due to btrfs CoW), and the parity block is only required when recovering
from a disk failure.

Put another way:  if all disks are online then RAID5/6 behaves like a slow
RAID0, and RAID0 does not have the partial stripe update problem because
all of the data blocks in RAID0 are independent.  It is only when a disk
fails in RAID5/6 that the parity block is 

Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread erenthetitan
Did i get you right?
Please correct me if i am wrong:

Scrubbing seems to have been fixed, you only have to run it once.

Hotplugging (temporary connection loss) is affected by the write hole bug, and 
will create undetectable errors every 16 TB (crc32 limitation).

The write Hole Bug can affect both old and new data.
Reason: BTRFS saves data in fixed size stripes, if the write operation fails 
midway, the stripe is lost.
This does not matter much for Raid 1/10, data always uses a full stripe, and 
stripes are copied on write. Only new data could be lost.
However, for some reason Raid 5/6 works with partial stripes, meaning that data 
is stored in stripes not completley filled by prior data, and stripes are 
removed on write.
Result: If the operation fails midway, the stripe is lost as is all data 
previously stored it.

Transid Mismatch can silently corrupt data.
Reason: It is a seperate metadata failure that is trigged by lost or incomplete 
writes, writes that are lost somewhere during transmission.
It can happen to all BTRFS configurations and is not trigerred by the write 
hole.
It could happen due to brown out (temporary undersupply of voltage), faulty 
cables, faulty ram, faulty disc cache, faulty discs in general.

Both bugs could damage metadata and trigger the following:
Data will be lost (0 to 100% unreadable), the filesystem will be readonly.
Reason: BTRFS saves metadata as a tree structure. The closer the error to the 
root, the more data cannot be read.

Transid Mismatch can happen up to once every 3 months per device,
depending on the drive hardware!

Question: Does this not make transid mismatch way more dangerous than
the write hole? What would happen to other filesystems, like ext4?

Am 10-Aug-2018 09:12:21 +0200 schrieb ce3g8...@umail.furryterror.org: 
> > On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> > > I am searching for more information regarding possible bugs related to
> > > BTRFS Raid 5/6. All sites i could find are incomplete and information
> > > contradicts itself:
> > >
> > > The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> > > warns of the write hole bug, stating that your data remains safe
> > > (except data written during power loss, obviously) upon unclean shutdown
> > > unless your data gets corrupted by further issues like bit-rot, drive
> > > failure etc.
> > 
> > The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
> > no mitigations to prevent or avoid it in mainline kernels.
> > 
> > The write hole results from allowing a mixture of old (committed) and
> > new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
> > blocks consisting of one related data or parity block from each disk
> > in the array, such that writes to any of the data blocks affect the
> > correctness of the parity block and vice versa). If the writes were
> > not completed and one or more of the data blocks are not online, the
> > data blocks reconstructed by the raid5/6 algorithm will be corrupt.
> > 
> > If all disks are online, the write hole does not immediately
> > damage user-visible data as the old data blocks can still be read
> > directly; however, should a drive failure occur later, old data may
> > not be recoverable because the parity block will not be correct for
> > reconstructing the missing data block. A scrub can fix write hole
> > errors if all disks are online, and a scrub should be performed after
> > any unclean shutdown to recompute parity data.
> > 
> > The write hole always puts both old and new data at risk of damage;
> > however, due to btrfs's copy-on-write behavior, only the old damaged
> > data can be observed after power loss. The damaged new data will have
> > no references to it written to the disk due to the power failure, so
> > there is no way to observe the new damaged data using the filesystem.
> > Not every interrupted write causes damage to old data, but some will.
> > 
> > Two possible mitigations for the write hole are:
> > 
> > - modify the btrfs allocator to prevent writes to partially filled
> > raid5/6 stripes (similar to what the ssd mount option does, except
> > with the correct parameters to match RAID5/6 stripe boundaries),
> > and advise users to run btrfs balance much more often to reclaim
> > free space in partially occupied raid stripes
> > 
> > - add a stripe write journal to the raid5/6 layer (either in
> > btrfs itself, or in a lower RAID5 layer).
> > 
> > There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
> > to btrfs or dramatically increase the btrfs block size) that also solve
> > the write hole problem but are somewhat more invasive and less practical
> > for btrfs.
> > 
> > Note that the write hole also affects btrfs on top of other similar
> > raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
> > The btrfs CoW layer does not understand how to allocate data to avoid RMW
> > raid5 stripe updates without corrupting 

Re: BUG: scheduling while atomic

2018-08-10 Thread Qu Wenruo


On 8/11/18 6:14 AM, James Courtier-Dutton wrote:
> On 6 August 2018 at 07:26, Qu Wenruo  wrote:
>>
>>
>>> WARNING: CPU: 3 PID: 803 at
>>> /build/linux-hwe-SYRsgd/linux-hwe-4.15.0/fs/btrfs/extent_map.c:77
>>> free_extent_map+0x78/0x90 [btrfs]
>>
>> Then it makes sense, as it's a WARN_ON() line, showing one extent map is
>> still used.
>>
>> If it get freed, it will definitely cause some rbtree corruption.
>>
>>
>> It's should be the only free_extent_map() call in __do_readpage() function.
>> However a first glance into this function shows nothing wrong, nor new
>> modification in this function.
>> (Maybe it's the get_extent() hook?)
>>
>> Is there any reproducer? Or special workload?
> The workload is fairly simple.
> 1) The server is receiving 1Gbyte files from across the network, in 10
> minute intervals, and storing them on the HDD.
> 2) A process reads the files, scanning them for certain patterns.
> 3) A cron job deletes the old files.

This looks pretty normal.
Shouldn't cause any problem.

> 
> 
> 
>>
>> And, have you tried "btrfs check --readonly "? If there is any
>> error it would help a lot to locate the problem.
>>
> root@sandisk:~# btrfs check --readonly /dev/sda3
> Checking filesystem on /dev/sda3
> UUID: 8c9063b9-a4bb-48d1-92ba-6adf49af6fb5
> checking extents
> checking free space cache
> block group 6874953940992 has wrong amount of free space
> failed to load free space cache for block group 6874953940992

This problem can be solved by "btrfs check --clear-space-cache=v1".
Normally it shouldn't cause much problem.

> checking fs roots
> checking csums
> checking root refs
> found 1488143566396 bytes used err is 0
> total csum bytes: 1448276084
> total tree bytes: 5079711744
> total fs tree bytes: 3280207872
> total extent tree bytes: 146997248
> btree space waste bytes: 1100557047
> file data blocks allocated: 2266345996288
>  referenced 2235075653632
> root@sandisk:~#
> 
> 
> So, not much to see there.
> Any more ideas?

Not really.

If it's still reproducible, maybe it's worth trying the latest upstream
kernel.

Thanks,
Qu

> 



signature.asc
Description: OpenPGP digital signature


Re: BUG: scheduling while atomic

2018-08-10 Thread James Courtier-Dutton
On 6 August 2018 at 07:26, Qu Wenruo  wrote:
>
>
>> WARNING: CPU: 3 PID: 803 at
>> /build/linux-hwe-SYRsgd/linux-hwe-4.15.0/fs/btrfs/extent_map.c:77
>> free_extent_map+0x78/0x90 [btrfs]
>
> Then it makes sense, as it's a WARN_ON() line, showing one extent map is
> still used.
>
> If it get freed, it will definitely cause some rbtree corruption.
>
>
> It's should be the only free_extent_map() call in __do_readpage() function.
> However a first glance into this function shows nothing wrong, nor new
> modification in this function.
> (Maybe it's the get_extent() hook?)
>
> Is there any reproducer? Or special workload?
The workload is fairly simple.
1) The server is receiving 1Gbyte files from across the network, in 10
minute intervals, and storing them on the HDD.
2) A process reads the files, scanning them for certain patterns.
3) A cron job deletes the old files.



>
> And, have you tried "btrfs check --readonly "? If there is any
> error it would help a lot to locate the problem.
>
root@sandisk:~# btrfs check --readonly /dev/sda3
Checking filesystem on /dev/sda3
UUID: 8c9063b9-a4bb-48d1-92ba-6adf49af6fb5
checking extents
checking free space cache
block group 6874953940992 has wrong amount of free space
failed to load free space cache for block group 6874953940992
checking fs roots
checking csums
checking root refs
found 1488143566396 bytes used err is 0
total csum bytes: 1448276084
total tree bytes: 5079711744
total fs tree bytes: 3280207872
total extent tree bytes: 146997248
btree space waste bytes: 1100557047
file data blocks allocated: 2266345996288
 referenced 2235075653632
root@sandisk:~#


So, not much to see there.
Any more ideas?


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-10 14:07, Chris Murphy wrote:

On Thu, Aug 9, 2018 at 5:35 PM, Qu Wenruo  wrote:



On 8/10/18 1:48 AM, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.


In reality that's definitely not the case.

 From what I see, most users would care more about exclusively used space
(excl), other than the total space one subvolume is referring to (rfer).


I'm confused.

So what happens in the following case with quotas enabled on Btrfs:

1. Provision a user with a directory, pre-populated with files, using
snapshot. Let's say it's 1GiB of files.
2. Set a quota for this user's directory, 1GiB.

The way I'm reading the description of Btrfs quotas, the 1GiB quota
applies to exclusive used space. So for starters, they have 1GiB of
shared data that does not affect their 1GiB quota at all.

3. User creates 500MiB worth of new files, this is exclusive usage.
They are still within their quota limit.
4. The shared data becomes obsolete for all but this one user, and is deleted.

Suddenly, 1GiB of shared data for this user is no longer shared data,
it instantly becomes exclusive data and their quota is busted. Now
consider scaling this to 12TiB of storage, with hundreds of users, and
dozens of abruptly busted quotas following this same scenario on a
weekly basis.

I *might* buy off on the idea that an overlay2 based initial
provisioning would not affect quotas. But whether data is shared or
exclusive seems potentially ephemeral, and not something a sysadmin
should even be able to anticipate let alone individual users.

Going back to the example, I'd expect to give the user a 2GiB quota,
with 1GiB of initially provisioned data via snapshot, so right off the
bat they are at 50% usage of their quota. If they were to modify every
single provisioned file, they'd in effect go from 100% shared data to
100% exclusive data, but their quota usage would still be 50%. That's
completely sane and easily understandable by a regular user. The idea
that they'd start modifying shared files, and their quota usage climbs
is weird to me. The state of files being shared or exclusive is not
user domain terminology anyway.
And it's important to note that this is the _only_ way this can sanely 
work for actually partitioning resources, which is the primary classical 
use case for quotas.


Being able to see how much data is shared and exclusive in a subvolume 
is nice, but quota groups are the wrong name for it because the current 
implementation does not work at all like quotas and can trivially result 
in both users escaping quotas (multiple ways), and in quotas being 
overreached by very large amounts for potentially indefinite periods of 
time because of actions of individuals who _don't_ own the data the 
quota is for.





The most common case is, you do a snapshot, user would only care how
much new space can be written into the subvolume, other than the total
subvolume size.


I think that's expecting a lot of users.

I also wonder if it expects a lot from services like samba and NFS who
have to communicate all of this in some sane way to remote clients? My
expectation is that a remote client shows Free Space on a quota'd
system to be based on the unused amount of the quota. I also expect if
I delete a 1GiB file, that my quota consumption goes down. But you're
saying it would be unchanged if I delete a 1GiB shared file, and would
only go down if I delete a 1GiB exclusive file. Do samba and NFS know
about shared and exclusive files? If samba and NFS don't understand
this, then how is a user supposed to understand it?
It might be worth looking at how Samba and NFS work on top of ZFS on a 
platform like FreeNAS and trying to emulate that.


Behavior there is as-follows:

* The total size of the 'disk' reported over SMB (shown on Windows only 
if you map the share as a drive) is equal to the quota for the 
underlying dataset.
* The reported space used on the 'disk' reported over SMB is based on 
physical space usage after compression, with a few caveats relating to 
deduplication:
- Data which is shared across multiple datasets is accounted 
against _all_ datasets that reference it.
- Data which is shared only within a given dataset is accounted 
only once.

* Free space is reported simply as the total size minus the used space.
* Usage reported by 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-10 14:21, Tomasz Pala wrote:

On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:


I.e.: every shared segment should be accounted within quota (at least once).

I think what you mean to say here is that every shared extent should be
accounted to quotas for every location it is reflinked from.  IOW, that
if an extent is shared between two subvolumes each with it's own quota,
they should both have it accounted against their quota.


Yes.


Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).


This is irrelevant to your point here.  In fact, it goes against it,
you're arguing for quotas to report data like `du`, but all of
chunk-profile stuff is invisible to `du` (and everything else in
userspace that doesn't look through BTRFS ioctls).


My point is user-point, not some system tool like du. Consider this:
1. user wants higher (than default) protection of some data,
2. user wants more storage space with less protection.

Ad. 1 - requesting better redundancy is similar to cp --reflink=never
- there are functional differences, but the cost is similar: trading
   space for security,

Ad. 2 - many would like to have .cache, .ccache, tmp or some build
system directory with faster writes and no redundancy at all. This
requires per-file/directory data profile attrs though.

Since we agreed that transparent data compression is user's storage bonus,
gains from the reduced redundancy should also profit user.
Do you actually know of any services that do this though?  I mean, 
Amazon S3 and similar services have the option of reduced redundancy 
(and other alternate storage tiers), but they charge 
per-unit-data-per-unit-time with no hard limit on how much space they 
use, and charge different rates for different storage tiers.  In 
comparison, what you appear to be talking about is something more 
similar to Dropbox or Google Drive, where you pay up front for a fixed 
amount of storage for a fixed amount of time and can't use more than 
that, and all the services I know of like that offer exactly one option 
for storage redundancy.


That aside, you seem to be overthinking this.  No sane provider is going 
to give their users the ability to create subvolumes themselves (there's 
too much opportunity for a tiny bug in your software to cost you a _lot_ 
of lost revenue, because creating subvolumes can let you escape qgroups) 
 That means in turn that what you're trying to argue for is no 
different from the provider just selling units of storage for different 
redundancy levels separately, and charging different rates for each of 
them.  In fact, that approach is better, because it works independent of 
the underlying storage technology (it will work with hardware RAID, 
LVM2, MD, ZFS, and even distributed storage platforms like Ceph and 
Gluster), _and_ it lets them charge differently than the trivial case of 
N copies costing N times as much as one copy (which is not quite 
accurate in terms of actual management costs).


Now, if BTRFS were to have the ability to set profiles per-file, then 
this might be useful, albeit with the option to tune how it gets accounted.


Disclaimer: all the above statements in relation to conception and
understanding of quotas, not to be confused with qgroups.





Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:

>> I.e.: every shared segment should be accounted within quota (at least once).
> I think what you mean to say here is that every shared extent should be 
> accounted to quotas for every location it is reflinked from.  IOW, that 
> if an extent is shared between two subvolumes each with it's own quota, 
> they should both have it accounted against their quota.

Yes.

>> Moreover - if there would be per-subvolume RAID levels someday, the data
>> should be accouted in relation to "default" (filesystem) RAID level,
>> i.e. having a RAID0 subvolume on RAID1 fs should account half of the
>> data, and twice the data in an opposite scenario (like "dup" profile on
>> single-drive filesystem).
>
> This is irrelevant to your point here.  In fact, it goes against it, 
> you're arguing for quotas to report data like `du`, but all of 
> chunk-profile stuff is invisible to `du` (and everything else in 
> userspace that doesn't look through BTRFS ioctls).

My point is user-point, not some system tool like du. Consider this:
1. user wants higher (than default) protection of some data,
2. user wants more storage space with less protection.

Ad. 1 - requesting better redundancy is similar to cp --reflink=never
- there are functional differences, but the cost is similar: trading
  space for security,

Ad. 2 - many would like to have .cache, .ccache, tmp or some build
system directory with faster writes and no redundancy at all. This
requires per-file/directory data profile attrs though.

Since we agreed that transparent data compression is user's storage bonus,
gains from the reduced redundancy should also profit user.


Disclaimer: all the above statements in relation to conception and
understanding of quotas, not to be confused with qgroups.

-- 
Tomasz Pala 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Chris Murphy
On Thu, Aug 9, 2018 at 5:35 PM, Qu Wenruo  wrote:
>
>
> On 8/10/18 1:48 AM, Tomasz Pala wrote:
>> On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:
>>
>>> 2) Different limitations on exclusive/shared bytes
>>>Btrfs can set different limit on exclusive/shared bytes, further
>>>complicating the problem.
>>>
>>> 3) Btrfs quota only accounts data/metadata used by the subvolume
>>>It lacks all the shared trees (mentioned below), and in fact such
>>>shared tree can be pretty large (especially for extent tree and csum
>>>tree).
>>
>> I'm not sure about the implications, but just to clarify some things:
>>
>> when limiting somebody's data space we usually don't care about the
>> underlying "savings" coming from any deduplicating technique - these are
>> purely bonuses for system owner, so he could do larger resource overbooking.
>
> In reality that's definitely not the case.
>
> From what I see, most users would care more about exclusively used space
> (excl), other than the total space one subvolume is referring to (rfer).

I'm confused.

So what happens in the following case with quotas enabled on Btrfs:

1. Provision a user with a directory, pre-populated with files, using
snapshot. Let's say it's 1GiB of files.
2. Set a quota for this user's directory, 1GiB.

The way I'm reading the description of Btrfs quotas, the 1GiB quota
applies to exclusive used space. So for starters, they have 1GiB of
shared data that does not affect their 1GiB quota at all.

3. User creates 500MiB worth of new files, this is exclusive usage.
They are still within their quota limit.
4. The shared data becomes obsolete for all but this one user, and is deleted.

Suddenly, 1GiB of shared data for this user is no longer shared data,
it instantly becomes exclusive data and their quota is busted. Now
consider scaling this to 12TiB of storage, with hundreds of users, and
dozens of abruptly busted quotas following this same scenario on a
weekly basis.

I *might* buy off on the idea that an overlay2 based initial
provisioning would not affect quotas. But whether data is shared or
exclusive seems potentially ephemeral, and not something a sysadmin
should even be able to anticipate let alone individual users.

Going back to the example, I'd expect to give the user a 2GiB quota,
with 1GiB of initially provisioned data via snapshot, so right off the
bat they are at 50% usage of their quota. If they were to modify every
single provisioned file, they'd in effect go from 100% shared data to
100% exclusive data, but their quota usage would still be 50%. That's
completely sane and easily understandable by a regular user. The idea
that they'd start modifying shared files, and their quota usage climbs
is weird to me. The state of files being shared or exclusive is not
user domain terminology anyway.


>
> The most common case is, you do a snapshot, user would only care how
> much new space can be written into the subvolume, other than the total
> subvolume size.

I think that's expecting a lot of users.

I also wonder if it expects a lot from services like samba and NFS who
have to communicate all of this in some sane way to remote clients? My
expectation is that a remote client shows Free Space on a quota'd
system to be based on the unused amount of the quota. I also expect if
I delete a 1GiB file, that my quota consumption goes down. But you're
saying it would be unchanged if I delete a 1GiB shared file, and would
only go down if I delete a 1GiB exclusive file. Do samba and NFS know
about shared and exclusive files? If samba and NFS don't understand
this, then how is a user supposed to understand it?

And now I'm sufficiently confused I'm ready for the weekend!


>> And the numbers accounted should reflect the uncompressed sizes.
>
> No way for current extent based solution.

I'm less concerned about this. But since the extent item shows both
ram and disk byte values, why couldn't the quota and the space
reporting be predicated on the ram value which is always uncompressed?



>
>>
>>
>> Moreover - if there would be per-subvolume RAID levels someday, the data
>> should be accouted in relation to "default" (filesystem) RAID level,
>> i.e. having a RAID0 subvolume on RAID1 fs should account half of the
>> data, and twice the data in an opposite scenario (like "dup" profile on
>> single-drive filesystem).
>
> No possible again for current extent based solution.

It's fine, I think it's unintuitive for DUP or raid1 profiles to cause
quota consumption to double. The underlying configuration of the array
is not the business of the user. They can only be expected to
understand file size. Underlying space consumed, whether compressed,
or duplicated, or compressed and duplicated, is out of scope for the
user. And we can't have quotas getting busted all of a sudden because
the sysadmin decides to do -dconvert -mconvert raid1, without
requiring the sysadmin to double everyone's quota before performing
the operation.





>
>>
>>
>> In 

Re: [COMMAND HANGS] The command 'btrfs subvolume sync -s 2 xyz' can hangs.

2018-08-10 Thread Giuseppe Della Bianca
In data giovedì 9 agosto 2018 20:48:03 CEST, Jeff Mahoney ha scritto:
> On 8/9/18 11:15 AM, Giuseppe Della Bianca wrote:
> > Hi.
> > 
> > My system:
> > - Fedora 28 x86_64
> > - kernel-4.17.7-200
> > - btrfs-progs-4.15.1-1
> > 
> > The command 'btrfs subvolume sync -s 2 xyz' hangs in this case:
> > 
> > - Run command 'btrfs subvolume sync -s 2 xyz' .
> > - After some time the kernel reports an error on the filesystem.
> > 
> >   (error that existed before the command was launched.)
> > 
> > - The filesystem goes in read-only mode.
> > - The command hangs.
> 
> Can you provide the output of 'dmesg' and the contents of
> /proc//stack where  is the pid of the btrfs command process?
> 
> -Jeff

For pid info we have to wait for the problem to reoccur.
(the filesystem has been restored.)

Kernel messages

Aug  7 15:45:33 exnet kernel: WARNING: CPU: 2 PID: 9700 at fs/btrfs/extent-
tree.c:7001 __btrfs_free_extent.isra.70+0x782/0xb10 [btrfs]
Aug  7 15:45:33 exnet kernel: Modules linked in: ppp_deflate bsd_comp 
ppp_async ppp_generic slhc cdc_ether usbnet option mii usb_wwan uas 
usb_storage fuse tun bridge devlink ebtable_filter ebtables bnx2fc cnic uio 
8021q fcoe libfcoe garp mrp stp llc libfc scsi_transport_fc nf_log_ipv4 
nf_log_common xt_LOG xt_limit xt_multiport xt_CHECKSUM iptable_mangle 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 
nf_defrag_ipv4 nf_nat_ipv4 nf_nat ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
nf_defrag_ipv6 xt_conntrack nf_conntrack uinput ip6table_filter ip6_tables 
eeepc_wmi asus_wmi sparse_keymap iTCO_wdt iTCO_vendor_support rfkill ppdev 
snd_hda_codec_hdmi wmi_bmof mxm_wmi snd_usb_audio snd_hda_codec_realtek 
snd_usbmidi_lib snd_hda_codec_generic snd_rawmidi intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp
Aug  7 15:45:33 exnet kernel: snd_hda_intel kvm_intel kvm snd_hda_codec 
snd_hda_core snd_hwdep snd_seq snd_seq_device irqbypass crct10dif_pclmul 
crc32_pclmul snd_pcm ghash_clmulni_intel intel_cstate intel_uncore 
intel_rapl_perf snd_timer snd mei_me soundcore i2c_i801 mei shpchp parport_pc 
parport wmi acpi_pad vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) binfmt_misc 
vboxdrv(OE) btrfs libcrc32c xor zstd_decompress zstd_compress xxhash raid6_pq 
i915 i2c_algo_bit drm_kms_helper e1000e nvme drm crc32c_intel serio_raw 
nvme_core video analog gameport joydev i2c_dev
Aug  7 15:45:33 exnet kernel: CPU: 2 PID: 9700 Comm: btrfs-cleaner Tainted: G   

OE 4.17.7-200.fc28.x86_64 #1
Aug  7 15:45:33 exnet kernel: Hardware name: System manufacturer System 
Product Name/Z170M-PLUS, BIOS 0704 02/18/2016
Aug  7 15:45:33 exnet kernel: RIP: 0010:__btrfs_free_extent.isra.
70+0x782/0xb10 [btrfs]
Aug  7 15:45:33 exnet kernel: RSP: 0018:af1c4992bc30 EFLAGS: 00010246
Aug  7 15:45:33 exnet kernel: RAX: fffe RBX: 8fe421682230 RCX: 

Aug  7 15:45:33 exnet kernel: RDX:  RSI:  RDI: 
8fe428f7b068
Aug  7 15:45:33 exnet kernel: RBP: 000e401ec000 R08: af1c4992bba4 R09: 
0046
Aug  7 15:45:33 exnet kernel: R10: 02ef R11:  R12: 
8fe43cd5
Aug  7 15:45:33 exnet kernel: R13: fffe R14:  R15: 
0578
Aug  7 15:45:33 exnet kernel: FS:  () 
GS:8fe5bbd0() knlGS:
Aug  7 15:45:33 exnet kernel: CS:  0010 DS:  ES:  CR0: 
80050033
Aug  7 15:45:33 exnet kernel: CR2: 564e5fb75650 CR3: 00028f20a004 CR4: 
003606e0
Aug  7 15:45:33 exnet kernel: DR0:  DR1:  DR2: 

Aug  7 15:45:33 exnet kernel: DR3:  DR6: fffe0ff0 DR7: 
0400
Aug  7 15:45:33 exnet kernel: Call Trace:
Aug  7 15:45:33 exnet kernel: __btrfs_run_delayed_refs+0x216/0x10b0 [btrfs]
Aug  7 15:45:33 exnet kernel: ? btrfs_set_disk_extent_flags+0x72/0xb0 [btrfs]
Aug  7 15:45:33 exnet kernel: btrfs_run_delayed_refs+0x78/0x180 [btrfs]
Aug  7 15:45:33 exnet kernel: btrfs_should_end_transaction+0x3e/0x60 [btrfs]
Aug  7 15:45:33 exnet kernel: btrfs_drop_snapshot+0x3cf/0x820 [btrfs]
Aug  7 15:45:33 exnet kernel: ? btree_submit_bio_start+0x20/0x20 [btrfs]
Aug  7 15:45:33 exnet kernel: btrfs_clean_one_deleted_snapshot+0xba/0xe0 
[btrfs]
Aug  7 15:45:33 exnet kernel: cleaner_kthread+0x129/0x160 [btrfs]
Aug  7 15:45:33 exnet kernel: kthread+0x112/0x130
Aug  7 15:45:33 exnet kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Aug  7 15:45:33 exnet kernel: ret_from_fork+0x35/0x40
Aug  7 15:45:33 exnet kernel: Code: b8 00 00 00 48 8b 7c 24 18 e8 bb b7 ff ff 
41 89 c5 58 5a c6 44 24 24 00 45 85 ed 0f 84 97 f9 ff ff 41 83 fd fe 0f 85 63 
fc ff ff <0f> 0b 48 8b 3b e8 94 ca 00 00 4d 89 f8 4c 89 f1 48 89 ea ff b4 
Aug  7 15:45:33 exnet kernel: ---[ end trace 961b1007d36aa769 ]---
Aug  7 15:45:33 exnet kernel: BTRFS info (device sda3): leaf 6329344 gen 
1601 total ptrs 135 free space 5456 owner 2
Aug  7 15:45:33 exnet kernel: 

Re: [RFC PATCH 00/17] btrfs zoned block device support

2018-08-10 Thread Qu Wenruo


On 8/10/18 9:32 PM, Hans van Kranenburg wrote:
> On 08/10/2018 09:28 AM, Qu Wenruo wrote:
>>
>>
>> On 8/10/18 2:04 AM, Naohiro Aota wrote:
>>> This series adds zoned block device support to btrfs.
>>>
>>> [...]
>>
>> And this is the patch modifying extent allocator.
>>
>> Despite that, the support zoned storage looks pretty interesting and
>> have something in common with planned priority-aware extent allocator.
> 
> Priority-aware allocator? Is someone actually working on that, or is it
> planned like everything is 'planned' (i.e. nice idea, and might happen
> or might as well not happen ever, SIYH)?

I'm working on this, although it will take some time.

Although it's originally designed to solve the problem where some empty
block groups won't be freed due to pinned bytes.

Thanks,
Qu
> 



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH 00/17] btrfs zoned block device support

2018-08-10 Thread Hans van Kranenburg
On 08/10/2018 09:28 AM, Qu Wenruo wrote:
> 
> 
> On 8/10/18 2:04 AM, Naohiro Aota wrote:
>> This series adds zoned block device support to btrfs.
>>
>> [...]
> 
> And this is the patch modifying extent allocator.
> 
> Despite that, the support zoned storage looks pretty interesting and
> have something in common with planned priority-aware extent allocator.

Priority-aware allocator? Is someone actually working on that, or is it
planned like everything is 'planned' (i.e. nice idea, and might happen
or might as well not happen ever, SIYH)?

-- 
Hans van Kranenburg



signature.asc
Description: OpenPGP digital signature


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-09 13:48, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.

So - the limit set on any user should enforce maximum and absolute space
he has allocated, including the shared stuff. I could even imagine that
creating a snapshot might immediately "eat" the available quota. In a
way, that quota returned matches (give or take) `du` reported usage,
unless "do not account reflinks withing single qgroup" was easy to implemet.

I.e.: every shared segment should be accounted within quota (at least once).
I think what you mean to say here is that every shared extent should be 
accounted to quotas for every location it is reflinked from.  IOW, that 
if an extent is shared between two subvolumes each with it's own quota, 
they should both have it accounted against their quota.


And the numbers accounted should reflect the uncompressed sizes.
This is actually inconsistent with pretty much every other VFS level 
quota system in existence.  Even ZFS does it's accounting _after_ 
compression.  At this point, it's actually expected by most sysadmins 
that things behave that way.



Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).
This is irrelevant to your point here.  In fact, it goes against it, 
you're arguing for quotas to report data like `du`, but all of 
chunk-profile stuff is invisible to `du` (and everything else in 
userspace that doesn't look through BTRFS ioctls).



In short: values representing quotas are user-oriented ("the numbers one
bought"), not storage-oriented ("the numbers they actually occupy").


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-09 19:35, Qu Wenruo wrote:



On 8/10/18 1:48 AM, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.


In reality that's definitely not the case.

 From what I see, most users would care more about exclusively used space
(excl), other than the total space one subvolume is referring to (rfer).

The most common case is, you do a snapshot, user would only care how
much new space can be written into the subvolume, other than the total
subvolume size.
I would really love to know exactly who these users are, because it 
sounds to me like you've heard from exactly zero people who are 
currently using conventional quotas to impose actual resource limits on 
other filesystems (instead of just using them for accounting, which is a 
valid use case but not what they were originally designed for).




So - the limit set on any user should enforce maximum and absolute space
he has allocated, including the shared stuff. I could even imagine that
creating a snapshot might immediately "eat" the available quota. In a
way, that quota returned matches (give or take) `du` reported usage,
unless "do not account reflinks withing single qgroup" was easy to implemet.


In fact, that's the case. In current implementation, accounting on
extent is the easiest (if not the only) way to implement.



I.e.: every shared segment should be accounted within quota (at least once).


Already accounted, at least for rfer.



And the numbers accounted should reflect the uncompressed sizes.


No way for current extent based solution.

While this may be true, this would be a killer feature to have.





Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).


No possible again for current extent based solution.




In short: values representing quotas are user-oriented ("the numbers one
bought"), not storage-oriented ("the numbers they actually occupy").


Well, if something is not possible or brings so big performance impact,
there will be no argument on how it should work in the first place.

Thanks,
Qu





Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
On Fri, Aug 10, 2018 at 6:51 AM, Qu Wenruo  wrote:
>
>
> On 8/10/18 6:42 PM, Dan Merillat wrote:
>> On Fri, Aug 10, 2018 at 6:05 AM, Qu Wenruo  wrote:
>
> But considering your amount of block groups, mount itself may take some
> time (before trying to resume balance).

I'd believe it, a clean mount took 2-3 minutes normally.

btrfs check ran out of RAM eventually so I killed it and went on to
trying to mount again.

readonly mounted pretty quickly, so I'm just letting -o remount,rw
spin for however long it needs to.  Readonly access is fine over the
weekend, and hopefully it will be done by monday.

To be clear, what exactly am I watching with dump-tree to monitor
forward progress?

Thanks again for the help!


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Qu Wenruo


On 8/10/18 6:42 PM, Dan Merillat wrote:
> On Fri, Aug 10, 2018 at 6:05 AM, Qu Wenruo  wrote:
> 
>>
>> Although not sure about the details, but the fs looks pretty huge.
>> Tons of subvolume and its free space cache inodes.
> 
> 11TB, 3 or so subvolumes and two snapshots I think.  Not particularly
> large for NAS.
> 
>> But only 3 tree reloc trees, unless you have tons of reflinked files
>> (off-line deduped), it shouldn't cause a lot of problem.
> 
> There's going to be a ton of reflinked files.  Both cp --reflink and
> via the wholefile dedup.
> 
> I freed up ~1/2 TB last month doing dedup.
> 
>> At least, we have some progress dropping tree reloc tree for subvolume 6482.
> 
> Is there a way to get an idea of how much work is left to be done on
> the reloc tree?

You could inspect by the level.
But each level change is a huge step forward.

>  Can I walk it
> with btrfs-inspect?

Of course you can. Using -b  could show the remaining tree.

> 
> dump-tree -t TREE_RELOC is quite enormous (13+ million lines before I gave up)

No wonder, since it's kind of snapshot of your subvolume.

> 
>> If you check the dump-tree output for the following data, the "drop key"
>> should change during mount: (inspect dump-tree can be run mounted)
>> item 175 key (TREE_RELOC ROOT_ITEM 6482) itemoff 8271 itemsize 439
>> 
>> drop key (2769795 EXTENT_DATA 12665933824) level 2
>>  ^
>>
>> So for the worst case scenario, there is some way to determine whether
>> it's processing.
> 
> I'll keep an eye on that.
> 
>> And according to the level (3), which is not small for each subvolume, I
>> doubt that's the reason why it's so slow.
>>
>> BTW, for last skip_balance mount, is there any kernel message like
>> "balance: resume skipped"?
> 
> No, the only reference to balance in kern.log is a hung
> btrfs_cancel_balance from the first reboot.

That's strange.

But considering your amount of block groups, mount itself may take some
time (before trying to resume balance).

Thanks,
Qu

> 
>> Have you tried mount the fs readonly with skip_balance? And then remount
>> rw, still with skip_balance?
> 
> No, every operation takes a long time.  It's still running the btrfs
> check, although I'm
> going to cancel it and try mount -o ro,skip_balance before I go to
> sleep and see where it is tomorrow.
> 
> Thank you for taking the time to help me with this.
> 



signature.asc
Description: OpenPGP digital signature


Re: [PATCH 1/2] btrfs: assert for num_devices below 0

2018-08-10 Thread David Sterba
On Fri, Aug 10, 2018 at 01:53:20PM +0800, Anand Jain wrote:
> In preparation to add helper function to deduce the num_devices with
> replace running, use assert instead of bug_on and warn_on.
> 
> Signed-off-by: Anand Jain 

Ok for the updated condition as it's going to be used in the new helper.

Reviewed-by: David Sterba 


Re: [PATCH v5 2/2] btrfs: add helper btrfs_num_devices() to deduce num_devices

2018-08-10 Thread David Sterba
On Fri, Aug 10, 2018 at 01:53:21PM +0800, Anand Jain wrote:
> When the replace is running the fs_devices::num_devices also includes
> the replace device, however in some operations like device delete and
> balance it needs the actual num_devices without the repalce devices, so
> now the function btrfs_num_devices() just provides that.
> 
> And here is a scenario how balance and repalce items could co-exist.
> Consider balance is started and paused, now start the replace
> followed by a unmount or power-recycle of the system. During following
> mount, the open_ctree() first restarts the balance so it must check for
> the replace device otherwise our num_devices calculation will be wrong.
> 
> Signed-off-by: Anand Jain 
> ---
> v4->v5: uses assert.
> v3->v4: add comment and drop the inline (sorry missed it before)
> v2->v3: update changelog with not so obvious balance and repalce
> co-existance secnario
> v1->v2: add comments
> 
>  fs/btrfs/volumes.c | 32 ++--
>  1 file changed, 18 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 0062615a79be..630f9ec158d0 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -1863,6 +1863,21 @@ void btrfs_assign_next_active_device(struct 
> btrfs_device *device,
>   fs_info->fs_devices->latest_bdev = next_device->bdev;
>  }
>  
> +/* Returns btrfs_fs_devices::num_devices excluding replace device if any */
> +static u64 btrfs_num_devices(struct btrfs_fs_info *fs_info)
> +{
> + u64 num_devices = fs_info->fs_devices->num_devices;
> +
> + btrfs_dev_replace_read_lock(_info->dev_replace);
> + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) {
> + ASSERT(num_devices > 0);
> + num_devices--;
> + }
> + btrfs_dev_replace_read_unlock(_info->dev_replace);

I'll move the assert with the updated condition here so it covers also
the non dev-replace case. Otherwise ok.

> +
> + return num_devices;
> +}


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
On Fri, Aug 10, 2018 at 6:05 AM, Qu Wenruo  wrote:

>
> Although not sure about the details, but the fs looks pretty huge.
> Tons of subvolume and its free space cache inodes.

11TB, 3 or so subvolumes and two snapshots I think.  Not particularly
large for NAS.

> But only 3 tree reloc trees, unless you have tons of reflinked files
> (off-line deduped), it shouldn't cause a lot of problem.

There's going to be a ton of reflinked files.  Both cp --reflink and
via the wholefile dedup.

I freed up ~1/2 TB last month doing dedup.

> At least, we have some progress dropping tree reloc tree for subvolume 6482.

Is there a way to get an idea of how much work is left to be done on
the reloc tree?  Can I walk it
with btrfs-inspect?

dump-tree -t TREE_RELOC is quite enormous (13+ million lines before I gave up)

> If you check the dump-tree output for the following data, the "drop key"
> should change during mount: (inspect dump-tree can be run mounted)
> item 175 key (TREE_RELOC ROOT_ITEM 6482) itemoff 8271 itemsize 439
> 
> drop key (2769795 EXTENT_DATA 12665933824) level 2
>  ^
>
> So for the worst case scenario, there is some way to determine whether
> it's processing.

I'll keep an eye on that.

> And according to the level (3), which is not small for each subvolume, I
> doubt that's the reason why it's so slow.
>
> BTW, for last skip_balance mount, is there any kernel message like
> "balance: resume skipped"?

No, the only reference to balance in kern.log is a hung
btrfs_cancel_balance from the first reboot.

> Have you tried mount the fs readonly with skip_balance? And then remount
> rw, still with skip_balance?

No, every operation takes a long time.  It's still running the btrfs
check, although I'm
going to cancel it and try mount -o ro,skip_balance before I go to
sleep and see where it is tomorrow.

Thank you for taking the time to help me with this.


How to ensure that a snapshot is not corrupted?

2018-08-10 Thread Cerem Cem ASLAN
Original question is here: https://superuser.com/questions/1347843

How can we sure that a readonly snapshot is not corrupted due to a disk failure?

Is the only way calculating the checksums one on another and store it
for further examination, or does BTRFS handle that on its own?


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Qu Wenruo


On 8/10/18 5:39 PM, Dan Merillat wrote:
> On Fri, Aug 10, 2018 at 5:13 AM, Qu Wenruo  wrote:
>>
>>
>> On 8/10/18 4:47 PM, Dan Merillat wrote:
>>> Unfortunately that doesn't appear to be it, a forced restart and
>>> attempted to mount with skip_balance leads to the same thing.
>>
>> That's strange.
>>
>> Would you please provide the following output to determine whether we
>> have any balance running?
>>
>> # btrfs inspect dump-super -fFa 
[snip]

Nothing special for super dump.

> 
>>
>> # btrfs inspect dump-tree -t root 
>>
> 
> Too large to include inline, hopefully attaching works.

Although not sure about the details, but the fs looks pretty huge.
Tons of subvolume and its free space cache inodes.

But only 3 tree reloc trees, unless you have tons of reflinked files
(off-line deduped), it shouldn't cause a lot of problem.

At least, we have some progress dropping tree reloc tree for subvolume 6482.

If you check the dump-tree output for the following data, the "drop key"
should change during mount: (inspect dump-tree can be run mounted)
item 175 key (TREE_RELOC ROOT_ITEM 6482) itemoff 8271 itemsize 439

drop key (2769795 EXTENT_DATA 12665933824) level 2
 ^

So for the worst case scenario, there is some way to determine whether
it's processing.

And according to the level (3), which is not small for each subvolume, I
doubt that's the reason why it's so slow.

BTW, for last skip_balance mount, is there any kernel message like
"balance: resume skipped"?
Have you tried mount the fs readonly with skip_balance? And then remount
rw, still with skip_balance?

> 
> I think there's a balance though:
> item 178 key (BALANCE TEMPORARY_ITEM 0) itemoff 6945 itemsize 448
> temporary item objectid BALANCE offset 0
> balance status flags 6
> DATA
> profiles 0 devid 0 target 0 flags 0
> usage_min 0 usage_max 0 pstart 0 pend 0
> vstart 0 vend 0 limit_min 0 limit_max 0
> stripes_min 0 stripes_max 0
> METADATA
> profiles 0 devid 0 target 0 flags 0
> usage_min 0 usage_max 0 pstart 0 pend 0
> vstart 0 vend 0 limit_min 0 limit_max 0
> stripes_min 0 stripes_max 0
> SYSTEM
> profiles 0 devid 0 target 0 flags 0
> usage_min 0 usage_max 0 pstart 0 pend 0
> vstart 0 vend 0 limit_min 0 limit_max 0
> stripes_min 0 stripes_max 0
> 
> btrfs check is still running, it's found one thing so far:
> 
> Checking filesystem on /dev/bcache0
> UUID: 16adc029-64c5-45ff-8114-e2f5b2f2d331
> checking extents
> ref mismatch on [13135707160576 16384] extent item 0, found 1
> tree backref 13135707160576 parent 13136850550784 root 13136850550784 not 
> found
> in extent tree
> backpointer mismatch on [13135707160576 16384]

Just single error, not bad (compared to some catastrophic error).
It may be a false alert since it's a backref for reloc tree, which is
not that common for us to test.

So the fs should be OK to go, just need to figure out how and why
skip_balance is not working.

Thanks,
Qu

> ERROR: errors found in extent allocation tree or chunk allocation
> checking free space cache
> checking fs roots
> 



signature.asc
Description: OpenPGP digital signature


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
E: Resending without the 500k attachment.

On Fri, Aug 10, 2018 at 5:13 AM, Qu Wenruo  wrote:
>
>
> On 8/10/18 4:47 PM, Dan Merillat wrote:
>> Unfortunately that doesn't appear to be it, a forced restart and
>> attempted to mount with skip_balance leads to the same thing.
>
> That's strange.
>
> Would you please provide the following output to determine whether we
> have any balance running?
>
> # btrfs inspect dump-super -fFa 

superblock: bytenr=65536, device=/dev/bcache0
-
csum_type0 (crc32c)
csum_size4
csum0xaeff2ec3 [match]
bytenr65536
flags0x1
( WRITTEN )
magic_BHRfS_M [match]
fsid16adc029-64c5-45ff-8114-e2f5b2f2d331
labelMEDIA
generation4584957
root33947648
sys_array_size129
chunk_root_generation4534813
root_level1
chunk_root13681127653376
chunk_root_level1
log_root0
log_root_transid0
log_root_level0
total_bytes12001954226176
bytes_used11387838865408
sectorsize4096
nodesize16384
leafsize (deprecated)16384
stripesize4096
root_dir6
num_devices1
compat_flags0x0
compat_ro_flags0x0
incompat_flags0x169
( MIXED_BACKREF |
  COMPRESS_LZO |
  BIG_METADATA |
  EXTENDED_IREF |
  SKINNY_METADATA )
cache_generation4584957
uuid_tree_generation4584925
dev_item.uuidec51cc1f-992a-47a2-b7b2-83af026723fd
dev_item.fsid16adc029-64c5-45ff-8114-e2f5b2f2d331 [match]
dev_item.type0
dev_item.total_bytes12001954226176
dev_item.bytes_used11613258579968
dev_item.io_align4096
dev_item.io_width4096
dev_item.sector_size4096
dev_item.devid1
dev_item.dev_group0
dev_item.seek_speed0
dev_item.bandwidth0
dev_item.generation0
sys_chunk_array[2048]:
item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 13681127456768)
length 33554432 owner 2 stripe_len 65536 type SYSTEM|DUP
io_align 65536 io_width 65536 sector_size 4096
num_stripes 2 sub_stripes 1
stripe 0 devid 1 offset 353298808832
dev_uuid ec51cc1f-992a-47a2-b7b2-83af026723fd
stripe 1 devid 1 offset 353332363264
dev_uuid ec51cc1f-992a-47a2-b7b2-83af026723fd
backup_roots[4]:
backup 0:
backup_tree_root:3666753175552gen: 4584956level: 1
backup_chunk_root:13681127653376gen: 4534813level: 1
backup_extent_root:3666740674560gen: 4584956level: 2
backup_fs_root:0gen: 0level: 0
backup_dev_root:199376896gen: 4584935level: 1
backup_csum_root:3666753568768gen: 4584956level: 3
backup_total_bytes:12001954226176
backup_bytes_used:11387838865408
backup_num_devices:1

backup 1:
backup_tree_root:33947648gen: 4584957level: 1
backup_chunk_root:13681127653376gen: 4534813level: 1
backup_extent_root:33980416gen: 4584957level: 2
backup_fs_root:0gen: 0level: 0
backup_dev_root:34160640gen: 4584957level: 1
backup_csum_root:34357248gen: 4584957level: 3
backup_total_bytes:12001954226176
backup_bytes_used:11387838865408
backup_num_devices:1

backup 2:
backup_tree_root:3666598461440gen: 4584954level: 1
backup_chunk_root:13681127653376gen: 4534813level: 1
backup_extent_root:3666595233792gen: 4584954level: 2
backup_fs_root:0gen: 0level: 0
backup_dev_root:199376896gen: 4584935level: 1
backup_csum_root:300034304gen: 4584954level: 3
backup_total_bytes:12001954226176
backup_bytes_used:11387838898176
backup_num_devices:1

backup 3:
backup_tree_root:390998272gen: 4584955level: 1
backup_chunk_root:13681127653376gen: 4534813level: 1
backup_extent_root:390293760gen: 4584955level: 2
backup_fs_root:0gen: 0level: 0
backup_dev_root:199376896gen: 4584935level: 1
backup_csum_root:391604480gen: 4584955level: 3
backup_total_bytes:12001954226176
backup_bytes_used:11387838881792
backup_num_devices:1


superblock: bytenr=67108864, device=/dev/bcache0
-
csum_type0 (crc32c)
csum_size4
csum0x0e9e060d [match]
bytenr67108864
flags0x1
( WRITTEN )
magic_BHRfS_M [match]
fsid16adc029-64c5-45ff-8114-e2f5b2f2d331
labelMEDIA

Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Qu Wenruo


On 8/10/18 5:42 PM, Eryu Guan wrote:
> On Fri, Aug 10, 2018 at 05:10:29PM +0800, Qu Wenruo wrote:
>>
>>
>> On 8/10/18 4:54 PM, Filipe Manana wrote:
>>> On Fri, Aug 10, 2018 at 9:46 AM, Qu Wenruo  wrote:


 On 8/9/18 5:26 PM, Filipe Manana wrote:
> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
>> This bug is exposed by populating a high level qgroup, and then make it
>> orphan (high level qgroup without child)
>
> Same comment as in the kernel patch:
>
> "That sentence is confusing. An orphan, by definition [1], is someone
> (or something in this case) without parents.
> But you mention a group without children, so that should be named
> "childless" or simply say "without children".
> So one part of the sentence is wrong, either what is in parenthesis or
> what comes before them.
>
> [1] https://www.thefreedictionary.com/orphan
> "
>
>> with old qgroup numbers, and
>> finally do rescan.
>>
>> Normally rescan should zero out all qgroups' accounting number, but due
>> to a kernel bug which won't mark orphan qgroups dirty, their on-disk
>> data is not updated, thus old numbers remain and cause qgroup
>> corruption.
>>
>> Fixed by the following kernel patch:
>> "btrfs: qgroup: Dirty all qgroups before rescan"
>>
>> Reported-by: Misono Tomohiro 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  tests/btrfs/170 | 82 +
>>  tests/btrfs/170.out |  3 ++
>>  tests/btrfs/group   |  1 +
>>  3 files changed, 86 insertions(+)
>>  create mode 100755 tests/btrfs/170
>>  create mode 100644 tests/btrfs/170.out
>>
>> diff --git a/tests/btrfs/170 b/tests/btrfs/170
>> new file mode 100755
>> index ..bcf8b5c0e4f3
>> --- /dev/null
>> +++ b/tests/btrfs/170
>> @@ -0,0 +1,82 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
>> +#
>> +# FS QA Test 170
>> +#
>> +# Test if btrfs can clear orphan (high level qgroup without child) 
>> qgroup's
>> +# accounting numbers during rescan.
>> +# Fixed by the following kernel patch:
>> +# "btrfs: qgroup: Dirty all qgroups before rescan"
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1   # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +   cd /
>> +   rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +
>> +_scratch_mkfs > /dev/null 2>&1
>> +_scratch_mount
>> +
>> +
>> +# Populate the fs
>> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
>> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
>> /dev/null
>> +
>> +# Ensure that file reach disk, so it will also appear in snapshot
>
> # Ensure that buffered file data is persisted, so we won't have an
> empty file in the snapshot.
>> +sync
>> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
>> "$SCRATCH_MNT/snapshot"
>> +
>> +
>> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>> +
>> +# Create high level qgroup
>> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
>> +
>> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
>> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
>> +# to ensure it will work, we just ignore the return value.
>
> Comment should go away IMHO. The preferred way is to call
> $BTRFS_UTIL_PROG and have failures noticed
> through differences in the golden output. There's no point in
> mentioning something that currently doesn't work
> if it's not used here.

 In this case, I think we still need to mention why we don't use
 _run_btrfs_util_progs, in fact if we use _run_btrfs_util_progs, the test
 will just fail due to the return value.

 In fact, it's a workaround and worthy noting IIRC.
>>>
>>> Still disagree, because we are not checking the return value and rely
>>> on errors printing something to stderr/stdout.
>>
>> OK, either way I'll introduce a new filter here for filtering out either
>> "Quota data changed, rescan scheduled" or "quotas may be inconsistent,
>> rescan needed".
>>
>> As there is patch floating 

Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Eryu Guan
On Fri, Aug 10, 2018 at 05:10:29PM +0800, Qu Wenruo wrote:
> 
> 
> On 8/10/18 4:54 PM, Filipe Manana wrote:
> > On Fri, Aug 10, 2018 at 9:46 AM, Qu Wenruo  wrote:
> >>
> >>
> >> On 8/9/18 5:26 PM, Filipe Manana wrote:
> >>> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
>  This bug is exposed by populating a high level qgroup, and then make it
>  orphan (high level qgroup without child)
> >>>
> >>> Same comment as in the kernel patch:
> >>>
> >>> "That sentence is confusing. An orphan, by definition [1], is someone
> >>> (or something in this case) without parents.
> >>> But you mention a group without children, so that should be named
> >>> "childless" or simply say "without children".
> >>> So one part of the sentence is wrong, either what is in parenthesis or
> >>> what comes before them.
> >>>
> >>> [1] https://www.thefreedictionary.com/orphan
> >>> "
> >>>
>  with old qgroup numbers, and
>  finally do rescan.
> 
>  Normally rescan should zero out all qgroups' accounting number, but due
>  to a kernel bug which won't mark orphan qgroups dirty, their on-disk
>  data is not updated, thus old numbers remain and cause qgroup
>  corruption.
> 
>  Fixed by the following kernel patch:
>  "btrfs: qgroup: Dirty all qgroups before rescan"
> 
>  Reported-by: Misono Tomohiro 
>  Signed-off-by: Qu Wenruo 
>  ---
>   tests/btrfs/170 | 82 +
>   tests/btrfs/170.out |  3 ++
>   tests/btrfs/group   |  1 +
>   3 files changed, 86 insertions(+)
>   create mode 100755 tests/btrfs/170
>   create mode 100644 tests/btrfs/170.out
> 
>  diff --git a/tests/btrfs/170 b/tests/btrfs/170
>  new file mode 100755
>  index ..bcf8b5c0e4f3
>  --- /dev/null
>  +++ b/tests/btrfs/170
>  @@ -0,0 +1,82 @@
>  +#! /bin/bash
>  +# SPDX-License-Identifier: GPL-2.0
>  +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
>  +#
>  +# FS QA Test 170
>  +#
>  +# Test if btrfs can clear orphan (high level qgroup without child) 
>  qgroup's
>  +# accounting numbers during rescan.
>  +# Fixed by the following kernel patch:
>  +# "btrfs: qgroup: Dirty all qgroups before rescan"
>  +#
>  +seq=`basename $0`
>  +seqres=$RESULT_DIR/$seq
>  +echo "QA output created by $seq"
>  +
>  +here=`pwd`
>  +tmp=/tmp/$$
>  +status=1   # failure is the default!
>  +trap "_cleanup; exit \$status" 0 1 2 3 15
>  +
>  +_cleanup()
>  +{
>  +   cd /
>  +   rm -f $tmp.*
>  +}
>  +
>  +# get standard environment, filters and checks
>  +. ./common/rc
>  +. ./common/filter
>  +
>  +# remove previous $seqres.full before test
>  +rm -f $seqres.full
>  +
>  +# real QA test starts here
>  +
>  +# Modify as appropriate.
>  +_supported_fs btrfs
>  +_supported_os Linux
>  +_require_scratch
>  +
>  +_scratch_mkfs > /dev/null 2>&1
>  +_scratch_mount
>  +
>  +
>  +# Populate the fs
>  +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
>  +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
>  /dev/null
>  +
>  +# Ensure that file reach disk, so it will also appear in snapshot
> >>>
> >>> # Ensure that buffered file data is persisted, so we won't have an
> >>> empty file in the snapshot.
>  +sync
>  +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
>  "$SCRATCH_MNT/snapshot"
>  +
>  +
>  +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
>  +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>  +
>  +# Create high level qgroup
>  +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
>  +
>  +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
>  +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
>  +# to ensure it will work, we just ignore the return value.
> >>>
> >>> Comment should go away IMHO. The preferred way is to call
> >>> $BTRFS_UTIL_PROG and have failures noticed
> >>> through differences in the golden output. There's no point in
> >>> mentioning something that currently doesn't work
> >>> if it's not used here.
> >>
> >> In this case, I think we still need to mention why we don't use
> >> _run_btrfs_util_progs, in fact if we use _run_btrfs_util_progs, the test
> >> will just fail due to the return value.
> >>
> >> In fact, it's a workaround and worthy noting IIRC.
> > 
> > Still disagree, because we are not checking the return value and rely
> > on errors printing something to stderr/stdout.
> 
> OK, either way I'll introduce a new filter here for filtering out either
> "Quota data changed, rescan scheduled" or "quotas may be inconsistent,
> rescan needed".
> 
> As there is patch floating around to change the default behavior of
> 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 15:55:46 +0800, Qu Wenruo wrote:

>> The first thing about virtually every mechanism should be
>> discoverability and reliability. I expect my quota not to change without
>> my interaction. Never. How did you cope with this?
>> If not - how are you going to explain such weird behaviour to users?
> 
> Read the manual first.
> Not every feature is suitable for every use case.

I, the sysadm, must RTFM.
My users won't comprehend this and moreover - they won't even care.

> IIRC lvm thin is pretty much the same for the same case.

LVM doesn't pretend to be user-oriented, it is the system scope.
LVM didn't name it's thin provisioning "quotas".

> For 4 disk with 1T free space each, if you're using RAID5 for data, then
> you can write 3T data.
> But if you're also using RAID10 for metadata, and you're using default
> inline, we can use small files to fill the free space, resulting 2T
> available space.
> 
> So in this case how would you calculate the free space? 3T or 2T or
> anything between them?

The answear is pretty simple: 3T. Rationale:
- this is the space I do can put in a single data stream,
- people are aware that there is metadata overhead with any object;
  after all, metadata are also data,
- while filling the fs with small files the free space available would
  self-adjust after every single file put, so after uploading 1T of such
  files the df should report 1.5T free. There would be nothing weird(er
  that now) that 1T of data has actually eaten 1.5T of storage.

No crystal ball calculations, just KISS; since one _can_ put 3T file
(non sparse, uncompressible, bulk written) on a filesystem, the free space is 
3T.

> Only yourself know what the heck you're going to use the that 4 disks
> with 1T free space each.
> Btrfs can't look into your head and know what you're thinking.

It shouldn't. I expect raw data - there is 3TB of unallocated space for
current data profile.

> That's the design from the very beginning of btrfs, yelling at me makes
> no sense at all.

Sorry if you receive me "yelling" - I honestly must put in on my
non-native english. I just want to clarify some terminology and
perspective expectations. They are irrevelant to the underlying
technical solutions, but the literal *description* of the solution
you provide should match user expectations of that terminology.

> I have tried to explain what btrfs quota does and it doesn't, if it
> doesn't fit you use case, that's all.
> (Whether you have ever tried to understand is another problem)

I am (more than before) aware what btrfs quotas are not.

So, my only expectation (except for worldwide peace and other
unrealistic ones) would be to stop using "quotas", "subvolume quotas"
and "qgroups" interchangeably in btrfs context, as IMvHO these are not
plain, well-known "quotas".

-- 
Tomasz Pala 


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Qu Wenruo


On 8/10/18 4:47 PM, Dan Merillat wrote:
> Unfortunately that doesn't appear to be it, a forced restart and
> attempted to mount with skip_balance leads to the same thing.

That's strange.

Would you please provide the following output to determine whether we
have any balance running?

# btrfs inspect dump-super -fFa 

# btrfs inspect dump-tree -t root 

> 
> 20 minutes in btrfs-transactio had a large burst of reads then started
> spinning the CPU with the disk idle.
> 
> Is this recoverable? I could leave it for a day or so if it may make
> progress, but if not I'd like to start on other options.

When umounted, would you please also try "btrfs check --readonly
" to see if there is anything wrong about the fs?

Thanks,
Qu

> 
> On Fri, Aug 10, 2018 at 3:59 AM, Qu Wenruo  wrote:
>>
>>
>> On 8/10/18 3:40 PM, Dan Merillat wrote:
>>> Kernel 4.17.9, 11tb BTRFS device (md-backed, not btrfs raid)
>>>
>>> I was testing something out and enabled quota groups and started getting
>>> 2-5 minute long pauses where a btrfs-transaction thread spun at 100%.
>>
>> Looks pretty like a running balance and quota.
>>
>> Would you please try with balance disabled (temporarily) with
>> skip_balance mount option to see if it works.
>>
>> If it works, then either try resume balance, or just cancel the balance.
>>
>> Nowadays balance is not needed routinely, especially when you still have
>> unallocated space and enabled quota.
>>
>> Thanks,
>> Qu
>>
>>>
>>> Post-reboot the mount process spinds at 100% CPU, occasinally yielding
>>> to a btrfs-transaction thread at 100% CPU.  The switchover is marked
>>> by a burst of disk activity in btrace.
>>>
>>> Btrace shows all disk activity is returning promptly - no hanging submits.
>>>
>>> Currently the mount is at 6+ hours.
>>>
>>> Suggestions on how to go about debugging this?
>>>
>>



signature.asc
Description: OpenPGP digital signature


Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Qu Wenruo


On 8/10/18 4:54 PM, Filipe Manana wrote:
> On Fri, Aug 10, 2018 at 9:46 AM, Qu Wenruo  wrote:
>>
>>
>> On 8/9/18 5:26 PM, Filipe Manana wrote:
>>> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
 This bug is exposed by populating a high level qgroup, and then make it
 orphan (high level qgroup without child)
>>>
>>> Same comment as in the kernel patch:
>>>
>>> "That sentence is confusing. An orphan, by definition [1], is someone
>>> (or something in this case) without parents.
>>> But you mention a group without children, so that should be named
>>> "childless" or simply say "without children".
>>> So one part of the sentence is wrong, either what is in parenthesis or
>>> what comes before them.
>>>
>>> [1] https://www.thefreedictionary.com/orphan
>>> "
>>>
 with old qgroup numbers, and
 finally do rescan.

 Normally rescan should zero out all qgroups' accounting number, but due
 to a kernel bug which won't mark orphan qgroups dirty, their on-disk
 data is not updated, thus old numbers remain and cause qgroup
 corruption.

 Fixed by the following kernel patch:
 "btrfs: qgroup: Dirty all qgroups before rescan"

 Reported-by: Misono Tomohiro 
 Signed-off-by: Qu Wenruo 
 ---
  tests/btrfs/170 | 82 +
  tests/btrfs/170.out |  3 ++
  tests/btrfs/group   |  1 +
  3 files changed, 86 insertions(+)
  create mode 100755 tests/btrfs/170
  create mode 100644 tests/btrfs/170.out

 diff --git a/tests/btrfs/170 b/tests/btrfs/170
 new file mode 100755
 index ..bcf8b5c0e4f3
 --- /dev/null
 +++ b/tests/btrfs/170
 @@ -0,0 +1,82 @@
 +#! /bin/bash
 +# SPDX-License-Identifier: GPL-2.0
 +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
 +#
 +# FS QA Test 170
 +#
 +# Test if btrfs can clear orphan (high level qgroup without child) 
 qgroup's
 +# accounting numbers during rescan.
 +# Fixed by the following kernel patch:
 +# "btrfs: qgroup: Dirty all qgroups before rescan"
 +#
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo "QA output created by $seq"
 +
 +here=`pwd`
 +tmp=/tmp/$$
 +status=1   # failure is the default!
 +trap "_cleanup; exit \$status" 0 1 2 3 15
 +
 +_cleanup()
 +{
 +   cd /
 +   rm -f $tmp.*
 +}
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +
 +# remove previous $seqres.full before test
 +rm -f $seqres.full
 +
 +# real QA test starts here
 +
 +# Modify as appropriate.
 +_supported_fs btrfs
 +_supported_os Linux
 +_require_scratch
 +
 +_scratch_mkfs > /dev/null 2>&1
 +_scratch_mount
 +
 +
 +# Populate the fs
 +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
 +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
 /dev/null
 +
 +# Ensure that file reach disk, so it will also appear in snapshot
>>>
>>> # Ensure that buffered file data is persisted, so we won't have an
>>> empty file in the snapshot.
 +sync
 +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
 "$SCRATCH_MNT/snapshot"
 +
 +
 +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
 +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
 +
 +# Create high level qgroup
 +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
 +
 +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
 +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
 +# to ensure it will work, we just ignore the return value.
>>>
>>> Comment should go away IMHO. The preferred way is to call
>>> $BTRFS_UTIL_PROG and have failures noticed
>>> through differences in the golden output. There's no point in
>>> mentioning something that currently doesn't work
>>> if it's not used here.
>>
>> In this case, I think we still need to mention why we don't use
>> _run_btrfs_util_progs, in fact if we use _run_btrfs_util_progs, the test
>> will just fail due to the return value.
>>
>> In fact, it's a workaround and worthy noting IIRC.
> 
> Still disagree, because we are not checking the return value and rely
> on errors printing something to stderr/stdout.

OK, either way I'll introduce a new filter here for filtering out either
"Quota data changed, rescan scheduled" or "quotas may be inconsistent,
rescan needed".

As there is patch floating around to change the default behavior of
"btrfs qgroup assign" to schedule rescan automatically.

The test needs to handle both cases anyway.

> 
>>
>>>
 +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
 +
 +# Above assign will mark qgroup inconsistent due to the shared extents
>>>
>>> assign -> assignment
>>>
 +# between 

Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Filipe Manana
On Fri, Aug 10, 2018 at 9:46 AM, Qu Wenruo  wrote:
>
>
> On 8/9/18 5:26 PM, Filipe Manana wrote:
>> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
>>> This bug is exposed by populating a high level qgroup, and then make it
>>> orphan (high level qgroup without child)
>>
>> Same comment as in the kernel patch:
>>
>> "That sentence is confusing. An orphan, by definition [1], is someone
>> (or something in this case) without parents.
>> But you mention a group without children, so that should be named
>> "childless" or simply say "without children".
>> So one part of the sentence is wrong, either what is in parenthesis or
>> what comes before them.
>>
>> [1] https://www.thefreedictionary.com/orphan
>> "
>>
>>> with old qgroup numbers, and
>>> finally do rescan.
>>>
>>> Normally rescan should zero out all qgroups' accounting number, but due
>>> to a kernel bug which won't mark orphan qgroups dirty, their on-disk
>>> data is not updated, thus old numbers remain and cause qgroup
>>> corruption.
>>>
>>> Fixed by the following kernel patch:
>>> "btrfs: qgroup: Dirty all qgroups before rescan"
>>>
>>> Reported-by: Misono Tomohiro 
>>> Signed-off-by: Qu Wenruo 
>>> ---
>>>  tests/btrfs/170 | 82 +
>>>  tests/btrfs/170.out |  3 ++
>>>  tests/btrfs/group   |  1 +
>>>  3 files changed, 86 insertions(+)
>>>  create mode 100755 tests/btrfs/170
>>>  create mode 100644 tests/btrfs/170.out
>>>
>>> diff --git a/tests/btrfs/170 b/tests/btrfs/170
>>> new file mode 100755
>>> index ..bcf8b5c0e4f3
>>> --- /dev/null
>>> +++ b/tests/btrfs/170
>>> @@ -0,0 +1,82 @@
>>> +#! /bin/bash
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
>>> +#
>>> +# FS QA Test 170
>>> +#
>>> +# Test if btrfs can clear orphan (high level qgroup without child) qgroup's
>>> +# accounting numbers during rescan.
>>> +# Fixed by the following kernel patch:
>>> +# "btrfs: qgroup: Dirty all qgroups before rescan"
>>> +#
>>> +seq=`basename $0`
>>> +seqres=$RESULT_DIR/$seq
>>> +echo "QA output created by $seq"
>>> +
>>> +here=`pwd`
>>> +tmp=/tmp/$$
>>> +status=1   # failure is the default!
>>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>>> +
>>> +_cleanup()
>>> +{
>>> +   cd /
>>> +   rm -f $tmp.*
>>> +}
>>> +
>>> +# get standard environment, filters and checks
>>> +. ./common/rc
>>> +. ./common/filter
>>> +
>>> +# remove previous $seqres.full before test
>>> +rm -f $seqres.full
>>> +
>>> +# real QA test starts here
>>> +
>>> +# Modify as appropriate.
>>> +_supported_fs btrfs
>>> +_supported_os Linux
>>> +_require_scratch
>>> +
>>> +_scratch_mkfs > /dev/null 2>&1
>>> +_scratch_mount
>>> +
>>> +
>>> +# Populate the fs
>>> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
>>> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
>>> /dev/null
>>> +
>>> +# Ensure that file reach disk, so it will also appear in snapshot
>>
>> # Ensure that buffered file data is persisted, so we won't have an
>> empty file in the snapshot.
>>> +sync
>>> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
>>> "$SCRATCH_MNT/snapshot"
>>> +
>>> +
>>> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
>>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>>> +
>>> +# Create high level qgroup
>>> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
>>> +
>>> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
>>> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
>>> +# to ensure it will work, we just ignore the return value.
>>
>> Comment should go away IMHO. The preferred way is to call
>> $BTRFS_UTIL_PROG and have failures noticed
>> through differences in the golden output. There's no point in
>> mentioning something that currently doesn't work
>> if it's not used here.
>
> In this case, I think we still need to mention why we don't use
> _run_btrfs_util_progs, in fact if we use _run_btrfs_util_progs, the test
> will just fail due to the return value.
>
> In fact, it's a workaround and worthy noting IIRC.

Still disagree, because we are not checking the return value and rely
on errors printing something to stderr/stdout.

>
>>
>>> +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
>>> +
>>> +# Above assign will mark qgroup inconsistent due to the shared extents
>>
>> assign -> assignment
>>
>>> +# between subvol/snapshot/high level qgroup, do rescan here
>>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>>
>> Use $BTRFS_UTIL_PROG directly instead, and adjust the golden output if 
>> needed.
>
> There is nothing special needed in "quota rescan".
>
> Only qgroup assign/remove could return 1 instead of 0.

And why not use $BTRFS_UTIL_PROG?
Not only that's the preferred way to do nowadays (I know many older
tests use _run_btrfs_util_prog), but it will
make this test consistent as right now it uses both.

>
>>
>>> +
>>> +# Now remove 

Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
Unfortunately that doesn't appear to be it, a forced restart and
attempted to mount with skip_balance leads to the same thing.

20 minutes in btrfs-transactio had a large burst of reads then started
spinning the CPU with the disk idle.

Is this recoverable? I could leave it for a day or so if it may make
progress, but if not I'd like to start on other options.

On Fri, Aug 10, 2018 at 3:59 AM, Qu Wenruo  wrote:
>
>
> On 8/10/18 3:40 PM, Dan Merillat wrote:
>> Kernel 4.17.9, 11tb BTRFS device (md-backed, not btrfs raid)
>>
>> I was testing something out and enabled quota groups and started getting
>> 2-5 minute long pauses where a btrfs-transaction thread spun at 100%.
>
> Looks pretty like a running balance and quota.
>
> Would you please try with balance disabled (temporarily) with
> skip_balance mount option to see if it works.
>
> If it works, then either try resume balance, or just cancel the balance.
>
> Nowadays balance is not needed routinely, especially when you still have
> unallocated space and enabled quota.
>
> Thanks,
> Qu
>
>>
>> Post-reboot the mount process spinds at 100% CPU, occasinally yielding
>> to a btrfs-transaction thread at 100% CPU.  The switchover is marked
>> by a burst of disk activity in btrace.
>>
>> Btrace shows all disk activity is returning promptly - no hanging submits.
>>
>> Currently the mount is at 6+ hours.
>>
>> Suggestions on how to go about debugging this?
>>
>


Re: [PATCH] fstests: btrfs: Add test for corrupted orphan qgroup numbers

2018-08-10 Thread Qu Wenruo


On 8/9/18 5:26 PM, Filipe Manana wrote:
> On Thu, Aug 9, 2018 at 8:45 AM, Qu Wenruo  wrote:
>> This bug is exposed by populating a high level qgroup, and then make it
>> orphan (high level qgroup without child)
> 
> Same comment as in the kernel patch:
> 
> "That sentence is confusing. An orphan, by definition [1], is someone
> (or something in this case) without parents.
> But you mention a group without children, so that should be named
> "childless" or simply say "without children".
> So one part of the sentence is wrong, either what is in parenthesis or
> what comes before them.
> 
> [1] https://www.thefreedictionary.com/orphan
> "
> 
>> with old qgroup numbers, and
>> finally do rescan.
>>
>> Normally rescan should zero out all qgroups' accounting number, but due
>> to a kernel bug which won't mark orphan qgroups dirty, their on-disk
>> data is not updated, thus old numbers remain and cause qgroup
>> corruption.
>>
>> Fixed by the following kernel patch:
>> "btrfs: qgroup: Dirty all qgroups before rescan"
>>
>> Reported-by: Misono Tomohiro 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  tests/btrfs/170 | 82 +
>>  tests/btrfs/170.out |  3 ++
>>  tests/btrfs/group   |  1 +
>>  3 files changed, 86 insertions(+)
>>  create mode 100755 tests/btrfs/170
>>  create mode 100644 tests/btrfs/170.out
>>
>> diff --git a/tests/btrfs/170 b/tests/btrfs/170
>> new file mode 100755
>> index ..bcf8b5c0e4f3
>> --- /dev/null
>> +++ b/tests/btrfs/170
>> @@ -0,0 +1,82 @@
>> +#! /bin/bash
>> +# SPDX-License-Identifier: GPL-2.0
>> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
>> +#
>> +# FS QA Test 170
>> +#
>> +# Test if btrfs can clear orphan (high level qgroup without child) qgroup's
>> +# accounting numbers during rescan.
>> +# Fixed by the following kernel patch:
>> +# "btrfs: qgroup: Dirty all qgroups before rescan"
>> +#
>> +seq=`basename $0`
>> +seqres=$RESULT_DIR/$seq
>> +echo "QA output created by $seq"
>> +
>> +here=`pwd`
>> +tmp=/tmp/$$
>> +status=1   # failure is the default!
>> +trap "_cleanup; exit \$status" 0 1 2 3 15
>> +
>> +_cleanup()
>> +{
>> +   cd /
>> +   rm -f $tmp.*
>> +}
>> +
>> +# get standard environment, filters and checks
>> +. ./common/rc
>> +. ./common/filter
>> +
>> +# remove previous $seqres.full before test
>> +rm -f $seqres.full
>> +
>> +# real QA test starts here
>> +
>> +# Modify as appropriate.
>> +_supported_fs btrfs
>> +_supported_os Linux
>> +_require_scratch
>> +
>> +_scratch_mkfs > /dev/null 2>&1
>> +_scratch_mount
>> +
>> +
>> +# Populate the fs
>> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
>> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
>> /dev/null
>> +
>> +# Ensure that file reach disk, so it will also appear in snapshot
> 
> # Ensure that buffered file data is persisted, so we won't have an
> empty file in the snapshot.
>> +sync
>> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
>> "$SCRATCH_MNT/snapshot"
>> +
>> +
>> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
>> +
>> +# Create high level qgroup
>> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
>> +
>> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
>> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
>> +# to ensure it will work, we just ignore the return value.
> 
> Comment should go away IMHO. The preferred way is to call
> $BTRFS_UTIL_PROG and have failures noticed
> through differences in the golden output. There's no point in
> mentioning something that currently doesn't work
> if it's not used here.

In this case, I think we still need to mention why we don't use
_run_btrfs_util_progs, in fact if we use _run_btrfs_util_progs, the test
will just fail due to the return value.

In fact, it's a workaround and worthy noting IIRC.

> 
>> +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
>> +
>> +# Above assign will mark qgroup inconsistent due to the shared extents
> 
> assign -> assignment
> 
>> +# between subvol/snapshot/high level qgroup, do rescan here
>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> 
> Use $BTRFS_UTIL_PROG directly instead, and adjust the golden output if needed.

There is nothing special needed in "quota rescan".

Only qgroup assign/remove could return 1 instead of 0.

> 
>> +
>> +# Now remove the qgroup relationship and make 1/0 orphan
>> +# Due to the shared extent outside of 1/0, we will mark qgroup inconsistent
>> +# and keep the number of qgroup 1/0
> 
> Missing "." at the end of the sentences.
> 
>> +$BTRFS_UTIL_PROG qgroup remove "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
>> +
>> +# Above removal also marks qgroup inconsistent, rescan again
>> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> 
> Use $BTRFS_UTIL_PROG directly instead, and adjust the golden output if needed.

The extra warning 

Re: [PATCH v2] fstests: btrfs: Add test for corrupted childless qgroup numbers

2018-08-10 Thread Filipe Manana
On Fri, Aug 10, 2018 at 3:20 AM, Qu Wenruo  wrote:
> This bug is exposed by populating a high level qgroup, and then make it
> childless with old qgroup numbers, and finally do rescan.
>
> Normally rescan should zero out all qgroups' accounting number, but due
> to a kernel bug which won't mark childless qgroups dirty, their on-disk
> data is never updated, thus old numbers remain and cause qgroup
> corruption.
>
> Fixed by the following kernel patch:
> "btrfs: qgroup: Dirty all qgroups before rescan"
>
> Reported-by: Misono Tomohiro 
> Signed-off-by: Qu Wenruo 
> ---
> changelog:
> v2:
>   Change the adjective for the offending group, from "orphan" to
>   "childless"

All the previous comments still apply as they weren't addressed, for
example using $BTRFS_UTIL_PROG instead of _run_btrfs_util_prog.
Thanks.

> ---
>  tests/btrfs/170 | 83 +
>  tests/btrfs/170.out |  3 ++
>  tests/btrfs/group   |  1 +
>  3 files changed, 87 insertions(+)
>  create mode 100755 tests/btrfs/170
>  create mode 100644 tests/btrfs/170.out
>
> diff --git a/tests/btrfs/170 b/tests/btrfs/170
> new file mode 100755
> index ..3a810e80562f
> --- /dev/null
> +++ b/tests/btrfs/170
> @@ -0,0 +1,83 @@
> +#! /bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +# Copyright (c) 2018 SUSE Linux Products GmbH.  All Rights Reserved.
> +#
> +# FS QA Test 170
> +#
> +# Test if btrfs can clear high level childless qgroup's accounting numbers
> +# during rescan.
> +#
> +# Fixed by the following kernel patch:
> +# "btrfs: qgroup: Dirty all qgroups before rescan"
> +#
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1   # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +
> +_scratch_mkfs > /dev/null 2>&1
> +_scratch_mount
> +
> +
> +# Populate the fs
> +_run_btrfs_util_prog subvolume create "$SCRATCH_MNT/subvol"
> +_pwrite_byte 0xcdcd 0 1M "$SCRATCH_MNT/subvol/file1" | _filter_xfs_io > 
> /dev/null
> +
> +# Ensure that file reach disk, so it will also appear in snapshot
> +sync
> +_run_btrfs_util_prog subvolume snapshot "$SCRATCH_MNT/subvol" 
> "$SCRATCH_MNT/snapshot"
> +
> +
> +_run_btrfs_util_prog quota enable "$SCRATCH_MNT"
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# Create high level qgroup
> +_run_btrfs_util_prog qgroup create 1/0 "$SCRATCH_MNT"
> +
> +# Don't use _run_btrfs_util_prog here, as it can return 1 to info user
> +# that qgroup is marked inconsistent, this is a bug in btrfs-progs, but
> +# to ensure it will work, we just ignore the return value.
> +$BTRFS_UTIL_PROG qgroup assign "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above assign will mark qgroup inconsistent due to the shared extents
> +# between subvol/snapshot/high level qgroup, do rescan here
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# Now remove the qgroup relationship and make 1/0 childless
> +# Due to the shared extent outside of 1/0, we will mark qgroup inconsistent
> +# and keep the number of qgroup 1/0
> +$BTRFS_UTIL_PROG qgroup remove "$SCRATCH_MNT/snapshot" 1/0 "$SCRATCH_MNT"
> +
> +# Above removal also marks qgroup inconsistent, rescan again
> +_run_btrfs_util_prog quota rescan -w "$SCRATCH_MNT"
> +
> +# After the test, btrfs check will verify qgroup numbers to catch any
> +# corruption.
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/btrfs/170.out b/tests/btrfs/170.out
> new file mode 100644
> index ..9002199e48ed
> --- /dev/null
> +++ b/tests/btrfs/170.out
> @@ -0,0 +1,3 @@
> +QA output created by 170
> +WARNING: quotas may be inconsistent, rescan needed
> +WARNING: quotas may be inconsistent, rescan needed
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index b616c73d09bf..339c977135c0 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -172,3 +172,4 @@
>  167 auto quick replace volume
>  168 auto quick send
>  169 auto quick send
> +170 auto quick qgroup
> --
> 2.18.0
>



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Qu Wenruo


On 8/10/18 3:40 PM, Dan Merillat wrote:
> Kernel 4.17.9, 11tb BTRFS device (md-backed, not btrfs raid)
> 
> I was testing something out and enabled quota groups and started getting
> 2-5 minute long pauses where a btrfs-transaction thread spun at 100%.

Looks pretty like a running balance and quota.

Would you please try with balance disabled (temporarily) with
skip_balance mount option to see if it works.

If it works, then either try resume balance, or just cancel the balance.

Nowadays balance is not needed routinely, especially when you still have
unallocated space and enabled quota.

Thanks,
Qu

> 
> Post-reboot the mount process spinds at 100% CPU, occasinally yielding
> to a btrfs-transaction thread at 100% CPU.  The switchover is marked
> by a burst of disk activity in btrace.
> 
> Btrace shows all disk activity is returning promptly - no hanging submits.
> 
> Currently the mount is at 6+ hours.
> 
> Suggestions on how to go about debugging this?
> 



signature.asc
Description: OpenPGP digital signature


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Qu Wenruo


On 8/10/18 3:17 PM, Tomasz Pala wrote:
> On Fri, Aug 10, 2018 at 07:35:32 +0800, Qu Wenruo wrote:
> 
>>> when limiting somebody's data space we usually don't care about the
>>> underlying "savings" coming from any deduplicating technique - these are
>>> purely bonuses for system owner, so he could do larger resource overbooking.
>>
>> In reality that's definitely not the case.
> 
> Definitely? How do you "sell" a disk space when there is no upper bound?
> Every, and I mean _every_ resource quota out in the wild gives you an 
> user-perspective.
> You can assign CPU cores/time, RAM or network bandwidth with HARD limit.
> 
> Only after that you _can_ sometimes assign some best-effort
> outer, not guaranteed limits, like extra network bandwidth or grace
> periods with filesystem usage (disregarding technical details - in case
> of quota you move hard limit beyond and apply lowere soft limit).
> 
> This is the primary quota usage. Quotas don't save system resources,
> quotas are valuables to "sell" (by quotes I mean every possible
> allocations, including interorganisation accouting).
> 
> Quotas are overbookable by design and like I said before, the underlying
> savings mechanism allow sysadm to increase actual overbooking ratio.
> 
> If I run out of CPU, RAM, storage or network I simply need to expand
> such resource. I won't shrink quotas in such case.
> Or apply some other resuorce-saving technique, like LVM with VDO,
> swapping, RAM deduplication etc.
> 
> If that is not the usecase of btrfs quotas, then it should be renamed to
> not confuse users. Using the incorrect terms for things widely known
> leads to user frustration at least.
> 
>> From what I see, most users would care more about exclusively used space
>> (excl), other than the total space one subvolume is referring to (rfer).
> 
> Consider this:
> 1. there is some "template" system-wide snapshot,
> 2. users X and Y have CoW copies of it - both see "0 bytes exclusive"?

Yep, although not zero, it's 16K.

> 3. sysadm removes "template" - what happens to X and Y quotas?

Still 16K, unless X or Y dropes their copy.

> 4. user X removes his copy - what happens to Y quota?

Now Y owns the all the snapshot exclusively.

In fact, it's not the correct way to organize your qgroups.
In your case, you should put a higher qgroup (1/0) to contain all the
original snapshot, and user X/Y's subvolume.

In that case, all the snapshots' data and X/Y's newer data are all
exclusive to qgroup 1/0 (as long as you don't do reflink to files out of
subvolume X/Y/snapshot).

And then exclusive number of qgroup 1/0 should be your total usage, and
as long as you don't do reflink out of X/Y/snapshot source, your rfer is
the same as excl, both representing how many bytes used by all three
subvolumes.

This is in btrfs-quota(5) man page.

> 
> The first thing about virtually every mechanism should be
> discoverability and reliability. I expect my quota not to change without
> my interaction. Never. How did you cope with this?
> If not - how are you going to explain such weird behaviour to users?

Read the manual first.
Not every feature is suitable for every use case.

IIRC lvm thin is pretty much the same for the same case.

> 
> Once again: numbers of quotas *I* got must not be influenced by external
> operations or foreign users.
> 
>> The most common case is, you do a snapshot, user would only care how
>> much new space can be written into the subvolume, other than the total
>> subvolume size.
> 
> If only that would be the case... then exactly - I do care how much new
> data is _guaranteed_ to fit on my storage.
> 
> So please tell me, as I might get it wrong - what happens if source
> subvolume get's removed and the CoWed data are not shared anymore?

It's exclusive to the only owner.

> Is the quota recalculated? - this would be wrong, as there were no new data 
> written.

It's recalculated and due to the owner change, the number will change.
It's about extent ownership, as already stated, not all solution suit
all use case.

If you don't think ownership change should change quota, then just don't
use btrfs quota (nor LVM thin if I didn't miss something), it doesn't
fit your use case.

Your use case need LVM snapshot (dm-snapshot), or follow my multi-level
qgroup setup above.

> Is the quota left intact? - this is wrong too, as this gives the false view 
> of exclusive space taken.
> 
> This is just another reincarnation of famous "btrfs df" problem you
> couldn't comprehend so long - when reporting "disk FREE" status I want
> to know the amount of data that is guaranteed to be written in current
> RAID profile, i.e. ignoring any possible savings from compression etc.

Because we have so many ways to use the unallocated space.
It's just impossible to give you a single number of how many space you
can use.

For 4 disk with 1T free space each, if you're using RAID5 for data, then
you can write 3T data.
But if you're also using RAID10 for metadata, and you're using 

Re: Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
[23084.426006] sysrq: SysRq : Show Blocked State
[23084.426085]   taskPC stack   pid father
[23084.426332] mount   D0  4857   4618 0x0080
[23084.426403] Call Trace:
[23084.426531]  ? __schedule+0x2c3/0x830
[23084.426628]  ? __wake_up_common+0x6f/0x120
[23084.426751]  schedule+0x2d/0x90
[23084.426871]  wait_current_trans+0x98/0xc0
[23084.426953]  ? wait_woken+0x80/0x80
[23084.427058]  start_transaction+0x2e9/0x3e0
[23084.427128]  btrfs_drop_snapshot+0x48c/0x860
[23084.427220]  merge_reloc_roots+0xca/0x210
[23084.427277]  btrfs_recover_relocation+0x290/0x420
[23084.427399]  ? btrfs_cleanup_fs_roots+0x174/0x190
[23084.427533]  open_ctree+0x2158/0x2549
[23084.427592]  ? bdi_register_va.part.2+0x10a/0x1a0
[23084.427652]  btrfs_mount_root+0x678/0x730
[23084.427709]  ? pcpu_next_unpop+0x32/0x40
[23084.427797]  ? pcpu_alloc+0x2f6/0x680
[23084.427884]  ? mount_fs+0x30/0x150
[23084.427939]  ? btrfs_decode_error+0x20/0x20
[23084.427996]  mount_fs+0x30/0x150
[23084.428054]  vfs_kern_mount.part.7+0x4f/0xf0
[23084.428111]  btrfs_mount+0x156/0x8ad
[23084.428167]  ? pcpu_block_update_hint_alloc+0x15e/0x1d0
[23084.428226]  ? pcpu_next_unpop+0x32/0x40
[23084.428282]  ? pcpu_alloc+0x2f6/0x680
[23084.428338]  ? mount_fs+0x30/0x150
[23084.428393]  mount_fs+0x30/0x150
[23084.428450]  vfs_kern_mount.part.7+0x4f/0xf0
[23084.428507]  do_mount+0x5b0/0xc60
[23084.428563]  ksys_mount+0x7b/0xd0
[23084.428618]  __x64_sys_mount+0x1c/0x20
[23084.428676]  do_syscall_64+0x55/0x110
[23084.428734]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[23084.428794] RIP: 0033:0x7efeb90daa1a
[23084.428849] RSP: 002b:7ffcc8b8fee8 EFLAGS: 0206 ORIG_RAX:
00a5
[23084.428925] RAX: ffda RBX: 55d5bef05420 RCX: 7efeb90daa1a
[23084.428987] RDX: 55d5bef05600 RSI: 55d5bef05ab0 RDI: 55d5bef05b70
[23084.429048] RBP:  R08: 55d5bef08e40 R09: 003f
[23084.429109] R10: c0ed R11: 0206 R12: 55d5bef05b70
[23084.429170] R13: 55d5bef05600 R14:  R15: 


On Fri, Aug 10, 2018 at 3:40 AM, Dan Merillat  wrote:
> Kernel 4.17.9, 11tb BTRFS device (md-backed, not btrfs raid)
>
> I was testing something out and enabled quota groups and started getting
> 2-5 minute long pauses where a btrfs-transaction thread spun at 100%.
>
> Post-reboot the mount process spinds at 100% CPU, occasinally yielding
> to a btrfs-transaction thread at 100% CPU.  The switchover is marked
> by a burst of disk activity in btrace.
>
> Btrace shows all disk activity is returning promptly - no hanging submits.
>
> Currently the mount is at 6+ hours.
>
> Suggestions on how to go about debugging this?


Mount stalls indefinitely after enabling quota groups.

2018-08-10 Thread Dan Merillat
Kernel 4.17.9, 11tb BTRFS device (md-backed, not btrfs raid)

I was testing something out and enabled quota groups and started getting
2-5 minute long pauses where a btrfs-transaction thread spun at 100%.

Post-reboot the mount process spinds at 100% CPU, occasinally yielding
to a btrfs-transaction thread at 100% CPU.  The switchover is marked
by a burst of disk activity in btrace.

Btrace shows all disk activity is returning promptly - no hanging submits.

Currently the mount is at 6+ hours.

Suggestions on how to go about debugging this?


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 07:03:18 +0300, Andrei Borzenkov wrote:

>> So - the limit set on any user
> 
> Does btrfs support per-user quota at all? I am aware only of per-subvolume 
> quotas.

Well, this is a kind of deceptive word usage in "post-truth" times.

In this case both "user" and "quota" are not valid...
- by "user" I ment general word, not unix-user account; such user might
  possess some container running full-blown guest OS,
- by "quota" btrfs means - I guess, dataset-quotas?


In fact: https://btrfs.wiki.kernel.org/index.php/Quota_support
"Quota support in BTRFS is implemented at a subvolume level by the use of quota 
groups or qgroup"

- what the hell is "quota group" and how it differs from qgroup? According to 
btrfs-quota(8):

"The quota groups (qgroups) are managed by the subcommand btrfs qgroup(8)"

- they are the same... just completely different from traditional "quotas".


My suggestion would be to completely remove the standalone "quota" word
from btrfs documentation - there is no "quota", just "subvolume quota"
or "qgroup" supported.

-- 
Tomasz Pala 


Re: List of known BTRFS Raid 5/6 Bugs?

2018-08-10 Thread Zygo Blaxell
On Fri, Aug 10, 2018 at 03:40:23AM +0200, erentheti...@mail.de wrote:
> I am searching for more information regarding possible bugs related to
> BTRFS Raid 5/6. All sites i could find are incomplete and information
> contradicts itself:
>
> The Wiki Raid 5/6 Page (https://btrfs.wiki.kernel.org/index.php/RAID56)
> warns of the write hole bug, stating that your data remains safe
> (except data written during power loss, obviously) upon unclean shutdown
> unless your data gets corrupted by further issues like bit-rot, drive
> failure etc.

The raid5/6 write hole bug exists on btrfs (as of 4.14-4.16) and there are
no mitigations to prevent or avoid it in mainline kernels.

The write hole results from allowing a mixture of old (committed) and
new (uncommitted) writes to the same RAID5/6 stripe (i.e. a group of
blocks consisting of one related data or parity block from each disk
in the array, such that writes to any of the data blocks affect the
correctness of the parity block and vice versa).  If the writes were
not completed and one or more of the data blocks are not online, the
data blocks reconstructed by the raid5/6 algorithm will be corrupt.

If all disks are online, the write hole does not immediately
damage user-visible data as the old data blocks can still be read
directly; however, should a drive failure occur later, old data may
not be recoverable because the parity block will not be correct for
reconstructing the missing data block.  A scrub can fix write hole
errors if all disks are online, and a scrub should be performed after
any unclean shutdown to recompute parity data.

The write hole always puts both old and new data at risk of damage;
however, due to btrfs's copy-on-write behavior, only the old damaged
data can be observed after power loss.  The damaged new data will have
no references to it written to the disk due to the power failure, so
there is no way to observe the new damaged data using the filesystem.
Not every interrupted write causes damage to old data, but some will.

Two possible mitigations for the write hole are:

- modify the btrfs allocator to prevent writes to partially filled
raid5/6 stripes (similar to what the ssd mount option does, except
with the correct parameters to match RAID5/6 stripe boundaries),
and advise users to run btrfs balance much more often to reclaim
free space in partially occupied raid stripes

- add a stripe write journal to the raid5/6 layer (either in
btrfs itself, or in a lower RAID5 layer).

There are assorted other ideas (e.g. copy the RAID-Z approach from zfs
to btrfs or dramatically increase the btrfs block size) that also solve
the write hole problem but are somewhat more invasive and less practical
for btrfs.

Note that the write hole also affects btrfs on top of other similar
raid5/6 implementations (e.g. mdadm raid5 without stripe journal).
The btrfs CoW layer does not understand how to allocate data to avoid RMW
raid5 stripe updates without corrupting existing committed data, and this
limitation applies to every combination of unjournalled raid5/6 and btrfs.

> The Wiki Gotchas Page (https://btrfs.wiki.kernel.org/index.php/Gotchas)
> warns of possible incorrigible "transid" mismatch, not stating which
> versions are affected or what transid mismatch means for your data. It
> does not mention the write hole at all.

Neither raid5 nor write hole are required to produce a transid mismatch
failure.  transid mismatch usually occurs due to a lost write.  Write hole
is a specific case of lost write, but write hole does not usually produce
transid failures (it produces header or csum failures instead).

During real disk failure events, multiple distinct failure modes can
occur concurrently.  i.e. both transid failure and write hole can occur
at different places in the same filesystem as a result of attempting to
use a failing disk over a long period of time.

A transid verify failure is metadata damage.  It will make the filesystem
readonly and make some data inaccessible as described below.

> This Mail Archive (linux-btrfs@vger.kernel.org/msg55161.html"
> target="_blank">https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg55161.html)
> states that scrubbing BTRFS Raid 5/6 will always repair Data Corruption,
> but may corrupt your Metadata while trying to do so - meaning you have
> to scrub twice in a row to ensure data integrity.

Simple corruption (without write hole errors) is fixed by scrubbing
as of the last...at least six months?  Kernel v4.14.xx and later can
definitely do it these days.  Both data and metadata.

If the metadata is damaged in any way (corruption, write hole, or transid
verify failure) on btrfs and btrfs cannot use the raid profile for
metadata to recover the damaged data, the filesystem is usually forever
readonly, and anywhere from 0 to 100% of the filesystem may be readable
depending on where in the metadata tree structure the error occurs (the
closer to the 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 07:35:32 +0800, Qu Wenruo wrote:

>> when limiting somebody's data space we usually don't care about the
>> underlying "savings" coming from any deduplicating technique - these are
>> purely bonuses for system owner, so he could do larger resource overbooking.
> 
> In reality that's definitely not the case.

Definitely? How do you "sell" a disk space when there is no upper bound?
Every, and I mean _every_ resource quota out in the wild gives you an 
user-perspective.
You can assign CPU cores/time, RAM or network bandwidth with HARD limit.

Only after that you _can_ sometimes assign some best-effort
outer, not guaranteed limits, like extra network bandwidth or grace
periods with filesystem usage (disregarding technical details - in case
of quota you move hard limit beyond and apply lowere soft limit).

This is the primary quota usage. Quotas don't save system resources,
quotas are valuables to "sell" (by quotes I mean every possible
allocations, including interorganisation accouting).

Quotas are overbookable by design and like I said before, the underlying
savings mechanism allow sysadm to increase actual overbooking ratio.

If I run out of CPU, RAM, storage or network I simply need to expand
such resource. I won't shrink quotas in such case.
Or apply some other resuorce-saving technique, like LVM with VDO,
swapping, RAM deduplication etc.

If that is not the usecase of btrfs quotas, then it should be renamed to
not confuse users. Using the incorrect terms for things widely known
leads to user frustration at least.

> From what I see, most users would care more about exclusively used space
> (excl), other than the total space one subvolume is referring to (rfer).

Consider this:
1. there is some "template" system-wide snapshot,
2. users X and Y have CoW copies of it - both see "0 bytes exclusive"?
3. sysadm removes "template" - what happens to X and Y quotas?
4. user X removes his copy - what happens to Y quota?

The first thing about virtually every mechanism should be
discoverability and reliability. I expect my quota not to change without
my interaction. Never. How did you cope with this?
If not - how are you going to explain such weird behaviour to users?

Once again: numbers of quotas *I* got must not be influenced by external
operations or foreign users.

> The most common case is, you do a snapshot, user would only care how
> much new space can be written into the subvolume, other than the total
> subvolume size.

If only that would be the case... then exactly - I do care how much new
data is _guaranteed_ to fit on my storage.

So please tell me, as I might get it wrong - what happens if source
subvolume get's removed and the CoWed data are not shared anymore?
Is the quota recalculated? - this would be wrong, as there were no new data 
written.
Is the quota left intact? - this is wrong too, as this gives the false view of 
exclusive space taken.

This is just another reincarnation of famous "btrfs df" problem you
couldn't comprehend so long - when reporting "disk FREE" status I want
to know the amount of data that is guaranteed to be written in current
RAID profile, i.e. ignoring any possible savings from compression etc.


Please note: my assumptions are based on
https://btrfs.wiki.kernel.org/index.php/Quota_support

"File copy and file deletion may both affect limits since the unshared
limit of another qgroup can change if the original volume's files are
deleted and only one copy is remaining"

so if I write something invalid this might be the source of my mistake.


>> And the numbers accounted should reflect the uncompressed sizes.
> 
> No way for current extent based solution.

OK, since the data is provided by the user, it's "compressableness"
might be considered his saving (we only provide transparency).

>> Moreover - if there would be per-subvolume RAID levels someday, the data
>> should be accouted in relation to "default" (filesystem) RAID level,
>> i.e. having a RAID0 subvolume on RAID1 fs should account half of the
>> data, and twice the data in an opposite scenario (like "dup" profile on
>> single-drive filesystem).
> 
> No possible again for current extent based solution.

Doesn't extent have information about devices it's cloned on? But OK,
this is not important until per-subvolume profiles are available.

>> In short: values representing quotas are user-oriented ("the numbers one
>> bought"), not storage-oriented ("the numbers they actually occupy").
> 
> Well, if something is not possible or brings so big performance impact,
> there will be no argument on how it should work in the first place.

Actually I think you did something overcomplicated (shared/exclusive),
which would only lead to user confusion (especially when his data
becomes "exclusive" one day without any known reason), misnamed ...and
not reflecting anything valuable, unless the problems with extent
fragmentation are already resolved somehow?

So IMHO current quotas are:
- not discoverable for