Power down tests...

2017-08-03 Thread Shyam Prasad N
Hi all,

We're running a couple of experiments on our servers with btrfs
(kernel version 4.4).
And we're running some abrupt power-off tests for a couple of scenarios:

1. We have a filesystem on top of two different btrfs filesystems
(distributed across N disks). i.e. Our filesystem lays out data and
metadata on top of these two filesystems. With the test workload, it
is going to generate a good amount of 16MB files on top of the system.
On abrupt power-off and following reboot, what is the recommended
steps to be run. We're attempting btrfs mount, which seems to fail
sometimes. If it fails, we run a fsck and then mount the btrfs. The
issue that we're facing is that a few files have been zero-sized. As a
result, there is either a data-loss, or inconsistency in the stacked
filesystem's metadata.
We're mounting the btrfs with commit period of 5s. However, I do
expect btrfs to journal the I/Os that are still dirty. Why then are we
seeing the above behaviour.

2. Another test that we're running is to create a virtual disk each on
multiple NFS mounts (softmount with timeout of 1 min), and use these
virtual disks as individual devices for one single btrfs. What is the
expected btrfs behaviour when one of the virtual disk becomes
unresponsive for a period of time (say 5 min)? Does the expectation
change if the NFS mounts are mounted with sync option?

Thanks in advance for any help you can offer.
-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread pwm

In 30 seconds I should be able to fill about 200MB * 30 = 6GB.

Requiring the parity to not grow larger than there is a 6GB additional 
space is possible to live with on a 10TB disk.


It seems that for SnapRAID to have any chance to work correctly with 
parity on a BTRFS partition, it would need a min-free configuration 
paramter to make sure there is always enough free space for one parity 
file update.


But as it is right now, requiring that the disc isn't filled past 50% 
because fallocate() wants enough free space for 100% of the original file 
data to be rewritten obviously is not a working solution.


Right now, it sounds like I should change all parity disks to a different 
file system to avoid the CoW issue. There doesn't seem to be any way to 
turn off CoW for an already existing file, and the parity data is already 
way past 50% so I can't make a copy.


/Per W

On Thu, 3 Aug 2017, Goffredo Baroncelli wrote:


On 2017-08-03 13:44, Marat Khalili wrote:

On 02/08/17 20:52, Goffredo Baroncelli wrote:

consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.

Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2


A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the 
filesystem space.


c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or 
does it only allocate additional 1GB and check free space for another 1GB? If 
it's only the latter, it is useless.

The file is physically extended

ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$



--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Chris Murphy
On Thu, Aug 3, 2017 at 2:45 PM, Brendan Hide  wrote:
>
> To counter, I think this is a big problem with btrfs, especially in terms of
> user attrition. We don't need "GUI" tools. At all. But we do need that btrfs
> is self-sufficient enough that regular users don't get burnt by what they
> would view as unexpected behaviour.  We have currently a situation where
> btrfs is too demanding on inexperienced users.

I think the top complaint is the manual nature of balancing to avoid
enospc when there's free space, followed by balancing needed to
avoid/reduce free space fragmentation and thus maintain consistent
performance.

Obviously the kernel needs more intelligent code to free up partially
full block groups, more correctly to free up contiguous space for it
to write to. That solve both problems.

But in the meantime, btrfs-progs should ship a policy to do some
minimal balance to totally obviate this and find the edge cases. Maybe
it's dusage=3 every day. And dusage=10 one a week. And dusage=20
musage=20 once a month. I don't know but some iteration in this area
is better than saying, PUNT! And putting it on the user's lap.

Better would be a trigger that is statistics based rather than time
based. The metric might be some combination of workload (i.e. idle)
and ratios found in sysfs.

Anyway, the first step is for people on this list to stop
micromanaging their own volumes, and try to center on a sane one size
fits all solution. And then iterate better and better solutions as we
determine the edge cases where one size doesn't fit all. We're
throwing hammers at the problem by default because it's a learned
behavior. We all need to just stop balancing and act like regular
users. And then figure out how to automatically optimize.




>
> I feel we need better worst-case behaviours. For example, if *I* have a
> btrfs on its second-to-last-available chunk, it means I'm not micro-managing
> properly. But users shouldn't have to micro-manage in the first place. Btrfs
> (or a management tool) should just know to balance the least-used chunk
> and/or delete the lowest-priority snapshot, etc. It shouldn't cause my
> services/apps to give diskspace errors when, clearly, there is free space
> available.

Ideally the kernel code needs to do a better job freeing up partial
block groups. But in the meantime, this can be set as an optimization
policy in user space. And it should be in btrfs-progs so it's
consistent across distros. SUSE has a distro specific balancer, on a
systemd timer, but I don't think it's enabled by default and I also
think it's weirdly too aggressive.

If it could be made smarter, with a trigger other than a timer, that'd
be even better.

But doing nothing has been one of the most consistently negative user
responses about Btrfs is the manual balance to maintain performance.



> The other "high-level" aspect would be along the lines of better guidance
> and standardisation for distros on how best to configure btrfs. This would
> include guidance/best practices for things like appropriate subvolume
> mountpoints and snapshot paths, sensible schedules or logic (or perhaps even
> example tools/scripts) for balancing and scrubbing the filesystem.

Would they listen? My experience with openSUSE is, nope.


> I don't have all the answers. But I also don't want to have to tell people
> they can't adopt it because a) they don't (or never will) understand it; and
> b) they're going to resent me for their irresponsibly losing their own data.

Sure.

You can read on the linux-raid@ list where there are still constant
problems with users doing crazy things like mdadm --create to fix a
raid assembly problem, and obliterate their data by doing this and
then getting pissed. It's like, where the hell do people keep getting
the idea of doing that?

There are six ways to Sunday ways of fixing a Btrfs volume. It reads
like a choose your own adventure book. No, actually it's worse because
at least the choose your own adventure book tells you what page to go
to next, and Btrfs gives you zero advice what order to try things in.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] Btrfs: search parity device wisely

2017-08-03 Thread Liu Bo
After mapping block with BTRFS_MAP_WRITE, parities have been sorted to
the end position, so this search can start from the first parity
stripe.

Signed-off-by: Liu Bo 
---
v2: fix typo (data_stripes -> nr_data).

 fs/btrfs/raid56.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 2086383..88a5b7b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2229,12 +2229,13 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info 
*fs_info, struct bio *bio,
ASSERT(!bio->bi_iter.bi_size);
rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
 
-   for (i = 0; i < rbio->real_stripes; i++) {
+   for (i = rbio->nr_data; i < rbio->real_stripes; i++) {
if (bbio->stripes[i].dev == scrub_dev) {
rbio->scrubp = i;
break;
}
}
+   ASSERT(i < rbio->real_stripes);
 
/* Now we just support the sectorsize equals to page size */
ASSERT(fs_info->sectorsize == PAGE_SIZE);
-- 
2.9.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Brendan Hide



On 08/03/2017 09:22 PM, Austin S. Hemmelgarn wrote:

On 2017-08-03 14:29, Christoph Anton Mitterer wrote:

On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote:
There are no higher-level management tools (e.g. RAID
management/monitoring, etc.)...

[snip]

As far as 'higher-level' management tools, you're using your system 
wrong if you _need_ them.  There is no need for there to be a GUI, or a 
web interface, or a DBus interface, or any other such bloat in the main 
management tools, they work just fine as is and are mostly on par with 
the interfaces provided by LVM, MD, and ZFS (other than the lack of 
machine parseable output).  I'd also argue that if you can't reassemble 
your storage stack by hand without using 'higher-level' tools, you 
should not be using that storage stack as you don't properly understand it.


On the subject of monitoring specifically, part of the issue there is 
kernel side, any monitoring system currently needs to be polling-based, 
not event-based, and as a result monitoring tends to be a very system 
specific affair based on how much overhead you're willing to tolerate. 
The limited stuff that does exist is also trivial to integrate with many 
pieces of existing monitoring infrastructure (like Nagios or monit), and 
therefore the people who care about it a lot (like me) are either 
monitoring by hand, or are just using the tools with their existing 
infrastructure (for example, I use monit already on all my systems, so I 
just make sure to have entries in the config for that to check error 
counters and scrub results), so there's not much in the way of incentive 
for the concerned parties to reinvent the wheel.


To counter, I think this is a big problem with btrfs, especially in 
terms of user attrition. We don't need "GUI" tools. At all. But we do 
need that btrfs is self-sufficient enough that regular users don't get 
burnt by what they would view as unexpected behaviour.  We have 
currently a situation where btrfs is too demanding on inexperienced users.


I feel we need better worst-case behaviours. For example, if *I* have a 
btrfs on its second-to-last-available chunk, it means I'm not 
micro-managing properly. But users shouldn't have to micro-manage in the 
first place. Btrfs (or a management tool) should just know to balance 
the least-used chunk and/or delete the lowest-priority snapshot, etc. It 
shouldn't cause my services/apps to give diskspace errors when, clearly, 
there is free space available.


The other "high-level" aspect would be along the lines of better 
guidance and standardisation for distros on how best to configure btrfs. 
This would include guidance/best practices for things like appropriate 
subvolume mountpoints and snapshot paths, sensible schedules or logic 
(or perhaps even example tools/scripts) for balancing and scrubbing the 
filesystem.


I don't have all the answers. But I also don't want to have to tell 
people they can't adopt it because a) they don't (or never will) 
understand it; and b) they're going to resent me for their irresponsibly 
losing their own data.


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-03 14:29, Christoph Anton Mitterer wrote:

On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote:

Brendan Hide wrote:

The title seems alarmist to me - and I suspect it is going to be
misconstrued. :-/

 From the release notes at
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li
nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-
7.4_Release_Notes-Deprecated_Functionality.html
"Btrfs has been deprecated



Wow... not that this would have any direct effect... it's still quite
alarming, isn't it?

This is not meant as criticism, but I often wonder myself where the
btrfs is going to!? :-/

It's in the kernel now since when? 2009? And while the extremely basic
things (snapshots, etc.) seem to work quite stable... other things seem
to be rather stuck (RAID?)... not to talk about many things that have
been kinda "promised" (fancy different compression algos, n-parity-
raid).
I assume you mean the erasure coding the devs and docs call raid56 when 
you're talking about stuck features, and you're right, it has been 
stuck, but it arguably should have been better tested and verified 
before being merged at all.  As far as other 'raid' profiles, raid1 and 
raid0 work fine, and raid10 is mostly fine once you wrap your head 
around the implications of the inconsistent component device ordering.

There are no higher-level management tools (e.g. RAID
management/monitoring, etc.)... there are still some kinda serious
issues (the attacks/corruptions likely possible via UUID collisions)...
The UUID collision issue is present in almost all volume managers and 
filesystems, it just does more damage in BTRFS, and is exacerbated by 
the brain-dead 'scan everything for BTRFS' policy in udev.


As far as 'higher-level' management tools, you're using your system 
wrong if you _need_ them.  There is no need for there to be a GUI, or a 
web interface, or a DBus interface, or any other such bloat in the main 
management tools, they work just fine as is and are mostly on par with 
the interfaces provided by LVM, MD, and ZFS (other than the lack of 
machine parseable output).  I'd also argue that if you can't reassemble 
your storage stack by hand without using 'higher-level' tools, you 
should not be using that storage stack as you don't properly understand it.


On the subject of monitoring specifically, part of the issue there is 
kernel side, any monitoring system currently needs to be polling-based, 
not event-based, and as a result monitoring tends to be a very system 
specific affair based on how much overhead you're willing to tolerate. 
The limited stuff that does exist is also trivial to integrate with many 
pieces of existing monitoring infrastructure (like Nagios or monit), and 
therefore the people who care about it a lot (like me) are either 
monitoring by hand, or are just using the tools with their existing 
infrastructure (for example, I use monit already on all my systems, so I 
just make sure to have entries in the config for that to check error 
counters and scrub results), so there's not much in the way of incentive 
for the concerned parties to reinvent the wheel.

One thing that I miss since long would be the checksumming with
nodatacow.
It has been stated multiple times on the list that this is not possible 
without making nodatacow prone to data loss.

Also it has always been said that the actual performance tunning would
still lay ahead?!
While there hasn't been anything touted specifically as performance 
tuning, performance has improved slightly since I started using BTRFS.



I really like btrfs and use it on all my personal systems... and I
haven't had any data loss since then (only a number of seriously
looking false positives due to bugs in btrfs check ;-) )... but one
still reads every now and then from people here on the list who seem to
suffer from more serious losses.
And this brings up part of the issue with uptake.  People are quick to 
post about issues, but not successes.  I've been running BTRFS on almost 
everything (I don't use it in VM's because of the performance 
implications of having multiple CoW layers) since around kernel 3.9, 
have had no critical issues (ones resulting in data loss) since about 
3.16, and have actually survived quite a few pieces of marginal or 
failed hardware as a result of BTRFS.




So is there any concrete roadmap? Or priority tasks? Is there a lack of
developers?

In order, no, in theory yes but not in practice, and somewhat.

As a general rule, all FOSS projects are short on developers.  Most of 
the work that is occurring on BTRFS is being sponsored by SUSE, 
Facebook, or Fujitsu (at least, I'm pretty sure those are the primary 
sponsors), and their priorities will not necessarily coincide with 
normal end-user priorities.  I'd say though that testing and review are 
just as much short on manpower as development.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Chris Murphy
On Wed, Aug 2, 2017 at 3:11 AM, Wang Shilong  wrote:
> I haven't seen active btrfs developers from some time, Redhat looks
> put most of their efforts on XFS, It is time to switch to SLES/opensuse!

I disagree. We need one or more Btrfs developers involved in Fedora.

Fedora runs fairly unmodified upstream kernels, which are kept up to
date. By default, Fedora 24, 25, 26 users today are on kernel 4.11.11
or 4.11.12. Fedora 25, 26 will soon be rebased to probably 4.12.5.
That's the stable repo. You can optionally get newer non-rc ones from
testing repo. And nightly Rawhide kernels are built as well with the
latest patchset in between rc's. Both Btrfs and Fedora are heavily
developing in containerize deployments, so it seems like a good fit
for both camps.

The problem is the Fedora kernel team has no one sufficiently familiar
with Btrfs, nor anyone at Red Hat to fall back on. But they do have
this with ext4, XFS, device-mapper, and LVM developers. So they're not
going to take on a burden like Btrfs by default without a
knowledgeable pair of eyes to triage issues as they come up. And
instead they're moving to XFS + overlayfs.

There's more opportunity for Btrfs than just as a default file system.
I like the idea of using Btrfs on install media to eliminate the
monolithic isomd5sum most users skip to test their USB install media;
eliminate device-mapper based persistent overlay for the install media
and use Btrfs seed/sprout instead (which would help the Sugar on a
Stick project as well); and at least for nightly composes eliminate
squashfs xz based images in favor of Btrfs compression (faster
compression and decompression, bigger file sizes but these are daily
throw aways so I think time is more important).

Anyway - point is that converging on SUSE doesn't help. If anything I
think it'll shrink the market for Btrfs as a general purpose file
system, rather than grow it.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-03 14:08, waxhead wrote:

Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html 



"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream 
in Red Hat Enterprise Linux 7.4 and will remain available in the Red 
Hat Enterprise Linux 7 series. However, this is the last planned 
update to this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through 
your Red Hat representative on features and requirements you have for 
file systems and storage technology."



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
First of all I am not a BTRFS dev, but I use it for various projects and 
have high hopes for what it can become.


Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is 
depreciated. It not removed from the kernel and so far BTRFS offers 
features that other filesystems don't have. ZFS is something that people 
brag about all the time as a viable alternative, but for me it seems to 
be a pain to manage properly. E.g. grow, add/remove devices, shrink 
etc... good luck doing that right!


BTRFS biggest problem is not that there are some bits and pieces that 
are thoroughly screwed up (raid5/6 (which just got some fixes by the 
way)), but  the fact that the documentation is rather dated.


There is a simple status page here 
https://btrfs.wiki.kernel.org/index.php/Status


As others have pointed out already the explanations on the status page 
is not exactly good. For example compression (that was also mentioned) 
is as of writing this marked as 'Mostly ok'  '(needs verification and 
source) - auto repair and compression may crash'


Now, I am aware that many use compression without trouble. I am not sure 
how many that has compression with disk issues and don't have trouble , 
but I would at least expect to see more people yelling on the mailing 
list if that where the case. The problem here is that this message is 
rather scary and certainly does NOT sound like 'mostly ok' for most people.


What exactly needs verification and source? the mostly ok statement or 
something else?! A more detailed explanation would be required here to 
avoid scaring people away.
Not certain what was meant here, but there were (a while back) some 
known issues with compressed extents, but I thought those had been fixed.


Same thing with the trim feature that is marked OK . It clearly says 
that is has performance implications. It is marked OK so one would 
expect it to not cause the filesystem to fail, but if the performance 
becomes so slow that the filesystem gets practically unusable it is of 
course not "OK". The relevant information is missing for people to make 
a decent choice and I certainly don't know how serious these performance 
implications are, if they are at all relevant...
The performance implications bit shouldn't be listed, that's a given for 
any filesystem with discard (TRIM is the ATA and eMMC command, UNMAP is 
the SCSI one, and ERASE is the name on SD cards, discard is the generic 
kernel term) support.  The issue arises from devices that don't have 
support for queuing such commands, which is quite rare for SSD's these days.


Most people interested in BTRFS are probably a bit more paranoid and 
concerned about their data than the average computer user. What people 
tend to forget is that other filesystems either have NO redundancy, 
auto-repair and other fancy features that BTRFS have. So for the 
compression example above... if you run compressed files on ext4 and 
your disk gets some corruption you are in a no better state than what 
you would be with btrfs either (in fact probably worse). Also nothing is 
stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you 
should be VERY safe.


Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh 
so here is what I think should be done:


1. The documentation needs to either be improved (or old non-relevant 
stuff simply removed / archived somewhere)
2. The status page MUST always be up to date for the latest kernel 
release (It's ok so far , let's hope nobody sleeps here)
3. Proper explanations must be given so the layman and reasonably 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Christoph Anton Mitterer
On Thu, 2017-08-03 at 20:08 +0200, waxhead wrote:
> Brendan Hide wrote:
> > The title seems alarmist to me - and I suspect it is going to be 
> > misconstrued. :-/
> > 
> > From the release notes at 
> > https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Li
> > nux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-
> > 7.4_Release_Notes-Deprecated_Functionality.html
> > "Btrfs has been deprecated
> > 

Wow... not that this would have any direct effect... it's still quite
alarming, isn't it?

This is not meant as criticism, but I often wonder myself where the
btrfs is going to!? :-/

It's in the kernel now since when? 2009? And while the extremely basic
things (snapshots, etc.) seem to work quite stable... other things seem
to be rather stuck (RAID?)... not to talk about many things that have
been kinda "promised" (fancy different compression algos, n-parity-
raid).
There are no higher-level management tools (e.g. RAID
management/monitoring, etc.)... there are still some kinda serious
issues (the attacks/corruptions likely possible via UUID collisions)...
One thing that I miss since long would be the checksumming with
nodatacow.
Also it has always been said that the actual performance tunning would
still lay ahead?!


I really like btrfs and use it on all my personal systems... and I
haven't had any data loss since then (only a number of seriously
looking false positives due to bugs in btrfs check ;-) )... but one
still reads every now and then from people here on the list who seem to
suffer from more serious losses.



So is there any concrete roadmap? Or priority tasks? Is there a lack of
developers?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread waxhead

Brendan Hide wrote:
The title seems alarmist to me - and I suspect it is going to be 
misconstrued. :-/


From the release notes at 
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/7.4_Release_Notes/chap-Red_Hat_Enterprise_Linux-7.4_Release_Notes-Deprecated_Functionality.html


"Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the 
initial release of Red Hat Enterprise Linux 6. Red Hat will not be 
moving Btrfs to a fully supported feature and it will be removed in a 
future major release of Red Hat Enterprise Linux.


The Btrfs file system did receive numerous updates from the upstream 
in Red Hat Enterprise Linux 7.4 and will remain available in the Red 
Hat Enterprise Linux 7 series. However, this is the last planned 
update to this feature.


Red Hat will continue to invest in future technologies to address the 
use cases of our customers, specifically those related to snapshots, 
compression, NVRAM, and ease of use. We encourage feedback through 
your Red Hat representative on features and requirements you have for 
file systems and storage technology."



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
First of all I am not a BTRFS dev, but I use it for various projects and 
have high hopes for what it can become.


Now, the fact that Red Hat depreciate BTRFS does not mean that BTRFS is 
depreciated. It not removed from the kernel and so far BTRFS offers 
features that other filesystems don't have. ZFS is something that people 
brag about all the time as a viable alternative, but for me it seems to 
be a pain to manage properly. E.g. grow, add/remove devices, shrink 
etc... good luck doing that right!


BTRFS biggest problem is not that there are some bits and pieces that 
are thoroughly screwed up (raid5/6 (which just got some fixes by the 
way)), but  the fact that the documentation is rather dated.


There is a simple status page here 
https://btrfs.wiki.kernel.org/index.php/Status


As others have pointed out already the explanations on the status page 
is not exactly good. For example compression (that was also mentioned) 
is as of writing this marked as 'Mostly ok'  '(needs verification and 
source) - auto repair and compression may crash'


Now, I am aware that many use compression without trouble. I am not sure 
how many that has compression with disk issues and don't have trouble , 
but I would at least expect to see more people yelling on the mailing 
list if that where the case. The problem here is that this message is 
rather scary and certainly does NOT sound like 'mostly ok' for most people.


What exactly needs verification and source? the mostly ok statement or 
something else?! A more detailed explanation would be required here to 
avoid scaring people away.


Same thing with the trim feature that is marked OK . It clearly says 
that is has performance implications. It is marked OK so one would 
expect it to not cause the filesystem to fail, but if the performance 
becomes so slow that the filesystem gets practically unusable it is of 
course not "OK". The relevant information is missing for people to make 
a decent choice and I certainly don't know how serious these performance 
implications are, if they are at all relevant...


Most people interested in BTRFS are probably a bit more paranoid and 
concerned about their data than the average computer user. What people 
tend to forget is that other filesystems either have NO redundancy, 
auto-repair and other fancy features that BTRFS have. So for the 
compression example above... if you run compressed files on ext4 and 
your disk gets some corruption you are in a no better state than what 
you would be with btrfs either (in fact probably worse). Also nothing is 
stopping you from putting btrfs DUP on a mdadm raid5 or 6 which mean you 
should be VERY safe.


Simple documentation is the key so HERE ARE MY DEMANDS!!!. ehhh 
so here is what I think should be done:


1. The documentation needs to either be improved (or old non-relevant 
stuff simply removed / archived somewhere)
2. The status page MUST always be up to date for the latest kernel 
release (It's ok so far , let's hope nobody sleeps here)
3. Proper explanations must be given so the layman and reasonably 
technical people understand the risks / issues for non-ok stuff.
4. There should be links to roadmaps for each feature on the status page 
that clearly stats what is being worked on for the NEXT kernel release






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-03 13:15, Marat Khalili wrote:

On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli

The file is physically extended

ghigo@venice:/tmp$ fallocate -l 1000 foo.txt


For clarity let's replace the fallocate above with:
$ head -c 1000 foo.txt


ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$


According to explanation by Austin the foo.txt at this point somehow occupies 
2000 bytes of space because I can reflink it and then write another 1000 bytes 
of data into it without losing 1000 bytes I already have or getting out of 
drive space. (Or is it only true while there are open file handles?)

OK, I think there may be some misunderstanding here.  By 'CoW unwritten 
extents', I mean that when we write to the extent, a CoW operation 
happens, instead of the data being written directly into the extent.  In 
this case, it has nothing to do with reflinking, and Goffredo is correct 
that if your filesystem is small enough, the second fallocate will fail 
there.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-03 12:37, Goffredo Baroncelli wrote:

On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:

On 2017-08-02 17:05, Goffredo Baroncelli wrote:

On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:

On 2017-08-02 13:52, Goffredo Baroncelli wrote:

Hi,


[...]


consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.



There is also an expectation based on pretty much every other FS in existence 
that calling fallocate() on a range that is already in use is a (possibly 
expensive) no-op, and by extension using fallocate() with an offset of 0 like a 
ftruncate() call will succeed as long as the new size will fit.


The man page of fallocate doesn't guarantee that.

Unfortunately in a COW filesystem the assumption that an allocate area may be 
simply overwritten is not true.

Let me to say it with others words: as general rule if you want to _write_ 
something in a cow filesystem, you need space. Doesn't matter if you are 
*over-writing* existing data or you are *appending* to a file.

Yes, you need space, but you don't need _all_ the space.  For a file that 
already has data in it, you only _need_ as much space as the largest chunk of 
data that can be written at once at a low level, because the moment that first 
write finishes, the space that was used in the file for that region is freed, 
and the next write can go there.  Put a bit differently, you only need to 
allocate what isn't allocated in the region, and then a bit more to handle the 
initial write to the file.

Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a CoW 
filesystem _does not_ need to behave like BTRFS is.


It seems that ZFS on linux doesn't support fallocate

see https://github.com/zfsonlinux/zfs/issues/326

So I think that you are referring to a posix_fallocate and ZFS on solaris, 
which I can't test so I can't comment.

Both Solaris, and FreeBSD (I've got a FreeNAS system at work i checked on).

That said, I'm starting to wonder if just failing fallocate() calls to 
allocate space is actually the right thing to do here after all.  Aside 
from this, we don't reserve metadata space for checksums and similar 
things for the eventual writes (so it's possible to get -ENOSPC on a 
write to an fallocate'ed region anyway because of metadata exhaustion), 
and splitting extents can also cause it to fail, so it's perfectly 
possible for the fallocate assumption to not hole on BTRFS.  The irony 
of this is that if you're in a situation where you actually need to 
reserve space, you're more likely to fail (because if you actually 
_need_ to reserve the space, your filesystem may already be mostly full, 
and therefore any of the above issues may occur).


On the specific note of splitting extents, the following will probably 
fail on BTRFS as well when done with a large enough FS (the turn over 
point ends up being the point at which 256MiB isn't enough space to 
account for all the extents), but will succeed with :
1. Create filesystem and mount it.  On BTRFS, make sure autodefrag is 
off (this makes it fail more reliably, but is not essential for it to fail).
2. Use fallocate to allocate as large a file as possible (in the BTRFS 
case, try for the size of the filesystem - 544MiB (512 MiB for the 
metadata chunk, 32 for the system chunk).
3. Write half the file using 1MB blocks and skipping 1MB of space 
between each block (so every other 1MB of space is actually written to.

4. Write the other half of the file by filling in the holes.

The net effect of this is to split the single large fallocat'ed extent 
into a very large number of 1MB extents, which in turn eats up lots of 
metadata space and will eventually exhaust it.  While this specific 
exercise requires a large filesystem, more generic real world situations 
exist where this can happen (and I have had this happen before).


[...]

In terms of a COW filesystem, you need the space of a) + the space of b)

No, that is only required if the entire file needs to be written atomically.  
There is some maximal size atomic write that BTRFS can perform as a single 
operation at a low level (I'm not sure if this is equal to the block size, or 
larger, but it doesn't matter much, either way, I'm talking the largest chunk 
of data it will write to a disk in a single operation before updating metadata 
to point to that new data).


On the best of my knowledge there is only a time limit: IIRC every 30seconds a 
transaction is closed. If you are able to fill the filesystem in this time 
window you are in trouble.
Even with that, it's still possible to implement the method I outlined 
by defining such a limit and forcing a transaction commit when 

Re: Massive loss of disk space

2017-08-03 Thread Marat Khalili
On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli 
>The file is physically extended
>
>ghigo@venice:/tmp$ fallocate -l 1000 foo.txt

For clarity let's replace the fallocate above with:
$ head -c 1000 foo.txt

>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
>ghigo@venice:/tmp$ ls -l foo.txt
>-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
>ghigo@venice:/tmp$

According to explanation by Austin the foo.txt at this point somehow occupies 
2000 bytes of space because I can reflink it and then write another 1000 bytes 
of data into it without losing 1000 bytes I already have or getting out of 
drive space. (Or is it only true while there are open file handles?)
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Goffredo Baroncelli
On 2017-08-03 13:39, Austin S. Hemmelgarn wrote:
> On 2017-08-02 17:05, Goffredo Baroncelli wrote:
>> On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:
>>> On 2017-08-02 13:52, Goffredo Baroncelli wrote:
 Hi,

>> [...]
>>
 consider the following scenario:

 a) create a 2GB file
 b) fallocate -o 1GB -l 2GB
 c) write from 1GB to 3GB

 after b), the expectation is that c) always succeed [1]: i.e. there is 
 enough space on the filesystem. Due to the COW nature of BTRFS, you cannot 
 rely on the already allocated space because there could be a small time 
 window where both the old and the new data exists on the disk.
>>
>>> There is also an expectation based on pretty much every other FS in 
>>> existence that calling fallocate() on a range that is already in use is a 
>>> (possibly expensive) no-op, and by extension using fallocate() with an 
>>> offset of 0 like a ftruncate() call will succeed as long as the new size 
>>> will fit.
>>
>> The man page of fallocate doesn't guarantee that.
>>
>> Unfortunately in a COW filesystem the assumption that an allocate area may 
>> be simply overwritten is not true.
>>
>> Let me to say it with others words: as general rule if you want to _write_ 
>> something in a cow filesystem, you need space. Doesn't matter if you are 
>> *over-writing* existing data or you are *appending* to a file.
> Yes, you need space, but you don't need _all_ the space.  For a file that 
> already has data in it, you only _need_ as much space as the largest chunk of 
> data that can be written at once at a low level, because the moment that 
> first write finishes, the space that was used in the file for that region is 
> freed, and the next write can go there.  Put a bit differently, you only need 
> to allocate what isn't allocated in the region, and then a bit more to handle 
> the initial write to the file.
> 
> Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that a 
> CoW filesystem _does not_ need to behave like BTRFS is.

It seems that ZFS on linux doesn't support fallocate

see https://github.com/zfsonlinux/zfs/issues/326

So I think that you are referring to a posix_fallocate and ZFS on solaris, 
which I can't test so I can't comment.

[...]
>> In terms of a COW filesystem, you need the space of a) + the space of b)
> No, that is only required if the entire file needs to be written atomically.  
> There is some maximal size atomic write that BTRFS can perform as a single 
> operation at a low level (I'm not sure if this is equal to the block size, or 
> larger, but it doesn't matter much, either way, I'm talking the largest chunk 
> of data it will write to a disk in a single operation before updating 
> metadata to point to that new data). 

On the best of my knowledge there is only a time limit: IIRC every 30seconds a 
transaction is closed. If you are able to fill the filesystem in this time 
window you are in trouble.

[...]-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Goffredo Baroncelli
On 2017-08-03 13:44, Marat Khalili wrote:
> On 02/08/17 20:52, Goffredo Baroncelli wrote:
>> consider the following scenario:
>>
>> a) create a 2GB file
>> b) fallocate -o 1GB -l 2GB
>> c) write from 1GB to 3GB
>>
>> after b), the expectation is that c) always succeed [1]: i.e. there is 
>> enough space on the filesystem. Due to the COW nature of BTRFS, you cannot 
>> rely on the already allocated space because there could be a small time 
>> window where both the old and the new data exists on the disk.
> Just curious. With current implementation, in the following case:
> a) create a 2GB file1 && create a 2GB file2
> b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2

A this step you are trying to allocate 3GB+3GB = 6GB, so you exhausted the 
filesystem space.

> c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
> will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or 
> does it only allocate additional 1GB and check free space for another 1GB? If 
> it's only the latter, it is useless.
The file is physically extended

ghigo@venice:/tmp$ fallocate -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1000 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt
ghigo@venice:/tmp$ ls -l foo.txt
-rw-r--r-- 1 ghigo ghigo 1500 Aug  3 18:00 foo.txt
ghigo@venice:/tmp$

> 
> -- 
> 
> With Best Regards,
> Marat Khalili
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range

2017-08-03 Thread Chris Mason



On 08/03/2017 11:25 AM, Wang Shilong wrote:

On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason  wrote:



On 07/27/2017 02:52 PM, fdman...@kernel.org wrote:


From: Filipe Manana 

If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana 



Thanks Filipe, looks like it goes all the way back to:

commit 47059d930f0e002ff851beea87d738146804726d
Author: Wang Shilong 
Date:   Thu Jul 3 18:22:07 2014 +0800

 Btrfs: make defragment work with nodatacow option

I can't see how the inode lock is required here.


This blames to me, thanks for fixing it.


No blame ;)  I'll take code that works any day.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range

2017-08-03 Thread Wang Shilong
On Thu, Aug 3, 2017 at 11:00 PM, Chris Mason  wrote:
>
>
> On 07/27/2017 02:52 PM, fdman...@kernel.org wrote:
>>
>> From: Filipe Manana 
>>
>> If the range being cleared was not marked for defrag and we are not
>> about to clear the range from the defrag status, we don't need to
>> lock and unlock the inode.
>>
>> Signed-off-by: Filipe Manana 
>
>
> Thanks Filipe, looks like it goes all the way back to:
>
> commit 47059d930f0e002ff851beea87d738146804726d
> Author: Wang Shilong 
> Date:   Thu Jul 3 18:22:07 2014 +0800
>
> Btrfs: make defragment work with nodatacow option
>
> I can't see how the inode lock is required here.

This blames to me, thanks for fixing it.

Reviewed-by: Wang Shilong 

>
> Reviewed-by: Chris Mason 
>
> -chris
>
>> ---
>>   fs/btrfs/inode.c | 7 ---
>>   1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
>> index eb495e956d53..51c45c0a8553 100644
>> --- a/fs/btrfs/inode.c
>> +++ b/fs/btrfs/inode.c
>> @@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void
>> *private_data,
>> u64 len = state->end + 1 - state->start;
>> u32 num_extents = count_max_extents(len);
>>   - spin_lock(>lock);
>> -   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG))
>> +   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) {
>> +   spin_lock(>lock);
>> inode->defrag_bytes -= len;
>> -   spin_unlock(>lock);
>> +   spin_unlock(>lock);
>> +   }
>> /*
>>  * set_bit and clear bit hooks normally require _irqsave/restore
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio

2017-08-03 Thread Chris Mason



On 08/03/2017 08:44 AM, Nikolay Borisov wrote:

Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.


I like it, thanks.

Reviewed-by: Chris Mason 

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: avoid unnecessarily locking inode when clearing a range

2017-08-03 Thread Chris Mason



On 07/27/2017 02:52 PM, fdman...@kernel.org wrote:

From: Filipe Manana 

If the range being cleared was not marked for defrag and we are not
about to clear the range from the defrag status, we don't need to
lock and unlock the inode.

Signed-off-by: Filipe Manana 


Thanks Filipe, looks like it goes all the way back to:

commit 47059d930f0e002ff851beea87d738146804726d
Author: Wang Shilong 
Date:   Thu Jul 3 18:22:07 2014 +0800

Btrfs: make defragment work with nodatacow option

I can't see how the inode lock is required here.

Reviewed-by: Chris Mason 

-chris


---
  fs/btrfs/inode.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index eb495e956d53..51c45c0a8553 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1797,10 +1797,11 @@ static void btrfs_clear_bit_hook(void *private_data,
u64 len = state->end + 1 - state->start;
u32 num_extents = count_max_extents(len);
  
-	spin_lock(>lock);

-   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG))
+   if ((state->state & EXTENT_DEFRAG) && (*bits & EXTENT_DEFRAG)) {
+   spin_lock(>lock);
inode->defrag_bytes -= len;
-   spin_unlock(>lock);
+   spin_unlock(>lock);
+   }
  
  	/*

 * set_bit and clear bit hooks normally require _irqsave/restore


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Move skip checksum check from btrfs_submit_direct to __btrfs_submit_dio_bio

2017-08-03 Thread Nikolay Borisov
Currently the code checks whether we should do data checksumming in
btrfs_submit_direct and the boolean result of this check is passed to
btrfs_submit_direct_hook, in turn passing it to __btrfs_submit_dio_bio which
actually consumes it. The last function actually has all the necessary context
to figure out whether to skip the check or not, so let's move the check closer
to where it's being consumed. No functional changes.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/inode.c | 18 ++
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 5e48d2c10152..a8bd0f951454 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8424,8 +8424,7 @@ static inline blk_status_t 
btrfs_lookup_and_bind_dio_csum(struct inode *inode,
 }
 
 static inline int __btrfs_submit_dio_bio(struct bio *bio, struct inode *inode,
-u64 file_offset, int skip_sum,
-int async_submit)
+u64 file_offset, int async_submit)
 {
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct btrfs_dio_private *dip = bio->bi_private;
@@ -8443,7 +8442,7 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, 
struct inode *inode,
goto err;
}
 
-   if (skip_sum)
+   if (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM)
goto map;
 
if (write && async_submit) {
@@ -8473,8 +8472,7 @@ static inline int __btrfs_submit_dio_bio(struct bio *bio, 
struct inode *inode,
return ret;
 }
 
-static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip,
-   int skip_sum)
+static int btrfs_submit_direct_hook(struct btrfs_dio_private *dip)
 {
struct inode *inode = dip->inode;
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
@@ -8537,7 +8535,7 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
 */
atomic_inc(>pending_bios);
 
-   ret = __btrfs_submit_dio_bio(bio, inode, file_offset, skip_sum,
+   ret = __btrfs_submit_dio_bio(bio, inode, file_offset,
 async_submit);
if (ret) {
bio_put(bio);
@@ -8557,8 +8555,7 @@ static int btrfs_submit_direct_hook(struct 
btrfs_dio_private *dip,
} while (submit_len > 0);
 
 submit:
-   ret = __btrfs_submit_dio_bio(bio, inode, file_offset, skip_sum,
-async_submit);
+   ret = __btrfs_submit_dio_bio(bio, inode, file_offset, async_submit);
if (!ret)
return 0;
 
@@ -8583,12 +8580,9 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
struct btrfs_dio_private *dip = NULL;
struct bio *bio = NULL;
struct btrfs_io_bio *io_bio;
-   int skip_sum;
bool write = (bio_op(dio_bio) == REQ_OP_WRITE);
int ret = 0;
 
-   skip_sum = BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM;
-
bio = btrfs_bio_clone(dio_bio);
 
dip = kzalloc(sizeof(*dip), GFP_NOFS);
@@ -8631,7 +8625,7 @@ static void btrfs_submit_direct(struct bio *dio_bio, 
struct inode *inode,
dio_data->unsubmitted_oe_range_end;
}
 
-   ret = btrfs_submit_direct_hook(dip, skip_sum);
+   ret = btrfs_submit_direct_hook(dip);
if (!ret)
return;
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: search parity device wisely

2017-08-03 Thread kbuild test robot
Hi Liu,

[auto build test ERROR on v4.13-rc3]
[also build test ERROR on next-20170803]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-search-parity-device-wisely/20170803-193103
config: xtensa-allmodconfig (attached as .config)
compiler: xtensa-linux-gcc (GCC) 4.9.0
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=xtensa 

All errors (new ones prefixed by >>):

   fs/btrfs/raid56.c: In function 'raid56_parity_alloc_scrub_rbio':
>> fs/btrfs/raid56.c:2232:15: error: 'struct btrfs_raid_bio' has no member 
>> named 'data_stripes'
 for (i = rbio->data_stripes; i < rbio->real_stripes; i++) {
  ^

vim +2232 fs/btrfs/raid56.c

  2201  
  2202  /*
  2203   * The following code is used to scrub/replace the parity stripe
  2204   *
  2205   * Caller must have already increased bio_counter for getting @bbio.
  2206   *
  2207   * Note: We need make sure all the pages that add into the scrub/replace
  2208   * raid bio are correct and not be changed during the scrub/replace. 
That
  2209   * is those pages just hold metadata or file data with checksum.
  2210   */
  2211  
  2212  struct btrfs_raid_bio *
  2213  raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct 
bio *bio,
  2214 struct btrfs_bio *bbio, u64 stripe_len,
  2215 struct btrfs_device *scrub_dev,
  2216 unsigned long *dbitmap, int 
stripe_nsectors)
  2217  {
  2218  struct btrfs_raid_bio *rbio;
  2219  int i;
  2220  
  2221  rbio = alloc_rbio(fs_info, bbio, stripe_len);
    if (IS_ERR(rbio))
  2223  return NULL;
  2224  bio_list_add(>bio_list, bio);
  2225  /*
  2226   * This is a special bio which is used to hold the completion 
handler
  2227   * and make the scrub rbio is similar to the other types
  2228   */
  2229  ASSERT(!bio->bi_iter.bi_size);
  2230  rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
  2231  
> 2232  for (i = rbio->data_stripes; i < rbio->real_stripes; i++) {
  2233  if (bbio->stripes[i].dev == scrub_dev) {
  2234  rbio->scrubp = i;
  2235  break;
  2236  }
  2237  }
  2238  ASSERT(i < rbio->real_stripes);
  2239  
  2240  /* Now we just support the sectorsize equals to page size */
  2241  ASSERT(fs_info->sectorsize == PAGE_SIZE);
  2242  ASSERT(rbio->stripe_npages == stripe_nsectors);
  2243  bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
  2244  
  2245  /*
  2246   * We have already increased bio_counter when getting bbio, 
record it
  2247   * so we can free it at rbio_orig_end_io().
  2248   */
  2249  rbio->generic_bio_cnt = 1;
  2250  
  2251  return rbio;
  2252  }
  2253  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] Btrfs: search parity device wisely

2017-08-03 Thread kbuild test robot
Hi Liu,

[auto build test ERROR on v4.13-rc3]
[also build test ERROR on next-20170803]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Liu-Bo/Btrfs-search-parity-device-wisely/20170803-193103
config: x86_64-randconfig-x007-201731 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/raid56.c: In function 'raid56_parity_alloc_scrub_rbio':
>> fs/btrfs/raid56.c:2232:15: error: 'struct btrfs_raid_bio' has no member 
>> named 'data_stripes'; did you mean 'real_stripes'?
 for (i = rbio->data_stripes; i < rbio->real_stripes; i++) {
  ^~

vim +2232 fs/btrfs/raid56.c

  2201  
  2202  /*
  2203   * The following code is used to scrub/replace the parity stripe
  2204   *
  2205   * Caller must have already increased bio_counter for getting @bbio.
  2206   *
  2207   * Note: We need make sure all the pages that add into the scrub/replace
  2208   * raid bio are correct and not be changed during the scrub/replace. 
That
  2209   * is those pages just hold metadata or file data with checksum.
  2210   */
  2211  
  2212  struct btrfs_raid_bio *
  2213  raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info *fs_info, struct 
bio *bio,
  2214 struct btrfs_bio *bbio, u64 stripe_len,
  2215 struct btrfs_device *scrub_dev,
  2216 unsigned long *dbitmap, int 
stripe_nsectors)
  2217  {
  2218  struct btrfs_raid_bio *rbio;
  2219  int i;
  2220  
  2221  rbio = alloc_rbio(fs_info, bbio, stripe_len);
    if (IS_ERR(rbio))
  2223  return NULL;
  2224  bio_list_add(>bio_list, bio);
  2225  /*
  2226   * This is a special bio which is used to hold the completion 
handler
  2227   * and make the scrub rbio is similar to the other types
  2228   */
  2229  ASSERT(!bio->bi_iter.bi_size);
  2230  rbio->operation = BTRFS_RBIO_PARITY_SCRUB;
  2231  
> 2232  for (i = rbio->data_stripes; i < rbio->real_stripes; i++) {
  2233  if (bbio->stripes[i].dev == scrub_dev) {
  2234  rbio->scrubp = i;
  2235  break;
  2236  }
  2237  }
  2238  ASSERT(i < rbio->real_stripes);
  2239  
  2240  /* Now we just support the sectorsize equals to page size */
  2241  ASSERT(fs_info->sectorsize == PAGE_SIZE);
  2242  ASSERT(rbio->stripe_npages == stripe_nsectors);
  2243  bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
  2244  
  2245  /*
  2246   * We have already increased bio_counter when getting bbio, 
record it
  2247   * so we can free it at rbio_orig_end_io().
  2248   */
  2249  rbio->generic_bio_cnt = 1;
  2250  
  2251  return rbio;
  2252  }
  2253  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: Massive loss of disk space

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-03 07:44, Marat Khalili wrote:

On 02/08/17 20:52, Goffredo Baroncelli wrote:

consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is 
enough space on the filesystem. Due to the COW nature of BTRFS, you 
cannot rely on the already allocated space because there could be a 
small time window where both the old and the new data exists on the disk.

Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per 
file, or does it only allocate additional 1GB and check free space for 
another 1GB? If it's only the latter, it is useless.
It will currently allocate 4GB total in this case (2 for each file), and 
_should_ succeed.  I think there are corner cases where it can fail 
though because of metadata exhaustion, and I'm still not certain we 
don't CoW unwritten extents (if we do CoW unwritten extents, then this, 
and all fallocate allocation for that matter, becomes non-deterministic 
as to whether or not it succeeds).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Marat Khalili

On 02/08/17 20:52, Goffredo Baroncelli wrote:

consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.

Just curious. With current implementation, in the following case:
a) create a 2GB file1 && create a 2GB file2
b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2
c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2
will (c) always succeed? I.e. does fallocate really allocate 2GB per 
file, or does it only allocate additional 1GB and check free space for 
another 1GB? If it's only the latter, it is useless.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive loss of disk space

2017-08-03 Thread Austin S. Hemmelgarn

On 2017-08-02 17:05, Goffredo Baroncelli wrote:

On 2017-08-02 21:10, Austin S. Hemmelgarn wrote:

On 2017-08-02 13:52, Goffredo Baroncelli wrote:

Hi,


[...]


consider the following scenario:

a) create a 2GB file
b) fallocate -o 1GB -l 2GB
c) write from 1GB to 3GB

after b), the expectation is that c) always succeed [1]: i.e. there is enough 
space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the 
already allocated space because there could be a small time window where both 
the old and the new data exists on the disk.



There is also an expectation based on pretty much every other FS in existence 
that calling fallocate() on a range that is already in use is a (possibly 
expensive) no-op, and by extension using fallocate() with an offset of 0 like a 
ftruncate() call will succeed as long as the new size will fit.


The man page of fallocate doesn't guarantee that.

Unfortunately in a COW filesystem the assumption that an allocate area may be 
simply overwritten is not true.

Let me to say it with others words: as general rule if you want to _write_ 
something in a cow filesystem, you need space. Doesn't matter if you are 
*over-writing* existing data or you are *appending* to a file.
Yes, you need space, but you don't need _all_ the space.  For a file 
that already has data in it, you only _need_ as much space as the 
largest chunk of data that can be written at once at a low level, 
because the moment that first write finishes, the space that was used in 
the file for that region is freed, and the next write can go there.  Put 
a bit differently, you only need to allocate what isn't allocated in the 
region, and then a bit more to handle the initial write to the file.


Also, as I said below, _THIS WORKS ON ZFS_.  That immediately means that 
a CoW filesystem _does not_ need to behave like BTRFS is.





I've checked JFS, XFS, ext4, vfat, NTFS (via NTFS-3G, not the kernel driver), 
NILFS2, OCFS2 (local mode only), F2FS, UFS, and HFS+ on Linux, UFS and HFS+ on 
OS X, UFS and ZFS on FreeBSD, FFS (UFS with a different name) and LFS (log 
structured) on NetBSD, and UFS and ZFS on Solaris, and VxFS on HP-UX, and _all_ 
of them behave correctly here and succeed with the test I listed, while BTRFS 
does not.  This isn't codified in POSIX, but it's also not something that is 
listed as implementation defined, which in turn means that we should be trying 
to match the other implementations.


[...]





My opinion is that in general this behavior is correct due to the COW nature of 
BTRFS.
The only exception that I can find, is about the "nocow" file. For these cases 
taking in accout the already allocated space would be better.

There are other, saner ways to make that expectation hold though, and I'm not 
even certain that it does as things are implemented (I believe we still CoW 
unwritten extents when data is written to them, because I _have_ had writes to 
fallocate'ed files fail on BTRFS before with -ENOSPC).

The ideal situation IMO is as follows:

1. This particular case (using fallocate() with an offset of 0 to extend a file 
that is already larger than half the remaining free space on the FS) _should_ 
succeed.


This description is not accurate. What happened is the following:
1) you have a file *with valid data*
2) you want to prepare an update of this file and want to be sure to have 
enough space
Except this is not the common case.  Most filesystems aren't CoW, so 
calling fallocate() like this is generally not 'ensuring you have enough 
space', it's 'ensuring the file isn't sparse, and we can write to the 
extra area beyond the end we care about'.


at this point fallocate have to guarantee:
a) you have your old data still available
b) you have allocated the space for the update

In terms of a COW filesystem, you need the space of a) + the space of b)
No, that is only required if the entire file needs to be written 
atomically.  There is some maximal size atomic write that BTRFS can 
perform as a single operation at a low level (I'm not sure if this is 
equal to the block size, or larger, but it doesn't matter much, either 
way, I'm talking the largest chunk of data it will write to a disk in a 
single operation before updating metadata to point to that new data). 
If your total size (original data plus the new space) is less than this 
maximal atomic write size, then the above is true, but if it is larger, 
you only need to allocate space for regions of the fallocate() range 
that aren't already allocated, plus space to accommodate at least one 
write of this maximal atomic write size.  Any space beyond that just 
ends up minimizing the degree of fragmentation introduced by allocation.


The methodology that allows this is really simple.  When you start to 
write data to the file, the first part of the write goes into the newly 
allocated space, and the original region covered by that write gets 
freed.  You can then write into the space that was just freed and 

Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-03 Thread Lutz Vieweg

On 08/03/2017 12:22 AM, Chris Murphy wrote:

Also more interesting is this Stratis project that started up a few months ago:
https://github.com/stratis-storage/stratisd

Which also includes this design document:
https://stratis-storage.github.io/StratisSoftwareDesign.pdf


This concept, if successfully implemented, does not seem to achieve
anything beyond "hide the complexity of its implementation from the user".

No actual new functionality, no reason to assume any additional robustness
or stability, and certainly not a new filesystem,  just yet-another-wrapper.

Keeping users from understanding the complexity of a storage system
they use is not a benefit for all but the most trivial use cases.

And I find it symptomatic that the section "D-Bus Access Control" in
StratisSoftwareDesign.pdf is empty.


So it's going to use existing device mapper, md, some LVM
stuff, XFS


That is the only part of the Stratis concept that looks reasonable to me.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html