Re: [PATCH 0/7] Let user specify the kernel version for features

2015-11-26 Thread Anand Jain


Hope we are in sync on..

1.
The term auto that you are using here refs to
 'Progs default-features being updated at the _run time_'.

2.
In the long run, mostly it would be:
  progs-version > LTS-kernel-version
(for the reason that user would need fsck,tools.. etc)



With the new -O comp= option, the concern on user who want to make a
btrfs for newer kernel is hugely reduced.


NO!. actually new option -O comp= provides no concern for users who
want to create _a btrfs disk layout which is compatible with more
than one kernel_.  above there are two examples of it.


Why you can't give a higher kernel version than current kernel?


  mount fails.  Pls try !!


But that's what user want to do. He/she knows what he is doing.
Maybe he is testing btrfs-progs self test without the need to mount
it(at least some of the tests doesn't require mount)


 right. It will continue to fail even with this patch set.



Now we need to auto align feature with kernel, who know one day we will
need to auto align our libs to upstream package?


 align libs to upstream package ? is there any eg you could provide ?



Keeping a matrix with different packages like libuuid/acl/attr with
different Makefile?
At least this is not a good idea for me, and that's the work of
autoconfig IIRC.

And if I'm a package and face such problem, I'll choose the simplest
solution, just add a line in PKGBUILD(package system of Archlinux) of
btrfs.
--
depends=('linux>=3.14')
--
(Yeah, such simple and slick packaging solution is the reason I like
Arch over other rolling distribution)

Not every thing really needed to be done in code level.


 As we are handling default features at the run time, how is
 this relevant in this context. ?

Thanks, Anand


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] Let user specify the kernel version for features

2015-11-26 Thread Qu Wenruo



Anand Jain wrote on 2015/11/27 06:17 +0800:


Hope we are in sync on..

1.
The term auto that you are using here refs to
  'Progs default-features being updated at the _run time_'.


Yes.



2.
In the long run, mostly it would be:
   progs-version > LTS-kernel-version
(for the reason that user would need fsck,tools.. etc)


Also true.

But mkfs default features won't change during one or two LTS kernels.





With the new -O comp= option, the concern on user who want to make a
btrfs for newer kernel is hugely reduced.


NO!. actually new option -O comp= provides no concern for users who
want to create _a btrfs disk layout which is compatible with more
than one kernel_.  above there are two examples of it.


Why you can't give a higher kernel version than current kernel?


  mount fails.  Pls try !!


But that's what user want to do. He/she knows what he is doing.
Maybe he is testing btrfs-progs self test without the need to mount
it(at least some of the tests doesn't require mount)


  right. It will continue to fail even with this patch set.



Now we need to auto align feature with kernel, who know one day we will
need to auto align our libs to upstream package?


  align libs to upstream package ? is there any eg you could provide ?


IIRC, for the ancient time, libblkid is still included in e2fsprogs and 
its API is different from nowadays.


Will us need to support that one with different blkid calls?





Keeping a matrix with different packages like libuuid/acl/attr with
different Makefile?
At least this is not a good idea for me, and that's the work of
autoconfig IIRC.

And if I'm a package and face such problem, I'll choose the simplest
solution, just add a line in PKGBUILD(package system of Archlinux) of
btrfs.
--
depends=('linux>=3.14')
--
(Yeah, such simple and slick packaging solution is the reason I like
Arch over other rolling distribution)

Not every thing really needed to be done in code level.


  As we are handling default features at the run time, how is
  this relevant in this context. ?


I meant, it can be done in packaging level and it's much easier to do.
One dependency line vs near 200 codes.
And it's much predictable than version based detection.

Thanks,
Qu



Thanks, Anand







--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvols and parents - how?

2015-11-26 Thread Duncan
Christoph Anton Mitterer posted on Tue, 24 Nov 2015 22:25:50 +0100 as
excerpted:

>> Suppose you only want to rollback /, because some update screwed you
>> up,
>> but not /home, which is fine.  If /home is a nested subvolume, then
>> you're now mounting the nested home subvolume from some other nesting
>> tree entirely,
> That's a bit unclear to me,... I thought when I make a snapshot, any
> nested subvols wouldn't be snapshotted and thus be empty dirs.
> So I'd have rather that if I would simply have no /home (if I didn't
> move it to the rolled-back subvol manually)

What I was intending to convey but apparently failed to be quite clear 
enough, suppose:

5
|
+-+ subvols (dir)
  |
  +-+ root (subvol)
  | |
  | + home (nested subvol)
  |
  +-+ snaps-2015.0901 (dir)
|
+-+ root-2015.0901 (subvol)


As long as you're on the working /, then /home is a nested subvol, and 
you don't have to mount it to access, tho you can if you want.

But now, you roll back to snaps-2015.0901/root-2015.0901.

It won't have /home nested underneath, as you correctly pointed out, but 
in ordered to access it, you now MUST mount /home, which...

#1 could be a pain to setup if you weren't actually mounting it 
previously, just relying on the nested tree, AND...

#2 The point I was trying to make, now, to mount it you'll mount not a 
native nested subvol, and not a directly available sibling
5/subvols/home, but you'll actually be reaching into an entirely 
different nesting structure to grab something down inside, mounting
5/subvols/root/home subvolume nesting down inside the direct
5/subvols/root sibling subvol.

With just one level of nesting and one additional mount, it's not too 
hard to keep track of, but if you're dealing with four or five levels of 
subvol nesting and some of them you're mounting the working head copy 
while others you're rolling back, it could get difficult to keep straight 
in your head what's going on.

Consider a layout like this:

5
+-+ subvols (dir)
  |
  +-+ root (subvol)
  | |
  | +-+ home (subvol)
  | | |
  | | +-+ henry (dir, no subvol for henry)
  | | |
  | | +-+ fred (subvol)
  | | | |
  | | | +-+ vms (subvol)
  | | | 
  | | +-+ betty (subvol)
  | |
  | +-+ svr (subvol)
  |   |
  |   +-+ vms (subvol)
  |
  +-+ snaps-2015.0901 (dir)
|
+-+ root-2015.0901 (subvol here and below)
|
+-+ home-2015.0901
|
+-+ fred-2015.0901
|
+-+ fred-vms-2015.0901
|
+-+ betty-2015.0901
|
+-+ svr-2015.0901
|
+-+ svr-vms-2015.0901


Now, you were hacked and they encrypted a bunch of stuff, but you were 
lucky and caught them before they got everything.  You need to roll back 
root but not home, fred is fine, but his vms subvol needs rolled back, 
betty needs rolled back, svr needs rolled back, but svr's vms are fine.

Try to sort THAT out along with the nesting, and keep it all straight 
while under the severe pressure of trying to recover from a hack in time 
for those svr things to go live for Black Friday in a few hours, where in 
a single day you expect to make as much as you normally do in a month, 
the rest of the year!  The pressure is on!

Oh, and you weren't actually doing the mounts as you were depending on 
the nested tree, so you have to actually setup the mounts as well, not 
just switch them to point to the appropriate location.

OK, so that's a bit contrived, but server encryption for ransom hacks are 
in the news, black Friday starts in a few hours here, and I think the 
point should be obvious! =:^)

(Some years ago, before btrfs, I had something similar setup but with 
partitions.  Disaster struck and I ended up with / from one backup, /usr 
from another, and /var, with the package database of what was installed 
on the other two, from current, or something like that.  Needless to say 
I learned quite some lessons from that, one of which was that everything 
that the package manager installs should be on the same partition with 
the installed-package database, so if it has to be restored from backup, 
at least if it's all old, at least it's all equally old, and the package 
database actually matches what's on the system because it's in the same 
backup!  That partition and btrfs, along with each of its various 
backups, are now 8 GiB each, so it's not like I'll run out of room with 
several levels of backup.  I went mdraid after that, but after an initial 
experiment with lvm on top of the raid, I decided that was too complex to 
deal with in the pressure of a disaster and redid it to multiple raids on 
parallel partitioned hardware.  In a disaster the raid would be bad 
enough to deal with but tolerable, but I did NOT need the complexity of 
lvm on top of raid, and after dealing with the parts of three different 
installs mess, I had the hard-earned wisdom to realize it!

The same idea applies here.  Once you start reaching into nested subvols 
to get the deeper nested subvols you're trying to mount, it's too much 
and you're just begging 

implications of mixed mode

2015-11-26 Thread Lukas Pirl
Dear list,

if a larger RAID file system (say disk space of 8 TB in total) is
created in mixed mode, what are the implications?

>From reading the mailing list and the Wiki, I can think of the following:

+ less hassle with "false positive" ENOSPC
- data and metadata have to have the same replication level
  forever (e.g. RAID 1)
- higher fragmentation
  (does this reduce with no(dir)atime?)
  -> more work for autodefrag

Is that roughly what is to be expected? Any implications on recovery etc.?

In the specific case, the file system usage is as follows:
* data spread over ~20 subvolumes
  * snapshotted with various frequencies
  * compression is used
* mostly archive storage
  * write once
  * read infrequently
* ~500GB of daily rsync'ed system backup

Thanks in advance,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread Christoph Anton Mitterer
On Fri, 2015-11-27 at 08:40 +0800, Qu Wenruo wrote:
> But since there is no real error
sure... 

> feel free to keep using it or just re 
> format it with skinny-metadata.
That's just onging =)

Thanks for all your efforts in that issue =)


Chris.

smime.p7s
Description: S/MIME cryptographic signature


Re: implications of mixed mode

2015-11-26 Thread Qu Wenruo



Lukas Pirl wrote on 2015/11/27 12:54 +1300:

Dear list,

if a larger RAID file system (say disk space of 8 TB in total) is
created in mixed mode, what are the implications?

 From reading the mailing list and the Wiki, I can think of the following:

+ less hassle with "false positive" ENOSPC


If your "false positive" means unbalanced DATA/METADATA chunk 
allocation, then yes.



- data and metadata have to have the same replication level
   forever (e.g. RAID 1)
- higher fragmentation
   (does this reduce with no(dir)atime?)
   -> more work for autodefrag


They are also true.

And some extra pros and cons due to fixed(4K) small(compared to 16K 
default) nodesize:


+ A little higher performance
  node/leaf size is restricted to sectorsize, smaller node/leaf,
  smaller range to lock.
  In our SSD test, operations with high concurrency, the performance is
  overall 10% better than 16K nodesize.
  And in extreme metadata operation case, like high concurrency on
  sequence write into small files, it can be 8 times the performance of
  default 16K nodesize.

- Smaller subvolume size
  Since the tree block are smaller, but tree level stays the same(level
  0 - 7), the up limit of a subvolume is reduced hugely be smaller
  node/leaf size.
  Although it's quite hard to hit that up limit though.

- (Possible) less developer interest.
  Other developers are trying remove default mixed-bg, so I'd like to
  consider the trend will be less mixed-bg focused developers.
  And hidden bugs are more and more hard to hit and fixed.



Is that roughly what is to be expected? Any implications on recovery etc.?


As long as your chunk tree and extent tree is OK, it shouldn't be much 
different from normal fs, at least for now.


Thanks,
Qu


In the specific case, the file system usage is as follows:
* data spread over ~20 subvolumes
   * snapshotted with various frequencies
   * compression is used
* mostly archive storage
   * write once
   * read infrequently
* ~500GB of daily rsync'ed system backup

Thanks in advance,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: implications of mixed mode

2015-11-26 Thread Duncan
Lukas Pirl posted on Fri, 27 Nov 2015 12:54:57 +1300 as excerpted:

> Dear list,
> 
> if a larger RAID file system (say disk space of 8 TB in total) is
> created in mixed mode, what are the implications?
> 
> From reading the mailing list and the Wiki, I can think of the
> following:
> 
> + less hassle with "false positive" ENOSPC
> - data and metadata have to have the same replication level
>   forever (e.g. RAID 1)
> - higher fragmentation
>   (does this reduce with no(dir)atime?)
> -> more work for autodefrag
> 
> Is that roughly what is to be expected? Any implications on recovery
> etc.?

To the best of my knowledge that looks reasonably accurate.

My big hesitancy would be over that fact that very few will run or test 
mixed-mode at TB scale filesystem level, and where they do, it's likely 
to be in ordered to work around the current (but set to soon be 
eliminated) metadata-only (no data) dup mode limit on single-device, 
since in that regard mixed-mode is treated as metadata and dup mode is 
allowed.

So you're relatively more likely to run into rarely seen scaling issues 
and perhaps bugs that nobody else has ever run into as (relatively) 
nobody else runs mixed-mode on multi-terabyte-scale btrfs.  If you want 
to be the guinea pig and make it easier for others to try later on, after 
you've flushed out the worst bugs, that's definitely one way to do it.
=:^]

> In the specific case, the file system usage is as follows:
> * data spread over ~20 subvolumes
>   * snapshotted with various frequencies
>   * compression is used
> * mostly archive storage
>   * write once
>   * read infrequently
> * ~500GB of daily rsync'ed system backup

It's worth noting that rsync... seems to stress btrfs more than pretty 
much any other common single application.  It's extremely heavy access 
pattern just seems to trigger bugs that nothing else does, and while they 
do tend to get fixed, it really does seem to push btrfs to the limits, 
and there have been a /lot/ of rsync triggered btrfs bugs reported over 
the years.

Between the stresses of rsyncing half a TiB daily and the relatively 
untested quantity that is mixed-mode btrfs at multi-terabyte scales on 
multi-devices, there's a reasonably high chance that you /will/ be 
working with the devs on various bugs for awhile.  If you're willing to 
do it, great, somebody putting the filesystem thru those kinds of mixed-
mode paces at that scale is just the sort of thing we need to get 
coverage on that particular not yet well tested corner case, but don't 
expect it to be particularly stable for a couple kernel cycles anyway, 
and after that, you'll still be running a particularly rare corner-case 
that's likely to put new code thru its paces as well, so just be aware of 
the relatively stony path you're signing up to navigate, should you 
choose to go that route.

Meanwhile, assuming you're /not/ deliberately setting out to test a 
rarely tested corner-case with stress tests known to rather too 
frequently get the best of btrfs...

Why are you considering mixed-mode here?  At that size the ENOSPC hassles 
of unmixed-mode btrfs on say single-digit GiB and below really should be 
dwarfed into insignificance, particularly since btrfs since 3.17 or so 
deletes empty chunks instead of letting them build up to the point where 
they're a problem, so what possible reason, other than simply to test it 
and cover that corner-case, could justify mixed-mode at that sort of 
scale?

Unless of course, given that you didn't mention number of devices or 
individual device size, only the 8 TB total, you have in mind a raid of 
something like 1000 8-GB USB sticks, or the like, in which case mixed-
mode on the individual sticks might make some sense (well, to the extent 
that a 1000-device raid of /anything/ makes sense! =:^), given their 8-GB 
each size.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread Qu Wenruo



Christoph Anton Mitterer wrote on 2015/11/26 16:20 +0100:

Hey.

I can confirm that the new patch fixes the issue on both test
filesystems.

Thanks for working that out. I guess there's no longer a need to keep
that old filesystems now?!


Of course no need to keep.
But since there is no real error, feel free to keep using it or just re 
format it with skinny-metadata.


Thanks,
Qu



Cheers,
Chris.

On Thu, 2015-11-26 at 15:27 +0100, David Sterba wrote:

On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote:

In process_extent_item(), it gives 'metadata' initial value 0, but
for
non-skinny-metadata case, metadata extent can't be judged just from
key
type and it forgot that case.

This causes a lot of false alert in non-skinny-metadata filesystem.

Fix it by set correct metadata value before calling
add_extent_rec().

Reported-by: Christoph Anton Mitterer 
Signed-off-by: Qu Wenruo 


Patch replaced, thanks. The test image is pushed as well.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: subvols and parents - how?

2015-11-26 Thread Duncan
Christoph Anton Mitterer posted on Tue, 24 Nov 2015 22:25:50 +0100 as
excerpted:

>> Then there's the security angle to consider.  With the (basically,
>> possibly modified as I suggested) flat layout, mounting something
>> doesn't automatically give people in-tree access to nested subvolumes
>> (subject to normal file permissions, of course), like nested layout
>> does.  And with (possibly modified) flat layout, the whole subvolume
>> tree doesn't need to be mounted all the time either, only when you're
>> actually working with subvolumes.

> Uhm, I don't get the big security advantage here... whether nested or
> manually mounted to a subdir,... if the permissions are insecure I'll
> have a problem... if they're secure, than not.

Consider a setuid-root binary with a recently publicized but patched on 
your system vuln.  But if you have root snapshots from before the patch 
and those snapshots are nested below root, then they're always 
accessible.  If the path to the vulnerable setuid is as user accessible 
as it likely was in its original location, then anyone with login access 
to the system is likely to be able to run it from the snapshot... and 
will be able to get root due to the vuln.

On a flat layout, a snapshot with the vuln would have to be mounted 
before it could be accessed, as otherwise it'd be outside the mounted 
tree.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Duncan
Christoph Anton Mitterer posted on Fri, 27 Nov 2015 01:06:45 +0100 as
excerpted:

> And additionally, allow people to mount subvols with different
> noatime/relatime/atime settings (unless that's already working)... that
> way, they could enable it for things where they want/need it,... and
> disable it where not.

AFAIK, per-subvolume *atime mounts should already be working.  The *atime 
mount options are filesystem-generic (aka Linux vfs level), and while I 
my own use-case doesn't involve subvolumes, the wiki says they should be 
working (wrapped link I'm not bothering to jump thru the hoops to 
properly unwrap):

https://btrfs.wiki.kernel.org/index.php/FAQ
#Can_I_mount_subvolumes_with_different_mount_options.3F

So while personally untested, per-subvolume *atime mount options /should/ 
"just work".

Meanwhile, I've simply grown to hate atime as an inefficient and mostly 
useless drain on resources, so I pretty much just noatime everything, the 
reason I decided to bother patching my kernel to make that the default, 
instead of having yet another option I use everywhere anyway, clogging up 
the options field in my fstab.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [auto-]defrag, nodatacow - general suggestions?(was: btrfs: poor performance on deleting many large files?)

2015-11-26 Thread Duncan
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 01:23:59 +0100 as
excerpted:

> Hey.
> 
> I've worried before about the topics Mitch has raised.
> Some questions.
> 
> 1) AFAIU, the fragmentation problem exists especially for those files
> that see many random writes, especially, but not limited to, big files.
> Now that databases and VMs are affected by this, is probably broadly
> known in the meantime (well at least by people on that list).
> But I'd guess there are n other cases where such IO patterns can happen
> which one simply never notices, while the btrfs continues to degrade.

The two other known cases are:

1) Bittorrent download files, where the full file size is preallocated 
(and I think fsynced), then the torrent client downloads into it a chunk 
at a time.

The more general case would be any time a file of some size is 
preallocated and then written into more or less randomly, the problem 
being the preallocation, which on traditional rewrite-in-place 
filesystems helps avoid fragmentation (as well as ensuring space to save 
the full file), but on COW-based filesystems like btrfs, triggers exactly 
the fragmentation it was trying to avoid.

At least some torrent clients (ktorrent at least) have an option to turn 
off that preallocation, however, and that would be recommended where 
possible.  Where disabling the preallocation isn't possible, arranging to 
have the client write into a dir with the nocow attribute set, so newly 
created torrent files inherit it and do rewrite-in-place, is highly 
recommended.

It's also worth noting that once the download is complete, the files 
aren't going to be rewritten any further, and thus can be moved out of 
the nocow-set download dir and treated normally.  For those who will 
continue to seed the files for some time, this could be done, provided 
the client can seed from a directory different than the download dir.

2) As a subcase of the database file case that people may not think 
about, systemd journal files are known to have had the internal-rewrite-
pattern problem in the past.  Apparently, while they're mostly append-
only in general, they do have an index at the beginning of the file that 
gets rewritten quite a bit.

The problem is much reduced in newer systemd, which is btrfs aware and in 
fact uses btrfs-specific features such as subvolumes in a number of cases 
(creating subvolumes rather than directories where it makes sense in some 
shipped tmpfiles.d config files, for instance), if it's running on 
btrfs.  For the journal, I /think/ (see the next paragraph) that it now 
sets the journal files nocow, and puts them in a dedicated subvolume so 
snapshots of the parent won't snapshot the journals, thereby helping to 
avoid the snapshot-triggered cow1 issue.

On my own systems, however, I've configured journald to only use the 
volatile tmpfs journals in /run, not the permanent /var location, 
tweaking the size of the tmpfs mounted on /run and the journald config so 
it normally stores a full boot session, but of course doesn't store 
journals from previous sessions as they're wiped along with the tmpfs at 
reboot.  I run syslog-ng as well, configured to work with journald, and 
thus have its more traditional append-only plain-text syslogs for 
previous boot sessions.

For my usage that actually seems the best of both worlds as I get journald 
benefits such as service status reports showing the last 10 log entries 
for that service, etc, with those benefits mostly applying to the current 
session only, while I still have the traditional plain-text greppable, 
etc, syslogs, from both the current and previous sessions, back as far as 
my log rotation policy keeps them.  It also keeps the journals entirely 
off of btrfs, so that's one particular problem I don't have to worry 
about at all, the reason I'm a bit fuzzy on the exact details of systemd's 
solution to the journal on btrfs issue.

> So is there any general approach towards this?

The general case is that for normal desktop users, it doesn't tend to be 
a problem, as they don't do either large VMs or large databases, and 
small ones such as the sqlite files generated by firefox and various 
email clients are handled quite well by autodefrag, with that general 
desktop usage being its primary target.

For server usage and the more technically inclined workstation users who 
are running VMs and larger databases, the general feeling seems to be 
that those adminning such systems are, or should be, technically inclined 
enough to do their research and know when measures such as nocow and 
limited snapshotting along with manual defrags where necessary, are 
called for.  And if they don't originally, they find out when they start 
researching why performance isn't what they expected and what to do about 
it. =:^)

> And what are the actual possible consequences? Is it just that fs gets
> slower (due to the fragmentation) or may I even run into other issues to
> the point the space is 

Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Duncan
Christoph Anton Mitterer posted on Thu, 26 Nov 2015 19:25:47 +0100 as
excerpted:

> On Thu, 2015-11-26 at 16:52 +, Duncan wrote:
>> For people doing snapshotting in particular, atime updates can be a big
>> part of the differences between snapshots, so it's particularly
>> important to set noatime if you're snapshotting.

> What everything happens when that is left at relatime?
> 
> I'd guess that obviously everytime the atime is updated there will be
> some CoW, but only on meta-data blocks, right?

Yes.
 
> Does this then lead to fragmentation problems in the meta-data block
> groups?

I don't believe so.  I think individual metadata elements tend to be 
small enough that several fit in a metadata node (16 KiB by default these 
days, IIRC), so there's no "metadata fragmentation" to speak of.

> And how serious are the effects on space that is eaten up... say I have
> n snapshots and access all of their files... then I'd probably get n
> times the metadata, right? Which would sound quite dramatic...
> 
> Or is just parts of the metadate copied with new atimes?

I think it's whole 4 KiB blocks and possibly whole metadata nodes (16 
KiB), copy-on-write, and these would be relatively small changes 
triggering cow of the entire block/node, aka write amplification.  While 
not too large in themselves, it's the number of them that becomes a 
problem.

IIRC relatime updates once a day on access.  If you're doing daily 
snapshots, updating metadata blocks for all files accessed in the last 24 
hours...

Again, individual snapshots aren't so much of a problem, and if you're 
thinning to the 250 snapshots per subvolume or less as I recommend, the 
problem will remain controlled, but at 250, starting at daily snapshots 
so they all have atime changes for at least all files accessed during 
that 24 hours, that's still a sizable set of unnecessarily modified and 
thus space-taking snapshotted metadata.

But I wouldn't worry about it too much if you're doing say monthly 
snapshots and only keeping a year's worth or less, 12-13 snapshots per 
subvolume total.

In my case, I'm on SSD with their limited write cycles, so while the 
snapshot thing doesn't affect me since my use-case doesn't involve 
snapshots, the SSD write cycle count thing certainly does, and noatime is 
worth it to me for that alone.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Christoph Anton Mitterer
On Thu, 2015-11-26 at 23:29 +, Duncan wrote:
> > but only on meta-data blocks, right?
> Yes.
Okay... so it'll at most get the whole meta-data for a snapshot
separately and not shared anymore...
And when these are chained as in ZFS,.. it probably amplifies... i.e. a
change deep down in the tree changes all the upper elements as well?
Which shouldn't be a too big problem unless I have a lot snapshots or
extremely many files.



> I think it's whole 4 KiB blocks and possibly whole metadata nodes (16
> KiB), copy-on-write, and these would be relatively small changes 
> triggering cow of the entire block/node, aka write
> amplification.  While 
> not too large in themselves, it's the number of them that becomes a 
> problem.
Ah... there you say it already =)
But still it's always only meta-data that is copied, never the data,
right?!


> IIRC relatime updates once a day on access.  If you're doing daily 
> snapshots, updating metadata blocks for all files accessed in the
> last 24 
> hours...
Yes...


Wouldn't it be a way to handle that problem if btrfs allowed to create
snapshots for which the atime never gets updated, regardless of any
mount option?

And additionally, allow people to mount subvols with different
noatime/relatime/atime settings (unless that's already working)... that
way, they could enable it for things where they want/need it,... and
disable it where not.


> In my case, I'm on SSD with their limited write cycles, so while the
> snapshot thing doesn't affect me since my use-case doesn't involve 
> snapshots, the SSD write cycle count thing certainly does, and
> noatime is 
> worth it to me for that alone.
I'm always a bit unsure about that... I've used to do it as well as for
the wear.. but is that really necessary?
With relatime, atime updates happen at most once a day... so at worst
you rewrite... what... some 100 MB (at least in the ext234 case)... and
SSDs seem to bare much more write cycles than advertised.


Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


kernel call trace during send/receive

2015-11-26 Thread Christoph Anton Mitterer
Hey.

Just got the following during send/receiving a big snapshot from one
btrfs to another fresh one.

Both under kernel 4.2.6, tools 4.3

The send/receive seems to continue however...

Any ideas what that means?

Cheers,
Chris.

Nov 27 01:52:36 heisenberg kernel: [ cut here ]
Nov 27 01:52:36 heisenberg kernel: WARNING: CPU: 7 PID: 18086 at 
/build/linux-CrHvZ_/linux-4.2.6/fs/btrfs/send.c:5794 
btrfs_ioctl_send+0x661/0x1120 [btrfs]()
Nov 27 01:52:36 heisenberg kernel: Modules linked in: ext4 mbcache jbd2 
nls_utf8 nls_cp437 vfat fat uas vhost_net vhost macvtap macvlan xt_CHECKSUM 
iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 
nf_nat xt_tcpudp tun bridge stp llc fuse ccm ebtable_filter ebtables seqiv ecb 
drbg ansi_cprng algif_skcipher md4 algif_hash af_alg binfmt_misc xfrm_user 
xfrm4_tunnel tunnel4 ipcomp xfrm_ipcomp esp4 ah4 cpufreq_userspace 
cpufreq_powersave cpufreq_stats cpufreq_conservative ip6t_REJECT nf_reject_ipv6 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_policy 
ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 
xt_multiport xt_conntrack nf_conntrack iptable_filter ip_tables x_tables joydev 
rtsx_pci_ms rtsx_pci_sdmmc mmc_core memstick iTCO_wdt iTCO_vendor_support 
x86_pkg_temp_thermal
Nov 27 01:52:36 heisenberg kernel:  intel_powerclamp intel_rapl iosf_mbi 
coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev deflate ctr psmouse 
serio_raw twofish_generic pcspkr btusb btrtl btbcm btintel bluetooth crc16 
uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev 
media twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common sg 
arc4 camellia_generic iwldvm mac80211 iwlwifi cfg80211 rtsx_pci rfkill 
camellia_aesni_avx_x86_64 snd_hda_codec_hdmi tpm_tis tpm 8250_fintek 
camellia_x86_64 snd_hda_codec_realtek snd_hda_codec_generic processor battery 
fujitsu_laptop i2c_i801 ac lpc_ich serpent_avx_x86_64 mfd_core snd_hda_intel 
snd_hda_codec snd_hda_core snd_hwdep snd_pcm shpchp snd_timer e1000e snd 
soundcore i915 ptp pps_core video button drm_kms_helper drm thermal_sys mei_me
Nov 27 01:52:36 heisenberg kernel:  i2c_algo_bit mei serpent_sse2_x86_64 xts 
serpent_generic blowfish_generic blowfish_x86_64 blowfish_common 
cast5_avx_x86_64 cast5_generic cast_common des_generic cbc cmac xcbc rmd160 
sha512_ssse3 sha512_generic sha256_ssse3 sha256_generic hmac crypto_null af_key 
xfrm_algo loop parport_pc ppdev lp parport autofs4 dm_crypt dm_mod md_mod btrfs 
xor raid6_pq uhci_hcd usb_storage sd_mod crc32c_intel aesni_intel aes_x86_64 
glue_helper ahci lrw gf128mul ablk_helper libahci cryptd libata ehci_pci 
xhci_pci ehci_hcd scsi_mod xhci_hcd usbcore usb_common
Nov 27 01:52:36 heisenberg kernel: CPU: 7 PID: 18086 Comm: btrfs Not tainted 
4.2.0-1-amd64 #1 Debian 4.2.6-1
Nov 27 01:52:36 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
E782/FJNB23E, BIOS Version 1.11 05/24/2012
Nov 27 01:52:36 heisenberg kernel:   a02e6260 
8154e2f6 
Nov 27 01:52:36 heisenberg kernel:  8106e5b1 880235a3c42c 
7ffd3d3796c0 8802f0e5c000
Nov 27 01:52:36 heisenberg kernel:  0004 88010543c500 
a02d2d81 88041e5ebb00
Nov 27 01:52:36 heisenberg kernel: Call Trace:
Nov 27 01:52:36 heisenberg kernel:  [] ? dump_stack+0x40/0x50
Nov 27 01:52:36 heisenberg kernel:  [] ? 
warn_slowpath_common+0x81/0xb0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
btrfs_ioctl_send+0x661/0x1120 [btrfs]
Nov 27 01:52:36 heisenberg kernel:  [] ? 
__alloc_pages_nodemask+0x194/0x9e0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
btrfs_ioctl+0x26c/0x2a10 [btrfs]
Nov 27 01:52:36 heisenberg kernel:  [] ? 
sched_move_task+0xca/0x1d0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
cpumask_next_and+0x2e/0x50
Nov 27 01:52:36 heisenberg kernel:  [] ? 
select_task_rq_fair+0x23f/0x5c0
Nov 27 01:52:36 heisenberg kernel:  [] ? 
enqueue_task_fair+0x387/0x1120
Nov 27 01:52:36 heisenberg kernel:  [] ? 
native_sched_clock+0x24/0x80
Nov 27 01:52:36 heisenberg kernel:  [] ? sched_clock+0x5/0x10
Nov 27 01:52:36 heisenberg kernel:  [] ? 
do_vfs_ioctl+0x2c3/0x4a0
Nov 27 01:52:36 heisenberg kernel:  [] ? _do_fork+0x146/0x3a0
Nov 27 01:52:36 heisenberg kernel:  [] ? SyS_ioctl+0x76/0x90
Nov 27 01:52:36 heisenberg kernel:  [] ? 
system_call_fast_compare_end+0xc/0x6b
Nov 27 01:52:36 heisenberg kernel: ---[ end trace f5fa91e2672eead0 ]---

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Qu Wenruo



Mitchell Fossen wrote on 2015/11/25 15:49 -0600:

On Mon, 2015-11-23 at 06:29 +, Duncan wrote:


Using subvolumes was the first recommendation I was going to make, too,
so you're on the right track. =:^)

Also, in case you are using it (you didn't say, but this has been
demonstrated to solve similar issues for others so it's worth
mentioning), try turning btrfs quota functionality off.  While the devs
are working very hard on that feature for btrfs, the fact is that it's
simply still buggy and doesn't work reliably anyway, in addition to
triggering scaling issues before they'd otherwise occur.  So my
recommendation has been, and remains, unless you're working directly with
the devs to fix quota issues (in which case, thanks!), if you actually
NEED quota functionality, use a filesystem where it works reliably, while
if you don't, just turn it off and avoid the scaling and other issues
that currently still come with it.



I did indeed have quotas turned on for the home directories! Since they were
mostly to calculate space used by everyone (since du -hs is so slow) and not
actually needed to limit people, I disabled them.


[[About quota]]
Personally speaking, I'd like to have some comparison between quota 
enabled and disabled, to help locate if it's quota causing the problem.


If you can find a good and reliable reproducer, it would be very helpful 
for developers to improve btrfs.


BTW, it's also a good idea to us ps to locate what process is running at 
the time your btrfs hangs.


If it's kernel thread named btrfs-transaction, then it may be related to 
quota.






As for defrag, that's quite a topic of its own, with complications
related to snapshots and the nocow file attribute.  Very briefly, if you
haven't been running it regularly or using the autodefrag mount option by
default, chances are your available free space is rather fragmented as
well, and while defrag may help, it may not reduce fragmentation to the
degree you'd like.  (I'd suggest using filefrag to check fragmentation,
but it doesn't know how to deal with btrfs compression, and will report
heavy fragmentation for compressed files even if they're fine.  Since you
use compression, that kind of eliminates using filefrag to actually see
what your fragmentation is.)
Additionally, defrag isn't snapshot aware (they tried it for a few
kernels a couple years ago but it simply didn't scale), so if you're
using snapshots (as I believe Ubuntu does by default on btrfs, at least
taking snapshots for upgrade-in-place), so using defrag on files that
exist in the snapshots as well can dramatically increase space usage,
since defrag will break the reflinks to the snapshotted extents and
create new extents for defragged files.

Meanwhile, the absolute worst-case fragmentation on btrfs occurs with
random-internal-rewrite-pattern files (as opposed to never changed, or
append-only).  Common examples are database files and VM images.  For
/relatively/ small files, to say 256 MiB, the autodefrag mount option is
a reasonably effective solution, but it tends to have scaling issues with
files over half a GiB so you can call this a negative recommendation for
trying that option with half-gig-plus internal-random-rewrite-pattern
files.  There are other mitigation strategies that can be used, but here
the subject gets complex so I'll not detail them.  Suffice it to say that
if the filesystem in question is used with large VM images or database
files and you haven't taken specific fragmentation avoidance measures,
that's very likely a good part of your problem right there, and you can
call this a hint that further research is called for.

If your half-gig-plus files are mostly write-once, for example most media
files unless you're doing heavy media editing, however, then autodefrag
could be a good option in general, as it deals well with such files and
with random-internal-rewrite-pattern files under a quarter gig or so.  Be
aware, however, that if it's enabled on an already heavily fragmented
filesystem (as yours likely is), it's likely to actually make performance
worse until it gets things under control.  Your best bet in that case, if
you have spare devices available to do so, is probably to create a fresh
btrfs and consistently use autodefrag as you populate it from the
existing heavily fragmented btrfs.  That way, it'll never have a chance
for the fragmentation to build up in the first place, and autodefrag used
as a routine mount option should keep it from getting bad in normal use.


Thanks for explaining that! Most of these files are written once and then read
from for the rest of their "lifetime" until the simulations are done and they
get archived/deleted. I'll try leaving autodefrag on and defragging directories
over the holiday weekend when no one is using the server. There is some database
usage, but I turned off COW for its folder and it only gets used sporadically
and shouldn't be a huge factor in day-to-day usage.

Also, is there a 

Re: btrfs check help

2015-11-26 Thread Vincent Olivier

> On Nov 25, 2015, at 8:44 PM, Qu Wenruo  wrote:
> 
> 
> 
> Vincent Olivier wrote on 2015/11/25 11:51 -0500:
>> I should probably point out that there is 64GB of RAM on this machine and 
>> it’s a dual Xeon processor (LGA2011-3) system. Also, there is only Btrfs 
>> served via Samba and the kernel panic was caused Btrfs (as per what I 
>> remember from the log on the screen just before I rebooted) and happened in 
>> the middle of the night when zero (0) client was connected.
>> 
>> You will find below the full “btrfs check” log for each device in the order 
>> it is listed by “btrfs fi show”.
> 
> There is really no need to do such thing, as btrfs is able to manage multiple 
> device, calling btrfsck on any of them is enough as long as it's not hugely 
> damaged.
> 
>> 
>> Ca I get a strong confirmation that I should run with the “—repair” option 
>> on each device? Thanks.
> 
> YES.
> 
> Inode nbytes fix is *VERY* safe as long as it's the only error.
> 
> Although it's not that convincing since the inode nbytes fix code is written 
> by myself and authors always tend to believe their codes are good
> But at least, some other users with more complicated problem(with inode 
> nbytes error) fixed it.
> 
> The last decision is still on you anyway.

I will do it on the first device from the “fi show” output and report.

Thanks,

Vincent

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to detect / notify when a raid drive fails?

2015-11-26 Thread Ian Kelling
I'd like to run "mail" when a btrfs raid drive fails, but I don't
know how to detect that a drive has failed. It don't see it in
any docs. Otherwise I assume I would never know until enough
drives fail that the filesystem stops working, and I'd like to
know before that.

- Ian Kelling
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to detect / notify when a raid drive fails?

2015-11-26 Thread Duncan
Ian Kelling posted on Thu, 26 Nov 2015 21:14:57 -0800 as excerpted:

> I'd like to run "mail" when a btrfs raid drive fails, but I don't know
> how to detect that a drive has failed. It don't see it in any docs.
> Otherwise I assume I would never know until enough drives fail that the
> filesystem stops working, and I'd like to know before that.

Btrfs isn't yet mature enough to have a device failure notifier daemon, 
like for instance mdadm does.  There's a patch set going around that adds 
global spares, so btrfs can detect the problem and grab a spare, but it's 
only a rather simplistic initial implementation designed to provide the 
framework for more fancy stuff later, and that's about it in terms of 
anything close, so far.

What generally happens now, however, is that the btrfs will note failures 
attempting to write the device and start queuing up writes.  If the 
device reappears fast enough, btrfs will flush the queue and be back to 
normal.  Otherwise, you pretty much need to reboot and mount degraded, 
then add a device and rebalance. (btrfs device delete missing broke some 
versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.)

As for alerts, you'd see the pile of accumulating write errors in the 
kernel log.  Presumably you can write up a script that can alert on that 
and mail you the log or whatever, but I don't believe there's anything 
official or close to it, yet.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: implications of mixed mode

2015-11-26 Thread Roman Mamedov
On Fri, 27 Nov 2015 10:21:31 +0800
Qu Wenruo  wrote:

> And some extra pros and cons due to fixed(4K) small(compared to 16K 
> default) nodesize:
> 
> + A little higher performance
>node/leaf size is restricted to sectorsize, smaller node/leaf,
>smaller range to lock.
>In our SSD test, operations with high concurrency, the performance is
>overall 10% better than 16K nodesize.
>And in extreme metadata operation case, like high concurrency on
>sequence write into small files, it can be 8 times the performance of
>default 16K nodesize.

This is surprising to read, as I thought 16K is generally faster and that's
why the default value was changed to it from 4K.

https://oss.oracle.com/~mason/blocksizes/
https://git.kernel.org/cgit/linux/kernel/git/mason/btrfs-progs.git/commit/?id=c652e4efb8e2dd76ef1627d8cd649c6af5905902

Seems like the 16K size prevents fragmentation, but since your SSDs do not care
much about fragmentation, that's not adding a benefit for them.

-- 
With respect,
Roman


signature.asc
Description: PGP signature


[PATCH] btrfs-progs: mkfs: Fix a wrong extent buffer size causing wrong superblock csum

2015-11-26 Thread Qu Wenruo
For make_btrfs(), it's setting wrong buf size for last super block write
out.
The superblock size is always BTRFS_SUPER_INFO_SIZE, not
cfg->sectorsize.

And this makes mkfs.btrfs -f -s 8K fails.

Fix it to BTRFS_SUPER_INFO_SIZE.

Signed-off-by: Qu Wenruo 
---
Thanks goodness, this time it's not my super block checksum patches
causing bugs.
---
 utils.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/utils.c b/utils.c
index 60235d8..00355a2 100644
--- a/utils.c
+++ b/utils.c
@@ -554,7 +554,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
BUG_ON(sizeof(super) > cfg->sectorsize);
memset(buf->data, 0, cfg->sectorsize);
memcpy(buf->data, , sizeof(super));
-   buf->len = cfg->sectorsize;
+   buf->len = BTRFS_SUPER_INFO_SIZE;
csum_tree_block_size(buf, BTRFS_CRC32_SIZE, 0);
ret = pwrite(fd, buf->data, cfg->sectorsize, cfg->blocks[0]);
if (ret != cfg->sectorsize) {
-- 
2.6.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] Btrfs-convert rework to support native separate

2015-11-26 Thread David Sterba
On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote:
> > As far as the conversion support stays, it's not a problem of course. I
> > don't have a complete picture of all the actual merging conflicts, but
> > the idea is to provide the callback abstraction v2 to allow ext2 and
> > reiser plus allow all the changes of this pathcset.
> >
> Glad to hear that.
> 
> BTW, which reiserfs progs headers are you using?

Sorry I forgot to mention it, it's the latest git version,
https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/

Jeff hasn't released v3.6.25 yet. We have the git version in SUSE
distros so it works for me here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] Btrfs-convert rework to support native separate

2015-11-26 Thread Qu Wenruo



On 11/26/2015 05:30 PM, David Sterba wrote:

On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote:

As far as the conversion support stays, it's not a problem of course. I
don't have a complete picture of all the actual merging conflicts, but
the idea is to provide the callback abstraction v2 to allow ext2 and
reiser plus allow all the changes of this pathcset.


Glad to hear that.

BTW, which reiserfs progs headers are you using?


Sorry I forgot to mention it, it's the latest git version,
https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/

Jeff hasn't released v3.6.25 yet. We have the git version in SUSE
distros so it works for me here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Thanks, now it should be OK to continue the rebase.

But I'm a little concerned about the unstable headers, unlike ext2 its 
headers is almost stable but reiserfs seems not.


What about rebasing my patch to your abstract patch (btrfs-progs: 
convert: add context and operations struct to allow different file 
systems) first and add back your reiserfs patch?


Your abstract patch is quite nice, although need some modification to 
work with new convert.
I hope to add stable things first and don't want another reiserfs change 
breaks the compile.


Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] Let user specify the kernel version for features

2015-11-26 Thread Anand Jain





With the new -O comp= option, the concern on user who want to make a
btrfs for newer kernel is hugely reduced.


NO!. actually new option -O comp= provides no concern for users who
want to create _a btrfs disk layout which is compatible with more
than one kernel_.  above there are two examples of it.


Why you can't give a higher kernel version than current kernel?


 mount fails.  Pls try !!





But I still prefer such feature align to be done only when specified by
user, instead of automatically. (yeah, already told for several times
though)
Warning should be enough for user, sometimes too automatic is not  good,


As said before.
We need latest btrfs-progs on older kernels, for obvious reasons of
btrfs-progs bug fixes. We don't have to back port fixes even on
btrfs-progs as we already do it in btrfs kernel. A btrfs-progs should
work on any kernel with the "default features as prescribed for that
kernel".

Let's say if we don't do this automatic then, latest btrfs-progs
with default mkfs.btfs && mount fails. But a user upgrading btrfs-progs
for fsck bug fixes, shouldn't find 'default mkfs.btfs && mount'
failing. Nor they have to use a "new" set of mkfs option to create all
default FS for a LTS kernel.

Default features based on btrfs-progs version instead of kernel
version- makes NO sense.


Kernel version never makes sense, especially for non-vanilla.

>

And unfortunately, most of kernels used in stable distribution is not
vanilla.
And that's the *POINT1*.

That's why I stand against kernel version based detection.
You can use stable /sys/fs/btrfs/features/, but kernel version?
Not an option even as fallback.


yep. thats the reason someone invented sysfs/features, but
unfortunately from 3.14. If not version, pls suggest best suitable,
_without_ transferring problem to solve to the user end.



And adding a warning for not using latest
features which is not in their running kernel is pointless.


 In the context of default features.




You didn't get the point of what to WARN.

Not warning user they are not using latest features, but warning some
features may prevent the fs being mounted for current kernel.


 It was there before. some patch in the past removed it. hope you
 remember "Turning on incompatible..."



That's _not_ a backward kernel compatible tool.

btrfs-progs should work "for the kernel". We should avoid adding too
much intelligence into btrfs-progs. I have fixed too many issues and
redesigned progs in this area. Too many bugs were mainly because of the
idea of copy and maintain same code on btrfs-progs and btrfs-kernel
approach for progs. (ref wiki and my email before). Thats a wrong
approach.


Totally agree with this point. Too many non-sense in btrfs-progs codes
copied from kernel, and due to lack of update, it's very buggy now.
Just check volume.c for allocating data chunk.

But I didn't see the point related to the feature auto align here.


 The whole point is don't add more intelligence into progs than what
 is required.

 Here its about default features. And the questions are
 Default against of what ? To the btrfs-progs version itself?
 Why do you want to add another attribute of btrfs-progs-version
 being relevant at the user end ? For users that's like progs not
 being inline with kernel, a very strong problem statement.

 (bit vague as of now) potentially there are chances that someday
 we would move mkfs part into the kernel itself, makes progs as
 slick as possible.



I don't understand- if the purpose of both of these isn't
same what is the point in maintaining same code? It won't save in
efforts mainly because its like developing a distributed FS where
two parties has to be communicated to be in sync. Which is like using
the canon to shoo a crow.
But if the reason was fuse like kernel-free FS (no one said that
though) then its better to do it as a separate project.


especially for tests.


It depends whats being tested kernel OR progs? Its kernel not progs.


No, both kernel and progs. Just from Dave, even with his typo:

"xfstests is not jsut for testing kernel changes - it tests all of
the filesystem utilities for regressions, too. And so when
inadvertant changes in default behaviour occur, it detects those
regressions too."


 Now in this context if you are testing latest btrfs-progs (without
 these patches) on an old LTS kernel, and using default mkfs option
 all tests fails. That's something to fix. Without transferring
 implementation difficulties to the user end. And without changing
 xfstests mkfs_options. Because we claim progs is backward kernel
 compatible.



Automatic will keep default feature constant for a given kernel
version. Further, for testing using a known set of options is even
better.


Yeah, known set of options get unknown on different kernels, thanks to
the hidden feature align. Unless you specify it by -O options.

That's the *POINT2*:
Default auto feature align are making mkfs.btrfs behavior *unpredictable*.

>

Before auto feature 

Re: [PATCH 0/7] Let user specify the kernel version for features

2015-11-26 Thread Qu Wenruo



On 11/26/2015 07:18 PM, Anand Jain wrote:





With the new -O comp= option, the concern on user who want to make a
btrfs for newer kernel is hugely reduced.


NO!. actually new option -O comp= provides no concern for users who
want to create _a btrfs disk layout which is compatible with more
than one kernel_.  above there are two examples of it.


Why you can't give a higher kernel version than current kernel?


  mount fails.  Pls try !!


But that's what user want to do. He/she knows what he is doing.
Maybe he is testing btrfs-progs self test without the need to mount 
it(at least some of the tests doesn't require mount)








But I still prefer such feature align to be done only when specified by
user, instead of automatically. (yeah, already told for several times
though)
Warning should be enough for user, sometimes too automatic is not
good,


As said before.
We need latest btrfs-progs on older kernels, for obvious reasons of
btrfs-progs bug fixes. We don't have to back port fixes even on
btrfs-progs as we already do it in btrfs kernel. A btrfs-progs should
work on any kernel with the "default features as prescribed for that
kernel".

Let's say if we don't do this automatic then, latest btrfs-progs
with default mkfs.btfs && mount fails. But a user upgrading btrfs-progs
for fsck bug fixes, shouldn't find 'default mkfs.btfs && mount'
failing. Nor they have to use a "new" set of mkfs option to create all
default FS for a LTS kernel.

Default features based on btrfs-progs version instead of kernel
version- makes NO sense.


Kernel version never makes sense, especially for non-vanilla.

 >

And unfortunately, most of kernels used in stable distribution is not
vanilla.
And that's the *POINT1*.

That's why I stand against kernel version based detection.
You can use stable /sys/fs/btrfs/features/, but kernel version?
Not an option even as fallback.


yep. thats the reason someone invented sysfs/features, but
unfortunately from 3.14. If not version, pls suggest best suitable,
_without_ transferring problem to solve to the user end.


A solution which may cause wrong result is never a solution.

No matter if it is better than any other one.

Do it wrong(even sometimes it's OK) is never good than doing nothing.





And adding a warning for not using latest
features which is not in their running kernel is pointless.


  In the context of default features.




You didn't get the point of what to WARN.

Not warning user they are not using latest features, but warning some
features may prevent the fs being mounted for current kernel.


  It was there before. some patch in the past removed it. hope you
  remember "Turning on incompatible..."


Then, add it back, and not such informative, but warning.
The old output is so easy to ignore.

Or you can just stop continuing mkfs if you detect such incompact 
feature with current kernel, only continue if "-f" is given.






That's _not_ a backward kernel compatible tool.

btrfs-progs should work "for the kernel". We should avoid adding too
much intelligence into btrfs-progs. I have fixed too many issues and
redesigned progs in this area. Too many bugs were mainly because of the
idea of copy and maintain same code on btrfs-progs and btrfs-kernel
approach for progs. (ref wiki and my email before). Thats a wrong
approach.


Totally agree with this point. Too many non-sense in btrfs-progs codes
copied from kernel, and due to lack of update, it's very buggy now.
Just check volume.c for allocating data chunk.

But I didn't see the point related to the feature auto align here.


  The whole point is don't add more intelligence into progs than what
  is required.

  Here its about default features. And the questions are
  Default against of what ? To the btrfs-progs version itself?
  Why do you want to add another attribute of btrfs-progs-version
  being relevant at the user end ?


End user of what? A single package?

No, end user of the *whole distribution*.
That's the packager/backport guys responsible for not combining mismatch 
kernel(-LTS) and progs


Now we need to auto align feature with kernel, who know one day we will 
need to auto align our libs to upstream package?
Keeping a matrix with different packages like libuuid/acl/attr with 
different Makefile?
At least this is not a good idea for me, and that's the work of 
autoconfig IIRC.


And if I'm a package and face such problem, I'll choose the simplest 
solution, just add a line in PKGBUILD(package system of Archlinux) of btrfs.

--
depends=('linux>=3.14')
--
(Yeah, such simple and slick packaging solution is the reason I like 
Arch over other rolling distribution)


Not every thing really needed to be done in code level.


 For users that's like progs not

  being inline with kernel, a very strong problem statement.

  (bit vague as of now) potentially there are chances that someday
  we would move mkfs part into the kernel itself, makes progs as
  slick as possible.





I don't understand- if the 

Re: [PATCH] btrfs-progs: mkfs: Fix a wrong extent buffer size causing wrong superblock csum

2015-11-26 Thread David Sterba
On Thu, Nov 26, 2015 at 04:15:55PM +0800, Qu Wenruo wrote:
> For make_btrfs(), it's setting wrong buf size for last super block write
> out.
> The superblock size is always BTRFS_SUPER_INFO_SIZE, not
> cfg->sectorsize.
> 
> And this makes mkfs.btrfs -f -s 8K fails.
> 
> Fix it to BTRFS_SUPER_INFO_SIZE.
> 
> Signed-off-by: Qu Wenruo 

Already fixed in current devel (bf1ac8305ab3f191d9).

> --- a/utils.c
> +++ b/utils.c
> @@ -554,7 +554,7 @@ int make_btrfs(int fd, struct btrfs_mkfs_config *cfg)
>   BUG_ON(sizeof(super) > cfg->sectorsize);
>   memset(buf->data, 0, cfg->sectorsize);
>   memcpy(buf->data, , sizeof(super));
> - buf->len = cfg->sectorsize;
> + buf->len = BTRFS_SUPER_INFO_SIZE;
>   csum_tree_block_size(buf, BTRFS_CRC32_SIZE, 0);
>   ret = pwrite(fd, buf->data, cfg->sectorsize, cfg->blocks[0]);

Also, this overwrites more bytes than necessary so my fix uses
BTRFS_SUPER_INFO_SIZE everywhere.

>   if (ret != cfg->sectorsize) {
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread David Sterba
On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote:
> In process_extent_item(), it gives 'metadata' initial value 0, but for
> non-skinny-metadata case, metadata extent can't be judged just from key
> type and it forgot that case.
> 
> This causes a lot of false alert in non-skinny-metadata filesystem.
> 
> Fix it by set correct metadata value before calling add_extent_rec().
> 
> Reported-by: Christoph Anton Mitterer 
> Signed-off-by: Qu Wenruo 

Patch replaced, thanks. The test image is pushed as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] Btrfs-convert rework to support native separate

2015-11-26 Thread David Sterba
On Thu, Nov 26, 2015 at 06:12:57PM +0800, Qu Wenruo wrote:
> But I'm a little concerned about the unstable headers, unlike ext2 its 
> headers is almost stable but reiserfs seems not.

Well, reiserfs is not developped nowadays and I think Jeff implemented
the bits required for btrfs-convert. The configure script will detect if
the reiser library provided needed functions, compiling reiser in will
be optional anyway.

> What about rebasing my patch to your abstract patch (btrfs-progs: 
> convert: add context and operations struct to allow different file 
> systems) first and add back your reiserfs patch?

Oh right, the patch is independent, I'll add it to devel.

> Your abstract patch is quite nice, although need some modification to 
> work with new convert.

Yes, that's expected.

> I hope to add stable things first and don't want another reiserfs change 
> breaks the compile.

Ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Support convert to -d dup for btrfs-convert

2015-11-26 Thread David Sterba
On Thu, Nov 19, 2015 at 05:26:22PM +0800, Zhao Lei wrote:
> Since we will add support for -d dup for non-mixed filesystem,
> kernel need to support converting to this raid-type.
> 
> This patch remove limitation of above case.
> 
> Signed-off-by: Zhao Lei 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] btrfs-progs: mkfs: Enable -d dup for single device

2015-11-26 Thread David Sterba
On Thu, Nov 19, 2015 at 04:14:16PM -0500, Austin S Hemmelgarn wrote:
> On 2015-11-19 04:36, Zhao Lei wrote:
> > Signed-off-by: Zhao Lei 
> Seeing as I forgot to reply to the previous version after testing it, 
> I'll just reply here now that I've run this version through the same 
> tests I did on the last one.
> 
> I threw everything I could think of at it, and nothing broke, so you can 
> add:
> Tested-by: Austin S. Hemmelgarn

Thanks.

Patch added to devel as it's not code-intrusive, will be probably
released within 4.4.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/25] Btrfs-convert rework to support native separate

2015-11-26 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

On 11/26/15 5:12 AM, Qu Wenruo wrote:
> 
> 
> On 11/26/2015 05:30 PM, David Sterba wrote:
>> On Thu, Nov 26, 2015 at 08:38:23AM +0800, Qu Wenruo wrote:
 As far as the conversion support stays, it's not a problem of
 course. I don't have a complete picture of all the actual
 merging conflicts, but the idea is to provide the callback
 abstraction v2 to allow ext2 and reiser plus allow all the
 changes of this pathcset.
 
>>> Glad to hear that.
>>> 
>>> BTW, which reiserfs progs headers are you using?
>> 
>> Sorry I forgot to mention it, it's the latest git version, 
>> https://git.kernel.org/cgit/linux/kernel/git/jeffm/reiserfsprogs.git/
>>
>>
>> 
Jeff hasn't released v3.6.25 yet. We have the git version in SUSE
>> distros so it works for me here. -- To unsubscribe from this
>> list: send the line "unsubscribe linux-btrfs" in the body of a
>> message to majord...@vger.kernel.org More majordomo info at
>> http://vger.kernel.org/majordomo-info.html
>> 
> Thanks, now it should be OK to continue the rebase.
> 
> But I'm a little concerned about the unstable headers, unlike ext2
> its headers is almost stable but reiserfs seems not.

This is entirely due to the fact that splitting out library
functionality for reiserfsprogs was only done to support
btrfs-convert.  Unless there's some pressing need to revise the API,
the headers are pretty much static at this point.

I should just go ahead and release the current snapshot as 3.6.25.
Today's a US holiday and I won't be able to get to it, but I'll do
that in the next few days.

- -Jeff

> What about rebasing my patch to your abstract patch (btrfs-progs: 
> convert: add context and operations struct to allow different file 
> systems) first and add back your reiserfs patch?
> 
> Your abstract patch is quite nice, although need some modification
> to work with new convert. I hope to add stable things first and
> don't want another reiserfs change breaks the compile.
> 
> Thanks, Qu
> 


- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2

iQIcBAEBCAAGBQJWVyECAAoJEB57S2MheeWyHFQP/jNSZthnme+Gr6HV9DcGNuWb
F3m4HOauGny+5mWzWjzATcS6YjAB0Hr0ObXi6jgoQxueBTInZKfgRYIqF6q1hdxS
NymluoAi9lkTkgkiCTZWmUexUaGE2pDzWR14dzqkBorqfCkvyvBVrkXV32FQUJ9H
ln85QND805HUiHol3+rqSSFhxN8A8C+3UMkxOuUCrlkBx8KkzdKVrFJNH4+2L3Ml
tt1KYkWa4hCOYqFivzxDJ69HWNBkUIPXsig+38Lw/JkuLbt82DfMAVu5QbttAgus
1/ZOhPwv7grRfV/CpCpNHPV/AvLLLR0wu5DNCej4HsC81WH4KUE5Isavk2WxQAF3
gpFdnxh2Ok3n+g/obFpDKh43XjVIbtRel4bsB13WjfM9mUNFX8lm/vnK8bjmuoGo
FRf0eQwY3YZAzFrTglFBQxK4eUnopDr1CTFTmnLtBoYKSojDdwe8KQTDWuimLyXb
/BY7qeiW3Cahm+V2VrL76cnbxg5/xH0u6CKrfsg9p4NpP9+5bi72vLgy16IJI+a2
jf6pLvgOi9MwMGW42tIHGna+vFwPAaoL4Iqm4qbWqHZ/R91bNAfWgOdBEubqHuYi
qML5RAttojrw7ZtwSLkBxnOyj3ias3c2jDTQ1yqaSVMJ6phn9XwzB5v4I87qEdsS
a8QhP69u++S+PcohdSYQ
=3s6k
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs-progs: fsck: Fix a false alert where extent record has wrong metadata flag

2015-11-26 Thread Christoph Anton Mitterer
Hey.

I can confirm that the new patch fixes the issue on both test
filesystems.

Thanks for working that out. I guess there's no longer a need to keep
that old filesystems now?!

Cheers,
Chris.

On Thu, 2015-11-26 at 15:27 +0100, David Sterba wrote:
> On Wed, Nov 25, 2015 at 02:19:06PM +0800, Qu Wenruo wrote:
> > In process_extent_item(), it gives 'metadata' initial value 0, but
> > for
> > non-skinny-metadata case, metadata extent can't be judged just from
> > key
> > type and it forgot that case.
> > 
> > This causes a lot of false alert in non-skinny-metadata filesystem.
> > 
> > Fix it by set correct metadata value before calling
> > add_extent_rec().
> > 
> > Reported-by: Christoph Anton Mitterer 
> > Signed-off-by: Qu Wenruo 
> 
> Patch replaced, thanks. The test image is pushed as well.

smime.p7s
Description: S/MIME cryptographic signature


Re: btrfs send reproducibly fails for a specific subvolume after sending 15 GiB, scrub reports no errors

2015-11-26 Thread Duncan
Hugo Mills posted on Tue, 24 Nov 2015 21:27:46 + as excerpted:

[In the context of btrfs send...]

>-p only sends the file metadata for the changes from the reference
> snapshot to the sent snapshot. -c sends all the file metadata, but will
> preserve the reflinks between the sent snapshot and the (one or more)
> reference snapshots. You can only use one -p (because there's only one
> difference you can compute at any one time), but you can use as many -c
> as you like (because you can share extents with any number of subvols).
> 
>In both cases, the reference snapshot(s) must exist on the
> receiving side.
> 
>In implementation terms, on the receiver, -p takes a (writable)
> snapshot of the reference subvol, and modifies it according to the
> stream data. -c makes a new empty subvol, and populates it from scratch,
> using the reflink ioctl to use data which is known to exist in the
> reference subvols.

Thanks, Hugo.  I had a vague idea that the above was the difference in 
general, but as CAM says, the manpage (and wiki) isn't particularly 
detailed on the differences, so I didn't know whether my vague idea was 
correct or not.  Your explanation makes perfect sense and clears things 
up dramatically. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Using Btrfs on single drives

2015-11-26 Thread Duncan
Russell Coker posted on Wed, 25 Nov 2015 18:20:25 +1100 as excerpted:

> On Sun, 15 Nov 2015 03:01:57 PM Duncan wrote:
>> That looks to me like native drive limitations.
>> 
>> Due to the fact that a modern hard drive spins at the same speed no
>> matter where the read/write head is located, when it's reading/writing
>> to the first part of the drive -- the outside -- much more linear drive
>> distance will pass under the read/write heads in say a tenth of a
>> second than will be the case as the last part of the drive is filled --
>> the inside -- and throughput will be much higher at the first of the
>> drive.
> 
> http://www.coker.com.au/bonnie++/zcav/results.html
> 
> The above page has the results of my ZCAV benchmark (part of the
> Bonnie++ suite) which shows this.  You can safely tun ZCAV in read mode
> on a device that's got a filesystem on it so it's not too late to test
> these things.

Thanks.  Those graphs are pretty clear.

As you, I'd have thought there'd be far fewer zones (3-4) than it turns 
out there are (8ish).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [4.3-rc4] scrubbing aborts before finishing

2015-11-26 Thread Duncan
Martin Steigerwald posted on Wed, 25 Nov 2015 16:35:39 +0100 as excerpted:

> I´d report a bug report, in case anyone would be interested. But if the
> interest it like in this mailinglist post I can spare myself the time
> for reporting via bugzilla.
> 
> So does anyone at all care about this issue?

FWIW I'm interested, but haven't as a user seen the same thing here... 
scrubs have worked fine here unless I forget to sudo, in which case they 
don't work at all as they can't get necessary privs.

And I'm stumped as to what else it might be.  I don't even have an idea 
where to start looking for clues.

But not being a dev and having not the foggiest what the problem may be, 
it's not like my interest is going to help much, which is why I didn't 
reply earlier.  Just posting now to say you're not alone in your 
interest.  Things like this that don't seem to have any logic bother me, 
tho, so I really would like to at least learn why it's happening.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Duncan
Mitchell Fossen posted on Wed, 25 Nov 2015 15:49:58 -0600 as excerpted:

> Also, is there a recommendation for relatime vs noatime mount options? I
> don't believe anything that runs on the server needs to use file access
> times, so if it can help with performance/disk usage I'm fine with
> setting it to noatime.

FWIW I finally got tired enough of always setting noatime (for over a 
decade, since kernel 2.4 and my standardizing to then reiserfs) that I 
finally found the spot in the kernel where the relatime default is set, 
and patched it to be noatime by default.  My kernel scripts apply that on 
top of my git kernel pulls, now.

For people doing snapshotting in particular, atime updates can be a big 
part of the differences between snapshots, so it's particularly important 
to set noatime if you're snapshotting.

If you're not doing snapshots, it's somewhat less important, but IIRC it 
was still someone more of a performance issue than with ext*, tho I don't 
remember the details but I'd guess it's to do with COWing the metadata 
triggering metadata fragmentation.

Bottom line, use noatime unless you have something that needs atime.  
It's not going to hurt for sure, and should improve performance at least 
somewhat even on ext*.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: docs: mkfs, implications of DUP on devices

2015-11-26 Thread David Sterba
We offer DUP but still depend on the hardware, to do the right thing.

Signed-off-by: David Sterba 
---

To wider audience: feel free to suggest improvements to the manual page text if
you think it's not clear or too technical etc.

 Documentation/mkfs.btrfs.asciidoc | 32 
 utils.c   |  5 +++--
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/Documentation/mkfs.btrfs.asciidoc 
b/Documentation/mkfs.btrfs.asciidoc
index c9ba314c2220..0b145c7a01c3 100644
--- a/Documentation/mkfs.btrfs.asciidoc
+++ b/Documentation/mkfs.btrfs.asciidoc
@@ -50,7 +50,9 @@ mkfs.btrfs uses the entire device space for the filesystem.
 
 *-d|--data *::
 Specify the profile for the data block groups.  Valid values are 'raid0',
-'raid1', 'raid5', 'raid6', 'raid10' or 'single', (case does not matter).
+'raid1', 'raid5', 'raid6', 'raid10' or 'single' or dup (case does not matter).
++
+See 'DUP PROFILES ON A SINGLE DEVICE' for more.
 
 *-m|--metadata *::
 Specify the profile for the metadata block groups.
@@ -60,13 +62,12 @@ Valid values are 'raid0', 'raid1', 'raid5', 'raid6', 
'raid10', 'single' or
 A single device filesystem will default to 'DUP', unless a SSD is detected. 
Then
 it will default to 'single'. The detection is based on the value of
 `/sys/block/DEV/queue/rotational`, where 'DEV' is the short name of the device.
-This is because SSDs can remap the blocks internally to a single copy thus
-deduplicating them which negates the purpose of increased metadata redunancy
-and just wastes space. 
 +
 Note that the rotational status can be arbitrarily set by the underlying block
 device driver and may not reflect the true status (network block device, 
memory-backed
 SCSI devices etc). Use the options '--data/--metadata' to avoid confusion.
++
+See 'DUP PROFILES ON A SINGLE DEVICE' for more details.
 
 *-M|--mixed*::
 Normally the data and metadata block groups are isolated. The 'mixed' mode
@@ -265,6 +266,29 @@ PROFILES
 another one is added, but *mkfs.btrfs* will not let you create DUP on multiple
 devices.
 
+DUP PROFILES ON A SINGLE DEVICE
+---
+
+The mkfs utility will let the user create a filesystem with profiles that write
+the logical blocks to 2 physical locations. Whether there are really 2
+physical copies highly depends on the underlying device type.
+
+For example, a SSD drive can remap the blocks internally to a single copy thus
+deduplicating them. This negates the purpose of increased redunancy and just
+wastes space.
+
+The duplicated data/metadata may still be useful to statistically improve the
+chances on a device that might perform some internal optimizations. The actual
+details are not usually disclosed by vendors. As another example, the widely
+used USB flash or SD cards use a translation layer. The data lifetime may
+be affected by frequent plugging. The memory cells could get damaged, hopefully
+not destroying both copies of particular data.
+
+The traditional rotational hard drives usually fail at the sector level.
+
+In any case, a device that starts to misbehave and repairs from the DUP copy
+should be replaced! *DUP is not backup*.
+
 KNOWN ISSUES
 
 
diff --git a/utils.c b/utils.c
index c20966c19768..d5f60a420135 100644
--- a/utils.c
+++ b/utils.c
@@ -2504,8 +2504,9 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 
data_profile,
return 1;
}
 
-   warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd,
-  "DUP have no effect if your SSD have deduplication 
function");
+   /* warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP) && ssd, 
*/
+   warning_on(!mixed && (data_profile & BTRFS_BLOCK_GROUP_DUP),
+  "DUP may not actually lead to 2 copies on the device, see 
manual page");
 
return 0;
 }
-- 
2.6.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] btrfs-progs: introduce framework to check kernel supported features

2015-11-26 Thread David Sterba
On Tue, Nov 24, 2015 at 03:21:19PM -0500, Austin S Hemmelgarn wrote:
> > I think you mean 2.6.37 here.
> > 67377734fd24c3 "Btrfs: add support for mixed data+metadata block groups"
> This brings up a rather important question:
> Should compat-X.Y mean features that were considered usable in that 
> version, or everything that version offered?  I understand wanting 
> consistency with the kernel versions, but we shouldn't be creating 
> filesystems that we know will break on the specified kernel even if it 
> is mountable on it.

IMO compat refers to the compatibility feature bits so it's whether the
filesystem is mountable on a given version. Usability can be subjective.
I assume the kernel versions in wide use match some of the long term
branches. If it's k.org, we can submit the fixes and distros update
their long term branches.

A table of "is the feature usable" would be interesting but I think it's
for wiki.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: poor performance on deleting many large files

2015-11-26 Thread Christoph Anton Mitterer
On Thu, 2015-11-26 at 16:52 +, Duncan wrote:
> For people doing snapshotting in particular, atime updates can be a
> big 
> part of the differences between snapshots, so it's particularly
> important 
> to set noatime if you're snapshotting.
What everything happens when that is left at relatime?

I'd guess that obviously everytime the atime is updated there will be
some CoW, but only on meta-data blocks, right?

Does this then lead to fragmentation problems in the meta-data block
groups?

And how serious are the effects on space that is eaten up... say I have
n snapshots and access all of their files... then I'd probably get n
times the metadata, right? Which would sound quite dramatic...

Or is just parts of the metadate copied with new atimes?


Thanks,
Chris.

smime.p7s
Description: S/MIME cryptographic signature


vfs: move btrfs clone ioctls to common code

2015-11-26 Thread Christoph Hellwig
This patch set moves the existing btrfs clone ioctls that other file
system have started to implement to common code, and allows the NFS
server to export this functionality to remote systems.

This work is based originally on my NFS CLONE prototype, which reused
code from Anna Schumaker's NFS COPY prototype, as well as various
updates from Peng Tao to this code.

The patches are also available as a git branch and on gitweb:

git://git.infradead.org/users/hch/pnfs.git clone-for-viro

http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] vfs: pull btrfs clone API to vfs layer

2015-11-26 Thread Christoph Hellwig
The btrfs ioctl clones are now adopted by other file systems:
NFS since 4.3 and XFS a few kernel in the future, as well as the
previous (incorrect) usage by CIFS.  To avoid growth of various
slightly incompatible implementation add one to the core VFS
code.  Note that clones are different from file copies in various
ways:

 - they are atomic vs other writers
 - they support whole file clones
 - they support 64-bit legth clones
 - they do not allow partial success (aka short writes)
 - clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback
them on top of the recent clone_file_range infrastructure.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig 
---
 fs/btrfs/ctree.h |   3 +-
 fs/btrfs/file.c  |   1 +
 fs/btrfs/ioctl.c |  49 ++
 fs/ioctl.c   |  29 +
 fs/nfs/nfs42proc.c   |   1 +
 fs/nfs/nfs4file.c| 107 ---
 fs/read_write.c  |  71 +++
 include/linux/fs.h   |   7 +++-
 include/uapi/linux/fs.h  |   9 
 include/uapi/linux/nfs.h |  11 -
 10 files changed, 140 insertions(+), 148 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd7d888..adc997f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4021,7 +4021,6 @@ void btrfs_get_block_group_info(struct list_head 
*groups_list,
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
   struct btrfs_ioctl_balance_args *bargs);
 
-
 /* file.c */
 int btrfs_auto_defrag_init(void);
 void btrfs_auto_defrag_exit(void);
@@ -4054,6 +4053,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t 
start, loff_t end);
 ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
  struct file *file_out, loff_t pos_out,
  size_t len, unsigned int flags);
+int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+  struct file *file_out, loff_t pos_out, u64 len);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1c0ee74..3b61b0a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2921,6 +2921,7 @@ const struct file_operations btrfs_file_operations = {
.compat_ioctl   = btrfs_ioctl,
 #endif
.copy_file_range = btrfs_copy_file_range,
+   .clone_file_range = btrfs_clone_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f92735..85b1cae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, 
loff_t pos_in,
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+int btrfs_clone_file_range(struct file *src_file, loff_t off,
+   struct file *dst_file, loff_t destoff, u64 len)
 {
-   struct fd src_file;
-   int ret;
-
-   /* the destination must be opened for writing */
-   if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
-   return -EINVAL;
-
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   /* the src must be open for reading */
-   if (!(src_file.file->f_mode & FMODE_READ)) {
-   ret = -EINVAL;
-   goto out_fput;
-   }
-
-   ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
-
-out_fput:
-   fdput(src_file);
-out_drop_write:
-   mnt_drop_write_file(file);
-   return ret;
-}
-
-static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
-{
-   struct btrfs_ioctl_clone_range_args args;
-
-   if (copy_from_user(, argp, sizeof(args)))
-   return -EFAULT;
-   return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
-args.src_length, args.dest_offset);
+   return btrfs_clone_files(dst_file, src_file, off, len, destoff);
 }
 
 /*
@@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_dev_info(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_ioctl_balance(file, NULL);
-   case BTRFS_IOC_CLONE:
-   return btrfs_ioctl_clone(file, arg, 0, 0, 0);
-   case BTRFS_IOC_CLONE_RANGE:
-   return btrfs_ioctl_clone_range(file, argp);
case BTRFS_IOC_TRANS_START:
return btrfs_ioctl_trans_start(file);
case BTRFS_IOC_TRANS_END:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5d01d26..84c6e79 100644
--- 

[PATCH 4/5] nfsd: Pass filehandle to nfs4_preprocess_stateid_op()

2015-11-26 Thread Christoph Hellwig
From: Anna Schumaker 

This will be needed so COPY can look up the saved_fh in addition to the
current_fh.

Signed-off-by: Anna Schumaker 
---
 fs/nfsd/nfs4proc.c  | 16 +---
 fs/nfsd/nfs4state.c |  5 ++---
 fs/nfsd/state.h |  4 ++--
 3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a9f096c..3ba10a3 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -774,8 +774,9 @@ nfsd4_read(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
clear_bit(RQ_SPLICE_OK, >rq_flags);
 
/* check stateid */
-   status = nfs4_preprocess_stateid_op(rqstp, cstate, >rd_stateid,
-   RD_STATE, >rd_filp, >rd_tmp_file);
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh,
+   >rd_stateid, RD_STATE,
+   >rd_filp, >rd_tmp_file);
if (status) {
dprintk("NFSD: nfsd4_read: couldn't process stateid!\n");
goto out;
@@ -921,7 +922,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
 
if (setattr->sa_iattr.ia_valid & ATTR_SIZE) {
status = nfs4_preprocess_stateid_op(rqstp, cstate,
-   >sa_stateid, WR_STATE, NULL, NULL);
+   >current_fh, >sa_stateid,
+   WR_STATE, NULL, NULL);
if (status) {
dprintk("NFSD: nfsd4_setattr: couldn't process 
stateid!\n");
return status;
@@ -985,8 +987,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
if (write->wr_offset >= OFFSET_MAX)
return nfserr_inval;
 
-   status = nfs4_preprocess_stateid_op(rqstp, cstate, stateid, WR_STATE,
-   , NULL);
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh,
+   stateid, WR_STATE, , NULL);
if (status) {
dprintk("NFSD: nfsd4_write: couldn't process stateid!\n");
return status;
@@ -1016,7 +1018,7 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
__be32 status = nfserr_notsupp;
struct file *file;
 
-   status = nfs4_preprocess_stateid_op(rqstp, cstate,
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh,
>falloc_stateid,
WR_STATE, , NULL);
if (status != nfs_ok) {
@@ -1055,7 +1057,7 @@ nfsd4_seek(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
__be32 status;
struct file *file;
 
-   status = nfs4_preprocess_stateid_op(rqstp, cstate,
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh,
>seek_stateid,
RD_STATE, , NULL);
if (status) {
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 6b800b5..df5dba6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4797,10 +4797,9 @@ nfs4_check_file(struct svc_rqst *rqstp, struct svc_fh 
*fhp, struct nfs4_stid *s,
  */
 __be32
 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
-   struct nfsd4_compound_state *cstate, stateid_t *stateid,
-   int flags, struct file **filpp, bool *tmp_file)
+   struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+   stateid_t *stateid, int flags, struct file **filpp, bool 
*tmp_file)
 {
-   struct svc_fh *fhp = >current_fh;
struct inode *ino = d_inode(fhp->fh_dentry);
struct net *net = SVC_NET(rqstp);
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 77fdf4d..99432b7 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -578,8 +578,8 @@ struct nfsd4_compound_state;
 struct nfsd_net;
 
 extern __be32 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
-   struct nfsd4_compound_state *cstate, stateid_t *stateid,
-   int flags, struct file **filp, bool *tmp_file);
+   struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+   stateid_t *stateid, int flags, struct file **filp, bool 
*tmp_file);
 __be32 nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate,
 stateid_t *stateid, unsigned char typemask,
 struct nfs4_stid **s, struct nfsd_net *nn);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] cifs: implement clone_file_range operation

2015-11-26 Thread Christoph Hellwig
And drop the fake support for the btrfs CLONE ioctl - SMB2 copies are
chunked and do not actually implement clone semantics!

Heavily based on a previous patch from Peng Tao.

Signed-off-by: Christoph Hellwig 
---
 fs/cifs/cifsfs.c |  25 ++
 fs/cifs/cifsfs.h |   4 ++-
 fs/cifs/ioctl.c  | 103 +++
 3 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index cbc0f4b..ad7117a 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -914,6 +914,23 @@ const struct inode_operations cifs_symlink_inode_ops = {
 #endif
 };
 
+ssize_t cifs_file_copy_range(struct file *file_in, loff_t pos_in,
+struct file *file_out, loff_t pos_out,
+size_t len, unsigned int flags)
+{
+   unsigned int xid;
+   int rc;
+
+   if (flags)
+   return -EOPNOTSUPP;
+
+   xid = get_xid();
+   rc = cifs_file_clone_range(xid, file_in, file_out, pos_in,
+  len, pos_out, true);
+   free_xid(xid);
+   return rc < 0 ? rc : len;
+}
+
 const struct file_operations cifs_file_ops = {
.read_iter = cifs_loose_read_iter,
.write_iter = cifs_file_write_iter,
@@ -926,6 +943,7 @@ const struct file_operations cifs_file_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
 };
@@ -942,6 +960,8 @@ const struct file_operations cifs_file_strict_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
+   .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
 };
@@ -958,6 +978,7 @@ const struct file_operations cifs_file_direct_ops = {
.mmap = cifs_file_mmap,
.splice_read = generic_file_splice_read,
.unlocked_ioctl  = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.llseek = cifs_llseek,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
@@ -974,6 +995,7 @@ const struct file_operations cifs_file_nobrl_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
 };
@@ -989,6 +1011,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
 };
@@ -1004,6 +1027,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = 
{
.mmap = cifs_file_mmap,
.splice_read = generic_file_splice_read,
.unlocked_ioctl  = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.llseek = cifs_llseek,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
@@ -1014,6 +1038,7 @@ const struct file_operations cifs_dir_ops = {
.release = cifs_closedir,
.read= generic_read_dir,
.unlocked_ioctl  = cifs_ioctl,
+   .copy_file_range = cifs_file_copy_range,
.llseek = generic_file_llseek,
 };
 
diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index c3cc160..797439b 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -131,7 +131,9 @@ extern int  cifs_setxattr(struct dentry *, const char *, 
const void *,
 extern ssize_t cifs_getxattr(struct dentry *, const char *, void *, size_t);
 extern ssize_t cifs_listxattr(struct dentry *, char *, size_t);
 extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long 
arg);
-
+extern int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+struct file *dst_file, u64 off, u64 len,
+u64 destoff, bool dup_extents);
 #ifdef CONFIG_CIFS_NFSD_EXPORT
 extern const struct export_operations cifs_export_ops;
 #endif /* CONFIG_CIFS_NFSD_EXPORT */
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 35cf990..4f92f5c 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -34,73 +34,43 @@
 #include "cifs_ioctl.h"
 #include 
 
-static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
-   unsigned long srcfd, u64 off, u64 len, u64 destoff,
-   bool dup_extents)
+int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+ struct file *dst_file, u64 off, u64 len,
+ u64 destoff, bool dup_extents)
 {
-   int rc;
-   struct cifsFileInfo *smb_file_target = 

[PATCH 2/5] locks: new locks_mandatory_area calling convention

2015-11-26 Thread Christoph Hellwig
Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also drop the pointless inode argument and simplify
the read/write selection.

Signed-off-by: Christoph Hellwig 
---
 fs/locks.c | 22 +-
 fs/read_write.c|  5 ++---
 include/linux/fs.h | 28 +---
 3 files changed, 24 insertions(+), 31 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 0d2b326..d503669 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
 
 /**
  * locks_mandatory_area - Check for a conflicting lock
- * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
- * for shared
- * @inode:  the file to check
  * @filp:   how the file was opened (if it was)
- * @offset: start of area to check
- * @count:  length of area to check
+ * @start: first byte in the file to check
+ * @end:   lastbyte in the file to check
+ * @write: %true if checking for write access
  *
  * Searches the inode's list of locks to find any POSIX locks which conflict.
- * This function is called from rw_verify_area() and
- * locks_verify_truncate().
  */
-int locks_mandatory_area(int read_write, struct inode *inode,
-struct file *filp, loff_t offset,
-size_t count)
+int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
+   bool write)
 {
+   struct inode *inode = file_inode(filp);
struct file_lock fl;
int error;
bool sleep = false;
@@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode 
*inode,
fl.fl_flags = FL_POSIX | FL_ACCESS;
if (filp && !(filp->f_flags & O_NONBLOCK))
sleep = true;
-   fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
-   fl.fl_start = offset;
-   fl.fl_end = offset + count - 1;
+   fl.fl_type = write ? F_WRLCK : F_RDLCK;
+   fl.fl_start = start;
+   fl.fl_end = end;
 
for (;;) {
if (filp) {
diff --git a/fs/read_write.c b/fs/read_write.c
index c81ef39..48157dd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const 
loff_t *ppos, size_t
}
 
if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
-   retval = locks_mandatory_area(
-   read_write == READ ? FLOCK_VERIFY_READ : 
FLOCK_VERIFY_WRITE,
-   inode, file, pos, count);
+   retval = locks_mandatory_area(file, pos, pos + count - 1,
+   read_write == READ ? false : true);
if (retval < 0)
return retval;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 870a76e..e640f791 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
 
 #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
 
-#define FLOCK_VERIFY_READ  1
-#define FLOCK_VERIFY_WRITE 2
-
 #ifdef CONFIG_FILE_LOCKING
 extern int locks_mandatory_locked(struct file *);
-extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, 
size_t);
+extern int locks_mandatory_area(struct file *, loff_t, loff_t, bool);
 
 /*
  * Candidates for mandatory locking have the setgid bit set
@@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode 
*inode,
struct file *filp,
loff_t size)
 {
-   if (inode->i_flctx && mandatory_lock(inode))
-   return locks_mandatory_area(
-   FLOCK_VERIFY_WRITE, inode, filp,
-   size < inode->i_size ? size : inode->i_size,
-   (size < inode->i_size ? inode->i_size - size
-: size - inode->i_size)
-   );
-   return 0;
+   if (!inode->i_flctx || !mandatory_lock(inode))
+   return 0;
+
+   if (size < inode->i_size) {
+   return locks_mandatory_area(filp, size, inode->i_size - 1,
+   true);
+   } else {
+   return locks_mandatory_area(filp, inode->i_size, size - 1,
+   true);
+   }
 }
 
 static inline int break_lease(struct inode *inode, unsigned int mode)
@@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file 
*file)
return 0;
 }
 
-static inline int locks_mandatory_area(int rw, struct inode *inode,
-  struct file *filp, loff_t offset,
-  size_t count)
+static inline int locks_mandatory_area(struct file *filp, loff_t start,
+   loff_t end, bool write)
 {
return 0;
 }
-- 
1.9.1

--
To unsubscribe from this 

[PATCH 5/5] nfsd: implement the NFSv4.2 CLONE operation

2015-11-26 Thread Christoph Hellwig
This is basically a remote version of the btrfs CLONE operation,
so the implementation is fairly trivial.  Made even more trivial
by stealing the XDR code and general framework Anna Schumaker's
COPY prototype.

Signed-off-by: Christoph Hellwig 
---
 fs/nfsd/nfs4proc.c   | 47 +++
 fs/nfsd/nfs4xdr.c| 21 +
 fs/nfsd/vfs.c|  8 
 fs/nfsd/vfs.h|  2 ++
 fs/nfsd/xdr4.h   | 10 ++
 include/linux/nfs4.h |  4 ++--
 6 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 3ba10a3..819ad81 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1012,6 +1012,47 @@ nfsd4_write(struct svc_rqst *rqstp, struct 
nfsd4_compound_state *cstate,
 }
 
 static __be32
+nfsd4_clone(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
+   struct nfsd4_clone *clone)
+{
+   struct file *src, *dst;
+   __be32 status;
+
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >save_fh,
+   >cl_src_stateid, RD_STATE,
+   , NULL);
+   if (status) {
+   dprintk("NFSD: %s: couldn't process src stateid!\n", __func__);
+   goto out;
+   }
+
+   status = nfs4_preprocess_stateid_op(rqstp, cstate, >current_fh,
+   >cl_dst_stateid, WR_STATE,
+   , NULL);
+   if (status) {
+   dprintk("NFSD: %s: couldn't process dst stateid!\n", __func__);
+   goto out_put_src;
+   }
+
+   /* fix up for NFS-specific error code */
+   if (!S_ISREG(file_inode(src)->i_mode) ||
+   !S_ISREG(file_inode(dst)->i_mode)) {
+   status = nfserr_wrong_type;
+   goto out_put_dst;
+   }
+
+   status = nfsd4_clone_file_range(src, clone->cl_src_pos,
+   dst, clone->cl_dst_pos, clone->cl_count);
+
+out_put_dst:
+   fput(dst);
+out_put_src:
+   fput(src);
+out:
+   return status;
+}
+
+static __be32
 nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
struct nfsd4_fallocate *fallocate, int flags)
 {
@@ -2281,6 +2322,12 @@ static struct nfsd4_operation nfsd4_ops[] = {
.op_name = "OP_DEALLOCATE",
.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
},
+   [OP_CLONE] = {
+   .op_func = (nfsd4op_func)nfsd4_clone,
+   .op_flags = OP_MODIFIES_SOMETHING | OP_CACHEME,
+   .op_name = "OP_CLONE",
+   .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
+   },
[OP_SEEK] = {
.op_func = (nfsd4op_func)nfsd4_seek,
.op_name = "OP_SEEK",
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 51c9e9c..924416f 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1675,6 +1675,25 @@ nfsd4_decode_fallocate(struct nfsd4_compoundargs *argp,
 }
 
 static __be32
+nfsd4_decode_clone(struct nfsd4_compoundargs *argp, struct nfsd4_clone *clone)
+{
+   DECODE_HEAD;
+
+   status = nfsd4_decode_stateid(argp, >cl_src_stateid);
+   if (status)
+   return status;
+   status = nfsd4_decode_stateid(argp, >cl_dst_stateid);
+   if (status)
+   return status;
+
+   READ_BUF(8 + 8 + 8);
+   p = xdr_decode_hyper(p, >cl_src_pos);
+   p = xdr_decode_hyper(p, >cl_dst_pos);
+   p = xdr_decode_hyper(p, >cl_count);
+   DECODE_TAIL;
+}
+
+static __be32
 nfsd4_decode_seek(struct nfsd4_compoundargs *argp, struct nfsd4_seek *seek)
 {
DECODE_HEAD;
@@ -1785,6 +1804,7 @@ static nfsd4_dec nfsd4_dec_ops[] = {
[OP_READ_PLUS]  = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_SEEK]   = (nfsd4_dec)nfsd4_decode_seek,
[OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
+   [OP_CLONE]  = (nfsd4_dec)nfsd4_decode_clone,
 };
 
 static inline bool
@@ -4292,6 +4312,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {
[OP_READ_PLUS]  = (nfsd4_enc)nfsd4_encode_noop,
[OP_SEEK]   = (nfsd4_enc)nfsd4_encode_seek,
[OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
+   [OP_CLONE]  = (nfsd4_enc)nfsd4_encode_noop,
 };
 
 /*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 994d66f..5411bf0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -36,6 +36,7 @@
 #endif /* CONFIG_NFSD_V3 */
 
 #ifdef CONFIG_NFSD_V4
+#include "../internal.h"
 #include "acl.h"
 #include "idmap.h"
 #endif /* CONFIG_NFSD_V4 */
@@ -498,6 +499,13 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct 
svc_fh *fhp,
 }
 #endif
 
+__be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
+   u64 dst_pos, u64 count)
+{
+   return nfserrno(vfs_clone_file_range(src, src_pos, dst, 

Re: How to detect / notify when a raid drive fails?

2015-11-26 Thread Ian Kelling
On Thu, Nov 26, 2015, at 09:30 PM, Duncan wrote:
> What generally happens now, however, is that the btrfs will note failures 
> attempting to write the device and start queuing up writes.  If the 
> device reappears fast enough, btrfs will flush the queue and be back to 
> normal.  Otherwise, you pretty much need to reboot and mount degraded, 
> then add a device and rebalance. (btrfs device delete missing broke some 
> versions ago and just got fixed by the latest btrfs-progs-4.3.1, IIRC.)
> 
> As for alerts, you'd see the pile of accumulating write errors in the 
> kernel log.  Presumably you can write up a script that can alert on that 
> and mail you the log or whatever, but I don't believe there's anything 
> official or close to it, yet.

Great info, thanks. Just trying to write a file, sync and read it
sounds like the easiest test for now, especially since I don't
know what the write fail log entries will look like. And setting
up SMART notifications.

- Ian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html