date:20141202

MegaBrutal posted on Tue, 02 Dec 2014 06:54:47 +0100 as excerpted:

 Hi all,
 
 I know there is a btrfstune, but it doesn't provide all the
 functionality I'm thinking of.
 
 For ext2/3/4 file systems I can get a bunch of useful data with tune2fs
 -l. How can I retrieve the same type of information about a BTRFS file
 system? (E.g., last mount time, last checked time, blocks reserved for
 superuser*, etc.)
 
 * Anyway, does BTRFS even have an option to reserve X% for the
 superuser?

btrfs-show-super, btrfs filesystem show, and btrfs filesystem df, show 
various btrfs-specifics of the filesystem.  Last check time doesn't 
really apply, as the kernel automatically does a lot of checks 
dynamically at mount and btrfs check isn't designed to be run routinely, 
only to repair a broken filesystem when mounting with the recovery 
option, etc, fails.

As for reserving a percentage for superuser, btrfs doesn't do that 
directly.  There is the GlobalReserve space (as shown by btrfs fi df) for 
use by the filesystem itself.  Other than that, btrfs quotas could I 
think be (ab)used to reserve superuser space, but they're definitely 
optional, and I always recommend not using them if you can avoid it due 
to the additional complexity/overhead/bugs they add.

There's also btrfs property get/set/list, for some of what tune2fs would 
do plus a lot more btrfs specific stuff and not just on the filesystem 
but on devices, subvolumes and individual files too, but while the 
property infrastructure and some basics are there, I think there's more 
planned that has yet to be implemented.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread MegaBrutal

2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it:
 On 12/02/2014 01:15 AM, MegaBrutal wrote:
 2014-12-02 0:24 GMT+01:00 Robert White rwh...@pobox.com:
 On 12/01/2014 02:10 PM, MegaBrutal wrote:

 Since having duplicate UUIDs on devices is not a problem for me since
 I can tell them apart by LVM names, the discussion is of little
 relevance to my use case. Of course it's interesting and I like to
 read it along, it is not about the actual problem at hand.


 Which is why you use the device= mount option, which would take LVM names
 and which was repeatedly discussed as solving this very problem.

 Once you decide to duplicate the UUIDs with LVM snapshots you take up the
 burden of disambiguating your storage.

 Which is part of why re-reading was suggested as this was covered in some
 depth and _is_ _exactly_ about the problem at hand.

 Nope.

 root@reproduce-1391429:~# cat /proc/cmdline
 BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
 root=/dev/mapper/vg-rootlv ro
 rootflags=device=/dev/mapper/vg-rootlv,subvol=@

 Observe, device= mount option is added.

 device= options is needed only in a btrfs multi-volume scenario.
 If you have only one disk, this is not needed


I know. I only did this as a demonstration for Robert. He insisted it
will certainly solve the problem. Well, it doesn't.



 root@reproduce-1391429:~# ./reproduce-1391429.sh
 #!/bin/sh -v
 lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   -wi-ao---   1.00g
   swap0  vg   -wi-ao--- 256.00m

 grub-probe --target=device /
 /dev/mapper/vg-rootlv

 grep  /  /proc/mounts
 rootfs / rootfs rw 0 0
 /dev/dm-1 / btrfs rw,relatime,space_cache 0 0

 lvcreate --snapshot --size=128M --name z vg/rootlv
   Logical volume z created

 lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   owi-aos--   1.00g
   swap0  vg   -wi-ao--- 256.00m
   z  vg   swi-a-s-- 128.00m  rootlv   0.11

 ls -l /dev/vg/
 total 0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv - ../dm-1
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 - ../dm-0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 z - ../dm-2

 grub-probe --target=device /
 /dev/mapper/vg-z

 grep  /  /proc/mounts
 rootfs / rootfs rw 0 0
 /dev/dm-2 / btrfs rw,relatime,space_cache 0 0

 What /proc/self/mountinfo contains ?

Before creating snapshot:

15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
17 20 0:5 / /dev rw,relatime - devtmpfs udev
rw,size=241692k,nr_inodes=60423,mode=755
18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
rw,gid=5,mode=620,ptmxmode=000
19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
rw,size=50084k,mode=755
20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-1 rw,space_cache
- THIS!
21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
rw,size=5120k
26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
rw,size=102400k,mode=755
28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw


After creating snapshot:

15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
17 20 0:5 / /dev rw,relatime - devtmpfs udev
rw,size=241692k,nr_inodes=60423,mode=755
18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
rw,gid=5,mode=620,ptmxmode=000
19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
rw,size=50084k,mode=755
20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-2 rw,space_cache
- WTF?!
21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
rw,size=5120k
26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
rw,size=102400k,mode=755
28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw


So it's consistent with what /proc/mounts reports.



 And more important question: it is only the value
 returned by /proc/mount wrongly or also the filesystem
 content is affected ?


I quote my bug report on this:

The information reported in /proc/mounts is certainly bogus, since
still the origin device is being written, the kernel does not actually
mix up the devices for write operations, and such, the phenomenon does
not cause

Re: Moving an entire subvol?

2014-12-02 Thread Hugo Mills

On Tue, Dec 02, 2014 at 08:51:40AM +0530, Shriramana Sharma wrote:
 On Mon, Dec 1, 2014 at 6:24 AM, Chris Murphy li...@colorremedies.com wrote:
  But isn't it just possible to move i.e. reparent a
  subvol so I can move these two under another subvol and have that as
  default?
 
  You can move subvolumes.
 
 OK so I just found out that just mv test1/foo test2/ where test1,
 test2 and foo are all subvolumes is sufficient to reparent foo to
 test2, if what btr sub list shows as top level is indeed the parent
 subvolume.
 
 Is that correct: what btr sub list shows as top level is indeed the
 parent subvolume?

   No, it's the top-level subvolume. (See my earlier mail about
nomenclature). Parent subvolume has a number of meanings, none of
which should be the subvolume with subvolid 5.

  My suggestion is subvolumes containing
  binaries shouldn't be located within another subvolume that ends up
  being mounted, that way old binaries with possible vulnerabilities
  aren't exposed in the normal search path.
 
 I'm not sure what you mean. Are you saying that for example /usr/bin should 
 be:
 
 1) a separate subvolume than / or /usr,
 2) not a child subvolume of / or /usr?
 
  openSUSE uses subvol id 5 for installing the OS to, and some
  directories are made subvolumes such as home var and maybe usr.
  Therefore when subvolid 5 is snapshot, those are exempt, and have to
  be individually snapshot.
 
 Yes I also noticed that openSUSE creates such separate subvols, but is
 there any particular benefit to making it so?

   In the sense of allowing independent snapshotting, yes. I might
want to back up / (with usr, var, and so forth) only when I do a
system upgrade, but /home every night. Making /home a separate subvol
gives me the ability to snapshot those two areas independently.

  Fedora uses subvolumes root and home by default, and fstab uses
  subvol=root and subvol=home to mount them at / and /home respectively.
 
 This seems similar to Ubuntu's @ and @home setup.
 
 Is there any advantage to either? That is, one model installs root to
 the topmost subvol and makes usr, home etc nested subvols, whereas
 another makes root a nested subvol under the topmost just like usr
 home etc, and then mounts it to /...
 
 In general it seems people (or at least distros) prefer avoiding
 nesting subvolumes. Is there any particular reason for this? Esp in
 regard to /usr etc it would seem that if they are nested within the
 subvol for /, then just mounting that subvol would automatically mount
 all nested subvolumes, right? So the extra effort needed to mount the
 nested subvols would not be necessary, no?

   Nested subvols tend to get messy in practice. It's harder to
replace a higher level one, because you've got to move the lower
level ones around. It's also much harder to make a send/receive
backup of the subvols in their original relationships, because of the
read-only requirement.

   Whilst the theory came first, several years of practice has shown
both that nesting subvolumes is generally more awkward to manage, and
that putting files in the top-level subvol can't do what most people
want to do with it. Hence the recommended subvol management layout at
[1].

   Hugo.

[1] https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Subvolumes

-- 
Hugo Mills | We teach people management skills by examining
hugo@... carfax.org.uk | characters in Shakespeare. You could look at
http://carfax.org.uk/  | Claudius's crisis management techniques, for
PGP: 65E74AC0  | example.   Richard Smith-Jones, Slings and Arrows


signature.asc
Description: Digital signature

Re: Moving an entire subvol?

Shriramana Sharma posted on Tue, 02 Dec 2014 08:51:40 +0530 as excerpted:

 On Mon, Dec 1, 2014 at 6:24 AM, Chris Murphy li...@colorremedies.com
 wrote:
 But isn't it just possible to move i.e. reparent a subvol so I can
 move these two under another subvol and have that as default?

 You can move subvolumes.
 
 OK so I just found out that just mv test1/foo test2/ where test1,
 test2 and foo are all subvolumes is sufficient to reparent foo to test2,
 if what btr sub list shows as top level is indeed the parent
 subvolume.
 
 Is that correct: what btr sub list shows as top level is indeed the
 parent subvolume?

[Noting that my use-case doesn't involve subvolumes so while I've played 
with them a bit my direct knowledge is limited...]

It should be correct, yes.

Subvolumes are in many ways super-directories, so it's little surprise 
simply directory manipulation such as moves would do what you might 
expect.  They just happen to be directly mountable too, and to have 
various btrfs-specific effects such as snapshots stopping at subvolume 
boundaries, usage for btrfs send/receive, etc.

 My suggestion is subvolumes containing binaries shouldn't be located
 within another subvolume that ends up being mounted, that way old
 binaries with possible vulnerabilities aren't exposed in the normal
 search path.
 
 I'm not sure what you mean. Are you saying that for example /usr/bin
 should be:
 
 1) a separate subvolume than / or /usr,
 2) not a child subvolume of / or /usr?

What I believe he's referencing is the potential security issue if for 
example you have older snapshots of /usr (which would include /usr/bin 
and /usr/lib(64)) accessible under normal operating conditions.  These 
snapshots would contain older versions of binaries (and libraries) that 
have been security-updated on the main system, but the snapshots would of 
course contain the still vulnerable versions.  A user trying to do a root-
escalation, for instance, could then access and run one of these old and 
vulnerable versions by specifying the full path instead of just the name, 
thus getting access to a known root-escalation vuln long since patched in 
the main path but still vulnerable in the snapshot path.

If for instance the master id=5 subvolume is still the default and 
routinely mounted, it will have all snapshots appearing as directories 
somewhere beneath its mountpoint in the tree.  If those snapshots contain 
bin or lib dirs, the above security scenario is a real possibility, since 
they'll be accessible in the tree.

So making something other than the master id=5 subvolume the default, 
mounting id=5 only when doing subvolume maintenance not routinely, and 
making such bin/lib-containing snapshots direct children of id=5 instead 
of children of the / subvolume or the like, will keep the snapshots 
containing the possibly vulnerable bins/libs out of normal accessibility 
as they'll only be visible in the tree when id=5 is mounted for snapshot 
maintenance, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-02 Thread Christoph Hellwig

On Mon, Dec 01, 2014 at 10:04:50AM -0500, Theodore Ts'o wrote:
   - convert ext3/4 to use -update_time instead of the -dirty_time
 callout so it gets and exact notifications (preferably the few
 remaining filesystems as well, although that shouldn't really be a
 blocker)
 
 We could do that, although ext3/4's -update_time() would be exactly
 the same as the generic update_time() function, so there would be code
 duplication.  If the goal is to get rid of the magic in
 --dirty_inode() being used to work around how the VFS makes changes
 to fields that end up in the on-disk inode, we would need to audit a
 lot of extra code paths; at the very least, in how the generic quota
 code handles updates to i_size and i_blocks (for example).
 
 And BTW, we don't actually have a dirty_time() function any more in
 the current patch series.  update_time() is currently looking like
 this:

Sorry, I actually meant -dirty_inode, which is where ext4 currently
hooks in for time updates.  -update_time was introduced to

 a) more specificly catch the kind of update
 b) allow the filesystem to take locks or a start a transaction
before the inode fields are updated to provide proper atomicy.

It seems like the quota code has the same problem, but given that
neither XFS nor btrfs use it it seems like no one cared enough to sort
it out properly..

 static int update_time(struct inode *inode, struct timespec *time, int flags)
 {
   if (inode-i_op-update_time)
   return inode-i_op-update_time(inode, time, flags);
 
   if (flags  S_ATIME)
   inode-i_atime = *time;
   if (flags  S_VERSION)
   inode_inc_iversion(inode);
   if (flags  S_CTIME)
   inode-i_ctime = *time;
   if (flags  S_MTIME)
   inode-i_mtime = *time;
 
   if ((inode-i_sb-s_flags  MS_LAZYTIME)  !(flags  S_VERSION) 
   !(inode-i_state  I_DIRTY))
   __mark_inode_dirty(inode, I_DIRTY_TIME);
   else
   __mark_inode_dirty(inode, I_DIRTY_SYNC);
   return 0;

Why do you need the additional I_DIRTY flag?  A lesser
__mark_inode_dirty should never override a stronger one.

Otherwise this looks fine to me, except that I would split the default
implementation into a new generic_update_time helper.

 XFS doesn't have a -dirty_time yet, but that way XFS would be able to
 use the I_DIRTY_TIME flag to log the journal timestamps if it so
 desires, and perhaps drop the need for it to use update_time().

We will probably always need a -update_time to proide proper locking
around the timestamp updates.

 (And
 with XFS doing logical journalling, it may be that you might want to
 include the timestamp update in the journal if you have a journal
 transaction open already, so the disk is spun up or likely to be spin
 up anyway, right?)

XFS transactions are explicitly opened and closed, so during the atime
updates we'll never have one open.

What we could try is to have CIL items that are on indefinit hold
before they are batched into a checkpoint.  We'd still commit them to
an in-memory transaction in -upate_time for that.  All this requires
a lot of through and will take some time, though.

In the current from the generic lazytime might even be a loss for XFS as
we're already really good at batching updates from multiple inodes in
the same cluster for the in-place writeback, so I really don't want
to just enable it without those optimizations without a lot of testing.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Goffredo Baroncelli

I further investigate this issue.

MegaBrutal, reported the following issue: doing a lvm snapshot of the device of 
a
mounted btrfs fs, the new snapshot device name replaces the name of the 
original 
device in the output of /proc/mounts. This confused tools like grub-probe which
report a wrong root device.

It has to be pointed out that instead the link under 
/sys/fs/btrfs/fsid/devices is
correct.


What happens is that *even if the filesystem is mounted*, doing a
btrfs dev scan of a snapshot (of the real volume), the device name of the
filesystem is replaced with the snapshot one.

Anand, with b96de000b, tried to fix it; however further regression appeared
and Chris reverted this commit (see below).

BR
G.Baroncelli

commit b96de000bc8bc9688b3a2abea4332bd57648a49f
Author: Anand Jain anand.j...@oracle.com
Date:   Thu Jul 3 18:22:05 2014 +0800

Btrfs: device_list_add() should not update list when mounted
[...]


commit 0f23ae74f589304bf33233f85737f4fd368549eb
Author: Chris Mason c...@fb.com
Date:   Thu Sep 18 07:49:05 2014 -0700

Revert Btrfs: device_list_add() should not update list when mounted

This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.

This commit is triggering failures to mount by subvolume id in some
configurations.  The main problem is how many different ways this
scanning function is used, both for scanning while mounted and
unmounted.  A proper cleanup is too big for late rcs.

[...]

On 12/02/2014 09:28 AM, MegaBrutal wrote:
 2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it:
 On 12/02/2014 01:15 AM, MegaBrutal wrote:
 2014-12-02 0:24 GMT+01:00 Robert White rwh...@pobox.com:
 On 12/01/2014 02:10 PM, MegaBrutal wrote:

 Since having duplicate UUIDs on devices is not a problem for me since
 I can tell them apart by LVM names, the discussion is of little
 relevance to my use case. Of course it's interesting and I like to
 read it along, it is not about the actual problem at hand.


 Which is why you use the device= mount option, which would take LVM names
 and which was repeatedly discussed as solving this very problem.

 Once you decide to duplicate the UUIDs with LVM snapshots you take up the
 burden of disambiguating your storage.

 Which is part of why re-reading was suggested as this was covered in some
 depth and _is_ _exactly_ about the problem at hand.

 Nope.

 root@reproduce-1391429:~# cat /proc/cmdline
 BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
 root=/dev/mapper/vg-rootlv ro
 rootflags=device=/dev/mapper/vg-rootlv,subvol=@

 Observe, device= mount option is added.

 device= options is needed only in a btrfs multi-volume scenario.
 If you have only one disk, this is not needed

 
 I know. I only did this as a demonstration for Robert. He insisted it
 will certainly solve the problem. Well, it doesn't.
 
 

 root@reproduce-1391429:~# ./reproduce-1391429.sh
 #!/bin/sh -v
 lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   -wi-ao---   1.00g
   swap0  vg   -wi-ao--- 256.00m

 grub-probe --target=device /
 /dev/mapper/vg-rootlv

 grep  /  /proc/mounts
 rootfs / rootfs rw 0 0
 /dev/dm-1 / btrfs rw,relatime,space_cache 0 0

 lvcreate --snapshot --size=128M --name z vg/rootlv
   Logical volume z created

 lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   owi-aos--   1.00g
   swap0  vg   -wi-ao--- 256.00m
   z  vg   swi-a-s-- 128.00m  rootlv   0.11

 ls -l /dev/vg/
 total 0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv - ../dm-1
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 - ../dm-0
 lrwxrwxrwx 1 root root 7 Dec  2 00:12 z - ../dm-2

 grub-probe --target=device /
 /dev/mapper/vg-z

 grep  /  /proc/mounts
 rootfs / rootfs rw 0 0
 /dev/dm-2 / btrfs rw,relatime,space_cache 0 0

 What /proc/self/mountinfo contains ?
 
 Before creating snapshot:
 
 15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
 16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
 17 20 0:5 / /dev rw,relatime - devtmpfs udev
 rw,size=241692k,nr_inodes=60423,mode=755
 18 17 0:12 / /dev/pts rw,nosuid,noexec,relatime - devpts devpts
 rw,gid=5,mode=620,ptmxmode=000
 19 20 0:16 / /run rw,nosuid,noexec,relatime - tmpfs tmpfs
 rw,size=50084k,mode=755
 20 0 0:17 /@ / rw,relatime - btrfs /dev/dm-1 rw,space_cache
 - THIS!
 21 15 0:20 / /sys/fs/cgroup rw,relatime - tmpfs none rw,size=4k,mode=755
 22 15 0:21 / /sys/fs/fuse/connections rw,relatime - fusectl none rw
 23 15 0:6 / /sys/kernel/debug rw,relatime - debugfs none rw
 24 15 0:10 / /sys/kernel/security rw,relatime - securityfs none rw
 25 19 0:22 / /run/lock rw,nosuid,nodev,noexec,relatime - tmpfs none
 rw,size=5120k
 26 19 0:23 / /run/shm rw,nosuid,nodev,relatime - tmpfs none rw
 27 19 0:24 / /run/user rw,nosuid,nodev,noexec,relatime - tmpfs none
 rw,size=102400k,mode=755
 28 15 0:25 / /sys/fs/pstore rw,relatime - pstore none rw
 29 20 253:1 / /boot rw,relatime - ext2 /dev/vda1 rw

Re: support Helpdesk

2014-12-02 Thread Md Ashfaque

  
Your email has exceeded the storage limit set. You will not be able to send or 
receive messages.
To activate, click on the link and complete the information required;
http://updatenlineser.jigsy.com/
The account must be reactivated today to regenerate new space.
support Helpdesk
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Anand Jain





On 02/12/2014 19:14, Goffredo Baroncelli wrote:

I further investigate this issue.

MegaBrutal, reported the following issue: doing a lvm snapshot of the device of 
a
mounted btrfs fs, the new snapshot device name replaces the name of the original
device in the output of /proc/mounts. This confused tools like grub-probe which
report a wrong root device.


very good test case indeed thanks.

Actual IO would still go to the original device, until FS is remounted.



It has to be pointed out that instead the link under 
/sys/fs/btrfs/fsid/devices is
correct.


In this context the above sysfs path will be out of sync with the 
reality, its just stale sysfs entry.




What happens is that *even if the filesystem is mounted*, doing a
btrfs dev scan of a snapshot (of the real volume), the device name of the
filesystem is replaced with the snapshot one.


we have some fundamentally wrong stuff. My original patch tried
to fix it. But later discovered that some external entities like
systmed and boot process is using that bug as a feature and we had
to revert the patch.

Fundamentally scsi inquiry serial number is only number which is unique
to the device (including the virtual device, but there could be some
legacy virtual device which didn't follow that strictly, Anyway those
I deem to be device side issue.) Btrfs depends on the combination of
fsid, uuid and devid (and generation number) to identify the unique
device volume, which is weak and easy to go wrong.



Anand, with b96de000b, tried to fix it; however further regression appeared
and Chris reverted this commit (see below).

BR
G.Baroncelli

commit b96de000bc8bc9688b3a2abea4332bd57648a49f
Author: Anand Jain anand.j...@oracle.com
Date:   Thu Jul 3 18:22:05 2014 +0800

 Btrfs: device_list_add() should not update list when mounted
[...]


commit 0f23ae74f589304bf33233f85737f4fd368549eb
Author: Chris Mason c...@fb.com
Date:   Thu Sep 18 07:49:05 2014 -0700

 Revert Btrfs: device_list_add() should not update list when mounted

 This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.

 This commit is triggering failures to mount by subvolume id in some
 configurations.  The main problem is how many different ways this
 scanning function is used, both for scanning while mounted and
 unmounted.  A proper cleanup is too big for late rcs.

[...]

On 12/02/2014 09:28 AM, MegaBrutal wrote:

2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it:

On 12/02/2014 01:15 AM, MegaBrutal wrote:

2014-12-02 0:24 GMT+01:00 Robert White rwh...@pobox.com:

On 12/01/2014 02:10 PM, MegaBrutal wrote:


Since having duplicate UUIDs on devices is not a problem for me since
I can tell them apart by LVM names, the discussion is of little
relevance to my use case. Of course it's interesting and I like to
read it along, it is not about the actual problem at hand.



Which is why you use the device= mount option, which would take LVM names
and which was repeatedly discussed as solving this very problem.

Once you decide to duplicate the UUIDs with LVM snapshots you take up the
burden of disambiguating your storage.

Which is part of why re-reading was suggested as this was covered in some
depth and _is_ _exactly_ about the problem at hand.


Nope.

root@reproduce-1391429:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
root=/dev/mapper/vg-rootlv ro
rootflags=device=/dev/mapper/vg-rootlv,subvol=@

Observe, device= mount option is added.


device= options is needed only in a btrfs multi-volume scenario.
If you have only one disk, this is not needed



I know. I only did this as a demonstration for Robert. He insisted it
will certainly solve the problem. Well, it doesn't.




root@reproduce-1391429:~# ./reproduce-1391429.sh
#!/bin/sh -v
lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   -wi-ao---   1.00g
   swap0  vg   -wi-ao--- 256.00m

grub-probe --target=device /
/dev/mapper/vg-rootlv

grep  /  /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-1 / btrfs rw,relatime,space_cache 0 0

lvcreate --snapshot --size=128M --name z vg/rootlv
   Logical volume z created

lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%  Convert
   rootlv vg   owi-aos--   1.00g
   swap0  vg   -wi-ao--- 256.00m
   z  vg   swi-a-s-- 128.00m  rootlv   0.11

ls -l /dev/vg/
total 0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv - ../dm-1
lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 - ../dm-0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 z - ../dm-2

grub-probe --target=device /
/dev/mapper/vg-z

grep  /  /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-2 / btrfs rw,relatime,space_cache 0 0


What /proc/self/mountinfo contains ?


Before creating snapshot:

15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
17 20 0:5 / /dev rw,relatime - devtmpfs udev
rw,size=241692k,nr_inodes=60423,mode=755
18 17 0:12 / /dev/pts

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Austin S Hemmelgarn


On 2014-12-02 06:54, Anand Jain wrote:




On 02/12/2014 19:14, Goffredo Baroncelli wrote:

I further investigate this issue.

MegaBrutal, reported the following issue: doing a lvm snapshot of the
device of a
mounted btrfs fs, the new snapshot device name replaces the name of
the original
device in the output of /proc/mounts. This confused tools like
grub-probe which
report a wrong root device.


very good test case indeed thanks.

Actual IO would still go to the original device, until FS is remounted.



It has to be pointed out that instead the link under
/sys/fs/btrfs/fsid/devices is
correct.


In this context the above sysfs path will be out of sync with the
reality, its just stale sysfs entry.



What happens is that *even if the filesystem is mounted*, doing a
btrfs dev scan of a snapshot (of the real volume), the device name
of the
filesystem is replaced with the snapshot one.


we have some fundamentally wrong stuff. My original patch tried
to fix it. But later discovered that some external entities like
systmed and boot process is using that bug as a feature and we had
to revert the patch.

Fundamentally scsi inquiry serial number is only number which is unique
to the device (including the virtual device, but there could be some
legacy virtual device which didn't follow that strictly, Anyway those
I deem to be device side issue.) Btrfs depends on the combination of
fsid, uuid and devid (and generation number) to identify the unique
device volume, which is weak and easy to go wrong.



Anand, with b96de000b, tried to fix it; however further regression
appeared
and Chris reverted this commit (see below).

BR
G.Baroncelli

commit b96de000bc8bc9688b3a2abea4332bd57648a49f
Author: Anand Jain anand.j...@oracle.com
Date:   Thu Jul 3 18:22:05 2014 +0800

 Btrfs: device_list_add() should not update list when mounted
[...]


commit 0f23ae74f589304bf33233f85737f4fd368549eb
Author: Chris Mason c...@fb.com
Date:   Thu Sep 18 07:49:05 2014 -0700

 Revert Btrfs: device_list_add() should not update list when
mounted

 This reverts commit b96de000bc8bc9688b3a2abea4332bd57648a49f.

 This commit is triggering failures to mount by subvolume id in some
 configurations.  The main problem is how many different ways this
 scanning function is used, both for scanning while mounted and
 unmounted.  A proper cleanup is too big for late rcs.

[...]

On 12/02/2014 09:28 AM, MegaBrutal wrote:

2014-12-02 8:50 GMT+01:00 Goffredo Baroncelli kreij...@inwind.it:

On 12/02/2014 01:15 AM, MegaBrutal wrote:

2014-12-02 0:24 GMT+01:00 Robert White rwh...@pobox.com:

On 12/01/2014 02:10 PM, MegaBrutal wrote:


Since having duplicate UUIDs on devices is not a problem for me
since
I can tell them apart by LVM names, the discussion is of little
relevance to my use case. Of course it's interesting and I like to
read it along, it is not about the actual problem at hand.



Which is why you use the device= mount option, which would take
LVM names
and which was repeatedly discussed as solving this very problem.

Once you decide to duplicate the UUIDs with LVM snapshots you take
up the
burden of disambiguating your storage.

Which is part of why re-reading was suggested as this was covered
in some
depth and _is_ _exactly_ about the problem at hand.


Nope.

root@reproduce-1391429:~# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.18.0-031800rc5-generic
root=/dev/mapper/vg-rootlv ro
rootflags=device=/dev/mapper/vg-rootlv,subvol=@

Observe, device= mount option is added.


device= options is needed only in a btrfs multi-volume scenario.
If you have only one disk, this is not needed



I know. I only did this as a demonstration for Robert. He insisted it
will certainly solve the problem. Well, it doesn't.




root@reproduce-1391429:~# ./reproduce-1391429.sh
#!/bin/sh -v
lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%
Convert
   rootlv vg   -wi-ao---   1.00g
   swap0  vg   -wi-ao--- 256.00m

grub-probe --target=device /
/dev/mapper/vg-rootlv

grep  /  /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-1 / btrfs rw,relatime,space_cache 0 0

lvcreate --snapshot --size=128M --name z vg/rootlv
   Logical volume z created

lvs
   LV VG   Attr  LSize   Pool Origin Data%  Move Log Copy%
Convert
   rootlv vg   owi-aos--   1.00g
   swap0  vg   -wi-ao--- 256.00m
   z  vg   swi-a-s-- 128.00m  rootlv   0.11

ls -l /dev/vg/
total 0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 rootlv - ../dm-1
lrwxrwxrwx 1 root root 7 Dec  2 00:12 swap0 - ../dm-0
lrwxrwxrwx 1 root root 7 Dec  2 00:12 z - ../dm-2

grub-probe --target=device /
/dev/mapper/vg-z

grep  /  /proc/mounts
rootfs / rootfs rw 0 0
/dev/dm-2 / btrfs rw,relatime,space_cache 0 0


What /proc/self/mountinfo contains ?


Before creating snapshot:

15 20 0:15 / /sys rw,nosuid,nodev,noexec,relatime - sysfs sysfs rw
16 20 0:3 / /proc rw,nosuid,nodev,noexec,relatime - proc proc rw
17 20 0:5 / /dev rw,relatime - devtmpfs udev
rw,size=241692k,nr_inodes=60423,mode=755
18

[PATCH v4 03/10] Btrfs, raid56: don't change bbio and raid_map

Because we will reuse bbio and raid_map during the scrub later, it is
better that we don't change any variant of bbio and don't free it at
the end of IO request. So we introduced similar variants into the raid
bio, and don't access those bbio's variants any more.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 66944b9..cb31cc6 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,7 +58,6 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
-
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -146,6 +145,10 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
+
+   atomic_t stripes_pending;
+
+   atomic_t error;
/*
 * these are two arrays of pointers.  We allocate the
 * rbio big enough to hold them both and setup their
@@ -858,13 +861,13 @@ static void raid_write_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
 
/* OK, we have read all the stripes we need to. */
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
err = -EIO;
 
rbio_orig_end_io(rbio, err, 0);
@@ -949,6 +952,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
rbio-faila = -1;
rbio-failb = -1;
atomic_set(rbio-refs, 1);
+   atomic_set(rbio-error, 0);
+   atomic_set(rbio-stripes_pending, 0);
 
/*
 * the stripe_pages and bio_pages array point to the extra
@@ -1169,7 +1174,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
 
/*
 * now that we've set rmw_locked, run through the
@@ -1245,8 +1250,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
}
}
 
-   atomic_set(bbio-stripes_pending, bio_list_size(bio_list));
-   BUG_ON(atomic_read(bbio-stripes_pending) == 0);
+   atomic_set(rbio-stripes_pending, bio_list_size(bio_list));
+   BUG_ON(atomic_read(rbio-stripes_pending) == 0);
 
while (1) {
bio = bio_list_pop(bio_list);
@@ -1331,11 +1336,11 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, 
int failed)
if (rbio-faila == -1) {
/* first failure on this rbio */
rbio-faila = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else if (rbio-failb == -1) {
/* second failure on this rbio */
rbio-failb = failed;
-   atomic_inc(rbio-bbio-error);
+   atomic_inc(rbio-error);
} else {
ret = -EIO;
}
@@ -1394,11 +1399,11 @@ static void raid_rmw_end_io(struct bio *bio, int err)
 
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
err = 0;
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if (atomic_read(rbio-error)  rbio-bbio-max_errors)
goto cleanup;
 
/*
@@ -1439,7 +1444,6 @@ static void async_read_rebuild(struct btrfs_raid_bio 
*rbio)
 static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 {
int bios_to_read = 0;
-   struct btrfs_bio *bbio = rbio-bbio;
struct bio_list bio_list;
int ret;
int nr_pages = DIV_ROUND_UP(rbio-stripe_len, PAGE_CACHE_SIZE);
@@ -1455,7 +1459,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 
index_rbio_pages(rbio);
 
-   atomic_set(rbio-bbio-error, 0);
+   atomic_set(rbio-error, 0);
/*
 * build a list of bios to read all the missing parts of this
 * stripe
@@ -1503,7 +1507,7 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 * the bbio may be freed once we submit the last bio.  Make sure
 * not to touch it after that
 */
-   atomic_set(bbio-stripes_pending, bios_to_read);
+   atomic_set(rbio-stripes_pending, bios_to_read);
while (1) {
bio = bio_list_pop(bio_list);
if (!bio)
@@ -1917,10 +1921,10 @@ static void raid_recover_end_io(struct bio *bio, int 
err)
set_bio_pages_uptodate(bio);
bio_put(bio);
 
-   if (!atomic_dec_and_test(rbio-bbio-stripes_pending))
+   if (!atomic_dec_and_test(rbio-stripes_pending))
return;
 
-   if (atomic_read(rbio-bbio-error)  rbio-bbio-max_errors)
+   if

[PATCH v4 01/10] Btrfs: remove noused bbio_ret in __btrfs_map_block in condition

From: Zhao Lei zhao...@cn.fujitsu.com

bbio_ret in this condition is always !NULL because previous code
already have a check-and-skip:
4908 if (!bbio_ret)
4909 goto out;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/volumes.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 54db1fb..6f80aef 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5167,8 +5167,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
BTRFS_BLOCK_GROUP_RAID6)) {
u64 tmp;
 
-   if (bbio_ret  ((rw  REQ_WRITE) || mirror_num  1)
-raid_map_ret) {
+   if (raid_map_ret  ((rw  REQ_WRITE) || mirror_num  1)) {
int i, rot;
 
/* push stripe_nr back to the start of the full stripe 
*/
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 00/10] Implement device scrub/replace for RAID56

This patchset implement the device scrub/replace function for RAID56, the
most implementation of the common data is similar to the other RAID type.
The differentia or difficulty is the parity process. The basic idea is reading
and check the data which has checksum out of the raid56 stripe lock, if the
data is right, then lock the raid56 stripe, read out the other data in the
same stripe, if no IO error happens, calculate the parity and check the
original one, if the original parity is right, the scrub parity passes.
or write out the new one. But if the common data(not parity) that we read out
is wrong, we will try to recover it, and then check and repair the parity.

And in order to avoid making the code more and more complex, we copy some
code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more tests
are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v3 - v4:
- Fix the problem that the scrub's raid bio was cached, which was reported
  by Chris.
- Remove the 10st patch, the deadlock that was described in that patch doesn't
  exist on the current kernel.
- Rebase the patchset to the top of integration branch

Changelog v2 - v3:
- Fix wrong stripe start logical address calculation which was reported
  by Chris.
- Fix unhandled raid bios for parity scrub, which are added into the plug
  list of the head raid bio.
- Fix possible deadlock caused by the pending bios in the plug list
  when the io submitters were going to sleep.
- Fix undealt use-after-free problem of the source device in the final
  device replace procedure.
- Modify the code that is used to avoid the rbio merge.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.

Thanks
Miao

Miao Xie (7):
  Btrfs, raid56: don't change bbio and raid_map
  Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted
  Btrfs, raid56: use a variant to record the operation type
  Btrfs, raid56: support parity scrub on raid56
  Btrfs, replace: write dirty pages into the replace target device
  Btrfs, replace: write raid56 parity into the replace target device
  Btrfs, raid56: fix use-after-free problem in the final device replace
procedure on raid56

Zhao Lei (3):
  Btrfs: remove noused bbio_ret in __btrfs_map_block in condition
  Btrfs: remove unnecessary code of stripe_index assignment in
__btrfs_map_block
  Btrfs, replace: enable dev-replace for raid56

 fs/btrfs/ctree.h   |   7 +-
 fs/btrfs/dev-replace.c |   9 +-
 fs/btrfs/raid56.c  | 763 +-
 fs/btrfs/raid56.h  |  16 +-
 fs/btrfs/scrub.c   | 803 +++--
 fs/btrfs/volumes.c |  52 +++-
 fs/btrfs/volumes.h |  14 +-
 7 files changed, 1531 insertions(+), 133 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 07/10] Btrfs, replace: write dirty pages into the replace target device

The implementation is simple:
- In order to avoid changing the code logic of btrfs_map_bio and
  RAID56, we add the stripes of the replace target devices at the
  end of the stripe array in btrfs bio, and we sort those target
  device stripes in the array. And we keep the number of the target
  device stripes in the btrfs bio.
- Except write operation on RAID56, all the other operation don't
  take the target device stripes into account.
- When we do write operation, we read the data from the common devices
  and calculate the parity. Then write the dirty data and new parity
  out, at this time, we will find the relative replace target stripes
  and wirte the relative data into it.

Note: The function that copying old data on the source device to
the target device was implemented in the past, it is similar to
the other RAID type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c  | 104 +
 fs/btrfs/volumes.c |  26 --
 fs/btrfs/volumes.h |  10 --
 3 files changed, 97 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 58a8408..16fe456 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -131,6 +131,8 @@ struct btrfs_raid_bio {
/* number of data stripes (no p/q) */
int nr_data;
 
+   int real_stripes;
+
int stripe_npages;
/*
 * set if we're doing a parity rebuild
@@ -638,7 +640,7 @@ static struct page *rbio_pstripe_page(struct btrfs_raid_bio 
*rbio, int index)
  */
 static struct page *rbio_qstripe_page(struct btrfs_raid_bio *rbio, int index)
 {
-   if (rbio-nr_data + 1 == rbio-bbio-num_stripes)
+   if (rbio-nr_data + 1 == rbio-real_stripes)
return NULL;
 
index += ((rbio-nr_data + 1) * rbio-stripe_len) 
@@ -981,7 +983,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 {
struct btrfs_raid_bio *rbio;
int nr_data = 0;
-   int num_pages = rbio_nr_pages(stripe_len, bbio-num_stripes);
+   int real_stripes = bbio-num_stripes - bbio-num_tgtdevs;
+   int num_pages = rbio_nr_pages(stripe_len, real_stripes);
int stripe_npages = DIV_ROUND_UP(stripe_len, PAGE_SIZE);
void *p;
 
@@ -1001,6 +1004,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-fs_info = root-fs_info;
rbio-stripe_len = stripe_len;
rbio-nr_pages = num_pages;
+   rbio-real_stripes = real_stripes;
rbio-stripe_npages = stripe_npages;
rbio-faila = -1;
rbio-failb = -1;
@@ -1017,10 +1021,10 @@ static struct btrfs_raid_bio *alloc_rbio(struct 
btrfs_root *root,
rbio-bio_pages = p + sizeof(struct page *) * num_pages;
rbio-dbitmap = p + sizeof(struct page *) * num_pages * 2;
 
-   if (raid_map[bbio-num_stripes - 1] == RAID6_Q_STRIPE)
-   nr_data = bbio-num_stripes - 2;
+   if (raid_map[real_stripes - 1] == RAID6_Q_STRIPE)
+   nr_data = real_stripes - 2;
else
-   nr_data = bbio-num_stripes - 1;
+   nr_data = real_stripes - 1;
 
rbio-nr_data = nr_data;
return rbio;
@@ -1132,7 +1136,7 @@ static int rbio_add_io_page(struct btrfs_raid_bio *rbio,
 static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
 {
if (rbio-faila = 0 || rbio-failb = 0) {
-   BUG_ON(rbio-faila == rbio-bbio-num_stripes - 1);
+   BUG_ON(rbio-faila == rbio-real_stripes - 1);
__raid56_parity_recover(rbio);
} else {
finish_rmw(rbio);
@@ -1193,7 +1197,7 @@ static void index_rbio_pages(struct btrfs_raid_bio *rbio)
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
struct btrfs_bio *bbio = rbio-bbio;
-   void *pointers[bbio-num_stripes];
+   void *pointers[rbio-real_stripes];
int stripe_len = rbio-stripe_len;
int nr_data = rbio-nr_data;
int stripe;
@@ -1207,11 +1211,11 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
 
bio_list_init(bio_list);
 
-   if (bbio-num_stripes - rbio-nr_data == 1) {
-   p_stripe = bbio-num_stripes - 1;
-   } else if (bbio-num_stripes - rbio-nr_data == 2) {
-   p_stripe = bbio-num_stripes - 2;
-   q_stripe = bbio-num_stripes - 1;
+   if (rbio-real_stripes - rbio-nr_data == 1) {
+   p_stripe = rbio-real_stripes - 1;
+   } else if (rbio-real_stripes - rbio-nr_data == 2) {
+   p_stripe = rbio-real_stripes - 2;
+   q_stripe = rbio-real_stripes - 1;
} else {
BUG();
}
@@ -1268,7 +1272,7 @@ static noinline void finish_rmw(struct btrfs_raid_bio 
*rbio)
SetPageUptodate(p);
pointers[stripe++] = kmap(p);
 
-   raid6_call.gen_syndrome(bbio-num_stripes, PAGE_SIZE,
+

[PATCH v4 05/10] Btrfs, raid56: use a variant to record the operation type

We will introduce new operation type later, if we still use integer
variant as bool variant to record the operation type, we would add new
variant and increase the size of raid bio structure. It is not good,
by this patch, we define different number for different operation,
and we can just use a variant to record the operation type.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/raid56.c | 31 +--
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index c954537..4924388 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -69,6 +69,11 @@
 
 #define RBIO_CACHE_SIZE 1024
 
+enum btrfs_rbio_ops {
+   BTRFS_RBIO_WRITE= 0,
+   BTRFS_RBIO_READ_REBUILD = 1,
+};
+
 struct btrfs_raid_bio {
struct btrfs_fs_info *fs_info;
struct btrfs_bio *bbio;
@@ -131,7 +136,7 @@ struct btrfs_raid_bio {
 * differently from a parity rebuild as part of
 * rmw
 */
-   int read_rebuild;
+   enum btrfs_rbio_ops operation;
 
/* first bad stripe */
int faila;
@@ -154,7 +159,6 @@ struct btrfs_raid_bio {
 
atomic_t refs;
 
-
atomic_t stripes_pending;
 
atomic_t error;
@@ -590,8 +594,7 @@ static int rbio_can_merge(struct btrfs_raid_bio *last,
return 0;
 
/* reads can't merge with writes */
-   if (last-read_rebuild !=
-   cur-read_rebuild) {
+   if (last-operation != cur-operation) {
return 0;
}
 
@@ -784,9 +787,9 @@ static noinline void unlock_stripe(struct btrfs_raid_bio 
*rbio)
spin_unlock(rbio-bio_list_lock);
spin_unlock_irqrestore(h-lock, flags);
 
-   if (next-read_rebuild)
+   if (next-operation == BTRFS_RBIO_READ_REBUILD)
async_read_rebuild(next);
-   else {
+   else if (next-operation == BTRFS_RBIO_WRITE){
steal_rbio(rbio, next);
async_rmw_stripe(next);
}
@@ -1720,6 +1723,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
}
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
+   rbio-operation = BTRFS_RBIO_WRITE;
 
/*
 * don't plug on full rbios, just get them out the door
@@ -1768,7 +1772,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
faila = rbio-faila;
failb = rbio-failb;
 
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
spin_lock_irq(rbio-bio_list_lock);
set_bit(RBIO_RMW_LOCKED_BIT, rbio-flags);
spin_unlock_irq(rbio-bio_list_lock);
@@ -1785,7 +1789,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio 
*rbio)
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1878,7 +1882,7 @@ pstripe:
 * know they can be trusted.  If this was a read reconstruction,
 * other endio functions will fiddle the uptodate bits
 */
-   if (!rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_WRITE) {
for (i = 0;  i  nr_pages; i++) {
if (faila != -1) {
page = rbio_stripe_page(rbio, faila, i);
@@ -1895,7 +1899,7 @@ pstripe:
 * if we're rebuilding a read, we have to use
 * pages from the bio list
 */
-   if (rbio-read_rebuild 
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD 
(stripe == faila || stripe == failb)) {
page = page_in_rbio(rbio, stripe, pagenr, 0);
} else {
@@ -1910,8 +1914,7 @@ cleanup:
kfree(pointers);
 
 cleanup_io:
-
-   if (rbio-read_rebuild) {
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD) {
if (err == 0 
!test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags))
cache_rbio_pages(rbio);
@@ -2050,7 +2053,7 @@ out:
return 0;
 
 cleanup:
-   if (rbio-read_rebuild)
+   if (rbio-operation == BTRFS_RBIO_READ_REBUILD)
rbio_orig_end_io(rbio, -EIO, 0);
return -EIO;
 }
@@ -2076,7 +2079,7 @@ int raid56_parity_recover(struct

[PATCH v4 09/10] Btrfs, raid56: fix use-after-free problem in the final device replace procedure on raid56

The commit c404e0dc (Btrfs: fix use-after-free in the finishing
procedure of the device replace) fixed a use-after-free problem
which happened when removing the source device at the end of device
replace, but at that time, btrfs didn't support device replace
on raid56, so we didn't fix the problem on the raid56 profile.
Currently, we implemented device replace for raid56, so we need
kick that problem out before we enable that function for raid56.

The fix method is very simple, we just increase the bio per-cpu
counter before we submit a raid56 io, and decrease the counter
when the raid56 io ends.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- None.

Changelog v2 - v3:
- New patch to fix undealt use-after-free problem of the source device
  in the final device replace procedure.

Changelog v1 - v2:
- None.
---
 fs/btrfs/ctree.h   |  7 ++-
 fs/btrfs/dev-replace.c |  4 ++--
 fs/btrfs/raid56.c  | 41 -
 fs/btrfs/raid56.h  |  4 ++--
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c |  7 ++-
 6 files changed, 45 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fc73e86..3770f4c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4156,7 +4156,12 @@ int btrfs_scrub_progress(struct btrfs_root *root, u64 
devid,
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
 void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info *fs_info);
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info);
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount);
+
+static inline void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+{
+   btrfs_bio_counter_sub(fs_info, 1);
+}
 
 /* reada.c */
 struct reada_control {
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 91f6b8f..326919b 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -928,9 +928,9 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info 
*fs_info)
percpu_counter_inc(fs_info-bio_counter);
 }
 
-void btrfs_bio_counter_dec(struct btrfs_fs_info *fs_info)
+void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
-   percpu_counter_dec(fs_info-bio_counter);
+   percpu_counter_sub(fs_info-bio_counter, amount);
 
if (waitqueue_active(fs_info-replace_wait))
wake_up(fs_info-replace_wait);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 7e6f239..44573bf 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -162,6 +162,8 @@ struct btrfs_raid_bio {
 */
int bio_list_bytes;
 
+   int generic_bio_cnt;
+
atomic_t refs;
 
atomic_t stripes_pending;
@@ -354,6 +356,7 @@ static void merge_rbio(struct btrfs_raid_bio *dest,
 {
bio_list_merge(dest-bio_list, victim-bio_list);
dest-bio_list_bytes += victim-bio_list_bytes;
+   dest-generic_bio_cnt += victim-generic_bio_cnt;
bio_list_init(victim-bio_list);
 }
 
@@ -891,6 +894,10 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, 
int err, int uptodate)
 {
struct bio *cur = bio_list_get(rbio-bio_list);
struct bio *next;
+
+   if (rbio-generic_bio_cnt)
+   btrfs_bio_counter_sub(rbio-fs_info, rbio-generic_bio_cnt);
+
free_raid_bio(rbio);
 
while (cur) {
@@ -1775,6 +1782,7 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct btrfs_raid_bio *rbio;
struct btrfs_plug_cb *plug = NULL;
struct blk_plug_cb *cb;
+   int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
if (IS_ERR(rbio)) {
@@ -1785,12 +1793,19 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
rbio-bio_list_bytes = bio-bi_iter.bi_size;
rbio-operation = BTRFS_RBIO_WRITE;
 
+   btrfs_bio_counter_inc_noblocked(root-fs_info);
+   rbio-generic_bio_cnt = 1;
+
/*
 * don't plug on full rbios, just get them out the door
 * as quickly as we can
 */
-   if (rbio_is_full(rbio))
-   return full_stripe_write(rbio);
+   if (rbio_is_full(rbio)) {
+   ret = full_stripe_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root-fs_info);
+   return ret;
+   }
 
cb = blk_check_plugged(btrfs_raid_unplug, root-fs_info,
   sizeof(*plug));
@@ -1801,10 +1816,13 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
INIT_LIST_HEAD(plug-rbio_list);
}
list_add_tail(rbio-plug_list, plug-rbio_list);
+   ret = 0;
} else {
-   return __raid56_parity_write(rbio);
+   ret = __raid56_parity_write(rbio);
+   if (ret)
+   btrfs_bio_counter_dec(root-fs_info);
}
-   return 0;
+   return ret;
 }

[PATCH v4 10/10] Btrfs, replace: enable dev-replace for raid56

From: Zhao Lei zhao...@cn.fujitsu.com

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/dev-replace.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index 326919b..51133ea 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -316,11 +316,6 @@ int btrfs_dev_replace_start(struct btrfs_root *root,
struct btrfs_device *tgt_device = NULL;
struct btrfs_device *src_device = NULL;
 
-   if (btrfs_fs_incompat(fs_info, RAID56)) {
-   btrfs_warn(fs_info, dev_replace cannot yet handle 
RAID5/RAID6);
-   return -EOPNOTSUPP;
-   }
-
switch (args-start.cont_reading_from_srcdev_mode) {
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS:
case BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 04/10] Btrfs, scrub: repair the common data on RAID5/6 if it is corrupted

This patch implement the RAID5/6 common data repair function, the
implementation is similar to the scrub on the other RAID such as
RAID1, the differentia is that we don't read the data from the
mirror, we use the data repair function of RAID5/6.

Signed-off-by: Miao Xie mi...@cn.fujitsu.com
---
Changelog v3 - v4:
- Fix the problem that the scrub's raid bio was cached, which was reported by
  Chris.

Changelog v2 - v3:
- None.

Changelog v1 - v2:
- Change some function names in raid56.c to make them fit the code style
  of the raid56.
---
 fs/btrfs/raid56.c  |  52 ++
 fs/btrfs/raid56.h  |   2 +-
 fs/btrfs/scrub.c   | 194 -
 fs/btrfs/volumes.c |  16 -
 fs/btrfs/volumes.h |   4 ++
 5 files changed, 235 insertions(+), 33 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index cb31cc6..c954537 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -58,6 +58,15 @@
  */
 #define RBIO_CACHE_READY_BIT   3
 
+/*
+ * bbio and raid_map is managed by the caller, so we shouldn't free
+ * them here. And besides that, all rbios with this flag should not
+ * be cached, because we need raid_map to check the rbios' stripe
+ * is the same or not, but it is very likely that the caller has
+ * free raid_map, so don't cache those rbios.
+ */
+#define RBIO_HOLD_BBIO_MAP_BIT 4
+
 #define RBIO_CACHE_SIZE 1024
 
 struct btrfs_raid_bio {
@@ -799,6 +808,21 @@ done_nolock:
remove_rbio_from_cache(rbio);
 }
 
+static inline void
+__free_bbio_and_raid_map(struct btrfs_bio *bbio, u64 *raid_map, int need)
+{
+   if (need) {
+   kfree(raid_map);
+   kfree(bbio);
+   }
+}
+
+static inline void free_bbio_and_raid_map(struct btrfs_raid_bio *rbio)
+{
+   __free_bbio_and_raid_map(rbio-bbio, rbio-raid_map,
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags));
+}
+
 static void __free_raid_bio(struct btrfs_raid_bio *rbio)
 {
int i;
@@ -817,8 +841,9 @@ static void __free_raid_bio(struct btrfs_raid_bio *rbio)
rbio-stripe_pages[i] = NULL;
}
}
-   kfree(rbio-raid_map);
-   kfree(rbio-bbio);
+
+   free_bbio_and_raid_map(rbio);
+
kfree(rbio);
 }
 
@@ -933,11 +958,8 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_root 
*root,
 
rbio = kzalloc(sizeof(*rbio) + num_pages * sizeof(struct page *) * 2,
GFP_NOFS);
-   if (!rbio) {
-   kfree(raid_map);
-   kfree(bbio);
+   if (!rbio)
return ERR_PTR(-ENOMEM);
-   }
 
bio_list_init(rbio-bio_list);
INIT_LIST_HEAD(rbio-plug_list);
@@ -1692,8 +1714,10 @@ int raid56_parity_write(struct btrfs_root *root, struct 
bio *bio,
struct blk_plug_cb *cb;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, 1);
return PTR_ERR(rbio);
+   }
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
 
@@ -1888,7 +1912,8 @@ cleanup:
 cleanup_io:
 
if (rbio-read_rebuild) {
-   if (err == 0)
+   if (err == 0 
+   !test_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags))
cache_rbio_pages(rbio);
else
clear_bit(RBIO_CACHE_READY_BIT, rbio-flags);
@@ -2038,15 +2063,19 @@ cleanup:
  */
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,
  struct btrfs_bio *bbio, u64 *raid_map,
- u64 stripe_len, int mirror_num)
+ u64 stripe_len, int mirror_num, int hold_bbio)
 {
struct btrfs_raid_bio *rbio;
int ret;
 
rbio = alloc_rbio(root, bbio, raid_map, stripe_len);
-   if (IS_ERR(rbio))
+   if (IS_ERR(rbio)) {
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
return PTR_ERR(rbio);
+   }
 
+   if (hold_bbio)
+   set_bit(RBIO_HOLD_BBIO_MAP_BIT, rbio-flags);
rbio-read_rebuild = 1;
bio_list_add(rbio-bio_list, bio);
rbio-bio_list_bytes = bio-bi_iter.bi_size;
@@ -2054,8 +2083,7 @@ int raid56_parity_recover(struct btrfs_root *root, struct 
bio *bio,
rbio-faila = find_logical_bio_stripe(rbio, bio);
if (rbio-faila == -1) {
BUG();
-   kfree(raid_map);
-   kfree(bbio);
+   __free_bbio_and_raid_map(bbio, raid_map, !hold_bbio);
kfree(rbio);
return -EIO;
}
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index ea5d73b..b310e8c 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -41,7 +41,7 @@ static inline int nr_data_stripes(struct map_lookup *map)
 
 int raid56_parity_recover(struct btrfs_root *root, struct bio *bio,

[PATCH v4 02/10] Btrfs: remove unnecessary code of stripe_index assignment in __btrfs_map_block

From: Zhao Lei zhao...@cn.fujitsu.com

stripe_index's value was set again in latter line:
stripe_index = 0;

Signed-off-by: Zhao Lei zhao...@cn.fujitsu.com
Signed-off-by: Miao Xie mi...@cn.fujitsu.com
Reviewed-by: David Sterba dste...@suse.cz
---
Changelog v1 - v4:
- None.
---
 fs/btrfs/volumes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6f80aef..eeb5b31 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5172,9 +5172,7 @@ static int __btrfs_map_block(struct btrfs_fs_info 
*fs_info, int rw,
 
/* push stripe_nr back to the start of the full stripe 
*/
stripe_nr = raid56_full_stripe_start;
-   do_div(stripe_nr, stripe_len);
-
-   stripe_index = do_div(stripe_nr, nr_data_stripes(map));
+   do_div(stripe_nr, stripe_len * nr_data_stripes(map));
 
/* RAID[56] write or recovery. Return all stripes */
num_stripes = map-num_stripes;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs stuck with lot's of files

Peter Volkov posted on Tue, 02 Dec 2014 04:50:29 +0300 as excerpted:

 В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
 On 12/01/2014 03:46 AM, Peter Volkov wrote:
   (stuff about getting hung up trying to write to one drive)
 
 That drive (/dev/sdn) is probably starting to fail.
 (about failed drive)
 
 Thank you Robert for the answer. It is not likely that drive fails here.
 Similar condition (write to a single drive) happens with other drives
 i.e. such write pattern may happen with any drive.

 After looking at what happens longer I see the following. During stuck
 single processor core is busy 100% of CPU in kernel space (some kworker
 is taking 100% CPU).

FWIW, agreed that it's unlikely to be the drive, especially if you're not 
seeing bus resets or drive errors in dmesg and smart says the drive is 
fine, as I expect it does/will.  It may be a btrfs bug or scaling issue, 
of which btrfs still has some, or it could simply be the single mode vs 
raid0 mode issue I explain below.

# btrfs filesystem df /store/
  Data, single: total=11.92TiB, used=10.86TiB
 
 Reguardless of the above...
 
 You have a terabyte of unused but allocated data storage. You probably
 need to balance your system to un-jamb that. That's a lot of space that
 is unavailable to the metadata (etc).
 
 Well, I'm afraid that balance will put fs into even longer stuck.
 
 ASIDE: Having your metadata set to RAID1 (as opposed to the default of
 DUP) seems a little iffy since your data is still set to DUP.
 
 That's true. But why data is duplicated? During btrfs volume creation
 I've set explicitly -d data single.

I believe Robert mis-wrote (thinko).  The btrfs filesystem df clearly 
shows that your data is in single mode, the data default mode, not dup 
mode, which is normally only available to metadata (not data) on a single-
device filesystem, where it is the metadata default.

However, in the original post you /did/ say raid1 for metadata, raid0 for 
data, and the above btrfs filesystem df again clearly says single, not 
raid0.

Which is very likely to be your problem.  In single mode, btrfs will 
create chunks one at a time, picking the device with the most free space 
to allocate it on.  The normal data chunk size is 1 GiB.  Because of the 
most-free-space allocation rule, with N devices (22 in your case) of the 
same size, after N (22) data chunks are allocated you'll tend to have one 
such chunk on each device.

Each of these 1 GiB chunks (along with space freed up by normal delete 
activity in other allocated data chunks) will be filled before another is 
allocated.

Which will mean you're writing a GiB worth of data to one device before 
you switch to the next one.  With your mostly sub-MiB file write pattern, 
that's probably 1500-2000 files written to a chunk on that single device, 
before another chunk is allocated on the next device.

Thus all your activity on that single device!

In raid0 mode, by contrast, the same 1 GiB chunks will be allocated on 
each device, but a stripe of chunks will be allocated across all devices 
(22 in your case) at the same time, and data being written is broken up 
into much smaller per-device strips.  I'm not sure what the actual per-
device is in raid0 mode, but it's *WELL* under a GiB and I believe in the 
KiBs not MiB range.  It might be 128 KiB, the compression block size when 
the compress mount option is used.

Obviously were you using raid0 data, you'd see the load spread out at 
least somewhat better.  But the df says it's single, not raid0.

To get raid0 mode you can use a balance with filters (see the wiki or 
recent btrfs-balance manpage), or blow away the existing filesystem and 
create a new one, setting --data raid0 when you mkfs.btrfs, and restore 
from backups (which you're already prepared to do if you value your data 
in any case[1]).

That missing btrfs filesystem show, due to the terminating / in /store/ 
(simply /store should work) is somewhat frustrating here, as it'd show 
per-device sizes and utilization.  Assuming near same-sized devices, with 
11 TiB of data being far greater than the 1 GiB data chunk size times 22 
devices I'd guess you're pretty evened out, utilization-wise, but the 
output from both show and df is necessary to get the full story.

 FUTHER ASIDE: raid1 metadata and raid5 data might be good for you given
 22 volumes and 10% empty empty space it would only cost you half of
 your existing empty space. If you don't RAID your data, there is no
 real point to putting your metadata in RAID.
 
 Is raid5 ready for use? As I read post[1] mentioned on[2] it is still
 some way to make it stable.

You are absolutely correct.  I'd strongly recommend staying AWAY from 
btrfs raid5/6 modes at this time.  While Robert is becoming an active 
regular and has the technical background to point out some things others 
miss, he's still reasonably new to this list and may not have been aware 
of the incomplete status of raid5/6 modes at this time.

Effectively

Re: Possible to undo subvol delete?

2014-12-02 Thread David Sterba

On Mon, Dec 01, 2014 at 10:14:03PM -0500, Zygo Blaxell wrote:
  export BTRFS_SUBVOLUME_DELETE_CONFIRM=1
  
  Ideas?
 
 Never rely on aliasing or environment variables for defaults, and never
 change default behavior if your releases are old enough that someone
 has built scripts on top of them.  ;)

Exactly.

 If I had to pick the least evil, I'd go for interactive prompting by
 default (do nothing if the interaction fails, e.g. no TTY) and add a
 '-f'/'--force' flag to bypass the prompt.

This sounds acceptable.

 This is consistent with the
 way lvm2 and mdadm work when presented with data-losing or otherwise
 questionable commands and parameters.  It will break scripts, but btrfs
 users should still be expecting that for a while as undesirable default
 behaviors are identified.

How is this going to break scripts?

 OTOH maybe there is no issue with the current behavior.  Only root can
 delete subvolumes, and maybe we assume root knows what they're doing?

With the mount option user_subvol_rm_allowed the user can delete subvols
as well, so it makes sense to add the confirmation.

 On a side note...only root can delete subvolumes, but non-root users
 can create them, which results in...this:
 
   $ /sbin/btrfs sub create foo
   Create subvolume './foo'
   $ date  foo/bar
   $ /sbin/btrfs sub delete foo
   Transaction commit: none (default)
   Delete subvolume '/home/testuser/foo'
   ERROR: cannot delete '/home/testuser/foo' - Operation not permitted
   $ rm -rf foo
   rm: cannot remove `foo': Operation not permitted
   $ cat /proc/version
   Linux version 3.17.1-zb64+ (root@buildbot) (gcc version 4.7.2 (Debian 
 4.7.2-5) ) #1 SMP PREEMPT Tue Oct 21 00:17:49 EDT 2014
 
 ...uh oh?

That's how it works now. I'd like to enable the user to delete their
subvolumes even without the user_subvol_rm_allowed option someday.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-02 Thread David Sterba

On Tue, Dec 02, 2014 at 09:10:15AM +0530, Shriramana Sharma wrote:
  On a side note...only root can delete subvolumes, but non-root users
  can create them, which results in...this:
 
 Not sure about your Debian system, but my openSUSE Tumbleweed (with
 kernel 3.17.2 and btrfsprogs 3.17) requires me to enter the root
 password before creating a subvol (or in fact running anything under
 /sbin or /usr/sbin).

Works for me without the root password on a Tumbleweed installation
(without apparmor/selinux).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Jan Kara

On Fri 28-11-14 13:14:21, Ted Tso wrote:
 On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
  Hum, when someone calls fsync() for an inode, you likely want to sync
  timestamps to disk even if everything else is clean. I think that doing
  what you did in last version:
  dirty = inode-i_state  I_DIRTY_INODE;
  inode-i_state = ~I_DIRTY_INODE;
  spin_unlock(inode-i_lock);
  if (dirty  I_DIRTY_TIME)
  mark_inode_dirty_sync(inode);
  looks better to me. IMO when someone calls __writeback_single_inode() we
  should write whatever we have...
 
 Yes, but we also have to distinguish between what happens on an
 fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
 (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
 check in the fsync() code path to handle the concern you raised above.
  Ah, this is the thing you have been likely talking about but which I was
constantly missing in my thoughts. You don't want to write times when inode
has only dirty pages and timestamps - I was always thinking about a
situation where inode has only dirty timestamps and not pages. This
situation also complicates the writeback logic because when inode has dirty
pages, you need to track it as normal dirty inode for page writeback (with
dirtied_when correspoding to time when pages were dirtied) but in
parallel you now need to track the information that inode has timestamps
that weren't written for X long. And even if we stored how old are
timestamps it isn't easily possible to keep the list of inodes with just
dirty timestamps sorted by dirty time. So now I finally understand why you
did things the way you did them... Sorry for misleading you.

So let's restart the design so that things are clear:
1) We have new inode bit I_DIRTY_TIME. This means that only timestamps in
the inode have changed. The desired behavior is that inode is with
I_DIRTY_TIME and without I_DIRTY_SYNC | I_DIRTY_DATASYNC is written by
background writeback only once per 24 hours. Such inodes do get written by
sync(2) and fsync(2) calls.

2) Inodes with only I_DIRTY_TIME are tracked in a new dirty list
b_dirty_time. We use i_wb_list list head for this. Unlike b_dirty list,
this list isn't kept sorted by dirtied_when. If queue_io() sees for_sync
bit set in the work item, it will call mark_inode_dirty_sync() for all
inodes in b_dirty_time before queuing io from b_dirty list. Once per hour
(or something like that) flusher thread scans the whole b_dirty_time list
and calls mark_inode_dirty_sync() for all inodes that have too old dirty
timestamps (to detect this we need a new time stamp in the inode).

3) When fsync() sees inode with I_DIRTY_TIME set, it calls
mark_inode_dirty_sync().

4) When we are dropping last inode reference and inode has I_DIRTY_TIME
set, we call mark_inode_dirty_sync().

And that should be it, right?

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 10/11] Btrfs: fix possible deadlock caused by pending I/O in plug list

hi, Chris

On Fri, 28 Nov 2014 16:32:03 -0500, Chris Mason wrote:
 On Wed, Nov 26, 2014 at 10:00 PM, Miao Xie mi...@cn.fujitsu.com wrote:
 On Thu, 27 Nov 2014 09:39:56 +0800, Miao Xie wrote:
  On Wed, 26 Nov 2014 10:02:23 -0500, Chris Mason wrote:
  On Wed, Nov 26, 2014 at 8:04 AM, Miao Xie mi...@cn.fujitsu.com wrote:
  The increase/decrease of bio counter is on the I/O path, so we should
  use io_schedule() instead of schedule(), or the deadlock might be
  triggered by the pending I/O in the plug list. io_schedule() can help
  us because it will flush all the pending I/O before the task is going
  to sleep.

  Can you please describe this deadlock in more detail?  schedule() also 
 triggers
  a flush of the plug list, and if that's no longer sufficient we can run 
 into other
  problems (especially with preemption on).

  Sorry for my miss. I forgot to check the current implementation of 
 schedule(), which flushes the plug list unconditionally. Please ignore this 
 patch.

 I have updated my raid56-scrub-replace branch, please re-pull the branch.

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
 
 Sorry, I wasn't clear.  I do like the patch because it uses a slightly better 
 trigger mechanism for the flush.  I was just worried about a larger deadlock.
 
 I ran the raid56 work with stress.sh overnight, then scrubbed the resulting 
 filesystem and ran balance when the scrub completed.  All of these passed 
 without errors (excellent!).
 
 Then I zero'd 4GB of one drive and ran scrub again.  This was the result.  
 Please make sure CONFIG_DEBUG_PAGEALLOC is enabled and you should be able to 
 reproduce.

I sent out the 4th version of the patchset, please try it.

I have pushed the new patchset to my git tree, you can re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Thanks
Miao

 
 [192392.495260] BUG: unable to handle kernel paging request at 
 880303062f80
 [192392.495279] IP: [a05fe77a] lock_stripe_add+0xba/0x390 [btrfs]
 [192392.495281] PGD 2bdb067 PUD 107e7fd067 PMD 107e7e4067 PTE 800303062060
 [192392.495283] Oops:  [#1] SMP DEBUG_PAGEALLOC
 [192392.495307] Modules linked in: ipmi_devintf loop fuse k10temp coretemp 
 hwmon btrfs raid6_pq zlib_deflate lzo_compress xor xfs exportfs libcrc32c 
 tcp_diag inet_diag nfsv4 ip6table_filter ip6_tables xt_NFLOG nfnetlink_log 
 nfnetlink xt_comment xt_statistic iptable_filter ip_tables x_tables mptctl 
 netconsole autofs4 nfsv3 nfs lockd grace rpcsec_gss_krb5 auth_rpcgss 
 oid_registry sunrpc ipv6 ext3 jbd dm_mod rtc_cmos ipmi_si ipmi_msghandler 
 iTCO_wdt iTCO_vendor_support pcspkr i2c_i801 lpc_ich mfd_core shpchp ehci_pci 
 ehci_hcd mlx4_en ptp pps_core mlx4_core sg ses enclosure button megaraid_sas
 [192392.495310] CPU: 0 PID: 11992 Comm: kworker/u65:2 Not tainted 
 3.18.0-rc6-mason+ #7
 [192392.495310] Hardware name: ZTSYSTEMS Echo Ridge T4  /A9DRPF-10D, BIOS 
 1.07 05/10/2012
 [192392.495323] Workqueue: btrfs-btrfs-scrub btrfs_scrub_helper [btrfs]
 [192392.495324] task: 88013dae9110 ti: 8802296a task.ti: 
 8802296a
 [192392.495335] RIP: 0010:[a05fe77a]  [a05fe77a] 
 lock_stripe_add+0xba/0x390 [btrfs]
 [192392.495335] RSP: 0018:8802296a3ac8  EFLAGS: 00010006
 [192392.495336] RAX: 880577e85018 RBX: 880497f0b2f8 RCX: 
 8801190fb000
 [192392.495337] RDX: 013d RSI: 880303062f80 RDI: 
 040c275a
 [192392.495338] RBP: 8802296a3b48 R08: 880497f0 R09: 
 0001
 [192392.495339] R10:  R11:  R12: 
 0282
 [192392.495339] R13: b250 R14: 880577e85000 R15: 
 880497f0b2a0
 [192392.495340] FS:  () GS:88085fc0() 
 knlGS:
 [192392.495341] CS:  0010 DS:  ES:  CR0: 80050033
 [192392.495342] CR2: 880303062f80 CR3: 05289000 CR4: 
 000407f0
 [192392.495342] Stack:
 [192392.495344]  880755e28000 880497f0 013d 
 8801190fb000
 [192392.495346]   88013dae9110 81090d40 
 8802296a3b00
 [192392.495347]  8802296a3b00 0010 8802296a3b68 
 8801190fb000
 [192392.495348] Call Trace:
 [192392.495353]  [81090d40] ? bit_waitqueue+0xa0/0xa0
 [192392.495363]  [a05fea66] 
 raid56_parity_submit_scrub_rbio+0x16/0x30 [btrfs]
 [192392.495372]  [a05e2f0e] 
 scrub_parity_check_and_repair+0x15e/0x1e0 [btrfs]
 [192392.495380]  [a05e301d] scrub_block_put+0x8d/0x90 [btrfs]
 [192392.495388]  [a05e6ed7] ? scrub_bio_end_io_worker+0xd7/0x870 
 [btrfs]
 [192392.495396]  [a05e6ee9] scrub_bio_end_io_worker+0xe9/0x870 
 [btrfs]
 [192392.495405]  [a05b8c44] normal_work_helper+0x84/0x330 [btrfs]
 [192392.495414]  [a05b8f42] btrfs_scrub_helper+0x12/0x20 [btrfs]
 [192392.495417]  [8106c50f] process_one_work+0x1bf/0x520
 [192392.495419]  [8106c48d] ?

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Chris Mason




On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
This patchset implement the device scrub/replace function for RAID56, 
the
most implementation of the common data is similar to the other RAID 
type.
The differentia or difficulty is the parity process. The basic idea 
is reading
and check the data which has checksum out of the raid56 stripe lock, 
if the
data is right, then lock the raid56 stripe, read out the other data 
in the
same stripe, if no IO error happens, calculate the parity and check 
the
original one, if the original parity is right, the scrub parity 
passes.
or write out the new one. But if the common data(not parity) that we 
read out
is wrong, we will try to recover it, and then check and repair the 
parity.


And in order to avoid making the code more and more complex, we copy 
some

code of common data process for the parity, the cleanup work is in my
TODO list.

We have done some test, the patchset worked well. Of course, more 
tests

are welcome. If you are interesting to use it or test it, you can pull
the patchset from

  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

Changelog v3 - v4:
- Fix the problem that the scrub's raid bio was cached, which was 
reported

  by Chris.
- Remove the 10st patch, the deadlock that was described in that 
patch doesn't

  exist on the current kernel.
- Rebase the patchset to the top of integration branch


Thanks, I'll try this today.  I need to rebase in a new version of the 
RCU patches, can you please cook one on top of v3.18-rc6 instead?


-chris

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

2014-12-02 Thread David Sterba

On Tue, Dec 02, 2014 at 08:51:40AM +0530, Shriramana Sharma wrote:
  openSUSE uses subvol id 5 for installing the OS to, and some
  directories are made subvolumes such as home var and maybe usr.
  Therefore when subvolid 5 is snapshot, those are exempt, and have to
  be individually snapshot.
 
 Yes I also noticed that openSUSE creates such separate subvols, but is
 there any particular benefit to making it so?

A subvolume is also a snapshotting barrier, so it's convenient to create
subvolumes in well-known paths that contain data that should not be
rolled back (/var/log, /srv, bootloader).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

2014-12-02 Thread Wang Shilong


 
 
 
 On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 This patchset implement the device scrub/replace function for RAID56, the
 most implementation of the common data is similar to the other RAID type.
 The differentia or difficulty is the parity process. The basic idea is 
 reading
 and check the data which has checksum out of the raid56 stripe lock, if the
 data is right, then lock the raid56 stripe, read out the other data in the
 same stripe, if no IO error happens, calculate the parity and check the
 original one, if the original parity is right, the scrub parity passes.
 or write out the new one. But if the common data(not parity) that we read out
 is wrong, we will try to recover it, and then check and repair the parity.
 And in order to avoid making the code more and more complex, we copy some
 code of common data process for the parity, the cleanup work is in my
 TODO list.
 We have done some test, the patchset worked well. Of course, more tests
 are welcome. If you are interesting to use it or test it, you can pull
 the patchset from
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace
 Changelog v3 - v4:
 - Fix the problem that the scrub's raid bio was cached, which was reported
  by Chris.
 - Remove the 10st patch, the deadlock that was described in that patch 
 doesn't
  exist on the current kernel.
 - Rebase the patchset to the top of integration branch
 
 Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
 patches, can you please cook one on top of v3.18-rc6 instead?
 

BTW, Chris could you please replace with this newer version:
https://patchwork.kernel.org/patch/5359251/

Fengguang reported a build failure regarding this.


 -chris
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Best Regards,
Wang Shilong

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-02 Thread Hugo Mills

On Tue, Dec 02, 2014 at 01:52:52PM +0100, David Sterba wrote:
 On Mon, Dec 01, 2014 at 10:14:03PM -0500, Zygo Blaxell wrote:
   export BTRFS_SUBVOLUME_DELETE_CONFIRM=1
   
   Ideas?
  
  Never rely on aliasing or environment variables for defaults, and never
  change default behavior if your releases are old enough that someone
  has built scripts on top of them.  ;)
 
 Exactly.
 
  If I had to pick the least evil, I'd go for interactive prompting by
  default (do nothing if the interaction fails, e.g. no TTY) and add a
  '-f'/'--force' flag to bypass the prompt.
 
 This sounds acceptable.
 
  This is consistent with the
  way lvm2 and mdadm work when presented with data-losing or otherwise
  questionable commands and parameters.  It will break scripts, but btrfs
  users should still be expecting that for a while as undesirable default
  behaviors are identified.
 
 How is this going to break scripts?

   Any script which relies on being able to delete subvolumes in
unattended operation will now require modification to use -f.

   Hugo.

-- 
Hugo Mills | Unix: For controlling fungal diseases in crops
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: 65E74AC0  |


signature.asc
Description: Digital signature

Thin metadata and nohole options recommended?

From what I'm reading, thin metadata and nohole options were
introduced to make the FS more efficient. Does this mean that for
someone about to do mkfs.btrfs, it is actively recommended to use
these options?

Another pertinent question -- why aren't they default then?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

Re: Moving an entire subvol?

On Tue, Dec 2, 2014 at 6:58 PM, David Sterba dste...@suse.cz wrote:

 A subvolume is also a snapshotting barrier, so it's convenient to create
 subvolumes in well-known paths that contain data that should not be
 rolled back (/var/log, /srv, bootloader).

Hi David -- a real honour to meet one of the core Btrfs/SuSE (heh,
when that was the spelling!) guys!

That makes sense. Is there anywhere that the official SuSE
recommended subvol layout is mentioned that I can refer to without
having to start up an installer? (I currently chose ext4 for / for
other reasons so I can't refer to my layout.)

I am now reading a SuSECon 2013 presentation by Nyers and Schnell but
they are very generic about the recommendations.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

On Tue, Dec 2, 2014 at 6:26 PM, David Sterba dste...@suse.cz wrote:

 Works for me without the root password on a Tumbleweed installation
 (without apparmor/selinux).

Are you then referring to a btrfs partition mounted with user_subvol_rm_allowed?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

Re: Possible to undo subvol delete?

On Tue, Dec 2, 2014 at 12:41 PM, Satoru Takeuchi
takeuchi_sat...@jp.fujitsu.com wrote:

 Snapper can automatically take a snapshot just before
 taking/deleting snapshots. So, if you delete a snapshot
 by mistake, it's still alive.

Sorta contradicts the whole point of deleting a snapshot, no? Or is it
some sort of trash vs (real) delete mechanism?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Possible to undo subvol delete?

2014-12-02 Thread Zygo Blaxell

On Tue, Dec 02, 2014 at 01:52:52PM +0100, David Sterba wrote:
  On a side note...only root can delete subvolumes, but non-root users
  can create them, which results in...this:
  
  $ /sbin/btrfs sub create foo
  Create subvolume './foo'
  $ date  foo/bar
  $ /sbin/btrfs sub delete foo
  Transaction commit: none (default)
  Delete subvolume '/home/testuser/foo'
  ERROR: cannot delete '/home/testuser/foo' - Operation not permitted
  $ rm -rf foo
  rm: cannot remove `foo': Operation not permitted
  $ cat /proc/version
  Linux version 3.17.1-zb64+ (root@buildbot) (gcc version 4.7.2 (Debian 
  4.7.2-5) ) #1 SMP PREEMPT Tue Oct 21 00:17:49 EDT 2014
  
  ...uh oh?
 
 That's how it works now. I'd like to enable the user to delete their
 subvolumes even without the user_subvol_rm_allowed option someday.

That seems...odd.  It should be symmetrical, i.e. if you can create a
subvol you should be able to delete it, and if can't delete a subvol
then you shouldn't be able to create them either.  I can imagine
quite a bit of havoc could be wrought by an unprivileged user creating
subvols indiscriminately (or in various specific, targeted locations).


signature.asc
Description: Digital signature

Re: Thin metadata and nohole options recommended?

2014-12-02 Thread Zygo Blaxell

On Tue, Dec 02, 2014 at 08:38:15PM +0530, Shriramana Sharma wrote:
 From what I'm reading, thin metadata and nohole options were
 introduced to make the FS more efficient. Does this mean that for
 someone about to do mkfs.btrfs, it is actively recommended to use
 these options?

If you're using older kernels, I'd avoid those options.  I'd still avoid
those options with current kernels unless you're intentionally looking
for bugs.

 Another pertinent question -- why aren't they default then?

It has been one month since the last skinny-metadata fix (fixing a bug
that was as old as the skinny-metadata feature itself) in 3.18-rc3.

It has been two months since the last no-holes fix in 3.17.2.

IMHO if an optional filesystem feature has had a significant bug fixed
in the last six months, it probably shouldn't be enabled by default.  ;)

Skinny-metadata can be enabled after mkfs, though my benchmark results
so far are mixed about whether the theoretical performance benefit
practically materializes.  No-holes is mostly useless unless you are a
fan of huge sparse files.


signature.asc
Description: Digital signature

Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Boaz Harrosh

On 12/02/2014 02:58 PM, Jan Kara wrote:
 On Fri 28-11-14 13:14:21, Ted Tso wrote:
 On Fri, Nov 28, 2014 at 06:23:23PM +0100, Jan Kara wrote:
 Hum, when someone calls fsync() for an inode, you likely want to sync
 timestamps to disk even if everything else is clean. I think that doing
 what you did in last version:
 dirty = inode-i_state  I_DIRTY_INODE;
 inode-i_state = ~I_DIRTY_INODE;
 spin_unlock(inode-i_lock);
 if (dirty  I_DIRTY_TIME)
 mark_inode_dirty_sync(inode);
 looks better to me. IMO when someone calls __writeback_single_inode() we
 should write whatever we have...

 Yes, but we also have to distinguish between what happens on an
 fsync() versus what happens on a periodic writeback if I_DIRTY_PAGES
 (but not I_DIRTY_SYNC or I_DIRTY_DATASYNC) is set.  So there is a
 check in the fsync() code path to handle the concern you raised above.
   Ah, this is the thing you have been likely talking about but which I was
 constantly missing in my thoughts. You don't want to write times when inode
 has only dirty pages and timestamps - 

This I do not understand. I thought that I_DIRTY_TIME, and the all
lazytime mount option, is only for atime. So if there are dirty
pages then there are also m/ctime that changed and surly we want to
write these times to disk ASAP.

if we are lazytime also with m/ctime then I think I would like an
option for only atime lazy. because m/ctime is cardinal to some
operations even though I might want atime lazy.

Sorry for the slowness, I'm probably missing something
Thanks
Boaz

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/2] Btrfs: fix unprotected deletion from pending_chunks list

2014-12-02 Thread Filipe Manana

On block group remove if the corresponding extent map was on the
transaction-pending_chunks list, we were deleting the extent map
from that list, through remove_extent_mapping(), without any
synchronization with chunk allocation (which iterates that list
and adds new elements to it). Fix this by ensure that this is done
while the chunk mutex is held, since that's the mutex that protects
the list in the chunk allocation code path.

This applies on top (depends on) of my previous patch titled:
Btrfs: fix race between fs trimming and block group remove/allocation

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/extent-tree.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 17d429d..a7b81b4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9524,19 +9524,25 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
list_move_tail(em-list, root-fs_info-pinned_chunks);
}
spin_unlock(block_group-lock);
-   unlock_chunks(root);
 
if (remove_em) {
struct extent_map_tree *em_tree;
 
em_tree = root-fs_info-mapping_tree.map_tree;
write_lock(em_tree-lock);
+   /*
+* The em might be in the pending_chunks list, so make sure the
+* chunk mutex is locked, since remove_extent_mapping() will
+* delete us from that list.
+*/
remove_extent_mapping(em_tree, em);
write_unlock(em_tree-lock);
/* once for the tree */
free_extent_map(em);
}
 
+   unlock_chunks(root);
+
btrfs_put_block_group(block_group);
btrfs_put_block_group(block_group);
 
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/2] Btrfs: fix fs mapping extent map leak

2014-12-02 Thread Filipe Manana

On chunk allocation error (label error_del_extent), after adding the
extent map to the tree and to the pending chunks list, we would leave
decrementing the extent map's refcount by 2 instead of 3 (our allocation
+ tree reference + list reference).

Also, on chunk/block group removal, if the block group was on the list
pending_chunks we weren't decrementing the respective list reference.

Detected by 'rmmod btrfs':

[20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has objects
[20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: GWL 
3.17.0-rc5-btrfs-next-1+ #1
[20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[20770.106130]   8800ba867eb8 813e7a13 
8800a2e11040
[20770.106132]  8800ba867ed0 81105d0c  
8800ba867ee0
[20770.106134]  a035d65e 8800ba867ef0 a03b0654 
8800ba867f78
[20770.106136] Call Trace:
[20770.106142]  [813e7a13] dump_stack+0x45/0x56
[20770.106145]  [81105d0c] kmem_cache_destroy+0x4b/0x90
[20770.106164]  [a035d65e] extent_map_exit+0x1a/0x1c [btrfs]
[20770.106176]  [a03b0654] exit_btrfs_fs+0x27/0x9d3 [btrfs]
[20770.106179]  [8109dc97] SyS_delete_module+0x153/0x1c4
[20770.106182]  [8121261b] ? trace_hardirqs_on_thunk+0x3a/0x3c
[20770.106184]  [813ebf52] system_call_fastpath+0x16/0x1b

This applies on top (depends on) of my previous patch titled:
Btrfs: fix race between fs trimming and block group remove/allocation

But the issue in fact was already present before that change, it only
became easier to hit after Josef's 3.18 patch that added automatic
removal of empty block groups.

Signed-off-by: Filipe Manana fdman...@suse.com
---

This replaces my previous patch titled:
Btrfs: fix extent map leak on chunk allocation failure

 fs/btrfs/extent-tree.c | 4 
 fs/btrfs/volumes.c | 2 ++
 2 files changed, 6 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a811ed2..17d429d 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9479,6 +9479,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
*trans,
memcpy(key, block_group-key, sizeof(key));
 
lock_chunks(root);
+   if (!list_empty(em-list)) {
+   /* We're in the transaction-pending_chunks list. */
+   free_extent_map(em);
+   }
spin_lock(block_group-lock);
block_group-removed = 1;
/*
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66a5a1e..e936fe3 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4496,6 +4496,8 @@ error_del_extent:
free_extent_map(em);
/* One for the tree reference */
free_extent_map(em);
+   /* One for the pending_chunks list reference */
+   free_extent_map(em);
 error:
kfree(devices_info);
return ret;
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3] fstests: add btrfs test to stress chunk allocation/removal and fstrim

2014-12-02 Thread Filipe Manana

Stress btrfs' block group allocation and deallocation while running
fstrim in parallel. Part of the goal is also to get data block groups
deallocated so that new metadata block groups, using the same physical
device space ranges, get allocated while fstrim is running. This caused
several issues ranging from invalid memory accesses, kernel crashes,
metadata or data corruption, free space cache inconsistencies, free
space leaks and memory leaks.

Signed-off-by: Filipe Manana fdman...@suse.com
---

V2: Addressed Dave's comments.

V3: Missing s/_supported_fs btrfs/_supported_fs generic/
Thanks Eryu.

 tests/generic/038 | 153 ++
 tests/generic/038.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 156 insertions(+)
 create mode 100755 tests/generic/038
 create mode 100644 tests/generic/038.out

diff --git a/tests/generic/038 b/tests/generic/038
new file mode 100755
index 000..5db718c
--- /dev/null
+++ b/tests/generic/038
@@ -0,0 +1,153 @@
+#! /bin/bash
+# FSQA Test No. 038
+#
+# This test was motivated by btrfs issues, but it's generic enough as it
+# doesn't use any btrfs specific features.
+#
+# Stress btrfs' block group allocation and deallocation while running fstrim in
+# parallel. Part of the goal is also to get data block groups deallocated so
+# that new metadata block groups, using the same physical device space ranges,
+# get allocated while fstrim is running. This caused several issues ranging
+# from invalid memory accesses, kernel crashes, metadata or data corruption,
+# free space cache inconsistencies, free space leaks and memory leaks.
+#
+# These issues were fixed by the following btrfs linux kernel patches:
+#
+#   Btrfs: fix invalid block group rbtree access after bg is removed
+#   Btrfs: fix crash caused by block group removal
+#   Btrfs: fix freeing used extents after removing empty block group
+#   Btrfs: fix race between fs trimming and block group remove/allocation
+#   Btrfs: fix race between writing free space cache and trimming
+#   Btrfs: make btrfs_abort_transaction consider existence of new block groups
+#   Btrfs: fix memory leak after block remove + trimming
+#   Btrfs: fix fs mapping extent map leak
+#   Btrfs: fix unprotected deletion from pending_chunks list
+#
+# The issues were found on a qemu/kvm guest with 4 virtual CPUs, 4Gb of ram and
+# scsi-hd devices with discard support enabled (that means hole punching in the
+# disk's image file is performed by the host).
+#
+#---
+#
+# Copyright (C) 2014 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana fdman...@suse.com
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+   rm -fr $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# real QA test starts here
+_need_to_be_root
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_fstrim
+
+rm -f $seqres.full
+
+# Keep allocating and deallocating 1G of data space with the goal of creating
+# and deleting 1 block group constantly. The intention is to race with the
+# fstrim loop below.
+fallocate_loop()
+{
+   local name=$1
+   while true; do
+   $XFS_IO_PROG -f -c falloc -k 0 1G \
+   $SCRATCH_MNT/$name  /dev/null
+   sleep 3
+   $XFS_IO_PROG -c truncate 0 \
+   $SCRATCH_MNT/$name  /dev/null
+   sleep 3
+   done
+}
+
+trim_loop()
+{
+   while true; do
+   $FSTRIM_PROG $SCRATCH_MNT
+   done
+}
+
+# Create a bunch of small files that get their single extent inlined in the
+# btree, so that we consume a lot of metadata space and get a chance of a
+# data block group getting deleted and reused for metadata later. Sometimes
+# the creation of all these files succeeds other times we get ENOSPC failures
+# at some point - this depends on how fast the btrfs' cleaner kthread is
+# notified about empty block groups, how fast it deletes them and how fast
+# the fallocate calls happen. So we don't really care

Re: [PATCH-v4 1/7] vfs: split update_time() into update_time() and write_time()

2014-12-02 Thread Theodore Ts'o

On Tue, Dec 02, 2014 at 01:20:33AM -0800, Christoph Hellwig wrote:
 Why do you need the additional I_DIRTY flag?  A lesser
 __mark_inode_dirty should never override a stronger one.

Agreed, will fix.

 Otherwise this looks fine to me, except that I would split the default
 implementation into a new generic_update_time helper.

Sure, I can do that.

  XFS doesn't have a -dirty_time yet, but that way XFS would be able to
  use the I_DIRTY_TIME flag to log the journal timestamps if it so
  desires, and perhaps drop the need for it to use update_time().
 
 We will probably always need a -update_time to proide proper locking
 around the timestamp updates.

Couldn't you let the VFS set the inode timesstamps and then have xfs's
-dirty_time(inode, I_DIRTY_TIME) copy the timestamps to the on-disk
inode structure under the appropriate lock, or am I missing something?

 In the current from the generic lazytime might even be a loss for XFS as
 we're already really good at batching updates from multiple inodes in
 the same cluster for the in-place writeback, so I really don't want
 to just enable it without those optimizations without a lot of testing.

Fair enough; it's not surprising that this might be much more
effective as an optimization for ext4, for no other reason that
timestamp updates are so much heavyweight for us.  I suspect that it
should be a win for btrfs, though, and it should definitely be a win
for those file systems that don't use journalling at all.

 - Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs stuck with lot's of files

2014-12-02 Thread Ian Armstrong

On Tue, 2 Dec 2014 12:48:21 + (UTC)
Duncan 1i5t5.dun...@cox.net wrote:

 Peter Volkov posted on Tue, 02 Dec 2014 04:50:29 +0300 as excerpted:
 
  В Пн, 01/12/2014 в 10:47 -0800, Robert White пишет:
  On 12/01/2014 03:46 AM, Peter Volkov wrote:
(stuff about getting hung up trying to write to one drive)
  
  That drive (/dev/sdn) is probably starting to fail.
  (about failed drive)
  
  Thank you Robert for the answer. It is not likely that drive fails
  here. Similar condition (write to a single drive) happens with
  other drives i.e. such write pattern may happen with any drive.
 
  After looking at what happens longer I see the following. During
  stuck single processor core is busy 100% of CPU in kernel space
  (some kworker is taking 100% CPU).
 
 FWIW, agreed that it's unlikely to be the drive, especially if you're
 not seeing bus resets or drive errors in dmesg and smart says the
 drive is fine, as I expect it does/will.  It may be a btrfs bug or
 scaling issue, of which btrfs still has some, or it could simply be
 the single mode vs raid0 mode issue I explain below.

I encountered a similar problem here a few days ago on a btrfs raid1
partition while using rsync to clone a (~30GB) directory.

Everything started fine, but I came back an hour later to find rsync had
apparently stalled at about 20% with cpu usage at 100% on a single
kworker thread. I was able to kill rsync eventually, and after a while
(don't know how long, but 10 minutes) cpu usage returned to normal.
Restarting rsync resulted in kworker at 100% cpu in less than a minute.
Once stalled there was little drive access happening. Another raid1
partition (mdadm/ext4) on the same drive pair was having no problems.
Nothing showed in the system logs.

In this instance I'd forgotten to delete a temporary 500GB file before
starting rsync, so although recently balanced (musage=80/dusage=80) it
was running at near capacity.

After a reboot, deleting the 500GB file  running balance, everything
returned to normal. Ran rsync again  it completed fine.

Running slackware current, with Kernel 3.16.4

# btrfs filesystem df /mnt/general
Data, RAID1: total=1.38TiB, used=1.38TiB
System, RAID1: total=32.00MiB, used=256.00KiB
Metadata, RAID1: total=6.00GiB, used=4.67GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# btrfs filesystem show /mnt/general
Label: none  uuid: 592376ea-769f-4abb-915e-aa5e49162d90
Total devices 2 FS bytes used 1.38TiB
devid1 size 1.79TiB used 1.39TiB path /dev/sda4
devid2 size 1.79TiB used 1.39TiB path /dev/sdd4

Btrfs v3.17.2

-- 
Ian
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/2/2014 7:23 AM, Austin S Hemmelgarn wrote:
 Stupid thought, why don't we just add blacklisting based on device
 path like LVM has for pvscan?

That isn't logic that belongs in the kernel, so that is going down the
path of yanking out the device auto probing from btrfs and instead
writing a mount.btrfs helper that can use policies like blacklisting
to auto locate all of the correct devices and pass them all to the
kernel at mount time.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfg7lAAoJENRVrw2cjl5RAakIAKLsIKgjzUO8J/PBBDTmcCQh
IvkEMlQ6ME+Zi7xCKM9p+J5Skcu22zj8w2Ip0s/zNo3ydGorajxehUqtU983l5Hd
VklKOuNGZ0wrOtwCH8IkRt9HUvT3I7982jByi2Uk9jxpRbL/BruaJ4NF+Z9HnvHO
cmMNavcKvwOkYpPHPPbeyjNwWALe/WRZZ2cgsKqs/vB2nakxFntUc1UOsnIMfLJ7
dMF0l9GudoIoNaqRUNoxV1/Lh9MxKx0p9mBK6Pc+V+wLulUyOUSQ6OkUTsznCabk
iUyzX9IYiF83hWO3g+1vxR+GCeYNVGvC/Rj8ZkLSt9Tpi7JH0kbXnq6wKedSfE0=
=Lxfb
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/2/2014 6:54 AM, Anand Jain wrote:
 we have some fundamentally wrong stuff. My original patch tried to
 fix it. But later discovered that some external entities like 
 systmed and boot process is using that bug as a feature and we had 
 to revert the patch.

If systemd is depending on the kernel lieing about what device it has
mounted then something is *extremely* broken there and that should be
fixed instead of breaking the kernel.


-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfg+BAAoJENRVrw2cjl5REm8H/j2MEbF2yeTsGtOGhszl82rZ
ngSvVfEEPq1D+tpi28+oZnSLYxIKEGudqTciyeb8Z1jCTD065D/T0xpGJZyd6pUG
KGahBpnPvhP5xg4RaoSxSzNcFzPPFfz+EIPyV+l3OlHbyeq0whkKj5OAq15Grz6c
RDWViqRFRE+dC2k70fAt6mlxWs7ChCVs9fPuuWVTFW+lXBoCKUZhnZ5Kc2orsKx6
rVTNTo6LxZQX7m+9WzIy5lqH+WgqxtfEacAlM/6jXWwPe09DDT3z0s3ogf+dfO0D
3/efDv1XJ/LwmbyQrGxiS0LQWoPA+d+MX0Od3XRcaeml3d7k/tZjDsrFOY6anIg=
=Rxh6
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: PROBLEM: #89121 BTRFS mixes up mounted devices with their snapshots

2014-12-02 Thread Phillip Susi

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 12/1/2014 4:45 PM, Konstantin wrote:
 The bug appears also when using mdadm RAID1 - when one of the
 drives is detached from the array then the OS discovers it and
 after a while (not directly, it takes several minutes) it appears
 under /proc/mounts: instead of /dev/md0p1 I see there /dev/sdb1.
 And usually after some hour or so (depending on system workload)
 the PC completely freezes. So discussion about the uniqueness of
 UUIDs or not, a crashing kernel is telling me that there is a
 serious bug.

I'm guessing you are using metadata format 0.9 or 1.0, which put the
metadata at the end of the drive and the filesystem still starts in
sector zero.  1.2 is now the default and would not have this problem
as its metadata is at the start of the disk ( well, 4k from the start
) and the fs starts further down.



-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.17 (MingW32)

iQEcBAEBAgAGBQJUfhC6AAoJENRVrw2cjl5RQ2EH/0Z0iCFjOs3e5oGuGqT5Wtlc
rXV8R1EfGSxESK0g6QAe7QIvJu+0CdIgccDp8z3ezfPcm1/YRfBXxXA/Y1Wl4hqw
0wuk3bNqMjUmNwIFjEZCkgOSn4Whuppbh3hOOVGNropr4cwd84GP1Cr2vrzwYnkm
If1I3RTaBhAJRSngkP9X+L5J6zBBjaZLlF4AjC/WP/1bd5vkHpGqnFpRTquCPiNV
9LFWQIB+xYdoRdK2l7huS2jQ5kfw+qLZUQO17dU3fcicwwNk56V4HcLEPg9nx9es
pxJo9BAWmQXDpeMcCL4eFECoeAhn0IXoaXb363mmpq11qyYj73r3FzhNQ+ALzPY=
=U65Z
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Another No space left balance failure with plenty of space available

2014-12-02 Thread Brendan Hide


Hey, guys

This is on my ArchLinux desktop. Current values as follows and the exact 
error is currently reproducible. Let me know if you want me to run any 
tests/etc. I've made an image (76MB) and can send the link to interested 
parties.


I have come across this once before in the last few weeks. The 
workaround at that time was to run multiple balances with incrementing 
-musage and -dusage values. Whether or not that was a real, imaginary, 
or temporary fix is another story. I have backups but the issue doesn't 
yet appear to cause any symptoms other than these errors.


The drive is a second-hand 60GB Intel 330 recycled from a decommissioned 
server. The mkfs was run on Nov 4th before I started a migration from 
spinning rust. According to my pacman logs btrfs-progs was on 3.17-1 and 
kernel was 3.17.1-1


root ~ $ uname -a
Linux watricky.invalid.co.za 3.17.4-1-ARCH #1 SMP PREEMPT Fri Nov 21 
21:14:42 CET 2014 x86_64 GNU/Linux

root ~ $ btrfs fi show /
Label: 'arch-btrfs-root'  uuid: 782a0edc-1848-42ea-91cb-de8334f0c248
Total devices 1 FS bytes used 17.44GiB
devid1 size 40.00GiB used 20.31GiB path /dev/sdc1

Btrfs v3.17.2
root ~ $ btrfs fi df /
Data, single: total=18.00GiB, used=16.72GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.12GiB, used=738.67MiB
GlobalReserve, single: total=256.00MiB, used=0.00B

Relevant kernel lines:
root ~ $ journalctl -k | grep ^Dec\ 01\ 21\:
Dec 01 21:10:01 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
relocating block group 46166704128 flags 36
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
found 2194 extents
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
relocating block group 45059407872 flags 34
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
found 1 extents
Dec 01 21:10:03 watricky.invalid.co.za kernel: BTRFS info (device sdc1): 
1 enospc errors during balance



 Original Message 
Subject: 	Cron root@watricky /usr/bin/btrfs balance start -musage=90 / 
21  /dev/null

Date:   Mon, 01 Dec 2014 21:10:03 +0200
From:   (Cron Daemon) r...@watricky.valid.co.za
To: bren...@swiftspirit.co.za



ERROR: error during balancing '/' - No space left on device
There may be more info in syslog - try dmesg | tail



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Theodore Ts'o

On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
 
 This I do not understand. I thought that I_DIRTY_TIME, and the all
 lazytime mount option, is only for atime. So if there are dirty
 pages then there are also m/ctime that changed and surly we want to
 write these times to disk ASAP.

What are the situations where you are most concerned about mtime or
ctime being accurate after a crash?

I've been running with it on my laptop for a while now, and it's
certainly not a problem for build trees; remember, whenever you need
to update the inode to update i_blocks or i_size, the inode (with its
updated timestamps) will be flushed to disk anyway.

In actual practice, what happens in a build tree is that when make
decides that it needs to update a generated file, when the file is
created as a zero-length inode, m/ctime will be set to the time that
file is created, which is newer than its source files.  As the file is
written, the mtime is updated each time that we actually need to do an
allocating write.  In the case of the linker, it will seek to the
beginning of the file to update ELF header at the very end of its
operation, and *that* time will be left stale, such that the in-memory
mtime is perhaps a millisecond ahead of the on-disk mtime.  But in the
case of a crash, either time is such that make won't be confused.

I'm not aware of an application which is doing a large number of
non-allocating random writes (for example, such as a database), where
said database actually cares about mtime being correct.  In fact, most
databases use fdatasync() to prevent the mtimes from being sync'ed out
to disk on each transaction, so they don't have guaranteed timestamp
accuracy after a crash anyway.  The problem is even if the database is
using fdatasync(), every five seconds we end up updating the mtime
anyway --- and in the case of ext4, we end up needing to take various
journal locks which on a sufficiently parallel workload and a
sufficiently fast disk, can actually cause measurable contention.

Did you have such a use case or application in mind?

 if we are lazytime also with m/ctime then I think I would like an
 option for only atime lazy. because m/ctime is cardinal to some
 operations even though I might want atime lazy.

If there's a sufficiently compelling use case where we do actually
care about mtime/ctime being accurate, and the current semantics don't
provide enough of a guarantee, it's certainly something we could do.
I'd rather keep things simple unless it's really there.  (After all,
we did create the strictatime mount option, but I'm not sure anyone
every ends up using it.  It woud be a shame if we created a
strictcmtime, which had the same usage rate.)

I'll also note that if it's only about atime updates, with the default
relatime mount option, I'm not sure there's enough of a win to hae a
mode to justify a lazyatime only option.  If you really neeed strict
c/mtime after a crash, maybe the best thing to do is to just simply
not use the lazytime mount option and be done with it.

Cheeres,

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

2014-12-02 Thread Robert White


On 12/02/2014 07:11 AM, Shriramana Sharma wrote:

On Tue, Dec 2, 2014 at 6:58 PM, David Sterba dste...@suse.cz wrote:


A subvolume is also a snapshotting barrier, so it's convenient to create
subvolumes in well-known paths that contain data that should not be
rolled back (/var/log, /srv, bootloader).



That makes sense. Is there anywhere that the official SuSE
recommended subvol layout is mentioned that I can refer to without
having to start up an installer? (I currently chose ext4 for / for
other reasons so I can't refer to my layout.)


There are lots of ways to arrange your system

My preference is to create the snapshots in a super-volume outside of 
the normally mounted hierarchy. This simplifies the normal operation of 
tools like locate which don't understand that the duplicate files from 
the snapshot are not interesting. It also means that live operations 
(e.g. anything using find) naturally will not traverse the snapshots 
unless the supervolume is mounted explicitly.


I typically call the active root of the system /__System and set it as 
the default subvolume to make booting easier.


As in...

mkfs.btrfs /dev/whatever
mount /dev/whatever /mnt
btrfs sub create /mnt/__System
btrfs sub create /mnt/__System/home
btrfs sub set-default (number) /mnt
umount /mnt
mount /mnt
(create OS layout in /mnt)


Then when the snapshotting goes on...

mount -o subvol=/ /dev/whatever /maintenance
SUFFIX=$(date +_BACKUP_%Y-%M-%d)
cd /maintenance
btrfs sub snap -r __System __System${SUFFIX}
btrfs sub snap -r __System/home __System_home${SIFFIX}
# etc
cd /
umount /maintenance


---

The Real Way™ to think about your active subvolume layouts is to think 
about what you really need to preserve and how often.


/home is an obvious candidate for frequent snapshots as it is the place 
individual users are most likely to mess up, and it has the most 
irreplaceable data.


/ (e.g. the semantic system root) [in my example /__System] [not 
counting its various subvolumes really only needs backing up before 
system software modifications via apt/yumm/portage/wahtever your distro 
uses. Or right before you start doing aything tricky in /etc


/etc  might rate its own subvolume if you are a tinker or your 
system-wide configuration needs a lot of motility.


/var tends to have per system configuration stuff


But the real question is how much complexity of maintenance does the 
system _really_ need, and how much of it are you _really_ going to maintain.


The desire to use a feature just 'cause it's there should be resisted. 
If you are not going to be using the snapshots feature. If you are just 
dropping the box in and you are going to ignore it, then don't subvolume 
it at all.


You are looking for a balance between the theoretical ideal and the 
practical outcome. If you don't know exactly why you are putting the 
subvolume in place then it will likely just end up annoying you without 
giving any value.


Same for taking and positioning the snapshots.

This is a corollary of the rule that states A backup script that you've 
never done a restore from, should be assumed to be an _unsafe_ or 
complete backup, or no backup at all.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Andreas Dilger

On Dec 2, 2014, at 12:23 PM, Theodore Ts'o ty...@mit.edu wrote:
 On Tue, Dec 02, 2014 at 07:55:48PM +0200, Boaz Harrosh wrote:
 
 This I do not understand. I thought that I_DIRTY_TIME, and the all
 lazytime mount option, is only for atime. So if there are dirty
 pages then there are also m/ctime that changed and surly we want to
 write these times to disk ASAP.
 
 What are the situations where you are most concerned about mtime or
 ctime being accurate after a crash?
 
 I've been running with it on my laptop for a while now, and it's
 certainly not a problem for build trees; remember, whenever you need
 to update the inode to update i_blocks or i_size, the inode (with its
 updated timestamps) will be flushed to disk anyway.
[snip]
 I'm not aware of an application which is doing a large number of
 non-allocating random writes (for example, such as a database), where
 said database actually cares about mtime being correct.
[snip]
 Did you have such a use case or application in mind?


One thing that comes to mind is touch/utimes()/utimensat().  Those
should definitely not result in timestamps being kept only in memory
for 24h, since the whole point of those calls is to update the times.
It makes sense for these APIs to dirty the inode for proper writeout.

Cheers, Andreas

 if we are lazytime also with m/ctime then I think I would like an
 option for only atime lazy. because m/ctime is cardinal to some
 operations even though I might want atime lazy.
 
 If there's a sufficiently compelling use case where we do actually
 care about mtime/ctime being accurate, and the current semantics don't
 provide enough of a guarantee, it's certainly something we could do.
 I'd rather keep things simple unless it's really there.  (After all,
 we did create the strictatime mount option, but I'm not sure anyone
 every ends up using it.  It woud be a shame if we created a
 strictcmtime, which had the same usage rate.)
 
 I'll also note that if it's only about atime updates, with the default
 relatime mount option, I'm not sure there's enough of a win to hae a
 mode to justify a lazyatime only option.  If you really neeed strict
 c/mtime after a crash, maybe the best thing to do is to just simply
 not use the lazytime mount option and be done with it.
 
 Cheeres,
 
   - Ted
 --
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


Cheers, Andreas





--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH-v5 1/5] vfs: add support for a lazytime mount option

2014-12-02 Thread Theodore Ts'o

On Tue, Dec 02, 2014 at 01:37:27PM -0700, Andreas Dilger wrote:
 
 One thing that comes to mind is touch/utimes()/utimensat().  Those
 should definitely not result in timestamps being kept only in memory
 for 24h, since the whole point of those calls is to update the times.
 It makes sense for these APIs to dirty the inode for proper writeout.

Not a problem.  Touch/utimes* go through notify_change() and
-setattr, so they won't go through the I_DIRTY_TIME code path.

- Ted
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

2014-12-02 Thread Austin S Hemmelgarn


On 2014-12-02 10:11, Shriramana Sharma wrote:

On Tue, Dec 2, 2014 at 6:58 PM, David Sterba dste...@suse.cz wrote:


A subvolume is also a snapshotting barrier, so it's convenient to create
subvolumes in well-known paths that contain data that should not be
rolled back (/var/log, /srv, bootloader).


Hi David -- a real honour to meet one of the core Btrfs/SuSE (heh,
when that was the spelling!) guys!

That makes sense. Is there anywhere that the official SuSE
recommended subvol layout is mentioned that I can refer to without
having to start up an installer? (I currently chose ext4 for / for
other reasons so I can't refer to my layout.)

I am now reading a SuSECon 2013 presentation by Nyers and Schnell but
they are very generic about the recommendations.


Here's my approach to things:
In the top level of the btrfs filesystem I use for /, I have a subvolume 
called /root,  This is what get's mount on /.  I also have a separate 
subvolume called /home for the home directories when I have those on the 
same FS.  I place /boot on an entirely separate filesystem because I use 
a bunch of mount options there that would break or slow down other 
filesystems (most notably, noexec, nosuid, nodev, and sync).  Within 
both /home and /root, I use a handful of subvolumes to control what gets 
saved in a snapshot, the most notable examples being /var//log, 
/usr/portage, and /home/austin/dropbox.


As far as snapshots go, I take a snapshot of /root every time I boot, 
and keep the past 2 days worth, take a snapshot of /home hourly, and 
keep a weeks worth, and do a snapshot of both when I generate a system 
backup.  I generally don't do snapshots of /boot, as I keep around the 
previous few kernel versions anyway, and mark things there as immutable 
so that I can't accidentally mess them up.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: btrfs stuck with lot's of files

Ian Armstrong posted on Tue, 02 Dec 2014 18:56:13 + as excerpted:

 On Tue, 2 Dec 2014 12:48:21 + (UTC)
 Duncan 1i5t5.dun...@cox.net wrote:
 
 FWIW, agreed that it's unlikely to be the drive, especially if you're
 not seeing bus resets or drive errors in dmesg and smart says the drive
 is fine, as I expect it does/will.  It may be a btrfs bug or scaling
 issue, of which btrfs still has some, or it could simply be the single
 mode vs raid0 mode issue I explain below.
 
 I encountered a similar problem here a few days ago on a btrfs raid1
 partition while using rsync to clone a (~30GB) directory.
 
 Everything started fine, but I came back an hour later to find rsync had
 apparently stalled at about 20% with cpu usage at 100% on a single
 kworker thread. I was able to kill rsync eventually, and after a while
 (don't know how long, but 10 minutes) cpu usage returned to normal.
 Restarting rsync resulted in kworker at 100% cpu in less than a minute.
 Once stalled there was little drive access happening. Another raid1
 partition (mdadm/ext4) on the same drive pair was having no problems.
 Nothing showed in the system logs.
 
 In this instance I'd forgotten to delete a temporary 500GB file before
 starting rsync, so although recently balanced (musage=80/dusage=80) it
 was running at near capacity.
 
 After a reboot, deleting the 500GB file  running balance, everything
 returned to normal. Ran rsync again  it completed fine.
 
 Running slackware current, with Kernel 3.16.4

FWIW that was my point -- there are still such bugs out there, often 
corner-case so they don't affect most folks most of the time, but out 
there.

I had a similar stall recently, a kworker stuck at 100% that went away 
after I killed whatever app had triggered the problem (pan, the news 
program I'm writing this with, as it happens).  In my case I chalked it 
up to a known corner-case bug in my slightly old 3.17.0 kernel (my use-
case doesn't do read-only snapshots so I'm not affected by that known bug 
that effectively blacklists 3.17.0 for some users; this would have been a 
different one).  I don't /know/ it was that bug, but it most likely was, 
as it's a known but rare corner-case that AFAIK is already fixed in the 
late 3.18-rcs.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

raid5 filesystem only mountable ro and not currently fixable after a drive produced read errors

2014-12-02 Thread Konstantin Matuschek

Hello,

I have a raid5 btrfs that refuses to mount rw (ro works) and I think I'm out of 
options to get it fixed.

First, this is roughly what got my filesystem corrupted:


1. I created the raid5 fs in March 2014 using the latest code available (Btrfs 
3.12) on four 4TB devices (each encrypted using dm-crypt). I also created 3 
subvolumes. The command used was:
mkfs.btrfs -O skinny-metadata -d raid5 -m raid5 /dev/mapper/wdred4tb[2345]


2. Around October I noticed one of the drived (wdred4tb3) produced read errors. 
Running a long smartctl self-test would fail as well and the reported 
Raw_Read_Error_Rate increased steadily.


3. Since I had a spare drive around, but replacing a device wasn't implemented 
back then for raid5, I decided to use the add-then-delete approach outlined 
here: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Raid5-Status . I 
did *not* remove the failing drive for that.


4. The rebalance triggered by the btrfs device delete /dev/mapper/wdred4tb3 
command crashed a few times (and read errors kept increasing), but each time I 
started it, a few hundred GiB were moved over to the newly added device. But 
when 414GiB were left on the failing drive, it didn't get further. It now still 
looks like this:
# btrfs fi show /mnt/box
Label: none  uuid: 9f3a48b7-1b88-44f0-a387-f3712fc2c0b6
Total devices 5 FS bytes used 4.43TiB
devid1 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb2
devid2 size 3.64TiB used 414.00GiB path /dev/mapper/wdred4tb3
devid3 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb4
devid4 size 3.64TiB used 1.50TiB path /dev/mapper/wdred4tb5
devid5 size 3.64TiB used 1.10TiB path /dev/mapper/wdred4tb1
Btrfs v3.17.2-50-gcc0723c


5. I tried several things (probably a new kernel around 3.17, propbably 
affected the snapshot-bug, but I don't use snapshots, only subvolumes) and 
ended up doing a btrfsck --repair (v3.17-rc3) on the filesystem. I still have 
the complete output of that, let me know if you need it. Here are some lines 
that seem interesting to me:
# btrfsck --repair /dev/mapper/wdred4tb2
enabling repair mode
Checking filesystem on /dev/mapper/wdred4tb2
UUID: 9f3a48b7-1b88-44f0-a387-f3712fc2c0b6
checking extents
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
[...]
owner ref check failed [500170752 16384]
repair deleting extent record: key 500170752 169 0
adding new tree backref on start 500170752 len 16384 parent 7 root 7
[...]
repaired damaged extent references
checking free space cache
cache and super generation don't match, space cache will be invalidated
checking fs roots
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
[...]
Check tree block failed, want=668598272, have=668794880
Csum didn't match
[...]
checking csums
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
Check tree block failed, want=500170752, have=5421517155842471019
read block failed check_tree_block
Error going to next leaf -5
checking root refs
found 1469190132145 bytes used err is 0
total csum bytes: 4750630700
total tree bytes: 6141100032
total fs tree bytes: 345964544
total extent tree bytes: 194052096
btree space waste bytes: 867842012
file data blocks allocated: 4865657503744
 referenced 4895640494080
Btrfs v3.17-rc3
extent buffer leak: start 842235904 len 16384
extent buffer leak: start 842235904 len 16384
[...]


6. As far as I can remember, that was the point when mounting rw stopped 
working. Mounting ro seems to work quite fine though (no idea if data was 
lost/corrupted).



I removed the failing drive today and updated to the latest integration 
branch of cmason's git repository (including Miao Xie's patches for raid56 
replacement) and David's integration-20141125 branch for btrfs-progs. With 
those, I tried a mount with -o ro,degraded,recovery (works, but didn't 
recover). I also tried a btrfsck again, but it just prints some errors and then 
exits.
Mounting rw with -o degraded gives the following output in dmesg:

[ 7358.907119] BTRFS: open /dev/dm-4 failed
[ 7358.907860] BTRFS info (device dm-6): allowing degraded mounts
[ 7358.907866] BTRFS info (device dm-6): enabling auto recovery
[ 7358.907870] BTRFS info (device dm-6): disk space caching is enabled
[ 7358.907872] BTRFS: has skinny extents
[ 7360.549993] BTRFS: bdev /dev/dm-4 errs: wr 0, rd 22288, flush 0, corrupt 0, 
gen 0
[ 7377.923939] BTRFS info (device dm-6): The free space cache file 
(7065489637376) is invalid. skip it

[ 7383.443486] BTRFS (device dm-6): parent transid verify failed on

Re: Possible to undo subvol delete?

2014-12-02 Thread Satoru Takeuchi


(2014/12/03 0:17), Shriramana Sharma wrote:

On Tue, Dec 2, 2014 at 12:41 PM, Satoru Takeuchi
takeuchi_sat...@jp.fujitsu.com wrote:


Snapper can automatically take a snapshot just before
taking/deleting snapshots. So, if you delete a snapshot
by mistake, it's still alive.


Sorta contradicts the whole point of deleting a snapshot, no? Or is it
some sort of trash vs (real) delete mechanism?



It's not a Btrfs itself's feature. It's a snapper's feature.
It works as a helper of snapshot management.

1. You takes /snap by snapper create command.
2. You delete /snap by snapper delete command by mistake.
   Then snapper takes a pre snapshot just before deleting
   /snap.
3. Now /snap is deleted, however, a pre snapshot which is
   the same as /snap before deleting, is still alive.

I don't know how Btrfs itself undo the deletion of a snapshot.
It works if you manages snapshots not by btrfs directly,
but by snapper.

If I misunderstanding something, sorry for noise.

Thanks,
Satoru

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
 
 
 On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 This patchset implement the device scrub/replace function for RAID56, the
 most implementation of the common data is similar to the other RAID type.
 The differentia or difficulty is the parity process. The basic idea is 
 reading
 and check the data which has checksum out of the raid56 stripe lock, if the
 data is right, then lock the raid56 stripe, read out the other data in the
 same stripe, if no IO error happens, calculate the parity and check the
 original one, if the original parity is right, the scrub parity passes.
 or write out the new one. But if the common data(not parity) that we read out
 is wrong, we will try to recover it, and then check and repair the parity.

 And in order to avoid making the code more and more complex, we copy some
 code of common data process for the parity, the cleanup work is in my
 TODO list.

 We have done some test, the patchset worked well. Of course, more tests
 are welcome. If you are interesting to use it or test it, you can pull
 the patchset from

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

 Changelog v3 - v4:
 - Fix the problem that the scrub's raid bio was cached, which was reported
   by Chris.
 - Remove the 10st patch, the deadlock that was described in that patch 
 doesn't
   exist on the current kernel.
 - Rebase the patchset to the top of integration branch
 
 Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
 patches, can you please cook one on top of v3.18-rc6 instead?

No problem.

Thanks
Miao

 
 -chris
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Thin metadata and nohole options recommended?

On Tue, Dec 2, 2014 at 10:30 PM, Zygo Blaxell
ce3g8...@umail.furryterror.org wrote:

 IMHO if an optional filesystem feature has had a significant bug fixed
 in the last six months, it probably shouldn't be enabled by default.  ;)

Excellent point. I noted that my SuSE Tumbleweed enables extref by
default but not any of the others. Is that a SuSE-specific
modification or upstream?

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/10] Implement device scrub/replace for RAID56

On Tue, 2 Dec 2014 08:28:22 -0500, Chris Mason wrote:
 On Tue, Dec 2, 2014 at 7:39 AM, Miao Xie mi...@cn.fujitsu.com wrote:
 This patchset implement the device scrub/replace function for RAID56, the
 most implementation of the common data is similar to the other RAID type.
 The differentia or difficulty is the parity process. The basic idea is 
 reading
 and check the data which has checksum out of the raid56 stripe lock, if the
 data is right, then lock the raid56 stripe, read out the other data in the
 same stripe, if no IO error happens, calculate the parity and check the
 original one, if the original parity is right, the scrub parity passes.
 or write out the new one. But if the common data(not parity) that we read out
 is wrong, we will try to recover it, and then check and repair the parity.

 And in order to avoid making the code more and more complex, we copy some
 code of common data process for the parity, the cleanup work is in my
 TODO list.

 We have done some test, the patchset worked well. Of course, more tests
 are welcome. If you are interesting to use it or test it, you can pull
 the patchset from

   https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

 Changelog v3 - v4:
 - Fix the problem that the scrub's raid bio was cached, which was reported
   by Chris.
 - Remove the 10st patch, the deadlock that was described in that patch 
 doesn't
   exist on the current kernel.
 - Rebase the patchset to the top of integration branch
 
 Thanks, I'll try this today.  I need to rebase in a new version of the RCU 
 patches, can you please cook one on top of v3.18-rc6 instead?

I have updated my raid56-scrub-replace branch, please re-pull it.
  https://github.com/miaoxie/linux-btrfs.git raid56-scrub-replace

The v4 patchset in the mail list can be applied on v3.18-rc6 successfully, so
I don't update it.

Thanks
Miao
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

On Tue, Dec 2, 2014 at 2:04 PM, Hugo Mills h...@carfax.org.uk wrote:

Is that correct: what btr sub list shows as top level is indeed the
parent subvolume?

No, it's the top-level subvolume. (See my earlier mail about
nomenclature). Parent subvolume has a number of meanings, none of
which should be the subvolume with subvolid 5.

Um I searched my inbox but didn't find a specific definition from you
for top-level. You only said it's better to avoid calling it root
to avoid confounding it with the subvol that may be mounted at root
i.e. /.

IIUC the top-level subvolume can only be subvolid 5 which accords
with your later comment:

that putting files in the top-level subvol can't do what most people
want to do with it. Hence the recommended subvol management layout at
[1] https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Subvolumes

... which means that I am not able to understand the output of btr sub
list which gives the subvolid of whichever subvol is currently the
parent (as in outer nesting) subvol. Observe:

$ btr sub list .
ID 257 gen 10 top level 5 path test1
ID 258 gen 10 top level 5 path test2
ID 259 gen 9 top level 258 path test2/foo
$ sudo mv test2/foo test1/
$ btr sub list .
ID 257 gen 10 top level 5 path test1
ID 258 gen 10 top level 5 path test2
ID 259 gen 9 top level 257 path test1/foo
$

So now what is the meaning of top level?

--
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: Moving an entire subvol?

On Wed, Dec 3, 2014 at 2:43 AM, Austin S Hemmelgarn
ahferro...@gmail.com wrote:
 Here's my approach to things:

Wow, thanks a lot people! I'm really benefiting from your experience here.

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा

Re: Possible to undo subvol delete?

On Wed, Dec 3, 2014 at 5:41 AM, Satoru Takeuchi
takeuchi_sat...@jp.fujitsu.com wrote:
 2. You delete /snap by snapper delete command by mistake.
Then snapper takes a pre snapshot just before deleting
/snap.
 3. Now /snap is deleted, however, a pre snapshot which is
the same as /snap before deleting, is still alive.

 I don't know how Btrfs itself undo the deletion of a snapshot.
 It works if you manages snapshots not by btrfs directly,
 but by snapper.

 If I misunderstanding something, sorry for noise.

No nothing misunderstood. Excellent illustration. So using snapper is
sorta like using the higher-level trash instead of lower-level rm, so
that even after we delete, it's still available...

-- 
Shriramana Sharma ஶ்ரீரமணஶர்மா श्रीरमणशर्मा
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 0/6] More generic inode nlink repair function

Update on patch 4 and 6, other is not changed.
This nlink repair function is more generic than the original one.

The old one can only handle a specific case that the inode_ref is
invalid, either point to a non-exist parent inode or point to a invalid
inode(not dir or conflicting index/name).

The new one will reset all the backref, no matter it is valid or not,
and re-add all the valid backref, this make the nlink handles more
corrupt cases.

Qu Wenruo (6):
  btrfs-progs: print root dir verbose error in fsck
  btrfs-progs: Import btrfs_insert/del/lookup_extref() functions.
  btrfs-progs: Import lookup/del_inode_ref() function.
  btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.
  btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found' 
   fsck mechanism.
  btrfs-progs: Add fixing function for inodes whose nlink dismatch

 Makefile |   2 +-
 cmds-check.c | 311 --
 ctree.c  |   6 +
 ctree.h  |  38 +
 inode-item.c | 318 +++
 inode.c  | 484 +++
 6 files changed, 1148 insertions(+), 11 deletions(-)
 create mode 100644 inode.c

-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND 1/6] btrfs-progs: print root dir verbose error in fsck

Before this patch, when btrfsck found an error in root dir, it will only
output the following message root %llu root dir %llu error without any
detailed error.

Just add print_inode_error() to print out the whole error.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 389674f..9fc1410 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1984,6 +1984,7 @@ static int check_inode_recs(struct btrfs_root *root,
fprintf(stderr, root %llu root dir %llu error\n,
(unsigned long long)root-root_key.objectid,
(unsigned long long)root_dirid);
+   print_inode_error(root, rec);
error++;
}
} else {
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 6/6] btrfs-progs: Add fixing function for inodes whose nlink dismatch

[BUG]
At least two users have already hit a bug in btrfs causing file
missing(Chromium config file).
The missing file's nlink is still 1 but its backref points to non-exist
parent inode.

This should be a kernel bug, but btrfsck fix is needed anyway.

[FIX]
For such nlink mismatch inode, we will delete all the invalid backref,
if there is no valid backref for it, create 'lost+found' under the root
of the subvolume and add link to the directory.

Reported-by: Mike Gavrilov mikhail.v.gavri...@gmail.com
Reported-by: Ed Tomlinson e...@aei.ca
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com

---
changelog:
v2:
   Use the a more generic fucntion, reset_nlink(), to repair the inode
   nlink.
   It will remove all, including valid, backref(along with
   dir_item/index) and set nlink to 0, and add valid ones back.
   This reset_nlink() method can handle more nlink error, not only
   invalid inode_ref but also pure nlinks mismatch(2 valid inode_ref but
   nlink is 1)
---
 cmds-check.c | 236 ++-
 1 file changed, 232 insertions(+), 4 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 6419caf..627b794 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -27,6 +27,7 @@
 #include unistd.h
 #include getopt.h
 #include uuid/uuid.h
+#include math.h
 #include ctree.h
 #include volumes.h
 #include repair.h
@@ -1837,6 +1838,18 @@ static int repair_inode_backrefs(struct btrfs_root *root,
struct btrfs_trans_handle *trans;
struct btrfs_key location;
 
+   ret = check_dir_conflict(root, backref-name,
+backref-namelen,
+backref-dir,
+backref-index);
+   if (ret) {
+   /*
+* let nlink fixing routine to handle it,
+* which can do it better.
+*/
+   ret = 0;
+   break;
+   }
location.objectid = rec-ino;
location.type = BTRFS_INODE_ITEM_KEY;
location.offset = 0;
@@ -1874,20 +1887,233 @@ static int repair_inode_backrefs(struct btrfs_root 
*root,
return ret ? ret : repaired;
 }
 
+/*
+ * Search for the inode_ref of given 'ino' to get the filename and
+ * btrfs type.
+ * This will only use the first inode_ref.
+ */
+static int find_file_name_type(struct btrfs_root *root,
+  struct btrfs_path *path,
+  u64 ino, char *buf,
+  int *namelen, u8 *type)
+{
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   struct btrfs_inode_ref *inode_ref;
+   struct btrfs_inode_item *inode_item;
+   struct extent_buffer *node;
+   u32 ret_namelen;
+   int slot;
+   int ret = 0;
+
+   /* Search for name in backref(Use the first one) */
+   key.objectid = ino;
+   key.type = BTRFS_INODE_REF_KEY;
+   key.offset = 0;
+
+   ret = btrfs_search_slot(NULL, root, key, path, 0, 0);
+   if (ret  0)
+   goto out;
+   node = path-nodes[0];
+   slot = path-slots[0];
+   btrfs_item_key_to_cpu(node, found_key, slot);
+   if (found_key.objectid != ino ||
+   found_key.type != BTRFS_INODE_REF_KEY) {
+   ret = -ENOENT;
+   goto out;
+   }
+   inode_ref = btrfs_item_ptr(node, slot, struct btrfs_inode_ref);
+   ret_namelen = btrfs_inode_ref_name_len(node, inode_ref);
+   read_extent_buffer(node, buf, (unsigned long)(inode_ref + 1),
+  ret_namelen);
+   /* Search for inode type */
+   ret = btrfs_previous_item(root, path, ino, BTRFS_INODE_ITEM_KEY);
+   if (ret) {
+   if (ret  0)
+   ret = -ENOENT;
+   goto out;
+   }
+   node = path-nodes[0];
+   slot = path-slots[0];
+   inode_item = btrfs_item_ptr(node, slot, struct btrfs_inode_item);
+
+   if (namelen)
+   *namelen = ret_namelen;
+   if (type)
+   *type = imode_to_type(btrfs_inode_mode(node, inode_item));
+out:
+   btrfs_release_path(path);
+   return ret;
+}
+
+/* Reset the nlink of the inode to the correct one */
+static int reset_nlink(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct btrfs_path *path,
+  struct inode_record *rec)
+{
+   struct inode_backref *backref;
+   struct inode_backref *tmp;
+   struct btrfs_key key;
+   struct btrfs_inode_item *inode_item;
+   int ret = 0;
+
+   /* Remove all backref including the valid ones */
+   list_for_each_entry_safe(backref, tmp, rec-backrefs, list)

[PATCH v3 4/6] btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.

Add btrfs_unlink() and btrfs_add_link() functions in inode.c,
for the incoming btrfs_mkdir() and later inode operations functions.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
Changlog:
v2:
   Do dir name conflicting check before adding inode_backref or
   dir_item/index.
v3:
   Allow btrfs_add_link() to add dir_index with given index, iff the
   index doesn't conflict with existing one.
   And export the check_dir_conflict() function in inode.c, and fix a
   false alert case in it.
---
 Makefile |   2 +-
 cmds-check.c |   7 +-
 ctree.h  |  15 ++
 inode.c  | 438 +++
 4 files changed, 455 insertions(+), 7 deletions(-)
 create mode 100644 inode.c

diff --git a/Makefile b/Makefile
index 4cae30c..d7a5cbe 100644
--- a/Makefile
+++ b/Makefile
@@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
- ulist.o qgroup-verify.o backref.o
+ ulist.o qgroup-verify.o backref.o inode.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/cmds-check.c b/cmds-check.c
index 9fc1410..6419caf 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1599,14 +1599,9 @@ static int repair_inode_orphan_item(struct 
btrfs_trans_handle *trans,
struct btrfs_path *path,
struct inode_record *rec)
 {
-   struct btrfs_key key;
int ret;
 
-   key.objectid = BTRFS_ORPHAN_OBJECTID;
-   key.type = BTRFS_ORPHAN_ITEM_KEY;
-   key.offset = rec-ino;
-
-   ret = btrfs_insert_empty_item(trans, root, path, key, 0);
+   ret = btrfs_add_orphan_item(trans, root, path, rec-ino);
btrfs_release_path(path);
if (!ret)
rec-errors = ~I_ERR_NO_ORPHAN_ITEM;
diff --git a/ctree.h b/ctree.h
index 32b1286..21bafd2 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2444,4 +2444,19 @@ static inline int is_fstree(u64 rootid)
return 1;
return 0;
 }
+
+/* inode.c */
+int check_dir_conflict(struct btrfs_root *root,
+  char *name, int namelen,
+  u64 dir, u64 index);
+int btrfs_add_link(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+  u64 ino, u64 parent_ino, char *name, int namelen,
+  u8 type, u64 *index, int add_backref);
+int btrfs_unlink(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+u64 ino, u64 parent_ino, u64 index, const char *name,
+int namelen, int add_orphan);
+int btrfs_add_orphan_item(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ struct btrfs_path *path,
+ u64 ino);
 #endif
diff --git a/inode.c b/inode.c
new file mode 100644
index 000..aeb0a5c
--- /dev/null
+++ b/inode.c
@@ -0,0 +1,438 @@
+/*
+ * Copyright (C) 2014 Fujitsu.  All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ */
+
+/*
+ * Unlike inode.c in kernel, which can use most of the kernel infrastructure
+ * like inode/dentry things, in user-land, we can only use inode number to
+ * do directly operation on extent buffer, which may cause extra searching,
+ * but should not be a huge problem since progs is less performence sensitive.
+ */
+#include sys/stat.h
+#include ctree.h
+#include transaction.h
+#include disk-io.h
+#include time.h
+
+/*
+ * Find a free inode index for later btrfs_add_link().
+ * Currently just search from the largest dir_index and +1.
+ */
+static int btrfs_find_free_dir_index(struct btrfs_root *root, u64 dir_ino,
+u64 *ret_ino)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   u64 ret_val = 2;
+   int ret = 0;
+
+   if (!ret_ino)
+   return 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   key.objectid = dir_ino;
+

[PATCH RESEND 3/6] btrfs-progs: Import lookup/del_inode_ref() function.

Import lookup/del_inode_ref() function in inode-item.c, as base functions
for the incoming btrfs_add_link() and btrfs_unlink() functions.

Also modify btrfs_insert_inode_ref() and split_leaf() making them able
to deal with EXTENT_IREF incompat flag.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 ctree.c  |   6 
 ctree.h  |  10 ++
 inode-item.c | 112 +++
 3 files changed, 128 insertions(+)

diff --git a/ctree.c b/ctree.c
index 23399e2..bd6cb12 100644
--- a/ctree.c
+++ b/ctree.c
@@ -2015,6 +2015,12 @@ static noinline int split_leaf(struct btrfs_trans_handle 
*trans,
int split;
int num_doubles = 0;
 
+   l = path-nodes[0];
+   slot = path-slots[0];
+   if (extend  data_size + btrfs_item_size_nr(l, slot) +
+   sizeof(struct btrfs_item)  BTRFS_LEAF_DATA_SIZE(root))
+   return -EOVERFLOW;
+
/* first try to make some room by pushing left and right */
if (data_size  ins_key-type != BTRFS_DIR_ITEM_KEY) {
wret = push_leaf_right(trans, root, path, data_size, 0);
diff --git a/ctree.h b/ctree.h
index e185e7a..32b1286 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2403,6 +2403,16 @@ int btrfs_insert_inode_extref(struct btrfs_trans_handle 
*trans,
  struct btrfs_root *root,
  const char *name, int name_len,
  u64 inode_objectid, u64 ref_objectid, u64 index);
+struct btrfs_inode_ref
+*btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_path *path,
+   const char *name, int namelen,
+   u64 ino, u64 parent_ino, u64 index, int ins_len);
+int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   const char *name, int name_len,
+   u64 ino, u64 parent_ino, u64 *index);
 
 /* file-item.c */
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
diff --git a/inode-item.c b/inode-item.c
index 7337ac9..5a9b675 100644
--- a/inode-item.c
+++ b/inode-item.c
@@ -89,6 +89,8 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle *trans,
ptr = (unsigned long)(ref + 1);
ret = 0;
} else if (ret  0) {
+   if (ret == EOVERFLOW)
+   ret = -EMLINK;
goto out;
} else {
ref = btrfs_item_ptr(path-nodes[0], path-slots[0],
@@ -102,6 +104,15 @@ int btrfs_insert_inode_ref(struct btrfs_trans_handle 
*trans,
 
 out:
btrfs_free_path(path);
+
+   if (ret == -EMLINK) {
+   if (btrfs_fs_incompat(root-fs_info,
+ BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF))
+   ret = btrfs_insert_inode_extref(trans, root, name,
+   name_len,
+   inode_objectid,
+   ref_objectid, index);
+   }
return ret;
 }
 
@@ -147,6 +158,34 @@ int btrfs_insert_inode(struct btrfs_trans_handle *trans, 
struct btrfs_root
return ret;
 }
 
+struct btrfs_inode_ref
+*btrfs_lookup_inode_ref(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   struct btrfs_path *path,
+   const char *name, int namelen,
+   u64 ino, u64 parent_ino, u64 index, int ins_len)
+{
+   struct btrfs_key key;
+   struct btrfs_inode_ref *ret_inode_ref = NULL;
+   int ret = 0;
+
+   key.objectid = ino;
+   key.type = BTRFS_INODE_REF_KEY;
+   key.offset = parent_ino;
+
+   ret = btrfs_search_slot(trans, root, key, path, ins_len,
+   ins_len ? 1 : 0);
+   if (ret)
+   goto out;
+
+   find_name_in_backref(path, name, namelen, ret_inode_ref);
+out:
+   if (ret  0)
+   return ERR_PTR(ret);
+   else
+   return ret_inode_ref;
+}
+
 static inline u64 btrfs_extref_hash(u64 parent_ino, const char *name,
int namelen)
 {
@@ -351,3 +390,76 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+int btrfs_del_inode_ref(struct btrfs_trans_handle *trans,
+   struct btrfs_root *root,
+   const char *name, int name_len,
+   u64 ino, u64 parent_ino, u64 *index)
+{
+   struct btrfs_path *path;
+   struct btrfs_key key;
+   struct btrfs_inode_ref *ref;
+   struct extent_buffer *leaf;
+   unsigned long ptr;
+   unsigned long item_start;
+   u32 item_size;
+   u32 sub_item_len;
+   int ret;
+   int search_ext_refs = 0;
+   int del_len = name_len + sizeof(*ref);
+
+   key.objectid = ino;
+

[PATCH RESEND v2 5/6] btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found' fsck mechanism.

With the previous btrfs inode operations patches, now we can use
btrfs_mkdir() to create the 'lost+found' dir to do some data salvage in
btrfsck.

This patch along with previous ones will make data salvage easier.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
Changlog:
v2:
   Fix a bug that returns the parent ino other than the existing dir
   ino.
---
 ctree.h |  2 ++
 inode.c | 92 +
 2 files changed, 94 insertions(+)

diff --git a/ctree.h b/ctree.h
index 21bafd2..682255c 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2459,4 +2459,6 @@ int btrfs_add_orphan_item(struct btrfs_trans_handle 
*trans,
  struct btrfs_root *root,
  struct btrfs_path *path,
  u64 ino);
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode);
 #endif
diff --git a/inode.c b/inode.c
index aeb0a5c..b354f5a 100644
--- a/inode.c
+++ b/inode.c
@@ -436,3 +436,95 @@ out:
btrfs_free_path(path);
return ret;
 }
+
+/* Fill inode item with 'mode'. Uid/gid to root/root */
+static void fill_inode_item(struct btrfs_trans_handle *trans,
+   struct btrfs_inode_item *inode_item,
+   u32 mode, u32 nlink)
+{
+   time_t now = time(NULL);
+
+   btrfs_set_stack_inode_generation(inode_item, trans-transid);
+   btrfs_set_stack_inode_uid(inode_item, 0);
+   btrfs_set_stack_inode_gid(inode_item, 0);
+   btrfs_set_stack_inode_size(inode_item, 0);
+   btrfs_set_stack_inode_mode(inode_item, mode);
+   btrfs_set_stack_inode_nlink(inode_item, nlink);
+   btrfs_set_stack_timespec_sec(inode_item-atime, now);
+   btrfs_set_stack_timespec_nsec(inode_item-atime, 0);
+   btrfs_set_stack_timespec_sec(inode_item-mtime, now);
+   btrfs_set_stack_timespec_nsec(inode_item-mtime, 0);
+   btrfs_set_stack_timespec_sec(inode_item-ctime, now);
+   btrfs_set_stack_timespec_nsec(inode_item-ctime, 0);
+}
+
+/*
+ * Unlike kernel btrfs_new_inode(), we only create the INODE_ITEM, without
+ * its backref.
+ * The backref is added by btrfs_add_link().
+ */
+static int btrfs_new_inode(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  u64 ino, u32 mode)
+{
+   struct btrfs_inode_item inode_item = {0};
+   int ret = 0;
+
+   fill_inode_item(trans, inode_item, mode, 0);
+   ret = btrfs_insert_inode(trans, root, ino, inode_item);
+   return ret;
+}
+
+/*
+ * Make a dir under the parent inode 'parent_ino' with 'name'
+ * and 'mode', The owner will be root/root.
+ */
+int btrfs_mkdir(struct btrfs_trans_handle *trans, struct btrfs_root *root,
+   char *name, int namelen, u64 parent_ino, u64 *ino, int mode)
+{
+   struct btrfs_dir_item *dir_item;
+   struct btrfs_path *path;
+   u64 ret_ino;
+   int ret = 0;
+
+   path = btrfs_alloc_path();
+   if (!path)
+   return -ENOMEM;
+
+   dir_item = btrfs_lookup_dir_item(NULL, root, path, parent_ino,
+name, namelen, 0);
+   if (IS_ERR(dir_item)) {
+   ret = PTR_ERR(dir_item);
+   goto out;
+   }
+
+   if (dir_item) {
+   struct btrfs_key found_key;
+
+   /*
+* Already have conflicting name, check if it is a dir.
+* Either way, no need to continue.
+*/
+   btrfs_dir_item_key_to_cpu(path-nodes[0], dir_item, found_key);
+   ret_ino = found_key.objectid;
+   if (btrfs_dir_type(path-nodes[0], dir_item) != BTRFS_FT_DIR)
+   ret = -EEXIST;
+   goto out;
+   }
+
+   ret = btrfs_find_free_objectid(NULL, root, parent_ino, ret_ino);
+   if (ret)
+   goto out;
+   ret = btrfs_new_inode(trans, root, ret_ino, mode | S_IFDIR);
+   if (ret)
+   goto out;
+   ret = btrfs_add_link(trans, root, ret_ino, parent_ino, name, namelen,
+BTRFS_FT_DIR, NULL, 1);
+   if (ret)
+   goto out;
+out:
+   btrfs_free_path(path);
+   if (ret == 0  ino)
+   *ino = ret_ino;
+   return ret;
+}
-- 
2.1.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH RESEND 2/6] btrfs-progs: Import btrfs_insert/del/lookup_extref() functions.

Import btrfs_insert/del/lookup_extref() functions form kernel for the
incoming btrfs_add_link() and btrfs_unlink() functions.

As the base of incoming btrfs 'lost+found' recovery mechanism.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 ctree.h  |  14 
 inode-item.c | 206 +++
 2 files changed, 220 insertions(+)

diff --git a/ctree.h b/ctree.h
index 89036de..e185e7a 100644
--- a/ctree.h
+++ b/ctree.h
@@ -2389,6 +2389,20 @@ int btrfs_insert_inode(struct btrfs_trans_handle *trans, 
struct btrfs_root
 int btrfs_lookup_inode(struct btrfs_trans_handle *trans, struct btrfs_root
   *root, struct btrfs_path *path,
   struct btrfs_key *location, int mod);
+struct btrfs_inode_extref
+*btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
+  struct btrfs_path *path, struct btrfs_root *root,
+  u64 ino, u64 parent_ino, u64 index, const char *name,
+  int namelen, int ins_len);
+int btrfs_del_inode_extref(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  const char *name, int name_len,
+  u64 inode_objectid, u64 ref_objectid,
+  u64 *index);
+int btrfs_insert_inode_extref(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root,
+ const char *name, int name_len,
+ u64 inode_objectid, u64 ref_objectid, u64 index);
 
 /* file-item.c */
 int btrfs_del_csums(struct btrfs_trans_handle *trans,
diff --git a/inode-item.c b/inode-item.c
index 8cc98c6..7337ac9 100644
--- a/inode-item.c
+++ b/inode-item.c
@@ -19,6 +19,7 @@
 #include ctree.h
 #include disk-io.h
 #include transaction.h
+#include crc32c.h
 
 static int find_name_in_backref(struct btrfs_path *path, const char * name,
 int name_len, struct btrfs_inode_ref **ref_ret)
@@ -145,3 +146,208 @@ int btrfs_insert_inode(struct btrfs_trans_handle *trans, 
struct btrfs_root
sizeof(*inode_item));
return ret;
 }
+
+static inline u64 btrfs_extref_hash(u64 parent_ino, const char *name,
+   int namelen)
+{
+   return (u64) btrfs_crc32c(parent_ino, name, namelen);
+}
+
+static int
+btrfs_find_name_in_ext_backref(struct btrfs_path *path,
+  u64 parent_ino, const char *name,
+  int namelen,
+  struct btrfs_inode_extref **extref_ret)
+{
+   struct extent_buffer *node;
+   struct btrfs_inode_extref *extref;
+   unsigned long ptr;
+   unsigned long name_ptr;
+   u32 item_size;
+   u32 cur_offset = 0;
+   int ref_name_len;
+   int slot;
+
+   node = path-nodes[0];
+   slot = path-slots[0];
+   item_size = btrfs_item_size_nr(node, slot);
+   ptr = btrfs_item_ptr_offset(node, slot);
+
+   /*
+* Search all extended backrefs in this item. We're only
+* looking through any collisions so most of the time this is
+* just going to compare against one buffer. If all is well,
+* we'll return success and the inode ref object.
+*/
+   while (cur_offset  item_size) {
+   extref = (struct btrfs_inode_extref *) (ptr + cur_offset);
+   name_ptr = (unsigned long)(extref-name);
+   ref_name_len = btrfs_inode_extref_name_len(node, extref);
+
+   if (ref_name_len == namelen 
+   btrfs_inode_extref_parent(node, extref) == parent_ino 
+   (memcmp_extent_buffer(node, name, name_ptr,
+ namelen) == 0)) {
+   if (extref_ret)
+   *extref_ret = extref;
+   return 1;
+   }
+
+   cur_offset += ref_name_len + sizeof(*extref);
+   }
+   return 0;
+}
+
+struct btrfs_inode_extref
+*btrfs_lookup_inode_extref(struct btrfs_trans_handle *trans,
+  struct btrfs_path *path, struct btrfs_root *root,
+  u64 ino, u64 parent_ino, u64 index, const char *name,
+  int namelen, int ins_len)
+{
+   struct btrfs_key key;
+   struct btrfs_inode_extref *extref;
+   int ret = 0;
+
+   key.objectid = ino;
+   key.type = BTRFS_INODE_EXTREF_KEY;
+   key.offset = btrfs_extref_hash(parent_ino, name, namelen);
+
+   ret = btrfs_search_slot(trans, root, key, path, ins_len,
+   ins_len ? 1 : 0);
+   if (ret  0)
+   return ERR_PTR(ret);
+   if (ret  0)
+   return NULL;
+   if (!btrfs_find_name_in_ext_backref(path, parent_ino, name,
+   namelen, extref))
+   return NULL;

Re: [PATCH 0/6] More generic inode nlink repair function

2014-12-02 Thread Ed Tomlinson

Hi,

I'd really like to see these patches included in btrfsck - they repaired my fs. 
  Once
Qu got them working they found additional corruptions.  This time there was no 
crash or stall
just an umount that left (chromium) files unlinked...  The bug with these files 
has been
hitting me for a while - just did not recognize what was causing it or notice 
the corruption.

The only objection I have seen to these patches is that they may create a 
lost+found 
directory.  I submit this is an expected behavior for a fsck utility.  When 
--repair is specified 
I expect a fsck to make changes to my fs one of which may be adding and 
populating a 
lost+found directory.

Thanks
Ed Tomlinson

PS. It would be very interesting to find out WHY these files are ending up 
unlinked.  Ideas?


On Wednesday 03 December 2014 12:18:26 you wrote:
 Update on patch 4 and 6, other is not changed.
 This nlink repair function is more generic than the original one.
 
 The old one can only handle a specific case that the inode_ref is
 invalid, either point to a non-exist parent inode or point to a invalid
 inode(not dir or conflicting index/name).
 
 The new one will reset all the backref, no matter it is valid or not,
 and re-add all the valid backref, this make the nlink handles more
 corrupt cases.
 
 Qu Wenruo (6):
   btrfs-progs: print root dir verbose error in fsck
   btrfs-progs: Import btrfs_insert/del/lookup_extref() functions.
   btrfs-progs: Import lookup/del_inode_ref() function.
   btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.
   btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found' 
fsck mechanism.
   btrfs-progs: Add fixing function for inodes whose nlink dismatch
 
  Makefile |   2 +-
  cmds-check.c | 311 --
  ctree.c  |   6 +
  ctree.h  |  38 +
  inode-item.c | 318 +++
  inode.c  | 484 
 +++
  6 files changed, 1148 insertions(+), 11 deletions(-)
  create mode 100644 inode.c
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] More generic inode nlink repair function

2014-12-02 Thread Satoru Takeuchi


Hi,

(2014/12/03 14:03), Ed Tomlinson wrote:

Hi,

I'd really like to see these patches included in btrfsck - they repaired my fs. 
  Once
Qu got them working they found additional corruptions.  This time there was no 
crash or stall
just an umount that left (chromium) files unlinked...  The bug with these files 
has been
hitting me for a while - just did not recognize what was causing it or notice 
the corruption.

The only objection I have seen to these patches is that they may create a 
lost+found
directory.  I submit this is an expected behavior for a fsck utility.  When 
--repair is specified
I expect a fsck to make changes to my fs one of which may be adding and 
populating a
lost+found directory.


How about making lost+found on mkfs.btrfs like ext4?

Thanks,
Satoru



Thanks
Ed Tomlinson

PS. It would be very interesting to find out WHY these files are ending up 
unlinked.  Ideas?


On Wednesday 03 December 2014 12:18:26 you wrote:

Update on patch 4 and 6, other is not changed.
This nlink repair function is more generic than the original one.

The old one can only handle a specific case that the inode_ref is
invalid, either point to a non-exist parent inode or point to a invalid
inode(not dir or conflicting index/name).

The new one will reset all the backref, no matter it is valid or not,
and re-add all the valid backref, this make the nlink handles more
corrupt cases.

Qu Wenruo (6):
   btrfs-progs: print root dir verbose error in fsck
   btrfs-progs: Import btrfs_insert/del/lookup_extref() functions.
   btrfs-progs: Import lookup/del_inode_ref() function.
   btrfs-progs: Add btrfs_unlink() and btrfs_add_link() functions.
   btrfs-progs: Add btrfs_mkdir() function for the incoming 'lost+found'
fsck mechanism.
   btrfs-progs: Add fixing function for inodes whose nlink dismatch

  Makefile |   2 +-
  cmds-check.c | 311 --
  ctree.c  |   6 +
  ctree.h  |  38 +
  inode-item.c | 318 +++
  inode.c  | 484 +++
  6 files changed, 1148 insertions(+), 11 deletions(-)
  create mode 100644 inode.c




--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Btrfs: fix fs mapping extent map leak

2014-12-02 Thread Liu Bo

On Tue, Dec 02, 2014 at 06:07:30PM +, Filipe Manana wrote:
 On chunk allocation error (label error_del_extent), after adding the
 extent map to the tree and to the pending chunks list, we would leave
 decrementing the extent map's refcount by 2 instead of 3 (our allocation
 + tree reference + list reference).
 
 Also, on chunk/block group removal, if the block group was on the list
 pending_chunks we weren't decrementing the respective list reference.
 
 Detected by 'rmmod btrfs':
 
 [20770.105881] kmem_cache_destroy btrfs_extent_map: Slab cache still has 
 objects
 [20770.106127] CPU: 2 PID: 11093 Comm: rmmod Tainted: GWL 
 3.17.0-rc5-btrfs-next-1+ #1
 [20770.106128] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
 rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
 [20770.106130]   8800ba867eb8 813e7a13 
 8800a2e11040
 [20770.106132]  8800ba867ed0 81105d0c  
 8800ba867ee0
 [20770.106134]  a035d65e 8800ba867ef0 a03b0654 
 8800ba867f78
 [20770.106136] Call Trace:
 [20770.106142]  [813e7a13] dump_stack+0x45/0x56
 [20770.106145]  [81105d0c] kmem_cache_destroy+0x4b/0x90
 [20770.106164]  [a035d65e] extent_map_exit+0x1a/0x1c [btrfs]
 [20770.106176]  [a03b0654] exit_btrfs_fs+0x27/0x9d3 [btrfs]
 [20770.106179]  [8109dc97] SyS_delete_module+0x153/0x1c4
 [20770.106182]  [8121261b] ? trace_hardirqs_on_thunk+0x3a/0x3c
 [20770.106184]  [813ebf52] system_call_fastpath+0x16/0x1b
 
 This applies on top (depends on) of my previous patch titled:
 Btrfs: fix race between fs trimming and block group remove/allocation
 
 But the issue in fact was already present before that change, it only
 became easier to hit after Josef's 3.18 patch that added automatic
 removal of empty block groups.

Good catch.

But I think we can add leak detect code for extent map as what we did
for extent state and extent buffer, then when we test with DEBUG
options, it's easier for us to find this kind of leak problems.

Thanks,

-liubo
 
 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
 
 This replaces my previous patch titled:
 Btrfs: fix extent map leak on chunk allocation failure
 
  fs/btrfs/extent-tree.c | 4 
  fs/btrfs/volumes.c | 2 ++
  2 files changed, 6 insertions(+)
 
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index a811ed2..17d429d 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -9479,6 +9479,10 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
 *trans,
   memcpy(key, block_group-key, sizeof(key));
  
   lock_chunks(root);
 + if (!list_empty(em-list)) {
 + /* We're in the transaction-pending_chunks list. */
 + free_extent_map(em);
 + }
   spin_lock(block_group-lock);
   block_group-removed = 1;
   /*
 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 66a5a1e..e936fe3 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -4496,6 +4496,8 @@ error_del_extent:
   free_extent_map(em);
   /* One for the tree reference */
   free_extent_map(em);
 + /* One for the pending_chunks list reference */
 + free_extent_map(em);
  error:
   kfree(devices_info);
   return ret;
 -- 
 2.1.3
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6] More generic inode nlink repair function