Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-17 Thread Lennart Poettering
On Wed, 17.06.15 21:10, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Well, /bin/mount is not a daemon, and it should not be one.
 
 My helper is not a deamon; you was correct the first time: it blocks
 until all needed/enough devices are appeared.
 Anyway this should not be different from mounting a nfs
 filesystem. Even in this case the mount helper blocks until the
 connection happened. The block time is not negligible, even tough
 not long as a device timeout ...

Well, the mount tool doesn't wait for the network to be configured or
so. It just waits for a response from the server. That's quite a
difference.

  Well, it's not really ugly. I mean, if the state or properties of a
  device change, then udev should update its information about it, and
  that's done via a retrigger. We do that all the time already, for
  example when an existing loopback device gets a backing file assigned
  or removed. I am pretty sure that loopback case is very close to what
  you want to do here, hence retriggering (either from the kernel side,
  or from userspace), appears like an OK thing to do.
 
 What seems strange to me is that in this case the devices don't have changed 
 their status.
 How this problem is managed in the md/dm raid cases ?

md has a daemon mdmon to my knowledge.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-15 Thread Lennart Poettering
On Fri, 12.06.15 21:16, Anand Jain (anand.j...@oracle.com) wrote:

 
 
 BTRFS_IOC_DEVICES_READY is to check if all the required devices
 are known by the btrfs kernel, so that admin/system-application
 could mount the FS. It is checked against a device in the argument.
 
 However the actual implementation is bit more than just that,
 in the way that it would also scan and register the device
 provided in the argument (same as btrfs device scan subcommand
 or BTRFS_IOC_SCAN_DEV ioctl).
 
 So BTRFS_IOC_DEVICES_READY ioctl isn't a read/view only ioctl,
 but its a write command as well.
 
 Next, since in the kernel we only check if total_devices
 (read from SB)  is equal to num_devices (counted in the list)
 to state the status as 0 (ready) or 1 (not ready). But this
 does not work in rest of the device pool state like missing,
 seeding, replacing since total_devices is actually not equal
 to num_devices in these state but device pool is ready for
 the mount and its a bug which is not part of this discussions.
 
 
 Questions:
 
  - Do we want BTRFS_IOC_DEVICES_READY ioctl to also scan and
register the device provided (same as btrfs device scan
command or the BTRFS_IOC_SCAN_DEV ioctl)
OR can BTRFS_IOC_DEVICES_READY be read-only ioctl interface
to check the state of the device pool. ?

I am pretty sure the kernel should not change API on this now. Hence:
stick to the current behaviour, please.

  - If the the device in the argument is already mounted,
can it straightaway return 0 (ready) ? (as of now it would
again independently read the SB determine total_devices
and check against num_devices.

Yeah, I figure that might make sense to do.

  - What should be the expected return when the FS is mounted
and there is a missing device.

An error, as it already does.

I am pretty sure that mounting degraded file systems should be an
exceptional operation, and not the common scheme. If it should happen
automatically at all, then it should be triggered by some daemon or
so, but not by udev/systemd.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-15 Thread Lennart Poettering
On Sat, 13.06.15 17:35, Anand Jain (anand.j...@oracle.com) wrote:

 Are there any other users?
 
- If the the device in the argument is already mounted,
  can it straightaway return 0 (ready) ? (as of now it would
  again independently read the SB determine total_devices
  and check against num_devices.
 
 
 I think yes; obvious use case is btrfs mounted in initrd and later
 coldplug. There is no point to wait for anything as filesystem is
 obviously there.
 
 
  There is little difference. If the device is already mounted.
  And there are two device paths for the same device PA and PB.
  The path as last given to either 'btrfs dev scan (BTRFS_IOC_SCAN_DEV)'
  or 'btrfs device ready (BTRFS_IOC_DEVICES_READY)' will be shown
  in the 'btrfs filesystem show' or '/proc/self/mounts' output.
  It does not mean that btrfs kernel will close the first device path
  and reopen the 2nd given device path, it just updates the device path
  in the kernel.

The device paths shown in /proc/self/mountinfo is also weird in other
cases: if people boot up without initrd, and use a btrfs fs as root,
then it will always carry the string /dev/root in there, which is
completely useless, since such a device never exists in userspace or
/sys, and hence one cannot make sense of. Moreover, if one then asks
the kernel for the devices backing the btrfs fs via the ioctl it will
also return /dev/root for it, which is really useless.

I think in general I'd prefer if btrfs would stop returning the device
paths it got from userspace or the kernel, and would always return
sanitized ones that use the official kernel names for the devices in
them. Specifically, the member devices ioctl should always return
names like /dev/sda5, even if I mount something using root= on the
kernel cmdline, or if I mount /dev/disks/by-uuid/ via a symlink
instead of the real kernel name of the device.

Then, I think it would be a good idea to always update the device
string shown in /proc/self/mountinfo to be a concatenated version of
the list of device names reported by the ioctl. So that a btrfs RAID
would show /dev/sda5:/dev/sdb6:/dev/sdc5 or so. And if I remove or
add backing devices the string really should be updated.

The btrfs client side tools then could use udev to get a list of the
device node symlinks for each device to help the user identifying
which backing devices belong to a btrfs pool.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-15 Thread Lennart Poettering
On Sat, 13.06.15 17:09, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Further, the problem will be more intense in this eg. if you use dd
  and copy device A to device B. After you mount device A, by just
  providing device B in the above two commands you could let kernel
  update the device path, again all the IO (since device is mounted)
  are still going to the device A (not B), but /proc/self/mounts and
  'btrfs fi show' shows it as device B (not A).
  
  Its a bug. very tricky to fix.
 
 In the past [*] I proposed a mount.btrfs helper . I tried to move the logic 
 outside the kernel.
 I think that the problem is that we try to manage all these cases
 from a device point of view: when a device appears, we register the
 device and we try to mount the filesystem... This works very well
 when there is 1-volume filesystem. For the other cases there is a
 mess between the different layers:

 - kernel
 - udev/systemd
 - initrd logic
 
 My attempt followed a different idea: the mount helper waits the
 devices if needed, or if it is the case it mounts the filesystem in
 degraded mode. All devices are passed as mount arguments
 (--device=/dev/sdX), there is no a device registration: this avoids
 all these problems.

Hmm, no. /bin/mount should not block for devices. That's generally
incompatible with how the tool is used, and in particular from
systemd. We would not make use for such a scheme in
systemd. /bin/mount should always be short-running.

I am pretty sure that if such automatic degraded mounting should be
supported, then this should be done with some background storage
daemon that alters the effect of the READY ioctl somehow after the
timeout, and then retriggers the devcies so that systemd takes
note. (or, alternatively: such a scheme could even be implemented all
in kernel, based on some configurable kernel setting...)

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-15 Thread Lennart Poettering
On Mon, 15.06.15 19:23, Goffredo Baroncelli (kreij...@inwind.it) wrote:

 On 2015-06-15 12:46, Lennart Poettering wrote:
  On Sat, 13.06.15 17:09, Goffredo Baroncelli (kreij...@libero.it) wrote:
  
  Further, the problem will be more intense in this eg. if you use dd
  and copy device A to device B. After you mount device A, by just
  providing device B in the above two commands you could let kernel
  update the device path, again all the IO (since device is mounted)
  are still going to the device A (not B), but /proc/self/mounts and
  'btrfs fi show' shows it as device B (not A).
 
  Its a bug. very tricky to fix.
 
  In the past [*] I proposed a mount.btrfs helper . I tried to move the 
  logic outside the kernel.
  I think that the problem is that we try to manage all these cases
  from a device point of view: when a device appears, we register the
  device and we try to mount the filesystem... This works very well
  when there is 1-volume filesystem. For the other cases there is a
  mess between the different layers:
  
  - kernel
  - udev/systemd
  - initrd logic
 
  My attempt followed a different idea: the mount helper waits the
  devices if needed, or if it is the case it mounts the filesystem in
  degraded mode. All devices are passed as mount arguments
  (--device=/dev/sdX), there is no a device registration: this avoids
  all these problems.
  
  Hmm, no. /bin/mount should not block for devices. That's generally
  incompatible with how the tool is used, and in particular from
  systemd. We would not make use for such a scheme in
  systemd. /bin/mount should always be short-running.
 
 Apart systemd, which are these incompatibilities ? 

Well, /bin/mount is not a daemon, and it should not be one.

  I am pretty sure that if such automatic degraded mounting should be
  supported, then this should be done with some background storage
  daemon that alters the effect of the READY ioctl somehow after the
  timeout, and then retriggers the devcies so that systemd takes
  note. (or, alternatively: such a scheme could even be implemented all
  in kernel, based on some configurable kernel setting...)
 
 I recognize that this solution provides the maximum compatibility
 with the current implementation. However it seems too complex to
 me. Re-trigging a devices seems to me more a workaround than a
 solution.

Well, it's not really ugly. I mean, if the state or properties of a
device change, then udev should update its information about it, and
that's done via a retrigger. We do that all the time already, for
example when an existing loopback device gets a backing file assigned
or removed. I am pretty sure that loopback case is very close to what
you want to do here, hence retriggering (either from the kernel side,
or from userspace), appears like an OK thing to do.

 Could a generator do this job ? I.e. this generator (or storage
 daemon) waits that all (or enough) devices are appeared, then it
 creates a .mount unit: do you think that it is doable ?

systemd generators are a way to extend the systemd unit dep tree with
units. They are very short running, and are executed only very very
early at boot. They cannot wait for anything, they don#t have access
to devices and are not run when they are appear.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] install Fedora systemd-nspawn container on btrfs

2015-04-23 Thread Lennart Poettering
On Thu, 23.04.15 14:18, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:

  Pick one:
 
  a) download the raw image and use that, but it will be a loopback file
  with its own file system inside
 
  or:
 
  b) do the dnf/yum install root thing, and install it into a directory
  tree.
 
 I installed yum package on Arch but couldn't manage to do the install.
 
 # yum -y --releasever=22 --nogpg --installroot=/var/lib/machines/enl
 --disablerepo='*' --enablerepo=fedora install systemd passwd dnf
 fedora-release-server
 Error getting repository data for fedora, repository not found
 # yum repolist all
 repolist: 0
 
 In fact, /etc/yum/repos.d is empty, so I am not surprised.

Of course, you could use the .raw image, mount the extern btrfs volume
into it via nspawn's --bind= switch, then use yum inside of that
container to install into the btrfs volume. Then get rid of the .raw
image again, and you still have the btrfs volume that should be
bootable.

A bit complex, but you almost were there already... ;-)

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] install Fedora systemd-nspawn container on btrfs

2015-04-23 Thread Lennart Poettering
On Thu, 23.04.15 19:00, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:

 On Thu, Apr 23, 2015 at 4:47 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Thu, 23.04.15 14:18, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:
 
   Pick one:
  
   a) download the raw image and use that, but it will be a loopback file
   with its own file system inside
  
   or:
  
   b) do the dnf/yum install root thing, and install it into a directory
   tree.
 
  I installed yum package on Arch but couldn't manage to do the install.
 
  # yum -y --releasever=22 --nogpg --installroot=/var/lib/machines/enl
  --disablerepo='*' --enablerepo=fedora install systemd passwd dnf
  fedora-release-server
  Error getting repository data for fedora, repository not found
  # yum repolist all
  repolist: 0
 
  In fact, /etc/yum/repos.d is empty, so I am not surprised.
 
  Of course, you could use the .raw image, mount the extern btrfs volume
  into it via nspawn's --bind= switch, then use yum inside of that
  container to install into the btrfs volume. Then get rid of the .raw
  image again, and you still have the btrfs volume that should be
  bootable.
 
  A bit complex, but you almost were there already... ;-)
 
 Wunderbach
 
 # systemd-nspawn -M Fedora-Cloud-Base-22_Beta-20150415.x86_64.raw --
 bind=/var/lib/machines/enl:/mnt
 [root@Fedora-Cloud-Base-22_Beta-20150415 ~]#dnf -y --releasever=22
 --nogpg --installroot=/mnt
  --disablerepo='*' --enablerepo=fedora install systemd passwd dnf
 
 Complete!
 
 -
 $ ls /var/lib/machines/enl
 boot/  etc/   media/  opt/   root/  srv/  tmp/  var/  lib@sbin@
 dev/   home/  mnt/proc/  run/   sys/  usr/  bin@  lib64@
 -
 
 But now booting but hanging:
 
 #systemd-nspawn -bD /var/lib/machines/enl
 Spawning container enl on /var/lib/machines/enl.
 Press ^] three times within 1s to kill container.
 systemd 219 running in system mode. (+PAM +AUDIT +SELINUX +IMA
 -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL
 +XZ -LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN)
 Detected virtualization 'systemd-nspawn'.
 Detected architecture 'x86-64'.
 Running with unpopulated /etc.
 
 Welcome to Fedora 22 (Twenty Two)!
 
 Initializing machine ID from random generator.
 Populated /etc with preset unit settings.
 Unit etc.mount is bound to inactive unit dev-sdb1.device. Stopping, too.
 Unit var.mount is bound to inactive unit dev-sdb1.device. Stopping, too.
 Cannot add dependency job for unit display-manager.service, ignoring:
 Unit display-manager.service failed to load: No such file or
 directory.
 Startup finished in 51ms.
 
 
 
 Maybe my btrfs story ?
  Unit etc.mount is bound to inactive unit dev-sdb1.device. Stopping, too.
  Unit var.mount is bound to inactive unit dev-sdb1.device. Stopping, too.
 Will investigate.
 
 --bind was certainly the most easy trick once the raw image is downloaded.

Hmm, my guess is that you somehow lost the /etc and /var directories
half way, probably because of the weird mounting your are
doing. --bind= should normally be recursive, but maybe this didn't work?

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] install Fedora systemd-nspawn container on btrfs

2015-04-23 Thread Lennart Poettering
On Thu, 23.04.15 19:29, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:

 When in /var/lib/machines/poppy:
 
 root@hortensia ➤➤ machines/poppy # btrfs subvolume list .
 ID 266 gen 98 top level 5 path rootvol
 ID 268 gen 100 top level 5 path var
 ID 269 gen 101 top level 5 path etc
 ID 271 gen 72 top level 266 path var/lib/machines
 ID 272 gen 77 top level 268 path var/tmp
 ID 273 gen 77 top level 268 path var/lib/machines
 
 Anyone from the Btrfs ML to help ?

Note that systemd-tmpfiles will create /var/tmp, and /var/lib/machines
as subvolumes these days, if they are missing and on btrfs.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] install Fedora systemd-nspawn container on btrfs

2015-04-23 Thread Lennart Poettering
On Thu, 23.04.15 13:45, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:

 Not sure what I did wrong, but I can't install/boot my nspawn container.
 Here is my setup:
 
 Archlinux box- updated
 
 1- created 3 btrfs subvol on /dev/sdb1 (SSD). The goal is to manage
 snapshots easily.
 no nested subvol.
 --
 # btrfs subvolume list .
 ID 266 gen 39 top level 5 path rootvol
 ID 268 gen 41 top level 5 path var
 ID 269 gen 42 top level 5 path etc
 # btrfs filesystem show
 Label: 'poppy-root'  uuid: ef1b44cd-e7b0-4166-b933-e7d4d20a1171
 Total devices 1 FS bytes used 64.00KiB
 devid1 size 80.00GiB used 12.00MiB path /dev/sdb1
 --
 
 2 - mount btrfs subvol
 ---
 # mount -t btrfs -o subvol=rootvol /dev/sdb1 /var/lib/machines/enl
 # mkdir /var/lib/machines/enl/var
 # mkdir /var/lib/machines/enl/etc
 # mount -t btrfs -o subvol=etc /dev/sdb1 /var/lib/machines/enl/etc
 # mount -t btrfs -o subvol=var /dev/sdb1 /var/lib/machines/enl/var
 

THis isn't really how one would normally use subvolumes. No need to
mount each subvolume explicitly, they are just special directories...

 
 3- install fedora minimal and boot it
 -
 #  machinectl pull-raw --verify=no
 http://ftp.halifax.rwth-aachen.de/fedora/linux/releases/22/Cloud/Images/x86_64/Fedora-Cloud-Base-22_Beta-20150415-x86_64.raw.xz
 $ tar 
 # systemd-nspawn -M Fedora-Cloud-Base-22_Beta-20150415.x86_64.raw
 Spawning container Fedora-Cloud-Base-22_Beta-20150415.x86_64.raw on
 /var/lib/machines/Fedora-Cloud-Base-22_Beta-20150415.x86_64.raw.
 Press ^] three times within 1s to kill container.
 [root@Fedora-Cloud-Base-22_Beta-20150415 ~]#
 
 
 4- install Fedora on /var/lib/machines/enl
 --
 [root@Fedora-Cloud-Base-22_Beta-20150415 ~]# dnf -y --releasever=22
 --nogpg --installroot=/var/lib/machines/MyContainer --disablerepo='*'
 --enablerepo=fedora install systemd passwd dnf fedora-release-server
 vim-minimal
 ..
 INSTALL
 ...
 Complete!
 --


Hmm? With this command you installed another fedora inside the raw
fedora image you downloaded. You now have three linuxes, installed
within each other...

Pick one:

a) download the raw image and use that, but it will be a loopback file
with its own file system inside

or:

b) do the dnf/yum install root thing, and install it into a directory
tree.

Do either of those in the host, not in the container.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] install Fedora systemd-nspawn container on btrfs

2015-04-23 Thread Lennart Poettering
On Thu, 23.04.15 14:57, Andrei Borzenkov (arvidj...@gmail.com) wrote:

 On Thu, Apr 23, 2015 at 2:50 PM, Lennart Poettering
 lenn...@poettering.net wrote:
  On Thu, 23.04.15 13:45, arnaud gaboury (arnaud.gabo...@gmail.com) wrote:
 
  Not sure what I did wrong, but I can't install/boot my nspawn container.
  Here is my setup:
 
  Archlinux box- updated
 
  1- created 3 btrfs subvol on /dev/sdb1 (SSD). The goal is to manage
  snapshots easily.
  no nested subvol.
  --
  # btrfs subvolume list .
  ID 266 gen 39 top level 5 path rootvol
  ID 268 gen 41 top level 5 path var
  ID 269 gen 42 top level 5 path etc
  # btrfs filesystem show
  Label: 'poppy-root'  uuid: ef1b44cd-e7b0-4166-b933-e7d4d20a1171
  Total devices 1 FS bytes used 64.00KiB
  devid1 size 80.00GiB used 12.00MiB path /dev/sdb1
  --
 
  2 - mount btrfs subvol
  ---
  # mount -t btrfs -o subvol=rootvol /dev/sdb1 /var/lib/machines/enl
  # mkdir /var/lib/machines/enl/var
  # mkdir /var/lib/machines/enl/etc
  # mount -t btrfs -o subvol=etc /dev/sdb1 /var/lib/machines/enl/etc
  # mount -t btrfs -o subvol=var /dev/sdb1 /var/lib/machines/enl/var
  
 
  THis isn't really how one would normally use subvolumes. No need to
  mount each subvolume explicitly, they are just special directories...
 
 As long as you never clone parent volume (but why use btrfs then?) As
 soon as you create clone or snapshot of parent volume, all childs will
 be out of place in it unless you explicitly mount them in correct
 place in hierarchy.

Hmm? not following. The btrfs tool surely doesn't do recursive
snapshots currently. But it doesn't reinstate mount points either (or
even makes them persistent), hence I really don't get what you are
saying.

(note that machined's clone command in git *does* recursive snapshots)

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Recursive subvolume snapshots and deletion?

2015-03-25 Thread Lennart Poettering
On Mon, 23.03.15 08:36, Chris Mason (c...@fb.com) wrote:

 On Mon, Mar 23, 2015 at 12:57 AM, Lennart Poettering mzerq...@0pointer.de
 wrote:
 Heya!
 
 So what's the story on recursive btrfs snapshotting and snapshot
 removal? Since a while systemd has now by default creating btrfs
 subvolumes for /var/lib/machines for example. Now, if that code is run
 inside a container, and the container itself already is stored in a
 subvolume we end up with a subvolume inside a subvolume, which
 currently breaks snapshotting and deletion of the outer container
 subvolume.
 
 What's the plan regarding recursive versions of the operations? Any
 plan to add this? We could work around this in userspace, of course,
 but it would not be atomic, and I'd much prefer if the kernel could do
 this on its own!
 
 Hi Lennart,
 
 I've got a patch to btrfs-progs that can do the recursive snapshotting, let
 me clean it up and submit to Dave.  For recursive deletion I think the same
 method can be used.

OK, so this means you want this to be solved in userspace, hence in a
non-atomic fashion?

What precisely will your code do? Enumerate the subvolumes and simply
snapshot all subvolumes below the desired path? Or anything smarter?

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] systemd and nested Btrfs subvolumes

2015-03-22 Thread Lennart Poettering
On Fri, 20.03.15 18:08, Chris Murphy (li...@colorremedies.com) wrote:

 Sure but now it's missing if you do a rollback, or if you mount any
 different root tree, so immediately special handling is needed.
 
 If machines is a subvolume at the top level, it can always be mounted
 at /var/lib/machines regardless of which root is used.
 
 I also think this is more consistent with
 http://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html
 specifically the section What We Propose when it comes to the
 location and naming convention for Btrfs subvolumes.

containers are recursively stackable, hence having toplevel subvolumes
doesn't work, since the containers should be able to have
subcontainers of their own...

Also, it kinda defeats the whole point of btrfs' subvolume concept,
where subvolumes are little more than special directories.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Recursive subvolume snapshots and deletion?

2015-03-22 Thread Lennart Poettering
Heya!

So what's the story on recursive btrfs snapshotting and snapshot
removal? Since a while systemd has now by default creating btrfs
subvolumes for /var/lib/machines for example. Now, if that code is run
inside a container, and the container itself already is stored in a
subvolume we end up with a subvolume inside a subvolume, which
currently breaks snapshotting and deletion of the outer container
subvolume.

What's the plan regarding recursive versions of the operations? Any
plan to add this? We could work around this in userspace, of course,
but it would not be atomic, and I'd much prefer if the kernel could do
this on its own!

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: price to pay for nocow file bit?

2015-01-08 Thread Lennart Poettering
On Wed, 07.01.15 15:10, Josef Bacik (jba...@fb.com) wrote:

 On 01/07/2015 12:43 PM, Lennart Poettering wrote:
 Heya!
 
 Currently, systemd-journald's disk access patterns (appending to the
 end of files, then updating a few pointers in the front) result in
 awfully fragmented journal files on btrfs, which has a pretty
 negative effect on performance when accessing them.
 
 I've been wondering if mount -o autodefrag would deal with this problem but
 I haven't had the chance to look into it.

Hmm, I am kinda interested in a solution that I can just implement in
systemd/journald now and that will then just make things work for
people suffering by the problem. I mean, I can hardly make systemd
patch the mount options of btrfs just because I place a journal file
on some fs...

Is autodefrag supposed to become a default one day?

Anyway, given the pros and cons I have now changed journald to set the
nocow bit on newly created journal files. When files are rotated (and
we hence know we will never ever write again to them) the bit is tried
to be unset again, and a defrag ioctl will be invoked right
after. btrfs currently silently ignores that we unset the bit, and
leaves it set, but I figure i should try to unset it anyway, in case
it learns that one day. After all, after rotating the files there's no
reason to treat the files special anymore...

I'll keep an eye on this, and see if I still get user complaints about
it. Should autodefrag become default eventually we can get rid of this
code in journald again.

One question regarding the btrfs defrag ioctl: playing around with it
it appears to be asynchronous, the defrag request is simply queued and
the ioctl returns immediately. Which is great for my usecase. However
I was wondering if it always was async like this? I googled a bit, and
found reports that defrag might take a while, but I am not sure if
those reports were about the ioctl taking so long, or the effect of
defrag actually hitting the disk... 

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: price to pay for nocow file bit?

2015-01-08 Thread Lennart Poettering
On Thu, 08.01.15 10:56, Zygo Blaxell (ce3g8...@umail.furryterror.org) wrote:

 On Wed, Jan 07, 2015 at 06:43:15PM +0100, Lennart Poettering wrote:
  Heya!
  
  Currently, systemd-journald's disk access patterns (appending to the
  end of files, then updating a few pointers in the front) result in
  awfully fragmented journal files on btrfs, which has a pretty
  negative effect on performance when accessing them.
  
  Now, to improve things a bit, I yesterday made a change to journald,
  to issue the btrfs defrag ioctl when a journal file is rotated,
  i.e. when we know that no further writes will be ever done on the
  file. 
  
  However, I wonder now if I should go one step further even, and use
  the equivalent of chattr -C (i.e. nocow) on all journal files. I am
  wondering what price I would precisely have to pay for
  that. Judging by this earlier thread:
  
  http://www.spinics.net/lists/linux-btrfs/msg33134.html
  
  it's mostly about data integrity, which is something I can live with,
  given the conservative write patterns of journald, and the fact that
  we do our own checksumming and careful data validation. I mean, if
  btrfs in this mode provides no worse data integrity semantics than
  ext4 I am fully fine with losing this feature for these files.
 
 This sounds to me like a job for fallocate with FALLOC_FL_KEEP_SIZE.

We already use fallocate(), but this is not enough on cow file
systems. With fallocate() you can certainly improve fragmentation when
appending things to a file. But on a COW file system this will help
little if we change things in the beginning of the file, since COW
means that it will then make a copy of those blocks and alter the
copy, but leave the original version unmodified. And if we do that all
the time the files get heavily fragmented, even though all the blocks
we modify have been fallocate()d initially...

 This would work on ext4, xfs, and others, and provide the same benefit
 (or even better) without filesystem-specific code.  journald would
 preallocate a contiguous chunk past the end of the file for appends,
 and

That's precisely what we do. But journald's write pattern is not
purely appending to files, it's append something to the end, then
link it up in the beginning. And for the append part we are
fine with fallocate(). It's the link up part that completely fucks
up fragmentation so far.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS_IOC_TREE_SEARCH ioctl

2015-01-07 Thread Lennart Poettering
On Mon, 05.01.15 19:14, Nehemiah Dacres (vivacar...@gmail.com) wrote:

 Is libbtrfs documented or even stable yet? What stage of development is it
 in anyway? is there a design spec yet?

Note that the code we use in systemd is not based on libbtrfs, we just
call the ioctls directly.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs_inode_item's otime?

2015-01-07 Thread Lennart Poettering
On Tue, 06.01.15 19:26, David Sterba (dste...@suse.cz) wrote:

  (Of course, even without xstat(), I think it would be good to have an
  unprivileged ioctl to query the otime in btrfs... the TREE_SEARCH
  ioctl after all requires privileges...)
 
 Adding this interface is a different question. I do not like to add
 ioctls that do too specialized things that normally fit into a generic
 interface like the xstat example. We could use the object properties
 instead (ie. export the otime as an extended attribute), but the work on
 that has stalled and it's not ready to just simply add the otime in
 advance.

Exposig this as xattr sounds great to me too.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


price to pay for nocow file bit?

2015-01-07 Thread Lennart Poettering
Heya!

Currently, systemd-journald's disk access patterns (appending to the
end of files, then updating a few pointers in the front) result in
awfully fragmented journal files on btrfs, which has a pretty
negative effect on performance when accessing them.

Now, to improve things a bit, I yesterday made a change to journald,
to issue the btrfs defrag ioctl when a journal file is rotated,
i.e. when we know that no further writes will be ever done on the
file. 

However, I wonder now if I should go one step further even, and use
the equivalent of chattr -C (i.e. nocow) on all journal files. I am
wondering what price I would precisely have to pay for
that. Judging by this earlier thread:

http://www.spinics.net/lists/linux-btrfs/msg33134.html

it's mostly about data integrity, which is something I can live with,
given the conservative write patterns of journald, and the fact that
we do our own checksumming and careful data validation. I mean, if
btrfs in this mode provides no worse data integrity semantics than
ext4 I am fully fine with losing this feature for these files.

Hence I am mostly interested in what else is lost if this flag is
turned on by default for all journal files journald creates: 

Does this have any effect on functionality? As I understood snapshots
still work fine for files marked like that, and so do
reflinks. Any drawback functionality-wise? Apparently file compression
support is lost if the bit is set? (which I can live with too, journal
files are internally compressed anyway)

What about performance? Do any operations get substantially slower by
setting this bit? For example, what happens if I take a snapshot of
files with this bit set and then modify the file, does this result in
a full (and hence slow) copy of the file on that occasion? 

I am trying to understand the pros and cons of turning this bit on,
before I can make this change. So far I see one big pro, but I wonder
if there's any major con I should think about?

Thanks,

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


BTRFS_IOC_TREE_SEARCH ioctl

2015-01-05 Thread Lennart Poettering
Heya,

I recently added some btrfs magic to systemd's machinectl/nspawn
tool. More specifically it can now show the disk usage of a container
that is stored in a btrfs subvolume. For that I made use of the btrfs
quota logic. To read the current disk usage of a subvolume I took
inspiration from btrfs-progs, most specifically the
BTRFS_IOC_TREE_SEARCH ioctl(). Unfortunately, documentation for the
ioctl seems to to be lacking, but there are some things about it I
fail to grok:

What precisely are the semantics of the ioctl, regarding the search
key min/max values (the fields of struct btrfs_ioctl_search_key)? I
kinda assumed that setting them would result in in only objects to be
returned that are within the min/max ranges. However, that appears not
to be the case. At least the min_offset/max_offset setting appears to
be ignored?

The code I hacked up is this one:

http://cgit.freedesktop.org/systemd/systemd/tree/src/shared/btrfs-util.c#n427

I try to read the BTRFS_QGROUP_STATUS_KEY and BTRFS_QGROUP_LIMIT_KEY
objects for the subvolume I care about. Hence I initialize .min_type
and .max_type to the two types (in the right order), and then
.min_offset and .max_offset to subvolume id. However, the search ioctl
will still give me entries back with offsets != the subvolume id...

Is this intended behaviour of the search ioctl? If so, what's the
rationale?

My code currently invokes the search ioctl in a loop to work around
the fact that .min_offset/.max_offset don't work as I wish they
did... I wish I could get rid of this loop and filtering out of the
entries I get back that aren't in th range I specified...

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFE: per-subvolume timestamp that is updated on every change to a subvolume

2015-01-05 Thread Lennart Poettering
Heya!

I am looking for a nice way to query the overall last modification
timestamp of a subvolume. i.e. the most recent mtime of *any* file or
directory within a subvolume. Ideally, I think, there was a
btrfs_timespec field for this in struct btrfs_root_item, alas there
isn't afaics. Any chance this can be added?

Or is there another workable way to query this value? Maybe determine
it from the current generation of a subvolume or so? Is that tracked?
Ideas?

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS_IOC_TREE_SEARCH ioctl

2015-01-05 Thread Lennart Poettering
On Mon, 05.01.15 18:22, Hugo Mills (h...@carfax.org.uk) wrote:

 On Mon, Jan 05, 2015 at 06:15:12PM +0100, Lennart Poettering wrote:
  Heya,
  
  I recently added some btrfs magic to systemd's machinectl/nspawn
  tool. More specifically it can now show the disk usage of a container
  that is stored in a btrfs subvolume. For that I made use of the btrfs
  quota logic. To read the current disk usage of a subvolume I took
  inspiration from btrfs-progs, most specifically the
  BTRFS_IOC_TREE_SEARCH ioctl(). Unfortunately, documentation for the
  ioctl seems to to be lacking, but there are some things about it I
  fail to grok:
  
  What precisely are the semantics of the ioctl, regarding the search
  key min/max values (the fields of struct btrfs_ioctl_search_key)? I
  kinda assumed that setting them would result in in only objects to be
  returned that are within the min/max ranges. However, that appears not
  to be the case. At least the min_offset/max_offset setting appears to
  be ignored?
 
This is an old argument. :)
 
Keys have three parts, so it's plausible (but, in this case, wrong)
 to consider the space you're searching to be a 3-dimensional space of
 (object, type, offset), which seems to be what you're expecting. A
 min, max pair would then define an oblong subset of the keyspace from
 which to retrieve keys.

However, that's not actually what's happening. Keys are indexed
 within their tree(s) by a concatenation of the items in the key. A
 key, therefore, should be thought of as a single 136-bit integer, and
 the keys are lexically ordered, (object||type||offset), where || is
 the concatenation operator. You get every key _lexically ordered_
 between the min and max values. This is a superset of the
 3-dimensional results above.

Ah, I see. Makes sense.

I figure the comments in btrfs.h next to struct
btrfs_ioctl_search_key could use some updating in this regard. They
pretty explicitly suggest that the 3 axis were independent and each
eleent individually would be between the respective min/max when
returning...

Ideally the structure would just have two fields called max, and
min or so, of type btrfs_disk_key, right? In that case I figure the
behaviour would have been clear. It's particular confusing that the
disk key fields appear in a different order than otherwise used and
with the min_transid+max_transid in the middle...

Which brings me to my question: how does {min|max}_transid affect the
search result? Is this axis orthogonal or is it neither?

Thanks for the explanations!

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-17 Thread Lennart Poettering
On Mon, 16.06.14 09:05, Josef Bacik (jba...@fb.com) wrote:

 So you are doing all the right things from what I can tell, I'm just
 a little confused about when you guys run fsync.  From what I can
 tell it's only when you open the journal file and when you switch it
 to offline.  I didn't look too much past this point so I don't
 know how often these things happen.  Are you taking an individual
 message, writing it, updating the head of the file and then
 fsync'ing?  Or are you getting a good bit of dirty log data and
 fsyncing occasionally?

The latter. Basically when opening a file for writing we mark it in the
header as online, then fsync() it. When we close a file we fsync() it,
then change the header to offline, ans sync() again. Also, 5min after
each write we will also put things to offline, until the next write,
when we will put things to online again. Finally, if something is logged
at priorities EMERG, ALERT or CRIT we will sync immediately (which
actually should never happen in real-life, unless something is really
broken -- a simple way to check if anything like this got written is
journalctl -p crit).

Also, we rotate and start a new file every now and then, when we hit a
size limit, but that is usually very seldom.

Putting this together: we should normally fsync() only very
infrequently. 

 What would cause btrfs problems is if you fallocate(), write a small
 chunk, fsync, write a small chunk again, fsync again etc.  Fallocate
 saves you the first write around, but if the next write is within
 the same block as the previous write we'll end up triggering cow and
 enter fragmented territory.  If this is what is what journald is
 doing then that would be good to know, if not I'd like to know what
 is happening since we shouldn't be fragmenting this badly.

Hmm, the only way I see that that would happen is if a lot of stuff is
logged at these super-high log levels mentioned above. But then again,
that never really should happen in real-life.

Could anyone who's expereiencing the slowdowns have a look on the
journalctl output menionted above? Do you have more than a few lines
printed like that?

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Lennart Poettering
On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

  I am not really following though why this trips up btrfs though. I am
  not sure I understand why this breaks btrfs COW behaviour. I mean,
  fallocate() isn't necessarily supposed to write anything really, it's
  mostly about allocating disk space in advance. I would claim that
  journald's usage of it is very much within the entire reason why it
  exists...
 
 I don't believe that fallocate() makes any difference to fragmentation on 
 BTRFS.  Blocks will be allocated when writes occur so regardless of an 
 fallocate() call the usage pattern in systemd-journald will cause 
 fragmentation.

journald's write pattern looks something like this: append something to
the end, make sure it is written, then update a few offsets stored at
the beginning of the file to point to the newly appended data. This is
of course not easy to handle for COW file systems. But then again, it's
probably not too different from access patterns of other database or
database-like engines...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sat, 14.06.14 09:52, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Which effectively means that by the time the 8 MiB is filled, each 4 KiB 
  block has been rewritten to a new location and is now an extent unto 
  itself.  So now that 8 MiB is composed of 2048 new extents, each one a 
  single 4 KiB block in size.
 
 Several people pointed fallocate as the problem. But I don't
 understand the reason.

BTW, the reason we use fallocate() in journald is not about trying to
optimize anything. It's only used for one reason: to avoid SIGBUS on
disk/quota full, since we actually write everything to the files using
mmap(). I mean, writing things with mmap() is always problematic, and
handling write errors is awfully difficult, but at least two of the most
common reasons for failure we'd like protect against in advance, under
the assumption that disk/quota full will be reported immediately by the
fallocate(), and the mmap writes later on will then necessarily succeed.

I am not really following though why this trips up btrfs though. I am
not sure I understand why this breaks btrfs COW behaviour. I mean,
fallocate() isn't necessarily supposed to write anything really, it's
mostly about allocating disk space in advance. I would claim that
journald's usage of it is very much within the entire reason why it
exists...

Anyway, happy to change these things around if necesary, but first I'd
like to have a very good explanation why fallocate() wouldn't be the
right thing to invoke here, and a suggestion what we should do instead
to cover this usecase...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Wed, 11.06.14 20:32, Chris Murphy (li...@colorremedies.com) wrote:

  systemd has a very stupid journal write pattern. It checks if there
  is space in the file for the write, and if not it fallocates the
  small amount of space it needs (it does *4 byte* fallocate calls!)

Not really the case. 

http://cgit.freedesktop.org/systemd/systemd/tree/src/journal/journal-file.c#n354

We allocate 8mb at minimum.

  and then does the write to it.  All this does is fragment the crap
  out of the log files because the filesystems cannot optimise the
  allocation patterns.

Well, it would be good if you'd tell me what to do instead...

I am invoking fallocate() in advance, because we write those files with
mmap() and that of course would normally triggered SIGBUS already on the
most boring of reasons, such as disk full/quota full or so. Hence,
before we do anything like that, we invoke fallocate() to ensure that
the space is actually available... As far as I can see, that pretty much
in line with what fallocate() is supposed to be useful for, the man page
says this explicitly:

 ...After a successful call to posix_fallocate(), subsequent writes
  to bytes in the specified range are guaranteed not to fail because
  of lack of disk space.

Happy to be informed that the man page is wrong. 

I am also happy to change our code, if it really is the wrong thing to
do. Note however that I generally favour correctness and relying on
documented behaviour, instead of nebulous optimizations whose effects
might change with different file systems or kernel versions...

  Yup, it fragments journal files on XFS, too.
  
  http://oss.sgi.com/archives/xfs/2014-03/msg00322.html
  
  IIRC, the systemd developers consider this a filesystem problem and
  so refused to change the systemd code to be nice to the filesystem
  allocators, even though they don't actually need to use fallocate...

What? No need to be dick. Nobody ever pinged me about this. And yeah, I
think I have a very good reason to use fallocate(). The only reason in
fact the man page explicitly mentions.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sun, 15.06.14 05:43, Duncan (1i5t5.dun...@cox.net) wrote:

 The base problem isn't fallocate per se, rather, tho it's the trigger in 
 this case.  The base problem is that for COW-based filesystems, *ANY* 
 rewriting of existing file content results in fragmentation.
 
 It just so happens that the only reason there's existing file content to 
 be rewritten (as opposed to simply appending) in this case, is because of 
 the fallocate.  The rewrite of existing file content is the problem, but 
 the existing file content is only there in this case because of the 
 fallocate.
 
 Taking a step back...
 
 On a non-COW filesystem, allocating 8 MiB ahead and writing into it 
 rewrites into the already allocated location, thus guaranteeing extents 
 of 8 MiB each, since once the space is allocated it's simply rewritten in-
 place.  Thus, on a non-COW filesystem, pre-allocating in something larger 
 than single filesystem blocks when an app knows the data is eventually 
 going to be written in to fill that space anyway is a GOOD thing, which 
 is why systemd is doing it.

Nope, that's not why we do it. We do it to avoid SIGBUS on disk full...

 But on a COW-based filesystem fallocate is the exact opposite, a BAD 
 thing, because an fallocate forces the file to be written out at that 
 size, effectively filled with nulls/blanks.  Then the actual logging 
 comes along and rewrites those nulls/blanks with actual data, but it's 
 now a rewrite, which on a COW, copy-on-write, based filesystem, the 
 rewritten block is copied elsewhere, it does NOT overwrite the existing 
 null/blank block, and elsewhere by definition means detached from the 
 previous blocks, thus in an extent all by itself.

Well, quite frankly I am not entirely sure why fallocate() would be any
useful like that on COW file systems, if this is really how it is
implemented... I mean, as I understood fallocate() -- and as the man
page suggests -- it is something for reserving space on disk, not for
writing out anything. This is why journald is invoking it, to reserve
the space, so that later write accesses to it will not require any
reservation anymore, and hence are unlikely to fail.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-15 Thread Lennart Poettering
On Sat, 14.06.14 12:59, Kai Krakow (hurikha...@gmail.com) wrote:

 
 Duncan 1i5t5.dun...@cox.net schrieb:
 
  As they say, Whoosh!
  
  At least here, I interpreted that remark as primarily sarcastic
  commentary on the systemd devs' apparent attitude, which can be
  (controversially) summarized as: Systemd doesn't have problems because
  it's perfect.  Therefore, any problems you have with systemd must instead
  be with other components which systemd depends on.
 
 Come on, sorry, but this is fud. Really... ;-)

Interestingly, I never commented on anything in this area, and neither
did anybody else from the systemd side afaics. THe entire btrfs defrag
thing i wasn't aware of before this thread started on the system ML a
few days ago. I am not sure where you take your ideas about our
attitude from. God, with behaviour like that you just make us ignore
you, Duncan.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [HEADS-UP] Discoverable Partitions Spec

2014-03-10 Thread Lennart Poettering
On Mon, 10.03.14 19:34, Goffredo Baroncelli (kreij...@libero.it) wrote:

Heya,

 Instead of relying on the subvolume UUID, why not relying to the subvolume 
 name: it would be more simple and flexible to manage them.
 
 For example supposing to use '@' as prefix for a subvolume name:
 
 @ - root filesystem
 @etc  - etc
 @home - home
 [...]

Well, the name is property of the admin really. There needs to be a way
how the admin can label his subvolumes, with a potentially localized
name. This makes it unsuitable for our purpose, we cannot just take
possession of this and leave the admin with nothing.

On GPT there are also gpt partition labels and partition types. The
former are property of the admin, he can place there whatever he wants,
in whatever language he chooses... The latter however is how we make
sense of it on a semantical level.

 Or in another way we could group the different systems in subdirectories:
 
 @home - home of all the systems
 @srv  - srv  of all the systems
 fedora/@  - root of a fedora system
 fedora/@etc   - etc of the fedora system
 fedora2/@ - root of a fedora2 system
 fedora2/@etc  - etc of the fedora2 system

I am pretty sure automatic discovery of mount points should not cover
the usecase where people install multiple distributions into the same
btrfs volume. THe automatic logic should cover the simple cases only,
and it sounds way over the top to support installing multiple OSes into
the same btrfs... I mean, people can do that, if they want to, they just
have to write a proper fstab, which I think is not too much too ask...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [HEADS-UP] Discoverable Partitions Spec

2014-03-10 Thread Lennart Poettering
On Mon, 10.03.14 14:53, Chris Murphy (li...@colorremedies.com) wrote:

 Since it's not a given whether a parent or child subvolume is the one
 being updated, it's ambiguous which one to use/automount if there is
 inheritance of the proposed subvolumetypeGUID at snapshot
 time. Inheritance of the proposed GUID at snapshot time sounds
 untenable.
 
 Failsafe for rootfs is to snapshot, mount it, apply updates to the
 snapshot in a chroot. That way a failed update means the snapshot can
 be immediately deleted. And the OS is overall more stable by not
 having running binaries modified or yanked out from under it. Here,
 using the current subvolume at reboot is the rollback; changing to the
 snapshot is the upgrade.
 
 Whereas for home, rebooting to the snapshot is a rollback, which I
 certainly don't want by default.
 
 I don't see the right way to handle this automatically or by default.

I am in no way bound to the idea of having subvolume type UUIDs for
this. There are other options. For example, there's already the concept
of having default subvolumes (btrfs subvolume set-default...), maybe
we can extend that a little bit, and allow different kinds of default
subvolumes, one default /home subvolume, one default /srv subvolume,
and so on, plus one default root subvolume, and so on. The vocabulary
for the available default subvolumes could be a free-form string where
the empty string would be the existing default subvolume. Or so...

When people play games with subvolumes and snapshots we could then
simply ask them to update where these default subvolumes point... Of
course this simply exports to the admin the problem of combining
subvolumes properly.

Another option would be to introduce a high level concept of a
subvolume set or so, which binds subvolumes together, and of which
there could be a default subvolume set. Within such a subvolume set
could then use type uuids or so to mount things properly. Or something
like that...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [HEADS-UP] Discoverable Partitions Spec

2014-03-10 Thread Lennart Poettering
On Mon, 10.03.14 23:39, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Well, the name is property of the admin really. There needs to be a way
  how the admin can label his subvolumes, with a potentially localized
  name. This makes it unsuitable for our purpose, we cannot just take
  possession of this and leave the admin with nothing.
 
 Instead of the name we can use the xattr to store these information.

Ah, using xattrs for this is indeed an option. That way we should be able
attach any kind of information we like to a subvolume.

Hmm, I figure though that there is no way currently to read xattrs off a
subvolume without first mounting them individually? Having to mount all
subvolumes before we can make sense of them and mount them to the right
place certainly sounds less than ideal...

  On GPT there are also gpt partition labels and partition types. The
  former are property of the admin, he can place there whatever he wants,
  in whatever language he chooses... The latter however is how we make
  sense of it on a semantical level.
  
  Or in another way we could group the different systems in subdirectories:
 
  @home  - home of all the systems
  @srv   - srv  of all the systems
  fedora/@   - root of a fedora system
  fedora/@etc- etc of the fedora system
  fedora2/@  - root of a fedora2 system
  fedora2/@etc   - etc of the fedora2 system
  
  I am pretty sure automatic discovery of mount points should not cover
  the usecase where people install multiple distributions into the same
  btrfs volume. THe automatic logic should cover the simple cases only,
  and it sounds way over the top to support installing multiple OSes into
  the same btrfs... I mean, people can do that, if they want to, they just
  have to write a proper fstab, which I think is not too much too ask...
 
 In your specification, you referred the use case of container (via
 nspawn / libvrt-lxc). which have to boot a disk image. Why you don't
 mind to use a container on a btrfs snapshot ? I think that it will be
 reasonable to have different containers on a snapshots of the same
 filesystem-tree.

Hmm, dunno, you might have a point there...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-30 Thread Lennart Poettering
On Thu, 30.09.10 21:59, Kay Sievers (kay.siev...@vrfy.org) wrote:

  So my question is, is this what we want?  Do I just need to quit bitching 
  and
  make it work?  Or am I doing something wrong?  This is a completely new 
  area for
  me so I'm just looking around at what md/dm does and trying to mirror it 
  for my
  own uses, if thats not what I should be doing please tell me, otherwise this
  seems like alot of work for a very shitty solution to our problem.  Thanks,
 
 Yeah, that matches what I was experiencing when thinking about the
 options. Making a btrfs mount a fake blockdev of zero size seems like
 a pretty weird hack, just get some 'dead' directories in sysfs. A
 btrfs mount is just not a raw blockdev, and should probably not
 pretend to be one.
 
 I guess a statfs()-like call from the filesystem side and not the
 block side, which can put out such information in some generic way,
 would better fit here.

Note that for my particular usecase it would even suffice to have two
flags in struct statfs or struct statvfs that encode whether there's a at
least one SSD in the fs, resp. at least one rotating disk in the fs.

if (statvfs.f_flag  ST_SSD) 
printf(FS contains at least one SSD disk);
if (statvfs.f_flag  ST_ROTATING) 
printf(FS contains at least one rotating disk);

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-29 Thread Lennart Poettering
On Wed, 29.09.10 16:25, Ric Wheeler (rwhee...@redhat.com) wrote:

 This in fact is how all current readahead implementations work, be it
 the fedora, the suse or ubuntu's readahead or Arjan's sreadahead. What's
 new is that in the systemd case we try to test for ssd/rotating
 properly, instead of just hardcoding a check for
 /sys/class/block/sda/queue/rotational.
 
 
 A couple of questions pop into mind - is systemd the right place to
 automatically tune readahead?  If this is a generic feature for the
 type of device, it sounds like something that we should be doing
 somewhere else in the stack (not relying on tuning from user space).

Note that this is not the kind of readahead that is controllable via 
/sys/class/block/sda/queue/read_ahead_kb, this is about detecting hot
files at boot, and then preloading them on the next boot. i.e. the
problem Jens once proposed fcache for.

 Second question is why is checking in /sys a big deal, would  you
 prefer an interface like we did for alignment in libblkid?

Well, currently there's no way to discover the underlying block devices
if you have a btrfs mount point. This is what Josef's patch added for
us.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: add a disk info ioctl to get the disks attached to a filesystem

2010-09-28 Thread Lennart Poettering
On Tue, 28.09.10 20:08, Josef Bacik (jo...@redhat.com) wrote:

 
 On Tue, Sep 28, 2010 at 07:25:13PM -0400, Christoph Hellwig wrote:
  On Tue, Sep 28, 2010 at 04:53:16PM -0400, Josef Bacik wrote:
   This was a request from the systemd guys.  They need a quick and easy way 
   to get
   all devices attached to a Btrfs filesystem in order to check if any of 
   the disks
   are SSD for...something, I didn't ask :).   I've tested this with the
   btrfs-progs patch that accompanies this patch.  Thanks,
  
  So please tell the systemd guys to explain what the fuck they're doing
  to linux-fsdevel and fiend a proper interface.  Chance is they will fuck
  up as much as just about ever other lowlevel userspace tool are very
  high.
  
 
 Lennart? :).  And Christoph, what would be a good interface?  LVM has a 
 slaves/
 subdir in sysfs which symlinks to all of their dev's, would you rather I
 resurrect the sysfs stuff for Btrfs and do a similar thing?  I'm open to
 suggestions, I just took the quick and painless way out.  Thanks,

When doing readahead you want to know whether you are on SSD or rotating
media, because you a) want to order the readahead requests on bootup
after access time on SSD and after location on disk on rotating
media. And b) because you might want to priorize readahead reads over
other reads on rotating media, but prefer other reads over readahead
reads on SSD.

This in fact is how all current readahead implementations work, be it
the fedora, the suse or ubuntu's readahead or Arjan's sreadahead. What's
new is that in the systemd case we try to test for ssd/rotating
properly, instead of just hardcoding a check for
/sys/class/block/sda/queue/rotational.

I hope this explains what the fuck we are doing.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html