Notes on support for multiple devices for a single filesystem
FYI: here's a little writeup I did this summer on support for filesystems spanning multiple block devices: -- === Notes on support for multiple devices for a single filesystem === == Intro == Btrfs (and an experimental XFS version) can support multiple underlying block devices for a single filesystem instances in a generalized and flexible way. Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and the special real-time device in XFS all data and metadata may be spread over a potentially large number of block devices, and not just one (or two) == Requirements == We want a scheme to support these complex filesystem topologies in way that is a) easy to setup and non-fragile for the users b) scalable to a large number of disks in the system c) recoverable without requiring user space running first d) generic enough to work for multiple filesystems or other consumers Requirement a) means that a multiple-device filesystem should be mountable by a simple fstab entry (UUID/LABEL or some other cookie) which continues to work when the filesystem topology changes. Requirement b) implies we must not do a scan over all available block devices in large systems, but use an event-based callout on detection of new block devices. Requirement c) means there must be some version to add devices to a filesystem by kernel command lines, even if this is not the default way, and might require additional knowledge from the user / system administrator. Requirement d) means that we should not implement this mechanism inside a single filesystem. == Prior art == * External log and realtime volume The most common way to specify the external log device and the XFS real time device is to have a mount option that contains the path to the block special device for it. This variant means a mount option is always required, and requires the device name doesn't change, which is enough with udev-generated unique device names (/dev/disk/by-{label,uuid}). An alternative way, supported by optionally by ext3 and reiserfs and exclusively supported by jfs is to open the journal device by the device number (dev_t) of the block special device. While this doesn't require an additional mount option when the device number is stored in the filesystem superblock it relies on the device number being stable which is getting increasingly unlikely in complex storage topologies. * RAID (MD) and LVM Software RAID and volume managers, although not strictly filesystems, have a similar very similar problem finding their devices. The traditional solution used for early versions of the Linux MD driver and LVM version 1 was to hook into the partitions scanning code and add device with the right partition type to a kernel-internal list of potential RAID / LVM devices. This approach has the advantage of being simple to implement, fast, reliable and not requiring additional user space programs in the boot process. The downside is that it only works with specific partition table formats that allow specifying a partition type, and doesn't work with unpartitioned disks at all. Recent MD setups and LVM2 thus move the scanning to user space, typically using a command iterating over all block device nodes and performing the format-specific scanning. While this is more flexible than the in-kernel scanning, it scales very badly to a large number of block devices, and requires additional user space commands to run early in the boot process. A variant of this schemes runs a scanning callout from udev once disk device are detected, which avoids the scanning overhead. == High-level design considerations == Due to requirement b) we need a layer that finds devices for a single fstab entry. We can either do this in user space, or in kernel space. As we've traditionally always done UUID/LABEL to device mapping in userspace, and we already have libvolume_id and libblkid dealing with the specialized case of UUID/LABEL to single device mapping I would recommend to keep doing this in user space and try to reuse the libvolume_id / libblkid. There are to options to perform the assembly of the device list for a filesystem: 1) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem it gets added to a list of all devices of this particular filesystem type. On mount type mount(8) or a mount.fstype helpers calls out to the libraries to get a list of devices belonging to this filesystem type and translates them to device names, which can be passed to the kernel on the mount command line. Advantage: Requires a mount.fstype helper or fs-specific knowledge in mount(8). Disadvantages: Required libvolume_id / libblkid to keep state. 2) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem they call into the kernel through and ioctl / sysfs / etc to add it to a list in kernel space. The kernel code
Re: [PATCH] fix wrong value returned from btrfs_listxattr when buffer is too small
On Fri, 2008-12-12 at 14:36 -0800, Yehuda Sadeh Weinraub wrote: Fix bug, btrfs_listxattr doesn't return an error when the buffer size is too small (ret was overridden). Thank you, I've applied this one locally and will push it out. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: compilation problem on last unstable
On Wed, Dec 17, 2008 at 05:43:50PM +, Michele Petrazzo wrote: Hi, I just tried to compile the last unstable version, but: CC [M] /home/michele/btrfs-unstable-standalone/inode.o /home/michele/btrfs-unstable-standalone/inode.c: In function ???btrfs_new_inode???: /home/michele/btrfs-unstable-standalone/inode.c:3470: error: implicit declaration of function ???current_fsuid??? /home/michele/btrfs-unstable-standalone/inode.c:3471: error: implicit declaration of function ???current_fsgid??? /home/michele/btrfs-unstable-standalone/inode.c: In function ???btrfs_cache_create???: /home/michele/btrfs-unstable-standalone/inode.c:4527: warning: passing argument 5 of ???kmem_cache_create??? from incompatible pointer type /home/michele/btrfs-unstable-standalone/inode.c: At top level: /home/michele/btrfs-unstable-standalone/inode.c:4966: warning: initialization from incompatible pointer type /home/michele/btrfs-unstable-standalone/inode.c:4970: warning: initialization from incompatible pointer type /home/michele/btrfs-unstable-standalone/inode.c:5024: warning: initialization from incompatible pointer type /home/michele/btrfs-unstable-standalone/inode.c:5030: warning: initialization from incompatible pointer type /home/michele/btrfs-unstable-standalone/inode.c:5040: warning: initialization from incompatible pointer type make[2]: *** [/home/michele/btrfs-unstable-standalone/inode.o] Error 1 make[1]: *** [_module_/home/michele/btrfs-unstable-standalone] Error 2 make[1]: Leaving directory `/usr/src/linux-headers-2.6.26-1-686' make: *** [all] Error 2 michele:~/btrfs-unstable-standalone$ michele:~/btrfs-unstable-standalone$ uname -r 2.6.26-1-686 from debian Currently btrfs only compiles on 2.6.27 and above although support all the way back to 2.6.18 is planned. I'm currently using Ubuntu 8.10 for all btrfs testing. Thanks, Michele -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weird bash autocomplete issue
On Wed, Dec 17, 2008 at 15:17, Chris Mason chris.ma...@oracle.com wrote: On Wed, 2008-12-17 at 14:59 +0100, Kay Sievers wrote: On Wed, Dec 17, 2008 at 09:45, Roland devz...@web.de wrote: On Tue, 2008-12-16 at 22:41 +0100, Kay Sievers wrote: open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fstat64(3, {st_dev=makedev(0, 19), st_ino=256, st_mode=S_IFDIR|0555, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=18, st_atime=2008/12/16-21:32:38, st_mtime=2008/12/16-21:32:37, st_ctime=2008/12/16-21:32:37}) = 0 getdents64(3, {{d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=257, d_off=3, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 104 _llseek(3, 3, [3], SEEK_SET)= 0 getdents64(3, {{d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 32 On Tue, Dec 16, 2008 at 22:26, devz...@web.de wrote: i assume it has something to do with the large value for d_off of the last dirent ? Looks like, 9223372036854775807 is just LLONG_MAX. I can not reproduce that (on openSUSE 11.1). I also don't see the _llseek() calls. weird. no btrfs issue then !? open(., O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 fstat(3, {st_dev=makedev(0, 18), ... getdents64(3, { {d_ino=260, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=261, d_off=3, d_type=DT_REG, d_reclen=24, d_name=a} {d_ino=262, d_off=4, d_type=DT_REG, d_reclen=24, d_name=b} {d_ino=263, d_off=5, d_type=DT_REG, d_reclen=24, d_name=c} {d_ino=264, d_off=6, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=265, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux} }, 4096) = 176 getdents64(3, {}, 4096) = 0 close(3) This is with today's git kernel and today's standalone btrfs unstable. You are using the distro kernel and compile the standalone btrfs module? yes. to be honest, i`m slightly newer than 11.1 (did zypper dup to latest factory some days ago) linux:~ # bash -version GNU bash, version 3.2.39(1)-release (i586-suse-linux-gnu) Copyright (C) 2007 Free Software Foundation, Inc. That is still the same bash, the one you use is a 32bit version. Do you run a 32 bit kernel too? I could try that on a 32 bit box then. At least on my 32 bit box, tab completion works fine. It works fine here too on 64 bit. I'll try with openSUSE 11.1 on a 32bit box later tonight. But, the d_off of LLONG_MAX comes from btrfs_readdir(). Git had a feature where it would loop infinitely over a directory in some cases and this was my workaround. There are other filesystems doing the same, usually with 32bit int max instead of 64 bit int max, I guess that should work fine. This should be fixed in git by now, so I can drop it if that really is causing problems in bash. I'll come back if I can reproduce it with the same environment Roland is using. Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weird bash autocomplete issue
On Wed, 2008-12-17 at 14:59 +0100, Kay Sievers wrote: On Wed, Dec 17, 2008 at 09:45, Roland devz...@web.de wrote: On Tue, 2008-12-16 at 22:41 +0100, Kay Sievers wrote: open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fstat64(3, {st_dev=makedev(0, 19), st_ino=256, st_mode=S_IFDIR|0555, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=18, st_atime=2008/12/16-21:32:38, st_mtime=2008/12/16-21:32:37, st_ctime=2008/12/16-21:32:37}) = 0 getdents64(3, {{d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=257, d_off=3, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 104 _llseek(3, 3, [3], SEEK_SET)= 0 getdents64(3, {{d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 32 On Tue, Dec 16, 2008 at 22:26, devz...@web.de wrote: i assume it has something to do with the large value for d_off of the last dirent ? Looks like, 9223372036854775807 is just LLONG_MAX. I can not reproduce that (on openSUSE 11.1). I also don't see the _llseek() calls. weird. no btrfs issue then !? open(., O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 fstat(3, {st_dev=makedev(0, 18), ... getdents64(3, { {d_ino=260, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=261, d_off=3, d_type=DT_REG, d_reclen=24, d_name=a} {d_ino=262, d_off=4, d_type=DT_REG, d_reclen=24, d_name=b} {d_ino=263, d_off=5, d_type=DT_REG, d_reclen=24, d_name=c} {d_ino=264, d_off=6, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=265, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux} }, 4096) = 176 getdents64(3, {}, 4096) = 0 close(3) This is with today's git kernel and today's standalone btrfs unstable. You are using the distro kernel and compile the standalone btrfs module? yes. to be honest, i`m slightly newer than 11.1 (did zypper dup to latest factory some days ago) linux:~ # bash -version GNU bash, version 3.2.39(1)-release (i586-suse-linux-gnu) Copyright (C) 2007 Free Software Foundation, Inc. That is still the same bash, the one you use is a 32bit version. Do you run a 32 bit kernel too? I could try that on a 32 bit box then. At least on my 32 bit box, tab completion works fine. But, the d_off of LLONG_MAX comes from btrfs_readdir(). Git had a feature where it would loop infinitely over a directory in some cases and this was my workaround. This should be fixed in git by now, so I can drop it if that really is causing problems in bash. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, 17 Dec 2008 08:23:44 -0500 Christoph Hellwig h...@infradead.org wrote: FYI: here's a little writeup I did this summer on support for filesystems spanning multiple block devices: -- === Notes on support for multiple devices for a single filesystem === == Intro == Btrfs (and an experimental XFS version) can support multiple underlying block devices for a single filesystem instances in a generalized and flexible way. Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and the special real-time device in XFS all data and metadata may be spread over a potentially large number of block devices, and not just one (or two) == Requirements == We want a scheme to support these complex filesystem topologies in way that is a) easy to setup and non-fragile for the users b) scalable to a large number of disks in the system c) recoverable without requiring user space running first d) generic enough to work for multiple filesystems or other consumers Requirement a) means that a multiple-device filesystem should be mountable by a simple fstab entry (UUID/LABEL or some other cookie) which continues to work when the filesystem topology changes. device topology? Requirement b) implies we must not do a scan over all available block devices in large systems, but use an event-based callout on detection of new block devices. Requirement c) means there must be some version to add devices to a filesystem by kernel command lines, even if this is not the default way, and might require additional knowledge from the user / system administrator. Requirement d) means that we should not implement this mechanism inside a single filesystem. One thing I've never seen comprehensively addressed is: why do this in the filesystem at all? Why not let MD take care of all this and present a single block device to the fs layer? Lots of filesystems are violating this, and I'm sure the reasons for this are good, but this document seems like a suitable place in which to briefly decribe those reasons. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs conference call
Hello everyone, There will be a btrfs conference call today Dec 17th. Topics will include mainline merging, and making a new stable release. Time: 1:30pm US Eastern (10:30am Pacific) * Dial-in Number(s): * Toll Free: +1-888-967-2253 * Toll +1-650-607-2253 * Meeting id: 665734 * Passcode: 428737 (which hopefully spells 4Btrfs) -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote: On Wed, 17 Dec 2008 08:23:44 -0500 Christoph Hellwig h...@infradead.org wrote: FYI: here's a little writeup I did this summer on support for filesystems spanning multiple block devices: -- === Notes on support for multiple devices for a single filesystem === == Intro == Btrfs (and an experimental XFS version) can support multiple underlying block devices for a single filesystem instances in a generalized and flexible way. Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and the special real-time device in XFS all data and metadata may be spread over a potentially large number of block devices, and not just one (or two) == Requirements == We want a scheme to support these complex filesystem topologies in way that is a) easy to setup and non-fragile for the users b) scalable to a large number of disks in the system c) recoverable without requiring user space running first d) generic enough to work for multiple filesystems or other consumers Requirement a) means that a multiple-device filesystem should be mountable by a simple fstab entry (UUID/LABEL or some other cookie) which continues to work when the filesystem topology changes. device topology? Requirement b) implies we must not do a scan over all available block devices in large systems, but use an event-based callout on detection of new block devices. Requirement c) means there must be some version to add devices to a filesystem by kernel command lines, even if this is not the default way, and might require additional knowledge from the user / system administrator. Requirement d) means that we should not implement this mechanism inside a single filesystem. One thing I've never seen comprehensively addressed is: why do this in the filesystem at all? Why not let MD take care of all this and present a single block device to the fs layer? Lots of filesystems are violating this, and I'm sure the reasons for this are good, but this document seems like a suitable place in which to briefly decribe those reasons. I'd almost rather see this doc stick to the device topology interface in hopes of describing something that RAID and MD can use too. But just to toss some information into the pool: * When moving data around (raid rebuild, restripe, pvmove etc), we want to make sure the data read off the disk is correct before writing it to the new location (checksum verification). * When moving data around, we don't want to move data that isn't actually used by the filesystem. This could be solved via new APIs, but keeping it crash safe would be very tricky. * When checksum verification fails on read, the FS should be able to ask the raid implementation for another copy. This could be solved via new APIs. * Different parts of the filesystem might want different underlying raid parameters. The easiest example is metadata vs data, where a 4k stripesize for data might be a bad idea and a 64k stripesize for metadata would result in many more rwm cycles. * Sharing the filesystem transaction layer. LVM and MD have to pretend they are a single consistent array of bytes all the time, for each and every write they return as complete to the FS. By pushing the multiple device support up into the filesystem, I can share the filesystem's transaction layer. Work can be done in larger atomic units, and the filesystem will stay consistent because it is all coordinated. There are other bits and pieces like high speed front end caching devices that would be difficult in MD/LVM, but since I don't have that coded yet I suppose they don't really count... -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weird bash autocomplete issue
On Wed, Dec 17, 2008 at 09:45, Roland devz...@web.de wrote: On Tue, 2008-12-16 at 22:41 +0100, Kay Sievers wrote: On Tue, Dec 16, 2008 at 21:46, devz...@web.de wrote: On Tue, Dec 16, 2008 at 20:37, Roland devz...@web.de wrote: i have come across a weird autocomplete issue i assume it is related to btrfs. let`s have some dirs: /non-btrfs-mount ./linux ./testdir /brtfs-mount ./linux ./testdir now, if i do cd ttab in /non-btrfs-mount, t autocompletes to testdir same for ltabinux - bash autocompletes as expected. now, the weird thing is, that on /btrfs-mount this behaves different. autocompletion for testdir works, but not for linux dir. weird. can someone reproduce this ? Open another shell, find the bash process pid of the first shell with: ps afx and do: strace -p pid Go back to the first shell, hit tab, and the trace should show what's going on. You see a significant difference there? ok, here we go (i hope i did not cut important parts). i don`t see the real issue, but i did another interesting finding - see below bad (cd ltab): open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fstat64(3, {st_dev=makedev(0, 19), st_ino=256, st_mode=S_IFDIR|0555, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=18, st_atime=2008/12/16-21:32:38, st_mtime=2008/12/16-21:32:37, st_ctime=2008/12/16-21:32:37}) = 0 getdents64(3, {{d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=257, d_off=3, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 104 _llseek(3, 3, [3], SEEK_SET)= 0 getdents64(3, {{d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 32 On Tue, Dec 16, 2008 at 22:26, devz...@web.de wrote: i assume it has something to do with the large value for d_off of the last dirent ? Looks like, 9223372036854775807 is just LLONG_MAX. I can not reproduce that (on openSUSE 11.1). I also don't see the _llseek() calls. weird. no btrfs issue then !? open(., O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 fstat(3, {st_dev=makedev(0, 18), ... getdents64(3, { {d_ino=260, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=261, d_off=3, d_type=DT_REG, d_reclen=24, d_name=a} {d_ino=262, d_off=4, d_type=DT_REG, d_reclen=24, d_name=b} {d_ino=263, d_off=5, d_type=DT_REG, d_reclen=24, d_name=c} {d_ino=264, d_off=6, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=265, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux} }, 4096) = 176 getdents64(3, {}, 4096) = 0 close(3) This is with today's git kernel and today's standalone btrfs unstable. You are using the distro kernel and compile the standalone btrfs module? yes. to be honest, i`m slightly newer than 11.1 (did zypper dup to latest factory some days ago) linux:~ # bash -version GNU bash, version 3.2.39(1)-release (i586-suse-linux-gnu) Copyright (C) 2007 Free Software Foundation, Inc. That is still the same bash, the one you use is a 32bit version. Do you run a 32 bit kernel too? I could try that on a 32 bit box then. Thanks, Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 14:23, Christoph Hellwig h...@infradead.org wrote: === Notes on support for multiple devices for a single filesystem === == Intro == Btrfs (and an experimental XFS version) can support multiple underlying block devices for a single filesystem instances in a generalized and flexible way. Unlike the support for external log devices in ext3, jfs, reiserfs, XFS, and the special real-time device in XFS all data and metadata may be spread over a potentially large number of block devices, and not just one (or two) == Requirements == We want a scheme to support these complex filesystem topologies in way that is a) easy to setup and non-fragile for the users b) scalable to a large number of disks in the system c) recoverable without requiring user space running first d) generic enough to work for multiple filesystems or other consumers Requirement a) means that a multiple-device filesystem should be mountable by a simple fstab entry (UUID/LABEL or some other cookie) which continues to work when the filesystem topology changes. Requirement b) implies we must not do a scan over all available block devices in large systems, but use an event-based callout on detection of new block devices. Requirement c) means there must be some version to add devices to a filesystem by kernel command lines, even if this is not the default way, and might require additional knowledge from the user / system administrator. Requirement d) means that we should not implement this mechanism inside a single filesystem. == Prior art == * External log and realtime volume The most common way to specify the external log device and the XFS real time device is to have a mount option that contains the path to the block special device for it. This variant means a mount option is always required, and requires the device name doesn't change, which is enough with udev-generated unique device names (/dev/disk/by-{label,uuid}). An alternative way, supported by optionally by ext3 and reiserfs and exclusively supported by jfs is to open the journal device by the device number (dev_t) of the block special device. While this doesn't require an additional mount option when the device number is stored in the filesystem superblock it relies on the device number being stable which is getting increasingly unlikely in complex storage topologies. * RAID (MD) and LVM Software RAID and volume managers, although not strictly filesystems, have a similar very similar problem finding their devices. The traditional solution used for early versions of the Linux MD driver and LVM version 1 was to hook into the partitions scanning code and add device with the right partition type to a kernel-internal list of potential RAID / LVM devices. This approach has the advantage of being simple to implement, fast, reliable and not requiring additional user space programs in the boot process. The downside is that it only works with specific partition table formats that allow specifying a partition type, and doesn't work with unpartitioned disks at all. Recent MD setups and LVM2 thus move the scanning to user space, typically using a command iterating over all block device nodes and performing the format-specific scanning. While this is more flexible than the in-kernel scanning, it scales very badly to a large number of block devices, and requires additional user space commands to run early in the boot process. A variant of this schemes runs a scanning callout from udev once disk device are detected, which avoids the scanning overhead. == High-level design considerations == Due to requirement b) we need a layer that finds devices for a single fstab entry. We can either do this in user space, or in kernel space. As we've traditionally always done UUID/LABEL to device mapping in userspace, and we already have libvolume_id and libblkid dealing with the specialized case of UUID/LABEL to single device mapping I would recommend to keep doing this in user space and try to reuse the libvolume_id / libblkid. There are to options to perform the assembly of the device list for a filesystem: 1) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem it gets added to a list of all devices of this particular filesystem type. On mount type mount(8) or a mount.fstype helpers calls out to the libraries to get a list of devices belonging to this filesystem type and translates them to device names, which can be passed to the kernel on the mount command line. Advantage: Requires a mount.fstype helper or fs-specific knowledge in mount(8). Disadvantages: Required libvolume_id / libblkid to keep state. 2) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem they call into the kernel through and ioctl / sysfs / etc to add it to a list
Re: Notes on support for multiple devices for a single filesystem
On Wed, Dec 17, 2008 at 21:58, Chris Mason chris.ma...@oracle.com wrote: On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote: One thing I've never seen comprehensively addressed is: why do this in the filesystem at all? Why not let MD take care of all this and present a single block device to the fs layer? Lots of filesystems are violating this, and I'm sure the reasons for this are good, but this document seems like a suitable place in which to briefly decribe those reasons. I'd almost rather see this doc stick to the device topology interface in hopes of describing something that RAID and MD can use too. But just to toss some information into the pool: * When moving data around (raid rebuild, restripe, pvmove etc), we want to make sure the data read off the disk is correct before writing it to the new location (checksum verification). * When moving data around, we don't want to move data that isn't actually used by the filesystem. This could be solved via new APIs, but keeping it crash safe would be very tricky. * When checksum verification fails on read, the FS should be able to ask the raid implementation for another copy. This could be solved via new APIs. * Different parts of the filesystem might want different underlying raid parameters. The easiest example is metadata vs data, where a 4k stripesize for data might be a bad idea and a 64k stripesize for metadata would result in many more rwm cycles. * Sharing the filesystem transaction layer. LVM and MD have to pretend they are a single consistent array of bytes all the time, for each and every write they return as complete to the FS. By pushing the multiple device support up into the filesystem, I can share the filesystem's transaction layer. Work can be done in larger atomic units, and the filesystem will stay consistent because it is all coordinated. There are other bits and pieces like high speed front end caching devices that would be difficult in MD/LVM, but since I don't have that coded yet I suppose they don't really count... Features like the very nice and useful directory-based snapshots would also not be possible with simple block-based multi-devices, right? Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs conference call
Hi, After the event, can someone provide a brief summary of the conference call to the list, please? :) Thank you! On Wed, 2008-12-17 at 09:31 -0500, Chris Mason wrote: Hello everyone, There will be a btrfs conference call today Dec 17th. Topics will include mainline merging, and making a new stable release. -- mg -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 22:20 +0100, Kay Sievers wrote: On Wed, Dec 17, 2008 at 21:58, Chris Mason chris.ma...@oracle.com wrote: On Wed, 2008-12-17 at 11:53 -0800, Andrew Morton wrote: There are other bits and pieces like high speed front end caching devices that would be difficult in MD/LVM, but since I don't have that coded yet I suppose they don't really count... Features like the very nice and useful directory-based snapshots would also not be possible with simple block-based multi-devices, right? At least for btrfs, the snapshotting is independent from the multi-device code, and you still get snapshotting on single device filesystems. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
Kay Sievers wrote: Features like the very nice and useful directory-based snapshots would also not be possible with simple block-based multi-devices, right? Snapshotting via block device has always been an incredibly dumb hack, existing primarily because filesystem-based snapshots did not exist for the filesystem in question. Snapshots are better at the filesystem level because the filesystem is the only entity that knows when the filesystem is quiescent and snapshot-able. ISTR we had to add -write_super_lockfs() to hack in support for LVM in this manner, rather than doing it the right way. Jeff -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 14:24 -0700, Andreas Dilger wrote: I can't speak for btrfs, but I don't think multiple device access from the filesystem is a layering violation as some people comment. It is just a different type of layering. With ZFS there is a distinct layer that is handling the allocation, redundancy, and transactions (SPA, DMU) that is exporting an object interface, and the filesystem (ZPL, or future versions of Lustre) is built on top of that object interface. Clean interfaces aren't really my best talent, but btrfs also layers this out. logical-physical mappings happen in a centralized function, and all of the on disk structures use logical block numbers. The only exception to that rule is the superblock offsets on the device. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Dec 17, 2008 08:23 -0500, Christoph Hellwig wrote: == Prior art == * External log and realtime volume The most common way to specify the external log device and the XFS real time device is to have a mount option that contains the path to the block special device for it. This variant means a mount option is always required, and requires the device name doesn't change, which is enough with udev-generated unique device names (/dev/disk/by-{label,uuid}). An alternative way, supported by optionally by ext3 and reiserfs and exclusively supported by jfs is to open the journal device by the device number (dev_t) of the block special device. While this doesn't require an additional mount option when the device number is stored in the filesystem superblock it relies on the device number being stable which is getting increasingly unlikely in complex storage topologies. Just as an FYI here - the dev_t stored in the ext3/4 superblock for the journal device is only a cached device. The journal is properly identified by its UUID, and should the device mapping change there is a journal_dev= option that can be used to specify the new device. The one shortcoming is that there is no mount.ext3 helper which does this journal UUID-dev mapping and automatically passes journal_dev= if needed. * RAID (MD) and LVM Recent MD setups and LVM2 thus move the scanning to user space, typically using a command iterating over all block device nodes and performing the format-specific scanning. While this is more flexible than the in-kernel scanning, it scales very badly to a large number of block devices, and requires additional user space commands to run early in the boot process. A variant of this schemes runs a scanning callout from udev once disk device are detected, which avoids the scanning overhead. My (admittedly somewhat vague) impression is that with large numbers of devices the udev callout can itself be a huge overhead because this involves a userspace fork/exec for each new device being added. For the same number of devices, a single scan from userspace only requires a single process, and an equal number of device probes. Added to this is that the blkid cache can be used to eliminate the need to do any scanning if the devices have not changed from the previous boot makes it unclear which mechanism is more efficient. The drawback is that the initrd device cache is never going to be up-to-date so it wouldn't be useful until the root partition is mounted. We've used blkid for our testing of Lustre-on-DMU with up to 48 (local) disks w/o any kind of performance issues. We'll eventually be able to test on systems with around 400 disks in a JBOD configuration, but until then we only run on systems with hundreds of disks behind a RAID controller. == High-level design considerations == Due to requirement b) we need a layer that finds devices for a single fstab entry. We can either do this in user space, or in kernel space. As we've traditionally always done UUID/LABEL to device mapping in userspace, and we already have libvolume_id and libblkid dealing with the specialized case of UUID/LABEL to single device mapping I would recommend to keep doing this in user space and reuse libvolume_id/libblkid. There are to options to perform the assembly of the device list for a filesystem: 1) whenever libvolume_id / libblkid find a device detected as a multi-device capable filesystem it gets added to a list of all devices of this particular filesystem type. On mount type mount(8) or a mount.fstype helpers calls out to the libraries to get a list of devices belonging to this filesystem type and translates them to device names, which can be passed to the kernel on the mount command line. I would actually suggest that instead of keeping devices in groups by the filesystem type, rather keep a list of devices with the same UUID and/or LABEL, and if the mount is looking for this UUID/LABEL it gets the whole list of matching devices back. This could also be done in the kernel by having the filesystems register a probe function that examines the device/partitions as they are added, similar to the way that MD used to do it. There would likely be very few probe functions needed, only ext3/4 (for journal devices), btrfs, and maybe MD, LVM2 and a handful more. If we wanted to avoid code duplication, this could share code between libblkid and the kernel (just the enhanced probe-only functions in the util-linux-ng implementation) since these functions are little more than take a pointer, cast it to struct X, check some magic fields and return match + {LABEL, UUID}, or no-match. That MD used to check only the partition type doesn't mean that we can't have simple functions that read the superblock (or equivalent) to make an internal list of suitable devices attached to a filesystem-type global structure (possibly split into per-fsUUID sublists if it wants).
Re: weird bash autocomplete issue
On Wed, Dec 17, 2008 at 15:46, Kay Sievers kay.siev...@vrfy.org wrote: On Wed, Dec 17, 2008 at 15:17, Chris Mason chris.ma...@oracle.com wrote: On Wed, 2008-12-17 at 14:59 +0100, Kay Sievers wrote: On Wed, Dec 17, 2008 at 09:45, Roland devz...@web.de wrote: On Tue, 2008-12-16 at 22:41 +0100, Kay Sievers wrote: open(., O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY|O_CLOEXEC) = 3 fstat64(3, {st_dev=makedev(0, 19), st_ino=256, st_mode=S_IFDIR|0555, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=18, st_atime=2008/12/16-21:32:38, st_mtime=2008/12/16-21:32:37, st_ctime=2008/12/16-21:32:37}) = 0 getdents64(3, {{d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=257, d_off=3, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 104 _llseek(3, 3, [3], SEEK_SET)= 0 getdents64(3, {{d_ino=258, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux}}, 4096) = 32 On Tue, Dec 16, 2008 at 22:26, devz...@web.de wrote: i assume it has something to do with the large value for d_off of the last dirent ? Looks like, 9223372036854775807 is just LLONG_MAX. I can not reproduce that (on openSUSE 11.1). I also don't see the _llseek() calls. weird. no btrfs issue then !? open(., O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3 fstat(3, {st_dev=makedev(0, 18), ... getdents64(3, { {d_ino=260, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=.} {d_ino=256, d_off=2, d_type=DT_DIR, d_reclen=24, d_name=..} {d_ino=261, d_off=3, d_type=DT_REG, d_reclen=24, d_name=a} {d_ino=262, d_off=4, d_type=DT_REG, d_reclen=24, d_name=b} {d_ino=263, d_off=5, d_type=DT_REG, d_reclen=24, d_name=c} {d_ino=264, d_off=6, d_type=DT_DIR, d_reclen=24, d_name=test} {d_ino=265, d_off=9223372036854775807, d_type=DT_DIR, d_reclen=32, d_name=linux} }, 4096) = 176 getdents64(3, {}, 4096) = 0 close(3) This is with today's git kernel and today's standalone btrfs unstable. You are using the distro kernel and compile the standalone btrfs module? yes. to be honest, i`m slightly newer than 11.1 (did zypper dup to latest factory some days ago) linux:~ # bash -version GNU bash, version 3.2.39(1)-release (i586-suse-linux-gnu) Copyright (C) 2007 Free Software Foundation, Inc. That is still the same bash, the one you use is a 32bit version. Do you run a 32 bit kernel too? I could try that on a 32 bit box then. At least on my 32 bit box, tab completion works fine. It works fine here too on 64 bit. I'll try with openSUSE 11.1 on a 32bit box later tonight. But, the d_off of LLONG_MAX comes from btrfs_readdir(). Git had a feature where it would loop infinitely over a directory in some cases and this was my workaround. There are other filesystems doing the same, usually with 32bit int max instead of 64 bit int max, I guess that should work fine. This should be fixed in git by now, so I can drop it if that really is causing problems in bash. I'll come back if I can reproduce it with the same environment Roland is using. I see the same issue on x86 32 bit, with the additional __llseek() between the getdents64(), and the last entry returned by readdir ignored. If I change the returned LLONG_MAX to LONG_MAX in inode.c, it all works fine, and the __llseek() disappears. Kay -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Notes on support for multiple devices for a single filesystem
On Wed, 2008-12-17 at 15:04 -0700, Andreas Dilger wrote: On Dec 17, 2008 08:23 -0500, Christoph Hellwig wrote: An alternative way, supported by optionally by ext3 and reiserfs and exclusively supported by jfs is to open the journal device by the device number (dev_t) of the block special device. While this doesn't require an additional mount option when the device number is stored in the filesystem superblock it relies on the device number being stable which is getting increasingly unlikely in complex storage topologies. Just as an FYI here - the dev_t stored in the ext3/4 superblock for the journal device is only a cached device. The journal is properly identified by its UUID, and should the device mapping change there is a journal_dev= option that can be used to specify the new device. The one shortcoming is that there is no mount.ext3 helper which does this journal UUID-dev mapping and automatically passes journal_dev= if needed. An additional FYI. JFS also treats the dev_t in its superblock the same way. Since jfs relies on jfs_fsck running at boot time to ensure that the journal is replayed, jfs_fsck makes sure that the dev_t is accurate. If not, then it scans all of the block devices until it finds the uuid of the journal device, updating the superblock so that the kernel will find the journal. Shaggy -- David Kleikamp IBM Linux Technology Center -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: weird bash autocomplete issue
On Wed, 2008-12-17 at 23:15 +0100, Kay Sievers wrote: There are other filesystems doing the same, usually with 32bit int max instead of 64 bit int max, I guess that should work fine. This should be fixed in git by now, so I can drop it if that really is causing problems in bash. I'll come back if I can reproduce it with the same environment Roland is using. I see the same issue on x86 32 bit, with the additional __llseek() between the getdents64(), and the last entry returned by readdir ignored. If I change the returned LLONG_MAX to LONG_MAX in inode.c, it all works fine, and the __llseek() disappears. Ok, thanks I'll work up a patch. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html