Re: [RFD] Incremental fsck
On Wed, 09 Jan 2008 07:40:12 +0300, Al Boldi said: But why wouldn't it be possible to do this on the current fs infrastructure, using just a smart fsck, working incrementally on some sub-dir? If you have /home/usera, /home/userb, and /home/userc, the vast majority of fs screw-ups can't be detected by only looking at one sub-dir. For example, you can't tell definitively that all blocks referenced by an inode under /home/usera are properly only allocated to one file until you *also* look at the inodes under user[bc]. Heck, you can't even tell if the link count for a file is correct unless you walk the entire filesystem - you can find a file with a link count of 3 in the inode, and you find one reference under usera, and a second under userb - you can't tell if the count is one too high or not until you walk through userc and actually see (or fail to see) a third directory entry referencing it. pgpVezxvjaftR.pgp Description: PGP signature
[PATCH][RFC] fast file mapping for loop
Hi, loop.c currently uses the page cache interface to do IO to file backed devices. This works reasonably well for simple things, like mapping an iso9660 file for direct mount and other read-only workloads. Writing is somewhat problematic, as anyone who has really used this feature can attest to - it tends to confuse the vm (hello kswapd) since it break dirty accounting and behaves very erratically on writeout. Did I mention that it's pretty slow as well, for both reads and writes? It also behaves differently than a real drive. For writes, completions are done once they hit page cache. Since loop queues bio's async and hands them off to a thread, you can have a huge backlog of stuff to do. It's hard to attempt to guarentee data safety for file systems on top of loop without making it even slower than it currently is. Back when loop was only used for iso9660 mounting and other simple things, this didn't matter. Now it's often used in xen (and others) setups where we do care about performance AND writing. So the below is a attempt at speeding up loop and making it behave like a real device. It's a somewhat quick hack and is still missing one piece to be complete, but I'll throw it out there for people to play with and comment on. So how does it work? Instead of punting IO to a thread and passing it through the page cache, we instead attempt to send the IO directly to the filesystem block that it maps to. loop maintains a prio tree of known extents in the file (populated lazily on demand, as needed). Advantages of this approach: - It's fast, loop will basically work at device speed. - It's fast, loop it doesn't put a huge amount of system load on the system when busy. When I did comparison tests on my notebook with an external drive, running a simple tiobench on the current in-kernel loop with a sparse file backing rendered the notebook basically unusable while the test was ongoing. The remapper version had no more impact than it did when used directly on the external drive. - It behaves like a real block device. - It's easy to support IO barriers, which is needed to ensure safety especially in virtualized setups. Disadvantages: - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... - It'll tie down a bit of memory for the prio tree. This is GREATLY offset by the reduced page cache foot print though. - It cannot be used with the loop encryption stuff. dm-crypt should be used instead, on top of loop (which, I think, is even the recommended way to do this today, so not a big deal). This patch will automatically enable the new operation mode (called fastfs). I added an ioctl (LOOP_SET_FASTFS) that should be implemented in losetup, then we can remove this hunk in the code: + /* +* This needs to be done after setup with another ioctl, +* not automatically like this. +*/ + loop_init_fastfs(lo); + from loop_set_fd(). Patch is against 2.6.23-rc7 ('ish, as of this morning), will probably apply easily to 2.6.22 as well. diff --git a/drivers/block/loop.c b/drivers/block/loop.c index 56e2304..e49bfa8 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -481,16 +481,51 @@ static int do_bio_filebacked(struct loop_device *lo, struct bio *bio) return ret; } +#define __lo_throttle(wq, lock, condition) \ +do { \ + DEFINE_WAIT(__wait);\ + for (;;) { \ + prepare_to_wait((wq), __wait, TASK_UNINTERRUPTIBLE); \ + if (condition) \ + break; \ + spin_unlock_irq((lock));\ + io_schedule(); \ + spin_lock_irq((lock)); \ + } \ + finish_wait((wq), __wait); \ +} while (0)\ + +#define lo_act_bio(bio)((bio)-bi_bdev) +#define LO_BIO_THROTTLE128 + /* - * Add bio to back of pending list + * A normal block device will throttle on request allocation. Do the same + * for loop to prevent millions of bio's queued internally. + */ +static void loop_bio_throttle(struct loop_device *lo, struct bio *bio) +{ + if (lo_act_bio(bio)) + __lo_throttle(lo-lo_bio_wait, lo-lo_lock, + lo-lo_bio_cnt LO_BIO_THROTTLE); +} + +/* + * Add bio to back
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. In addition unprivileged Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not considered important enough to even mention? No. Because in practice they don't seem to matter. Also because there's no way in which fuse could be done differently to address these issues. Could you clarify, please? I hope I'm getting the wrong end of the stick - it sounds to me like you and Pavel are saying that this patch breaks suspending to ram (and hibernating?) but you want to push it anyway because you haven't been able to produce an instance, don't think suspending or hibernating matter and couldn't fix fuse anyway? This patch has nothing to do with suspend or hibernate. What this patchset does, is help get rid of fusermount, a suid-root mount helper. It also opens up new possibilities, which are not fuse related. Fuse has bad interactions with the freezer, theoretically. In practice, I remember just one bug report (that sparked off this whole do we need freezer, or don't we flamefest), that actually got fixed fairly quickly, ...maybe. Rafael probably remembers better. The 'kill -9' thing is basically due to VFS level locking not being interruptible. It could be changed, but I'm not sure it's worth it. For the suspend issue, there are also no easy solutions. What are the non-easy solutions? The ability to freeze tasks in uninterruptible sleep, or more generally at any preempt point (except when drivers are poking hardware). I know this doesn't play well with userspace hibernate, and I don't think it can be resolved without going the kexec way. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi, On Wed, 9 Jan 2008, Nigel Cunningham wrote: On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: For the suspend issue, there are also no easy solutions. What are the non-easy solutions? A practical point of view I've seen only fuse rootfs mounts to be a problem. I remember Ubuntu patches for this (WUBI and some other distros install NTFS root). But this probably also depends on the used suspend implementation. Personally I've never had fuse related suspend problem with ordinary mounts during heavy use under development, nor NTFS user problem was tracked down to it in the last one and half year. Regards, Szaka -- NTFS-3G: http://ntfs-3g.org - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
Andi Kleen wrote: Theodore Tso [EMAIL PROTECTED] writes: Now, there are good reasons for doing periodic checks every N mounts and after M months. And it has to do with PC class hardware. (Ted's aphorism: PC class hardware is cr*p). If these reasons are good ones (some skepticism here) then the correct way to really handle this would be to do regular background scrubbing during runtime; ideally with metadata checksums so that you can actually detect all corruption. But since fsck is so slow and disks are so big this whole thing is a ticking time bomb now. e.g. it is not uncommon to require tens of minutes or even hours of fsck time and some server that reboots only every few months will eat that when it happens to reboot. This means you get a quite long downtime. Has there been some thought about an incremental fsck? While an _incremental_ fsck isn't so easy for existing filesystem types, what is pretty easy to automate is making a read-only snapshot of a filesystem via LVM/DM and then running e2fsck against that. The kernel and filesystem have hooks to flush the changes from cache and make the on-disk state consistent. You can then set the the ext[234] superblock mount count and last check time via tune2fs if all is well, or schedule an outage if there are inconsistencies found. There is a copy of this script at: http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html Note that it might need some tweaks to run with DM/LVM2 commands/output, but is mostly what is needed. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi. Miklos Szeredi wrote: On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. In addition unprivileged Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not considered important enough to even mention? No. Because in practice they don't seem to matter. Also because there's no way in which fuse could be done differently to address these issues. Could you clarify, please? I hope I'm getting the wrong end of the stick - it sounds to me like you and Pavel are saying that this patch breaks suspending to ram (and hibernating?) but you want to push it anyway because you haven't been able to produce an instance, don't think suspending or hibernating matter and couldn't fix fuse anyway? This patch has nothing to do with suspend or hibernate. What this patchset does, is help get rid of fusermount, a suid-root mount helper. It also opens up new possibilities, which are not fuse related. That's what I thought. So what was Pavel talking about with kill -9 no longer works and suspend no longer works above? I couldn't understand it from the context. Fuse has bad interactions with the freezer, theoretically. In practice, I remember just one bug report (that sparked off this whole do we need freezer, or don't we flamefest), that actually got fixed fairly quickly, ...maybe. Rafael probably remembers better. I think they just gave up and considered it unsolvable. I'm not sure it is. The 'kill -9' thing is basically due to VFS level locking not being interruptible. It could be changed, but I'm not sure it's worth it. For the suspend issue, there are also no easy solutions. What are the non-easy solutions? The ability to freeze tasks in uninterruptible sleep, or more generally at any preempt point (except when drivers are poking hardware). Couldn't some sort of scheduler based solution deal with the uninterruptible sleeping case? I know this doesn't play well with userspace hibernate, and I don't think it can be resolved without going the kexec way. I can see the desirability of kexec when it comes to avoiding the freezer, but comes with its own problems too - having the original context usable is handy, not having to set aside a large amount of space for a second kernel is also desirable and there are still greater issues of transferring information backwards and forwards between the two kernels. Regards, Nigel - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
'updatedb no longer works' is not a problem? I haven't seen any problems with updatedb, and haven't had any bug reports about it either. Ok, I don't know much about FUSE. In current version, if user creates infinite maze and mounts it under ~, updatedb just does not enter it? It doesn't. See Documentation/filesystems/fuse.txt AFAIR there were two security vulnerabilities in fuse's history, one of them an information leak in the kernel module, and the other one an mtab corruption issue in the fusermount utility. I don't think this is such a bad track record. Not bad indeed. But I'd consider 'kill -9 not working' to be DoS vulnerability... The worst that can happen is that a sysadmin doesn't read the docs (likely) before enabling fuse on a multiuser system, and is surprised by a user doing funny things. And _then_ has to go read the docs, or google for some info. This is basically how things normally work, and I don't consider it a DoS. and I'm woried about problems fuse + user mounts expose in other parts of system. I'm worried too, and I'm not saying that enabling unprivileged fuse mounts is completely risk free. Nothing is, and nobody is forced to do it. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Wed, Jan 09 2008, Christoph Hellwig wrote: On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote: - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... And the way this is done is simply broken. It means you have to get rid of things like delayed or unwritten hands beforehand, it'll be a complete pain for COW or non-block backed filesystems. COW is not that hard to handle, you just need to be notified of moving blocks. If you view the patch as just a tighter integration between loop and fs, I don't think it's necessarily that broken. I did consider these cases, and it can be done with the existing approach. The right way to do this is to allow direct I/O from kernel sources where the filesystem is in-charge of submitting the actual I/O after the pages are handed to it. I think Peter Zijlstra has been looking into something like that for swap over nfs. That does sound like a nice approach, but a lot more work. It'll behave differently too, the advantage of what I proposed is that it behaves like a real device. I'm not asking you to love it (in fact I knew some people would complain about this approach and I understand why), just tossing it out there to get things rolling. If we end up doing it differently I don't really care, I'm not married to any solution but merely wish to solve a problem. If that ends up being solved differently, the outcome is the same to me. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote: - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... And the way this is done is simply broken. It means you have to get rid of things like delayed or unwritten hands beforehand, it'll be a complete pain for COW or non-block backed filesystems. The right way to do this is to allow direct I/O from kernel sources where the filesystem is in-charge of submitting the actual I/O after the pages are handed to it. I think Peter Zijlstra has been looking into something like that for swap over nfs. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 6/9] unprivileged mounts: allow unprivileged mounts
On Tue, Jan 08, 2008 at 12:35:08PM +0100, Miklos Szeredi wrote: Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. For safe filesystems also allow unprivileged forced unmounting. What about to list safe filesystems anywhere in /proc/fs/ ? I think it's very important information for admins. Note, your patch for mount(8) is always trying to use unprivileged mount(2) for non-root users. It's overkill when unprivileged mounts are supported for bind mounts and fuse only. It would be nice to check if FS is safe before switch to unprivileged mode. The safe definition is also very subjective and it depends on your level of paranoia. There should be a way (e.g. /proc) how control and modify the list of safe filesystems. For example I have no problem to mark cifs as safe for my home server. Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
On Wed 2008-01-09 09:47:31, Miklos Szeredi wrote: On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote: From: Miklos Szeredi [EMAIL PROTECTED] Use FS_SAFE for fuse fs type, but not for fuseblk. FUSE was designed from the beginning to be safe for unprivileged users. This has also been verified in practice over many years. In addition unprivileged Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not considered important enough to even mention? No. Because in practice they don't seem to matter. Also because there's no way in which fuse could be done differently to address these issues. Could you clarify, please? I hope I'm getting the wrong end of the stick - it sounds to me like you and Pavel are saying that this patch breaks suspending to ram (and hibernating?) but you want to push it anyway because you haven't been able to produce an instance, don't think suspending or hibernating matter and couldn't fix fuse anyway? This patch has nothing to do with suspend or hibernate. What this patchset does, is help get rid of fusermount, a suid-root mount helper. It also opens up new possibilities, which are not fuse related. Fuse has bad interactions with the freezer, theoretically. In practice, I remember just one bug report (that sparked off this whole do we need freezer, or don't we flamefest), that actually got fixed fairly quickly, ...maybe. Rafael probably remembers better. In practice, if the unpriviledged fuse gets enabled, any user can prevent suspend/hibernation from working. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Wed, 9 Jan 2008 10:43:21 +0100 Jens Axboe [EMAIL PROTECTED] wrote: On Wed, Jan 09 2008, Christoph Hellwig wrote: On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote: - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... And the way this is done is simply broken. It means you have to get rid of things like delayed or unwritten hands beforehand, it'll be a complete pain for COW or non-block backed filesystems. COW is not that hard to handle, you just need to be notified of moving blocks. If you view the patch as just a tighter integration between loop and fs, I don't think it's necessarily that broken. Filling holes (delayed allocation) and COW are definitely a problem. But at least for the loop use case, most non-cow filesystems will want to preallocate the space for loop file and be done with it. Sparse loop definitely has uses, but generally those users are willing to pay a little performance. Jens' patch falls back to buffered writes for the hole case and pretends cow doesn't exist. It's a good starting point that I hope to extend with something like the extent_map apis. I did consider these cases, and it can be done with the existing approach. The right way to do this is to allow direct I/O from kernel sources where the filesystem is in-charge of submitting the actual I/O after the pages are handed to it. I think Peter Zijlstra has been looking into something like that for swap over nfs. That does sound like a nice approach, but a lot more work. It'll behave differently too, the advantage of what I proposed is that it behaves like a real device. The problem with O_DIRECT (or even O_SYNC) loop is that every write into loop becomes synchronous, and it really changes the performance of things like filemap_fdatawrite. If we just hand ownership of the file over to loop entirely and prevent other openers (perhaps even forcing backups through the loop device), we get fewer corner cases and much better performance. -chris - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi! AFAIR there were two security vulnerabilities in fuse's history, one of them an information leak in the kernel module, and the other one an mtab corruption issue in the fusermount utility. I don't think this is such a bad track record. Not bad indeed. But I'd consider 'kill -9 not working' to be DoS vulnerability... The worst that can happen is that a sysadmin doesn't read the docs (likely) before enabling fuse on a multiuser system, and is surprised by a user doing funny things. And _then_ has to go read the docs, or google for some info. This is basically how things normally work, and I don't consider it a DoS. No, this is not normal. Kill -9 has been estabilished long time ago, and we should not be documenting its now-brokenness in Documentation/filesystems/fuse.txt . For example, my /etc/inittab currently has: kb::kbrequest:/etc/rc/rc.reboot 2 0 # # This file handles system shutdown and reboot. # PATH=/sbin:/bin:/usr/sbin:/usr/bin sync # Kill all processes. wall System is going down NOW\! echo -n -e \rSystem is going down: processes. killall5 -15 echo -n . sleep 1 echo -n . killall5 -9 # Before unmounting file systems write a reboot record to wtmp. echo -n wtmp halt -w # Swap needs to be unmounted because otherwise busy filesystems remain. echo -n swap swapoff -a swapoff /c/swap swapoff /c/swap2 # Unmount file systems echo -n umount. umount -a || ( sync echo -n umount-retry. sleep 1 umount -a || sulogin ) echo -n . mount -n -o remount,ro / # Now halt or reboot. if [ $2 = 0 ] ; then swapoff -a echo halted. halt -p -f else echo rebooting... reboot -d -f fi ...this will break with FUSE enabled, right? (Minor security hole by allowing users to stop c-a-delete, where none existed before?) I'm currently suspending by 'echo mem /sys/power/state'. How should I do that _safely_ with FUSE enabled? If I want to get rid of nasty user in multiuser system, I do su nastyuser 'kill -9 -1' . How do I do the equivalent with FUSE enabled? (Without affecting other users?) Load average was never really meaningful number, but with FUSE enabled, users can set it to 666 without actually eating any CPU. SIGSTOP used to work, allowing you to prevent user processes from working while you examine them. Now SIGSTOP can be delayed for arbitrary time. Heck, imagine malicious user process misbehaves. Before FUSE, you could at least attach it with gdb to look what it is doing. Now you can't. I really believe FUSE vs. signals needs fixing. Either that, or updating all the manpages man 1 kill: - KILL 9 exit this signal may not be blocked + KILL 9 exit this signal may not be blocked, except by FUSE user mount Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
Valerie Henson wrote: On Jan 8, 2008 8:40 PM, Al Boldi [EMAIL PROTECTED] wrote: Rik van Riel wrote: Al Boldi [EMAIL PROTECTED] wrote: Has there been some thought about an incremental fsck? You know, somehow fencing a sub-dir to do an online fsck? Search for chunkfs Sure, and there is TileFS too. But why wouldn't it be possible to do this on the current fs infrastructure, using just a smart fsck, working incrementally on some sub-dir? Several data structures are file system wide and require finding every allocated file and block to check that they are correct. In particular, block and inode bitmaps can't be checked per subdirectory. Ok, but let's look at this a bit more opportunistic / optimistic. Even after a black-out shutdown, the corruption is pretty minimal, using ext3fs at least. So let's take advantage of this fact and do an optimistic fsck, to assure integrity per-dir, and assume no external corruption. Then we release this checked dir to the wild (optionally ro), and check the next. Once we find external inconsistencies we either fix it unconditionally, based on some preconfigured actions, or present the user with options. All this could be per-dir or using some form of on-the-fly file-block-zoning. And there probably is a lot more to it, but it should conceptually be possible, with more thoughts though... http://infohost.nmt.edu/~val/review/chunkfs.pdf Thanks! -- Al - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 6/9] unprivileged mounts: allow unprivileged mounts
On Tue, Jan 08, 2008 at 12:35:08PM +0100, Miklos Szeredi wrote: Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of this filesystem may not constitute a security problem. Since most filesystems haven't been designed with unprivileged mounting in mind, a thorough audit is needed before setting this flag. For safe filesystems also allow unprivileged forced unmounting. What about to list safe filesystems anywhere in /proc/fs/ ? I think it's very important information for admins. Makes sense. I'll cook up something. Note, your patch for mount(8) is always trying to use unprivileged mount(2) for non-root users. It's overkill when unprivileged mounts are supported for bind mounts and fuse only. It would be nice to check if FS is safe before switch to unprivileged mode. I think the little gain in performance is not worth the added complexity. Especially if the added complexity is in the privileged part, and itself can be a source of security holes. The safe definition is also very subjective and it depends on your level of paranoia. There should be a way (e.g. /proc) how control and modify the list of safe filesystems. For example I have no problem to mark cifs as safe for my home server. OK, also makes some sense. Pavel's examples do point out that fuse isn't as safe as I'd like it to be, so perhaps it would make sense to default to just bind mounts being allowed, and having to explicity enable unprivileged fuse mounts with a sysctl or whatever. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
On Jan 8 2008 20:08, Miklos Szeredi wrote: On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote: +static int reserve_user_mount(void) +{ + int err = 0; + + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} Would -ENOSPC or -ENOMEM be a more descriptive error here? The logic behind EPERM, is that this failure is only for unprivileged callers. ENOMEM is too specifically about OOM. It could be changed to ENOSPC, ENFILE, EMFILE, or it could remain EPERM. What do others think? ENOSPC: No space remaining on device = 'wth'. ENOMEM: I usually think of a userspace OOM (e.g. malloc'ed out all of your 32-bit address space on 32-bit processes) EMFILE: Too many open files ENFILE: Too many open files in system. ENFILE seems like a temporary winner among these four. Back in the old days, when the number of mounts was limited in Linux, what error value did it return? That one could be used. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
On Wed, Jan 09, 2008 at 01:45:09PM +0100, Jan Engelhardt wrote: On Jan 8 2008 20:08, Miklos Szeredi wrote: On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote: +static int reserve_user_mount(void) +{ + int err = 0; + + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} Would -ENOSPC or -ENOMEM be a more descriptive error here? The logic behind EPERM, is that this failure is only for unprivileged callers. ENOMEM is too specifically about OOM. It could be changed to ENOSPC, ENFILE, EMFILE, or it could remain EPERM. What do others think? ENOSPC: No space remaining on device = 'wth'. ENOMEM: I usually think of a userspace OOM (e.g. malloc'ed out all of your 32-bit address space on 32-bit processes) EMFILE: Too many open files ENFILE: Too many open files in system. ENFILE seems like a temporary winner among these four. I see EMFILE, it's still supported by the latest mount(8). Back in the old days, when the number of mounts was limited in Linux, what error value did it return? That one could be used. Copy past from mount-0.99.2: /* Mount failed, complain, but don't die. */ switch (mnt_err) { case EPERM: if (geteuid() == 0) error (mount: mount point %s is not a directory, node); else error (mount: must be superuser to use mount); break; case EBUSY: error (mount: wrong fs type, %s already mounted, %s busy, or other error, spec, node); break; case ENOENT: error (mount: mount point %s does not exist, node); break; case ENOTDIR: error (mount: mount point %s is not a directory, node); break; case EINVAL: error (mount: %s not a mount point, spec); break; case EMFILE: error (mount table full); break; case EIO: error (mount: %s: can't read superblock, spec); break; case ENODEV: error (mount: fs type %s not supported by kernel, type); break; case ENOTBLK: error (mount: %s is not a block device, spec); break; case ENXIO: error (mount: %s is not a valid block device, spec); break; case EACCES: error (mount: block device %s is not permitted on its filesystem, spec); break; default: error (mount: %s, strerror (mnt_err)); break; } Karel -- Karel Zak [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
case EMFILE: error (mount table full); break; OK, we could go with EMFILE, but the message should be changed to something like maximum unprivileged mount count exceeded. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
I'm not saying fuse is worthless. It is a nice toy for single-user systems. But I do not think we should be merging allow ordinary users to mount their own fuse's before issues above are fixed. I think multi user systems are not all that interesting. And I suspect very few of them want reliably working suspend/hibernate (which they wouldn't get due to other issues anyway), or have weird shutdown scripts which stop when they are unable to umount filesystems. For paranoid sysadmins, I suggest not enabling fuse for unprivileged users, which is pretty easy to do: just don't set /dev/fuse to be world read-writable (which is the default BTW). So your reasons just don't warrant a big effort involving VFS hacking, etc. Patches are of course welcome. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
Hi! ...this will break with FUSE enabled, right? (Minor security hole by allowing users to stop c-a-delete, where none existed before?) Yup (or I don't know, I'm sure there was or is some problem with ptrace, that could be used to create unkillable processes). Fuse could actually be fixed to exit reliably for 'killall5 -9' (it used to), but that has other problems, and it doesn't seem very important to me. But this can be discussed. I think it is better to fix fuse than to rewrite all the shutdown scripts. What cannot be fixed is if one process is inside an fs operation (e.g. unlink), holding a VFS lock (i_mutex) and another process goes to uninterruptible sleep on that lock. There's no way (other than rewriting the VFS) in which that second process could be killed unless you kill the first one or the fuse server. I believe VFS should be rewritten here. Perhaps new TASK_KILLABLE state can help? I really believe FUSE vs. signals needs fixing. Either that, or updating all the manpages man 1 kill: - KILL 9 exit this signal may not be blocked + KILL 9 exit this signal may not be blocked, except by FUSE user mount Heh, there are all very interesting, but most of these issues are not even on my todo list (which has grown into quite a big pile over the years), which means, that they don't seem to matter to people in practice. You seem to be implying that fuse is worthless if these issues are not fixed, but that is very far from the truth, I think. I'm not saying fuse is worthless. It is a nice toy for single-user systems. But I do not think we should be merging allow ordinary users to mount their own fuse's before issues above are fixed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts
I'm not saying fuse is worthless. It is a nice toy for single-user systems. But I do not think we should be merging allow ordinary users to mount their own fuse's before issues above are fixed. I think multi user systems are not all that interesting. And I suspect very few of them want reliably working suspend/hibernate (which they wouldn't get due to other issues anyway), or have weird shutdown scripts which stop when they are unable to umount filesystems. Weird shutdown scripts? I believe all shutdown scripts have this issue -- if you want to [cleanly] unmount your / filesystem, you need all the opens for write closed, right...? Which self-deadlocked fused holding files open will prevent. For paranoid sysadmins, I suggest not enabling fuse for unprivileged users, which is pretty easy to do: just don't set /dev/fuse to be world read-writable (which is the default BTW). So your reasons just don't warrant a big effort involving VFS hacking, etc. Patches are of course welcome. Well, I believe code with obscure, but almost impossible to fix problems should not be merged... That code _has_ been merged, something like 3 years ago, and is doing fine, thank you. The unprivileged mounts code, which we should be discussing, doesn't change anything about that, except to not require another suid-root utility. Many distributions enabling unprivileged mounting by default _now_, so it's not as if there's some great danger in doing this slightly differently. Anyway, I believe it would be fair to mention kill -9 no longer working and shutdown/hibernation/multiuser problems it implies in the changelogs and probably sysctl documentation or how is this enabled. Sure. Miklos - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
On Wed, 9 Jan 2008 14:52:14 +0300 Al Boldi [EMAIL PROTECTED] wrote: Ok, but let's look at this a bit more opportunistic / optimistic. You can't play fast and loose with data integrity. Besides, if we looked at things optimistically, we would conclude that no fsck will be needed, ever :) http://infohost.nmt.edu/~val/review/chunkfs.pdf You will really want to read this paper, if you haven't already. -- All Rights Reversed - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello, On Wed, January 9, 2008 05:39, Tetsuo Handa wrote: Hello. Indan Zupancic wrote: I think you focus too much on your way of enforcing filename/attributes pairs. So? So that you miss alternatives and don't see the bigger picture. The same can be achieved by creating the device nodes with expected attributes, and preventing processes from changing those files. The device nodes have to be deletable if some process (including udev) needs to delete. Thus, you cannot unconditionally prevent processes from changing those files. This because expected combinations are known beforehand. Yes. And once those files are present, the MAC system used doesn't have to have special device nodes attributes support. Protecting those files is enough to guarantee filename/attributes pairs. If MAC system needn't to support this filesystem's functionality, who creates those files with warrantee of expected attributes? The udev does? If udev is exploited, who can guarantee? The person that would write the config file for your fs, the one who wants that guarantee. No, this is because rename permission was given for files that it shouldn't had. Do you think all MAC implementation have the same granularity and functionalities? I don't think so. Not all MAC implementation can control with such granularity. This filesystem is designed to be combined with any MAC, although the MAC used with this filesystem should be able to restrict namespace manipulation requests so that this filesystem can remain /dev and visible to userland applications. Good point, but I assume they all have at least a directory granularity, and then /dev/ can be static and udev and other can have free reign in e.g. /dev/dynamic/. Just use subdirs for the dynamic stuff and this granularity problem is, with slight inconvenience, solved. Either you want a process to manage device names and attributes, and then you give it permission to do that, or you want to enforce certain filename/attribute pairs and then you just do it yourself. If I modify udev to enforce certain filename/attribute pairs and the modified udev was exploited, who can guarantee? Don't trust userland application is the basis of restricting access in kernel space. If you can trust userland application, you don't need in-kernel access control. Funny, I thought that it was in the kernel because that's the way to protect processes against eachother, the fs against processes, and for performance reasons. Exploits are in code, and where that code is doesn't matter that much, either kernel or userspace, though if it's exploitable you'll rather not have it in the kernel. So I think it's more secure if the checking would be done by udev than in a special filesystem, even if that means that you're screwed if udev is exploited. Of course you fully trust your own code, naturally. A tiny daemon that communicates with udev and does the checking you have now, and if ok it creates the node is really not much more code than your fs, so as hard to exploit too. Then if udev is hacked you have the same guarantee as you have now. I can think of more alternatives that are as secure or more secure than the current solution. Will your filesystem prevent the trivial case of rm /dev/hda1 ln -s /dev/hda2 /dev/hda1 Of course. To permit the above operation, the following permissions are needed. hda1660 0 6 2 b 3 1 hda1777 0 0 33 l . Yes, I should've read the code before asking that, instead of the other way round. Rename permission can be given for /dev in general, but prohibited for certain files in /dev, the ones you want to have specific attributes. It isn't all or nothing. Do you think all MAC implementation can prohibit renaming for certain files in /dev ? It's forbid modifying certain nodes that process needn't to modify versus forbid breaking filename/attribute pairs of certain nodes. Both have the same effect, except that the first one is generic and can be done by existing MAC systems, while the second one needs a special filesystem and a handful of MAC rules to make it effective. Do you think all MAC implementation can do? I think the first one is implementation specific and the second one is generic. Protecting certain files from being modified seems to me more generic than enforcing filename/attributes pairs on device nodes. And if they can't do it surely they can do it per directory, and the using subdirs solves it. It doesn't matter where they are, it's that a different fs than yours could be mounted over it. You say a MAC can prevent that from happening, but a MAC can also prevent all processes except for udev from modifying /dev. But MAC cannot prevent udev from modifying /dev . And what if exploited? Not all MAC can enforce access control over all processes with the granularity you are talking. And what if a process that cannot be
Re: [PATCH][RFC] fast file mapping for loop
Jens Axboe [EMAIL PROTECTED] writes: So how does it work? Instead of punting IO to a thread and passing it through the page cache, we instead attempt to send the IO directly to the Great -- something like this was needed for a long time. - The file block mappings must not change while loop is using the file. This means that we have to ensure exclusive access to the file and this is the bit that is currently missing in the implementation. It would be nice if we could just do this via open(), ideas welcome... get_write_access()/put_write_access() will block other writers. But as pointed out by others that is not enough for this. I suppose you could use a white list like a special flag for file systems (like ext2/ext3) that do not reallocate blocks. -Andi - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
Here's the latest version of dm-loop, for comparison. To try it out, ln -s dmsetup dmlosetup and supply similar basic parameters to losetup. (using dmsetup version 1.02.11 or higher) Alasdair From: Bryn Reeves [EMAIL PROTECTED] This implements a loopback target for device mapper allowing a regular file to be treated as a block device. Signed-off-by: Bryn Reeves [EMAIL PROTECTED] Signed-off-by: Alasdair G Kergon [EMAIL PROTECTED] --- drivers/md/Kconfig |9 drivers/md/Makefile |1 drivers/md/dm-loop.c | 1036 +++ 3 files changed, 1046 insertions(+) Index: linux-2.6.24-rc6/drivers/md/Kconfig === --- linux-2.6.24-rc6.orig/drivers/md/Kconfig2008-01-07 18:32:05.0 + +++ linux-2.6.24-rc6/drivers/md/Kconfig 2008-01-07 18:33:12.0 + @@ -229,6 +229,15 @@ config DM_CRYPT If unsure, say N. +config DM_LOOP + tristate Loop target (EXPERIMENTAL) + depends on BLK_DEV_DM EXPERIMENTAL + ---help--- + This device-mapper target allows you to treat a regular file as + a block device. + + If unsure, say N. + config DM_SNAPSHOT tristate Snapshot target (EXPERIMENTAL) depends on BLK_DEV_DM EXPERIMENTAL Index: linux-2.6.24-rc6/drivers/md/Makefile === --- linux-2.6.24-rc6.orig/drivers/md/Makefile 2008-01-07 18:32:05.0 + +++ linux-2.6.24-rc6/drivers/md/Makefile2008-01-07 18:33:12.0 + @@ -34,6 +34,7 @@ obj-$(CONFIG_BLK_DEV_MD) += md-mod.o obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o obj-$(CONFIG_DM_CRYPT) += dm-crypt.o obj-$(CONFIG_DM_DELAY) += dm-delay.o +obj-$(CONFIG_DM_LOOP) += dm-loop.o obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o obj-$(CONFIG_DM_MULTIPATH_EMC) += dm-emc.o obj-$(CONFIG_DM_MULTIPATH_HP) += dm-hp-sw.o Index: linux-2.6.24-rc6/drivers/md/dm-loop.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ linux-2.6.24-rc6/drivers/md/dm-loop.c 2008-01-07 18:33:12.0 + @@ -0,0 +1,1036 @@ +/* + * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved. + * + * This file is part of device-mapper. + * + * Extent mapping implementation heavily influenced by mm/swapfile.c + * Bryn Reeves [EMAIL PROTECTED] + * + * File mapping and block lookup algorithms support by + * Heinz Mauelshagen [EMAIL PROTECTED]. + * + * This file is released under the GPL. + */ + +#include linux/kernel.h +#include linux/slab.h +#include linux/fs.h +#include linux/module.h +#include linux/vmalloc.h +#include linux/syscalls.h +#include linux/workqueue.h +#include linux/file.h +#include linux/bio.h + +#include dm.h +#include dm-bio-list.h + +#define DM_LOOP_DAEMON kloopd +#define DM_MSG_PREFIX loop + +enum flags { DM_LOOP_BMAP, DM_LOOP_FSIO }; + +/* + * Loop context + **/ + +struct loop_c { + unsigned long flags; + + /* Backing store */ + + struct file *filp; + char *path; + loff_t offset; + struct block_device *bdev; + unsigned blkbits; /* file system block size shift bits */ + + loff_t size;/* size of entire file in bytes */ + loff_t blocks; /* blocks allocated to loop file */ + sector_t mapped_sectors;/* size of mapped area in sectors */ + + int (*map_fn)(struct dm_target *, struct bio *); + void *map_data; +}; + +/* + * Block map extent + */ +struct dm_loop_extent { + sector_t start; /* start sector in mapped device */ + sector_t to;/* start sector on target device */ + sector_t len; /* length in sectors */ +}; + +/* + * Temporary extent list + */ +struct extent_list { + struct dm_loop_extent *extent; + struct list_head list; +}; + +static struct kmem_cache *dm_loop_extent_cache; + +/* + * Block map private context + */ +struct block_map_c { + int nr_extents; /* number of extents in map */ + struct dm_loop_extent **map;/* linear map of extent pointers */ + struct dm_loop_extent **mru;/* pointer to mru entry */ + spinlock_t mru_lock;/* protects mru */ +}; + +/* + * File map private context + */ +struct file_map_c { + spinlock_t lock;/* protects in */ + struct bio_list in; /* new bios for processing */ + struct bio_list work; /* bios queued for processing */ + struct workqueue_struct *wq;/* workqueue */ + struct work_struct ws; /* loop work */ + struct loop_c *loop;/* for filp
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Quoting Indan Zupancic ([EMAIL PROTECTED]): Hello, On Wed, January 9, 2008 05:39, Tetsuo Handa wrote: Hello. Indan Zupancic wrote: I think you focus too much on your way of enforcing filename/attributes pairs. So? So that you miss alternatives and don't see the bigger picture. These emails again are getting really long, but I think the gist of Indan's suggestion can be concisely summarized: To confine process P3 to /dev/hda2 being 'b 3 2', create /dev/p3, launch P3 in a new mounts namespace, mount --bind /dev/p3 /dev, exec what you want p3 running, and have MAC prevent umount /dev/p3. This is a neat idea, but Tetsuo's rebutall is P3 may be legacy code needing to create or delete /dev/floppy, where -EPERM confuses P3 and prevents it working correctly. Indan's idea is interesting and I like it, but is there an answer to Tetsuo's problem with it? thanks, -serge PS - Indan, you also said in essence if P3 can be trusted to create /dev/floppy why can't it be trusted to create /dev/hda1. I trust that, phrased that way, the question answers itself? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Thu, Jan 10, 2008 at 12:43:19AM +0100, [EMAIL PROTECTED] wrote: oh, nice to see that this is still alive. at least i got no crashes and was able to mount and acess more than 300 iso-images with that. were there fixes/chances since then? Little has changed for some time - mostly code cleanups and the occasional bug fix. It's time to give it wider exposure, I think, and we'll find out how well it holds up. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
On Thu, January 10, 2008 00:08, Serge E. Hallyn wrote: These emails again are getting really long, but I think the gist of Indan's suggestion can be concisely summarized: No worry, I wasn't planning on extending it, I've said what I've to say. Except... To confine process P3 to /dev/hda2 being 'b 3 2', create /dev/p3, launch P3 in a new mounts namespace, mount --bind /dev/p3 /dev, exec what you want p3 running, and have MAC prevent umount /dev/p3. This is a neat idea, but Tetsuo's rebutall is P3 may be legacy code needing to create or delete /dev/floppy, where -EPERM confuses P3 and prevents it working correctly. Indan's idea is interesting and I like it, but is there an answer to Tetsuo's problem with it? ...that I didn't mean that, but a more simple /dev/ directory protected from any modifications by MAC, /dev/* all the nodes that need to have guaranteed name/attribute pairs, like /dev/null, /dev/zero, /dev/random, etc. and: /dev/dynamic/ being a dir where apps who really need to create/modify device nodes can do whatever they want to do. It can be multiple dirs too, like /dev/snd/, /dev/input/ etc. I guess this covers about 96% of the usecases of this tamper-proof dev fs. You can think of unlikely cases that aren't solved by this, but those can be solved in another way if really wanted (like a checking daemon, modified udev, shadow /dev/, to name a few). But I think doing more is getting ridiculous, because if a process can create a device node, it can also access it and do whatever harm could be done by the confusion caused by unexpected name/attribute pairs. As for information snooping, that's mostly about /dev/null or other things that are known beforehand. PS - Indan, you also said in essence if P3 can be trusted to create /dev/floppy why can't it be trusted to create /dev/hda1. I trust that, phrased that way, the question answers itself? Not exactly. If there's a process that dynamically created certain device nodes, and it wants to create one that doesn't fit the rules, you can't know if it's wrong or if your rules are wrong. The process has a certain policy of naming/creating the devices, but you also have a policy at the kernel side with this fs. If it mismatches you don't know which one is right. If you trust a process to create /dev/hd*, you can also trust it to create the proper /dev/hdXn, no need to verify if /dev/hda1 is really 3 1. The whole thing about filename/attribute pairs is that it's about what applications expect. There aren't many expectations about dynamically created device nodes which might not always be there, because their name isn't stable. The use case for this fs is a malicious app that can create device nodes, and we're worried about mismatching name/attribute pairs. Not about our data, or anything else. Call me an optimist, but I think you don't need to worry about name/attribute pairs. Greetings, Indan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][RFC] fast file mapping for loop
On Wednesday 09 January 2008 19:52, Jens Axboe wrote: So how does it work? Instead of punting IO to a thread and passing it through the page cache, we instead attempt to send the IO directly to the filesystem block that it maps to. You told Christoph that just using direct-IO from kernel still doesn't give you the required behaviour... What about queueing the IO directly *and* using direct-IO? I guess it still has to go through the underlying filesystem, but that's probably a good thing. loop maintains a prio tree of known extents in the file (populated lazily on demand, as needed). Just a quick question (I haven't looked closely at the code): how come you are using a prio tree for extents? I don't think they could be overlapping? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL -mm] 0/4 Unionfs updates/fixes/cleanups
The following is a series of patchsets related to Unionfs. This is the fourth set of patchsets resulting from an lkml review of the entire unionfs code base, in preparation for a merge into mainline. The most significant changes here are a few locking/race bugfix related to branch-management. These patches were tested (where appropriate) on Linus's 2.6.24 latest code (as of v2.6.24-rc7-71-gfd0b45d), MM, as well as the backports to 2.6.{23,22,21,20,19,18,9} on ext2/3/4, xfs, reiserfs, nfs2/3/4, jffs2, ramfs, tmpfs, cramfs, and squashfs (where available). Also tested with LTP-full and with a continuous parallel kernel compile (while forcing cache flushing, manipulating lower branches, etc.). See http://unionfs.filesystems.org/ to download back-ported unionfs code. Please pull from the 'master' branch of git://git.kernel.org/pub/scm/linux/kernel/git/ezk/unionfs.git to receive the following: Erez Zadok (4): Unionfs: merged several printk KERN_CONT together into one pr_debug Unionfs: mmap fixes Unionfs: branch-management related locking fixes Unionfs: ensure we have lower dentries in d_iput commonfops.c |6 ++ debug.c | 51 +-- dentry.c |9 +++-- inode.c | 17 + mmap.c | 26 +- 5 files changed, 76 insertions(+), 33 deletions(-) --- Erez Zadok [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] Unionfs: branch-management related locking fixes
Add necessary locking to dentry/inode branch-configuration, so we get consistent values during branch-management actions. In d_revalidate_chain, -permission, and -create, also lock parent dentry. Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- fs/unionfs/commonfops.c |6 ++ fs/unionfs/dentry.c |6 +- fs/unionfs/inode.c | 17 + 3 files changed, 28 insertions(+), 1 deletions(-) diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c index 2c32ada..f37192f 100644 --- a/fs/unionfs/commonfops.c +++ b/fs/unionfs/commonfops.c @@ -318,6 +318,7 @@ int unionfs_file_revalidate(struct file *file, bool willwrite) * First revalidate the dentry inside struct file, * but not unhashed dentries. */ +reval_dentry: if (unlikely(!d_deleted(dentry) !__unionfs_d_revalidate_chain(dentry, NULL, willwrite))) { err = -ESTALE; @@ -328,6 +329,11 @@ int unionfs_file_revalidate(struct file *file, bool willwrite) dgen = atomic_read(UNIONFS_D(dentry)-generation); fgen = atomic_read(UNIONFS_F(file)-generation); + if (unlikely(sbgen dgen)) { + pr_debug(unionfs: retry dentry revalidation\n); + schedule(); + goto reval_dentry; + } BUG_ON(sbgen dgen); /* diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c index 7646828..d969640 100644 --- a/fs/unionfs/dentry.c +++ b/fs/unionfs/dentry.c @@ -203,7 +203,7 @@ bool is_newer_lower(const struct dentry *dentry) if (!dentry || !UNIONFS_D(dentry)) return false; inode = dentry-d_inode; - if (!inode || !UNIONFS_I(inode) || + if (!inode || !UNIONFS_I(inode)-lower_inodes || ibstart(inode) 0 || ibend(inode) 0) return false; @@ -295,6 +295,8 @@ bool __unionfs_d_revalidate_chain(struct dentry *dentry, struct nameidata *nd, chain_len = 0; sbgen = atomic_read(UNIONFS_SB(dentry-d_sb)-generation); dtmp = dentry-d_parent; + if (dentry != dtmp) + unionfs_lock_dentry(dtmp, UNIONFS_DMUTEX_REVAL_PARENT); dgen = atomic_read(UNIONFS_D(dtmp)-generation); /* XXX: should we check if is_newer_lower all the way up? */ if (unlikely(is_newer_lower(dtmp))) { @@ -315,6 +317,8 @@ bool __unionfs_d_revalidate_chain(struct dentry *dentry, struct nameidata *nd, } purge_inode_data(dtmp-d_inode); } + if (dentry != dtmp) + unionfs_unlock_dentry(dtmp); while (sbgen != dgen) { /* The root entry should always be valid */ BUG_ON(IS_ROOT(dtmp)); diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c index 6095c4f..e15ddb9 100644 --- a/fs/unionfs/inode.c +++ b/fs/unionfs/inode.c @@ -30,6 +30,13 @@ static int unionfs_create(struct inode *parent, struct dentry *dentry, struct nameidata lower_nd; unionfs_read_lock(dentry-d_sb, UNIONFS_SMUTEX_CHILD); + unionfs_lock_dentry(dentry-d_parent, UNIONFS_DMUTEX_PARENT); + valid = __unionfs_d_revalidate_chain(dentry-d_parent, nd, false); + unionfs_unlock_dentry(dentry-d_parent); + if (unlikely(!valid)) { + err = -ESTALE; /* same as what real_lookup does */ + goto out; + } unionfs_lock_dentry(dentry, UNIONFS_DMUTEX_CHILD); valid = __unionfs_d_revalidate_chain(dentry, nd, false); @@ -936,6 +943,14 @@ static int unionfs_permission(struct inode *inode, int mask, const int is_file = !S_ISDIR(inode-i_mode); const int write_mask = (mask MAY_WRITE) !(mask MAY_READ); + if (nd) + unionfs_lock_dentry(nd-dentry, UNIONFS_DMUTEX_CHILD); + + if (!UNIONFS_I(inode)-lower_inodes) { + if (is_file)/* dirs can be unlinked but chdir'ed to */ + err = -ESTALE; /* force revalidate */ + goto out; + } bstart = ibstart(inode); bend = ibend(inode); if (unlikely(bstart 0 || bend 0)) { @@ -1003,6 +1018,8 @@ static int unionfs_permission(struct inode *inode, int mask, out: unionfs_check_inode(inode); unionfs_check_nd(nd); + if (nd) + unionfs_unlock_dentry(nd-dentry); return err; } -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] Unionfs: mmap fixes
Ensure we have lower inodes in prepare/commit_write. Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- fs/unionfs/mmap.c | 26 +- 1 files changed, 21 insertions(+), 5 deletions(-) diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c index a0e654b..ad770ac 100644 --- a/fs/unionfs/mmap.c +++ b/fs/unionfs/mmap.c @@ -224,13 +224,26 @@ out: static int unionfs_prepare_write(struct file *file, struct page *page, unsigned from, unsigned to) { + int err; + + unionfs_read_lock(file-f_path.dentry-d_sb, UNIONFS_SMUTEX_PARENT); /* -* Just copy lower inode attributes and return success. Not much -* else to do here. No need to lock either (lockdep won't like it). -* Let commit_write do all the hard work instead. +* This is the only place where we unconditionally copy the lower +* attribute times before calling unionfs_file_revalidate. The +* reason is that our -write calls do_sync_write which in turn will +* call our -prepare_write and then -commit_write. Before our +* -write is called, the lower mtimes are in sync, but by the time +* the VFS calls our -commit_write, the lower mtimes have changed. +* Therefore, the only reasonable time for us to sync up from the +* changed lower mtimes, and avoid an invariant violation warning, +* is here, in -prepare_write. */ unionfs_copy_attr_times(file-f_path.dentry-d_inode); - return 0; + err = unionfs_file_revalidate(file, true); + unionfs_check_file(file); + unionfs_read_unlock(file-f_path.dentry-d_sb); + + return err; } static int unionfs_commit_write(struct file *file, struct page *page, @@ -252,7 +265,6 @@ static int unionfs_commit_write(struct file *file, struct page *page, unionfs_check_file(file); inode = page-mapping-host; - lower_inode = unionfs_lower_inode(inode); if (UNIONFS_F(file) != NULL) lower_file = unionfs_lower_file(file); @@ -282,6 +294,10 @@ static int unionfs_commit_write(struct file *file, struct page *page, goto out; /* if vfs_write succeeded above, sync up our times/sizes */ + lower_inode = lower_file-f_path.dentry-d_inode; + if (!lower_inode) + lower_inode = unionfs_lower_inode(inode); + BUG_ON(!lower_inode); fsstack_copy_inode_size(inode, lower_inode); unionfs_copy_attr_times(inode); mark_inode_dirty_sync(inode); -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] Unionfs: ensure we have lower dentries in d_iput
Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- fs/unionfs/dentry.c |3 ++- 1 files changed, 2 insertions(+), 1 deletions(-) diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c index d969640..cd15243 100644 --- a/fs/unionfs/dentry.c +++ b/fs/unionfs/dentry.c @@ -507,9 +507,10 @@ static void unionfs_d_iput(struct dentry *dentry, struct inode *inode) { int bindex, rc; + BUG_ON(!dentry); unionfs_read_lock(dentry-d_sb, UNIONFS_SMUTEX_CHILD); - if (dbstart(dentry) 0) + if (!UNIONFS_D(dentry) || dbstart(dentry) 0) goto drop_lower_inodes; for (bindex = dbstart(dentry); bindex = dbend(dentry); bindex++) { if (unionfs_lower_mnt_idx(dentry, bindex)) { -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] Unionfs: merged several printk KERN_CONT together into one pr_debug
CC: Joe Perches [EMAIL PROTECTED] Signed-off-by: Erez Zadok [EMAIL PROTECTED] --- fs/unionfs/debug.c | 51 +-- 1 files changed, 25 insertions(+), 26 deletions(-) diff --git a/fs/unionfs/debug.c b/fs/unionfs/debug.c index 5f1d887..d154c32 100644 --- a/fs/unionfs/debug.c +++ b/fs/unionfs/debug.c @@ -472,16 +472,16 @@ void __show_inode_times(const struct inode *inode, lower_inode = unionfs_lower_inode_idx(inode, bindex); if (unlikely(!lower_inode)) continue; - pr_debug(IT(%lu:%d): , inode-i_ino, bindex); - printk(KERN_CONT %s:%s:%d , file, fxn, line); - printk(KERN_CONT um=%lu/%lu lm=%lu/%lu , - inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec, - lower_inode-i_mtime.tv_sec, - lower_inode-i_mtime.tv_nsec); - printk(KERN_CONT uc=%lu/%lu lc=%lu/%lu\n, - inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec, - lower_inode-i_ctime.tv_sec, - lower_inode-i_ctime.tv_nsec); + pr_debug(IT(%lu:%d): %s:%s:%d +um=%lu/%lu lm=%lu/%lu uc=%lu/%lu lc=%lu/%lu\n, +inode-i_ino, bindex, +file, fxn, line, +inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec, +lower_inode-i_mtime.tv_sec, +lower_inode-i_mtime.tv_nsec, +inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec, +lower_inode-i_ctime.tv_sec, +lower_inode-i_ctime.tv_nsec); } } @@ -496,17 +496,16 @@ void __show_dinode_times(const struct dentry *dentry, lower_inode = unionfs_lower_inode_idx(inode, bindex); if (!lower_inode) continue; - pr_debug(DT(%s:%lu:%d): , dentry-d_name.name, inode-i_ino, -bindex); - printk(KERN_CONT %s:%s:%d , file, fxn, line); - printk(KERN_CONT um=%lu/%lu lm=%lu/%lu , - inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec, - lower_inode-i_mtime.tv_sec, - lower_inode-i_mtime.tv_nsec); - printk(KERN_CONT uc=%lu/%lu lc=%lu/%lu\n, - inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec, - lower_inode-i_ctime.tv_sec, - lower_inode-i_ctime.tv_nsec); + pr_debug(DT(%s:%lu:%d): %s:%s:%d +um=%lu/%lu lm=%lu/%lu uc=%lu/%lu lc=%lu/%lu\n, +dentry-d_name.name, inode-i_ino, bindex, +file, fxn, line, +inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec, +lower_inode-i_mtime.tv_sec, +lower_inode-i_mtime.tv_nsec, +inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec, +lower_inode-i_ctime.tv_sec, +lower_inode-i_ctime.tv_nsec); } } @@ -525,10 +524,10 @@ void __show_inode_counts(const struct inode *inode, lower_inode = unionfs_lower_inode_idx(inode, bindex); if (unlikely(!lower_inode)) continue; - printk(KERN_CONT SIC(%lu:%d:%d): , inode-i_ino, bindex, - atomic_read((inode)-i_count)); - printk(KERN_CONT lc=%d , - atomic_read((lower_inode)-i_count)); - printk(KERN_CONT %s:%s:%d\n, file, fxn, line); + pr_debug(SIC(%lu:%d:%d): lc=%d %s:%s:%d\n, +inode-i_ino, bindex, +atomic_read((inode)-i_count), +atomic_read((lower_inode)-i_count), +file, fxn, line); } } -- 1.5.2.2 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts
Quoting Miklos Szeredi ([EMAIL PROTECTED]): From: Miklos Szeredi [EMAIL PROTECTED] Allow bind mounts to unprivileged users if the following conditions are met: - mountpoint is not a symlink - parent mount is owned by the user - the number of user mounts is below the maximum Unprivileged mounts imply MS_SETUSER, and will also have the nosuid and nodev mount flags set. In particular, if mounting process doesn't have CAP_SETUID capability, then the nosuid flag will be added, and if it doesn't have CAP_MKNOD capability, then the nodev flag will be added. That little part by itself is really needed in order to make the ability to remove CAP_MKNOD from a process tree's bounding set meaningful. Else instead of creating /dev/hda1, the user can just mount a filesystem with hda1 existing on it. (Which is why I was surprised when one day I found this code missing :) But of course I'm a fan of the patchset altogether. I plan to review in more detail early next week, but since I liked the previous submission I don't see myself having any complaints, so I'm glad to see the reviews by others. thanks, -serge Signed-off-by: Miklos Szeredi [EMAIL PROTECTED] --- Index: linux/fs/namespace.c === --- linux.orig/fs/namespace.c 2008-01-04 13:47:49.0 +0100 +++ linux/fs/namespace.c 2008-01-04 13:48:01.0 +0100 @@ -487,11 +487,34 @@ static void dec_nr_user_mounts(void) spin_unlock(vfsmount_lock); } -static void set_mnt_user(struct vfsmount *mnt) +static int reserve_user_mount(void) +{ + int err = 0; + + spin_lock(vfsmount_lock); + if (nr_user_mounts = max_user_mounts !capable(CAP_SYS_ADMIN)) + err = -EPERM; + else + nr_user_mounts++; + spin_unlock(vfsmount_lock); + return err; +} + +static void __set_mnt_user(struct vfsmount *mnt) { BUG_ON(mnt-mnt_flags MNT_USER); mnt-mnt_uid = current-fsuid; mnt-mnt_flags |= MNT_USER; + + if (!capable(CAP_SETUID)) + mnt-mnt_flags |= MNT_NOSUID; + if (!capable(CAP_MKNOD)) + mnt-mnt_flags |= MNT_NODEV; +} + +static void set_mnt_user(struct vfsmount *mnt) +{ + __set_mnt_user(mnt); spin_lock(vfsmount_lock); nr_user_mounts++; spin_unlock(vfsmount_lock); @@ -510,10 +533,16 @@ static struct vfsmount *clone_mnt(struct int flag) { struct super_block *sb = old-mnt_sb; - struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname); + struct vfsmount *mnt; + if (flag CL_SETUSER) { + int err = reserve_user_mount(); + if (err) + return ERR_PTR(err); + } + mnt = alloc_vfsmnt(old-mnt_devname); if (!mnt) - return ERR_PTR(-ENOMEM); + goto alloc_failed; mnt-mnt_flags = old-mnt_flags; atomic_inc(sb-s_active); @@ -525,7 +554,7 @@ static struct vfsmount *clone_mnt(struct /* don't copy the MNT_USER flag */ mnt-mnt_flags = ~MNT_USER; if (flag CL_SETUSER) - set_mnt_user(mnt); + __set_mnt_user(mnt); if (flag CL_SLAVE) { list_add(mnt-mnt_slave, old-mnt_slave_list); @@ -550,6 +579,11 @@ static struct vfsmount *clone_mnt(struct spin_unlock(vfsmount_lock); } return mnt; + + alloc_failed: + if (flag CL_SETUSER) + dec_nr_user_mounts(); + return ERR_PTR(-ENOMEM); } static inline void __mntput(struct vfsmount *mnt) @@ -986,22 +1020,26 @@ asmlinkage long sys_oldumount(char __use #endif -static int mount_is_safe(struct nameidata *nd) +/* + * Conditions for unprivileged mounts are: + * - mountpoint is not a symlink + * - mountpoint is in a mount owned by the user + */ +static bool permit_mount(struct nameidata *nd, int *flags) { + struct inode *inode = nd-path.dentry-d_inode; + if (capable(CAP_SYS_ADMIN)) - return 0; - return -EPERM; -#ifdef notyet - if (S_ISLNK(nd-path.dentry-d_inode-i_mode)) - return -EPERM; - if (nd-path.dentry-d_inode-i_mode S_ISVTX) { - if (current-uid != nd-path.dentry-d_inode-i_uid) - return -EPERM; - } - if (vfs_permission(nd, MAY_WRITE)) - return -EPERM; - return 0; -#endif + return true; + + if (S_ISLNK(inode-i_mode)) + return false; + + if (!is_mount_owner(nd-path.mnt, current-fsuid)) + return false; + + *flags |= MS_SETUSER; + return true; } static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry) @@ -1245,9 +1283,10 @@ static int do_loopback(struct nameidata int clone_fl; struct nameidata old_nd; struct vfsmount *mnt = NULL; - int err = mount_is_safe(nd); - if (err) -
Re: [PATCH][RFC] Simple tamper-proof device filesystem.
Hello. Indan Zupancic wrote: Good point, but I assume they all have at least a directory granularity, and then /dev/ can be static and udev and other can have free reign in e.g. /dev/dynamic/. Just use subdirs for the dynamic stuff and this granularity problem is, with slight inconvenience, solved. It seems to me that the alternatives you are proposing include modification of userland applications. But my assumption is that Don't require modification of userland applications. In other words, I want to implement without asking applications to use /dev/dynamic/ or something. This filesystem is intended to provide support for legacy applications. (In fact, this filesystem in TOMOYO Linux is for kernel 2.4.30/2.6.11 and later.) Exploits are in code, and where that code is doesn't matter that much, either kernel or userspace, though if it's exploitable you'll rather not have it in the kernel. So I think it's more secure if the checking would be done by udev than in a special filesystem, even if that means that you're screwed if udev is exploited. Of course you fully trust your own code, naturally. I'm keeping the mechanism as simple as possible so that there is unlikely room (e.g. buffer overflow) for running exploits. A tiny daemon that communicates with udev and does the checking you have now, and if ok it creates the node is really not much more code than your fs, so as hard to exploit too. Then if udev is hacked you have the same guarantee as you have now. Use of a tiny daemon that communicates with udev is not sufficient. The udev is not the only application that modifies /dev files. At least, the tiny daemon should communicate with the kernel so that all requests are checked by the tiny daemon. But use of the tiny daemon (which is a process running in userland) causes a lot of troubles. See the block after the -- boundary -- of this posting. My assumption is that Don't require userland process's assistance, as written at Why not use FUSE?. Protecting certain files from being modified seems to me more generic than enforcing filename/attributes pairs on device nodes. OK. You are saying that from the point of view of what it can. I thought you were saying enforcing filename/attributes pairs from out-of-this-filesystem (e.g. MAC) is more flexible than this-filesystem. rm -f /dev/either-null-or-zero as said before, if this is possible then the MAC config used is wrong. Exactly the same as for your filesystem with mknod /dev/tmp1 c 1 X mount --bind /dev/tmp1 /dev/either-null-or-zero and you count on the MAC to prevent that. An administrator asks MAC to prevent processes (except specific processes who need to do rm -f /dev/either-null-or-zero) from doing rm -f /dev/either-null-or-zero. An administrator asks this filesystem to prevent processes from doing mknod /dev/tmp1 c 1 X. An administrator asks MAC to prevent processes from doing mount --bind /dev/tmp1 /dev/either-null-or-zero. And as for that app, if you trust it to create device nodes, why don't you trust it to make the right nodes too? If that app has a bug that triggers mknod /dev/either-null-or-zero 1$REPLY instead of mknod /dev/either-null-or-zero $REPLY under an unexpected circumstance, it will create unwanted nodes. Thus I don't trust the app. If an administrator wants something else than 3 or 5, you're breaking something. That's the fate of white-list based access control. Does this filesystem sound too strict to support dynamic device? May be this filesystem should be able to permit creation of device nodes that are not listed in the policy file. Can SELinux guarantee the same result as my filesystem even if udev or administrative programs have to be able to modify /dev ? More, because your filesystem doesn't guarantee anything at all on its own. But assuming the MAC is decent enough to protect your fs from being bypassed, I'm sure it can do what's needed fine without your fs. I can't answer for SELinux because I don't know it well. But I trust it can protect files and/or directories, and that's all that's needed to achieve the same end result. I don't know SELinux well, but as far as seeing an example (found by Googling selinux allow mknod) allow udev_t self:capability { chown dac_override dac_read_search fowner fsetid sys_admin sys_nice mknod net_raw net_admin sys_rawio }; I can't find a place to specify filename/attributes pairs in this syntax. So, if the process who is permitted to create device nodes misbehaves, it will generate unexpected filename/attribute pairs. I think SELinux can't guarantee the same result as my filesystem. You seem to assume that the in-kernel implementation is suddenly guaranteed bugfree. I keep the implementation as simple as possible. From your next posting: But I think doing more is getting ridiculous, because if a process can create a device node, it can also access it and do whatever harm could