Re: [RFD] Incremental fsck

2008-01-09 Thread Valdis . Kletnieks
On Wed, 09 Jan 2008 07:40:12 +0300, Al Boldi said:

 But why wouldn't it be possible to do this on the current fs infrastructure, 
 using just a smart fsck, working incrementally on some sub-dir?

If you have /home/usera, /home/userb, and /home/userc, the vast majority of
fs screw-ups can't be detected by only looking at one sub-dir.  For example,
you can't tell definitively that all blocks referenced by an inode under
/home/usera are properly only allocated to one file until you *also* look at
the inodes under user[bc].  Heck, you can't even tell if the link count for
a file is correct unless you walk the entire filesystem - you can find a file
with a link count of 3 in the inode, and you find one reference under usera,
and a second under userb - you can't tell if the count is one too high or
not until you walk through userc and actually see (or fail to see) a third
directory entry referencing it.


pgpVezxvjaftR.pgp
Description: PGP signature


[PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Jens Axboe
Hi,

loop.c currently uses the page cache interface to do IO to file backed
devices. This works reasonably well for simple things, like mapping an
iso9660 file for direct mount and other read-only workloads. Writing is
somewhat problematic, as anyone who has really used this feature can
attest to - it tends to confuse the vm (hello kswapd) since it break
dirty accounting and behaves very erratically on writeout. Did I mention
that it's pretty slow as well, for both reads and writes?

It also behaves differently than a real drive. For writes, completions
are done once they hit page cache. Since loop queues bio's async and
hands them off to a thread, you can have a huge backlog of stuff to do.
It's hard to attempt to guarentee data safety for file systems on top of
loop without making it even slower than it currently is.

Back when loop was only used for iso9660 mounting and other simple
things, this didn't matter. Now it's often used in xen (and others)
setups where we do care about performance AND writing. So the below is a
attempt at speeding up loop and making it behave like a real device.
It's a somewhat quick hack and is still missing one piece to be
complete, but I'll throw it out there for people to play with and
comment on.

So how does it work? Instead of punting IO to a thread and passing it
through the page cache, we instead attempt to send the IO directly to the
filesystem block that it maps to. loop maintains a prio tree of known
extents in the file (populated lazily on demand, as needed). Advantages
of this approach:

- It's fast, loop will basically work at device speed.
- It's fast, loop it doesn't put a huge amount of system load on the
  system when busy. When I did comparison tests on my notebook with an
  external drive, running a simple tiobench on the current in-kernel
  loop with a sparse file backing rendered the notebook basically
  unusable while the test was ongoing. The remapper version had no more
  impact than it did when used directly on the external drive.
- It behaves like a real block device.
- It's easy to support IO barriers, which is needed to ensure safety
  especially in virtualized setups.

Disadvantages:

- The file block mappings must not change while loop is using the file.
  This means that we have to ensure exclusive access to the file and
  this is the bit that is currently missing in the implementation. It
  would be nice if we could just do this via open(), ideas welcome...
- It'll tie down a bit of memory for the prio tree. This is GREATLY
  offset by the reduced page cache foot print though.
- It cannot be used with the loop encryption stuff. dm-crypt should be
  used instead, on top of loop (which, I think, is even the recommended
  way to do this today, so not a big deal).

This patch will automatically enable the new operation mode (called
fastfs). I added an ioctl (LOOP_SET_FASTFS) that should be implemented
in losetup, then we can remove this hunk in the code:

+   /*
+* This needs to be done after setup with another ioctl,
+* not automatically like this.
+*/
+   loop_init_fastfs(lo);
+

from loop_set_fd().

Patch is against 2.6.23-rc7 ('ish, as of this morning), will probably
apply easily to 2.6.22 as well.


diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 56e2304..e49bfa8 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -481,16 +481,51 @@ static int do_bio_filebacked(struct loop_device *lo, 
struct bio *bio)
return ret;
 }
 
+#define __lo_throttle(wq, lock, condition) \
+do {   \
+   DEFINE_WAIT(__wait);\
+   for (;;) {  \
+   prepare_to_wait((wq), __wait, TASK_UNINTERRUPTIBLE);   \
+   if (condition)  \
+   break;  \
+   spin_unlock_irq((lock));\
+   io_schedule();  \
+   spin_lock_irq((lock));  \
+   }   \
+   finish_wait((wq), __wait); \
+} while (0)\
+
+#define lo_act_bio(bio)((bio)-bi_bdev)
+#define LO_BIO_THROTTLE128
+
 /*
- * Add bio to back of pending list
+ * A normal block device will throttle on request allocation. Do the same
+ * for loop to prevent millions of bio's queued internally.
+ */
+static void loop_bio_throttle(struct loop_device *lo, struct bio *bio)
+{
+   if (lo_act_bio(bio))
+   __lo_throttle(lo-lo_bio_wait, lo-lo_lock,
+   lo-lo_bio_cnt  LO_BIO_THROTTLE);
+}
+
+/*
+ * Add bio to back 

Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Miklos Szeredi
  On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
  From: Miklos Szeredi [EMAIL PROTECTED]
 
  Use FS_SAFE for fuse fs type, but not for fuseblk.
 
  FUSE was designed from the beginning to be safe for unprivileged users.  
  This
  has also been verified in practice over many years.  In addition 
  unprivileged
  Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not
  considered important enough to even mention?
  
  No.  Because in practice they don't seem to matter.  Also because
  there's no way in which fuse could be done differently to address
  these issues.
 
 Could you clarify, please? I hope I'm getting the wrong end of the stick
 - it sounds to me like you and Pavel are saying that this patch breaks
 suspending to ram (and hibernating?) but you want to push it anyway
 because you haven't been able to produce an instance, don't think
 suspending or hibernating matter and couldn't fix fuse anyway?

This patch has nothing to do with suspend or hibernate.  What this
patchset does, is help get rid of fusermount, a suid-root mount
helper.  It also opens up new possibilities, which are not fuse
related.

Fuse has bad interactions with the freezer, theoretically.  In
practice, I remember just one bug report (that sparked off this whole
do we need freezer, or don't we flamefest), that actually got fixed
fairly quickly, ...maybe.  Rafael probably remembers better.

  The 'kill -9' thing is basically due to VFS level locking not being
  interruptible.  It could be changed, but I'm not sure it's worth it.
  
  For the suspend issue, there are also no easy solutions.
 
 What are the non-easy solutions?

The ability to freeze tasks in uninterruptible sleep, or more
generally at any preempt point (except when drivers are poking
hardware).

I know this doesn't play well with userspace hibernate, and I don't
think it can be resolved without going the kexec way.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Szabolcs Szakacsits

Hi,

On Wed, 9 Jan 2008, Nigel Cunningham wrote:
 On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
 
  For the suspend issue, there are also no easy solutions.
 
 What are the non-easy solutions?

A practical point of view I've seen only fuse rootfs mounts to be a 
problem. I remember Ubuntu patches for this (WUBI and some other distros 
install NTFS root). But this probably also depends on the used suspend 
implementation.

Personally I've never had fuse related suspend problem with ordinary mounts 
during heavy use under development, nor NTFS user problem was tracked down 
to it in the last one and half year.

Regards,
Szaka

-- 
NTFS-3G:  http://ntfs-3g.org
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-09 Thread Andreas Dilger
Andi Kleen wrote:
 Theodore Tso [EMAIL PROTECTED] writes:
  Now, there are good reasons for doing periodic checks every N mounts
  and after M months.  And it has to do with PC class hardware.  (Ted's
  aphorism: PC class hardware is cr*p).

 If these reasons are good ones (some skepticism here) then the correct
 way to really handle this would be to do regular background scrubbing
 during runtime; ideally with metadata checksums so that you can actually
 detect all corruption.

 But since fsck is so slow and disks are so big this whole thing
 is a ticking time bomb now. e.g. it is not uncommon to require tens
 of minutes or even hours of fsck time and some server that reboots
 only every few months will eat that when it happens to reboot.
 This means you get a quite long downtime.

 Has there been some thought about an incremental fsck?

While an _incremental_ fsck isn't so easy for existing filesystem types,
what is pretty easy to automate is making a read-only snapshot of a
filesystem via LVM/DM and then running e2fsck against that.  The kernel
and filesystem have hooks to flush the changes from cache and make the
on-disk state consistent.

You can then set the the ext[234] superblock mount count and last check
time via tune2fs if all is well, or schedule an outage if there are
inconsistencies found.

There is a copy of this script at:
http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html

Note that it might need some tweaks to run with DM/LVM2 commands/output,
but is mostly what is needed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Nigel Cunningham
Hi.

Miklos Szeredi wrote:
 On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
 From: Miklos Szeredi [EMAIL PROTECTED]

 Use FS_SAFE for fuse fs type, but not for fuseblk.

 FUSE was designed from the beginning to be safe for unprivileged users.  
 This
 has also been verified in practice over many years.  In addition 
 unprivileged
 Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not
 considered important enough to even mention?
 No.  Because in practice they don't seem to matter.  Also because
 there's no way in which fuse could be done differently to address
 these issues.
 Could you clarify, please? I hope I'm getting the wrong end of the stick
 - it sounds to me like you and Pavel are saying that this patch breaks
 suspending to ram (and hibernating?) but you want to push it anyway
 because you haven't been able to produce an instance, don't think
 suspending or hibernating matter and couldn't fix fuse anyway?
 
 This patch has nothing to do with suspend or hibernate.  What this
 patchset does, is help get rid of fusermount, a suid-root mount
 helper.  It also opens up new possibilities, which are not fuse
 related.

That's what I thought. So what was Pavel talking about with kill -9 no
longer works and suspend no longer works above? I couldn't understand
it from the context.

 Fuse has bad interactions with the freezer, theoretically.  In
 practice, I remember just one bug report (that sparked off this whole
 do we need freezer, or don't we flamefest), that actually got fixed
 fairly quickly, ...maybe.  Rafael probably remembers better.

I think they just gave up and considered it unsolvable. I'm not sure it is.

 The 'kill -9' thing is basically due to VFS level locking not being
 interruptible.  It could be changed, but I'm not sure it's worth it.

 For the suspend issue, there are also no easy solutions.
 What are the non-easy solutions?
 
 The ability to freeze tasks in uninterruptible sleep, or more
 generally at any preempt point (except when drivers are poking
 hardware).

Couldn't some sort of scheduler based solution deal with the
uninterruptible sleeping case?

 I know this doesn't play well with userspace hibernate, and I don't
 think it can be resolved without going the kexec way.

I can see the desirability of kexec when it comes to avoiding the
freezer, but comes with its own problems too - having the original
context usable is handy, not having to set aside a large amount of space
for a second kernel is also desirable and there are still greater issues
of transferring information backwards and forwards between the two kernels.

Regards,

Nigel
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Miklos Szeredi
   'updatedb no longer works' is not a problem?
  
  I haven't seen any problems with updatedb, and haven't had any bug
  reports about it either.
 
 Ok, I don't know much about FUSE. In current version, if user creates
 infinite maze and mounts it under ~, updatedb just does not enter it?

It doesn't.  See Documentation/filesystems/fuse.txt

  AFAIR there were two security vulnerabilities in fuse's history, one
  of them an information leak in the kernel module, and the other one an
  mtab corruption issue in the fusermount utility.  I don't think this
  is such a bad track record.
 
 Not bad indeed. But I'd consider 'kill -9 not working' to be DoS
 vulnerability...

The worst that can happen is that a sysadmin doesn't read the docs
(likely) before enabling fuse on a multiuser system, and is surprised
by a user doing funny things.  And _then_ has to go read the docs, or
google for some info.  This is basically how things normally work, and
I don't consider it a DoS.

 and I'm woried about problems fuse + user mounts expose in other
 parts of system.

I'm worried too, and I'm not saying that enabling unprivileged fuse
mounts is completely risk free.  Nothing is, and nobody is forced to
do it.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Jens Axboe
On Wed, Jan 09 2008, Christoph Hellwig wrote:
 On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
  - The file block mappings must not change while loop is using the file.
This means that we have to ensure exclusive access to the file and
this is the bit that is currently missing in the implementation. It
would be nice if we could just do this via open(), ideas welcome...
 
 And the way this is done is simply broken.  It means you have to get
 rid of things like delayed or unwritten hands beforehand, it'll be
 a complete pain for COW or non-block backed filesystems.

COW is not that hard to handle, you just need to be notified of moving
blocks. If you view the patch as just a tighter integration between loop
and fs, I don't think it's necessarily that broken.

I did consider these cases, and it can be done with the existing
approach.

 The right way to do this is to allow direct I/O from kernel sources
 where the filesystem is in-charge of submitting the actual I/O after
 the pages are handed to it.  I think Peter Zijlstra has been looking
 into something like that for swap over nfs.

That does sound like a nice approach, but a lot more work. It'll behave
differently too, the advantage of what I proposed is that it behaves
like a real device.

I'm not asking you to love it (in fact I knew some people would complain
about this approach and I understand why), just tossing it out there to
get things rolling. If we end up doing it differently I don't really
care, I'm not married to any solution but merely wish to solve a
problem. If that ends up being solved differently, the outcome is the
same to me.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Christoph Hellwig
On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
 - The file block mappings must not change while loop is using the file.
   This means that we have to ensure exclusive access to the file and
   this is the bit that is currently missing in the implementation. It
   would be nice if we could just do this via open(), ideas welcome...

And the way this is done is simply broken.  It means you have to get
rid of things like delayed or unwritten hands beforehand, it'll be
a complete pain for COW or non-block backed filesystems.

The right way to do this is to allow direct I/O from kernel sources
where the filesystem is in-charge of submitting the actual I/O after
the pages are handed to it.  I think Peter Zijlstra has been looking
into something like that for swap over nfs.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/9] unprivileged mounts: allow unprivileged mounts

2008-01-09 Thread Karel Zak
On Tue, Jan 08, 2008 at 12:35:08PM +0100, Miklos Szeredi wrote:
 Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of
 this filesystem may not constitute a security problem.
 
 Since most filesystems haven't been designed with unprivileged mounting in
 mind, a thorough audit is needed before setting this flag.
 
 For safe filesystems also allow unprivileged forced unmounting.

 What about to list safe filesystems anywhere in /proc/fs/ ? I think
 it's very important information for admins.

 Note, your patch for mount(8) is always trying to use unprivileged
 mount(2) for non-root users. It's overkill when unprivileged mounts are
 supported for bind mounts and fuse only. It would be nice to check
 if FS is safe before switch to unprivileged mode.

 The safe definition is also very subjective and it depends on your
 level of paranoia. There should be a way (e.g. /proc) how control and
 modify the list of safe filesystems. For example I have no problem
 to mark cifs as safe for my home server.

Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Pavel Machek
On Wed 2008-01-09 09:47:31, Miklos Szeredi wrote:
   On Tue 2008-01-08 12:35:09, Miklos Szeredi wrote:
   From: Miklos Szeredi [EMAIL PROTECTED]
  
   Use FS_SAFE for fuse fs type, but not for fuseblk.
  
   FUSE was designed from the beginning to be safe for unprivileged users. 
This
   has also been verified in practice over many years.  In addition 
   unprivileged
   Eh? So 'kill -9 no longer works' and 'suspend no longer works' is not
   considered important enough to even mention?
   
   No.  Because in practice they don't seem to matter.  Also because
   there's no way in which fuse could be done differently to address
   these issues.
  
  Could you clarify, please? I hope I'm getting the wrong end of the stick
  - it sounds to me like you and Pavel are saying that this patch breaks
  suspending to ram (and hibernating?) but you want to push it anyway
  because you haven't been able to produce an instance, don't think
  suspending or hibernating matter and couldn't fix fuse anyway?
 
 This patch has nothing to do with suspend or hibernate.  What this
 patchset does, is help get rid of fusermount, a suid-root mount
 helper.  It also opens up new possibilities, which are not fuse
 related.
 
 Fuse has bad interactions with the freezer, theoretically.  In
 practice, I remember just one bug report (that sparked off this whole
 do we need freezer, or don't we flamefest), that actually got fixed
 fairly quickly, ...maybe.  Rafael probably remembers better.

In practice, if the unpriviledged fuse gets enabled, any user can
prevent suspend/hibernation from working.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Chris Mason
On Wed, 9 Jan 2008 10:43:21 +0100
Jens Axboe [EMAIL PROTECTED] wrote:

 On Wed, Jan 09 2008, Christoph Hellwig wrote:
  On Wed, Jan 09, 2008 at 09:52:32AM +0100, Jens Axboe wrote:
   - The file block mappings must not change while loop is using the
   file. This means that we have to ensure exclusive access to the
   file and this is the bit that is currently missing in the
   implementation. It would be nice if we could just do this via
   open(), ideas welcome...
  
  And the way this is done is simply broken.  It means you have to get
  rid of things like delayed or unwritten hands beforehand, it'll be
  a complete pain for COW or non-block backed filesystems.
 
 COW is not that hard to handle, you just need to be notified of moving
 blocks. If you view the patch as just a tighter integration between
 loop and fs, I don't think it's necessarily that broken.
 
Filling holes (delayed allocation) and COW are definitely a problem.
But at least for the loop use case, most non-cow filesystems will want
to preallocate the space for loop file and be done with it.  Sparse
loop definitely has uses, but generally those users are willing to pay
a little performance.

Jens' patch falls back to buffered writes for the hole case and
pretends cow doesn't exist.  It's a good starting point that I hope to
extend with something like the extent_map apis.

 I did consider these cases, and it can be done with the existing
 approach.
 
  The right way to do this is to allow direct I/O from kernel sources
  where the filesystem is in-charge of submitting the actual I/O after
  the pages are handed to it.  I think Peter Zijlstra has been looking
  into something like that for swap over nfs.
 
 That does sound like a nice approach, but a lot more work. It'll
 behave differently too, the advantage of what I proposed is that it
 behaves like a real device.

The problem with O_DIRECT (or even O_SYNC) loop is that every write
into loop becomes synchronous, and it really changes the performance of
things like filemap_fdatawrite.

If we just hand ownership of the file over to loop entirely and prevent
other openers (perhaps even forcing backups through the loop device),
we get fewer corner cases and much better performance.

-chris
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Pavel Machek
Hi!

   AFAIR there were two security vulnerabilities in fuse's history, one
   of them an information leak in the kernel module, and the other one an
   mtab corruption issue in the fusermount utility.  I don't think this
   is such a bad track record.
  
  Not bad indeed. But I'd consider 'kill -9 not working' to be DoS
  vulnerability...
 
 The worst that can happen is that a sysadmin doesn't read the docs
 (likely) before enabling fuse on a multiuser system, and is surprised
 by a user doing funny things.  And _then_ has to go read the docs, or
 google for some info.  This is basically how things normally work, and
 I don't consider it a DoS.

No, this is not normal. Kill -9 has been estabilished long time ago,
and we should not be documenting its now-brokenness in
Documentation/filesystems/fuse.txt .

For example, my /etc/inittab currently has:

kb::kbrequest:/etc/rc/rc.reboot 2 0

#
# This file handles system shutdown and reboot.
#

PATH=/sbin:/bin:/usr/sbin:/usr/bin
sync 

# Kill all processes.
wall System is going down NOW\!
echo -n -e \rSystem is going down: processes.
killall5 -15
echo -n .
sleep 1
echo -n . 
killall5 -9

# Before unmounting file systems write a reboot record to wtmp.
echo -n wtmp 
halt -w

# Swap needs to be unmounted because otherwise busy filesystems
remain.
echo -n swap 
swapoff -a
swapoff /c/swap
swapoff /c/swap2

# Unmount file systems
echo -n umount.
umount -a || (
sync 
echo -n umount-retry.
sleep 1
umount -a || sulogin
)
echo -n . 
mount -n -o remount,ro /

# Now halt or reboot.
if [ $2 = 0 ] ; then
swapoff -a
echo halted.
halt -p -f
else
echo rebooting...
reboot -d -f
fi

...this will break with FUSE enabled, right? (Minor security hole by
allowing users to stop c-a-delete, where none existed before?)

I'm currently suspending by 'echo mem  /sys/power/state'. How
should I do that _safely_ with FUSE enabled?

If I want to get rid of nasty user in multiuser system, I do 
su nastyuser 'kill -9 -1' . How do I do the equivalent with FUSE
enabled? (Without affecting other users?)

Load average was never really meaningful number, but with FUSE
enabled, users can set it to 666 without actually eating any CPU.

SIGSTOP used to work, allowing you to prevent user processes from
working while you examine them. Now SIGSTOP can be delayed for
arbitrary time.

Heck, imagine malicious user process misbehaves. Before FUSE, you
could at least attach it with gdb to look what it is doing. Now you
can't.

I really believe FUSE vs. signals needs fixing. Either that, or
updating all the manpages

man 1 kill:
-   KILL   9   exit  this signal may not be blocked
+   KILL   9   exit  this signal may not be blocked, except by FUSE 
user mount

Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-09 Thread Al Boldi
Valerie Henson wrote:
 On Jan 8, 2008 8:40 PM, Al Boldi [EMAIL PROTECTED] wrote:
  Rik van Riel wrote:
   Al Boldi [EMAIL PROTECTED] wrote:
Has there been some thought about an incremental fsck?
   
You know, somehow fencing a sub-dir to do an online fsck?
  
   Search for chunkfs
 
  Sure, and there is TileFS too.
 
  But why wouldn't it be possible to do this on the current fs
  infrastructure, using just a smart fsck, working incrementally on some
  sub-dir?

 Several data structures are file system wide and require finding every
 allocated file and block to check that they are correct.  In
 particular, block and inode bitmaps can't be checked per subdirectory.

Ok, but let's look at this a bit more opportunistic / optimistic.

Even after a black-out shutdown, the corruption is pretty minimal, using 
ext3fs at least.  So let's take advantage of this fact and do an optimistic 
fsck, to assure integrity per-dir, and assume no external corruption.  Then 
we release this checked dir to the wild (optionally ro), and check the next.  
Once we find external inconsistencies we either fix it unconditionally, 
based on some preconfigured actions, or present the user with options.

All this could be per-dir or using some form of on-the-fly file-block-zoning.

And there probably is a lot more to it, but it should conceptually be 
possible, with more thoughts though...

 http://infohost.nmt.edu/~val/review/chunkfs.pdf


Thanks!

--
Al

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 6/9] unprivileged mounts: allow unprivileged mounts

2008-01-09 Thread Miklos Szeredi
 On Tue, Jan 08, 2008 at 12:35:08PM +0100, Miklos Szeredi wrote:
  Define a new fs flag FS_SAFE, which denotes, that unprivileged mounting of
  this filesystem may not constitute a security problem.
  
  Since most filesystems haven't been designed with unprivileged mounting in
  mind, a thorough audit is needed before setting this flag.
  
  For safe filesystems also allow unprivileged forced unmounting.
 
  What about to list safe filesystems anywhere in /proc/fs/ ? I think
  it's very important information for admins.

Makes sense.  I'll cook up something.

  Note, your patch for mount(8) is always trying to use unprivileged
  mount(2) for non-root users. It's overkill when unprivileged mounts are
  supported for bind mounts and fuse only. It would be nice to check
  if FS is safe before switch to unprivileged mode.

I think the little gain in performance is not worth the added
complexity.  Especially if the added complexity is in the privileged
part, and itself can be a source of security holes.

  The safe definition is also very subjective and it depends on your
  level of paranoia. There should be a way (e.g. /proc) how control and
  modify the list of safe filesystems. For example I have no problem
  to mark cifs as safe for my home server.

OK, also makes some sense.  Pavel's examples do point out that fuse
isn't as safe as I'd like it to be, so perhaps it would make sense to
default to just bind mounts being allowed, and having to explicity
enable unprivileged fuse mounts with a sysctl or whatever.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-09 Thread Jan Engelhardt

On Jan 8 2008 20:08, Miklos Szeredi wrote:
 On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote:
  +static int reserve_user_mount(void)
  +{
  +   int err = 0;
  +
  +   spin_lock(vfsmount_lock);
  +   if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
  +   err = -EPERM;
  +   else
  +   nr_user_mounts++;
  +   spin_unlock(vfsmount_lock);
  +   return err;
  +} 
 
 Would -ENOSPC or -ENOMEM be a more descriptive error here?  

The logic behind EPERM, is that this failure is only for unprivileged
callers.  ENOMEM is too specifically about OOM.  It could be changed
to ENOSPC, ENFILE, EMFILE, or it could remain EPERM.  What do others
think?

ENOSPC: No space remaining on device = 'wth'.
ENOMEM: I usually think of a userspace OOM (e.g. malloc'ed out all of your
32-bit address space on 32-bit processes)
EMFILE: Too many open files
ENFILE: Too many open files in system.

ENFILE seems like a temporary winner among these four.

Back in the old days, when the number of mounts was limited in Linux,
what error value did it return? That one could be used.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-09 Thread Karel Zak
On Wed, Jan 09, 2008 at 01:45:09PM +0100, Jan Engelhardt wrote:
 
 On Jan 8 2008 20:08, Miklos Szeredi wrote:
  On Tue, 2008-01-08 at 12:35 +0100, Miklos Szeredi wrote:
   +static int reserve_user_mount(void)
   +{
   +   int err = 0;
   +
   +   spin_lock(vfsmount_lock);
   +   if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
   +   err = -EPERM;
   +   else
   +   nr_user_mounts++;
   +   spin_unlock(vfsmount_lock);
   +   return err;
   +} 
  
  Would -ENOSPC or -ENOMEM be a more descriptive error here?  
 
 The logic behind EPERM, is that this failure is only for unprivileged
 callers.  ENOMEM is too specifically about OOM.  It could be changed
 to ENOSPC, ENFILE, EMFILE, or it could remain EPERM.  What do others
 think?
 
 ENOSPC: No space remaining on device = 'wth'.
 ENOMEM: I usually think of a userspace OOM (e.g. malloc'ed out all of your
 32-bit address space on 32-bit processes)
 EMFILE: Too many open files
 ENFILE: Too many open files in system.
 
 ENFILE seems like a temporary winner among these four.

 I see EMFILE, it's still supported by the latest mount(8).

 Back in the old days, when the number of mounts was limited in Linux,
 what error value did it return? That one could be used.

 Copy  past from mount-0.99.2:

  /* Mount failed, complain, but don't die.  */
  switch (mnt_err)
{
case EPERM:
  if (geteuid() == 0)
error (mount: mount point %s is not a directory, node);
  else
error (mount: must be superuser to use mount);
  break;
case EBUSY:
  error (mount: wrong fs type, %s already mounted, %s busy, 
or other error, spec, node);
  break;
case ENOENT:
  error (mount: mount point %s does not exist, node); break;
case ENOTDIR:
  error (mount: mount point %s is not a directory, node); break;
case EINVAL:
  error (mount: %s not a mount point, spec); break;
case EMFILE:
  error (mount table full); break;
case EIO:
  error (mount: %s: can't read superblock, spec); break;
case ENODEV:
  error (mount: fs type %s not supported by kernel, type); break;
case ENOTBLK:
  error (mount: %s is not a block device, spec); break;
case ENXIO:
  error (mount: %s is not a valid block device, spec); break;
case EACCES:
  error (mount: block device %s is not permitted on its filesystem, spec);
  break;
default:
  error (mount: %s, strerror (mnt_err)); break;
}


  Karel

-- 
 Karel Zak  [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-09 Thread Miklos Szeredi
 case EMFILE:
   error (mount table full); break;

OK, we could go with EMFILE, but the message should be changed to
something like maximum unprivileged mount count exceeded.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Miklos Szeredi
 I'm not saying fuse is worthless. It is a nice toy for single-user
 systems. But I do not think we should be merging allow ordinary users
 to mount their own fuse's before issues above are fixed.

I think multi user systems are not all that interesting.  And I
suspect very few of them want reliably working suspend/hibernate
(which they wouldn't get due to other issues anyway), or have weird
shutdown scripts which stop when they are unable to umount
filesystems.

For paranoid sysadmins, I suggest not enabling fuse for unprivileged
users, which is pretty easy to do: just don't set /dev/fuse to be
world read-writable (which is the default BTW).

So your reasons just don't warrant a big effort involving VFS hacking,
etc.  Patches are of course welcome.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Pavel Machek
Hi!

  ...this will break with FUSE enabled, right? (Minor security hole by
  allowing users to stop c-a-delete, where none existed before?)
 
 Yup (or I don't know, I'm sure there was or is some problem with
 ptrace, that could be used to create unkillable processes).
 
 Fuse could actually be fixed to exit reliably for 'killall5 -9' (it
 used to), but that has other problems, and it doesn't seem very
 important to me.  But this can be discussed.

I think it is better to fix fuse than to rewrite all the shutdown scripts.

 What cannot be fixed is if one process is inside an fs operation
 (e.g. unlink), holding a VFS lock (i_mutex) and another process goes
 to uninterruptible sleep on that lock.  There's no way (other than
 rewriting the VFS) in which that second process could be killed unless
 you kill the first one or the fuse server.

I believe VFS should be rewritten here. Perhaps new TASK_KILLABLE
state can help?

  I really believe FUSE vs. signals needs fixing. Either that, or
  updating all the manpages
  
  man 1 kill:
  -   KILL   9   exit  this signal may not be blocked
  +   KILL   9   exit  this signal may not be blocked, except by 
  FUSE user mount
 
 Heh, there are all very interesting, but most of these issues are not
 even on my todo list (which has grown into quite a big pile over the
 years), which means, that they don't seem to matter to people in
 practice.
 
 You seem to be implying that fuse is worthless if these issues are not
 fixed, but that is very far from the truth, I think.

I'm not saying fuse is worthless. It is a nice toy for single-user
systems. But I do not think we should be merging allow ordinary users
to mount their own fuse's before issues above are fixed.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 7/9] unprivileged mounts: allow unprivileged fuse mounts

2008-01-09 Thread Miklos Szeredi
   I'm not saying fuse is worthless. It is a nice toy for single-user
   systems. But I do not think we should be merging allow ordinary users
   to mount their own fuse's before issues above are fixed.
  
  I think multi user systems are not all that interesting.  And I
  suspect very few of them want reliably working suspend/hibernate
  (which they wouldn't get due to other issues anyway), or have weird
  shutdown scripts which stop when they are unable to umount
  filesystems.
 
 Weird shutdown scripts? I believe all shutdown scripts have this issue
 -- if you want to [cleanly] unmount your / filesystem, you need all
 the opens for write closed, right...? Which self-deadlocked fused
 holding files open will prevent.
 
  For paranoid sysadmins, I suggest not enabling fuse for unprivileged
  users, which is pretty easy to do: just don't set /dev/fuse to be
  world read-writable (which is the default BTW).
  
  So your reasons just don't warrant a big effort involving VFS hacking,
  etc.  Patches are of course welcome.
 
 Well, I believe code with obscure, but almost impossible to fix
 problems should not be merged...

That code _has_ been merged, something like 3 years ago, and is doing
fine, thank you.

The unprivileged mounts code, which we should be discussing, doesn't
change anything about that, except to not require another suid-root
utility.  Many distributions enabling unprivileged mounting by default
_now_, so it's not as if there's some great danger in doing this
slightly differently.

 Anyway, I believe it would be fair to mention kill -9 no longer
 working and shutdown/hibernation/multiuser problems it implies in the
 changelogs and probably sysctl documentation or how is this enabled.

Sure.

Miklos
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-09 Thread Rik van Riel
On Wed, 9 Jan 2008 14:52:14 +0300
Al Boldi [EMAIL PROTECTED] wrote:

 Ok, but let's look at this a bit more opportunistic / optimistic.

You can't play fast and loose with data integrity.

Besides, if we looked at things optimistically, we would conclude
that no fsck will be needed, ever :)
 
  http://infohost.nmt.edu/~val/review/chunkfs.pdf

You will really want to read this paper, if you haven't already.

-- 
All Rights Reversed
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2008-01-09 Thread Indan Zupancic
Hello,

On Wed, January 9, 2008 05:39, Tetsuo Handa wrote:
 Hello.

 Indan Zupancic wrote:
 I think you focus too much on your way of enforcing filename/attributes
 pairs.
 So?

So that you miss alternatives and don't see the bigger picture.


 The same can be achieved by creating the device nodes with
 expected attributes, and preventing processes from changing those files.
 The device nodes have to be deletable if some process (including udev) needs
 to delete.
 Thus, you cannot unconditionally prevent processes from changing those files.

 This because expected combinations are known beforehand.
 Yes.

 And once those files are present, the MAC system used doesn't have to have
 special
 device nodes attributes support. Protecting those files is enough to
 guarantee filename/attributes pairs.
 If MAC system needn't to support this filesystem's functionality,
 who creates those files with warrantee of expected attributes? The udev does?
 If udev is exploited, who can guarantee?

The person that would write the config file for your fs, the one who wants
that guarantee.


 No, this is because rename permission was given for files that it shouldn't
 had.
 Do you think all MAC implementation have the same granularity and
 functionalities?
 I don't think so. Not all MAC implementation can control with such
 granularity.
 This filesystem is designed to be combined with any MAC,
 although the MAC used with this filesystem should be able to restrict
 namespace manipulation requests so that this filesystem can remain /dev
 and visible to userland applications.

Good point, but I assume they all have at least a directory granularity, and 
then
/dev/ can be static and udev and other can have free reign in e.g. 
/dev/dynamic/.
Just use subdirs for the dynamic stuff and this granularity problem is, with
slight
inconvenience, solved.


 Either you want a process to manage device names and attributes, and then
 you
 give it permission to do that, or you want to enforce certain
 filename/attribute
 pairs and then you just do it yourself.
 If I modify udev to enforce certain filename/attribute pairs and the modified
 udev
 was exploited, who can guarantee?
 Don't trust userland application is the basis of restricting access in
 kernel space.
 If you can trust userland application, you don't need in-kernel access
 control.

Funny, I thought that it was in the kernel because that's the way to protect
processes against eachother, the fs against processes, and for performance
reasons.

Exploits are in code, and where that code is doesn't matter that much, either
kernel or userspace, though if it's exploitable you'll rather not have it in the
kernel. So I think it's more secure if the checking would be done by udev than
in a special filesystem, even if that means that you're screwed if udev is
exploited. Of course you fully trust your own code, naturally.

A tiny daemon that communicates with udev and does the checking you have
now, and if ok it creates the node is really not much more code than your fs,
so as hard to exploit too. Then if udev is hacked you have the same guarantee
as you have now.

I can think of more alternatives that are as secure or more secure than the
current solution.



 Will your filesystem prevent the trivial case of

 rm /dev/hda1
 ln -s /dev/hda2 /dev/hda1

 Of course. To permit the above operation, the following permissions are
 needed.

   hda1660 0   6   2   b   3   1
   hda1777 0   0  33   l   .

Yes, I should've read the code before asking that, instead of the other way
round.


 Rename permission can be given for /dev in general, but prohibited for
 certain files in /dev, the ones you want to have specific attributes.
 It isn't all or nothing.
 Do you think all MAC implementation can prohibit renaming for certain files in
 /dev ?

 It's forbid modifying certain nodes that process needn't to modify
 versus forbid breaking filename/attribute pairs of certain nodes.

 Both have the same effect, except that the first one is generic and
 can be done by existing MAC systems, while the second one needs
 a special filesystem and a handful of MAC rules to make it effective.
 Do you think all MAC implementation can do?
 I think the first one is implementation specific and the second one is
 generic.

Protecting certain files from being modified seems to me more generic than
enforcing filename/attributes pairs on device nodes. And if they can't do it
surely they can do it per directory, and the using subdirs solves it.


 It doesn't matter where they are, it's that a different fs than yours could
 be
 mounted over it. You say a MAC can prevent that from happening, but a
 MAC can also prevent all processes except for udev from modifying /dev.
 But MAC cannot prevent udev from modifying /dev . And what if exploited?
 Not all MAC can enforce access control over all processes with the granularity
 you are talking. And what if a process that cannot be 

Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Andi Kleen
Jens Axboe [EMAIL PROTECTED] writes:

 So how does it work? Instead of punting IO to a thread and passing it
 through the page cache, we instead attempt to send the IO directly to the

Great -- something like this was needed for a long time.

 - The file block mappings must not change while loop is using the file.
   This means that we have to ensure exclusive access to the file and
   this is the bit that is currently missing in the implementation. It
   would be nice if we could just do this via open(), ideas welcome...

get_write_access()/put_write_access() will block other writers.

But as pointed out by others that is not enough for this.

I suppose you could use a white list like a special flag for file systems 
(like ext2/ext3) that do not reallocate blocks.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Alasdair G Kergon
Here's the latest version of dm-loop, for comparison.

To try it out, 
  ln -s dmsetup dmlosetup
and supply similar basic parameters to losetup.
(using dmsetup version 1.02.11 or higher)

Alasdair

From: Bryn Reeves [EMAIL PROTECTED]

This implements a loopback target for device mapper allowing a regular
file to be treated as a block device.

Signed-off-by: Bryn Reeves [EMAIL PROTECTED]
Signed-off-by: Alasdair G Kergon [EMAIL PROTECTED]

---
 drivers/md/Kconfig   |9 
 drivers/md/Makefile  |1 
 drivers/md/dm-loop.c | 1036 +++
 3 files changed, 1046 insertions(+)

Index: linux-2.6.24-rc6/drivers/md/Kconfig
===
--- linux-2.6.24-rc6.orig/drivers/md/Kconfig2008-01-07 18:32:05.0 
+
+++ linux-2.6.24-rc6/drivers/md/Kconfig 2008-01-07 18:33:12.0 +
@@ -229,6 +229,15 @@ config DM_CRYPT
 
  If unsure, say N.
 
+config DM_LOOP
+   tristate Loop target (EXPERIMENTAL)
+   depends on BLK_DEV_DM  EXPERIMENTAL
+   ---help---
+ This device-mapper target allows you to treat a regular file as
+ a block device.
+
+ If unsure, say N.
+
 config DM_SNAPSHOT
tristate Snapshot target (EXPERIMENTAL)
depends on BLK_DEV_DM  EXPERIMENTAL
Index: linux-2.6.24-rc6/drivers/md/Makefile
===
--- linux-2.6.24-rc6.orig/drivers/md/Makefile   2008-01-07 18:32:05.0 
+
+++ linux-2.6.24-rc6/drivers/md/Makefile2008-01-07 18:33:12.0 
+
@@ -34,6 +34,7 @@ obj-$(CONFIG_BLK_DEV_MD)  += md-mod.o
 obj-$(CONFIG_BLK_DEV_DM)   += dm-mod.o
 obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
 obj-$(CONFIG_DM_DELAY) += dm-delay.o
+obj-$(CONFIG_DM_LOOP)  += dm-loop.o
 obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
 obj-$(CONFIG_DM_MULTIPATH_EMC) += dm-emc.o
 obj-$(CONFIG_DM_MULTIPATH_HP)  += dm-hp-sw.o
Index: linux-2.6.24-rc6/drivers/md/dm-loop.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ linux-2.6.24-rc6/drivers/md/dm-loop.c   2008-01-07 18:33:12.0 
+
@@ -0,0 +1,1036 @@
+/*
+ * Copyright (C) 2006-2008 Red Hat, Inc. All rights reserved.
+ *
+ * This file is part of device-mapper.
+ *
+ * Extent mapping implementation heavily influenced by mm/swapfile.c
+ * Bryn Reeves [EMAIL PROTECTED]
+ *
+ * File mapping and block lookup algorithms support by
+ * Heinz Mauelshagen [EMAIL PROTECTED].
+ *
+ * This file is released under the GPL.
+ */
+
+#include linux/kernel.h
+#include linux/slab.h
+#include linux/fs.h
+#include linux/module.h
+#include linux/vmalloc.h
+#include linux/syscalls.h
+#include linux/workqueue.h
+#include linux/file.h
+#include linux/bio.h
+
+#include dm.h
+#include dm-bio-list.h
+
+#define DM_LOOP_DAEMON kloopd
+#define DM_MSG_PREFIX loop
+
+enum flags { DM_LOOP_BMAP, DM_LOOP_FSIO };
+
+/*
+ * Loop context
+ **/
+
+struct loop_c {
+   unsigned long flags;
+
+   /* Backing store */
+
+   struct file *filp;
+   char *path;
+   loff_t offset;
+   struct block_device *bdev;
+   unsigned blkbits;   /* file system block size shift bits */
+
+   loff_t size;/* size of entire file in bytes */
+   loff_t blocks;  /* blocks allocated to loop file */
+   sector_t mapped_sectors;/* size of mapped area in sectors */
+
+   int (*map_fn)(struct dm_target *, struct bio *);
+   void *map_data;
+};
+
+/*
+ * Block map extent
+ */
+struct dm_loop_extent {
+   sector_t start; /* start sector in mapped device */
+   sector_t to;/* start sector on target device */
+   sector_t len;   /* length in sectors */
+};
+
+/*
+ * Temporary extent list
+ */
+struct extent_list {
+   struct dm_loop_extent *extent;
+   struct list_head list;
+};
+
+static struct kmem_cache *dm_loop_extent_cache;
+
+/*
+ * Block map private context
+ */
+struct block_map_c {
+   int nr_extents; /* number of extents in map */
+   struct dm_loop_extent **map;/* linear map of extent pointers */
+   struct dm_loop_extent **mru;/* pointer to mru entry */
+   spinlock_t mru_lock;/* protects mru */
+};
+
+/*
+ * File map private context
+ */
+struct file_map_c {
+   spinlock_t lock;/* protects in */
+   struct bio_list in; /* new bios for processing */
+   struct bio_list work;   /* bios queued for processing */
+   struct workqueue_struct *wq;/* workqueue */
+   struct work_struct ws;  /* loop work */
+   struct loop_c *loop;/* for filp  

Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2008-01-09 Thread Serge E. Hallyn
Quoting Indan Zupancic ([EMAIL PROTECTED]):
 Hello,
 
 On Wed, January 9, 2008 05:39, Tetsuo Handa wrote:
  Hello.
 
  Indan Zupancic wrote:
  I think you focus too much on your way of enforcing filename/attributes
  pairs.
  So?
 
 So that you miss alternatives and don't see the bigger picture.

These emails again are getting really long, but I think the gist of
Indan's suggestion can be concisely summarized:

To confine process P3 to /dev/hda2 being 'b 3 2', create
/dev/p3, launch P3 in a new mounts namespace, mount --bind
/dev/p3 /dev, exec what you want p3 running, and have
MAC prevent umount /dev/p3.

This is a neat idea, but Tetsuo's rebutall is

P3 may be legacy code needing to create or delete
/dev/floppy, where -EPERM confuses P3 and prevents
it working correctly.

Indan's idea is interesting and I like it, but is there an answer to
Tetsuo's problem with it?

thanks,
-serge

PS - Indan, you also said in essence if P3 can be trusted to create
/dev/floppy why can't it be trusted to create /dev/hda1.  I trust that,
phrased that way, the question answers itself?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Alasdair G Kergon
On Thu, Jan 10, 2008 at 12:43:19AM +0100, [EMAIL PROTECTED] wrote:
 oh, nice to see that this is still alive.

 at least i got no crashes and was able to mount and acess more than 300 
 iso-images with that.

 were there fixes/chances since then?

Little has changed for some time - mostly code cleanups and the occasional bug 
fix.

It's time to give it wider exposure, I think, and we'll find out how well it 
holds up.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2008-01-09 Thread Indan Zupancic
On Thu, January 10, 2008 00:08, Serge E. Hallyn wrote:
 These emails again are getting really long, but I think the gist of
 Indan's suggestion can be concisely summarized:

No worry, I wasn't planning on extending it, I've said what I've to say.

Except...


   To confine process P3 to /dev/hda2 being 'b 3 2', create
   /dev/p3, launch P3 in a new mounts namespace, mount --bind
   /dev/p3 /dev, exec what you want p3 running, and have
   MAC prevent umount /dev/p3.

 This is a neat idea, but Tetsuo's rebutall is

   P3 may be legacy code needing to create or delete
   /dev/floppy, where -EPERM confuses P3 and prevents
   it working correctly.

 Indan's idea is interesting and I like it, but is there an answer to
 Tetsuo's problem with it?

...that I didn't mean that, but a more simple

/dev/ directory protected from any modifications by MAC,

/dev/* all the nodes that need to have guaranteed name/attribute pairs,
like /dev/null, /dev/zero, /dev/random, etc. and:

/dev/dynamic/ being a dir where apps who really need to create/modify
device nodes can do whatever they want to do. It can be multiple dirs
too, like /dev/snd/, /dev/input/ etc.

I guess this covers about 96% of the usecases of this tamper-proof dev fs.

You can think of unlikely cases that aren't solved by this, but those can
be solved in another way if really wanted (like a checking daemon,
modified udev, shadow /dev/, to name a few).

But I think doing more is getting ridiculous, because if a process can
create a device node, it can also access it and do whatever harm could
be done by the confusion caused by unexpected name/attribute pairs.

As for information snooping, that's mostly about /dev/null or other
things that are known beforehand.

 PS - Indan, you also said in essence if P3 can be trusted to create
 /dev/floppy why can't it be trusted to create /dev/hda1.  I trust that,
 phrased that way, the question answers itself?

Not exactly. If there's a process that dynamically created certain device
nodes, and it wants to create one that doesn't fit the rules, you can't
know if it's wrong or if your rules are wrong. The process has a certain
policy of naming/creating the devices, but you also have a policy at the
kernel side with this fs. If it mismatches you don't know which one is
right.

If you trust a process to create /dev/hd*, you can also trust it to create
the proper /dev/hdXn, no need to verify if /dev/hda1 is really 3 1.

The whole thing about filename/attribute pairs is that it's about what
applications expect. There aren't many expectations about dynamically
created device nodes which might not always be there, because their
name isn't stable.

The use case for this fs is a malicious app that can create device nodes,
and we're worried about mismatching name/attribute pairs. Not about
our data, or anything else. Call me an optimist, but I think you don't
need to worry about name/attribute pairs.

Greetings,

Indan


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][RFC] fast file mapping for loop

2008-01-09 Thread Nick Piggin
On Wednesday 09 January 2008 19:52, Jens Axboe wrote:

 So how does it work? Instead of punting IO to a thread and passing it
 through the page cache, we instead attempt to send the IO directly to the
 filesystem block that it maps to.

You told Christoph that just using direct-IO from kernel still doesn't
give you the required behaviour... What about queueing the IO directly
*and* using direct-IO? I guess it still has to go through the underlying
filesystem, but that's probably a good thing.

 loop maintains a prio tree of known 
 extents in the file (populated lazily on demand, as needed).

Just a quick question (I haven't looked closely at the code): how come
you are using a prio tree for extents? I don't think they could be
overlapping?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL -mm] 0/4 Unionfs updates/fixes/cleanups

2008-01-09 Thread Erez Zadok

The following is a series of patchsets related to Unionfs.  This is the
fourth set of patchsets resulting from an lkml review of the entire unionfs
code base, in preparation for a merge into mainline.  The most significant
changes here are a few locking/race bugfix related to branch-management.

These patches were tested (where appropriate) on Linus's 2.6.24 latest code
(as of v2.6.24-rc7-71-gfd0b45d), MM, as well as the backports to
2.6.{23,22,21,20,19,18,9} on ext2/3/4, xfs, reiserfs, nfs2/3/4, jffs2,
ramfs, tmpfs, cramfs, and squashfs (where available).  Also tested with
LTP-full and with a continuous parallel kernel compile (while forcing cache
flushing, manipulating lower branches, etc.).  See
http://unionfs.filesystems.org/ to download back-ported unionfs code.

Please pull from the 'master' branch of
git://git.kernel.org/pub/scm/linux/kernel/git/ezk/unionfs.git

to receive the following:

Erez Zadok (4):
  Unionfs: merged several printk KERN_CONT together into one pr_debug
  Unionfs: mmap fixes
  Unionfs: branch-management related locking fixes
  Unionfs: ensure we have lower dentries in d_iput

 commonfops.c |6 ++
 debug.c  |   51 +--
 dentry.c |9 +++--
 inode.c  |   17 +
 mmap.c   |   26 +-
 5 files changed, 76 insertions(+), 33 deletions(-)

---
Erez Zadok
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] Unionfs: branch-management related locking fixes

2008-01-09 Thread Erez Zadok
Add necessary locking to dentry/inode branch-configuration, so we get
consistent values during branch-management actions.  In d_revalidate_chain,
-permission, and -create, also lock parent dentry.

Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/unionfs/commonfops.c |6 ++
 fs/unionfs/dentry.c |6 +-
 fs/unionfs/inode.c  |   17 +
 3 files changed, 28 insertions(+), 1 deletions(-)

diff --git a/fs/unionfs/commonfops.c b/fs/unionfs/commonfops.c
index 2c32ada..f37192f 100644
--- a/fs/unionfs/commonfops.c
+++ b/fs/unionfs/commonfops.c
@@ -318,6 +318,7 @@ int unionfs_file_revalidate(struct file *file, bool 
willwrite)
 * First revalidate the dentry inside struct file,
 * but not unhashed dentries.
 */
+reval_dentry:
if (unlikely(!d_deleted(dentry) 
 !__unionfs_d_revalidate_chain(dentry, NULL, willwrite))) {
err = -ESTALE;
@@ -328,6 +329,11 @@ int unionfs_file_revalidate(struct file *file, bool 
willwrite)
dgen = atomic_read(UNIONFS_D(dentry)-generation);
fgen = atomic_read(UNIONFS_F(file)-generation);
 
+   if (unlikely(sbgen  dgen)) {
+   pr_debug(unionfs: retry dentry revalidation\n);
+   schedule();
+   goto reval_dentry;
+   }
BUG_ON(sbgen  dgen);
 
/*
diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
index 7646828..d969640 100644
--- a/fs/unionfs/dentry.c
+++ b/fs/unionfs/dentry.c
@@ -203,7 +203,7 @@ bool is_newer_lower(const struct dentry *dentry)
if (!dentry || !UNIONFS_D(dentry))
return false;
inode = dentry-d_inode;
-   if (!inode || !UNIONFS_I(inode) ||
+   if (!inode || !UNIONFS_I(inode)-lower_inodes ||
ibstart(inode)  0 || ibend(inode)  0)
return false;
 
@@ -295,6 +295,8 @@ bool __unionfs_d_revalidate_chain(struct dentry *dentry, 
struct nameidata *nd,
chain_len = 0;
sbgen = atomic_read(UNIONFS_SB(dentry-d_sb)-generation);
dtmp = dentry-d_parent;
+   if (dentry != dtmp)
+   unionfs_lock_dentry(dtmp, UNIONFS_DMUTEX_REVAL_PARENT);
dgen = atomic_read(UNIONFS_D(dtmp)-generation);
/* XXX: should we check if is_newer_lower all the way up? */
if (unlikely(is_newer_lower(dtmp))) {
@@ -315,6 +317,8 @@ bool __unionfs_d_revalidate_chain(struct dentry *dentry, 
struct nameidata *nd,
}
purge_inode_data(dtmp-d_inode);
}
+   if (dentry != dtmp)
+   unionfs_unlock_dentry(dtmp);
while (sbgen != dgen) {
/* The root entry should always be valid */
BUG_ON(IS_ROOT(dtmp));
diff --git a/fs/unionfs/inode.c b/fs/unionfs/inode.c
index 6095c4f..e15ddb9 100644
--- a/fs/unionfs/inode.c
+++ b/fs/unionfs/inode.c
@@ -30,6 +30,13 @@ static int unionfs_create(struct inode *parent, struct 
dentry *dentry,
struct nameidata lower_nd;
 
unionfs_read_lock(dentry-d_sb, UNIONFS_SMUTEX_CHILD);
+   unionfs_lock_dentry(dentry-d_parent, UNIONFS_DMUTEX_PARENT);
+   valid = __unionfs_d_revalidate_chain(dentry-d_parent, nd, false);
+   unionfs_unlock_dentry(dentry-d_parent);
+   if (unlikely(!valid)) {
+   err = -ESTALE;  /* same as what real_lookup does */
+   goto out;
+   }
unionfs_lock_dentry(dentry, UNIONFS_DMUTEX_CHILD);
 
valid = __unionfs_d_revalidate_chain(dentry, nd, false);
@@ -936,6 +943,14 @@ static int unionfs_permission(struct inode *inode, int 
mask,
const int is_file = !S_ISDIR(inode-i_mode);
const int write_mask = (mask  MAY_WRITE)  !(mask  MAY_READ);
 
+   if (nd)
+   unionfs_lock_dentry(nd-dentry, UNIONFS_DMUTEX_CHILD);
+
+   if (!UNIONFS_I(inode)-lower_inodes) {
+   if (is_file)/* dirs can be unlinked but chdir'ed to */
+   err = -ESTALE;  /* force revalidate */
+   goto out;
+   }
bstart = ibstart(inode);
bend = ibend(inode);
if (unlikely(bstart  0 || bend  0)) {
@@ -1003,6 +1018,8 @@ static int unionfs_permission(struct inode *inode, int 
mask,
 out:
unionfs_check_inode(inode);
unionfs_check_nd(nd);
+   if (nd)
+   unionfs_unlock_dentry(nd-dentry);
return err;
 }
 
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] Unionfs: mmap fixes

2008-01-09 Thread Erez Zadok
Ensure we have lower inodes in prepare/commit_write.

Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/unionfs/mmap.c |   26 +-
 1 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/fs/unionfs/mmap.c b/fs/unionfs/mmap.c
index a0e654b..ad770ac 100644
--- a/fs/unionfs/mmap.c
+++ b/fs/unionfs/mmap.c
@@ -224,13 +224,26 @@ out:
 static int unionfs_prepare_write(struct file *file, struct page *page,
 unsigned from, unsigned to)
 {
+   int err;
+
+   unionfs_read_lock(file-f_path.dentry-d_sb, UNIONFS_SMUTEX_PARENT);
/*
-* Just copy lower inode attributes and return success.  Not much
-* else to do here.  No need to lock either (lockdep won't like it).
-* Let commit_write do all the hard work instead.
+* This is the only place where we unconditionally copy the lower
+* attribute times before calling unionfs_file_revalidate.  The
+* reason is that our -write calls do_sync_write which in turn will
+* call our -prepare_write and then -commit_write.  Before our
+* -write is called, the lower mtimes are in sync, but by the time
+* the VFS calls our -commit_write, the lower mtimes have changed.
+* Therefore, the only reasonable time for us to sync up from the
+* changed lower mtimes, and avoid an invariant violation warning,
+* is here, in -prepare_write.
 */
unionfs_copy_attr_times(file-f_path.dentry-d_inode);
-   return 0;
+   err = unionfs_file_revalidate(file, true);
+   unionfs_check_file(file);
+   unionfs_read_unlock(file-f_path.dentry-d_sb);
+
+   return err;
 }
 
 static int unionfs_commit_write(struct file *file, struct page *page,
@@ -252,7 +265,6 @@ static int unionfs_commit_write(struct file *file, struct 
page *page,
unionfs_check_file(file);
 
inode = page-mapping-host;
-   lower_inode = unionfs_lower_inode(inode);
 
if (UNIONFS_F(file) != NULL)
lower_file = unionfs_lower_file(file);
@@ -282,6 +294,10 @@ static int unionfs_commit_write(struct file *file, struct 
page *page,
goto out;
 
/* if vfs_write succeeded above, sync up our times/sizes */
+   lower_inode = lower_file-f_path.dentry-d_inode;
+   if (!lower_inode)
+   lower_inode = unionfs_lower_inode(inode);
+   BUG_ON(!lower_inode);
fsstack_copy_inode_size(inode, lower_inode);
unionfs_copy_attr_times(inode);
mark_inode_dirty_sync(inode);
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] Unionfs: ensure we have lower dentries in d_iput

2008-01-09 Thread Erez Zadok
Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/unionfs/dentry.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/fs/unionfs/dentry.c b/fs/unionfs/dentry.c
index d969640..cd15243 100644
--- a/fs/unionfs/dentry.c
+++ b/fs/unionfs/dentry.c
@@ -507,9 +507,10 @@ static void unionfs_d_iput(struct dentry *dentry, struct 
inode *inode)
 {
int bindex, rc;
 
+   BUG_ON(!dentry);
unionfs_read_lock(dentry-d_sb, UNIONFS_SMUTEX_CHILD);
 
-   if (dbstart(dentry)  0)
+   if (!UNIONFS_D(dentry) || dbstart(dentry)  0)
goto drop_lower_inodes;
for (bindex = dbstart(dentry); bindex = dbend(dentry); bindex++) {
if (unionfs_lower_mnt_idx(dentry, bindex)) {
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] Unionfs: merged several printk KERN_CONT together into one pr_debug

2008-01-09 Thread Erez Zadok
CC: Joe Perches [EMAIL PROTECTED]

Signed-off-by: Erez Zadok [EMAIL PROTECTED]
---
 fs/unionfs/debug.c |   51 +--
 1 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/unionfs/debug.c b/fs/unionfs/debug.c
index 5f1d887..d154c32 100644
--- a/fs/unionfs/debug.c
+++ b/fs/unionfs/debug.c
@@ -472,16 +472,16 @@ void __show_inode_times(const struct inode *inode,
lower_inode = unionfs_lower_inode_idx(inode, bindex);
if (unlikely(!lower_inode))
continue;
-   pr_debug(IT(%lu:%d): , inode-i_ino, bindex);
-   printk(KERN_CONT %s:%s:%d , file, fxn, line);
-   printk(KERN_CONT um=%lu/%lu lm=%lu/%lu ,
-  inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec,
-  lower_inode-i_mtime.tv_sec,
-  lower_inode-i_mtime.tv_nsec);
-   printk(KERN_CONT uc=%lu/%lu lc=%lu/%lu\n,
-  inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec,
-  lower_inode-i_ctime.tv_sec,
-  lower_inode-i_ctime.tv_nsec);
+   pr_debug(IT(%lu:%d): %s:%s:%d 
+um=%lu/%lu lm=%lu/%lu uc=%lu/%lu lc=%lu/%lu\n,
+inode-i_ino, bindex,
+file, fxn, line,
+inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec,
+lower_inode-i_mtime.tv_sec,
+lower_inode-i_mtime.tv_nsec,
+inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec,
+lower_inode-i_ctime.tv_sec,
+lower_inode-i_ctime.tv_nsec);
}
 }
 
@@ -496,17 +496,16 @@ void __show_dinode_times(const struct dentry *dentry,
lower_inode = unionfs_lower_inode_idx(inode, bindex);
if (!lower_inode)
continue;
-   pr_debug(DT(%s:%lu:%d): , dentry-d_name.name, inode-i_ino,
-bindex);
-   printk(KERN_CONT %s:%s:%d , file, fxn, line);
-   printk(KERN_CONT um=%lu/%lu lm=%lu/%lu ,
-  inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec,
-  lower_inode-i_mtime.tv_sec,
-  lower_inode-i_mtime.tv_nsec);
-   printk(KERN_CONT uc=%lu/%lu lc=%lu/%lu\n,
-  inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec,
-  lower_inode-i_ctime.tv_sec,
-  lower_inode-i_ctime.tv_nsec);
+   pr_debug(DT(%s:%lu:%d): %s:%s:%d 
+um=%lu/%lu lm=%lu/%lu uc=%lu/%lu lc=%lu/%lu\n,
+dentry-d_name.name, inode-i_ino, bindex,
+file, fxn, line,
+inode-i_mtime.tv_sec, inode-i_mtime.tv_nsec,
+lower_inode-i_mtime.tv_sec,
+lower_inode-i_mtime.tv_nsec,
+inode-i_ctime.tv_sec, inode-i_ctime.tv_nsec,
+lower_inode-i_ctime.tv_sec,
+lower_inode-i_ctime.tv_nsec);
}
 }
 
@@ -525,10 +524,10 @@ void __show_inode_counts(const struct inode *inode,
lower_inode = unionfs_lower_inode_idx(inode, bindex);
if (unlikely(!lower_inode))
continue;
-   printk(KERN_CONT SIC(%lu:%d:%d): , inode-i_ino, bindex,
-  atomic_read((inode)-i_count));
-   printk(KERN_CONT lc=%d ,
-  atomic_read((lower_inode)-i_count));
-   printk(KERN_CONT %s:%s:%d\n, file, fxn, line);
+   pr_debug(SIC(%lu:%d:%d): lc=%d %s:%s:%d\n,
+inode-i_ino, bindex,
+atomic_read((inode)-i_count),
+atomic_read((lower_inode)-i_count),
+file, fxn, line);
}
 }
-- 
1.5.2.2

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 5/9] unprivileged mounts: allow unprivileged bind mounts

2008-01-09 Thread Serge E. Hallyn
Quoting Miklos Szeredi ([EMAIL PROTECTED]):
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Allow bind mounts to unprivileged users if the following conditions are met:
 
   - mountpoint is not a symlink
   - parent mount is owned by the user
   - the number of user mounts is below the maximum
 
 Unprivileged mounts imply MS_SETUSER, and will also have the nosuid and
 nodev mount flags set.
 
 In particular, if mounting process doesn't have CAP_SETUID capability,
 then the nosuid flag will be added, and if it doesn't have CAP_MKNOD
 capability, then the nodev flag will be added.

That little part by itself is really needed in order to make the ability
to remove CAP_MKNOD from a process tree's bounding set meaningful.
Else instead of creating /dev/hda1, the user can just mount a filesystem
with hda1 existing on it.  (Which is why I was surprised when one day
I found this code missing :)

But of course I'm a fan of the patchset altogether.  I plan to review in
more detail early next week, but since I liked the previous submission I
don't see myself having any complaints, so I'm glad to see the reviews
by others.

thanks,
-serge

 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---
 
 Index: linux/fs/namespace.c
 ===
 --- linux.orig/fs/namespace.c 2008-01-04 13:47:49.0 +0100
 +++ linux/fs/namespace.c  2008-01-04 13:48:01.0 +0100
 @@ -487,11 +487,34 @@ static void dec_nr_user_mounts(void)
   spin_unlock(vfsmount_lock);
  }
 
 -static void set_mnt_user(struct vfsmount *mnt)
 +static int reserve_user_mount(void)
 +{
 + int err = 0;
 +
 + spin_lock(vfsmount_lock);
 + if (nr_user_mounts = max_user_mounts  !capable(CAP_SYS_ADMIN))
 + err = -EPERM;
 + else
 + nr_user_mounts++;
 + spin_unlock(vfsmount_lock);
 + return err;
 +}
 +
 +static void __set_mnt_user(struct vfsmount *mnt)
  {
   BUG_ON(mnt-mnt_flags  MNT_USER);
   mnt-mnt_uid = current-fsuid;
   mnt-mnt_flags |= MNT_USER;
 +
 + if (!capable(CAP_SETUID))
 + mnt-mnt_flags |= MNT_NOSUID;
 + if (!capable(CAP_MKNOD))
 + mnt-mnt_flags |= MNT_NODEV;
 +}
 +
 +static void set_mnt_user(struct vfsmount *mnt)
 +{
 + __set_mnt_user(mnt);
   spin_lock(vfsmount_lock);
   nr_user_mounts++;
   spin_unlock(vfsmount_lock);
 @@ -510,10 +533,16 @@ static struct vfsmount *clone_mnt(struct
   int flag)
  {
   struct super_block *sb = old-mnt_sb;
 - struct vfsmount *mnt = alloc_vfsmnt(old-mnt_devname);
 + struct vfsmount *mnt;
 
 + if (flag  CL_SETUSER) {
 + int err = reserve_user_mount();
 + if (err)
 + return ERR_PTR(err);
 + }
 + mnt = alloc_vfsmnt(old-mnt_devname);
   if (!mnt)
 - return ERR_PTR(-ENOMEM);
 + goto alloc_failed;
 
   mnt-mnt_flags = old-mnt_flags;
   atomic_inc(sb-s_active);
 @@ -525,7 +554,7 @@ static struct vfsmount *clone_mnt(struct
   /* don't copy the MNT_USER flag */
   mnt-mnt_flags = ~MNT_USER;
   if (flag  CL_SETUSER)
 - set_mnt_user(mnt);
 + __set_mnt_user(mnt);
 
   if (flag  CL_SLAVE) {
   list_add(mnt-mnt_slave, old-mnt_slave_list);
 @@ -550,6 +579,11 @@ static struct vfsmount *clone_mnt(struct
   spin_unlock(vfsmount_lock);
   }
   return mnt;
 +
 + alloc_failed:
 + if (flag  CL_SETUSER)
 + dec_nr_user_mounts();
 + return ERR_PTR(-ENOMEM);
  }
 
  static inline void __mntput(struct vfsmount *mnt)
 @@ -986,22 +1020,26 @@ asmlinkage long sys_oldumount(char __use
 
  #endif
 
 -static int mount_is_safe(struct nameidata *nd)
 +/*
 + * Conditions for unprivileged mounts are:
 + * - mountpoint is not a symlink
 + * - mountpoint is in a mount owned by the user
 + */
 +static bool permit_mount(struct nameidata *nd, int *flags)
  {
 + struct inode *inode = nd-path.dentry-d_inode;
 +
   if (capable(CAP_SYS_ADMIN))
 - return 0;
 - return -EPERM;
 -#ifdef notyet
 - if (S_ISLNK(nd-path.dentry-d_inode-i_mode))
 - return -EPERM;
 - if (nd-path.dentry-d_inode-i_mode  S_ISVTX) {
 - if (current-uid != nd-path.dentry-d_inode-i_uid)
 - return -EPERM;
 - }
 - if (vfs_permission(nd, MAY_WRITE))
 - return -EPERM;
 - return 0;
 -#endif
 + return true;
 +
 + if (S_ISLNK(inode-i_mode))
 + return false;
 +
 + if (!is_mount_owner(nd-path.mnt, current-fsuid))
 + return false;
 +
 + *flags |= MS_SETUSER;
 + return true;
  }
 
  static int lives_below_in_same_fs(struct dentry *d, struct dentry *dentry)
 @@ -1245,9 +1283,10 @@ static int do_loopback(struct nameidata 
   int clone_fl;
   struct nameidata old_nd;
   struct vfsmount *mnt = NULL;
 - int err = mount_is_safe(nd);
 - if (err)
 - 

Re: [PATCH][RFC] Simple tamper-proof device filesystem.

2008-01-09 Thread Tetsuo Handa
Hello.

Indan Zupancic wrote:
 Good point, but I assume they all have at least a directory granularity, and 
 then
 /dev/ can be static and udev and other can have free reign in e.g. 
 /dev/dynamic/.
 Just use subdirs for the dynamic stuff and this granularity problem is, with
 slight inconvenience, solved.

It seems to me that the alternatives you are proposing include modification of
userland applications. But my assumption is that
Don't require modification of userland applications.
In other words, I want to implement without asking applications
to use /dev/dynamic/ or something.
This filesystem is intended to provide support for legacy applications.
(In fact, this filesystem in TOMOYO Linux is for kernel 2.4.30/2.6.11 and 
later.)



 Exploits are in code, and where that code is doesn't matter that much, either
 kernel or userspace, though if it's exploitable you'll rather not have it in 
 the
 kernel. So I think it's more secure if the checking would be done by udev than
 in a special filesystem, even if that means that you're screwed if udev is
 exploited. Of course you fully trust your own code, naturally.

I'm keeping the mechanism as simple as possible
so that there is unlikely room (e.g. buffer overflow) for running exploits.



 A tiny daemon that communicates with udev and does the checking you have
 now, and if ok it creates the node is really not much more code than your fs,
 so as hard to exploit too. Then if udev is hacked you have the same guarantee
 as you have now.

Use of a tiny daemon that communicates with udev is not sufficient.
The udev is not the only application that modifies /dev files.
At least, the tiny daemon should communicate with the kernel
so that all requests are checked by the tiny daemon.
But use of the tiny daemon (which is a process running in userland)
causes a lot of troubles.
See the block after the -- boundary -- of this posting.

My assumption is that Don't require userland process's assistance,
as written at Why not use FUSE?.



 Protecting certain files from being modified seems to me more generic than
 enforcing filename/attributes pairs on device nodes. 
OK. You are saying that from the point of view of what it can.
I thought you were saying enforcing filename/attributes pairs
from out-of-this-filesystem (e.g. MAC) is more flexible than this-filesystem.



 rm -f /dev/either-null-or-zero
 
 as said before, if this is possible then the MAC config used is wrong. Exactly
 the same as for your filesystem with
 
 mknod /dev/tmp1 c 1 X
 mount --bind /dev/tmp1 /dev/either-null-or-zero
 
 and you count on the MAC to prevent that.

An administrator asks MAC to prevent processes
(except specific processes who need to do rm -f /dev/either-null-or-zero)
from doing rm -f /dev/either-null-or-zero.

An administrator asks this filesystem to prevent processes from doing
mknod /dev/tmp1 c 1 X.

An administrator asks MAC to prevent processes from doing
mount --bind /dev/tmp1 /dev/either-null-or-zero.



 And as for that app, if you trust it to create device nodes, why don't you
 trust it to make the right nodes too?

If that app has a bug that triggers
  mknod /dev/either-null-or-zero 1$REPLY
instead of
  mknod /dev/either-null-or-zero $REPLY
under an unexpected circumstance, it will create unwanted nodes.
Thus I don't trust the app.



 If an administrator wants something else than
 3 or 5, you're breaking something.
That's the fate of white-list based access control.

Does this filesystem sound too strict to support dynamic device?
May be this filesystem should be able to permit creation of device nodes
that are not listed in the policy file.



  Can SELinux guarantee the same result as my filesystem even if udev or
  administrative programs have to be able to modify /dev ?
 
 More, because your filesystem doesn't guarantee anything at all on its own.
 But assuming the MAC is decent enough to protect your fs from being bypassed,
 I'm sure it can do what's needed fine without your fs. I can't answer for 
 SELinux
 because I don't know it well. But I trust it can protect files and/or
 directories, and that's all that's needed to achieve the same end result.

I don't know SELinux well, but as far as seeing an example
(found by Googling selinux allow mknod)

  allow udev_t self:capability { chown dac_override dac_read_search fowner 
fsetid sys_admin sys_nice mknod net_raw net_admin sys_rawio };

I can't find a place to specify filename/attributes pairs in this syntax.
So, if the process who is permitted to create device nodes misbehaves,
it will generate unexpected filename/attribute pairs.
I think SELinux can't guarantee the same result as my filesystem.



 You seem to assume that the in-kernel implementation is suddenly
 guaranteed bugfree.
I keep the implementation as simple as possible.



From your next posting:
 But I think doing more is getting ridiculous, because if a process can
 create a device node, it can also access it and do whatever harm could