Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)
Karel Zak <[EMAIL PROTECTED]> writes: > On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote: >> The following extra security measures are taken for unprivileged >> mounts: >> >> - usermounts are limited by a sysctl tunable >> - force "nosuid,nodev" mount options on the created mount > > The original userspace "user=" solution also implies the "noexec" > option by default (you can override the default by "exec" option). > > It means the kernel based solution is not fully compatible ;-( Why noexec? Either it was a silly or arbitrary decision, or our kernel design may be incomplete. Now I can see not wanting to support executables if you are locking down a system. The classic don't execute a program from a CD just because the CD was stuck in the drive problem. So I can see how executing code from an untrusted source could prevent exploitation of other problems, and we certainly don't want to do it automatically. Eric - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
ChunkFS - measuring cross-chunk references
On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote: On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: . It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). Modified the tool to distinguish between 1. cross references between directories and files 2. cross references between directories and sub directories 3. cross references within a file (due to huge file size) Below is the result from / partition of ext3 file system: Number of files = 221794 Number of directories = 24457 Total size = 8193116 KB Total data stored = 7187392 KB Size of block groups = 131072 KB Number of inodes per block group = 16288 No. of cross references between directories and sub-directories = 7791 No. of cross references between directories and file = 657 Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570) Thanks for the suggestions. There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted Thanks, Karuna cref.tar.bz2 Description: BZip2 compressed data
Re: ChunkFS - measuring cross-chunk references
On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote: On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote: . It would also be good to distinguish between directories referencing files in another chunk, and directories referencing subdirectories in another chunk (which would be simpler to handle, given the topological restrictions on directories, as compared to files and hard links). Modified the tool to distinguish between 1. cross references between directories and files 2. cross references between directories and sub directories 3. cross references within a file (due to huge file size) Below is the result from / partition of ext3 file system: Number of files = 221794 Number of directories = 24457 Total size = 8193116 KB Total data stored = 7187392 KB Size of block groups = 131072 KB Number of inodes per block group = 16288 No. of cross references between directories and sub-directories = 7791 No. of cross references between directories and file = 657 Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570) Thanks for the suggestions. There may also be special things we will need to do to handle scenarios such as BackupPC, where if it looks like a directory contains a huge number of hard links to a particular chunk, we'll need to make sure that directory is either created in the right chunk (possibly with hints from the application) or migrated to the right chunk (but this might cause the inode number of the directory to change --- maybe we allow this as long as the directory has never been stat'ed, so that the inode number has never been observed). The other thing which we should consider is that chunkfs really requires a 64-bit inode number space, which means either we only allow it on 64-bit systems, or we need to consider a migration so that even on 32-bit platforms, stat() functions like stat64(), insofar that it uses a stat structure which returns a 64-bit ino_t. - Ted Thanks, Karuna cref.tar.bz2 Description: BZip2 compressed data
Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)
On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote: > The following extra security measures are taken for unprivileged > mounts: > > - usermounts are limited by a sysctl tunable > - force "nosuid,nodev" mount options on the created mount The original userspace "user=" solution also implies the "noexec" option by default (you can override the default by "exec" option). It means the kernel based solution is not fully compatible ;-( Karel -- Karel Zak <[EMAIL PROTECTED]> Red Hat Czech s.r.o. Purkynova 99/71, 612 45 Brno, Czech Republic Reg.id: CZ27690016 - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Nikita Danilov wrote: Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. Perhaps, I should talk about how continuation inodes are managed / located on disk. (This is how it is in my current implementation) Right now, there is no distinction between an inode and continuation inode (also referred to as 'cnode' below), except for the EXT2_IS_CONT_FL flag. Every inode holds a list of static number of inodes, currently limited to 4. The structure looks like this: -- -- | cnode 0 |-->| cnode 0 |--> to another cnode or NULL -- -- | cnode 1 |- | cnode 1 |- -- | -- | | cnode 2 |-- | | cnode 2 |-- | -- | | -- | | | cnode 3 | | | | cnode 3 | | | -- | | -- | | | | || | | inodes inodes or NULL I.e. only first cnode in the list carries forward the chain if all the slots are occupied. Every cnode# field contains { ino_t cnode; __u32 start;/* starting logical block number */ __u32 end; /* ending logical block number */ } (current implementation has just one field: cnode) I thought of this structure to avoid recursion and / or use of any data structure while traversing the continuation inodes. Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode has any sparse portions. 'start' and 'end' fields are used to speed-up finding a cnode given a logical block number without the need of actually reading the inode - this can be done away with, perhaps more conveniently by, pinning the cnodes in memory as and when read. Now going back to the Nikita's question, all the cnodes for an inode need to be scanned iff 'start' field or number of blocks or flag EXT2_SPARSE_CONT_FL in any of its cnodes is altered. And yes, the whole attempt is to reduce the number of continuation inodes. Comments, suggestions welcome. AG -- May the source be with you. http://www.cis.ksu.edu/~gud - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the final > pass checking the cross-chunk links doesn't have to keep track of everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. actually, it should be proportional to the number of continuation nodes. The expectation (and design) is that they are rare. If you get into the worst-case situation of all of them being continuation nodes, then you are actually worse off then you were to start with (as you are saying), but numbers from people's real filesystems (assuming a chunk size equal to a block cluster size) indicates that we are more on the order of a fraction of a percent of the nodes. and the expectation is that since the chunk sizes will be substantially larger then the block cluster sizes this should get reduced even more. David Lang - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
David Lang writes: > On Tue, 24 Apr 2007, Nikita Danilov wrote: > > > Amit Gud writes: > > > > Hello, > > > > > > > > This is an initial implementation of ChunkFS technique, briefly discussed > > > at: http://lwn.net/Articles/190222 and > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf > > > > I have a couple of questions about chunkfs repair process. > > > > First, as I understand it, each continuation inode is a sparse file, > > mapping some subset of logical file blocks into block numbers. Then it > > seems, that during "final phase" fsck has to check that these partial > > mappings are consistent, for example, that no two different continuation > > inodes for a given file contain a block number for the same offset. This > > check requires scan of all chunks (rather than of only "active during > > crash"), which seems to return us back to the scalability problem > > chunkfs tries to address. > > not quite. > > this checking is a O(n^2) or worse problem, and it can eat a lot of memory > in > the process. with chunkfs you divide the problem by a large constant (100 or > more) for the checks of individual chunks. after those are done then the > final > pass checking the cross-chunk links doesn't have to keep track of > everything, it > only needs to check those links and what they point to Maybe I failed to describe the problem presicely. Suppose that all chunks have been checked. After that, for every inode I0 having continuations I1, I2, ... In, one has to check that every logical block is presented in at most one of these inodes. For this one has to read I0, with all its indirect (double-indirect, triple-indirect) blocks, then read I1 with all its indirect blocks, etc. And to repeat this for every inode with continuations. In the worst case (every inode has a continuation in every chunk) this obviously is as bad as un-chunked fsck. But even in the average case, total amount of io necessary for this operation is proportional to the _total_ file system size, rather than to the chunk size. > > any ability to mark a filesystem as 'clean' and then not have to check it on > reboot is a bonus on top of this. > > David Lang Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
On Tue, 24 Apr 2007, Nikita Danilov wrote: Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. not quite. this checking is a O(n^2) or worse problem, and it can eat a lot of memory in the process. with chunkfs you divide the problem by a large constant (100 or more) for the checks of individual chunks. after those are done then the final pass checking the cross-chunk links doesn't have to keep track of everything, it only needs to check those links and what they point to any ability to mark a filesystem as 'clean' and then not have to check it on reboot is a bonus on top of this. David Lang Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ChunkFS - measuring cross-chunk references
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote: > > > The other thing which we should consider is that chunkfs really > > requires a 64-bit inode number space, which means either we only allow > > does it? > I'd think it needs a "chunk space" number and a 32 bit local inode > number ;) (same for blocks) > But that means that the number which gets exported to userspace via the stat system call will need more than 32 bits worth of ino_t - Ted - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: review: don't block non-blocking writes when frozen
On Tue, Apr 24, 2007 at 09:13:37AM +1000, David Chinner wrote: > So, given the catch-22 you've just presented us can we revisit the > nfsd non-blocking I/O issue again? This affects anyone using DM > snapshots on their NFS servers and has nothing to do with HSMs > or DMAPI... > > FWIW, you can still do non-blocking userspace I/O to a file, so this > XFS patch is still valid for mainline (that's how I tested it). I've been investigating the situation a little bit more, and here's what's going on: - SuS allows for O_NONBLOCK on regular files as per http://www.opengroup.org/onlinepubs/007908799/xsh/write.html - actually implementing O_NONBLOCK semantics for regular fixes breaks userspace when poll/select claims files are ready to read/write but they aren't, see http://lkml.org/lkml/2004/10/17/17 So we can't really expose O_NONBLOCK on regular files to userspace, and we need to make sure in common code this does not happen. EJUKEBOX on snaphots does make sense, though. Can you please send a full patchseries for nfsd, the common code and the xfs writepath so that this actually gets used and behaviour is consistant for all (or at least most) filesystems? Also now that the patch goes to mainline please kill ugly FILP_DELAY_FLAG and just check the flags directly. And it should probably only check O_NONBLOCK. The only architecture having O_NDELAY different from O_NONBLOCK is sparc, and it already translates the value for us. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Interface for the new fallocate() system call
On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote: > On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote: > > Ok. > > In this case we may have to consider following things: > > > > 1) Obviously, for this glibc will have to call fallocate() syscall with > > different arguments on s390, than other archs. I think this should be > > doable and should not be an issue with glibc folks (right?). > > glibc can cope with this easily, will just add > sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override > the generic Linux implementation. > > > 2) we also need to see how strace behaves in this case. With little > > knowledge that I have of strace, I don't think it should depend on > > argument ordering of a system call on different archs (since it uses > > ptrace internally and that should take care of it). But, it will be > > nice if someone can confirm this. > > strace would solve this with #ifdef mess, it already does that in many > places so guess another few lines don't make it significantly worse. I will work on the revised fallocate patchset and will post it soon. Thanks! -- Regards, Amit Arora - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Amit Gud writes: Hello, > > This is an initial implementation of ChunkFS technique, briefly discussed > at: http://lwn.net/Articles/190222 and > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf I have a couple of questions about chunkfs repair process. First, as I understand it, each continuation inode is a sparse file, mapping some subset of logical file blocks into block numbers. Then it seems, that during "final phase" fsck has to check that these partial mappings are consistent, for example, that no two different continuation inodes for a given file contain a block number for the same offset. This check requires scan of all chunks (rather than of only "active during crash"), which seems to return us back to the scalability problem chunkfs tries to address. Second, it is not clear how, under assumption of bugs in the file system code (which paper makes at the very beginning), fsck can limit itself only to the chunks that were active at the moment of crash. [...] > > Best, > AG Nikita. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: > On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote: > > On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote: > > > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote: > > > > > + page = __grab_cache_page(mapping, index); > > > > > > > > btw, __grab_cache_page should probably get a more descriptive and > > > > non-__-prefixed name now that it's used all over the place. > > > > > > Agreed. Suggestions? ;) > > > > find_or_create_cache_page given that's it's like find_or_create_page + > > add_to_page_cache? > > find_or_create_page adds to page cache as well, though :P I would really like to see the word 'lock' in there, as it does return a locked page, and when I first saw __grab_cache_page recently there was an unlock_page afterwards and I couldn't see where the lock happened, and I was confused for a little while. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tue, Apr 24, 2007 at 12:18:11PM +0100, Christoph Hellwig wrote: > On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote: > > I was going to try doing that after this patchset. Or do you think it > > would be better to get the __grab_cache_page name right in the > > first place? > > If you plan to fix things up afterwards there's not point in doing > the renaming first. Would be nice to get both into the same release > in the end, though :) OK, will do. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote: > > find_or_create_cache_page given that's it's like find_or_create_page + > > add_to_page_cache? > > find_or_create_page adds to page cache as well, though :P > > All those random little slightly different allocators scattered over > filemap.c are a bit annoying. Basically I think it would be better > to have a single variant that takes gfp_mask of both the pagecache > page, and the radix-tree insertion. Then serveral things can be > converted to use it. > > I was going to try doing that after this patchset. Or do you think it > would be better to get the __grab_cache_page name right in the > first place? If you plan to fix things up afterwards there's not point in doing the renaming first. Would be nice to get both into the same release in the end, though :) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote: > On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote: > > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote: > > > > + page = __grab_cache_page(mapping, index); > > > > > > btw, __grab_cache_page should probably get a more descriptive and > > > non-__-prefixed name now that it's used all over the place. > > > > Agreed. Suggestions? ;) > > find_or_create_cache_page given that's it's like find_or_create_page + > add_to_page_cache? find_or_create_page adds to page cache as well, though :P All those random little slightly different allocators scattered over filemap.c are a bit annoying. Basically I think it would be better to have a single variant that takes gfp_mask of both the pagecache page, and the radix-tree insertion. Then serveral things can be converted to use it. I was going to try doing that after this patchset. Or do you think it would be better to get the __grab_cache_page name right in the first place? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6
On Tuesday 24 April 2007 12:40:24 Jan Kara wrote: > > One of my autoboot test clients gave me this during shutdown. It used > > reiserfs and autofs and NFS heavily. > Jeff has a fix for this bug so it should go away soon... Thanks for > report anyway :). Well I hit two more -- see other mails if you're interested. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote: > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote: > > > + page = __grab_cache_page(mapping, index); > > > > btw, __grab_cache_page should probably get a more descriptive and > > non-__-prefixed name now that it's used all over the place. > > Agreed. Suggestions? ;) find_or_create_cache_page given that's it's like find_or_create_page + add_to_page_cache? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote: > > + page = __grab_cache_page(mapping, index); > > btw, __grab_cache_page should probably get a more descriptive and > non-__-prefixed name now that it's used all over the place. Agreed. Suggestions? ;) - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 13/44] mm: restore KERNEL_DS optimisations
On Tue, Apr 24, 2007 at 11:43:18AM +0100, Christoph Hellwig wrote: > On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote: > > Restore the KERNEL_DS optimisation, especially helpful to the 2copy write > > path. > > > > This may be a pretty questionable gain in most cases, especially after the > > legacy 2copy write path is removed, but it doesn't cost much. > > Well, it gets removed later and sets a bad precedence. Instead of > adding hacks we should have proper methods for kernel-space read/writes. > Especially as the latter are a lot simpler and most of the magic > in this patch series is not needed. I'll start this work once > your patch series is in. It was removed earlier and put back in here. I agree it isn't so important, but again it does help that the patchset introduces no obvious regression. You could remove it in your patchset? > In general there seems to be a lot of stuff in the earlier patches > that just goes away later and doesn't make much sense in the series. > Is there a good reason not to simply consolidate out those changes > completely? I guess the first half of the patchset -- the slow deadlock fix for the old prepare_write path -- came about because that's the only reasonable way I could find to fix it. I initially thought it would take a lot longer to convert all filesystems and that we might want to stay compatible for a while, which is why I wanted to ensure that was working. Basically I can't really see which ones you think I should merge and be able retain a working kernel? Granted there are a couple of bugfixes and some slightly orthogonal cleanups in there, but I just thought I'd submit them in the same series because it was a little easier for me. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 16/44] rd convert to new aops
> + page = __grab_cache_page(mapping, index); btw, __grab_cache_page should probably get a more descriptive and non-__-prefixed name now that it's used all over the place. - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 13/44] mm: restore KERNEL_DS optimisations
On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote: > Restore the KERNEL_DS optimisation, especially helpful to the 2copy write > path. > > This may be a pretty questionable gain in most cases, especially after the > legacy 2copy write path is removed, but it doesn't cost much. Well, it gets removed later and sets a bad precedence. Instead of adding hacks we should have proper methods for kernel-space read/writes. Especially as the latter are a lot simpler and most of the magic in this patch series is not needed. I'll start this work once your patch series is in. In general there seems to be a lot of stuff in the earlier patches that just goes away later and doesn't make much sense in the series. Is there a good reason not to simply consolidate out those changes completely? - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6
> One of my autoboot test clients gave me this during shutdown. It used > reiserfs and autofs and NFS heavily. Jeff has a fix for this bug so it should go away soon... Thanks for report anyway :). Honza > Unmounting file systems > BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of > reiserfs sda9] > [ cut here ] > kernel BUG at > /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623! > invalid opcode: [1] SMP > CPU 1 > Modules linked in: > Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44 > RIP: 0010:[] [] > shrink_dcache_for_umount_subtree+0x178/0x250 > RSP: 0018:8100f5f67e18 EFLAGS: 00010292 > RAX: 0060 RBX: 8100f3693a40 RCX: 5207 > RDX: RSI: 0046 RDI: 00014661 > RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005 > R10: R11: R12: 8100f3693aa0 > R13: 00014661 R14: 0050ea70 R15: 0050ead0 > FS: 2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0 > CS: 0010 DS: ES: CR0: 8005003b > CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0 > Process umount (pid: 15791, threadinfo 8100f5f66000, task > 8100f7a08100) > Stack: 810004dab218 810004dab000 80558860 810004dab000 > 8028815b 810004dab000 8027a1a5 > 8100f6c50980 806c1600 8027a2a4 > Call Trace: > [] shrink_dcache_for_umount+0x2f/0x3d > [] generic_shutdown_super+0x19/0xf2 > [] kill_block_super+0x26/0x3b > [] deactivate_super+0x47/0x60 > [] sys_umount+0x1f7/0x22a > [] sys_newstat+0x19/0x31 > [] system_call+0x7e/0x83 > > > Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d > RIP [] shrink_dcache_for_umount_subtree+0x178/0x250 > RSP > /etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault > umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs > > > > -Andi > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Jan Kara <[EMAIL PROTECTED]> SuSE CR Labs - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tue, Apr 24, 2007 at 05:49:48PM +1000, Neil Brown wrote: > On Tuesday April 24, [EMAIL PROTECTED] wrote: > > > > BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid > > an initial read or other sequence they might be using to handle the > > case of a short write. ecryptfs uses it, others can too. > > > > For buffered writes, this doesn't get passed in (unless they are > > coming from kernel space), so I was debating whether to have it at > > all. However, in the previous API, _nobody_ had to worry about > > short writes, so this flag means I avoid making an API that is > > fundamentally less performant in some situations. > > Ahhh I think I get it now. > > In general, the address_space must cope with the possibility that > fewer than the expected number of bytes is copied. This may leave > parts of the page with invalid data. This can be handled by > pre-loading the page with valid data, however this may cause a > significant performance cost. Right. Bringing the page uptodate at write_begin-time is probably the simplest way to handle it. However, more sophisticated schemes are possible. For example, the generic block routines can recover at write_end-time, and probably can't make use of the flag to do things much better... > The write_begin/write_end interface provide two mechanism by which > this case can be handled more efficiently. > 1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will > not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL). > If that is set, inefficient preparation can be avoided. However the > most common write paths will never set this flag. Yes, loop, nfsd, and filesystem-specific pagecache modification (eg. ext2 directories) are probably the main things that use it. > 2/ The return from write_end can declare that fewer bytes have been > accepted. e.g. part of the page may have been loaded from backing > store, overwriting some of the newly written bytes. If this > return value is reduced, a new write_begin/write_end cycle > may be called to attempt to write the bytes again. Yeah, although you'd have to be careful not to overwrite things if the page is uptodate (because uptodate *really* means uptodate -- ie. it is the only thing we have to synchronise buffered reads from returning the data to userspace). > > Also > + write_end: After a successful write_begin, and data copy, write_end must > +be called. len is the original len passed to write_begin, and copied > +is the amount that was able to be copied (they must be equal if > +write_begin was called with intr == 0). > + > > That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being > set". > And "that was able to be copied" is misleading, as the copy is not done in > write_end. Maybe "that was accepted". Thanks, very good eyes and good suggestions. Actually I'm a bit worried about this copied vs accepted thing -- we've already copied some number of bytes into the pagecache by the time write_end is called. So if the filesystem accepts less and the pagecache page is marked uptodate, then the pagecache is now out of sync with the filesystem. There are a few places where it looks like we get this wrong... but that's for a future patch :P - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: > > BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid > an initial read or other sequence they might be using to handle the > case of a short write. ecryptfs uses it, others can too. > > For buffered writes, this doesn't get passed in (unless they are > coming from kernel space), so I was debating whether to have it at > all. However, in the previous API, _nobody_ had to worry about > short writes, so this flag means I avoid making an API that is > fundamentally less performant in some situations. Ahhh I think I get it now. In general, the address_space must cope with the possibility that fewer than the expected number of bytes is copied. This may leave parts of the page with invalid data. This can be handled by pre-loading the page with valid data, however this may cause a significant performance cost. The write_begin/write_end interface provide two mechanism by which this case can be handled more efficiently. 1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL). If that is set, inefficient preparation can be avoided. However the most common write paths will never set this flag. 2/ The return from write_end can declare that fewer bytes have been accepted. e.g. part of the page may have been loaded from backing store, overwriting some of the newly written bytes. If this return value is reduced, a new write_begin/write_end cycle may be called to attempt to write the bytes again. Also + write_end: After a successful write_begin, and data copy, write_end must +be called. len is the original len passed to write_begin, and copied +is the amount that was able to be copied (they must be equal if +write_begin was called with intr == 0). + That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being set". And "that was able to be copied" is misleading, as the copy is not done in write_end. Maybe "that was accepted". It seems to make sense now. I might try re-reviewing the patches based on this improved understanding only a public holiday looms :-) NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tue, Apr 24, 2007 at 04:59:47PM +1000, Neil Brown wrote: > On Tuesday April 24, [EMAIL PROTECTED] wrote: > > + write_begin: This is intended as a replacement for prepare_write. Called > > +by the generic buffered write code to ask the filesystem to prepare > > +to write len bytes at the given offset in the file. flags is a > > field > > +for AOP_FLAG_xxx flags, described in include/linux/mm.h. > > Putting "This is intended as a replacement.." there sees a bit > dangerous. It could well accidentally remain when the documentation > for prepare_write gets removed. I would make it a separate paragraph > and flesh it out. And include text from prepare_write before that > gets removed. > >write_begin: > This is intended as a replacement for prepare_write. The key > differences being that: > - it returns a locked page (in *pagep) rather than > being given a pre-locked page: > - it can pass arbitrary state to write_end rather than > having to hide stuff in some filesystem-internal > data structure > - The (largely undocumented) flags option. > > Called by the generic bufferred write code to ask an > address_space to prepare to write len bytes at the given > offset in the file. > >The address_space should check that the write will be able to >complete, by allocating space if necessary and doing any other >internal housekeeping. If the write will update parts of any >basic-blocks on storage, then those blocks should be pre-read >(if they haven't been read already) so that the updated blocks >can be written out properly. >The possible flags are listed in include/linux/fs.h (not >mm.h) and include > AOP_FLAG_UNINTERRUPTIBLE: > It is unclear how this should be used. No > current code handles it. Yeah, reasonable points. I'll do an incremental patch to clean up some of the documentation. BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid an initial read or other sequence they might be using to handle the case of a short write. ecryptfs uses it, others can too. For buffered writes, this doesn't get passed in (unless they are coming from kernel space), so I was debating whether to have it at all. However, in the previous API, _nobody_ had to worry about short writes, so this flag means I avoid making an API that is fundamentally less performant in some situations. > > (together with the rest...) > > + > > +The filesystem must return the locked pagecache page for the caller > > +to write into. > > + > > +A void * may be returned in fsdata, which then gets passed into > > +write_end. > > + > > +Returns < 0 on failure, in which case all cleanup must be done and > > +write_end not called. 0 on success, in which case write_end must > > +be called. > > > As you are not including perform_write in the current patchset, maybe > it is best not to include the documentation yet either? Right, missed that, thanks! Nick - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops
On Tuesday April 24, [EMAIL PROTECTED] wrote: > + write_begin: This is intended as a replacement for prepare_write. Called > +by the generic buffered write code to ask the filesystem to prepare > +to write len bytes at the given offset in the file. flags is a field > +for AOP_FLAG_xxx flags, described in include/linux/mm.h. Putting "This is intended as a replacement.." there sees a bit dangerous. It could well accidentally remain when the documentation for prepare_write gets removed. I would make it a separate paragraph and flesh it out. And include text from prepare_write before that gets removed. write_begin: This is intended as a replacement for prepare_write. The key differences being that: - it returns a locked page (in *pagep) rather than being given a pre-locked page: - it can pass arbitrary state to write_end rather than having to hide stuff in some filesystem-internal data structure - The (largely undocumented) flags option. Called by the generic bufferred write code to ask an address_space to prepare to write len bytes at the given offset in the file. The address_space should check that the write will be able to complete, by allocating space if necessary and doing any other internal housekeeping. If the write will update parts of any basic-blocks on storage, then those blocks should be pre-read (if they haven't been read already) so that the updated blocks can be written out properly. The possible flags are listed in include/linux/fs.h (not mm.h) and include AOP_FLAG_UNINTERRUPTIBLE: It is unclear how this should be used. No current code handles it. (together with the rest...) > + > +The filesystem must return the locked pagecache page for the caller > +to write into. > + > +A void * may be returned in fsdata, which then gets passed into > +write_end. > + > +Returns < 0 on failure, in which case all cleanup must be done and > +write_end not called. 0 on success, in which case write_end must > +be called. As you are not including perform_write in the current patchset, maybe it is best not to include the documentation yet either? > + perform_write: This is a single-call, bulk version of write_begin/write_end > +operations. It is only used in the buffered write path (write_begin > +must still be implemented), and not for in-kernel writes to > pagecache. > +It takes an iov_iter structure, which provides a descriptor for the > +source data (and has associated iov_iter_xxx helpers to operate on > +that data). There are also file, mapping, and pos arguments, which > +specify the destination of the data. > + > +Returns < 0 on failure if nothing was written out, otherwise returns > +the number of bytes copied into pagecache. > + > +fs/libfs.c provides a reasonable template to start with, > demonstrating > +iov_iter routines, and iteration over the destination pagecache. > + NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html