Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Eric W. Biederman
Karel Zak <[EMAIL PROTECTED]> writes:

> On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
>> The following extra security measures are taken for unprivileged
>> mounts:
>> 
>>  - usermounts are limited by a sysctl tunable
>>  - force "nosuid,nodev" mount options on the created mount
>
>  The original userspace "user=" solution also implies the "noexec"
>  option by default (you can override the default by "exec" option).
>  
>  It means the kernel based solution is not fully compatible ;-(

Why noexec?  Either it was a silly or arbitrary decision, or
our kernel design may be incomplete.

Now I can see not wanting to support executables if you are locking
down a system.  The classic don't execute a program from a CD just because
the CD was stuck in the drive problem.

So I can see how executing code from an untrusted source could prevent
exploitation of other problems, and we certainly don't want to do it
automatically.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ChunkFS - measuring cross-chunk references

2007-04-24 Thread Karuna sagar K

On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:

.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).



Modified the tool to distinguish between
1. cross references between directories and files
2. cross references between directories and sub directories
3. cross references within a file (due to huge file size)

Below is the result from / partition of ext3 file system:

Number of files = 221794
Number of directories = 24457
Total size = 8193116 KB
Total data stored = 7187392 KB
Size of block groups = 131072 KB
Number of inodes per block group = 16288
No. of cross references between directories and sub-directories = 7791
No. of cross references between directories and file = 657
Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)

Thanks for the suggestions.


There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

   - Ted




Thanks,
Karuna


cref.tar.bz2
Description: BZip2 compressed data


Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Karuna sagar K

On 4/24/07, Theodore Tso <[EMAIL PROTECTED]> wrote:

On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:

.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).



Modified the tool to distinguish between
1. cross references between directories and files
2. cross references between directories and sub directories
3. cross references within a file (due to huge file size)

Below is the result from / partition of ext3 file system:

Number of files = 221794
Number of directories = 24457
Total size = 8193116 KB
Total data stored = 7187392 KB
Size of block groups = 131072 KB
Number of inodes per block group = 16288
No. of cross references between directories and sub-directories = 7791
No. of cross references between directories and file = 657
Total no. of cross references = 62018 (dir ref = 8448, file ref = 53570)

Thanks for the suggestions.


There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

   - Ted




Thanks,
Karuna


cref.tar.bz2
Description: BZip2 compressed data


Re: [patch 0/8] mount ownership and unprivileged mount syscall (v4)

2007-04-24 Thread Karel Zak
On Fri, Apr 20, 2007 at 12:25:32PM +0200, Miklos Szeredi wrote:
> The following extra security measures are taken for unprivileged
> mounts:
> 
>  - usermounts are limited by a sysctl tunable
>  - force "nosuid,nodev" mount options on the created mount

 The original userspace "user=" solution also implies the "noexec"
 option by default (you can override the default by "exec" option).
 
 It means the kernel based solution is not fully compatible ;-(

Karel

-- 
 Karel Zak  <[EMAIL PROTECTED]>
 
 Red Hat Czech s.r.o.
 Purkynova 99/71, 612 45 Brno, Czech Republic
 Reg.id: CZ27690016
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Amit Gud

Nikita Danilov wrote:

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.



Perhaps, I should talk about how continuation inodes are managed / 
located on disk. (This is how it is in my current implementation)


Right now, there is no distinction between an inode and continuation 
inode (also referred to as 'cnode' below), except for the 
EXT2_IS_CONT_FL flag. Every inode holds a list of static number of 
inodes, currently limited to 4.


The structure looks like this:

 -- --
| cnode 0  |-->| cnode 0  |--> to another cnode or NULL
 -- --
| cnode 1  |-  | cnode 1  |-
 -- |   --  |
| cnode 2  |--  |  | cnode 2  |--   |
 --  |  |   --  |   |
| cnode 3  | |  |  | cnode 3  | |   |
 --  |  |   --  |   |
  |  |  ||  |   |

   inodes   inodes or NULL

I.e. only first cnode in the list carries forward the chain if all the 
slots are occupied.


Every cnode# field contains
{
ino_t cnode;
__u32 start;/* starting logical block number */
__u32 end;  /* ending logical block number */
}

(current implementation has just one field: cnode)

I thought of this structure to avoid recursion and / or use of any data 
structure while traversing the continuation inodes.


Additional flag, EXT2_SPARSE_CONT_FL would indicate whether the inode 
has any sparse portions. 'start' and 'end' fields are used to speed-up 
finding a cnode given a logical block number without the need of 
actually reading the inode - this can be done away with, perhaps more 
conveniently by, pinning the cnodes in memory as and when read.


Now going back to the Nikita's question, all the cnodes for an inode 
need to be scanned iff 'start' field or number of blocks or flag 
EXT2_SPARSE_CONT_FL in any of its cnodes is altered.


And yes, the whole attempt is to reduce the number of continuation inodes.

Comments, suggestions welcome.


AG
--
May the source be with you.
http://www.cis.ksu.edu/~gud

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


David Lang writes:
> On Tue, 24 Apr 2007, Nikita Danilov wrote:
>
> > Amit Gud writes:
> >
> > Hello,
> >
> > >
> > > This is an initial implementation of ChunkFS technique, briefly discussed
> > > at: http://lwn.net/Articles/190222 and
> > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
> >
> > I have a couple of questions about chunkfs repair process.
> >
> > First, as I understand it, each continuation inode is a sparse file,
> > mapping some subset of logical file blocks into block numbers. Then it
> > seems, that during "final phase" fsck has to check that these partial
> > mappings are consistent, for example, that no two different continuation
> > inodes for a given file contain a block number for the same offset. This
> > check requires scan of all chunks (rather than of only "active during
> > crash"), which seems to return us back to the scalability problem
> > chunkfs tries to address.
>
> not quite.
>
> this checking is a O(n^2) or worse problem, and it can eat a lot of memory in
> the process. with chunkfs you divide the problem by a large constant (100 or
> more) for the checks of individual chunks. after those are done then the final
> pass checking the cross-chunk links doesn't have to keep track of everything, 
it
> only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.


actually, it should be proportional to the number of continuation nodes. The 
expectation (and design) is that they are rare.


If you get into the worst-case situation of all of them being continuation 
nodes, then you are actually worse off then you were to start with (as you are 
saying), but numbers from people's real filesystems (assuming a chunk size equal 
to a block cluster size) indicates that we are more on the order of a fraction 
of a percent of the nodes. and the expectation is that since the chunk sizes 
will be substantially larger then the block cluster sizes this should get 
reduced even more.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
David Lang writes:
 > On Tue, 24 Apr 2007, Nikita Danilov wrote:
 > 
 > > Amit Gud writes:
 > >
 > > Hello,
 > >
 > > >
 > > > This is an initial implementation of ChunkFS technique, briefly discussed
 > > > at: http://lwn.net/Articles/190222 and
 > > > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf
 > >
 > > I have a couple of questions about chunkfs repair process.
 > >
 > > First, as I understand it, each continuation inode is a sparse file,
 > > mapping some subset of logical file blocks into block numbers. Then it
 > > seems, that during "final phase" fsck has to check that these partial
 > > mappings are consistent, for example, that no two different continuation
 > > inodes for a given file contain a block number for the same offset. This
 > > check requires scan of all chunks (rather than of only "active during
 > > crash"), which seems to return us back to the scalability problem
 > > chunkfs tries to address.
 > 
 > not quite.
 > 
 > this checking is a O(n^2) or worse problem, and it can eat a lot of memory 
 > in 
 > the process. with chunkfs you divide the problem by a large constant (100 or 
 > more) for the checks of individual chunks. after those are done then the 
 > final 
 > pass checking the cross-chunk links doesn't have to keep track of 
 > everything, it 
 > only needs to check those links and what they point to

Maybe I failed to describe the problem presicely.

Suppose that all chunks have been checked. After that, for every inode
I0 having continuations I1, I2, ... In, one has to check that every
logical block is presented in at most one of these inodes. For this one
has to read I0, with all its indirect (double-indirect, triple-indirect)
blocks, then read I1 with all its indirect blocks, etc. And to repeat
this for every inode with continuations.

In the worst case (every inode has a continuation in every chunk) this
obviously is as bad as un-chunked fsck. But even in the average case,
total amount of io necessary for this operation is proportional to the
_total_ file system size, rather than to the chunk size.

 > 
 > any ability to mark a filesystem as 'clean' and then not have to check it on 
 > reboot is a bonus on top of this.
 > 
 > David Lang

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread David Lang

On Tue, 24 Apr 2007, Nikita Danilov wrote:


Amit Gud writes:

Hello,

>
> This is an initial implementation of ChunkFS technique, briefly discussed
> at: http://lwn.net/Articles/190222 and
> http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process.

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.


not quite.

this checking is a O(n^2) or worse problem, and it can eat a lot of memory in 
the process. with chunkfs you divide the problem by a large constant (100 or 
more) for the checks of individual chunks. after those are done then the final 
pass checking the cross-chunk links doesn't have to keep track of everything, it 
only needs to check those links and what they point to


any ability to mark a filesystem as 'clean' and then not have to check it on 
reboot is a bonus on top of this.


David Lang


Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

>
> Best,
> AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Theodore Tso
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote:
> 
> > The other thing which we should consider is that chunkfs really
> > requires a 64-bit inode number space, which means either we only allow
> 
> does it?
> I'd think it needs a "chunk space" number and a 32 bit local inode
> number ;) (same for blocks)
> 

But that means that the number which gets exported to userspace via
the stat system call will need more than 32 bits worth of ino_t

- Ted

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: review: don't block non-blocking writes when frozen

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 09:13:37AM +1000, David Chinner wrote:
> So, given the catch-22 you've just presented us can we revisit the
> nfsd non-blocking I/O issue again?  This affects anyone using DM
> snapshots on their NFS servers and has nothing to do with HSMs
> or DMAPI...
> 
> FWIW, you can still do non-blocking userspace I/O to a file, so this
> XFS patch is still valid for mainline (that's how I tested it).

I've been investigating the situation a little bit more, and here's
what's going on:

 - SuS allows for O_NONBLOCK on regular files as per
   http://www.opengroup.org/onlinepubs/007908799/xsh/write.html
 - actually implementing O_NONBLOCK semantics for regular fixes
   breaks userspace when poll/select claims files are ready to
   read/write but they aren't, see http://lkml.org/lkml/2004/10/17/17

So we can't really expose O_NONBLOCK on regular files to userspace,
and we need to make sure in common code this does not happen.

EJUKEBOX on snaphots does make sense, though.  Can you please send
a full patchseries for nfsd, the common code and the xfs writepath
so that this actually gets used and behaviour is consistant for all
(or at least most) filesystems?

Also now that the patch goes to mainline please kill ugly FILP_DELAY_FLAG
and just check the flags directly.  And it should probably only check
O_NONBLOCK.  The only architecture having O_NDELAY different from
O_NONBLOCK is sparc, and it already translates the value for us.



-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Interface for the new fallocate() system call

2007-04-24 Thread Amit K. Arora
On Fri, Apr 20, 2007 at 10:59:18AM -0400, Jakub Jelinek wrote:
> On Fri, Apr 20, 2007 at 07:21:46PM +0530, Amit K. Arora wrote:
> > Ok.
> > In this case we may have to consider following things:
> > 
> > 1) Obviously, for this glibc will have to call fallocate() syscall with
> > different arguments on s390, than other archs. I think this should be
> > doable and should not be an issue with glibc folks (right?).
> 
> glibc can cope with this easily, will just add
> sysdeps/unix/sysv/linux/s390/fallocate.c or something similar to override
> the generic Linux implementation.
> 
> > 2) we also need to see how strace behaves in this case. With little
> > knowledge that I have of strace, I don't think it should depend on
> > argument ordering of a system call on different archs (since it uses
> > ptrace internally and that should take care of it). But, it will be
> > nice if someone can confirm this.
> 
> strace would solve this with #ifdef mess, it already does that in many
> places so guess another few lines don't make it significantly worse.

I will work on the revised fallocate patchset and will post it soon.

Thanks!
--
Regards,
Amit Arora
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck

2007-04-24 Thread Nikita Danilov
Amit Gud writes:

Hello,

 > 
 > This is an initial implementation of ChunkFS technique, briefly discussed
 > at: http://lwn.net/Articles/190222 and 
 > http://cis.ksu.edu/~gud/docs/chunkfs-hotdep-val-arjan-gud-zach.pdf

I have a couple of questions about chunkfs repair process. 

First, as I understand it, each continuation inode is a sparse file,
mapping some subset of logical file blocks into block numbers. Then it
seems, that during "final phase" fsck has to check that these partial
mappings are consistent, for example, that no two different continuation
inodes for a given file contain a block number for the same offset. This
check requires scan of all chunks (rather than of only "active during
crash"), which seems to return us back to the scalability problem
chunkfs tries to address.

Second, it is not clear how, under assumption of bugs in the file system
code (which paper makes at the very beginning), fsck can limit itself
only to the chunks that were active at the moment of crash.

[...]

 > 
 > Best,
 > AG

Nikita.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
> On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
> > On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> > > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > > > + page = __grab_cache_page(mapping, index);
> > > > 
> > > > btw, __grab_cache_page should probably get a more descriptive and
> > > > non-__-prefixed name now that it's used all over the place.
> > > 
> > > Agreed. Suggestions? ;)
> > 
> > find_or_create_cache_page given that's it's like find_or_create_page +
> > add_to_page_cache?
> 
> find_or_create_page adds to page cache as well, though :P

I would really like to see the word 'lock' in there, as it does
return a locked page, and when I first saw __grab_cache_page recently
there was an unlock_page afterwards and I couldn't see where the lock
happened, and I was confused for a little while.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 12:18:11PM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
> > I was going to try doing that after this patchset. Or do you think it
> > would be better to get the __grab_cache_page name right in the
> > first place?
> 
> If you plan to fix things up afterwards there's not point in doing
> the renaming first.  Would be nice to get both into the same release
> in the end, though :)

OK, will do.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 01:16:53PM +0200, Nick Piggin wrote:
> > find_or_create_cache_page given that's it's like find_or_create_page +
> > add_to_page_cache?
> 
> find_or_create_page adds to page cache as well, though :P
> 
> All those random little slightly different allocators scattered over
> filemap.c are a bit annoying. Basically I think it would be better
> to have a single variant that takes gfp_mask of both the pagecache
> page, and the radix-tree insertion. Then serveral things can be
> converted to use it.
> 
> I was going to try doing that after this patchset. Or do you think it
> would be better to get the __grab_cache_page name right in the
> first place?

If you plan to fix things up afterwards there's not point in doing
the renaming first.  Would be nice to get both into the same release
in the end, though :)
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 12:11:19PM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> > On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > > +   page = __grab_cache_page(mapping, index);
> > > 
> > > btw, __grab_cache_page should probably get a more descriptive and
> > > non-__-prefixed name now that it's used all over the place.
> > 
> > Agreed. Suggestions? ;)
> 
> find_or_create_cache_page given that's it's like find_or_create_page +
> add_to_page_cache?

find_or_create_page adds to page cache as well, though :P

All those random little slightly different allocators scattered over
filemap.c are a bit annoying. Basically I think it would be better
to have a single variant that takes gfp_mask of both the pagecache
page, and the radix-tree insertion. Then serveral things can be
converted to use it.

I was going to try doing that after this patchset. Or do you think it
would be better to get the __grab_cache_page name right in the
first place?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-24 Thread Andi Kleen
On Tuesday 24 April 2007 12:40:24 Jan Kara wrote:
> > One of my autoboot test clients gave me this during shutdown. It used
> > reiserfs and autofs and NFS heavily.
>   Jeff has a fix for this bug so it should go away soon... Thanks for
> report anyway :).

Well I hit two more -- see other mails if you're interested.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 01:05:49PM +0200, Nick Piggin wrote:
> On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > > + page = __grab_cache_page(mapping, index);
> > 
> > btw, __grab_cache_page should probably get a more descriptive and
> > non-__-prefixed name now that it's used all over the place.
> 
> Agreed. Suggestions? ;)

find_or_create_cache_page given that's it's like find_or_create_page +
add_to_page_cache?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 11:46:47AM +0100, Christoph Hellwig wrote:
> > +   page = __grab_cache_page(mapping, index);
> 
> btw, __grab_cache_page should probably get a more descriptive and
> non-__-prefixed name now that it's used all over the place.

Agreed. Suggestions? ;)

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/44] mm: restore KERNEL_DS optimisations

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 11:43:18AM +0100, Christoph Hellwig wrote:
> On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
> > Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
> > path.
> > 
> > This may be a pretty questionable gain in most cases, especially after the
> > legacy 2copy write path is removed, but it doesn't cost much.
> 
> Well, it gets removed later and sets a bad precedence.  Instead of
> adding hacks we should have proper methods for kernel-space read/writes.
> Especially as the latter are a lot simpler and most of the magic
> in this patch series is not needed.  I'll start this work once
> your patch series is in.

It was removed earlier and put back in here. I agree it isn't so
important, but again it does help that the patchset introduces no
obvious regression. You could remove it in your patchset?


> In general there seems to be a lot of stuff in the earlier patches
> that just goes away later and doesn't make much sense in the series.
> Is there a good reason not to simply consolidate out those changes
> completely?

I guess the first half of the patchset -- the slow deadlock fix for
the old prepare_write path -- came about because that's the only
reasonable way I could find to fix it. I initially thought it would
take a lot longer to convert all filesystems and that we might want
to stay compatible for a while, which is why I wanted to ensure that
was working.

Basically I can't really see which ones you think I should merge and
be able retain a working kernel?

Granted there are a couple of bugfixes and some slightly orthogonal
cleanups in there, but I just thought I'd submit them in the same
series because it was a little easier for me.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 16/44] rd convert to new aops

2007-04-24 Thread Christoph Hellwig
> + page = __grab_cache_page(mapping, index);

btw, __grab_cache_page should probably get a more descriptive and
non-__-prefixed name now that it's used all over the place.

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 13/44] mm: restore KERNEL_DS optimisations

2007-04-24 Thread Christoph Hellwig
On Tue, Apr 24, 2007 at 11:23:59AM +1000, Nick Piggin wrote:
> Restore the KERNEL_DS optimisation, especially helpful to the 2copy write
> path.
> 
> This may be a pretty questionable gain in most cases, especially after the
> legacy 2copy write path is removed, but it doesn't cost much.

Well, it gets removed later and sets a bad precedence.  Instead of
adding hacks we should have proper methods for kernel-space read/writes.
Especially as the latter are a lot simpler and most of the magic
in this patch series is not needed.  I'll start this work once
your patch series is in.

In general there seems to be a lot of stuff in the earlier patches
that just goes away later and doesn't make much sense in the series.
Is there a good reason not to simply consolidate out those changes
completely?
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: Dentry still in use during umount in 2.6.21-rc5-git6

2007-04-24 Thread Jan Kara
> One of my autoboot test clients gave me this during shutdown. It used
> reiserfs and autofs and NFS heavily.
  Jeff has a fix for this bug so it should go away soon... Thanks for
report anyway :).

Honza

> Unmounting file systems
> BUG: Dentry 8100f3693a40{i=2352220,n=xattrs} still in use (1) [unmount of 
> reiserfs sda9]
> [ cut here ]
> kernel BUG at 
> /mnt/dm-2/newautoboot/autoboot/lsrc/mainline/linux/fs/dcache.c:623!
> invalid opcode:  [1] SMP 
> CPU 1 
> Modules linked in:
> Pid: 15791, comm: umount Not tainted 2.6.21-rc5-git6 #44
> RIP: 0010:[]  [] 
> shrink_dcache_for_umount_subtree+0x178/0x250
> RSP: 0018:8100f5f67e18  EFLAGS: 00010292
> RAX: 0060 RBX: 8100f3693a40 RCX: 5207
> RDX:  RSI: 0046 RDI: 00014661
> RBP: 8100f6dc9cc0 R08: 00a0 R09: 0005
> R10:  R11:  R12: 8100f3693aa0
> R13: 00014661 R14: 0050ea70 R15: 0050ead0
> FS:  2adc863a86d0() GS:8100f7fdc1c0() knlGS:b7be38d0
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 2adc8626a688 CR3: f628b000 CR4: 06e0
> Process umount (pid: 15791, threadinfo 8100f5f66000, task 
> 8100f7a08100)
> Stack:  810004dab218 810004dab000 80558860 810004dab000
>   8028815b 810004dab000 8027a1a5
>   8100f6c50980 806c1600 8027a2a4
> Call Trace:
>  [] shrink_dcache_for_umount+0x2f/0x3d
>  [] generic_shutdown_super+0x19/0xf2
>  [] kill_block_super+0x26/0x3b
>  [] deactivate_super+0x47/0x60
>  [] sys_umount+0x1f7/0x22a
>  [] sys_newstat+0x19/0x31
>  [] system_call+0x7e/0x83
> 
> 
> Code: 0f 0b eb fe 48 8b 6b 28 48 39 dd 75 04 31 ed eb 04 f0 ff 4d 
> RIP  [] shrink_dcache_for_umount_subtree+0x178/0x250
>  RSP 
> /etc/init.d/boot.d/K14boot.localfs: line 93: 15791 Segmentation fault  
> umount -avt noproc,nonfs,nonfs4,nosmbfs,nocifs,notmpfs
> 
> 
> 
> -Andi
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
-- 
Jan Kara <[EMAIL PROTECTED]>
SuSE CR Labs
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 05:49:48PM +1000, Neil Brown wrote:
> On Tuesday April 24, [EMAIL PROTECTED] wrote:
> > 
> > BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
> > an initial read or other sequence they might be using to handle the
> > case of a short write. ecryptfs uses it, others can too.
> > 
> > For buffered writes, this doesn't get passed in (unless they are
> > coming from kernel space), so I was debating whether to have it at
> > all.  However, in the previous API, _nobody_ had to worry about
> > short writes, so this flag means I avoid making an API that is
> > fundamentally less performant in some situations.
> 
> Ahhh I think I get it now.
> 
>   In general, the address_space must cope with the possibility that
>   fewer than the expected number of bytes is copied.  This may leave
>   parts of the page with invalid data.  This can be handled by
>   pre-loading the page with valid data, however this may cause a
>   significant performance cost.

Right. Bringing the page uptodate at write_begin-time is probably
the simplest way to handle it. However, more sophisticated schemes
are possible. For example, the generic block routines can recover
at write_end-time, and probably can't make use of the flag to do
things much better...


>   The write_begin/write_end interface provide two mechanism by which
>   this case can be handled more efficiently.
>   1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
> not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
> If that is set, inefficient preparation can be avoided.  However the
> most common write paths will never set this flag.

Yes, loop, nfsd, and filesystem-specific pagecache modification
(eg. ext2 directories) are probably the main things that use it.


>   2/ The return from write_end can declare that fewer bytes have been
> accepted. e.g. part of the page may have been loaded from backing
> store, overwriting some of the newly written bytes.  If this
> return value is reduced, a new write_begin/write_end cycle
> may be called to attempt to write the bytes again.

Yeah, although you'd have to be careful not to overwrite things if
the page is uptodate (because uptodate *really* means uptodate --
ie.  it is the only thing we have to synchronise buffered reads from
returning the data to userspace).


> 
> Also
> +  write_end: After a successful write_begin, and data copy, write_end must
> +be called. len is the original len passed to write_begin, and copied
> +is the amount that was able to be copied (they must be equal if
> +write_begin was called with intr == 0).
> +
> 
> That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being
> set".
> And "that was able to be copied" is misleading, as the copy is not done in
> write_end.  Maybe "that was accepted".

Thanks, very good eyes and good suggestions.

Actually I'm a bit worried about this copied vs accepted thing -- we've
already copied some number of bytes into the pagecache by the time write_end
is called. So if the filesystem accepts less and the pagecache page is marked
uptodate, then the pagecache is now out of sync with the filesystem. There
are a few places where it looks like we get this wrong... but that's for a
future patch :P
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
> 
> BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
> an initial read or other sequence they might be using to handle the
> case of a short write. ecryptfs uses it, others can too.
> 
> For buffered writes, this doesn't get passed in (unless they are
> coming from kernel space), so I was debating whether to have it at
> all.  However, in the previous API, _nobody_ had to worry about
> short writes, so this flag means I avoid making an API that is
> fundamentally less performant in some situations.

Ahhh I think I get it now.

  In general, the address_space must cope with the possibility that
  fewer than the expected number of bytes is copied.  This may leave
  parts of the page with invalid data.  This can be handled by
  pre-loading the page with valid data, however this may cause a
  significant performance cost.
  The write_begin/write_end interface provide two mechanism by which
  this case can be handled more efficiently.
  1/ The AOP_FLAG_UNINTERRUPTIBLE flag declares that the write will
not be partial (maybe a different name? AOP_FLAG_NO_PARTIAL).
If that is set, inefficient preparation can be avoided.  However the
most common write paths will never set this flag.
  2/ The return from write_end can declare that fewer bytes have been
accepted. e.g. part of the page may have been loaded from backing
store, overwriting some of the newly written bytes.  If this
return value is reduced, a new write_begin/write_end cycle
may be called to attempt to write the bytes again.

Also
+  write_end: After a successful write_begin, and data copy, write_end must
+be called. len is the original len passed to write_begin, and copied
+is the amount that was able to be copied (they must be equal if
+write_begin was called with intr == 0).
+

That should be "... called without AOP_FLAG_UNINTERRUPTIBLE being
set".
And "that was able to be copied" is misleading, as the copy is not done in
write_end.  Maybe "that was accepted".

It seems to make sense now.  I might try re-reviewing the patches based
on this improved understanding only a public holiday looms :-)

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Nick Piggin
On Tue, Apr 24, 2007 at 04:59:47PM +1000, Neil Brown wrote:
> On Tuesday April 24, [EMAIL PROTECTED] wrote:
> > +  write_begin: This is intended as a replacement for prepare_write. Called
> > +by the generic buffered write code to ask the filesystem to prepare
> > +to write len bytes at the given offset in the file. flags is a 
> > field
> > +for AOP_FLAG_xxx flags, described in include/linux/mm.h.
> 
> Putting "This is intended as a replacement.." there sees a bit
> dangerous.  It could well accidentally remain when the documentation
> for prepare_write gets removed.  I would make it a separate paragraph
> and flesh it out.  And include text from prepare_write before that
> gets removed.
> 
>write_begin:
>  This is intended as a replacement for prepare_write.  The key 
>  differences being that:
>   - it returns a locked page (in *pagep) rather than
>   being given a pre-locked page:
>   - it can pass arbitrary state to write_end rather than
> having to hide stuff in some filesystem-internal
> data structure 
>   - The (largely undocumented) flags option.
> 
>  Called by  the generic bufferred write code to ask an
>  address_space to prepare to write len bytes at the given
>  offset in the file.
> 
>The address_space should check that the write will be able to
>complete, by allocating space if necessary and doing any other
>internal housekeeping.  If the write will update parts of any
>basic-blocks on storage, then those blocks should be pre-read
>(if they haven't been read already) so that the updated blocks
>can be written out properly.
>The possible flags are listed in include/linux/fs.h (not
>mm.h) and include
>   AOP_FLAG_UNINTERRUPTIBLE:
>   It is unclear how this should be used.  No
>   current code handles it.

Yeah, reasonable points. I'll do an incremental patch to clean up
some of the documentation.

BTW. AOP_FLAG_UNINTERRUPTIBLE can be used by filesystems to avoid
an initial read or other sequence they might be using to handle the
case of a short write. ecryptfs uses it, others can too.

For buffered writes, this doesn't get passed in (unless they are
coming from kernel space), so I was debating whether to have it at
all.  However, in the previous API, _nobody_ had to worry about
short writes, so this flag means I avoid making an API that is
fundamentally less performant in some situations.

> 
> (together with the rest...)
> > +
> > +The filesystem must return the locked pagecache page for the caller
> > +to write into.
> > +
> > +A void * may be returned in fsdata, which then gets passed into
> > +write_end.
> > +
> > +Returns < 0 on failure, in which case all cleanup must be done and
> > +write_end not called. 0 on success, in which case write_end must
> > +be called.
> 
> 
> As you are not including perform_write in the current patchset, maybe
> it is best not to include the documentation yet either?

Right, missed that, thanks!

Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch 12/44] fs: introduce write_begin, write_end, and perform_write aops

2007-04-24 Thread Neil Brown
On Tuesday April 24, [EMAIL PROTECTED] wrote:
> +  write_begin: This is intended as a replacement for prepare_write. Called
> +by the generic buffered write code to ask the filesystem to prepare
> +to write len bytes at the given offset in the file. flags is a field
> +for AOP_FLAG_xxx flags, described in include/linux/mm.h.

Putting "This is intended as a replacement.." there sees a bit
dangerous.  It could well accidentally remain when the documentation
for prepare_write gets removed.  I would make it a separate paragraph
and flesh it out.  And include text from prepare_write before that
gets removed.

   write_begin:
 This is intended as a replacement for prepare_write.  The key 
 differences being that:
- it returns a locked page (in *pagep) rather than
  being given a pre-locked page:
- it can pass arbitrary state to write_end rather than
  having to hide stuff in some filesystem-internal
  data structure 
- The (largely undocumented) flags option.

 Called by  the generic bufferred write code to ask an
 address_space to prepare to write len bytes at the given
 offset in the file.

 The address_space should check that the write will be able to
 complete, by allocating space if necessary and doing any other
 internal housekeeping.  If the write will update parts of any
 basic-blocks on storage, then those blocks should be pre-read
 (if they haven't been read already) so that the updated blocks
 can be written out properly.
 The possible flags are listed in include/linux/fs.h (not
 mm.h) and include
AOP_FLAG_UNINTERRUPTIBLE:
It is unclear how this should be used.  No
current code handles it.

(together with the rest...)
> +
> +The filesystem must return the locked pagecache page for the caller
> +to write into.
> +
> +A void * may be returned in fsdata, which then gets passed into
> +write_end.
> +
> +Returns < 0 on failure, in which case all cleanup must be done and
> +write_end not called. 0 on success, in which case write_end must
> +be called.


As you are not including perform_write in the current patchset, maybe
it is best not to include the documentation yet either?

> +  perform_write: This is a single-call, bulk version of write_begin/write_end
> +operations. It is only used in the buffered write path (write_begin
> +must still be implemented), and not for in-kernel writes to 
> pagecache.
> +It takes an iov_iter structure, which provides a descriptor for the
> +source data (and has associated iov_iter_xxx helpers to operate on
> +that data). There are also file, mapping, and pos arguments, which
> +specify the destination of the data.
> +
> +Returns < 0 on failure if nothing was written out, otherwise returns
> +the number of bytes copied into pagecache.
> +
> +fs/libfs.c provides a reasonable template to start with, 
> demonstrating
> +iov_iter routines, and iteration over the destination pagecache.
> +

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html