from:"Andrew Morton"

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-26 Thread Andrew Morton

On Tue, 26 Feb 2008 15:07:45 + Jamie Lokier [EMAIL PROTECTED] wrote:

 SYNC_FILE_RANGE_WRITE scans all pages in the range, looking for dirty
 pages which aren't already queued for write-out.  It marks those with
 a write-out flag, and starts write I/Os at some unspecified time in
 the near future; it can be assumed writes for all the pages will
 complete eventually if there's no errors.  When I/O completes on a
 page, it cleans the page and also clears the write-out flag.
 
 SYNC_FILE_RANGE_WAIT_AFTER waits until all pages in the range don't
 have the write-out flag set.
 
 SYNC_FILE_RANGE_WAIT_BEFORE does the same wait, but before marking
 pages for write-out.  I don't actually see the point in this.  Isn't a
 preceding call with SYNC_FILE_RANGE_WAIT_AFTER equivalent, making
 BEFORE a redundant flag?

Consider the case of pages which are dirty but are already under writeout. 
ie: someone redirtied the page after someone started writing the page out. 
For these pages the kernel needs to

a) wait for the current writeout to complete

b) start new writeout

c) wait for that writeout to complete.

those are the three stages of sync_file_range().  They are independently
selectable and various combinations provide various results.

The reason for providing b) only (SYNC_FILE_RANGE_WRITE) is so that
userspace can get as much data into the queue as possible, to permit the
kernel to optimise IO scheduling better.

If you perform a) and b) together
(SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE) then you are guaranteed
that all data which was dirty when sync_file_range() executed will be sent
into the queue, but you won't get as much data into the queue if the kernel
encounters dirty, under-writeout pages.  This is especially hurtful if
you're trying to feed a lot of little segments into the queue.  In that
case perhaps userspace should do an asynchrnous pass
(SYNC_FILE_RANGE_WRITE) to stuff as much data as poss into the queue, then
a SYNC_FILE_RANGE_WAIT_AFTER pass then a
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
pass to clean up any stragglers.  WHich mode is best very much depends on
the application's file dirtying patterns.  One would have to experiment
with it, and tuning of sync_file_range() usage would occur alongside tuning
of the application's write() design.

It's an interesting problem, with potentially high payback.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] autofs4 - implement a miscelaneous device for ioctl control

2008-02-25 Thread Andrew Morton

We have four patches, all with the same title - please avoid doing this.

I could invent titles for them (as I often have to do), but I think it
would be best if you were to do so.  That way, we avoid confusion and
people who later google for the patch's title to understand the patch's
background will get better coverage.

So... resend, please?

Documentation/SubmittingPatches has some tips on all of this - search term
is Subject:.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Proposal for proper durable fsync() and fdatasync()

2008-02-25 Thread Andrew Morton

On Tue, 26 Feb 2008 07:26:50 + Jamie Lokier [EMAIL PROTECTED] wrote:

 (It would be nicer if sync_file_range()
 took a vector of ranges for better elevator scheduling, but let's
 ignore that :-)

Two passes:

Pass 1: shove each of the segments into the queue with
SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE

Pass 2: wait for them all to complete and return accumulated result
with SYNC_FILE_RANGE_WAIT_AFTER


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/18] Define functions for page cache handling

2008-02-23 Thread Andrew Morton

On Fri, 15 Feb 2008 16:47:19 -0800 Christoph Lameter [EMAIL PROTECTED] wrote:

 V2-V3:
 - Use mapping instead of a as the address space parameter
 
 We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
 and PAGE_CACHE_ALIGN in various places in the kernel. Many times
 common operations like calculating the offset or the index are coded
 using shifts and adds. This patch provides inline functions to
 get the calculations accomplished without having to explicitly
 shift and add constants.
 
 All functions take an address_space pointer. The address space pointer
 will be used in the future to eventually support a variable size
 page cache. Information reachable via the mapping may then determine
 page size.
 
 New functionRelated base page constant
 
 page_cache_shift(a) PAGE_CACHE_SHIFT
 page_cache_size(a)  PAGE_CACHE_SIZE
 page_cache_mask(a)  PAGE_CACHE_MASK
 page_cache_index(a, pos)Calculate page number from position
 page_cache_next(addr, pos)  Page number of next page
 page_cache_offset(a, pos)   Calculate offset into a page
 page_cache_pos(a, index, offset)
 Form position based on page number
 and an offset.

These sort-of look OK as cleanups and avoidance of accidents.

But the interfaces which they use (passing and address_space) are quite
pointless unless we implement variable page size per address_space.  And as
the chances of that ever happening seem pretty damn small, these changes
are just obfuscation which make the code harder to read and which
pointlessly churn the codebase.

So I'm inclined to drop these patches.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-15 Thread Andrew Morton

On Fri, 15 Feb 2008 04:01:20 -0500 Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Thu, Feb 14, 2008 at 10:21:03PM -0800, Andrew Morton wrote:
  Linus has just merged all the VFS renaming patches, so the decks
  are clear for looking at this work.
  
  However David and Christoph are beavering away on the r-o-bind-mounts
  patches and I expect that there will be overlaps with unprivileged mounts.
  
  Could we coordinate things a bit please?  Decide who goes first, review
  and maybe even test each others work, etc?
 
 Al is setting up a git tree for VFS work.  per-mount r/o will go in
 as one of the first things, aswell as his rework of the path lookup
 logic to fix the intents mess.
 

That didn't answer my question..
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/10] mount ownership and unprivileged mount syscall (v8)

2008-02-14 Thread Andrew Morton

On Tue, 05 Feb 2008 22:36:16 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

 Just documentation updates, compared to the previous submission.
 Thanks to Serge for the relentless reviews :)
 
 Please consider for -mm, and then for 2.6.26.

Linus has just merged all the VFS renaming patches, so the decks
are clear for looking at this work.

However David and Christoph are beavering away on the r-o-bind-mounts
patches and I expect that there will be overlaps with unprivileged mounts.

Could we coordinate things a bit please?  Decide who goes first, review
and maybe even test each others work, etc?

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] reiserfs: use open_bdev_excl

2008-02-06 Thread Andrew Morton

On Thu, 7 Feb 2008 05:45:13 +0100 Christoph Hellwig [EMAIL PROTECTED] wrote:

 On Wed, Dec 26, 2007 at 04:31:01PM +0100, Christoph Hellwig wrote:
  Use the proper helper to open a blockdevice by name for filesystem
  use, this makes sure it's properly claimed (also added for open-by-number)
  and gets rid of the struct file abuse.
  
  Tested by mounting a reiserfs filesystem with external journal.
 
 Folks, what do we do with this patch?  It's been out for more than a
 month but didn't actually get picked up in any tree.

It's still sitting in the great pile of things I got sent over Christmas
and haven't looked at yet.  It didn't look like a bugfix.

  I'd really like
 to see this going in soon.

How come?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-02-05 Thread Andrew Morton

On Tue, 5 Feb 2008 22:48:29 +0100
Oliver Pinter [EMAIL PROTECTED] wrote:

 On 2/5/08, Oliver Pinter [EMAIL PROTECTED] wrote:
  http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/
 
  uploaded:
  kernel image
  .config
  new pictures
  lspci
  lsusb
 
  -
 
  when read for /dev/uba then crashed the kernel, the read is egal, thet
  dd or mount is ...

Looks like you deadlocked in ub_request_fn().  I assume that you were using
ub.c in 2.6.23 and that it worked OK?  If so, we broke it, possibly via
changes to the core block layer.

I think ub.c is basically abandoned in favour of usb-storage.  If so,
perhaps we should remove or disble ub.c?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] vfs: optimization to /proc/pid/mountinfo patch

2008-02-04 Thread Andrew Morton

On Mon, 04 Feb 2008 01:15:05 -0800 Ram Pai [EMAIL PROTECTED] wrote:

 1) reports deleted inode in dentry_path() consistent with that in __d_path()
 2) modified __d_path() to use prepend(), reducing the size of __d_path()
 3) moved all the functionality that reports mount information in /proc under
   CONFIG_PROC_FS.
 
 Could not verify if the code would work with CONFIG_PROC_FS=n, since it was
 impossible to disable CONFIG_PROC_FS. Looking for ideas on how to disable
 CONFIG_PROC_FS.
   

Do `make menuconfig', then hit '/' and search for proc_fs.

It'll tell you that you need to set EMBEDDED=y to disable procfs.

  fs/dcache.c  |   59 
 +++
  fs/namespace.c   |2 +
  fs/seq_file.c|2 +
  include/linux/dcache.h   |3 ++
  include/linux/seq_file.h |3 ++

Please resend after testing that, thanks.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2.6.24 REGRESSION] BUG: Soft lockup - with VFS

2008-02-04 Thread Andrew Morton

On Mon, 28 Jan 2008 09:31:43 +0100 Oliver Pinter (Pintér Olivér)  [EMAIL 
PROTECTED] wrote:

 hi all!
 
 in the 2.6.24 become i some soft lockups with usb-phone, when i pluged
 in the mobile, then the vfs-layer crashed. am afternoon can i the
 .config send, and i bisected the kernel, when i have time.
 
 pictures from crash:
 http://students.zipernowsky.hu/~oliverp/kernel/regression_2624/

It looks like selinux's file_has_perm() is doing spin_lock() on an
uninitialised (or already locked) spinlock.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/6] mm: bdi: export BDI attributes in sysfs

2008-01-31 Thread Andrew Morton

On Thu, 31 Jan 2008 10:39:02 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:

  On Tue, 29 Jan 2008 16:49:02 +0100
  Miklos Szeredi [EMAIL PROTECTED] wrote:
  
   From: Peter Zijlstra [EMAIL PROTECTED]
   
   Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info
   object.  This allows us to see and set the various BDI specific
   variables.
   
   In particular this properly exposes the read-ahead window for all
   relevant users and /sys/block/block/queue/read_ahead_kb should be
   deprecated.
  
  This description is not complete.  It implies that the readahead window is
  not properly exposed for some relevant users.  The reader is left
  wondering what on earth this is referring to.  I certainly don't know.
  Perhaps when this information is revealed, we can work out what was
  wrong with per-queue readahead tuning.
 
 I think Peter meant, that the readahead window was only exposed for
 block devices, and not things like NFS or FUSE.

OK.

 
   +blk-NAME
   +
   + Block devices, NAME is 'sda', 'loop0', etc...
  
  But if I've done `mknod /dev/pizza-party 8 0', I'm looking for
  blk-pizza-party, not blk-sda.
  
  But I might still have /dev/sda, too.
 
 An alternative would be to uniformly use MAJOR:MINOR in there.  It
 would work for block devices and anonymous devices (NFS/FUSE) as well.
 
 Would that be any better?

I suppose so.  sysfs likes to use symlinks to point over at related
things in different directories...

  
   +FSTYPE-MAJOR:MINOR
   +
   + Non-block device backed filesystems which provide their own
   + BDI, such as NFS and FUSE.  MAJOR:MINOR is the value of st_dev
   + for files on this filesystem.
   +
   +default
   +
   + The default backing dev, used for non-block device backed
   + filesystems which do not provide their own BDI.
   +
   +Files under /sys/class/bdi/bdi/
   +-
   +
   +read_ahead_kb (read-write)
   +
   + Size of the read-ahead window in kilobytes
   +
   +reclaimable_kb (read-only)
   +
   + Reclaimable (dirty or unstable) memory destined for writeback
   + to this device
   +
   +writeback_kb (read-only)
   +
   + Memory currently under writeback to this device
   +
   +dirty_kb (read-only)
   +
   + Global threshold for reclaimable + writeback memory
   +
   +bdi_dirty_kb (read-only)
   +
   + Current threshold on this BDI for reclaimable + writeback
   + memory
   +
  
  I dunno.  A number of the things which you're exposing are closely tied to
  present-day kernel implementation and may be irrelevant or even
  unimplementable in a few years' time.
 
 Which ones?

I don't know - I misplaced my copy of linux-2.6.44 :)

The whole concept of a BDI might go away, who knows?  Progress in
non-volatile semiconductor storage might make the whole
rotating-platter-with-a-seek-head thing obsolete.

read_ahead_kb is likely to be stable.  writeback_kb is a stable concept
too, although we might lose the ability to keep track of it some time in
the future.

Suppose that /dev/sda and /dev/sdb share the same queue - we lose the ability
to track some of these things?

  They could possibly be moved to debugfs, or something.
 
 I agree, that sysfs should be relatively stable.

This does look more like a debugging feature than a permanently-offered,
support-it-forever part of the kernel ABI.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] vfs: create /proc/pid/mountinfo

2008-01-30 Thread Andrew Morton

On Tue, 29 Jan 2008 14:57:29 +0100
Miklos Szeredi [EMAIL PROTECTED] wrote:

 From: Ram Pai [EMAIL PROTECTED]
 
 /proc/mounts in its current state fails to disambiguate bind mounts,
 especially when the bind mount is subrooted.  Also it does not capture
 propagation state of the mounts(shared-subtree).  The following patch
 addresses the problem.
 
 The patch adds '/proc/pid/mountinfo' which contains a superset of
 information in '/proc/pid/mounts'. The following fields are added:
 
 mntid -- is a unique identifier of the mount
 parent -- the id of the parent mount
 major:minor -- value of st_dev for files on that filesystem
 dir -- the subdir in the filesystem which forms the root of this mount
 propagation-type in the form of propagation_flag[:mntid][,...]
   note: 'shared' flag is followed by the mntid of its peer mount
 'slave' flag is followed by the mntid of its master mount
 'private' flag stands by itself
 'unbindable' flag stands by itself
 
 Also mount options are split into two fileds, the first containing the
 per mount flags, the second the per super block options.
 
 Here is a sample cat /proc/mounts after execution the following commands:

     /proc/pid/mountinfo?
 
 mount --bind /mnt /mnt
 mount --make-shared /mnt
 mount --bind /mnt/1 /var
 mount --make-slave /var
 mount --make-shared /var
 mount --bind /var/abc /tmp
 mount --make-unbindable /proc
 
 2 2 0:1 rootfs rootfs / / rw rw private
 16 2 98:0 ext2 /dev/root / / rw rw private
 17 16 0:3 proc /proc / /proc rw rw unbindable
 18 16 0:10 devpts devpts /dev/pts / rw rw private
 19 16 98:0 ext2 /dev/root /mnt /mnt rw rw shared:19
 20 16 98:0 ext2 /dev/root /mnt/1 /var rw rw shared:21,slave:19
 21 16 98:0 ext2 /dev/root /mnt/1/abc /tmp rw rw shared:20,slave:19
 
 For example, the last line indicates that :
 
 1) The mount is a shared mount.
 2) Its peer mount of mount with id 20
 3) It is also a slave mount of the master-mount with the id  19
 4) The filesystem on device with major/minor number 98:0 and subdirectory
   mnt/1/abc makes the root directory of this mount.
 5) And finally the mount with id 16 is its parent.

Boy, that's complex.  It'd be nicer to have a name:value\n format, like
/proc/meminfo.  But that would really imply one /proc file per mount under
/proc/pid/mountinfo/...

If history teaches us anything, it is that we should make this thing
extensible in ways in which we can be confident will not break existing
parsers.

Does this format ensure that?  It's hard to see how.

 
 [EMAIL PROTECTED]
 
 - new file, rearrange fields
 - for mount ID's use IDA (from the IDR library) instead of a 32bit
   counter, which could overflow
 - print canonical ID's (smallest one within the peer group) for peers
   and master, this is more useful, than a random ID within the same namespace
 - fix a couple of small bugs
 - remove inlines
 - style fixes
 
 Signed-off-by: Ram Pai [EMAIL PROTECTED]
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

The patch adds a new concept of an id for each vfsmount (correct?). 
Perhaps the semantics of that identifier should be captured in code
comments, probably in pnode.c.  Things like:

- They're always unique
- They are allocated on a first-fit-from-zero basis (correct?)
- They are used for... ?

They are also signed, which seems inappropriate.

The new exported-to-everyone dentry_path() probably could do with a bit
more documentation - it's the sort of thing which people keep on wanting
and using.

How does dentry_path() differ from d_path() and why do we need both and can
we get some sharing/consolidation happening here?

Why do d_path() and dentry_path() have differing conventions for displaying
a deleted file and can we fix that?

This patch adds a lot of code which is, I guess, unused if
CONFIG_PROC_FS=n.  Fixable?


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/6] mm: bdi: export BDI attributes in sysfs

2008-01-30 Thread Andrew Morton

On Tue, 29 Jan 2008 16:49:02 +0100
Miklos Szeredi [EMAIL PROTECTED] wrote:

 From: Peter Zijlstra [EMAIL PROTECTED]
 
 Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info
 object.  This allows us to see and set the various BDI specific
 variables.
 
 In particular this properly exposes the read-ahead window for all
 relevant users and /sys/block/block/queue/read_ahead_kb should be
 deprecated.

This description is not complete.  It implies that the readahead window is
not properly exposed for some relevant users.  The reader is left
wondering what on earth this is referring to.  I certainly don't know.
Perhaps when this information is revealed, we can work out what was
wrong with per-queue readahead tuning.

 --- /dev/null 1970-01-01 00:00:00.0 +
 +++ linux/Documentation/ABI/testing/sysfs-class-bdi   2008-01-29 
 13:02:46.0 +0100
 @@ -0,0 +1,50 @@
 +What:/sys/class/bdi/bdi/
 +Date:January 2008
 +Contact: Peter Zijlstra [EMAIL PROTECTED]
 +Description:
 +
 +Provide a place in sysfs for the backing_dev_info object.
 +This allows us to see and set the various BDI specific variables.
 +
 +The bdi identifyer can take the following forms:

identifier

 +blk-NAME
 +
 + Block devices, NAME is 'sda', 'loop0', etc...

But if I've done `mknod /dev/pizza-party 8 0', I'm looking for
blk-pizza-party, not blk-sda.

But I might still have /dev/sda, too.

 +FSTYPE-MAJOR:MINOR
 +
 + Non-block device backed filesystems which provide their own
 + BDI, such as NFS and FUSE.  MAJOR:MINOR is the value of st_dev
 + for files on this filesystem.
 +
 +default
 +
 + The default backing dev, used for non-block device backed
 + filesystems which do not provide their own BDI.
 +
 +Files under /sys/class/bdi/bdi/
 +-
 +
 +read_ahead_kb (read-write)
 +
 + Size of the read-ahead window in kilobytes
 +
 +reclaimable_kb (read-only)
 +
 + Reclaimable (dirty or unstable) memory destined for writeback
 + to this device
 +
 +writeback_kb (read-only)
 +
 + Memory currently under writeback to this device
 +
 +dirty_kb (read-only)
 +
 + Global threshold for reclaimable + writeback memory
 +
 +bdi_dirty_kb (read-only)
 +
 + Current threshold on this BDI for reclaimable + writeback
 + memory
 +

I dunno.  A number of the things which you're exposing are closely tied to
present-day kernel implementation and may be irrelevant or even
unimplementable in a few years' time.

At the very least you should put a HUGE warning in here telling everyone
that these files may disappear or be renamed with new semantics in the
future, and that they should design their userspace code with this in mind.

But that will only prevent userspace from outright crashing.  Once we
expose functionality of this nature, people will come to depend upon it.
We can't stop this.

Suppose $CLUELESS_CORP modifies $LARGE_DATABASE so that it uses these new
fields to optimise its cache population and cache flushout strategies. 
Later, we are forced to remove these fields.  The database now runs all
slowly.

It's just a bad idea to expose deep kernelguts in this way.  We need really
good reasons for doing so, and those reasons should be in the changelog.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 00/26] mount options: fix filesystem's -show_options

2008-01-26 Thread Andrew Morton

 On Thu, 24 Jan 2008 20:33:41 +0100 Miklos Szeredi [EMAIL PROTECTED] wrote:
 Andrew,
 
 Would you please consider these patches for -mm?

Sure, but I'm too lazy to pick through them and work out which ones need
updating, which ones got acked and which ones someone else merged, all on a
very bumpy plane flight ;)

Please resend when the dust has settled?


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 19/26] mount options: fix jfs

2008-01-24 Thread Andrew Morton

 On Thu, 24 Jan 2008 15:15:01 -0600 Dave Kleikamp [EMAIL PROTECTED] wrote:
 On Thu, 2008-01-24 at 20:34 +0100, Miklos Szeredi wrote:
  plain text document attachment (jfs_opts.patch)
  From: Miklos Szeredi [EMAIL PROTECTED]
  
  Add iocharset= and errors= options to /proc/mounts for jfs
  filesystems.
  
  Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 
 Acked-by: Dave Kleikamp [EMAIL PROTECTED]
 
 Andrew,
 Would you like me to add this to the jfs git tree, or would you like to
 handle these patches as a set?
 

My usual algorithm here is to

1: queue all the patches and send the ones which have a maintainer to
   that maintainer until he merges it.

2: If the patches have a dependency upon (say) a VFS patch then I'll
   merge the VFS patch and will then goto 1.

I don't think this particular patch has a VFS depencency so sure, merge
away.  You'll probably see that I merged it anyway, but I'll drop it again
when I see it turn up in your tree (I used to resync with the git trees
at least daily, but I now do this far less frequently because it is such
torture because everyone is paddling in everyone else's puddle).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Andrew Morton

On Wed, 16 Jan 2008 23:12:31 +0100
Miklos Szeredi [EMAIL PROTECTED] wrote:

 The reason, why this patch was dug up, is that if the bdi-sysfs patch
 is going to use device numbers to identify BDIs, then there should be
 a way for the user to map the device number into mount(s).
 
 But it's useful regardless of the bdi-sysfs patch.

Don't know what that is.

 Can this be added to -mm?
 
 In theory it could break userspace, but I think it's very unlikely to
 do so, because stuff is added only at the end of the lines, and
 because most programs probably parse it through the libc interface
 which is not broken by this change.  Despite this, it should be tested
 on as many systems as possible.

Seems like a plain bad idea to me.  There will be any number of home-made
/proc/mounts parsers and we don't know what they do.

 - for mount ID's use IRA instead of a 32bit counter, which could overflow

don't know what an IRA is.

 - print canonical ID's (smallest one within the peer group) for peers
   and master, this is more useful, than a random ID within the same namespace
 - fix a couple of small bugs
 - style fixes
 
 Signed-off-by: Ram Pai [EMAIL PROTECTED]
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Both the newly-added inlines in this patch are wrong.  They will result in
a larger and slower kernel.  This should be very well known by now.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] VFS: extend /proc/mounts

2008-01-16 Thread Andrew Morton

On Thu, 17 Jan 2008 00:58:06 +0100 (CET) Jan Engelhardt [EMAIL PROTECTED] 
wrote:

 
 On Jan 17 2008 00:43, Karel Zak wrote:
  
  Seems like a plain bad idea to me.  There will be any number of home-made
  /proc/mounts parsers and we don't know what they do.
 
  So, let's use /proc/mounts_v2  ;-)
 
 Was not it like don't use /proc for new things?

Well yeah.  If we're going to do a brand new mechanism to expose
per-mount data then we should hunker down and get it right.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] util-linux-ng 2.13.1-rc2

2008-01-05 Thread Andrew Morton

On Wed, 2 Jan 2008 22:10:52 +0100 Karel Zak [EMAIL PROTECTED] wrote:

 
 
  The second util-linux-ng 2.13.1 release candidate is available at
 
 ftp://ftp.kernel.org/pub/linux/utils/util-linux-ng/
 

Interesting.  Thanks.  Which distros are using this, or plan to do so?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Pid namespaces vs locks interaction

2007-12-21 Thread Andrew Morton

On Fri, 21 Dec 2007 15:22:31 +0300
Vitaliy Gusev [EMAIL PROTECTED] wrote:

 fcntl(F_GETLK,..) can return pid of process for not current pid namespace (if 
 process is belonged to the several namespaces). It is true also for pids 
 in /proc/locks. So correct behavior is saving pointer to the struct pid of 
 the process lock owner.
 
 Assigned-off-by: Vitaliy Gusev [EMAIL PROTECTED]

Signed-off-by:, please.

 [diff-pid-to-nspid.patch  text/plain (2.9KB)]

eek.  Please don't include two copies of the same patch in the one email. 
The result applies happily with `patch --dry-run' but then wrecks my tree
when I try to apply it for real.  It tricks me every time, that one.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rd: support XIP

2007-12-04 Thread Andrew Morton

On Tue, 4 Dec 2007 12:21:00 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 +  *
 +  * Cannot support XIP and highmem, because our -direct_access
 +  * routine for XIP must return memory that is always addressable.
 +  * If XIP was reworked to use pfns and kmap throughout, this
 +  * restriction might be able to be lifted.
*/
 + gfp_flags = GFP_NOIO | __GFP_ZERO;
 +#ifndef CONFIG_BLK_DEV_XIP
 + gfp_flags |= __GFP_HIGHMEM;
 +#endif

A dubious tradeoff?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] Make cramfs little endian only

2007-12-04 Thread Andrew Morton

On Tue, 4 Dec 2007 13:01:26 +0100
Andi Drebes [EMAIL PROTECTED] wrote:

 The following patch makes cramfs little endian only. When trying to mount a 
 big endian image,
 an error message is produced.
 
 The changes were tested on the following types of machines:
 An i386 compatible box (little endian)
 UltraSparc IIi (big endian)
 
 Signed-off-by: Andi Drebes [EMAIL PROTECTED]
 ---
  inode.c |  163 
 +---
  1 file changed, 136 insertions(+), 27 deletions(-)
 
 diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
 index 350680f..3fbf567 100644
 --- a/fs/cramfs/inode.c
 +++ b/fs/cramfs/inode.c
 @@ -4,6 +4,10 @@
   * Copyright (C) 1999 Linus Torvalds.
   *
   * This file is released under the GPL.
 + *
 + * Changelog:
 + *   11/07 - Andi Drebes [EMAIL PROTECTED]
 + *   Made cramfs little endian only.
   */
  
  /*
 @@ -40,6 +44,95 @@ static DEFINE_MUTEX(read_mutex);
  #define CRAMINO(x)   (((x)-offset  (x)-size)?(x)-offset2:1)
  #define OFFSET(x)((x)-i_ino)
  
 +#ifdef __BIG_ENDIAN
 +/* Converts a cramfs_info from little endian to big endian. */
 +static inline void cramfs_convert_info_letobe(struct cramfs_info* info)
 +{
 +  info-crc = swab32(info-crc);
 +  info-edition = swab32(info-edition);
 +  info-blocks = swab32(info-blocks);
 +  info-files = swab32(info-files);
 +}
 +
 +/* Converts a cramfs_info from little endian to big endian. */
 +static inline void cramfs_convert_inode_letobe(struct cramfs_inode* inode)
 +{
 + u8* inode_bytes = (u8*)inode;
 + u8 old_nloffs[4];
 +
 + inode-mode = swab16(inode-mode);
 + inode-uid = swab16(inode-uid);
 + inode-size = (inode_bytes[6]  16) | (inode_bytes[5]  8) | 
 (inode_bytes[4]);

eww.  Is there a nicer way of doing that?

Might be a bit tricky given the weird way in which struct cramfs_inode was
defined.

 +
 + /* Save the old values of the namelength and the offset */
 + memcpy(old_nloffs, inode_bytes+8, 4);
 +
 + /* Convert the namelength and the offset */
 + inode_bytes[8] = ((old_nloffs[0]  0x3f)  2) | ((old_nloffs[3]  
 0xc0)  6);
 + inode_bytes[9] = ((old_nloffs[3]  0x3f)  2) | ((old_nloffs[2]  
 0xc0)  6);
 + inode_bytes[10] = ((old_nloffs[2]  0x3f)  2) | ((old_nloffs[1]  
 0xc0)  6);
 + inode_bytes[11] = ((old_nloffs[1]  0x3f)  2) | ((old_nloffs[0]  
 0xc0)  6);
 +}
 +
 +/* Converts a cramfs superblock from little endian to big endian. */
 +static inline void cramfs_convert_super_letobe(struct cramfs_super* super)
 +{
 + super-magic = swab32(super-magic);
 + super-size = swab32(super-size);
 + super-flags = swab32(super-flags);
 + super-future = swab32(super-future);
 + cramfs_convert_info_letobe(super-fsid);
 + cramfs_convert_inode_letobe(super-root);
 +}

These inlines are not sane.  Just removing those three takes the sparc64
fs/cramfs/inode.o from 6856 bytes of text down to 5668, which is rather a
large difference.

The patch has a number of trivial coding-style errors. 
scripts/checkpatch.pl finds them.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 01/19] Define functions for page cache handling

2007-12-03 Thread Andrew Morton

On Fri, 30 Nov 2007 09:34:49 -0800
Christoph Lameter [EMAIL PROTECTED] wrote:

 We use the macros PAGE_CACHE_SIZE PAGE_CACHE_SHIFT PAGE_CACHE_MASK
 and PAGE_CACHE_ALIGN in various places in the kernel. Many times
 common operations like calculating the offset or the index are coded
 using shifts and adds. This patch provides inline functions to
 get the calculations accomplished without having to explicitly
 shift and add constants.
 
 All functions take an address_space pointer. The address space pointer
 will be used in the future to eventually support a variable size
 page cache. Information reachable via the mapping may then determine
 page size.
 
 New functionRelated base page constant
 
 page_cache_shift(a) PAGE_CACHE_SHIFT
 page_cache_size(a)  PAGE_CACHE_SIZE
 page_cache_mask(a)  PAGE_CACHE_MASK
 page_cache_index(a, pos)Calculate page number from position
 page_cache_next(addr, pos)  Page number of next page
 page_cache_offset(a, pos)   Calculate offset into a page
 page_cache_pos(a, index, offset)
 Form position based on page number
 and an offset.
 
 This provides a basis that would allow the conversion of all page cache
 handling in the kernel and ultimately allow the removal of the PAGE_CACHE_*
 constants.
 
 ...

 +/*
 + * Functions that are currently setup for a fixed PAGE_SIZEd. The use of
 + * these will allow the user of largere page sizes in the future.
 + */
 +static inline int mapping_order(struct address_space *a)
 +{
 + return 0;
 +}
 +
 +static inline int page_cache_shift(struct address_space *a)
 +{
 + return PAGE_SHIFT;
 +}
 +
 +static inline unsigned int page_cache_size(struct address_space *a)
 +{
 + return PAGE_SIZE;
 +}
 +
 +static inline unsigned int page_cache_offset(struct address_space *a,
 + loff_t pos)
 +{
 + return pos  ~PAGE_MASK;
 +}
 +
 +static inline pgoff_t page_cache_index(struct address_space *a,
 + loff_t pos)
 +{
 + return pos  page_cache_shift(a);
 +}

These will of course all work OK as they are presently implemented.

But you have callsites doing things like

page_cache_size(page_mapping(page));

which is a whole different thing.  Once page_cache_size() is changed to
look inside the address_space we need to handle races against truncation
and we need to handle the address_space getting reclaimed, etc.

So I think it would be misleading to merge these changes at present - they
make it _look_ like we can have variable PAGE_CACHE_SIZE just by tweaking a
bit of core code, but we in fact cannot do that without a careful review of
all callsites and perhaps the addition of new locking and null-checking.

Now, one possible way around this is to rework all these functions so they
take only a page*, and to create (and assert) the requirement that the caller
has locked the page.  That's a little bit inefficient (additional calls to
page_mapping()) but it does mean that we can now confidently change the
implementation of these functions as you intend.


And a coding nit: when you implement the out-of-line versions of these
functions you're going to stick with VFS conventions and use the identifier
`mapping' to identify the address_space*.  So I think it would be better to
also call in `mapping' in these inlined stubbed functions, rather than `a'.
No?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] rewrite rd

2007-12-03 Thread Andrew Morton

On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin [EMAIL PROTECTED] wrote:

 Hi,
 
 This is my proposal for a (hopefully) backwards compatible rd driver.
 The motivation for me is not pressing, except that I have this code
 sitting here that is either going to rot or get merged. I'm happy to
 make myself maintainer of this code, but if any real block device
 driver writer would like to, that would be fine by me ;)
 
 Comments?
 
 --
 
 This is a rewrite of the ramdisk block device driver.
 
 The old one is really difficult because it effectively implements a block
 device which serves data out of its own buffer cache. It relies on the dirty
 bit being set, to pin its backing store in cache, however there are non
 trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
 which had recently lead to data corruption. And in general it is completely
 wrong for a block device driver to do this.
 
 The new one is more like a regular block device driver. It has no idea
 about vm/vfs stuff. It's backing store is similar to the buffer cache
 (a simple radix-tree of pages), but it doesn't know anything about page
 cache (the pages in the radix tree are not pagecache pages).
 
 There is one slight downside -- direct block device access and filesystem
 metadata access goes through an extra copy and gets stored in RAM twice.
 However, this downside is only slight, because the real buffercache of the
 device is now reclaimable (because we're not playing crazy games with it),
 so under memory intensive situations, footprint should effectively be the
 same -- maybe even a slight advantage to the new driver because it can also
 reclaim buffer heads.

what about mmap(/dev/ram0)?

 The fact that it now goes through all the regular vm/fs paths makes it
 much more useful for testing, too.
 
textdata bss dec hex filename
2837 849 3844070 fe6 drivers/block/rd.o
3528 371  123911 f47 drivers/block/brd.o
 
 Text is larger, but data and bss are smaller, making total size smaller.
 
 A few other nice things about it:
 - Similar structure and layout to the new loop device handlinag.
 - Dynamic ramdisk creation.
 - Runtime flexible buffer head size (because it is no longer part of the
   ramdisk code).
 - Boot / load time flexible ramdisk size, which could easily be extended
   to a per-ramdisk runtime changeable size (eg. with an ioctl).

This ramdisk driver can use highmem whereas the existing one can't (yes?). 
That's a pretty major difference.  Plus look at the revoltingness in rd.c's
mapping_set_gfp_mask()s.

 +++ linux-2.6/drivers/block/brd.c
 @@ -0,0 +1,536 @@
 +/*
 + * Ram backed block device driver.
 + *
 + * Copyright (C) 2007 Nick Piggin
 + * Copyright (C) 2007 Novell Inc.
 + *
 + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
 + * of their respective owners.
 + */
 +
 +#include linux/init.h
 +#include linux/module.h
 +#include linux/moduleparam.h
 +#include linux/major.h
 +#include linux/blkdev.h
 +#include linux/bio.h
 +#include linux/highmem.h
 +#include linux/gfp.h
 +#include linux/radix-tree.h
 +#include linux/buffer_head.h /* invalidate_bh_lrus() */
 +
 +#include asm/uaccess.h
 +
 +#define SECTOR_SHIFT 9

That's our third definition of SECTOR_SHIFT.

 +#define PAGE_SECTORS_SHIFT   (PAGE_SHIFT - SECTOR_SHIFT)
 +#define PAGE_SECTORS (1  PAGE_SECTORS_SHIFT)
 +
 +/*
 + * Each block ramdisk device has a radix_tree brd_pages of pages that stores
 + * the pages containing the block device's contents. A brd page's -index is
 + * its offset in PAGE_SIZE units. This is similar to, but in no way connected
 + * with, the kernel's pagecache or buffer cache (which sit above our block
 + * device).
 + */
 +struct brd_device {
 + int brd_number;
 + int brd_refcnt;
 + loff_t  brd_offset;
 + loff_t  brd_sizelimit;
 + unsignedbrd_blocksize;
 +
 + struct request_queue*brd_queue;
 + struct gendisk  *brd_disk;
 + struct list_headbrd_list;
 +
 + /*
 +  * Backing store of pages and lock to protect it. This is the contents
 +  * of the block device.
 +  */
 + spinlock_t  brd_lock;
 + struct radix_tree_root  brd_pages;
 +};
 +
 +/*
 + * Look up and return a brd's page for a given sector.
 + */
 +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 +{
 + unsigned long idx;

Could use pgoff_t here if you think that's clearer.

 + struct page *page;
 +
 + /*
 +  * The page lifetime is protected by the fact that we have opened the
 +  * device node -- brd pages will never be deleted under us, so we
 +  * don't need any further locking or refcounting.
 +  *
 +  * This is strictly true for the radix-tree nodes as well (ie. we
 +  * don't actually need the rcu_read_lock()), however that is not a
 +  * documented feature of the radix-tree

Re: [patch 00/19] Page cache: Replace PAGE_CACHE_xx with inline functions

2007-11-29 Thread Andrew Morton

On Thu, 29 Nov 2007 15:20:39 +1100 David Chinner [EMAIL PROTECTED] wrote:

 On Wed, Nov 28, 2007 at 05:10:52PM -0800, Christoph Lameter wrote:
  This patchset cleans up page cache handling by replacing
  open coded shifts and adds with inline function calls.
  
  The ultimate goal is to replace all uses of PAGE_CACHE_xxx in the
  kernel through the use of these functions. All the functions take
  a mapping parameter. The mapping parameter is required if we want
  to support large block sizes in filesystems and block devices.
  
  Patchset against 2.6.24-rc3-mm2.
 
 Reviewed-by: Dave Chinner [EMAIL PROTECTED]

thanks ;)  I'll merge version 2..


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 18/19] Use page_cache_xxx for fs/xfs

2007-11-29 Thread Andrew Morton

On Thu, 29 Nov 2007 15:10:13 +1100 David Chinner [EMAIL PROTECTED] wrote:

 On Wed, Nov 28, 2007 at 08:06:30PM -0800, Christoph Lameter wrote:
  Is this correct?
 
 Yup, looks good now.
 

Given the error rate in the xfs patch my confidence in the rest
of the series isn't terribly high, sorry.

I guess I can suck up the core and xfs bits, but who is going to 
review the rest of them this closely?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext3,4:fdatasync should skip metadata writeout

2007-11-15 Thread Andrew Morton

On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote:

 Currently fdatasync is identical to fsync in ext3,4.
 I think fdatasync should skip journal flush in data=ordered and 
 data=writeback mode
 because this syscall is not required to synchronize the metadata.

I suppose so.  Although one wonders what earthly point there is in syncing
a file's data if we haven't yet written out the metadata which is required
for locating that data.

IOW, fdatasync() is only useful if the application knows that it is overwriting
already-instantiated blocks.

In which case it might as well have used fsync().  For ext2-style filesystems,
anyway.

hm.  It needs some thought.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] ext3,4:fdatasync should skip metadata writeout

2007-11-15 Thread Andrew Morton

On Thu, 15 Nov 2007 22:47:40 -0500 Wendy Cheng [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
 
 On Fri, 16 Nov 2007 11:47:27 +0900 Hisashi Hifumi [EMAIL PROTECTED] wrote:
 
   
 
 Currently fdatasync is identical to fsync in ext3,4.
 I think fdatasync should skip journal flush in data=ordered and 
 data=writeback mode
 because this syscall is not required to synchronize the metadata.
 
 
 
 I suppose so.  Although one wonders what earthly point there is in syncing
 a file's data if we haven't yet written out the metadata which is required
 for locating that data.
 
 IOW, fdatasync() is only useful if the application knows that it is 
 overwriting
 already-instantiated blocks.
 
 In which case it might as well have used fsync().  For ext2-style 
 filesystems,
 anyway.
 
 hm.  It needs some thought.
 
   
 
 
 There are non-trivial amount of performance critical programs, 
 particularly in financial application segment ported from legacy UNIX 
 platforms, know the difference between fsync() and fdatasync(). Those 
 can certainly take advantages of this separation. Don't underestimate 
 the talents of these application programmers.
 

If they're that good, they'll be using sync_file_range() ;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive slowdown when re-querying large nfs dir

2007-11-08 Thread Andrew Morton

 On Thu, 8 Nov 2007 10:44:35 +0300 Al Boldi [EMAIL PROTECTED] wrote:
 Andrew Morton wrote:
I would suggest getting a 'tcpdump -s0' trace and seeing (with
wireshark) what is different between the various cases.
  
   Thanks Neil for looking into this.  Your suggestion has already been
   answered in a previous post, where the difference has been attributed to
   ls -l inducing lookup for the first try, which is fast, and getattr
   for later tries, which is super-slow.
  
   Now it's easy to blame the userland rpc.nfs.V2 server for this, but
   what's not clear is how come 2.4.31 handles getattr faster than 2.6.23?
 
  We broke 2.6?  It'd be interesting to run the ls in an infinite loop on
  the client them start poking at the server.  Is the 2.6 server doing
  physical IO?  Is the 2.6 server consuming more system time?  etc.  A basic
  `vmstat 1' trace for both 2.4 and 2.6 would be a starting point.
 
  Could be that there's some additional latency caused by networking
  changes, too.  I expect the tcpdump/wireshark/etc traces would have
  sufficient resolution for us to be able to see that.
 
 The problem turns out to be tune2fs -O dir_index.
 Removing that feature resolves the big slowdown.

Doh.  Well worked-out.

 Does 2.4.31 support this feature?

No.  This explains it.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive slowdown when re-querying large nfs dir

2007-11-07 Thread Andrew Morton

 On Wed, 7 Nov 2007 12:36:26 +0300 Al Boldi [EMAIL PROTECTED] wrote:
 Neil Brown wrote:
  On Tuesday November 6, [EMAIL PROTECTED] wrote:
On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi [EMAIL PROTECTED] wrote:
Al Boldi wrote:
 There is a massive (3-18x) slowdown when re-querying a large nfs dir
 (2k+ entries) using a simple ls -l.

 On 2.6.23 client and server running userland rpc.nfs.V2:
 first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
 more tries: time -p ls -l 2k+ entry dir  in ~8sec

 first  try: time -p ls -l 5k+ entry dir  in ~9sec
 more tries: time -p ls -l 5k+ entry dir  in ~180sec

 On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
 first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
 more tries: time -p ls -l 2k+ entry dir  in ~7sec

 first  try: time -p ls -l 5k+ entry dir  in ~8sec
 more tries: time -p ls -l 5k+ entry dir  in ~43sec

 Remounting the nfs-dir on the client resets the problem.

 Any ideas?
   
Ok, I played some more with this, and it turns out that nfsV3 is a lot
faster.  But, this does not explain why the 2.4.31 kernel is still
over 4-times faster than 2.6.23.
   
Can anybody explain what's going on?
  
   Sure, Neil can! ;)
 
 Thanks Andrew!
 
  Nuh.
  He said userland rpc.nfs.Vx.  I only do kernel-land NFS.  In these
  days of high specialisation, each line of code is owned by a different
  person, and finding the right person is hard
 
  I would suggest getting a 'tcpdump -s0' trace and seeing (with
  wireshark) what is different between the various cases.
 
 Thanks Neil for looking into this.  Your suggestion has already been answered 
 in a previous post, where the difference has been attributed to ls -l 
 inducing lookup for the first try, which is fast, and getattr for later 
 tries, which is super-slow.
 
 Now it's easy to blame the userland rpc.nfs.V2 server for this, but what's 
 not clear is how come 2.4.31 handles getattr faster than 2.6.23?
 

We broke 2.6?  It'd be interesting to run the ls in an infinite loop on the
client them start poking at the server.  Is the 2.6 server doing physical
IO?  Is the 2.6 server consuming more system time?  etc.  A basic `vmstat
1' trace for both 2.4 and 2.6 would be a starting point.

Could be that there's some additional latency caused by networking changes,
too.  I expect the tcpdump/wireshark/etc traces would have sufficient
resolution for us to be able to see that.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 16/31] IGET: Stop FreeVXFS from using iget() and read_inode() [try #5]

2007-11-06 Thread Andrew Morton

On Thu, 25 Oct 2007 17:35:14 +0100 David Howells [EMAIL PROTECTED] wrote:

 Stop the FreeVXFS filesystem from using iget() and read_inode().  Replace
 vxfs_read_inode() with vxfs_iget(), and call that instead of iget().
 vxfs_iget() then uses iget_locked() directly and returns a proper error code
 instead of an inode in the event of an error.
 
 vxfs_fill_super() returns any error incurred when getting the root inode
 instead of EINVAL.

alpha:

fs/freevxfs/vxfs_inode.c:298: error: conflicting types for 'vxfs_iget'
fs/freevxfs/vxfs_extern.h:61: error: previous declaration of 'vxfs_iget' was 
here

Not completely trivial - the ino_t - unsigned long conversion needs to be
propagated fairly widely in there.

(when should we be using ino_t and when should we be using unsigned long,
anyway?)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: + embed-a-struct-path-into-struct-nameidata-instead-of-nd-dentrymnt.pa tch added to -mm tree

2007-11-06 Thread Andrew Morton

 On Tue, 6 Nov 2007 12:30:38 +0100 Jörn Engel [EMAIL PROTECTED] wrote:
 On Tue, 6 November 2007 10:11:49 +0100, Jan Blunck wrote:
  On Mon, Nov 05, Jörn Engel wrote:
  
   This patch changes some 400 lines, most if not all of which get longer
   and more complicated to read.  23 get sufficiently longer to require an
   additional linebreak.  I can't remember complexity being invited into
   the kernel without good reasoning, yet the patch description is
   surprisingly low on reasoning:
Switch from nd-{dentry,mnt} to nd-path.{dentry,mnt} everywhere.
  
  I don't measure complexity by lines of code or length of lines. Maybe I was
  not verbose enough in the description, fair.
 
 If you have a better metric, please share it.  In the paragraph you
 deleted I explicitly asked for _any_ metric that shows favorable
 numbers.  Lacking numbers, we could only argue about our respective
 personal taste.
 
  This is a cleanup series. In mostly no case there is a reason why someone
  would want to use a dentry for itself. This series reflects that fact in
  nameidata where there is absolutly no reason at all.
 
 400+ lines changed in this patch, some 10 in a followup patch that
 combines dentry/vfsmount assignments into a single path assignment.  If
 your argument above was valid, I would expect more simplifications and
 fewer complications.  Call me a sceptic until further patches show up to
 support your point.
 
  It enforced the correct
  order of getting/releasing refcount on dentry,vfsmount pairs.
 
 This argument I buy.
 
  It enables us
  to do some more cleanups wrt lookup (which are coming later).
 
 Please send those patches.  I invite cleanups that do clean things up
 and won't argue against then. ;)
 
  For stacking
  support in VFS it is essential to have the dentry,vfsmount pair in every
  place where you want to traverse the stack.
 
 True, but unrelated to this patch.
 
   If churn is the only effect of this, please considere it NAKed again.
  
  I wonder why you didn't speak up when this series was posted to LKML. It was
  at least posted three times before.
 
 I did speak up.  Once.  If you missed that thread, please forgive me
 missing those in which the same patch I disapproved of were resent
 without me on Cc.
 
 I'm not categorically against this struct path business.  It does have
 some advantages at first glance.  But the patch we're arguing about
 clearly makes code more complicated and harder to read.  We should have
 more than superficial benefits if we decide to pay such a cost.

It sounds like we at least need a better overall changlog, please..
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Massive slowdown when re-querying large nfs dir

2007-11-06 Thread Andrew Morton

 On Tue, 6 Nov 2007 14:28:11 +0300 Al Boldi [EMAIL PROTECTED] wrote:
 Al Boldi wrote:
  There is a massive (3-18x) slowdown when re-querying a large nfs dir (2k+
  entries) using a simple ls -l.
 
  On 2.6.23 client and server running userland rpc.nfs.V2:
  first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
  more tries: time -p ls -l 2k+ entry dir  in ~8sec
 
  first  try: time -p ls -l 5k+ entry dir  in ~9sec
  more tries: time -p ls -l 5k+ entry dir  in ~180sec
 
  On 2.6.23 client and 2.4.31 server running userland rpc.nfs.V2:
  first  try: time -p ls -l 2k+ entry dir  in ~2.5sec
  more tries: time -p ls -l 2k+ entry dir  in ~7sec
 
  first  try: time -p ls -l 5k+ entry dir  in ~8sec
  more tries: time -p ls -l 5k+ entry dir  in ~43sec
 
  Remounting the nfs-dir on the client resets the problem.
 
  Any ideas?
 
 Ok, I played some more with this, and it turns out that nfsV3 is a lot 
 faster.  But, this does not explain why the 2.4.31 kernel is still over 
 4-times faster than 2.6.23.
 
 Can anybody explain what's going on?
 

Sure, Neil can! ;)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: writeout stalls in current -git

2007-11-05 Thread Andrew Morton

On Fri, 2 Nov 2007 18:33:29 +0800
Fengguang Wu [EMAIL PROTECTED] wrote:

 On Fri, Nov 02, 2007 at 11:15:32AM +0100, Peter Zijlstra wrote:
  On Fri, 2007-11-02 at 10:21 +0800, Fengguang Wu wrote:
  
   Interestingly, no background_writeout() appears, but only
   balance_dirty_pages() and wb_kupdate.  Obviously wb_kupdate won't
   block the process.
  
  Yeah, the background threshold is not (yet) scaled. So it can happen
  that the bdi_dirty limit is below the background limit.
  
  I'm curious though as to these stalls, though, I can't seem to think of
  what goes wrong.. esp since most writeback seems to happen from pdflush.
 
 Me confused too. The new debug patch will confirm whether emerge is
 waiting in balance_dirty_pages().
 
  (or I'm totally misreading it - quite a possible as I'm still recovering
  from a serious cold and not all the green stuff has yet figured out its
  proper place wrt brain cells 'n stuff)
 
 Do take care of yourself.
 
  
  I still have this patch floating around:
 
 I think this patch is OK for 2.6.24 :-)
 
 Reviewed-by: Fengguang Wu [EMAIL PROTECTED] 

I would prefer Tested-by: :(

  
  ---
  Subject: mm: speed up writeback ramp-up on clean systems
  
  We allow violation of bdi limits if there is a lot of room on the
  system. Once we hit half the total limit we start enforcing bdi limits
  and bdi ramp-up should happen. Doing it this way avoids many small
  writeouts on an otherwise idle system and should also speed up the
  ramp-up.

Given the problems we're having in there I'm a bit reluctant to go tossing
hastily put together and inadequately tested stuff onto the fire.  And
that's what this patch looks like to me.

Wanna convince me otherwise?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] NFS: Stop sillyname renames and unmounts from racing

2007-11-05 Thread Andrew Morton

On Sat, 03 Nov 2007 07:09:25 -0400 Steve Dickson [EMAIL PROTECTED] wrote:

 The following patch stops NFS sillyname renames and umounts from racing.

(appropriate cc's added)

 I have a test script does the following:
  1) start nfs server
   2) mount loopback
   3) open file in background
   4) remove file
   5) stop nfs server
   6) kill -9 process which has file open
   7) restart nfs server
   8) umount looback mount.
 
 After umount I got the VFS: Busy inodes after unmount message
 because the processing of the rename has not finished.
 
 Below is a patch that the uses the new silly_count mechanism to
 synchronize sillyname processing and umounts. The patch introduces a
 nfs_put_super() routine that waits until the nfsi-silly_count count
 is zero.
 
 A side-effect of finding and waiting for all the inode to
 find the sillyname processing, is I need to traverse
 the sb-s_inodes list in the supper block. To do that
 safely the inode_lock spin lock has to be held. So for
 modules to be able to see that lock I needed to
 EXPORT_SYMBOL_GPL() it.
 
 Any objections to exporting the inode_lock spin lock?
 If so, how should modules _safely_ access the s_inode list?
 
 steved.
 
 
 Author: Steve Dickson [EMAIL PROTECTED]
 Date:   Wed Oct 31 12:19:26 2007 -0400
 
  Close a unlink/sillyname rename and umount race by added a
  nfs_put_super routine that will run through all the inode
  currently on the super block, waiting for those that are
  in the middle of a sillyname rename or removal.
 
  This patch stop the infamous VFS: Busy inodes after unmount... 
  warning during umounts.
 
  Signed-off-by: Steve Dickson [EMAIL PROTECTED]
 
 diff --git a/fs/inode.c b/fs/inode.c
 index ed35383..da9034a 100644
 --- a/fs/inode.c
 +++ b/fs/inode.c
 @@ -81,6 +81,7 @@ static struct hlist_head *inode_hashtable __read_mostly;
* the i_state of an inode while it is in use..
*/
   DEFINE_SPINLOCK(inode_lock);
 +EXPORT_SYMBOL_GPL(inode_lock);

That's going to make hch unhappy.

Your email client is performing space-stuffing.
See http://mbligh.org/linuxdocs/Email/Clients/Thunderbird

   static struct file_system_type nfs_fs_type = {
   .owner  = THIS_MODULE,
 @@ -223,6 +225,7 @@ static const struct super_operations nfs_sops = {
   .alloc_inode= nfs_alloc_inode,
   .destroy_inode  = nfs_destroy_inode,
   .write_inode= nfs_write_inode,
 + .put_super  = nfs_put_super,
   .statfs = nfs_statfs,
   .clear_inode= nfs_clear_inode,
   .umount_begin   = nfs_umount_begin,
 @@ -1767,6 +1770,30 @@ static void nfs4_kill_super(struct super_block *sb)
   nfs_free_server(server);
   }
 
 +void nfs_put_super(struct super_block *sb)

This was (correctly) declared to be static.  We should define it that way
too (I didn't know you could do this, actually).

 +{
 + struct inode *inode;
 + struct nfs_inode *nfsi;
 + /*
 +  * Make sure there are no outstanding renames
 +  */
 +relock:
 + spin_lock(inode_lock);
 + list_for_each_entry(inode, sb-s_inodes, i_sb_list) {
 + nfsi = NFS_I(inode);
 + if (atomic_read(nfsi-silly_count)  0) {
 + /* Keep this inode around  during the wait */
 + atomic_inc(inode-i_count);
 + spin_unlock(inode_lock);
 + wait_event(nfsi-waitqueue,
 + atomic_read(nfsi-silly_count) == 1);
 + iput(inode);
 + goto relock;
 + }
 + }
 + spin_unlock(inode_lock);
 +}

That's an O(n^2) search.  If it is at all possible to hit a catastrophic
slowdown in here, you can bet that someone out there will indeed hit it in
real life.

I'm too lazy to look, but we might need to check things like I_FREEING
and I_CLEAR before taking a ref on this inode.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qstr abuse in git-cifs

2007-11-05 Thread Andrew Morton


static int cifs_ci_compare(struct dentry *dentry, struct qstr *a,
   struct qstr *b)
{
struct nls_table *codepage = CIFS_SB(dentry-d_inode-i_sb)-local_nls;

if ((a-len == b-len) 
(nls_strnicmp(codepage, a-name, b-name, a-len) == 0)) {
/*
 * To preserve case, don't let an existing negative dentry's
 * case take precedence.  If a is not a negative dentry, this
 * should have no side effects
 */
memcpy(a-name, b-name, a-len);
return 0;
}
return 1;
}

produces

fs/cifs/dir.c: In function 'cifs_ci_compare':
fs/cifs/dir.c:596: warning: passing argument 1 of '__constant_memcpy' discards 
qualifiers from pointer target type
fs/cifs/dir.c:596: warning: passing argument 1 of '__memcpy' discards 
qualifiers from pointer target type

I suspect that bad things are happening in there.

It's strange for a comparison function to go and alter one of the things
which it's comparing, too.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3] 0/4 fs/ioctl.c coding style, function renaming/factoring

2007-10-30 Thread Andrew Morton

On Tue, 30 Oct 2007 15:39:55 -0400
Erez Zadok [EMAIL PROTECTED] wrote:

 This series of 4 proposed patches (take 3) changes fs/ioctl.c and Unionfs as
 follows.

The problem is of course that you need these in your tree for ongoing
development, but 2.6.25 is months away.  They look OK to me so I suggest
that you go ahead and commit them to your git tree and I'll drop them
again.  Please resend them for merging in the 2.6.25-rc1 merge window.


unionfs has been hanging around for a long time now and we should work
towards getting it into 2.6.25 or dropped from -mm (sorry).  Right now
would be a great time to get this process underway.  

I have only a partial memory of what the sticking points were, and I have
basically zero knowledge of what was done to address them.  So could you
please, over the next few weeks:

- Send out a description of what the issues were, and how they were addressed

- If issues remain, describe their impact and possible workarounds, all
  that stuff.

- If it mostly-survives all that design-level review and consideration
  then let's go for it: get all the patches cleaned up and consolidated and
  get them emailed out for review no later than 2.6.24-rc5.

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [0/3] Distributed storage. Mirror algo extension for automatic recovery.

2007-10-26 Thread Andrew Morton

On Thu, 18 Oct 2007 23:17:41 +0400
Evgeniy Polyakov [EMAIL PROTECTED] wrote:

 I'm pleased to announce sixth release of the distributed storage
 subsystem, which allows to form a storage on top of remote and local
 nodes, which in turn can be exported to another storage as a node to
 form tree-like storages.

I went back and re-read last month's discussion and I'm not seeing any
reason why we shouldn't start thinking about merging this.

How close is it to that stage?  A peek at your development blog indicates
that things are still changing at a moderate rate?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 01/31] Add an ERR_CAST() macro to complement ERR_PTR and co. [try #5]

2007-10-25 Thread Andrew Morton

On Thu, 25 Oct 2007 16:09:09 -0700
Zach Brown [EMAIL PROTECTED] wrote:

 
  + * ERR_CAST - Explicitly cast an error-valued pointer to another pointer 
  type
  + * @ptr: The pointer to cast.
  + *
  + * Explicitly cast an error-valued pointer to another pointer type in such 
  a
  + * way as to make it clear that's what's going on.
  + */
  +static inline void *ERR_CAST(const void *ptr)
  +{
  +   return (void *) ptr;
  +}
 
 Just to nit, surely you don't need the cast inside the function.  The
 casting happens at the call site between the argument and returned pointer.
 

It'll warn without the cast.

btw, nit time.  This style:

return (void *)ptr;

outnumbers this style

return (void *) ptr;

by 4 to 1.  And I don't find the space attractive, useful or logical,
personally.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fix tmpfs BUG and AOP_WRITEPAGE_ACTIVATE

2007-10-24 Thread Andrew Morton

On Wed, 24 Oct 2007 22:02:15 +0100 (BST)
Hugh Dickins [EMAIL PROTECTED] wrote:

 --- 2.6.24-rc1/mm/shmem.c 2007-10-24 07:16:04.0 +0100
 +++ linux/mm/shmem.c  2007-10-24 20:24:31.0 +0100
 @@ -915,6 +915,11 @@ static int shmem_writepage(struct page *
   struct inode *inode;
  
   BUG_ON(!PageLocked(page));
 + if (!wbc-for_reclaim) {
 + set_page_dirty(page);
 + unlock_page(page);
 + return 0;
 + }
   BUG_ON(page_mapped(page));

Needs a comment, methinks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH -mm] Split fs/Kconfig: ext[234]

2007-10-22 Thread Andrew Morton

On Sat, 6 Oct 2007 12:15:08 +0400
Alexey Dobriyan [EMAIL PROTECTED] wrote:

 Signed-off-by: Alexey Dobriyan [EMAIL PROTECTED]
 ---
 
  fs/Kconfig  |  191 --
  fs/ext2/Kconfig |   55 +
  fs/ext3/Kconfig |   67 
  fs/ext4/Kconfig |   65 +++
  4 files changed, 190 insertions(+), 188 deletions(-)

A reasonable thing to do, but poorly timed.  I'd prefer not to carry a
patch like this through two months of development, please.

Late in the -rc timeframe would be a better time.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Correct behavior on O_DIRECT sparse file writes

2007-10-12 Thread Andrew Morton

On Fri, 12 Oct 2007 16:39:27 -0400
Chris Mason [EMAIL PROTECTED] wrote:

 Hello everyone,
 
 The test below creates a sparse file and then fills a hole with
 O_DIRECT.  As far as I can tell from reading generic_osync_inode, the
 filesystem metadata is only forced to disk if i_size changes during the
 file write.  I've tested ext3, xfs and reiserfs and they all skip the
 commit when filling holes.
 
 I would argue that filling holes via O_DIRECT is supposed to commit the
 metadata required to find those file blocks later.  At least on ext3,
 O_SYNC does force a commit on fill holes  (haven't tested others).
 
 So, is the current behavior a bug or a feature?

I don't think it's a bug.  Sure, O_DIRECT is synchronous, but that's
because it is, err, direct.  Not because it provides extra data-integrity
guarantees.  If you want those guarantees, use O_SYNC as well.

 dd if=/dev/zero of=foo bs=1M seek=1 count=1 oflag=direct
 
 hexdump foo | head -n 2
 000 62b1 ea2d 73e8 c64f f5ef 1af5 dd09 8ccd
 010 75ec 9581 e0ea ae9b e28f b76d a700 4d5b
 
 dd if=/dev/urandom of=foo bs=4k count=1 conv=notrunc oflag=direct
 reboot -nf
 
 (after reboot)
 
 hexdump foo
 000        
 *
 020
 
 -chris
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: msync(2) bug(?), returns AOP_WRITEPAGE_ACTIVATE to userland

2007-10-11 Thread Andrew Morton

On Sun, 7 Oct 2007 15:20:19 -0400
Erez Zadok [EMAIL PROTECTED] wrote:

 According to vfs.txt, -writepage() may return AOP_WRITEPAGE_ACTIVATE back
 to the VFS/VM.  Indeed some filesystems such as tmpfs can return
 AOP_WRITEPAGE_ACTIVATE; and stackable file systems (e.g., Unionfs) also
 return AOP_WRITEPAGE_ACTIVATE if the lower f/s returned it.
 
 Anyway, some Ubuntu users of Unionfs reported that msync(2) sometimes
 returns AOP_WRITEPAGE_ACTIVATE (decimal 524288) back to userland.
 Therefore, some user programs fail, esp. if they're written such as this:
 
  err = msync(...);
  if (err != 0)
   // fail
 
 They temporarily fixed the specific program in question (apt-get) to check
 
  if (err  0)
   // fail
 
 Is this a bug indeed, or are user programs supposed to handle
 AOP_WRITEPAGE_ACTIVATE (I hope not the latter).  If it's a kernel bug, what
 should the kernel return: a zero, or an -errno (and which one)?
 

shit.  That's a nasty bug.  Really userspace should be testing for -1, but
the msync() library function should only ever return 0 or -1.

Does this fix it?

--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -850,8 +850,10 @@ retry:
 
ret = (*writepage)(page, wbc, data);
 
-   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE))
+   if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) {
unlock_page(page);
+   ret = 0;
+   }
if (ret || (--(wbc-nr_to_write) = 0))
done = 1;
if (wbc-nonblocking  bdi_write_congested(bdi)) {
_

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Andrew Morton

On Mon, 8 Oct 2007 19:45:08 +0800 Yan Zheng [EMAIL PROTECTED] wrote:

 Hi all
 
 The test for VM_CAN_NONLINEAR always fails
 
 Signed-off-by: Yan Zheng[EMAIL PROTECTED]
 
 diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c
 --- linux-2.6.23-rc9/mm/fremap.c  2007-10-07 15:03:33.0 +0800
 +++ linux/mm/fremap.c 2007-10-08 19:33:44.0 +0800
 @@ -160,7 +160,7 @@
   if (vma-vm_private_data  !(vma-vm_flags  VM_NONLINEAR))
   goto out;
 
 - if (!vma-vm_flags  VM_CAN_NONLINEAR)
 + if (!(vma-vm_flags  VM_CAN_NONLINEAR))
   goto out;
 
   if (end = start || start  vma-vm_start || end  vma-vm_end)

Lovely.  From this we can deduce that nobody has run remap_file_pages() since
2.6.23-rc1 and that nobody (including the developer who made that change) ran it
while that change was in -mm.

I'm surprise that LTP doesn't have any remap_file_pages() tests.

Have you runtime tested this change?

Thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]fix VM_CAN_NONLINEAR check in sys_remap_file_pages

2007-10-08 Thread Andrew Morton

On Mon, 8 Oct 2007 10:28:43 -0700
Randy Dunlap [EMAIL PROTECTED] wrote:

 On Mon, 8 Oct 2007 10:04:56 -0700 Andrew Morton wrote:
 
  On Mon, 8 Oct 2007 19:45:08 +0800 Yan Zheng [EMAIL PROTECTED] wrote:
  
   Hi all
   
   The test for VM_CAN_NONLINEAR always fails
   
   Signed-off-by: Yan Zheng[EMAIL PROTECTED]
   
   diff -ur linux-2.6.23-rc9/mm/fremap.c linux/mm/fremap.c
   --- linux-2.6.23-rc9/mm/fremap.c  2007-10-07 15:03:33.0 +0800
   +++ linux/mm/fremap.c 2007-10-08 19:33:44.0 +0800
   @@ -160,7 +160,7 @@
 if (vma-vm_private_data  !(vma-vm_flags  VM_NONLINEAR))
 goto out;
   
   - if (!vma-vm_flags  VM_CAN_NONLINEAR)
   + if (!(vma-vm_flags  VM_CAN_NONLINEAR))
 goto out;
   
 if (end = start || start  vma-vm_start || end  vma-vm_end)
  
  Lovely.  From this we can deduce that nobody has run remap_file_pages() 
  since
  2.6.23-rc1 and that nobody (including the developer who made that change) 
  ran it
  while that change was in -mm.
 
 I've run rmap-test with -M (use remap_file_pages) and
 remap-test from ext3-tools, but not remap_file_pages for some reason.
 
 I'll now add remap_file_pages soon.
 Maybe those other 2 tests aren't strong enough (?).
 Or maybe they don't return a non-0 exit status even when they fail...
 (I'll check.)

Perhaps Yan Zheng can tell us what test was used to demonstrate this?

 
  I'm surprise that LTP doesn't have any remap_file_pages() tests.
 
 quick grep didn't find any for me.

Me either.  There are a few lying around the place which could be
integrated.

It would be good if LTP were to have some remap_file_pages() tests
(please).  As we see here, it is something which we can easily break, and
leave broken for some time.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/2] getattr - fill the size of pipes

2007-10-04 Thread Andrew Morton

On Tue, 2 Oct 2007 19:54:53 +0200 (CEST)
Jan Engelhardt [EMAIL PROTECTED] wrote:

 [PATCH]: Fill the size of pipes
 
 Instead of reporting 0 in size when stating() a pipe, we give the number of
 queued bytes. This might avoid using ioctl(FIONREAD) to get this information.
 
 References and derived from: http://lkml.org/lkml/2007/4/2/138
 Cc: Eric Dumazet [EMAIL PROTECTED]
 Signed-off-by: Jan Engelhardt [EMAIL PROTECTED]


Cute feature, but it is (I assume) a Linux-specific extension and is
something which we'll need to maintain for ever and it invites
unportability to older Linuxes and other OSes and it introduces some risk
of breakage of existing applications.  And it slows down fstat on a pipe.

Given that the info can already be obtained via ioctl(FIONREAD) anyway, I
don't think that (gain  pain)?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Andrew Morton

On Mon, 1 Oct 2007 13:55:29 -0700 (PDT)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Sat, 29 Sep 2007, Andrew Morton wrote:
 
   atomic allocations. And with SLUB using higher order pages, atomic !0
   order allocations will be very very common.
  
  Oh OK.
  
  I thought we'd already fixed slub so that it didn't do that.  Maybe that
  fix is in -mm but I don't think so.
  
  Trying to do atomic order-1 allocations on behalf of arbitray slab caches
  just won't fly - this is a significant degradation in kernel reliability,
  as you've very easily demonstrated.
 
 Ummm... SLAB also does order 1 allocations. We have always done them.
 
 See mm/slab.c
 
 /*
  * Do not go above this order unless 0 objects fit into the slab.
  */
 #define BREAK_GFP_ORDER_HI  1
 #define BREAK_GFP_ORDER_LO  0
 static int slab_break_gfp_order = BREAK_GFP_ORDER_LO;

Do slab and slub use the same underlying page size for each slab?

Single data point: the CONFIG_SLAB boxes which I have access to here are
using order-0 for radix_tree_node, so they won't be failing in the way in
which Peter's machine is.

I've never ever before seen reports of page allocation failures in the
radix-tree node allocation code, and that's the bottom line.  This is just
a drop-dead must-fix show-stopping bug.  We cannot rely upon atomic order-1
allocations succeeding so we cannot use them for radix-tree nodes.  Nor for
lots of other things which we have no chance of identifying.

Peter, is this bug -mm only, or is 2.6.23 similarly failing?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-10-01 Thread Andrew Morton

On Mon, 1 Oct 2007 14:38:55 -0700 (PDT)
Christoph Lameter [EMAIL PROTECTED] wrote:

 On Mon, 1 Oct 2007, Andrew Morton wrote:
 
  Do slab and slub use the same underlying page size for each slab?
 
 SLAB cannot pack objects as dense as SLUB and they have different 
 algorithm to make the choice of order. Thus the number of objects per slab 
 may vary between SLAB and SLUB and therefore also the choice of order to 
 store these objects.
 
  Single data point: the CONFIG_SLAB boxes which I have access to here are
  using order-0 for radix_tree_node, so they won't be failing in the way in
  which Peter's machine is.
 
 Upstream SLUB uses order 0 allocations for the radix tree.

OK, that's a relief.

 MM varies 
 because the use of higher order allocs is more loose if the mobility 
 algorithms are found to be active:
 
 2.6.23-rc8:
 
 Name   Objects ObjsizeSpace Slabs/Part/Cpu  O/S O %Fr %Ef 
 Flg\
 radix_tree_node  14281 552 9.9M 2432/948/17 0  38  79

Ah.  So the already-dropped
slub-exploit-page-mobility-to-increase-allocation-order.patch was the
culprit?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-30 Thread Andrew Morton

On Sun, 30 Sep 2007 05:09:28 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 On Sunday 30 September 2007 05:20, Andrew Morton wrote:
  On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] 
 wrote:
   On Saturday 29 September 2007 19:27, Andrew Morton wrote:
On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra
[EMAIL PROTECTED]
  
   wrote:
  oom-killings, or page allocation failures?  The latter, one hopes.

 Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 
 4.1.2
 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007

 ...


 mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
 Call Trace:
 611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
 611b3888:  [60052ed4] out_of_memory+0x80/0x100
 611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
 611b3948:  [6006c608] allocate_slab+0x5b/0xb0
 611b3968:  [6006c705] new_slab+0x7e/0x183
 611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
 611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
 611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
 611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
 611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
 611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
 611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
 611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
 611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
 611b3b98:  [60056f00] read_pages+0x37/0x9b
 611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
 611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
 611b3c78:  [60050ab4] filemap_fault+0x145/0x278
 611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
 611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
 611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
 611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
 611b3da8:  [60120a7b] __up_read+0x73/0x7b
 611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
 611b3e48:  [60013419] segv+0xac/0x297
 611b3f28:  [60013367] segv_handler+0x68/0x6e
 611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
 611b3f68:  [60023853] userspace+0x13a/0x19d
 611b3fc8:  [60010d58] fork_handler+0x86/0x8d
   
OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.
   
I'm suspecting that did_some_progress thing.
  
   The allocation didn't fail -- it invoked the OOM killer because the
   kernel ran out of unfragmented memory.
 
  We can't run out of unfragmented memory for an order-2 GFP_KERNEL
  allocation in this workload.  We go and synchronously free stuff up to make
  it work.
 
  How did this get broken?
 
 Either no more order-2 pages could be freed, or the ones that were being
 freed were being used by something else (eg. other order-2 slab allocations).

No.  The current design of reclaim (for better or for worse) is that for
order 0,1,2 and 3 allocations we just keep on trying until it works.  That
got broken and I think it got broken at a design level when that
did_some_progress logic went in.  Perhaps something else we did later
worsened things.


   Probably because higher order
   allocations are the new vogue in -mm at the moment ;)
 
  That's a different bug.
 
  bug 1: We shouldn't be doing higher-order allocations in slub because of
  the considerable damage this does to atomic allocations.
 
  bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.
 
 I think one causes 2 as well -- it isn't just considerable damage to atomic
 allocations but to GFP_KERNEL allocations too.

Well sure, because we already broke GFP_KERNEL allocations.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton

On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
 
   start 2 processes that each mmap a separate 64M file, and which does
   sequential writes on them. start a 3th process that does the same with
   64M anonymous.
   
   wait for a while, and you'll see order=1 failures.
  
  Really? That means we can no longer even allocate stacks for forking.
  
  Its surprising that neither lumpy reclaim nor the mobility patches can 
  deal with it? Lumpy reclaim should be able to free neighboring pages to 
  avoid the order 1 failure unless there are lots of pinned pages.
  
  I guess then that lots of pages are pinned through I/O?
 
 memory got massively fragemented, as anti-frag gets easily defeated.
 setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
 order blocks to stay available, so we don't mix types. however 12M on
 128M is rather a lot.
 
 its still on my todo list to look at it further..
 

That would be really really bad (as in: patch-dropping time) if those
order-1 allocations are not atomic.

What's the callsite? 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton

On Sat, 29 Sep 2007 10:47:12 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Sat, 2007-09-29 at 01:13 -0700, Andrew Morton wrote:
  On Fri, 28 Sep 2007 20:25:50 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:
  
   
   On Fri, 2007-09-28 at 11:20 -0700, Christoph Lameter wrote:
   
 start 2 processes that each mmap a separate 64M file, and which does
 sequential writes on them. start a 3th process that does the same with
 64M anonymous.
 
 wait for a while, and you'll see order=1 failures.

Really? That means we can no longer even allocate stacks for forking.

Its surprising that neither lumpy reclaim nor the mobility patches can 
deal with it? Lumpy reclaim should be able to free neighboring pages to 
avoid the order 1 failure unless there are lots of pinned pages.

I guess then that lots of pages are pinned through I/O?
   
   memory got massively fragemented, as anti-frag gets easily defeated.
   setting min_free_kbytes to 12M does seem to solve it - it forces 2 max
   order blocks to stay available, so we don't mix types. however 12M on
   128M is rather a lot.
   
   its still on my todo list to look at it further..
   
  
  That would be really really bad (as in: patch-dropping time) if those
  order-1 allocations are not atomic.
  
  What's the callsite? 
 
 Ah, right, that was the detail... all this lumpy reclaim is useless for
 atomic allocations. And with SLUB using higher order pages, atomic !0
 order allocations will be very very common.

Oh OK.

I thought we'd already fixed slub so that it didn't do that.  Maybe that
fix is in -mm but I don't think so.

Trying to do atomic order-1 allocations on behalf of arbitray slab caches
just won't fly - this is a significant degradation in kernel reliability,
as you've very easily demonstrated.

 One I can remember was:
 
   add_to_page_cache()
 radix_tree_insert()
   radix_tree_node_alloc()
 kmem_cache_alloc()
 
 which is an atomic callsite.
 
 Which leaves us in a situation where we can load pages, because there is
 free memory, but can't manage to allocate memory to track them.. 

Right.  Leading to application failure which for many is equivalent to a
complete system outage.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton

On Sat, 29 Sep 2007 10:53:41 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 
 On Sat, 2007-09-29 at 10:47 +0200, Peter Zijlstra wrote:
 
  Ah, right, that was the detail... all this lumpy reclaim is useless for
  atomic allocations. And with SLUB using higher order pages, atomic !0
  order allocations will be very very common.
  
  One I can remember was:
  
add_to_page_cache()
  radix_tree_insert()
radix_tree_node_alloc()
  kmem_cache_alloc()
  
  which is an atomic callsite.
  
  Which leaves us in a situation where we can load pages, because there is
  free memory, but can't manage to allocate memory to track them.. 
 
 Ah, I found a boot log of one of these sessions, its also full of
 order-2 OOMs.. :-/

oom-killings, or page allocation failures?  The latter, one hopes.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton

On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

  oom-killings, or page allocation failures?  The latter, one hopes.
 
 
 Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
 (Ubuntu 4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
 
 ...
 
 
 mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
 Call Trace:
 611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
 611b3888:  [60052ed4] out_of_memory+0x80/0x100
 611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
 611b3948:  [6006c608] allocate_slab+0x5b/0xb0
 611b3968:  [6006c705] new_slab+0x7e/0x183
 611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
 611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
 611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
 611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
 611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
 611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
 611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
 611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
 611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
 611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
 611b3b98:  [60056f00] read_pages+0x37/0x9b
 611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
 611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
 611b3c78:  [60050ab4] filemap_fault+0x145/0x278
 611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
 611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
 611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
 611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
 611b3da8:  [60120a7b] __up_read+0x73/0x7b
 611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
 611b3e48:  [60013419] segv+0xac/0x297
 611b3f28:  [60013367] segv_handler+0x68/0x6e
 611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
 611b3f68:  [60023853] userspace+0x13a/0x19d
 611b3fc8:  [60010d58] fork_handler+0x86/0x8d

OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
allocations aren't supposed to fail.

I'm suspecting that did_some_progress thing.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [15/17] SLUB: Support virtual fallback via SLAB_VFALLBACK

2007-09-29 Thread Andrew Morton

On Sat, 29 Sep 2007 06:19:33 +1000 Nick Piggin [EMAIL PROTECTED] wrote:

 On Saturday 29 September 2007 19:27, Andrew Morton wrote:
  On Sat, 29 Sep 2007 11:14:02 +0200 Peter Zijlstra [EMAIL PROTECTED] 
 wrote:
oom-killings, or page allocation failures?  The latter, one hopes.
  
   Linux version 2.6.23-rc4-mm1-dirty ([EMAIL PROTECTED]) (gcc version 4.1.2 
   (Ubuntu
   4.1.2-0ubuntu4)) #27 Tue Sep 18 15:40:35 CEST 2007
  
   ...
  
  
   mm_tester invoked oom-killer: gfp_mask=0x40d0, order=2, oomkilladj=0
   Call Trace:
   611b3878:  [6002dd28] printk_ratelimit+0x15/0x17
   611b3888:  [60052ed4] out_of_memory+0x80/0x100
   611b38c8:  [60054b0c] __alloc_pages+0x1ed/0x280
   611b3948:  [6006c608] allocate_slab+0x5b/0xb0
   611b3968:  [6006c705] new_slab+0x7e/0x183
   611b39a8:  [6006cbae] __slab_alloc+0xc9/0x14b
   611b39b0:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b39b8:  [600980f2] do_mpage_readpage+0x3b3/0x472
   611b39e0:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b39f8:  [6006cc81] kmem_cache_alloc+0x51/0x98
   611b3a38:  [6011f89f] radix_tree_preload+0x70/0xbf
   611b3a58:  [6004f8e2] add_to_page_cache+0x22/0xf7
   611b3a98:  [6004f9c6] add_to_page_cache_lru+0xf/0x24
   611b3ab8:  [6009821e] mpage_readpages+0x6d/0x109
   611b3ac0:  [600d59f0] ext3_get_block+0x0/0xf2
   611b3b08:  [6005483d] get_page_from_freelist+0x8d/0xc1
   611b3b88:  [600d6937] ext3_readpages+0x18/0x1a
   611b3b98:  [60056f00] read_pages+0x37/0x9b
   611b3bd8:  [60057064] __do_page_cache_readahead+0x100/0x157
   611b3c48:  [60057196] do_page_cache_readahead+0x52/0x5f
   611b3c78:  [60050ab4] filemap_fault+0x145/0x278
   611b3ca8:  [60022b61] run_syscall_stub+0xd1/0xdd
   611b3ce8:  [6005eae3] __do_fault+0x7e/0x3ca
   611b3d68:  [6005ee60] do_linear_fault+0x31/0x33
   611b3d88:  [6005f149] handle_mm_fault+0x14e/0x246
   611b3da8:  [60120a7b] __up_read+0x73/0x7b
   611b3de8:  [60013177] handle_page_fault+0x11f/0x23b
   611b3e48:  [60013419] segv+0xac/0x297
   611b3f28:  [60013367] segv_handler+0x68/0x6e
   611b3f48:  [600232ad] get_skas_faultinfo+0x9c/0xa1
   611b3f68:  [60023853] userspace+0x13a/0x19d
   611b3fc8:  [60010d58] fork_handler+0x86/0x8d
 
  OK, that's different.  Someone broke the vm - order-2 GFP_KERNEL
  allocations aren't supposed to fail.
 
  I'm suspecting that did_some_progress thing.
 
 The allocation didn't fail -- it invoked the OOM killer because the kernel
 ran out of unfragmented memory.

We can't run out of unfragmented memory for an order-2 GFP_KERNEL
allocation in this workload.  We go and synchronously free stuff up to make
it work.

How did this get broken?

 Probably because higher order
 allocations are the new vogue in -mm at the moment ;)

That's a different bug.

bug 1: We shouldn't be doing higher-order allocations in slub because of
the considerable damage this does to atomic allocations.

bug 2: order-2 GFP_KERNEL allocations shouldn't fail like this.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 11:59:02 -0400 Theodore Tso [EMAIL PROTECTED] wrote:

 On Thu, Sep 27, 2007 at 04:19:12PM +0100, Alan Cox wrote:
   Well it's not my call, just seems like a really bad idea to change the
   error value. You can't claim full coverage for such testing anyway, it's
   one of those things that people will complain about two releases later
   saying it broke app foo.
  
  Strange since we've spent years changing error values and getting them
  right in the past. 
 
 I doubt there any apps which are going to specifically check for EFBIG
 and do soemthing different if they get EOVERFLOW instead.  If it was
 something like EAGAIN or EPERM, I'd be more concerned, but EFBIG
 vs. EOVERFLOW?  C'mon!

Yeah.  There's no correct answer here (apart from get it right the first
time).  There are risks either way, and it _is_ a bug.  Bummer.

  There are real things to worry about - sysfs, sysfs, sysfs, ... and all
  the other crap which is continually breaking stuff, not spec compliance
  corrections that don't break things but move us into compliance with the
  standard
 
 I've got to agree with Alan, the sysfs/udev breakages that we've done
 are far more significant, and the fact that we continue to expose
 internal data structures via sysfs is a gaping open pit is far more
 likely to cause any kind of problems than changing an error return.

Funny you should mention that.  I was staring in astonishment at the
pending sysfs patch pile last night.  Forty syfs patches and twenty-odd
patches against driver core and the kobject layer.

That's a huge amount of churn for a core piece of kernel infrastructure
which has been there for four or five years.  Not a good sign.  I mean,
it's not as if, say, the CPU scheduler guys keep on rewriting all their
junk.

oh, wait..
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch]anon_inodes.c: fix error check in anon_inode_getfd

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 09:54:47 -0700 (PDT) Davide Libenzi [EMAIL PROTECTED] 
wrote:

 On Thu, 27 Sep 2007, Davide Libenzi wrote:
 
  On Wed, 26 Sep 2007, Andrew Morton wrote:
  
   On Thu, 27 Sep 2007 10:30:50 +0800 Yan Zheng [EMAIL PROTECTED] wrote:
   
Hello,

igrab return NULL on error.

Signed-off-by: Yan Zheng[EMAIL PROTECTED]
---
diff -ur linux-2.6.23-rc8/fs/anon_inodes.c linux/fs/anon_inodes.c
--- linux-2.6.23-rc8/fs/anon_inodes.c   2007-09-27 10:05:07.0 
+0800
+++ linux/fs/anon_inodes.c  2007-09-27 10:18:26.0 +0800
@@ -87,8 +87,8 @@
return -ENFILE;

inode = igrab(anon_inode_inode);
-   if (IS_ERR(inode)) {
-   error = PTR_ERR(inode);
+   if (!inode) {
+   error = -ENOENT;
goto err_put_filp;
}
   
   hm, does that code actually need to exist?  igrab() is pretty expensive 
   and
   that fs is permanently mounted anyway...
  
  yes. The inode gets attached the the file*, and on its way out an iput() 
  is done.
 
 Wait. You mean is it worth checking the igrab() return code at all?

Well I was more thinking can we just leave that inode's refcount alone all
the time.  I guess that would be tricky, but I think we can at least use
the much less expensive __iget(), or an open-coded atomic_inc.

inode_lock is a heavily-used singleton.  I remain surprised that it hasn't
bitten us in the ass yet.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch]anon_inodes.c: fix error check in anon_inode_getfd

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 10:56:09 -0700 (PDT) Davide Libenzi [EMAIL PROTECTED] 
wrote:

 Open coded atomic_inc()? Hmm, dunno...

box:/usr/src/25 grep 'atomic_inc.*-i_count' */*.c
fs/block_dev.c: atomic_inc(bdev-bd_inode-i_count);
fs/block_dev.c: atomic_inc(bdev-bd_inode-i_count);
fs/inode.c: atomic_inc(inode-i_count);
fs/inode.c: atomic_inc(inode-i_count);
fs/libfs.c: atomic_inc(inode-i_count);
fs/namei.c: atomic_inc(inode-i_count);
ipc/mqueue.c:   atomic_inc(inode-i_count);
kernel/futex.c: atomic_inc(key-shared.inode-i_count);
mm/shmem.c: atomic_inc(inode-i_count);/* New dentry reference */

The refcount on that inode is never zero *by design*.  So all we'd be doing
here is relying upon our existing design, so I think it'd be an OK thing to do. 
With appropriate code comments, of course.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 2.6.23-rc8-mm2: BUG near reiserfs_xattr_set

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 21:18:55 +0200
Laurent Riffard [EMAIL PROTECTED] wrote:

 Le 27.09.2007 11:22, Andrew Morton a écrit :
  ftp://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.23-rc8/2.6.23-rc8-mm2/
 
 I've got this BUG a few seconds after I logged in into Gnome desktop :
 
 [partially hand copied BUG]
 BUG: unable to handle kernel NULL pointer dereference at virtual address 
 
 printing eip: c016f55e *pde=0b8a5067 *pte=
 Oops: 0002 [#1] PREEMPT
 ...
 Process beagled...
 ...
 Call Trace:
  __fput+0x124/0x1a9
  fput+0x31/0x35
  reiserfs_xattr_set+0x291/0x2b0 [reiserfs]
  user_set+0x4c/0x57 [reiserfs]
  reiserfs_setxattr+0x81/0xf1 [reiserfs]
  vfs_setxattr+0x7d/0xfa
  setxattr+0xb9/0xd1
  sys_lsetxattr+0x4c/0x85
  sysenter_past_esp+0x57/0x85
 
 EIP: mnt_drop_write+0x5b/0x9d
 

Hi, Dave!

 It's fully reproducible.
 
 /home is mounted with the following options:
/dev/mapper/vglinux1-lvhome on /home type reiserfs 
 (rw,noatime,nodiratime,user_xattr)
 
 This BUG happened with rc8-mm1 too.
 rc6-mm1 works fine.
 I didn't try rc7-mm1.
 
 .config attached.
 ~~
 laurent
 
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] make reiserfs stop using 'struct file' for internal xattr operations

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 14:27:14 -0700
Dave Hansen [EMAIL PROTECTED] wrote:

 On Thu, 2007-09-27 at 22:04 +0100, Christoph Hellwig wrote:
  On Thu, Sep 27, 2007 at 01:53:39PM -0700, Dave Hansen wrote:
   -int reiserfs_commit_write(struct file *f, struct page *page,
   -   unsigned from, unsigned to);
   -int reiserfs_prepare_write(struct file *f, struct page *page,
   -unsigned from, unsigned to);
   +int reiserfs_commit_write(struct page *page, unsigned from, unsigned to);
   +int reiserfs_prepare_write(struct page *page, unsigned from, unsigned 
   to);
  
  I doubt this will work.  These are also used for the -prepare_write
  and -commit_write aops, and the method signature definitively wants
  a file there, even if it's zero..
 
 Oddly enough, I don't see those functions being used in aops:
 
 const struct address_space_operations reiserfs_address_space_operations = {
 .writepage = reiserfs_writepage,
 .readpage = reiserfs_readpage,
 .readpages = reiserfs_readpages,
 .releasepage = reiserfs_releasepage,
 .invalidatepage = reiserfs_invalidatepage,
 .sync_page = block_sync_page,
 .write_begin = reiserfs_write_begin,
 .write_end = reiserfs_write_end,
 .bmap = reiserfs_aop_bmap,
 .direct_IO = reiserfs_direct_IO,
 .set_page_dirty = reiserfs_set_page_dirty,
 };
 
 Plus, reiserfs seems to compile with that patch I just sent.  Sure as
 heck surprised me.
 

That'll be because reiserfs-convert-to-new-aops.patch witched reiserfs over
to -write_begin() and -write_end().

So your stuff becomes dependent on Nick's stuff, and Nick's stuff is still
failing on NFS, I think.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] make reiserfs stop using 'struct file' for internal xattr operations

2007-09-27 Thread Andrew Morton

On Thu, 27 Sep 2007 14:51:25 -0700
Andrew Morton [EMAIL PROTECTED] wrote:

  Plus, reiserfs seems to compile with that patch I just sent.  Sure as
  heck surprised me.
  
 
 That'll be because reiserfs-convert-to-new-aops.patch witched reiserfs over
 to -write_begin() and -write_end().

Actually, we should rename reiserfs_prepare_write and reiserfs_commit_write
to something else to reduce confusion.  Probably lots of other filesystems
would benefit from the same change, post-Nick's-stuff.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch]anon_inodes.c: fix error check in anon_inode_getfd

2007-09-26 Thread Andrew Morton

On Thu, 27 Sep 2007 10:30:50 +0800 Yan Zheng [EMAIL PROTECTED] wrote:

 Hello,
 
 igrab return NULL on error.
 
 Signed-off-by: Yan Zheng[EMAIL PROTECTED]
 ---
 diff -ur linux-2.6.23-rc8/fs/anon_inodes.c linux/fs/anon_inodes.c
 --- linux-2.6.23-rc8/fs/anon_inodes.c 2007-09-27 10:05:07.0 +0800
 +++ linux/fs/anon_inodes.c2007-09-27 10:18:26.0 +0800
 @@ -87,8 +87,8 @@
   return -ENFILE;
 
   inode = igrab(anon_inode_inode);
 - if (IS_ERR(inode)) {
 - error = PTR_ERR(inode);
 + if (!inode) {
 + error = -ENOENT;
   goto err_put_filp;
   }

hm, does that code actually need to exist?  igrab() is pretty expensive and
that fs is permanently mounted anyway...

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Ecryptfs-devel] [PATCH 3/11] eCryptfs: read_write.c routines

2007-09-21 Thread Andrew Morton

On Fri, 21 Sep 2007 16:51:25 -0500
Michael Halcrow [EMAIL PROTECTED] wrote:

 On Wed, Sep 19, 2007 at 10:38:50PM -0700, Andrew Morton wrote:
   + virt = kmap(page_for_lower);
   + rc = ecryptfs_write_lower(ecryptfs_inode, virt, offset, size);
   + kunmap(page_for_lower);
   + return rc;
   +}
  
  argh, kmap.  http://lkml.org/lkml/2007/9/15/55
 
 Here is a patch that moves to kmap_atomic(), adding an intermediate
 copy. Although I would really like to find a way to avoid having to do
 this extra copy.

We might as well stick with kmap.  I was just having a whine - I don't know
what to do about this really, apart from perhaps giving in to reality and
making kmap work better.


btw, I'm not really a great admirer of the whole patchset: it does some
pretty nasty-looking things: allocating dynamic memory, grabbing the
underlying pageframes with virt_to_page(), passing them back into kernel
APIs which are supposed to be called from userspace, etc.  It's all rather
ugly and abusive-looking.

But given that you're trying to do things which the kernel just isn't set
up to do, it isn't immediately obvious what can be done to fix it.

Perhaps there are problems whcih I didn't have time to spot, and perhaps
there are things which could be done to improve it.  But I don't have time
to sit down and absorb it all to a sufficient level of detail to be able to
suggest anything, and nobody else seems to be interested in reading the
patches so whoop, in it all goes.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 02/18] exportfs: add new methods

2007-09-20 Thread Andrew Morton

On Wed, 19 Sep 2007 18:30:25 +0200
Christoph Hellwig [EMAIL PROTECTED] wrote:

 + /*
 +  * It's not a directory.  Life is a little more complicated.
 +  */
 + struct dentry *target_dir, *nresult;
 + char nbuf[NAME_MAX+1];

Lots of stack usage there.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/11] eCryptfs: read_write.c routines

2007-09-19 Thread Andrew Morton

On Mon, 17 Sep 2007 16:46:32 -0500 Michael Halcrow [EMAIL PROTECTED] wrote:

 Add a set of functions through which all I/O to lower files is
 consolidated. This patch adds a new inode_info reference to a
 persistent lower file for each eCryptfs inode; another patch later in
 this series will set that up. This persistent lower file is what the
 read_write.c functions use to call vfs_read() and vfs_write() on the
 lower filesystem, so even when reads and writes come in through
 aops-readpage and aops-writepage, we can satisfy them without
 resorting to direct access to the lower inode's address space.
 Several function declarations are going to be changing with this
 patchset. For now, in order to keep from breaking the build, I am
 putting dummy parameters in for those functions.
 
 ..

 +/**
 + * ecryptfs_write_lower_page_segment
 + * @ecryptfs_inode: The eCryptfs inode
 + * @page_for_lower: The page containing the data to be written to the
 + *  lower file
 + * @offset_in_page: The offset in the @page_for_lower from which to
 + *  start writing the data
 + * @size: The amount of data from @page_for_lower to write to the
 + *lower file
 + *
 + * Determines the byte offset in the file for the given page and
 + * offset within the page, maps the page, and makes the call to write
 + * the contents of @page_for_lower to the lower inode.
 + *
 + * Returns zero on success; non-zero otherwise
 + */
 +int ecryptfs_write_lower_page_segment(struct inode *ecryptfs_inode,
 +   struct page *page_for_lower,
 +   size_t offset_in_page, size_t size)
 +{
 + char *virt;
 + loff_t offset;
 + int rc;
 +
 + offset = (page_for_lower-index  PAGE_CACHE_SHIFT) + offset_in_page;

bug.  You need to cast page.index to loff_t before shifting.

I'd fix it on the spot, but this would be a good time to review the whole
patchset and perhaps the whole fs for this easy-to-do, hard-to-find bug.

 + virt = kmap(page_for_lower);
 + rc = ecryptfs_write_lower(ecryptfs_inode, virt, offset, size);
 + kunmap(page_for_lower);
 + return rc;
 +}

argh, kmap.  http://lkml.org/lkml/2007/9/15/55

 +/**
 + * ecryptfs_write
 + * @ecryptfs_file: The eCryptfs file into which to write
 + * @data: Virtual address where data to write is located
 + * @offset: Offset in the eCryptfs file at which to begin writing the
 + *  data from @data
 + * @size: The number of bytes to write from @data
 + *
 + * Write an arbitrary amount of data to an arbitrary location in the
 + * eCryptfs inode page cache. This is done on a page-by-page, and then
 + * by an extent-by-extent, basis; individual extents are encrypted and
 + * written to the lower page cache (via VFS writes). This function
 + * takes care of all the address translation to locations in the lower
 + * filesystem; it also handles truncate events, writing out zeros
 + * where necessary.
 + *
 + * Returns zero on success; non-zero otherwise
 + */
 +int ecryptfs_write(struct file *ecryptfs_file, char *data, loff_t offset,
 +size_t size)
 +{
 + struct page *ecryptfs_page;
 + char *ecryptfs_page_virt;
 + u64 ecryptfs_file_size = i_size_read(ecryptfs_file-f_dentry-d_inode);

Not loff_t?

 + loff_t data_offset = 0;
 + loff_t pos;
 + int rc = 0;
 +
 + if (offset  ecryptfs_file_size)
 + pos = ecryptfs_file_size;

loff_t = u64.  The typing seems a bit confused?

 + else
 + pos = offset;
 + while (pos  (offset + size)) {
 + pgoff_t ecryptfs_page_idx = (pos  PAGE_CACHE_SHIFT);
 + size_t start_offset_in_page = (pos  ~PAGE_CACHE_MASK);
 + size_t num_bytes = (PAGE_CACHE_SIZE - start_offset_in_page);
 + size_t total_remaining_bytes = ((offset + size) - pos);
 +
 + if (num_bytes  total_remaining_bytes)
 + num_bytes = total_remaining_bytes;
 + if (pos  offset) {
 + size_t total_remaining_zeros = (offset - pos);
 +
 + if (num_bytes  total_remaining_zeros)
 + num_bytes = total_remaining_zeros;
 + }
 + ecryptfs_page = ecryptfs_get1page(ecryptfs_file,
 +   ecryptfs_page_idx);
 + if (IS_ERR(ecryptfs_page)) {
 + rc = PTR_ERR(ecryptfs_page);
 + printk(KERN_ERR %s: Error getting page at 
 +index [%ld] from eCryptfs inode 
 +mapping; rc = [%d]\n, __FUNCTION__,
 +ecryptfs_page_idx, rc);
 + goto out;
 + }
 + if (start_offset_in_page) {
 + /* Read in the page from the lower
 +  * into the eCryptfs inode page cache,
 +  * decrypting */
 + if ((rc =

Re: [PATCH 4/11] eCryptfs: Replace encrypt, decrypt, and inode size write

2007-09-19 Thread Andrew Morton

On Mon, 17 Sep 2007 16:47:10 -0500 Michael Halcrow [EMAIL PROTECTED] wrote:

 Replace page encryption and decryption routines and inode size write
 routine with versions that utilize the read_write.c functions.
 
 Signed-off-by: Michael Halcrow [EMAIL PROTECTED]
 ---
  fs/ecryptfs/crypto.c  |  427 ++--
  fs/ecryptfs/ecryptfs_kernel.h |   14 +-
  fs/ecryptfs/inode.c   |   12 +-
  fs/ecryptfs/mmap.c|  131 -
  fs/ecryptfs/read_write.c  |   12 +-
  5 files changed, 290 insertions(+), 306 deletions(-)
 
 diff --git a/fs/ecryptfs/crypto.c b/fs/ecryptfs/crypto.c
 index 5d8a553..b829d3c 100644
 --- a/fs/ecryptfs/crypto.c
 +++ b/fs/ecryptfs/crypto.c
 @@ -467,8 +467,91 @@ out:
  }
  
  /**
 + * ecryptfs_lower_offset_for_extent
 + *
 + * Convert an eCryptfs page index into a lower byte offset
 + */
 +void ecryptfs_lower_offset_for_extent(loff_t *offset, loff_t extent_num,
 +   struct ecryptfs_crypt_stat *crypt_stat)
 +{
 + (*offset) = ((crypt_stat-extent_size
 +   * crypt_stat-num_header_extents_at_front)
 +  + (crypt_stat-extent_size * extent_num));
 +}
 +
 +/**
 + * ecryptfs_encrypt_extent
 + * @enc_extent_page: Allocated page into which to encrypt the data in
 + *   @page
 + * @crypt_stat: crypt_stat containing cryptographic context for the
 + *  encryption operation
 + * @page: Page containing plaintext data extent to encrypt
 + * @extent_offset: Page extent offset for use in generating IV
 + *
 + * Encrypts one extent of data.
 + *
 + * Return zero on success; non-zero otherwise
 + */
 +static int ecryptfs_encrypt_extent(struct page *enc_extent_page,
 +struct ecryptfs_crypt_stat *crypt_stat,
 +struct page *page,
 +unsigned long extent_offset)
 +{
 + unsigned long extent_base;
 + char extent_iv[ECRYPTFS_MAX_IV_BYTES];
 + int rc;
 +
 + extent_base = (page-index
 +* (PAGE_CACHE_SIZE / crypt_stat-extent_size));

I think this is vulnerable to overflow on 32-bit machines?

 + rc = ecryptfs_derive_iv(extent_iv, crypt_stat,
 + (extent_base + extent_offset));
 + if (rc) {
 + ecryptfs_printk(KERN_ERR, Error attempting to 
 + derive IV for extent [0x%.16x]; 
 + rc = [%d]\n, (extent_base + extent_offset),
 + rc);
 + goto out;
 + }
 + if (unlikely(ecryptfs_verbosity  0)) {
 + ecryptfs_printk(KERN_DEBUG, Encrypting extent 
 + with iv:\n);
 + ecryptfs_dump_hex(extent_iv, crypt_stat-iv_bytes);
 + ecryptfs_printk(KERN_DEBUG, First 8 bytes before 
 + encryption:\n);
 + ecryptfs_dump_hex((char *)
 +   (page_address(page)
 ++ (extent_offset * crypt_stat-extent_size)),
 +   8);
 + }
 + rc = ecryptfs_encrypt_page_offset(crypt_stat, enc_extent_page, 0,
 +   page, (extent_offset
 +  * crypt_stat-extent_size),

maybe that is too, dunno.

 +   crypt_stat-extent_size, extent_iv);
 + if (rc  0) {
 + printk(KERN_ERR %s: Error attempting to encrypt page with 
 +page-index = [%ld], extent_offset = [%ld]; 
 +rc = [%d]\n, __FUNCTION__, page-index, extent_offset,
 +rc);
 + goto out;
 + }
 + rc = 0;
 + if (unlikely(ecryptfs_verbosity  0)) {
 + ecryptfs_printk(KERN_DEBUG, Encrypt extent [0x%.16x]; 
 + rc = [%d]\n, (extent_base + extent_offset),
 + rc);
 + ecryptfs_printk(KERN_DEBUG, First 8 bytes after 
 + encryption:\n);
 + ecryptfs_dump_hex((char *)(page_address(enc_extent_page)), 8);
 + }
 +out:
 + return rc;
 +}
 +
 +/**
   * ecryptfs_encrypt_page
 - * @ctx: The context of the page
 + * @page: Page mapped from the eCryptfs inode for the file; contains
 + *decrypted content that needs to be encrypted (to a temporary
 + *page; not in place) and written out to the lower file
   *
   * Encrypt an eCryptfs page. This is done on a per-extent basis. Note
   * that eCryptfs pages may straddle the lower pages -- for instance,
 @@ -478,128 +561,121 @@ out:
   * file, 24K of page 0 of the lower file will be read and decrypted,
   * and then 8K of page 1 of the lower file will be read and decrypted.
   *
 - * The actual operations performed on each page depends on the
 - * contents of the ecryptfs_page_crypt_context struct.
 - *
   * Returns zero on

Re: [PATCH 6/11] eCryptfs: Update metadata read/write functions

2007-09-19 Thread Andrew Morton

On Mon, 17 Sep 2007 16:48:44 -0500 Michael Halcrow [EMAIL PROTECTED] wrote:

 + if ((rc = ecryptfs_write_lower(ecryptfs_dentry-d_inode,

checkpatch missed the assignment-in-an-if here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 8/11] eCryptfs: Convert mmap functions to use persistent file

2007-09-19 Thread Andrew Morton

On Mon, 17 Sep 2007 16:50:16 -0500 Michael Halcrow [EMAIL PROTECTED] wrote:

 Convert readpage, prepare_write, and commit_write to use read_write.c
 routines. Remove sync_page; I cannot think of a good reason for
 implementing that in eCryptfs.
 
 Signed-off-by: Michael Halcrow [EMAIL PROTECTED]
 ---
  fs/ecryptfs/mmap.c |  199 
 +++-
  1 files changed, 103 insertions(+), 96 deletions(-)
 
 diff --git a/fs/ecryptfs/mmap.c b/fs/ecryptfs/mmap.c
 index 60e635e..dd68dd3 100644
 --- a/fs/ecryptfs/mmap.c
 +++ b/fs/ecryptfs/mmap.c
 @@ -267,9 +267,78 @@ static void set_header_info(char *page_virt,
  }
  
  /**
 + * ecryptfs_copy_up_encrypted_with_header
 + * @page: Sort of a ``virtual'' representation of the encrypted lower
 + *file. The actual lower file does not have the metadata in
 + *the header.
 + * @crypt_stat: The eCryptfs inode's cryptographic context
 + *
 + * The ``view'' is the version of the file that userspace winds up
 + * seeing, with the header information inserted.
 + */
 +static int
 +ecryptfs_copy_up_encrypted_with_header(struct page *page,
 +struct ecryptfs_crypt_stat *crypt_stat)
 +{
 + loff_t extent_num_in_page = 0;
 + loff_t num_extents_per_page = (PAGE_CACHE_SIZE
 +/ crypt_stat-extent_size);
 + int rc = 0;
 +
 + while (extent_num_in_page  num_extents_per_page) {
 + loff_t view_extent_num = ((page-index * num_extents_per_page)

overflow?

 +   + extent_num_in_page);
 +
 + if (view_extent_num  crypt_stat-num_header_extents_at_front) {
 + /* This is a header extent */
 + char *page_virt;
 +
 + page_virt = kmap_atomic(page, KM_USER0);
 + memset(page_virt, 0, PAGE_CACHE_SIZE);
 + /* TODO: Support more than one header extent */
 + if (view_extent_num == 0) {
 + rc = ecryptfs_read_xattr_region(
 + page_virt, page-mapping-host);
 + set_header_info(page_virt, crypt_stat);
 + }
 + kunmap_atomic(page_virt, KM_USER0);
 + flush_dcache_page(page);
 + if (rc) {
 + ClearPageUptodate(page);
 + printk(KERN_ERR %s: Error reading xattr 
 +region; rc = [%d]\n, __FUNCTION__, rc);
 + goto out;
 + }
 + SetPageUptodate(page);

I don't know what sort of page `page' refers to here, but normally we only
manipulate the page uptodate status under lock_page().


 + } else {
 + /* This is an encrypted data extent */
 + loff_t lower_offset =
 + ((view_extent_num -
 +   crypt_stat-num_header_extents_at_front)
 +  * crypt_stat-extent_size);

overflow?

 + rc = ecryptfs_read_lower_page_segment(
 + page, (lower_offset  PAGE_CACHE_SHIFT),
 + (lower_offset  ~PAGE_CACHE_MASK),
 + crypt_stat-extent_size, page-mapping-host);
 + if (rc) {
 + printk(KERN_ERR %s: Error attempting to read 
 +extent at offset [%lld] in the lower 
 +file; rc = [%d]\n, __FUNCTION__,
 +lower_offset, rc);
 + goto out;
 + }
 + }
 + extent_num_in_page++;
 + }
 +out:
 + return rc;
 +}
 +
 +/**
   * ecryptfs_readpage
 - * @file: This is an ecryptfs file
 - * @page: ecryptfs associated page to stick the read data into
 + * @file: An eCryptfs file
 + * @page: Page from eCryptfs inode mapping into which to stick the read data
   *
   * Read in a page, decrypting if necessary.
   *
 @@ -277,59 +346,35 @@ static void set_header_info(char *page_virt,
   */
  static int ecryptfs_readpage(struct file *file, struct page *page)
  {
 + struct ecryptfs_crypt_stat *crypt_stat =
 + 
 ecryptfs_inode_to_private(file-f_path.dentry-d_inode)-crypt_stat;
   int rc = 0;
 - struct ecryptfs_crypt_stat *crypt_stat;
  
 - BUG_ON(!(file  file-f_path.dentry  file-f_path.dentry-d_inode));
 - crypt_stat = ecryptfs_inode_to_private(file-f_path.dentry-d_inode)
 - -crypt_stat;
   if (!crypt_stat
   || !(crypt_stat-flags  ECRYPTFS_ENCRYPTED)
   || (crypt_stat-flags  ECRYPTFS_NEW_FILE)) {
   ecryptfs_printk(KERN_DEBUG,
   Passing through unencrypted page\n);
 -

Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Andrew Morton

On Tue, 18 Sep 2007 18:00:01 -0700 Mingming Cao [EMAIL PROTECTED] wrote:

 JBD: Replace slab allocations with page cache allocations
 
 JBD allocate memory for committed_data and frozen_data from slab. However
 JBD should not pass slab pages down to the block layer. Use page allocator 
 pages instead. This will also prepare JBD for the large blocksize patchset.
 
 
 Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

__GFP_NOFAIL should only be used when we have no way of recovering
from failure.  The allocation in journal_init_common() (at least)
_can_ recover and hence really shouldn't be using __GFP_NOFAIL.

(Actually, nothing in the kernel should be using __GFP_NOFAIL.  It is 
there as a marker which says we really shouldn't be doing this but
we don't know how to fix it).

So sometime it'd be good if you could review all the __GFP_NOFAILs in
there and see if we can remove some, thanks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [00/41] Large Blocksize Support V7 (adds memmap support)

2007-09-15 Thread Andrew Morton

On Tue, 11 Sep 2007 14:12:26 +0200 Jörn Engel [EMAIL PROTECTED] wrote:

 While I agree with your concern, those numbers are quite silly.  The
 chances of 99.8% of pages being free and the remaining 0.2% being
 perfectly spread across all 2MB large_pages are lower than those of SHA1
 creating a collision.

Actually it'd be pretty easy to craft an application which allocates seven
pages for pagecache, then one for something, then seven for pagecache, then
one for something, etc.

I've had test apps which do that sort of thing accidentally.  The result
wasn't pretty.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] 9p: add readahead support for loose mode

2007-09-15 Thread Andrew Morton

On Fri, 14 Sep 2007 11:02:40 -0500 Eric Van Hensbergen [EMAIL PROTECTED] 
wrote:

 + list_for_each_entry_reverse(tmp_page, page_list, lru) {
 + BUG_ON(count  num_pages);
 + if (add_to_page_cache(tmp_page, mapping,
 + tmp_page-index, GFP_KERNEL)) {
 + page_cache_release(tmp_page);
 + continue;
 + }
 +
 + kv[count].iov_base = kmap(tmp_page);
 + kv[count].iov_len = PAGE_CACHE_SIZE;
 + count++;
 + }
 +
 + read_size = count * PAGE_CACHE_SIZE;
 + if (!read_size)
 + goto cleanup;
 +
 + retval = p9_client_readv(fid, kv, offset, count);
 +
 +cleanup:
 + list_for_each_safe(p, n, page_list) {
 + tmp_page = list_entry(p, struct page, lru);
 + list_del(tmp_page-lru);
 + if (!pagevec_add(lru_pvec, tmp_page))
 + __pagevec_lru_add(lru_pvec);
 + kunmap(tmp_page);
 + flush_dcache_page(tmp_page);
 + SetPageUptodate(tmp_page);
 + unlock_page(tmp_page);
 + }

eww, kmap.  Large amounts of them, apparently.

Be aware that kmap is a) slow and b) deadlockable.  The latter happens when
multiple tasks want to take more than one kmap simultaneously: they all
wait for someone else to release one.  Your code here seems especially
vulnerable to this.

Nick had a kmap-speedup patchset a while back which addressed a) but I
don't know if it addressed the deadlock.  But that patch seemed to die.

Strongly recommend that the code be reworked to use kmap_atomic()


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [-mm] FS: file name must be unique in the same dir in procfs

2007-09-13 Thread Andrew Morton

 
 From: Zhang Rui [EMAIL PROTECTED]
 File name should be unique in the same directory.
 
 In order to keep the back-compatibility, only a warning is given
 currently, but actions must be taken to fix it when such duplicates
 are detected.
 
 Bug report and a simple fix can be found here: 
 http://bugzilla.kernel.org/show_bug.cgi?id=8798
 
 Signed-off-by: Zhang Rui [EMAIL PROTECTED]
 ---
  fs/proc/generic.c |   12 
  1 file changed, 12 insertions(+)
 
 Index: linux-2.6/fs/proc/generic.c
 ===
 --- linux-2.6.orig/fs/proc/generic.c
 +++ linux-2.6/fs/proc/generic.c
 @@ -523,6 +523,7 @@ static const struct inode_operations pro
  
  static int proc_register(struct proc_dir_entry * dir, struct
 proc_dir_entry * dp)

Your email client is wordwrapping the text.

  {
 + struct proc_dir_entry *tmp = NULL;

That initialisation is unneeded - I removed it.

`tmp' is always a crappy name for anything.  I renamed it to `de'.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] [-mm] FS: file name must be unique in the same dir in procfs

2007-09-10 Thread Andrew Morton

On Mon, 20 Aug 2007 23:50:05 +0800 Zhang Rui [EMAIL PROTECTED] wrote:

 Hi, Oliver,
 Thanks for your comments,
 
 On Mon, 2007-08-20 at 18:45 +0800, Oliver Neukum wrote:
  Am Montag 20 August 2007 schrieb Zhang Rui:
   Files name must be unique in the same directory.
  
   Bug is reported here:
   http://bugzilla.kernel.org/show_bug.cgi?id=8798
  
  Then I'd say fix the callers.
 But at least the callers need to be told that something is wrong first.
 
   This will paper over bugs.
 Hmm, what kind of bugs?
 callers always need to check the return value when calling
 proc_mkdir/create_proc_entry, don't they?
 

Yes, but there's some risk that such a change will cause a
presently-working system to stop working.  It's quite likely, if that
system is checking the procfs-creation return value.

So I think it'd be best if we were to detect the duplication and print a
warning (which should include, if possible, the full pathname and a
dump_stack()) and then the code should proceed as normal: permit the
duplicated entry and return success.

Then, when such duplicates are reported, we can work out what to do about
them on a case-by-case basis.

(btw, please feed your patches through scripts/checkpatch.pl - that one had
a remarkably high coding-style-error-density).
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2/4] 2.6.23-rc5: known regressions

2007-09-03 Thread Andrew Morton

 On Mon, 03 Sep 2007 11:48:00 +0100 H. Peter Anvin [EMAIL PROTECTED] wrote:
 Michal Piotrowski wrote:
  
  Unclassified
  
  Subject : console is messed up after resume from s2ram or switching 
  to console from X
  References  : http://lkml.org/lkml/2007/8/4/6
  Last known good : ?
  Submitter   : Jeff Chua [EMAIL PROTECTED]
  Caused-By   : ?
  Handled-By  : H. Peter Anvin [EMAIL PROTECTED]
Antonino A. Daplas [EMAIL PROTECTED]
  Workaround  : s2ram --force --acpi_sleep 1 --vbe_mode
  Status  : problem is being debugged
  
 
 I'm inclined to write this one off as general STR weirdness.

Both suspend-to-ram and suspend-to-disk are broken on this Vaio.  Running
2.6.23-rc4.


suspend-to-RAM:

a) sometimes hangs during suspend

b) frequently hangs during resume

c) occasionally acts weird after resume.  system requires repeated
   keypresses to make forward progress.

d) on those occasions where resume-from-RAM _does_ work, it takes much
   longer to resume than it used to.

suspend-to-disk:

a) always hangs when netconsole-over-e100 is enabled (might have been a
   2.6.21-2.6.22 regression).

b) usually hangs during suspend


Apart from suspend-to-disk's a), all of the above are post-2.6.21
regressions.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2/4] 2.6.23-rc5: known regressions

2007-09-03 Thread Andrew Morton

 On Mon, 3 Sep 2007 20:36:32 +0800 Jeff Chua [EMAIL PROTECTED] wrote:
 On 9/3/07, Michal Piotrowski [EMAIL PROTECTED] wrote:
 
  Subject : console is messed up after resume from s2ram or switching 
  to console from X
  References  : http://lkml.org/lkml/2007/8/4/6
  Last known good : ?
  Submitter   : Jeff Chua [EMAIL PROTECTED]
  Caused-By   : ?
  Handled-By  : H. Peter Anvin [EMAIL PROTECTED]
Antonino A. Daplas [EMAIL PROTECTED]
  Workaround  : s2ram --force --acpi_sleep 1 --vbe_mode
  Status  : problem is being debugged
 
 Michal,
 
 Can you please close this case. I'm using the workaround and satisfied with 
 it.
 

2.6.21 was OK, and 2.6.23-rc2 needed a manual workaround?

That's a regression.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [2/4] 2.6.23-rc5: known regressions

2007-09-03 Thread Andrew Morton

 On Mon, 3 Sep 2007 05:46:02 -0700 Andrew Morton [EMAIL PROTECTED] wrote:
 Apart from suspend-to-disk's a), all of the above are post-2.6.21
 regressions.

I meant: post-2.6.22 regressions.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Andrew Morton

On Mon, 6 Aug 2007 09:54:03 -0400
Jeff Layton [EMAIL PROTECTED] wrote:

 Apologies for the resend, but the original sending had the date in the
 email header and it caused some of these to bounce...
 
 ( Please consider trimming the Cc list if discussing some aspect of this
 that doesn't concern everyone.)
 
 When an unprivileged process attempts to modify a file that has the
 setuid or setgid bits set, the VFS will attempt to clear these bits. The
 VFS will set the ATTR_KILL_SUID or ATTR_KILL_SGID bits in the ia_valid
 mask, and then call notify_change to clear these bits and set the mode
 accordingly.
 
 With a networked filesystem (NFS in particular but most likely others),
 the client machine may not have credentials that allow for the clearing
 of these bits. In some situations, this can lead to file corruption, or
 to an operation failing outright because the setattr fails.
 
 In this situation, we'd like to just leave the handling of this to
 the server and ignore these bits. The problem is that by the time
 nfs_setattr is called, the VFS has already reinterpreted the ATTR_KILL_*
 bits into a mode change. We can't fix this in the filesystems where
 this is a problem, as doing so would leave us having to second-guess
 what the VFS wants us to do. So we need to change it so that filesystems
 have more flexibility in how to interpret the ATTR_KILL_* bits.
 
 The first patch in the following patchset moves this logic into a helper
 function, and then only calls this helper function for inodes that do
 not have a setattr operation defined. The subsequent patches fix up
 individual filesystem setattr functions to call this helper function.
 
 The upshot of this is that with this change, filesystems that define
 a setattr inode operation are now responsible for handling the ATTR_KILL
 bits as well. They can trivially do so by calling the helper, but they
 must do so.
 
 Some of the follow-on patches may not be strictly necessary, but I
 decided that it was better to take the conservative approach and call
 the helper when I wasn't sure. I've tried to CC the maintainers
 for the individual filesystems as well where I could find them,
 please let me know if there are others who should be informed.
 
 Comments and suggestions appreciated...
 

From a purely practical standpoint: it's a concern that all filesytems need
patching to continue to correctly function after this change.  There might
be filesystems which you missed, and there are out-of-tree filesystems
which won't be updated.

And I think the impact upon the out-of-tree filesystems would be fairly
severe: they quietly and subtly get their secutiry guarantees broken (I
think?)

Is there any way in which we can prevent these problems?  Say

- rename something so that unconverted filesystems will reliably fail to
  compile?

- leave existing filesystems alone, but add a new
  inode_operations.setattr_jeff, which the networked filesytems can
  implement, and teach core vfs to call setattr_jeff in preference to
  setattr?

Something else?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 00/25] move handling of setuid/gid bits from VFS into individual setattr functions (RESEND)

2007-08-07 Thread Andrew Morton

On Tue, 07 Aug 2007 20:45:34 -0400
Trond Myklebust [EMAIL PROTECTED] wrote:

  - rename something so that unconverted filesystems will reliably fail to
compile?
  
  - leave existing filesystems alone, but add a new
inode_operations.setattr_jeff, which the networked filesytems can
implement, and teach core vfs to call setattr_jeff in preference to
setattr?
 
 If you really need to know that the filesystem is handling the flags,
 then how about instead having -setattr() return something which
 indicates which flags it actually handled? That is likely to be a far
 more intrusive change, but it is one which is future-proof.

If we change -setattr so that it will return a positive, non-zero value
which the caller can then check and reliably do printk(that filesystem
needs updating) then that addresses my concern, sure.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch][rfc] fs: fix nobh error handling

2007-08-07 Thread Andrew Morton

On Wed, 8 Aug 2007 04:18:38 +0200 Nick Piggin [EMAIL PROTECTED] wrote:

 On Tue, Aug 07, 2007 at 06:09:03PM -0700, Andrew Morton wrote:
  On Tue, 7 Aug 2007 07:51:29 +0200
  Nick Piggin [EMAIL PROTECTED] wrote:
  
   nobh mode error handling is not just pretty slack, it's wrong.
   
   One cannot zero out the whole page to ensure new blocks are zeroed,
   because it just brings the whole page uptodate with zeroes even if
   that may not be the correct uptodate data. Also, other parts of the
   page may already contain dirty data which would get lost by zeroing
   it out. Thirdly, the writeback of zeroes to the new blocks will also
   erase existing blocks. All these conditions are pagecache and/or
   filesystem corruption.
   
   The problem comes about because we didn't keep track of which buffers
   actually are new or old. However it is not enough just to keep only
   this state, because at the point we start dirtying parts of the page
   (new blocks, with zeroes), the handling of IO errors becomes impossible
   without buffers because the page may only be partially uptodate, in
   which case the page flags allone cannot capture the state of the parts
   of the page.
   
   So allocate all buffers for the page upfront, but leave them unattached
   so that they don't pick up any other references and can be freed when
   we're done. If the error path is hit, then zero the new buffers as the
   regular buffer path does, then attach the buffers to the page so that
   it can actually be written out correctly and be subject to the normal
   IO error handling paths.
   
   As an upshot, we save 1K of kernel stack on ia64 or powerpc 64K page
   systems.
   
   Signed-off-by: Nick Piggin [EMAIL PROTECTED]
   
  
  With this change, nobh_prepare_write() can magically attach buffers to the
  page.  But a filesystem which is running in nobh mode wouldn't expect that,
  and could quite legitimately go BUG, or leak data or, more seriously and
  much less fixably, just go and overwrite page-private, because it knows
  that nobody else is using -private.
 
 I was fairly sure that a filesystem can not assume buffers won't be
 attached, because there are various error case paths thta do exactly
 the same thing (eg. nobh_writepage can call __block_write_full_page
 which will attach buffers). 

oh crap, that's sad.  Either we broke it later on or I misremembered.

 Does any filesystem assume this? Is it not already broken?

Yes, it would be broken.

 
  I'd have thought that it would be better to not attach the buffers and to
  go ahead and do whatever synchronous IO is needed in the error recovery
  code, then free those buffers again.
 
 It is hard because if the synchronous IO fails, then what do you do?

Do what we usually do when an IO error happens: crash the kernel?  (Sorry,
have been spending too long at bugzilla.kernel.org)

 You could try making it up as you go along, but of course if we _can_
 attach the buffers here then it would be preferable to do that. IMO.
  
 
  Also, you have a couple of (cheerily uncommented) PagePrivate() tests in
  there which should be page_has_buffers().
 
 Yeah, I guess the whole thing needs more commenting :P
 page_has_buffers... right, I'll change that.

Did it get much testing?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Faster ext2_clear_inode()

2007-07-19 Thread Andrew Morton

On Mon, 9 Jul 2007 22:00:03 +0200
Jörn Engel [EMAIL PROTECTED] wrote:

 On Mon, 9 July 2007 22:01:48 +0400, Alexey Dobriyan wrote:
  
  Yes. Note that ext2_clear_inode() is referenced from ext2_sops, so even
  empty, it leaves traces in resulting kernel.
 
 Is that your opinion or have you actually measured a difference?
 I strongly suspect that compilers are smart enough to optimize away a
 call to an empty static function.
 

It saves a big 16 bytes of text here.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] isofs: mounting to regular file may succeed

2007-07-17 Thread Andrew Morton

On Sat, 14 Jul 2007 03:47:21 +0400 Kirill Kuvaldin [EMAIL PROTECTED] wrote:

 It turned out that mounting a corrupted ISO image to a regular file may
 succeed, e.g. if an image was prepared as follows:
 
 $ dd if=correct.iso of=bad.iso bs=4k count=8
 
 We then can mount it to a regular file:
 
 # mount -o loop -t iso9660 bad.iso /tmp/file
 
 But mounting it to a directory fails with -ENOTDIR, simply because 
 the root directory inode doesn't have S_IFDIR set and the condition
 in graft_tree() is met:
 
   if (S_ISDIR(nd-dentry-d_inode-i_mode) !=
 S_ISDIR(mnt-mnt_root-d_inode-i_mode))
   return -ENOTDIR
 
 This is because the root directory inode was read from an incorrect
 block. It's supposed to be read from sbi-s_firstdatazone, which is
 an absolute value and gets messed up in the case of an incorrect image.
 
 In order to somehow circumvent this we have to check that the root
 directory inode is actually a directory after all.
 
 
 Signed-off-by: Kirill Kuvaldin [EMAIL PROTECTED]
 
 diff --git a/fs/isofs/inode.c b/fs/isofs/inode.c
 index 5c3eecf..ce5062a 100644
 --- a/fs/isofs/inode.c
 +++ b/fs/isofs/inode.c
 @@ -840,6 +840,15 @@ root_found:
   goto out_no_root;
   if (!inode-i_op)
   goto out_bad_root;
 +
 + /* Make sure the root inode is a directory */
 + if (!S_ISDIR(inode-i_mode)) {
 + printk(KERN_WARNING
 + isofs_fill_super: root inode is not a directory. 
 + Corrupted media?\n);
 + goto out_iput;
 + }
 +
   /* get the root dentry */
   s-s_root = d_alloc_root(inode);
   if (!(s-s_root))

I don't think any (all?) other filesystems perform checks like this.
Is this something which can/should be performed at the VFS level?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-15 Thread Andrew Morton

On Sun, 15 Jul 2007 21:21:03 +0200 Peter Zijlstra [EMAIL PROTECTED] wrote:

 Shows the current stacktrace where we violate the previously established
 locking order.

yup, but the lock_page() which we did inside truncate_mutex was a 
lock_page() against a different address_space: the blockdev mapping.

So this is OK - we'll never take truncate_mutex against the blockdev
mapping (it doesn't have one, for a start ;))

This is similar to the quite common case where we take inode A's
i_mutex inside inode B's i_mutex, which needs special lockdep annotations.

I think.  I haven't looked into this in detail.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andrew Morton

On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote:

  +   brelse(bh);
  +   up_write(EXT4_I(inode)-xattr_sem);
  +   return error;
  +}
  +
 
 We're doing GFP_KERNEL memory allocations while holding xattr_sem.  This
 can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or
 i_truncate_sem and/or journal_start() (I forget whether this still
 happens).  Have we checked whether this can occur and if so, whether we are
 OK from a lock ranking POV?  Bear in mind that journalled-data mode is more
 complex in this regard.

I notice that everyone carefully avoided addressing this ;)

Oh well, hopefully people are testing with lockdep enabled.  As long
as the fs is put under extreme memory pressure, most bugs should be reported.

Except lockdep doesn't know about journal_start(), which has ranking
requirements similar to a semaphore.  Nor does it know about lock_page().
We already have hard-to-hit but deadlockable bugs in this area.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-13 Thread Andrew Morton

On Fri, 13 Jul 2007 16:00:48 +0530 Kalpak Shah [EMAIL PROTECTED] wrote:


   - if (inode-i_nlink = EXT4_LINK_MAX)
   + if (EXT4_DIR_LINK_MAX(inode))
 return -EMLINK;
  
  argh.  WHY_IS_EXT4_FULL_OF_UPPER_CASE_MACROS_WHICH_COULD_BE_IMPLEMENTED
  as_lower_case_inlines?  Sigh.  It's all the old-timers, I guess.
  
  EXT4_DIR_LINK_MAX() is buggy: it evaluates its arg twice.
 
 #define EXT4_DIR_LINK_MAX(dir) (!is_dx(dir)  (dir)-i_nlink = 
 EXT4_LINK_MAX)
 
 This just checks if directory has hash indexing in which case we need not 
 worry about EXT4_LINK_MAX subdir limit. If directory is not hash indexed then 
 we will need to enforce a max subdir limit. 
 
 Sorry, I didn't understand what is the problem with this macro?

Macros should never evaluate their argument more than once, because if they
do they will misbehave when someone passes them an
expression-with-side-effects:

struct inode *p = q;

EXT4_DIR_LINK_MAX(p++);

one expects `p' to have the value q+1 here.  But it might be q+2.

and

EXT4_DIR_LINK_MAX(some_function());

might cause some_function() to be called twice.


This is one of the many problems which gets fixed when we write code in C
rather than in cpp.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andrew Morton

On Fri, 13 Jul 2007 15:33:41 +0200
Peter Zijlstra [EMAIL PROTECTED] wrote:

 On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote:
 
  Except lockdep doesn't know about journal_start(), which has ranking
  requirements similar to a semaphore.  
 
 Something like so?

Looks OK.

 Or can journal_stop() be done by a different task than the one that did
 journal_start()? - in which case nothing much can be done :-/

Yeah, journal_start() and journal_stop() are well-behaved.
 
 This seems to boot... albeit I did not push it hard.

I fear the consequences of this change :(

Oh well, please keep it alive, maybe beat on it a bit, resend it
later on?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:25 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Journal checksum feature has been added to detect corruption of journal.

That was brief.  No description of what it does, how it does it, why it
does it, how one operates it, why (or why not) one would choose to enable
it nor of the costs of enabling it.

 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 
 diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c
 --- linux024/fs/ext4/super.c  2007-06-25 16:19:24.0 -0500
 +++ linux/fs/ext4/super.c 2007-06-26 08:35:16.0 -0500
 @@ -721,6 +721,7 @@ enum {
   Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
   Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
   Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
 + Opt_journal_checksum, Opt_journal_async_commit,

A new user-visible feature should be accompanied by an update to
Documentation/filesystems/ext4.txt?

   Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
 @@ -760,6 +761,8 @@ static match_table_t tokens = {
   {Opt_journal_update, journal=update},
   {Opt_journal_inum, journal=%u},
   {Opt_journal_dev, journal_dev=%u},
 + {Opt_journal_checksum, journal_checksum},
 + {Opt_journal_async_commit, journal_async_commit},

What's journal_async_commit?

   {Opt_abort, abort},
   {Opt_data_journal, data=journal},
   {Opt_data_ordered, data=ordered},
 @@ -948,6 +951,13 @@ static int parse_options (char *options,
   return 0;
   *journal_devnum = option;
   break;
 + case Opt_journal_checksum:
 + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
 + break;
 + case Opt_journal_async_commit:
 + set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT);
 + set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
 + break;
   case Opt_noload:
   set_opt (sbi-s_mount_opt, NOLOAD);
   break;
 @@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super
   goto failed_mount4;
   }
  
 + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
 + jbd2_journal_set_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
 + jbd2_journal_set_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
 + jbd2_journal_clear_features(sbi-s_journal, 0, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + } else {
 + jbd2_journal_clear_features(sbi-s_journal,
 + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
 + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
 + }

Some discussion of the forward- and backward- compatibility design would be
appropriate.  Also a description of whether and how fsck can be used to fix
up these feature flags.

   /* We have now updated the journal if required, so we can
* validate the data journaling mode. */
   switch (test_opt(sb, DATA_FLAGS)) {
 diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c
 --- linux024/fs/jbd2/commit.c 2007-06-25 16:19:25.0 -0500
 +++ linux/fs/jbd2/commit.c2007-06-26 08:40:03.0 -0500
 @@ -21,6 +21,7 @@
  #include linux/mm.h
  #include linux/pagemap.h
  #include linux/jiffies.h
 +#include linux/crc32.h

I think we just broke CONFIG_EXT4=y, CONFIG_CRC32=n builds.

  /*
   * Default IO end handler for temporary BJ_IO buffer_heads.
 @@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour
   return 1;
  }
  
 -/* Done it all: now write the commit record.  We should have
 +/*
 + * Done it all: now submit the commit record.  We should have
   * cleaned up our previous buffers by now, so if we are in abort
   * mode we can now just skip the rest of the journal write
   * entirely.
   *
   * Returns 1 if the journal needs to be aborted or 0 on success
   */
 -static int journal_write_commit_record(journal_t *journal,
 - transaction_t *commit_transaction)
 +static int journal_submit_commit_record(journal_t *journal,
 + transaction_t *commit_transaction,
 + struct buffer_head **cbh,
 + __u32 crc32_sum)
  {
   struct journal_head *descriptor;
   struct buffer_head *bh;
 @@ -117,21 +121,36 @@ static int journal_write_commit_record(j
  
   bh

Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
 Date: Sun, 01 Jul 2007 03:38:51 -0400
 Sender: [EMAIL PROTECTED]
 Organization: IBM Linux Technology Center
 X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent compilation fixes

I hope this patch series will hit git with nice titles like ext4: extent
compilation fixes and not funny ones like
Morecleanups:ext4_extent_compilation_fixes.

 Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

conversions.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Andrew Morton

On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 From: Dmitry Monakhov [EMAIL PROTECTED]
 Subject: ext4: extent macros cleanup
 
 - Replace math equation to it's macro equivalent

s/it's/its/;)

 - make ext4_ext_grow_indepth() indexes/leaf correct

hm, what was wrong with it?

 Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
 Acked-by: Alex Tomas [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 ---
  fs/ext4/extents.c |   11 +++
  1 files changed, 7 insertions(+), 4 deletions(-)
 
 diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
 index 12fe3d7..1fd00ac 100644
 --- a/fs/ext4/extents.c
 +++ b/fs/ext4/extents.c
 @@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
 ext4_ext_path *path, int bloc
   ext_debug(binsearch for %d(idx):  , block);
 
   l = EXT_FIRST_INDEX(eh) + 1;
 - r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_INDEX(eh);
   while (l = r) {
   m = l + (r - l) / 2;
   if (block  le32_to_cpu(m-ei_block))
 @@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
 ext4_ext_path *path, int block)
   ext_debug(binsearch for %d:  , block);
 
   l = EXT_FIRST_EXTENT(eh) + 1;
 - r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1;
 + r = EXT_LAST_EXTENT(eh);
 
   while (l = r) {
   m = l + (r - l) / 2;
 @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
 struct inode *inode,
   curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
   curp-p_hdr-eh_entries = cpu_to_le16(1);
   curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
 - /* FIXME: it works, but actually path[0] can be index */
 - curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 + 
 + if (path[0].p_hdr-eh_depth)
 +   curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
 + else
 +   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;

whitespace bustage there.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 00:38:09 -0500 Jose R. Santos [EMAIL PROTECTED] wrote:

 
  Alternatively (and preferably) do this via an update to
  Documentation/filesystems/ext4.txt.
 
 Seems like I also need to update the doc on Kconfig as well.  Do you
 prefer this in separate patches? (current patch, kconfig patch, ext4
 doc update patch?

All these changes are logically connected (aren't they?).  A single patch
is fine.

  Shoudln't all this debug info be a per-superblock thing rather than
  kernel-wide?
 
 I don't think it is worth pursuing this feature since this seems to
 have been broken for a while now (its been there since the first git
 revission in ext3) and nobody has noticed it until now.  It could be
 address on a later patch though, since the initial purpose of the patch
 was to fix the broken JBD2_DEBUG option. Of course, this may not be
 clearly express in the changelog. :)
 

I don't think that making it all per-superblock is worth the effort - it's
a developer-only thing and developer will have the knowledge to test ext4
on an otherwise-ext3 setup if they're really fussed about the accuracy.

So yes, a bare make-it-work patch sounds appropriate.  Or remove it, but
hey, it might be useful.  The timestamping stuff certainly looks useful.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andrew Morton

On Tue, 10 Jul 2007 23:18:50 -0400 Mingming Cao [EMAIL PROTECTED] wrote:

 On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
  On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  
   David Chinneer pointed that we need to journal the version number
   updates together with the operations that causes the change of the inode
   version number, in order to survive server crashes so clients won't see
   the counter go backwards.
   
   So increment i_version in fs code is probably the place to ensure the
   inode version changes are stored to disk. It's seems update the ext4
   inode version in every ext4_mark_inode_dirty() is the easiest way.
  
  That still makes us dependent upon _something_ changing the inode.  For
  overwrites the only something is mtime.
  
  If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
  I don't think we do) then I guess we'll need new code in or around
  file_update_time() to do this.
 
 do you mean mark inode dirty all the times in file_update_time()? Not
 sure about the overhead for ext3/4.
 

hm, I hadn't thought about it in any detail.

Maybe something like

--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1238,6 +1238,11 @@ void file_update_time(struct file *file)
sync_it = 1;
}
 
+   if (IS_I_VERSION_64(inode)) {
+   inode_inc_iversion(inode);  /* Takes i_lock on 32-bit */
+   sync_it = 1;
+   }
+
if (sync_it)
mark_inode_dirty_sync(inode);
 }
_

?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: -mm merge plans for 2.6.23

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 12:39:42 +0100 David Woodhouse [EMAIL PROTECTED] wrote:

 On Wed, 2007-07-11 at 13:35 +0200, Christoph Hellwig wrote:
  On Tue, Jul 10, 2007 at 01:31:52AM -0700, Andrew Morton wrote:
   romfs-printk-format-warnings.patch
  
  NACK on this one. 
 
 The rest of it is nacked anyway, until we unify the point and
 get_unmapped_area methods of the MTD API.
 

Methinks you meant
nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, not
romfs-printk-format-warnings.patch.

I'll drop nommu-make-it-possible-for-romfs-to-use-mtd-devices.patch, thamks.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 07:01:08 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

-   /* AKPM: buglet - add `i' to tmp! */
   
   Damn.  After, what, seven years, someone actually fixed it?
   
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   struct commit_header *tmp =
+   (struct commit_header *)(bh-b_data + i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = 
cpu_to_be32(commit_transaction-t_tid);
+
+   if (JBD2_HAS_COMPAT_FEATURE(journal,
+   
JBD2_FEATURE_COMPAT_CHECKSUM)) {
+   tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
+   tmp-h_chksum_size  = 
JBD2_CRC32_CHKSUM_SIZE;
+   tmp-h_chksum[0]= 
cpu_to_be32(crc32_sum);
+   }
}
   
   And in doing so, changed the on-disk format of the journal commit blocks.
   
   Surely this was worth a mention in the changelog, if not a standalone 
   patch?
   
   I don't think this is worth doing, really.  Why not just leave the format
   as it was, take the loop out and run this code once rather than eight
   times?
 
 Well, we aren't using the rest of the commit block in any case.  I think
 the original intention was that we'd get 8 copies of the commit block so
 we would be sure to get a good one.
 
 I don't know whether we'd rather have 8 copies of the commit block, or
 more potential to expand the commit block?  I don't personally have any
 preference, since the checksum should be a more robust way of checking
 validity than having multiple copies, so we may as well remove the loop
 and stick with a single copy for now.

We've never altered any commit block sectors apart from the zeroeth one
(eight times) due to the above bug.  So I'd suggest that we should formalise
the old bug and leave the format as-is.  That'll leave lots of space spare in
the commit block.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andrew Morton

On Wed, 11 Jul 2007 06:10:56 -0600 Andreas Dilger [EMAIL PROTECTED] wrote:

 On Jul 10, 2007  16:32 -0700, Andrew Morton wrote:
 err = ext4_reserve_inode_write(handle, inode, iloc);
   + if (EXT4_I(inode)-i_extra_isize 
   + EXT4_SB(inode-i_sb)-s_want_extra_isize 
   + !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
   + /* We need extra buffer credits since we may write into EA block
   +  * with this same handle */
   + if ((jbd2_journal_extend(handle,
   +  EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
   + ret = ext4_expand_extra_isize(inode,
   + EXT4_SB(inode-i_sb)-s_want_extra_isize,
   + iloc, handle);
   + if (ret) {
   + EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
   + if (!expand_message) {
   + ext4_warning(inode-i_sb, __FUNCTION__,
   + Unable to expand inode %lu. Delete
   +  some EAs or run e2fsck.,
   + inode-i_ino);
   + expand_message = 1;
   + }
   + }
   + }
   + }
  
  Maybe that message should come out once per mount rather than once per boot
  (or once per modprobe)?
 
 Probably true.
 
  What are the consequences of a jbd2_journal_extend() failure in here?
 
 Not fatal, just like every user of i_extra_isize.  If the inode isn't a
 large inode, or it can't be expanded then there will be a minor loss of
 functionality on that inode.  If the i_extra_isize is critical, then
 the sysadmin will have run e2fsck to force s_min_extra_isize large enough.
 
 Note that this is only applicable for filesystems which are upgraded.  For
 new inodes (i.e. all inodes that exist in the filesystem if it was always
 run on a kernel with the currently understood extra fields) then this will
 never be invoked (until such a time new extra fields are added).

I'd suggest that we get a comment in the code explaining this: this
unchecked error does rather stand out.

   + if (EXT4_I(inode)-i_file_acl) {
   + bh = sb_bread(inode-i_sb, EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + if (!bh)
   + goto cleanup;
   + if (ext4_xattr_check_block(bh)) {
   + ext4_error(inode-i_sb, __FUNCTION__,
   + inode %lu: bad block %llu, inode-i_ino,
   + EXT4_I(inode)-i_file_acl);
   + error = -EIO;
   + goto cleanup;
   + }
   + base = BHDR(bh);
   + first = BFIRST(bh);
   + end = bh-b_data + bh-b_size;
   + min_offs = end - base;
   + free = ext4_xattr_free_space(first, min_offs, base,
   +  total_blk);
   + if (free  new_extra_isize) {
   + if (!tried_min_extra_isize  s_min_extra_isize) {
   + tried_min_extra_isize++;
   + new_extra_isize = s_min_extra_isize;
  
  Aren't we missing a brelse(bh) here?
 
 Seems likely, yes.

OK - could we get a positive ack from someone indicating that this will get
looked at?  Because I am about to forget about it.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:35:48 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Add a mount option to turn off extents.

Please update the changelog to describe the reason for making this change.


 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 ---
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:18.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:02:22.0 -0700
 @@ -725,7 +725,7 @@
   Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
   Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
   Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
 - Opt_grpquota, Opt_extents,
 + Opt_grpquota, Opt_extents, Opt_noextents,
  };
  
  static match_table_t tokens = {
 @@ -776,6 +776,7 @@
   {Opt_usrquota, usrquota},
   {Opt_barrier, barrier=%u},
   {Opt_extents, extents},
 + {Opt_noextents, noextents},
   {Opt_err, NULL},
   {Opt_resize, resize},
  };
 @@ -,6 +1112,9 @@
   case Opt_extents:
   set_opt (sbi-s_mount_opt, EXTENTS);
   break;
 + case Opt_noextents:
 + clear_opt (sbi-s_mount_opt, EXTENTS);
 + break;
   default:
   printk (KERN_ERR
   EXT4-fs: Unrecognized mount option \%s\ 
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:01 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Turn on extents feature by default in ext4 filesystem. User could use
 -o noextents to turn it off.
 

Oh, there you go.

 
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 17:02:22.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 17:03:09.0 -0700
 @@ -1546,6 +1546,12 @@
  
   set_opt(sbi-s_mount_opt, RESERVATION);
  
 + /*
 +  * turn on extents feature by default in ext4 filesystem
 +  * User -o noextents to turn it off
 +  */
 + set_opt (sbi-s_mount_opt, EXTENTS);
 +

Broken coding style.

Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
version 0.07 - see Andy's patch on lkml) and then consider addressing the
(quite large) number of mistakes which are detected.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:32 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
 than 32bit block sizes during mount time.  This ensure proper record
 lenth when writing to the journal.

This patch isn't in Ted's kernel.org directory and hasn't been in -mm. 
Where did it come from?  Is something amiss with ext4 patch management?

 Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
 ---
  fs/ext4/super.c |   11 +++
  1 file changed, 11 insertions(+)
 
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c 2007-06-11 16:15:54.0 
 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c  2007-06-11 16:16:10.0 -0700
 @@ -1804,6 +1804,13 @@

Please prepare patches using `diff -p'

   goto failed_mount3;
   }
  
 + if (ext4_blocks_count(es)  0xULL 
 + !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0,
 +JBD2_FEATURE_INCOMPAT_64BIT)) {
 + printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n);
 + goto failed_mount4;
 + }

It is not appropriate for the text ext4 to appear in a JBD2 message.

   /* We have now updated the journal if required, so we can
* validate the data journaling mode. */
   switch (test_opt(sb, DATA_FLAGS)) {


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 1/5] cleanups: Propagate some i_flags to disk

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:12 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 Propagate flags such as S_APPEND, S_IMMUTABLE, etc. from i_flags into
 ext4-specific i_flags. Hence, when someone sets these flags via a different
 interface than ioctl, they are stored correctly.
 

This changelog is inadequate.  I had to hunt down the equivalent ext3
patch's changelog to understand the reasons for this change.  Please update
this patch's changelog using the below: 

ext3: copy i_flags to inode flags on write

A patch that stores inode flags such as S_IMMUTABLE, S_APPEND, etc.  from
i_flags to EXT3_I(inode)-i_flags when inode is written to disk.  The same
thing is done on GETFLAGS ioctl.

Quota code changes these flags on quota files (to make it harder for
sysadmin to screw himself) and these changes were not correctly propagated
into the filesystem (especially, lsattr did not show them and users were
wondering...).

Propagate flags such as S_APPEND, S_IMMUTABLE, etc.  from i_flags into
ext3-specific i_flags.  Hence, when someone sets these flags via a
different interface than ioctl, they are stored correctly.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:48 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

  On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
   The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
   create_proc_entry() does not do lookups on file names with more that one
   directory deep.  This causes the entry creation to fail and hence, no proc
   file is created.  This patch moves the file to /proc/jbd2-degug.
   
   The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
   some minor alterations to the jbd-stats patch.
  
  I don't think we really want to be adding top-level files in /proc.
  What about using the debugfs filesystem (not to be confused with
  the e2fsprogs 'debugfs' command)?
 
 How about this then?  Moved the file to use debugfs as well as having
 the nice effect of removing more lines than what it adds.
 
 Signed-off-by: Jose R. Santos [EMAIL PROTECTED]

Please clean up the changelog.

The changelog should include information about the location and the content
of these debugfs files.  it should provide any instructions which users
will need to be able to create and use those files.

Alternatively (and preferably) do this via an update to
Documentation/filesystems/ext4.txt.

  fs/jbd2/journal.c|   6220 +42 -0 !
  include/linux/jbd2.h |21 + 1 - 0 !
  2 files changed, 21 insertions(+), 43 deletions(-)

Again, this patch isn't in Ted's kernel.org directory and hasn't been in -mm.

Apart from the lack of testing and review which this causes, it means I
can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
I squint at the diff, but that's harder when the diff wasn't prepared with
`diff -p'.  Oh well.


 Index: linux-2.6.22-rc4/fs/jbd2/journal.c
 ===
 --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 16:16:18.0 
 -0700
 +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
 -0700
 @@ -35,6 +35,7 @@
  #include linux/kthread.h
  #include linux/poison.h
  #include linux/proc_fs.h
 +#include linux/debugfs.h
  
  #include asm/uaccess.h
  #include asm/page.h
 @@ -1954,60 +1955,37 @@
   * /proc tunables
   */
  #if defined(CONFIG_JBD2_DEBUG)
 -int jbd2_journal_enable_debug;
 +u16 jbd2_journal_enable_debug;
  EXPORT_SYMBOL(jbd2_journal_enable_debug);
  #endif
  
 -#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
 +#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)

Has this been compile-tested with CONFIG_DEBUGFS=n?

  
 -#define create_jbd_proc_entry() do {} while (0)
 -#define jbd2_remove_jbd_proc_entry() do {} while (0)
 +#define jbd2_create_debugfs_entry() do {} while (0)
 +#define jbd2_remove_debugfs_entry() do {} while (0)

I suggest that these be converted to (preferable) inline functions while
you're there.

  #endif
  
 @@ -2067,7 +2045,7 @@
   ret = journal_init_caches();
   if (ret != 0)
   jbd2_journal_destroy_caches();
 - create_jbd_proc_entry();
 + jbd2_create_debugfs_entry();
   return ret;
  }
  
 @@ -2078,7 +2056,7 @@
   if (n)
   printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
  #endif
 - jbd2_remove_jbd_proc_entry();
 + jbd2_remove_debugfs_entry();
   jbd2_journal_destroy_caches();
  }
  
 Index: linux-2.6.22-rc4/include/linux/jbd2.h
 ===
 --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
 16:16:18.0 -0700
 +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
 -0700
 @@ -57,7 +57,7 @@
   * CONFIG_JBD2_DEBUG is on.
   */
  #define JBD_EXPENSIVE_CHECKING

JBD2?

 -extern int jbd2_journal_enable_debug;
 +extern u16 jbd2_journal_enable_debug;

Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
going to do that.


Shoudln't all this debug info be a per-superblock thing rather than
kernel-wide?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:36:56 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch is a spinoff of the old nanosecond patches.

I don't know what the old nanosecond patches are.  A link to a suitable
changlog for those patches would do in a pinch.  Preferable would be to
write a proper changelog for this patch.

 It includes some cleanups and addition of a creation timestamp. The
 EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
 s_{min, want}_extra_isize fields in struct ext3_super_block.
 
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
 Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 
 Index: linux-2.6.22-rc4/fs/ext4/ialloc.c

Please include diffstat output when preparing patches.

 +
 +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)\
 + ((offsetof(typeof(*ext4_inode), field) +\
 +   sizeof((ext4_inode)-field))  \
 + = (EXT4_GOOD_OLD_INODE_SIZE +  \
 + (einode)-i_extra_isize))   \

Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
under what circumstances something will not fit in an inode and what the
consequences of this are.

 +static inline __le32 ext4_encode_extra_time(struct timespec *time)
 +{
 +   return cpu_to_le32((sizeof(time-tv_sec)  4 ?
 +time-tv_sec  32 : 0) |
 +((time-tv_nsec  2)  EXT4_NSEC_MASK));
 +}
 +
 +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
 extra)
 +{
 +   if (sizeof(time-tv_sec)  4)
 +time-tv_sec |= (__u64)(le32_to_cpu(extra)  EXT4_EPOCH_MASK)
 + 32;
 +   time-tv_nsec = (le32_to_cpu(extra)  EXT4_NSEC_MASK)  2;
 +}

Consider uninlining these functions.

 +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)
\
 +do {\
 + (raw_inode)-xtime = cpu_to_le32((inode)-xtime.tv_sec);   \
 + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
 + (raw_inode)-xtime ## _extra = \
 + ext4_encode_extra_time((inode)-xtime);   \
 +} while (0)
 +
 +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)  
\
 +do {\
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
 + (raw_inode)-xtime = cpu_to_le32((einode)-xtime.tv_sec);  \
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
 + (raw_inode)-xtime ## _extra = \
 + ext4_encode_extra_time((einode)-xtime);  \
 +} while (0)
 +
 +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)
\
 +do {\
 + (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
 + if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
 + ext4_decode_extra_time((inode)-xtime,\
 +raw_inode-xtime ## _extra);\
 +} while (0)
 +
 +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)  
\
 +do {\
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
 + (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);  \
 + if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
 + ext4_decode_extra_time((einode)-xtime,   \
 +raw_inode-xtime ## _extra);\
 +} while (0)

Ugly.  I expect these could be implemented as plain old C functions. 
Caller could pass in the address of the ext4_inode field which the function
is to operate upon.

  #if defined(__KERNEL__) || defined(__linux__)

(What's the __linux__ for?)

  #define i_reserved1  osd1.linux1.l_i_reserved1
  #define i_frag   osd2.linux2.l_i_frag
 @@ -539,6 +603,13 @@
   return container_of(inode, struct ext4_inode_info, vfs_inode);
  }
  
 +static inline struct timespec ext4_current_time(struct inode *inode)
 +{
 + return (inode-i_sb-s_time_gran  NSEC_PER_SEC) ?
 + current_fs_time(inode-i_sb) : CURRENT_TIME_SEC;
 +}

Now, I've forgotten how this works.  Remind me, please.  Can ext4
filesystems ever have one-second timestamp granularity?  If so, how does
one cause that to come about?

 --- linux-2.6.22-rc4.orig/include/linux/ext4_fs_i.h   2007-06-11 
 17:22:15.0 -0700
 +++

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:04 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch converts the 32-bit i_version in the generic inode to a 64-bit
 i_version field.
 

That's obvious from the patch.  But what was the reason for making this
(unrelated to ext4) change?

Please update the changelog for this.

 Index: linux-2.6.21/include/linux/fs.h
 ===
 --- linux-2.6.21.orig/include/linux/fs.h
 +++ linux-2.6.21/include/linux/fs.h
 @@ -549,7 +549,7 @@ struct inode {
   uid_t   i_uid;
   gid_t   i_gid;
   dev_t   i_rdev;
 - unsigned long   i_version;
 + u64 i_version;
   loff_t  i_size;
  #ifdef __NEED_I_SIZE_ORDERED
   seqcount_t  i_size_seqcount;

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:16 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch adds a 32-bit i_version_hi field to ext4_inode, which can be used 
 for 64-bit inode versions. This field will store the higher 32 bits of the 
 version, while Jean Noel's patch has added support to store the lower 32-bits 
 in osd1.linux1.l_i_version.

Please wordwrap this changelog entry to less than 80 columns.  Well, less
than 258, anyway ;)

 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 ---
 Index: linux-2.6.21/include/linux/ext4_fs.h
 ===
 --- linux-2.6.21.orig/include/linux/ext4_fs.h
 +++ linux-2.6.21/include/linux/ext4_fs.h
 @@ -342,6 +342,7 @@ struct ext4_inode {
   __le32  i_atime_extra;  /* extra Access time  (nsec  2 | epoch) */
   __le32  i_crtime;   /* File Creation time */
   __le32  i_crtime_extra; /* extra FileCreationtime (nsec  2 | epoch) */
 + __le32  i_version_hi;   /* high 32 bits for 64-bit version */
  };

Aren't there forward- backward-compatibility issues here?  How does the
filesystem driver work out whether this field is present and valid?

The changelog should describe this design issue.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 3/5] i_version:ext4 inode version read/store

2007-07-10 Thread Andrew Morton

On Sun, 01 Jul 2007 03:37:36 -0400
Mingming Cao [EMAIL PROTECTED] wrote:

 This patch adds 64-bit inode version support to ext4. The lower 32 bits
 are stored in the osd1.linux1.l_i_version field while the high 32 bits
 are stored in the i_version_hi field newly created in the ext4_inode.

So reading the code here does serve to answer the question I raised against
the earlier patch.  A bit.

I'd have thought that this patch and the one which adds i_version_hi should
be folded into a single diff?

 
 Index: linux-2.6.21/fs/ext4/inode.c
 ===
 --- linux-2.6.21.orig/fs/ext4/inode.c
 +++ linux-2.6.21/fs/ext4/inode.c
 @@ -2709,6 +2709,13 @@ void ext4_read_inode(struct inode * inod
   EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
   EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
 
 + inode-i_version = le32_to_cpu(raw_inode-i_disk_version);
 + if (EXT4_INODE_SIZE(inode-i_sb)  EXT4_GOOD_OLD_INODE_SIZE) {
 + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi))
 + inode-i_version |=
 + (__u64)(le32_to_cpu(raw_inode-i_version_hi))  32;

checks the precedence of (type) versus 

OK

 + }

I don't quite see how the above two tests are sufficient to unambiguously
determine that the i_version_hi field is present on-disk.

I guess we're implicitly assuming that if the on-disk inode is big enough
then it _must_ have i_version_hi in there?  If so, why is the comparison
with EXT4_GOOD_OLD_INODE_SIZE needed?

Some description of the overall approach to inode version identification
would be helpful here.

   if (S_ISREG(inode-i_mode)) {
   inode-i_op = ext4_file_inode_operations;
   inode-i_fop = ext4_file_operations;
 @@ -2852,8 +2859,14 @@ static int ext4_do_update_inode(handle_t
   } else for (block = 0; block  EXT4_N_BLOCKS; block++)
   raw_inode-i_block[block] = ei-i_data[block];
 
 - if (ei-i_extra_isize)
 + raw_inode-i_disk_version = cpu_to_le32(inode-i_version);
 + if (ei-i_extra_isize) {
 + if (EXT4_FITS_IN_INODE(raw_inode, ei, i_version_hi)) {

There's no comparison with EXT4_GOOD_OLD_INODE_SIZE here...

 + raw_inode-i_version_hi =
 + cpu_to_le32(inode-i_version  32);
 + }
   raw_inode-i_extra_isize = cpu_to_le16(ei-i_extra_isize);
 + }
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 >

1 - 100 of 281 matches

Mail list logo