Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 04:57:25PM +0100, Andi Kleen wrote:
  Use cp
  or a tar pipeline to move the files.
 
 Are you sure cp handles hardlinks correctly? I know tar does,
 but I have my doubts about cp.

I *think* GNU cp does the right thing with --preserve=links.  I'm not
100% sure, though --- like you, probably, I always use tar for moving
or copying directory hierarchies.

   - Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: very poor ext3 write performance on big filesystems?

2008-02-18 Thread Theodore Tso
On Mon, Feb 18, 2008 at 05:16:55PM +0100, Tomasz Chmielewski wrote:
 Theodore Tso schrieb:

 I'd really need to know exactly what kind of operations you were
 trying to do that were causing problems before I could say for sure.
 Yes, you said you were removing unneeded files, but how were you doing
 it?  With rm -r of old hard-linked directories?

 Yes, with rm -r.

You should definitely try the spd_readdir hack; that will help reduce
the seek times.  This will probably help on any block group oriented
filesystems, including XFS, etc.

 How big are the
 average files involved?  Etc.

 It's hard to estimate the average size of a file. I'd say there are not 
 many files bigger than 50 MB.

Well, Ext4 will help for files bigger than 48k.

The other thing that might help for you is using an external journal
on a separate hard drive (either for ext3 or ext4).  That will help
alleviate some of the seek storms going on, since the journal is
written to only sequentially, and putting it on a separate hard drive
will help remove some of the contention on the hard drive.  

I assume that your 1.2 TB filesystem is located on a RAID array; did
you use the mke2fs -E stride option to make sure all of the bitmaps
don't get concentrated on one hard drive spindle?  One of the failure
modes which can happen is if you use a 4+1 raid 5 setup, that all of
the block and inode bitmaps can end up getting laid out on a single
hard drive, so it becomes a bottleneck for bitmap intensive workloads
--- including rm -rf.  So that's another thing that might be going
on.  If you do a dumpe2fs, and look at the block numbers for the
block and inode allocation bitmaps, and you find that they are are all
landing on the same physical hard drive, then that's very clearly the
biggest problem given an rm -rf workload.  You should be able to see
this as well visually; if one hard drive has its hard drive light
almost constantly on, and the other ones don't have much activity,
that's probably what is happening.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-28 Thread Theodore Tso
On Mon, Jan 28, 2008 at 07:30:05PM +, Pavel Machek wrote:
 
 As user pages are always in highmem, this should be easy to decide:
 only send SIGDANGER when highmem is full. (Yes, there are
 inodes/dentries/file descriptors in lowmem, but I doubt apps will
 respond to SIGDANGER by closing files).

Good point; for a system with at least (say) 2GB of memory, that
definitely makes sense.  For a system with less than 768 megs of
memory (how quaint, but it wasn't that long ago this was a lot of
memory :-), there wouldn't *be* any memory in highmem at all

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-26 Thread Theodore Tso
On Fri, Jan 25, 2008 at 05:55:51PM -0800, Bryan Henderson wrote:
 I was surprised to see AIX do late allocation by default, because IBM's 
 traditional style is bulletproof systems.  A system where a process can be 
 killed at unpredictable times because of resource demands of unrelated 
 processes doesn't really fit that style.
 
 It's really a fairly unusual application that benefits from late 
 allocation: one that creates a lot more virtual memory than it ever 
 touches.  For example, a sparse array.  Or am I missing something?

I guess it depends on how far you try to do bulletproof.  OSF/1 used
to use bulletproof as its default --- and I had to turn it off on
tsx-11.mit.edu (the first North American ftp server for Linux :-),
because the difference was something like 50 ftp daemons versus over
500 on the same server.  It reserved VM space for the text segement of
every single process, since at least in theory, it's possible for
every single text page to get modified using ptrace if (for example) a
debugger were to set a break point on every single page of every
single text segement of every single ftp daemon.

You can also see potential problems for Java programs.  Suppose you
had some gigantic Java Application (say, Lotus Notes, or Websphere
Application Server) which is taking up many, many, MANY gigabytes of
VM space.  Now suppose the Java application needs to fork and exec
some trivial helper program.  For that tiny instant, between the fork
and exec, the VM requirements in bulletproof mode would double,
since while 99.% of the time programs will immediately discard the
VM upon the exec, there is always the possibility that the child
process will touch every single data page, forcing a copy on write,
and never do the exec.

There are of course different levels of bulletproof between the
extremes of totally bulletproof and late binding from an
algorithmic standpoint.  For example, you could ignore the needed
pages caused by ptrace(); more challenging would be to how to handle
the fork/exec semantics, although there could be kludges such as
strongly encouraging applications to use an old-fashed BSD-style
vfork() to guarantee that the child couldn't double VM requirements
between the vfork() and exec().  I certainly can't say for sure what
the AIX designers had in mind, and why they didn't choose one of the
more intermediate design choices.  

However, it is fair to say that 100% bulletproof can require
reserving far more VM resources than you might first expect.  Even a
company which is highly incented to sell large amounts of hardware,
such as Digital, might not have wanted their OS to be only able to
support an embarassingly small number of simultaneous ftpd
connections.  I know this for sure because the OSF/1 documentation,
when discussing their VM tuning knobs, specifically talked about the
scenario that I ran into with tsx-11.mit.edu.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso
On Fri, Jan 25, 2008 at 10:34:25AM -0600, Eric Sandeen wrote:
  But it was this concern which is why ext3 never exported freeze
  functionality to userspace, even though other commercial filesystems
  do support this.  It wasn't that it wasn't considered, but the concern
  about whether or not it was sufficiently safe to make available.
 
 What's the safety concern; that the admin will forget to unfreeze?

That the admin would manage to deadlock him/herself and wedge up the
whole system...

 I'm also not sure I see the point of the timeout in the original patch;
 either you are done snapshotting and ready to unfreeze, or you're not;
 1, or 2, or 3 seconds doesn't really matter.  When you're done, you're
 done, and you can only unfreeze then.  Shouldn't this be done
 programmatically, and not with some pre-determined timeout?

This is only a guess, but I suspect it was a fail-safe in case the
admin did manage to deadlock him/herself.  

I would think a better approach would be to make the filesystem
unfreeze if the file descriptor that was used to freeze the filesystem
is closed, and then have explicit deadlock detection that kills the
process doing the freeze, at which point the filesystem unlocks and
the system can recover.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 freeze feature

2008-01-25 Thread Theodore Tso
On Fri, Jan 25, 2008 at 03:18:51PM +0300, Dmitri Monakhov wrote:
 First of all Linux already have at least one open-source(dm-snap),
 and several commercial snapshot solutions. 

Yes, but it requires that the filesystem be stored under LVM.  Unlike
what EVMS v1 allowed us to do, we can't currently take a snapshot of a
bare block device.  This patch could potentially be useful for systems
which aren't using LVM, however

 You have to realize what delay between 1-3 stages have to be minimal.
 for example dm-snap perform it only for explicit journal flushing.
 From my experience if delay is more than 4-5 seconds whole system becomes
 unstable.

That's the problem.  You can't afford to freeze for very long.

What you *could* do is to start putting processes to sleep if they
attempt to write to the frozen filesystem, and then detect the
deadlock case where the process holding the file descriptor used to
freeze the filesystem gets frozen because it attempted to write to the
filesystem --- at which point it gets some kind of signal (which
defaults to killing the process), and the filesystem is unfrozen and
as part of the unfreeze you wake up all of the processes that were put
to sleep for touching the frozen filesystem.

The other approach would be to say, oh well, the freeze ioctl is
inherently dangerous, and root is allowed to himself in the foot, so
who cares.  :-)

But it was this concern which is why ext3 never exported freeze
functionality to userspace, even though other commercial filesystems
do support this.  It wasn't that it wasn't considered, but the concern
about whether or not it was sufficiently safe to make available.

And I do agree that we probably should just implement this in
filesystem independent way, in which case all of the filesystems that
support this already have super_operations functions
write_super_lockfs() and unlockfs().

So if this is done using a new system call, there should be no
filesystem-specific changes needed, and all filesystems which support
those super_operations method functions would be able to provide this
functionality to the new system call.

 - Ted

P.S.  Oh yeah, it should be noted that freezing at the filesystem
layer does *not* guarantee that changes to the block device aren't
happening via mmap()'ed files.  The LVM needs to freeze writes the
block device level if it wants to guarantee a completely stable
snapshot image.  So the proposed patch doens't quite give you those
guarantees, if that was the intended goal.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Theodore Tso
On Fri, Jan 25, 2008 at 01:08:09AM +0200, Adrian Bunk wrote:
 In practice, there is a small number of programs that are both the
 common memory hogs and should be able to reduce their memory consumption
 by 10% or 20% without big problems when requested (e.g. Java VMs,
 Firefox and databases come into my mind).

I agree, it's only a few processes where this makes sense.  But for
those that do, it would be useful if they could register with the
kernel that would like to know, (just before the system starts
ejecting cached data, just before swapping, etc.) and at what
frequency.  And presumably, if the kernel notices that a process is
responding to such requests with memory actually getting released back
to the system, that process could get rewarded by having the OOM
killer less likely to target that particular thread.

AIX basically did this with SIGDANGER (the signal is ignored by
default), except there wasn't the ability for the process to tell the
kernel at what level of memory pressure before it should start getting
notified, and there was no way for the kernel to tell how bad the
memory pressure actually was.  On the other hand, it was a relatively
simple design.

In practice very few processes would indeed pay attention to
SIGDANGER, so I think you're quite right there.

 And from a performance point of view letting applications voluntarily 
 free some memory is better even than starting to swap.

Absolutely.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] Incremental fsck

2008-01-12 Thread Theodore Tso
On Wed, Jan 09, 2008 at 02:52:14PM +0300, Al Boldi wrote:
 
 Ok, but let's look at this a bit more opportunistic / optimistic.
 
 Even after a black-out shutdown, the corruption is pretty minimal, using 
 ext3fs at least.


After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have any
corruption at all.  If you have crappy hardware, then all bets are off

 So let's take advantage of this fact and do an optimistic fsck, to
 assure integrity per-dir, and assume no external corruption.  Then
 we release this checked dir to the wild (optionally ro), and check
 the next.  Once we find external inconsistencies we either fix it
 unconditionally, based on some preconfigured actions, or present the
 user with options.

So what can you check?  The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.

What is very hard to check is whether or not the link count on the
inode is correct.  Suppose the link count is 1, but there are actually
two directory entries pointing at it.  Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname.  Oops.  Data Loss.

This is why doing incremental, on-line fsck'ing is *hard*.  You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse.  And
it's not just the hard link count.  There is a similar issue with the
block allocation bitmap.  Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.

One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only.  There are a few caveats, however  (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.)  (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write.  You can't just do a mount -o remount if the filesystem
was modified under the OS's nose.

 All this could be per-dir or using some form of on-the-fly file-block-zoning.
 
 And there probably is a lot more to it, but it should conceptually be 
 possible, with more thoughts though...

Many things are possible, in the NASA sense of with enough thrust,
anything will fly.  Whether or not it is *useful* and *worthwhile*
are of course different questions!  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/2] readdir() as an inode operation

2007-10-31 Thread Theodore Tso
On Tue, Oct 30, 2007 at 04:26:04PM +0100, Jan Kara wrote:
  This is a first try to move readdir() to become an inode operation. This is
  necessary for a VFS implementation of something like union-mounts where a
  readdir() needs to read the directory contents of multiple directories.
  Besides that the new interface is no longer giving the struct file to the
  filesystem implementations anymore.
  
  Comments, please?
   Hmm, are you sure there are no users which keep some per-struct-file
 information for directories? File offset is one such obvious thing which
 you've handled but actually filesystem with more complicated structure
 of directory may remember some hints about where we really are, keep
 some readahead information or so...

For example, the ext3 filesystem, when it is supported hash tree, does
exactly this.  See ext3_htree_store_dirent() in fs/ext3/dir.c and
ext3_htree_fill_tree() in fs/ext3/namei.c.

So your patch would break ext3 htree support.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-23 Thread Theodore Tso
On Tue, Oct 23, 2007 at 07:38:20PM +0900, Tetsuo Handa wrote:
  Are you sure the file isn't getting written by some background tasks
  that you weren't aware of?  This seems very strange; what
  virtualization software are you using?  VMware, Xen, KVM?
 I'm using VMware Workstation 6.0.0 build 45731 for x86_64.
 It seems that there were some background tasks that delays writing.
 I tried the following sequence, sync didn't affect.

Or it may be that it takes a while to do a controlled shutdown.

One potential reason for the vmem file being very badly fragmented is
that it might not be getting written in sequential order.  If the
writer is writing the file in random order, then unless you have a
filesystem which can do delayed allocations, the blocks will get
allocated in the other that they are first written, and if the writer
is seeking to random locations to do the write, that's one way that
you can end up with a very badly fragmented file.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-22 Thread Theodore Tso
On Mon, Oct 22, 2007 at 08:58:11PM +0900, Tetsuo Handa wrote:
 
 --- Start VM ---
 --- Suspend VM ---
 [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem
 Ubuntu7.10.vmem: 751 extents found, perfection would be 5 extents
 [EMAIL PROTECTED] Ubuntu7.10]# sync
 [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem
 Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents
 [EMAIL PROTECTED] Ubuntu7.10]# filefrag Ubuntu7.10.vmem
 Ubuntu7.10.vmem: 3281 extents found, perfection would be 5 extents
 --- Resume and poweroff VM ---
 
 What? sync yields more discontiguous?

What filesystem are you using?  ext3?  ext4?  xfs?  And are you using
any non-standard patches, such as some of the delayed allocation
patches that have been floating around?  If you're using ext3, that
shouldn't be happening.

If you use the -v option to filefrag, both before and after the sync,
that might show us what is going on.  The other thing is to use
debugfs and its stat command to get detailed breakdown of the block
assignments of the file.

Are you sure the file isn't getting written by some background tasks
that you weren't aware of?  This seems very strange; what
virtualization software are you using?  VMware, Xen, KVM?

 - Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does 32.1% non-contiguous mean severely fragmented?

2007-10-20 Thread Theodore Tso
On Sat, Oct 20, 2007 at 12:39:33PM +0900, Tetsuo Handa wrote:
 Theodore Tso wrote:
  beginning of every single block group.  You have a small number of
  files on your system (349) occupying an average of 348 megabytes.  So
  it's not at all surprising that the contiguous percentage is 32%.
 I see, thank you. Yes, there are many files splitted in 2GB each.
 
 But what is surprising for me is that I have to wait for more than
 five minutes to save/restore the virtual machine's 512MB-RAM image
 (usually it takes less than five seconds).
 Hdparm reports DMA is on and e2fsck reports no errors,
 so I thought it is severely fragmented.
 May be I should backup all virtual machine's data and
 format the partition and restore them.

Well, that's a little drastic if you're not sure what is going on is
fragmentation.

5 minutes to save/restore a 512MB ram image, assuming that you are
saving somewhere around 576 megs of data, means you are writing less
than 2 megs/second.  That seems point to something fundamentally
wrong, far worse than can be explained by fragmentation.  

First of all, what does the filefrag program (shipped as part of
e2fsprogs, not included in some distributions) say if you run it as
root on your VM data file?

Secondly, what results do you get when you run the command hdparm -tT
/dev/sda (or /dev/hda if you are using an IDE disk)?

This kind of performance regression is the sort of thing I see on my
laptop when compile the kernel with the wrong options, and/or disable
AHCI mode in favor of compatibility mode, such that my laptop SATA
performance (as measured using hdparm) drops from 50 megs/second to 2
megs/second.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Does \32.1% non-contigunous\ mean severely fragmented?

2007-10-19 Thread Theodore Tso
On Fri, Oct 19, 2007 at 10:49:03AM +0900, Tetsuo Handa wrote:

 /data/VMware: 349/19546112 files (32.1% non-contiguous), 31019203/39072080 
 blocks
 
 Does non-contiguous mean fragmented?
 If so, where is ext3defrag?

Not necessarily; it just means that 32% of your files have at least
one discontinuity.  Given the ext3 layout, by definition every 128
megs there will be a discontinuity because of the metadata at the
beginning of every single block group.  You have a small number of
files on your system (349) occupying an average of 348 megabytes.  So
it's not at all surprising that the contiguous percentage is 32%.

The recent Flex BG feature that was recently pulled into 2.6.23-git14
for ext4 is desgined to avoid this issue, but a seek every 128 megs is
for most workloads not a big deal and will hopefully not cause you any
problems.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/32] IGET: Stop EXT2 from using iget() and read_inode() [try #2]

2007-10-05 Thread Theodore Tso
On Thu, Oct 04, 2007 at 04:57:08PM +0100, David Howells wrote:
 Stop the EXT2 filesystem from using iget() and read_inode().  Replace
 ext2_read_inode() with ext2_iget(), and call that instead of iget().
 ext2_iget() then uses iget_locked() directly and returns a proper error code
 instead of an inode in the event of an error.
 
 ext2_fill_super() returns any error incurred when getting the root inode
 instead of EINVAL.
 
 Signed-off-by: David Howells [EMAIL PROTECTED]

Acked-by: Theodore Ts'o [EMAIL PROTECTED]

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrading datastructures between different filesystem versions

2007-09-28 Thread Theodore Tso
On Fri, Sep 28, 2007 at 02:31:46PM +0100, Christoph Hellwig wrote:
 On Fri, Sep 28, 2007 at 03:11:00PM +0200, Erik Mouw wrote:
  There are however ways to confuse it: if you reformat an ext3
  filesystem to reiserfs (version 3), mounting that filesystem without
  -t reiserfs will trick mount(8) into mounting it as an ext3
  filesystem (which will usually fail). This is because the ext3
  superblocks lives at offset 0x400, and the reiserfs superblock at
  0x8000. When you format a partition as reiserfs, it will not erase old
  ext3 superblocks. Before looking for a reiserfs superblock, mount(8)
  first looks for an ext3 superblock. The old ext3 superblock wasn't
  erased, but usually most of the other ext3 structures are and so
  mount(8) will fail to mount the filesystem. Don't know if this
  particular bug is still there, but it has bitten me in the past.
 
 This is easy to fix, though.  Quoting mkfs.xfs:
 
   /*
  * Zero out the beginning of the device, to obliterate any old
* filesystem signatures out there.  This should take care of
* swap (somewhere around the page size), jfs (32k),
* ext[2,3] and reiserfs (64k) - and hopefully all else.
*/
   buf = libxfs_getbuf(xi.ddev, 0, BTOBB(WHACK_SIZE));
   bzero(XFS_BUF_PTR(buf), WHACK_SIZE);
   libxfs_writebuf(buf, LIBXFS_EXIT_ON_FAILURE);
   libxfs_purgebuf(buf);

Ext3 does something similar, zapping space at the beginning AND the
end of the partition (because the MD superblocks are at the end).
It's just a misfeature of reiserfs's mkfs that it doesn't do this.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 04:19:12PM +0100, Alan Cox wrote:
  Well it's not my call, just seems like a really bad idea to change the
  error value. You can't claim full coverage for such testing anyway, it's
  one of those things that people will complain about two releases later
  saying it broke app foo.
 
 Strange since we've spent years changing error values and getting them
 right in the past. 

I doubt there any apps which are going to specifically check for EFBIG
and do soemthing different if they get EOVERFLOW instead.  If it was
something like EAGAIN or EPERM, I'd be more concerned, but EFBIG
vs. EOVERFLOW?  C'mon!

 There are real things to worry about - sysfs, sysfs, sysfs, ... and all
 the other crap which is continually breaking stuff, not spec compliance
 corrections that don't break things but move us into compliance with the
 standard

I've got to agree with Alan, the sysfs/udev breakages that we've done
are far more significant, and the fact that we continue to expose
internal data structures via sysfs is a gaping open pit is far more
likely to cause any kind of problems than changing an error return.

  - Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 10:59:17AM -0700, Greg KH wrote:
 Come on now, I'm _very_ tired of this kind of discussion.  Please go
 read the documentation on how to _use_ sysfs from userspace in such a
 way that you can properly access these data structures so that no
 breakage occurs.

I've read it; the question is whether every single application
programmer or system shell script programmer who writes code my system
depends upon has read it this document buried in the kernel sources,
or whether things will break spectacularly --- one of those things
that leaves me in suspense each time I update the kernel.

I'm reminded of Rusty's 2003 OLS Keynote, where he points out that
what's important is not making an interface easy to use, but _hard_
_to_ _misuse_.  That fact that sysfs is all laid out in a directory,
but for which some directories/symlinks are OK to use, and some are
NOT OK to use --- is why I call the sysfs interface an open pit.
Sure, if you have the map to the minefield, a minefield is perfectly
safe when you know what to avoid.  But is that the best way to
construct a path/interface for an application programmer to get from
point A to point B?  Maybe, maybe not.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs: Correct SuS compliance for open of large file without options

2007-09-27 Thread Theodore Tso
On Thu, Sep 27, 2007 at 05:28:57PM -0600, Matthew Wilcox wrote:
 On Thu, Sep 27, 2007 at 07:19:27PM -0400, Theodore Tso wrote:
  Would you accept a patch which causes the deprecated sysfs
  files/directories to disappear, even if CONFIG_SYS_DEPRECATED is
  defined, via a boot-time parameter?
 
 How about a mount option?  That way people can test without a reboot:
 
 mount -o remount,deprecated={yes,no} /sys

It would be nice if that would be easy to make work, but the problem
is that remounting /sysfs doesn't change the entries in the sysfs tree
that have already been made in the tree.  We could do something such
as creating an sysfs_create_link_deprecated() call which created a
kobject with a new flag indicating it's deprecated, so it could be
filtered out dynamically when /sys is remounted, or when some file
such as /sys/kernel/deprecated_sysfs_files has 0 or 1 written to
it.

The question is whether it's worth it, since we'd have to bloat the
kobject structure by 4 bytes (it currently doesn't have a flags field
from which we could borrow a bit), or whether it's OK just to make the
user reboot.  (I do agree it would be nicer if the user didn't have to
reboot, but most of the time they will need to test the initrd and
init scripts anyway.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Upgrading datastructures between different filesystem versions

2007-09-26 Thread Theodore Tso
On Wed, Sep 26, 2007 at 06:29:19PM -0500, Sachin Gaikwad wrote:
 Is it not the case that VFS takes care of all filesystems available ?
 VFS will see if a particular file belongs to ext3 or ext4 and call
 that FS's drivers to access information ??

No, it doesn't quite work that way.  You have to mount a particular
partition using a specific filesystem (i.e., ntfs, vfat, ext2, ext3,
ext4, etc.).  A partition formatted using ext2 can be mounted using
the ext2, ext3, or ext4 filesystem driver.  You can explicitly specify
what filesystem should be used to mount a particuar partition using
the -t option to the mount program, or by specifying a particular
filesystem type in the /etc/fstab file.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 12/26] ext2 white-out support

2007-07-30 Thread Theodore Tso
On Mon, Jul 30, 2007 at 06:13:35PM +0200, Jan Blunck wrote:
 Introduce white-out support to ext2.
 
 Known Bugs:
 - Needs a reserved inode number for white-outs

You picked different reserved inodes for the ext2 and ext3
filesystems.  That's good for a NACK right there.  The codepoints
(i.e., reserved inode numbers, feature bit masks, etc.) for ext2,
ext3, and ext4 MUST not overlap.  After all, someone might use tune2fs
-j to convert an ext2 filesystem to ext3, and is it's REALLY BAD that
you're using a reserved inode of 7 for ext2, and 9 for ext3.

Also, I note that you have created a new INCOMPAT feature flag support
for whiteouts.  That's really unfortunate; we try to avoid introducing
incompatible feature flags unless absolutely necessary; note that even
adding a COMPAT feature flag means that you need a new version of
e2fsprogs if you want e2fsck to be willing to touch that filesystem.

So --- if you're looking for a way to add whiteout support to
ext2/ext3 without needing a feature bit, here's how.  We allocate a
new inode flag in struct ext3_inode.i_flags:

#define EXT2_WHTOUT_FL   0x0004

We also allocate a new field in the ext2 superblock to store the
whiteout inode.  (Please coordinate with me so it's a superblock
field not in use by ext3/ext4, and so it's reserved so that no one
else uses it.)  The superblock field, call it s_whtout_ino, stores the
inode number for the white out inode.

When you create a new whiteout file, the code checks sb-s_whtout_ino,
and if it is zero, it allocates a new inode, and creates it as a
zero-length regular file (i_mode |= S_IFREG) with the EXT2_WHTOUT_FL
flag set in the inode, and then store the inode number in
sb-s_whtout_ino.  If sb-s_whtout_ino is non-zero, you must read in
the inode and make sure that the EXT2_WHTOUT_FL is set.  If it is not,
then allocate a new whiteout inode as described previously.  Then link
the inode into the directory as before.

When reading an inode, if the EXT2_WHTOUT_FL flag is set, then set the
in-memory mode of the inode to be S_IFWHT.  

That's pretty much about it.  For cleanliness sake, it would be good
if ext2_delete_inode clears sb-s_whtout_ino if the last whiteout link
has been deleted, but it's strictly speaking not necessary.  If you do
it this way, the filesystem is completely backwards compatible; the
whiteout files will just appear to links to a normal zero-lenth file.

I wouldn't bother with setting the directory type field to be DT_WHT,
given that they will never be returned to userspace anyway.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFH] Partition table recovery

2007-07-23 Thread Theodore Tso
On Mon, Jul 23, 2007 at 10:15:21AM +0200, Rene Herman wrote:
 On an integrated system like this, do you consider it acceptable to only do 
 the MS-DOS partitions and not the other types that may be present _inside_ 
 those partitions? (MINIX subpartitions, BSD slices, ...). I believe those 
 should really also be done, but this would require keeping more information 
 again.

Well, I'm considering this to be a MBR backup scheme, so Minix and BSD
slices are legacy systems which are out of scope.  If they are busted
in the same way as MBR in terms of not having redundant backups of
critical data, when they have a lot fewer excuses that MBR, and they
can address that issue in their own way.  The number of Linux users
that also have Minix and BSD partitions are a vanishingly small number
in any case.

 I (very) briefly looked at blkid but unless I'm mistaken blkid needs device 
 names? The documentation seems to be missing. When scanning the device for 
 the partition table, we've built a list of partitions with offsets into the 
 device and it would be nice if we could hand the fd and the offset off to 
 something directly. If the program has to construct device names itself 
 there's another truckload of pitfalls right there.

Yeah, good point, I'd have to add that support into blkid.  It's been
on my todo list, but I just haven't gotten around to it yet.

 It might in fact make sense to just ask the kernel for the partitions on a 
 device and not bother with scanning anything ourselves. Ie, just walk 
 sysfs. Would you agree? This siginificantly reduces the risk of things 
 getting out of sync, both scanning order and implementation.

My concern of sysfs is that #1, it won't work on older kernels since
you would need to add new fields to backup what we want, and #2, I'm
still fundamentally distrustful of sysfs because there isn't a bright
line between what is an exported interface that will never change, and
something which is considered an internal implementation detail that
can change whenever some kernel hacker feels like it.  (Or when some
kernel hacker is careless...)  So as far as I'm concerned sysfs is a
terrible, TERRIBLE way to export a published interface where we
promise stability to userspace.

So I'd just as soon do this in userspace; after all, the entire
partition manager (and there are multiple ones; fdisk, sfdisk, gpart,
etc.) all in userspace, and that needs to be in synch with the kernel
partition reading code anyway.  So one more userspace implementation
is in my mind much cleaner than trying to push the needed
functionality into sysfs, and then hoping against hope that it doesn't
accidentally change in the future.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFH] Partition table recovery

2007-07-22 Thread Theodore Tso
On Sun, Jul 22, 2007 at 07:10:31AM +0300, Al Boldi wrote:
 Sounds great, but it may be advisable to hook this into the partition 
 modification routines instead of mkfs/fsck.  Which would mean that the 
 partition manager could ask the kernel to instruct its fs subsystem to 
 update the backup partition table for each known fs-type that supports such 
 a feature.

Well, let's think about this a bit.  What are the requirements?

1) The partition manager should be able explicitly request that a new
backup of the partition tables be stashed in each filesystem that has
room for such a backup.  That way, when the user affirmatively makes a
partition table change, it can get backed up in all of the right
places automatically.

2) The fsck program should *only* stash a backup of the partition
table if there currently isn't one in the filesystem.  It may be that
the partition table has been corrupted, and so merely doing an fsck
should not transfer a current copy of the partition table to the
filesystem-secpfic backup area.  It could be that the partition table
was only partially recovered, and we don't want to overwrite the
previously existing backups except on an explicit request from the
system administrator.

3) The mkfs program should automatically create a backup of the
current partition table layout.  That way we get a backup in the newly
created filesystem as soon as it is created.

4) The exact location of the backup may vary from filesystem to
filesystem.  For ext2/3/4, bytes 512-1023 are always unused, and don't
interfere with the boot sector at bytes 0-511, so that's the obvious
location.  Other filesystems may have that location in use, and some
other location might be a better place to store it.  Ideally it will
be a well-known location, that isn't dependent on finding an inode
table, or some such, but that may not be possible for all filesystems.

OK, so how about this as a solution that meets the above requirements?

/sbin/partbackup device [fspart]

Will scan device (i.e., /dev/hda, /dev/sdb, etc.) and create
a 512 byte partition backup, using the format I've previously
described.  If fspart is specified on the command line, it
will use the blkid library to determine the filesystem type of
fspart, and then attempt to execute
/dev/partbackupfs.fstype to write the partition backup to
fspart.  If fspart is '-', then it will write the 512 byte
partition table to stdout.  If fspart is not specified on
the command line, /sbin/partbackup will iterate over all
partitions in device, use the blkid library to attempt to
determine the correct filesystem type, and then execute 
/sbin/partbackupfs.fstype if such a backup program exists.

/sbin/partbackupfs.fstype fspart

... is a filesystem specific program for filesystem type
fstype.  It will assure that fspart (i.e., /dev/hda1,
/dev/sdb3) is of an appropriate filesystem type, and then read
512 bytes from stdin and write it out to fspart to an
appropriate place for that filesystem.

Partition managers will be encouraged to check to see if
/sbin/partbackup exists, and if so, after the partition table is
written, will check to see if /sbin/partbackup exists, and if so, to
call it with just one argument (i.e., /sbin/partbackup /dev/hdb).
They SHOULD provide an option for the user to suppress the backup from
happening, but the backup should be the default behavior.

An /etc/mkfs.fstype program is encouraged to run /sbin/partbackup
with two arguments (i.e., /sbin/partbackup /dev/hdb /dev/hdb3) when
creating a filesystem.

An /etc/fsck.fstype program is encouraged to check to see if a
partition backup exists (assuming the filesystem supports it), and if
not, call /sbin/partbackup with two arguments.

A filesystem utility package for a particular filesystem type is
encouraged to make the above changes to its mkfs and fsck programs, as
well as provide an /sbin/partbackupfs.fstype program.

I would do this all in userspace, though.  Is there any reason to get
the kernel involved?  I don't think so.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [EXT4 set 4][PATCH 5/5] i_version: noversion mount option to disable inode version updates

2007-07-11 Thread Theodore Tso
On Tue, Jul 10, 2007 at 04:31:44PM -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:53 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Add a noversion mount option to disable inode version updates.
 
 Why is this option being offered to our users?  To reduce disk traffic,
 like noatime?
 
 If so, what are the implications of this?  What would the user lose?

This has been removed in the latest patch set; it's needed only for
Lustre, because they set the version field themselves.  Lustre needs
the inode version to be globally monotonically increasing, so it can
order updates between two different files, so it does this itself.
NFSv4 only uses i_version to detect changes, and so there's no need to
use a global atomic counter for i_version.  So the thinking was that
there was no point doing the global atomic counter if it was not necessary.

Since noversion is Lustre specific, we've dropped that from the list
of patches that we'll push, and so the inode version will only have
local per-inode significance, and not have any global ordering
properties.  

We have not actually benchmarked whether or not doing the global
ordering actually *matters* in terms of being actually noticeable.  If
it isn't noticeable, I wouldn't mind changing things so that we always
make i_version globally significant (without a mount option), and make
life a bit easier for the Lustre folks.  Or if someone other
distributed filesystem requests a globally significant i_version.  But
we can cross that bridge when we get to it

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-07-04 Thread Theodore Tso
On Wed, Jul 04, 2007 at 07:32:34PM +0200, Erik Mouw wrote:
 (sorry for the late reply, just got back from holiday)
 
 On Mon, Jun 18, 2007 at 01:29:56PM -0400, Theodore Tso wrote:
  As I mentioned in my Linux.conf.au presentation a year and a half ago,
  the main use of Streams in Windows to date has been for system
  crackers to hide trojan horse code and rootkits so that system
  administrators couldn't find them.  :-)
 
 The only valid use of Streams in Windows I've seen was a virus checker
 that stored a hash of the file in a separate stream. Checking a file
 was a matter of rehashing it and comparing against the hash stored in
 the special hash data stream for that particular file.

And even that's not a valid use.  All the virus would have to do is to
infect the file, and then update the special hash data stream.  Why
is it that when programmers are told about streams as a potential
technology choice, it makes their thinking become fuzzy?  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Theodore Tso
On Thu, Jun 28, 2007 at 11:33:42AM -0700, Andrew Morton wrote:
  Please let us know what you think of Mingming's suggestion of posting
  all the fallocate patches including the ext4 ones as incremental ones
  against the -mm.
 
 I think Mingming was asking that Ted move the current quilt tree into git,
 presumably because she's working off git.

No, mingming and I both work off of the patch queue (which is also
stored in git).  So what mingming was asking for exactly was just
posting the incremental patches and tagging them appropriately to
avoid confusion.

I tried building the patch queue earlier in the week and it there were
multiple oops/panics as I ran things through various regression tests,
but that may have been fixed since (the tree was broken over the
weekend and I may have grabbed a broken patch series) or it may have
been a screw up on my part feeding them into our testing grid.  I
haven't had time to try again this week, but I'll try to put together
a new tested ext4 patchset over the weekend.

 I'm not sure what to do, really.  The core kernel patches need to be in
 Ted's tree for testing but that'll create a mess for me.

I don't think we have a problem here.  What we have now is fine, and
it was just people kvetching that Amit reposted patches that were
already in -mm and ext4.

In any case, the plan is to push all of the core bits into Linus tree
for 2.6.22 once it opens up, which should be Real Soon Now, it looks
like.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-29 Thread Theodore Tso
On Fri, Jun 29, 2007 at 10:29:21AM -0400, Jeff Garzik wrote:
 In any case, the plan is to push all of the core bits into Linus tree
 for 2.6.22 once it opens up, which should be Real Soon Now, it looks
 like.
 
 Presumably you mean 2.6.23.

Yes, sorry.  I meant once Linus releases 2.6.22, and we would be
aiming to merge before the 2.6.23-rc1 window.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-19 Thread Theodore Tso
On Tue, Jun 19, 2007 at 12:26:57AM +0200, Jörn Engel wrote:
 The main difference appears to be the potential size.  Both extended
 attributes and forks allow for extra data that I neither want or need.
 But once the extra space is large enough to hide a rootkit in, it
 becomes a security problem instead of just something pointless.

The other difference is that you can't execute an extended attribute.

You can store kvm/qemu, a complete virtualization enviroment, shared
libraries, and other executables all inside a forks inside a file, and
then execute programs/rootkit out of said file fork(s).

As I mentioned in my LCA presentation, one system administrator
refused to upgrade beyond Solaris 8 because he thought forks were good
for nothing but letting system crackers hide rootkits that wouldn't be
detected by programs like tripwire.  The question then is why in the
world would we want to replicate Sun's mistakes?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-19 Thread Theodore Tso
On Mon, Jun 18, 2007 at 03:48:15PM -0700, Jeremy Allison wrote:
 Did you ever code up forkdepot ? Just wondering ?

There is a partial implementation lieing around somewhere, but there
were a number of problems we ran into that were discussed in the
slidedeck.  Basically, if the only program accessing the files
containing forks was the Samba program calling forkdepot library, it
worked fine.  But if there were other programs (or NFS servers) that
were potentially deleting files, moving files around, the things fell
apart fairly quickly.

 Just because I now agree with you that streams are
 a bad idea doesn't mean the pressure to support them
 in some way in Samba has gone away :-).

What, even with Winfs delaying Microsoft Longwait by years before
finally being flushed?  :-)

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 03:45:24AM -0600, Andreas Dilger wrote:
 Too bad everyone is spending time on 10 similar-but-slightly-different
 filesystems.  This will likely end up with a bunch of filesystems that
 implement some easy subset of features, but will not get polished for
 users or have a full set of features implemented (e.g. ACL, quota, fsck,
 etc).  While I don't think there is a single answer to every question,
 it does seem that the number of filesystem projects has climbed lately.

I view some of the attempts for from scratch filesystems as ways of
testing out various designs as proof-of-concepts.  It's a great way
of demo'ing ones ideas, to see how well they work.  There is a huge
chasm between a proof-of-concept and a full production filesystem that
has great repair/recovery tools, etc.  That's why it's so important to
do the POC implementation first, so folks can see how well it works
before investing a huge amount of effort to make it be
production-ready.

So I actually think the number of these new filesystem proposals are
*good* things.  It means people are interested in creating new
filesystems, and that's all good.  Eventually, we'll need to decide
which design ideas should be combined, and that may be a little tough
to the egos involved, but that's all part of the darwinian kernel
programming model.  Not all implementations make it into the kernel
mainline.  That doesn't mean that the work that was done on the
various schedular proposals were useless; they just helped demonstrate
concepts and advanced the debate.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 09:16:30AM -0700, alan wrote:
 
 I just wish that people would learn from the mistakes of others.  The 
 MacOS is a prime example of why you do not want to use a forked 
 filesystem, yet some people still seem to think it is a good idea. 
 (Forked filesystems tend to be fragile and do not play well with 
 non-forked filesystems.)

Jeremy Alison used to be the one who was always pestering me to add
Streams support into ext4, but recently he's admitted that I was right
that it was a Very Bad Idea.

As I mentioned in my Linux.conf.au presentation a year and a half ago,
the main use of Streams in Windows to date has been for system
crackers to hide trojan horse code and rootkits so that system
administrators couldn't find them.  :-)


- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 10:33:42AM -0700, Jeremy Allison wrote:
 
 Yeah, ok - but do you have to rub my nose in it every chance you get ?
 
 :-) :-).

Well, I just want to make sure people know that Samba isn't asking for
it any more, and I don't know of any current requests outstanding from
any of the userspace projects.  So there's no one we need to ship off
to the re-education camps about why filesystem fork/streams are a bad
idea.  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Versioning file system

2007-06-18 Thread Theodore Tso
On Mon, Jun 18, 2007 at 02:31:14PM -0700, H. Peter Anvin wrote:
 And that makes them different from extended attributes, how?
 
 Both of these really are nothing but ad hocky syntactic sugar for
 directories, sometimes combined with in-filesystem support for small
 data items.

There's a good discussion of the issues involved in my LCA 2006
presentation  which doesn't seem to be on the LCA 2006 site.  Hrm.
I'll have to ask that this be fixed.  In any case, here it is:

http://thunk.org/tytso/forkdepot.odp

- Ted


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read/write counts

2007-06-04 Thread Theodore Tso
On Mon, Jun 04, 2007 at 11:02:23AM -0600, Matthew Wilcox wrote:
 On Mon, Jun 04, 2007 at 09:56:07AM -0700, Bryan Henderson wrote:
  Programs that assume a full transfer are fairly common, but are 
  universally regarded as either broken or just lazy, and when it does cause 
  a problem, it is far more common to fix the application than the kernel.
 
 Linus has explicitly forbidden short reads from being returned.  The
 original poster may get away with it for a specialised case, but for
 example, signals may not cause a return to userspace with a short read
 for exactly this reason.

Hmm, I'm not sure I would go that far.  Per the POSIX specification,
we support the optional BSD-style restartable system calls for signals
which will avoid short reads; but this is only true if SA_RESTART is
passed to sigaction().  Without SA_RESTART, we will indeed return
short reads, as required by POSIX.

I don't think Linus has said that short reads are always evil; I
certainly can't remember him ever making that statement.  Do you have
a pointer to a LKML message where he's said that?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Read/write counts

2007-06-04 Thread Theodore Tso
On Mon, Jun 04, 2007 at 08:57:16PM +0200, Roman Zippel wrote:
 That's the last discussion about signals and I/O I can remember:
 http://www.ussg.iu.edu/hypermail/linux/kernel/0208.0/0188.html

Well, I think Linus was saying that we have to do both (where the
signal interrupts and where it doesn't), and I agree with that:

  There are enough reasons to discourage people from using uninterruptible
  sleep (this f*cking application won't die when the network goes down)
  that I don't think this is an issue. We need to handle both cases, and
   ^
  while we can expand on the two cases we have now, we can't remove them. 
  ^^^

Fortunately, although the -ERESTARTSYS framework is a little awkward
(and people can shoot arrows at me for creating it 15 year ago :-), we
do have a way of supporting both styles without _too_ much pain.

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
  Actually, this is a non-issue.  The reason that it is handled for 
  extent-only
  is that this is the only way to allocate space in the filesystem without
  doing the explicit zeroing.  For other filesystems (including ext3 and
  ext4 with block-mapped files) the filesystem should return an error (e.g.
  -EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.
 
 It can be a bit suboptimal from the layout POV.  The reservations code will
 largely save us here, but kernel support might make it a bit better.

Actually, the reservations code won't matter, since glibc will fall
back to its current behavior, which is it will do the preallocation by
explicitly writing zeros to the file.  This wlil result in the same
layout as if we had done the persistent preallocation, but of course
it will mean the posix_fallocate() could potentially take a long time
if you're a PVR and you're reserving a gig or two for a two hour movie
at high quality.  That seems suboptimal, granted, and ideally the
application should be warned about this before it calls
posix_fallocate().  On the other hand, it's what happens today, all
the time, so applications won't be too badly surprised.  

If we think applications programmers badly need to know in advance if
posix_fallocate() will be fast or slow, probably the right thing is to
define a new fpathconf() configuration option so they can query to see
whether a particular file will support a fast posix_fallocate().  I'm
not 100% convinced such complexity is really needed, but I'm willing
to be convinced  what do folks think?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 07:02:32PM -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 On May 07, 2007  13:58 -0700, Andrew Morton wrote:
 Final point: it's fairly disappointing that the present implementation is
 ext4-only, and extent-only.  I do think we should be aiming at an ext4
 bitmap-based implementation and an ext3 implementation.
 
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only
 is that this is the only way to allocate space in the filesystem without
 doing the explicit zeroing.  For other filesystems (including ext3 and
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

There is a bit in the extent structure which indicates that the extent
has not been initialized.  When reading from a block where the extent
is marked as unitialized, ext4 returns zero's, to avoid returning the
uninitalized contents of the disk, which might contain someone else's
love letters, p0rn, or other information which we shouldn't leak out.
When writing to an extent which is uninitalized, we may potentially
have to split the extent into three extents in the worst case.

My understanding is that XFS uses a similar implementation; it's a
pretty obvious and standard way to implement allocated-but-not-initialized
extents.

We thought about supporting persistent preallocation for inodes using
indirect blocks, but it would require stealing a bit from each entry
in the indirect block, reducing the maximum size of the filesystem by
two (i.e., 2**31 blocks).  It was decided it wasn't worth the
complexity, given the tradeoffs.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Theodore Tso
On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
 We could check the total number of fs free blocks account before
 preallocation happens, if there isn't enough space left, there is no
 need to bother preallocating.

Checking against the fs free blocks is a good idea, since it will
prevent the obvious error case where someone tries to preallocate 10GB
when there is only 2GB left.  But it won't help if there are multiple
processes trying to allocate blocks the same time.  On the other hand,
that case is probably relatively rare, and in that case, the
filesystem was probably going to be left completely full in any case.

On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
 Userspace could presumably repair the mess in most situations by truncating
 the file back again.  The kernel cannot do that because there might be live
 data in amongst there.

Actually, the kernel could do it, in that could simply release all
unitialized extents back to the system.  The problem is distinguishing
between the unitialized extents that had just been newly added, versus
the ones that had there from before.  (On the other hand, if the
filesystem was completely full, releasing unitialized blocks wouldn't
be the worse thing in the world to do, although releasing previously
fallocated blocks probably does violate the princple of least
surprise, even if it's what the user would have wanted.)

On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
 If there is enough free space, we could make a reservation window that
 have at least N free blocks and mark it not stealable by other files. So
 later we will not run into the ENOSPC error.

Could you really use a single reservation window?  When the filesystem
is almost full, the free extents are likely going to be scattered all
over the disk.  The general principle of grabbing all of the extents
and keeping them in an in-memory data structure, and only adding them
to the extent tree would work, though; I'm just not sure we could do
it using the existing reservation window code, since it only supports
a single reservation window per file, yes?

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
 Except one other issue with online shrinking is that we need to move
 inodes on occasion and this poses a bunch of other problems over just
 remapping the data blocks.

Well, I did say necessary, and not sufficient.  But yes, moving
inodes, especially if the inode is currently open gets interesting.  I
don't think there are that many user space applications that would
notice or care if the st_ino of an open file changed out from under
them, but there are obviously userspace applications, such as tar,
that would most definitely care.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-05-01 Thread Theodore Tso
On Tue, May 01, 2007 at 12:52:49PM -0600, Andreas Dilger wrote:
 I think rm -r does a LOT of this kind of operation, like:
 
 stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)
 
 I think find does the same to avoid security problems with malicious
 path manipulation.

Yep, so if you're doing an rm -rf (or any other recursive descent)
while we're doing an on-line shrink, it's going to fail.  I suppose we
could have an in-core inode mapping table that would continue to remap
inode numbers until the next reboot.  I'm not sure we would want to
keep the inode remapping indefinitely, although if we don't it could
also end up screwing up NFS as well.  Not sure I care, though.  :-)

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext2/3 block remapping tool

2007-04-30 Thread Theodore Tso
On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote:
 I'd prefer that such functionality be integrated with Takashi's online
 defrag tool, since it needs virtually the same functionality.  For that
 matter, this is also very similar to the block-mapped - extents tool
 from Aneesh.  It doesn't make sense to have so many separate tools for
 users, especially if they start interfering with each other (i.e. defrag
 undoes the remapping done by your tool).

Yep, in fact, I'm really glad that Jan is working on the remapping
tool because if the on-line defrag kernel interfaces don't have the
right support for it, then that means we need to fix the on-line
defrag patches.  :-)

While we're at it, someone want to start thinking about on-line
shrinking of ext4 filesystems?  Again, the same block remapping
interfaces for defrag and file access optimizations should also be
useful for shrinking filesystems (even if some of the files that need
to be relocated are being actively used).  If not, that probably means
we got the interface wrong.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-30 Thread Theodore Tso
On Sun, Apr 29, 2007 at 08:40:42PM -0500, Matt Mackall wrote:
 chunkfs. The other is reverse maps (aka back pointers) for blocks -
 inodes and inodes - directories that obviate the need to have large
 amounts of memory to check for collisions.

Yes, I missed the fact that you had back pointers for blocks as well
as inodes.  So the block table in the tile header gets used for
determing if a block is free, much like is done with FAT, right?  

That's a clever system; I like it.  It does mean that there is a lot
more metadata updates, but since you're not journaling, that should
counter that effect to some extent.

IMHO, it's definitely worth a try to see how well it works!

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] TileFS - a proposal for scalable integrity checking

2007-04-29 Thread Theodore Tso
On Sat, Apr 28, 2007 at 05:05:22PM -0500, Matt Mackall wrote:
 This is a relatively simple scheme for making a filesystem with
 incremental online consistency checks of both data and metadata.
 Overhead can be well under 1% disk space and CPU overhead may also be
 very small, while greatly improving filesystem integrity.

What's your goal here?  Is it to it to speed up fsck's after an
unclean shutdown to the point so that you don't need to use a journal
or some kind of soft updates scheme?   

Is it to speed up fsck's after the kernel has detected some kind if
internal consistency error?  

Is it to speed up fsck's after you no longer have confidence that
random blocks in the filesystem may have gotten corrupted, due to a
suspend/resume bug, hard drive failure, reported CRC/parity errors
when writing to the device, reports of massive ECC mailures in your
memory that could have caused random block to have gotten been written
with multiple bit flips?

The first is relatively easy, but as you move down the list, things
get progressive harder, since it's no longer possible to use a per
tile clean bit to assume that you get to skip checking that
particular tile or chunk.

  Divide disk into a bunch of tiles. For each tile, allocate a one
  block tile header that contains (inode, checksum) pairs for each
  block in the tile. Unused blocks get marked inode -1, filesystem
  metadata blocks -2. The first element contains a last-clean
  timestamp, a clean flag and a checksum for the block itself. For 4K
  blocks with 32-bit inode and CRC, that's 512 blocks per tile (2MB),
  with ~.2% overhead.

So what happens for files that are bigger than 2MB?  Presumably they
consists of blocks that must come from more than one tile, right?  So
is an inode allowed to reference blocks outside of its tile?  Or does
an inode that needs to span multiple tiles have a local sub-inode in
each tile, with back pointers to parent inode?   

Note that both design paths have some serious tradeoffs.  If you allow
an inode to span multiple tiles, now you can't check the block
allocation data structures without scanning all of the tiles involved.
If you have sub-inodes, now you have to have bidirectional pointers
and you have to validate those pointers after validating all of the
individual tiles.  

This is one of the toughest aspects of either the chunkfs or tilefs
design, and when we discussed chunkfs at past filesystem workshops, we
didn't come to any firm conclusions about the best way to solve it,
except to acknowledge that it is a hard problem.  My personal
inclination is to use substantially bigger chunks than the 2MB that
you've proposed, and make each of the chunks more like 2 or 4 GB each,
and to enforce a rule which says an inode in a chunk can only
reference blocks in that local chunk, and to try like mad to keep
directories reference inodes in a chunk, and to keep inodes reference
blocks within that chunk.  When a file is bigger than a chunk, then
you will be forced to use indirection pointers that basically say,
for offsets 2GB-4GB, reference inode  in chunk , and for
4GB-8GB, checkout inode  in chunk , etc.   

I won't say that this is definitely the best way to do things, but I
note that you haven't really address this design point, and there are
no obvious best ways of handling this.

  [Note that CRCs are optional so we can cut the overhead in half. I
  choose CRCs here because they're capable of catching the vast
  majority of accidental corruptions at a small cost and mostly serve
  to protect against errors not caught by on-disk ECC (eg cable noise,
  kernel bugs, cosmic rays). Replacing CRCs with a stronger hash like
  SHA-n is perfectly doable.]

If the goal is just accidental corruptions, CRC's are just fine.  If
you want better protection against accidental corruption, then the
answer is to use a bigger CRC.  Using a cryptographic hash like SHA-n
is pure overkill unless you're trying to design protection against a
malicious attacker, in which case you've got a much bigger set of
problems that you have to address first --- you don't get a
cryptogaphically secure filesystem by replcing a CRC with a SHA-n hash
function

  Every time we write to a tile, we must mark the tile dirty. To cut
  down time to find dirty tiles, the clean bits can be collected into a
  smaller set of blocks, one clean bitmap block per 64GB of data.

Hopefully the clean bitmap block is protected by a checksum.  After
all, the smaller set of clean bitmap block is going to be constantly
updated as tiles get dirtied, and then cleaned.  What if they get
corrupted?  How does the checker notice?  And presumably if there is a
CRC that doesn't verify, it would have to check all of the tiles,
right?

 Checking a tile:
 
  Read the tile
  If clean and current, we're done.
  Check the tile header checksum
  Check the checksum on each block in the tile
  Check that metadata blocks are metadata
  Check that inodes in tile agree with inode 

Re: ChunkFS - measuring cross-chunk references

2007-04-24 Thread Theodore Tso
On Mon, Apr 23, 2007 at 06:02:29PM -0700, Arjan van de Ven wrote:
 
  The other thing which we should consider is that chunkfs really
  requires a 64-bit inode number space, which means either we only allow
 
 does it?
 I'd think it needs a chunk space number and a 32 bit local inode
 number ;) (same for blocks)
 

But that means that the number which gets exported to userspace via
the stat system call will need more than 32 bits worth of ino_t

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Theodore Tso
On Mon, Apr 23, 2007 at 02:53:33PM -0600, Andreas Dilger wrote:
  With a blocksize of 4KB, a block group would be 128 MB. In the original
  Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be
  possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size
  increases the number of cross-chunk references will reduce and hence it
  might be a good idea to present these statistics considering different
  chunk sizes starting from 512MB upto 2GB.
 
 Also, given that cross-chunk references will be more expensive to fix, I
 can imagine the allocation policy for chunkfs will try to avoid this if
 possible, further reducing the number of cross-chunk inodes.  I guess it
 should be more clear whether the cross-chunk references are due to inode
 block references, or because of e.g. directories referencing inodes in
 another chunk.

It would also be good to distinguish between directories referencing
files in another chunk, and directories referencing subdirectories in
another chunk (which would be simpler to handle, given the topological
restrictions on directories, as compared to files and hard links).

There may also be special things we will need to do to handle
scenarios such as BackupPC, where if it looks like a directory
contains a huge number of hard links to a particular chunk, we'll need
to make sure that directory is either created in the right chunk
(possibly with hints from the application) or migrated to the right
chunk (but this might cause the inode number of the directory to
change --- maybe we allow this as long as the directory has never been
stat'ed, so that the inode number has never been observed).

The other thing which we should consider is that chunkfs really
requires a 64-bit inode number space, which means either we only allow
it on 64-bit systems, or we need to consider a migration so that even
on 32-bit platforms, stat() functions like stat64(), insofar that it
uses a stat structure which returns a 64-bit ino_t.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-08 Thread Theodore Tso
The reason why I ignore the tar+gzip tests is that in the past Hans
has rigged the test by using a tar ball which was generated by
unpacking a set of kernel sources on a reiser4 filesystem, and then
repacking them using tar+gzip.  The result was a tar file whose files
were optimally laid out so that reiser4 could insert them into the
filesystem b-tree without doing any extra work.

I can't say for sure whether or not this set of benchmarks has done
this (there's not enough information describing the benchmark setup),
but the sad fact of the matter is that people trying to pitch Reiser4
have generated for themselves a reputation for using rigged
benchmarks.  Hans's used of a carefully stacked and ordered tar file
(which is the same as stacking a deck of cards), and your repeated use
of the bonnee++ benchmarks despite being told that it is a meaningless
result given the fact that well, zero's compress very well and most
people are interested in storing a file of all zeros, has caused me to
look at any benchmarks cited by Reiser4 partisans with a very
jaundiced and skeptical eye.

Fortunately for you, it's not up to me whether or not Reiser4 makes it
into the kernel.  And if it works for you, hey, go wild.  You can
always patch it into your own kernel and encourage others to do the
same with respect to getting it tested and adopted.  My personal take
on it is that Reiser3, Reiser4 and JFS suffer the same problems, which
is to say they have a very small and limited development community,
and this was referenced in Novell's decision to drop Reiser3:

http://linux.wordpress.com/2006/09/27/suse-102-ditching-reiserfs-as-it-default-fs/

SuSE has deprecated Reiser3 *and* JFS, and I believe quite strongly it
is the failure of the organizations to attract a diverse development
community is ultimately what doomed them in the long term, both in
terms of support as the kernel migrated and new feature support.  It
is for that reason that Hans' personality traits that tend to drive
away those developers who would help them, beyond those that he hires,
is what has been so self-destructive to Reiser4.  Read the
announcement Jeff Mahoney from SUSE Labs again; he pointed out was
that reiser3 was getting dropped even though it performs better than
ext3 in some scenarios.  There are many other considerations, such as
a filesystem's robustness in case on-disk corruption, long term
maintenance as the kernel maintains, availability of developers to
provide bug fixes, how well the system performs on systems with
multiple cores/CPU's, etc.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Reiser4. BEST FILESYSTEM EVER.

2007-04-07 Thread Theodore Tso
On Sat, Apr 07, 2007 at 05:44:57PM -0700, [EMAIL PROTECTED] wrote:
 To get a feel for the performance increases that can be achieved by
 using compression, we look at the total time (in seconds) to run the
 test:

You mean the performance increases of writing a file which is mostly
all zero's?  Yawn.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
 Well, I'm sure the kernel can do better than the code we have in libc
 now.  The kernel has access to the bitmasks which say which blocks have
 already been allocated.  The libc code does not and we have to be very
 simple-minded and simply touch every block.  And this means reading it
 and then writing it back.  The kernel would know when the reading part
 is not necessary.  Add to then the block granularity (we use f_bsize as
 returned from fstatfs but that's not the best value in some cases) and
 you have compelling data to have generic code in the kernel.  Then libc
 implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache.  So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-26 Thread Theodore Tso
On Mon, Feb 26, 2007 at 04:33:37PM +1100, Neil Brown wrote:
 Do we want a path in the other direction to handle write errors?  The
 file system could say Don't worry to much if this block cannot be
 written, just return an error and I will write it somewhere else?
 This might allow md not to fail a whole drive if there is a single
 write error.

Can someone with knowledge of current disk drive behavior confirm that
for all drives that support bad block sparing, if an attempt to write
to a particular spot on disk results in an error due to bad media at
that spot, the disk drive will automatically rewrite the sector to a
sector in its spare pool, and automatically redirect that sector to
the new location.  I believe this should be always true, so presumably
with all modern disk drives a write error should mean something very
serious has happend.  

(Or that someone was in the middle of reconfiguring a FC network and
they're running a kernel that doesn't understand why short-duration FC
timeouts should be retried.  :-)

 Or is that completely un-necessary as all modern devices do bad-block
 relocation for us?
 Is there any need for a bad-block-relocating layer in md or dm?

That's the question.  It wouldn't be that hard for filesystems to be
able to remap a data block, but (a) it would be much more difficult
for fundamental metadata (for example, the inode table), and (b) it's
unnecessary complexity if the lower levels in the storage stack should
always be doing this for us in the case of media errors anyway.

 What about corrected-error counts?  Drives provide them with SMART.
 The SCSI layer could provide some as well.  Md can do a similar thing
 to some extent.  Where these are actually useful predictors of pending
 failure is unclear, but there could be some value.
 e.g. after a certain number of recovered errors raid5 could trigger a
 background consistency check, or a filesystem could trigger a
 background fsck should it support that.

Somewhat off-topic, but my one big regret with how the dm vs. evms
competition settled out was that evms had the ability to perform block
device snapshots using a non-LVM volume as the base --- and that EVMS
allowed a single drive to be partially managed by the LVM layer, and
partially managed by evms.  

What this allowed is the ability to do device snapshots and therefore
background fsck's without needing to convert the entire laptop disk to
using a LVM solution (since to this day I still don't trust initrd's
to always do the right thing when I am constantly replacing the kernel
for kernel development).

I know, I'm weird, distro users have initrd that seem to mostly work,
and it's only wierd developers that try to use bleeding edge kernels
with a RHEL4 userspace that suffer, but it's one of the reasons why
I've avoided initrd's like the plague --- I've wasted entire days
trying to debug problems with the userspace-provided initrd being too
old to support newer 2.6 development kernels.

In any case, the reason why I bring this up is that it would be really
nice if there was a way with a single laptop drive to be able to do
snapshots and background fsck's without having to use initrd's with
device mapper.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: end to end error recovery musings

2007-02-23 Thread Theodore Tso
On Fri, Feb 23, 2007 at 05:37:23PM -0700, Andreas Dilger wrote:
  Probably the only sane thing to do is to remember the bad sectors and 
  avoid attempting reading them; that would mean marking automatic 
  versus explicitly requested requests to determine whether or not to 
  filter them against a list of discovered bad blocks.
 
 And clearing this list when the sector is overwritten, as it will almost
 certainly be relocated at the disk level.  For that matter, a huge win
 would be to have the MD RAID layer rewrite only the bad sector (in hopes
 of the disk relocating it) instead of failing the whiole disk.  Otherwise,
 a few read errors on different disks in a RAID set can take the whole
 system offline.  Apologies if this is already done in recent kernels...

And having a way of making this list available to both the filesystem
and to a userspace utility, so they can more easily deal with doing a
forced rewrite of the bad sector, after determining which file is
involved and perhaps doing something intelligent (up to and including
automatically requesting a backup system to fetch a backup version of
the file, and if it can be determined that the file shouldn't have
been changed since the last backup, automatically fixing up the
corrupted data block :-).

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Wed, Feb 14, 2007 at 11:54:54AM -0800, Valerie Henson wrote:
 Background: The eXplode file system checker found a bug in ext2 fsync
 behavior.  Do the following: truncate file A, create file B which
 reallocates one of A's old indirect blocks, fsync file B.  If you then
 crash before file A's metadata is all written out, fsck will complete
 the truncate for file A... thereby deleting file B's data.  So fsync
 file B doesn't guarantee data is on disk after a crash.  Details:

It's actually not the case that fsck will complete the truncate for
file A.  The problem is that while e2fsck is processing indirect
blocks in pass 1, the block which is marked as file A's indirect block
(but which actually contain's file B's data) gets fixed when e2fsck
sees block numbers which look like illegal block numbers.  So this
ends up corrupting file B's data.

This is actually legal end result, BTW, since it's POSIX states the
result of fsync() is undefined if the system crashes.  Technically
fsync() did actually guarantee that file B's data is on disk; the
problem is that e2fsck would corrupt the data afterwards.  Ironically,
fsync()'ing file B actually makes it more likely that it might get
corrupted afterwards, since normally filesystem metadata gets sync'ed
out on 5 second intervals, while data gets sync'ed out at 30 second
intervals.

 * Rearrange order of duplicate block checking and fixing file size in
   fsck.  Not sure how hard this is. (Ted?)

It's not a matter of changing when we deal with fixing the file size,
as described above.  At the fsck time, we would need to keep backup
copies of any indirect blocks that get modified for whatever reason,
and then in pass 1D, when we clone a block that has been claimed by
multiple inods, the inodes which claim the block as a data block
should get a copy of the block before it was modified by e2fsck.

 * Keep a set of still allocated on disk block bitmaps that gets
   flushed whenever a sync happens.  Don't allocate these blocks.
   Journaling file systems already have to do this.

A list would be more efficient, as others have pointed out.  That
would work, although the knowing when entries could be removed from
the list.  The machinery for knowing when metadata has been updated
isn't present in ext2, and that's a fair amount of complexity.  You
could clear the list/bitmap after the 5 second metadata flush command
has been kicked off, or if you associate a data block with the
previous inode's owner, you could clear the entry when the inode's
dirty bit has been cleared, but that doesn't completely get rid of the
race unless you tie it to when the write has completed (and this
assumes write barriers to make sure the block was actually flushed to
the media).

Another very heavyweight approach would be to simply force a full sync
of the filesystem whenever fysnc() is called.  Not pretty, and without
the proper write ordering, the race is still potentially there.

I'd say that the best way to handle this is in fsck, but quite frankly
it's relatively low priority bug to handle, since a much simpler
workaround is to tell people to use ext3 instead.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 10:39:02AM -0600, Dave Kleikamp wrote:
  It was my understanding from the persentation of Dawson that ext3 and jfs  
  have ame problem.
 
 Hmm.  If jfs has the problem, it is a bug.  jfs is designed to handle
 this correctly.  I'm pretty sure I've fixed at least one bug that
 eXplode has uncovered in the past.  I'm not sure what was mentioned in
 the presentation though.  I'd like any information about current
 problems in jfs.

That was not my understanding of the charts that were presented
earlier this week.  Ext3 journaling code will deal with this case
explicitly, just as jfs does.  

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fix(es) for ext2 fsync bug

2007-02-15 Thread Theodore Tso
On Thu, Feb 15, 2007 at 11:28:46AM -0800, Junfeng Yang wrote:
 
 Actually,  we found a crash-during-recovery bug in ext3 too.  It's a race
 between resetting the journal super block and replay of the journal.  This
 bug was fixed by Ted long time ago (3 years?).

That was found in your original work (using UML) not the more recent
work using EXPLODE, correct?

- Ted

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH 2/3] Move the file data to the new blocks

2007-02-11 Thread Theodore Tso
On Thu, Feb 08, 2007 at 11:47:39AM +0100, Jan Kara wrote:
  
  Well.  Do we really?  Are we looking for a 100% solution here, or a 90% one?
   Umm, I think that for ext3 having data on one end of the disk and
 indirect blocks on the other end of the disk does not quite help (not
 mentioning that it can create bad free space fragmentation over the time).
 I have not measured it but I'd guess that it would erase the effect of
 moving data closer together. At least for sequential reads..

I don't think anyone is saying we can ignore the metadata; but the
fact is, the cleanest solution for 90% of the problem is to use the
page cache, and as far as the other 10%, Linus has been pushing us to
move at least the directories into the page cache, and it's not insane
to consider moving the rest of the metadata into page cache.  At least
it's something we should consider carefully.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ext3 question: How to compose an inode given a list of data block numbers?

2007-02-08 Thread Theodore Tso
On Thu, Feb 08, 2007 at 02:46:19PM -0800, hlily wrote:
 
 Suppose I have a list of data blocks, does Ext3 provide some functions that
 can help me to build a block list into an inode?
 
 If no such functions, could someone direct me to the right place in Ext3
 code that add block numbers to an inode?

What are you trying to do?  Are you trying to do this from a kernel
module, or from user space?   

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH[RFC] kill sysrq-u (emergency remount r/o)

2007-02-05 Thread Theodore Tso
On Mon, Feb 05, 2007 at 09:40:08PM +0100, Jan Engelhardt wrote:
 
 On Feb 5 2007 18:32, Christoph Hellwig wrote:
 
 in two recent discussions (file_list_lock scalability and remount r/o
 on suspend) I stumbled over this emergency remount feature.  It's not
 actually useful because it tries a potentially dangerous remount
 despite writers still beeing in progress, which we can't get rid.
 
 The current way is to remount things, and return -EROFS to any process
 that attempts to write(). Unless we want to kill processes to get rid of
 them [most likely we possibly won't], I am fine with how things are atm.
 So, what's the dangerous part, actually?

The dangerous part is that we change f-f_mode for all open files
without regard for whether there might be any writes underway at the
time.  This isn't *serious* although the results might be a little
strange and it might result in a confused return from write(2).  More
seriously, mark_files_ro() in super.c *only* changes f-f_mode and
doesn't deal with the possibility that the file might be mapped
read-write.  For filesystems that do delayed allocation, I'm not at
all convinced that an emergency read-only will result in the
filesystem doing anything at all sane, depending on what else the
filesystem might do when the filesystem is forced into read-only state.

 sysrq+u is helpful. It is like \( sysrq+s  make sure no further writes
 go to disk \).

I agree it is useful, but if we're going to do it we really should do
it right.  We should have real revoke() functionality on file
descriptors, which revokes all of the mmap()'s (any attempt to write
into a previously read/write mmap will cause a SEGV) as well as
changing f_mode, and then use that to implement emergency read-only
remount.

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/35] Unionfs: Inode operations

2006-12-07 Thread Theodore Tso
On Tue, Dec 05, 2006 at 01:50:17PM -0800, Andrew Morton wrote:
 This
 
   /*
* Lorem ipsum dolor sit amet, consectetur
* adipisicing elit, sed do eiusmod tempor
* incididunt ut labore et dolore magna aliqua.
*/
 
 is probably the most common, and is what I use when forced to descrog
 comments.

This what I normally do by default, unless it's a one-line comment, in
which case my preference is usually for this:

/* Lorem ipsum dolor sit amet, consectetur */

I'm not convinced we really do _need_ to standardize on comment styles
(I can foresee thousands and thousands of trivial patches being
submitted and we'd probably be better off encouraging people to spend
time actually improving the documentation instead of reformatting it :-), 
but if were going to standardize, that would be my vote.

- Ted


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: ext3 for 2.4

2001-05-17 Thread Theodore Tso

On Thu, May 17, 2001 at 03:00:28PM -0400, Jeff Garzik wrote:
 AFAIK the original stated intention of ext3 was
 
   cd linux/fs
   cp -a ext2 ext3
   # hack on ext3
 
 That leaves ext2 in ultra-stability,
 no-patches-unless-absolutely-necessary mode.
 
 IMHO prove a new feature, like directories in page cache, journaling,
 etc. in ext3 first.  Then maybe after a year of testing, if people
 actually care, backport those features to ext2.

Alternatively, once we get ext3 with just journaling stable (and with
an option to not do journaling at all), simply do something like this:

cd linux/fs
rm -f ext2
mv ext3 ext2
cp -r ext2 ext3
# hack hack hack on ext3 and add even more features

So ext3 is always the development version, and ext2 is the stable
version.

- Ted





-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]