from:"Andreas Dilger"

Re: [RFC] ext3 freeze feature ver 0.2

2008-02-26 Thread Andreas Dilger

On Feb 26, 2008  08:39 -0800, Eric Sandeen wrote:
 Takashi Sato wrote:
 
  o Elevate XFS ioctl numbers (XFS_IOC_FREEZE and XFS_IOC_THAW) to the VFS
As Andreas Dilger and Christoph Hellwig advised me, I have elevated
them to include/linux/fs.h as below.
  #define FIFREEZE_IOWR('X', 119, int)
  　  #define FITHAW  _IOWR('X', 120, int)
The ioctl numbers used by XFS applications don't need to be changed.
But my following ioctl for the freeze needs the parameter
as the timeout period.  So if XFS applications don't want the timeout
feature as the current implementation, the parameter needs to be
changed 1 (level?) into 0.
 
 So, existing xfs applications calling the xfs ioctl now will behave
 differently, right?  We can only keep the same ioctl number if the
 calling semantics are the same.  Keeping the same number but changing
 the semantics is harmful, IMHO

Do we know what this parameter was supposed to mean?

We could special case 1 if needed to keep compatibility (documenting
this clearly), either making it == 0, or some very long timeout (1h
or whatever).  A relatively minor wart I think.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: i_version changes

2008-02-13 Thread Andreas Dilger

On Feb 12, 2008  15:06 -0500, J. Bruce Fields wrote:
 On Sun, Feb 10, 2008 at 08:30:41AM +0100, Christoph Hellwig wrote:
  Third using the MS_ flag but then actually having a filesystem
  mount option to enable it is more than confusing.  After all MS_
  options (at least the exported parts) are the mount ABI for common
  options.  Also this option doesn't show up in -show_options,
  which is something Miklos will beat you up for :)
  I'm also not convinced this should be option behaviour, either you
  do update i_version for a given filesystem or you don't - having
  an obscure mount option will only give you confusion.
 
 That does sound likely to be confusing.  Any chance we could just make
 the new behavior mandatory?

One of the reasons NOT to make it mandatory is that it forces updates
of the inode after every write.  On ext3/ext4 this is expensive, as the
ext3_dirty_inode() packs the inode from memory into the buffer each time,
so that it can be journaled.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: i_version changes

2008-02-13 Thread Andreas Dilger

On Feb 13, 2008  09:07 -0500, Trond Myklebust wrote:
 On Wed, 2008-02-13 at 13:52 +0100, Christoph Hellwig wrote:
  Btw, stupid question:  the commit message for the i_version changes
  mentions this is to work around lack of granularity for ctime updates.
  But all modern filesystems (and I includ ext4 in that here) have 64bit
  timestamps already, so that should be enough.  It would certainly
  avoid all this additional code, and especially the additional space
  used in struct inode which can hurt quite a lot.
 
 Support for 64-bit on-disk time stamps alone does not suffice. In order
 to label all file/directory changes unambiguously, you would also need
 nanosecond timekeeping support.
 
 An example is XFS, which has had on-disk support for 64-bit time stamps
 since forever, but is in practice limited by the Linux default 250Hz
 internal clock. We've seen plenty of examples of NFS clients missing
 updates on the resulting filesystem due to the fact that they occurred
 within 1/250 sec of each other.

The other issue which unfortunately makes ctime a non-starter is the
ability of ctime to go backward due to clock changes.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3 freeze feature

2008-02-08 Thread Andreas Dilger

On Feb 08, 2008  19:48 +0900, Takashi Sato wrote:
 OK I would like to implement the freeze feature on VFS
 as the filesystem independent ioctl so that it can be
 available on filesystems that have already had write_super_lockfs()
 and unlockfs().
 The usage for the freeze ioctl is the following.
  int ioctl(int fd, int FIFREEZE, long *timeval);
fd:file descriptor of mountpoint
FIFREEZE:request cord for freeze
timeval:timeout period (second)

 And the unfreeze ioctl is the following.
  int ioctl(int fd, int FITHAW, NULL);
fd:file descriptor of mountpoint
FITHAW:Request cord for unfreeze

You may as well make the common ioctl the same as the XFS version,
both by number and parameters, so that applications which already
understand the XFS ioctl will work on other filesystems.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-02-06 Thread Andreas Dilger

On Jan 26, 2008  08:27 +0300, Al Boldi wrote:
 Jan Kara wrote:
   data=ordered mode has proven reliable over the years, and it does this
   by ordering filedata flushes before metadata flushes.  But this
   sometimes causes contention in the order of a 10x slowdown for certain
   apps, either due to the misuse of fsync or due to inherent behaviour
   like db's, as well as inherent starvation issues exposed by the
   data=ordered mode.
  
   data=writeback mode alleviates data=order mode slowdowns, but only works
   per-mount and is too dangerous to run as a default mode.
  
   This RFC proposes to introduce a tunable which allows to disable fsync
   and changes ordered into writeback writeout on a per-process basis like
   this:
  
 echo 1  /proc/`pidof process`/softsync
 
I guess disabling fsync() was already commented on enough. Regarding
  switching to writeback mode on per-process basis - not easily possible
  because sometimes data is not written out by the process which stored
  them (think of mmaped file).
 
 Do you mean there is a locking problem?
 
  And in case of DB, they use direct-io
  anyway most of the time so they don't care about journaling mode anyway.
 
 Testing with sqlite3 and mysql4 shows that performance drastically improves 
 with writeback writeout.
 
   But as Diego wrote, there is definitely some room for improvement in
  current data=ordered mode so the difference shouldn't be as big in the
  end.
 
 Yes, it would be nice to get to the bottom of this starvation problem, but 
 even then, the proposed tunable remains useful for misbehaving apps.

Al, can you try a patch posted to linux-fsdevel and linux-ext4 from
Hisashi Hifumi [EMAIL PROTECTED] to see if this improves
your situation?  Dated Mon, 04 Feb 2008 19:15:25 +0900.

[PATCH] ext3,4:fdatasync should skip metadata writeout when overwriting

It may be that we already have a solution in that patch for database
workloads where the pages are already allocated by avoiding the need
for ordered mode journal flushing in that case.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-30 Thread Andreas Dilger

On Wednesday 30 January 2008, Al Boldi wrote:
 And, a quick test of successive 1sec delayed syncs shows no hangs until
 about 1 minute (~180mb) of db-writeout activity, when the sync abruptly
 hangs for minutes on end, and io-wait shows almost 100%.

How large is the journal in this filesystem?  You can check via
debugfs -R 'stat 8' /dev/XXX.  Is this affected by increasing
the journal size?  You can set the journal size via mke2fs -J size=400 
at format time, or on an unmounted filesystem by running
tune2fs -O ^has_journal /dev/XXX then tune2fs -J size=400 /dev/XXX.

I suspect that the stall is caused by the journal filling up, and then
waiting while the entire journal is checkpointed back to the filesystem
before the next transaction can start.

It is possible to improve this behaviour in JBD by reducing the amount
of space that is cleared if the journal becomes full, and also doing
journal checkpointing before it becomes full.  While that may reduce
performance a small amount, it would help avoid such huge latency problems.
I believe we have such a patch in one of the Lustre branches already,
and while I'm not sure what kernel it is for the JBD code rarely changes
much

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] ext3: per-process soft-syncing data=ordered mode

2008-01-25 Thread Andreas Dilger

On Jan 24, 2008  23:36 +0300, Al Boldi wrote:
 data=ordered mode has proven reliable over the years, and it does this by 
 ordering filedata flushes before metadata flushes.  But this sometimes 
 causes contention in the order of a 10x slowdown for certain apps, either 
 due to the misuse of fsync or due to inherent behaviour like db's, as well 
 as inherent starvation issues exposed by the data=ordered mode.
 
 data=writeback mode alleviates data=order mode slowdowns, but only works 
 per-mount and is too dangerous to run as a default mode.
 
 This RFC proposes to introduce a tunable which allows to disable fsync and 
 changes ordered into writeback writeout on a per-process basis like this:
 
   echo 1  /proc/`pidof process`/softsync

If fsync performance is an issue for you, run the filesystem in data=journal
mode, put the journal on a separate disk and make it big enough that you
don't block on it to flush the data to the filesystem (but not so big that
it is consuming all of your RAM).

That keeps your data guarantees without hurting performance.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-25 Thread Andreas Dilger

On Jan 24, 2008  17:25 -0700, Zan Lynx wrote:
 Have y'all been following the /dev/mem_notify patches?
 http://article.gmane.org/gmane.linux.kernel/628653

Having the notification be via poll() is a very restrictive processing
model.  Having the notification be via a signal means that any kind of
process (and not just those that are event loop driven) can register
a callback at some arbitrary point in the code and be notified.  I
don't object to the poll() interface, but it would be good to have a
signal mechanism also.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 12/26] mount options: fix ext4

2008-01-25 Thread Andreas Dilger

On Jan 24, 2008  20:33 +0100, Miklos Szeredi wrote:
 Add stripe= option to /proc/mounts for ext4 filesystems.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]

Acked-by: Andreas Dilger [EMAIL PROTECTED]

 Index: linux/fs/ext4/super.c
 ===
 --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100
 +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100
 @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_
   seq_puts(seq, ,nomballoc);
   if (!test_opt(sb, DELALLOC))
   seq_puts(seq, ,nodelalloc);
 -
 + if (sbi-s_stripe)
 + seq_printf(seq, ,stripe=%lu, sbi-s_stripe);
  
   /*
* journal mode get enabled in different ways

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-24 Thread Andreas Dilger

On Jan 24, 2008  18:32 +0100, Bodo Eggert wrote:
 I think a single, system-wide signal is the second-to worst solution: All
 applications (or the wrong one, if you select one) would free their caches
 and start to crawl, and either stay in this state or slowly increase their
 caches again until they get signaled again. And the signal would either
 come too early or too late. The userspace daemon could collect the weighted
 demand of memory from all applications and tell them how much to use.

Well, sending a few signals (maybe to the top 5 processes in the OOM killer
list) is still a LOT better than OOM-killing them without warning...  That
way important system processes could be taught to understand SIGDANGER and
maybe do something about it instead of being killed, and if Firefox and
other memory hungry processes flush some of their cache it is not fatal.

I wouldn't think that SIGDANGER means free all of your cache, since the
memory usage clearly wasn't a problem a few seconds previously, so as
an application writer I'd code it as flush the oldest 10% of my cache
or similar, and the kernel could send SIGDANGER again (or kill the real
offender) if the memory usage again becomes an issue.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 16, 2008  13:30 -0800, Valerie Henson wrote:
 I have a partial solution that sort of blindly manages the buffer
 cache.  First, the user passes e2fsck a parameter saying how much
 memory is available as buffer cache.  The readahead thread reads
 things in and immediately throws them away so they are only in buffer
 cache (no double-caching).  Then readahead and e2fsck work together so
 that readahead only reads in new blocks when the main thread is done
 with earlier blocks.  The already-used blocks get kicked out of buffer
 cache to make room for the new ones.

 What would be nice is to take into account the current total memory
 usage of the whole fsck process and factor that in.  I don't think it
 would be hard to add to the existing cache management framework.
 Thoughts?

I discussed this with Ted at one point also.  This is a generic problem,
not just for readahead, because fsck can run multiple e2fsck in parallel
and in case of many large filesystems on a single node this can cause
memory usage problems also.

What I was proposing is that fsck.{fstype} be modified to return an
estimated minimum amount of memory needed, and some desired amount of
memory (i.e. readahead) to fsck the filesystem, using some parameter like
fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
return the output in the expected format, or returns an error then fsck
will assume some amount of memory based on the device size and continue
as it does today.

If the fsck.{fstype} does understand this parameter, then fsck makes a
decision based on devices, parallelism, total RAM (less some amount to
avoid thrashing), then it can call the individual fsck commands with
--maximum-memory MMM /dev/XXX so each knows how much cache it can
allocate.  This parameter can also be specified by the user if running
e2fsck directly.

I haven't looked through your patch yet, but I hope to get to it soon.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 21, 2008  23:17 -0500, [EMAIL PROTECTED] wrote:
 On Tue, 22 Jan 2008 14:38:30 +1100, David Chinner said:
  Perhaps instead of swapping immediately, a SIGLOWMEM could be sent
  to a processes that aren't masking the signal followed by a short
  grace period to allow the processes to free up some memory before
  swapping out pages from that process?
 
 AIX had SIGDANGER some 15 years ago.  Admittedly, that was sent when
 the system was about to hit OOM, not when it was about to start swapping.

I'd tried to advocate SIGDANGER some years ago as well, but none of
the kernel maintainers were interested.  It definitely makes sense
to have some sort of mechanism like this.  At the time I first brought
it up it was in conjunction with Netscape using too much cache on some
system, but it would be just as useful for all kinds of other memory-
hungry applications.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Parallelize IO for e2fsck

2008-01-21 Thread Andreas Dilger

On Jan 22, 2008  14:38 +1100, David Chinner wrote:
 On Mon, Jan 21, 2008 at 04:00:41PM -0700, Andreas Dilger wrote:
  I discussed this with Ted at one point also.  This is a generic problem,
  not just for readahead, because fsck can run multiple e2fsck in parallel
  and in case of many large filesystems on a single node this can cause
  memory usage problems also.
  
  What I was proposing is that fsck.{fstype} be modified to return an
  estimated minimum amount of memory needed, and some desired amount of
  memory (i.e. readahead) to fsck the filesystem, using some parameter like
  fsck.{fstype} --report-memory-needed /dev/XXX.  If this does not
  return the output in the expected format, or returns an error then fsck
  will assume some amount of memory based on the device size and continue
  as it does today.
 
 And while fsck is running, some other program runs that uses
 memory and blows your carefully calculated paramters to smithereens?

Well, fsck has a rather restricted working environment, because it is
run before most other processes start (i.e. single-user mode).  For fsck
initiated by an admin in other runlevels the admin would need to specify
the upper limit of memory usage.  My proposal was only for the single-user
fsck at boot time.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)

2008-01-16 Thread Andreas Dilger

On Jan 15, 2008  22:05 -0500, Rik van Riel wrote:
 With a filesystem that is compartmentalized and checksums metadata,
 I believe that an online fsck is absolutely worth having.
 
 Instead of the filesystem resorting to mounting the whole volume
 read-only on certain errors, part of the filesystem can be offlined
 while an fsck runs.  This could even be done automatically in many
 situations.

In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an error flag in a single
group and have it become read-only.

As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group.  I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFD] Incremental fsck

2008-01-09 Thread Andreas Dilger

Andi Kleen wrote:
 Theodore Tso [EMAIL PROTECTED] writes:
  Now, there are good reasons for doing periodic checks every N mounts
  and after M months.  And it has to do with PC class hardware.  (Ted's
  aphorism: PC class hardware is cr*p).

 If these reasons are good ones (some skepticism here) then the correct
 way to really handle this would be to do regular background scrubbing
 during runtime; ideally with metadata checksums so that you can actually
 detect all corruption.

 But since fsck is so slow and disks are so big this whole thing
 is a ticking time bomb now. e.g. it is not uncommon to require tens
 of minutes or even hours of fsck time and some server that reboots
 only every few months will eat that when it happens to reboot.
 This means you get a quite long downtime.

 Has there been some thought about an incremental fsck?

While an _incremental_ fsck isn't so easy for existing filesystem types,
what is pretty easy to automate is making a read-only snapshot of a
filesystem via LVM/DM and then running e2fsck against that.  The kernel
and filesystem have hooks to flush the changes from cache and make the
on-disk state consistent.

You can then set the the ext[234] superblock mount count and last check
time via tune2fs if all is well, or schedule an outage if there are
inconsistencies found.

There is a copy of this script at:
http://osdir.com/ml/linux.lvm.devel/2003-04/msg1.html

Note that it might need some tweaks to run with DM/LVM2 commands/output,
but is mostly what is needed.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger

On Nov 28, 2007  14:56 -0800, Nicholas Miell wrote:
 I stand by my belief that SEEK_HOLE/SEEK_DATA is a lousy interface.
 
 It abuses the seek operation to become a query operation, it requires a
 total number of system calls proportional to the number holes+data and
 it isn't general enough for other similar uses (e.g. total number of
 contiguous extents, compressed extents, offline extents, extents
 currently shared with other inodes, extents embedded in the inode
 (tails), etc.)
 
 Something like the following would be much better:
 
 int getfilextents(int fd, off_t offset, int type, size_t *length, struct
 extent *extents)
 
 with
 
 int fd: open file
 
 offset: offset in file to start reporting extents
 
 type: one of EXTENT_TYPE_HOLE, EXTENT_TYPE_DATA, EXTENT_TYPE_EXTENTS,
 EXTENT_TYPE_COMPRESSED, EXTENT_TYPE_UNCOMPRESSED etc.

This is what FIEMAP is supposed to do.  We wrote a spec and implemented
a prototype for ext4, but haven't had time to make it generic to move
the large part of the code into the VFS.  If someone wanted to take that
up, it would be much appreciated.

See [RFC] add FIEMAP ioctl to efficiently map file allocation in
linux-fsdevel for details on this interface.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Implement SEEK_HOLE/SEEK_DATA

2007-11-28 Thread Andreas Dilger

On Nov 29, 2007  00:48 +0100, Jörn Engel wrote:
 I didn't follow the discussion much, since it didn't appear to suit
 logfs too well.  In a nutshell, logfs is purely block-based, prepends
 every block with a header, may compress blocks and packs them as tightly
 as possible (byte alignment).

The FIEMAP interface in some ways well suited to your needs, because it
uses byte offsets instead of FIBMAP, which uses block offsets.  I now see
that one hole in the specification is that your physical extent is not the
same length as the logical extent that the data represents.

I'm not sure that is in itself a reason not to use FIEMAP.  There is already
a provision for logical extents that do not map directly to the physical
layout (i.e. FIEMAP_EXTENT_NO_DIRECT flag on the extent).  In the case of
logfs, the extent fe_length would represent the logical length of the data
that the extent covers, fe_offset is the start of the physical extent, and
FIEMAP_EXTENT_NO_DIRECT indicates that this extent cannot be directly mapped.
This is useful for applications like LILO that would otherwise assume the
physical offset can be used to access the data from the block device.

It would still provide very useful information about the layout of files on
disk for filefrag, and if tar or cp used FIEMAP they could know to read the
data from the start up to the end of an extent to avoid waiting for a seek
in the middle of the IO, and of course skipping holes during copy.  A very
smart tar might even FIEMAP a whole bunch of files and then read the extents
in physical block offset order to reduce seeking.

I don't think most applications will care about the actual physical location
of an extent, so much as the relative locations of extents within a file and
between files.

 Maybe the MAP part fooled me to believe FIEMAP would also expose
 physical location of extends on the medium.  But reading the proposal
 again, I am unsure about that part.  If physical locations are exposed,
 SEEK_HOLE/SEEK_DATA is significantly more elegant for logfs.  If not,
 FIEMAP could be useful.

SEEK_HOLE/SEEK_DATA only provides a fraction of the useful information
that FIEMAP does.  It won't give users or developers any information about
the on disk layout (which is quite important for knowing if allocation
algorithms are good).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Beagle and logging inotify events

2007-11-14 Thread Andreas Dilger

On Nov 14, 2007  11:32 -0500, Chuck Lever wrote:
 I disagree: we don't need a bullet-proof log.  We can get a significant 
 performance improvement even with a permanent dnotify log implemented in 
 user-space.  We already have well-defined fallback behavior if such a log 
 is missing or incomplete.

 The problem with a permanent inotify log is that it can become unmanageably 
 enormous, and a performance problem to boot.  Recording at that level of 
 detail makes it more likely that the logger won't be able to keep up with 
 file system activity.

 A lightweight solution gets us most of the way there, is simple to 
 implement, and doesn't introduce many new issues.  As long as it can tell 
 us precisely where the holes are, it shouldn't be a problem.

Jan Kara is working on a patch for ext4 which would store a recursive
timestamp for each directory that gives the latest time that a file in
that directory was modified.  ZFS has a similar mechanism by virtue of
doing full-tree updates during COW of all the metadata blocks and storing
the most recent transaction number in each block.  I suspect btrfs could
do the same thing easily.

That would allow recursive-descent filesystem traversal to be much more
efficient because whole chunks of the filesystem tree can be ignored during
scans.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/6][RFC] Cleanup FIBMAP

2007-10-29 Thread Andreas Dilger

On Oct 29, 2007  12:16 -0700, Mike Waychison wrote:
 Chris Mason wrote:
 Reiserfs and Btrfs also use 0 to mean packed.  It would be nice if there
 was a way to indicate your-data-is-here-but-isn't-alone.  But that's
 more of a feature for the FIEMAP stuff.

 I hadn't heard of FIEMAP, so I went back and read the thread from 
 April/May.  It seems that this is a much better approach than introducing a 
 FIBMAP64.

 What ever happened with this proposal?

We made a patch for ext4 that we need to update and push upstream.  I've
just resent the spec we used in a separate email (attached to old thread)
for reference.

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger

By request on #linuxfs, here is the FIEMAP spec that we used to implement
the FIEMAP support for ext4.  There was an ext4 patch posted on August 29
to linux-ext4 entitled [PATCH] FIEMAP ioctl.   I've asked Kalpak to post
an updated version of that patch along with the changes to the filefrag
tool to use FIEMAP.

 FIEMAP_1.0.txt ==

File Mapping Interface

18 June 2007

Andreas Dilger, Kalpak Shah

Introduction

This document covers the user interface and internal implementation of
an efficient fragmentation reporting tool. This will include addition
of a FIEMAP ioctl to fetch extents and changes to filefrag to use this
ioctl. The main objective of this tool is to efficiently and easily allow
inspection of the disk layout of one or more files without requiring
user access to the underlying storage device(s).

1 Requirements

The tool should be efficient in its use of resources, even for large
files. The FIBMAP ioctl is not suitable for use on large files,
as this can result in millions or even billions of ioctls to get the
mapping information for a single file. It should be possible to get the
information about an arbitrary-sized extent in a single call, and the
kernel component and user tool should efficiently use this information.

The user interface should be simple, and the output should be easily
understood - by default the filename(s), a count of extents (for each
file), and the optimal number of extents for a file with the given
striping parameters. The user interface will be filefrag [options]
{filename ...} and will allow retrieving the fragmentation information
for one or more files specified on the command-line. The output will be
of the form:

/path/to/file1: extents=2 optimal=1

/path/to/file2: extents=10 optimal=4

..

2 Functional specification

The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
ioctl block device ioctl used for mapping an individual logical block
address in a file to a physical block address in the block device. The
FIEMAP ioctl will return the logical to physical mapping for the extent
that contains the specified logical byte address.

struct fiemap_extent {
__u64 fe_offset;/* offset in bytes for the start of the extent */
__u64 fe_length;/* length in bytes for the extent */
__u32 fe_flags; /* returned FIEMAP_EXTENT_* flags for the extent */
__u32 fe_lun;   /* logical device number for extent(starting at 0)*/
};



struct fiemap {
__u64 fm_start; /* logical byte offset (in/out) */
__u64 fm_length;/* logical length of map (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags (in/out) */
__u32 fm_extent_count;  /* extents in fm_extents (in/out) */
__u64 fm_unused;

struct fiemap_extent fm_extents[0];  
};



In the ioctl request, the fiemap struct is initialized with the desired
mapping information.

fiemap.fm_start = {desired start byte offset, 0 if whole file};
fiemap.fm_length = {length of mapping in bytes, ~0ULL if whole file}
fiemap.fm_extent_count = {number of fiemap_extents in fm_extents array};
fiemap.fm_flags = {flags from FIEMPA_FLAG_* array, if needed};

ioctl(fd, FIEMAP, fiemap);
{verify fiemap flags are understood }

for (i = 0; i  fiemap.fm_extent_count; i++) {
{ process extent fiemap.fm_extents[i]};
}


The logic for the filefrag would be similar to above. The size of the
extent array will be extrapolated from the filesize and multiple ioctls
of increasing extent count may be called for very large files. filefrag
can easily call the FIEMAP ioctls repeatedly using the end of the last
extent as the start offset for the next ioctl:

fm_start = fm_extents[fm_extent_count - 1].fe_offset +
fm_extents[fm_extent_count - 1].fe_length + 1;

We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.

The FIEMAP_FLAG_* values are specified below. If FIEMAP_FLAG_NO_EXTENTS is
given then the fm_extents array is not filled, and only fm_extent_count is
returned with the total number of extents in the file. Any new flags that
introduce and/or require an incompatible behaviour in an application or
in the kernel need to be in the range specified by FIEMAP_FLAG_INCOMPAT
(e.g. FIEMAP_FLAG_SYNC and FIEMAP_FLAG_NO_EXTENTS would fall into that
range if they were not part of the original specification). This is
currently only for future use. If it turns out that FIEMAP_FLAG_INCOMPAT
is not large enough then it is possible to use the last INCOMPAT flag
0x0100 to incidate that more of the flag range contains incompatible
flags.

#define FIEMAP_FLAG_SYNC0x0001 /* sync file data before map */
#define FIEMAP_FLAG_HSM_READ0x0002 /* get data from HSM before map */
#define FIEMAP_FLAG_NUM_EXTENTS 0x0004 /* return only number

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger

On Oct 29, 2007  16:13 -0600, Andreas Dilger wrote:
 On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
  I'm a little bit confused by fe_offset. Is it a physical offset, or a
  logical offset? The reason I ask is that your description above says FIEMAP
  ioctl will return the logical to physical mapping for the extent that
  contains the specified logical byte address. Which seems to imply physical,
  but your math to get to the next logical start in a very fragmented file,
  implies that fe_offset is a logical offset:
  
 fm_start = fm_extents[fm_extent_count - 1].fe_offset +
   fm_extents[fm_extent_count - 1].fe_length + 1; 
 
 Note the distinction between fe_offset (which is a physical offset for
 a single extent) and fm_offset (which is a logical offset for that file).

Actually, that is completely bunk.  What it should say is something like:
filefrag can easily call the FIEMAP ioctls repeatedly using the returned
fm_start and fm_length as the start offset for the next ioctl:

fiemap.fm_start = fiemap.fm_start + fiemap.fm_length + 1;

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger

On Oct 29, 2007  13:57 -0700, Mark Fasheh wrote:
   Thanks for posting this. I believe that an interface such as FIEMAP
 would be very useful to Ocfs2 as well. (I added ocfs2-devel to the e-mail)

I tried to make it as Lustre-agnostic as possible...

 On Mon, Oct 29, 2007 at 01:45:07PM -0600, Andreas Dilger wrote:
  The FIEMAP ioctl (FIle Extent MAP) is similar to the existing FIBMAP
  ioctl block device ioctl used for mapping an individual logical block
  address in a file to a physical block address in the block device. The
  FIEMAP ioctl will return the logical to physical mapping for the extent
  that contains the specified logical byte address.
  
  struct fiemap_extent {
  __u64 fe_offset;/* offset in bytes for the start of the extent */
 
 I'm a little bit confused by fe_offset. Is it a physical offset, or a
 logical offset? The reason I ask is that your description above says FIEMAP
 ioctl will return the logical to physical mapping for the extent that
 contains the specified logical byte address. Which seems to imply physical,
 but your math to get to the next logical start in a very fragmented file,
 implies that fe_offset is a logical offset:
 
fm_start = fm_extents[fm_extent_count - 1].fe_offset +
  fm_extents[fm_extent_count - 1].fe_length + 1; 

Note the distinction between fe_offset (which is a physical offset for
a single extent) and fm_offset (which is a logical offset for that file).

  We do this until we find an extent with FIEMAP_EXTENT_LAST flag set. We
  will also need to re-initialise the fiemap flags, fm_extent_count, fm_end.
 
 I think you meant 'fm_length' instead of 'fm_end' there.

You're right, thanks.

  #define FIEMAP_EXTENT_LAST  0x0020 /* last extent in the file */
  #define FIEMAP_EXTENT_EOF   0x0100 /* fm_start + fm_len beyond EOF*/
 
 Is EOF here considering beyond i_size or beyond allocation?

_EOF == beyond i_size.
_LAST == last extent in the file.

In most cases FIEMAP_EXTENT_EOF will be set at the same time as
FIEMAP_EXTENT_LAST, but in case of e.g. prealloc beyond i_size the 
EOF flag may be set on one or more earlier extents.

  FIEMAP_EXTENT_NO_DIRECT means data cannot be directly accessed (maybe
  encrypted, compressed, etc.)
 
 Would it be valid to use FIEMAP_EXTENT_NO_DIRECT for marking in-inode data?
 Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
 blocks.

Hmm, but part of the issue would be how to request the extra data, and
what offset it would be given?  One could, for example, use negative
offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
or similar, I hadn't given that much thought.  The other issue is that
I'd like to get the basics of the API in place before it gets too complex.
We can always add functionality with more FIEMAP_FLAG_* (whether in the
INCOMPAT range or not, depending on what is being done).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-10-29 Thread Andreas Dilger

On Oct 29, 2007  17:11 -0700, Mark Fasheh wrote:
 On Mon, Oct 29, 2007 at 04:13:02PM -0600, Andreas Dilger wrote:
   Btrfs, Ocfs2, and Gfs2 pack small amounts of user data directly in inode
   blocks.
  
  Hmm, but part of the issue would be how to request the extra data, and
  what offset it would be given?  One could, for example, use negative
  offsets to represent metadata or something, or add a FIEMAP_EXTENT_META
  or similar, I hadn't given that much thought.
 
 Well, fe_offset and fe_length are already expressed in bytes, so we could
 just put the byte offset to where the inline data starts in there. fe_length
 is just used as the length allocated for inline-data.
 
 If fe_offset is required to be block aligned, then we could add a field to
 express an offset within the block where data would be found - say
 'fe_data_start_offset'. In the non-inline case, we could guarantee that
 fe_data_start_offset is zero. That way software which doesn't want to care
 whether something is inline-data (for example, a backup program) or not
 could just blidly add it to fe_offset before looking at the data.

Oh, I was confused as to what you are asking.  Mapping in-inode data is
just fine using the existing interface.  The byte offset of the data is
given, and the FIEMAP_EXTENT_NO_DIRECT flag is set to indicate that it
isn't necessarily safe to do IO directly to that byte offset in the file
(e.g. tail packed, compressed data, etc).

I was thinking you were asking how to map metadata (e.g. indirect blocks).

Cheers, Andreas
--
Andreas Dilger
Sr. Software Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: batching support for transactions

2007-10-03 Thread Andreas Dilger

On Oct 03, 2007  06:42 -0400, Ric Wheeler wrote:
 With 2 threads writing to the same directory, we instantly drop down to 
 234 files/sec.
 
 Is this with HZ=250?
 
 Yes - I assume that with HZ=1000 the batching would start to work again 
 since the penalty for batching would only be 1ms which would add a 0.3ms 
 overhead while waiting for some other thread to join.

This is probably the easiest solution, but at the same time using HZ=1000
adds overhead to the server because of extra interrupts, etc.

 It would seem one of the problems is that we shouldn't really be
 scheduling for a fixed 1 jiffie timeout, but rather only until the
 other threads have a chance to run and join the existing transaction.
 
 This is really very similar to the domain of the IO schedulers - when do 
 you hold off an IO and/or try to combine it.

I was thinking the same.

 my guess would be that yield() doesn't block the first thread long enough
 for the second one to get into the transaction (e.g. on an 2-CPU system
 with 2 threads, yield() will likely do nothing).
 
 Andy tried playing with yield() and it did not do well. Note this this 
 server is a dual CPU box, so your intuition is most likely correct.

How many threads did you try?

 It makes sense to track not only the time to commit a single synchronous
 transaction, but also the time between sync transactions to decide if
 the initial transaction should be held to allow later ones.
 
 Yes, that is what I was trying to suggest with the rate. Even if we are 
 relatively slow, if the IO's are being synched at a low rate, we are 
 effectively adding a potentially nasty latency for each IO.
 
 That would give us two measurements to track per IO device - average 
 commit time and this average IO's/sec rate. That seems very doable.

Agreed.

 Alternately, it might be possible to check if a new thread is trying to
 start a sync handle when the previous one was also synchronous and had
 only a single handle in it, then automatically enable the delay in that 
 case.
 
 I am not sure that this avoids the problem with the current defaults at 
 250HZ where each wait is sufficient to do 3 fully independent 
 transactions ;-)

I was trying to think if there was some way to non-busy-wait that is
less than 1 jiffie.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Upgrading datastructures between different filesystem versions

2007-09-26 Thread Andreas Dilger

On Sep 25, 2007  23:40 -0600, Jim Cromie wrote:
 kernel learner wrote:
 ext3 filesystem has 32-bit block address and ext4 filesystem has 
 48-bit block address. If a user installs ext4, how will the file 
 system handle already existing block with 32 bit values? 

 Why should it ? thats what ext3 is for.

Bzzt. Wrong answer.  The ext4 code will be able to read existing ext3
(and ext2) filesystems just fine.  Otherwise there wouldn't be much
of an upgrade path.

 Id expect ext4 drivers handling ext3 filesystems is a distant, secondary 
 goal to getting a fast, reliable, clean 48bit filesystem working.

Far from the truth.  One of the main goals of ext4 is that it is a drop-in
replacement for ext3.  The code is mostly incremental improvements over
ext3, and that IS one of the reasons that it is reliable.  We didn't throw
away 10 years of bug fixes in the ext2/ext3 code when adding the ext4
features.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/4] ext2: fix rec_len overflow for 64KB block size

2007-09-25 Thread Andreas Dilger

On Sep 25, 2007  16:30 -0700, Christoph Lameter wrote:
 [2/4]  ext2: fix rec_len overflow
  - prevent rec_len from overflow with 64KB blocksize
 
 Signed-off-by: Takashi Sato [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

Note that we just got a cleaner implemantation of this code on the ext4
mailing list from Jan Kara yesterday.  Please use that one instead, in
thread Avoid rec_len overflow with 64KB block size instead.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/5] VFS: pass open file to -getattr()

2007-09-21 Thread Andreas Dilger

On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
 @@ -1212,7 +1212,8 @@ struct inode_operations {
 - int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
 + int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *,
 + struct file *file);

It's not much of an inode operation anymore if you need to pass a file
to it...  Since the attributes are really part of the inode and not
the file, this seems like a bit of a hack.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/5] VFS: pass open file to -xattr()

2007-09-21 Thread Andreas Dilger

On Sep 21, 2007  16:59 +0200, Miklos Szeredi wrote:
 What I'm saying is that read and write are _no_more_ related to the
 file than fstat.  Read/write operate on inode data, fstat operates on
 inode metadata.

The read and write operations are DEFINITELY related to the file descriptor
because of f_pos.  Each process opening the same file can have a different
f_pos so read/write will work in different locations of the file.

In contrast getattr and getxattr operate on the single inode and you don't
get e.g. a different i_size or i_uid or i_gid depending on who opened a
file, nor is the xattr different.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 3/5] VFS: pass open file to -xattr()

2007-09-21 Thread Andreas Dilger

On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
 @@ -1214,10 +1214,12 @@ struct inode_operations {
 + int (*setxattr) (struct dentry *, const char *,const void *,size_t,int,
 +  struct file *);
 + ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t,
 +  struct file *);
 + ssize_t (*listxattr) (struct dentry *, char *, size_t, struct file *);
 + int (*removexattr) (struct dentry *, const char *, struct file *);

Likewise - these are no longer inode operations if you need a file.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 4/5] VFS: allow filesystems to implement atomic open+truncate

2007-09-21 Thread Andreas Dilger

On Sep 21, 2007  14:23 +0200, Miklos Szeredi wrote:
 Add a new attribute flag ATTR_OPEN, with the meaning: truncation was
 initiated by open() due to the O_TRUNC flag.
 
 This way filesystems wanting to implement truncation within their
 -open() method can ignore such truncate requests.

This is actually something we've needed to do in Lustre for a while also.
We called it ATTR_FROM_OPEN, but I don't really mind ATTR_OPEN either -
the less patching we need to do the better.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Andreas Dilger

On Sep 19, 2007  12:15 -0700, Mingming Cao wrote:
 @@ -96,8 +96,7 @@ static int start_this_handle(journal_t *
  
  alloc_transaction:
   if (!journal-j_running_transaction) {
 - new_transaction = kmalloc(sizeof(*new_transaction),
 - GFP_NOFS|__GFP_NOFAIL);
 + new_transaction = kmalloc(sizeof(*new_transaction), GFP_NOFS);

This should probably be a __GFP_NOFAIL if we are trying to start a new
handle in truncate, as there is no way to propagate an error to the caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger

On Sep 15, 2007  16:29 +0400, Evgeniy Polyakov wrote:
 Yes, block device itself is not able to scale well, but it is the place
 for redundancy, since filesystem will just fail if underlying device
 does not work correctly and FS actually does not know about where it
 should place redundancy bits - it might happen to be the same broken 
 disk, so I created a low-level device which distribute requests itself.

I actually think there is a place for this - and improvements are
definitely welcome.  Even Lustre needs block-device level redundancy
currently, though we will be working to make Lustre-level redundancy
available in the future (the problem is WAY harder than it seems at
first glance, if you allow writeback caches at the clients and servers).

 When Chris Mason announced btrfs, I found that quite a few new ideas 
 are already implemented there, so I postponed project (although
 direction of the developement of the btrfs seems to move to the zfs side
 with some questionable imho points, so I think I can jump to the wagon
 of new filesystems right now). 

This is an area I'm always a bit sad about in OSS development - the need
everyone has to make a new {fs, editor, gui, etc} themselves instead of
spending more time improving the work we already have.  Imagine where the
internet would be (or not) if there were 50 different network protocols
instead of TCP/IP?  If you don't like some things about btrfs, maybe you
can fix them?

To be honest, developing a new filesystem that is actually widely useful
and used is a very time consuming task (see Reiserfs and Reiser4).  It
takes many years before the code is reliable enough for people to trust it,
so most likely any effort you put into this would be wasted unless you can
come up with something that is dramatically better than something existing.

The part that bothers me is that this same effort could have been used to
improve something that more people would use (btrfs in this case).  Of
course, sometimes the new code is substantially better than what currently
exists, and I think btrfs may have laid claim to the current generation of
filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Distributed storage. Move away from char device ioctls.

2007-09-15 Thread Andreas Dilger

On Sep 15, 2007  12:20 -0400, Robin Humble wrote:
 On Sat, Sep 15, 2007 at 10:35:16AM -0400, Jeff Garzik wrote:
 Lustre is tilted far too much towards high-priced storage,
 
 many (most?) Lustre deployments are with SATA and md raid5 and GigE -
 can't get much cheaper than that.

I have to agree - while Lustre CAN scale up to huge servers and fat pipes,
it can definitely also scale down (which is a LOT easier to do :-).  I can
run a client + MDS + 5 OSTs in a single UML instance using loop devices
for testing w/o problems.

 interestingly, one of the ways to provide dual-attached storage behind
 a failover pair of lustre servers (apart from buying SAS) would be via
 a networked-raid-1 device like Evgeniy's, so I don't see distributed
 block devices and distributed filesystems as being mutually exclusive.

That is definitely true, and there are a number of users who run in
this mode.  We're also working to make Lustre handle the replication
internally (RAID5/6+ at the OST level) so you wouldn't need any kind of
block-level redundancy at all.  I suspect some sites may still use RAID5/6
back-ends anyways to avoid performance loss from taking out a whole OST
due to a single disk failure, but that would definitely not be required.

 and needs improvement before it could be considered for mainline.

It's definitely true, and we are always working at improving it.  It
used to be in the past that one of the reasons we DIDN'T want to go
into mainline was because this would restrict our ability to make
network protocol changes.  Because our install base is large enough
and many of the large sites with mutliple supercomputers mounting
multiple global filesystems we aren't at liberty to change the network
protocol at will anymore.  That said, we also have network protocol
versioning that is akin to the ext3 COMPAT/INCOMPAT feature flags, so
we are able to add/change features without breaking old clients

 from what I understand (hopefully I am mistaken) they consider a merge
 task to be too daunting as the number of kernel subsystems that any
 scalable distributed filesystem touches is necessarily large.

That's partly true - Lustre has its own RDMA RPC mechanism, but it does
not need kernel patches anymore (we removed the zero-copy callback and
do this at the protocol level because there was too much resistance to it).
We are now also able to run a client filesystem that doesn't require any
kernel patches, since we've given up on trying to get the intents and
raw operations into the VFS, and have worked out other ways to improve
the performance to compensate.  Likewise with parallel directory operations.

It's a bit sad, in a way, because these are features that other filesystems
(especially network fs) could have benefitted from also.

 roadmaps indicate that parts of lustre are likely to move to userspace
 (partly to ease solaris and ZFS ports) so perhaps those performance
 critical parts that remain kernel space will be easier to merge.

This is also true - when that is done the only parts that will remain
in the kernel are the network drivers.  With some network stacks there
is even direct userspace acceleration.  We'll use RDMA and direct IO to
avoid doing any user-kernel data copies.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] System calls for online defrag

2007-09-04 Thread Andreas Dilger

On Sep 03, 2007  20:03 +0200, Jan Kara wrote:
   I've finally got to writing up some proposal how could look system calls
 allowing for online filesystem defragmentation and generally moving file
 blocks around for improving performance. Comments are welcome.
 
 int sys_movedata(int datafd, int spacefd, loff_t from, size_t len)
The call takes blocks used to carry data starting at offset @from of length
 @len in @spacefd and places them instead of corresponding blocks in @datafd.

Calling these @spacefd and @datafd is a bit confusing.  How about @srcfd
and @tgtfd instead?  For defragmentation, are you planning to have @datafd
be the real inode and @spacefd be the temporary inode with defragged data,
or the reverse?  It isn't really clear.

 Data is copied from @datafd to newly spliced data blocks. If @spacefd contains
 a hole in the specified interval, a hole is created also in @datafd in the
 corresponding place. A data block from @spacefd and also replace a hole in
 @datafd - zeros are copied to such data block. @from and @len should be
 multiples of filesystem block size (otherwise EINVAL is returned). Data blocks
 from @datafd in the interval are released, a hole is created in @spacefd.

This is mostly clear except the last sentence.  I would think that the data
blocks in @datafd are kept, getting a copy of the data, while those in
@spacefd are released?

   Another possibility would be to just replace data blocks without any copying
 of data (that would have to be done by the caller to before calling
 sys_movedata()). The problem here is how to avoid data loss if someone writes
 to the file after userspace has copied the data and before sys_movedata() is
 called.

Isn't that true in any case?

 ssize_t sys_allocate(int fd, int mode, loff_t goal, ssize_t len)
   Allocate new space to file @fd at offset defined by file position.  Both 
 file
 offset and @len should be a multiple of filesystem block size. The whole
 interval must not contain any allocated blocks. If the interval extends past
 EOF, the file size is changed accordingly.  @mode defines a way the filesystem
 will search for blocks. @mode is a bitwise OR of the following flags:
   ALLOC_FIXED_START - allocation must start at @goal; if not specified, @goal
 is just a hint where to start an allocation
   ALLOC_FIXED_LEN - allocate exactly space for @len; if not specified, upto
 @len bytes may be allocated.
   ALLOC_CONTINGUOUS - allocation must be one continguous run of blocks

How is this much different than sys_fallocate()?

 int sys_get_free_blocks(const char *fs, loff_t start, loff_t end, int count,
   struct alloc_extent *space)

One alternate possibility is to call the proposed FIEMAP on the block device,
to return lists of free/used extents?  We have a version of that patch for
ext4 and integration into filefrag, so it would be nice to avoid making up
yet another API/tool if that one is sufficient.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] basic delayed allocation in VFS

2007-07-29 Thread Andreas Dilger

On Jul 28, 2007  20:51 +0100, Christoph Hellwig wrote:
 That doesn't mean I want to arge against Alex's code although I'd of
 course be more happy if we could actually shared code between multiple
 filesystems.
 
 Of ourse the code in it's current form should not go into mpage.c but
 rather into ext4 so that it doesn't bloat the kernel for everyone.

Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
was rejected because that functionality should go into the VFS.
Since the performance improvement of delalloc is quite large, we'd
like to get this into the kernel one way or another.  Can we make a
decision if the ext4-specific delalloc is acceptable?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5][TAKE8] manpage for fallocate

2007-07-19 Thread Andreas Dilger

On Jul 18, 2007  20:41 -0700, Mark Fasheh wrote:
 On Sat, Jul 14, 2007 at 12:16:25AM +0530, Amit K. Arora wrote:
  After a successful call, subsequent writes are guaranteed not to fail
  because of lack of disk space.
   
 If a write to an unwritten region requires a node split, that could result
 in the allocation of new meta data which obviously could fail if the disk is
 truly full.
 
 Granted that's unlikely to happen but maybe we should be conservative and
 say something like:
 
 After a successful call, subsequent writes are guaranteed to never require
 allocation of file data. ?
 --Mark

In the worst case, the unwritten extent could be zero-filled before the write
is done, so no exent split is needed.  We discussed this recently for the
ext4 fallocate, but didn't consider it important enough to hold the code.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-16 Thread Andreas Dilger

On Jul 16, 2007  16:52 -0700, Mingming Cao wrote:
 I am not sure why we need GFP_KERNEL flag here. I think we should use
 GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as
 fixing memory leak issue introduced by the ext4 expand inode extra isize
 patch.
 
 Fixing memory allocation issue with expand inode extra isize patch.
 
 - use GFP_NOFS instead of GFP_KERNEL flag for memory allocation
 - use kzalloc instead of kmalloc

This doesn't need kzalloc() for buffer and b_entry_name, since they are
immediately overwritten by memcpy().

 - fix memory leak in the success case, at the end of while loop.
   goto cleanup;
 @@ -1302,7 +1302,15 @@ retry:
   error = ext4_xattr_block_set(handle, inode, i, bs);
   if (error)
   goto cleanup;
 + kfree(b_entry_name);
 + kfree(buffer);
 + brelse(is-iloc.bh);
 + kfree(is);
 + kfree(bs);
 + brelse(bh);
   }
 + up_write(EXT4_I(inode)-xattr_sem);
 +return 0;
  
  cleanup:
   kfree(b_entry_name);

I don't think you should have brelse(bh) inside the loop, since it is
allocated before the loop starts.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andreas Dilger

On Jul 13, 2007  15:33 +0200, Peter Zijlstra wrote:
 On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote:
 Or can journal_stop() be done by a different task than the one that did
 journal_start()? - in which case nothing much can be done :-/

The call to journal_stop() has to be in the same process, since the
journal handle is also held in current-journal_info so the handle
does not need to be passed as an argument all over the VFS.

 This seems to boot... albeit I did not push it hard.

Can you please also make a patch for jbd2.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-13 Thread Andreas Dilger

On Jul 13, 2007  02:05 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote:
 
   + brelse(bh);
   + up_write(EXT4_I(inode)-xattr_sem);
   + return error;
   +}
   +
  
  We're doing GFP_KERNEL memory allocations while holding xattr_sem.  This
  can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or
  i_truncate_sem and/or journal_start() (I forget whether this still
  happens).  Have we checked whether this can occur and if so, whether we are
  OK from a lock ranking POV?  Bear in mind that journalled-data mode is more
  complex in this regard.
 
 I notice that everyone carefully avoided addressing this ;)
 
 Oh well, hopefully people are testing with lockdep enabled.  As long
 as the fs is put under extreme memory pressure, most bugs should be reported.

I have no objection to changing these to GFP_NOFS or GFP_ATOMIC, because
the number of times this function is called is really quite small (only
for existing inodes when the size of the fixed fields in the inode is
increasing) and the buffers are freed immediately so this won't put any
undue strain on the atomic memory pools.

That said, there is also a GFP_KERNEL allocations in ext3_xattr_block_set()
under xattr_sem, so the same problem would exist there.

I also just noticed that buffer and b_entry_name are leaked in
ext4_expand_extra_isize() if the while loop is run more than one time
(again a relatively rare event).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-12 Thread Andreas Dilger

On Jul 12, 2007  13:56 +0530, Amit K. Arora wrote:
 As you suggest, let us just have two modes for the time being:
 
 #define FALLOC_ALLOCATE   0x1
 #define FALLOC_ALLOCATE_KEEP_SIZE 0x2
 
 As the name suggests, when FALLOC_ALLOCATE_KEEP_SIZE mode is passed it
 will result in file size not being changed even if the preallocation is
 beyond EOF.

What does FALLOC_ALLOCATE mean vs. not passing this flag?  I have no
objection to this as long as the code remains with these as flags
instead of modes...  Essentially just dropping the FALLOC_FL_DEALLOCATE
and FALLOC_FL_DEL_DATA from the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  22:00 -0400, Mingming Cao wrote:
 On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
  On Sun, 01 Jul 2007 03:36:56 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
  
   This patch is a spinoff of the old nanosecond patches.
  
  I don't know what the old nanosecond patches are.  A link to a suitable
  changlog for those patches would do in a pinch.  Preferable would be to
  write a proper changelog for this patch.
  
 I found the original patch
 http://marc.info/?l=linux-ext4m=115091699809181w=2
 
 Andreas or Kalpak, is changelog from the original patch is accurate to
 apply here?

Mostly, yes, but the name of the feature flag has changed.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-11 Thread Andreas Dilger

 if requested by the sysadmin.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  23:34 -0400, Trond Myklebust wrote:
 On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
  So my vote is to increment i_version in common code every time any
  change is made to the file, and alloc_inode should initialise it to
  current time, which might be changed by the filesystem before it calls
  unlock_new_inode. 
  ... but doesn't lustre want to control its i_version... so maybe not :-(
 
 If lustre wants to be exportable via pNFS (as Peter Braam has suggested
 it should), then it had better be able to return a change attribute that
 is compatible with the NFSv4.1 spec...

The Lustre use of i_version is a superset of what NFSv4 needs - the Lustre
version can be used to compare the updates of two inodes.  It is set
to be the Lustre transaction number (which is sequentially incremented
on a per filesystem basis), so that we can use the per-inode version
to do replay of client operations even if they have been disconnected for
a long time, which is why we want to be able to control the actual value.
We don't want the version to be updated for e.g. file defragmentation
or other similar internal-only changes which need ext4_mark_inode_dirty().

We had a patch to disable ext4 inode versioning by a flag the superblock,
but we dropped it at the last minute because it needed some updates and
we didn't want to wait on that for submitting these changes upstream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 2/5] i_version: Add hi 32 bit inode version on ext4 on-disk inode

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  16:30 -0700, Andrew Morton wrote:
  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
  ---
  Index: linux-2.6.21/include/linux/ext4_fs.h
  ===
  --- linux-2.6.21.orig/include/linux/ext4_fs.h
  +++ linux-2.6.21/include/linux/ext4_fs.h
  @@ -342,6 +342,7 @@ struct ext4_inode {
  __le32  i_atime_extra;  /* extra Access time  (nsec  2 | epoch) */
  __le32  i_crtime;   /* File Creation time */
  __le32  i_crtime_extra; /* extra FileCreationtime (nsec  2 | epoch) */
  +   __le32  i_version_hi;   /* high 32 bits for 64-bit version */
   };
 
 Aren't there forward- backward-compatibility issues here?  How does the
 filesystem driver work out whether this field is present and valid?

This uses the same EXT4_FITS_IN_INODE() check as any other, so the
compatibility issues are handled.  NFSv4 could live with 32-bit versions
with only a small danger of overflow, so we can still export ext3
filesystems with 128-byte inodes that have been updated to ext4.  For
Lustre (which requires 64-bit versions), we will enforce that space is
available with s_min_extra_isize and RO_COMPAT_EXTRA_ISIZE.

In the case where an older ext3/ext4 filesystem with large inodes does
not have enough space for i_version_hi the EAs that follow i_extra_isize
will be shifted to make room for it (if possible, which is likely).  There
are no critical fields inside i_extra_isize so in the rare case of a
failure to enlarge the i_extra_isize is not a cause for alarm.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  09:47 +0100, Christoph Hellwig wrote:
 On Sun, Jul 01, 2007 at 03:37:45AM -0400, Mingming Cao wrote:
  This patch is on top of i_version_update_vfs.
  The i_version field of the inode is set on inode creation and incremented
  when the inode is being modified.
 
 Which is not what i_version is supposed to do.  It'll get you tons of misses
 for NFSv3 filehandles that rely on the generation staying the same for the
 same file.  Please add a new field for the NFSv4 sequence counter instead
 of making i_version unuseable.

You are confusing i_generation (the instance of this inode number) with
i_version (whether this file has been modified)?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-11 Thread Andreas Dilger

++;
  +   new_extra_isize = s_min_extra_isize;
 
 Aren't we missing a brelse(bh) here?

Seems likely, yes.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 7][PATCH 1/1]Remove 32000 subdirs limit.

2007-07-11 Thread Andreas Dilger

On Jul 10, 2007  22:40 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:18 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
  A EXT4_FEATURE_RO_COMPAT_DIR_NLINK flag has been added and it is set if
  the subdir count for any directory crosses 65000.
 
 Would I be correct in assuming that a later fsck will clear
 EXT4_FEATURE_RO_COMPAT_DIR_NLINK if there are no longer any 65000 subdir
 directories?

Correct.

 If so, that is worth a mention in the changelog, perhaps?
 
 Please remind us what is the behaviour of an RO_COMPAT flag?  It means that
 old ext4, ext3 and ext2 can only mount this fs read-only, yes?

Also correct.  The COMPAT flag behaviour is described in detail in
Documentation/filesystems/ext[234].txt

  +static inline void ext4_inc_count(handle_t *handle, struct inode *inode)
  +{
  +   inc_nlink(inode);
  +   if (is_dx(inode)  inode-i_nlink  1) {
  +   /* limit is 16-bit i_links_count */
  +   if (inode-i_nlink = EXT4_LINK_MAX || inode-i_nlink == 2) {
  +   inode-i_nlink = 1;
  +   EXT4_SET_RO_COMPAT_FEATURE(inode-i_sb,
  + EXT4_FEATURE_RO_COMPAT_DIR_NLINK);
  +   }
  +   }
  +}
 
 Why do we set EXT4_FEATURE_RO_COMPAT_DIR_NLINK if i_nlink==2?

Because that means it was previously 1 (inc_nlink() was already called).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  17:16 +0530, Girish Shilamkar wrote:
   + if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
   + jbd2_journal_set_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
   + jbd2_journal_set_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
   + jbd2_journal_clear_features(sbi-s_journal, 0, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + } else {
   + jbd2_journal_clear_features(sbi-s_journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM, 0,
   + JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
   + }
  
  Some discussion of the forward- and backward- compatibility design would be
  appropriate.  Also a description of whether and how fsck can be used to fix
  up these feature flags.

It is forward  backward compatible to enable COMPAT_CHECKSUM.  That just
means the commit blocks will have checksums in them, but older kernels
will just ignore them.  Hmm, I suppose there might be an issue with upgrade,
downgrade, upgrade in that the commit blocks would not have checksums
even though the superblock says they will...

Does that mean we should accept a checksum == 0 as being valid (which
is not very nice, given that 0 is an oft-hit bad value), or that we need
a flag in every commit block which indicates if it actually has a checksum?

The INCOMPAT_ASYNC_COMMIT can't be handled safely by older kernels, since
they would assume commit block == complete transaction, which isn't
true if the commit block didn't wait for the rest of the blocks to make
it to the disk.

I don't think e2fsck can be used to individually clean up the feature flags,
but it is always possible to remove and recreate the journal...

   - /* AKPM: buglet - add `i' to tmp! */
  
  Damn.  After, what, seven years, someone actually fixed it?
  
 for (i = 0; i  bh-b_size; i += 512) {
   - journal_header_t *tmp = (journal_header_t*)bh-b_data;
   + struct commit_header *tmp =
   + (struct commit_header *)(bh-b_data + i);
 tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
 tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
 tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
   +
   + if (JBD2_HAS_COMPAT_FEATURE(journal,
   + JBD2_FEATURE_COMPAT_CHECKSUM)) {
   + tmp-h_chksum_type  = JBD2_CRC32_CHKSUM;
   + tmp-h_chksum_size  = JBD2_CRC32_CHKSUM_SIZE;
   + tmp-h_chksum[0]= cpu_to_be32(crc32_sum);
   + }
 }
  
  And in doing so, changed the on-disk format of the journal commit blocks.
  
  Surely this was worth a mention in the changelog, if not a standalone patch?
  
  I don't think this is worth doing, really.  Why not just leave the format
  as it was, take the loop out and run this code once rather than eight
  times?

Well, we aren't using the rest of the commit block in any case.  I think
the original intention was that we'd get 8 copies of the commit block so
we would be sure to get a good one.

I don't know whether we'd rather have 8 copies of the commit block, or
more potential to expand the commit block?  I don't personally have any
preference, since the checksum should be a more robust way of checking
validity than having multiple copies, so we may as well remove the loop
and stick with a single copy for now.

   @@ -328,6 +360,7 @@ static int do_one_pass(journal_t *journa
 unsigned intsequence;
 int blocktype;
 int tag_bytes = journal_tag_bytes(journal);
   + __u32   crc32_sum = ~0; /* Transactional Checksums */
  
  We normally use __u32 for visible-to-userspace stuff.  Kernel code would
  use plain old u32.
 Ok.

Since the checksum is saved to disk, it seems more appropriate to use __u32
or maybe even __be32, though I'm not sure if the crc32 functions do that
correctly or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Andreas Dilger

On Jul 11, 2007  16:04 -0400, J. Bruce Fields wrote:
 A 32-bit i_version could in theory wrap pretty quickly, couldn't it?
 That's not a problem in itself--the problem would only arise if two
 subsequent client queries of the change attribute happened a multiple of
 2^32 i_version increments apart.
 
 This is more likely than the previous scenario, but still very unlikely.
 I would have guessed that even in situations with a very high rate of
 updates and a low rate of client revalidations, the chance of two
 revalidations happening exactly 2^32 updates apart would still be no
 more than 1 in 2^32.  (Could odd characteristics of the workloads (like
 updates that tend to happen in power-of-2 groups?) make it any more
 likely?)
 
 I'd be happier if ext4 at least allowed the possibility of 64 bits in
 the future.  And there's always the chance someone would find a use for
 an i_version that was nondecreasing, even if nfs didn't care.

This would indeed be the case for ext3 filesystems updated to ext4.
They will only be able to store the low 32 bits of the version, which
is itself normally enough for NFSv4 because it only uses the inequality
check.  Having the full 64 bits available eliminates the risk of
collisions, and given that the spec mandates a 64-bit version I'm sure
someone will take full advantage of it in NFS at some point.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-06 Thread Andreas Dilger

On Jul 06, 2007  09:51 -0400, J. Bruce Fields wrote:
 The use of a mount option means the change attribute could be
 inconsistent across mounts.  If we really need this, wouldn't it make
 more sense for it to be a persistent feature of the filesystem, set at
 mkfs time?

Yes, having it stored into the superblock in s_flags is probably a good
idea.  Kalpak, do you think you could get a patch that adds e.g.
EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] util-linux-ng 2.13-rc1

2007-07-05 Thread Andreas Dilger

On Jul 05, 2007  12:41 -0400, Mike Frysinger wrote:
 On Wednesday 04 July 2007, Christoph Hellwig wrote:
  Sorry, but it's really annoying to pull in a filesystem-specific devel
  package for that.  Having a library is fine, but please move the library
  into util-linux so it's always available without another dependency.
 
 ugh, moving libraries which are already actively maintained by other core 
 projects into util-linux is so not a good idea (ignoring the fact that it'd 
 easily be a pita/waste for distro maintainers)

Some distros (Debian and SuSE I think) split the e2fsprogs libraries
into separate packages so that you are not depending on e2fsprogs,
but rather libuuid and/or libblkid.

  That way xfsprogs could for example drop it's own detection library aswell.
 
 i dont really think this is dependent on util-linux at all.  nothing is 
 stopping xfsprogs from depending on udev or e2fsprogs now.

In fact, Eric Sandeen and I discussed splitting the xfsprogs libdisk
(or similar, it detects RAID geometry for DM/MD/etc) into a standalone
library so that e2fsprogs could use it.  The only issue is the increased
maintenance and packaging of separate libraries.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-04 Thread Andreas Dilger

On Jul 04, 2007  12:06 +0530, Aneesh Kumar K.V wrote:
 Mingming Cao wrote:
 On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote:
 On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote:
 +
 +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  \
 +do {   \
 +  (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);
 \
 +  if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra))  
 \
 +  ext4_decode_extra_time((inode)-xtime, 
 \
 + raw_inode-xtime ## _extra); 
 \
 +} while (0)
 +
 +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)\
 +do {   \
 +  if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))   
 \
 +  (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   
 \
 +  if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra)) 
 \
 +  ext4_decode_extra_time((einode)-xtime,
 \
 + raw_inode-xtime ## _extra); 
 \
 +} while (0)
 +
 This nanosecond patch seems to be missing the fix below which is 
 required for http://bugzilla.kernel.org/show_bug.cgi?id=5079
 
 If the timestamp is set to before epoch i.e. a negative timestamp then 
 the file may have its date set into the future on 64-bit systems. So 
 when the timestamp is read it must be cast as signed.
 
 Missed this one.
 Thanks. Will update ext4 patch queue tonight with this fix.
 
 IIRC in the conference call it was decided to not to apply this patch. 
 Andreas may be able to update better.

I wasn't on the most recent concall, and I've forgotten the details of
any discussion on a previous concall.

Care really needs to be taken here that negative timestamps are handled
properly.  We can take the sign bit from the inode i_*time, but then we
need to change the load/save of the extra time to use a shift of 31
instead of 32.  If we overflow the epoch we have to ensure that the high
bits of the seconds is handled correctly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 4/5] i_version:ext4 inode version update

2007-07-03 Thread Andreas Dilger

On Jul 03, 2007  12:19 +0530, Aneesh Kumar K.V wrote:
 Mingming Cao wrote:
 Index: linux-2.6.22-rc4/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc4.orig/fs/ext4/super.c2007-06-13 
 17:19:11.0 -0700
 +++ linux-2.6.22-rc4/fs/ext4/super.c 2007-06-13 17:24:45.0 -0700
 @@ -2846,8 +2846,8 @@ out:
  i_size_write(inode, off+len-towrite);
  EXT4_I(inode)-i_disksize = inode-i_size;
  }
 -inode-i_version++;
  inode-i_mtime = inode-i_ctime = CURRENT_TIME;
 +inode-i_version = 1;
  ext4_mark_inode_dirty(handle, inode);
  mutex_unlock(inode-i_mutex);
  return len - towrite;
 
 
 Is this correct ? . Why do we set the qutoa file inodes version to 1  
 during write ?

Hmm, I thought we had previously fixed this?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-03 Thread Andreas Dilger

On Jul 03, 2007  10:24 -0400, Trond Myklebust wrote:
 It looks OK to me, but you might want to strip out the now redundant
 i_version updates in add_dirent_to_buf(), ext4_rmdir(), ext4_rename().

Agreed, and I thought we discussed that already on the ext4 list.

 I also have some questions about how this will affect the readdir code:
 unless I missed something, the filp-f_version is still unsigned long,
 so the comparisons and assignments in ext4_readdir()/ext4_dx_readdir()
 no longer make sense.

I don't see them as any worse than existing checks.  For 32-bit systems
we only ever had a 32-bit in-memory version anyway so using only the
low 32 bits of i_version in f_version is no more racy than in the past.
For 64-bit systems using the full on-disk i_version is possible.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-03 Thread Andreas Dilger

On Jul 03, 2007  18:15 -0400, J. Bruce Fields wrote:
 How will nfsd tell whether it can really on a given filesystem's
 i_version, or whether it should fall back on ctime?

Good question.

  As to performance concerns that raise before the inode version counter
  (at least for ext4) is done inside ext4_mark_inode_dirty), so there is
  no extra IO work to store this counter to disk.
 
 So what's the motivation for the noversion mount option?

Lustre needs to be able to control the version number directly (version
number needs to be ordered between all inodes, is set by Lustre to be a
transaction number).  Instead of trying to incorporate this unused code
into ext4 we just turn off the ext4 version code and let Lustre control
this directly.  It may even be that NFSv4 will need to control the version
numbers itself...

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-07-01 Thread Andreas Dilger

On Jun 30, 2007  11:21 +0100, Christoph Hellwig wrote:
 On Tue, Jun 26, 2007 at 04:02:47PM +0530, Amit K. Arora wrote:
  Currently it is left on the file system implementation. In ext4, we do
  not undo preallocation if some error (say, ENOSPC) is hit. Hence it may
  end up with partial (pre)allocation. This is inline with dd and
  posix_fallocate, which also do not free the partially allocated space.
 
 I can't find anything in the specification of posix_fallocate
 (http://www.opengroup.org/onlinepubs/009695399/functions/posix_fallocate.html)
 that tells what should happen to allocate blocks on error.
 
 But common sense would be to not leak disk space on failure of this
 syscall, and this definitively should not be left up to the filesystem,
 either we always leak it or always free it, and I'd strongly favour
 the latter variant.

I definitely agree that the behaviour should be specified part of
the interface.  The current behaviour of both ext4 and XFS is that the
successful part of the unallocated extent is left in place when returning
ENOSPC so we considered this the consistent behaviour.  This is the same
as e.g. sys_write() which does not remove the part of the write that was
successful if ENOSPC is hit.  I think this also makes sense for some usa
cases, because application like PVR may want to preallocate approximately
30min of space, but if it gets only 25min worth then it can at least start
using this while it also begins looking for and/or freeing old files.

If the space is always freed on ENOSPC, then there may be a significant
amount of work done and undone while the application is iterating over
possible sizes until one works.   It is easy for the application to
use fstat() to see the blocks/size actually preallocated on failure, and
explicitly request unallocation of this space if the outcome is undesirable.

If you think that applications have a strong preference for both kinds
of behaviour (e.g. database which requires the full allocation to succeed,
unlike PVR application above) then this could be encoded into a @mode flag.

   For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
   don't want to expose uninitialized disk blocks to userspace.  I'm not
   sure if this makes sense at all.
 
 This is the xfs unwritten extent behaviour.  But anyway, the important bit
 is uninitialized blocks should never ever leak to userspace, so there is
 not need for the flag.

I agree that we shouldn't need FA_ZERO_SPACE.  If an application wants
explicit zeros written to disk it can just do this with O_DIRECT writes
or similar.

 The more I think about it the more I'd prefer we would just put a simple
 syscall in that implements nothing but the posix_fallocate(3) semantics
 as defined in SuS, and then go on to brainstorm about advanced
 preallocation / layout hint semantics.

I don't think the current @mode flags introduce any significant complexity
in the implementation, and in fact one of the reasons these came up in the
first place was because David pointed out the XFS behaviour did NOT match
with posix_fallocate() and we started getting strange semantics enforced
by monolithic modes.  IMHO, coding for and understanding the semantics of
the monolithic modes is much more complex and less useful than the explicit
flags.

The @mode flags that are currently under consideration are (AFAIK):

FA_FL_DEALLOC   0x01 /* deallocate unwritten extent (default allocate) */
FA_FL_KEEP_SIZE 0x02 /* keep size for EOF {pre,de}alloc (default change size) */
FA_FL_DEL_DATA  0x04 /* delete existing data in alloc range (default keep) */

Your concern about leaking space would imply:

FA_FL_ERR_FREE  0x08 /* free preallocation on error (default keep prealloc) */

The other possible flags that were proposed, to avoid confusing backup and
HSM applications when preallocated space is added or removed from a file
(you don't want a backup app to re-backup a file that was migrated via HSM):

FA_FL_NO_MTIME  0x10 /* keep same mtime (default change on size, data change) */
FA_FL_NO_CTIME  0x20 /* keep same ctime (default change on size, data change) */

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Andreas Dilger

On Jun 28, 2007  23:27 +0530, Amit K. Arora wrote:
 On Thu, Jun 28, 2007 at 02:55:43AM -0700, Andrew Morton wrote:
  Are we all supposed to re-review the entire patchset (or at least #4 and
  #7) again?
 
 As I mentioned in the note above, only patches #4 and #7 were new and
 thus these needed to be reviewed. Other patches are _not_ replacements
 of any of the patches which are already part of -mm and/or in Ted's
 patch queue. They were posted again as just placeholders so that the
 two new patches (#4  #7) could be reviewed. Sorry for any confusion.

The new patches are definitely a big improvement over the previous API,
and need to go in before fallocate() goes into mainline.  This last set
of changes allows the behaviour of these syscalls to accomodate the various
different semantics desired by XFS in a sensible manner instead of tying
all of the individual behaviours (time update, size update, alloc/free, etc)
into monolithic modes that will never make everyone happy.

My understanding is that you only need to grab #4 and #7 to get your tree
into get fallocate in sync with the ext4 patch queue (i.e. they are
incremental over the previous set).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [AppArmor 00/44] AppArmor security module overview

2007-06-27 Thread Andreas Dilger

Any chance you can remove linux-fsdevel from the CC list?  I don't think this
has anything to do with filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger

On Jun 26, 2007  16:15 +0530, Amit K. Arora wrote:
 On Mon, Jun 25, 2007 at 03:52:39PM -0600, Andreas Dilger wrote:
  In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
  For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
  each extent.  For some workloads this would be much faster than truncate
  and reallocate of all the blocks in a file.
 
 In ext4, we already mark each extent having preallocated blocks as
 uninitialized. This is done as part of following code (which is part of
 patch 5/7) in ext4_ext_get_blocks() :  

What I meant is that with XFS_IOC_ALLOCSP the previously-written data
is ZEROED OUT, unlike with fallocate() which leaves previously-written
data alone and only allocates in holes.

So, if you had a sparse file with some data in it:

 A BB

fallocate() would allocate the holes:

0A0BB

XFS_IOC_ALLOCSP would overwrite everything:

0

In order to specify this for allocation, FA_FL_DEL_DATA would need to make
sense for allocations (as well as the deallocation).  This is farily easy
to do - just mark all of the existing extents as unallocated, and their
data disappears.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-26 Thread Andreas Dilger

On Jun 26, 2007  17:37 +0530, Amit K. Arora wrote:
 Hmm.. I am thinking of a scenario when the file system supports some
 individual flags, but does not support a particular combination of them.
 Just for example sake, assume we have FA_ZERO_SPACE mode also. Now, if a
 file system supports FA_ZERO_SPACE, FA_ALLOCATE, FA_DEALLOCATE and
 FA_RESV_SPACE; and no other mode (i.e. FA_UNRESV_SPACE is not supported
 for some reason). This means that although we support FA_FL_DEALLOC,
 FA_FL_KEEP_SIZE and FA_FL_DEL_DATA flags, but we do not support the
 combination of all these flags (which is nothing but FA_UNRESV_SPACE).

That is up to the filesystem to determine then.  I just thought it should
be clear to return an error for flags (or as you say combinations thereof)
that the filesystem doesn't understand.

That said, I'd think in most cases the flags are orthogonal, so if you
support some combination of the flags (e.g. FA_FL_DEL_DATA, FA_FL_DEALLOC)
then you will also support other combinations of those flags just from
the way it is coded.

  I also thought another proposed flag was to determine whether mtime (and
  maybe ctime) is changed when doing prealloc/dealloc space?  Default should
  probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
  should decide if we want to allow changing the file w/o changing ctime, if
  that is required even though the file is not visibly changing.  Maybe the
  ctime update should be implicit if the size or mtime are changing?
 
 Is it really required ? I mean, why should we allow users not to update
 ctime/mtime even if the file metadata/data gets updated ? It sounds
 a bit unnatural to me.
 Is there any application scenario in your mind, when you suggest of
 giving this flexibility to userspace ?

One reason is that XFS does NOT update the mtime/ctime when doing the
XFS_IOC_* allocation ioctls.

 I think, modifying ctime/mtime should be dependent on the other flags.
 E.g., if we do not zero out data blocks on allocation/deallocation,
 update only ctime. Otherwise, update ctime and mtime both.

I'm only being the advocate for requirements David Chinner has put
forward due to existing behaviour in XFS.  This is one of the reasons
why I think the flags mechanism we now have - we can encode the
various different behaviours in any way we want and leave it to the
caller.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patent or not patent a new idea

2007-06-26 Thread Andreas Dilger

On Jun 26, 2007  09:53 -0700, Bryan Henderson wrote:
 md/raid already works happily with different sized drives from
 different manufacturers ...
 
 So I still cannot see anything particularly new.
 
 As compared to md of conventional disk partitions, it brings the ability 
 to create and delete arrays without shutting down all use of the physical 
 disks (to update the partition tables).  (LVM gives you that too).  It 
 also makes managing space much easier because the component devices don't 
 have to be carved from contiguous space on the physical disks.
 
 Neither of those benefits is specific to RAID, but you could probably say 
 that RAID multiplies the problems they address.

That's one of the reasons why I liked AIX doing RAID1 on top of LVM.  It
was possible to safely migrate LEs across disks by virtue of creating a
RAID 1 mirror for that LE and then deleting the old copy.  It wouldn't
be too hard to extend the same notion to RAID 5 or RAID 6 on LVM.  You
could even change the RAID level and number of constituent LEs in a RAID
set on an ad-hoc basis because the LVM mappings would allow this to be
efficient.

Unfortunately, Linux still has distinct DM and MD layers and there doesn't
seem to be any work to combine the two into a more powerful single layer.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-26 Thread Andreas Dilger

On Jun 27, 2007  09:14 +1000, David Chinner wrote:
 Someone on the XFs list had an interesting request - preallocated
 swap files. You can't use unwritten extents for this because
 of sys_swapon()s use of bmap() (XFS returns holes for reading
 unwritten extents), so we need a method of preallocating that does
 not zero or mark the extent unread. i.e. FA_MKSWAP.

Is there a reason why unwritten extents return 0 to bmap()?  This
would seem to be the only impediment from using fallocated files
for swap files.  Maybe if FIEMAP was used by mkswap to get an
UNWRITTEN flag back instead of HOLE it wouldn't be a problem.

 That way we can allocate large swap files that don't need zeroing
 in a single, fast operation, and hence potentially bring new
 swap space online without needed very much memory at all (i.e.
 should succeed in most near-OOM conditions).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger

On Jun 25, 2007  20:33 +0530, Amit K. Arora wrote:
 I have not implemented FA_FL_FREE_ENOSPC and FA_ZERO_SPACE flags yet, as
 *suggested* by Andreas in http://lkml.org/lkml/2007/6/14/323  post.
 If it is decided that these flags are also needed, I will update this
 patch. Thanks!

Can you clarify - what is the current behaviour when ENOSPC (or some other
error) is hit?  Does it keep the current fallocate() or does it free it?

For FA_ZERO_SPACE - I'd think this would (IMHO) be the default - we
don't want to expose uninitialized disk blocks to userspace.  I'm not
sure if this makes sense at all.

 On Mon, Jun 25, 2007 at 07:15:00PM +0530, Amit K. Arora wrote:
  Implement new flags and values for mode argument.
  
  This patch implements the new flags and values for the mode argument
  of the fallocate system call. It is based on the discussion between
  Andreas Dilger and David Chinner on the man page proposed (by the later)
  on fallocate.
  
  Signed-off-by: Amit Arora [EMAIL PROTECTED]
  
  Index: linux-2.6.22-rc4/include/linux/fs.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/fs.h
  +++ linux-2.6.22-rc4/include/linux/fs.h
  @@ -267,15 +267,16 @@ extern int dir_notify_enable;
   #define SYNC_FILE_RANGE_WAIT_AFTER 4
  
   /*
  - * sys_fallocate modes
  - * Currently sys_fallocate supports two modes:
  - * FA_ALLOCATE  : This is the preallocate mode, using which an 
  application/user
  - *   may request (pre)allocation of blocks.
  - * FA_DEALLOCATE: This is the deallocate mode, which can be used to free
  - *   the preallocated blocks.
  + * sys_fallocate mode flags and values
*/
  -#define FA_ALLOCATE0x1
  -#define FA_DEALLOCATE  0x2
  +#define FA_FL_DEALLOC  0x01 /* default is allocate */
  +#define FA_FL_KEEP_SIZE0x02 /* default is extend/shrink size */
  +#define FA_FL_DEL_DATA 0x04 /* default is keep written data on DEALLOC 
  */
  +
  +#define FA_ALLOCATE0
  +#define FA_DEALLOCATE  FA_FL_DEALLOC
  +#define FA_RESV_SPACE  FA_FL_KEEP_SIZE
  +#define FA_UNRESV_SPACE(FA_FL_DEALLOC | FA_FL_KEEP_SIZE | 
  FA_FL_DEL_DATA)
  
   #ifdef __KERNEL__
  
  Index: linux-2.6.22-rc4/fs/open.c
  ===
  --- linux-2.6.22-rc4.orig/fs/open.c
  +++ linux-2.6.22-rc4/fs/open.c
  @@ -356,23 +356,26 @@ asmlinkage long sys_ftruncate64(unsigned
* sys_fallocate - preallocate blocks or free preallocated blocks
* @fd: the file descriptor
* @mode: mode specifies if fallocate should preallocate blocks OR free
  - *   (unallocate) preallocated blocks. Currently only FA_ALLOCATE and
  - *   FA_DEALLOCATE modes are supported.
  + *   (unallocate) preallocated blocks.
* @offset: The offset within file, from where (un)allocation is being
* requested. It should not have a negative value.
* @len: The amount (in bytes) of space to be (un)allocated, from the 
  offset.
*
* This system call, depending on the mode, preallocates or unallocates 
  blocks
* for a file. The range of blocks depends on the value of offset and len
  - * arguments provided by the user/application. For FA_ALLOCATE mode, if 
  this
  + * arguments provided by the user/application. For FA_ALLOCATE and
  + * FA_RESV_SPACE modes, if the sys_fallocate()
* system call succeeds, subsequent writes to the file in the given range
* (specified by offset  len) should not fail - even if the file system
* later becomes full. Hence the preallocation done is persistent (valid
  - * even after reopen of the file and remount/reboot).
  + * even after reopen of the file and remount/reboot). If FA_RESV_SPACE mode
  + * is passed, the file size will not be changed even if the preallocation
  + * is beyond EOF.
*
* It is expected that the -fallocate() inode operation implemented by the
* individual file systems will update the file size and/or ctime/mtime
  - * depending on the mode and also on the success of the operation.
  + * depending on the mode (change is visible to user or not - say file size)
  + * and obviously, on the success of the operation.
*
* Note: Incase the file system does not support preallocation,
* posix_fallocate() should fall back to the library implementation (i.e.
  @@ -398,7 +401,8 @@ asmlinkage long sys_fallocate(int fd, in
  
  /* Return error if mode is not supported */
  ret = -EOPNOTSUPP;
  -   if (mode != FA_ALLOCATE  mode != FA_DEALLOCATE)
  +   if (!(mode == FA_ALLOCATE || mode == FA_DEALLOCATE ||
  +   mode == FA_RESV_SPACE || mode == FA_UNRESV_SPACE))
  goto out;
  
  ret = -EBADF;
  -
  To unsubscribe from this list: send the line unsubscribe linux-ext4 in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems

Re: [PATCH 4/7][TAKE5] support new modes in fallocate

2007-06-25 Thread Andreas Dilger

On Jun 25, 2007  19:15 +0530, Amit K. Arora wrote:
 +#define FA_FL_DEALLOC0x01 /* default is allocate */
 +#define FA_FL_KEEP_SIZE  0x02 /* default is extend/shrink size */
 +#define FA_FL_DEL_DATA   0x04 /* default is keep written data on DEALLOC 
 */

In XFS one of the (many) ALLOC modes is to zero existing data on allocate.
For ext4 all this would mean is calling ext4_ext_mark_uninitialized() on
each extent.  For some workloads this would be much faster than truncate
and reallocate of all the blocks in a file.

In that light, please change the comment to /* default is keep existing data */
so that it doesn't imply this is only for DEALLOC.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 7/7][TAKE5] ext4: support new modes

2007-06-25 Thread Andreas Dilger

On Jun 25, 2007  19:20 +0530, Amit K. Arora wrote:
 @@ -2499,7 +2500,8 @@ long ext4_fallocate(struct inode *inode,
* currently supporting (pre)allocate mode for extent-based
* files _only_
*/
 - if (mode != FA_ALLOCATE || !(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
 + if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) ||
 + !(mode == FA_ALLOCATE || mode == FA_RESV_SPACE))
   return -EOPNOTSUPP;

This should probably just check for the individual flags it can support
(e.g. no FA_FL_DEALLOC, no FA_FL_DEL_DATA).

I also thought another proposed flag was to determine whether mtime (and
maybe ctime) is changed when doing prealloc/dealloc space?  Default should
probably be to change mtime/ctime, and have FA_FL_NO_MTIME.  Someone else
should decide if we want to allow changing the file w/o changing ctime, if
that is required even though the file is not visibly changing.  Maybe the
ctime update should be implicit if the size or mtime are changing?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: vm/fs meetup in september?

2007-06-25 Thread Andreas Dilger

On Jun 26, 2007  12:35 +1000, Nick Piggin wrote:
 Leaving my opinion of higher order pagecache aside, this _may_ be an
 example of something that doesn't need a lot of attention, because it
 should be fairly uncontroversial from a filesystem's POV? (eg. it is
 more a relevant item to memory management and possibly block layer).
 OTOH if it is discussed in the context of large blocks in the buffer
 layer is crap because we can do it with higher order pagecache, then
 that might be interesting :)

FWIW, being able to have large (8-64kB) blocksize would be great for
ext2/3/4.  We'd sort of been betting on this by limiting the on-disk
extent format to 48-bit physical block numbers, and to have 2 patches
to implement this in as many weeks is excellent.

To me the mechanism doesn't matter, whether through fsblock or high-order
PAGE_SIZE.  I'll let the rest of you duke it out as long as at least one
of them makes it into the kernel.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [34/37] Large blocksize support in ramfs

2007-06-20 Thread Andreas Dilger

On Jun 20, 2007  11:29 -0700, [EMAIL PROTECTED] wrote:
 If you apply this patch and then you can f.e. try this:
 
   mount -tramfs -o10 none /media

 @@ -164,10 +165,15 @@ static int ramfs_fill_super(struct super
 + if (options  *options)
 + order = simple_strtoul(options, NULL, 10);

This is probably a bad name for a mount option.  What about order=10?
Otherwise you prevent any other option from being used in the future.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [36/37] Large blocksize support for ext2

2007-06-20 Thread Andreas Dilger

On Jun 20, 2007  11:29 -0700, [EMAIL PROTECTED] wrote:
 This adds support for a block size of up to 64k on any platform.
 It enables the mounting filesystems that have a larger blocksize
 than the page size.

Might have been good to CC the ext2/3/4 maintainers here?  I definitely
have been waiting for a patch like this for ages (so definitely no
objection from me), but there are a few caveats before this will work
on ext2/3/4.

 Hmmm... Actually there is nothing additional to be done after the earlier
 cleanup of the macros. So just modify copyright.

It is NOT possible to have 64kB blocksize on ext2/3/4 without some small
changes to the directory handling code.  The reason is that an empty 64kB
directory block would have a rec_len == (__u16)2^16 == 0, and this would
cause an error to be hit in the filesystem.  What is needed is to put
2 empty records in such a directory, or to special-case an impossible
value like rec_len = 0x to handle this.

There was a patch to fix the 64kB blocksize directory problem, but it
hasn't been merged anywhere yet seeing as there wasn't previously a
patch to allow larger blocksize...

Having 32kB blocksize has no problems that I'm aware of.  Also, I'm not
sure how it happened, but ext2 SHOULD have an explicit check (as
ext3/4 does) limiting it to EXT2_MAX_BLOCK_SIZE.  Otherwise it appears
that there would be no error reported if the superblock reports e.g. 16MB
blocksize, and all kinds of things would break.

There shouldn't be a problem with increasing EXT{2,3,4}_MAX_BLOCK_SIZE to
32kB (AFAIK), but I haven't looked into this in a while.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [36/37] Large blocksize support for ext2

2007-06-20 Thread Andreas Dilger

On Jun 20, 2007  14:27 -0700, Christoph Lameter wrote:
   Hmmm... Actually there is nothing additional to be done after the earlier
   cleanup of the macros. So just modify copyright.
  
  It is NOT possible to have 64kB blocksize on ext2/3/4 without some small
  changes to the directory handling code.  The reason is that an empty 64kB
  directory block would have a rec_len == (__u16)2^16 == 0, and this would
  cause an error to be hit in the filesystem.  What is needed is to put
  2 empty records in such a directory, or to special-case an impossible
  value like rec_len = 0x to handle this.
  
  There was a patch to fix the 64kB blocksize directory problem, but it
  hasn't been merged anywhere yet seeing as there wasn't previously a
  patch to allow larger blocksize...
 
 mke2fs allows to specify a 64kb blocksize and IA64 can run with 64kb 
 PAGE_SIZE. So this is a bug in ext2fs that needs to be fixed regardless.

True.  I had increased the e2fsprogs blocksize to 16kB after testing it,
and after that it seems Ted increased it to 64kB after that.  The 64kB
directory problem only came out recently.

  Having 32kB blocksize has no problems that I'm aware of.  Also, I'm not
  sure how it happened, but ext2 SHOULD have an explicit check (as
  ext3/4 does) limiting it to EXT2_MAX_BLOCK_SIZE.  Otherwise it appears
  that there would be no error reported if the superblock reports e.g. 16MB
  blocksize, and all kinds of things would break.
 
 mke2fs fails for blocksizes  64k so you are safe there. I'd like to see 
 that limit lifted?

I don't think extN can go to past 64kB blocksize in any case.

  There shouldn't be a problem with increasing EXT{2,3,4}_MAX_BLOCK_SIZE to
  32kB (AFAIK), but I haven't looked into this in a while.
 
 I'd love to see such a patch. That is also useful for arches that have 
 PAGE_SIZE  4kb without this patchset.

Definitely, which is why we had been working on this originally.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Versioning file system

2007-06-18 Thread Andreas Dilger

On Jun 16, 2007  16:53 +0200, Jörn Engel wrote:
 On Fri, 15 June 2007 15:51:07 -0700, alan wrote:
  Thus, in the end it turns out that this stuff is better handled by
  explicit version-control systems (which require explicit operations to
  manage revisions) and atomic snapshots (for backup.)
  
  ZFS is the cool new thing in that space.  Too bad the license makes it 
  hard to incorporate it into the kernel.
 
 It may be the coolest, but there are others as well.  Btrfs looks good,
 nilfs finally has a cleaner and may be worth a try, logfs will get
 snapshots sooner or later.  Heck, even my crusty old cowlinks can be
 viewed as snapshots.
 
 If one has spare cycles to waste, working on one of those makes more
 sense than implementing file versioning.

Too bad everyone is spending time on 10 similar-but-slightly-different
filesystems.  This will likely end up with a bunch of filesystems that
implement some easy subset of features, but will not get polished for
users or have a full set of features implemented (e.g. ACL, quota, fsck,
etc).  While I don't think there is a single answer to every question,
it does seem that the number of filesystem projects has climbed lately.

Maybe there should be a BOF at OLS to merge these filesystem projects
(btrfs, chunkfs, tilefs, logfs, etc) into a single project with multiple
people working on getting it solid, scalable (parallel readers/writers on
lots of CPUs), robust (checksums, failure localization), recoverable, etc.
I thought Val's FS summits were designed to get developers to collaborate,
but it seems everyone has gone back to their corners to work on their own
filesystem?

Working on getting hooks into DM/MD so that the filesystem and RAID layers
can move beyond ignorance is bliss when talking to each other would be
great.  Not rebuilding empty parts of the fs, limit parity resync to parts
of the fs that were in the previous transaction, use fs-supplied checksums
to verify on-disk data is correct, use RAID geometry when doing allocations,
etc.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read/write counts

2007-06-04 Thread Andreas Dilger

On Jun 04, 2007  06:20 -0400, David H. Lynch Jr. wrote:
 The net result is that implimentation would be simpler if I could
 just read/write, the amount of data that can be done with the least
 amount of work, even if that is less than was requested.
 
 If I receive a request to read 512 bytes, and I return that I have read
 486, is either the OS, libc, or something else going to treat that as an
 error, or are they coming back for the rest in a subsequent call ?
 
 I though I recalled that read()/write() returning a cound less than
 requested is not an error.

It is not strictly an error to read/write less than the requested amount,
but you will find that a lot of applications don't handle this correctly.
They will assume that if the amount read/written is != amount requested
that this is an error.  Of course the opposite is also true - some
applications assume that the amount requested == amount read/written and
don't even check whether that is actually the case or not.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] obsoleting /etc/mtab

2007-05-31 Thread Andreas Dilger

On May 31, 2007  17:11 -0700, H. Peter Anvin wrote:
 NFS takes a binary option block anyway.  However, that's the exception,
 not the rule.

There was recently a patch submitted to linux-fsdevel to change NFS to
use text option parsing.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] i_version update - ext4 part

2007-05-29 Thread Andreas Dilger

On May 29, 2007  12:44 -0700, Mingming Cao wrote:
 I am a little bit confused about the two patches. 
 
 It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a
 new 64 bit i_fs_version field is added to ext4 inode structure for inode
 versioning support. read/store of this counter are properly handled, but
 missing the inode versioning counter update.

For the Lustre use of the inode version we don't care about the VFS changes
to i_version.  In fact - we want to be able to control the changes to
inode version ourselves so that e.g. file defragmenting or atime updates
don't change the inode version, and that recovery can restore the version
to a known state along with the rest of the metadata.

That said, since Lustre isn't in the kernel and we patch our version of
ext3 anyways it doesn't really matter what is done for NFS.  We will just
patch in our own behaviour if the final ext4 code isn't suitable in all
of the details.  Having 99% of the code the same at least makes this a
lot less work.

 But later in the second patch by Jean Noel, he re-used the VFS inode-
 i_version for ext4 inode versioning, the counter is being updated every
 time the file is being changed. 

I don't know what the NFS requirements for the version are.  There may
also be some complaints from others if the i_version is 64 bits because
this contributes to generic inode growth and isn't used for other
filesystems.

 To me, i_fs_version and inode_version are the same thing, right?
 Shouldn't we choose one(I assume inode i_version?), and combine these
 two patch together? How about split the inode versioning part from the
 ext4_expand_inode_extra_isize patch(it does multiple things, and
 i_versioning doesn't longs there) and put it together with the rest of
 i_version update patches?

I don't have an objection to that, but I don't think it is required.

 BTW, how could NFS/user space to access the inode version counter?

If the Bull patch uses i_version then knfsd can just access it directly.
I don't think there is any API to access it from userspace.  One option
is to add a virtual EA like user.inode_version and have the kernel fill
this in from i_version.

Lustre will manipulate the ei-i_fs_version directly.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Andreas Dilger

On May 25, 2007  17:58 +1000, Neil Brown wrote:
These devices would find it very hard to support BIO_RW_BARRIER.
Doing this would require keeping track of all in-flight requests
(which some, possibly all, of the above don't) and then:
  When a BIO_RW_BARRIER request arrives:
 wait for all pending writes to complete
 call blkdev_issue_flush on all devices
 issue the barrier write to the target device(s)
as BIO_RW_BARRIER,
 if that is -EOPNOTSUP, re-issue, wait, flush.

We noticed when testing the SLES10 kernel (which has barriers enabled
by default) that ext3 write throughput went from about 170MB/s to about
130MB/s (on high-end RAID storage using no-op scheduler).

The reason (as far as we could tell) is that the barriers are implemented
by flushing and waiting for all previosly submitted IOs to finish, but
all that ext3/jbd really care about is that the journal blocks are safely
on disk.

Since the journal blocks are only a small fraction of the total IO in
flight, the barrier + write cache ends up being a lot worse than just
doing synchronous IO with the write cache disabled because no new IO can
be submitted past the barrier, and since that IO is large and contiguous
it might complete much faster than the scattered metadata updates that are
also being checkpointed to disk from the previous transactions.  With jbd
there can be both a running and a committing transaction, and multiple
checkpointing transactions, and the use of barriers breaks this important
optimization.

If ext3 used an external journal this problem would be avoided,
but then there isn't really a need for barriers in the first place, since
the jbd code already will handle the wait for the commit block itself.

We've got a pretty-much complete version of the ext3 journal checksumming
patch that avoids the need to do the pre-commit barrier, since the checksum
can verify at recovery time whether all of the transaction's blocks made
it to disk or not (which is what the commit block is all about in the end).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC 4/5] inode reservation v0.1 (benchmark result)

2007-05-24 Thread Andreas Dilger

On May 24, 2007  02:08 +0800, coly wrote:
 Due to the bad design of magic inode and the on-disk layout of magic
 inode. When 30 files created alternatively in each directory, no
 performance advantage exists. When 50 files created alternatively in
 each directory, the patched ext4 will use double time on removing all
 the files and directories.

I don't think the use of magic inodes is the right approach.  One possibility
to avoid changing the on-disk format at all is to only do the reservation in
memory, scaling the reservation with the size of the directory.

The only issue that arises is how to regenerate the same reservation
after a remount.  This might be possible to do by looking into the leaf
block at create time to see which inode numbers are already in use for
that leaf and checking whether there are free inodes in each group.

One way to get the best mapping is possibly checking groups in order of
decreasing number of inodes for that leaf in each group and once a suitable
group has been found doing a few name-hash-inode numbers to get the old
mapping back.  Once this leaf-group mapping has been established it
can be re-used for a given leaf block until that window is full.

Since you need to scan all of a leaf block's dir entries in a hash block
at insert time to look for duplicate names, and the inode numbers are
in the dir entries, this shouldn't introduce any additional disk IO.

Also, regardless of what the mapping turns out to be - the goal is to place
inodes with a similar hash into nearby inodes, and this heuristic works
relatively well for that.  Once the given leaf block's inode range is full
then new inodes can be allocated from a new window as it was done for the
newly-created directory.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Software raid0 will crash the file-system, when each disk is 5TB

2007-05-16 Thread Andreas Dilger

On May 16, 2007  11:09 +1200, Jeff Zheng wrote:
 We are using two 3ware disk array controllers, each of them is connected
 8 750GB harddrives. And we build a software raid0 on top of that. The
 total capacity is 5.5TB+5.5TB=11TB
 
 We use jfs as the file-system, we have a test application that write
 data continuously to the disks. After writing 52 10GB files, jfs
 crashed. And we are not able to recover it, fsck doesn't recognise it
 anymore.
 We then tried xfs, same application, lasted a little longer, but gives
 kernel crash later.

Check if your kernel has CONFIG_LBD enabled.

The kernel doesn't check if the block layer can actually write to
a block device  2TB.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5][TAKE2] fallocate system call

2007-05-15 Thread Andreas Dilger

On May 14, 2007  18:59 +0530, Amit K. Arora wrote:
  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len)
 
 fd: The descriptor of the open file.
 
 mode*: This specifies the behavior of the system call. Currently the
   system call supports two modes - FA_ALLOCATE and FA_DEALLOCATE.
   FA_ALLOCATE: Applications can use this mode to preallocate blocks to
 a given file (specified by fd). This mode changes the file size if
 the preallocation is done beyond the EOF. It also updates the
 ctime/mtime in the inode of the corresponding file, marking a
 successfull allocation.
   FA_DEALLOCATE: This mode can be used by applications to deallocate the
 previously preallocated blocks. This also may change the file size
 and the ctime/mtime.
 * New modes might get added in future. One such new mode which is
   already under discussion is FA_PREALLOCATE, which when used will
   preallocate space but will not change the filesize and [cm]time.
   Since the semantics of this new mode is not clear and agreed upon yet,
   this patchset does not implement it currently.
 
 offset: This is the offset in bytes, from where the preallocation should
   start.
 
 len: This is the number of bytes requested for preallocation (from
   offset).

What is the return value?  I'd hope it is the number of bytes preallocated,
in case of interrupted preallocation for whatever reason (interrupt, out of
space, etc) like a regular write(2) call.  In this case the return type needs
to also be an loff_t to match @len.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH 13/14] ext3 whiteout support

2007-05-14 Thread Andreas Dilger

On May 14, 2007  15:14 +0530, Bharata B Rao wrote:
  #define EXT3_FEATURE_INCOMPAT_JOURNAL_DEV0x0008 /* Journal device */
  #define EXT3_FEATURE_INCOMPAT_META_BG0x0010
 +#define EXT3_FEATURE_INCOMPAT_WHITEOUT   0x0020

Is this flag reserved with Ted?  It isn't listed in the e2fsprogs repo.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] file capabilities: accomodate 32 bit capabilities

2007-05-10 Thread Andreas Dilger

On May 08, 2007  16:49 -0500, Serge E. Hallyn wrote:
 Quoting Andreas Dilger ([EMAIL PROTECTED]):
  One of the important use cases I can see today is the ability to
  split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
  grained attributes.
 
 Sounds plausible, though it suffers from both making capabilities far
 more cumbersome (i.e. finding the right capability for what you wanted
 to do) and backward compatibility.  Perhaps at that point we should
 introduce security.capabilityv2 xattrs.  A binary can then carry
 security.capability=CAP_SYS_ADMIN=p, and
 security.capabilityv2=cap_may_clone_mntns=p.

Well, the overhead of each EA is non-trivial (16 bytes/EA) for storing
12 bytes worth of data, so it is probably just better to keep extending
the original capability fields as was in the proposal.

  What we definitely do NOT want to happen is an application that needs
  priviledged access (e.g. e2fsck, mount) to stop running because the
  new capabilities _would_ have been granted by the new kernel and are
  not by the old kernel and STRICTXATTR is used.
  
  To me it would seem that having extra capabilities on an old kernel
  is relatively harmless if the old kernel doesn't know what they are.
  It's like having a key to a door that you don't know where it is.
 
 If we ditch the STRICTXATTR option do the semantics seem sane to you?

Seems reasonable.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Andreas Dilger

On May 09, 2007  21:31 +0530, Amit K. Arora wrote:
 2) For FA_UNALLOCATE mode, should the file system allow unallocation
of normal (non-preallocated) blocks (blocks allocated via
regular write/truncate operations) also (i.e. work as punch()) ?
- Though FA_UNALLOCATE mode is yet to be implemented on ext4, still
  we need to finalize on the convention here as a general guideline
  to all the filesystems that implement fallocate.

I would only allow this on FA_ALLOCATE extents.  That means it won't be
possible to do this for filesystems that don't understand unwritten
extents unless there are blocks allocated beyond EOF.

 3) If above is true, the file size will need to be changed
for unallocation when block holding the EOF gets unallocated.
- If we do not unallocate normal (non-preallocated) blocks and we
  do not change the file size on preallocation, then this is a
  non-issue.

Not necessarily.  That will just make the file sparse.  If FA_ALLOCATE
does not change the file size, why should FA_UNALLOCATE.

 4) Should we update mtime  ctime on a successfull allocation/
unallocation ?

I would say yes.  If glibc does the fallback fallocate via write() the
mtime/ctime will be updated, so it makes sense to be consistent for
both methods.  Also, it just makes sense from the this file was modified
point of view.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Andreas Dilger

On May 07, 2007  21:43 -0400, Theodore Tso wrote:
 On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
  Userspace could presumably repair the mess in most situations by truncating
  the file back again.  The kernel cannot do that because there might be live
  data in amongst there.
 
 Actually, the kernel could do it, in that could simply release all
 unitialized extents back to the system.  The problem is distinguishing
 between the unitialized extents that had just been newly added, versus
 the ones that had there from before.  (On the other hand, if the
 filesystem was completely full, releasing unitialized blocks wouldn't
 be the worse thing in the world to do, although releasing previously
 fallocated blocks probably does violate the princple of least
 surprise, even if it's what the user would have wanted.)

I tend to agree with this.  Having fallocate() fill up the filesystem
is exactly what the caller asked.  Doing a write() hit ENOSPC doesn't
trucate off the whole write either, nor does dd delete the whole file
when the filesystem is full.

Even checking the statfs() space before doing the fallocate() may be
counter intuitive, since it will return ENOSPC but the filesystem will
not actually be full.  Some applications (e.g. database) may WANT to
fill the filesystem and then get the actual file size back to avoid
trusting statfs() because of metadata overhead (e.g. indirect blocks).

One of the design goals for sys_fallocate() was to allow FA_DELALLOC
to deallocate unwritten extents in a safe manner.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] file capabilities: accomodate 32 bit capabilities

2007-05-08 Thread Andreas Dilger

On May 08, 2007  14:17 -0500, Serge E. Hallyn wrote:
 As the capability set changes and distributions start tagging
 binaries with capabilities, we would like for running an older
 kernel to not necessarily make those binaries unusable.
 
   (0. Enable the CONFIG_SECURITY_FS_CAPABILITIES option
  when CONFIG_SECURITY=n.)
   (1. Rename CONFIG_SECURITY_FS_CAPABILITIES to
  CONFIG_SECURITY_FILE_CAPABILITIES)
   2. Introduce CONFIG_SECURITY_FILE_CAPABILITIES_STRICTXATTR
  which, when set, prevents loading binaries with capabilities
  set which the kernel doesn't know about.  When not set,
  such capabilities run, ignoring the unknown caps.
   3. To accomodate 64-bit caps, specify that capabilities are
  stored as
   u32 version; u32 eff0; u32 perm0; u32 inh0;
   u32 eff1; u32 perm1; u32 inh1; (etc)

Have you considered how such capabilities will be used in the future?
One of the important use cases I can see today is the ability to
split the heavily-overloaded e.g. CAP_SYS_ADMIN into much more fine
grained attributes.

What we definitely do NOT want to happen is an application that needs
priviledged access (e.g. e2fsck, mount) to stop running because the
new capabilities _would_ have been granted by the new kernel and are
not by the old kernel and STRICTXATTR is used.

To me it would seem that having extra capabilities on an old kernel
is relatively harmless if the old kernel doesn't know what they are.
It's like having a key to a door that you don't know where it is.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger

On May 03, 2007  21:31 -0700, Andrew Morton wrote:
 On Thu, 26 Apr 2007 23:43:32 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
  + * ext4_fallocate:
  + * preallocate space for a file
  + * mode is for future use, e.g. for unallocating preallocated blocks etc.
  + */
 
 This description is rather thin.  What is the filesystem's actual behaviour
 here?  If the file is using extents then the implementation will do
 something.  If the file is using bitmaps then we will do something else.
 
 But what?   Here is where it should be described.

My understanding is that glibc will handle zero-filling of files for
filesystems that do not support fallocate().

  +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t 
  len)
  +{
  +   handle_t *handle;
  +   ext4_fsblk_t block, max_blocks;
  +   int ret, ret2, nblocks = 0, retries = 0;
  +   struct buffer_head map_bh;
  +   unsigned int credits, blkbits = inode-i_blkbits;
  +
  +   /* Currently supporting (pre)allocate mode _only_ */
  +   if (mode != FA_ALLOCATE)
  +   return -EOPNOTSUPP;
  +
  +   if (!(EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL))
  +   return -ENOTTY;
 
 So we don't implement fallocate on bitmap-based files!  Well that's huge
 news.  The changelog would be an appropriate place to communicate this,
 along with reasons why, or a description of the plan to fix it.
 
 Also, posix says nothing about fallocate() returning ENOTTY.

I _think_ this is to convince glibc to do the zero-filling in userspace,
but I'm not up on the API specifics.

  +   block = offset  blkbits;
  +   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  blkbits)
  +- block;
  +   mutex_lock(EXT4_I(inode)-truncate_mutex);
  +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
  +   mutex_unlock(EXT4_I(inode)-truncate_mutex);
 
 Now I'm mystified.  Given that we're allocating an arbitrary amount of disk
 space, and that this disk space will require an arbitrary amount of
 metadata, how can we work out how much journal space we'll be needing
 without at least looking at `len'?

Good question.

The uninitialized extent can cover up to 128MB with a single entry.
If @path isn't specified, then ext4_ext_calc_credits_for_insert()
function returns the maximum number of extents needed to insert a leaf,
including splitting all of the index blocks.  That would allow up to 43GB
(340 extents/block * 128MB) to be preallocated, but it still needs to take
the size of the preallocation into account (adding 3 blocks per 43GB - a
leaf block, a bitmap block and a group descriptor).

Also, since @path is not being given then truncate_mutex is not needed.

  +   ret = ext4_ext_get_blocks(handle, inode, block,
  + max_blocks, map_bh,
  + EXT4_CREATE_UNINITIALIZED_EXT, 0);
  +   BUG_ON(!ret);
 
 BUG_ON is vicious.  Is it really justified here?  Possibly a WARN_ON and
 ext4_error() would be safer and more useful here.

Ouch, not very friendly error handling.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger

On May 07, 2007  13:58 -0700, Andrew Morton wrote:
 Final point: it's fairly disappointing that the present implementation is
 ext4-only, and extent-only.  I do think we should be aiming at an ext4
 bitmap-based implementation and an ext3 implementation.

Actually, this is a non-issue.  The reason that it is handled for extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error (e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in userspace.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Andreas Dilger

On May 07, 2007  19:02 -0400, Jeff Garzik wrote:
 Andreas Dilger wrote:
 Actually, this is a non-issue.  The reason that it is handled for 
 extent-only is that this is the only way to allocate space in the
 filesystem without doing the explicit zeroing.
 
 Precisely /how/ do you avoid the zeroing issue, for extents?
 
 If I posix_fallocate() 20GB on ext4, it damn well better be zeroed, 
 otherwise the implementation is broken.

In ext4 (as in XFS) there is a flag stored in the extent that tells if
the extent is initialized or not.  Reads from uninitialized extents will
return zero-filled data, and writes that don't span the whole extent
will cause the uninitialized extent to be split into a regular extent
and one or two uninitialized extents (depending where the write is).

My comment was just that the extent doesn't have to be explicitly zero
filled on the disk, by virtue of the fact that the uninitialized flag
will cause reads to return zero.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-03 Thread Andreas Dilger

On May 02, 2007  20:57 +1000, David Chinner wrote:
 On Wed, May 02, 2007 at 10:36:12AM +0100, Anton Altaparmakov wrote:
  HSM_READ is definitely _NOT_ required because all  
  it means is if the file is OFFLINE, bring it ONLINE and then return  
  the extent map.
 
 You've got the definition of HSM_READ wrong. If the flag is *not*
 set, then we bring everything back online and return the full extent
 map.
 
 Specifying the flag indicates that we do *not* want the offline
 extents brought back online.  i.e. it is a HSM or a datamover
 (e.g. backup program) that is querying the extents and we want to
 known *exactly* what the current state of the file is right now.
 
 So, if the HSM_READ flag is set, then the application is
 expecting the filesytem to be part of a HSM. Hence if it's not,
 it should return an error because somebody has done something wrong.

In my original proposal I specifically pointed out that the
FIEMAP_FLAG_HSM_READ has the OPPOSITE behaviour as the XFS_IOC_GETBMAPX
BMV_IF_NO_DMAPI_READ flag.  Data is retrieved from HSM only if the
HSM_READ flag is set.  That's why the flag is called HSM_READ instead
of HSM_NO_READ.

The reason is that it seems bad if the default behaviour for calling
ioctl(FIEMAP) would be to force retrieval of data from HSM, and this is
only disabled by specifying a flag.  It makes a lot more sense to just
leave the data as it is and return the extent mapping by default (i.e.
this is the principle of least surprise).  It would probably be equally
surprising and undesirable if the default behaviour was to force all
data out to HSM.



For that matter, I'm also beginning to wonder if the FLAG_HSM_READ should
even be a part of this interface?  I have no problem with returning a
flag that reports if the data is migrated to HSM and whether it is UNMAPPED.

Having FIEMAP force the retrieval of data from HSM strikes me as something
that should be a part of a separate HSM interface, which also needs to be
able to do things like push specific files or parts thereof out to HSM,
set the aging policy, and return information like where does the HSM
file live and how many copies are there.

Do you know the reasoning behind including this into XFS_IOC_GETBMAPX?
Looking at the bmap.c comments it appears it is simply because the API
isn't able to return something like UNMAPPED|HSM_RESIDENT to indicate
there is data in HSM but it has no blocks allocated in the filesystem.

I don't think it makes the operation significantly more efficient than
say ioctl(DMAPI_FORCE_READ); ioctl(FIEMAP) if an application actually
needs the data to be present instead of just returning mapping info that
includes UNMAPPED.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger

On Apr 30, 2007  08:09 -0400, Theodore Tso wrote:
 On Fri, Apr 27, 2007 at 12:09:42PM -0600, Andreas Dilger wrote:
  I'd prefer that such functionality be integrated with Takashi's online
  defrag tool, since it needs virtually the same functionality.  For that
  matter, this is also very similar to the block-mapped - extents tool
  from Aneesh.  It doesn't make sense to have so many separate tools for
  users, especially if they start interfering with each other (i.e. defrag
  undoes the remapping done by your tool).
 
 While we're at it, someone want to start thinking about on-line
 shrinking of ext4 filesystems?  Again, the same block remapping
 interfaces for defrag and file access optimizations should also be
 useful for shrinking filesystems (even if some of the files that need
 to be relocated are being actively used).  If not, that probably means
 we got the interface wrong.

Except one other issue with online shrinking is that we need to move
inodes on occasion and this poses a bunch of other problems over just
remapping the data blocks.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ext2/3 block remapping tool

2007-05-01 Thread Andreas Dilger

On May 01, 2007  11:28 -0400, Theodore Tso wrote:
 On Tue, May 01, 2007 at 12:01:42AM -0600, Andreas Dilger wrote:
  Except one other issue with online shrinking is that we need to move
  inodes on occasion and this poses a bunch of other problems over just
  remapping the data blocks.
 
 Well, I did say necessary, and not sufficient.  But yes, moving
 inodes, especially if the inode is currently open gets interesting.  I
 don't think there are that many user space applications that would
 notice or care if the st_ino of an open file changed out from under
 them, but there are obviously userspace applications, such as tar,
 that would most definitely care.

I think rm -r does a LOT of this kind of operation, like:

stat(.); stat(foo); chdir(foo); stat(.); unlink(*); chdir(..); stat(.)

I think find does the same to avoid security problems with malicious
path manipulation.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger

On May 01, 2007  14:22 +1000, David Chinner wrote:
 On Mon, Apr 30, 2007 at 04:44:01PM -0600, Andreas Dilger wrote:
  Hmm, I'd thought offline would migrate to EXTENT_UNKNOWN, but I didn't
 
 I disagree - why would you want to indicate the state is unknown when we know
 very well that it is offline?

If you don't like UNKNOWN, what about UNMAPPED?  I just want a
catch-all flag that indicates this extent contains data but there is
nothing sensible to be returned for the extent mapping.

 Effectively, when your extent is offline in the HSM, it is inaccessable, and
 you have to bring it back from tape so it becomes accessible again. i.e. some
 action is necessary on behalf of the user to make it accessible. So I think
 that OFFLINE is a good name for this state because it really is inaccessible.

What you are calling OFFLINE I would prefer to call UNMAPPED, since that
can be used by applications as a catch-all for no mapping.  There can
be further flags that give refinements to UNMAPPED that some applications
might care about them (e.g. HSM_RESIDENT), but many users/apps will not
if they just want the number of fragments in a given file.

 Also, I don't think secondary is a good term because most large systems
 have more than one tier of storage. One possibility is HSM_RESIDENT
 which indicates the extent is current and resident with a HSM's archive

Sure.

  Can you propose reasonable flag names for these (I can't think of anything
  very good) and a clear explanation of what they mean.  I suspect it will
  only be XFS that uses them initially.  In mke2fs and ext4+mballoc there is
  the concept of stripe unit and stripe width, but as yet they are not
  communicated between the two very well.  I'd be much happier if this info
  could be queried in a standard way from the block layer instead of the
  user having to specify it and the filesystem having to track it.
 
 My preference is definitely for a separate ioctl to grab the
 filesystem geometry so this stuff can be calculated in userspace.
 i.e. the way XFS does it right now (XFS_IOC_FSGEOMETRY). I won't
 bother trying to define names until we decide which appraoch we take
 to implement this.

Hmm, previously you wrote This information could be easily passed up in the
flags fields if the filesystem has geometry information.  So, I _think_
what you are saying is that you want 4 flags to convey this start/end
alignment information, but the exact semantics of what a stripe unit and
a stripe width is filesystem specific?

I definitely do NOT want to get into any issues of querying the block
device geometry here.  I was just making a passing comment that ext4+mballoc
can already do RAID-specific allocation alignment, but it depends on the
admin to specify this information and it would be nice if there was some
easy way to get this from userspace/kernel interfaces.

Having an API that can request tell me the number of blocks from this
offset until the next physical disk boundary or similar would be useful
to any allocator, and the block layer already needs to know this when
submitting IO.

 In XFS, mkfs.xfs does the work of getting this information
 to see in the filesystem superblock. Here's the code for getting
 sunit/swidth from the underlying block device:
 
 http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libdisk/
 
 Not much in common there ;)

It looks like this might be just what e2fsprogs needs also.

  It does make sense to specify zero for the fm_extent_count array and a
  new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
  extent data itself, for the non-verbose mode of filefrag, and for
  pre-allocating a buffer large enough to hold the file if that is important.
 
 Rather than rely on implicit behaviour of pass in extent count of
 zero and a don't try to return any extents to return the number of
 extents on the file, why not just explicitly define this as a valid
 input flag? i.e. FIEMAP_FLAG_GET_NUMEXTENTS

That's what I said, isn't it?  FIEMAP_FLAG_NO_EXTENTS.  I wonder if my
clever-clever for return no extents and return number of extents
is wasted :-/.

  - does XFS return an extent for the metadata parts of the file (e.g. btree)?
 
 No, but we can return the extent map for the attribute fork (i.e.
 extended attrs) if asked for (XFS_IOC_GETBMAPA).

This seems like it would be a useful addition to the interface also, having
FIEMAP_FLAG_METADATA request the return of metadata allocations too.

  - does XFS return preallocated extents beyond EOF?
 
 Yes - they are part of the extent map for the file.

OK.

  - does XFS allow non-root users to call xfs_bmap on files they don't own, or
use by non-root users at all?
 
 Users can run xfs_bmap on any file they have permission to
 open(O_RDONLY).
 
The FIBMAP ioctl is for privileged users
only, and I wonder if FIEMAP should be the same, or at least disallow
mapping files that the user can't access especially with FLAG_SYNC and/or
FLAG_HSM_READ

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-05-01 Thread Andreas Dilger

On May 02, 2007  00:20 +1000, David Chinner wrote:
 My point was that there is a difference between specification and
 implementation - if the specification says something is compulsory,
 then they must be implemented in the filesystem. This is easy
 enough to ensure by code review - we don't need additional interface
 complexity for this

What you seem to be missing about my proposal is that the FLAG_INCOMPAT
is for future use by that part of the specification we haven't thought
of yet...  Having COMPAT/INCOMPAT flags has been very useful for ext2/3/4,
and is much better than having version numbers for the interface.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-30 Thread Andreas Dilger

 are in the file
   is by adding up the length of each extent. What happens if the file
   is changing underneath you e.g. someone punches out a hole
   in teh file, or truncates and extends it again between ioctl()
   calls?
  
  Well, that is always true with data once it is out of the caller.
 
 Sure, but this interface requires iterative calls where the n+1
 call is reliant on nothing changing since the first call to be
 accurate. My question is how do you use this interface to reliably
 and accurately get all the extents if you using iterative summing
 like this?

Maybe it wasn't clear, but the semantics of the ioctl are that it will
return the first extent that contains the requested byte offset in fm_start.
If the file has changed since the last call to FIEMAP then it will restart
with the extent that covers this byte and continue on.  In most cases the
file mapping should be returnable in a single ioctl (assuming a reasonable
extent count).

   Also, what happens if you ask for an offset/len that doesn't map to
   any extent boundaries - are you truncating the extents returned to
   teh off/len passed in?
  
  The request offset will be returned as the start of the actual extent that
  it falls inside.  And the returned extents will end with the extent that
  ends at or after the requested fm_start + fm_len.
 
 Ok, so you round the start inwards and the round end outwards. Can
 you ensure that this is documented in the header file that describes
 this interface?

Sure.

   xfs_bmap gets around this by finding out how many extents there are in the
   file and allocating a buffer that big to hold all the extents so they
   are gathered in a single atomic call (think sparse matrix files)
  
  Yeah, except this might be persistent for a long time if it isn't fully
  read with a single ioctl and the app never continues reading but doesn't
  close the fd.
 
 Not sure I follow you here...

Ah, I was thinking that XFS was keeping a copy of the whole extent
mapping in the kernel to handle getting the data with separate calls.
It does make sense to specify zero for the fm_extent_count array and a
new FIEMAP_FLAG_NO_EXTENTS to return only the count of extents and not the
extent data itself, for the non-verbose mode of filefrag, and for
pre-allocating a buffer large enough to hold the file if that is important.

I'm also going to add a FIEMAP_FLAG_LAST to mark the last extent in the file,
so that iterators using a small buffer don't need to retry to get the last
extent, and it is possible in case of e.g. EINTR (or whatever) to return a
short list without signalling EOF.  I think this is cleaner than returning
a HOLE extent from EOF to ~0ULL.

Another question about semantics -
- does XFS return an extent for the metadata parts of the file (e.g. btree)?
- does XFS return preallocated extents beyond EOF?
- does XFS allow non-root users to call xfs_bmap on files they don't own, or
  use by non-root users at all?  The FIBMAP ioctl is for privileged users
  only, and I wonder if FIEMAP should be the same, or at least disallow
  mapping files that the user can't access especially with FLAG_SYNC and/or
  FLAG_HSM_READ.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ext2/3 block remapping tool

2007-04-27 Thread Andreas Dilger

On Apr 26, 2007  21:29 +0200, Jan Kara wrote:
   I've been lately playing with remapping ext2/ext3 blocks (especially how
 much it can give us in terms of speed of things like KDE start). For that
 I've written two simple tools (you can get them from
 ftp.suse.com/pub/people/jack/ext3remapper.tar.gz):
   e2block2file to transform (preparsed) output from blktrace into a list
 of accessed files and offsets accessed
   e2remapblocks to use output from e2block2file and remap blocks into big
 chunks in the order in which they were accessed.

Does it map the whole file contiguously, or does it interleave blocks of the
file in the order they are accessed?  I would hope that it maps the whole
file contiguously, and let readahead work properly to fetch the whole file.
Also, keeping the file contiguous avoids fragmentation later if that file is
updated, deleted, etc, and conflicts with allocator/defrag/etc.

   (see README in the tools archive for more details)
 
   So far the tools (especially e2remapblocks ;) work on unmounted
 filesystem. The ultimate goal is to be able to do similar things for
 mounted filesystems but I wanted to see whether block remapping is worth it
 and what kernel interfaces would be useful for achieving the goal.

I'd prefer that such functionality be integrated with Takashi's online
defrag tool, since it needs virtually the same functionality.  For that
matter, this is also very similar to the block-mapped - extents tool
from Aneesh.  It doesn't make sense to have so many separate tools for
users, especially if they start interfering with each other (i.e. defrag
undoes the remapping done by your tool).

   BTW, the results for KDE startup are as follows:
 The root partition was about 4.8 GB with around 1 GB free. System has
 1GB mem. All measurements (except for warmcache) were performed after
   sync; echo 3 /proc/sys/vm/drop_caches
 
 Ordinary start: 19.2 20.3 19.5 19.8 19.3; avg. 19.62
 Start with all data cached: 7 7.6 7.3 7.1 7.1; avg. 7.22
 Start with fcache (see thread http://lkml.org/lkml/2006/5/15/46 for details
 on fcache):
   11.3 11 10.3 10.8 10.6; avg. 10.8
 Start with blocks remapped with e2remapblocks:
   13.5 15 13 14.5 14.5; avg. 14.1
 (after remapping, data was stored in 20 continguous extents on disk)



Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ChunkFS - measuring cross-chunk references

2007-04-23 Thread Andreas Dilger

On Apr 23, 2007  15:04 +0530, Kalpak Shah wrote:
 On Mon, 2007-04-23 at 12:49 +0530, Karuna sagar K wrote:
  The tool estimates the cross-chunk references from an extt2/3 file
  system. It considers a block group as one chunk and calcuates how many
  block groups does a file span across. So, the block group size gives
  the estimate of chunk size.
  
  The file systems were aged for about 3-4 months on a developers laptop.
 
 With a blocksize of 4KB, a block group would be 128 MB. In the original
 Chunkfs paper, Valh had mentioned 1GB chunks and I believe it will be
 possible to use 2GB, 4GB or 8GB chunks in the future. As the chunk size
 increases the number of cross-chunk references will reduce and hence it
 might be a good idea to present these statistics considering different
 chunk sizes starting from 512MB upto 2GB.

Also, given that cross-chunk references will be more expensive to fix, I
can imagine the allocation policy for chunkfs will try to avoid this if
possible, further reducing the number of cross-chunk inodes.  I guess it
should be more clear whether the cross-chunk references are due to inode
block references, or because of e.g. directories referencing inodes in
another chunk.

Also, is it considered a cross-chunk reference if a directory entry is
referencing an inode in another group?  Should there be a continuation
inode in the local group, or is the directory entry itself enough?

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger

On Apr 16, 2007  18:01 +1000, Timothy Shimmin wrote:
 --On 12 April 2007 5:05:50 AM -0600 Andreas Dilger [EMAIL PROTECTED] 
 wrote:
 struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping 
  */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to 
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
 }
 
 # define FIEMAP_LEN_MASK 0xff
 # define FIEMAP_LEN_HOLE 0x01
 # define FIEMAP_LEN_UNWRITTEN0x02
 
 All offsets are in bytes to allow cases where filesystems are not going
 block-aligned/sized allocations (e.g. tail packing).  The fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 The -fm_extents[] array includes all of the holes in addition to
 allocated extents because this avoids the need to return both the logical
 and physical address for every extent and does not make processing any
 harder.
 
 Well, that's what stood out for me. I was wondering where the fe_block 
 field had gone - the physical address.
 So is your fe_start; /* starting offset */ actually the disk location
 (not a logical file offset)
 _except_ in the header (fiemap) where it is the desired logical offset.

Correct.  The fm_extent in the request contains the logical start offset
and length in bytes of the requested fiemap region.  In the returned header
it represents the logical start offset of the extent that contained the
requested start offset, and the logical length of all the returned extents.
I haven't decided whether the returned length should be until EOF, or have
the virtual hole at the end of the file.  I think EOF makes more sense.

The fe_start + fe_len in the fm_extents represent the physical location on
the block device for that extent.  fm_extent[i].fe_start (per Anton) is
undefined if FIEMAP_LEN_HOLE is set, and .fe_len is the length of the hole.

 Okay, looking at your example use below that's what it looks like.
 And when you refer to fm_start below, you mean fm_start.fe_start?
 Sorry, I realise this is just an approximation but this part confused me.

Right, I'll write up a new RFC based on feedback here, and correcting the
various errors in the original proposal.

 So you get rid of all the logical file offsets in the extents because we
 report holes explicitly (and we know everything is contiguous if you
 include the holes).

Correct.  It saves space in the common case.

 Caller works something like:
 
  char buf[4096];
  struct fiemap *fm = (struct fiemap *)buf;
  int count = (sizeof(buf) - sizeof(*fm)) / sizeof(fm_extent);
  
  fm-fm_start.fe_start = 0; /* start of file */
  fm-fm_start.fe_len = -1;   /* end of file */
  fm-fm_extent_count = count; /* max extents in fm_extents[] array */
  fm-fm_flags = 0;   /* maybe no DMAPI, etc like XFS */
 
  fd = open(path, O_RDONLY);
  printf(logical\t\tphysical\t\tbytes\n);
 
  /* The last entry will have less extents than the maximum */
  while (fm-fm_extent_count == count) {
  rc = ioctl(fd, FIEMAP, fm);
  if (rc)
  break;
 
  /* kernel filled in fm_extents[] array, set fm_extent_count
   * to be actual number of extents returned, leaves
   * fm_start.fe_start alone (unlike XFS_IOC_GETBMAP). */
 
  for (i = 0; i  fm-fm_extent_count; i++) {
  __u64 len = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_MASK;
  __u64 fm_next = fm-fm_start.fe_start + len;
  int hole = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_HOLE;
  int unwr = fm-fm_extents[i].fe_len  
  FIEMAP_LEN_UNWRITTEN;
 
  printf(%llu-%llu\t%llu-%llu\t%llu\t%s%s\n,
  fm-fm_start.fe_start, fm_next - 1,
  hole ? 0 : fm-fm_extents[i].fe_start,
  hole ? 0 : fm-fm_extents[i].fe_start +
 fm-fm_extents[i].fe_len - 1,
  len, hole ? (hole)  : ,
  unwr ? (unwritten)  : );
 
  /* get ready for printing next extent, or next ioctl 
  */
  fm-fm_start.fe_start = fm_next;
  }
  }
 

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-18 Thread Andreas Dilger

On Apr 16, 2007  21:22 +1000, David Chinner wrote:
 On Thu, Apr 12, 2007 at 05:05:50AM -0600, Andreas Dilger wrote:
  struct fiemap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
  }
  
  struct fiemap {
  struct fiemap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags (similar to XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fiemap_extent fm_extents[0];
  }
  
  #define FIEMAP_LEN_MASK 0xff
  #define FIEMAP_LEN_HOLE 0x01
  #define FIEMAP_LEN_UNWRITTEN0x02
 
 I'm not sure I like stealing bits from the length to use a flags -
 I'd prefer an explicit field per fiemap_extent for this.

Christoph expressed the same concern.  I'm not dead set against having an
extra 8 bytes per extent (32-bit flags, 32-bit reserved), though it may
mean the need for 50% more ioctls if the file is large.


Below is an aggregation of the comments in this thread:

struct fiemap_extent {
__u64 fe_start; /* starting offset in bytes */
__u64 fe_len;   /* length in bytes */
__u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */
__u32 fe_lun;   /* logical storage device number in array */
}

struct fiemap {
__u64 fm_start; /* logical start offset of mapping (in/out) */
__u64 fm_len;   /* logical length of mapping (in/out) */
__u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */
__u32 fm_extent_count;  /* number of extents in fm_extents (in/out) */
__u64 fm_unused;
struct fiemap_extent fm_extents[0];
}

/* flags for the fiemap request */
#define FIEMAP_FLAG_SYNC0x0001  /* flush delalloc data to disk*/
#define FIEMAP_FLAG_HSM_READ0x0002  /* retrieve data from HSM */
#define FIEMAP_FLAG_INCOMPAT0xff00  /* must understand these flags*/

/* flags for the returned extents */
#define FIEMAP_EXTENT_HOLE  0x0001  /* no space allocated */
#define FIEMAP_EXTENT_UNWRITTEN 0x0002  /* uninitialized space */
#define FIEMAP_EXTENT_UNKNOWN   0x0004  /* in use, location unknown */
#define FIEMAP_EXTENT_ERROR 0x0008  /* error mapping space */
#define FIEMAP_EXTENT_NO_DIRECT 0x0010  /* no direct data access */



SUMMARY OF CHANGES
==
- use fm_* fields directly in request instead of making it a fiemap_extent
  (though they are layed out identically)

- separate flags word for fm_flags:
  - FIEMAP_FLAG_SYNC = range should be synced to disk before returning
mapping, may return FIEMAP_EXTENT_UNKNOWN for delalloc writes otherwise
  - FIEMAP_FLAG_HSM_READ = force retrieval + mapping from HSM if specified
(this has the opposite meaning of XFS's BMV_IF_NO_DMAPI_READ flag)
  - FIEMAP_FLAG_XATTR = omitted for now, can address that in the future
if there is agreement on whether that is desirable to have or if it is
better to call ioctl(FIEMAP) on an XATTR fd.
  - FIEMAP_FLAG_INCOMPAT = if flags are set in this mask in request, kernel
must understand them, or fail ioctl with e.g. EOPNOTSUPP, so that we
don't request e.g. FIEMAP_FLAG_XATTR and kernel ignores it

- __u64 fm_unused does not take up an extra space on all power-of-two buffer
  sizes (would otherwise be at end of buffer), and may be handy in the future.

- add separate fe_flags word with flags from various suggestions:
  - FIEMAP_EXTENT_HOLE = extent has no space allocation
  - FIEMAP_EXTENT_UNWRITTEN = extent space allocation but contains no data
  - FIEMAP_EXTENT_UNKNOWN = extent contains data, but location is unknown
(e.g. HSM, delalloc awaiting sync, etc)
  - FIEMAP_EXTENT_ERROR = error mapping extent.  Should fe_lun == errno?
  - FIEMAP_EXTENT_NO_DIRECT = data cannot be directly accessed (e.g. data
encrypted, compressed, etc), may want separate flags for these?

- add new fe_lun word per extent for filesystems that manage multiple devices
  (e.g. OCFS, GFS, ZFS, Lustre).  This would otherwise have been unused.


 Given that xfs_bmap uses extra information from the filesystem
 (geometry) to display extra (and frequently used) information
 about the alignment of extents. ie:
 
 chook 681% xfs_bmap -vv fred
 fred:
  EXT: FILE-OFFSET  BLOCK-RANGE  AG AG-OFFSET  TOTAL FLAGS
0: [0..151]:288444888..288445039  8 (1696536..1696687)   152 00010
  FLAG Values:
 01 Unwritten preallocated extent
 001000 Doesn't begin on stripe unit
 000100 Doesn't end   on stripe unit
 10 Doesn't begin on stripe width
 01 Doesn't end   on stripe width

Can you clarify the terminology here?  What is a stripe unit and what is
a stripe width?  Are there N * stripe_unit = stripe_width in e.g. a
RAID

[RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger

 to allow this interface to be mapped
onto XFS_IOC_BMAP internally (or vice versa).  Even for block-mapped
filesystems, they can at least improve over the -bmap() case by skipping
holes in files that cover [dt]indirect blocks (saving thousands of calls).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] add FIEMAP ioctl to efficiently map file allocation

2007-04-12 Thread Andreas Dilger

On Apr 12, 2007  12:22 +0100, Anton Altaparmakov wrote:
 On 12 Apr 2007, at 12:05, Andreas Dilger wrote:
 I'm interested in getting input for implementing an ioctl to  
 efficiently map file extents  holes (FIEMAP) instead of looping
 over FIBMAP a billion times.  We already have customers with single
 files in the 10TB range and we additionally need to get the mapping
 over the network so it needs to be efficient in terms of how data
 is passed, and how easily it can be extracted from the filesystem.
 
 struct fibmap_extent {
  __u64 fe_start; /* starting offset in bytes */
  __u64 fe_len;   /* length in bytes */
 }
 
 struct fibmap {
  struct fibmap_extent fm_start;  /* offset, length of desired mapping */
  __u32 fm_extent_count;  /* number of extents in array */
  __u32 fm_flags; /* flags for input request */
  XFS_IOC_GETBMAP) */
  __u64 unused;
  struct fibmap_extent fm_extents[0];
 }
 
 #define FIEMAP_LEN_MASK  0xff
 #define FIEMAP_LEN_HOLE  0x01
 #define FIEMAP_LEN_UNWRITTEN 0x02
 
 Sound good but I would add:
 
 #define FIEMAP_LEN_NO_DIRECT_ACCESS
 
 This would say that the offset on disk can move at any time or that  
 the data is compressed or encrypted on disk thus the data is not  
 useful for direct disk access.

This makes sense.  Even for Reiserfs the same is true with packed tails,
and I believe if FIBMAP is called on a tail it will migrate the tail into
a block because this is might be a sign that the file is a kernel that
LILO wants to boot.

I'd rather not have any such feature in FIEMAP, and just return the
on-disk allocation for the file, so NO_DIRECT_ACCESS is fine with me.
My main reason for FIEMAP is being able to investigate allocation patterns
of files.

By no means is my flag list exhaustive, just the ones that I thought would
be needed to implement this for ext4 and Lustre.

 Also why are you not using 0xff00, i.e. two more zeroes  
 at the end?  Seems unnecessary to drop an extra 8 bits of  
 significance from the byte size...

It was actually just a typo (this was the first time I'd written the
structs and flags down, it is just at the discussion stage).  I'd meant
for it to be 2^56 bytes for the file size as I wrote later in the email.
That said, I think that 2^48 bytes is probably sufficient for most uses,
so that we get 16 bits for flags.  As it is this email already discusses
5 flags, and that would give little room for expansion in the future.

Remember, this is the mapping for a single file (which can't practially
be beyond 2^64 bytes as yet) so it wouldn't be hard for the filesystem to
return a few separate extents which are actually contiguous (assuming that
there will actually be files in filesystems with  2^48 bytes of contiguous
space).  Since the API is that it will return the extent that contains the
requested start byte, the kernel will be able to detect this case also,
since it won't be able to specify a length for the extent that contains the
start byte.

At most we'd have to call the ioctl() 65536 times for a completely
contiguous 2^64 byte file if the buffer was only large enough for a
single extent.  In reality, I expect any file to have some discontinuities
and the buffer to be large enough for a thousand or more entries so the
corner case is not very bad.

 Finally please make sure that the file system can return in one way  
 or another errors for example when it fails to determine the extents  
 because the system ran out of memory, there was an i/o error,  
 whatever...  It may even be useful to be able to say here is an  
 extent of size X bytes but we do not know where it is on disk because  
 there was an error determining this particular extent's on-disk  
 location for some reason or other...

Yes, that makes sense also, something like FIEMAP_LEN_UNKNOWN, and
FIEMAP_LEN_ERROR.  Consider FIEMAP on a file that was migrated
to tape and currently has no blocks allocated in the filesystem.  We
want to return some indication that there is actual file data and not
just a hole, but at the same time we don't want this to actually return
the file from tape just to generate block mappings for it.

This concept is also present in XFS_IOC_GETBMAPX - BMV_IF_NO_DMAPI_READ,
but this needs to be specified on input to prevent the file being mapped
and I'd rather the opposite (not getting file from tape) be the default,
by principle of least surprise.


 block-aligned/sized allocations (e.g. tail packing).  The  
 fm_extents array
 returned contains the packed list of allocation extents for the file,
 including entries for holes (which have fe_start == 0, and a flag).
 
 Why the fe_start == 0?  Surely just the flag is sufficient...  On  
 NTFS it is perfectly valid to have fe_start == 0 and to have that not  
 be sparse (normally the $Boot system file is stored in the first 8  
 sectors of the volume)...

I thought

Re: ext3, BKL, journal replay, multiple non-bind mounts of same device

2007-04-10 Thread Andreas Dilger

On Apr 10, 2007  20:49 -0400, John Anthony Kazos Jr. wrote:
 Since it is possible for the same block device to be mounted multiple 
 times concurrently by the same filesystem, and since ext3 explicitly 
 disables the BKL during its fill_super operation which would prevent this, 
 what is the result of mounting it multiple times this way? Especially if 
 the filesystem is dirty and a journal is replayed. (In any case, what 
 operation is being performed by ext3/ext4 that requires the BKL to be 
 dropped? What's the need to even consider the BKL during fill_super?)
 
 And in general, how does a filesystem deal with being mounted multiple 
 times in this way? In my testing and exploration so far, everything seems 
 to generally work, but I haven't tried deliberately using different 
 instances of the mount concurrently. Do we end up with locks not being 
 held properly on the superblock because the super_block structure 
 instances don't know about each other? Has dealing with this behavior of 
 bd_claim really been considered before, and if so, what's the general 
 scheme for handling it?

It is a myth (that actually frightened me quite a bit when I first did it)
that the filesystem is mounted twice in this case.  The truth of the matter
is if you mount -t ext3 /dev/ /mnt/1 and ... /mnt/2 you actually
get the equivalent of a bind mount for this block device on the two mount
points.  You can see this easily because e.g. you don't get two kjournald
threads for the two mounts, and it doesn't completely blow up.

If, on the other hand, you tried one mount with ext3 and another with ext4
it will fail the second with -EBUSY.

As for the BKL changes, your best bet is to go back through GIT and/or BK
or search the mailing lists to see when and why that was added.  It appears
to have  been 2.6.11, but I don't know why.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/13] ext4: use zero_user_page

2007-04-10 Thread Andreas Dilger

On Apr 10, 2007  20:36 -0700, Nate Diller wrote:
 Use zero_user_page() instead of open-coding it. 
 
 Signed-off-by: Nate Diller [EMAIL PROTECTED]

 To: Andrew Morton [EMAIL PROTECTED],
Alexander Viro [EMAIL PROTECTED]
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org

Would have been better to CC the filesystem maintainers directly
(which was one of the reasons Andrew wanted per-fs patches so they
can be Ack/Nack independently.

Looks good in any case,

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]

 diff -urpN -X dontdiff linux-2.6.21-rc6-mm1/fs/ext4/inode.c 
 linux-2.6.21-rc6-mm1-test/fs/ext4/inode.c
 --- linux-2.6.21-rc6-mm1/fs/ext4/inode.c  2007-04-10 17:15:04.0 
 -0700
 +++ linux-2.6.21-rc6-mm1-test/fs/ext4/inode.c 2007-04-10 18:33:04.0 
 -0700
 @@ -1791,7 +1791,6 @@ int ext4_block_truncate_page(handle_t *h
   struct inode *inode = mapping-host;
   struct buffer_head *bh;
   int err = 0;
 - void *kaddr;
  
   if ((EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) 
   test_opt(inode-i_sb, EXTENTS) 
 @@ -1808,10 +1807,7 @@ int ext4_block_truncate_page(handle_t *h
*/
   if (!page_has_buffers(page)  test_opt(inode-i_sb, NOBH) 
ext4_should_writeback_data(inode)  PageUptodate(page)) {
 - kaddr = kmap_atomic(page, KM_USER0);
 - memset(kaddr + offset, 0, length);
 - flush_dcache_page(page);
 - kunmap_atomic(kaddr, KM_USER0);
 + zero_user_page(page, offset, length);
   set_page_dirty(page);
   goto unlock;
   }
 @@ -1864,11 +1860,7 @@ int ext4_block_truncate_page(handle_t *h
   goto unlock;
   }
  
 - kaddr = kmap_atomic(page, KM_USER0);
 - memset(kaddr + offset, 0, length);
 - flush_dcache_page(page);
 - kunmap_atomic(kaddr, KM_USER0);
 -
 + zero_user_page(page, offset, length);
   BUFFER_TRACE(bh, zeroed end of block);
  
   err = 0;
 diff -urpN -X dontdiff linux-2.6.21-rc6-mm1/fs/ext4/writeback.c 
 linux-2.6.21-rc6-mm1-test/fs/ext4/writeback.c
 --- linux-2.6.21-rc6-mm1/fs/ext4/writeback.c  2007-04-10 18:05:52.0 
 -0700
 +++ linux-2.6.21-rc6-mm1-test/fs/ext4/writeback.c 2007-04-10 
 18:33:04.0 -0700
 @@ -961,7 +961,6 @@ int ext4_wb_writepage(struct page *page,
   loff_t i_size = i_size_read(inode);
   pgoff_t end_index = i_size  PAGE_CACHE_SHIFT;
   unsigned offset;
 - void *kaddr;
  
   wb_debug(writepage %lu from inode %lu\n, page-index, inode-i_ino);
  
 @@ -1011,10 +1010,7 @@ int ext4_wb_writepage(struct page *page,
* the  page size, the remaining memory is zeroed when mapped, and
* writes to that region are not written out to the file.
*/
 - kaddr = kmap_atomic(page, KM_USER0);
 - memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
 - flush_dcache_page(page);
 - kunmap_atomic(kaddr, KM_USER0);
 + zero_user_page(page, offset, PAGE_CACHE_SIZE - offset);
   return ext4_wb_write_single_page(page, wbc);
  }
  
 @@ -1065,7 +1061,6 @@ int ext4_wb_block_truncate_page(handle_t
   struct inode *inode = mapping-host;
   struct buffer_head bh, *bhw = bh;
   unsigned blocksize, length;
 - void *kaddr;
   int err = 0;
  
   wb_debug(partial truncate from %lu on page %lu from inode %lu\n,
 @@ -1104,10 +1099,7 @@ int ext4_wb_block_truncate_page(handle_t
   }
   }
  
 - kaddr = kmap_atomic(page, KM_USER0);
 - memset(kaddr + offset, 0, length);
 - flush_dcache_page(page);
 - kunmap_atomic(kaddr, KM_USER0);
 + zero_user_page(page, offset, length);
   SetPageUptodate(page);
   __set_page_dirty_nobuffers(page);
  
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Interface for the new fallocate() system call

2007-04-06 Thread Andreas Dilger

On Apr 05, 2007  16:56 +0530, Amit K. Arora wrote:
 This should work on all the platforms. The only concern I can think of
 here is the convention being followed till now, where all the entities on
 which the action has to be performed by the kernel (say fd, file/device
 name, pid etc.) is the first argument of the system call. If we can live
 with the small exception here, fine.

Yes, it is much cleaner to have fd first, like every other such syscall.

 Or else, we may have to implement the 
 
   int fd, int mode, loff_t offset, loff_t len
 
 as the layout of arguments here. I think only s390 will have a problem
 with this, and we can think of a workaround for it (may be similar to
 what ARM did to implement sync_file_range() system call)   :
 
 asmlinkage long sys_s390_fallocate(int fd, loff_t offset, loff_t len, int 
 mode)
 {
 return sys_fallocate(fd, offset, len, mode);
 }

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 145 matches

Mail list logo