from:"Ted Ts'o"

Re: [PATCH] E2fsprogs: add missing usage for No_COW

2012-06-13 Thread Ted Ts&#x27;o

On Wed, Jun 13, 2012 at 03:47:13PM +0800, Liu Bo wrote:
> Add the missing usage for No_COW since we've supported No_COW flag.
> 
> Signed-off-by: Liu Bo 

Applied, thanks.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] E2fsprogs: add missing usage for No_COW

2012-06-13 Thread Ted Ts&#x27;o

On Wed, Jun 13, 2012 at 04:56:42PM +0800, Liu Bo wrote:
> Add the missing usage for No_COW since we've supported No_COW flag.
> 
> Signed-off-by: Liu Bo 

Applied, although I changed the commit desciption to read:

chattr: add the -C option to the usage message

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs and data nocow per inode basis

2012-06-12 Thread Ted Ts&#x27;o

... and e2fsprogs 1.42.4 has been released, with the No_COW lsattr and
chattr support.  It's in all of the usual places:

ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/v1.42.4

and

http://prdownloads.sourceforge.net/e2fsprogs/e2fsprogs-1.42.4.tar.gz

... and I've uploaded a release to debian unstable, which will
hopefully make it into testing before the upcoming stable release
(Wheezy) freeze.

You can look through the release notes at:

http://e2fsprogs.sourceforge.net/e2fsprogs-release.html#1.42.4

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs and data nocow per inode basis

2012-06-12 Thread Ted Ts&#x27;o

On Tue, Jun 12, 2012 at 04:44:23PM -0400, Chris Mason wrote:
> On Tue, Jun 12, 2012 at 01:15:27PM -0600, Ted Ts'o wrote:
> > It appears the NOCOW_FL flag is currently a no-op in the 3.2 kernel?
> 
> It's not a noop, but it is only setting the NODATACOW flag.  It needs to
> set the nodatasum flag as well, just like the mount -o nodatacow mount
> option does.
> 
> I'll fix this up on the kernel side, thanks Ted.

Here's the final patch to e2fsprogs that will be going into 1.42.4:

commit 5a23c93aeb65d61892a47f8f27bffad38f4759ea
Author: Theodore Ts'o 
Date:   Tue Jun 12 17:09:39 2012 -0400

lsattr, chattr: add support for btrfs's No_COW flag

Signed-off-by: "Theodore Ts'o" 

diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c
index f03193c..e2f8ce5 100644
--- a/lib/e2p/pf.c
+++ b/lib/e2p/pf.c
@@ -49,6 +49,7 @@ static struct flags_name flags_array[] = {
{ EXT2_TOPDIR_FL, "T", "Top_of_Directory_Hierarchies" },
{ EXT4_EXTENTS_FL, "e", "Extents" },
{ EXT4_HUGE_FILE_FL, "h", "Huge_file" },
+   { FS_NOCOW_FL, "C", "No_COW" },
{ 0, NULL, NULL }
 };
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index f46a1a9..fb3f7cc 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -301,6 +301,7 @@ struct ext2_dx_countlimit {
 #define EXT4_EXTENTS_FL0x0008 /* Inode uses extents */
 #define EXT4_EA_INODE_FL   0x0020 /* Inode used for large EA */
 /* EXT4_EOFBLOCKS_FL 0x0040 was here */
+#define FS_NOCOW_FL0x0080 /* Do not cow file */
 #define EXT4_SNAPFILE_FL   0x0100  /* Inode is a snapshot */
 #define EXT4_SNAPFILE_DELETED_FL   0x0400  /* Snapshot is being 
deleted */
 #define EXT4_SNAPFILE_SHRUNK_FL0x0800  /* Snapshot shrink 
has completed */
diff --git a/misc/chattr.1.in b/misc/chattr.1.in
index 92f6d70..5a57d2c 100644
--- a/misc/chattr.1.in
+++ b/misc/chattr.1.in
@@ -64,6 +64,15 @@ this file compresses data before storing them on the disk.  
Note: please
 make sure to read the bugs and limitations section at the end of this
 document.
 .PP
+A file with the 'C' attribute set will not be subject to copy-on-write
+updates.  This flag is only supported on file systems which perform
+copy-on-write.  (Note: For btrfs, the 'C' flag should be only
+set on new or empty files.  If it is set on a file which already has
+data blocks, it is undefined when the blocks assigned to the file will
+be fully stable.  If the 'C' flag is set on a directory, it will have no
+effect on the directory, but new files created in that directory will
+the No_COW attribute.)
+.PP
 When a directory with the `D' attribute set is modified,
 the changes are written synchronously on the disk; this is equivalent to
 the `dirsync' mount option applied to a subset of the files.
@@ -159,8 +168,7 @@ maintained by Theodore Ts'o .
 .SH BUGS AND LIMITATIONS
 The `c', 's',  and `u' attributes are not honored 
 by the ext2 and ext3 filesystems as implemented in the current mainline
-Linux kernels.These attributes may be implemented
-in future versions of the ext2 and ext3 filesystems.
+Linux kernels.
 .PP
 The `j' option is only useful if the filesystem is mounted as ext3.
 .PP
diff --git a/misc/chattr.c b/misc/chattr.c
index 8a2d61f..141ea6e 100644
--- a/misc/chattr.c
+++ b/misc/chattr.c
@@ -107,6 +107,7 @@ static const struct flags_char flags_array[] = {
{ EXT2_UNRM_FL, 'u' },
{ EXT2_NOTAIL_FL, 't' },
{ EXT2_TOPDIR_FL, 'T' },
+   { FS_NOCOW_FL, 'C' },
{ 0, 0 }
 };
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs and data nocow per inode basis

2012-06-12 Thread Ted Ts&#x27;o

It appears the NOCOW_FL flag is currently a no-op in the 3.2 kernel?

 {/mnt}  
2062# grep /mnt /proc/mounts 
/dev/mapper/funarg-btrfs /mnt btrfs rw,relatime,space_cache 0 0
 {/mnt}  
2063# sync ; filefrag -v a
Filesystem type is: 9123683e
File size of a is 32768 (8 blocks, blocksize 4096)
 ext logical physical expected length flags
   0   0 3096   8 eof
a: 1 extent found
 {/mnt}  
2064# dd if=/dev/zero of=a bs=32k conv=notrunc,nocreat count=1 
1+0 records in
1+0 records out
32768 bytes (33 kB) copied, 0.000119266 s, 275 MB/s
 {/mnt}  
2065# sync ; filefrag -v a
Filesystem type is: 9123683e
File size of a is 32768 (8 blocks, blocksize 4096)
 ext logical physical expected length flags
   0   0 3088   8 eof
a: 1 extent found
 {/mnt}  
2066# lsattr a
---C a

I've attached the patch which I was going to commit, but when I tested
it, it appears the flag is being set and displayed correctly, but
btrfs doesn't appear to be honoring it.

Anyway want to explain what's going on?

- Ted

diff --git a/lib/e2p/pf.c b/lib/e2p/pf.c
index f03193c..69181e7 100644
--- a/lib/e2p/pf.c
+++ b/lib/e2p/pf.c
@@ -49,6 +49,7 @@ static struct flags_name flags_array[] = {
{ EXT2_TOPDIR_FL, "T", "Top_of_Directory_Hierarchies" },
{ EXT4_EXTENTS_FL, "e", "Extents" },
{ EXT4_HUGE_FILE_FL, "h", "Huge_file" },
+   { FS_NOCOW_FL, "C", "Huge_file" },
{ 0, NULL, NULL }
 };
 
diff --git a/lib/ext2fs/ext2_fs.h b/lib/ext2fs/ext2_fs.h
index f46a1a9..fb3f7cc 100644
--- a/lib/ext2fs/ext2_fs.h
+++ b/lib/ext2fs/ext2_fs.h
@@ -301,6 +301,7 @@ struct ext2_dx_countlimit {
 #define EXT4_EXTENTS_FL0x0008 /* Inode uses extents */
 #define EXT4_EA_INODE_FL   0x0020 /* Inode used for large EA */
 /* EXT4_EOFBLOCKS_FL 0x0040 was here */
+#define FS_NOCOW_FL0x0080 /* Do not cow file */
 #define EXT4_SNAPFILE_FL   0x0100  /* Inode is a snapshot */
 #define EXT4_SNAPFILE_DELETED_FL   0x0400  /* Snapshot is being 
deleted */
 #define EXT4_SNAPFILE_SHRUNK_FL0x0800  /* Snapshot shrink 
has completed */
diff --git a/misc/chattr.1.in b/misc/chattr.1.in
index 92f6d70..18b44bd 100644
--- a/misc/chattr.1.in
+++ b/misc/chattr.1.in
@@ -64,6 +64,10 @@ this file compresses data before storing them on the disk.  
Note: please
 make sure to read the bugs and limitations section at the end of this
 document.
 .PP
+A file with the 'C' attribute set will not be subject to copy-on-write
+updates.  This flag is only supported on file systems which perform
+copy-on-write, obviously.
+.PP
 When a directory with the `D' attribute set is modified,
 the changes are written synchronously on the disk; this is equivalent to
 the `dirsync' mount option applied to a subset of the files.
diff --git a/misc/chattr.c b/misc/chattr.c
index 8a2d61f..141ea6e 100644
--- a/misc/chattr.c
+++ b/misc/chattr.c
@@ -107,6 +107,7 @@ static const struct flags_char flags_array[] = {
{ EXT2_UNRM_FL, 'u' },
{ EXT2_NOTAIL_FL, 't' },
{ EXT2_TOPDIR_FL, 'T' },
+   { FS_NOCOW_FL, 'C' },
{ 0, 0 }
 };
 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs and data nocow per inode basis

2012-06-12 Thread Ted Ts&#x27;o

On Tue, Jun 12, 2012 at 07:41:25PM +0200, Goffredo Baroncelli wrote:
> 
> After a bit of googling I found a Liu Bo patches which add the ability
> to set the NOCOW flags to a btrfs file.[1]
> 
> However it seems that it was not present in the current (v1.42.3)
> e2fsprogs suite.
> 
> There is any reason which stopped the adoption of this patch ?

It's a textbook example of how *not* to try to get a patch into
e2fsprogs.

1) It's a huge patch that makes a much larger set of changes than what
is necessary to achieve the desired results (the EXT2_*_FL to
FS_*_FL flags is hugely gratuitous)

2) Because of all of the noise in the patch, something that was
completely missed was that the patch did ***NOT*** allow the setting
of the NOCOW flag.

My apologies for not having the time send a reply back.  It fell
between the cracks.

> Does make sense to add the chattr/lsattr capability to the btrfs tool (I
> am thinking about new commands like "btrfs filesystem chattr"/"btrfs
> filesystem lsattr") ?

The lsattr/chattr commands and the flags were always originally been
defined for the ext2/3/4 file systems.  It was the reiserfs filesystem
that just glommed onto that interface and hijacked the flags
definition into a file system independent interface.  At this point
enough other file systems have used those ioctl's and the flags
interface that I *do* try to coordinate with other file systems.

This means that I coordinate flag assignments and try not to use flag
bits in ext2/3/4 that have been made visible by other file systems
(even though the 32-bit flag space is getting pretty crowded, and
there's not much space left).  And I will add support for flags into
lsattr/chattr that are not supported by ext2/3/4.

But a massive change which tries to do a global rename of EXT2_*_FL to
FS_*_FL in e2fsprogs for no good reason?  That just adds code churn
and it makes it harder to validate that the patch is correct, for no
good user-visible benefit.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fs: make i_generation a u64

2012-04-12 Thread Ted Ts&#x27;o

On Wed, Apr 11, 2012 at 04:42:48PM -0400, Josef Bacik wrote:
> Btrfs stores generation numbers as 64bit numbers, which means we have to
> carry around a u64 in our incore inode in addition to setting i_generation.
> So convert to a u64 so btrfs can kill it's incore generation.  Thanks,
> 
> Signed-off-by: Josef Bacik 

Why is btrfs using a 64-bit generation number, out of curiosity?  The
only user of the inode generation number as far as I can tell is NFS,
and even NFSv4 is using a 32-bit generation number

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-18 Thread Ted Ts&#x27;o

On Thu, Mar 15, 2012 at 11:42:24AM +0100, Jacek Luczak wrote:
> 
> That was not a SVN server. It was a build host having checkouts of SVN
> projects.
> 
> The many files/dirs case is common for VCS and the SVN is not the only
> that would be affected here. 

Well, with SVN it's 2x or 3x the number of files in the checked out
source code directory, right?  So if a particular source tree has
2,000 files in a source directory, then SVN might have at most 6,000
files, and if you assume each directory entry is 64 bytes, we're still
talking about 375k.  Do you have more files than that in a directory
in practice with SVN?  And if so why?

> AFAIR git.kernel.org was also suffering from the getdents().

git.kernel.org was suffering from a different problem, which was that
the git.kernel.org administrators didn't feel like automatically doing
a "git gc" on all of the repositories, and a lot of people were just
doing "git pushes" and not bothering to gc their repositories.  Since
git.kernel.org users don't have shell access any more, the
git.kernel.org administrators have to be doing automatic git gc's.  By
default git is supposed to automatically do a gc when there are more
than 6700 loose object files (which are distributed across 256 1st
level directories, so in practice a .git/objects/XX directory
shouldn't have more than 30 objects in it, which each directory object
taking 48 bytes).  The problem I believe is that "git push" commands
weren't checking gc.auto limit, and so that's why git.kernel.org had
in the past suffered from large directories.  This is arguably a git
bug, though, and as I mentioned, since we all don't have shell access
to git.kernel.org, this has to be handled automatically now

> Same applies to commercial products that are
> heavily stuffed with many files/dirs, e.g. ClearCase or Synergy. 

How many files in a dircectory do we commonly see with these systems?
I'm not familiar with them, and so I don't have a good feel for what
typical directory sizes tend to be.

> A medium size you are referring would most probably fit into 256k and
> this could be enough for 90% of cases. Large production system running
> on ext4 need backups thus those would benefit the most here.

Yeah, 256k or 512k is probably the best.  Alternatively, the backup
programs could simply be taught to sort the directory entries by inode
number, and if that's not enough, to grab the initial block numbers
using FIEMAP and then sort by block number.  Of course, all of this
optimization may or may not actually give us as much returns as we
think, given that the disk is probably seeking from other workloads
happening in parallel anyway (another reason why I am suspicious that
timing the tar command may not be an accurate way of measuring actual
performance when you have other tasks accessing the file system in
parallel with the backup).

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-14 Thread Ted Ts&#x27;o

On Wed, Mar 14, 2012 at 03:34:13PM +0100, Lukas Czerner wrote:
> > 
> > You can make it be a RO_COMPAT change instead of an INCOMPAT change,
> > yes.
> 
> Does it have to be RO_COMPAT change though ? Since this would be both
> forward and backward compatible.

The challenge is how do you notice if the file system is mounted on an
older kernel, which then inserts a directory entry without updating
the secondary tree.  The older kernel won't know about the new inode
flag, so it can't clear the flag when it modifies the directory.

We were able to get away with making the original htree feature
read/write compatible because the design for it was anticipated far in
advance, and because it was before the days of enterprise kernels that
had to be supported for seven years.  So we added code into ext3 to
clear the the htree flag whenever the directory was modified something
like two years before the htree code made its appearance, and back
then we decided that was fair to assume no one would be using a kernel
that old, or be jumping back and forth between an ancient kernel and a
modern kernel with htree enabled.  Yes, that was playing a bit fast
and loose, but early on in the kernel history, we could do that.  It's
not something I would do today.

The bigger deal is that as Zach pointed out, we can't just index it by
inode number because we have to worry about hard links.  Which means
we need either an explicit counter field added to the directory entry,
or some kind of new offset field.  That we can't just shoehorn in and
retain backwards compatibility.

And once we break backwards compatibility, we might as well look at
the full design space and see what is the most optimal.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-14 Thread Ted Ts&#x27;o

On Wed, Mar 14, 2012 at 10:28:20AM -0400, Phillip Susi wrote:
> 
> Do you really think it is that much easier?  Even if it is easier,
> it is still an ugly kludge.  It would be much better to fix the
> underlying problem rather than try to paper over it.

I don't think the choice is obvious.  A solution which solves the
problem 90% plus of the time for real-life workloads, without
requiring any file system format changes, is very appealing to me and
to I think many customers.

That being said, for someone who is looking for perfection, I can see
how it would grate on someone's nerves.

> The same argument could have been made against the current htree
> implementation when it was added.  I think it carries just as little
> weight now as it did then.  People who care about the added
> performance the new feature provides can use it, those who don't,
> won't.  Eventually it will become ubiquitous.

The original htree implementation tried very hard to be fully
backwards compatible for reading and writing.  At the time that was
considered very important, and for some folks it was a major killer
feature.  The original ext3 journalling support was designed in a way
where you could back out and mount such a file system on an ext2 file
system, and that was considered important back then, too.

The fact that we've been so successful with ext4's upgrade to features
that require RO_COMPAT or INCOMPAT features does make me more willing
to consider those features, but the reality is that the deployment
time for such a new feature is measured in 2-3 years, minimum.  So I
won't rule out making such a change, but I wouldn't let the fact that
someone wants to sign up to implement a long-term feature, to be
mutually exclusive with a short-term solution which solves most of the
problem that could be rolled out much more quickly.

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-14 Thread Ted Ts&#x27;o

On Wed, Mar 14, 2012 at 10:17:37AM -0400, Zach Brown wrote:
> 
> >We could do this if we have two b-trees, one indexed by filename and
> >one indexed by inode number, which is what JFS (and I believe btrfs)
> >does.
> 
> Typically the inode number of the destination inode isn't used to index
> entries for a readdir tree because of (wait for it) hard links.  You end
> up right back where you started with multiple entries per key.

Well, if you are using 32-bit (or even 48-bit) inode numbers and a
64-bit telldir cookie, it's possible to make the right thing happen.
But yes, if you are using 32-bit inode numbers and a 32-bit telldir
cookie, dealing with what happens when you have multiple hard links to
the same inode in the same directory gets tricky.

> A painful solution is to have the key in the readdir tree allocated by
> the tree itself -- count key populations in subtrees per child pointer
> and use that to find free keys.

One thing that might work is to have a 16-bit extra field in the
directory entry that gives an signed offset to the inode number so
that such that inode+offset is a unique value within the btree sorted
by inode+offset number.  Since this tree is only used for returning
entries in an optimal (or as close to optimal as can be arranged)
order, we could get away with that.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-14 Thread Ted Ts&#x27;o

On Wed, Mar 14, 2012 at 09:12:02AM +0100, Lukas Czerner wrote:
> I kind of like the idea about having the separate btree with inode
> numbers for the directory reading, just because it does not affect
> allocation policy nor the write performance which is a good thing. Also
> it has been done before in btrfs and it works very well for them. The
> only downside (aside from format change) is the extra step when removing
> the directory entry, but the positives outperform the negatives.

Well, there will be extra (journaled!) block reads and writes involved
when adding or removing directory entries.  So the performance on
workloads that do a large number of directory adds and removed will
have to be impacted.  By how much is not obvious, but it will be
something which is measurable.

> Maybe this might be even done in a way which does not require format
> change. We can have new inode flag (say EXT4_INUM_INDEX_FL) which will
> tell us that there is a inode number btree to use on directory reads.
> Then the pointer to the root of that tree would be stored at the end of
> the root block of the hree (of course if there is still some space for
> it) and the rest is easy.

You can make it be a RO_COMPAT change instead of an INCOMPAT change,
yes.

And if you want to do it as simply as possible, we could just recycle
the current htree approach for the second tree, and simply store the
root in another set of directory blocks.  But by putting the index
nodes in the directory blocks, masquerading as deleted directories, it
means that readdir() has to cycle through them and ignore the index
blocks.

The alternate approach is to use physical block numbers instead of
logical block numbers for the index blocks, and to store it separately
from the blocks containing actual directory entries.  But if we do
that for the inumber tree, the next question that arise is maybe we
should do that for the hash tree as well --- and then once you upon
that can of worms, it gets a lot more complicated.

So the question that really arises here is how wide open do we want to
leave the design space, and whether we are optimizing for the best
possible layout change ignoring the amount of implementation work it
might require, or whether we keep things as simple as possible from a
code change perspective.

There are good arguments that can be made either way, and a lot
depends on the quality of the students you can recruit, the amount of
time they have, and how much review time it will take out of the core
team during the design and implementation phase.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-13 Thread Ted Ts&#x27;o

On Wed, Mar 14, 2012 at 10:48:17AM +0800, Yongqiang Yang wrote:
> What if we use inode number as the hash value?  Does it work?

The whole point of using the tree structure is to accelerate filename
-> inode number lookups.  So the namei lookup doesn't have the inode
number; the whole point is to use the filename to lookup the inode
number.  So we can't use the inode number as the hash value since
that's what we are trying to find out.

We could do this if we have two b-trees, one indexed by filename and
one indexed by inode number, which is what JFS (and I believe btrfs)
does.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-13 Thread Ted Ts&#x27;o

On Tue, Mar 13, 2012 at 04:22:52PM -0400, Phillip Susi wrote:
> 
> I think a format change would be preferable to runtime sorting.

Are you volunteering to spearhead the design and coding of such a
thing?  Run-time sorting is backwards compatible, and a heck of a lot
easier to code and test...

The reality is we'd probably want to implement run-time sorting
*anyway*, for the vast majority of people who don't want to convert to
a new incompatible file system format.  (Even if you can do the
conversion using e2fsck --- which is possible, but it would be even
more code to write --- system administrators tend to be very
conservative about such things, since they might need to boot an older
kernel, or use a rescue CD that doesn't have an uptodate kernel or
file system utilities, etc.)

> So the index nodes contain the hash ranges for the leaf block, but
> the leaf block only contains the regular directory entries, not a
> hash for each name?  That would mean that adding or removing names
> would require moving around the regular directory entries wouldn't
> it?

They aren't sorted in the leaf block, so we only need to move around
regular directory entries when we do a node split (and at the moment
we don't support shrinking directories), so we don't have to worry the
reverse case.

> I would think that hash collisions are rare enough that reading a
> directory block you end up not needing once in a blue moon would be
> chalked up under "who cares".  So just stick with hash, offset pairs
> to map the hash to the normal directory entry.

With a 64-bit hash, and if we were actually going to implement this as
a new incompatible feature, you're probably right in terms of
accepting the extra directory block search.

We would still have to implement the case where hash collisions *do*
exist, though, and make sure the right thing happens in that case.
Even if the chance of that happening is 1 in 2**32, with enough
deployed systems (i.e., every Android handset, etc.) it's going to
happen in real life.

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-13 Thread Ted Ts&#x27;o

On Tue, Mar 13, 2012 at 03:05:59PM -0400, Phillip Susi wrote:
> Why not just separate the hash table from the conventional, mostly
> in inode order directory entries?  For instance, the first 200k of
> the directory could be the normal entries that would tend to be in
> inode order ( and e2fsck -D would reorder ), and the last 56k of the
> directory would contain the hash table.  Then readdir() just walks
> the directory like normal, and namei() can check the hash table.

Because that would be a format change.

What we have today is not a hash table; it's a hashed tree, where we
use a fixed-length key for the tree based on the hash of the file
name.  Currently the leaf nodes of the tree are the directory blocks
themselves; that is, the lowest level of the index blocks tells you to
look at directory block N, where that directory contains the directory
indexes for those file names which are in a particular range (say,
between 0x2325777A and 0x2325801).

If we aren't going to change the ordering of the directory directory,
that means we would need to change things so the leaf nodes contain
the actual directory file names themselves, so that we know whether or
not we've hit the correct entry or not before we go to read in a
specific directory block (otherwise, you'd have problems dealing with
hash collisions).  But in that case, instead of storing the pointer to
the directory entry, since the bulk of the size of a directory entry
is the filename itself, you might as well store the inode number in
the tree itself, and be done with it.

And in that case, since you are replicating the information directory
twice over, and it's going to be an incompatible format change anyway,
you might as well just store the second copy of the directory entries
in *another* btree, except this one is indexed by inode number, and
then you use the second tree for readdir(), and you make the
telldir/seekdir cookie be the inode number.  That way readdir() will
always return results in a stat() optimized order, even in the face of
directory fragmentation and file system aging.

**This** is why telldir/seekdir is so evil; in order to do things
correctly in terms of the semantics of readdir() in the face of
telldir/seekdir and file names getting inserted and deleted into the
btree, and the possibility for tree splits/rotations, etc., and the
fact that the cookie is only 32 or 64 bits, you essentially either
need to just do something stupid and have a linear directory aala ext2
and V7 Unix, or you need to store the directory information twice over
in redundant b-trees.

Or, userspace needs to do the gosh-darned sorting by inode, or we do
some hack such as only sorting the inodes using non-swapping kernel
memory if the directory is smaller than some arbitrary size, such as
256k or 512k.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-11 Thread Ted Ts&#x27;o

On Sun, Mar 11, 2012 at 04:30:37AM -0600, Andreas Dilger wrote:
> > if the userspace process could
> > feed us the exact set of filenames that will be used in the directory,
> > plus the exact file sizes for each of the file names...
> 
> Except POSIX doesn't allow anything close to this at all.  Something like
> fallocate() for directories would at least give the kernel a hint about
> how large the directory is, but fewer userspace tools have this information
> than the file size (e.g. tar doesn't really store a "directory", just a
> bunch of pathnames that happen to have largely the same components).

Yes, this was purely a thought experiment idea (although if we could
try to extend the interface if it really neted enough gains).  My
point was that even if we could do this, we really are trading off
write performance for read performance (since you would now be doing
the extra seeking on the tar extraction phase).

Ultimately, while I believe it makes sense to optimize the "readdir +
seek" case, I don't believe it makes sense to optimize the "copy out
the directory using tar" benchmark --- since I believe this is much
more of a benchmark workload than something that happens a huge amount
in real life, and even for people who do backups (which unfortunately
as we know is a too-small percentage of the ext4 user base), they
generally will prefer than the "real" use case for the file system be
fast rather than the backup phase, and in point of fact as the file
system gets more fragmented, even if some file systems can offer
perfect name -> block correlation, over time I'll bet their
correlation breaks down and so ext4's disadvantage over those file
systems will disappear in real life usages.  So I don't believe it
makes that much sense to worry overly about the "name -> block"
correlation for freshly formatted file systems.  It's a nice-to-have,
but I really think that's more of a benchmark game than anything else.

> In the cases where size IS known in advance, or at least if the directory
> is going to be large, an allocation policy like mine would essentially
> "predict" where a uniformly distributed hash function will allocate the
> inodes for better reading order.

Well, yes; if we knew in advance how many files (roughly) would be
stored in a directory, there are some optimizations we could do.  The
simplest of which is preallocating the directory blocks so they are
contiguous on disk.  I'm not pursuaded yet that the optimizations are
worth the effort of trying to define a new interface and then trying
to convince application programers to use a hypothetical interface
extension *just* for this purpose, though.

> The unfortunate part is that this will help small directories, where the
> problem is already largely hidden by readahead and the page cache, but it
> doesn't help at all for huge directories that do not fit into cache.  That
> is where my approach would do well, but the difficulty is in knowing at
> create time how large the directory will be.

Well, my goal in proposing this optimization is that helps for the
"medium size" directories in the cold cache case.  The ext4 user who
first kicked off this thread was using his file system for an SVN
server, as I recall.  I could easily believe that he has thousands of
files; maybe even tens of thousands of files in a directory.  But that
probably still fits in 256k, or at best 512k worth of directory blocks.

It definitely won't help for the really lage directories, yes; but how
common are they?  The most common case I can think of is squid caches,
but you can configure a larger number of 1st and 2nd level
subdirectories to keep the actual directory size smaller.

> > for directories which are less than
> > this threshold size, the entire directory is sucked in after the first
> > readdir() after an opendir() or rewinddir().  The directory contents
> > are then sorted by inode number (or loaded into an rbtree ordered by
> > inode number), and returned back to userspace in the inode order via
> > readdir().  The directory contents will be released on a closedir() or
> > rewinddir().
> 
> What would the telldir() cookie be in this case?  The inode number, or
> the file descriptor?  What happens if the directory size crosses the
> threshold while the cache is in use?  I guess in that case the cache
> could still be used (according to POSIX) because readdir() doesn't need
> to report entries added after it started.

The telldir() cookie can simply be an ordinal -- so "7" means the 7th
directory entry in the rbtree.  And all we need to do, which obeys
POSIX as you've noted, is that we instantiate the rbtree during the
first readdir() after an opendir() or rewinddir(), and free the old
rbtree() when we get the rewinddir(), reach the end of the directory
stream, or get the closedir().

> > 3) If we want do to better than that, we could create new flag
> > O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
> > do something like "dir = opend

Re: getdents - ext4 vs btrfs performance

2012-03-09 Thread Ted Ts&#x27;o

On Fri, Mar 09, 2012 at 04:09:43PM -0800, Andreas Dilger wrote:
> > I have also run the correlation.py from Phillip Susi on directory with
> > 10 4k files and indeed the name to block correlation in ext4 is pretty
> > much random :)
> 
> Just reading this on the plane, so I can't find the exact reference
> that I want, but a solution to this problem with htree was discussed
> a few years ago between myself and Coly Li.
> 
> The basic idea is that for large directories the inode allocator
> starts by selecting a range of (relatively) free inodes based on the
> current directory size, and then piecewise maps the hash value for
> the filename into this inode range and uses that as the goal inode.

I've heard you sketch out this idea before, but it's not been clean to
me how well it would work in practice.  The potential downside is that
it fragments the inode table, such that a single inode table block
might not have as many inodes stored in it as we might otherwise
would.  (Ideally, with 256 byte inodes, there would 16 of them in each
4k block.)  This is something I'd definitely recommend modelling in
simulation to see how well it works first.

I'd observe that if we knew in advance how many files would be in a
particular directory (i.e., via a hint from userspace at mkdir time),
then it would be possible to optimize this case much more easily.  In
fact, if we had perfect knowledge --- if the userspace process could
feed us the exact set of filenames that will be used in the directory,
plus the exact file sizes for each of the file names --- then it would
be possible to allocate inode numbers and starting block numbers such
that when the files are read in readdir() order, the inode numbers and
block numbers would line up perfectly into sequential writes.  

Of course, the trade off is that by optimizing for read order, the
write order will tend to get painful as well!  So there's a tension
here; making the read part of the benchmark faster will make the
process of writing the directory hierarchy require more seeks --- and
this assuming that you know file names and sizes ahead of time.

(Unless, of course, you play the same game that Hans Reiser did when
he cooked his "tar" benchmark by constructing a kernel source tarball
where the file order in the tarball was perfectly ordered to guarantee
the desired result given Reiser4's hash!  :-)

I suspect the best optimization for now is probably something like
this:

1) Since the vast majority of directories are less than (say) 256k
(this would be a tunable value), for directories which are less than
this threshold size, the entire directory is sucked in after the first
readdir() after an opendir() or rewinddir().  The directory contents
are then sorted by inode number (or loaded into an rbtree ordered by
inode number), and returned back to userspace in the inode order via
readdir().  The directory contents will be released on a closedir() or
rewinddir().

2) For files larger than this size, we don't want to hold that much
kernel memory during an opendir(), so fall back to ext4's current
scheme.   

3) If we want do to better than that, we could create new flag
O_NOTELLDIR, which can be set via fcntl.  (This way programs can use
do something like "dir = opendir(..); fd = dirfd(dir); fcntl(fd,
SETFL, O_NOTELLDIR);").  For files who know they don't need to use
telldir/seekdir, they can set this flag, which will allow the above
optimization to be used on large files.

The other thing we could potentially do, if we really cared about this
issue in ext3/4, would be to define whole new (incompatible) storing
the directory indexing format which avoided this problem.  It would be
an INCOMPAT feature, so it would be a while before it could get used
in practice, which is why I'm not entirely convinced it's worth it ---
especially since I suspect just doing (1) would solve the problem for
the vast majority of ext4's users.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-09 Thread Ted Ts&#x27;o

Hey Jacek,

I'm curious parameters of the set of directories on your production
server.  On an ext4 file system, assuming you've copied the
directories over, what are the result of this command pipeline when
you are cd'ed into the top of the directory hierarchy of interest
(your svn tree, as I recall)?

find . -type d -ls | awk '{print $7}' | sort -n | uniq -c

I'm just curious what the distribution of directories sizes are in
practice for your use case...

Thanks,

- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-02 Thread Ted Ts&#x27;o

On Fri, Mar 02, 2012 at 09:26:51AM -0500, Chris Mason wrote:
> 
> filefrag will tell you how many extents each file has, any file with
> more than one extent is interesting.  (The ext4 crowd may have better
> suggestions on measuring fragmentation).

You can get a *huge* amount of information (probably more than you'll
want to analyze) by doing this:

 e2fsck -nf -E fragcheck /dev/ >& /tmp/fragcheck.out

I haven't had time to do this in a while, but a while back I used this
to debug the writeback code with an eye towards reducing
fragmentation.  At the time I was trying to optimize the case of
reducing fragmentation in the easist case possible, where you start
with an empty file system, and then copy all of the data from another
file system onto it using rsync -avH.

It would be worth while to see what happens with files written by the
compiler and linker.  Given that libelf tends to write .o files
non-sequentially, and without telling us how big the space is in
advance, I could well imagine that we're not doing the best job
avoiding free space fragmentation, which eventually leads to extra
file system aging.

It would be interesting to have a project where someone added
fallocate() support into libelf, and then added some hueristics into
ext4 so that if a file is fallocated to a precise size, or if the file
is fully written and closed before writeback begins, that we use this
to more efficiently pack the space used by the files by the block
allocator.  This is a place where I would not be surprised that XFS
has some better code to avoid accelerated file system aging, and where
we could do better with ext4 with some development effort.

Of course, it might also be possible to hack around this by simply
using VPATH and dropping your build files in a separate place from
your source files, and periodically reformatting the file system where
your build tree lives.  (As a side note, something that works well for
me is to use an SSD for my source files, and a separate 5400rpm HDD
for my build tree.  That allows me to use a smaller and more
affordable SSD, and since the object files can be written
asynchronously by the writeback threads, but the compiler can't move
forward until it gets file data from the .c or .h file, it gets me the
best price/performance for a laptop build environment.)

BTW, I suspect we could make acp even more efficient by teaching it to
use FIEMAP ioctl to map out the data blocks for all of the files in
the source file system, and then copied the files (or perhaps even
parts of files) in a read order which reduced seeking on the source
drive.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: getdents - ext4 vs btrfs performance

2012-03-01 Thread Ted Ts&#x27;o

On Thu, Mar 01, 2012 at 03:43:41PM +0100, Jacek Luczak wrote:
> 
> Yep, ext4 is close to my wife's closet.
> 

Were all of the file systems freshly laid down, or was this an aged
ext4 file system?

Also you should beware that if you have a workload which is heavy
parallel I/O, with lots of random, read/write accesses to small files,
a benchmark using tar might not be representative of what you will see
in production --- different file systems have different strengths and
weaknesses --- and the fact that ext3/ext4's readdir() returns inodes
in a non-optimal order for stat(2) or unlink(2) or file copy in the
cold cache case may not matter as much as you think in a build server.
(i.e., the directories that do need to be searched will probably be
serviced out of the dentry cache, etc.)

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A Plumber???s Wish List for Linux

2011-10-14 Thread Ted Ts&#x27;o

On Thu, Oct 13, 2011 at 11:28:39AM +1100, Dave Chinner wrote:
> Yup. xfs_admin already provides an interface for offline
> modification of the UUID for XFS filesytems. I.e. clone the
> filesytem using xfs_copy, then run xfs_admin -U generate  to
> generate a new uuid in the cloned copy before you mount the
> clone

This is probably another thing which perhaps Ric Wheeler's proposed
"generic LVM / file system management front end" should abstract away,
since every single file system has a different way of setting the UUID
in an off-line way.  It's a relatively specialized feature, so I
wouldn't call it high priority to implement first.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/8] remove i_alloc_sem

2011-06-22 Thread Ted Ts&#x27;o

On Wed, Jun 22, 2011 at 01:54:25AM +0200, Jan Kara wrote:
> ext4_page_mkwrite()... Ted, what happened to that patch. Should I resend
> it?

So assuming I fix the refcounting issue in fs/ext4/page_io.c (which I
will do not dropping the page's refcount until after the workqueue
finishes its job), does your patch need changing, or is it a matter of
fixing the commit description?

In any case, this dragged got out, and it's late in -rc cycle for 3.0,
so I was planning on queuing this for the next merge window.  (Sorry
for that, this was mostly due to my not having enough time to really
dive into the issues involved.)

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Check for immutable flag in fallocate path

2011-02-27 Thread Ted Ts&#x27;o

On Mon, Feb 21, 2011 at 05:50:21PM +0100, Marco Stornelli wrote:
> 2011/2/21 Christoph Hellwig :
> > On Mon, Feb 21, 2011 at 09:26:32AM +0100, Marco Stornelli wrote:
> >> From: Marco Stornelli 
> >>
> >> All fs must check for the immutable flag in their fallocate callback.
> >> It's possible to have a race condition in this scenario: an application
> >> open a file in read/write and it does something, meanwhile root set the
> >> immutable flag on the file, the application at that point can call
> >> fallocate with success. Only Ocfs2 check for the immutable flag at the
> >> moment.
> >
> > Please add the check in fs/open.c:do_fallocate() so that it covers all
> > filesystems.
> >
> >
> 
> The check should be done after the fs got the inode mutex lock.

Why?  None of the other places which check the IMMUTABLE flag do so
under the inode mutex lock.  Yes, it's true that we're not properly
doing proper locking when updating i_flags from the ioctl (this is
true for all file systems), but this has been true for quite some
time, and using a mutex to protect bit set/clear/test operations would
be like using a sledgehammer to kill a fly.

A proper fix if we want to be completely correct about updates to
i_flags would involve using test_bit, set_bit, and clear_bit, which is
guaranteed to be atomic.  This is how we update the
ext4_inode_info->i_flags (which is different from inode->i_flags) (see
the definition and use of EXT4_INODE_BIT_FNS in fs/ext4/ext4.h).

At some point, it would be good to fix how we set/get i_flags values,
but that's independent of the change that's being discussed here.

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] fs: add hole punching to fallocate

2011-01-11 Thread Ted Ts&#x27;o

On Tue, Jan 11, 2011 at 04:13:42PM -0500, Lawrence Greenfield wrote:
> > IOWs, all they want to do is avoid the unwritten extent conversion
> > overhead. Time has shown that a bad security/performance tradeoff
> > decision was made 13 years ago in XFS, so I see little reason to
> > repeat it for ext4 today

I suspect things may have changed somewhat; both in terms of
requirements and nature of cluter file systems, and the performance of
various storage systems (including PCIe-attached flash devices).  

> I'd make use of FALLOC_FL_EXPOSE_OLD_DATA. It's not the CPU overhead
> of extent conversion. It's that extent conversion causes more metadata
> operations than what you'd have otherwise, which means systems that
> want to use O_DIRECT and make sure the data doesn't go away either
> have to write O_DIRECT|O_DSYNC or need to call fdatasync().
> 
> cluster file system implementor,

One possibility might be to make it an optional feature which is only
enabled via a mount option.  That way someone would have to explicit
ask for this feature two ways (via a new flag to fallocate) and a
mount option.

It might not make sense for XFS, but for people who are using ext4 as
the local storage file system back-end, and are doing all sorts of
things to get the best performance, including disabling the journal, I
suspect it really would make sense.  So it could always be an
optional-to-implement flag, that not all file systems should feel
obliged to implement.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-12 Thread Ted Ts&#x27;o

On Sun, Dec 12, 2010 at 07:11:28AM -0600, Jon Nelson wrote:
> I'm glad you've been able to reproduce the problem! If you should need
> any further assistance, please do not hesitate to ask.

This patch seems to fix the problem for me.  (Unless the partition is
mounted with mblk_io_submit.)

Could you confirm that it fixes it for you as well?

Thanks!

- Ted

commit a8649d85bd0ab3be6014918bd9eae319a0ffc691
Author: Theodore Ts'o 
Date:   Sun Dec 12 20:57:19 2010 -0500

ext4: Turn off multiple page-io submission by default

Jon Nelson has found a test case which causes postgresql to fail with
the error:

psql:t.sql:4: ERROR: invalid page header in block 38269 of relation 
base/16384/16581

Under memory pressure, it looks like part of a file can end up getting
replaced by zero's.  Until we can figure out the cause, we'll roll
back the change and use block_write_full_page() instead of
ext4_bio_write_page().  The new, more efficient writing function can
be used via the mount option mblk_io_submit, so we can test and fix
the new page I/O code.

To reproduce the problem, install postgres 8.4 or 9.0, and pin enough
memory such that the system just at the end of triggering writeback
before running the following sql script:

begin;
create temporary table foo as select x as a, ARRAY[x] as b FROM
generate_series(1, 1000 ) AS x;
create index foo_a_idx on foo (a);
create index foo_b_idx on foo USING GIN (b);
rollback;

If the temporary table is created on a hard drive partition which is
encrypted using dm_crypt, then under memory pressure, approximately
30-40% of the time, pgsql will issue the above failure.

This patch should fix this problem, and the problem will come back if
the file system is mounted with the mblk_io_submit mount option.

Reported-by: Jon Nelson 
Signed-off-by: "Theodore Ts'o" 

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6a5edea..6eb598b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -910,6 +910,7 @@ struct ext4_inode_info {
 #define EXT4_MOUNT_JOURNAL_CHECKSUM0x80 /* Journal checksums */
 #define EXT4_MOUNT_JOURNAL_ASYNC_COMMIT0x100 /* Journal Async 
Commit */
 #define EXT4_MOUNT_I_VERSION0x200 /* i_version support */
+#define EXT4_MOUNT_MBLK_IO_SUBMIT  0x400
 #define EXT4_MOUNT_DELALLOC0x800 /* Delalloc support */
 #define EXT4_MOUNT_DATA_ERR_ABORT  0x1000 /* Abort on file data write 
*/
 #define EXT4_MOUNT_BLOCK_VALIDITY  0x2000 /* Block validity checking */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bdbe699..e659597 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2125,9 +2125,12 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
 */
if (unlikely(journal_data && PageChecked(page)))
err = __ext4_journalled_writepage(page, len);
-   else
+   else if (test_opt(inode->i_sb, MBLK_IO_SUBMIT))
err = ext4_bio_write_page(&io_submit, page,
  len, mpd->wbc);
+   else
+   err = block_write_full_page(page,
+   noalloc_get_block_write, mpd->wbc);
 
if (!err)
mpd->pages_written++;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e32195d..fb15c9c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1026,6 +1026,8 @@ static int ext4_show_options(struct seq_file *seq, struct 
vfsmount *vfs)
!(def_mount_opts & EXT4_DEFM_NODELALLOC))
seq_puts(seq, ",nodelalloc");
 
+   if (test_opt(sb, MBLK_IO_SUBMIT))
+   seq_puts(seq, ",mblk_io_submit");
if (sbi->s_stripe)
seq_printf(seq, ",stripe=%lu", sbi->s_stripe);
/*
@@ -1239,8 +1241,8 @@ enum {
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_jqfmt_vfsv1, Opt_quota,
Opt_noquota, Opt_ignore, Opt_barrier, Opt_nobarrier, Opt_err,
Opt_resize, Opt_usrquota, Opt_grpquota, Opt_i_version,
-   Opt_stripe, Opt_delalloc, Opt_nodelalloc,
-   Opt_block_validity, Opt_noblock_validity,
+   Opt_stripe, Opt_delalloc, Opt_nodelalloc, Opt_mblk_io_submit,
+   Opt_nomblk_io_submit, Opt_block_validity, Opt_noblock_validity,
Opt_inode_readahead_blks, Opt_journal_ioprio,
Opt_dioread_nolock, Opt_dioread_lock,
Opt_discard, Opt_nodiscard,
@@ -1304,6 +1306,8 @@ static const match_table_t tokens = {
{Opt_resize, "resize"},
{Opt_delalloc, "delalloc"},
{Opt_nodelalloc, "nodelalloc"},
+   {Opt_mblk_io_submit, "mblk_io_submit"},
+   {Opt_nomblk_io_submit, "nomblk_io_submit"},
{Opt_block_validity, "block_v

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-12 Thread Ted Ts&#x27;o

On Sun, Dec 12, 2010 at 04:18:29AM -0600, Jon Nelson wrote:
> > I have one CPU configured in the environment, 512MB of memory.
> > I have not done any memory-constriction tests whatsoever.

I've finally been able to reproduce it myself, on real hardware.  SMP
is not necessary to reproduce it, although having more than one CPU
doesn't hurt.  What I did need to do (on real hardware with 4 gigs of
memory) was to turn off swap and pin enough memory so that free memory
was around 200megs or so before the start of the test.  (This is the
natural amount of free memory that the system would try to reach,
since 200 megs is about 5% of 4 gigs.)

Then, during the test, free memory would drop to 50-70 megabytes,
forcing writeback to run, and then I could trigger it about 1-2 times
out of three.

I'm guessing that when you used 512mb of memory, that was in effect a
memory-constriction test, and if you were to push the memory down a
little further, it might reproduce even more quickly.  My next step is
to try to reproduce this in a VM, and then I can start probing to see
what might be going on.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-11 Thread Ted Ts&#x27;o

One experiment --- can you try this with the file system mounted with
data=writeback, and see if the problem reproduces in that journalling
mode?

I want to rule out (if possible) journal_submit_inode_data_buffers()
racing with mpage_da_submit_io().  I don't think that's the issue, but
I'd prefer to do the experiment to make sure.  So if you can use a
kernel and system configuration which triggers the problem, and then
try changing the mount options to include data=writeback, and then
rerun the test, and let me know if the problem still reproduces, I'd
be really grateful.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-11 Thread Ted Ts&#x27;o

On Fri, Dec 10, 2010 at 08:14:56PM -0600, Jon Nelson wrote:
> > Barring false negatives, bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc
> > appears to be the culprit (according to git bisect).
> > I will test bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc again, confirm
> > the behavior, and work backwards to try to reduce the possibility of
> > false negatives.
> 
> A few additional notes, in no particular order:
> 
> - For me, triggering the problem is fairly easy when encryption is involved.
> - I'm now 81 iterations into testing
> bd2d0210cf22f2bd0cef72eb97cf94fc7d31d8cc *without* encryption.  Out of
> 81 iterations, I have 4 failures: #16, 40, 62, and 64.
> 
> I will now try 1de3e3df917459422cb2aecac440febc8879d410 much more extensively.
> 
> Is this useful information?

Yes, indeed.  Is this in the virtualized environment or on real
hardware at this point?  And how many CPU's do you have configured in
your virtualized environment, and how memory memory?  Is having a
certain number of CPU's critical for reproducing the problem?  Is
constricting the amount of memory important?

It'll be a lot easier if I can reproduce it locally, which is why I'm
asking all of these questions.

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-09 Thread Ted Ts&#x27;o

On Fri, Dec 10, 2010 at 02:53:30AM +0100, Matt wrote:
> 
> Try a kernel before 5a87b7a5da250c9be6d757758425dfeaf8ed3179
> 
> from the tests I've done that one showed the least or no corruption if
> you count the empty /etc/env.d/03opengl as an artefact

Yes, that's a good test.  Also try commit bd2d0210cf.  The patch
series that is most likely to be at fault if there is a regression in
between 5a87b7a5d and bd2d0210cf inclusive.

I did a lot of testing before submitting it, but that wa a tricky
rewrite.  If you can reproduce the problem reliably, it might be good
to try commit 16828088f9 (the commit before 5a87b7a5d) and commit
bd2d0210cf.  If it reliably reproduces on bd2d0210cf, but is clean on
16828088f9, then it's my ext4 block i/o submission patches, and we'll
need to either figure out what's going on or back out that set of
changes.

If that's the case, a bisect of those changes (it's only 6 commits, so
it shouldn't take long) would be most appreciated.

Regards,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-09 Thread Ted Ts&#x27;o

On Thu, Dec 09, 2010 at 12:10:58PM -0600, Jon Nelson wrote:
> 
> You should be OK, there. Are you using encryption or no?
> I had difficulty replicating the issue without encryption.

Yes, I'm using encryption.  LUKS with aes-xts-plain-sha256, and then
LVM on top of LUKS.

> > If you can point out how to query pgsql_tmp (I'm using a completely
> > default postgres install), that would be helpful, but I don't think it
> > would be going anywhere else.
> 
> Normally it's /var/lib/pgsql/data/pgsql_tmp (or
> /var/lib/postgres/data/pgsql_tmp in your case). By placing
> /var/lib/{postgresql,pgsql}/data on the LUKS + ext4 volume, on both
> openSUSE 11.3 and Kubuntu, I was able to replicate the problem easily,
> in VirtualBox. I can give qemu a try. In both cases I was using a
> 2.6.37x kernel.

Ah, I'm not using virtualization.  I'm running on a X410 laptop, on
raw hardware.  Perhaps virtualization slows things down enough that it
triggers?  Or maybe you're running with a more constrained memory than
I?  How much memory do you have configured in your VM?

  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-09 Thread Ted Ts&#x27;o

On Tue, Dec 07, 2010 at 09:37:20PM -0600, Jon Nelson wrote:
> One difference is the location of the transaction logs (pg_xlog). In
> my case, /var/lib/pgsql/data *is* mountpoint for the test volume
> (actually, it's a symlink to the mount point). In your case, that is
> not so. Perhaps that makes a difference?  pgsql_tmp might also be on
> two different volumes in your case (I can't be sure).

I just tried tried to run t.sql five times with /var/lib/postgres as a
mountpoint for a 5400 rpm disk, and I still haven't been able to
replicate it.

If you can point out how to query pgsql_tmp (I'm using a completely
default postgres install), that would be helpful, but I don't think it
would be going anywhere else.

Still trying

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: hunt for 2.6.37 dm-crypt+ext4 corruption? (was: Re: dm-crypt barrier support is effective)

2010-12-07 Thread Ted Ts&#x27;o

On Tue, Dec 07, 2010 at 01:22:43PM -0500, Mike Snitzer wrote:
> > 1. create a database (from bash):
> > 
> > createdb test
> > 
> > 2. place the following contents in a file (I used 't.sql'):
> > 
> > begin;
> > create temporary table foo as select x as a, ARRAY[x] as b FROM
> > generate_series(1, 1000 ) AS x;
> > create index foo_a_idx on foo (a);
> > create index foo_b_idx on foo USING GIN (b);
> > rollback;
> > 
> > 3. execute that sql:
> > 
> > psql -f t.sql --echo-all test
> > 
> > With 2.6.34.7 I can re-run [3] all day long, as many times as I want,
> > without issue.
> > 
> > With 2.6.37-rc4-13 (the currently-installed KOTD kernel) if tails
> > pretty frequently.

So I just tried to reproduce this on an Ubuntu 10.04 system running
2.6.37-rc5 (completely stock except for a few apparmor patches that I
needed to keep the apparmor userspace from complaining).  I'm using
Postgres 8.4.5-0ubuntu10.04.

Using the above procedure, I wasn't able to reproduce.  Then I
realized this might have been because I was using an SSD root file
system (which is secured using LUKS/dm-crypt, with LVM on top of
dm-crypt).  So I mounted a file system on a 5400 rpm SSD disk, which
is also protected using LUKS/dm-crypt with LVM on top.  I then
executed the PostgresQL commands:

CREATE TABLESPACE test LOCATION '/kbuild/postgres';
SET default_tablespace = test;
COMMIT
\quit

I then re-ran the above proceduing, and verified that all of the I/O
was going to the 5400rpm laptop disk.

I then ran the above procedure a half-dozen times, and I still haven't
been able to reproduce any Postgresql errors or kernel errors.

Jon, can you help me identify what might be different with your run
and mine?  What version of Postgres are you using?

Thanks,

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [dm-devel] hunt for 2.6.37 dm-crypt+ext4 corruption?

2010-12-05 Thread Ted Ts&#x27;o

On Sun, Dec 05, 2010 at 02:44:14PM +0100, Matt wrote:
> gcc version 4.5.1 (Gentoo Hardened 4.5.1-r1 p1.4, pie-0.4.5)

This is probably just me being paranoid, but it might be worth trying
using a gcc 4.4.x compiler and see if that makes any difference.
There have been some other gcc 4.5-caused probablems with the kernel,
and it might be a good idea to rule those out.  (I'm using gcc 4.4.3,
myself, and I'm deliberately avoiding the use of gcc 4.5 for now.)

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch] fix up lock order reversal in writeback

2010-11-17 Thread Ted Ts&#x27;o

On Wed, Nov 17, 2010 at 05:10:57PM +1100, Nick Piggin wrote:
> On Tue, Nov 16, 2010 at 11:05:52PM -0600, Eric Sandeen wrote:
> > On 11/16/10 10:38 PM, Nick Piggin wrote:
> > >> as for the locking problems ... sorry about that!
> > > 
> > > That's no problem. So is that an ack? :)
> > > 
> > 
> > I'd like to test it with the original case it was supposed to solve; will
> > do that tomorrow.
> 
> OK, but it shouldn't make much difference, unless there is a lot of
> strange activity happening on the sb (like mount / umount / remount /
> freeze / etc).

This makes sense to me as well.

Acked-by: "Theodore Ts'o" 

So how do we want to send this patch to Linus?  It's a writeback
change, so through some mm tree?  Or it lives in fs/fs-writeback.c
(which I always thought was weird; why is it not in mm/?), so maybe
through the VFS tree, even though I don't think Al would really care
about this patch.

Or I can push it to Linus through the ext4 tree

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/6] Ext4: fail if we try to use hole punch

2010-11-16 Thread Ted Ts&#x27;o

> >There is no simple way to test if a filesystem supports hole punching or not 
> >so
> >the check has to be done per fs.  Thanks,
> 
> Could put a flag word in superblock_operations.  Filesystems which
> support punching (or other features) can enable it there.

No, it couldn't be in super_operations.  It may vary on a per-inode
basis for some file systems, such as ext4 (depending on whether the
inode is extent-mapped or indirect-block mapped).

So at least for ext4 we'd need to call into fallocate() function
anyway, once we add support.  I suppose if other file systems really
want it, we could add a flag to the super block ops structure, so they
don't have do the "do we support the punch" operation.  I can go
either way on that; although if we think the majority of file systems
are going support punch in the long-term, then it might not be worth
it to add such a flag.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-09 Thread Ted Ts&#x27;o

On Tue, Nov 09, 2010 at 03:42:42PM +1100, Dave Chinner wrote:
> Implementation is up to the filesystem. However, XFS does (b)
> because:
> 
>   1) it was extremely simple to implement (one of the
>  advantages of having an exceedingly complex allocation
>  interface to begin with :P)
>   2) conversion is atomic, fast and reliable
>   3) it is independent of the underlying storage; and
>   4) reads of unwritten extents operate at memory speed,
>  not disk speed.

Yeah, I was thinking that using a device-style TRIM might be better
since future attempts to write to it won't require a separate seek to
modify the extent tree.  But yeah, there are a bunch of advantages of
simply mutating the extent tree.

While we're on the subject of changes to fallocate, what do people
think of FALLOC_FL_EXPOSE_OLD_DATA, which requires either root
privileges or (if capabilities are in use) CAP_DAC_OVERRIDE &&
CAP_MAC_OVERRIDE && CAP_SYS_ADMIN.  This would allow a trusted process
to fallocate blocks with the extent already marked initialized.  I've
had two requests for such functionality for ext4 already.  

(Take for example a trusted cluster filesystem backend that checks the
object checksum before returning any data to the user; and if the
check fails the cluster file system will try to use some other replica
stored on some other server.)

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/6] fs: add hole punching to fallocate

2010-11-08 Thread Ted Ts&#x27;o

On Tue, Nov 09, 2010 at 12:12:22PM +1100, Dave Chinner wrote:
> Hole punching was not included originally in fallocate() for a
> variety of reasons. IIRC, they were along the lines of:
> 
>   1 de-allocating of blocks in an allocation syscall is wrong.
> People wanted a new syscall for this functionality.
>   2 no glibc interface needs it
>   3 at the time, only XFS supported punching holes, so there
> is not need to support it in a generic interface
>   4 the use cases presented were not considered compelling
> enough to justify the additional complexity (!)
> 
> In the end, I gave up arguing for it to be included because just
> getting the FALLOC_FL_KEEP_SIZE functionality was a hard enough
> battle.
> 
> Anyway, #3 isn't the case any more, #4 was just an excuse not to
> support anything ext4 couldn't do and lots of apps are calling
> fallocate directly (because glibc can't use FALLOC_FL_KEEP_SIZE) so
> #2 isn't an issue, either.

I don't recall anyone arguing #4 because of ext4, but I get very tired
of the linux-fsdevel bike-shed painting parties, so I often will
concede whatever is necessary just to get the !...@#! interface in,
assuming we could add more flags later

glibc does support fallocate(), BTW; it's just posix_fallocate() that
doesn't use FALLOC_FL_KEEP_SIZE.

> I guess that leaves #1 to be debated;
> I don't think there is any problem with doing what you propose.

I don't have a problem either.

As a completely separate proposal, what do people think about an
FALLOCATE_FL_ZEROIZE after which time the blocks are allocated, but
reading from them returns zero.  This could be done either by (a)
sending a discard in the case of devices where discard_zeros_data is
true and discard_granularty is less than the fs block size, or (b) by
setting the uninitialized flag in the extent tree.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: Unbelievably slow with kvm/qemu

2010-09-02 Thread Ted Ts&#x27;o

On Tue, Aug 31, 2010 at 02:58:44PM -0700, K. Richard Pixley wrote:
>  On 20100831 14:46, Mike Fedyk wrote:
> >There is little reason not to use duplicate metadata.  Only small
> >files (less than 2kb) get stored in the tree, so there should be no
> >worries about images being duplicated without data duplication set at
> >mkfs time.
> My benchmarks show that for my kinds of data, btrfs is somewhat
> slower than ext4, (which is slightly slower than ext3 which is
> somewhat slower than ext2), when using the defaults, (ie, duplicate
> metadata).
> 
> It's a hair faster than ext2, (the fastest of the ext family), when
> using singleton metadata.  And ext2 isn't even crash resistant while
> btrfs has snapshots.

I'm really, really curious.  Can you describe your data and your
workload in detail?  You mentioned "continuous builders"; is this some
kind of tinderbox setup?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Ted Ts&#x27;o

On Wed, Aug 25, 2010 at 05:30:42PM -0700, David Rientjes wrote:
> 
> We certainly hope that nobody will reimplement the same function without 
> the __deprecated warning, especially for order < PAGE_ALLOC_COSTLY_ORDER 
> where there's no looping at a higher level.  So perhaps the best 
> alternative is to implement the same _nofail() functions but do a 
> WARN_ON(get_order(size) > PAGE_ALLOC_COSTLY_ORDER) instead?

Yeah, that sounds better.

> I think it's really sad that the caller can't know what the upper bounds 
> of its memory requirement are ahead of time or at least be able to 
> implement a memory freeing function when kmalloc() returns NULL.

Oh, we can determine an upper bound.  You might just not like it.
Actually ext3/ext4 shouldn't be as bad as XFS, which Dave estimated to
be around 400k for a transaction.  My guess is that the worst case for
ext3/ext4 is probably around 256k or so; like XFS, most of the time,
it would be a lot less.  (At least, if data != journalled; if we are
doing data journalling and every single data block begins with
0xc03b3998U, we'll need to allocate a 4k page for every single data
block written.)  We could dynamically calculate an upper bound if we
had to.  Of course, if ext3/ext4 is attached to a network block
device, then it could get a lot worse than 256k, of course.

- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Ted Ts&#x27;o

On Wed, Aug 25, 2010 at 04:11:38PM -0700, David Rientjes wrote:
> 
> I'll repropose the patchset with __deprecated as you suggested.  Thanks!

And what Dave and I are saying is that we'll either need to do our on
loop to avoid the deprecation warning, or the use of the deprecated
function will probably be used forever...

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Ted Ts&#x27;o

On Wed, Aug 25, 2010 at 03:35:42PM +0200, Peter Zijlstra wrote:
> 
> While I appreciate that it might be somewhat (a lot) harder for a
> filesystem to provide that guarantee, I'd be deeply worried about your
> claim that its impossible.
> 
> It would render a system without swap very prone to deadlocks. Even with
> the very tight dirty page accounting we currently have you can fill all
> your memory with anonymous pages, at which point there's nothing free
> and you require writeout of dirty pages to succeed.

For file systems that do delayed allocation, the situation is very
similar to swapping over NFS.  Sometimes in order to make some free
memory, you need to spend some free memory...  which implies that for
these file systems, being more aggressive about triggering writeout,
and being more aggressive about throttling processes which are
creating too many dirty pages, especially dirty delayed allocaiton
pages (regardless of whether this is via write(2) or accessing mmapped
memory), is a really good idea.

A pool of free pages which is reserved for routines that are doing
page cleaning would probably also be a good idea.  Maybe that's just
retrying with GFP_ATOMIC if a normal allocation fails, or maybe we
need our own special pool, or maybe we need to dynamically resize the
GFP_ATOMIC pool based on how many subsystems might need to use it

Just brainstorming here; what do people think?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Ted Ts&#x27;o

On Wed, Aug 25, 2010 at 01:35:32PM +0200, Peter Zijlstra wrote:
> On Wed, 2010-08-25 at 07:24 -0400, Ted Ts'o wrote:
> > Part of the problem is that we have a few places in the kernel where
> > failure is really not an option --- or rather, if we're going to fail
> > while we're in the middle of doing a commit, our choices really are
> > (a) retry the loop in the jbd layer (which Andrew really doesn't
> > like), (b) keep our own private cache of free memory so we don't fail
> > and/or loop, (c) fail the file system and mark it read-only, or (d)
> > panic. 
> 
> d) do the allocation before you're committed to going fwd and can still
> fail and back out.

Sure in some cases that can be done, but the commit has to happen at
some point, or we run out of journal space, at which point we're back
to (c) or (d).

- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 1/5] mm: add nofail variants of kmalloc kcalloc and kzalloc

2010-08-25 Thread Ted Ts&#x27;o

On Tue, Aug 24, 2010 at 01:11:26PM -0700, David Rientjes wrote:
> On Tue, 24 Aug 2010, Jens Axboe wrote:
> 
> > Should be possible to warn at build time for anyone using __GFP_NOFAIL
> > without wrapping it in a function.
> > 
> 
> We could make this __deprecated functions as Peter suggested if you think 
> build time warnings for existing users would be helpful.

Let me take a few steps backwards and look at the problem from a
somewhat higher level.

Part of the problem is that we have a few places in the kernel where
failure is really not an option --- or rather, if we're going to fail
while we're in the middle of doing a commit, our choices really are
(a) retry the loop in the jbd layer (which Andrew really doesn't
like), (b) keep our own private cache of free memory so we don't fail
and/or loop, (c) fail the file system and mark it read-only, or (d)
panic.

There are other places where we can fail safely (for example, in jbd's
start_this_handle, although that just pushes the layer up the stack,
and ultimately, to userspace where most userspace programs don't
really expect ENOMEM to get returned by a random disk write --- how
much do _you_ trust a random GNOME or KDE developer to do correct
error checking and do something sane at the application?)

So we can mark the retry loop helper function as deprecated, and that
will make some of these cases go away, but ultimately if we're going
to fail the memory allocation, something bad is going to happen, and
the only question is whether we want to have something bad happen by
looping in the memory allocator, or to force the file system to
panic/oops the system, or have random application die and/or lose user
data because they don't expect write() to return ENOMEM.

So at some level it would be nice if we had a few different levels of
"we *really* need this memory".  Level 0 might be, "if we can't get
it, no biggie, we'll figure out some other way around it.  I doubt
there is much at level 0, but in theory, we could have some good
citizens that fall in that camp and who simply will bypass some
optimization if they can't get the memory.  Level 1 might be, if you
can't get the memory, we will propagate a failure up to userspace, but
it's probably a relatively "safe" place to fail (i.e., when the user
is opening a file).  Level 2 might be, "if you can't get the memory,
we will propgate a failure up to userspace, but it's at a space where
most applications are lazy f*ckers, and this may lead to serious
application errors" (example: close(2), and this is a file system that
only pushes the file to the server at close time, e.g. AFS).  Level 3
might be, "if you can't get the memory, I'm going to fail the file
system, or some other global subsystem, that will probably result in
the system crashing or needing to be rebooted".

We can ignore this problem and pretend it doesn't exist at the memory
allocator level, but that means the callers are going to be doing
their own thing to try to avoid having really bad things happening at
really-low-memory occasions.  And this may mean looping, whether we
mark the function as deprecated or not.

This is becoming even more of an issue now given that with
containerization, we have jokers who are forcing tasks to run in very
small memory containers, which means that failures can happen far more
frequently --- and in some cases, just because the process running the
task happens to be in an extremely constrained memory cgroup, doesn't
mean that failing the entire system is really such a great idea.  Or
maybe that means memory cgroups are kinda busted.  :-)

   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

43 matches

Mail list logo