from:"Mingming Cao"

Re: [patch 12/26] mount options: fix ext4

2008-01-25 Thread Mingming Cao

On Thu, 2008-01-24 at 20:33 +0100, Miklos Szeredi wrote:
 plain text document attachment (ext4_opts.patch)
 From: Miklos Szeredi [EMAIL PROTECTED]
 
 Add stripe= option to /proc/mounts for ext4 filesystems.
 
 Signed-off-by: Miklos Szeredi [EMAIL PROTECTED]
 ---
 
 Index: linux/fs/ext4/super.c
 ===
 --- linux.orig/fs/ext4/super.c2008-01-23 12:57:07.0 +0100
 +++ linux/fs/ext4/super.c 2008-01-23 21:43:51.0 +0100
 @@ -742,7 +742,8 @@ static int ext4_show_options(struct seq_
   seq_puts(seq, ,nomballoc);
   if (!test_opt(sb, DELALLOC))
   seq_puts(seq, ,nodelalloc);
 -
 + if (sbi-s_stripe)
 + seq_printf(seq, ,stripe=%lu, sbi-s_stripe);
 
   /*
* journal mode get enabled in different ways
 

Added to ext4 patch queue. Thanks!
http://repo.or.cz/w/ext4-patch-queue.git

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-11-30 Thread Mingming Cao

On Fri, 2007-11-30 at 17:08 -0600, Eric Sandeen wrote:
 Mingming Cao wrote:
  [PATCH] jbd2 stats through procfs
  
  The patch below updates the jbd stats patch to 2.6.20/jbd2.
  The initial patch was posted by Alex Tomas in December 2005
  (http://marc.info/?l=linux-ext4m=113538565128617w=2).
  It provides statistics via procfs such as transaction lifetime and size.
  
  [ This probably should be rewritten to use debugfs?   -- Ted]
  
  Signed-off-by: Johann Lombardi [EMAIL PROTECTED]
 
 I've started going through this one to clean it up to the point where it
 can go forward.  It's been sitting at the top of the unstable portion of
 the patch queue for long enough, I think :)
 
Thanks!

 I've converted the msecs to jiffies until the user boundary, changed the
 union #defines as suggested by Andrew, and various other little issues etc.
 
 Remaining to do is a generic time-difference calculator (instead of
 jbd2_time_diff), and looking into whether it should be made a config
 option; I tend to think it should, but it's fairly well sprinkled
 through the code, so I'll see how well that works.
 
 Also did we ever decided if this should go to debugfs?
 

I thought it was decided to keep it on procfs as debugfs is not always
on...
 Thanks,
 
 -Eric
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/4] 64k pagesize/blocksize fixes

2007-09-26 Thread Mingming Cao

On Tue, 2007-09-25 at 16:30 -0700, Christoph Lameter wrote:
 Attached the fixes necessary to support 64k pagesize/blocksize. I think these 
 are useful
 independent of the large blocksize patchset since there are architectures 
 that support
 64k page size and that could use these large buffer sizes without the large 
 buffersize
 patchset.
 
 Are these patches in the right shape to be merged? I rediffed these against 
 2.6.32-rc8-mm1.
 
 I had to fix some things in the second patch (ext2) that may need some review 
 since the
 way that commits work changed.
 

Thanks,  As Andreas mentioned in another email there is a better way to
deal with rec_len issue on 64k block size. I will get those patches
merged in ext4 patch queue. I think the updated patches could get
merged.


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Mingming Cao

Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
with the rest of kmalloc flag used in the JBD/JBD2 layer.

Signed-off-by: Mingming Cao [EMAIL PROTECTED]

---
 fs/jbd/journal.c  |6 +++---
 fs/jbd/revoke.c   |8 
 fs/jbd2/journal.c |6 +++---
 fs/jbd2/revoke.c  |8 
 4 files changed, 14 insertions(+), 14 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:51:10.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 11:51:57.0 -0700
@@ -653,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
journal-j_blocksize = blocksize;
n = journal-j_blocksize / sizeof(journal_block_tag_t);
journal-j_wbufsize = n;
-   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
@@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal-j_blocksize / sizeof(journal_block_tag_t);
journal-j_wbufsize = n;
-   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-19 11:51:30.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 -0700
@@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
while((tmp = 1UL) != 0UL)
shift++;
 
-   journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
+   journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
GFP_NOFS);
if (!journal-j_revoke_table[0])
return -ENOMEM;
journal-j_revoke = journal-j_revoke_table[0];
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal-j_revoke-hash_shift = shift;
 
journal-j_revoke-hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal-j_revoke-hash_table) {
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
journal-j_revoke = NULL;
@@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
for (tmp = 0; tmp  hash_size; tmp++)
INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]);
 
-   journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
+   journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_NOFS);
if (!journal-j_revoke_table[1]) {
kfree(journal-j_revoke_table[0]-hash_table);
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
@@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
journal-j_revoke-hash_shift = shift;
 
journal-j_revoke-hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
if (!journal-j_revoke-hash_table) {
kfree(journal-j_revoke_table[0]-hash_table);
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
Index: linux-2.6.23-rc6/fs/jbd2/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd2/journal.c 2007-09-19 11:52:48.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd2/journal.c  2007-09-19 11:53:12.0 -0700
@@ -654,7 +654,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_NOFS);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -724,7 +724,7 @@ journal_t * jbd2_journal_init_dev(struct
journal-j_blocksize = blocksize;
n = journal-j_blocksize / sizeof(journal_block_tag_t);
journal-j_wbufsize = n;
-   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL

Re: [PATCH] JBD slab cleanups

2007-09-19 Thread Mingming Cao

On Wed, 2007-09-19 at 19:28 +, Dave Kleikamp wrote:
 On Wed, 2007-09-19 at 14:26 -0500, Dave Kleikamp wrote:
  On Wed, 2007-09-19 at 12:15 -0700, Mingming Cao wrote:
  
   Here is the patch to clean up __GFP_NOFAIL flag in jbd/jbd2. In all
   cases except one handles memory allocation failure so I get rid of those
   GFP_NOFAIL flags.
   
   Also, shouldn't we use GFP_KERNEL instead of GFP_NOFS flag for kmalloc
   in jbd/jbd2? I will send a separate patch to cleanup that.
  
  No.  GFP_NOFS avoids deadlock.  It prevents the allocation from making
  recursive calls back into the file system that could end up blocking on
  jbd code.
 
 Oh, I see your patch now.  You mean use GFP_NOFS instead of
 GFP_KERNEL.  :-)  OK then.
 

oops, I did mean what you say here.:-)

  Shaggy

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] JBD: use GFP_NOFS in kmalloc

2007-09-19 Thread Mingming Cao

On Wed, 2007-09-19 at 14:34 -0700, Andrew Morton wrote:
 On Wed, 19 Sep 2007 12:22:09 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Convert the GFP_KERNEL flag used in JBD/JBD2 to GFP_NOFS, consistent
  with the rest of kmalloc flag used in the JBD/JBD2 layer.
  
  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  
  ---
   fs/jbd/journal.c  |6 +++---
   fs/jbd/revoke.c   |8 
   fs/jbd2/journal.c |6 +++---
   fs/jbd2/revoke.c  |8 
   4 files changed, 14 insertions(+), 14 deletions(-)
  
  Index: linux-2.6.23-rc6/fs/jbd/journal.c
  ===
  --- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-19 11:51:10.0 
  -0700
  +++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-19 11:51:57.0 
  -0700
  @@ -653,7 +653,7 @@ static journal_t * journal_init_common (
  journal_t *journal;
  int err;
   
  -   journal = kmalloc(sizeof(*journal), GFP_KERNEL);
  +   journal = kmalloc(sizeof(*journal), GFP_NOFS);
  if (!journal)
  goto fail;
  memset(journal, 0, sizeof(*journal));
  @@ -723,7 +723,7 @@ journal_t * journal_init_dev(struct bloc
  journal-j_blocksize = blocksize;
  n = journal-j_blocksize / sizeof(journal_block_tag_t);
  journal-j_wbufsize = n;
  -   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
  +   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
  if (!journal-j_wbuf) {
  printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
  __FUNCTION__);
  @@ -777,7 +777,7 @@ journal_t * journal_init_inode (struct i
  /* journal descriptor can store up to n blocks -bzzz */
  n = journal-j_blocksize / sizeof(journal_block_tag_t);
  journal-j_wbufsize = n;
  -   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
  +   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_NOFS);
  if (!journal-j_wbuf) {
  printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
  __FUNCTION__);
  Index: linux-2.6.23-rc6/fs/jbd/revoke.c
  ===
  --- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-19 11:51:30.0 
  -0700
  +++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-19 11:52:34.0 
  -0700
  @@ -206,7 +206,7 @@ int journal_init_revoke(journal_t *journ
  while((tmp = 1UL) != 0UL)
  shift++;
   
  -   journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
  GFP_KERNEL);
  +   journal-j_revoke_table[0] = kmem_cache_alloc(revoke_table_cache, 
  GFP_NOFS);
  if (!journal-j_revoke_table[0])
  return -ENOMEM;
  journal-j_revoke = journal-j_revoke_table[0];
  @@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
  journal-j_revoke-hash_shift = shift;
   
  journal-j_revoke-hash_table =
  -   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
  +   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
  if (!journal-j_revoke-hash_table) {
  kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
  journal-j_revoke = NULL;
  @@ -229,7 +229,7 @@ int journal_init_revoke(journal_t *journ
  for (tmp = 0; tmp  hash_size; tmp++)
  INIT_LIST_HEAD(journal-j_revoke-hash_table[tmp]);
   
  -   journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
  GFP_KERNEL);
  +   journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
  GFP_NOFS);
  if (!journal-j_revoke_table[1]) {
  kfree(journal-j_revoke_table[0]-hash_table);
  kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
  @@ -246,7 +246,7 @@ int journal_init_revoke(journal_t *journ
  journal-j_revoke-hash_shift = shift;
   
  journal-j_revoke-hash_table =
  -   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
  +   kmalloc(hash_size * sizeof(struct list_head), GFP_NOFS);
  if (!journal-j_revoke-hash_table) {
  kfree(journal-j_revoke_table[0]-hash_table);
  kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
 
 These were all OK using GFP_KERNEL.
 
 GFP_NOFS should only be used when the caller is holding some fs locks which
 might cause a deadlock if that caller reentered the fs in -writepage (and
 maybe put_inode and such).  That isn't the case in any of the above code,
 which is all mount time stuff (I think).
 

You are right they are all occur at initialization time.

 ext3/4 should be using GFP_NOFS when the caller has a transaction open, has
 a page locked, is holding i_mutex, etc.
 

Thanks for your feedback.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Mingming Cao

On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
 On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
  Here is the incremental small cleanup patch. 
  
  Remove kamlloc usages in jbd/jbd2 and consistently use 
  jbd_kmalloc/jbd2_malloc.
 
 Shouldn't we kill jbd_kmalloc instead?
 

It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
places to handle memory (de)allocation(page size) via kmalloc/kfree, so
in the future if we need to change memory allocation in jbd(e.g. not
using kmalloc or using different flag), we don't need to touch every
place in the jbd code calling jbd_kmalloc.

Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] JBD slab cleanups

2007-09-18 Thread Mingming Cao

On Tue, 2007-09-18 at 13:04 -0500, Dave Kleikamp wrote:
 On Tue, 2007-09-18 at 09:35 -0700, Mingming Cao wrote:
  On Tue, 2007-09-18 at 10:04 +0100, Christoph Hellwig wrote:
   On Mon, Sep 17, 2007 at 03:57:31PM -0700, Mingming Cao wrote:
Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use 
jbd_kmalloc/jbd2_malloc.
   
   Shouldn't we kill jbd_kmalloc instead?
   
  
  It seems useful to me to keep jbd_kmalloc/jbd_free. They are central
  places to handle memory (de)allocation(page size) via kmalloc/kfree, so
  in the future if we need to change memory allocation in jbd(e.g. not
  using kmalloc or using different flag), we don't need to touch every
  place in the jbd code calling jbd_kmalloc.
 
 I disagree.  Why would jbd need to globally change the way it allocates
 memory?  It currently uses kmalloc (and jbd_kmalloc) for allocating a
 variety of structures.  Having to change one particular instance won't
 necessarily mean we want to change all of them.  Adding unnecessary
 wrappers only obfuscates the code making it harder to understand.  You
 wouldn't want every subsystem to have it's own *_kmalloc() that took
 different arguments.  Besides, there aren't that many calls to kmalloc
 and kfree in the jbd code, so there wouldn't be much pain in changing
 GFP flags or whatever, if it ever needed to be done.
 
 Shaggy

Okay, Points taken, Here is the updated patch to get rid of slab
management and jbd_kmalloc from jbd totally. This patch is intend to
replace the patch in mm tree, Andrew, could you pick up this one
instead?

Thanks,

Mingming


jbd/jbd2: JBD memory allocation cleanups

From: Christoph Lameter [EMAIL PROTECTED]

JBD: Replace slab allocations with page cache allocations

JBD allocate memory for committed_data and frozen_data from slab. However
JBD should not pass slab pages down to the block layer. Use page allocator 
pages instead. This will also prepare JBD for the large blocksize patchset.


Also this patch cleans up jbd_kmalloc and replace it with kmalloc directly

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

---
 fs/jbd/commit.c   |6 +--
 fs/jbd/journal.c  |   99 ++
 fs/jbd/transaction.c  |   12 +++---
 fs/jbd2/commit.c  |6 +--
 fs/jbd2/journal.c |   99 ++
 fs/jbd2/transaction.c |   18 -
 include/linux/jbd.h   |   18 +
 include/linux/jbd2.h  |   21 +-
 8 files changed, 52 insertions(+), 227 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-18 17:19:01.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-18 17:51:21.0 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;
 
jbd_unlock_bh_state(bh_in);
-   tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS);
+   tmp = jbd_alloc(bh_in-b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in-b_frozen_data) {
-   jbd_slab_free(tmp, bh_in-b_size);
+   jbd_free(tmp, bh_in-b_size);
goto repeat;
}
 
@@ -654,7 +653,7 @@ static journal_t * journal_init_common (
journal_t *journal;
int err;
 
-   journal = jbd_kmalloc(sizeof(*journal), GFP_KERNEL);
+   journal = kmalloc(sizeof(*journal), GFP_KERNEL|__GFP_NOFAIL);
if (!journal)
goto fail;
memset(journal, 0, sizeof(*journal));
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}
 
-   /*
-* Create a slab for this blocksize
-*/
-   err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize));
-   if (err)
-   return err;
-
/* Let the recovery code check whether it needs to recover any
 * data from the journal. */
if (journal_recover(journal))
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-   return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could

Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Mingming Cao

On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
 jbd/jbd2: Replace slab allocations with page cache allocations
 
 From: Christoph Lameter [EMAIL PROTECTED]
 
 JBD should not pass slab pages down to the block layer.
 Use page allocator pages instead. This will also prepare
 JBD for the large blocksize patchset.
 

Currently memory allocation for committed_data(and frozen_buffer) for
bufferhead is done through jbd slab management, as Christoph Hellwig
pointed out that this is broken as jbd should not pass slab pages down
to IO layer. and suggested to use get_free_pages() directly.

The problem with this patch, as Andreas Dilger pointed today in ext4
interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
1/3-1/2 page space. 

What was the originally intention to set up slabs for committed_data(and
frozen_buffer) in JBD? Why not using kmalloc?

Mingming

 Tested on 2.6.23-rc6 with fsx runs fine.
 
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 ---
  fs/jbd/checkpoint.c   |2 
  fs/jbd/commit.c   |6 +-
  fs/jbd/journal.c  |  107 
 -
  fs/jbd/transaction.c  |   10 ++--
  fs/jbd2/checkpoint.c  |2 
  fs/jbd2/commit.c  |6 +-
  fs/jbd2/journal.c |  109 
 --
  fs/jbd2/transaction.c |   18 
  include/linux/jbd.h   |   23 +-
  include/linux/jbd2.h  |   28 ++--
  10 files changed, 83 insertions(+), 228 deletions(-)
 
 Index: linux-2.6.23-rc5/fs/jbd/journal.c
 ===
 --- linux-2.6.23-rc5.orig/fs/jbd/journal.c2007-09-13 13:37:57.0 
 -0700
 +++ linux-2.6.23-rc5/fs/jbd/journal.c 2007-09-13 13:45:39.0 -0700
 @@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
  static int journal_convert_superblock_v1(journal_t *, journal_superblock_t 
 *);
  static void __journal_abort_soft (journal_t *journal, int errno);
 -static int journal_create_jbd_slab(size_t slab_size);
 
  /*
   * Helper function used to manage commit timeouts
 @@ -334,10 +333,10 @@ repeat:
   char *tmp;
 
   jbd_unlock_bh_state(bh_in);
 - tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS);
 + tmp = jbd_alloc(bh_in-b_size, GFP_NOFS);
   jbd_lock_bh_state(bh_in);
   if (jh_in-b_frozen_data) {
 - jbd_slab_free(tmp, bh_in-b_size);
 + jbd_free(tmp, bh_in-b_size);
   goto repeat;
   }
 
 @@ -679,7 +678,7 @@ static journal_t * journal_init_common (
   /* Set up a default-sized revoke table for the new mount. */
   err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
   if (err) {
 - kfree(journal);
 + jbd_kfree(journal);
   goto fail;
   }
   return journal;
 @@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
   if (!journal-j_wbuf) {
   printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
   __FUNCTION__);
 - kfree(journal);
 + jbd_kfree(journal);
   journal = NULL;
   goto out;
   }
 @@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
   if (!journal-j_wbuf) {
   printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
   __FUNCTION__);
 - kfree(journal);
 + jbd_kfree(journal);
   return NULL;
   }
 
 @@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
   if (err) {
   printk(KERN_ERR %s: Cannnot locate journal superblock\n,
  __FUNCTION__);
 - kfree(journal);
 + jbd_kfree(journal);
   return NULL;
   }
 
 @@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
   }
   }
 
 - /*
 -  * Create a slab for this blocksize
 -  */
 - err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize));
 - if (err)
 - return err;
 -
   /* Let the recovery code check whether it needs to recover any
* data from the journal. */
   if (journal_recover(journal))
 @@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
   if (journal-j_revoke)
   journal_destroy_revoke(journal);
   kfree(journal-j_wbuf);
 - kfree(journal);
 + jbd_kfree(journal);
  }
 
 
 @@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
  }
 
  /*
 - * Simple support for retrying memory allocations.  Introduced to help to
 - * debug different VM deadlock avoidance strategies.
 - */
 -void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
 -{
 - return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
 -}
 -
 -/*
 - * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
 - * and allocate frozen

Re: [PATCH] JBD slab cleanups

2007-09-17 Thread Mingming Cao

On Mon, 2007-09-17 at 15:01 -0700, Badari Pulavarty wrote:
 On Mon, 2007-09-17 at 12:29 -0700, Mingming Cao wrote:
  On Fri, 2007-09-14 at 11:53 -0700, Mingming Cao wrote:
   jbd/jbd2: Replace slab allocations with page cache allocations
   
   From: Christoph Lameter [EMAIL PROTECTED]
   
   JBD should not pass slab pages down to the block layer.
   Use page allocator pages instead. This will also prepare
   JBD for the large blocksize patchset.
   
  
  Currently memory allocation for committed_data(and frozen_buffer) for
  bufferhead is done through jbd slab management, as Christoph Hellwig
  pointed out that this is broken as jbd should not pass slab pages down
  to IO layer. and suggested to use get_free_pages() directly.
  
  The problem with this patch, as Andreas Dilger pointed today in ext4
  interlock call, for 1k,2k block size ext2/3/4, get_free_pages() waste
  1/3-1/2 page space. 
  
  What was the originally intention to set up slabs for committed_data(and
  frozen_buffer) in JBD? Why not using kmalloc?
  
  Mingming
 
 Looks good. Small suggestion is to get rid of all kmalloc() usages and
 consistently use jbd_kmalloc() or jbd2_kmalloc().
 
 Thanks,
 Badari
 

Here is the incremental small cleanup patch. 

Remove kamlloc usages in jbd/jbd2 and consistently use jbd_kmalloc/jbd2_malloc.


Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/jbd/journal.c  |8 +---
 fs/jbd/revoke.c   |   12 ++--
 fs/jbd2/journal.c |8 +---
 fs/jbd2/revoke.c  |   12 ++--
 4 files changed, 22 insertions(+), 18 deletions(-)

Index: linux-2.6.23-rc6/fs/jbd/journal.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/journal.c  2007-09-17 14:32:16.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/journal.c   2007-09-17 14:33:59.0 -0700
@@ -723,7 +723,8 @@ journal_t * journal_init_dev(struct bloc
journal-j_blocksize = blocksize;
n = journal-j_blocksize / sizeof(journal_block_tag_t);
journal-j_wbufsize = n;
-   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+   GFP_KERNEL);
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
@@ -777,7 +778,8 @@ journal_t * journal_init_inode (struct i
/* journal descriptor can store up to n blocks -bzzz */
n = journal-j_blocksize / sizeof(journal_block_tag_t);
journal-j_wbufsize = n;
-   journal-j_wbuf = kmalloc(n * sizeof(struct buffer_head*), GFP_KERNEL);
+   journal-j_wbuf = jbd_kmalloc(n * sizeof(struct buffer_head*),
+   GFP_KERNEL);
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
@@ -1157,7 +1159,7 @@ void journal_destroy(journal_t *journal)
iput(journal-j_inode);
if (journal-j_revoke)
journal_destroy_revoke(journal);
-   kfree(journal-j_wbuf);
+   jbd_kfree(journal-j_wbuf);
jbd_kfree(journal);
 }
 
Index: linux-2.6.23-rc6/fs/jbd/revoke.c
===
--- linux-2.6.23-rc6.orig/fs/jbd/revoke.c   2007-09-17 14:32:22.0 
-0700
+++ linux-2.6.23-rc6/fs/jbd/revoke.c2007-09-17 14:35:13.0 -0700
@@ -219,7 +219,7 @@ int journal_init_revoke(journal_t *journ
journal-j_revoke-hash_shift = shift;
 
journal-j_revoke-hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal-j_revoke-hash_table) {
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
journal-j_revoke = NULL;
@@ -231,7 +231,7 @@ int journal_init_revoke(journal_t *journ
 
journal-j_revoke_table[1] = kmem_cache_alloc(revoke_table_cache, 
GFP_KERNEL);
if (!journal-j_revoke_table[1]) {
-   kfree(journal-j_revoke_table[0]-hash_table);
+   jbd_kfree(journal-j_revoke_table[0]-hash_table);
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
return -ENOMEM;
}
@@ -246,9 +246,9 @@ int journal_init_revoke(journal_t *journ
journal-j_revoke-hash_shift = shift;
 
journal-j_revoke-hash_table =
-   kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
+   jbd_kmalloc(hash_size * sizeof(struct list_head), GFP_KERNEL);
if (!journal-j_revoke-hash_table) {
-   kfree(journal-j_revoke_table[0]-hash_table);
+   jbd_kfree(journal-j_revoke_table[0]-hash_table);
kmem_cache_free(revoke_table_cache, journal-j_revoke_table[0]);
kmem_cache_free

[PATCH] JBD slab cleanups

2007-09-14 Thread Mingming Cao

jbd/jbd2: Replace slab allocations with page cache allocations

From: Christoph Lameter [EMAIL PROTECTED]

JBD should not pass slab pages down to the block layer.
Use page allocator pages instead. This will also prepare
JBD for the large blocksize patchset.

Tested on 2.6.23-rc6 with fsx runs fine.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/jbd/checkpoint.c   |2 
 fs/jbd/commit.c   |6 +-
 fs/jbd/journal.c  |  107 -
 fs/jbd/transaction.c  |   10 ++--
 fs/jbd2/checkpoint.c  |2 
 fs/jbd2/commit.c  |6 +-
 fs/jbd2/journal.c |  109 --
 fs/jbd2/transaction.c |   18 
 include/linux/jbd.h   |   23 +-
 include/linux/jbd2.h  |   28 ++--
 10 files changed, 83 insertions(+), 228 deletions(-)

Index: linux-2.6.23-rc5/fs/jbd/journal.c
===
--- linux-2.6.23-rc5.orig/fs/jbd/journal.c  2007-09-13 13:37:57.0 
-0700
+++ linux-2.6.23-rc5/fs/jbd/journal.c   2007-09-13 13:45:39.0 -0700
@@ -83,7 +83,6 @@ EXPORT_SYMBOL(journal_force_commit);
 
 static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);
 static void __journal_abort_soft (journal_t *journal, int errno);
-static int journal_create_jbd_slab(size_t slab_size);
 
 /*
  * Helper function used to manage commit timeouts
@@ -334,10 +333,10 @@ repeat:
char *tmp;
 
jbd_unlock_bh_state(bh_in);
-   tmp = jbd_slab_alloc(bh_in-b_size, GFP_NOFS);
+   tmp = jbd_alloc(bh_in-b_size, GFP_NOFS);
jbd_lock_bh_state(bh_in);
if (jh_in-b_frozen_data) {
-   jbd_slab_free(tmp, bh_in-b_size);
+   jbd_free(tmp, bh_in-b_size);
goto repeat;
}
 
@@ -679,7 +678,7 @@ static journal_t * journal_init_common (
/* Set up a default-sized revoke table for the new mount. */
err = journal_init_revoke(journal, JOURNAL_REVOKE_DEFAULT_HASH);
if (err) {
-   kfree(journal);
+   jbd_kfree(journal);
goto fail;
}
return journal;
@@ -728,7 +727,7 @@ journal_t * journal_init_dev(struct bloc
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
journal = NULL;
goto out;
}
@@ -782,7 +781,7 @@ journal_t * journal_init_inode (struct i
if (!journal-j_wbuf) {
printk(KERN_ERR %s: Cant allocate bhs for commit thread\n,
__FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
return NULL;
}
 
@@ -791,7 +790,7 @@ journal_t * journal_init_inode (struct i
if (err) {
printk(KERN_ERR %s: Cannnot locate journal superblock\n,
   __FUNCTION__);
-   kfree(journal);
+   jbd_kfree(journal);
return NULL;
}
 
@@ -1095,13 +1094,6 @@ int journal_load(journal_t *journal)
}
}
 
-   /*
-* Create a slab for this blocksize
-*/
-   err = journal_create_jbd_slab(be32_to_cpu(sb-s_blocksize));
-   if (err)
-   return err;
-
/* Let the recovery code check whether it needs to recover any
 * data from the journal. */
if (journal_recover(journal))
@@ -1166,7 +1158,7 @@ void journal_destroy(journal_t *journal)
if (journal-j_revoke)
journal_destroy_revoke(journal);
kfree(journal-j_wbuf);
-   kfree(journal);
+   jbd_kfree(journal);
 }
 
 
@@ -1615,86 +1607,6 @@ int journal_blocks_per_page(struct inode
 }
 
 /*
- * Simple support for retrying memory allocations.  Introduced to help to
- * debug different VM deadlock avoidance strategies.
- */
-void * __jbd_kmalloc (const char *where, size_t size, gfp_t flags, int retry)
-{
-   return kmalloc(size, flags | (retry ? __GFP_NOFAIL : 0));
-}
-
-/*
- * jbd slab management: create 1k, 2k, 4k, 8k slabs as needed
- * and allocate frozen and commit buffers from these slabs.
- *
- * Reason for doing this is to avoid, SLAB_DEBUG - since it could
- * cause bh to cross page boundary.
- */
-
-#define JBD_MAX_SLABS 5
-#define JBD_SLAB_INDEX(size)  (size  11)
-
-static struct kmem_cache *jbd_slab[JBD_MAX_SLABS];
-static const char *jbd_slab_names[JBD_MAX_SLABS] = {
-   jbd_1k, jbd_2k, jbd_4k, NULL, jbd_8k
-};
-
-static void journal_destroy_jbd_slabs(void)
-{
-   int i;
-
-   for (i = 0; i  JBD_MAX_SLABS; i++) {
-   if (jbd_slab[i])
-   kmem_cache_destroy(jbd_slab[i]);
-   jbd_slab[i] = NULL

Re: [RFC 1/4] Large Blocksize support for Ext2/3/4

2007-08-31 Thread Mingming Cao

On Wed, 2007-08-29 at 17:47 -0700, Mingming Cao wrote:

 Just rebase to 2.6.23-rc4 and against the ext4 patch queue. Compile tested 
 only. 
 
 Next steps:
 Need a e2fsprogs changes to able test this feature. As mkfs needs to be
 educated not assuming rec_len to be blocksize all the time.
 Will try it with Christoph Lameter's large block patch next.
 

Two problems were found when testing largeblock on ext3.  Patches to
follow. 

Good news is, with your changes, plus all these extN changes, I am able
to run ext2/3/4 with 64k block size, tested on x86 and ppc64 with 4k
page size. fsx test runs fine for an hour on ext3 with 16k blocksize on
x86:-)

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC 2/2] JBD: blocks reservation fix for large block support

2007-08-31 Thread Mingming Cao

The blocks per page could be less or quals to 1 with the large block support in 
VM.
The patch fixed the way to calculate the number of blocks to reserve in journal 
in the
case blocksize  pagesize.



Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: my2.6/fs/jbd/journal.c
===
--- my2.6.orig/fs/jbd/journal.c 2007-08-31 13:27:16.0 -0700
+++ my2.6/fs/jbd/journal.c  2007-08-31 13:28:18.0 -0700
@@ -1611,7 +1611,12 @@ void journal_ack_err(journal_t *journal)
 
 int journal_blocks_per_page(struct inode *inode)
 {
-   return 1  (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits);
+   int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits;
+
+   if (bits  0)
+   return 1  bits;
+   else
+   return 1;
 }
 
 /*
Index: my2.6/fs/jbd2/journal.c
===
--- my2.6.orig/fs/jbd2/journal.c2007-08-31 13:32:21.0 -0700
+++ my2.6/fs/jbd2/journal.c 2007-08-31 13:32:30.0 -0700
@@ -1612,7 +1612,12 @@ void jbd2_journal_ack_err(journal_t *jou
 
 int jbd2_journal_blocks_per_page(struct inode *inode)
 {
-   return 1  (PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits);
+   int bits = PAGE_CACHE_SHIFT - inode-i_sb-s_blocksize_bits;
+
+   if (bits  0)
+   return 1  bits;
+   else
+   return 1;
 }
 
 /*


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 8][PATCH 1/1]Add journal checksums

2007-08-06 Thread Mingming Cao

On Wed, 2007-08-01 at 12:34 +0530, Girish Shilamkar wrote:
 On Wed, 2007-07-11 at 17:16 +0530, Girish Shilamkar wrote:
 
  I will make the changes and send an incremental patch.
  
 Hi,
   I have made the changes and attached the incremental patch as per the
 review.

Thanks,

I merged your changes to ext4-patch-queue.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] basic delayed allocation in VFS

2007-07-30 Thread Mingming Cao

On Sun, 2007-07-29 at 20:24 +0100, Christoph Hellwig wrote:
 On Sun, Jul 29, 2007 at 11:30:36AM -0600, Andreas Dilger wrote:
  Sigh, we HAVE a patch that was only adding delalloc to ext4, but it
  was rejected because that functionality should go into the VFS.
  Since the performance improvement of delalloc is quite large, we'd
  like to get this into the kernel one way or another.  Can we make a
  decision if the ext4-specific delalloc is acceptable?
 
 I'm a big proponent of having proper common delalloc code, but the
 one proposed here is not generic for the existing filesystem using
 delalloc.  

To be fair, what Alex have so far is probably good enough for ext2/3
delayed allocation.

 It's still on my todo list to revamp the xfs code to get
 rid of some of the existing mess and make it useable genericly.  If
 the ext4 users are fine with the end result we could move to generic
 code.
 

Are you okay with having a ext4 delayed allocation implementation (i.e.
moving the code proposed in this thread to fs/ext4) first?  Then later
when you come up with a generic delayed allocation for both ext4 and xfs
we could make use of that generic implementation. Is that a acceptable
approach? 

Andrew, what do you think?


Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [BUG?] ext4_ext_put_in_cache uses __u32 to receive physical block number.

2007-07-27 Thread Mingming Cao

On Fri, 2007-07-27 at 13:16 +0800, Yan Zheng wrote:
 Hi, all
 
 I think I found a bug in ext4/extents.c, ext4_ext_put_in_cache uses
 __u32 to receive physical block number.  ext4_ext_put_in_cache is
 used in ext4_ext_get_blocks, it sets ext4 inode's extent cache
 according most recently tree lookup (higher 16 bits of saved physical
 block number are always zero). when serving a mapping request,
 ext4_ext_get_blocks first check whether the logical block is in
 inode's extent cache. if the logical block is in the cache and the
 cached region isn't a gap, ext4_ext_get_blocks gets physical block
 number by using cached region's physical block number and offset in
 the cached region.  as described above, ext4_ext_get_blocks may
 return wrong result when there are physical block numbers bigger than
 0x.
 
 Regards
 
 YZ

You are right.  Thanks for reporting this!

Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22/fs/ext4/extents.c
===
--- linux-2.6.22.orig/fs/ext4/extents.c 2007-07-27 08:31:02.0 -0700
+++ linux-2.6.22/fs/ext4/extents.c  2007-07-27 08:31:48.0 -0700
@@ -1544,7 +1544,7 @@ int ext4_ext_walk_space(struct inode *in
 
 static void
 ext4_ext_put_in_cache(struct inode *inode, __u32 block,
-   __u32 len, __u32 start, int type)
+   __u32 len, ext4_fsblk_t start, int type)
 {
struct ext4_ext_cache *cex;
BUG_ON(len == 0);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: new ext4 build warnings

2007-07-18 Thread Mingming Cao

On Wed, 2007-07-18 at 17:37 -0400, Jeff Garzik wrote:
 It seems jbd_debug() might need modification:
 
 fs/ext4/inode.c: In function ‘ext4_write_inode’:
 fs/ext4/inode.c:2906: warning: comparison is always true due to limited 
 range of data type
 
 fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’:
 fs/jbd2/recovery.c:254: warning: comparison is always true due to 
 limited range of data type
 fs/jbd2/recovery.c:257: warning: comparison is always true due to 
 limited range of data type
 
 fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’:
 fs/jbd2/recovery.c:301: warning: comparison is always true due to 
 limited range of data type
 
 I'm surprised this was not noticed in a test build before pushing upstream.
 

Hmm,  I am not sure what happened. I get the compile warning on linus
latest git tree, but could not get the same compile warning on Ted's
ext4 git tree. 

In both build CONFIG_JBD2_DEBUG and DEBUG_FS is enabled.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] fix ext4/JBD2 build warnings

2007-07-18 Thread Mingming Cao

On Wed, 2007-07-18 at 17:37 -0400, Jeff Garzik wrote:
 It seems jbd_debug() might need modification:
 
Looking at the current linus-git tree jbd_debug() define in
include/linux/jbd2.h

extern u8 journal_enable_debug;

#define jbd_debug(n, f, a...)   \
do {\
if ((n) = journal_enable_debug) {  \
printk (KERN_DEBUG (%s, %d): %s: ,\
__FILE__, __LINE__, __FUNCTION__);  \
printk (f, ## a);   \
}   \
} while (0)
 fs/ext4/inode.c: In function ‘ext4_write_inode’:
 fs/ext4/inode.c:2906: warning: comparison is always true due to limited 
 range of data type
 
 fs/jbd2/recovery.c: In function ‘jbd2_journal_recover’:
 fs/jbd2/recovery.c:254: warning: comparison is always true due to 
 limited range of data type
 fs/jbd2/recovery.c:257: warning: comparison is always true due to 
 limited range of data type
 
 fs/jbd2/recovery.c: In function ‘jbd2_journal_skip_recovery’:
 fs/jbd2/recovery.c:301: warning: comparison is always true due to 
 limited range of data type
 
Noticed all warnings are occurs when the debug level is 0. Then found
the jbd2: Move jbd2-debug file to debugfs patch
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=0f49d5d019afa4e94253bfc92f0daca3badb990b

changed the jbd2_journal_enable_debug from int type to u8, makes the
jbd_debug comparision is always true when the debugging level is 0. Thus
the compile warning occurs. 

Thought about changing the jbd2_journal_enable_debug data type back to
int, but can't, because the jbd2-debug is moved to debug fs, where
calling debugfs_create_u8() to create the debugfs entry needs the value
to be u8 type.

Even if we changed the data type back to int, the code is still buggy,
kernel should not print jbd2 debug message if the
jbd2_journal_enable_debug is set to 0. But this is not the case.

The fix is change the level of debugging to 1. The same should fixed in
ext3/JBD, but currently ext3 jbd-debug via /proc fs is broken, so we
probably should fix it all together.


Signed-off-by: Mingming Cao [EMAIL PROTECTED]

---
 fs/ext4/inode.c|2 +-
 fs/jbd2/recovery.c |6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6.22/fs/ext4/inode.c
===
--- linux-2.6.22.orig/fs/ext4/inode.c   2007-07-18 19:23:58.0 -0700
+++ linux-2.6.22/fs/ext4/inode.c2007-07-18 19:25:08.0 -0700
@@ -2903,7 +2903,7 @@ int ext4_write_inode(struct inode *inode
return 0;
 
if (ext4_journal_current_handle()) {
-   jbd_debug(0, called recursively, non-PF_MEMALLOC!\n);
+   jbd_debug(1, called recursively, non-PF_MEMALLOC!\n);
dump_stack();
return -EIO;
}
Index: linux-2.6.22/fs/jbd2/recovery.c
===
--- linux-2.6.22.orig/fs/jbd2/recovery.c2007-07-18 19:25:51.0 
-0700
+++ linux-2.6.22/fs/jbd2/recovery.c 2007-07-18 19:26:13.0 -0700
@@ -251,10 +251,10 @@ int jbd2_journal_recover(journal_t *jour
if (!err)
err = do_one_pass(journal, info, PASS_REPLAY);
 
-   jbd_debug(0, JBD: recovery, exit status %d, 
+   jbd_debug(1, JBD: recovery, exit status %d, 
  recovered transactions %u to %u\n,
  err, info.start_transaction, info.end_transaction);
-   jbd_debug(0, JBD: Replayed %d and revoked %d/%d blocks\n,
+   jbd_debug(1, JBD: Replayed %d and revoked %d/%d blocks\n,
  info.nr_replays, info.nr_revoke_hits, info.nr_revokes);
 
/* Restart the log at the next transaction ID, thus invalidating
@@ -298,7 +298,7 @@ int jbd2_journal_skip_recovery(journal_t
 #ifdef CONFIG_JBD2_DEBUG
int dropped = info.end_transaction - 
be32_to_cpu(sb-s_sequence);
 #endif
-   jbd_debug(0,
+   jbd_debug(1,
  JBD: ignoring %d transaction%s from the journal.\n,
  dropped, (dropped == 1) ?  : s);
journal-j_transaction_sequence = ++info.end_transaction;



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-17 Thread Mingming Cao

On Tue, 2007-07-17 at 15:29 +0530, Kalpak Shah wrote:
 On Mon, 2007-07-16 at 17:49 -0700, Mingming Cao wrote:
  On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
   On Sun, 01 Jul 2007 03:36:56 -0400
   Mingming Cao [EMAIL PROTECTED] wrote:
+static inline __le32 ext4_encode_extra_time(struct timespec *time)
+{
+   return cpu_to_le32((sizeof(time-tv_sec)  4 ?
+  time-tv_sec  32 : 0) |
+  ((time-tv_nsec  2)  EXT4_NSEC_MASK));
+}
+
+static inline void ext4_decode_extra_time(struct timespec *time, 
__le32 extra)
+{
+   if (sizeof(time-tv_sec)  4)
+  time-tv_sec |= (__u64)(le32_to_cpu(extra)  
EXT4_EPOCH_MASK)
+   32;
+   time-tv_nsec = (le32_to_cpu(extra)  EXT4_NSEC_MASK)  2;
+}
   
   Consider uninlining these functions.
   
  I got compile warining after apply Kalpal's update nanosecond patch,
  which makes these two function inline. It complains these functions are
  defined but not used. It's being used only in the following
  micros(EXT4_INODE_SET_XTIME etc).  So if the .c file included the
  ext4_fs.h but not using the micros, the compile will think these two
  functions are not used.
 
 The compile warnings were introduced because the functions were
 uninlined. So we can either keep these functions inlined or consider
 adding a __used attribute to these two functions. 
 
okay for now I keep these functions inlined. 

 Thanks,
 Kalpak.
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 23:18 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:51 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Subject: [EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes
  Date:   Sun, 01 Jul 2007 03:38:51 -0400
  Sender: [EMAIL PROTECTED]
  Organization: IBM Linux Technology Center
  X-Mailer: Evolution 2.8.0 (2.8.0-33.el5) 
  
  From: Dmitry Monakhov [EMAIL PROTECTED]
  Subject: ext4: extent compilation fixes
 
 I hope this patch series will hit git with nice titles like ext4: extent
 compilation fixes and not funny ones like
 Morecleanups:ext4_extent_compilation_fixes.
 
Sure, the changelog keep the nice tile Subject: ext4: extent
compilation fixes

  Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.
 
 conversions.

Done.

Thanks,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/1] ext4: JBD-JBD2 naming cleanups

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:

  Index: linux-2.6.22-rc4/include/linux/jbd2.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/jbd2.h  2007-06-11 
  16:16:18.0 -0700
  +++ linux-2.6.22-rc4/include/linux/jbd2.h   2007-06-11 16:35:25.0 
  -0700
  @@ -57,7 +57,7 @@
* CONFIG_JBD2_DEBUG is on.
*/
   #define JBD_EXPENSIVE_CHECKING
 
 JBD2?
 

Some cleanups in ext4/JBD2 to follow the naming fules:change micros name
from JBD_XXX to JBD2_XXX.


Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/ext4/extents.c |2 +-
 fs/ext4/super.c   |2 +-
 fs/jbd2/commit.c  |2 +-
 fs/jbd2/journal.c |   26 +-
 fs/jbd2/recovery.c|2 +-
 fs/jbd2/revoke.c  |4 ++--
 include/linux/ext4_jbd2.h |6 +++---
 include/linux/jbd2.h  |   30 +++---
 8 files changed, 37 insertions(+), 37 deletions(-)

Index: linux-2.6.22/fs/ext4/super.c
===
--- linux-2.6.22.orig/fs/ext4/super.c   2007-07-13 17:17:29.0 -0700
+++ linux-2.6.22/fs/ext4/super.c2007-07-13 17:17:33.0 -0700
@@ -967,7 +967,7 @@ static int parse_options (char *options,
if (option  0)
return 0;
if (option == 0)
-   option = JBD_DEFAULT_MAX_COMMIT_AGE;
+   option = JBD2_DEFAULT_MAX_COMMIT_AGE;
sbi-s_commit_interval = HZ * option;
break;
case Opt_data_journal:
Index: linux-2.6.22/include/linux/ext4_jbd2.h
===
--- linux-2.6.22.orig/include/linux/ext4_jbd2.h 2007-07-13 17:02:08.0 
-0700
+++ linux-2.6.22/include/linux/ext4_jbd2.h  2007-07-13 17:39:41.0 
-0700
@@ -12,8 +12,8 @@
  * Ext4-specific journaling extensions.
  */
 
-#ifndef _LINUX_EXT4_JBD_H
-#define _LINUX_EXT4_JBD_H
+#ifndef _LINUX_EXT4_JBD2_H
+#define _LINUX_EXT4_JBD2_H
 
 #include linux/fs.h
 #include linux/jbd2.h
@@ -228,4 +228,4 @@ static inline int ext4_should_writeback_
return 0;
 }
 
-#endif /* _LINUX_EXT4_JBD_H */
+#endif /* _LINUX_EXT4_JBD2_H */
Index: linux-2.6.22/include/linux/jbd2.h
===
--- linux-2.6.22.orig/include/linux/jbd2.h  2007-07-13 17:17:28.0 
-0700
+++ linux-2.6.22/include/linux/jbd2.h   2007-07-13 17:31:33.0 -0700
@@ -13,8 +13,8 @@
  * filesystem journaling support.
  */
 
-#ifndef _LINUX_JBD_H
-#define _LINUX_JBD_H
+#ifndef _LINUX_JBD2_H
+#define _LINUX_JBD2_H
 
 /* Allow this file to be included directly into e2fsprogs */
 #ifndef __KERNEL__
@@ -37,26 +37,26 @@
 #define journal_oom_retry 1
 
 /*
- * Define JBD_PARANIOD_IOFAIL to cause a kernel BUG() if ext3 finds
+ * Define JBD2_PARANIOD_IOFAIL to cause a kernel BUG() if ext4 finds
  * certain classes of error which can occur due to failed IOs.  Under
- * normal use we want ext3 to continue after such errors, because
+ * normal use we want ext4 to continue after such errors, because
  * hardware _can_ fail, but for debugging purposes when running tests on
  * known-good hardware we may want to trap these errors.
  */
-#undef JBD_PARANOID_IOFAIL
+#undef JBD2_PARANOID_IOFAIL
 
 /*
  * The default maximum commit age, in seconds.
  */
-#define JBD_DEFAULT_MAX_COMMIT_AGE 5
+#define JBD2_DEFAULT_MAX_COMMIT_AGE 5
 
 #ifdef CONFIG_JBD2_DEBUG
 /*
- * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal
+ * Define JBD2_EXPENSIVE_CHECKING to enable more expensive internal
  * consistency checks.  By default we don't do this unless
  * CONFIG_JBD2_DEBUG is on.
  */
-#define JBD_EXPENSIVE_CHECKING
+#define JBD2_EXPENSIVE_CHECKING
 extern u8 jbd2_journal_enable_debug;
 
 #define jbd_debug(n, f, a...)  \
@@ -185,8 +185,8 @@ typedef struct journal_block_tag_s
__be32  t_blocknr_high; /* most-significant high 32bits. */
 } journal_block_tag_t;
 
-#define JBD_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
-#define JBD_TAG_SIZE64 (sizeof(journal_block_tag_t))
+#define JBD2_TAG_SIZE32 (offsetof(journal_block_tag_t, t_blocknr_high))
+#define JBD2_TAG_SIZE64 (sizeof(journal_block_tag_t))
 
 /*
  * The revoke descriptor: used on disk to describe a series of blocks to
@@ -282,8 +282,8 @@ typedef struct journal_superblock_s
 #include linux/fs.h
 #include linux/sched.h
 
-#define JBD_ASSERTIONS
-#ifdef JBD_ASSERTIONS
+#define JBD2_ASSERTIONS
+#ifdef JBD2_ASSERTIONS
 #define J_ASSERT(assert)   \
 do {   \
if (!(assert)) {\
@@ -310,9 +310,9 @@ void

Re: [EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:36:22 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  with the patch all headers are checked. the code should become
  more resistant to on-disk corruptions. needless BUG_ON() have
  been removed. please, review for inclusion.
  
  ...
 
  @@ -269,6 +239,70 @@
  return size;
   }
   
  +static inline int
  +ext4_ext_max_entries(struct inode *inode, int depth)
 
 Please remove the `inline'.
 
done
 `inline' is usually wrong.  It is wrong here.  If this function has a
 single callsite (it has) then the compiler will inline it.  If the function
 later gains more callsites then the compiler will know (correctly) not to
 inline it.
 
 We hope.  If we find the compiler still inlines a function as large as this
 one then we'd need to use noinline and complain at the gcc developers.
 
  +{
  +   int max;
  +
  +   if (depth == ext_depth(inode)) {
  +   if (depth == 0)
  +   max = ext4_ext_space_root(inode);
  +   else
  +   max = ext4_ext_space_root_idx(inode);
  +   } else {
  +   if (depth == 0)
  +   max = ext4_ext_space_block(inode);
  +   else
  +   max = ext4_ext_space_block_idx(inode);
  +   }
  +
  +   return max;
  +}
  +
  +static int __ext4_ext_check_header(const char *function, struct inode 
  *inode,
  +   struct ext4_extent_header *eh,
  +   int depth)
  +{
  +   const char *error_msg = NULL;
 
 Unneeded initialisation.
 
done

  +   int max = 0;
  +
  +   if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
  +   error_msg = invalid magic;
  +   goto corrupted;
  +   }
  +   if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) {
  +   error_msg = unexpected eh_depth;
  +   goto corrupted;
  +   }
  +   if (unlikely(eh-eh_max == 0)) {
  +   error_msg = invalid eh_max;
  +   goto corrupted;
  +   }
  +   max = ext4_ext_max_entries(inode, depth);
  +   if (unlikely(le16_to_cpu(eh-eh_max)  max)) {
  +   error_msg = too large eh_max;
  +   goto corrupted;
  +   }
  +   if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
  +   error_msg = invalid eh_entries;
  +   goto corrupted;
  +   }
  +   return 0;
  +
  +corrupted:
  +   ext4_error(inode-i_sb, function,
  +   bad header in inode #%lu: %s - magic %x, 
  +   entries %u, max %u(%u), depth %u(%u),
  +   inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
  +   le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
  +   max, le16_to_cpu(eh-eh_depth), depth);
  +
  +   return -EIO;
  +}
  +
 
  ...
 
  +   i = depth = ext_depth(inode);
 
 
 Mulitple assignments like this are somewhat unpopular from the coding-style
 POV.  
 
okay.

 We have a local variable called `i' which is not used as a simple counter
 and which has no explanatory comment.  This is plain bad.  Please find a
 better name for this variable.  Review the code for other such lack of
 clarity - I'm sure there's more.
 
i is is loop counter. I have moved the i= depth to before the while
loop.

 
  -   BUG_ON(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max));
  -   BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC);
 
 Yeah, this patch improves things a lot.
 
 How well-tested is it?  Have any of these errors actually been triggered?
 
  path[i].p_hdr = ext_block_hdr(path[i].p_bh);
  -   if (ext4_ext_check_header(__FUNCTION__, inode,
  -   path[i].p_hdr)) {
  -   err = -EIO;
  -   goto out;
  -   }
  }
   
  -   BUG_ON(le16_to_cpu(path[i].p_hdr-eh_entries)
  -   le16_to_cpu(path[i].p_hdr-eh_max));
  -   BUG_ON(path[i].p_hdr-eh_magic != EXT4_EXT_MAGIC);
  -
  if (!path[i].p_idx) {
  /* this level hasn't been touched yet */
  path[i].p_idx = EXT_LAST_INDEX(path[i].p_hdr);
  @@ -1873,17 +1890,24 @@
  i, EXT_FIRST_INDEX(path[i].p_hdr),
  path[i].p_idx);
  if (ext4_ext_more_to_rm(path + i)) {
  +   struct buffer_head *bh;
  /* go to the next level */
  ext_debug(move to level %d (block %llu)\n,
i + 1, idx_pblock(path[i].p_idx));
  memset(path + i + 1, 0, sizeof(*path));
  -   path[i+1].p_bh =
  -   sb_bread(sb, idx_pblock(path[i].p_idx));
  -   if (!path[i+1].p_bh) {
  +   bh = sb_bread(sb, idx_pblock(path[i].p_idx));
  +   if (!bh) {
  /* should we reset i_size

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 19:31 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:10 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  [PATCH] jbd2 stats through procfs
  
  The patch below updates the jbd stats patch to 2.6.20/jbd2.
  The initial patch was posted by Alex Tomas in December 2005
  (http://marc.info/?l=linux-ext4m=113538565128617w=2).
  It provides statistics via procfs such as transaction lifetime and size.
  
  [ This probably should be rewritten to use debugfs?   -- Ted]
  
 
 I suppose that given that we're creating a spot in debugfs for the jbd2
 debug code, yes, this also should be moved over.
 
 But the jbd2 debug debugfs entries were kernel-wide whereas this is
 per-superblock.  I think.
 

I take this comment as we still keep the stats info in proc fs...

 That email from Alex contains pretty important information.  I suggest that
 it be added to the changelog after accuracy-checking.  Addition to
 Documentation/filesystems/ext4.txt would be good.
 
Done.

I hope Alex can help address the rest of comments.

  --
  
  Index: linux-2.6.22-rc4/include/linux/jbd2.h
  ===
  --- linux-2.6.22-rc4.orig/include/linux/jbd2.h  2007-06-11 
  17:28:17.0 -0700
  +++ linux-2.6.22-rc4/include/linux/jbd2.h   2007-06-13 10:45:21.0 
  -0700
  @@ -408,6 +408,16 @@
   };
   
   
  +/*
  + * Some stats for checkpoint phase
  + */
  +struct transaction_chp_stats_s {
  +   unsigned long   cs_chp_time;
  +   unsigned long   cs_forced_to_close;
  +   unsigned long   cs_written;
  +   unsigned long   cs_dropped;
  +};
 
 It would be nice to document what units all these fields are in.  Jiffies,
 I assume.
 
   /* The transaction_t type is the guts of the journaling mechanism.  It
* tracks a compound transaction through its various states:
*
  @@ -543,6 +553,21 @@
  spinlock_t  t_handle_lock;
   
  /*
  +* Longest time some handle had to wait for running transaction
  +*/
  +   unsigned long   t_max_wait;
  +
  +   /*
  +* When transaction started
  +*/
  +   unsigned long   t_start;
  +
  +   /*
  +* Checkpointing stats [j_checkpoint_sem]
  +*/
  +   struct transaction_chp_stats_s t_chp_stats;
  +
  +   /*
   * Number of outstanding updates running on this transaction
   * [t_handle_lock]
   */
  @@ -573,6 +598,57 @@
   
   };
   
  +struct transaction_run_stats_s {
  +   unsigned long   rs_wait;
  +   unsigned long   rs_running;
  +   unsigned long   rs_locked;
  +   unsigned long   rs_flushing;
  +   unsigned long   rs_logging;
  +
  +   unsigned long   rs_handle_count;
  +   unsigned long   rs_blocks;
  +   unsigned long   rs_blocks_logged;
  +};
  +
  +struct transaction_stats_s
  +{
  +   int ts_type;
  +   unsigned long   ts_tid;
  +   union {
  +   struct transaction_run_stats_s run;
  +   struct transaction_chp_stats_s chp;
  +   } u;
  +};
  +
  +#define JBD2_STATS_RUN 1
  +#define JBD2_STATS_CHECKPOINT  2
  +
  +#define ts_waitu.run.rs_wait
  +#define ts_running u.run.rs_running
  +#define ts_locked  u.run.rs_locked
  +#define ts_flushingu.run.rs_flushing
  +#define ts_logging u.run.rs_logging
  +#define ts_handle_countu.run.rs_handle_count
  +#define ts_blocks  u.run.rs_blocks
  +#define ts_blocks_logged   u.run.rs_blocks_logged
  +
  +#define ts_chp_timeu.chp.cs_chp_time
  +#define ts_forced_to_close u.chp.cs_forced_to_close
  +#define ts_written u.chp.cs_written
  +#define ts_dropped u.chp.cs_dropped
 
 That's a bit sleazy.  We can drop the u from 'struct transaction_stats_s'
 and make it an anonymous union, then open-code foo.run.rs_wait everywhere.
 
 But that's a bit sleazy too, because the reader of the code would not know
 that a write to foo.run.rs_wait will stomp on the value of
 foo.chp.cs_chp_time.
 
 So to minimize reader confusion it would be best I think to just open-code
 the full u.run.rs_wait at all code-sites.
 
 The macros above are the worst possible choice: they hide information from
 the code-reader just to save the code-writer a bit of typing.  But we very
 much want to optimise for code-readers, not for code-writers.
 
  +#define CURRENT_MSECS  (jiffies_to_msecs(jiffies))
 
 hm, that isn't something which should be in an ext4 header file.  And it
 shouldn't be in UPPER CASE and it shouldn't pretend to be a constant (or a
 global scalar).
 
 IOW: yuk.
 
 How's about raising a separate, standalone patch which creates a new
 kernel-wide coded-in-C function such as
 
 unsigned long jiffies_in_msecs(void);
 
 ?  (That's assuming we don't already have one.  Most likely we have seven
 of them hiding in various dark corners).
 
  +static inline

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 21:42 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 23:21:49 -0400 Cédric Augonnet [EMAIL PROTECTED] 
 wrote:
 
  2007/7/10, Andrew Morton [EMAIL PROTECTED]:
  
  Hi all,
  
+ size = sizeof(struct transaction_stats_s);
+ s-stats = kmalloc(size, GFP_KERNEL);
+ if (s == NULL) {
+ kfree(s);
+ return -EIO;
  
   ENOMEM
  
  I'm sorry if i missed some point, but i just don't see the use of such
  a kfree here, not sure Andrew meant you should only return ENOMEM
  instead, but why issuing those kfree(NULL), instead of just a if (!s)
  return ENOMEM ?
  
 
 You found a bug.  It was meant to be
 
   if (s-stats == NULL)
 
 
The incremental fix patch is attached just FYI. It will be folded to the
parent patch once the parent patch is being updated.


---
 fs/jbd2/journal.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.22/fs/jbd2/journal.c
===
--- linux-2.6.22.orig/fs/jbd2/journal.c 2007-07-16 00:05:03.0 -0700
+++ linux-2.6.22/fs/jbd2/journal.c  2007-07-16 00:05:59.0 -0700
@@ -751,7 +751,7 @@ static int jbd2_seq_history_open(struct 
return -EIO;
size = sizeof(struct transaction_stats_s) * journal-j_history_max;
s-stats = kmalloc(size, GFP_KERNEL);
-   if (s == NULL) {
+   if (s-stats == NULL) {
kfree(s);
return -EIO;
}


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-16 Thread Mingming Cao

On Fri, 2007-07-13 at 02:05 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 16:32:47 -0700 Andrew Morton [EMAIL PROTECTED] wrote:
 
   + brelse(bh);
   + up_write(EXT4_I(inode)-xattr_sem);
   + return error;
   +}
   +
  
  We're doing GFP_KERNEL memory allocations while holding xattr_sem.  This
  can cause the VM to reenter the filesystem, perhaps taking i_mutex and/or
  i_truncate_sem and/or journal_start() (I forget whether this still
  happens).  Have we checked whether this can occur and if so, whether we are
  OK from a lock ranking POV?  Bear in mind that journalled-data mode is more
  complex in this regard.
 
 I notice that everyone carefully avoided addressing this ;)

I am not sure why we need GFP_KERNEL flag here. I think we should use
GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as
fixing memory leak issue introduced by the ext4 expand inode extra isize
patch.

Fixing memory allocation issue with expand inode extra isize patch.

- use GFP_NOFS instead of GFP_KERNEL flag for memory allocation
- use kzalloc instead of kmalloc
- fix memory leak in the success case, at the end of while loop.


Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/ext4/xattr.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6.22/fs/ext4/xattr.c
===
--- linux-2.6.22.orig/fs/ext4/xattr.c   2007-07-16 16:12:18.0 -0700
+++ linux-2.6.22/fs/ext4/xattr.c2007-07-16 16:35:59.0 -0700
@@ -1204,8 +1204,8 @@ retry:
unsigned int shift_bytes; /* No. of bytes to shift EAs by? */
unsigned int min_total_size = ~0U;
 
-   is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_KERNEL);
-   bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_KERNEL);
+   is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS);
+   bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS);
if (!is || !bs) {
error = -ENOMEM;
goto cleanup;
@@ -1251,8 +1251,8 @@ retry:
size = le32_to_cpu(entry-e_value_size);
entry_size = EXT4_XATTR_LEN(entry-e_name_len);
i.name_index = entry-e_name_index,
-   buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_KERNEL);
-   b_entry_name = kmalloc(entry-e_name_len + 1, GFP_KERNEL);
+   buffer = kzalloc(EXT4_XATTR_SIZE(size), GFP_NOFS);
+   b_entry_name = kzalloc(entry-e_name_len + 1, GFP_NOFS);
if (!buffer || !b_entry_name) {
error = -ENOMEM;
goto cleanup;
@@ -1302,7 +1302,15 @@ retry:
error = ext4_xattr_block_set(handle, inode, i, bs);
if (error)
goto cleanup;
+   kfree(b_entry_name);
+   kfree(buffer);
+   brelse(is-iloc.bh);
+   kfree(is);
+   kfree(bs);
+   brelse(bh);
}
+   up_write(EXT4_I(inode)-xattr_sem);
+return 0;
 
 cleanup:
kfree(b_entry_name);





-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-16 Thread Mingming Cao

On Mon, 2007-07-16 at 18:06 -0600, Andreas Dilger wrote:
 On Jul 16, 2007  16:52 -0700, Mingming Cao wrote:
  I am not sure why we need GFP_KERNEL flag here. I think we should use
  GFP_NOFS instead. The following patch use the GFP_NOFS flag, as well as
  fixing memory leak issue introduced by the ext4 expand inode extra isize
  patch.
  
  Fixing memory allocation issue with expand inode extra isize patch.
  
  - use GFP_NOFS instead of GFP_KERNEL flag for memory allocation
  - use kzalloc instead of kmalloc
 
 This doesn't need kzalloc() for buffer and b_entry_name, since they are
 immediately overwritten by memcpy().
 
ok.
  - fix memory leak in the success case, at the end of while loop.
  goto cleanup;
  @@ -1302,7 +1302,15 @@ retry:
  error = ext4_xattr_block_set(handle, inode, i, bs);
  if (error)
  goto cleanup;
  +   kfree(b_entry_name);
  +   kfree(buffer);
  +   brelse(is-iloc.bh);
  +   kfree(is);
  +   kfree(bs);
  +   brelse(bh);
  }
  +   up_write(EXT4_I(inode)-xattr_sem);
  +return 0;
   
   cleanup:
  kfree(b_entry_name);
 
 I don't think you should have brelse(bh) inside the loop, since it is
 allocated before the loop starts.
 
Ahh right. thanks.

Updated fix.

Fixing memory allocation issue with expand inode extra isize patch.

- use GFP_NOFS instead of GFP_KERNEL to prevent fs reenter
- fix memory leak in the success case, at the end of while loop.


Signed-off-by: Mingming Cao [EMAIL PROTECTED]
---
 fs/ext4/xattr.c |   16 
 1 file changed, 12 insertions(+), 4 deletions(-)

Index: linux-2.6.22/fs/ext4/xattr.c
===
--- linux-2.6.22.orig/fs/ext4/xattr.c   2007-07-16 17:21:14.0 -0700
+++ linux-2.6.22/fs/ext4/xattr.c2007-07-16 17:22:25.0 -0700
@@ -1204,8 +1204,8 @@ retry:
unsigned int shift_bytes; /* No. of bytes to shift EAs by? */
unsigned int min_total_size = ~0U;
 
-   is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_KERNEL);
-   bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_KERNEL);
+   is = kzalloc(sizeof(struct ext4_xattr_ibody_find), GFP_NOFS);
+   bs = kzalloc(sizeof(struct ext4_xattr_block_find), GFP_NOFS);
if (!is || !bs) {
error = -ENOMEM;
goto cleanup;
@@ -1251,8 +1251,8 @@ retry:
size = le32_to_cpu(entry-e_value_size);
entry_size = EXT4_XATTR_LEN(entry-e_name_len);
i.name_index = entry-e_name_index,
-   buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_KERNEL);
-   b_entry_name = kmalloc(entry-e_name_len + 1, GFP_KERNEL);
+   buffer = kmalloc(EXT4_XATTR_SIZE(size), GFP_NOFS);
+   b_entry_name = kmalloc(entry-e_name_len + 1, GFP_NOFS);
if (!buffer || !b_entry_name) {
error = -ENOMEM;
goto cleanup;
@@ -1302,7 +1302,15 @@ retry:
error = ext4_xattr_block_set(handle, inode, i, bs);
if (error)
goto cleanup;
+   kfree(b_entry_name);
+   kfree(buffer);
+   brelse(is-iloc.bh);
+   kfree(is);
+   kfree(bs);
}
+   brelse(bh);
+   up_write(EXT4_I(inode)-xattr_sem);
+   return 0;
 
 cleanup:
kfree(b_entry_name);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-16 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:36:56 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  This patch is a spinoff of the old nanosecond patches.
 
 I don't know what the old nanosecond patches are.  A link to a suitable
 changlog for those patches would do in a pinch.  Preferable would be to
 write a proper changelog for this patch.
 
  It includes some cleanups and addition of a creation timestamp. The
  EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
  s_{min, want}_extra_isize fields in struct ext3_super_block.
  
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
  Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
  Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  
  Index: linux-2.6.22-rc4/fs/ext4/ialloc.c
 
 Please include diffstat output when preparing patches.
 
  +
  +#define EXT4_FITS_IN_INODE(ext4_inode, einode, field)  \
  +   ((offsetof(typeof(*ext4_inode), field) +\
  + sizeof((ext4_inode)-field))  \
  +   = (EXT4_GOOD_OLD_INODE_SIZE +  \
  +   (einode)-i_extra_isize))   \
 
 Please add explanatory commentary to EXT4_FITS_IN_INODE(): tell readers
 under what circumstances something will not fit in an inode and what the
 consequences of this are.
 
  +static inline __le32 ext4_encode_extra_time(struct timespec *time)
  +{
  +   return cpu_to_le32((sizeof(time-tv_sec)  4 ?
  +  time-tv_sec  32 : 0) |
  +  ((time-tv_nsec  2)  EXT4_NSEC_MASK));
  +}
  +
  +static inline void ext4_decode_extra_time(struct timespec *time, __le32 
  extra)
  +{
  +   if (sizeof(time-tv_sec)  4)
  +  time-tv_sec |= (__u64)(le32_to_cpu(extra)  EXT4_EPOCH_MASK)
  +   32;
  +   time-tv_nsec = (le32_to_cpu(extra)  EXT4_NSEC_MASK)  2;
  +}
 
 Consider uninlining these functions.
 
I got compile warining after apply Kalpal's update nanosecond patch,
which makes these two function inline. It complains these functions are
defined but not used. It's being used only in the following
micros(EXT4_INODE_SET_XTIME etc).  So if the .c file included the
ext4_fs.h but not using the micros, the compile will think these two
functions are not used.

Mingming

  +#define EXT4_INODE_SET_XTIME(xtime, inode, raw_inode)  
 \
  +do {   
 \
  +   (raw_inode)-xtime = cpu_to_le32((inode)-xtime.tv_sec);   \
  +   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
  +   (raw_inode)-xtime ## _extra = \
  +   ext4_encode_extra_time((inode)-xtime);   \
  +} while (0)
  +
  +#define EXT4_EINODE_SET_XTIME(xtime, einode, raw_inode)
 \
  +do {   
 \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
  +   (raw_inode)-xtime = cpu_to_le32((einode)-xtime.tv_sec);  \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
  +   (raw_inode)-xtime ## _extra = \
  +   ext4_encode_extra_time((einode)-xtime);  \
  +} while (0)
  +
  +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  
 \
  +do {   
 \
  +   (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
  +   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
  +   ext4_decode_extra_time((inode)-xtime,\
  +  raw_inode-xtime ## _extra);\
  +} while (0)
  +
  +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)
 \
  +do {   
 \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
  +   (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);  \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
  +   ext4_decode_extra_time((einode)-xtime,   \
  +  raw_inode-xtime ## _extra);\
  +} while (0)
 
 Ugly.  I expect these could be implemented as plain old C functions. 
 Caller could pass in the address of the ext4_inode field which the function
 is to operate upon.
 
   #if defined(__KERNEL__) || defined(__linux__)
 
 (What's the __linux__ for?)
 
   #define i_reserved1osd1.linux1.l_i_reserved1
   #define i_frag osd2.linux2.l_i_frag
  @@ -539,6 +603,13 @@
  return container_of(inode, struct

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-13 Thread Mingming Cao

On Fri, 2007-07-13 at 12:35 +0530, Kalpak Shah wrote:
 On Fri, 2007-07-13 at 09:59 +0530, Aneesh Kumar K.V wrote:
  
  Kalpak Shah wrote:
   On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
   On Sun, 01 Jul 2007 03:36:56 -0400
   Mingming Cao [EMAIL PROTECTED] wrote:
  
   This patch is a spinoff of the old nanosecond patches.
   I don't know what the old nanosecond patches are.  A link to a suitable
   changlog for those patches would do in a pinch.  Preferable would be to
   write a proper changelog for this patch.
   
   The incremental patch contains a proper changelog describing the patch.
   
  
  
  Instead of  putting incremental patches it would be nice if we can have 
  replacement patches.
  for the already existing patches with the comments addressed. For example 
  if we have a 
  review comment on the patch message ( commit log ) then adding an 
  incremental patch doesn't help.
 
 I think that it would be easier to review just the changes that have
 been made to the patches instead of having people go through the entire
 patch again. I was hoping that someone with write access to ext4-git
 would update the commit logs.
 
 If replacement patches are preferred, then I will send them again.
 

No need, I already fold your fix patch to the parent patches, so in the
updated ext4-patch-queue it saved the updated nanosecond patch.

 Thanks,
 Kalpak.
 
  
  
  -aneesh
  -
  To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-11 Thread Mingming Cao

On Wed, 2007-07-11 at 00:38 -0500, Jose R. Santos wrote:
 On Tue, 10 Jul 2007 16:30:25 -0700
 Andrew Morton [EMAIL PROTECTED] wrote:
 
  On Sun, 01 Jul 2007 03:36:48 -0400
  Mingming Cao [EMAIL PROTECTED] wrote:
  
On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
 The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
 create_proc_entry() does not do lookups on file names with more that 
 one
 directory deep.  This causes the entry creation to fail and hence, no 
 proc
 file is created.  This patch moves the file to /proc/jbd2-degug.
 
 The file could be move to /proc/fs/jbd2/jbd2-debug, but it would 
 require
 some minor alterations to the jbd-stats patch.

I don't think we really want to be adding top-level files in /proc.
What about using the debugfs filesystem (not to be confused with
the e2fsprogs 'debugfs' command)?
   
   How about this then?  Moved the file to use debugfs as well as having
   the nice effect of removing more lines than what it adds.
   
   Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
  
  Please clean up the changelog.
  
  The changelog should include information about the location and the content
  of these debugfs files.  it should provide any instructions which users
  will need to be able to create and use those files.
 
 Will fix.
 
  Alternatively (and preferably) do this via an update to
  Documentation/filesystems/ext4.txt.
 
 Seems like I also need to update the doc on Kconfig as well.  Do you
 prefer this in separate patches? (current patch, kconfig patch, ext4
 doc update patch?
 
fs/jbd2/journal.c|   6220 +42 -0 !
include/linux/jbd2.h |21 + 1 - 0 !
2 files changed, 21 insertions(+), 43 deletions(-)
  
  Again, this patch isn't in Ted's kernel.org directory and hasn't been in 
  -mm.
  
  Apart from the lack of testing and review which this causes, it means I
  can't just do `pushpatch name-of-this-patch' and look at it in tkdiff.  So
  I squint at the diff, but that's harder when the diff wasn't prepared with
  `diff -p'.  Oh well.
 
 Will fix.
 
  
   Index: linux-2.6.22-rc4/fs/jbd2/journal.c
   ===
   --- linux-2.6.22-rc4.orig/fs/jbd2/journal.c   2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/fs/jbd2/journal.c2007-06-11 16:36:10.0 
   -0700
   @@ -35,6 +35,7 @@
#include linux/kthread.h
#include linux/poison.h
#include linux/proc_fs.h
   +#include linux/debugfs.h

#include asm/uaccess.h
#include asm/page.h
   @@ -1954,60 +1955,37 @@
 * /proc tunables
 */
#if defined(CONFIG_JBD2_DEBUG)
   -int jbd2_journal_enable_debug;
   +u16 jbd2_journal_enable_debug;
EXPORT_SYMBOL(jbd2_journal_enable_debug);
#endif

   -#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
   +#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
  
  Has this been compile-tested with CONFIG_DEBUGFS=n?
 
 I think I did, but honestly don't remember.  Will check with the new
 patch. :) 
 

Yes, I remember I did, that discovered some inconsistency in ext4 code,
which has already been fixed.


   -#define create_jbd_proc_entry() do {} while (0)
   -#define jbd2_remove_jbd_proc_entry() do {} while (0)
   +#define jbd2_create_debugfs_entry() do {} while (0)
   +#define jbd2_remove_debugfs_entry() do {} while (0)
  
  I suggest that these be converted to (preferable) inline functions while
  you're there.
 
 OK.
 
#endif

   @@ -2067,7 +2045,7 @@
 ret = journal_init_caches();
 if (ret != 0)
 jbd2_journal_destroy_caches();
   - create_jbd_proc_entry();
   + jbd2_create_debugfs_entry();
 return ret;
}

   @@ -2078,7 +2056,7 @@
 if (n)
 printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
#endif
   - jbd2_remove_jbd_proc_entry();
   + jbd2_remove_debugfs_entry();
 jbd2_journal_destroy_caches();
}

   Index: linux-2.6.22-rc4/include/linux/jbd2.h
   ===
   --- linux-2.6.22-rc4.orig/include/linux/jbd2.h2007-06-11 
   16:16:18.0 -0700
   +++ linux-2.6.22-rc4/include/linux/jbd2.h 2007-06-11 16:35:25.0 
   -0700
   @@ -57,7 +57,7 @@
 * CONFIG_JBD2_DEBUG is on.
 */
#define JBD_EXPENSIVE_CHECKING
  
  JBD2?
 
   -extern int jbd2_journal_enable_debug;
   +extern u16 jbd2_journal_enable_debug;
  
  Why was this made 16-bit?  To save 2 bytes?  Could have saved 3 if we're
  going to do that.
 
 OK.
 
  Shoudln't all this debug info be a per-superblock thing rather than
  kernel-wide?
 
 I don't think it is worth pursuing this feature since this seems to
 have been broken for a while now (its been there since the first git
 revission in ext3) and nobody has noticed it until now.  It could be
 address on a later patch though, since the initial purpose of the patch
 was to fix

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 23:35 -0400, Dave Jones wrote:
 On Tue, Jul 10, 2007 at 05:35:13PM -0400, Mingming Cao wrote:
   
   Sorry about this. I was using version 0.04. The latest one I can find
   for now is 0.05(searching lkml), but it didn't catch this codling style
   bug either. I appreciate if anyone can point me the version 0.07, thanks
 
 It's now in-tree in scripts/checkpatch.pl
 (linus' tree is still at 0.06 though, I suspect Andrew has something
  newer in -mm)
 

Thanks, Andy has uploaded the 0.07 version at
http://www.shadowen.org/~apw/public/checkpatch/checkpatch.pl-0.07


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 22:17 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 22:09:08 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  David Chinneer pointed that we need to journal the version number
  updates together with the operations that causes the change of the inode
  version number, in order to survive server crashes so clients won't see
  the counter go backwards.
  
  So increment i_version in fs code is probably the place to ensure the
  inode version changes are stored to disk. It's seems update the ext4
  inode version in every ext4_mark_inode_dirty() is the easiest way.
 
 That still makes us dependent upon _something_ changing the inode.  For
 overwrites the only something is mtime.
 
 If we don't want to have a peculiar dependency upon s_time_gran=1e9 (and
 I don't think we do) then I guess we'll need new code in or around
 file_update_time() to do this.

do you mean mark inode dirty all the times in file_update_time()? Not
sure about the overhead for ext3/4.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-11 Thread Mingming Cao

On Tue, 2007-07-10 at 23:20 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:38:59 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  From: Dmitry Monakhov [EMAIL PROTECTED]
  Subject: ext4: extent macros cleanup
  
  - Replace math equation to it's macro equivalent
 
 s/it's/its/;)
Okay.

 
  - make ext4_ext_grow_indepth() indexes/leaf correct
 
 hm, what was wrong with it?
 
Looking at the code, ext4_ext_ext_grow_indepth() implements tree growing
procedure. It allocates a new index block, moves the top-level data of
the tree(root or leaf blocks in i_data) into the new block, initializes
the new root, creating index that points to the just created index block

The original top-level data (in i_data) could be extent tree root (index
block) or extents (leaf block). The current code (without the patch)
treats the top-level data always be the leaf block, which is incorrect.


assumes when the tree is growing the extent structure pass in is always
  @@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, 
  struct inode *inode,
  curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
  curp-p_hdr-eh_entries = cpu_to_le16(1);
  curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
  -   /* FIXME: it works, but actually path[0] can be index */
  -   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
  +   
  +   if (path[0].p_hdr-eh_depth)
  + curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
  +   else
  + curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
 
 whitespace bustage there.
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 1/2] Add noextents mount option

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:35:48 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Add a mount option to turn off extents.
 
 Please update the changelog to describe the reason for making this change.
 
 

Sure, I will update the changelog, mainly this is for wide testing of
extents feature in ext4dev.

  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  ---
  Index: linux-2.6.22-rc4/fs/ext4/super.c
  ===
  --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 17:02:18.0 
  -0700
  +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:02:22.0 
  -0700
  @@ -725,7 +725,7 @@
  Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
  Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
  Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
  -   Opt_grpquota, Opt_extents,
  +   Opt_grpquota, Opt_extents, Opt_noextents,
   };
   
   static match_table_t tokens = {
  @@ -776,6 +776,7 @@
  {Opt_usrquota, usrquota},
  {Opt_barrier, barrier=%u},
  {Opt_extents, extents},
  +   {Opt_noextents, noextents},
  {Opt_err, NULL},
  {Opt_resize, resize},
   };
  @@ -,6 +1112,9 @@
  case Opt_extents:
  set_opt (sbi-s_mount_opt, EXTENTS);
  break;
  +   case Opt_noextents:
  +   clear_opt (sbi-s_mount_opt, EXTENTS);
  +   break;
  default:
  printk (KERN_ERR
  EXT4-fs: Unrecognized mount option \%s\ 
  
  
  -
  To unsubscribe from this list: send the line unsubscribe linux-kernel in
  the body of a message to [EMAIL PROTECTED]
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  Please read the FAQ at  http://www.tux.org/lkml/

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:36:32 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
  than 32bit block sizes during mount time.  This ensure proper record
  lenth when writing to the journal.
 
 This patch isn't in Ted's kernel.org directory and hasn't been in -mm. 
 Where did it come from?  Is something amiss with ext4 patch management?
 
Jose Santo posted it to linux-ext4 mailing list.

I agree this bug fix should included in Ted's git tree or mm tree. There
are other ext4 cleanups in this series that should goes to mm tree also.

  Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
  Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
  Signed-off-by: Mingming Cao [EMAIL PROTECTED]
  Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
  ---
   fs/ext4/super.c |   11 +++
   1 file changed, 11 insertions(+)
  
  Index: linux-2.6.22-rc4/fs/ext4/super.c
  ===
  --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 16:15:54.0 
  -0700
  +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 16:16:10.0 
  -0700
  @@ -1804,6 +1804,13 @@
 
 Please prepare patches using `diff -p'
 

Will do.

  goto failed_mount3;
  }
   
  +   if (ext4_blocks_count(es)  0xULL 
  +   !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0,
  +  JBD2_FEATURE_INCOMPAT_64BIT)) {
  +   printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n);
  +   goto failed_mount4;
  +   }
 
 It is not appropriate for the text ext4 to appear in a JBD2 message.

This is part of ext4 code. Ext4 will set the 64-bit JBD2 flag if the fs
is larger than 32 bit blocks.

  /* We have now updated the journal if required, so we can
   * validate the data journaling mode. */
  switch (test_opt(sb, DATA_FLAGS)) {
 
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:36:01 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  Turn on extents feature by default in ext4 filesystem. User could use
  -o noextents to turn it off.
  
 
 Oh, there you go.
 
  
  Index: linux-2.6.22-rc4/fs/ext4/super.c
  ===
  --- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 17:02:22.0 
  -0700
  +++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:03:09.0 
  -0700
  @@ -1546,6 +1546,12 @@
   
  set_opt(sbi-s_mount_opt, RESERVATION);
   
  +   /*
  +* turn on extents feature by default in ext4 filesystem
  +* User -o noextents to turn it off
  +*/
  +   set_opt (sbi-s_mount_opt, EXTENTS);
  +
 
 Broken coding style.
 
 Please feed all the ext4 patches through scripts/checkpatch.pl (preferably
 version 0.07 - see Andy's patch on lkml) and then consider addressing the
 (quite large) number of mistakes which are detected.
 

Sorry about this. I was using version 0.04. The latest one I can find
for now is 0.05(searching lkml), but it didn't catch this codling style
bug either. I appreciate if anyone can point me the version 0.07, thanks

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:04 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  This patch converts the 32-bit i_version in the generic inode to a 64-bit
  i_version field.
  
 
 That's obvious from the patch.  But what was the reason for making this
 (unrelated to ext4) change?
 

The need is came from NFSv4

On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
 Hi,
 
 This is an update of the i_version patch.
 The i_version field is a 64bit counter that is set on every inode
 creation and that is incremented every time the inode data is modified
 (similarly to the ctime time-stamp).
 The aim is to fulfill a NFSv4 requirement for rfc3530:
 5.5.  Mandatory Attributes - Definitions
 Name  #   DataType   Access   Description
 ___
 change3   uint64   READ A value created by the
   server that the client can use to determine if file
   data, directory contents or attributes of the object
   have been modified.  The servermay return the object's
   time_metadata attribute for this attribute's value but
   only if the filesystem object can not be updated more
   frequently than the resolution of time_metadata.
 
 

 Please update the changelog for this.
 

Is above description clear to you?


  Index: linux-2.6.21/include/linux/fs.h
  ===
  --- linux-2.6.21.orig/include/linux/fs.h
  +++ linux-2.6.21/include/linux/fs.h
  @@ -549,7 +549,7 @@ struct inode {
  uid_t   i_uid;
  gid_t   i_gid;
  dev_t   i_rdev;
  -   unsigned long   i_version;
  +   u64 i_version;
  loff_t  i_size;
   #ifdef __NEED_I_SIZE_ORDERED
  seqcount_t  i_size_seqcount;
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
   On Sun, 01 Jul 2007 03:37:04 -0400
   Mingming Cao [EMAIL PROTECTED] wrote:
   
This patch converts the 32-bit i_version in the generic inode to a 
64-bit
i_version field.

   
   That's obvious from the patch.  But what was the reason for making this
   (unrelated to ext4) change?
   
  
  The need is came from NFSv4
  
  On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
   Hi,
   
   This is an update of the i_version patch.
   The i_version field is a 64bit counter that is set on every inode
   creation and that is incremented every time the inode data is modified
   (similarly to the ctime time-stamp).
   The aim is to fulfill a NFSv4 requirement for rfc3530:
   5.5.  Mandatory Attributes - Definitions
   Name  #   DataType   Access   Description
   ___
   change3   uint64   READ A value created by the
 server that the client can use to determine if file
 data, directory contents or attributes of the object
 have been modified.  The servermay return the object's
 time_metadata attribute for this attribute's value but
 only if the filesystem object can not be updated more
 frequently than the resolution of time_metadata.
   
   
  
   Please update the changelog for this.
   
  
  Is above description clear to you?
  
 
 Yes, thanks.  It doesn't actually tell us why we want to implement
 this attribute and it doesn't tell us what the implications of failing
 to do so are, but I guess we can take that on trust from the NFS guys.
 
 But I suspect the ext4 implementation doesn't actually do this.  afaict we
 won't update i_version for file overwrites (especially if s_time_gran can
 indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
 would be the implications of this?
 

In the case of overwrite (file date updated), I assume the ctime/mtime
is being updated and the inode is being dirtied, so the version number
is being updated.

 vfs_write()-..
-__generic_file_aio_write_nolock()
-file_update_time()
-mark_inode_dirty_sync()
-__mark_inode_dirty(I_DIRTY_SYNC)
-ext4_dirty_inode()
-ext4_mark_inode_dirty()

 And how does the NFS server know that the filesystem implements i_version? 
 Will a zero-value of i_version have special significance, telling the
 server to not send this attribute, perhaps?

Bruce raised up this question a few days back when he reviewed this
patch, I think the solution is add a superblock flag for fs support
inode versioning, probably at VFS layer?

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:36:56 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  This patch is a spinoff of the old nanosecond patches.
 
 I don't know what the old nanosecond patches are.  A link to a suitable
 changlog for those patches would do in a pinch.  Preferable would be to
 write a proper changelog for this patch.
 
I found the original patch
http://marc.info/?l=linux-ext4m=115091699809181w=2

Andreas or Kalpak, is changelog from the original patch is accurate to
apply here?

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Wed, 2007-07-11 at 13:21 +1000, Neil Brown wrote:
 On Tuesday July 10, [EMAIL PROTECTED] wrote:
  
  Yes, thanks.  It doesn't actually tell us why we want to implement
  this attribute and it doesn't tell us what the implications of failing
  to do so are, but I guess we can take that on trust from the NFS guys.
 
 You would like to think so, but remember NFSv4 was designed by a
 committee :-)
 
 The 'change' number is used for cache consistency, and as the spec
 makes very strong statements about the 'change' number, it is very
 hard (or impossible) to implement a server correctly without storing a
 change number in stable storage (just one of my grips about V4).
 
  
  But I suspect the ext4 implementation doesn't actually do this.  afaict we
  won't update i_version for file overwrites (especially if s_time_gran can
  indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
  would be the implications of this?
 
 The first part sounds like a bug - i_version should really be updated
 by every call to -commit_write (if that is still what it is called).
 
 The MAP_SHARED thing is less obvious.  I guess every time we notice
 that the page might have been changed, we need to increment i_version.
 
  
  And how does the NFS server know that the filesystem implements i_version? 
  Will a zero-value of i_version have special significance, telling the
  server to not send this attribute, perhaps?
 
 That is a very important question.  Zero probably makes sense, but
 what ever it is needs to be agreed and documented.
 And just by-the-way, the server doesn't really have the option of not
 sending the attribute.  If i_version isn't defined, it has to fake
 something using mtime, and hope that is good enough.
 
 Alternately we could mandate that i_version is always kept up-to-date
 and if a filesystem doesn't have anything to load from storage, it
 just sets it to the current time in nanoseconds.
 
 That would mean that a client would need to flush it's cache whenever
 the inode fell out of cache on the server, but I don't think we can
 reliably do better than that.
 
 I think I like that approach.
 
 So my vote is to increment i_version in common code every time any
 change is made to the file, 

David Chinneer pointed that we need to journal the version number
updates together with the operations that causes the change of the inode
version number, in order to survive server crashes so clients won't see
the counter go backwards.

So increment i_version in fs code is probably the place to ensure the
inode version changes are stored to disk. It's seems update the ext4
inode version in every ext4_mark_inode_dirty() is the easiest way.

 and alloc_inode should initialise it to
 current time, which might be changed by the filesystem before it calls
 unlock_new_inode. 
 ... but doesn't lustre want to control its i_version... so maybe not :-(
 
 NeilBrown

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 6][PATCH 1/1]Export jbd stats through procfs

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 21:42 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 23:21:49 -0400 Cédric Augonnet [EMAIL PROTECTED] 
 wrote:
 
  2007/7/10, Andrew Morton [EMAIL PROTECTED]:
  
  Hi all,
  
+ size = sizeof(struct transaction_stats_s);
+ s-stats = kmalloc(size, GFP_KERNEL);
+ if (s == NULL) {
+ kfree(s);
+ return -EIO;
  
   ENOMEM
  
  I'm sorry if i missed some point, but i just don't see the use of such
  a kfree here, not sure Andrew meant you should only return ENOMEM
  instead, but why issuing those kfree(NULL), instead of just a if (!s)
  return ENOMEM ?
  
 
 You found a bug.  It was meant to be
 
   if (s-stats == NULL)
 
 

Thanks, I will make sure this get fixed in ext4 patch queue.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-10 Thread Mingming Cao

On Tue, 2007-07-10 at 21:22 -0700, Andrew Morton wrote:
 On Tue, 10 Jul 2007 20:19:16 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
 
  On Tue, 2007-07-10 at 18:22 -0700, Andrew Morton wrote:
   On Tue, 10 Jul 2007 18:09:40 -0400 Mingming Cao [EMAIL PROTECTED] wrote:
   
On Tue, 2007-07-10 at 16:30 -0700, Andrew Morton wrote:
 On Sun, 01 Jul 2007 03:37:04 -0400
 Mingming Cao [EMAIL PROTECTED] wrote:
 
  This patch converts the 32-bit i_version in the generic inode to a 
  64-bit
  i_version field.
  
 
 That's obvious from the patch.  But what was the reason for making 
 this
 (unrelated to ext4) change?
 

The need is came from NFSv4

On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote: 
 Hi,
 
 This is an update of the i_version patch.
 The i_version field is a 64bit counter that is set on every inode
 creation and that is incremented every time the inode data is modified
 (similarly to the ctime time-stamp).
 The aim is to fulfill a NFSv4 requirement for rfc3530:
 5.5.  Mandatory Attributes - Definitions
 Name  #   DataType   Access   Description
 ___
 change3   uint64   READ A value created 
 by the
   server that the client can use to determine if file
   data, directory contents or attributes of the object
   have been modified.  The servermay return the object's
   time_metadata attribute for this attribute's value but
   only if the filesystem object can not be updated more
   frequently than the resolution of time_metadata.
 
 

 Please update the changelog for this.
 

Is above description clear to you?

   
   Yes, thanks.  It doesn't actually tell us why we want to implement
   this attribute and it doesn't tell us what the implications of failing
   to do so are, but I guess we can take that on trust from the NFS guys.
   
   But I suspect the ext4 implementation doesn't actually do this.  afaict we
   won't update i_version for file overwrites (especially if s_time_gran can
   indeed be 1,000,000,000) and of course for MAP_SHARED modifications.  What
   would be the implications of this?
   
  
  In the case of overwrite (file date updated), I assume the ctime/mtime
  is being updated and the inode is being dirtied, so the version number
  is being updated.
  
   vfs_write()-..
  -__generic_file_aio_write_nolock()
  -file_update_time()
  -mark_inode_dirty_sync()
  -__mark_inode_dirty(I_DIRTY_SYNC)
  -ext4_dirty_inode()
  -ext4_mark_inode_dirty()
 
 That assumes an mtime update for every write().  OK, so two writes in a
 single nanosecond won't be happening.  But in that case why is this code:
 
 static inline struct timespec ext4_current_time(struct inode *inode)
 {
   return (inode-i_sb-s_time_gran  NSEC_PER_SEC) ?
   current_fs_time(inode-i_sb) : CURRENT_TIME_SEC;
 }
 
 checking (s_time_gran  NSEC_PER_SEC) ??
 

Ext4 can still load/read ext3 fs (which by default with 128 bytes old
inode size, means doens't have support nanosecond timestamps), so it's
not always gurantee nanosecond timestamps granularity.(it depends on the
size of the inode (128 bytes), by default, a fresh ext4  increase inode
size to 256 bytes to have the room to store nanoseond timestamps, inode
versioning etc)

 Overall it is a bit unpleasing to rely upon mtime updates for a correct NFS
 server implementation: if we were to later decrease s_time_gran (as we
 might do, for performance reasons), the NFS server implementation starts
 reporting incorrect information.
 

:( that is a problem...

   And how does the NFS server know that the filesystem implements 
   i_version? 
   Will a zero-value of i_version have special significance, telling the
   server to not send this attribute, perhaps?
  
  Bruce raised up this question a few days back when he reviewed this
  patch, I think the solution is add a superblock flag for fs support
  inode versioning, probably at VFS layer?
 
 That would work.
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-09 Thread Mingming Cao

On Fri, 2007-07-06 at 16:53 -0600, Andreas Dilger wrote:
 On Jul 06, 2007  09:51 -0400, J. Bruce Fields wrote:
  The use of a mount option means the change attribute could be
  inconsistent across mounts.  If we really need this, wouldn't it make
  more sense for it to be a persistent feature of the filesystem, set at
  mkfs time?
 
 Yes, having it stored into the superblock in s_flags is probably a good
 idea.  Kalpak, do you think you could get a patch that adds e.g.
 EXT4_FLAGS_NO_INODE_VERSION (like EXT4_FLAGS_SIGNED_HASH in e2fsprogs).
 
Per our ext4 interlock discussion today, I have dropped the
ext4-no-inode version-mount-option patch from ext4 patch queue. 

Thanks,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-04 Thread Mingming Cao

On Tue, 2007-07-03 at 15:58 +0530, Kalpak Shah wrote:
 On Sun, 2007-07-01 at 03:36 -0400, Mingming Cao wrote:
  +
  +#define EXT4_INODE_GET_XTIME(xtime, inode, raw_inode)  
 \
  +do {   
 \
  +   (inode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);   \
  +   if (EXT4_FITS_IN_INODE(raw_inode, EXT4_I(inode), xtime ## _extra)) \
  +   ext4_decode_extra_time((inode)-xtime,\
  +  raw_inode-xtime ## _extra);\
  +} while (0)
  +
  +#define EXT4_EINODE_GET_XTIME(xtime, einode, raw_inode)
 \
  +do {   
 \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime))  \
  +   (einode)-xtime.tv_sec = le32_to_cpu((raw_inode)-xtime);  \
  +   if (EXT4_FITS_IN_INODE(raw_inode, einode, xtime ## _extra))\
  +   ext4_decode_extra_time((einode)-xtime,   \
  +  raw_inode-xtime ## _extra);\
  +} while (0)
  +
 
 This nanosecond patch seems to be missing the fix below which is required for 
 http://bugzilla.kernel.org/show_bug.cgi?id=5079
 
 If the timestamp is set to before epoch i.e. a negative timestamp then the 
 file may have its date set into the future on 64-bit systems. So when the 
 timestamp is read it must be cast as signed.

Missed this one.
Thanks. Will update ext4 patch queue tonight with this fix.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-02 Thread Mingming Cao

Trond or Bruce, can you please review these patch series and ack if you
agrees? Thanks.

As to performance concerns that raise before the inode version counter
(at least for ext4) is done inside ext4_mark_inode_dirty), so there is
no extra IO work to store this counter to disk.

Mingming
On Sun, 2007-07-01 at 03:37 -0400, Mingming Cao wrote:
 This patch converts the 32-bit i_version in the generic inode to a 64-bit
 i_version field.
 
 Signed-off-by: Mingming Cao [EMAIL PROTECTED]
 Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED]
 Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
 
 Index: linux-2.6.21/include/linux/fs.h
 ===
 --- linux-2.6.21.orig/include/linux/fs.h
 +++ linux-2.6.21/include/linux/fs.h
 @@ -549,7 +549,7 @@ struct inode {
   uid_t   i_uid;
   gid_t   i_gid;
   dev_t   i_rdev;
 - unsigned long   i_version;
 + u64 i_version;
   loff_t  i_size;
  #ifdef __NEED_I_SIZE_ORDERED
   seqcount_t  i_size_seqcount;
 
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ext4 patches for 2.6.22-rc6

2007-07-01 Thread Mingming Cao

On Fri, 2007-06-29 at 13:57 -0700, Andrew Morton wrote:
 On Fri, 29 Jun 2007 11:50:04 -0400
 Mingming Caoc [EMAIL PROTECTED] wrote:
 
  I think the ext4 patch queue is in good shape now.
 
 Which ext4 patches are you intending to merge into 2.6.23?
 
 Please send all those out to lkml for review?

Hi Andrew, 

Here are the patches in ext4-patch-queue that I think can be considered
to be merged to upstream. Please review.

All of the patches have been posted on ext4 mailinglist before. Some are
bug fixes, some are features, to summaries:
- make extents on by default in ext4dev
- nanosecond timestamp
- 64 bit inode versioning support
- remove 32k subdir limits
- journal  checksumming
- journal stats via procfs
- delayed allocation for ext4 writeback mode
- fallocate()

All the patches can be found at http://repo.or.cz/w/ext4-patch-queue.git
and have been tested(with fsx ,dbench, FFSB, iozone) on
x86,x86_64,ppc64, with extents and delayed allocation enabled

And the full series can be found at
http://repo.or.cz/w/ext4-patch-queue.git?a=blob;f=series;h=2f43431db28778ce8d2149bce7a51566a2d2517c;hb=56e27e20cf228b32f5162a76b3bad154d1d3b730

I will post the patches-in-good-shape (in 9 set of patches) to lkml in
the following emails, except for the bottom two feature:

*the fallocate() patches, which Amit just posted a few days ago and are
under review (hopefully we can reach a agreement on the interface and
the modes before 2.6.23-rc1 window closed).

*Another one is the delayed allocation patches in ext4 patch queue. Alex
mentioned in another email that he is working on another version of
delalloc that can handle block size  page size, and move some work to
vfs. So it's probably not very useful to post this version for people to
review.


So, here is the series file.

# Rebased the patches to 2.6.22-rc6

# Add mount option to turn off extents
ext4_noextent_mount_opt.patch

# Mounted ext4dev fs with extents by default for testing purpose,
# for Ext4 product release, extents mount option
# will be turn on only if the fs has EXTENTS feature on
ext4_extents_on_by_default.patch

# Propagate inode flags
ext4-propagate_flags.patch

# Add extent sanity checks
ext4-extent-sanity-checks.patch

# Bug fix:set 64bit JBD2 feature on 32bit ext4 fs
ext4_set_jbd2_64bit_feature.patch

# Fix: Rename CONFIG_JBD_DEBUG to CONFIG_JBD2_DEBUG
jbd2_config_jbd2_debug_fix.patch

# Export jbd2-debug via debugfs
ext4_CONFIG_JBD2_DEBUG.patch
jbd2_move_jbd2_debug_to_debugfs.patch

# Nanosecond timestamp support
ext4-nanosecond-patch

# inode verion patch series
# inode versioning is needed for NFSv4

# vfs changes, 64 bit inode-i_version
64-bit-i_version.patch
# reserve hi 32 bit inode version on ext4 on-disk inode
i_version_hi.patch
# ext4 inode version read/store
ext4_i_version_hi_2.patch
# ext4 inode version update
i_version_update_ext4.patch
# add a noversion mount option to disable inode version updates
ext4_no_version.patch

# New patch to expand inode i_extra_isize to support features
# in high part of inode (128 bytes)
ext4_expand_inode_extra_isize.patch

# Export jbd stats through procfs
# Shall this move to debugfs?
jbd-stats-through-procfs

# Remove 32000 subdirs limit. 
ext4_remove_subdirs_limit.patch

# Add journal checksums
ext4-journal_chksum-2.6.20.patch

# Various Cleanups
ext4-zero_user_page.patch
is_power_of_2-ext4-superc.patch
ext4-remove-extra-is_rdonly-check.patch
ext4_extent_compilation_fixes.patch
ext4_extent_macros_cleanup.patch


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 1][PATCH 2/2] Enable extents by default for ext4dev

2007-07-01 Thread Mingming Cao

Turn on extents feature by default in ext4 filesystem. User could use
-o noextents to turn it off.

Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/fs/ext4/super.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 17:02:22.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 17:03:09.0 -0700
@@ -1546,6 +1546,12 @@
 
set_opt(sbi-s_mount_opt, RESERVATION);
 
+   /*
+* turn on extents feature by default in ext4 filesystem
+* User -o noextents to turn it off
+*/
+   set_opt (sbi-s_mount_opt, EXTENTS);
+
if (!parse_options ((char *) data, sb, journal_inum, journal_devnum,
NULL, 0))
goto failed_mount;


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 2][PATCH 2/5] cleanups: Add extent sanity checks

2007-07-01 Thread Mingming Cao

with the patch all headers are checked. the code should become
more resistant to on-disk corruptions. needless BUG_ON() have
been removed. please, review for inclusion.

Signed-off-by: Alex Tomas [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/fs/ext4/extents.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/extents.c 2007-06-11 17:22:15.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/extents.c  2007-06-11 17:27:57.0 -0700
@@ -91,36 +91,6 @@
ix-ei_leaf_hi = cpu_to_le16((unsigned long) ((pb  31)  1)  
0x);
 }
 
-static int ext4_ext_check_header(const char *function, struct inode *inode,
-   struct ext4_extent_header *eh)
-{
-   const char *error_msg = NULL;
-
-   if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
-   error_msg = invalid magic;
-   goto corrupted;
-   }
-   if (unlikely(eh-eh_max == 0)) {
-   error_msg = invalid eh_max;
-   goto corrupted;
-   }
-   if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
-   error_msg = invalid eh_entries;
-   goto corrupted;
-   }
-   return 0;
-
-corrupted:
-   ext4_error(inode-i_sb, function,
-   bad header in inode #%lu: %s - magic %x, 
-   entries %u, max %u, depth %u,
-   inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
-   le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
-   le16_to_cpu(eh-eh_depth));
-
-   return -EIO;
-}
-
 static handle_t *ext4_ext_journal_restart(handle_t *handle, int needed)
 {
int err;
@@ -269,6 +239,70 @@
return size;
 }
 
+static inline int
+ext4_ext_max_entries(struct inode *inode, int depth)
+{
+   int max;
+
+   if (depth == ext_depth(inode)) {
+   if (depth == 0)
+   max = ext4_ext_space_root(inode);
+   else
+   max = ext4_ext_space_root_idx(inode);
+   } else {
+   if (depth == 0)
+   max = ext4_ext_space_block(inode);
+   else
+   max = ext4_ext_space_block_idx(inode);
+   }
+
+   return max;
+}
+
+static int __ext4_ext_check_header(const char *function, struct inode *inode,
+   struct ext4_extent_header *eh,
+   int depth)
+{
+   const char *error_msg = NULL;
+   int max = 0;
+
+   if (unlikely(eh-eh_magic != EXT4_EXT_MAGIC)) {
+   error_msg = invalid magic;
+   goto corrupted;
+   }
+   if (unlikely(le16_to_cpu(eh-eh_depth) != depth)) {
+   error_msg = unexpected eh_depth;
+   goto corrupted;
+   }
+   if (unlikely(eh-eh_max == 0)) {
+   error_msg = invalid eh_max;
+   goto corrupted;
+   }
+   max = ext4_ext_max_entries(inode, depth);
+   if (unlikely(le16_to_cpu(eh-eh_max)  max)) {
+   error_msg = too large eh_max;
+   goto corrupted;
+   }
+   if (unlikely(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max))) {
+   error_msg = invalid eh_entries;
+   goto corrupted;
+   }
+   return 0;
+
+corrupted:
+   ext4_error(inode-i_sb, function,
+   bad header in inode #%lu: %s - magic %x, 
+   entries %u, max %u(%u), depth %u(%u),
+   inode-i_ino, error_msg, le16_to_cpu(eh-eh_magic),
+   le16_to_cpu(eh-eh_entries), le16_to_cpu(eh-eh_max),
+   max, le16_to_cpu(eh-eh_depth), depth);
+
+   return -EIO;
+}
+
+#define ext4_ext_check_header(inode, eh, depth)\
+   __ext4_ext_check_header(__FUNCTION__, inode, eh, depth)
+
 #ifdef EXT_DEBUG
 static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
 {
@@ -329,6 +363,7 @@
 /*
  * ext4_ext_binsearch_idx:
  * binary search for the closest index of the given block
+ * the header must be checked before calling this
  */
 static void
 ext4_ext_binsearch_idx(struct inode *inode, struct ext4_ext_path *path, int 
block)
@@ -336,9 +371,6 @@
struct ext4_extent_header *eh = path-p_hdr;
struct ext4_extent_idx *r, *l, *m;
 
-   BUG_ON(eh-eh_magic != EXT4_EXT_MAGIC);
-   BUG_ON(le16_to_cpu(eh-eh_entries)  le16_to_cpu(eh-eh_max));
-   BUG_ON(le16_to_cpu(eh-eh_entries) = 0);
 
ext_debug(binsearch for %d(idx):  , block);
 
@@ -388,6 +420,7 @@
 /*
  * ext4_ext_binsearch:
  * binary search for closest extent of the given block
+ * the header must be checked before calling this
  */
 static void
 ext4_ext_binsearch(struct inode *inode, struct ext4_ext_path *path, int block)
@@ -395,9 +428,6 @@
struct ext4_extent_header *eh = path-p_hdr;
struct

[EXT4 set 2][PATCH 3/5] cleanups: set_jbd2_64bit_feature for 16TB ext4 fs

2007-07-01 Thread Mingming Cao

Set the journals JBD2_FEATURE_INCOMPAT_64BIT on devices with more
than 32bit block sizes during mount time.  This ensure proper record
lenth when writing to the journal.

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]
Signed-off-by: Laurent Vivier [EMAIL PROTECTED]
---
 fs/ext4/super.c |   11 +++
 1 file changed, 11 insertions(+)

Index: linux-2.6.22-rc4/fs/ext4/super.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/super.c   2007-06-11 16:15:54.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/super.c2007-06-11 16:16:10.0 -0700
@@ -1804,6 +1804,13 @@
goto failed_mount3;
}
 
+   if (ext4_blocks_count(es)  0xULL 
+   !jbd2_journal_set_features(EXT4_SB(sb)-s_journal, 0, 0,
+  JBD2_FEATURE_INCOMPAT_64BIT)) {
+   printk(KERN_ERR ext4: Failed to set 64-bit journal feature\n);
+   goto failed_mount4;
+   }
+
/* We have now updated the journal if required, so we can
 * validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 2][PATCH 4/5] cleanups: Rename CONFIG_JBD_DEBUG to CONFIG_JBD2_DEBUG

2007-07-01 Thread Mingming Cao

When the JBD code was forked to create the new JBD2 code base, the
references to CONFIG_JBD_DEBUG where never changed to
CONFIG_JBD2_DEBUG.  This patch fixes that.

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
---
Index: linux-2.6.22-rc4/fs/jbd2/journal.c
===
--- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:15:49.0 
-0700
+++ linux-2.6.22-rc4/fs/jbd2/journal.c  2007-06-11 16:16:18.0 -0700
@@ -528,7 +528,7 @@
 {
int err = 0;
 
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
spin_lock(journal-j_state_lock);
if (!tid_geq(journal-j_commit_request, tid)) {
printk(KERN_EMERG
@@ -1709,7 +1709,7 @@
  * Journal_head storage management
  */
 static struct kmem_cache *jbd2_journal_head_cache;
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
 static atomic_t nr_journal_heads = ATOMIC_INIT(0);
 #endif
 
@@ -1747,7 +1747,7 @@
struct journal_head *ret;
static unsigned long last_warning;
 
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
atomic_inc(nr_journal_heads);
 #endif
ret = kmem_cache_alloc(jbd2_journal_head_cache, GFP_NOFS);
@@ -1768,7 +1768,7 @@
 
 static void journal_free_journal_head(struct journal_head *jh)
 {
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
atomic_dec(nr_journal_heads);
memset(jh, JBD_POISON_FREE, sizeof(*jh));
 #endif
@@ -1953,12 +1953,12 @@
 /*
  * /proc tunables
  */
-#if defined(CONFIG_JBD_DEBUG)
+#if defined(CONFIG_JBD2_DEBUG)
 int jbd2_journal_enable_debug;
 EXPORT_SYMBOL(jbd2_journal_enable_debug);
 #endif
 
-#if defined(CONFIG_JBD_DEBUG)  defined(CONFIG_PROC_FS)
+#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
 
 static struct proc_dir_entry *proc_jbd_debug;
 
@@ -2073,7 +2073,7 @@
 
 static void __exit journal_exit(void)
 {
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
int n = atomic_read(nr_journal_heads);
if (n)
printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
Index: linux-2.6.22-rc4/fs/jbd2/recovery.c
===
--- linux-2.6.22-rc4.orig/fs/jbd2/recovery.c2007-06-04 17:57:25.0 
-0700
+++ linux-2.6.22-rc4/fs/jbd2/recovery.c 2007-06-11 16:16:18.0 -0700
@@ -295,7 +295,7 @@
printk(KERN_ERR JBD: error %d scanning journal\n, err);
++journal-j_transaction_sequence;
} else {
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
int dropped = info.end_transaction - 
be32_to_cpu(sb-s_sequence);
 #endif
jbd_debug(0,
Index: linux-2.6.22-rc4/include/linux/ext4_fs.h
===
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs.h   2007-06-11 
16:15:59.0 -0700
+++ linux-2.6.22-rc4/include/linux/ext4_fs.h2007-06-11 16:16:18.0 
-0700
@@ -237,7 +237,7 @@
 #define EXT4_IOC_GROUP_ADD _IOW('f', 8,struct ext4_new_group_input)
 #defineEXT4_IOC_GETVERSION_OLD FS_IOC_GETVERSION
 #defineEXT4_IOC_SETVERSION_OLD FS_IOC_SETVERSION
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
 #define EXT4_IOC_WAIT_FOR_READONLY _IOR('f', 99, long)
 #endif
 #define EXT4_IOC_GETRSVSZ  _IOR('f', 5, long)
@@ -253,7 +253,7 @@
 #define EXT4_IOC32_GETRSVSZ_IOR('f', 5, int)
 #define EXT4_IOC32_SETRSVSZ_IOW('f', 6, int)
 #define EXT4_IOC32_GROUP_EXTEND_IOW('f', 7, unsigned int)
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
 #define EXT4_IOC32_WAIT_FOR_READONLY   _IOR('f', 99, int)
 #endif
 #define EXT4_IOC32_GETVERSION_OLD  FS_IOC32_GETVERSION
Index: linux-2.6.22-rc4/include/linux/ext4_fs_sb.h
===
--- linux-2.6.22-rc4.orig/include/linux/ext4_fs_sb.h2007-06-11 
16:15:55.0 -0700
+++ linux-2.6.22-rc4/include/linux/ext4_fs_sb.h 2007-06-11 16:16:18.0 
-0700
@@ -71,7 +71,7 @@
struct list_head s_orphan;
unsigned long s_commit_interval;
struct block_device *journal_bdev;
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
struct timer_list turn_ro_timer;/* For turning read-only (crash 
simulation) */
wait_queue_head_t ro_wait_queue;/* For people waiting for the 
fs to go read-only */
 #endif
Index: linux-2.6.22-rc4/include/linux/jbd2.h
===
--- linux-2.6.22-rc4.orig/include/linux/jbd2.h  2007-06-11 16:15:49.0 
-0700
+++ linux-2.6.22-rc4/include/linux/jbd2.h   2007-06-11 16:16:18.0 
-0700
@@ -50,11 +50,11 @@
  */
 #define JBD_DEFAULT_MAX_COMMIT_AGE 5
 
-#ifdef CONFIG_JBD_DEBUG
+#ifdef CONFIG_JBD2_DEBUG
 /*
  * Define JBD_EXPENSIVE_CHECKING to enable more expensive internal
  * consistency checks.  By default we don't do this unless
- * CONFIG_JBD_DEBUG

[EXT4 set 2][PATCH 5/5] cleanups: Export jbd2-debug via debugfs

2007-07-01 Thread Mingming Cao

 On Jun 07, 2007  23:45 -0500, Jose R. Santos wrote:
  The jbd2-debug file used to be located in /proc/sys/fs/jbd2-debug, but
  create_proc_entry() does not do lookups on file names with more that one
  directory deep.  This causes the entry creation to fail and hence, no proc
  file is created.  This patch moves the file to /proc/jbd2-degug.
  
  The file could be move to /proc/fs/jbd2/jbd2-debug, but it would require
  some minor alterations to the jbd-stats patch.
 
 I don't think we really want to be adding top-level files in /proc.
 What about using the debugfs filesystem (not to be confused with
 the e2fsprogs 'debugfs' command)?

How about this then?  Moved the file to use debugfs as well as having
the nice effect of removing more lines than what it adds.

Signed-off-by: Jose R. Santos [EMAIL PROTECTED]
---
 fs/jbd2/journal.c|   6220 +42 -0 !
 include/linux/jbd2.h |21 + 1 - 0 !
 2 files changed, 21 insertions(+), 43 deletions(-)

Index: linux-2.6.22-rc4/fs/jbd2/journal.c
===
--- linux-2.6.22-rc4.orig/fs/jbd2/journal.c 2007-06-11 16:16:18.0 
-0700
+++ linux-2.6.22-rc4/fs/jbd2/journal.c  2007-06-11 16:36:10.0 -0700
@@ -35,6 +35,7 @@
 #include linux/kthread.h
 #include linux/poison.h
 #include linux/proc_fs.h
+#include linux/debugfs.h
 
 #include asm/uaccess.h
 #include asm/page.h
@@ -1954,60 +1955,37 @@
  * /proc tunables
  */
 #if defined(CONFIG_JBD2_DEBUG)
-int jbd2_journal_enable_debug;
+u16 jbd2_journal_enable_debug;
 EXPORT_SYMBOL(jbd2_journal_enable_debug);
 #endif
 
-#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_PROC_FS)
+#if defined(CONFIG_JBD2_DEBUG)  defined(CONFIG_DEBUG_FS)
 
-static struct proc_dir_entry *proc_jbd_debug;
+#define JBD2_DEBUG_NAME jbd2-debug
 
-static int read_jbd_debug(char *page, char **start, off_t off,
- int count, int *eof, void *data)
-{
-   int ret;
-
-   ret = sprintf(page + off, %d\n, jbd2_journal_enable_debug);
-   *eof = 1;
-   return ret;
-}
+struct dentry *jbd2_debugfs_dir, *jbd2_debug;
 
-static int write_jbd_debug(struct file *file, const char __user *buffer,
-  unsigned long count, void *data)
+static void __init jbd2_create_debugfs_entry(void)
 {
-   char buf[32];
-
-   if (count  ARRAY_SIZE(buf) - 1)
-   count = ARRAY_SIZE(buf) - 1;
-   if (copy_from_user(buf, buffer, count))
-   return -EFAULT;
-   buf[ARRAY_SIZE(buf) - 1] = '\0';
-   jbd2_journal_enable_debug = simple_strtoul(buf, NULL, 10);
-   return count;
-}
-
-#define JBD_PROC_NAME sys/fs/jbd2-debug
-
-static void __init create_jbd_proc_entry(void)
-{
-   proc_jbd_debug = create_proc_entry(JBD_PROC_NAME, 0644, NULL);
-   if (proc_jbd_debug) {
-   /* Why is this so hard? */
-   proc_jbd_debug-read_proc = read_jbd_debug;
-   proc_jbd_debug-write_proc = write_jbd_debug;
-   }
+   jbd2_debugfs_dir = debugfs_create_dir(jbd2, NULL);
+   if (jbd2_debugfs_dir)
+   jbd2_debug = debugfs_create_u16(JBD2_DEBUG_NAME, S_IRUGO,
+   jbd2_debugfs_dir,
+   jbd2_journal_enable_debug);
 }
 
-static void __exit jbd2_remove_jbd_proc_entry(void)
+static void __exit jbd2_remove_debugfs_entry(void)
 {
-   if (proc_jbd_debug)
-   remove_proc_entry(JBD_PROC_NAME, NULL);
+   if (jbd2_debug)
+   debugfs_remove(jbd2_debug);
+   if (jbd2_debugfs_dir)
+   debugfs_remove(jbd2_debugfs_dir);
 }
 
 #else
 
-#define create_jbd_proc_entry() do {} while (0)
-#define jbd2_remove_jbd_proc_entry() do {} while (0)
+#define jbd2_create_debugfs_entry() do {} while (0)
+#define jbd2_remove_debugfs_entry() do {} while (0)
 
 #endif
 
@@ -2067,7 +2045,7 @@
ret = journal_init_caches();
if (ret != 0)
jbd2_journal_destroy_caches();
-   create_jbd_proc_entry();
+   jbd2_create_debugfs_entry();
return ret;
 }
 
@@ -2078,7 +2056,7 @@
if (n)
printk(KERN_EMERG JBD: leaked %d journal_heads!\n, n);
 #endif
-   jbd2_remove_jbd_proc_entry();
+   jbd2_remove_debugfs_entry();
jbd2_journal_destroy_caches();
 }
 
Index: linux-2.6.22-rc4/include/linux/jbd2.h
===
--- linux-2.6.22-rc4.orig/include/linux/jbd2.h  2007-06-11 16:16:18.0 
-0700
+++ linux-2.6.22-rc4/include/linux/jbd2.h   2007-06-11 16:35:25.0 
-0700
@@ -57,7 +57,7 @@
  * CONFIG_JBD2_DEBUG is on.
  */
 #define JBD_EXPENSIVE_CHECKING
-extern int jbd2_journal_enable_debug;
+extern u16 jbd2_journal_enable_debug;
 
 #define jbd_debug(n, f, a...)  \
do {\


-
To unsubscribe from this list: send the line

[EXT4 set 3][PATCH 1/1] ext4 nanosecond timestamp

2007-07-01 Thread Mingming Cao

This patch is a spinoff of the old nanosecond patches.

It includes some cleanups and addition of a creation timestamp. The
EXT3_FEATURE_RO_COMPAT_EXTRA_ISIZE flag has also been added along with
s_{min, want}_extra_isize fields in struct ext3_super_block.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Eric Sandeen [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/fs/ext4/ialloc.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/ialloc.c  2007-06-11 17:22:15.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/ialloc.c   2007-06-11 17:39:05.0 -0700
@@ -563,7 +563,8 @@
inode-i_ino = ino;
/* This is the optimal IO size (for stat), not the fs block size */
inode-i_blocks = 0;
-   inode-i_mtime = inode-i_atime = inode-i_ctime = CURRENT_TIME_SEC;
+   inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime =
+  ext4_current_time(inode);
 
memset(ei-i_data, 0, sizeof(ei-i_data));
ei-i_dir_start_lookup = 0;
@@ -595,9 +596,8 @@
spin_unlock(sbi-s_next_gen_lock);
 
ei-i_state = EXT4_STATE_NEW;
-   ei-i_extra_isize =
-   (EXT4_INODE_SIZE(inode-i_sb)  EXT4_GOOD_OLD_INODE_SIZE) ?
-   sizeof(struct ext4_inode) - EXT4_GOOD_OLD_INODE_SIZE : 0;
+
+   ei-i_extra_isize = EXT4_SB(sb)-s_want_extra_isize;
 
ret = inode;
if(DQUOT_ALLOC_INODE(inode)) {
Index: linux-2.6.22-rc4/fs/ext4/inode.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/inode.c   2007-06-11 17:24:28.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/inode.c2007-06-11 17:39:05.0 -0700
@@ -726,7 +726,7 @@
 
/* We are done with atomic stuff, now do the rest of housekeeping */
 
-   inode-i_ctime = CURRENT_TIME_SEC;
+   inode-i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
 
/* had we spliced it onto indirect block? */
@@ -2375,7 +2375,7 @@
ext4_discard_reservation(inode);
 
mutex_unlock(ei-truncate_mutex);
-   inode-i_mtime = inode-i_ctime = CURRENT_TIME_SEC;
+   inode-i_mtime = inode-i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
 
/*
@@ -2629,10 +2629,6 @@
}
inode-i_nlink = le16_to_cpu(raw_inode-i_links_count);
inode-i_size = le32_to_cpu(raw_inode-i_size);
-   inode-i_atime.tv_sec = (signed)le32_to_cpu(raw_inode-i_atime);
-   inode-i_ctime.tv_sec = (signed)le32_to_cpu(raw_inode-i_ctime);
-   inode-i_mtime.tv_sec = (signed)le32_to_cpu(raw_inode-i_mtime);
-   inode-i_atime.tv_nsec = inode-i_ctime.tv_nsec = 
inode-i_mtime.tv_nsec = 0;
 
ei-i_state = 0;
ei-i_dir_start_lookup = 0;
@@ -2708,6 +2704,11 @@
} else
ei-i_extra_isize = 0;
 
+   EXT4_INODE_GET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_GET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_GET_XTIME(i_atime, inode, raw_inode);
+   EXT4_EINODE_GET_XTIME(i_crtime, ei, raw_inode);
+
if (S_ISREG(inode-i_mode)) {
inode-i_op = ext4_file_inode_operations;
inode-i_fop = ext4_file_operations;
@@ -2789,9 +2790,12 @@
}
raw_inode-i_links_count = cpu_to_le16(inode-i_nlink);
raw_inode-i_size = cpu_to_le32(ei-i_disksize);
-   raw_inode-i_atime = cpu_to_le32(inode-i_atime.tv_sec);
-   raw_inode-i_ctime = cpu_to_le32(inode-i_ctime.tv_sec);
-   raw_inode-i_mtime = cpu_to_le32(inode-i_mtime.tv_sec);
+
+   EXT4_INODE_SET_XTIME(i_ctime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_mtime, inode, raw_inode);
+   EXT4_INODE_SET_XTIME(i_atime, inode, raw_inode);
+   EXT4_EINODE_SET_XTIME(i_crtime, ei, raw_inode);
+
raw_inode-i_blocks = cpu_to_le32(inode-i_blocks);
raw_inode-i_dtime = cpu_to_le32(ei-i_dtime);
raw_inode-i_flags = cpu_to_le32(ei-i_flags);
Index: linux-2.6.22-rc4/fs/ext4/ioctl.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/ioctl.c   2007-06-11 17:25:11.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/ioctl.c2007-06-11 17:39:05.0 -0700
@@ -97,7 +97,7 @@
ei-i_flags = flags;
 
ext4_set_inode_flags(inode);
-   inode-i_ctime = CURRENT_TIME_SEC;
+   inode-i_ctime = ext4_current_time(inode);
 
err = ext4_mark_iloc_dirty(handle, inode, iloc);
 flags_err:
@@ -134,7 +134,7 @@
return PTR_ERR(handle);
err = ext4_reserve_inode_write(handle, inode, iloc);
if (err == 0) {
-   inode-i_ctime = CURRENT_TIME_SEC;
+   inode-i_ctime

[EXT4 set 4][PATCH 1/5] i_version:64 bit inode version

2007-07-01 Thread Mingming Cao

This patch converts the 32-bit i_version in the generic inode to a 64-bit
i_version field.

Signed-off-by: Mingming Cao [EMAIL PROTECTED]
Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]

Index: linux-2.6.21/include/linux/fs.h
===
--- linux-2.6.21.orig/include/linux/fs.h
+++ linux-2.6.21/include/linux/fs.h
@@ -549,7 +549,7 @@ struct inode {
uid_t   i_uid;
gid_t   i_gid;
dev_t   i_rdev;
-   unsigned long   i_version;
+   u64 i_version;
loff_t  i_size;
 #ifdef __NEED_I_SIZE_ORDERED
seqcount_t  i_size_seqcount;


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 5][PATCH 1/1] expand inode i_extra_isize to support features in larger inode

2007-07-01 Thread Mingming Cao

This patch is on top of the nanosecond timestamp and i_version_hi
patches. 

We need to make sure that existing filesystems can also avail the new
fields that have been added to the inode. We use s_want_extra_isize and
s_min_extra_isize to decide by how much we should expand the inode. If
EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE feature is set then we expand by
max(s_want_extra_isize, s_min_extra_isize , sizeof(ext4_inode) -
EXT4_GOOD_OLD_INODE_SIZE) bytes. Actually it is still an open question
about whether users should be able to set s_*_extra_isize smaller than
the known fields or not. 

This patch also adds the functionality to expand inodes to include the
newly added fields. We start by trying to expand by s_want_extra_isize
bytes and if its fails we try to expand by s_min_extra_isize bytes. This
is done by changing the i_extra_isize if enough space is available in
the inode and no EAs are present. If EAs are present and there is enough
space in the inode then the EAs in the inode are shifted to make space.
If enough space is not available in the inode due to the EAs then 1 or
more EAs are shifted to the external EA block. In the worst case when
even the external EA block does not have enough space we inform the user
that some EA would need to be deleted or s_min_extra_isize would have to
be reduced.

This would be online expansion of inodes. I am also working on adding an
expand_inodes option to e2fsck which will expand all the used inodes.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Kalpak Shah [EMAIL PROTECTED]
Signed-off-by: Mingming Cao [EMAIL PROTECTED]

Index: linux-2.6.22-rc4/fs/ext4/inode.c
===
--- linux-2.6.22-rc4.orig/fs/ext4/inode.c   2007-06-14 17:32:27.0 
-0700
+++ linux-2.6.22-rc4/fs/ext4/inode.c2007-06-14 17:32:41.0 -0700
@@ -3120,6 +3120,40 @@ ext4_reserve_inode_write(handle_t *handl
 }
 
 /*
+ * Expand an inode by new_extra_isize bytes.
+ * Returns 0 on success or negative error number on failure.
+ */
+int ext4_expand_extra_isize(struct inode *inode, unsigned int new_extra_isize,
+   struct ext4_iloc iloc, handle_t *handle)
+{
+   struct ext4_inode *raw_inode;
+   struct ext4_xattr_ibody_header *header;
+   struct ext4_xattr_entry *entry;
+
+   if (EXT4_I(inode)-i_extra_isize = new_extra_isize) {
+   return 0;
+   }
+
+   raw_inode = ext4_raw_inode(iloc);
+
+   header = IHDR(inode, raw_inode);
+   entry = IFIRST(header);
+
+   /* No extended attributes present */
+   if (!(EXT4_I(inode)-i_state  EXT4_STATE_XATTR) ||
+   header-h_magic != cpu_to_le32(EXT4_XATTR_MAGIC)) {
+   memset((void *)raw_inode + EXT4_GOOD_OLD_INODE_SIZE, 0,
+   new_extra_isize);
+   EXT4_I(inode)-i_extra_isize = new_extra_isize;
+   return 0;
+   }
+
+   /* try to expand with EA present */
+   return ext4_expand_extra_isize_ea(inode, new_extra_isize,
+   raw_inode, handle);
+}
+
+/*
  * What we do here is to mark the in-core inode as clean with respect to inode
  * dirtiness (it may still be data-dirty).
  * This means that the in-core inode may be reaped by prune_icache
@@ -3143,10 +3177,33 @@ ext4_reserve_inode_write(handle_t *handl
 int ext4_mark_inode_dirty(handle_t *handle, struct inode *inode)
 {
struct ext4_iloc iloc;
-   int err;
+   int err, ret;
+   static int expand_message;
 
might_sleep();
err = ext4_reserve_inode_write(handle, inode, iloc);
+   if (EXT4_I(inode)-i_extra_isize 
+   EXT4_SB(inode-i_sb)-s_want_extra_isize 
+   !(EXT4_I(inode)-i_state  EXT4_STATE_NO_EXPAND)) {
+   /* We need extra buffer credits since we may write into EA block
+* with this same handle */
+   if ((jbd2_journal_extend(handle,
+EXT4_DATA_TRANS_BLOCKS(inode-i_sb))) == 0) {
+   ret = ext4_expand_extra_isize(inode,
+   EXT4_SB(inode-i_sb)-s_want_extra_isize,
+   iloc, handle);
+   if (ret) {
+   EXT4_I(inode)-i_state |= EXT4_STATE_NO_EXPAND;
+   if (!expand_message) {
+   ext4_warning(inode-i_sb, __FUNCTION__,
+   Unable to expand inode %lu. Delete
+some EAs or run e2fsck.,
+   inode-i_ino);
+   expand_message = 1;
+   }
+   }
+   }
+   }
if (!err)
err = ext4_mark_iloc_dirty(handle, inode, iloc);
return err;
Index: linux-2.6.22-rc4/include/linux/ext4_fs.h

[EXT4 set 8][PATCH 1/1]Add journal checksums

2007-07-01 Thread Mingming Cao

Journal checksum feature has been added to detect corruption of journal.

Signed-off-by: Andreas Dilger [EMAIL PROTECTED]
Signed-off-by: Girish Shilamkar [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]

diff -Nurp linux024/fs/ext4/super.c linux/fs/ext4/super.c
--- linux024/fs/ext4/super.c2007-06-25 16:19:24.0 -0500
+++ linux/fs/ext4/super.c   2007-06-26 08:35:16.0 -0500
@@ -721,6 +721,7 @@ enum {
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, Opt_bh,
Opt_commit, Opt_journal_update, Opt_journal_inum, Opt_journal_dev,
+   Opt_journal_checksum, Opt_journal_async_commit,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
Opt_jqfmt_vfsold, Opt_jqfmt_vfsv0, Opt_quota, Opt_noquota,
@@ -760,6 +761,8 @@ static match_table_t tokens = {
{Opt_journal_update, journal=update},
{Opt_journal_inum, journal=%u},
{Opt_journal_dev, journal_dev=%u},
+   {Opt_journal_checksum, journal_checksum},
+   {Opt_journal_async_commit, journal_async_commit},
{Opt_abort, abort},
{Opt_data_journal, data=journal},
{Opt_data_ordered, data=ordered},
@@ -948,6 +951,13 @@ static int parse_options (char *options,
return 0;
*journal_devnum = option;
break;
+   case Opt_journal_checksum:
+   set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
+   break;
+   case Opt_journal_async_commit:
+   set_opt (sbi-s_mount_opt, JOURNAL_ASYNC_COMMIT);
+   set_opt (sbi-s_mount_opt, JOURNAL_CHECKSUM);
+   break;
case Opt_noload:
set_opt (sbi-s_mount_opt, NOLOAD);
break;
@@ -1817,6 +1827,21 @@ static int ext4_fill_super (struct super
goto failed_mount4;
}
 
+   if (test_opt(sb, JOURNAL_ASYNC_COMMIT)) {
+   jbd2_journal_set_features(sbi-s_journal,
+   JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+   } else if (test_opt(sb, JOURNAL_CHECKSUM)) {
+   jbd2_journal_set_features(sbi-s_journal,
+   JBD2_FEATURE_COMPAT_CHECKSUM, 0, 0);
+   jbd2_journal_clear_features(sbi-s_journal, 0, 0,
+   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+   } else {
+   jbd2_journal_clear_features(sbi-s_journal,
+   JBD2_FEATURE_COMPAT_CHECKSUM, 0,
+   JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT);
+   }
+
/* We have now updated the journal if required, so we can
 * validate the data journaling mode. */
switch (test_opt(sb, DATA_FLAGS)) {
diff -Nurp linux024/fs/jbd2/commit.c linux/fs/jbd2/commit.c
--- linux024/fs/jbd2/commit.c   2007-06-25 16:19:25.0 -0500
+++ linux/fs/jbd2/commit.c  2007-06-26 08:40:03.0 -0500
@@ -21,6 +21,7 @@
 #include linux/mm.h
 #include linux/pagemap.h
 #include linux/jiffies.h
+#include linux/crc32.h
 
 /*
  * Default IO end handler for temporary BJ_IO buffer_heads.
@@ -93,15 +94,18 @@ static int inverted_lock(journal_t *jour
return 1;
 }
 
-/* Done it all: now write the commit record.  We should have
+/*
+ * Done it all: now submit the commit record.  We should have
  * cleaned up our previous buffers by now, so if we are in abort
  * mode we can now just skip the rest of the journal write
  * entirely.
  *
  * Returns 1 if the journal needs to be aborted or 0 on success
  */
-static int journal_write_commit_record(journal_t *journal,
-   transaction_t *commit_transaction)
+static int journal_submit_commit_record(journal_t *journal,
+   transaction_t *commit_transaction,
+   struct buffer_head **cbh,
+   __u32 crc32_sum)
 {
struct journal_head *descriptor;
struct buffer_head *bh;
@@ -117,21 +121,36 @@ static int journal_write_commit_record(j
 
bh = jh2bh(descriptor);
 
-   /* AKPM: buglet - add `i' to tmp! */
for (i = 0; i  bh-b_size; i += 512) {
-   journal_header_t *tmp = (journal_header_t*)bh-b_data;
+   struct commit_header *tmp =
+   (struct commit_header *)(bh-b_data + i);
tmp-h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
tmp-h_blocktype = cpu_to_be32(JBD2_COMMIT_BLOCK);
tmp-h_sequence = cpu_to_be32(commit_transaction-t_tid);
+
+   if (JBD2_HAS_COMPAT_FEATURE(journal,
+

[EXT4 set 9][PATCH 1/5]Morecleanups:ext4-zero_user_page

2007-07-01 Thread Mingming Cao

Use zero_user_page() in ext4 where possible.


Signed-off-by: Eric Sandeen [EMAIL PROTECTED]

Index: linux-2.6.22-rc4-mm2/fs/ext4/inode.c
===
--- linux-2.6.22-rc4-mm2.orig/fs/ext4/inode.c
+++ linux-2.6.22-rc4-mm2/fs/ext4/inode.c
@@ -1830,7 +1830,6 @@ int ext4_block_truncate_page(handle_t *h
struct inode *inode = mapping-host;
struct buffer_head *bh;
int err = 0;
-   void *kaddr;

if ((EXT4_I(inode)-i_flags  EXT4_EXTENTS_FL) 
test_opt(inode-i_sb, EXTENTS) 
@@ -1847,10 +1846,7 @@ int ext4_block_truncate_page(handle_t *h
 */
if (!page_has_buffers(page)  test_opt(inode-i_sb, NOBH) 
 ext4_should_writeback_data(inode)  PageUptodate(page)) {
-   kaddr = kmap_atomic(page, KM_USER0);
-   memset(kaddr + offset, 0, length);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
+   zero_user_page(page, offset, length, KM_USER0);
set_page_dirty(page);
goto unlock;
}
@@ -1903,10 +1899,7 @@ int ext4_block_truncate_page(handle_t *h
goto unlock;
}

-   kaddr = kmap_atomic(page, KM_USER0);
-   memset(kaddr + offset, 0, length);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
+   zero_user_page(page, offset, length, KM_USER0);

BUFFER_TRACE(bh, zeroed end of block);



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 9][PATCH 3/5]Morecleanups:ext4-remove-extra-is_rdonly-check

2007-07-01 Thread Mingming Cao

Subject: ext4: remove extra IS_RDONLY() check
From: Dave Hansen [EMAIL PROTECTED]

ext4_change_inode_journal_flag() is only called from one location:
ext4_ioctl(EXT3_IOC_SETFLAGS).  That ioctl case already has a IS_RDONLY()
call in it so this one is superfluous.

Signed-off-by: Dave Hansen [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Signed-off-by: Andrew Morton [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---

 fs/ext4/inode.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN fs/ext4/inode.c~ext4-remove-extra-is_rdonly-check fs/ext4/inode.c
--- a/fs/ext4/inode.c~ext4-remove-extra-is_rdonly-check
+++ a/fs/ext4/inode.c
@@ -3352,7 +3352,7 @@ int ext4_change_inode_journal_flag(struc
 */

journal = EXT4_JOURNAL(inode);
-   if (is_journal_aborted(journal) || IS_RDONLY(inode))
+   if (is_journal_aborted(journal))
return -EROFS;

jbd2_journal_lock_updates(journal);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 9][PATCH 4/5]Morecleanups:ext4_extent_compilation_fixes

2007-07-01 Thread Mingming Cao

From: Dmitry Monakhov [EMAIL PROTECTED]
Subject: ext4: extent compilation fixes

Fix compilation with EXT_DEBUG, also fix leXX_to_cpu convertions.

Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
Acked-by: Alex Tomas [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---
 fs/ext4/extents.c |   18 ++
 1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6f72dcb..12fe3d7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -382,13 +382,14 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
ext4_ext_path *path, int bloc
r = m - 1;
else
l = m + 1;
-   ext_debug(%p(%u):%p(%u):%p(%u) , l, l-ei_block,
-   m, m-ei_block, r, r-ei_block);
+   ext_debug(%p(%u):%p(%u):%p(%u) , l, le32_to_cpu(l-ei_block),
+   m, le32_to_cpu(m-ei_block),
+   r, le32_to_cpu(r-ei_block));
}

path-p_idx = l - 1;
ext_debug(  - %d-%lld , le32_to_cpu(path-p_idx-ei_block),
- idx_block(path-p_idx));
+ idx_pblock(path-p_idx));

 #ifdef CHECK_BINSEARCH
{
@@ -447,8 +448,9 @@ ext4_ext_binsearch(struct inode *inode, struct 
ext4_ext_path *path, int block)
r = m - 1;
else
l = m + 1;
-   ext_debug(%p(%u):%p(%u):%p(%u) , l, l-ee_block,
-   m, m-ee_block, r, r-ee_block);
+   ext_debug(%p(%u):%p(%u):%p(%u) , l, le32_to_cpu(l-ee_block),
+   m, le32_to_cpu(m-ee_block),
+   r, le32_to_cpu(r-ee_block));
}

path-p_ext = l - 1;
@@ -580,7 +582,7 @@ static int ext4_ext_insert_index(handle_t *handle, struct 
inode *inode,
if (curp-p_idx != EXT_LAST_INDEX(curp-p_hdr)) {
len = (len - 1) * sizeof(struct ext4_extent_idx);
len = len  0 ? 0 : len;
-   ext_debug(insert new index %d after: %d. 
+   ext_debug(insert new index %d after: %llu. 
move %d from 0x%p to 0x%p\n,
logical, ptr, len,
(curp-p_idx + 1), (curp-p_idx + 2));
@@ -591,7 +593,7 @@ static int ext4_ext_insert_index(handle_t *handle, struct 
inode *inode,
/* insert before */
len = len * sizeof(struct ext4_extent_idx);
len = len  0 ? 0 : len;
-   ext_debug(insert new index %d before: %d. 
+   ext_debug(insert new index %d before: %llu. 
move %d from 0x%p to 0x%p\n,
logical, ptr, len,
curp-p_idx, (curp-p_idx + 1));
@@ -791,7 +793,7 @@ static int ext4_ext_split(handle_t *handle, struct inode 
*inode,
BUG_ON(EXT_MAX_INDEX(path[i].p_hdr) !=
EXT_LAST_INDEX(path[i].p_hdr));
while (path[i].p_idx = EXT_MAX_INDEX(path[i].p_hdr)) {
-   ext_debug(%d: move %d:%d in new index %llu\n, i,
+   ext_debug(%d: move %d:%llu in new index %llu\n, i,
le32_to_cpu(path[i].p_idx-ei_block),
idx_pblock(path[i].p_idx),
newblock);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[EXT4 set 9][PATCH 5/5]Extent micro cleanups

2007-07-01 Thread Mingming Cao

From: Dmitry Monakhov [EMAIL PROTECTED]
Subject: ext4: extent macros cleanup

- Replace math equation to it's macro equivalent
- make ext4_ext_grow_indepth() indexes/leaf correct

Signed-off-by: Dmitry Monakhov [EMAIL PROTECTED]
Acked-by: Alex Tomas [EMAIL PROTECTED]
Signed-off-by: Dave Kleikamp [EMAIL PROTECTED]
---
 fs/ext4/extents.c |   11 +++
 1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 12fe3d7..1fd00ac 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -375,7 +375,7 @@ ext4_ext_binsearch_idx(struct inode *inode, struct 
ext4_ext_path *path, int bloc
ext_debug(binsearch for %d(idx):  , block);

l = EXT_FIRST_INDEX(eh) + 1;
-   r = EXT_FIRST_INDEX(eh) + le16_to_cpu(eh-eh_entries) - 1;
+   r = EXT_LAST_INDEX(eh);
while (l = r) {
m = l + (r - l) / 2;
if (block  le32_to_cpu(m-ei_block))
@@ -440,7 +440,7 @@ ext4_ext_binsearch(struct inode *inode, struct 
ext4_ext_path *path, int block)
ext_debug(binsearch for %d:  , block);

l = EXT_FIRST_EXTENT(eh) + 1;
-   r = EXT_FIRST_EXTENT(eh) + le16_to_cpu(eh-eh_entries) - 1;
+   r = EXT_LAST_EXTENT(eh);

while (l = r) {
m = l + (r - l) / 2;
@@ -922,8 +922,11 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct 
inode *inode,
curp-p_hdr-eh_max = cpu_to_le16(ext4_ext_space_root_idx(inode));
curp-p_hdr-eh_entries = cpu_to_le16(1);
curp-p_idx = EXT_FIRST_INDEX(curp-p_hdr);
-   /* FIXME: it works, but actually path[0] can be index */
-   curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
+   
+   if (path[0].p_hdr-eh_depth)
+ curp-p_idx-ei_block = EXT_FIRST_INDEX(path[0].p_hdr)-ei_block;
+   else
+ curp-p_idx-ei_block = EXT_FIRST_EXTENT(path[0].p_hdr)-ee_block;
ext4_idx_store_pblock(curp-p_idx, newblock);

neh = ext_inode_hdr(inode);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6][TAKE5] fallocate system call

2007-06-28 Thread Mingming Cao

On Thu, 2007-06-28 at 02:55 -0700, Andrew Morton wrote:

 Please drop the non-ext4 patches from the ext4 tree and send incremental
 patches against the (non-ext4) fallocate patches in -mm.
 
The ext4 fallocate() patches are dependent on the core fallocate()
patches, so ext4 patch-queue and git tree won't compile (it's not based
on mm tree) without the core changes.

We can send ext4 fallocate patches (incremental patches against mm tree)
and drop the full fallocate patches(ext4 and non ext4 part) from ext4
patch queue if you prefer this way.

 And try to get the code finished?  Time is pressing.
 
I looked at the mm tree, there are other ext4 features/changes that are
currently in ext4-patch-queue(not ext4 git tree) that not in part of
ext4 series yet. Ted, can you merge those patches to your git tree?
Thanks!


Thanks for your patience.

Mingming.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/2] i_version update

2007-05-31 Thread Mingming Cao

On Thu, 2007-05-31 at 10:33 +1000, David Chinner wrote:
 On Wed, May 30, 2007 at 04:32:57PM -0700, Mingming Cao wrote:
  On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
   On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
Hi,

This is an update of the i_version patch.
The i_version field is a 64bit counter that is set on every inode
creation and that is incremented every time the inode data is modified
(similarly to the ctime time-stamp).
   
   My understanding (please correct me if I'm wrong) is that the
   requirements are much more rigourous than simply incrementing an in
   memory counter on every change.  i.e. the this counter has to
   survive server crashes intact so clients never see the counter go
   backwards. That means version number changes need to be journalled
   along with the operation that caused the change of the version
   number.
   
  Yeah, the i_version is the in memeory counter. From the patch it looks
  like the counter is being updated inside ext4_mark_iloc_dirty(), so it
  is being journalled and being flush to on-disk ext4 inode structure
  immediately (On-disk ext4 inode structure is being modified/expanded to
  store the counter in the first patch). 
 
 Ok, that catches most things (I missed that), but the version number still
 needs to change on file data changes, right? So if we are overwriting the
 file, we're calling __mark_inode_dirty(I_DIRTY_PAGES) which means you don't
 get the callout and so the version number doesn't change or get logged. In
 that case, the version number is not doing what it needs to do, right?
 

Hmm, maybe I missed something... but looking at the code again, in the
case of overwrite (file date updated),it seems the ctime/mtime is being
updated and the inode is being dirtied, so the version number is being
updated.

 vfs_write()-..
-__generic_file_aio_write_nolock()
-file_update_time()
-mark_inode_dirty_sync()
-__mark_inode_dirty(I_DIRTY_SYNC)
-ext4_dirty_inode()
-ext4_mark_inode_dirty()

Regards,
Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 0/2] i_version update

2007-05-30 Thread Mingming Cao

On Wed, 2007-05-30 at 10:21 +1000, David Chinner wrote:
 On Fri, May 25, 2007 at 06:25:31PM +0200, Jean noel Cordenner wrote:
  Hi,
  
  This is an update of the i_version patch.
  The i_version field is a 64bit counter that is set on every inode
  creation and that is incremented every time the inode data is modified
  (similarly to the ctime time-stamp).
 
 My understanding (please correct me if I'm wrong) is that the
 requirements are much more rigourous than simply incrementing an in
 memory counter on every change.  i.e. the this counter has to
 survive server crashes intact so clients never see the counter go
 backwards. That means version number changes need to be journalled
 along with the operation that caused the change of the version
 number.
 
Yeah, the i_version is the in memeory counter. From the patch it looks
like the counter is being updated inside ext4_mark_iloc_dirty(), so it
is being journalled and being flush to on-disk ext4 inode structure
immediately (On-disk ext4 inode structure is being modified/expanded to
store the counter in the first patch). 

This patch is on top of i_version_update_vfs.
 The i_version field of the inode is set on inode creation and incremented when
 the inode is being modified.
 
 Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED]
 
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c  2007-05-25 18:05:40.0 
 +0200
 @@ -565,6 +565,7 @@
   inode-i_blocks = 0;
   inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime =
  ext4_current_time(inode);
 + inode-i_version = 1;
 
   memset(ei-i_data, 0, sizeof(ei-i_data));
   ei-i_dir_start_lookup = 0;
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c  2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c   2007-05-25 18:05:40.0 
 +0200
 @@ -3082,6 +3082,7 @@
  {
   int err = 0;
 
 + inode-i_version++;
   /* the do_update_inode consumes one bh-b_count */
   get_bh(iloc-bh);
 
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c  2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c   2007-05-25 18:05:40.0 
 +0200
 @@ -2839,8 +2839,8 @@
   i_size_write(inode, off+len-towrite);
   EXT4_I(inode)-i_disksize = inode-i_size;
   }
 - inode-i_version++;
   inode-i_mtime = inode-i_ctime = CURRENT_TIME;
 + inode-i_version = 1;
   ext4_mark_inode_dirty(handle, inode);
   mutex_unlock(inode-i_mutex);
   return len - towrite;

I am not very clear about this part -- what is i_version being used for
with quota? And shall we reset the counter to 1 for ext4_quota_write()? 

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] i_version update - ext4 part

2007-05-30 Thread Mingming Cao

On Tue, 2007-05-29 at 16:58 -0600, Andreas Dilger wrote:
 On May 29, 2007  12:44 -0700, Mingming Cao wrote:
  I am a little bit confused about the two patches. 
  
  It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a
  new 64 bit i_fs_version field is added to ext4 inode structure for inode
  versioning support. read/store of this counter are properly handled, but
  missing the inode versioning counter update.
 
 For the Lustre use of the inode version we don't care about the VFS changes
 to i_version.  In fact - we want to be able to control the changes to
 inode version ourselves so that e.g. file defragmenting or atime updates
 don't change the inode version, and that recovery can restore the version
 to a known state along with the rest of the metadata.
 
It's a pity that VFS i_version can't serve Lustre need completely. :(

If the unnecessary inode version update is a concern, then, with current
implementation (i_version being updated in ext4_mark_inode_dirty()-
ext4_mark_iloc_dirty()), the i_version can be increased multiple times
during one write() operation (unlike ctime update). I know doing the
update in ext4_mark_inode_dirty() (rather than update inode version on
every mtime/ctime update) clearly simplified the code changes. So I am
not sure if the increased update is a concern or not.

In any case, does the VFS inode version get update when atime updates?

 That said, since Lustre isn't in the kernel and we patch our version of
 ext3 anyways it doesn't really matter what is done for NFS.  We will just
 patch in our own behaviour if the final ext4 code isn't suitable in all
 of the details.  Having 99% of the code the same at least makes this a
 lot less work.

  But later in the second patch by Jean Noel, he re-used the VFS inode-
  i_version for ext4 inode versioning, the counter is being updated every
  time the file is being changed. 
 
 I don't know what the NFS requirements for the version are.  There may
 also be some complaints from others if the i_version is 64 bits because
 this contributes to generic inode growth and isn't used for other
 filesystems.
 
That should benefit for other filesystems, as I thought this NFS
requirements apply to all filesystems. 

  To me, i_fs_version and inode_version are the same thing, right?
  Shouldn't we choose one(I assume inode i_version?), and combine these
  two patch together? How about split the inode versioning part from the
  ext4_expand_inode_extra_isize patch(it does multiple things, and
  i_versioning doesn't longs there) and put it together with the rest of
  i_version update patches?
 
 I don't have an objection to that, but I don't think it is required.
 
  BTW, how could NFS/user space to access the inode version counter?
 
 If the Bull patch uses i_version then knfsd can just access it directly.
 I don't think there is any API to access it from userspace.  One option
 is to add a virtual EA like user.inode_version and have the kernel fill
 this in from i_version.
 
 Lustre will manipulate the ei-i_fs_version directly.
 
 Cheers, Andreas
 --
 Andreas Dilger
 Principal Software Engineer
 Cluster File Systems, Inc.
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [patch 2/2] i_version update - ext4 part

2007-05-29 Thread Mingming Cao

On Fri, 2007-05-25 at 18:25 +0200, Jean noel Cordenner wrote:
 The patch is on top of the ext4 tree:
 http://repo.or.cz/w/ext4-patch-queue.git
 
 In this part, the i_version counter is stored into 2 32bit fields of
 the ext4_inode structure osd1.linux1.l_i_version and i_version_hi.
 
 I included the ext4_expand_inode_extra_isize patch, which does part of 
 the job, checking if there is enough room for extra fields in the inode 
 (i_version_hi). The other patch increments the counter on inode 
 modifications and set it on inode creation.
 plain text document attachment (i_version_update_ext4)
 This patch is on top of i_version_update_vfs.
 The i_version field of the inode is set on inode creation and incremented when
 the inode is being modified.
 

I am a little bit confused about the two patches. 

It appears in the ext4_expand_inode_extra_isize patch by Kalpak, there a
new 64 bit i_fs_version field is added to ext4 inode structure for inode
versioning support. read/store of this counter are properly handled, but
missing the inode versioning counter update.

But later in the second patch by Jean Noel, he re-used the VFS inode-
i_version for ext4 inode versioning, the counter is being updated every
time the file is being changed. 

To me, i_fs_version and inode_version are the same thing, right?
Shouldn't we choose one(I assume inode i_version?), and combine these
two patch together? How about split the inode versioning part from the
ext4_expand_inode_extra_isize patch(it does multiple things, and
i_versioning doesn't longs there) and put it together with the rest of
i_version update patches?


BTW, how could NFS/user space to access the inode version counter?

Thanks,
Mingming


 Signed-off-by: Jean Noel Cordenner [EMAIL PROTECTED]
 
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/ialloc.c 2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/ialloc.c  2007-05-25 18:05:40.0 
 +0200
 @@ -565,6 +565,7 @@
   inode-i_blocks = 0;
   inode-i_mtime = inode-i_atime = inode-i_ctime = ei-i_crtime =
  ext4_current_time(inode);
 + inode-i_version = 1;
 
   memset(ei-i_data, 0, sizeof(ei-i_data));
   ei-i_dir_start_lookup = 0;
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/inode.c  2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/inode.c   2007-05-25 18:05:40.0 
 +0200
 @@ -3082,6 +3082,7 @@
  {
   int err = 0;
 
 + inode-i_version++;
   /* the do_update_inode consumes one bh-b_count */
   get_bh(iloc-bh);
 
 Index: linux-2.6.22-rc2-ext4-1/fs/ext4/super.c
 ===
 --- linux-2.6.22-rc2-ext4-1.orig/fs/ext4/super.c  2007-05-25 
 18:05:28.0 +0200
 +++ linux-2.6.22-rc2-ext4-1/fs/ext4/super.c   2007-05-25 18:05:40.0 
 +0200
 @@ -2839,8 +2839,8 @@
   i_size_write(inode, off+len-towrite);
   EXT4_I(inode)-i_disksize = inode-i_size;
   }
 - inode-i_version++;
   inode-i_mtime = inode-i_ctime = CURRENT_TIME;
 + inode-i_version = 1;
   ext4_mark_inode_dirty(handle, inode);
   mutex_unlock(inode-i_mutex);
   return len - towrite;

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/6][TAKE4] fallocate system call

2007-05-20 Thread Mingming Cao

On Fri, 2007-05-18 at 23:44 -0700, Andrew Morton wrote:
 On Thu, 17 May 2007 19:41:15 +0530 Amit K. Arora [EMAIL PROTECTED] wrote:
 
  fallocate() is a new system call being proposed here which will allow
  applications to preallocate space to any file(s) in a file system.
 
 I merged the first three patches into -mm, thanks.
 
 All the system call numbers got changed due to recent additions.  They
 may change in the future, too - nothing is stable until the code lands
 in mainline.
 
In case you haven't realize it, the ia64 fallocate() patch comes with
Amit's takes 4 fallocate patch series (3/6) missing one line change,
thus fail to compile on ia64.

Here is the updated one. Patch tested on ia64. (compile and fsx)

fallocate() on ia64

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-18 
16:30:16.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-18 16:32:45.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate
 
.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-18 
16:30:16.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-18 17:34:58.0 
-0700
@@ -296,11 +296,12 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307
 
 #ifdef __KERNEL__
 
 
-#define NR_syscalls283 /* length of syscall table */
+#define NR_syscalls285 /* length of syscall table */
 
 #define __ARCH_WANT_SYS_RT_SIGACTION
 #define __ARCH_WANT_SYS_RT_SIGSUSPEND


 I didn't merge any of the ext4 changes as they appear to be in Ted's
 devel tree.  Although I didn't check that they are 100% the same in 
 that tree.
 
Since both Amit and Ted are traveling, I will jump in...

Most likely it's not the same one. What in Ted's devel tree is takes 2
patches.

I have incorporated takes 4 patches in the backing ext4 patch git tree
here:
http://repo.or.cz/w/ext4-patch-queue.git

I have tested these patch series on ia64,ppc64,x86 and x86_64. I am not
sure if Ted got a chance to update his ext4 git tree from this patch
queue git tree yet. 

 What's the plan to get some ext4 updates into mainline, btw?  Things
 seem to be rather gradual.


Last time Ted and I discussed we all agree fallocate patches should go
into mainline. Actually most patches marked before the unstable
patches can get into mainline, especially the following patches
(contains a few bug fixes patches)

# New patch to fix whitespace before applying new patches
whitespace.patch
 
#New patch to remove unnecessary exported symbols
ext4_remove_exported_symbles.patch
 
# New patch to add mount option to turn off extents
ext4_noextent_mount_opt.patch
# Now Turn on extents feature by default
ext4_extents_on_by_default.patch
 
#New patch to propagate inode flags
ext4-propagate_flags.patch
 
#New patch to add extent sanity checks
ext4-extent-sanity-checks.patch
 
#New patch to free blocks when failed to insert an extent
ext4-free-blocks-on-insert-extent-failure.patch

We already missed rc-1 window, but if possible, I would like to see ext4
fallocate patches and above patches in mainline 2.6.22. The nanosecond
timestamp patch is probably good to go also.

Regards,
Mingming
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5][TAKE3] fallocate system call

2007-05-15 Thread Mingming Cao

On Wed, 2007-05-16 at 01:07 +0530, Amit K. Arora wrote:

 ToDos:
 -
 1 Implementation on other architectures (other than i386, x86_64,
 ppc64 and s390(x)). David Chinner has already posted a patch for ia64.

Here is the 2.6.22-rc1 version of David's patch: add fallocate() on ia64

From: David Chinner [EMAIL PROTECTED]
Subject: [PATCH] ia64 fallocate syscall
Cc: Amit K. Arora [EMAIL PROTECTED], 
[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], [EMAIL PROTECTED]

ia64 fallocate syscall support.

Signed-Off-By: Dave Chinner [EMAIL PROTECTED]

---
 arch/ia64/kernel/entry.S  |1 +
 include/asm-ia64/unistd.h |3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.22-rc1/arch/ia64/kernel/entry.S
===
--- linux-2.6.22-rc1.orig/arch/ia64/kernel/entry.S  2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/arch/ia64/kernel/entry.S   2007-05-15 15:36:48.0 
-0700
@@ -1585,5 +1585,6 @@
data8 sys_getcpu
data8 sys_epoll_pwait   // 1305
data8 sys_utimensat
+   data8 sys_fallocate
 
.org sys_call_table + 8*NR_syscalls // guard against failures to 
increase NR_syscalls
Index: linux-2.6.22-rc1/include/asm-ia64/unistd.h
===
--- linux-2.6.22-rc1.orig/include/asm-ia64/unistd.h 2007-05-12 
18:45:56.0 -0700
+++ linux-2.6.22-rc1/include/asm-ia64/unistd.h  2007-05-15 15:37:51.0 
-0700
@@ -296,6 +296,7 @@
 #define __NR_getcpu1304
 #define __NR_epoll_pwait   1305
 #define __NR_utimensat 1306
+#define __NR_fallocate 1307
 
 #ifdef __KERNEL__
 


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/5] fallocate() implementation in i86, x86_64 and powerpc

2007-05-09 Thread Mingming Cao

On Wed, 2007-05-09 at 21:31 +0530, Amit K. Arora wrote:
 I have the updated patches ready which take care of Andrew's comments.
 Will run some tests and post them soon.
 
 But, before submitting these patches, I think it will be better to finalize
 on certain things which might be worth some discussion here:
 
 1) Should the file size change when preallocation is done beyond EOF ?
- Andreas and Chris Wedgwood are in favor of not changing the
  file size in this case. I also tend to agree with them. Does anyone
  has an argument in favor of changing the filesize ?
  If not, I will remove the code which changes the filesize, before I
  resubmit the concerned ext4 patch.
 

If we chose not to update the file size beyong EOF, then for filesystem
without fallocate() support (ext2,3 currently), posix_fallocate() will
follow the hard way(zero-out) to do preallocation. Then we will get
different behavior on filesystems w/o fallocate() support. It make sense
to be consistent, IMO.

My point of view, preallocation is just a efficient way to allocating
blocks for files without zero-out, other than this, the new behavior
should be consistent with the old way: file size update,mtime/ctime,
ENOSPC etc.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-08 Thread Mingming Cao

On Mon, 2007-05-07 at 21:43 -0400, Theodore Tso wrote:
 On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
  We could check the total number of fs free blocks account before
  preallocation happens, if there isn't enough space left, there is no
  need to bother preallocating.
 
 Checking against the fs free blocks is a good idea, since it will
 prevent the obvious error case where someone tries to preallocate 10GB
 when there is only 2GB left.
Think it again, this check is useful when preallocate blocks at EOF.
It's not much useful is preallocating a range with holes. In that case
2GB space might be enough if the application tries to preallocate a
10GB.

   But it won't help if there are multiple
 processes trying to allocate blocks the same time.  On the other hand,
 that case is probably relatively rare, and in that case, the
 filesystem was probably going to be left completely full in any case.

 On Mon, May 07, 2007 at 05:15:41PM -0700, Andrew Morton wrote:
  Userspace could presumably repair the mess in most situations by truncating
  the file back again.  The kernel cannot do that because there might be live
  data in amongst there.
 
 Actually, the kernel could do it, in that could simply release all
 unitialized extents back to the system.  The problem is distinguishing
 between the unitialized extents that had just been newly added, versus
 the ones that had there from before.

True, the new uninitialized extents can be merged to the near old
uninitialized extents, there is no way to distinguish the just added
unintialized extents from the merged one.

   (On the other hand, if the
 filesystem was completely full, releasing unitialized blocks wouldn't
 be the worse thing in the world to do, although releasing previously
 fallocated blocks probably does violate the princple of least
 surprise, even if it's what the user would have wanted.)
 
 On Mon, May 07, 2007 at 05:41:39PM -0700, Mingming Cao wrote:
  If there is enough free space, we could make a reservation window that
  have at least N free blocks and mark it not stealable by other files. So
  later we will not run into the ENOSPC error.
 
 Could you really use a single reservation window?  When the filesystem
 is almost full, the free extents are likely going to be scattered all
 over the disk.  The general principle of grabbing all of the extents
 and keeping them in an in-memory data structure, and only adding them
 to the extent tree would work, though; I'm just not sure we could do
 it using the existing reservation window code, since it only supports
 a single reservation window per file, yes?
 
You are right.  One reservation window per file and there is limit to
the maximum window size). So yeah this way it's not going to prevent
ENOSPC for sure:(

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 13:58 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 05:37:54 -0600
 Andreas Dilger [EMAIL PROTECTED] wrote:
 
+   block = offset  blkbits;
+   max_blocks = (EXT4_BLOCK_ALIGN(len + offset, blkbits)  
blkbits)
+- block;
+   mutex_lock(EXT4_I(inode)-truncate_mutex);
+   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
+   mutex_unlock(EXT4_I(inode)-truncate_mutex);
   
   Now I'm mystified.  Given that we're allocating an arbitrary amount of 
   disk
   space, and that this disk space will require an arbitrary amount of
   metadata, how can we work out how much journal space we'll be needing
   without at least looking at `len'?
  
  Good question.
  
  The uninitialized extent can cover up to 128MB with a single entry.
  If @path isn't specified, then ext4_ext_calc_credits_for_insert()
  function returns the maximum number of extents needed to insert a leaf,
  including splitting all of the index blocks.  That would allow up to 43GB
  (340 extents/block * 128MB) to be preallocated, but it still needs to take
  the size of the preallocation into account (adding 3 blocks per 43GB - a
  leaf block, a bitmap block and a group descriptor).
 
 I think the use of ext4_journal_extend() (as Amit has proposed) will help
 here, but it is not sufficient.
 
 Because under some circumstances, a journal_extend() failure could mean
 that we fail to allocate all the required disk space.  If it is infrequent
 enough, that is acceptable when the caller is using fallocate() for
 performance reasons.
 
 But it is very much not acceptable if the caller is using fallocate() for
 space-reservation reasons.  If you used fallocate to reserve 1GB of disk
 and fallocate() succeeded and you later get ENOSPC then you'd have a
 right to get a bit upset.
 
 So I think the ext3/4 fallocate() implementation will need to be
 implemented as a loop: 
 
   while (len) {
   journal_start();
   len -= do_fallocate(len, ...);
   journal_stop();
   }
 
 

I agree.  There is already a loop in Amit's current's patch to call
ext4_ext_get_blocks() thoug. Question is how much credit should ext4 to
ask for in each journal_start()?
 +/*
 + * ext4_fallocate:
 + * preallocate space for a file
 + * mode is for future use, e.g. for unallocating preallocated blocks etc.
 + */
 +int ext4_fallocate(struct inode *inode, int mode, loff_t offset, loff_t len)
 +{


 +   mutex_lock(EXT4_I(inode)-truncate_mutex);
 +   credits = ext4_ext_calc_credits_for_insert(inode, NULL);
 +   mutex_unlock(EXT4_I(inode)-truncate_mutex);

I think the calculation is based on the assumption that there is only a
single extent to be inserted, which is the ideal case. But in some cases
we may end up allocating several chunk of blocks(extents) for this
single preallocation request when fs is fragmented (or part of
preallocation request is already fulfilled)

I think we should move this calculation inside the loop as well,and we
really do not need to grab the lock to calculate the credit if the @path
is always NULL, all the function does is mathmatics.

I can't think of any good way to estimate the total credits needed for
this whole preallocation request. Looked at ext4_get_block(), which is
used for DIO code to deal with large amount of block allocation. The
credit reservation is quite weak there too. The DIO_CREDIT is only
(EXT4_RESERVE_TRANS_BLOCKS + 32)

 +   handle=ext4_journal_start(inode, credits +
 +   
 EXT4_DATA_TRANS_BLOCKS(inode-i_sb)+1);
 +   if (IS_ERR(handle))
 +   return PTR_ERR(handle);
 +retry:
 +   ret = 0;
 +   while (ret = 0  ret  max_blocks) {
 +   block = block + ret;
 +   max_blocks = max_blocks - ret;
 +   ret = ext4_ext_get_blocks(handle, inode, block,
 + max_blocks, map_bh,
 + EXT4_CREATE_UNINITIALIZED_EXT, 0);
 +   BUG_ON(!ret);
 +   if (ret  0  test_bit(BH_New, map_bh.b_state)
 +((block + ret)  (i_size_read(inode)  blkbits)))
 +   nblocks = nblocks + ret;
 +   }
 +
 +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, retries))
 +   goto retry;
 +
 Now the interesting question is: what do we do if we get halfway through
 this loop and then run out of space?  We could leave the disk all filled up
 and then return failure to the caller, but that's pretty poor behaviour,
 IMO.
 
The current code handles earlier ENOSPC by three times retries. After
that if we still run out of space, then it's propably right to notify
the caller there isn't much space left.

We could extend the block reservation window size before the while loop
so we could get a lower chance to get more fragmented.

 
 Does the proposed implementation handle quotas

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 16:31 -0700, Andrew Morton wrote:
 On Mon, 7 May 2007 19:14:42 -0400
 Theodore Tso [EMAIL PROTECTED] wrote:
 
  On Mon, May 07, 2007 at 03:38:56PM -0700, Andrew Morton wrote:
Actually, this is a non-issue.  The reason that it is handled for 
extent-only
is that this is the only way to allocate space in the filesystem without
doing the explicit zeroing.  For other filesystems (including ext3 and
ext4 with block-mapped files) the filesystem should return an error 
(e.g.
-EOPNOTSUPP) and glibc will do manual zero-filling of the file in 
userspace.
   
   It can be a bit suboptimal from the layout POV.  The reservations code 
   will
   largely save us here, but kernel support might make it a bit better.
  
  Actually, the reservations code won't matter, since glibc will fall
  back to its current behavior, which is it will do the preallocation by
  explicitly writing zeros to the file.
 
 No!  Reservations code is *critical* here.  Without reservations, we get
 disastrously-bad layout if two processes were running a large fallocate()
 at the same time.  (This is an SMP-only problem, btw: on UP the timeslice
 lengths save us).
 
 My point is that even though reservations save us, we could do even-better
 in-kernel.
 

In this case, since the number of blocks to preallocate (eg. N=10GB) is
clear, we could improve the current reservation code, to allow callers
explicitly ask for a new window that have the minimum N free blocks for
the blocks-to-preallocated(rather than just have at least 1 free
blocks).

Before the ext4_fallocate() is called, the right reservation window size
is set with the flag to indicating please spend time if needed to find
a window covers at least N free blocks.

So for ex4 block mapped files, later when glibc is doing allocation and
zeroing, the ext4 block-mapped allocator will knows to reserve the right
amount of free blocks before allocating and zeroing 10GB space.

I am not sure whether this worth the effort though.

 But then, a smart application would bypass the glibc() fallocate()
 implementation and would tune the reservation window size and would use
 direct-IO or sync_file_range()+fadvise(FADV_DONTNEED).
 
  This wlil result in the same
  layout as if we had done the persistent preallocation, but of course
  it will mean the posix_fallocate() could potentially take a long time
  if you're a PVR and you're reserving a gig or two for a two hour movie
  at high quality.  That seems suboptimal, granted, and ideally the
  application should be warned about this before it calls
  posix_fallocate().  On the other hand, it's what happens today, all
  the time, so applications won't be too badly surprised.
 
 A PVR implementor would take all this over and would do it themselves, for
 sure.
 
  If we think applications programmers badly need to know in advance if
  posix_fallocate() will be fast or slow, probably the right thing is to
  define a new fpathconf() configuration option so they can query to see
  whether a particular file will support a fast posix_fallocate().  I'm
  not 100% convinced such complexity is really needed, but I'm willing
  to be convinced  what do folks think?
  
 
 An application could do sys_fallocate(one-byte) to work out whether it's
 supported in-kernel, I guess.
 

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] ext4: fallocate support in ext4

2007-05-07 Thread Mingming Cao

On Mon, 2007-05-07 at 17:15 -0700, Andrew Morton wrote:
 On Mon, 07 May 2007 17:00:24 -0700
 Mingming Cao [EMAIL PROTECTED] wrote:
 
   +   while (ret = 0  ret  max_blocks) {
   +   block = block + ret;
   +   max_blocks = max_blocks - ret;
   +   ret = ext4_ext_get_blocks(handle, inode, block,
   + max_blocks, map_bh,
   + EXT4_CREATE_UNINITIALIZED_EXT, 
   0);
   +   BUG_ON(!ret);
   +   if (ret  0  test_bit(BH_New, map_bh.b_state)
   +((block + ret)  (i_size_read(inode)  
   blkbits)))
   +   nblocks = nblocks + ret;
   +   }
   +
   +   if (ret == -ENOSPC  ext4_should_retry_alloc(inode-i_sb, 
   retries))
   +   goto retry;
   +
   Now the interesting question is: what do we do if we get halfway through
   this loop and then run out of space?  We could leave the disk all filled 
   up
   and then return failure to the caller, but that's pretty poor behaviour,
   IMO.
   
  The current code handles earlier ENOSPC by three times retries. After
  that if we still run out of space, then it's propably right to notify
  the caller there isn't much space left.
  
  We could extend the block reservation window size before the while loop
  so we could get a lower chance to get more fragmented.
 
 yes, but my point is that the proposed behaviour is really quite bad.
 
I agree your point, that's why I mention it only helped the
fragmentation issue but not the ENOSPC case.


 We will attempt to allocate the disk space and then we will return failure,
 having consumed all the disk space and having partially and uselessly
 populated an unknown amount of the file.
 

Not totally useless I think. If only half of the space is preallocated
because run out of space, the application can decide whether it's good
enough to start to use this preallocated space or wait for the fs to
have more free space.

 Userspace could presumably repair the mess in most situations by truncating
 the file back again.  The kernel cannot do that because there might be live
 data in amongst there.
 
 So we'd need to either keep track of which blocks were newly-allocated and
 then free them all again on the error path (doesn't work right across
 commit+crash+recovery) or we could later use the space-reservation scheme 
 which
 delayed allocation will need to introduce.
 
 Or we could decide to live with the above IMO-crappy behaviour.

In fact Amit and I had raised this issue before, whether it's okay to do
allow partial preallocation. At that moment the feedback is it's no much
different than the current zero-out-preallocation behavior: people might
preallocating half-way then later deal with ENOSPC.

We could check the total number of fs free blocks account before
preallocation happens, if there isn't enough space left, there is no
need to bother preallocating.

If there is enough free space, we could make a reservation window that
have at least N free blocks and mark it not stealable by other files. So
later we will not run into the ENOSPC error.

The fs free blocks account is just a estimate though.


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Mingming Cao


Jan Kara wrote:

On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott [EMAIL PROTECTED] wrote:




On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:



On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:




This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the command flag extension.


Seems like a separate syscall would be better, command sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to mode? ;)



I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


  Yes, I also think this would be useful so you can guide
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza

Yep, I think it makes sense to use preallocation for defragmentation.
After all both preallocation and defragmentation shall call underlying 
filesystem multiple block allocator to try to allocate a chunk of 
contiguous blocks on disk. ext4 online defrag implementation by Takashi 
already support to choose a goal allocation block to guide the ext4 
block allocator to place the defraged file is a specific location.


Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
and/or whether the goal block is important over the size of prealloc 
extent), might make it more useful for the orginial goal (get contigous 
and guranteed blocks) and for defragmentation.


Regards,
Mingming
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Heads up on sys_fallocate()

2007-03-02 Thread Mingming Cao


Dave Kleikamp wrote:

On Thu, 2007-03-01 at 14:59 -0800, Andrew Morton wrote:


On Thu, 01 Mar 2007 22:44:16 +
Dave Kleikamp [EMAIL PROTECTED] wrote:



On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:


On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:



+asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len)
+{
+   struct file *file;
+   struct inode *inode;
+   long ret = -EINVAL;
+   file = fget(fd);
+   if (!file)
+   goto out;
+   inode = file-f_path.dentry-d_inode;
+   if (inode-i_op  inode-i_op-fallocate)
+   ret = inode-i_op-fallocate(inode, offset, len);
+   else
+   ret = -ENOTTY;
+   fput(file);
+out:
+return ret;
+}


ENOTTY is a bit unconventional - we often use EINVAL for this sort of
thing.  But EINVAL has other meanings for posix_fallocate() and isn't
really appropriate here anyway.  So I'm not sure what would be better...


Would EINVAL (or whatever) make it back to the caller of
posix_fallocate(), or would glibc fall back to its current
implementation?

Forgive me if I haven't put enough thought into it, but would it be
useful to create a generic_fallocate() that writes zeroed pages for any
non-existent pages in the range?  I don't know how glibc currently
implements posix_fallocate(), but maybe the kernel could do it more
efficiently, even in generic code.  Maybe we don't care, since the major
file systems can probably do something better in their own code.


Given that glibc already implements fallocate for all filesystems, it will
need to continue to do so for filesystems which don't implement this
syscall - otherwise applications would start breaking.



I didn't make it clear, but my point was to call generic_fallocate if
the file system did not define i_op-allocate().

if (inode-i_op  inode-i_op-fallocate)
ret = inode-i_op-fallocate(inode, offset, len);
else
ret = generic_fallocate(inode, offset, len);

I'm not sure it's worth the effort, but I thought I'd throw the idea out
there.


I think this is useful.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: a question regarding Ext3 file truncate

2007-02-07 Thread Mingming Cao

On Wed, 2007-02-07 at 12:16 -0500, Xin Zhao wrote:
 Hi,
 
 Please forgive me if the question is dumb.
 
 I am modifying ext3 to add some new features, but was confused by the
 implementation of ext3_truncate().
 
 In ext3_truncate():
 
 we first use
   n = ext3_block_to_path(inode, last_block, offsets, NULL);
 to get the path of the last block.
 
 If the number of blocks is smaller than 12,  all blocks are then
 direct blocks. We then need to clear them.
 
 But the interesting thing happens:
   if (n == 1) 
   /* direct blocks */
   {
   ext3_free_data(handle, inode, NULL, i_data+offsets[0],
   i_data + EXT3_NDIR_BLOCKS);
   goto do_indirects;
   }
 This code seems to free data blocks right after the blocks used by the
 file. I think it should be
 ext3_free_data(handle, inode, NULL, i_data, i_data+offsets[0]);

Last_block is the last logical block after the truncate, so
ext3_truncate() free data blocks after this point.

Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH]Add memory barrier before clear bit in unlock_buffer()

2007-02-05 Thread Mingming Cao

We are runnin SDET benchmark and saw double free issue for ext3 extended
attributes block, which complains the same xattr block already being
freed (in ext3_xattr_release_block()). The problem could also been
triggered by multiple threads loop untar/rm a kernel tree.

The race is caused by missing a memory barrier at unlock_buffer() before
the lock bit being cleared, resulting in possible concurrent
h_refcounter update. That causes a reference counter leak, then later
leads to the double free that we have seen.

Inside unlock_buffer(), there is a memory barrier is placed *after* the
lock bit is being cleared, however, there is no memory barrier *before*
the bit is cleared. On some arch the h_refcount update instruction and
the clear bit instruction could be reordered, thus leave the critical
section re-entered.

The race is like this: For example, if the h_refcount is initialized as
1,

cpu 0:   cpu1
--   ---
lock_buffer() /* test_and_set_bit */
clear_buffer_locked(bh); 
lock_buffer() /* test_and_set_bit */
h_refcount = h_refcount+1; /* = 2*/ h_refcount = h_refcount + 1; /*= 2 */
clear_buffer_locked(bh);
..

We lost a h_refcount here. We need a memory barrier before the buffer head lock
bit being cleared to force the order of the two writes.  Please apply.

Signed-Off-By: Mingming Cao [EMAIL PROTECTED]


--- linux/fs/buffer.c.orig  2007-02-04 11:37:50.0 -0600
+++ linux/fs/buffer.c   2007-02-04 11:38:14.0 -0600
@@ -77,6 +77,7 @@
 
 void fastcall unlock_buffer(struct buffer_head *bh)
 {
+   smp_mb__before_clear_bit();
clear_buffer_locked(bh);
smp_mb__after_clear_bit();
wake_up_bit(bh-b_state, BH_Lock);


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: a question about ext4_fsblk_t

2006-11-28 Thread Mingming Cao


On Sun, 2006-10-22 at 16:50 +0800, guomingyang wrote:
 
I also find many places where the block number type is not changed in 
 namei.c and dir.c. Why? 

They are all file logical blocks, ext4_fsblk_t is for on disk blocks.

Takashi Sato has a patch to define all file logical blocks as
ext3_fileblk_t, it did not make to mainline before ext4 was forked.
Probably we should bring the to ext3/4 to clarify the confusions.

Mingming
Thank you to anyone who offer help
 Hello everyone:
 
  I am a student interesting in linux filesystem, I have a 
  problem about the replace of block number type to ext4_fsblk_t. Is it 
  complete? Or has it passed all the test? Because I have found a place like 
  this in linux-2.6.19-rc2/fs/ext4/inode.c 
 
 
 
 +struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 +   int block, int create, int *err)
 
 
 the block's type is not changed. Although I think it will bring no mistake, 
 I doubt about why not change it. Thank you !
 


 
 guomingyang
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 guomingyang
 
 -
 To unsubscribe from this list: send the line unsubscribe linux-ext4 in
 the body of a message to [EMAIL PROTECTED]
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Ext2-devel] [RFC] [PATCH 2/4]delayed allocation for ext3

2005-07-18 Thread Mingming Cao

On Sun, 2005-07-17 at 19:47 -0600, Andreas Dilger wrote:
 On Jul 17, 2005  10:40 -0700, Mingming Cao wrote:
  @@ -373,6 +373,7 @@ struct ext3_inode {
   #define EXT3_MOUNT_BARRIER 0x2 /* Use block barriers */
   #define EXT3_MOUNT_NOBH0x4 /* No bufferheads */
   #define EXT3_MOUNT_QUOTA   0x8 /* Some quota option set */
  + #define EXT3_MOUNT_DELAYED_ALLOC  0xC /* Delayed Allocation */
 
 This doesn't make sense.  DELAYED_ALLOC == QUOTA | NOBH?
 


Ah...:-)  Badari used 0x8 for DELAYED_ALLOC in the previous patch
(2.6.11 based).When moving those patches forward to 2.6.13-rc3, we found
the conflict with QUOTA, and obviously picked up a wrong value.

  + {Opt_delayed_alloc, delalloc},
 
 Is this a replacement for Alex's delalloc code?  We also use delalloc for
 that code and if they are not interchangeable it will cause confusion
 about which one is in use.
 

Okey, will think a new name for this feature to avoid confusion.  Alex's
delalloc is bond to extent tree structure so it's hard to be adopted
directly to the current ext3 layout, so, I'd say this work done by
Badari(inspired by Alex's work) is a different implementation.

  + if (test_opt(sb, DELAYED_ALLOC)) {
  + if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) 
  {
  + printk(KERN_WARNING EXT3-fs: Ignoring delall option 
  - 
  + its supported only with writeback mode\n);
 
 Should be ignoring delalloc option.
  
Fixed. 


Thanks for looking at this.

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3

2005-07-17 Thread Mingming Cao

Hi All, 

Here are the updated patches to support multiple block allocation and
delayed allocation for ext3 done by me, Badari and Suparna.

[PATCH 1/4] -- multiple block allocation for current ext3.
(ext3_get_blocks()).

[PATCH 2/4] -- adding delayed allocation for writeback mode

[PATCH 3/4] -- generic support for cluster pages together in
mapge_writepages() to make use of getblocks() 

[PATCH 4/4] -- support multiple block allocation for ext3 writeback mode
through writepages(). 


Have done initial testing on dbench and tiobench on a 2.6.11 version of
this patch set. Dbench 8 thread throughput result is increased by 20%
with this patch set.

dbench comparison: (ext3-dm represents ext3+thispatchset)
http://www.sudhaa.com/~ram/ols2005presentation/dbench.jpg
tiobench comparison:
http://www.sudhaa.com/~ram/ols2005presentation/tio_seq_write.jpg


Todo:
- bmap() support for delayed allocation
- page reserve flag to indicate the delayed allocation
- ordered mode support for delayed allocation
- bh support to enable blocksize = 1k/2k filesystems



Cheers,

Mingming


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] [PATCH 4/4]add ext3 writeback writpages

2005-07-17 Thread Mingming Cao

support multiple block allocation for ext3 writeback mode through writepages().


---

 linux-2.6.12-ming/fs/ext3/inode.c   |  131 
 linux-2.6.12-ming/fs/mpage.c|8 +
 linux-2.6.12-ming/include/linux/mpage.h |   17 
 3 files changed, 153 insertions(+), 3 deletions(-)

diff -puN fs/ext3/inode.c~writepages fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~writepages 2005-07-17 17:11:43.239274864 
-0700
+++ linux-2.6.12-ming/fs/ext3/inode.c   2005-07-17 17:11:43.259271824 -0700
@@ -36,6 +36,7 @@
 #include linux/writeback.h
 #include linux/mpage.h
 #include linux/uio.h
+#include linux/pagevec.h
 #include xattr.h
 #include acl.h
 
@@ -1719,6 +1720,135 @@ out_fail:
return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping,
+   struct writeback_control *wbc)
+{
+   struct inode *inode = mapping-host;
+   const unsigned blkbits = inode-i_blkbits;
+   int err = 0;
+   int ret = 0;
+   int done = 0;
+   unsigned int max_pages_to_cluster = 0;
+   struct pagevec pvec;
+   int nr_pages;
+   pgoff_t index;
+   pgoff_t end = -1;   /* Inclusive */
+   int scanned = 0;
+   int is_range = 0;
+   struct page *page;
+   struct mpageio mio = {
+   .bio = NULL
+   };
+
+   pagevec_init(pvec, 0);
+   if (wbc-sync_mode == WB_SYNC_NONE) {
+   index = mapping-writeback_index; /* Start from prev offset */
+   } else {
+   index = 0;/* whole-file sweep */
+   scanned = 1;
+   }
+   if (wbc-start || wbc-end) {
+   index = wbc-start  PAGE_CACHE_SHIFT;
+   end = wbc-end  PAGE_CACHE_SHIFT;
+   is_range = 1;
+   scanned = 1;
+   }
+   max_pages_to_cluster = min(EXT3_MAX_TRANS_DATA, (pgoff_t)PAGEVEC_SIZE);
+
+retry:
+   down_read(inode-i_alloc_sem);
+   while (!done  (index = end) 
+   (nr_pages = pagevec_contig_lookup_tag(pvec, mapping,
+   index, PAGECACHE_TAG_DIRTY,
+   min(end - index, max_pages_to_cluster-1) + 1))) {
+   unsigned i;
+
+   scanned = 1;
+   for (i = 0; i  nr_pages; i++) {
+   page = pvec.pages[i];
+
+   lock_page(page);
+
+   if (unlikely(page-mapping != mapping)) {
+   unlock_page(page);
+   break;
+   }
+
+   if (unlikely(is_range)  page-index  end) {
+   unlock_page(page);
+   break;
+   }
+
+   if (wbc-sync_mode != WB_SYNC_NONE)
+   wait_on_page_writeback(page);
+
+   if (PageWriteback(page) ||
+   !clear_page_dirty_for_io(page)) {
+   unlock_page(page);
+   break;
+   }
+   }
+
+   if (i) {
+   unsigned j;
+   handle_t *handle;
+
+   page = pvec.pages[i-1];
+   index = page-index + 1;
+   mio.final_block_in_request =
+   min(index, end)  (PAGE_CACHE_SHIFT - blkbits);
+
+   handle = ext3_journal_start(inode,
+   i + ext3_writepage_trans_blocks(inode));
+
+   if (IS_ERR(handle)) {
+   err = PTR_ERR(handle);
+   done = 1;
+   }
+   for (j = 0; j  i; j++) {
+   page = pvec.pages[j];
+   if (!done) {
+   ret = __mpage_writepage(mio, page,
+   ext3_writepages_get_blocks, wbc,
+   
ext3_writeback_writepage_helper);
+   if (ret || (--(wbc-nr_to_write) = 0))
+   done = 1;
+   } else {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   }
+   }
+   if (!err  mio.bio)
+   mio.bio = mpage_bio_submit(WRITE, mio.bio);
+   if (!err)
+   err = ext3_journal_stop(handle);
+   if (!ret) {
+   ret = err;
+   if (ret)
+   done = 1;
+

[RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages

2005-07-17 Thread Mingming Cao

Updated patch from Suparna for generic support for cluster pages
together in mapge_writepages() to make use of getblocks() 

---

 linux-2.6.12-ming/fs/buffer.c |   49 -
 linux-2.6.12-ming/fs/ext2/inode.c |   15 -
 linux-2.6.12-ming/fs/ext3/inode.c |   15 +
 linux-2.6.12-ming/fs/ext3/super.c |3 
 linux-2.6.12-ming/fs/hfs/inode.c  |2 
 linux-2.6.12-ming/fs/hfsplus/inode.c  |2 
 linux-2.6.12-ming/fs/jfs/inode.c  |   24 ++
 linux-2.6.12-ming/fs/mpage.c  |  214 ++
 linux-2.6.12-ming/include/linux/buffer_head.h |4 
 linux-2.6.12-ming/include/linux/fs.h  |2 
 linux-2.6.12-ming/include/linux/mpage.h   |   11 -
 linux-2.6.12-ming/include/linux/pagemap.h |3 
 linux-2.6.12-ming/include/linux/pagevec.h |3 
 linux-2.6.12-ming/include/linux/radix-tree.h  |   14 +
 linux-2.6.12-ming/lib/radix-tree.c|   25 ++-
 linux-2.6.12-ming/mm/filemap.c|9 -
 linux-2.6.12-ming/mm/swap.c   |   11 +
 17 files changed, 270 insertions(+), 136 deletions(-)

diff -puN fs/buffer.c~mpage_writepages_getblocks fs/buffer.c
--- linux-2.6.12/fs/buffer.c~mpage_writepages_getblocks 2005-07-15 
00:11:01.0 -0700
+++ linux-2.6.12-ming/fs/buffer.c   2005-07-15 00:11:01.0 -0700
@@ -2509,53 +2509,10 @@ EXPORT_SYMBOL(nobh_commit_write);
  * that it tries to operate without attaching bufferheads to
  * the page.
  */
-int nobh_writepage(struct page *page, get_block_t *get_block,
-   struct writeback_control *wbc)
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+   struct writeback_control *wbc, writepage_t bh_writepage_fn)
 {
-   struct inode * const inode = page-mapping-host;
-   loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size  PAGE_CACHE_SHIFT;
-   unsigned offset;
-   void *kaddr;
-   int ret;
-
-   /* Is the page fully inside i_size? */
-   if (page-index  end_index)
-   goto out;
-
-   /* Is the page fully outside i_size? (truncate in progress) */
-   offset = i_size  (PAGE_CACHE_SIZE-1);
-   if (page-index = end_index+1 || !offset) {
-   /*
-* The page may have dirty, unmapped buffers.  For example,
-* they may have been added in ext3_writepage().  Make them
-* freeable here, so the page does not leak.
-*/
-#if 0
-   /* Not really sure about this  - do we need this ? */
-   if (page-mapping-a_ops-invalidatepage)
-   page-mapping-a_ops-invalidatepage(page, offset);
-#endif
-   unlock_page(page);
-   return 0; /* don't care */
-   }
-
-   /*
-* The page straddles i_size.  It must be zeroed out on each and every
-* writepage invocation because it may be mmapped.  A file is mapped
-* in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
-* writes to that region are not written out to the file.
-*/
-   kaddr = kmap_atomic(page, KM_USER0);
-   memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
-out:
-   ret = mpage_writepage(page, get_block, wbc);
-   if (ret == -EAGAIN)
-   ret = __block_write_full_page(inode, page, get_block, wbc);
-   return ret;
+   return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn);
 }
 EXPORT_SYMBOL(nobh_writepage);
 
diff -puN fs/ext2/inode.c~mpage_writepages_getblocks fs/ext2/inode.c
--- linux-2.6.12/fs/ext2/inode.c~mpage_writepages_getblocks 2005-07-15 
00:11:01.0 -0700
+++ linux-2.6.12-ming/fs/ext2/inode.c   2005-07-15 00:11:01.0 -0700
@@ -650,12 +650,6 @@ ext2_nobh_prepare_write(struct file *fil
return nobh_prepare_write(page,from,to,ext2_get_block);
 }
 
-static int ext2_nobh_writepage(struct page *page,
-   struct writeback_control *wbc)
-{
-   return nobh_writepage(page, ext2_get_block, wbc);
-}
-
 static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,ext2_get_block);
@@ -673,6 +667,12 @@ ext2_get_blocks(struct inode *inode, sec
return ret;
 }
 
+static int ext2_nobh_writepage(struct page *page,
+   struct writeback_control *wbc)
+{
+   return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage);
+}
+
 static ssize_t
 ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
@@ -687,7 +687,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
 {
-

Re: [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3

2005-07-17 Thread Mingming Cao

On Sun, 2005-07-17 at 10:40 -0700, Mingming Cao wrote:
 Hi All, 
 
 Here are the updated patches to support multiple block allocation and
 delayed allocation for ext3 done by me, Badari and Suparna.

Patches are against 2.6.13-rc3.


Mingming

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC] [PATCH 1/4]Multiple block allocation for ext3

2005-07-17 Thread Mingming Cao

Here is the patch support multiple block allocation for ext3. Current
ext3 allocates one block at a time, not efficient for large sequential
write IO.

This patch implements a simply multiple block allocation with current
ext3.  The basic idea is allocating the 1st block in the existing way,
and attempting to allocate the next adjacent blocks on a  best effort
basis. If contiguous allocation is blocked by an already allocated
block, the current number of free blocks are allocated and no futhur
search is tried.

This implementation makes uses of block reservation. With the knowledge
of how many blocks to allocate, the reservation window size is being
enlargedaccordin before block allocation to increase the chance to get
contiguous blocks.

Previous post of this patch with more description could be found here:
http://marc.theaimsgroup.com/?l=ext2-develm=111471578328685w=2 




---

 linux-2.6.12-ming/fs/ext3/balloc.c|  121 +++--
 linux-2.6.12-ming/fs/ext3/inode.c |  380 --
 linux-2.6.12-ming/fs/ext3/xattr.c |3 
 linux-2.6.12-ming/include/linux/ext3_fs.h |2 
 4 files changed, 458 insertions(+), 48 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-get-blocks fs/ext3/balloc.c
--- linux-2.6.12/fs/ext3/balloc.c~ext3-get-blocks   2005-07-14 
21:55:55.110385896 -0700
+++ linux-2.6.12-ming/fs/ext3/balloc.c  2005-07-14 22:40:32.265396472 -0700
@@ -20,6 +20,7 @@
 #include linux/quotaops.h
 #include linux/buffer_head.h
 
+#defineNBS_DEBUG   0
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -652,9 +653,11 @@ claim_block(spinlock_t *lock, int block,
  */
 static int
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
-   struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window 
*my_rsv)
+   struct buffer_head *bitmap_bh, int goal, unsigned long *count,
+   struct ext3_reserve_window *my_rsv)
 {
int group_first_block, start, end;
+   unsigned long num = 0;
 
/* we do allocation within the reservation window if we have a window */
if (my_rsv) {
@@ -712,8 +715,22 @@ repeat:
goto fail_access;
goto repeat;
}
-   return goal;
+   num++;
+   goal++;
+   if (NBS_DEBUG)
+   printk(ext3_new_block: first block allocated:block %d,num 
%d\n, goal, num);
+   while (num  *count  goal  end
+ext3_test_allocatable(goal, bitmap_bh)
+claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, 
bitmap_bh)) {
+   num++;
+   goal++;
+   }
+   *count = num;
+   if (NBS_DEBUG)
+   printk(ext3_new_block: additional block allocated:block %d,num 
%d,goal-num %d\n, goal, num, goal-num);
+   return goal - num;
 fail_access:
+   *count = num;
return -1;
 }
 
@@ -998,6 +1015,28 @@ retry:
goto retry;
 }
 
+static void try_to_extend_reservation(struct ext3_reserve_window_node *my_rsv,
+   struct super_block *sb, int size)
+{
+   struct ext3_reserve_window_node *next_rsv;
+   struct rb_node *next;
+   spinlock_t *rsv_lock = EXT3_SB(sb)-s_rsv_window_lock;
+
+   spin_lock(rsv_lock);
+   next = rb_next(my_rsv-rsv_node);
+
+   if (!next)
+   my_rsv-rsv_end += size;
+   else {
+   next_rsv = list_entry(next, struct ext3_reserve_window_node, 
rsv_node);
+
+   if ((next_rsv-rsv_start - my_rsv-rsv_end)  size)
+   my_rsv-rsv_end += size;
+   else
+   my_rsv-rsv_end = next_rsv-rsv_start -1 ;
+   }
+   spin_unlock(rsv_lock);
+}
 /*
  * This is the main function used to allocate a new block and its reservation
  * window.
@@ -1023,11 +1062,12 @@ static int
 ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
unsigned int group, struct buffer_head *bitmap_bh,
int goal, struct ext3_reserve_window_node * my_rsv,
-   int *errp)
+   unsigned long *count, int *errp)
 {
unsigned long group_first_block;
int ret = 0;
int fatal;
+   unsigned long num = *count;
 
*errp = 0;
 
@@ -1050,7 +1090,8 @@ ext3_try_to_allocate_with_rsv(struct sup
 * or last attempt to allocate a block with reservation turned on failed
 */
if (my_rsv == NULL ) {
-   ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, 
NULL);
+   ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+   count, NULL);
goto out;
}
/*
@@ -1080,6 +1121,10 @@ ext3_try_to_allocate_with_rsv(struct sup
while (1) {
if (rsv_is_empty(my_rsv-rsv_window) || (ret  0) ||

Re: Lazy block allocation and block_prepare_write?

2005-04-19 Thread Mingming Cao

On Tue, 2005-04-19 at 19:55 +0400, Nikita Danilov wrote:
 Badari Pulavarty [EMAIL PROTECTED] writes:
 
  On Tue, 2005-04-19 at 04:22, Nikita Danilov wrote:
  Badari Pulavarty [EMAIL PROTECTED] writes:
  
  [...]
  
  
   Yes. Its possible to do what you want to. I am currently working on
   adding delayed allocation support to ext3. As part of that, We
  
  As you most likely already know, Alex Thomas already implemented delayed
  block allocation for ext3.
 
  Yep. I reviewed Alex Thomas patches for delayed allocation. He handled
  all the cases in his code and did NOT use any mpage* routines to do
  the work. I was hoping to change the mpage infrastructure to handle
  these, so that every filesystem doesn't have to do their thing.
 
 
 Just keep in mind that filesystem != ext3. :-) Generic support makes
 sense only when it is usable by multiple file systems. This is not
 always possible, e.g., there is no generic block allocator for
 precisely the same reason: disk space allocation policies are tightly
 intertwined with the rest of file system internals.
 

This generic support should be useful for ext2 and xfs. From delayed
allocation point of view, it should not aware any filesystem specific
block allocation policies, and it should not care.:)  It just simply
gathering all pages that need to map block on disk, and asking the
filesystem get_blocks() call back function, which will take care of the
filesystem-specific multiple blocks mapping for it.

Current get_blocks() function for ext3 is just simply loop calling
ext3_get_block().  I am trying to add a real ext3_get_blocks() to reduce
the cpu cost, reduce the number of metadata updates and increase the
possibility to get contiguous blocks on disk.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

82 matches

Mail list logo