[RFC] [PATCH 1/4]Multiple block allocation for ext3

2005-07-17 Thread Mingming Cao
Here is the patch support multiple block allocation for ext3. Current
ext3 allocates one block at a time, not efficient for large sequential
write IO.

This patch implements a simply multiple block allocation with current
ext3.  The basic idea is allocating the 1st block in the existing way,
and attempting to allocate the next adjacent blocks on a  best effort
basis. If contiguous allocation is blocked by an already allocated
block, the current number of free blocks are allocated and no futhur
search is tried.

This implementation makes uses of block reservation. With the knowledge
of how many blocks to allocate, the reservation window size is being
enlargedaccordin before block allocation to increase the chance to get
contiguous blocks.

Previous post of this patch with more description could be found here:
http://marc.theaimsgroup.com/?l=ext2-devel&m=111471578328685&w=2 




---

 linux-2.6.12-ming/fs/ext3/balloc.c|  121 +++--
 linux-2.6.12-ming/fs/ext3/inode.c |  380 --
 linux-2.6.12-ming/fs/ext3/xattr.c |3 
 linux-2.6.12-ming/include/linux/ext3_fs.h |2 
 4 files changed, 458 insertions(+), 48 deletions(-)

diff -puN fs/ext3/balloc.c~ext3-get-blocks fs/ext3/balloc.c
--- linux-2.6.12/fs/ext3/balloc.c~ext3-get-blocks   2005-07-14 
21:55:55.110385896 -0700
+++ linux-2.6.12-ming/fs/ext3/balloc.c  2005-07-14 22:40:32.265396472 -0700
@@ -20,6 +20,7 @@
 #include 
 #include 
 
+#defineNBS_DEBUG   0
 /*
  * balloc.c contains the blocks allocation and deallocation routines
  */
@@ -652,9 +653,11 @@ claim_block(spinlock_t *lock, int block,
  */
 static int
 ext3_try_to_allocate(struct super_block *sb, handle_t *handle, int group,
-   struct buffer_head *bitmap_bh, int goal, struct ext3_reserve_window 
*my_rsv)
+   struct buffer_head *bitmap_bh, int goal, unsigned long *count,
+   struct ext3_reserve_window *my_rsv)
 {
int group_first_block, start, end;
+   unsigned long num = 0;
 
/* we do allocation within the reservation window if we have a window */
if (my_rsv) {
@@ -712,8 +715,22 @@ repeat:
goto fail_access;
goto repeat;
}
-   return goal;
+   num++;
+   goal++;
+   if (NBS_DEBUG)
+   printk("ext3_new_block: first block allocated:block %d,num 
%d\n", goal, num);
+   while (num < *count && goal < end
+   && ext3_test_allocatable(goal, bitmap_bh)
+   && claim_block(sb_bgl_lock(EXT3_SB(sb), group), goal, 
bitmap_bh)) {
+   num++;
+   goal++;
+   }
+   *count = num;
+   if (NBS_DEBUG)
+   printk("ext3_new_block: additional block allocated:block %d,num 
%d,goal-num %d\n", goal, num, goal-num);
+   return goal - num;
 fail_access:
+   *count = num;
return -1;
 }
 
@@ -998,6 +1015,28 @@ retry:
goto retry;
 }
 
+static void try_to_extend_reservation(struct ext3_reserve_window_node *my_rsv,
+   struct super_block *sb, int size)
+{
+   struct ext3_reserve_window_node *next_rsv;
+   struct rb_node *next;
+   spinlock_t *rsv_lock = &EXT3_SB(sb)->s_rsv_window_lock;
+
+   spin_lock(rsv_lock);
+   next = rb_next(&my_rsv->rsv_node);
+
+   if (!next)
+   my_rsv->rsv_end += size;
+   else {
+   next_rsv = list_entry(next, struct ext3_reserve_window_node, 
rsv_node);
+
+   if ((next_rsv->rsv_start - my_rsv->rsv_end) > size)
+   my_rsv->rsv_end += size;
+   else
+   my_rsv->rsv_end = next_rsv->rsv_start -1 ;
+   }
+   spin_unlock(rsv_lock);
+}
 /*
  * This is the main function used to allocate a new block and its reservation
  * window.
@@ -1023,11 +1062,12 @@ static int
 ext3_try_to_allocate_with_rsv(struct super_block *sb, handle_t *handle,
unsigned int group, struct buffer_head *bitmap_bh,
int goal, struct ext3_reserve_window_node * my_rsv,
-   int *errp)
+   unsigned long *count, int *errp)
 {
unsigned long group_first_block;
int ret = 0;
int fatal;
+   unsigned long num = *count;
 
*errp = 0;
 
@@ -1050,7 +1090,8 @@ ext3_try_to_allocate_with_rsv(struct sup
 * or last attempt to allocate a block with reservation turned on failed
 */
if (my_rsv == NULL ) {
-   ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal, 
NULL);
+   ret = ext3_try_to_allocate(sb, handle, group, bitmap_bh, goal,
+   count, NULL);
goto out;
}
/*
@@ -1080,6 +1121,10 @@ ext3_try_to_allocate_with_rsv(struct sup
while (1) {
if (rsv_is_empty(&my_rsv->rsv_window) || (ret < 0) ||
!goal_in_my_reservation(&my_rsv->r

Re: [RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3

2005-07-17 Thread Mingming Cao
On Sun, 2005-07-17 at 10:40 -0700, Mingming Cao wrote:
> Hi All, 
> 
> Here are the updated patches to support multiple block allocation and
> delayed allocation for ext3 done by me, Badari and Suparna.

Patches are against 2.6.13-rc3.


Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] [PATCH 3/4]generic getblocks() support in mpage_writepages

2005-07-17 Thread Mingming Cao
Updated patch from Suparna for generic support for cluster pages
together in mapge_writepages() to make use of getblocks() 

---

 linux-2.6.12-ming/fs/buffer.c |   49 -
 linux-2.6.12-ming/fs/ext2/inode.c |   15 -
 linux-2.6.12-ming/fs/ext3/inode.c |   15 +
 linux-2.6.12-ming/fs/ext3/super.c |3 
 linux-2.6.12-ming/fs/hfs/inode.c  |2 
 linux-2.6.12-ming/fs/hfsplus/inode.c  |2 
 linux-2.6.12-ming/fs/jfs/inode.c  |   24 ++
 linux-2.6.12-ming/fs/mpage.c  |  214 ++
 linux-2.6.12-ming/include/linux/buffer_head.h |4 
 linux-2.6.12-ming/include/linux/fs.h  |2 
 linux-2.6.12-ming/include/linux/mpage.h   |   11 -
 linux-2.6.12-ming/include/linux/pagemap.h |3 
 linux-2.6.12-ming/include/linux/pagevec.h |3 
 linux-2.6.12-ming/include/linux/radix-tree.h  |   14 +
 linux-2.6.12-ming/lib/radix-tree.c|   25 ++-
 linux-2.6.12-ming/mm/filemap.c|9 -
 linux-2.6.12-ming/mm/swap.c   |   11 +
 17 files changed, 270 insertions(+), 136 deletions(-)

diff -puN fs/buffer.c~mpage_writepages_getblocks fs/buffer.c
--- linux-2.6.12/fs/buffer.c~mpage_writepages_getblocks 2005-07-15 
00:11:01.0 -0700
+++ linux-2.6.12-ming/fs/buffer.c   2005-07-15 00:11:01.0 -0700
@@ -2509,53 +2509,10 @@ EXPORT_SYMBOL(nobh_commit_write);
  * that it tries to operate without attaching bufferheads to
  * the page.
  */
-int nobh_writepage(struct page *page, get_block_t *get_block,
-   struct writeback_control *wbc)
+int nobh_writepage(struct page *page, get_blocks_t *get_blocks,
+   struct writeback_control *wbc, writepage_t bh_writepage_fn)
 {
-   struct inode * const inode = page->mapping->host;
-   loff_t i_size = i_size_read(inode);
-   const pgoff_t end_index = i_size >> PAGE_CACHE_SHIFT;
-   unsigned offset;
-   void *kaddr;
-   int ret;
-
-   /* Is the page fully inside i_size? */
-   if (page->index < end_index)
-   goto out;
-
-   /* Is the page fully outside i_size? (truncate in progress) */
-   offset = i_size & (PAGE_CACHE_SIZE-1);
-   if (page->index >= end_index+1 || !offset) {
-   /*
-* The page may have dirty, unmapped buffers.  For example,
-* they may have been added in ext3_writepage().  Make them
-* freeable here, so the page does not leak.
-*/
-#if 0
-   /* Not really sure about this  - do we need this ? */
-   if (page->mapping->a_ops->invalidatepage)
-   page->mapping->a_ops->invalidatepage(page, offset);
-#endif
-   unlock_page(page);
-   return 0; /* don't care */
-   }
-
-   /*
-* The page straddles i_size.  It must be zeroed out on each and every
-* writepage invocation because it may be mmapped.  "A file is mapped
-* in multiples of the page size.  For a file that is not a multiple of
-* the  page size, the remaining memory is zeroed when mapped, and
-* writes to that region are not written out to the file."
-*/
-   kaddr = kmap_atomic(page, KM_USER0);
-   memset(kaddr + offset, 0, PAGE_CACHE_SIZE - offset);
-   flush_dcache_page(page);
-   kunmap_atomic(kaddr, KM_USER0);
-out:
-   ret = mpage_writepage(page, get_block, wbc);
-   if (ret == -EAGAIN)
-   ret = __block_write_full_page(inode, page, get_block, wbc);
-   return ret;
+   return mpage_writepage(page, get_blocks, wbc, bh_writepage_fn);
 }
 EXPORT_SYMBOL(nobh_writepage);
 
diff -puN fs/ext2/inode.c~mpage_writepages_getblocks fs/ext2/inode.c
--- linux-2.6.12/fs/ext2/inode.c~mpage_writepages_getblocks 2005-07-15 
00:11:01.0 -0700
+++ linux-2.6.12-ming/fs/ext2/inode.c   2005-07-15 00:11:01.0 -0700
@@ -650,12 +650,6 @@ ext2_nobh_prepare_write(struct file *fil
return nobh_prepare_write(page,from,to,ext2_get_block);
 }
 
-static int ext2_nobh_writepage(struct page *page,
-   struct writeback_control *wbc)
-{
-   return nobh_writepage(page, ext2_get_block, wbc);
-}
-
 static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
 {
return generic_block_bmap(mapping,block,ext2_get_block);
@@ -673,6 +667,12 @@ ext2_get_blocks(struct inode *inode, sec
return ret;
 }
 
+static int ext2_nobh_writepage(struct page *page,
+   struct writeback_control *wbc)
+{
+   return nobh_writepage(page, ext2_get_blocks, wbc, ext2_writepage);
+}
+
 static ssize_t
 ext2_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
loff_t offset, unsigned long nr_segs)
@@ -687,7 +687,8 @@ ext2_direct_IO(int rw, struct kiocb *ioc
 static int
 ext2_writepages(struct address_space *mapping, struct writeback_control

[RFC] [PATCH 4/4]add ext3 writeback writpages

2005-07-17 Thread Mingming Cao
support multiple block allocation for ext3 writeback mode through writepages().


---

 linux-2.6.12-ming/fs/ext3/inode.c   |  131 
 linux-2.6.12-ming/fs/mpage.c|8 +
 linux-2.6.12-ming/include/linux/mpage.h |   17 
 3 files changed, 153 insertions(+), 3 deletions(-)

diff -puN fs/ext3/inode.c~writepages fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~writepages 2005-07-17 17:11:43.239274864 
-0700
+++ linux-2.6.12-ming/fs/ext3/inode.c   2005-07-17 17:11:43.259271824 -0700
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "xattr.h"
 #include "acl.h"
 
@@ -1719,6 +1720,135 @@ out_fail:
return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping,
+   struct writeback_control *wbc)
+{
+   struct inode *inode = mapping->host;
+   const unsigned blkbits = inode->i_blkbits;
+   int err = 0;
+   int ret = 0;
+   int done = 0;
+   unsigned int max_pages_to_cluster = 0;
+   struct pagevec pvec;
+   int nr_pages;
+   pgoff_t index;
+   pgoff_t end = -1;   /* Inclusive */
+   int scanned = 0;
+   int is_range = 0;
+   struct page *page;
+   struct mpageio mio = {
+   .bio = NULL
+   };
+
+   pagevec_init(&pvec, 0);
+   if (wbc->sync_mode == WB_SYNC_NONE) {
+   index = mapping->writeback_index; /* Start from prev offset */
+   } else {
+   index = 0;/* whole-file sweep */
+   scanned = 1;
+   }
+   if (wbc->start || wbc->end) {
+   index = wbc->start >> PAGE_CACHE_SHIFT;
+   end = wbc->end >> PAGE_CACHE_SHIFT;
+   is_range = 1;
+   scanned = 1;
+   }
+   max_pages_to_cluster = min(EXT3_MAX_TRANS_DATA, (pgoff_t)PAGEVEC_SIZE);
+
+retry:
+   down_read(&inode->i_alloc_sem);
+   while (!done && (index <= end) &&
+   (nr_pages = pagevec_contig_lookup_tag(&pvec, mapping,
+   &index, PAGECACHE_TAG_DIRTY,
+   min(end - index, max_pages_to_cluster-1) + 1))) {
+   unsigned i;
+
+   scanned = 1;
+   for (i = 0; i < nr_pages; i++) {
+   page = pvec.pages[i];
+
+   lock_page(page);
+
+   if (unlikely(page->mapping != mapping)) {
+   unlock_page(page);
+   break;
+   }
+
+   if (unlikely(is_range) && page->index > end) {
+   unlock_page(page);
+   break;
+   }
+
+   if (wbc->sync_mode != WB_SYNC_NONE)
+   wait_on_page_writeback(page);
+
+   if (PageWriteback(page) ||
+   !clear_page_dirty_for_io(page)) {
+   unlock_page(page);
+   break;
+   }
+   }
+
+   if (i) {
+   unsigned j;
+   handle_t *handle;
+
+   page = pvec.pages[i-1];
+   index = page->index + 1;
+   mio.final_block_in_request =
+   min(index, end) << (PAGE_CACHE_SHIFT - blkbits);
+
+   handle = ext3_journal_start(inode,
+   i + ext3_writepage_trans_blocks(inode));
+
+   if (IS_ERR(handle)) {
+   err = PTR_ERR(handle);
+   done = 1;
+   }
+   for (j = 0; j < i; j++) {
+   page = pvec.pages[j];
+   if (!done) {
+   ret = __mpage_writepage(&mio, page,
+   ext3_writepages_get_blocks, wbc,
+   
ext3_writeback_writepage_helper);
+   if (ret || (--(wbc->nr_to_write) <= 0))
+   done = 1;
+   } else {
+   redirty_page_for_writepage(wbc, page);
+   unlock_page(page);
+   }
+   }
+   if (!err && mio.bio)
+   mio.bio = mpage_bio_submit(WRITE, mio.bio);
+   if (!err)
+   err = ext3_journal_stop(handle);
+   if (!ret) {
+   ret = err;
+   if (ret)
+   done = 1;
+   }
+  

[RFC] [PATCH 2/4]delayed allocation for ext3

2005-07-17 Thread Mingming Cao
Here is the updated patch from Badari for delayed allocation for ext3.
Delayed allocation defers block allocation from prepare-write time to
page writeout time. 


---

 linux-2.6.12-ming/fs/buffer.c |   13 +
 linux-2.6.12-ming/fs/ext3/inode.c |6 ++
 linux-2.6.12-ming/fs/ext3/super.c |   14 +-
 linux-2.6.12-ming/include/linux/ext3_fs.h |1 +
 4 files changed, 29 insertions(+), 5 deletions(-)

diff -puN include/linux/ext3_fs.h~ext3-delalloc include/linux/ext3_fs.h
--- linux-2.6.12/include/linux/ext3_fs.h~ext3-delalloc  2005-07-14 
23:15:34.861753240 -0700
+++ linux-2.6.12-ming/include/linux/ext3_fs.h   2005-07-14 23:15:34.881750200 
-0700
@@ -373,6 +373,7 @@ struct ext3_inode {
 #define EXT3_MOUNT_BARRIER 0x2 /* Use block barriers */
 #define EXT3_MOUNT_NOBH0x4 /* No bufferheads */
 #define EXT3_MOUNT_QUOTA   0x8 /* Some quota option set */
+ #define EXT3_MOUNT_DELAYED_ALLOC  0xC /* Delayed Allocation */
 
 /* Compatibility, for having both ext2_fs.h and ext3_fs.h included at once */
 #ifndef _LINUX_EXT2_FS_H
diff -puN fs/ext3/inode.c~ext3-delalloc fs/ext3/inode.c
--- linux-2.6.12/fs/ext3/inode.c~ext3-delalloc  2005-07-14 23:15:34.866752480 
-0700
+++ linux-2.6.12-ming/fs/ext3/inode.c   2005-07-14 23:15:34.889748984 -0700
@@ -1340,6 +1340,9 @@ static int ext3_prepare_write(struct fil
handle_t *handle;
int retries = 0;
 
+
+   if (test_opt(inode->i_sb, DELAYED_ALLOC))
+   return __nobh_prepare_write(page, from, to, ext3_get_block, 0);
 retry:
handle = ext3_journal_start(inode, needed_blocks);
if (IS_ERR(handle)) {
@@ -1439,6 +1442,9 @@ static int ext3_writeback_commit_write(s
else
ret = generic_commit_write(file, page, from, to);
 
+   if (test_opt(inode->i_sb, DELAYED_ALLOC))
+   return ret;
+
ret2 = ext3_journal_stop(handle);
if (!ret)
ret = ret2;
diff -puN fs/ext3/super.c~ext3-delalloc fs/ext3/super.c
--- linux-2.6.12/fs/ext3/super.c~ext3-delalloc  2005-07-14 23:15:34.870751872 
-0700
+++ linux-2.6.12-ming/fs/ext3/super.c   2005-07-14 23:15:34.896747920 -0700
@@ -585,7 +585,7 @@ enum {
Opt_resgid, Opt_resuid, Opt_sb, Opt_err_cont, Opt_err_panic, Opt_err_ro,
Opt_nouid32, Opt_check, Opt_nocheck, Opt_debug, Opt_oldalloc, Opt_orlov,
Opt_user_xattr, Opt_nouser_xattr, Opt_acl, Opt_noacl,
-   Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh,
+   Opt_reservation, Opt_noreservation, Opt_noload, Opt_nobh, 
Opt_delayed_alloc,
Opt_commit, Opt_journal_update, Opt_journal_inum,
Opt_abort, Opt_data_journal, Opt_data_ordered, Opt_data_writeback,
Opt_usrjquota, Opt_grpjquota, Opt_offusrjquota, Opt_offgrpjquota,
@@ -621,6 +621,7 @@ static match_table_t tokens = {
{Opt_noreservation, "noreservation"},
{Opt_noload, "noload"},
{Opt_nobh, "nobh"},
+   {Opt_delayed_alloc, "delalloc"},
{Opt_commit, "commit=%u"},
{Opt_journal_update, "journal=update"},
{Opt_journal_inum, "journal=%u"},
@@ -954,6 +955,10 @@ clear_qf_name:
case Opt_nobh:
set_opt(sbi->s_mount_opt, NOBH);
break;
+   case Opt_delayed_alloc:
+   set_opt(sbi->s_mount_opt, NOBH);
+   set_opt(sbi->s_mount_opt, DELAYED_ALLOC);
+   break;
default:
printk (KERN_ERR
"EXT3-fs: Unrecognized mount option \"%s\" "
@@ -1612,6 +1617,13 @@ static int ext3_fill_super (struct super
clear_opt(sbi->s_mount_opt, NOBH);
}
}
+   if (test_opt(sb, DELAYED_ALLOC)) {
+   if (!(test_opt(sb, DATA_FLAGS) == EXT3_MOUNT_WRITEBACK_DATA)) {
+   printk(KERN_WARNING "EXT3-fs: Ignoring delall option - "
+   "its supported only with writeback mode\n");
+   clear_opt(sbi->s_mount_opt, DELAYED_ALLOC);
+   }
+   }
/*
 * The journal_load will have done any necessary log recovery,
 * so we can safely mount the rest of the filesystem now.
diff -puN fs/buffer.c~ext3-delalloc fs/buffer.c
--- linux-2.6.12/fs/buffer.c~ext3-delalloc  2005-07-14 23:15:34.875751112 
-0700
+++ linux-2.6.12-ming/fs/buffer.c   2005-07-14 23:15:34.903746856 -0700
@@ -2337,8 +2337,8 @@ static void end_buffer_read_nobh(struct 
  * On entry, the page is fully not uptodate.
  * On exit the page is fully uptodate in the areas outside (from,to)
  */
-int nobh_prepare_write(struct page *page, unsigned from, unsigned to,
-   get_block_t *get_block)
+int __nobh_prepare_write(struct page *page, unsigned from, unsigned to,
+   get_block_t *get_block, int create)
 {
   

[RFC] [PATCH 0/4]Multiple block allocation and delayed allocation for ext3

2005-07-17 Thread Mingming Cao
Hi All, 

Here are the updated patches to support multiple block allocation and
delayed allocation for ext3 done by me, Badari and Suparna.

[PATCH 1/4] -- multiple block allocation for current ext3.
(ext3_get_blocks()).

[PATCH 2/4] -- adding delayed allocation for writeback mode

[PATCH 3/4] -- generic support for cluster pages together in
mapge_writepages() to make use of getblocks() 

[PATCH 4/4] -- support multiple block allocation for ext3 writeback mode
through writepages(). 


Have done initial testing on dbench and tiobench on a 2.6.11 version of
this patch set. Dbench 8 thread throughput result is increased by 20%
with this patch set.

dbench comparison: (ext3-dm represents ext3+thispatchset)
http://www.sudhaa.com/~ram/ols2005presentation/dbench.jpg
tiobench comparison:
http://www.sudhaa.com/~ram/ols2005presentation/tio_seq_write.jpg


Todo:
- bmap() support for delayed allocation
- page reserve flag to indicate the delayed allocation
- ordered mode support for delayed allocation
- "bh" support to enable blocksize = 1k/2k filesystems



Cheers,

Mingming


-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html