Re: fishy -put_inode usage in ntfs

2005-02-10 Thread Christoph Hellwig
On Thu, Oct 14, 2004 at 02:26:45PM +0100, Anton Altaparmakov wrote:
  I don't like filesystem doings things like this in -put_inode at all,
  and indeed the plan is to get rid of -put_inode completely.  Why do
  you need to hold an additional reference anyway?  What's so special
  about the relation of these two inodes?
 
 The bmp_ino is a virtual inode.  It doesn't exist on disk as an inode. 
 It is an NTFS attribute of the base inode.  It cannot exist without the
 base inode there.  You could neither read from nor write to this inode
 without its base inode being there and you couldn't even clear_inode()
 this inode without the base inode being there.  The reference is
 essential I am afraid.
 
 If -put_inode is removed then I will have to switch to using
 ntfs_attr_iget() each time or I will have to attach the inode in some
 other much hackier way that doesn't use the i_count and uses my ntfs
 private counter instead.

Coming back to this issue.  Why do you need to refcount bmp_ino at all?
Can someone ever grab a reference separate from it's master inode?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fishy -put_inode usage in ntfs

2005-02-10 Thread Christoph Hellwig
On Thu, Feb 10, 2005 at 02:48:26PM +, Anton Altaparmakov wrote:
 If the igrab() were not done, it would be possible for clear_inode to be
 called on the 'parent' inode whilst at the same time one or more attr
 inodes (belonging to this 'parent') are in use and Bad Things(TM) would
 happen...

What bad thing specificly?  If there's shared information we should
probably refcount them separately.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fishy -put_inode usage in ntfs

2005-02-10 Thread Anton Altaparmakov
On Thu, 2005-02-10 at 14:48 +, Anton Altaparmakov wrote:
 On Thu, 2005-02-10 at 15:42 +0100, Christoph Hellwig wrote:
  On Thu, Feb 10, 2005 at 02:40:39PM +, Anton Altaparmakov wrote:
   I am not sure what you mean.  The VFS layer does reference counting on
   inodes.  I have no choice in the matter.
   
Can someone ever grab a reference separate from it's master inode?
   
   Again, not sure what you mean.  Could you elaborate?
  
  ntfs_read_locked_attr_inode() does igrab on the 'parent' inode
  currently.  What do you need this for exactly - the attr inode
  goes away anyway when clear_inode is called on that 'parent' inode
  (in my scheme).
 
 If the igrab() were not done, it would be possible for clear_inode to be
 called on the 'parent' inode whilst at the same time one or more attr
 inodes (belonging to this 'parent') are in use and Bad Things(TM) would
 happen...

The igrab() effectively guarantees that iput() is called on all attr
inodes before clear_inode on the 'parent' can be invoked.

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fishy -put_inode usage in ntfs

2005-02-10 Thread Christoph Hellwig
On Thu, Feb 10, 2005 at 02:50:02PM +, Anton Altaparmakov wrote:
  If the igrab() were not done, it would be possible for clear_inode to be
  called on the 'parent' inode whilst at the same time one or more attr
  inodes (belonging to this 'parent') are in use and Bad Things(TM) would
  happen...
 
 The igrab() effectively guarantees that iput() is called on all attr
 inodes before clear_inode on the 'parent' can be invoked.

Yes, but why exactly is this important.  It looks like you're absuing
the refcount on the 'parent' inode for some shared data?

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: fishy -put_inode usage in ntfs

2005-02-10 Thread Anton Altaparmakov
On Thu, 2005-02-10 at 15:50 +0100, Christoph Hellwig wrote:
 On Thu, Feb 10, 2005 at 02:48:26PM +, Anton Altaparmakov wrote:
  If the igrab() were not done, it would be possible for clear_inode to be
  called on the 'parent' inode whilst at the same time one or more attr
  inodes (belonging to this 'parent') are in use and Bad Things(TM) would
  happen...
 
 What bad thing specificly?  If there's shared information we should
 probably refcount them separately.

Each attr inode stores a pointer to its parent inode in NTFS_I(struct
inode *vi)-ext.base_ntfs_ino.  This pointer would point to random
memory if clear_inode is called on the parent whilst the attr inode is
still in use.

Best regards,

Anton
-- 
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer / IRC: #ntfs on irc.freenode.net
WWW: http://linux-ntfs.sf.net/  http://www-stu.christs.cam.ac.uk/~aia21/

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Badari Pulavarty
On Wed, 2005-02-09 at 18:05, Bryan Henderson wrote:
 I see much larger IO chunks and better throughput. So, I guess its
 worth doing it
 
 I hate to see something like this go ahead based on empirical results 
 without theory.  It might make things worse somewhere else.
 
 Do you have an explanation for why the IO chunks are larger?  Is the I/O 
 scheduler not building large I/Os out of small requests?  Is the queue 
 running dry while the device is actually busy?
 

Bryan,

I would like to find out what theory you are looking for.

Don't you think, filesystems submitting biggest chunks of IO
possible is better than submitting 1k-4k chunks and hoping that
IO schedulers do the perfect job ? 

BTW, writepages() is being used for other filesystems like JFS.

We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
IO and making the elevator merge all the peices together. Thats
one reason why 2.6 DIO/RAW code is completely written from
scratch to submit the biggest possible IO chunks.

Well, I agree that we should have theory behind the results.
We are just playing with prototypes for now.

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] block new writers on frozen filesystems

2005-02-10 Thread Christoph Hellwig
When the lockfs patches went in an important bit got lost, the call in
generic_file_write to put newly incoming writers to sleep when a
filesystem is frozen.  Nathan added back the call in the now separate
XFS write patch, and the patch for the generic code is below:


Index: mm/filemap.c
===
RCS file: /cvs/linux-2.6-xfs/mm/filemap.c,v
retrieving revision 1.14
diff -u -p -r1.14 filemap.c
--- mm/filemap.c5 Jan 2005 14:17:31 -   1.14
+++ mm/filemap.c4 Feb 2005 21:35:53 -
@@ -2046,6 +2046,8 @@ __generic_file_aio_write_nolock(struct k
count = ocount;
pos = *ppos;
 
+   vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);
+
/* We can write back this queue in page reclaim */
current-backing_dev_info = mapping-backing_dev_info;
written = 0;
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
I am inferring this using iostat which shows that average device
utilization fluctuates between 83 and 99 percent and the average
request size is around 650 sectors (going to the device) without
writepages. 

With writepages, device utilization never drops below 95 percent and
is usually about 98 percent utilized, and the average request size to
the device is around 1000 sectors.

Well that blows away the only two ways I know that this effect can happen. 
 The first has to do with certain code being more efficient than other 
code at assembling I/Os, but the fact that the CPU utilization is the same 
in both cases pretty much eliminates that.  The other is where the 
interactivity of the I/O generator doesn't match the buffering in the 
device so that the device ends up 100% busy processing small I/Os that 
were sent to it because it said all the while that it needed more work. 
But in the small-I/O case, we don't see a 100% busy device.

So why would the device be up to 17% idle, since the writepages case makes 
it apparent that the I/O generator is capable of generating much more 
work?  Is there some queue plugging (I/O scheduler delays sending I/O to 
the device even though the device is idle) going on?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
Don't you think, filesystems submitting biggest chunks of IO
possible is better than submitting 1k-4k chunks and hoping that
IO schedulers do the perfect job ? 

No, I don't see why it would better.  In fact intuitively, I think the I/O 
scheduler, being closer to the device, should do a better job of deciding 
in what packages I/O should go to the device.  After all, there exist 
block devices that don't process big chunks faster than small ones.  But 

So this starts to look like something where you withhold data from the I/O 
scheduler in order to prevent it from scheduling the I/O wrongly because 
you (the pager/filesystem driver) know better.  That shouldn't be the 
architecture.

So I'd like still like to see a theory that explains why submitting the 
I/O a little at a time (i.e. including the bio_submit() in the loop that 
assembles the I/O) causes the device to be idle more.

We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
IO and making the elevator merge all the peices together.

That was CPU time, right?  In the present case, the numbers say it takes 
the same amount of CPU time to assemble the I/O above the I/O scheduler as 
inside it.

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Badari Pulavarty
On Thu, 2005-02-10 at 10:00, Bryan Henderson wrote:
 Don't you think, filesystems submitting biggest chunks of IO
 possible is better than submitting 1k-4k chunks and hoping that
 IO schedulers do the perfect job ? 
 
 No, I don't see why it would better.  In fact intuitively, I think the I/O 
 scheduler, being closer to the device, should do a better job of deciding 
 in what packages I/O should go to the device.  After all, there exist 
 block devices that don't process big chunks faster than small ones.  But 
 
 So this starts to look like something where you withhold data from the I/O 
 scheduler in order to prevent it from scheduling the I/O wrongly because 
 you (the pager/filesystem driver) know better.  That shouldn't be the 
 architecture.
 
 So I'd like still like to see a theory that explains why submitting the 
 I/O a little at a time (i.e. including the bio_submit() in the loop that 
 assembles the I/O) causes the device to be idle more.
 
 We all learnt thro 2.4 RAW code about the overhead of doing 512bytes
 IO and making the elevator merge all the peices together.
 
 That was CPU time, right?  In the present case, the numbers say it takes 
 the same amount of CPU time to assemble the I/O above the I/O scheduler as 
 inside it.

One clear distinction between submitting smaller chunks vs larger
ones is - number of call backs we get and the processing we need to
do.

I don't think we have enough numbers here to get to bottom of this.
CPU utilization remains same in both cases, doesn't mean that - the
test took exactly same amount of time. I don't even think that we
are doing a fixed number of IOs. Its possible that by doing larger
IOs we save CPU and use that CPU to push more data ?



Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Allow kernel-only mount interfaces...

2005-02-10 Thread Andreas Dilger
On Feb 10, 2005  13:41 -0500, Trond Myklebust wrote:
 +struct vfsmount *
 +do_kern_mount(const char *fstype, int flags, const char *name, void *data)
 +{
 + struct file_system_type *type = get_fs_type(fstype);
 + struct vfsmount *mnt = vfs_kern_mount(type, flags, name, data);
 + put_filesystem(type);
 + return mnt;
 +}

This will OOPS if fstype is bad, since you unconditionally put_filesystem()
on a possible PTR_ERR() type.  You need an extra

if (!IS_ERR(type))
put_filesystem(type);

Cheers, Andreas
--
Andreas Dilger
http://sourceforge.net/projects/ext2resize/
http://members.shaw.ca/adilger/ http://members.shaw.ca/golinux/



pgpof1y4sKPal.pgp
Description: PGP signature


Re: [PATCH] Allow kernel-only mount interfaces...

2005-02-10 Thread Trond Myklebust
to den 10.02.2005 Klokka 12:01 (-0700) skreiv Andreas Dilger:

 This will OOPS if fstype is bad, since you unconditionally put_filesystem()
 on a possible PTR_ERR() type.  You need an extra
 
   if (!IS_ERR(type))
   put_filesystem(type);
 

Agreed. That was not a final patch, but just a first untested draft in
order to test the waters.
I'm mainly wanting to hear whether or not anyone has major objections
(Al ?) against the new function itself.

Here's an update, though ;-)

Cheers,
  Trond

VFS: Add GPL_EXPORTED function vfs_kern_mount()

 do_kern_mount() does not allow the kernel to use private mount interfaces
 without exposing the same interfaces to userland. The problem is that the
 filesystem is referenced by name, thus meaning that it and its mount
 interface must be registered in the global filesystem list.

 vfs_kern_mount() passes the struct file_system_type as an explicit
 parameter in order to overcome this limitation.

 Signed-off-by: Trond Myklebust [EMAIL PROTECTED]
 super.c |   22 +++---
 1 files changed, 15 insertions(+), 7 deletions(-)

Index: linux-2.6.11-rc3/fs/super.c
===
--- linux-2.6.11-rc3.orig/fs/super.c
+++ linux-2.6.11-rc3/fs/super.c
@@ -794,17 +794,13 @@ struct super_block *get_sb_single(struct
 EXPORT_SYMBOL(get_sb_single);
 
 struct vfsmount *
-do_kern_mount(const char *fstype, int flags, const char *name, void *data)
+vfs_kern_mount(struct file_system_type *type, int flags, const char *name, 
void *data)
 {
-   struct file_system_type *type = get_fs_type(fstype);
struct super_block *sb = ERR_PTR(-ENOMEM);
struct vfsmount *mnt;
int error;
char *secdata = NULL;
 
-   if (!type)
-   return ERR_PTR(-ENODEV);
-
mnt = alloc_vfsmnt(name);
if (!mnt)
goto out;
@@ -835,7 +831,6 @@ do_kern_mount(const char *fstype, int fl
mnt-mnt_parent = mnt;
mnt-mnt_namespace = current-namespace;
up_write(sb-s_umount);
-   put_filesystem(type);
return mnt;
 out_sb:
up_write(sb-s_umount);
@@ -846,10 +841,23 @@ out_free_secdata:
 out_mnt:
free_vfsmnt(mnt);
 out:
-   put_filesystem(type);
return (struct vfsmount *)sb;
 }
 
+EXPORT_SYMBOL_GPL(vfs_kern_mount);
+
+struct vfsmount *
+do_kern_mount(const char *fstype, int flags, const char *name, void *data)
+{
+   struct file_system_type *type = get_fs_type(fstype);
+   struct vfsmount *mnt;
+   if (!type)
+   return ERR_PTR(-ENODEV);
+   mnt = vfs_kern_mount(type, flags, name, data);
+   put_filesystem(type);
+   return mnt;
+}
+
 EXPORT_SYMBOL_GPL(do_kern_mount);
 
 struct vfsmount *kern_mount(struct file_system_type *type)


-- 
Trond Myklebust [EMAIL PROTECTED]

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: journal start/stop in ext3_writeback_writepage()

2005-02-10 Thread Andrew Morton
Badari Pulavarty [EMAIL PROTECTED] wrote:

 But I still don't understand why this can't happen
  thro original code ..
 
   journal_destory()
   iput(journal inode)
   do_writepages()
   generic_writepages()
   ext3_writeback_writepage()
   journal_start()
 
  what am i missing ?

presumably there are never any dirty pages or inodes when we run
journal_destroy().

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
Its possible that by doing larger
IOs we save CPU and use that CPU to push more data ?

This is absolutely right; my mistake -- the relevant number is CPU seconds 
per megabyte moved, not CPU seconds per elapsed second.
But I don't think we're close enough to 100% CPU utilization that this 
explains much.

In fact, the curious thing here is that neither the disk nor the CPU seems 
to be a bottleneck in the slow case.  Maybe there's some serialization I'm 
not seeing that makes less parallelism between I/O and execution.  Is this 
a single thread doing writes and syncs to a single file?

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Sonny Rao
On Thu, Feb 10, 2005 at 12:30:23PM -0800, Bryan Henderson wrote:
 Its possible that by doing larger
 IOs we save CPU and use that CPU to push more data ?
 
 This is absolutely right; my mistake -- the relevant number is CPU seconds 
 per megabyte moved, not CPU seconds per elapsed second.
 But I don't think we're close enough to 100% CPU utilization that this 
 explains much.
 
 In fact, the curious thing here is that neither the disk nor the CPU seems 
 to be a bottleneck in the slow case.  Maybe there's some serialization I'm 
 not seeing that makes less parallelism between I/O and execution.  Is this 
 a single thread doing writes and syncs to a single file?

From what I've seen, without writepages, the application thread itself
tends to do the writing by falling into balance_dirty_pages() during
it's write call, while in the writepages case, a pdflush thread seems
to do more of the writeback.This also depends somewhat on
processor speed (and number) and amount of RAM.  

To try and isolate this more, I've limited RAM (1GB) and number of
CPUs (1)  on my testing setup.

So yes, there could be better parallelism in the writepages case, but
again this behavior could be a symptom and not a cause, but I'm not
sure how to figure that out, any suggestions ?

Sonny
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] block new writers on frozen filesystems

2005-02-10 Thread Andrew Morton
Christoph Hellwig [EMAIL PROTECTED] wrote:

 When the lockfs patches went in an important bit got lost, the call in
 generic_file_write to put newly incoming writers to sleep when a
 filesystem is frozen.  Nathan added back the call in the now separate
 XFS write patch, and the patch for the generic code is below:
 
 
 Index: mm/filemap.c
 ===
 RCS file: /cvs/linux-2.6-xfs/mm/filemap.c,v
 retrieving revision 1.14
 diff -u -p -r1.14 filemap.c
 --- mm/filemap.c  5 Jan 2005 14:17:31 -   1.14
 +++ mm/filemap.c  4 Feb 2005 21:35:53 -
 @@ -2046,6 +2046,8 @@ __generic_file_aio_write_nolock(struct k
   count = ocount;
   pos = *ppos;
  
 + vfs_check_frozen(inode-i_sb, SB_FREEZE_WRITE);

hm, I didn't pay much attention to this stuff.  Shouldn't the direct-io
code be waiting as well?  Are all paths which can write to the bdev supposed
to be blocked?  kjournald?
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: journal start/stop in ext3_writeback_writepage()

2005-02-10 Thread Stephen C. Tweedie
Hi,

On Thu, 2005-02-10 at 20:21, Andrew Morton wrote:

  But I still don't understand why this can't happen
   thro original code ..

   what am i missing ?
 
 presumably there are never any dirty pages or inodes when we run
 journal_destroy().

I assume so, yes.  If there is no a_ops-writepages(), then we default
to generic_writepages() which is a noop if there are no dirty pages.  If
your new ext3-specific writepages code tries to do a journal_start() in
that case, then yes, it is likely to blow up spectacularly during
journal_destroy!

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: journal start/stop in ext3_writeback_writepage()

2005-02-10 Thread Badari Pulavarty
On Thu, 2005-02-10 at 15:12, Stephen C. Tweedie wrote:
 Hi,
 
 On Thu, 2005-02-10 at 20:21, Andrew Morton wrote:
 
   But I still don't understand why this can't happen
thro original code ..
 
what am i missing ?
  
  presumably there are never any dirty pages or inodes when we run
  journal_destroy().
 
 I assume so, yes.  If there is no a_ops-writepages(), then we default
 to generic_writepages() which is a noop if there are no dirty pages.  If
 your new ext3-specific writepages code tries to do a journal_start() in
 that case, then yes, it is likely to blow up spectacularly during
 journal_destroy!
 
 --Stephen

Yep. I found this hardway that exactly whats happening.
generic_writepages() is clever enough to do nothing, if there are no
dirty pages. But I am being stupid in my writepages(). 

I need to teach writepages() to nothing in case of no dirty pages. 
Is there a easy way like checking a count somewhere than doing all the
stuff mpage_writepages() is doing to figure this out, like ..


while (!done  (index = end) 
(nr_pages = pagevec_lookup_tag(pvec, mapping,
index,
PAGECACHE_TAG_DIRTY,
min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1)))
...

Thanks,
Badari

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ext3 writepages ?

2005-02-10 Thread Bryan Henderson
I went back and looked more closely and see that you did more than add a 
-writepages method.  You replaced the -prepare_write with one that 
doesn't involve the buffer cache, right?  And from your answer to Badari's 
question about that, I believe you said this is not an integral part of 
having -writepages, but an additional enhancement.  Well, that could 
explain a lot.  First of all, there's a significant amount of CPU time 
involved in managing buffer heads.  In the profile you posted, it's one of 
the differences in CPU time between the writepages and non-writepages 
case.  But it also changes the whole way the file cache is managed, 
doesn't it?  That might account for the fact that in one case you see 
cache cleaning happening via balance_dirty_pages() (i.e. memory fills up), 
but in the other it happens via Pdflush.  I'm not really up on the buffer 
cache; I haven't used it in my own studies for years.

I also saw that while you originally said CPU utilization was 73% in both 
cases, in one of the profiles I add up at least 77% for the writepages 
case, so I'm not sure we're really comparing straight across.

To investigate these effects further, I think you should monitor 
/proc/meminfo.  And/or make more isolated changes to the code.

So yes, there could be better parallelism in the writepages case, but
again this behavior could be a symptom and not a cause,

I'm not really suggesting that there's better parallelism in the 
writepages case.  I'm suggesting that there's poor parallelism (compared 
to what I expect) in both cases, which means that adding CPU time directly 
affects throughput.  If the CPU time were in parallel with the I/O time, 
adding an extra 1.8ms per megabyte to the CPU time (which is what one of 
my calculation from your data gave) wouldn't affect throughput.

But I believe we've at least established doubt that submitting an entire 
file cache in one bio is faster than submitting a bio for each page and 
that smaller I/Os (to the device) cause lower throughput in the 
non-writepages case (it seems more likely that the lower throughput causes 
the smaller I/Os).

--
Bryan Henderson  IBM Almaden Research Center
San Jose CA  Filesystems
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Ext2-devel] Re: journal start/stop in ext3_writeback_writepage()

2005-02-10 Thread Andrew Morton
Badari Pulavarty [EMAIL PROTECTED] wrote:

 I need to teach writepages() to nothing in case of no dirty pages. 
 Is there a easy way like checking a count somewhere than doing all the
 stuff mpage_writepages() is doing to figure this out

if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)
return;

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC] ext3 writepages for writeback mode

2005-02-10 Thread Badari Pulavarty
Hi,

Here is my first cut at adding writepages() support for
ext3 writeback mode.

I have not done any performance analysis on the patch, 
so try it at your own risk.

Please let me know, if I am completely off or its a
stupid idea.

Thanks,
Badari


--- linux-2.6.10.org/fs/ext3/inode.c2004-12-06 11:45:49.0 -0800
+++ linux-2.6.10/fs/ext3/inode.c2005-02-10 18:14:17.987263744 -0800
@@ -856,6 +856,12 @@
return ret;
 }
 
+static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
+   struct buffer_head *bh, int create)
+{
+   return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
+}
+
 /*
  * `handle' can be NULL if create is zero
  */
@@ -1321,6 +1327,37 @@
return ret;
 }
 
+static int
+ext3_writeback_writepages(struct address_space *mapping, 
+   struct writeback_control *wbc)
+{
+   struct inode *inode = mapping-host;
+   handle_t *handle = NULL;
+   int err, ret = 0;
+
+   if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+   return ret;
+
+   handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
+   if (IS_ERR(handle)) {
+   ret = PTR_ERR(handle);
+   return ret;
+   }
+
+ret = mpage_writepages(mapping, wbc, ext3_writepages_get_block);
+
+   /*
+* Need to reaquire the handle since ext3_writepages_get_block()
+* can restart the handle
+*/
+   handle = journal_current_handle();
+
+   err = ext3_journal_stop(handle);
+   if (!ret)
+   ret = err;
+   return ret;
+}
+
 static int ext3_writeback_writepage(struct page *page,
struct writeback_control *wbc)
 {
@@ -1552,6 +1589,7 @@
.readpage   = ext3_readpage,
.readpages  = ext3_readpages,
.writepage  = ext3_writeback_writepage,
+   .writepages = ext3_writeback_writepages,
.sync_page  = block_sync_page,
.prepare_write  = ext3_prepare_write,
.commit_write   = ext3_writeback_commit_write,


Re: [RFC] ext3 writepages for writeback mode

2005-02-10 Thread Andrew Morton
Badari Pulavarty [EMAIL PROTECTED] wrote:

  Here is my first cut at adding writepages() support for
  ext3 writeback mode.

Looks sane from a brief scan.

  I have not done any performance analysis on the patch, 

Please do ;)

  +static int ext3_writepages_get_block(struct inode *inode, sector_t iblock,
  +struct buffer_head *bh, int create)
  +{
  +return ext3_direct_io_get_blocks(inode, iblock, 1, bh, create);
  +}

yup.

  +
   /*
* `handle' can be NULL if create is zero
*/
  @@ -1321,6 +1327,37 @@
   return ret;
   }
   
  +static int
  +ext3_writeback_writepages(struct address_space *mapping, 
  +struct writeback_control *wbc)
  +{
  +struct inode *inode = mapping-host;
  +handle_t *handle = NULL;
  +int err, ret = 0;
  +
  +if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
  +return ret;
  +
  +handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
  +if (IS_ERR(handle)) {
  +ret = PTR_ERR(handle);
  +return ret;
  +}
  +
  +ret = mpage_writepages(mapping, wbc, ext3_writepages_get_block);
  +

Funny whitespace.  What is it with you IBM guys? ;)

  +/*
  + * Need to reaquire the handle since ext3_writepages_get_block()
  + * can restart the handle
  + */
  +handle = journal_current_handle();
  +
  +err = ext3_journal_stop(handle);
  +if (!ret)
  +ret = err;
  +return ret;
  +}
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] ext3 writepages for writeback mode

2005-02-10 Thread Badari Pulavarty
Andrew Morton wrote:
Badari Pulavarty [EMAIL PROTECTED] wrote:
Here is my first cut at adding writepages() support for
ext3 writeback mode.

Looks sane from a brief scan.
Well, not really..
mpage_writepages() could end up calling ext3_writeback_writepage()
in confused case thro ..
*ret = page-mapping-a_ops-writepage(page, wbc);
which ends up doing nothing and leaves the page dirty, since there is 
journal handle started :(

if (ext3_journal_current_handle())
goto out_fail;
Ideas ?
Thanks,
Badari
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


about commit sector concept in journal

2005-02-10 Thread Somenath Ghosh

   i am reading the scheme by which journal works.

 i came to know that after every transaction writen to the log file a 512 byte
sector is writen back to the disk. this is treated as a commit block, and there
is some sequence number in there that matches with all the previous transaction
has been done. but how it is possible i can't understand .

  please tell me about the proper scheme of the commit block.
somenath

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html