[patch] 2.6.23-rc3: fsblock

2007-08-24 Thread Nick Piggin
Hi,

I'm still plugging away at fsblock slowly. Haven't really got around to
to finishing up any big new features, but there has been a lot of bug fixing
and little API changes since last release.

I still think fsblock has merit, and even if a more extent-based approach
ends up working better for most things. I think a block based one will
still have its place (either along-side or underneath), and I think fsblock
is just much better than buffer_heads.

fsblock proper and minix port patches are on kernel.org, because they're
pretty big and not well split-up to be posted here.

http://www.kernel.org/pub/linux/kernel/people/npiggin/patches/fsblock/2.6.23-rc3/

Some notes:

mbforget is restored (though very poor implementation for now), and filesystems
do not have their underlying metadata unmapped or synched by default for new
blocks. They could do this in their get_block function if they want, or
otherwise ensure that metadata blocks are not just abandoned with possible
dirty data in-flight. Basically my thinking is that it is quite an extra
expense that should be optional.

Made use of lockless radix-tree. Pages in superpage-blocks are tracked with
the pagecache radix-tree, and a fair amount of lookups can be involved. This
makes those operations faster and more scalable.

I've got rid of RCU completely from the fsblock data structures. fsblock's
getblk equivalent must now tolerate blocking (if this turns out to be too
limiting, I might be convinced to add back RCU or some kind of spin locking).
However I want to get rid of RCU freeing, because a normal use-case (the
nobh mode) is expected to have a lot of fsblock structures being allocated
and freed as IO is happening, so this needs to be really efficient and RCU
can degrade CPU cache properties in these situations.

Added a real vmap cache. It is still pretty primitive, but it gets like
99% hit rate in my basic tests, so I didn't bother with anything more
fancy yet.

Changed block memory mapping API to only support metadata blocks at this
stage. Just makes a few implementation things easier at this point.

Added the /proc/sys/vm/fsblock_no_cache tunable, which disables caching of
fsblock structures, and actively frees them after IOs etc have finished.

TODO
- introduce, and make use of, range based aops APIs and remove all the
  silly locking code for multi-page blocks (as well as do nice multi-block
  bios). In progress.

- write a extent-based block mapping module that filesystems could optionally
  use to efficiently cache block mappings. In progress.

- port some real filesystems. ext2 is in progress, and I like the idea of
  brtfs, it seems like it would be a lot more manageable than ext3 for me.

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-24 Thread Fengguang Wu
On Thu, Aug 23, 2007 at 08:13:41AM -0400, Chris Mason wrote:
 On Thu, 23 Aug 2007 12:47:23 +1000
 David Chinner [EMAIL PROTECTED] wrote:
 
  On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
   I think we should assume a full scan of s_dirty is impossible in the
   presence of concurrent writers.  We want to be able to pick a start
   time (right now) and find all the inodes older than that start time.
   New things will come in while we're scanning.  But perhaps that's
   what you're saying...
   
   At any rate, we've got two types of lists now.  One keeps track of
   age and the other two keep track of what is currently being
   written.  I would try two things:
   
   1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
   indexes by inode number (or some arbitrary field the FS can set in
   the inode).  Radix tree tags are used to indicate which things in
   s_io are already in progress or are pending (hand waving because
   I'm not sure exactly).
   
   inodes are pulled off s_dirty and the corresponding slot in s_io is
   tagged to indicate IO has started.  Any nearby inodes in s_io are
   also sent down.
  
  the problem with this approach is that it only looks at inode
  locality. Data locality is ignored completely here and the data for
  all the inodes that are close together could be splattered all over
  the drive. In that case, clustering by inode location is exactly the
  wrong thing to do.
 
 Usually it won't be less wrong than clustering by time.
 
  
  For example, XFs changes allocation strategy at 1TB for 32bit inode
  filesystems which makes the data get placed way away from the inodes.
  i.e. inodes in AGs below 1TB, all data in AGs  1TB. clustering
  by inode number for data writeback is mostly useless in the 1TB
  case.
 
 I agree we'll want a way to let the FS provide the clustering key.  But
 for the first cut on the patch, I would suggest keeping it simple.
 
  
  The inode32 for 1Tb and inode64 allocators both try to keep data
  close to the inode (i.e. in the same AG) so clustering by inode number
  might work better here.
  
  Also, it might be worthwhile allowing the filesystem to supply a
  hint or mask for closeness for inode clustering. This would help
  the gernic code only try to cluster inode writes to inodes that
  fall into the same cluster as the first inode
 
 Yes, also a good idea after things are working.
 
  
Notes:
(1) I'm not sure inode number is correlated to disk location in
filesystems other than ext2/3/4. Or parent dir?
   
   In general, it is a better assumption than sorting by time.  It may
   make sense to one day let the FS provide a clustering hint
   (corresponding to the first block in the file?), but for starters it
   makes sense to just go with the inode number.
  
  Perhaps multiple hints are needed - one for data locality and one
  for inode cluster locality.
 
 So, my feature creep idea would have been more data clustering.  I'm
 mainly trying to solve this graph:
 
 http://oss.oracle.com/~mason/compilebench/makej/compare-create-dirs-0.png
 
 Where background writing of the block device inode is making ext3 do
 seeky writes while directory trees.  My simple idea was to kick
 off a 'I've just written block X' call back to the FS, where it may
 decide to send down dirty chunks of the block device inode that also
 happen to be dirty.
 
 But, maintaining the kupdate max dirty time and congestion limits in
 the face of all this clustering gets tricky.  So, I wasn't going to
 suggest it until the basic machinery was working.
 
 Fengguang, this isn't a small project ;)  But, lots of people will be
 interested in the results.

Exactly, the current writeback logics are unsatisfactory in many ways.
As for writeback clustering, inode/data localities can be different.
But I'll follow your suggestion to start simple first and give the
idea a spin on ext3.

-fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-24 Thread Fengguang Wu
On Wed, Aug 22, 2007 at 08:42:01AM -0400, Chris Mason wrote:
  My vague idea is to
  - keep the s_io/s_more_io as a FIFO/cyclic writeback dispatching
  queue.
  - convert s_dirty to some radix-tree/rbtree based data structure.
It would have dual functions: delayed-writeback and
  clustered-writeback. 
  clustered-writeback:
  - Use inode number as clue of locality, hence the key for the sorted
tree.
  - Drain some more s_dirty inodes into s_io on every kupdate wakeup,
but do it in the ascending order of inode number instead of
-dirtied_when. 
  
  delayed-writeback:
  - Make sure that a full scan of the s_dirty tree takes =30s, i.e.
dirty_expire_interval.
 
 I think we should assume a full scan of s_dirty is impossible in the
 presence of concurrent writers.  We want to be able to pick a start
 time (right now) and find all the inodes older than that start time.
 New things will come in while we're scanning.  But perhaps that's what
 you're saying...

Yeah, I was thinking about elevators :)
Or call it sweeping based on address-hint(inode number).

 At any rate, we've got two types of lists now.  One keeps track of age
 and the other two keep track of what is currently being written.  I
 would try two things:
 
 1) s_dirty stays a list for FIFO.  s_io becomes a radix tree that
 indexes by inode number (or some arbitrary field the FS can set in the
 inode).  Radix tree tags are used to indicate which things in s_io are
 already in progress or are pending (hand waving because I'm not sure
 exactly).
 
 inodes are pulled off s_dirty and the corresponding slot in s_io is
 tagged to indicate IO has started.  Any nearby inodes in s_io are also
 sent down.
 
 2) s_dirty and s_io both become radix trees.  s_dirty is indexed by a
 sequence number that corresponds to age.  It is treated as a big
 circular indexed list that can wrap around over time.  Radix tree tags
 are used both on s_dirty and s_io to flag which inodes are in progress.

It's meaningless to convert s_io to radix tree. Because inodes on s_io
will normally be sent to block layer elevators at the same time.

Also s_dirty holds 30 seconds of inodes, while s_io only 5 seconds.
The more inodes, the more chances of good clustering. That's the
general rule.

s_dirty is the right place to do address-clustering.
As for the dirty_expire_interval parameter on dirty age,
we can apply a simple rule: do one full scan/sweep over the
fs-address-space in every 30s, syncing all inodes encountered,
and sparing those newly dirtied in less than 5s. With that rule,
any inode will get synced after being dirtied for 5-35 seconds.

-fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/6] writeback time order/delay fixes take 3

2007-08-24 Thread Fengguang Wu
On Thu, Aug 23, 2007 at 12:33:06PM +1000, David Chinner wrote:
 On Wed, Aug 22, 2007 at 09:18:41AM +0800, Fengguang Wu wrote:
  On Tue, Aug 21, 2007 at 08:23:14PM -0400, Chris Mason wrote:
  Notes:
  (1) I'm not sure inode number is correlated to disk location in
  filesystems other than ext2/3/4. Or parent dir?
 
 The correspond to the exact location on disk on XFS. But, XFS has it's
 own inode clustering (see xfs_iflush) and it can't be moved up
 into the generic layers because of locking and integration into
 the transaction subsystem.

  (2) It duplicates some function of elevators. Why is it necessary?
 
 The elevators have no clue as to how the filesystem might treat adjacent
 inodes. In XFS, inode clustering is a fundamental feature of the inode
 reading and writing and that is something no elevator can hope to
 acheive
 
Thank you. That explains the linear write curve(perfect!) in Chris' graph.

I wonder if XFS can benefit any more from the general writeback clustering.
How large would be a typical XFS cluster?

-fengguang

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [2/4] 2.6.23-rc3: known regressions v3

2007-08-24 Thread Michal Piotrowski
Hi all,

Here is a list of some known regressions in 2.6.23-rc3.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

NameRegressions fixed since 21-Jun-2007
Adrian Bunk9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton  4
Al Viro3
Alan Stern 3
Cornelia Huck  3
Jens Axboe 3
Tejun Heo  3



ACPI

Subject : MCFG bug on hp nx6310
References  : http://lkml.org/lkml/2007/8/8/252
Last known good : ?
Submitter   : Michael Sedkowski [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown

Subject : 2.6.23-rc1-git10 hangs on boot - needs acpi=off
References  : http://lkml.org/lkml/2007/8/2/24
Last known good : ?
Submitter   : Plamen Petrov [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Len Brown [EMAIL PROTECTED]
Status  : problem is being debugged



CPUFREQ

Subject : ide problems: 2.6.22-git17 working, 2.6.23-rc1* is not
References  : http://lkml.org/lkml/2007/7/27/298
  http://lkml.org/lkml/2007/7/29/371
Last known good : ?
Submitter   : dth [EMAIL PROTECTED]
Caused-By   : Len Brown [EMAIL PROTECTED]
  commit f79e3185dd0f8650022518d7624c876d8929061b
Handled-By  : Len Brown [EMAIL PROTECTED]
Status  : problem is being debugged



FS

Subject : nfs4 hang/NFS woes again
  (was USB-related oops in sysfs with linux 
v2.6.23-rc3-50-g28e8351)
References  : http://lkml.org/lkml/2007/8/15/144
  http://lkml.org/lkml/2007/8/22/484
  http://lkml.org/lkml/2007/8/23/134
Last known good : ?
Submitter   : Florin Iucha [EMAIL PROTECTED]
  Bret Towe [EMAIL PROTECTED]
Caused-By   : Trond Myklebust [EMAIL PROTECTED]
  commit 3d39c691ff486142dd9aaeac12f553f4476b7a62
Handled-By  : ?
Status  : problem is being debugged

Subject : [NFSD OOPS] 2.6.23-rc1-git10
References  : http://lkml.org/lkml/2007/8/2/462
Last known good : ?
Submitter   : Andrew Clayton [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : ?
Status  : unknown



Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html