Re: Questions about the buffer+page cache in 2.4.0

2000-07-31 Thread Chris Mason



On 7/30/00, 7:14:16 AM, Daniel Phillips 
[EMAIL PROTECTED] wrote regarding Re: Questions about 
the buffer+page cache in 2.4.0:


 Daniel Phillips wrote:
  There are two obvious ways to do filesystem-specific special handling of 
the
  tail block: (1) in the 'read actor' that does the actual copy-to-user 
(see? yet
  another problem solved by an extra level of indirection!) or (2) in the
  inode-i_mapping-a_ops-readpage function that retrieves the page: 
essentially
  we would do a copy-on-write to produce a new page that has the tail 
fragment in
  the correct place.
 
  I have a very distinct preference for (1), and I'll proceed on the 
assumption
  that that's what I'll be doing unless some shows me why it's bad.

 After digging a little deeper I can see that using the read actor won't 
work
 because the read actor doesn't take the inode, or anything that can be
 dereferenced to find the inode, as a parameter.  So it's not possible to 
do the
 tail offset check and adjustment there.

 That's ok - it's the wrong place to do it anyway because the check then 
has to
 be performed each time around the loop.  A much better way is to replace
 generic_file_read in the Ext2 file_operations struct by a new 
ext2_file_read:

 proposed_ext2_file_read:
   - generic_file_read stopping before any tail with nonzero offset
   - If necessary, generic_file_read of the tail with source offset

For reading the tail, take a look at how these functions interact:

get_block
generic_file_read
block_read_full_page (ext2's readpage func)


Putting the tail knowledge into ext2_file_read won't be enough, it won't 
cover mmaps.  You have to make sure your readpage/writepage functions 
keep the page and buffer caches in sync.  Reiserfs does most of this from 
get_block...

 This imposes just one extra check when the tail isn't merged or happens 
to be at
 the beginning of the tail block, so read overhead for tailmerging is 
negligible
 when the feature isn't used.

 Now I have to address the question of how tail blocks can be shared 
between
 files.  This does not seem to me to be an easy question at all.  I'll 
summarize
 the previous discussion here...

You have two real choices.  Unpack the tail before any writes to it, then 
repack later (file close, whatever).  This allows you to use all the 
generic functions for writing data, and keeps the synchronization down 
(only write to shared data on pack/unpack). 

Or, change your prepare, commit, writepage, and get_block routines to 
write directly to the shared block.  This is somewhat more difficult, and 
I suspect slower since you'll have to touch all the inodes in the ring as 
you shift data around for each write.

-chris





Re: Questions about the buffer+page cache in 2.4.0

2000-07-31 Thread Daniel Phillips

Chris Mason wrote:
 
 On 7/30/00, 7:14:16 AM, Daniel Phillips
 [EMAIL PROTECTED] wrote regarding Re: Questions about
 the buffer+page cache in 2.4.0:
 
  After digging a little deeper I can see that using the read actor
  won't work because the read actor doesn't take the inode, or 
  anything that can be dereferenced to find the inode, as a parameter.  
  So it's not possible to do the tail offset check and adjustment 
  there.
 
  That's ok - it's the wrong place to do it anyway because the check 
  then has to be performed each time around the loop.  A much better 
  way is to replace generic_file_read in the Ext2 file_operations 
  struct by a new ext2_file_read:
 
  proposed_ext2_file_read:
- generic_file_read stopping before any tail with nonzero offset
- If necessary, generic_file_read of the tail with source offset
 
 For reading the tail, take a look at how these functions interact:
 
 get_block
 generic_file_read
 block_read_full_page (ext2's readpage func)
 
 Putting the tail knowledge into ext2_file_read won't be enough, it 
 won't cover mmaps.  You have to make sure your readpage/writepage 
 functions keep the page and buffer caches in sync.  Reiserfs does 
 most of this from get_block...

Yes, exactly.  To be able to reach that conclusion requires a detailed
understanding of the way the file mapping mechanism works, which I puzzled out
yesterday *after* I wrote the previous post.  Tomorrow I'll try to post the
relevant details show how they lead to the design you just suggested.  Clearly
Stephen and Alex both had the same thing in mind, and were merely arguing about
the subtle details of the implementation.

  Now I have to address the question of how tail blocks can 
  be shared between files...
 
 You have two real choices.  Unpack the tail before any writes to it, then
 repack later (file close, whatever).  This allows you to use all the
 generic functions for writing data, and keeps the synchronization down
 (only write to shared data on pack/unpack).

Yes, that was the preferred design right from the beginning even before I
started thinking of it in terms of page cache operations vs buffer cache.

 Or, change your prepare, commit, writepage, and get_block routines to
 write directly to the shared block.  This is somewhat more difficult, and
 I suspect slower since you'll have to touch all the inodes in the ring as
 you shift data around for each write.

Ick - I think we agree on which is best.  Incidently, since the ring is now
defined to be double-linked, you will only touch two inodes besides the one
you're writing to.  Furthermore, if those two inodes happen to be on the same
inode page (8 inodes/page in a 1K/block fs; 32 with 4K) then you don't even have
to hit the disk an extra time.  The tailmerging algorithm could easily be
optimized to favor this arrangement if desired.

I see we're now at the point that Alex warned me about - the change-of-state
when a merged tail on the page has to be unmerged due to a write further up the
file.

-- 
Daniel



Re: Questions about the buffer+page cache in 2.4.0

2000-07-30 Thread Steve Dodd

On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote:

 So now it's time to start asking questions.  Just jumping in at a place I
 felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk,
 and I see it's changed somewhat.  Finding and removing a block from the free
 list is now bracketed by a spinlock pair.  First question: why do we use
 atomic_set to set the initial buffer use count if this is already protected
 by a spinlock?

Also, I'd /still/ like to know what stops another CPU from adding the buffer
head we were looking for between the call to get_hash_table returning NULL and
insert_into_queues. AFAICS, getblk() isn't always called with the BKL.



Re: Questions about the buffer+page cache in 2.4.0

2000-07-30 Thread Kurt Garloff

On Sat, Jul 29, 2000 at 09:52:57AM -0700, Gary Funck wrote:
 Sounds good. Has anyone run performance tests (eg, Bonnie) on the 2.4 file I/O
 performance (ie, using ext2fs) vs. 2.2?  What were the results?

I did that on my SMP box with some fast IBM SCSI disks lately ...

Singlethreaded I/O (bonnie):
Write performance unchanged, read performance 10--20% lower with 2.4.

Multithreaded I/O (tiobench):
Write performance much better for multiple processes, read performance
10% slower with 2.4.

Regards,
-- 
Kurt Garloff  [EMAIL PROTECTED]  Eindhoven, NL
GPG key: See mail header, key servers Linux kernel development
SuSE GmbH, Nuernberg, FRG   SCSI, Security

 PGP signature


Re: Questions about the buffer+page cache in 2.4.0

2000-07-29 Thread Daniel Phillips

Daniel Phillips wrote:
 So now it's time to start asking questions.

I've been partway through lxr'ing the 2.4.0 VFS now, and all I can say is wow! 
This really is a big improvement over 2.2.

Please treat all of the rest of this post as speculation: I'm not sure how
accurate any of it is.  Please feel free to make corrections.

I'm focussing on three functions:

 - generic_file_read
 - generic_file_write
 - generic_file_mmap

Oddly enough, all three of these functions are defined in the mm tree, not the
fs tree.

An observation: in 2.2, generic_file_read was used by Ext2 and
generic_file_write wasn't.  Now I see that all three of the functions appy to
not only Ext2 but a majority of the filesystems.  With the notable exception of
NTFS - can somebody tell me why that is?

Each of these functions works its magic through a 'mapping' defined by an
address_space struct.  In OO terminology this would be a memory mapping class
with instances denoting the mapping between a section of memory and a particular
kind of underlying secondary storage.  So what we have looks a lot like a
generalization of a swap file - all (or the vast majority of) file I/O in Linux
is now done by memory-mapping, whether you ask for it or not.  (This is good)

Who was it that said "any problem, no matter how complex, can be solved by
adding another level of indirection"?  Anyway, we now have
inode-i_mapping-a_ops-readpage instead of 2.2's inode-i_op-readpage.
(breaking the comment above the function)

The three functions in question make use of a 'page cache', which appears to be
a set of memory page 'heads' (page head - my terminology, since I haven't seen
any other so far - an instance of struct page) indexed by offset expressed in
units of PAGE_SIZE (4096 on most architectures), munged together with the
address of the mapping struct.  So we use the page offset within a file plus a
unique characteristic of the file (its mapping struct), followed by a linear
search to find a piece of memory that has already been set up to map the
file/offset we're interested in.

I've only looked at generic_file_read in any detail so far.  In simple terms,
its algorithm is:

  while there is more to transfer:
look in the hash table for a page that maps the current offset
  if there isn't one, create one and add it to the page hash table
check that the page is up to date
  if it isn't, read it in via the above-mentioned mapping function
copy the page (or part) to user space via a 'read actor'
advance by PAGE_SIZE 

This is considerably more complicated in the actual implementation because of
the requirements of concurrency and readahead optmimization.

As a next step, I hope to be able to state some of the conditions that have to
hold throughout the execution of the above algorithm.

Another thing I hope to be able to define pretty soon is the relationship
between the buffer cache and the page cache, including a few rules you have to
follow in order to keep the two consistent.  I suspect from the discussion I've
seen so far that this relationship is still evolving.  Nonetheless, in order to
be able to do filesystem work without breaking things, the rules have to be
stated and understood in a crystal-clear way.

OK, I'm within sight of being able to discuss what to do with that pesky final
block in a tail-merged file, or at least to understand the discussion that's
gone on so far.  The obvious place to handle the special requirements for the
last page would be in the above read-algorithm.  The fly in that ointment is:
doing so would require the VFS to be changed, IOW, work that should stay within
a particular filesystem doesn't.

There are two obvious ways to do filesystem-specific special handling of the
tail block: (1) in the 'read actor' that does the actual copy-to-user (see? yet
another problem solved by an extra level of indirection!) or (2) in the
inode-i_mapping-a_ops-readpage function that retrieves the page: essentially
we would do a copy-on-write to produce a new page that has the tail fragment in
the correct place.

I have a very distinct preference for (1), and I'll proceed on the assumption
that that's what I'll be doing unless some shows me why it's bad.

Now that I know a little more about the page cache I'll have a good think about
what's going to happen when files start sharing tail blocks.

-- 
Daniel



Re: Questions about the buffer+page cache in 2.4.0

2000-07-29 Thread Andi Kleen

On Sat, Jul 29, 2000 at 06:58:34PM +0200, Gary Funck wrote:
 What entity is responsible for tearing down the file-page mapping, when
 the storage is needed?  Is that bdflush's job?

kswapd's and the allocators themselves.

 
 In 2.4, does the 'read actor' (ie, for ext2) optimize the case where
 the part of the I/O request being handled has a user-level address that
 is page aligned, and the requested bytes to transfer are at least one
 full page?  Ie, does it handle the 'copy' by simply remapping the I/O
 page directly into the user's address space (avoiding a copy)?

No, because you cannot avoid the copy anyways because you need to maintain
cache coherency in the page cache. If you want zero copy IO use mmap().


-Andi



Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Tigran Aivazian

On Thu, 27 Jul 2000, Daniel Phillips wrote:
 So now it's time to start asking questions.  Just jumping in at a place I felt I
 knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
 it's changed somewhat.  Finding and removing a block from the free list is now
 bracketed by a spinlock pair.  First question: why do we use atomic_set to set
 the initial buffer use count if this is already protected by a spinlock?

have a look at other users of bh-b_count. For example __brelse() does
atomic_dec() and it is called directly from brelse() which can be called
by filesystem without any other protection.

regards,
Tigran




Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Andi Kleen

On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote:
 So now it's time to start asking questions.  Just jumping in at a place I felt I
 knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
 it's changed somewhat.  Finding and removing a block from the free list is now
 bracketed by a spinlock pair.  First question: why do we use atomic_set to set
 the initial buffer use count if this is already protected by a spinlock?

The buffer use count needs to be atomic in other places because interrupts
may change it on UP. atomic_t can only be modified by atomic_* functions 
and atomic.h is lacking a "atomic_set_nonatomic". So even when you only 
need the atomic property once you have to change all uses of the field.


-Andi

-- 
This is like TV. I don't like TV.



Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Matthew Wilcox

On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote:
 So now it's time to start asking questions.  Just jumping in at a place I felt I
 knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
 it's changed somewhat.  Finding and removing a block from the free list is now
 bracketed by a spinlock pair.  First question: why do we use atomic_set to set
 the initial buffer use count if this is already protected by a spinlock?

the usual answer to this type of question is that there is another
place which accesses the buffer use count without holding the spinlock.
I haven't audited this particular piece of code and I'm not familiar
with this area, but this may help you find the solution.

(the other answer is that this is a bug left over from a time when there
was no spinlock :-)

-- 
Revolutions do not require corporate support.