Re: Questions about the buffer+page cache in 2.4.0
On 7/30/00, 7:14:16 AM, Daniel Phillips [EMAIL PROTECTED] wrote regarding Re: Questions about the buffer+page cache in 2.4.0: Daniel Phillips wrote: There are two obvious ways to do filesystem-specific special handling of the tail block: (1) in the 'read actor' that does the actual copy-to-user (see? yet another problem solved by an extra level of indirection!) or (2) in the inode-i_mapping-a_ops-readpage function that retrieves the page: essentially we would do a copy-on-write to produce a new page that has the tail fragment in the correct place. I have a very distinct preference for (1), and I'll proceed on the assumption that that's what I'll be doing unless some shows me why it's bad. After digging a little deeper I can see that using the read actor won't work because the read actor doesn't take the inode, or anything that can be dereferenced to find the inode, as a parameter. So it's not possible to do the tail offset check and adjustment there. That's ok - it's the wrong place to do it anyway because the check then has to be performed each time around the loop. A much better way is to replace generic_file_read in the Ext2 file_operations struct by a new ext2_file_read: proposed_ext2_file_read: - generic_file_read stopping before any tail with nonzero offset - If necessary, generic_file_read of the tail with source offset For reading the tail, take a look at how these functions interact: get_block generic_file_read block_read_full_page (ext2's readpage func) Putting the tail knowledge into ext2_file_read won't be enough, it won't cover mmaps. You have to make sure your readpage/writepage functions keep the page and buffer caches in sync. Reiserfs does most of this from get_block... This imposes just one extra check when the tail isn't merged or happens to be at the beginning of the tail block, so read overhead for tailmerging is negligible when the feature isn't used. Now I have to address the question of how tail blocks can be shared between files. This does not seem to me to be an easy question at all. I'll summarize the previous discussion here... You have two real choices. Unpack the tail before any writes to it, then repack later (file close, whatever). This allows you to use all the generic functions for writing data, and keeps the synchronization down (only write to shared data on pack/unpack). Or, change your prepare, commit, writepage, and get_block routines to write directly to the shared block. This is somewhat more difficult, and I suspect slower since you'll have to touch all the inodes in the ring as you shift data around for each write. -chris
Re: Questions about the buffer+page cache in 2.4.0
Chris Mason wrote: On 7/30/00, 7:14:16 AM, Daniel Phillips [EMAIL PROTECTED] wrote regarding Re: Questions about the buffer+page cache in 2.4.0: After digging a little deeper I can see that using the read actor won't work because the read actor doesn't take the inode, or anything that can be dereferenced to find the inode, as a parameter. So it's not possible to do the tail offset check and adjustment there. That's ok - it's the wrong place to do it anyway because the check then has to be performed each time around the loop. A much better way is to replace generic_file_read in the Ext2 file_operations struct by a new ext2_file_read: proposed_ext2_file_read: - generic_file_read stopping before any tail with nonzero offset - If necessary, generic_file_read of the tail with source offset For reading the tail, take a look at how these functions interact: get_block generic_file_read block_read_full_page (ext2's readpage func) Putting the tail knowledge into ext2_file_read won't be enough, it won't cover mmaps. You have to make sure your readpage/writepage functions keep the page and buffer caches in sync. Reiserfs does most of this from get_block... Yes, exactly. To be able to reach that conclusion requires a detailed understanding of the way the file mapping mechanism works, which I puzzled out yesterday *after* I wrote the previous post. Tomorrow I'll try to post the relevant details show how they lead to the design you just suggested. Clearly Stephen and Alex both had the same thing in mind, and were merely arguing about the subtle details of the implementation. Now I have to address the question of how tail blocks can be shared between files... You have two real choices. Unpack the tail before any writes to it, then repack later (file close, whatever). This allows you to use all the generic functions for writing data, and keeps the synchronization down (only write to shared data on pack/unpack). Yes, that was the preferred design right from the beginning even before I started thinking of it in terms of page cache operations vs buffer cache. Or, change your prepare, commit, writepage, and get_block routines to write directly to the shared block. This is somewhat more difficult, and I suspect slower since you'll have to touch all the inodes in the ring as you shift data around for each write. Ick - I think we agree on which is best. Incidently, since the ring is now defined to be double-linked, you will only touch two inodes besides the one you're writing to. Furthermore, if those two inodes happen to be on the same inode page (8 inodes/page in a 1K/block fs; 32 with 4K) then you don't even have to hit the disk an extra time. The tailmerging algorithm could easily be optimized to favor this arrangement if desired. I see we're now at the point that Alex warned me about - the change-of-state when a merged tail on the page has to be unmerged due to a write further up the file. -- Daniel
Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote: So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? Also, I'd /still/ like to know what stops another CPU from adding the buffer head we were looking for between the call to get_hash_table returning NULL and insert_into_queues. AFAICS, getblk() isn't always called with the BKL.
Re: Questions about the buffer+page cache in 2.4.0
On Sat, Jul 29, 2000 at 09:52:57AM -0700, Gary Funck wrote: Sounds good. Has anyone run performance tests (eg, Bonnie) on the 2.4 file I/O performance (ie, using ext2fs) vs. 2.2? What were the results? I did that on my SMP box with some fast IBM SCSI disks lately ... Singlethreaded I/O (bonnie): Write performance unchanged, read performance 10--20% lower with 2.4. Multithreaded I/O (tiobench): Write performance much better for multiple processes, read performance 10% slower with 2.4. Regards, -- Kurt Garloff [EMAIL PROTECTED] Eindhoven, NL GPG key: See mail header, key servers Linux kernel development SuSE GmbH, Nuernberg, FRG SCSI, Security PGP signature
Re: Questions about the buffer+page cache in 2.4.0
Daniel Phillips wrote: So now it's time to start asking questions. I've been partway through lxr'ing the 2.4.0 VFS now, and all I can say is wow! This really is a big improvement over 2.2. Please treat all of the rest of this post as speculation: I'm not sure how accurate any of it is. Please feel free to make corrections. I'm focussing on three functions: - generic_file_read - generic_file_write - generic_file_mmap Oddly enough, all three of these functions are defined in the mm tree, not the fs tree. An observation: in 2.2, generic_file_read was used by Ext2 and generic_file_write wasn't. Now I see that all three of the functions appy to not only Ext2 but a majority of the filesystems. With the notable exception of NTFS - can somebody tell me why that is? Each of these functions works its magic through a 'mapping' defined by an address_space struct. In OO terminology this would be a memory mapping class with instances denoting the mapping between a section of memory and a particular kind of underlying secondary storage. So what we have looks a lot like a generalization of a swap file - all (or the vast majority of) file I/O in Linux is now done by memory-mapping, whether you ask for it or not. (This is good) Who was it that said "any problem, no matter how complex, can be solved by adding another level of indirection"? Anyway, we now have inode-i_mapping-a_ops-readpage instead of 2.2's inode-i_op-readpage. (breaking the comment above the function) The three functions in question make use of a 'page cache', which appears to be a set of memory page 'heads' (page head - my terminology, since I haven't seen any other so far - an instance of struct page) indexed by offset expressed in units of PAGE_SIZE (4096 on most architectures), munged together with the address of the mapping struct. So we use the page offset within a file plus a unique characteristic of the file (its mapping struct), followed by a linear search to find a piece of memory that has already been set up to map the file/offset we're interested in. I've only looked at generic_file_read in any detail so far. In simple terms, its algorithm is: while there is more to transfer: look in the hash table for a page that maps the current offset if there isn't one, create one and add it to the page hash table check that the page is up to date if it isn't, read it in via the above-mentioned mapping function copy the page (or part) to user space via a 'read actor' advance by PAGE_SIZE This is considerably more complicated in the actual implementation because of the requirements of concurrency and readahead optmimization. As a next step, I hope to be able to state some of the conditions that have to hold throughout the execution of the above algorithm. Another thing I hope to be able to define pretty soon is the relationship between the buffer cache and the page cache, including a few rules you have to follow in order to keep the two consistent. I suspect from the discussion I've seen so far that this relationship is still evolving. Nonetheless, in order to be able to do filesystem work without breaking things, the rules have to be stated and understood in a crystal-clear way. OK, I'm within sight of being able to discuss what to do with that pesky final block in a tail-merged file, or at least to understand the discussion that's gone on so far. The obvious place to handle the special requirements for the last page would be in the above read-algorithm. The fly in that ointment is: doing so would require the VFS to be changed, IOW, work that should stay within a particular filesystem doesn't. There are two obvious ways to do filesystem-specific special handling of the tail block: (1) in the 'read actor' that does the actual copy-to-user (see? yet another problem solved by an extra level of indirection!) or (2) in the inode-i_mapping-a_ops-readpage function that retrieves the page: essentially we would do a copy-on-write to produce a new page that has the tail fragment in the correct place. I have a very distinct preference for (1), and I'll proceed on the assumption that that's what I'll be doing unless some shows me why it's bad. Now that I know a little more about the page cache I'll have a good think about what's going to happen when files start sharing tail blocks. -- Daniel
Re: Questions about the buffer+page cache in 2.4.0
On Sat, Jul 29, 2000 at 06:58:34PM +0200, Gary Funck wrote: What entity is responsible for tearing down the file-page mapping, when the storage is needed? Is that bdflush's job? kswapd's and the allocators themselves. In 2.4, does the 'read actor' (ie, for ext2) optimize the case where the part of the I/O request being handled has a user-level address that is page aligned, and the requested bytes to transfer are at least one full page? Ie, does it handle the 'copy' by simply remapping the I/O page directly into the user's address space (avoiding a copy)? No, because you cannot avoid the copy anyways because you need to maintain cache coherency in the page cache. If you want zero copy IO use mmap(). -Andi
Re: Questions about the buffer+page cache in 2.4.0
On Thu, 27 Jul 2000, Daniel Phillips wrote: So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? have a look at other users of bh-b_count. For example __brelse() does atomic_dec() and it is called directly from brelse() which can be called by filesystem without any other protection. regards, Tigran
Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote: So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? The buffer use count needs to be atomic in other places because interrupts may change it on UP. atomic_t can only be modified by atomic_* functions and atomic.h is lacking a "atomic_set_nonatomic". So even when you only need the atomic property once you have to change all uses of the field. -Andi -- This is like TV. I don't like TV.
Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 07:49:50PM +0200, Daniel Phillips wrote: So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? the usual answer to this type of question is that there is another place which accesses the buffer use count without holding the spinlock. I haven't audited this particular piece of code and I'm not familiar with this area, but this may help you find the solution. (the other answer is that this is a bug left over from a time when there was no spinlock :-) -- Revolutions do not require corporate support.