Re: Trimming VFS inodes?
Date:Tue, 13 Jun 2000 13:10:48 -0400 (EDT) From: Alexander Viro [EMAIL PROTECTED] Start from taking ext2, UFS and NFS out of -u. in struct inode. Yup, separate allocation and inlined function (ext2_ino(inode)) that would return the pointer to private part of inode. I can send you my old (circa 2.2.early) patch that does it for ext2 tonight - hope that will help. Can we please save this for 2.5? If it's not absolutely necessary to fix a critical bug, I think we're much better off not making changes to core parts of the kernel at this point. - Ted
Re: Trimming VFS inodes?
Richard Gooch wrote: Hi, Al. I'd like to explore an idea Linus suggested a while back. He suggested using VFS inodes as the data store for devfs, rather than keeping stuff in devfs entries. So the idea would be that the VFS maintains the tree structure rather than devfs entries. This is a lot closer to being feasible with all the VFS changes you've been making, but there is one problem that really concerns me. VFS inodes are very heavyweight. The devfs entries are very lightweight, storing only that which is necessary. So you could save some code space in devfs, but at the expense of increased data size. Either way, it costs RAM. Have you given any consideration to coming up with a more lightweight inode structure? Either through modification of the VFS inode, or creation of some kind of "generic" lightweight inode structure that stores the bare essentials. Perhaps it could go like this: dentry-lightweight inode-heavyweight inode. Another idea (probably too radical, and also CPU inefficient), is a super lightweight inode that has just two pointers: one to FS-specific data, another to FS-specific methods. The methods are used to read/write inode members, so each FS can decide just how much is worth storing. There are desired applications of reiserfs where the VFS inode is just too heavyweight. I'd just like to say that this seems like a good concern you have here, and the ReiserFS team is completely willing to recode in 2.5.* to accomodate your radical proposal, or some as yet unproposed even better radical proposal if it comes along, because this is a real issue. Perhaps the ultimate lightweight inode would simply mean treating the dcache as optional, and the FS determining whether to look there for it or sidestep it. For persons surprised that this is a real issue, let me just mention that there are persons desiring to put 30 million entry plus hypertext indexes with poor locality of reference into reiserfs as directories, and one issue is that the VFS inode costs too much RAM. For these indexes to be effective one needs to use stem compression and other such techniques on them just to be able to prevent being killed by random I/Os to disk when the index is too big for RAM. Yet another idea is to split the dcache and icache so that you can keep dentries around (thus maintaining your tree), with pointers to FS-specific data (to save "inode" state), but still free VFS inodes when there is memory pressure. This would require a new "refill" method, which is similar but not quite the same as the lookup() method. Also interesting. I have two basic questions: - do you see merit in some kind of cheaper inode structure - how would you like to see it done? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED] This looks like the start of an interesting discussion.:)
Re: Trimming VFS inodes?
Alexander Viro writes: On Tue, 13 Jun 2000, Richard Gooch wrote: I'd like to see something more drastic. Indeed, that union crap is by far the worst offender and needs fixing. But there's a whole pile of other junk that's just not needed all the time. Richard, may I remind you that we are supposed to be in the freeze? There may be a chance to trim the union down _and_ get it into 2.4. ??? Didn't you read the other parts of my message. Quoting myself: Besides, there's also the problem of getting efficiency improvements into the mainline kernel. I don't expect Linus would let us fix these things so close to 2.4. And here you quote me: Yeah, but 2.4 is too close. Such a change is going to require a fair bit of surgery for all filesystems. So I don't really expect wholesale VFS changes right now (but, hey, that doesn't seem to stop you getting stuff in;-). But that shouldn't stop us talking about where to go from here. You don't need it on all filesystems. So you're thinking of attacking just the worst offenders? I still prefer my idea of splitting the dcache and icache so that you can maintain a populated dentry tree without keeping the inodes hanging around as well. This seems far less invasive and also brings even more space savings. Less invasive??? It requires a lot of changes in the internal VFS locking protocol. And that affects (in non-obvious ways) every friggin' code path in namei.c and dcache.c. It's going to happen, but that's _not_ a 2.4.early stuff. Sorry. Just too high potential of introducing a lot of new and interesting races. I will fork VFS-CURRENT after 2.4.0 release, then such stuff may go there without destabilising 2.4. Maybe some parts will be possible to fold back during 2.4, but complete thing will not be merged until 2.5.early. OK, so you're assuming that shrinking the union will be done by only attacking a small number of filesystems. In that case, it will probably be less invasive that splitting the dcache and icache. However, ultimately I'd like to see the union thrown out entirely. And also have the dcache and icache split. BTW: for 2.4, my main focus is on ensuring there aren't any races in devfs. The recent changes should make things a lot better :-) Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED]
Re: Trimming VFS inodes?
On Tue, 13 Jun 2000, Richard Gooch wrote: I'd like to see something more drastic. Indeed, that union crap is by far the worst offender and needs fixing. But there's a whole pile of other junk that's just not needed all the time. Richard, may I remind you that we are supposed to be in the freeze? There may be a chance to trim the union down _and_ get it into 2.4. [snip] Yeah, but 2.4 is too close. Such a change is going to require a fair bit of surgery for all filesystems. You don't need it on all filesystems. I still prefer my idea of splitting the dcache and icache so that you can maintain a populated dentry tree without keeping the inodes hanging around as well. This seems far less invasive and also brings even more space savings. Less invasive??? It requires a lot of changes in the internal VFS locking protocol. And that affects (in non-obvious ways) every friggin' code path in namei.c and dcache.c. It's going to happen, but that's _not_ a 2.4.early stuff. Sorry. Just too high potential of introducing a lot of new and interesting races. I will fork VFS-CURRENT after 2.4.0 release, then such stuff may go there without destabilising 2.4. Maybe some parts will be possible to fold back during 2.4, but complete thing will not be merged until 2.5.early.
Re: Trimming VFS inodes?
On Tue, 13 Jun 2000, Richard Gooch wrote: Yes. And all that time mounting the thing at several point was a huge, fscking hole. Sure. And hence RedHat wasn't going to compile it in. Fine with RedHat, but how in hell does it solve the problem? I don't _CARE_ for any "party line". I don't belong to any fucking parties, no matter where I'm employed. Excuse me, but I had seen enough of that shit in .su and .ru and that's a game I don't play. [snip] OK, but since you never liked devfs in the first place, I'm surprised you would care so much about devfs races. I'd just expect a "don't use devfs" response, rather than all this effort to help clean up devfs. If it is a part of ftp.kernel.org tree and I don't want to fork - too fscking bad, that's part of the things I'm dealing with. What we are paying no is the price of these years when devfs grew larger and larger and accumulated stuff from all layers of VFS. All these changes were not done - you were just sitting on the growing patch and refused to turn it into the set of small patches, each doing one thing and doing it right. Fine, so that work has to be done now. I think that I'm actually getting it quite fine - 3-4 months and most of the infrastructure is built, thank you very much. Try to imagine the shit I've been going through the last 2.5 years with devfs. Flamewar after bloody flamewar (*NOT* about minor things .procmailrc? like devfs races, the merits of devfs multi-mounting vs. VFS bindings, but basic arguments about the very concept). Between the flamewars, tracking constant kernel drift, [check] writing a thesis [check] and maintaining a relationship, [check] I'm surprised I got as much done on it as I did. On top of that, when I finally had more time available, Linus dropped the whole namespace change thing on me. Besides, if I were to have tried to clean up the VFS first, I expect I would have encountered extreme opposition, as people would have used it as another reason to oppose devfs ("don't bloat the VFS"). And So don't bloat it ;-) If you will compare the size before and after you'll see that no bloat went in. people would oppose the VFS changes because they'd want another obstacle for devfs. No thanks. I wasn't going to get into that fight. Also, trying to maintain multiple dependent patches is a lot of work. [check] Roll on BK. Yes, I had other reasons. This kind of stuff actually has to be done right or not at all. So these changes started they had pulled in quite a bit of other stuff - handling of pseudoroots in binary emulation, for example. But doing all that stuff during the freeze and in effect postponing the release... Not if we had any choice. Unfortunately, we hadn't. There's always a choice. You could always have opted for the RedHat party line: don't use devfs because it's racy. I don't opt for party lines of any description. Besides, it's not a RH release I'm talking about. Linux != RH. And stuff already in the tree is not enough - aside of multiple mounts there is revalidate() problems. So it will take more... IIRC, your concerns here were that devfs "knew" about how revalidates work, and thus if you want to change the VFS, devfs will have to track that. Not only that, actually - order of invalidation was incorrect, IIRC. I'll agree that's not ideal, but given the amount of dependence other filesystems have on VFS subtleties, I don't see why it's the end of Most of them actually has very few dependencies - there are exceptions (HFS, UMSDOS, autofs), but majority is pretty clean in that respect. the world. I don't think there's any races in there, though. Famous last words. BTW: have you looked at my latest devfs patch? Looking at it.
Re: Trimming VFS inodes?
Alexander Viro writes: On Tue, 13 Jun 2000, Richard Gooch wrote: Yes. And all that time mounting the thing at several point was a huge, fscking hole. Sure. And hence RedHat wasn't going to compile it in. Fine with RedHat, but how in hell does it solve the problem? I don't _CARE_ for any "party line". I don't belong to any fucking parties, no matter where I'm employed. Excuse me, but I had seen enough of that shit in .su and .ru and that's a game I don't play. ;-) OK, but since you never liked devfs in the first place, I'm surprised you would care so much about devfs races. I'd just expect a "don't use devfs" response, rather than all this effort to help clean up devfs. If it is a part of ftp.kernel.org tree and I don't want to fork - too fscking bad, that's part of the things I'm dealing with. Well, I'm certainly happy to see the VFS binding stuff (even down to the file/device level) have gone in. Good job. What we are paying no is the price of these years when devfs grew larger and larger and accumulated stuff from all layers of VFS. All these changes were not done - you were just sitting on the growing patch and refused to turn it into the set of small patches, each doing one thing and doing it right. Fine, so that work has to be done now. I think that I'm actually getting it quite fine - 3-4 months and most of the infrastructure is built, thank you very much. Try to imagine the shit I've been going through the last 2.5 years with devfs. Flamewar after bloody flamewar (*NOT* about minor things .procmailrc? Unfortunately procmail doesn't support the following syntax: * CONTENT_just_another_rant_against_devfs so it makes it hard to distinguish between bug reports, feature requests, *new* technical criticisms or alternative suggestions, and just repeat flaming. Besides, if I were to have tried to clean up the VFS first, I expect I would have encountered extreme opposition, as people would have used it as another reason to oppose devfs ("don't bloat the VFS"). And So don't bloat it ;-) If you will compare the size before and after you'll see that no bloat went in. Actually, I've seen one or two gripes about recent VFS changes, so there's always someone who will complain. But it's been my experience that arguments about bloat don't always correlate strongly with reality. And I do know that if *I* had done those VFS changes (with the obvious intent of making devfs itself smaller), then the screams of bloat would have spurt forth. Guilt by association and all that. And stuff already in the tree is not enough - aside of multiple mounts there is revalidate() problems. So it will take more... IIRC, your concerns here were that devfs "knew" about how revalidates work, and thus if you want to change the VFS, devfs will have to track that. Not only that, actually - order of invalidation was incorrect, IIRC. Let me check I understand what you mean. You're concerned about the way I *invalidate*, rather than the way I *revalidate*? So, basically, the order in which I unregister devices and invalidate dentries is where you see a problem? You're not saying there is a problem with the way I do revalidates? BTW: have you looked at my latest devfs patch? Looking at it. Thanks. Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED]
Re: Trimming VFS inodes?
Alexander Viro writes: On Tue, 13 Jun 2000, Richard Gooch wrote: This is a lot closer to being feasible with all the VFS changes you've been making, but there is one problem that really concerns me. VFS inodes are very heavyweight. The devfs entries are very lightweight, storing only that which is necessary. So you could save some code space in devfs, but at the expense of increased data size. Either way, it costs RAM. nods That's a problem. Unfortunately, not the only one - there is an revalidation stuff that can also make life painful. Have you given any consideration to coming up with a more lightweight inode structure? Either through modification of the VFS inode, or creation of some kind of "generic" lightweight inode structure that stores the bare essentials. Perhaps it could go like this: dentry-lightweight inode-heavyweight inode. Start from taking ext2, UFS and NFS out of -u. in struct inode. Yup, separate allocation and inlined function (ext2_ino(inode)) that would return the pointer to private part of inode. I can send you my old (circa 2.2.early) patch that does it for ext2 tonight - hope that will help. I'd like to see something more drastic. Indeed, that union crap is by far the worst offender and needs fixing. But there's a whole pile of other junk that's just not needed all the time. Even with a patch to remove the union bloat, I'm still not keen on using the VFS for devfs storage, as it would be quite a bit more wasteful than the current devfs implementation. Besides, there's also the problem of getting efficiency improvements into the mainline kernel. I don't expect Linus would let us fix these things so close to 2.4. Notice that some filesystems are already keeping private stuff out of struct inode, so similar taking the worst offenders out will not be too complex. You'll need -clear_inode() releasing the data + foo_new_inode() and foo_read_inode() allocating it. That's more or less it - minimal patch mostly consists of replacements like inode-u.ext2_i.foo to ext2_ino(inode)-foo. Yeah, but 2.4 is too close. Such a change is going to require a fair bit of surgery for all filesystems. I still prefer my idea of splitting the dcache and icache so that you can maintain a populated dentry tree without keeping the inodes hanging around as well. This seems far less invasive and also brings even more space savings. Do you dislike this approach? If so, why? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED]
Re: Trimming VFS inodes?
Alexander Viro writes: On Tue, 13 Jun 2000, Richard Gooch wrote: So I don't really expect wholesale VFS changes right now (but, hey, that doesn't seem to stop you getting stuff in;-). But that shouldn't They would not be there if not for your ability to get devfs there ;-/ And took three months of piece-wise feeding the fixes into tree. I don't quite see what the urgency was, considering that until this week, devfs has remained relatively unchanged (modulo minor VFS API tweaks) in the midst of this. Surely you had other reasons? Regards, Richard Permanent: [EMAIL PROTECTED] Current: [EMAIL PROTECTED]
Re: Trimming VFS inodes?
On Tue, 13 Jun 2000, Richard Gooch wrote: Alexander Viro writes: On Tue, 13 Jun 2000, Richard Gooch wrote: So I don't really expect wholesale VFS changes right now (but, hey, that doesn't seem to stop you getting stuff in;-). But that shouldn't They would not be there if not for your ability to get devfs there ;-/ And took three months of piece-wise feeding the fixes into tree. I don't quite see what the urgency was, considering that until this week, devfs has remained relatively unchanged (modulo minor VFS API tweaks) in the midst of this. Yes. And all that time mounting the thing at several point was a huge, fscking hole. Surely you had other reasons? DAMN. OK, see here: to fix the situation with devfs (and IMNSHO releasing the stable branch with that situation was impossible) we needed to add a _lot_ of changes in infrastructure. They made sense and had to be done at some point anyway. Not all of them are in the tree, BTW. So it was a choice between removing devfs, not releasing 2.4 at all and doing these changes (and doing them right - otherwise we would just prepare a huge PITA for ourselves) ASAP. There _really_ had been no other options. And changing devfs proper before these changes are done is not too promising. Yes, some of them already are in there, so some stuff in devfs can be done right now. Good. What we are paying no is the price of these years when devfs grew larger and larger and accumulated stuff from all layers of VFS. All these changes were not done - you were just sitting on the growing patch and refused to turn it into the set of small patches, each doing one thing and doing it right. Fine, so that work has to be done now. I think that I'm actually getting it quite fine - 3-4 months and most of the infrastructure is built, thank you very much. Yes, I had other reasons. This kind of stuff actually has to be done right or not at all. So these changes started they had pulled in quite a bit of other stuff - handling of pseudoroots in binary emulation, for example. But doing all that stuff during the freeze and in effect postponing the release... Not if we had any choice. Unfortunately, we hadn't. By the way, you do realize now why I was less than happy about devfs in the form it had? Because I knew what kind of work did the inclusion mean. And was rather pissed seeing your point-blank refusal to make that work less messy. Grep l-k archives - it's all there. And stuff already in the tree is not enough - aside of multiple mounts there is revalidate() problems. So it will take more...