Re: Trimming VFS inodes?

2000-06-15 Thread Theodore Ts'o

   Date:Tue, 13 Jun 2000 13:10:48 -0400 (EDT)
   From: Alexander Viro [EMAIL PROTECTED]

   Start from taking ext2, UFS and NFS out of -u. in struct inode. Yup,
   separate allocation and inlined function (ext2_ino(inode)) that would
   return the pointer to private part of inode. I can send you my old (circa
   2.2.early) patch that does it for ext2 tonight - hope that will help.

Can we please save this for 2.5?   If it's not absolutely necessary
to fix a critical bug, I think we're much better off not making changes
to core parts of the kernel at this point.

- Ted



Re: Trimming VFS inodes?

2000-06-14 Thread Hans Reiser

Richard Gooch wrote:
 
   Hi, Al. I'd like to explore an idea Linus suggested a while back. He
 suggested using VFS inodes as the data store for devfs, rather than
 keeping stuff in devfs entries. So the idea would be that the VFS
 maintains the tree structure rather than devfs entries.
 
 This is a lot closer to being feasible with all the VFS changes you've
 been making, but there is one problem that really concerns me. VFS
 inodes are very heavyweight. The devfs entries are very lightweight,
 storing only that which is necessary. So you could save some code
 space in devfs, but at the expense of increased data size. Either way,
 it costs RAM.
 
 Have you given any consideration to coming up with a more lightweight
 inode structure? Either through modification of the VFS inode, or
 creation of some kind of "generic" lightweight inode structure that
 stores the bare essentials. Perhaps it could go like this:
 dentry-lightweight inode-heavyweight inode.


 
 Another idea (probably too radical, and also CPU inefficient), is a
 super lightweight inode that has just two pointers: one to FS-specific
 data, another to FS-specific methods. The methods are used to
 read/write inode members, so each FS can decide just how much is worth
 storing.

There are desired applications of reiserfs where the VFS inode is just too
heavyweight.  I'd just like to say that this seems like a good concern you have
here, and the ReiserFS team is completely willing to recode in 2.5.* to
accomodate your radical proposal, or some as yet unproposed even better radical
proposal if it comes along, because this is a real issue.  Perhaps the ultimate
lightweight inode would simply mean treating the dcache as optional, and the FS
determining whether to look there for it or sidestep it.

For persons surprised that this is a real issue, let me just mention that there
are persons desiring to put 30 million entry plus hypertext indexes with poor
locality of reference into reiserfs as directories, and one issue is that the
VFS inode costs too much RAM.  For these indexes to be effective one needs to
use stem compression and other such techniques on them just to be able to
prevent being killed by random I/Os to disk when the index is too big for RAM.

 
 Yet another idea is to split the dcache and icache so that you can
 keep dentries around (thus maintaining your tree), with pointers to
 FS-specific data (to save "inode" state), but still free VFS inodes
 when there is memory pressure. This would require a new "refill"
 method, which is similar but not quite the same as the lookup()
 method.

Also interesting.

 
 I have two basic questions:
 
 - do you see merit in some kind of cheaper inode structure
 
 - how would you like to see it done?
 
 Regards,
 
 Richard
 Permanent: [EMAIL PROTECTED]
 Current:   [EMAIL PROTECTED]

This looks like the start of an interesting discussion.:)



Re: Trimming VFS inodes?

2000-06-14 Thread Richard Gooch

Alexander Viro writes:
 
 On Tue, 13 Jun 2000, Richard Gooch wrote:
 
  I'd like to see something more drastic. Indeed, that union crap is by
  far the worst offender and needs fixing. But there's a whole pile of
  other junk that's just not needed all the time.
 
 Richard, may I remind you that we are supposed to be in the freeze?
 There may be a chance to trim the union down _and_ get it into 2.4.

??? Didn't you read the other parts of my message. Quoting myself:

 Besides, there's also the problem of getting efficiency improvements
 into the mainline kernel. I don't expect Linus would let us fix these
 things so close to 2.4.

And here you quote me:
  Yeah, but 2.4 is too close. Such a change is going to require a fair
  bit of surgery for all filesystems.

So I don't really expect wholesale VFS changes right now (but, hey,
that doesn't seem to stop you getting stuff in;-). But that shouldn't
stop us talking about where to go from here.

 You don't need it on all filesystems.

So you're thinking of attacking just the worst offenders?

  I still prefer my idea of splitting the dcache and icache so that you
  can maintain a populated dentry tree without keeping the inodes
  hanging around as well. This seems far less invasive and also brings
  even more space savings.
 
 Less invasive??? It requires a lot of changes in the internal VFS
 locking protocol. And that affects (in non-obvious ways) every
 friggin' code path in namei.c and dcache.c. It's going to happen,
 but that's _not_ a 2.4.early stuff. Sorry. Just too high potential
 of introducing a lot of new and interesting races. I will fork
 VFS-CURRENT after 2.4.0 release, then such stuff may go there
 without destabilising 2.4. Maybe some parts will be possible to fold
 back during 2.4, but complete thing will not be merged until
 2.5.early.

OK, so you're assuming that shrinking the union will be done by only
attacking a small number of filesystems. In that case, it will
probably be less invasive that splitting the dcache and icache.
However, ultimately I'd like to see the union thrown out entirely.
And also have the dcache and icache split.

BTW: for 2.4, my main focus is on ensuring there aren't any races in
devfs. The recent changes should make things a lot better :-)

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: Trimming VFS inodes?

2000-06-14 Thread Alexander Viro



On Tue, 13 Jun 2000, Richard Gooch wrote:

 I'd like to see something more drastic. Indeed, that union crap is by
 far the worst offender and needs fixing. But there's a whole pile of
 other junk that's just not needed all the time.

Richard, may I remind you that we are supposed to be in the freeze? There
may be a chance to trim the union down _and_ get it into 2.4.

[snip]
 Yeah, but 2.4 is too close. Such a change is going to require a fair
 bit of surgery for all filesystems.

You don't need it on all filesystems.

 I still prefer my idea of splitting the dcache and icache so that you
 can maintain a populated dentry tree without keeping the inodes
 hanging around as well. This seems far less invasive and also brings
 even more space savings.

Less invasive??? It requires a lot of changes in the internal VFS locking
protocol. And that affects (in non-obvious ways) every friggin' code path
in namei.c and dcache.c. It's going to happen, but that's _not_ a
2.4.early stuff. Sorry. Just too high potential of introducing a lot of
new and interesting races. I will fork VFS-CURRENT after 2.4.0 release,
then such stuff may go there without destabilising 2.4. Maybe some parts
will be possible to fold back during 2.4, but complete thing will not be
merged until 2.5.early.




Re: Trimming VFS inodes?

2000-06-14 Thread Alexander Viro



On Tue, 13 Jun 2000, Richard Gooch wrote:

  Yes. And all that time mounting the thing at several point was a huge,
  fscking hole.
 
 Sure. And hence RedHat wasn't going to compile it in.

Fine with RedHat, but how in hell does it solve the problem? I don't
_CARE_ for any "party line". I don't belong to any fucking parties, no
matter where I'm employed. Excuse me, but I had seen enough of that shit
in .su and .ru and that's a game I don't play.

[snip]

 OK, but since you never liked devfs in the first place, I'm surprised
 you would care so much about devfs races. I'd just expect a "don't use
 devfs" response, rather than all this effort to help clean up devfs.

If it is a part of ftp.kernel.org tree and I don't want to fork - too
fscking bad, that's part of the things I'm dealing with.

  What we are paying no is the price of these years when devfs grew
  larger and larger and accumulated stuff from all layers of VFS. All
  these changes were not done - you were just sitting on the growing
  patch and refused to turn it into the set of small patches, each
  doing one thing and doing it right. Fine, so that work has to be
  done now. I think that I'm actually getting it quite fine - 3-4
  months and most of the infrastructure is built, thank you very much.
 
 Try to imagine the shit I've been going through the last 2.5 years
 with devfs. Flamewar after bloody flamewar (*NOT* about minor things
.procmailrc?
 like devfs races, the merits of devfs multi-mounting vs. VFS bindings,
 but basic arguments about the very concept). Between the flamewars,
 tracking constant kernel drift,
[check]
 writing a thesis
[check]
 and maintaining a
 relationship,
[check]
 I'm surprised I got as much done on it as I did. On top
 of that, when I finally had more time available, Linus dropped the
 whole namespace change thing on me.

 Besides, if I were to have tried to clean up the VFS first, I expect I
 would have encountered extreme opposition, as people would have used
 it as another reason to oppose devfs ("don't bloat the VFS"). And

So don't bloat it ;-) If you will compare the size before and after you'll
see that no bloat went in.

 people would oppose the VFS changes because they'd want another
 obstacle for devfs. No thanks. I wasn't going to get into that fight.

 Also, trying to maintain multiple dependent patches is a lot of work.
[check]
 Roll on BK.
 
  Yes, I had other reasons. This kind of stuff actually has to be done
  right or not at all. So these changes started they had pulled in
  quite a bit of other stuff - handling of pseudoroots in binary
  emulation, for example. But doing all that stuff during the freeze
  and in effect postponing the release... Not if we had any
  choice. Unfortunately, we hadn't.
 
 There's always a choice. You could always have opted for the RedHat
 party line: don't use devfs because it's racy.

I don't opt for party lines of any description. Besides, it's not a RH
release I'm talking about. Linux != RH.

  And stuff already in the tree is not enough - aside of multiple
  mounts there is revalidate() problems. So it will take more...
 
 IIRC, your concerns here were that devfs "knew" about how revalidates
 work, and thus if you want to change the VFS, devfs will have to track
 that.

Not only that, actually - order of invalidation was incorrect, IIRC.

 I'll agree that's not ideal, but given the amount of dependence other
 filesystems have on VFS subtleties, I don't see why it's the end of

Most of them actually has very few dependencies - there are exceptions
(HFS, UMSDOS, autofs), but majority is pretty clean in that respect.

 the world. I don't think there's any races in there, though.

Famous last words.

 BTW: have you looked at my latest devfs patch?

Looking at it.




Re: Trimming VFS inodes?

2000-06-14 Thread Richard Gooch

Alexander Viro writes:
 
 
 On Tue, 13 Jun 2000, Richard Gooch wrote:
 
   Yes. And all that time mounting the thing at several point was a huge,
   fscking hole.
  
  Sure. And hence RedHat wasn't going to compile it in.
 
 Fine with RedHat, but how in hell does it solve the problem? I don't
 _CARE_ for any "party line". I don't belong to any fucking parties, no
 matter where I'm employed. Excuse me, but I had seen enough of that shit
 in .su and .ru and that's a game I don't play.

;-)

  OK, but since you never liked devfs in the first place, I'm surprised
  you would care so much about devfs races. I'd just expect a "don't use
  devfs" response, rather than all this effort to help clean up devfs.
 
 If it is a part of ftp.kernel.org tree and I don't want to fork -
 too fscking bad, that's part of the things I'm dealing with.

Well, I'm certainly happy to see the VFS binding stuff (even down to
the file/device level) have gone in. Good job.

   What we are paying no is the price of these years when devfs grew
   larger and larger and accumulated stuff from all layers of VFS. All
   these changes were not done - you were just sitting on the growing
   patch and refused to turn it into the set of small patches, each
   doing one thing and doing it right. Fine, so that work has to be
   done now. I think that I'm actually getting it quite fine - 3-4
   months and most of the infrastructure is built, thank you very much.
  
  Try to imagine the shit I've been going through the last 2.5 years
  with devfs. Flamewar after bloody flamewar (*NOT* about minor things
 .procmailrc?

Unfortunately procmail doesn't support the following syntax:
* CONTENT_just_another_rant_against_devfs

so it makes it hard to distinguish between bug reports, feature
requests, *new* technical criticisms or alternative suggestions, and
just repeat flaming.

  Besides, if I were to have tried to clean up the VFS first, I expect I
  would have encountered extreme opposition, as people would have used
  it as another reason to oppose devfs ("don't bloat the VFS"). And
 
 So don't bloat it ;-) If you will compare the size before and after
 you'll see that no bloat went in.

Actually, I've seen one or two gripes about recent VFS changes, so
there's always someone who will complain. But it's been my experience
that arguments about bloat don't always correlate strongly with
reality.

And I do know that if *I* had done those VFS changes (with the obvious
intent of making devfs itself smaller), then the screams of bloat
would have spurt forth. Guilt by association and all that.

   And stuff already in the tree is not enough - aside of multiple
   mounts there is revalidate() problems. So it will take more...
  
  IIRC, your concerns here were that devfs "knew" about how revalidates
  work, and thus if you want to change the VFS, devfs will have to track
  that.
 
 Not only that, actually - order of invalidation was incorrect, IIRC.

Let me check I understand what you mean. You're concerned about the
way I *invalidate*, rather than the way I *revalidate*?

So, basically, the order in which I unregister devices and invalidate
dentries is where you see a problem?

You're not saying there is a problem with the way I do revalidates?

  BTW: have you looked at my latest devfs patch?
 
 Looking at it.

Thanks.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: Trimming VFS inodes?

2000-06-13 Thread Richard Gooch

Alexander Viro writes:
 On Tue, 13 Jun 2000, Richard Gooch wrote:
 
  This is a lot closer to being feasible with all the VFS changes you've
  been making, but there is one problem that really concerns me. VFS
  inodes are very heavyweight. The devfs entries are very lightweight,
  storing only that which is necessary. So you could save some code
  space in devfs, but at the expense of increased data size. Either way,
  it costs RAM.
 
 nods That's a problem. Unfortunately, not the only one - there is an
 revalidation stuff that can also make life painful.
 
  Have you given any consideration to coming up with a more lightweight
  inode structure? Either through modification of the VFS inode, or
  creation of some kind of "generic" lightweight inode structure that
  stores the bare essentials. Perhaps it could go like this:
  dentry-lightweight inode-heavyweight inode.
 
 Start from taking ext2, UFS and NFS out of -u. in struct
 inode. Yup, separate allocation and inlined function
 (ext2_ino(inode)) that would return the pointer to private part of
 inode. I can send you my old (circa 2.2.early) patch that does it
 for ext2 tonight - hope that will help.

I'd like to see something more drastic. Indeed, that union crap is by
far the worst offender and needs fixing. But there's a whole pile of
other junk that's just not needed all the time.

Even with a patch to remove the union bloat, I'm still not keen on
using the VFS for devfs storage, as it would be quite a bit more
wasteful than the current devfs implementation.

Besides, there's also the problem of getting efficiency improvements
into the mainline kernel. I don't expect Linus would let us fix these
things so close to 2.4.

 Notice that some filesystems are already keeping private stuff out
 of struct inode, so similar taking the worst offenders out will not
 be too complex. You'll need -clear_inode() releasing the data +
 foo_new_inode() and foo_read_inode() allocating it. That's more or
 less it - minimal patch mostly consists of replacements like
 inode-u.ext2_i.foo to ext2_ino(inode)-foo.

Yeah, but 2.4 is too close. Such a change is going to require a fair
bit of surgery for all filesystems.

I still prefer my idea of splitting the dcache and icache so that you
can maintain a populated dentry tree without keeping the inodes
hanging around as well. This seems far less invasive and also brings
even more space savings.
Do you dislike this approach? If so, why?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: Trimming VFS inodes?

2000-06-13 Thread Richard Gooch

Alexander Viro writes:
 
 On Tue, 13 Jun 2000, Richard Gooch wrote:
  So I don't really expect wholesale VFS changes right now (but, hey,
  that doesn't seem to stop you getting stuff in;-). But that shouldn't
 
 They would not be there if not for your ability to get devfs there ;-/
 And took three months of piece-wise feeding the fixes into tree.

I don't quite see what the urgency was, considering that until this
week, devfs has remained relatively unchanged (modulo minor VFS API
tweaks) in the midst of this.

Surely you had other reasons?

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]



Re: Trimming VFS inodes?

2000-06-13 Thread Alexander Viro



On Tue, 13 Jun 2000, Richard Gooch wrote:

 Alexander Viro writes:
  
  On Tue, 13 Jun 2000, Richard Gooch wrote:
   So I don't really expect wholesale VFS changes right now (but, hey,
   that doesn't seem to stop you getting stuff in;-). But that shouldn't
  
  They would not be there if not for your ability to get devfs there ;-/
  And took three months of piece-wise feeding the fixes into tree.
 
 I don't quite see what the urgency was, considering that until this
 week, devfs has remained relatively unchanged (modulo minor VFS API
 tweaks) in the midst of this.

Yes. And all that time mounting the thing at several point was a huge,
fscking hole.

 Surely you had other reasons?

DAMN. OK, see here: to fix the situation with devfs (and IMNSHO releasing
the stable branch with that situation was impossible) we needed to add a
_lot_ of changes in infrastructure. They made sense and had to be done at
some point anyway. Not all of them are in the tree, BTW. So it was a
choice between removing devfs, not releasing 2.4 at all and doing these
changes (and doing them right - otherwise we would just prepare a huge
PITA for ourselves) ASAP. There _really_ had been no other options. And
changing devfs proper before these changes are done is not too promising.
Yes, some of them already are in there, so some stuff in devfs can be done
right now. Good.

What we are paying no is the price of these years when devfs grew larger
and larger and accumulated stuff from all layers of VFS. All these changes
were not done - you were just sitting on the growing patch and refused to 
turn it into the set of small patches, each doing one thing and doing it
right. Fine, so that work has to be done now. I think that I'm actually
getting it quite fine - 3-4 months and most of the infrastructure is 
built, thank you very much.

Yes, I had other reasons. This kind of stuff actually has to be done
right or not at all. So these changes started they had pulled in quite a
bit of other stuff - handling of pseudoroots in binary emulation, for
example. But doing all that stuff during the freeze and in effect
postponing the release... Not if we had any choice. Unfortunately, we
hadn't.

By the way, you do realize now why I was less than happy about devfs in
the form it had? Because I knew what kind of work did the inclusion mean.
And was rather pissed seeing your point-blank refusal to make that work
less messy. Grep l-k archives - it's all there.

And stuff already in the tree is not enough - aside of multiple mounts
there is revalidate() problems. So it will take more...