Re: hfs support for blocksize != 512

2000-09-01 Thread Roman Zippel

Hi,

On Thu, 31 Aug 2000, Alexander Viro wrote:

> Go ahead, write it. IMNSHO it's going to be much more complicated and
> race-prone, but code talks. If you will manage to write it in clear and
> race-free way - fine. Frankly, I don't believe that it's doable.

It will be insofar more complicated, as I want to use a more complex state
machine than "locked <-> unlocked", on the other hand I can avoid such
funny constructions as triple_down() and obscure locking order rules.

At any time the object will be either locked or in a well defined state,
where at any time only a single object is locked by a thread. (I hope some
pseudo code does for the beginning, too?) Most namespace operation work
simply like a semaphore:

restart:
lock(dentry);
if (dentry is busy) {
unlock(dentry);
sleep();
goto restart;
}
dentry->state = busy;
unlock(dentry);

If the operation is finished, the state is reset and everyone sleeping is
woken up. Ok, let's come to the most interesting operation - rename():

restart:
lock(olddentry);
if (olddentry is busy) {
unlock(olddentry);
sleep();
goto restart;
}
olddentry->state = moving;
unlock(olddentry);

restart2:
lock(newdentry);
if (newdentry->state == moving) {
lock(renamelock);
if (olddentry->state == deleted) {
unlock(renamelock);
unlock(newdentry);
sleep();
goto restart;
}
newdentry->state = deleted;
unlock(renamelock);
} else if (newdentry is busy) {
unlock(newdentry);
sleep();
goto restart2;
} else
newdentry->state = deleted;
unlock(newdentry);

if (!rename_valid(olddentry, newdentry)) {
lock(newdentry);
newdentry->state = idle;
unlock(renamelock);
lock(olddentry);
olddentry->state = idle;
unlock(olddentry);
wakeup_sleepers();
return;
}

if (newdentry exists)
unlink(newdentry);
do_rename(olddentry, newdentry);

lock(newdentry);
newdentry->state = idle;
unlock(renamelock);
lock(olddentry);
olddentry->state = deleted;
unlock(olddentry);
wakeup_sleepers();
return;

Note that I don't touch any inode here, everything happens in the dcache.
That means I move the complete inode locking into the fs, all I do here is
to make sure, that while operation("foo") is busy, no other operation will
use "foo".
IMO this should work, I tried it with a rename("foo", "bar") and 
rename("bar", "foo"):
case 1: one rename gets both dentries busy, the other rename will wait
till it's finished.
case 2: both mark the old dentry as moving and find the new dentry also
moving. To make the rename atomic the global rename lock is needed, one
rename will find the old dentry isn't moving anymore and has to restart
and wait, the other rename will complete.

Other operations will keep only one dentry busy, so that I don't a see
problem here. If you don't find any major problem here, I'm going to try
this. Since if this works, it will have some other advantages:
- a user space fs will become possible, that can't even deadlock the
system. The first restart loop can be easily made interruptable, so it can
be safely killed. (I don't really want to know how a 
triple_down_interruptable() looks, not to mention the other three locks
(+ BKL) taken during a rename.)
- I can imagine better support for hfs. It can access the other fork
without excessive locking (I think currently it doesn't even tries to).
The order in which the forks can be created can change then too.

> BTW, I really wonder what kind of locks are you going to have on _blocks_
> (you've mentioned that, unless I've misparsed what you've said). IMO that
> way lies the horror, but hey, code talks.

I thought about catching a bread, but while thinking about it, there
should also be other ways. But that's fs specific, let's concentrate on
the generic part first.

> You claim that it's doable. I seriously doubt it. Nobody knows your ideas
> better than you do, so... come on, demonstrate the patch.

I think the above example should do basically the same as some nothing
doing patch within affs.
I hope that example shows two important ideas (no idea if they will save
the world, but I'm willing to learn):
- I use the dcache instead of the inode to synchronize namespace
operation, what IMO makes quite a lot of sense, since it represents our
(cached) representation of the fs.
- Using states instead of a semaphore, makes it easily possible to detect
e.g. a rename loop.

bye, Roman



-
To 

Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro


[snip the plans for AFFS]

You know what? Try it. If your scheme is doable at all (I _very_ seriously
doubt it, since I've seen similar attempts on FAT-derived filesystems and
I remember very well what horror it was) it is doable with private locks.
Just take your locks always after the VFS is done with getting its locks
and you can forget about the locking done in VFS - the only effect will be
that you will see (possibly) fewer simultaneous calls. Which should reduce
the pressure on your mechanisms, so if they can work by theselves - they
will work.

Go ahead, write it. IMNSHO it's going to be much more complicated and
race-prone, but code talks. If you will manage to write it in clear and
race-free way - fine. Frankly, I don't believe that it's doable.

Several things to watch for:
* opened unlinked files should remain available until the last
process closes the file.
* if foo and bar exist there should be no interval during the
rename(foo, bar) when open(bar,...) would fail.
* busy directories can be removed.
* ... and that includes rename() over them.
* large intervals when power-off would lead to unrecoverable fs
are bad. I'm not talking about full protection, but several seconds of
inactivity (i.e. no new requests being submitted) should be enough even on
floppies. You will get dirty fs, indeed, but it shouldn't be in
catastrophically bad state.

BTW, I really wonder what kind of locks are you going to have on _blocks_
(you've mentioned that, unless I've misparsed what you've said). IMO that
way lies the horror, but hey, code talks.

Right now the thing doesn't even work reliably. If you claim that your
design will reduce the contention if VFS will get out of the way - better
yet, but let's see first if it will work and will be readable.

Allocation problems are not going to enter the game - on AFFS you've got
no sparse files and thus all allocation is process-synchronous. Moreover,
you can count on the fact that truncate and allocation attempts on a file 
are not going to clash (that includes the lack of clashes between
allocations).

You claim that it's doable. I seriously doubt it. Nobody knows your ideas
better than you do, so... come on, demonstrate the patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

> > - get dentry foo
> > - get dentry baz
> 
> How? OK, you've found block of baz. You know the name, all right.

Links are chained together and all point back to the original, so if you
remove the original, you have quite something to do with lots of links.

> Now
> you've got to do the full tracing all the way back to root.

All file header have a pointer to the dir header, so it's not that
difficult, but that makes links to directories so interesting. :)

Anyway, I'll better try to describe the idea more generally:
The basic idea is to introduce transient states to vfs and to move the
locking into the fs, which probably knows better what needs to be
protected. This would avoid the current locking overkill. Let's take a
rename, first we mark the object as to be moved, no need to keep it locked
after this. An open on this object would either fail or had to wait (on a
seperate queue). Next we mark the destination dir as not removable. This
is basically the job of vfs so far, the next steps happen in the fs.
(I use affs here as an example.) First we lock the source dir and
remove the object from the chain and unlock the dir. Now I can lock the
destination, insert the object here and unlock the dir. (back to vfs) All
we have to do now is to restore now the state of destination dir and the
object and we have to wakeup anyone who's waiting.
Back to the original example of removing a file with links. I have to get
the dentry of baz as I have to prevent a lookup of that link, while I'm
modifying its block. But I think it's enough to lock that block and check
only the cached aliases. Then I can modify that block and unlock it again.

> > - update file header baz from file header foo
> 
> If it would be that simple... Extent blocks refer to foo, unfortunately.
> Yes, copying the thing would be easier. Too bad, data structure prohibits
> that.

Which data structure prohibits that?
Updating the extent blocks isn't that difficult as the back links are not
needed for general operation, it's just wasting I/O. A bit more
problematic are concurrent readers of foo, so I can't simply trash the
buffer of foo's file header, but I can simply keep it allocated till the
file is closed (keeps also the inode number constant and unique).

> Well, consider rename over the primary link and there you go... Keep in
> mind that extent blocks contain the reference to header block, so unless
> you want to update them all you've got to move the header into donor's
> chain ;-/

Oops, I just read rename(2) and notice that I forgot about a small detail.
Ok, above rename operation get's slightly more difficult. Basically it's
only a variation of the unlink problem, I first unlink the old file and
then insert the new file. As I do less locking, I shouldn't have a
locking problem or what do I miss? I just might have to update lots of
back links, but that is not a critical part.

[I can skip the affs history part, I just see you already got a better
answer than I could give.]

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro



On Thu, 31 Aug 2000, J. Dow wrote:

> > being a jaded bastard I suspect that Commodore PHBs decided to save a
> > bit on floppy controller price and did it well after the initial design
> 
> Comododo PHBs had nothing to do with it. And the Commododo floppy
> disk format is quite literally unreadable with a PC style controller. It was
> not an economic decision. If you are going to carp please do so from a
> basis of real knowledge Alexander. (The REAL blame for the disk fiasco
> goes to the people at Metacrap^H^H^H^HComCo.)

Hey, I've clearly said that I don't know which idiot was responsible for
that fsckup. 

> > was done and so close to release that redesign was impossible for
> > schedule reasons, but it might be something else. We'll probably never
> > know unless somebody who had been in the original design team will leak
> > it. But whatever reasons were behind that decision, OFS was either blindly
> > copied without a single thought about very serious design factor _or_
> > had been crippled at some point before the release. If it's the latter - I
> > commiserate with their fs folks. If it's the former... well, I think that
> > it says quite a few things about their clue level.
> 
> Metacomco designed it based on their TripOS. OFS is very good for
> repairing the filesystem in the event of a problem, although the so called
> DiskDoctor they provided quickly earned the name DiskDestroyer.
> Metacomco and BSTRINGS and BPOINTERS and all that nonsense
> entered the picture when it was decided the originally planned OS was
> would take too long to develop. So what Metacomco had was grafted
> onto what the old Amiga Inc had done resulting in a hodgepodge
> mess.

Umm... Interesting. Could somebody familiar with TripOS tell what
size sectors had there? IOW, did it keep the metadata out-of-band or not?

[snip]

> old cruft is preserved for reading old disks. Later on DirCache was added
> principly for floppy disks. About that time Randall added both so called
> soft links and hard links. For what it is worth it took a long long time and
> series of modifications before either of them worked adequately.

Egads... Please, pass him my compliments - one has to be _really_
perverted to do the hardlinks that way. Even QNX way of handling that
(move them into magical place after the first rename()/link() and leave
the dud in the old place) is much saner.

> > And let's not go into the links to directories, implemented well
> > after it became painfully obvious that they were an invitation for
> > troubles (from looking into Amiga newsgroups it seems that miracle
> > didn't happen - I've seen quite a few complaints about fs breakage
> > answered with "don't use links to directories, they are broken").
> 
> They MAY be fixed in the OS3.5 BoingBag 2 (service pack 2 with a
> cutsiepie name.) Heinz has committed yet another rewrite.

Ouch... Why did he do them (links to directories, that is), in the
first place?

> > Anyway, it's all history. We can't unroll the kludge, no matter
> > what we do. We've got what we've got. And I'm not too interested in
> > distribution of the blame between the people in team that seems to be
> > dissolved years ago. I consider AFFS we have to deal with as a poor excuse
> > of design and I think that it gives more than enough reasons for that.
> > In alternative history it might be better. So might many other things.
> 
> Indeed, poor or not it exists and we live with it in the Amiga community.
> (Um, I wonder if I could talk Hendrix into a copy of the source for SFS so
> it could be ported to Linux These days I prefer it to FFS. {^_-})

Hmm... What, format description is not available?

> If you want I can bend your ear on things Amiga for longer than your
> patience stretches, I suspect. (I've been following the threads discussions

alt.folklore.computers is -> that way ;-) Let's take it there...

> because there is a project I'd like to port from NT to Linux that just ain't
> gonna make it until some nice threads are added and latencies drop
> dramatically. RT_Linux may be overkill. But as it sits today Linux is
> underkill when you need 1/4 frame and less timing latencies on Show
> Control operations. )

ObWTF: WTF did these guys drop QNX when they clearly wanted RTOS? Do they
have somebody who
a) knew the difference between RT and TS and
b) knew that Linux is TS?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread J. Dow

Quoth a misinformed Alexander Viro re AFFS,
> As for the silliness of the OFS... I apologize for repeating the
> story if you know it already, but anyway: OFS looks awfully similar to
> Alto filesystem. With one crucial difference: Alto kept the header/footer
> equivalents in the sector framing. No silly 400-odd byte sectors for them.
> That layout made a lot of sense - you could easily recover from many disk
> faults, yodda, yodda, _without_ sacrificing performance. The whole design
> relied on ability to put pieces of metadata in the sector framing. Take
> that away and you've lost _very_ large part of the benefits. So large that
> the whole design ought to be rethought - tradeoffs change big way.
>
> OFS took that away. Mechanically. It just stuffed the headers into
> the data part of sectors. I don't know the story behind that decision -
> being a jaded bastard I suspect that Commodore PHBs decided to save a
> bit on floppy controller price and did it well after the initial design

Comododo PHBs had nothing to do with it. And the Commododo floppy
disk format is quite literally unreadable with a PC style controller. It was
not an economic decision. If you are going to carp please do so from a
basis of real knowledge Alexander. (The REAL blame for the disk fiasco
goes to the people at Metacrap^H^H^H^HComCo.)

> was done and so close to release that redesign was impossible for
> schedule reasons, but it might be something else. We'll probably never
> know unless somebody who had been in the original design team will leak
> it. But whatever reasons were behind that decision, OFS was either blindly
> copied without a single thought about very serious design factor _or_
> had been crippled at some point before the release. If it's the latter - I
> commiserate with their fs folks. If it's the former... well, I think that
> it says quite a few things about their clue level.

Metacomco designed it based on their TripOS. OFS is very good for
repairing the filesystem in the event of a problem, although the so called
DiskDoctor they provided quickly earned the name DiskDestroyer.
Metacomco and BSTRINGS and BPOINTERS and all that nonsense
entered the picture when it was decided the originally planned OS was
would take too long to develop. So what Metacomco had was grafted
onto what the old Amiga Inc had done resulting in a hodgepodge
mess.

> AFFS took the headers out of the data sectors. But that killed the
> whole reason behind having them anywhere - if you can't tell data blocks
> from the rest, what's the point of marking free and metadata ones?

> Now, links were total lossage - I think that even if you have some

Kemo Sabe, links never existed UNTIL the Amiga FFS was developed,
redeveloped, and redeveloped again.

> doubts about that now, you will lose them when you will write down the
> operations needed for rename(). And I mean pure set of on-disk changes -
> forget about dentries, inodes and other in-core data.
>
> Why did they do it that way? Beats me. AmigaOS is a microkernel,
> so replacing fs driver should be very easy. It ought to be easier than in
> Linux. And they've pulled out the change from OFS to AFFS, so the
> filesystem conversion was not an issue. Dunno how about UNIX-friendliness,
> but their implementation of links definitely was not friendly to their own
> OS.

As it turns out many of the recovery tools people built worked remarkably
well on FFS when it was introduced with little modification. (Most of the
time tracing the actual data blocks was not necessary for rebuilding the
disk. Thus the datablock metadata loss was not crippling.) FFS appeared
in its first versions with AmigaDOS 1.3. (Er, if you want a copy of some of
the earliest versions sent to developers for testing I can arrange something
in that regard. I believe I still have most of that "stuff".) It underwent
several
rewrites as successive developers and demands were placed on it. One
major change is evidenced in the hash algorithm used for the original
OFS and FFS. It fails to treat international characters correctly when
removing case. The international version corrected this deficiency. The
old cruft is preserved for reading old disks. Later on DirCache was added
principly for floppy disks. About that time Randall added both so called
soft links and hard links. For what it is worth it took a long long time and
series of modifications before either of them worked adequately.

> And let's not go into the links to directories, implemented well
> after it became painfully obvious that they were an invitation for
> troubles (from looking into Amiga newsgroups it seems that miracle
> didn't happen - I've seen quite a few complaints about fs breakage
> answered with "don't use links to directories, they are broken").

They MAY be fixed in the OS3.5 BoingBag 2 (service pack 2 with a
cutsiepie name.) Heinz has committed yet another rewrite.

> Anyway, it's all history. We can't unroll the kludge, no matter
> what we do. We've got 

Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro



On Thu, 31 Aug 2000, Roman Zippel wrote:

> Disclaimer: I know that the following doesn't match the current
> implementation, it's just how I would intuitively would do it:
> 
> - get dentry foo
> - get dentry baz

How? OK, you've found block of baz. You know the name, all right. Now
you've got to do the full tracing all the way back to root. During that
tracing you've got to do interesting things - essentially that's what Neil
and Roman are trying to do with fh_to_dentry patches and it's _not_ 
simple. Moreover, it's even worse than the current code wrt amount of IO
_and_ seeks. OK, nevermind, let's say you've done that.

> - lock inode foo
> - mark dentry foo as deleted
> - getblk file header foo
> - mark file header foo as deleted

?

> - getblk file header baz

You'll have to do it way before - how else would you find out that it was
called baz, in the first place?

> - update file header baz from file header foo

If it would be that simple... Extent blocks refer to foo, unfortunately.
Yes, copying the thing would be easier. Too bad, data structure prohibits
that.

> > On that specific operation. When you are done with
> > that, I have a rename() for you, but I think that even simpler example
> > (unlink()) will be sufficient.
> 
> Please post it, I know there are some interesting examples, but I don't
> have them at hand. Although I wanted to keep that flamewar for later, but
> if we're already in it...

Well, consider rename over the primary link and there you go... Keep in
mind that extent blocks contain the reference to header block, so unless
you want to update them all you've got to move the header into donor's
chain ;-/

> > Again, we are talking about the data structure and operations it has to
> > deal with _according to its designers_. I claim that due to a bad data
> > structure design (single-linked lists in hash chains, requirement to have
> > all entries belonging to some chain) unlink() (one of the operations it
> > was designed to deal with) becomes very complicated  and requires rather
> > hairy exclusion rules.  On Amiga. Linux has nothing with the problem.
> 
> To be fair it shoud be mentioned, that links were added later to affs.

Well, but we've got to deal with the result, not with the
AFFS-without-links. I certainly agree that most of the blame for bad data
structure design falls on the folks who added that kludge for
pseudo-links, but that's purely historical question. Result is ugly.

As for the silliness of the OFS... I apologize for repeating the
story if you know it already, but anyway: OFS looks awfully similar to
Alto filesystem. With one crucial difference: Alto kept the header/footer
equivalents in the sector framing. No silly 400-odd byte sectors for them.
That layout made a lot of sense - you could easily recover from many disk
faults, yodda, yodda, _without_ sacrificing performance. The whole design
relied on ability to put pieces of metadata in the sector framing. Take
that away and you've lost _very_ large part of the benefits. So large that
the whole design ought to be rethought - tradeoffs change big way.

OFS took that away. Mechanically. It just stuffed the headers into
the data part of sectors. I don't know the story behind that decision -
being a jaded bastard I suspect that Commodore PHBs decided to save a
bit on floppy controller price and did it well after the initial design
was done and so close to release that redesign was impossible for
schedule reasons, but it might be something else. We'll probably never
know unless somebody who had been in the original design team will leak
it. But whatever reasons were behind that decision, OFS was either blindly
copied without a single thought about very serious design factor _or_
had been crippled at some point before the release. If it's the latter - I
commiserate with their fs folks. If it's the former... well, I think that
it says quite a few things about their clue level.

AFFS took the headers out of the data sectors. But that killed the
whole reason behind having them anywhere - if you can't tell data blocks
from the rest, what's the point of marking free and metadata ones?

Now, links were total lossage - I think that even if you have some
doubts about that now, you will lose them when you will write down the
operations needed for rename(). And I mean pure set of on-disk changes -
forget about dentries, inodes and other in-core data.

Why did they do it that way? Beats me. AmigaOS is a microkernel,
so replacing fs driver should be very easy. It ought to be easier than in
Linux. And they've pulled out the change from OFS to AFFS, so the
filesystem conversion was not an issue. Dunno how about UNIX-friendliness,
but their implementation of links definitely was not friendly to their own
OS.

And let's not go into the links to directories, implemented well
after it became painfully obvious that they were an invitation for
troubles (from looking into 

Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

On Wed, 30 Aug 2000, Alexander Viro wrote:

>   c) ->i_sem on pageout? When?

For 2.2.16:

filemap_write_page() <- filemap_swapout() <- try_to_swap_out() <- ... <-
swap_out() <- do_try_to_free_pages() <- kswapd()

filemap_write_page() takes i_sem and calls do_write_page(). What did I
miss?

>   BKL matters only in the areas where you do not block. Moreover,
> fs code is still under the BKL, so it's totally moot.

Let me state it differently, what I'm trying to say:
Past: Lots of filesystem code wasn't designed/written with multiple
threads in mind. The result is lots of races.
Future: We want to experiment with a preempting kernel. Maybe that
experiment will fail, but I'm certainly interested in it. But the result
here will be a wonderful world of new races and I'm pretty sure your ext2
fixes will break here, one more reason I'm so keen to use sempahores.

All I wanted to say is that level of threading is changing. How that is
visible in the fs layer is a different problem.

> > > Wrong. As the matter of fact, we could trivially get rid of _any_ use of
> > > bread() and friends on ext2.
> > 
> > Excuse my stupidity, but could you please outline me how?
> 
> Using kiovec, for one thing.

Huh? You said "trivially".

> One thing that became really obvious is that current documentation
> is either not enough or not read. Hell knows what to do about the latter,
> but the former can be helped.

Documentation is one (good) thing (I really tried to find as much as
possible), but my point is that I tried to discuss design issues, I didn't
want to know how it works now (for that I can and do read the source), I
want to discuss the possibility of alternative solutions, is that really
impossible?
Anyway, after I discussed that enough with myself, I think I can try to
code up something as soon as find the time for it.

bye, Roman


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

(Sorry for the previous empty mail, I was a bit too fast with sending and
couldn't stop it completly.)

On Wed, 30 Aug 2000, Alexander Viro wrote:

I concentrate on the most interesting part:

> As for AFFS directory format - fine, please describe the data
> manipulations required by unlink("foo"); done after the
> link("foo","bar/baz");. Both operations are supported on AmigaOS, so
> references to UNIX are utterly irrelevant. On the block level, please.
> Only for directory blocks. Now, tell me what kind of protection (pageout
> has nothing to directories, so all async problems are irrelevant) would
> you provide. Or what protection should VFS/core kernel/exec/whatever
> provide to filesystem.

Disclaimer: I know that the following doesn't match the current
implementation, it's just how I would intuitively would do it:

- get dentry foo
- get dentry baz
- lock inode foo
- mark dentry foo as deleted
- getblk file header foo
- mark file header foo as deleted
- getblk file header baz
- update file header baz from file header foo
- brelse file header baz
- update inode foo
- unlock inode foo
- put dentry baz
- lock foo's parent
- getblk and update dir header parent
- getblk file headers from foo's chain until file header of predecessor of
  foo found
- update predecessor to point to successor of foo
- brelse everything
- unlock foo's parent
- put and invalidate dentry foo
- last user of foo frees file header foo in bitmap

I probably forgot something, but you will surely tell me. Two things I
want to mention anyway. First, I only lock something when needed, that of
course breaks with current conventions. Second (and most important), I use
the dentry to block a possible lookup of an inode, so noone can open or
create foo or do anything else with it. A rename would work similiar only
that the new dentry would be marked as not complete yet.

> On that specific operation. When you are done with
> that, I have a rename() for you, but I think that even simpler example
> (unlink()) will be sufficient.

Please post it, I know there are some interesting examples, but I don't
have them at hand. Although I wanted to keep that flamewar for later, but
if we're already in it...

> Again, we are talking about the data structure and operations it has to
> deal with _according to its designers_. I claim that due to a bad data
> structure design (single-linked lists in hash chains, requirement to have
> all entries belonging to some chain) unlink() (one of the operations it
> was designed to deal with) becomes very complicated  and requires rather
> hairy exclusion rules.  On Amiga. Linux has nothing with the problem.

To be fair it shoud be mentioned, that links were added later to affs.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Daniel Phillips

Alexander Viro wrote:
> On Wed, 30 Aug 2000, Roman Zippel wrote:
> > > What? You've proposed locking on pageout. If _that_ isn't the fast path...
> >
> > No, I suggested a lock (not necessarily the inode lock) during allocation
> > of indirect blocks (and defer truncation of them).
> 
> Which means pageout when you are dealing with sparse files. You
> don't have them - fine, then you can take such lock right now.
>
[...]
> >
> > Sorry, but from time to time I prefer _first_ to think about a problem and
> > I try to understand it. One way to do this is to post questions and/or
> > suggestions to a mailing list (at least I thought so). If you have an
> > other suggestion please enlighten me.
> 
> No problem with _that_.
> 
> How about we all calm down and do something more useful than this pissing
> match? One thing that became really obvious is that current documentation
> is either not enough or not read. Hell knows what to do about the latter,
> but the former can be helped. We have several pieces of it - Richard's one
> in the tree, Daniel's postings on fsdevel

Funny you should mention that - I was just reading this thread and
thinking "now, how the heck and I going to make some sense of the
locking rules in the new VFS?".  I'm getting to the point where I have
to deal with some subtle issues in my own code and I thought I'd
approach this by writing down the locking rules.  Then I realized that
since I don't have a clue where to start, I'd better do some deep
breathing, relax and think about it.  Here's were I am now:

1) I want to think about what the absolute minimal level of locking
for FS ops could be.  This is the same as asking what the maximum
parallelism could be.  This is not necessarily going to resemble the
current arrangement very much, and it might give shivers to some fs
programmers that are used to being able to count on certain
traditional regions of mutual exclusion.

2) Then I have to go look at the current practice, and get it down in
some sort of notation that's easy to understand.

3) At this point I'd have the two endpoints of a migration path: where
we are (on the road away from BKL) and where we're going (towards the
tightest, most parallel fs you ever did see:-).  This should be useful
in assessing how long that road is, and hence, just how far we are
from having the locking rules settle down.

4) Then post the draft, hopefully attracting some of the usual
flamage.  In other words, trial by fire.

> and several parts written by
> various folks. This stuff needs to be merged (and corrected where needed).
> I volunteer to do that - I've spent quite a while dealing with the code,
> so I at least know what _is_ there. I would be really grateful if
> * folks who have writeups would post URLs to them (or texts
> themselves, if they are small enough). Preferably to fsdevel, but private
> email will also go.

Please cross-post to [EMAIL PROTECTED] and
[EMAIL PROTECTED] as well.  On the theory that having more
copies of documentation is always better than less.

> * people would comment after the result will be posted. Especially
> about the missing / hard-to-understand pieces of text.
> * somebody helped to turn the result into decent English text.

There are a number of native English speakers hanging on the linux-doc
list, just waiting to be asked.

--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

On Wed, 30 Aug 2000, Alexander Viro wrote:

   c) -i_sem on pageout? When?

For 2.2.16:

filemap_write_page() - filemap_swapout() - try_to_swap_out() - ... -
swap_out() - do_try_to_free_pages() - kswapd()

filemap_write_page() takes i_sem and calls do_write_page(). What did I
miss?

   BKL matters only in the areas where you do not block. Moreover,
 fs code is still under the BKL, so it's totally moot.

Let me state it differently, what I'm trying to say:
Past: Lots of filesystem code wasn't designed/written with multiple
threads in mind. The result is lots of races.
Future: We want to experiment with a preempting kernel. Maybe that
experiment will fail, but I'm certainly interested in it. But the result
here will be a wonderful world of new races and I'm pretty sure your ext2
fixes will break here, one more reason I'm so keen to use sempahores.

All I wanted to say is that level of threading is changing. How that is
visible in the fs layer is a different problem.

   Wrong. As the matter of fact, we could trivially get rid of _any_ use of
   bread() and friends on ext2.
  
  Excuse my stupidity, but could you please outline me how?
 
 Using kiovec, for one thing.

Huh? You said "trivially".

 One thing that became really obvious is that current documentation
 is either not enough or not read. Hell knows what to do about the latter,
 but the former can be helped.

Documentation is one (good) thing (I really tried to find as much as
possible), but my point is that I tried to discuss design issues, I didn't
want to know how it works now (for that I can and do read the source), I
want to discuss the possibility of alternative solutions, is that really
impossible?
Anyway, after I discussed that enough with myself, I think I can try to
code up something as soon as find the time for it.

bye, Roman


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro



On Thu, 31 Aug 2000, Roman Zippel wrote:

 Disclaimer: I know that the following doesn't match the current
 implementation, it's just how I would intuitively would do it:
 
 - get dentry foo
 - get dentry baz

How? OK, you've found block of baz. You know the name, all right. Now
you've got to do the full tracing all the way back to root. During that
tracing you've got to do interesting things - essentially that's what Neil
and Roman are trying to do with fh_to_dentry patches and it's _not_ 
simple. Moreover, it's even worse than the current code wrt amount of IO
_and_ seeks. OK, nevermind, let's say you've done that.

 - lock inode foo
 - mark dentry foo as deleted
 - getblk file header foo
 - mark file header foo as deleted

?

 - getblk file header baz

You'll have to do it way before - how else would you find out that it was
called baz, in the first place?

 - update file header baz from file header foo

If it would be that simple... Extent blocks refer to foo, unfortunately.
Yes, copying the thing would be easier. Too bad, data structure prohibits
that.

  On that specific operation. When you are done with
  that, I have a rename() for you, but I think that even simpler example
  (unlink()) will be sufficient.
 
 Please post it, I know there are some interesting examples, but I don't
 have them at hand. Although I wanted to keep that flamewar for later, but
 if we're already in it...

Well, consider rename over the primary link and there you go... Keep in
mind that extent blocks contain the reference to header block, so unless
you want to update them all you've got to move the header into donor's
chain ;-/

  Again, we are talking about the data structure and operations it has to
  deal with _according to its designers_. I claim that due to a bad data
  structure design (single-linked lists in hash chains, requirement to have
  all entries belonging to some chain) unlink() (one of the operations it
  was designed to deal with) becomes very complicated  and requires rather
  hairy exclusion rules.  On Amiga. Linux has nothing with the problem.
 
 To be fair it shoud be mentioned, that links were added later to affs.

Well, but we've got to deal with the result, not with the
AFFS-without-links. I certainly agree that most of the blame for bad data
structure design falls on the folks who added that kludge for
pseudo-links, but that's purely historical question. Result is ugly.

As for the silliness of the OFS... I apologize for repeating the
story if you know it already, but anyway: OFS looks awfully similar to
Alto filesystem. With one crucial difference: Alto kept the header/footer
equivalents in the sector framing. No silly 400-odd byte sectors for them.
That layout made a lot of sense - you could easily recover from many disk
faults, yodda, yodda, _without_ sacrificing performance. The whole design
relied on ability to put pieces of metadata in the sector framing. Take
that away and you've lost _very_ large part of the benefits. So large that
the whole design ought to be rethought - tradeoffs change big way.

OFS took that away. Mechanically. It just stuffed the headers into
the data part of sectors. I don't know the story behind that decision -
being a jaded bastard I suspect that Commodore PHBs decided to save a
bit on floppy controller price and did it well after the initial design
was done and so close to release that redesign was impossible for
schedule reasons, but it might be something else. We'll probably never
know unless somebody who had been in the original design team will leak
it. But whatever reasons were behind that decision, OFS was either blindly
copied without a single thought about very serious design factor _or_
had been crippled at some point before the release. If it's the latter - I
commiserate with their fs folks. If it's the former... well, I think that
it says quite a few things about their clue level.

AFFS took the headers out of the data sectors. But that killed the
whole reason behind having them anywhere - if you can't tell data blocks
from the rest, what's the point of marking free and metadata ones?

Now, links were total lossage - I think that even if you have some
doubts about that now, you will lose them when you will write down the
operations needed for rename(). And I mean pure set of on-disk changes -
forget about dentries, inodes and other in-core data.

Why did they do it that way? Beats me. AmigaOS is a microkernel,
so replacing fs driver should be very easy. It ought to be easier than in
Linux. And they've pulled out the change from OFS to AFFS, so the
filesystem conversion was not an issue. Dunno how about UNIX-friendliness,
but their implementation of links definitely was not friendly to their own
OS.

And let's not go into the links to directories, implemented well
after it became painfully obvious that they were an invitation for
troubles (from looking into Amiga newsgroups it seems that miracle

Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro



On Thu, 31 Aug 2000, J. Dow wrote:

  being a jaded bastard I suspect that Commodore PHBs decided to save a
  bit on floppy controller price and did it well after the initial design
 
 Comododo PHBs had nothing to do with it. And the Commododo floppy
 disk format is quite literally unreadable with a PC style controller. It was
 not an economic decision. If you are going to carp please do so from a
 basis of real knowledge Alexander. (The REAL blame for the disk fiasco
 goes to the people at Metacrap^H^H^H^HComCo.)

Hey, I've clearly said that I don't know which idiot was responsible for
that fsckup. 

  was done and so close to release that redesign was impossible for
  schedule reasons, but it might be something else. We'll probably never
  know unless somebody who had been in the original design team will leak
  it. But whatever reasons were behind that decision, OFS was either blindly
  copied without a single thought about very serious design factor _or_
  had been crippled at some point before the release. If it's the latter - I
  commiserate with their fs folks. If it's the former... well, I think that
  it says quite a few things about their clue level.
 
 Metacomco designed it based on their TripOS. OFS is very good for
 repairing the filesystem in the event of a problem, although the so called
 DiskDoctor they provided quickly earned the name DiskDestroyer.
 Metacomco and BSTRINGS and BPOINTERS and all that nonsense
 entered the picture when it was decided the originally planned OS was
 would take too long to develop. So what Metacomco had was grafted
 onto what the old Amiga Inc had done resulting in a hodgepodge
 mess.

Umm... Interesting. Could somebody familiar with TripOS tell what
size sectors had there? IOW, did it keep the metadata out-of-band or not?

[snip]

 old cruft is preserved for reading old disks. Later on DirCache was added
 principly for floppy disks. About that time Randall added both so called
 soft links and hard links. For what it is worth it took a long long time and
 series of modifications before either of them worked adequately.

Egads... Please, pass him my compliments - one has to be _really_
perverted to do the hardlinks that way. Even QNX way of handling that
(move them into magical place after the first rename()/link() and leave
the dud in the old place) is much saner.

  And let's not go into the links to directories, implemented well
  after it became painfully obvious that they were an invitation for
  troubles (from looking into Amiga newsgroups it seems that miracle
  didn't happen - I've seen quite a few complaints about fs breakage
  answered with "don't use links to directories, they are broken").
 
 They MAY be fixed in the OS3.5 BoingBag 2 (service pack 2 with a
 cutsiepie name.) Heinz has committed yet another rewrite.

Ouch... Why did he do them (links to directories, that is), in the
first place?

  Anyway, it's all history. We can't unroll the kludge, no matter
  what we do. We've got what we've got. And I'm not too interested in
  distribution of the blame between the people in team that seems to be
  dissolved years ago. I consider AFFS we have to deal with as a poor excuse
  of design and I think that it gives more than enough reasons for that.
  In alternative history it might be better. So might many other things.
 
 Indeed, poor or not it exists and we live with it in the Amiga community.
 (Um, I wonder if I could talk Hendrix into a copy of the source for SFS so
 it could be ported to Linux These days I prefer it to FFS. {^_-})

Hmm... What, format description is not available?

 If you want I can bend your ear on things Amiga for longer than your
 patience stretches, I suspect. (I've been following the threads discussions

alt.folklore.computers is - that way ;-) Let's take it there...

 because there is a project I'd like to port from NT to Linux that just ain't
 gonna make it until some nice threads are added and latencies drop
 dramatically. RT_Linux may be overkill. But as it sits today Linux is
 underkill when you need 1/4 frame and less timing latencies on Show
 Control operations. petasigh)

ObWTF: WTF did these guys drop QNX when they clearly wanted RTOS? Do they
have somebody who
a) knew the difference between RT and TS and
b) knew that Linux is TS?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Roman Zippel

Hi,

  - get dentry foo
  - get dentry baz
 
 How? OK, you've found block of baz. You know the name, all right.

Links are chained together and all point back to the original, so if you
remove the original, you have quite something to do with lots of links.

 Now
 you've got to do the full tracing all the way back to root.

All file header have a pointer to the dir header, so it's not that
difficult, but that makes links to directories so interesting. :)

Anyway, I'll better try to describe the idea more generally:
The basic idea is to introduce transient states to vfs and to move the
locking into the fs, which probably knows better what needs to be
protected. This would avoid the current locking overkill. Let's take a
rename, first we mark the object as to be moved, no need to keep it locked
after this. An open on this object would either fail or had to wait (on a
seperate queue). Next we mark the destination dir as not removable. This
is basically the job of vfs so far, the next steps happen in the fs.
(I use affs here as an example.) First we lock the source dir and
remove the object from the chain and unlock the dir. Now I can lock the
destination, insert the object here and unlock the dir. (back to vfs) All
we have to do now is to restore now the state of destination dir and the
object and we have to wakeup anyone who's waiting.
Back to the original example of removing a file with links. I have to get
the dentry of baz as I have to prevent a lookup of that link, while I'm
modifying its block. But I think it's enough to lock that block and check
only the cached aliases. Then I can modify that block and unlock it again.

  - update file header baz from file header foo
 
 If it would be that simple... Extent blocks refer to foo, unfortunately.
 Yes, copying the thing would be easier. Too bad, data structure prohibits
 that.

Which data structure prohibits that?
Updating the extent blocks isn't that difficult as the back links are not
needed for general operation, it's just wasting I/O. A bit more
problematic are concurrent readers of foo, so I can't simply trash the
buffer of foo's file header, but I can simply keep it allocated till the
file is closed (keeps also the inode number constant and unique).

 Well, consider rename over the primary link and there you go... Keep in
 mind that extent blocks contain the reference to header block, so unless
 you want to update them all you've got to move the header into donor's
 chain ;-/

Oops, I just read rename(2) and notice that I forgot about a small detail.
Ok, above rename operation get's slightly more difficult. Basically it's
only a variation of the unlink problem, I first unlink the old file and
then insert the new file. As I do less locking, I shouldn't have a
locking problem or what do I miss? I just might have to update lots of
back links, but that is not a critical part.

[I can skip the affs history part, I just see you already got a better
answer than I could give.]

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-31 Thread Alexander Viro


[snip the plans for AFFS]

You know what? Try it. If your scheme is doable at all (I _very_ seriously
doubt it, since I've seen similar attempts on FAT-derived filesystems and
I remember very well what horror it was) it is doable with private locks.
Just take your locks always after the VFS is done with getting its locks
and you can forget about the locking done in VFS - the only effect will be
that you will see (possibly) fewer simultaneous calls. Which should reduce
the pressure on your mechanisms, so if they can work by theselves - they
will work.

Go ahead, write it. IMNSHO it's going to be much more complicated and
race-prone, but code talks. If you will manage to write it in clear and
race-free way - fine. Frankly, I don't believe that it's doable.

Several things to watch for:
* opened unlinked files should remain available until the last
process closes the file.
* if foo and bar exist there should be no interval during the
rename(foo, bar) when open(bar,...) would fail.
* busy directories can be removed.
* ... and that includes rename() over them.
* large intervals when power-off would lead to unrecoverable fs
are bad. I'm not talking about full protection, but several seconds of
inactivity (i.e. no new requests being submitted) should be enough even on
floppies. You will get dirty fs, indeed, but it shouldn't be in
catastrophically bad state.

BTW, I really wonder what kind of locks are you going to have on _blocks_
(you've mentioned that, unless I've misparsed what you've said). IMO that
way lies the horror, but hey, code talks.

Right now the thing doesn't even work reliably. If you claim that your
design will reduce the contention if VFS will get out of the way - better
yet, but let's see first if it will work and will be readable.

Allocation problems are not going to enter the game - on AFFS you've got
no sparse files and thus all allocation is process-synchronous. Moreover,
you can count on the fact that truncate and allocation attempts on a file 
are not going to clash (that includes the lack of clashes between
allocations).

You claim that it's doable. I seriously doubt it. Nobody knows your ideas
better than you do, so... come on, demonstrate the patch.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Alexander Viro



On Wed, 30 Aug 2000, Roman Zippel wrote:

> > Your repeated claims of VFS becoming more multi-threaded in ways
> > that are not transparent to fs drivers wrt locking are false.
> 
> For example the usage of inode lock changed pretty much and was partly
> replaced with the page lock? I can still remember times, where all of the
> fs stuff happened under the BKL, for me that means only a _single_ thread

a) fs methods _are_ called under BKL.
b) BKL has nothing to single-processor races. Proof: definition of
lock_kernel() on non-SMP builds. You sleep - you lose BKL. schedule()
drops it (and restores when your process gets a timeslice again, but
whatever happens in the meanwhile - happens).
c) ->i_sem on pageout? When?

> of execution could be busy in the whole fs layer. IMHO that's not really a
> prime example of multi-threaded programming, if you have a different
> definition please let me now.

BKL matters only in the areas where you do not block. Moreover,
fs code is still under the BKL, so it's totally moot.

> > What? You've proposed locking on pageout. If _that_ isn't the fast path...
> 
> No, I suggested a lock (not necessarily the inode lock) during allocation
> of indirect blocks (and defer truncation of them).

Which means pageout when you are dealing with sparse files. You
don't have them - fine, then you can take such lock right now.

> > > The major problem right now is that writepage() is supposed to be
> > > asynchronous especially for kswapd, but the fs might have to
> > > synchronized something _internal_. I think one problem here is that we
> > > still have a synchronous buffer API, what makes it very hard to
> > > implement a asynchronous interface. That's why I suggested an I/O
> > 
> > Wrong. As the matter of fact, we could trivially get rid of _any_ use of
> > bread() and friends on ext2.
> 
> Excuse my stupidity, but could you please outline me how?

Using kiovec, for one thing.

> > _One_ thread? For the whole fs? So you would pass the dirty pages from
> > kswapd to that guy. Fine. It attempts to acquire the inode semaphore (in
> > your proposal, as far as I could parse it). It blocks. kswapd keeps
> > pumping dirty pages into the queue of that thread. Wonderful...
> 
> Sorry, but did you read my mail? The purpose of that thread is to sleep
> and to get waken up to continue the IO. Not very much changes, except that
> this thread can safely sleep, whereas kswapd can't.
> Excuse my ignorance, but who does currently stop kswapd to start lots of
> IO?

Filesystems, actually. The problem is not a burst of IO (it will not
happen - your thread is locked), but completely unnecessary interlock
between the output on different files.

> > b) doesn't help AFFS directory problems
> 
> Why the hell do you come always with this, I _never_ mentioned it.

Let me put it that way:
It will not help with anything except a very specific problem with
sparse files.

You've mentioned handling of HFS. Guess what, there your suggestion gives
zero. Why? Because pageout on HFS never has a chance to allocate anything,
so no matter what/how you lock on allocation, kswapd doesn't enter the
picture. At all.

> > Talk is cheap. If you can show the patch that would simplify ext2,
> > I'm sure that Ted will be glad to see it. Same for maintainers of other
> > filesystems. The only requirement is that it should work. Excuse me, but
> > the longer I read your postings the more it looks like you have no idea of
> > the things you are talking about. I would be glad to be proven wrong on
> > that one too ;-/
> 
> I'm very sorry to waste your precious time, but your fscking arrogance
> makes me sick. What's your problem? Shall I first worship you as our fs
> god who saved us from all races?

Huh???

> Sorry, but from time to time I prefer _first_ to think about a problem and
> I try to understand it. One way to do this is to post questions and/or
> suggestions to a mailing list (at least I thought so). If you have an 
> other suggestion please enlighten me.

No problem with _that_.

How about we all calm down and do something more useful than this pissing
match? One thing that became really obvious is that current documentation
is either not enough or not read. Hell knows what to do about the latter,
but the former can be helped. We have several pieces of it - Richard's one
in the tree, Daniel's postings on fsdevel and several parts written by
various folks. This stuff needs to be merged (and corrected where needed).
I volunteer to do that - I've spent quite a while dealing with the code,
so I at least know what _is_ there. I would be really grateful if
* folks who have writeups would post URLs to them (or texts
themselves, if they are small enough). Preferably to fsdevel, but private
email will also go.
* people would comment after the result will be posted. Especially
about the missing / hard-to-understand pieces of text.
* somebody 

Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

>   Show me these removed locks. The only polite explanation I see is
> that you have serious reading comprehension problems. Let me say it once
> more, hopefully that will sink in:
> 
>   Your repeated claims of VFS becoming more multi-threaded in ways
> that are not transparent to fs drivers wrt locking are false.

For example the usage of inode lock changed pretty much and was partly
replaced with the page lock? I can still remember times, where all of the
fs stuff happened under the BKL, for me that means only a _single_ thread
of execution could be busy in the whole fs layer. IMHO that's not really a
prime example of multi-threaded programming, if you have a different
definition please let me now.

> What? You've proposed locking on pageout. If _that_ isn't the fast path...

No, I suggested a lock (not necessarily the inode lock) during allocation
of indirect blocks (and defer truncation of them).

> > The major problem right now is that writepage() is supposed to be
> > asynchronous especially for kswapd, but the fs might have to
> > synchronized something _internal_. I think one problem here is that we
> > still have a synchronous buffer API, what makes it very hard to
> > implement a asynchronous interface. That's why I suggested an I/O
> 
> Wrong. As the matter of fact, we could trivially get rid of _any_ use of
> bread() and friends on ext2.

Excuse my stupidity, but could you please outline me how?

> _One_ thread? For the whole fs? So you would pass the dirty pages from
> kswapd to that guy. Fine. It attempts to acquire the inode semaphore (in
> your proposal, as far as I could parse it). It blocks. kswapd keeps
> pumping dirty pages into the queue of that thread. Wonderful...

Sorry, but did you read my mail? The purpose of that thread is to sleep
and to get waken up to continue the IO. Not very much changes, except that
this thread can safely sleep, whereas kswapd can't.
Excuse my ignorance, but who does currently stop kswapd to start lots of
IO?

>   b) doesn't help AFFS directory problems

Why the hell do you come always with this, I _never_ mentioned it.

>   Talk is cheap. If you can show the patch that would simplify ext2,
> I'm sure that Ted will be glad to see it. Same for maintainers of other
> filesystems. The only requirement is that it should work. Excuse me, but
> the longer I read your postings the more it looks like you have no idea of
> the things you are talking about. I would be glad to be proven wrong on
> that one too ;-/

I'm very sorry to waste your precious time, but your fscking arrogance
makes me sick. What's your problem? Shall I first worship you as our fs
god who saved us from all races?
Sorry, but from time to time I prefer _first_ to think about a problem and
I try to understand it. One way to do this is to post questions and/or
suggestions to a mailing list (at least I thought so). If you have an 
other suggestion please enlighten me.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

> It sounds to me like different FSes have different needs.  Maybe the best
> approach is to have two or three fs APIs, according to the needs of the
> fs.

No, having several fs API is a maintainance nightmare, I think that's
something everyone agrees on. What is needed is to modify the API to
meet all requirements of vfs and needs of the fs. (The problem is we
don't agree on what the fs needs...)

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread David A. Gatwood

On Tue, 29 Aug 2000, Jeff V. Merkey wrote:

> I concur with this appraisal from Al Viro.  Single threading the VFS is
> going backwards -- not a good idea.

It sounds to me like different FSes have different needs.  Maybe the best
approach is to have two or three fs APIs, according to the needs of the
fs.  One could be a pure vnode interface, simple, serene, which puts the
locking in the driver by whatever means it chooses.  Lookup for NFS would
be on the vnode number, which would be kept in a kernel table until the
file was closed.  One could be the current multi-threaded arrangement. 
Finally, one might add a single-threaded-per-filesystem-instance method
for filesystems that don't thread well.

It just seems to me that this sort of thing need not be an either-or
situation.


Comments?
David

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

Tony Mantler wrote:

> For those of you who would rather not have read through this entire email,
> here's the condensed version: VFS is inherintly a wrong-level API, QNX does
> it much better. Flame on. :)

VFS isn't really wrong, the problem is that it moved from an almost
single threaded API to a multithreaded API and that development isn't
complete yet. I don't really expect that fs programming becomes easier,
but it should stay sane. For example I want to protect certain state
changes properly and not that insane "check all possible states at all
possible times and before and after every change" what Al is currently
doing in ext2.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

> Yes? And it will become simpler if you will put each and every locking
> scheme into the API?

No, I didn't say that. I want the API to be less restrictive and make
the job for the fs a bit easier. IMO the current API is inconsistent
and/or incomplete and I'm still trying to find out what exactly is
missing. The VFS is becoming more and more multithreaded, locks are
(re)moved, but nothing was added for the fs.

> We have ext2 with indirect blocks, inode bitmaps and block bitmaps, one
> per cylinder group + counters in each cylinder group. Should VFS know
> about the internal locking rules? Should it be aware of the fact that
> inodes foo and bar belong to the same cylinder group and if we remove them
> we will need to protect the bitmap for a while?

Ok, let's take ext2 as an example. Of course vfs should only be the
abstraction layer, but it shouldn't enforce locking rules like you added
them in ext2. I know the races exists already longer, so you don't have
to argue about that, but earlier I suggested a simpler solution, the
problem is that it requires holding an exclusive lock while it would
sleep. It wouldn't even be in the fast path and would only affect write
access to the indirect blocks of a single file, it doesn't affect reads
and it doesn't affect access to other files - that really shouldn't be a
problem even for a multi threaded environment. But currently this is not
possible and all I'm trying now is to explore possibilities to make that
possible, as it would make the life for ext2 and every other fs a lot
easier.

> We have AFFS with totally fscked directory structures.

Sorry? Why is that? Because it's not UNIX friendly? It was designed for
a completly different os and is very simple. The problems I know are
mostly shared with every other fs, that has a more dynamic directory
structure than ext2.

> It's insane - protection of purely internal data structures belongs to the
> module that knows about them.

I absolutly don't argue against that!

Anyway, somehow you skipped a lot of my mail, so it seems I have to
continue to discuss that with myself (hopefully without permanent
damage).
The major problem right now is that writepage() is supposed to be
asynchronous especially for kswapd, but the fs might have to
synchronized something _internal_. I think one problem here is that we
still have a synchronous buffer API, what makes it very hard to
implement a asynchronous interface. That's why I suggested an I/O
thread, which can sleep for the caller. Another possibility is to make
the already existing asynchronus interface in buffer.c available to the
fs. Anyway, if we want an asynchronous fs interface, we need an
asynchronous buffer interface, so e.g. writepage() in ext2 can lock the
indirect block, starts the I/O and gets called back later, another
writepage() call in the same area has to detect that lock (with a simple
down_trylock()) and schedules the complete I/O for later. With some help
from the buffer interface it should be possible pretty easily and ext2
would actually become much easier again. Something like this would also
be great for a real AIO support in userspace with great latencies.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Alexander Viro



On Wed, 30 Aug 2000, Albert D. Cahalan wrote:

> Ext2, XFS, Reiserfs, NWFS, and JFS need a multi-threaded VFS.
> Do we really need a screaming fast multi-threaded AFFS driver?

Erm... Roman seems to complain about VFS/VM not locking hard enough to
make protection of private fs data structures unnecessary.

> Tell me who is doing SPECweb99 benchmarks on AFFS.
> I'd trade away some NTFS performance for a bug reduction.
> Perhaps the trade would be OK for most single-user filesystems.
> Somebody was doing a Commodore 64 filesystem. That just isn't
> going to be mounted on /tmp except as a joke.
> 
> Yeah, I know about the Coda interface and all. People like the
> ease-of-use and reliability offered by in-kernel filesystems.

 What? You are trying to say that debugging the kernel code is
easier than doing that in userland? Could you pass me this reefer? Seems
to be some fairly strong stuff in there... There are reasons to write 
kernel fs, but reliability is _not_ one of them; debugging will be harder
with all usual consequences.

> Having a complex-to-simple VFS adapter would make this guy happy.
> You don't have to write it or use it.

Albert, care to look at the API someday? Areas of major suckage:
->revalidate()
->truncate()
->readdir()
It's too fscking close to 2.4 for further cleanups in these places.  The
rest is in funny state - simple, but badly documented.

It's much simpler than it used to be and I really wonder what
simplification would you propose. Full lock on all operations? But one
can do it right now - it's not worth a special translator... The only
thing to watch for: ->writepage() (i.e. pageout) can happen anywhere below
the ->i_size and you can't use blocking exclusion against that. Rationale:
trivial deadlocks upon the memory pressure.
If fs has no holes (AFFS, HFS, etc. qualify) - no block allocation
on pageout, so your life is relatively simple. And the rest of
file-modifying stuff is
a) process-synchronous
b) called with ->i_sem held.
For files-with-holes you have to step carefully. Fixed that in ext2, but
expanding to the rest will take a while. Fortunately they are relatively
sane in other respects and happen to be very similar...

[VFS comparison]
> > Plan 9 is nice and easy. Without mmap(),
> > without link(), without truncate(), without cross-directory rename() and
> 
> No link() and no cross-directory rename()... how in hell???
> They what, move via copy and delete?

So do we, if the target is on a different filesystem... So does AFS, for
that matter (no cross-directory rename). I can understand them very well -
full-blown rename() is _hell_ to get right. BT, DT in VFS, got the nausea.
Grep for "Hastur" in fs/namei.c and read the comments. POSIX, 
Fortunately, these days fs side of ->rename() is mostly painless (compared
to what had been there; there _is_ some crap, but that's what you get from
the fscked semantics of the operation).

'sides, you had been one of the most vocal link(2)-haters, hadn't you?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Albert D. Cahalan

Alexander Viro writes:
> On Wed, 30 Aug 2000, Roman Zippel wrote:

>> The point is: the thing I like about Linux is its simple interfaces, it's
>> the basic idea of unix - keep it simple. That is true for most parts - the
>> basic idea is simple and the real complexity is hidden behind it. But
>> that's currently not true for vfs interface, a fs maintainer has to fight
>> right now with fscking complex vfs interface and with a possible fscking
>
> Yes? And it will become simpler if you will put each and every locking
> scheme into the API?
> 
> Look: we have Hans with his trees-all-over-the-place + journal.

Mmmm, isn't it just _one_ big tree with different types of nodes?

> We have AFFS with totally fscked directory structures. Do you propose to
...
> Then check what's left after that locking - e.g. can two
> processes access the same fs at the same time or not?
...
> Making VFS single-threaded will not fly. If you can show simpler MT one -

Ext2, XFS, Reiserfs, NWFS, and JFS need a multi-threaded VFS.
Do we really need a screaming fast multi-threaded AFFS driver?
Tell me who is doing SPECweb99 benchmarks on AFFS.
I'd trade away some NTFS performance for a bug reduction.
Perhaps the trade would be OK for most single-user filesystems.
Somebody was doing a Commodore 64 filesystem. That just isn't
going to be mounted on /tmp except as a joke.

Yeah, I know about the Coda interface and all. People like the
ease-of-use and reliability offered by in-kernel filesystems.
Having a complex-to-simple VFS adapter would make this guy happy.
You don't have to write it or use it.

> Plan 9 is nice and easy. Without mmap(),
> without link(), without truncate(), without cross-directory rename() and

No link() and no cross-directory rename()... how in hell???
They what, move via copy and delete?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-30 Thread Roman Zippel

Hi,

Tony Mantler wrote:

 For those of you who would rather not have read through this entire email,
 here's the condensed version: VFS is inherintly a wrong-level API, QNX does
 it much better. Flame on. :)

VFS isn't really wrong, the problem is that it moved from an almost
single threaded API to a multithreaded API and that development isn't
complete yet. I don't really expect that fs programming becomes easier,
but it should stay sane. For example I want to protect certain state
changes properly and not that insane "check all possible states at all
possible times and before and after every change" what Al is currently
doing in ext2.

bye, Roman
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Alexander Viro



On Tue, 29 Aug 2000, David A. Gatwood wrote:

> Indeed, that's what a VFS layer should do -- abstract away all physical
> structure, inodes, etc., leaving only the file abstraction.  I've read

It does. That leaves caring about the internal structures to fs - you
don't want fscked block bitmap on ext2, you've got to protect it yourself.
Sorry.

> that the BSD-derived OSes have vnode interfaces that are remarkably
> similar to what you're describing, i.e. the concept isn't restricted to
> RTOSes.

That's what had been done. BTW, pure vnode interface leaves all 
namespace-related race-prevention to fs writer. And they tend to fsck
up. "They" include Kirk, so... I wouldn't call it simple. Moreover, tons
of the code are duplicated (with slight variations in the set of present
bugs) in all filesystems.

> Note that I haven't touched the Linux VFS layer since 2.0.xx, so I'm not
> in a position to comment on the current state of the code.  :-)

It got much simpler.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread David A. Gatwood

On Tue, 29 Aug 2000, Tony Mantler wrote:

> (Obligitory disclaimer: QNX is an embedded operating system, both it's
> architecture and target market is considerably different from Linux's)
> 
> QNX's filesystem interfaces make it so painfully easy to write a filesystem
> that it puts everything else to shame. You can easily write a fully
> functioning, race-free, completely coherent filesystem in less than a week,
> it's that simple.

I'd interject that it's not a very fair comparison between the kernel
complexity of an RTOS and a full fledged traditional OS, but go on 
;-)


> Now, let's say you do an 'ls' on the FOO directory. The FS api would tap
> your filesystem on the shoulder and ask "Hey you, what's in the FOO
> directory?". Your filesystem would reply "BAR and BAZ".

It might also reply with a stat structure, depending on the
implementation, but otherwise, yeah, this is a good model to move towards.


> So what does it all mean? Basically, if you want hugely complex dentries,
> and inodes as big as your head, you can do that. If you don't, more power
> to you. It's all entirely contained inside your specific FS code, the FS
> api doesn't care one bit. It just asks you for files.

Indeed, that's what a VFS layer should do -- abstract away all physical
structure, inodes, etc., leaving only the file abstraction.  I've read
that the BSD-derived OSes have vnode interfaces that are remarkably
similar to what you're describing, i.e. the concept isn't restricted to
RTOSes.

Note that I haven't touched the Linux VFS layer since 2.0.xx, so I'm not
in a position to comment on the current state of the code.  :-)


Later,
David

-
A brief Haiku:

Microsoft is bad.
It seems secure at first glance.
Then you read your mail.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Jeff V. Merkey


I concur with this appraisal from Al Viro.  Single threading the VFS is
going backwards -- not a good idea.  

:-)

Jeff

Alexander Viro wrote:
> 
> On Wed, 30 Aug 2000, Roman Zippel wrote:
> 
> > > > hfs. For example reading from a file might require a read from a btree
> > > > file (extent file), with what another file write can be busy with (e.g.
> > > > reordering the btree nodes).
> > >
> > > And?
> >
> > The point is: the thing I like about Linux is its simple interfaces, it's
> > the basic idea of unix - keep it simple. That is true for most parts - the
> > basic idea is simple and the real complexity is hidden behind it. But
> > that's currently not true for vfs interface, a fs maintainer has to fight
> > right now with fscking complex vfs interface and with a possible fscking
> 
> Yes? And it will become simpler if you will put each and every locking
> scheme into the API?
> 
> Look: we have Hans with his trees-all-over-the-place + journal. He has a
> very legitimate need to protect the internal data structures of Reiserfs
> and do it without changing the VFS<->reiserfs interaction whenever he
> decides to change purely internal structures.
> 
> We have ext2 with indirect blocks, inode bitmaps and block bitmaps, one
> per cylinder group + counters in each cylinder group. Should VFS know
> about the internal locking rules? Should it be aware of the fact that
> inodes foo and bar belong to the same cylinder group and if we remove them
> we will need to protect the bitmap for a while?
> 
> We have FAT32 where we've got a nasty allocation data with rather
> interesting locking rules. Should it be protected by VFS? If it should -
> well, I have bad news for you: write() on a file will lock the whole
> filesystem until write() completes. Don't like it for every fs? Tough, it
> will mean that VFS will not protect the thing and fs will have to do it
> itself.
> 
> We have AFFS with totally fscked directory structures. Do you propose to
> make unlink() block all directory operations on the whole fs? No? Too
> bad, because only AFFS knows enough to protect its data structures without
> _that_ locking. Sorry, the only rule that would not require the knowledge
> of layout and would be strong enough to protect is "no directory access
> while unlink() is in progress". Yup, on the whole fs. Hardly acceptable
> even for one filesystem, but try to impose that on everyone and see how
> long you will survive. JPEGs of the murder scene would be appreciated,
> BTW.
> 
> We have HFS with the data structures of its own. You want locking in VFS
> that would protect the things VFS doesn't know about and has no business
> to meddle with? Fine, post the locking rules.
> 
> It's insane - protection of purely internal data structures belongs to the
> module that knows about them. Generic stuff can, should be and _is_
> protected. Private one _can't_ be protected without either horribly
> crippled system (see above) or putting the knowledge of each data
> structure into the generic layer. And the latter will be on the author of
> filesystem anyway, because only he knows what rules he need.
> 
> Please, propose your magical locking scheme that will protect everything
> on every fs. And let maintainers of filesystems tell you whether it is
> sufficient. Then check what's left after that locking - e.g. can two
> processes access the same fs at the same time or not?
> 
> If you are complaining about the fact that maintaining complex data
> structures in multithreaded program (which kernel is) may be, well,
> complex - welcome to reality. It had been that way since the very
> beginning on _all_ MT projects, Linux included. You have complex private
> data - you may be in for pain protecting yourself from races. Protection
> of the public structures is there, so life became easier than it used to
> be back in 2.0/2.1/2.2 days.
> 
> Making VFS single-threaded will not fly. If you can show simpler MT one -
> do it and a lot of people will be extremely grateful. 4.4BSD and SunOS
> ones are more complex and make the life harder for filesystem writers.
> Check yourself. OSF/1 is _much_ more complex. Hell knows what NT has, but
> filesystem glue there looks absolutely horrible - compared to them we are
> angels in that respect. v7 was simpler, sure enough. Without mmap(),
> rename() and truncate() _and_ with only one fs type - why not? Too bad
> that it was racey as hell... Plan 9 is nice and easy. Without mmap(),
> without link(), without truncate(), without cross-directory rename() and
> without support of crazy abortions from hell a-la AFFS. 2.0 and 2.2 are
> _way_ more complex, just compare filesystem code size in 2.4 with them and
> you will see. And yes, races in question are not new. I can reproduce them
> on 2.0.9 box. Single-processor one - nothing fancy and SMP-related.
> 
> If you have a way to simplify VFS and/or filesystems - by all means,
> post it on fsdevel/l-k. Just tell what locking warranties you provide.
> Current 

Re: hfs support for blocksize != 512

2000-08-29 Thread Alexander Viro



On Wed, 30 Aug 2000, Roman Zippel wrote:

> > > hfs. For example reading from a file might require a read from a btree
> > > file (extent file), with what another file write can be busy with (e.g.
> > > reordering the btree nodes).
> > 
> > And?
> 
> The point is: the thing I like about Linux is its simple interfaces, it's
> the basic idea of unix - keep it simple. That is true for most parts - the
> basic idea is simple and the real complexity is hidden behind it. But
> that's currently not true for vfs interface, a fs maintainer has to fight
> right now with fscking complex vfs interface and with a possible fscking

Yes? And it will become simpler if you will put each and every locking
scheme into the API?

Look: we have Hans with his trees-all-over-the-place + journal. He has a
very legitimate need to protect the internal data structures of Reiserfs
and do it without changing the VFS<->reiserfs interaction whenever he
decides to change purely internal structures.

We have ext2 with indirect blocks, inode bitmaps and block bitmaps, one
per cylinder group + counters in each cylinder group. Should VFS know
about the internal locking rules? Should it be aware of the fact that
inodes foo and bar belong to the same cylinder group and if we remove them
we will need to protect the bitmap for a while?

We have FAT32 where we've got a nasty allocation data with rather
interesting locking rules. Should it be protected by VFS? If it should -
well, I have bad news for you: write() on a file will lock the whole
filesystem until write() completes. Don't like it for every fs? Tough, it
will mean that VFS will not protect the thing and fs will have to do it
itself.

We have AFFS with totally fscked directory structures. Do you propose to
make unlink() block all directory operations on the whole fs? No? Too
bad, because only AFFS knows enough to protect its data structures without
_that_ locking. Sorry, the only rule that would not require the knowledge
of layout and would be strong enough to protect is "no directory access
while unlink() is in progress". Yup, on the whole fs. Hardly acceptable
even for one filesystem, but try to impose that on everyone and see how
long you will survive. JPEGs of the murder scene would be appreciated,
BTW.

We have HFS with the data structures of its own. You want locking in VFS
that would protect the things VFS doesn't know about and has no business
to meddle with? Fine, post the locking rules.

It's insane - protection of purely internal data structures belongs to the
module that knows about them. Generic stuff can, should be and _is_
protected. Private one _can't_ be protected without either horribly
crippled system (see above) or putting the knowledge of each data
structure into the generic layer. And the latter will be on the author of
filesystem anyway, because only he knows what rules he need.

Please, propose your magical locking scheme that will protect everything
on every fs. And let maintainers of filesystems tell you whether it is
sufficient. Then check what's left after that locking - e.g. can two
processes access the same fs at the same time or not?

If you are complaining about the fact that maintaining complex data
structures in multithreaded program (which kernel is) may be, well,
complex - welcome to reality. It had been that way since the very
beginning on _all_ MT projects, Linux included. You have complex private
data - you may be in for pain protecting yourself from races. Protection
of the public structures is there, so life became easier than it used to
be back in 2.0/2.1/2.2 days.

Making VFS single-threaded will not fly. If you can show simpler MT one -
do it and a lot of people will be extremely grateful. 4.4BSD and SunOS
ones are more complex and make the life harder for filesystem writers.
Check yourself. OSF/1 is _much_ more complex. Hell knows what NT has, but
filesystem glue there looks absolutely horrible - compared to them we are
angels in that respect. v7 was simpler, sure enough. Without mmap(),
rename() and truncate() _and_ with only one fs type - why not? Too bad
that it was racey as hell... Plan 9 is nice and easy. Without mmap(),
without link(), without truncate(), without cross-directory rename() and
without support of crazy abortions from hell a-la AFFS. 2.0 and 2.2 are
_way_ more complex, just compare filesystem code size in 2.4 with them and
you will see. And yes, races in question are not new. I can reproduce them
on 2.0.9 box. Single-processor one - nothing fancy and SMP-related.

If you have a way to simplify VFS and/or filesystems - by all means,
post it on fsdevel/l-k. Just tell what locking warranties you provide.
Current ones are documented in the tree, so it will be very easy to
compare. I'm not saying that they are ideal (check the documentation in
question - I'm saying the opposite in quite a few cases). They _can_ be
made better. But if you are saying that you know how to protect purely
internal data structures without losing MT 

Re: hfs support for blocksize != 512

2000-08-29 Thread Tony Mantler

At 8:09 PM -0500 8/29/2000, Roman Zippel wrote:
>So lets get back to the vfs interface

Yes, let's do that.

Every time I hear someone talking about implementing a filesystem, the
words "you are doomed" are usually to be heard somewhere along the lines.

Now, the bits on disk aren't usually the part that kills you - heck, I
repaired an HFS drive with a hex editor once (don't try that at home, kids)
- it's the evil and miserable FS driver APIs that get you. Big ugly
structs, coherency problems with layers upon layers of xyz-cache, locking
nightmares etc.

So, when my boss dropped a multiple-compressed-backed ramdisk filesystem in
my lap and said "make it use less memory", the words "I am doomed" floated
through my head.

Thankfully for the sake of both myself and my sanity, the platform of
choice was QNX 4.

(Obligitory disclaimer: QNX is an embedded operating system, both it's
architecture and target market is considerably different from Linux's)

QNX's filesystem interfaces make it so painfully easy to write a filesystem
that it puts everything else to shame. You can easily write a fully
functioning, race-free, completely coherent filesystem in less than a week,
it's that simple.

When I wanted to make my compressed-backed ramdisk filesystem attach to
multiple points in the namespace with seperate and multiple backings on
each point, in only a single instance of the driver, it was as easy as
changing 10 lines of code.

Now, for those of you who don't have convinient access to QNX4 or QNX
Neutrino (which has an even nicer interface, mostly cleaning up on the QNX4
stuff), here's the disneyfied version of how it all works:

When your filesystem starts up it tells the FS api "hey you, fs api. if
someone needs something under directory FOO, call me". Your filesystem then
wanders off and sleeps in the background 'till someone needs it.

Now, let's say you do an 'ls' on the FOO directory. The FS api would tap
your filesystem on the shoulder and ask "Hey you, what's in the FOO
directory?". Your filesystem would reply "BAR and BAZ".

Now you do 'cat FOO/BAZ >/dev/null', the FS api taps your filesystem on the
shoulder and says "someone wants to open FOO/BAZ". Your filesystem replys
"Yeah, got it open, here's an FD for you". The FS layer then comes back
again and says "I'll take block x y and z from the file on this FD", to
which your filesystem replies "Ok, here it is".

Etc etc, you get the point.

So what does it all mean? Basically, if you want hugely complex dentries,
and inodes as big as your head, you can do that. If you don't, more power
to you. It's all entirely contained inside your specific FS code, the FS
api doesn't care one bit. It just asks you for files.

It also means that you can do cute things like use the exact same API for
block/char/random devices as you do for filesystems. No big fuss over
special files, procfs, devfs, or dead chickens, your device driver just
calls up the FS api and says "hey, I'm /dev/dsp" or "hey, I'll be taking
care of /proc/cpuinfo" and it all "just works".

Also, it means that if you want to represent your multiforked filesystem as
files-as-directories, (can-o-worms: open) you can just do it. No changes to
the FS api, no other filesystems break, etc. Everything "just works".


If someone, ANYONE, could bring this kind of painfully simple FS api to
linux, and make it work, not only would I be eternally in their debt, I
would personally send them a box of genuine canadian maple-sugar candies as
a small token of my infinite thanks.

Even failing that, I urge anyone who would want to look at (re)designing
any filesystem API to look at how QNX does it. It's really a beautiful
thing. Further reading can be found in "Getting Started with QNX Neutrino
2: A Guide for Realtime Programmers", ISBN 0968250114.


I should apologise here for this email being particularily fluffy. It's
getting a bit late here, and I don't want to switch my brain on again
before I go to sleep.

For those of you who would rather not have read through this entire email,
here's the condensed version: VFS is inherintly a wrong-level API, QNX does
it much better. Flame on. :)


Cheers - Tony 'Nicoya' Mantler :)


--
Tony "Nicoya" Mantler - Renaissance Nerd Extraordinaire - [EMAIL PROTECTED]
Winnipeg, Manitoba, Canada   --   http://nicoya.feline.pp.se/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel

Hi,

> > hfs. For example reading from a file might require a read from a btree
> > file (extent file), with what another file write can be busy with (e.g.
> > reordering the btree nodes).
> 
> And?

The point is: the thing I like about Linux is its simple interfaces, it's
the basic idea of unix - keep it simple. That is true for most parts - the
basic idea is simple and the real complexity is hidden behind it. But
that's currently not true for vfs interface, a fs maintainer has to fight
right now with fscking complex vfs interface and with a possible fscking
complex fs implementation. E2fs or affs have a pretty simple structure and
I believe you that it's not that hard to fix, maybe there is also a simple
solution for hfs. But I'd like you to forget about that and think about
the big picture (how Linus nicely states it). What we should aim at with
the vfs interface is simplicity, I want to use a fscking simple semaphore
to protect something like anywhere else, I don't want to juggle with lots
blocks wich have to be updated atomically. Maybe you get once right, but
it will follow you as a nightmare, you add one feature (e.g. quota), you
add another feature (like btrees), you so still damned fscking sure to get
and keeping it right?
So and? What I'd really like to see from you is to be a bit more
supportive for other peoples problems, I really don't expect you to solve
these problems, but if someone approaches a different solution, you're
pretty quick to refuse it.
So lets get back to the vfs interface, fs currently have to do pretty much 
all there changes atomically, they have to grab all the buffers they need
and do all changes at once. How can you be sure that this is possible for
every possible fs? How do you make sure you don't create other problems
like livelocks? We currently have problem that things like kswapd require 
an asynchronous interface, but fs prefer to synchronize it. Currently you
pushing all the burden of an asynchronous interface into the fs, which
want to rather avoid that. Why don't you think for a moment in the other
direction? Currently I'm playing with the idea of a kernel thread for
asynchronous io (maybe one per fs), that thread takes the io requests e.g.
from kswapd and the io thread can safely sleep on it, while kswapd can
continue its job, but I don't know yet, where to put, whether in the fs
specific part or whether it can be made generic enough to be put into the
generic part. Can we please think for a moment in that direction? At some
point you have to synchronize the io anyway (at latest when it hits the
device), but I would pretty much prefer if a fs would get some help at
some earlier point.
(Anyway, I need some sleep now as well... :) )

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Alexander Viro



On Tue, 29 Aug 2000, Roman Zippel wrote:

> hfs. For example reading from a file might require a read from a btree
> file (extent file), with what another file write can be busy with (e.g.
> reordering the btree nodes).

And?

> I really would prefer that a fs could sleep _and_ can use semaphores,
> that would keep locking simple, otherwise it gets only a fscking mess.

WTF? HFS does not allow holes. _ALL_ allocations there are process
synchronous. What's the problem? Pageout on HFS can not allocate blocks
and that's the only process-async method. If you want to sleep at
completely arbitrary moments while you are modifying the btree (i.e.
in the moments when it's in  the inconsistent state and hfs_get_block()
would fail) - too bad, you are going to have problems. And not from me -
power failure will take care of making your life _very_ painful.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel

Hi,

> Darnit, documentation on filesystem locking is there for purpose. First
> folks complain about its absence, then they don't bother to read the
> bloody thing once it is there. Furrfu...

It's great that it's there, but still doesn't tell you everything.

> Said that, handling of indirect blocks used to be badly b0rken on all
> normal filesystems and it had been fixed only on ext2, so I wouldn't be
> amazed if regular files were bad on B-tree style filesystems. Directories
> are easy - all requests are process-synchronous (no pageout), no
> truncate() in sight, so the life is better.

I don't think that files are that easy, at least from what I know now from
hfs. For example reading from a file might require a read from a btree
file (extent file), with what another file write can be busy with (e.g.
reordering the btree nodes).
I really would prefer that a fs could sleep _and_ can use semaphores,
that would keep locking simple, otherwise it gets only a fscking mess.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Alexander Viro



On Tue, 29 Aug 2000, Matthew Wilcox wrote:

> On Tue, Aug 29, 2000 at 06:08:04PM +0200, Roman Zippel wrote:
> > Anyway, I'm happy about any bug reports, that you can't reproduce with
> > hfs on a drive with 512 byte sectors (for that I still trying to fully
> > understand hfs btrees :-) ). I don't think this patch should be included
> 
> last time i looked (somewhere around 2.3.4x), all the B-tree directory
> implementations in the kernel were broken.  That's HFS, HPFS and NTFS.
> None of them consider the race where an insert occurs into the tree
> while you're doing a readdir.  I thought about how to fix it for ext2
> btrees but I haven't come up with a satisfactory solution yet.

readdir() holds ->both i_sem and ->i_zombie, so I'm not sure what other
exclusion do you need.

Darnit, documentation on filesystem locking is there for purpose. First
folks complain about its absence, then they don't bother to read the
bloody thing once it is there. Furrfu...

Said that, handling of indirect blocks used to be badly b0rken on all
normal filesystems and it had been fixed only on ext2, so I wouldn't be
amazed if regular files were bad on B-tree style filesystems. Directories
are easy - all requests are process-synchronous (no pageout), no
truncate() in sight, so the life is better.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: hfs support for blocksize != 512

2000-08-29 Thread Matthew Wilcox

On Tue, Aug 29, 2000 at 06:08:04PM +0200, Roman Zippel wrote:
> Anyway, I'm happy about any bug reports, that you can't reproduce with
> hfs on a drive with 512 byte sectors (for that I still trying to fully
> understand hfs btrees :-) ). I don't think this patch should be included

last time i looked (somewhere around 2.3.4x), all the B-tree directory
implementations in the kernel were broken.  That's HFS, HPFS and NTFS.
None of them consider the race where an insert occurs into the tree
while you're doing a readdir.  I thought about how to fix it for ext2
btrees but I haven't come up with a satisfactory solution yet.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



hfs support for blocksize != 512

2000-08-29 Thread Roman Zippel

Hi,

Here is a patch for anyone who needs to access HFS on e.g. an MO drive.
It's only for 2.2.16, but I was able to do that as part of my job as we
need that functionality. Anyway, I've read also a bit through HFS+ spec
and IMO basically most of the current hfs needs to rewritten for 2.4,
e.g. its special files should better go into the page cache and hfs
basically assumes everywhere 512 byte blocks, what isn't true anymore
with hfs+. This 512 bytes block problem is also the reason that the
perfomance of this patch will suck badly on MOs, since _every_ write (of
a 512 byte block) requires a read (of a 1024 byte sector).
Anyway, I'm happy about any bug reports, that you can't reproduce with
hfs on a drive with 512 byte sectors (for that I still trying to fully
understand hfs btrees :-) ). I don't think this patch should be included
into standard 2.2, but on the other hand it also shouldn't make anything
worse than it already is.

bye, Roman
 hfs1024.diff.gz