Re: module counts in kern_mount()

2000-07-30 Thread Andi Kleen

On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote:
 
 
 On Sun, 30 Jul 2000, Andi Kleen wrote:
 
  
  kern_mount currently forgets to increase the module count of the file system,
  leading to negative module count after umount.
 
 That's because it is not supposed to be used more than once or to be
 undone by umount(). If it _would_ increment the counter you would be
 unable to get rid of the module once it's loaded. What are you actually
 trying to do?

It is not even done once. I was just writing a small module that registers
a private file system similar to sockfs.

IMHO kern_mount should increase the count so that it is symmetric with
kern_umount

-Andi
 



Re: module counts in kern_mount()

2000-07-30 Thread Andi Kleen

On Sun, Jul 30, 2000 at 06:28:11PM -0400, Alexander Viro wrote:
 
 
 On Mon, 31 Jul 2000, Andi Kleen wrote:
 
  On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote:
   
   
   On Sun, 30 Jul 2000, Andi Kleen wrote:
   

kern_mount currently forgets to increase the module count of the file system,
leading to negative module count after umount.
   
   That's because it is not supposed to be used more than once or to be
   undone by umount(). If it _would_ increment the counter you would be
   unable to get rid of the module once it's loaded. What are you actually
   trying to do?
  
  It is not even done once. I was just writing a small module that registers
  a private file system similar to sockfs.
 
 Great, so why locking it in-core? It should be done when you mount it, not
 when you register.

It is mounted in the module too (it is a fileless file system like sockfs
and does not have a file system mount point, so it can do that).

Anyways, the problem is that the mounting does not increase the module
count, but the umount does.

 
  IMHO kern_mount should increase the count so that it is symmetric with
  kern_umount
 
 blinks How TF did kern_umount() come to decrement it? Oh, I see -
 side effect of kill_super()... OK, _that_ must be fixed. By adding
 get_filesystem(sb-s_type); before the call of kill_super(sb, 0); in
 kern_umount().

I'm not sure I follow, but shouldn't mounting increase the fs module count? 
How else would you do module count management for file systems  ?


-Andi



Re: Questions about the buffer+page cache in 2.4.0

2000-07-29 Thread Andi Kleen

On Sat, Jul 29, 2000 at 06:58:34PM +0200, Gary Funck wrote:
 What entity is responsible for tearing down the file-page mapping, when
 the storage is needed?  Is that bdflush's job?

kswapd's and the allocators themselves.

 
 In 2.4, does the 'read actor' (ie, for ext2) optimize the case where
 the part of the I/O request being handled has a user-level address that
 is page aligned, and the requested bytes to transfer are at least one
 full page?  Ie, does it handle the 'copy' by simply remapping the I/O
 page directly into the user's address space (avoiding a copy)?

No, because you cannot avoid the copy anyways because you need to maintain
cache coherency in the page cache. If you want zero copy IO use mmap().


-Andi



Re: Questions about the buffer+page cache in 2.4.0

2000-07-27 Thread Andi Kleen

On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote:
 So now it's time to start asking questions.  Just jumping in at a place I felt I
 knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see
 it's changed somewhat.  Finding and removing a block from the free list is now
 bracketed by a spinlock pair.  First question: why do we use atomic_set to set
 the initial buffer use count if this is already protected by a spinlock?

The buffer use count needs to be atomic in other places because interrupts
may change it on UP. atomic_t can only be modified by atomic_* functions 
and atomic.h is lacking a "atomic_set_nonatomic". So even when you only 
need the atomic property once you have to change all uses of the field.


-Andi

-- 
This is like TV. I don't like TV.



Re: Reiserfs and NFS

2000-05-04 Thread Andi Kleen

On Wed, May 03, 2000 at 11:57:52PM +0200, Steve Dodd wrote:
 On Wed, May 03, 2000 at 11:27:34AM +0200, Andi Kleen wrote:
  On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote:
 
   First off, could we call them "inode labels" or something less confusing?
   "file" outside of NFS has a different meaning (semantic namespace collision
   g) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to
   work - (think hardlinks, etc.). You do need an iget_by_label or something
  
  NFS file handles are always a kind of hard link, in traditional Unix
  they refer directly to inodes. Linux knfsd uses dentries only because the
  VFS semantics require it, not because of any NFS requirements.
 
 What I meant was, you can't have a "ilabel_to_dentry" (or fh_to_dentry for
 that matter g) function because there may well be more than one dentry
 pointing to the inode.

It does not matter, as long as you get some dentry pointing to it.
The new nfs file handle code even uses anonymous dentries in some 
cases (dentries not connected to the dentry tree). 
The nfsfh conceptually just acts like another hard link.
The standard 2.2 code is really broken because it cannot handle renaming
of directories (because the file handles are path dependent). 2.3/
2.2+sourceforge code tries to fix this mostly with some evil tricks.

 As for NFS's use of dentries, I'm still not sure I understand all the
 details. Without having reading the specs, I would expect it to be operating
 mostly on inodes, but I'm sure there are good reasons why it doesn't.

NFS wants to operate on inodes, but Linux 2.2+ VFS does not allow it
(that is why the nfsfh.c code is so complex -- in early 2.1 knfsd
before the dcache architecture was introduced nfsfh.c was *much* simpler)

 
 [..]
  iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately
  it is a bit inefficient to search inodes this way [because you cannot index
  on the additional information], but not too bad.
 
 iget4 isn't quite the same -- you need to supply a "find actor" to compare
 the other parts of the inode identifier, which are fs-specific. knfsd wouldn't
 be able to supply a find actor for the underlying filesystem it was serving.

``with some trivial extensions''
The 2.3 nfsfh code supports arbitary file handle types, indexed with
an identifier. You would associate the find_actor with an specific
identifier. Some fs specific code is needed anyways to write the
private parts into the fh (e.g. in the reiser case for writing
the true packing locality) 


 
   Also, what are the size constraints imposed by NFS? What about other network
   filesystems?
  
  NFSv2 has 2GB limits for files. 
 
 Sorry, I was thinking more of limits imposed on the size of the "file handle" /
 inode identifier..

NFSv2 has 32bytes file handles. NFSv3 has longer ones.
For all current versions of reiserfs the necessary information can
be squeezed into 32bytes with some tricks.


-Andi

-- 
This is like TV. I don't like TV.



Re: fs changes in 2.3

2000-05-03 Thread Andi Kleen

On Wed, May 03, 2000 at 08:54:54AM +0200, Alexander Viro wrote:

[flame snipped -- hopefully everybody can go back to normal work now]

 
 ObNFS: weird as it may sound to you, I actually write stuff - not
 "subcontract" to somebody else. So I'm afraid that I have slightly less
 free time than you do. FWIC, in Reiserfs context nfsd is a non-issue.
 Current kludge is not too lovely, but it's well-isolated and can be
 replaced fast. So -read_inode2() is ugly, but in my opinion it's not an
 obstacle. If other problems will be resolved and by that time 
 -fh_to_dentry() interface will not be in place - count on my vote for
 temporary adding -read_inode2().

In the long run some generic support for 64bit inodes will be needed
anyways -- other file systems depend on that too (e.g. XFS). So 
fh_to_dentry only saves you temporarily. I think adding read_inode2
early is the best.


-Andi



Re: Reiserfs and NFS

2000-05-03 Thread Andi Kleen

On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote:
 On Tue, May 02, 2000 at 01:50:16PM -0700, Chris Mason wrote:
  
  ReiserFS has unique inode numbers, but they aren't enough to actually find
  the inode on disk.  That requires the inode number, and another 32 bits of
  information we call the packing locality.  The packing locality starts as
  the parent directory inode number, but does not change across renames.
  
  So, we need to add a fh_to_dentry lookup operation for knfsd to use, and
  perhaps a dentry_to_fh operation as well (but _fh_update in pre6 looks ok
  for us).
 
 First off, could we call them "inode labels" or something less confusing?
 "file" outside of NFS has a different meaning (semantic namespace collision
 g) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to
 work - (think hardlinks, etc.). You do need an iget_by_label or something

NFS file handles are always a kind of hard link, in traditional Unix
they refer directly to inodes. Linux knfsd uses dentries only because the
VFS semantics require it, not because of any NFS requirements.

 similar though. Details that need to be worked out would be max label size
 and how they're passed around (void * ptr and a length?)

iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately
it is a bit inefficient to search inodes this way [because you cannot index
on the additional information], but not too bad.

 
 Also, what are the size constraints imposed by NFS? What about other network
 filesystems?

NFSv2 has 2GB limits for files. 
 
-Andi



Re: [Fwd: [linux-audio-dev] info point on linux hdr]

2000-04-14 Thread Andi Kleen

On Sat, Apr 15, 2000 at 12:24:16AM +0200, Andrew Clausen wrote:
 i mentioned in some remarks to benno how important i thought it was to
 preallocate the files used for hard disk recording under linux.

[...]

Unfortunately efficient preallocation is rather hard with the current
ext2. To do it efficiently you just want to allocate the blocks in the
bitmaps without writing into the actual allocated blocks (otherwise
it would be as slow as the manual write-every-block-from-userspace trick)
Now in these blocks there could be data from other files, the new owner
of the block should not be allowed to see the old data for security reasons. 

You suddenly get a new special kind of block in the file system with
different semantics: ignore old data and only supply zeroes to the reader, 
unless the block has been actually writen. 

This ``ignore old data until writen'' information about the block would
need to be persistent on disk -- you cannot just hold it in memory, 
otherwise it would not be known anymore after a reboot/crash. Filling 
the unwriten blocks with zeroes on shutdown would be too slow[1]. 

The problem is that ext2 has no space to store this information. The blocks
are allocated using simple bitmaps, and you cannot express three states
(free, allocated, write-only) in only a single bit. If you were using a 
extent based file system there would be probably enough space (you usually
find space for a single bit somewhere in the extent tree), but with
bitmaps it is tricky and would require on disk format changes. 

Apparently extent based ext2 is planned, maybe it would be useful to 
include that feature then, but before that it looks too hairy. 

JFS and XFS seem to support these things already. 

-Andi

[1] Imagine your system starting a 100MB write when the UPS tries to force 
a quick shutdown on power fail -- you really don't want that.




Re: Ext2 / VFS projects

2000-02-09 Thread Andi Kleen

On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote:
 
 On 09-Feb-00 Andi Kleen wrote:
  On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote:
  
  [...]
  
  How about secure deletion? 
  
  1.3 used to have some simple minded overwriting of deleted data when the 
  's' attribute was set.  That got lost with 2.0+. 
  
  Secure disk overwriting that is immune to 
  manual surface probing seems to take a lot more effort  (Colin Plumb's 
  sterilize does 25 overwrites with varying bit patterns). Such a complicated
  procedure is probably better kept in user space. What I would like is some
  way to have a sterilize daemon running, and when a get 's' file gets
  deleted the VFS would open a new file descriptor for it, pass it to 
  sterilized (sterild?) using a unix control message and let it do its job.
  
  What does the audience think? Should such a facility have kernel support
  or not?  I think secure deletion is an interesting topic and it would be
  nice if Linux supported it better.
 
 You have to be careful that you don't leak the file you're trying to eradicate
 into the swap via the serilize daemon.  I guess simply never reading the file
 is a good start.

sterilize does that. You have of course be careful that you didn't leak
its content to swap before (one way around that is encrypted swapping) 

 
 The other question is whether you're talking about an ext2-specific thing, or
 whether its a general service all filesystems provide.  Many filesystem

I was actually only thinking about ext2 (because only it has a 's' bit 
and the thread is about ext2's future) 

 designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl
 filesystem, don't let a process overwrite an existing block on disk.  Well,
 ext3 does, but only via the journal; wafl never does.  There's also the
 question of what happens when you have a RAID device under the filesystem,
 especially with hot-swappable disks.

reiserfs lets you when you don't change the file size (if you do it is possible
that the file is migrated from a formatted node to a unformatted node). 
sterilize does not change file sizes.

ext3 only doesn't let you when you do data journaling (good point I forgot
that) 

RAID0/RAID1 are no problem I think, because you have always well defined
block(s) to write too. The wipe data does not depend on the old data on
the disk, so e.g. on a simple mirrored configuration both blocks would
be sterilized in parallel.

RAID5 devices could be a problem, especially when they do data journaling
(I think most only journal some metadata). It is not clear how the sterilize
algorithms interact with the XORed blocks.

If you swap your disks inbetween you lost.

 
 Perhaps a better approach, since we're talking about a privileged process, is
 to get a list of raw blocks and go directly to the disk.  You'd have to be very
 careful to synchronize with the filesystem...

Not too much. The file still exists, but there are no references to it
outside sterild. No other process can access it. Assuming the file system
does not have fragments and the raw io has block granuality and the file was
fdatasync'ed before you could directly access it without worrying about
any file system interference. If the fs has fragments you need the
infrastructure needed for O_DIRECT (I think that is planned anyways).

With a "invalidate all dirty buffers for file X" call you could optimize
part of the fdatasync writes away, but a good sterilize needs so many 
writes anyways (25+) that it probably does not make much difference. 

The data would be only really deleted when the system is turned off,
because it could partly still exist in some not yet reused buffers.


-Andi



Re: file system size limits

2000-01-06 Thread Andi Kleen

On Thu, Jan 06, 2000 at 04:03:38PM +0100, Manfred Spraul wrote:
 What's the current limit for ext2 filesystems, and what happens if a users
 creates a larger disk? I think we should document the current limits, and
 refuse to mount larger disks. I guess the current limit is somewhere around
 1000 GB (512*INT_MAX)?

The limit is more ~2TB with 1K blocks and ~8TB with 4K blocks. 

 
 I'm posting this question because I've already seen a message that someone
 uses a 500 GB ext2 fs, and because (IIRC) certain versions of the Norton
 Commander silently corrupted disks  2GB on the Macintosh when Apple removed
 the 2 GB limit a few years ago.

Some version of fsck compiled with the wrong llseek() did that too.

 The Linux filesystems/utilities [kernel, fsck,...] should avoid similar
 problems.

I think it is not a problem currently, because both 2TB and 8TB would
take several days of fsck, which makes them impractical.

-Andi



Re: Ext2 defragmentation

1999-11-15 Thread Andi Kleen

On Mon, Nov 15, 1999 at 03:00:20PM +0100, Pavel Machek wrote:
 Hi!
 
 How necessary is it to defragment ones ext2 partitions? It just hit me
 that defragmentation is very important under the Wintendo filesystem.

It's not as important.  But... I had an idea for an ext2 defrag daemon,
e2defragd, which would take advantage of _disk_ idle time to reorganize
blocks, while the filesystem was mounted.  This daemon would be a good
candidate for disk optimizations like moving frequently-accessed files
to the middle of the disk in addition to background defragging.
   
   There's one usefull thing that could be done with e2defrag: putting
   directories at the beggining of the disk exactly in the order find /
   would use. One line hack, but e2defrag just does not work for me.
 Pavel
  
  Isn't it better to simply use locate / updatedb instead ? 
 
 No. There are other operations (such as du -s ., search from midnight)
 which have find-like access pattern. And you have no chance of getting
 out of date.

It just sounds silly to optimize the disk layout for such specific cases.
Maybe if you're only running du -s and find / all day, but somehow I doubt
that.


-Andi


-- 
This is like TV. I don't like TV.