Re: module counts in kern_mount()
On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote: On Sun, 30 Jul 2000, Andi Kleen wrote: kern_mount currently forgets to increase the module count of the file system, leading to negative module count after umount. That's because it is not supposed to be used more than once or to be undone by umount(). If it _would_ increment the counter you would be unable to get rid of the module once it's loaded. What are you actually trying to do? It is not even done once. I was just writing a small module that registers a private file system similar to sockfs. IMHO kern_mount should increase the count so that it is symmetric with kern_umount -Andi
Re: module counts in kern_mount()
On Sun, Jul 30, 2000 at 06:28:11PM -0400, Alexander Viro wrote: On Mon, 31 Jul 2000, Andi Kleen wrote: On Sun, Jul 30, 2000 at 06:04:16PM -0400, Alexander Viro wrote: On Sun, 30 Jul 2000, Andi Kleen wrote: kern_mount currently forgets to increase the module count of the file system, leading to negative module count after umount. That's because it is not supposed to be used more than once or to be undone by umount(). If it _would_ increment the counter you would be unable to get rid of the module once it's loaded. What are you actually trying to do? It is not even done once. I was just writing a small module that registers a private file system similar to sockfs. Great, so why locking it in-core? It should be done when you mount it, not when you register. It is mounted in the module too (it is a fileless file system like sockfs and does not have a file system mount point, so it can do that). Anyways, the problem is that the mounting does not increase the module count, but the umount does. IMHO kern_mount should increase the count so that it is symmetric with kern_umount blinks How TF did kern_umount() come to decrement it? Oh, I see - side effect of kill_super()... OK, _that_ must be fixed. By adding get_filesystem(sb-s_type); before the call of kill_super(sb, 0); in kern_umount(). I'm not sure I follow, but shouldn't mounting increase the fs module count? How else would you do module count management for file systems ? -Andi
Re: Questions about the buffer+page cache in 2.4.0
On Sat, Jul 29, 2000 at 06:58:34PM +0200, Gary Funck wrote: What entity is responsible for tearing down the file-page mapping, when the storage is needed? Is that bdflush's job? kswapd's and the allocators themselves. In 2.4, does the 'read actor' (ie, for ext2) optimize the case where the part of the I/O request being handled has a user-level address that is page aligned, and the requested bytes to transfer are at least one full page? Ie, does it handle the 'copy' by simply remapping the I/O page directly into the user's address space (avoiding a copy)? No, because you cannot avoid the copy anyways because you need to maintain cache coherency in the page cache. If you want zero copy IO use mmap(). -Andi
Re: Questions about the buffer+page cache in 2.4.0
On Thu, Jul 27, 2000 at 08:02:50PM +0200, Daniel Phillips wrote: So now it's time to start asking questions. Just jumping in at a place I felt I knew pretty well back in 2.2.13, I'm now looking at the 2.4.0 getblk, and I see it's changed somewhat. Finding and removing a block from the free list is now bracketed by a spinlock pair. First question: why do we use atomic_set to set the initial buffer use count if this is already protected by a spinlock? The buffer use count needs to be atomic in other places because interrupts may change it on UP. atomic_t can only be modified by atomic_* functions and atomic.h is lacking a "atomic_set_nonatomic". So even when you only need the atomic property once you have to change all uses of the field. -Andi -- This is like TV. I don't like TV.
Re: Reiserfs and NFS
On Wed, May 03, 2000 at 11:57:52PM +0200, Steve Dodd wrote: On Wed, May 03, 2000 at 11:27:34AM +0200, Andi Kleen wrote: On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote: First off, could we call them "inode labels" or something less confusing? "file" outside of NFS has a different meaning (semantic namespace collision g) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to work - (think hardlinks, etc.). You do need an iget_by_label or something NFS file handles are always a kind of hard link, in traditional Unix they refer directly to inodes. Linux knfsd uses dentries only because the VFS semantics require it, not because of any NFS requirements. What I meant was, you can't have a "ilabel_to_dentry" (or fh_to_dentry for that matter g) function because there may well be more than one dentry pointing to the inode. It does not matter, as long as you get some dentry pointing to it. The new nfs file handle code even uses anonymous dentries in some cases (dentries not connected to the dentry tree). The nfsfh conceptually just acts like another hard link. The standard 2.2 code is really broken because it cannot handle renaming of directories (because the file handles are path dependent). 2.3/ 2.2+sourceforge code tries to fix this mostly with some evil tricks. As for NFS's use of dentries, I'm still not sure I understand all the details. Without having reading the specs, I would expect it to be operating mostly on inodes, but I'm sure there are good reasons why it doesn't. NFS wants to operate on inodes, but Linux 2.2+ VFS does not allow it (that is why the nfsfh.c code is so complex -- in early 2.1 knfsd before the dcache architecture was introduced nfsfh.c was *much* simpler) [..] iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately it is a bit inefficient to search inodes this way [because you cannot index on the additional information], but not too bad. iget4 isn't quite the same -- you need to supply a "find actor" to compare the other parts of the inode identifier, which are fs-specific. knfsd wouldn't be able to supply a find actor for the underlying filesystem it was serving. ``with some trivial extensions'' The 2.3 nfsfh code supports arbitary file handle types, indexed with an identifier. You would associate the find_actor with an specific identifier. Some fs specific code is needed anyways to write the private parts into the fh (e.g. in the reiser case for writing the true packing locality) Also, what are the size constraints imposed by NFS? What about other network filesystems? NFSv2 has 2GB limits for files. Sorry, I was thinking more of limits imposed on the size of the "file handle" / inode identifier.. NFSv2 has 32bytes file handles. NFSv3 has longer ones. For all current versions of reiserfs the necessary information can be squeezed into 32bytes with some tricks. -Andi -- This is like TV. I don't like TV.
Re: fs changes in 2.3
On Wed, May 03, 2000 at 08:54:54AM +0200, Alexander Viro wrote: [flame snipped -- hopefully everybody can go back to normal work now] ObNFS: weird as it may sound to you, I actually write stuff - not "subcontract" to somebody else. So I'm afraid that I have slightly less free time than you do. FWIC, in Reiserfs context nfsd is a non-issue. Current kludge is not too lovely, but it's well-isolated and can be replaced fast. So -read_inode2() is ugly, but in my opinion it's not an obstacle. If other problems will be resolved and by that time -fh_to_dentry() interface will not be in place - count on my vote for temporary adding -read_inode2(). In the long run some generic support for 64bit inodes will be needed anyways -- other file systems depend on that too (e.g. XFS). So fh_to_dentry only saves you temporarily. I think adding read_inode2 early is the best. -Andi
Re: Reiserfs and NFS
On Tue, May 02, 2000 at 11:20:10PM +0200, Steve Dodd wrote: On Tue, May 02, 2000 at 01:50:16PM -0700, Chris Mason wrote: ReiserFS has unique inode numbers, but they aren't enough to actually find the inode on disk. That requires the inode number, and another 32 bits of information we call the packing locality. The packing locality starts as the parent directory inode number, but does not change across renames. So, we need to add a fh_to_dentry lookup operation for knfsd to use, and perhaps a dentry_to_fh operation as well (but _fh_update in pre6 looks ok for us). First off, could we call them "inode labels" or something less confusing? "file" outside of NFS has a different meaning (semantic namespace collision g) Also, I don't see how a "fh_to_dentry" (or ilbl_to_dentry) is going to work - (think hardlinks, etc.). You do need an iget_by_label or something NFS file handles are always a kind of hard link, in traditional Unix they refer directly to inodes. Linux knfsd uses dentries only because the VFS semantics require it, not because of any NFS requirements. similar though. Details that need to be worked out would be max label size and how they're passed around (void * ptr and a length?) iget_by_label() is already implemented in 2.3 -- see iget4(). Unfortunately it is a bit inefficient to search inodes this way [because you cannot index on the additional information], but not too bad. Also, what are the size constraints imposed by NFS? What about other network filesystems? NFSv2 has 2GB limits for files. -Andi
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
On Sat, Apr 15, 2000 at 12:24:16AM +0200, Andrew Clausen wrote: i mentioned in some remarks to benno how important i thought it was to preallocate the files used for hard disk recording under linux. [...] Unfortunately efficient preallocation is rather hard with the current ext2. To do it efficiently you just want to allocate the blocks in the bitmaps without writing into the actual allocated blocks (otherwise it would be as slow as the manual write-every-block-from-userspace trick) Now in these blocks there could be data from other files, the new owner of the block should not be allowed to see the old data for security reasons. You suddenly get a new special kind of block in the file system with different semantics: ignore old data and only supply zeroes to the reader, unless the block has been actually writen. This ``ignore old data until writen'' information about the block would need to be persistent on disk -- you cannot just hold it in memory, otherwise it would not be known anymore after a reboot/crash. Filling the unwriten blocks with zeroes on shutdown would be too slow[1]. The problem is that ext2 has no space to store this information. The blocks are allocated using simple bitmaps, and you cannot express three states (free, allocated, write-only) in only a single bit. If you were using a extent based file system there would be probably enough space (you usually find space for a single bit somewhere in the extent tree), but with bitmaps it is tricky and would require on disk format changes. Apparently extent based ext2 is planned, maybe it would be useful to include that feature then, but before that it looks too hairy. JFS and XFS seem to support these things already. -Andi [1] Imagine your system starting a 100MB write when the UPS tries to force a quick shutdown on power fail -- you really don't want that.
Re: Ext2 / VFS projects
On Thu, Feb 10, 2000 at 03:04:53AM +0100, Jeremy Fitzhardinge wrote: On 09-Feb-00 Andi Kleen wrote: On Wed, Feb 09, 2000 at 05:35:00PM +0100, Matthew Wilcox wrote: [...] How about secure deletion? 1.3 used to have some simple minded overwriting of deleted data when the 's' attribute was set. That got lost with 2.0+. Secure disk overwriting that is immune to manual surface probing seems to take a lot more effort (Colin Plumb's sterilize does 25 overwrites with varying bit patterns). Such a complicated procedure is probably better kept in user space. What I would like is some way to have a sterilize daemon running, and when a get 's' file gets deleted the VFS would open a new file descriptor for it, pass it to sterilized (sterild?) using a unix control message and let it do its job. What does the audience think? Should such a facility have kernel support or not? I think secure deletion is an interesting topic and it would be nice if Linux supported it better. You have to be careful that you don't leak the file you're trying to eradicate into the swap via the serilize daemon. I guess simply never reading the file is a good start. sterilize does that. You have of course be careful that you didn't leak its content to swap before (one way around that is encrypted swapping) The other question is whether you're talking about an ext2-specific thing, or whether its a general service all filesystems provide. Many filesystem I was actually only thinking about ext2 (because only it has a 's' bit and the thread is about ext2's future) designs, including ext3 w/ journalling, reiserfs(?) and the NetApp Wafl filesystem, don't let a process overwrite an existing block on disk. Well, ext3 does, but only via the journal; wafl never does. There's also the question of what happens when you have a RAID device under the filesystem, especially with hot-swappable disks. reiserfs lets you when you don't change the file size (if you do it is possible that the file is migrated from a formatted node to a unformatted node). sterilize does not change file sizes. ext3 only doesn't let you when you do data journaling (good point I forgot that) RAID0/RAID1 are no problem I think, because you have always well defined block(s) to write too. The wipe data does not depend on the old data on the disk, so e.g. on a simple mirrored configuration both blocks would be sterilized in parallel. RAID5 devices could be a problem, especially when they do data journaling (I think most only journal some metadata). It is not clear how the sterilize algorithms interact with the XORed blocks. If you swap your disks inbetween you lost. Perhaps a better approach, since we're talking about a privileged process, is to get a list of raw blocks and go directly to the disk. You'd have to be very careful to synchronize with the filesystem... Not too much. The file still exists, but there are no references to it outside sterild. No other process can access it. Assuming the file system does not have fragments and the raw io has block granuality and the file was fdatasync'ed before you could directly access it without worrying about any file system interference. If the fs has fragments you need the infrastructure needed for O_DIRECT (I think that is planned anyways). With a "invalidate all dirty buffers for file X" call you could optimize part of the fdatasync writes away, but a good sterilize needs so many writes anyways (25+) that it probably does not make much difference. The data would be only really deleted when the system is turned off, because it could partly still exist in some not yet reused buffers. -Andi
Re: file system size limits
On Thu, Jan 06, 2000 at 04:03:38PM +0100, Manfred Spraul wrote: What's the current limit for ext2 filesystems, and what happens if a users creates a larger disk? I think we should document the current limits, and refuse to mount larger disks. I guess the current limit is somewhere around 1000 GB (512*INT_MAX)? The limit is more ~2TB with 1K blocks and ~8TB with 4K blocks. I'm posting this question because I've already seen a message that someone uses a 500 GB ext2 fs, and because (IIRC) certain versions of the Norton Commander silently corrupted disks 2GB on the Macintosh when Apple removed the 2 GB limit a few years ago. Some version of fsck compiled with the wrong llseek() did that too. The Linux filesystems/utilities [kernel, fsck,...] should avoid similar problems. I think it is not a problem currently, because both 2TB and 8TB would take several days of fsck, which makes them impractical. -Andi
Re: Ext2 defragmentation
On Mon, Nov 15, 1999 at 03:00:20PM +0100, Pavel Machek wrote: Hi! How necessary is it to defragment ones ext2 partitions? It just hit me that defragmentation is very important under the Wintendo filesystem. It's not as important. But... I had an idea for an ext2 defrag daemon, e2defragd, which would take advantage of _disk_ idle time to reorganize blocks, while the filesystem was mounted. This daemon would be a good candidate for disk optimizations like moving frequently-accessed files to the middle of the disk in addition to background defragging. There's one usefull thing that could be done with e2defrag: putting directories at the beggining of the disk exactly in the order find / would use. One line hack, but e2defrag just does not work for me. Pavel Isn't it better to simply use locate / updatedb instead ? No. There are other operations (such as du -s ., search from midnight) which have find-like access pattern. And you have no chance of getting out of date. It just sounds silly to optimize the disk layout for such specific cases. Maybe if you're only running du -s and find / all day, but somehow I doubt that. -Andi -- This is like TV. I don't like TV.