Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
No specific spec, just general quality of implementation. I completely agree. If one thread writes A and another writes B then the kernel should record either A or B, not ((A 0x) | (B 0x)) Agree entirely: the spec doesn't allow for random scribbling in the wrong place. It doesn't cover which goes first or who wins the race but provides pwrite/pread for that situation. Writing somewhere unrelated is definitely not to spec and not good. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
On Mon, 28 Jan 2008 15:10:34 +0100 Andi Kleen [EMAIL PROTECTED] wrote: On Monday 28 January 2008 14:38:57 Alan Cox wrote: Also worse really fixing it would be a major change to the VFS because of the way -read/write are defined :/ I don't see a problem there. -read and -write update the passed pointer which is not the real f_pos anyway. Just the copies need fixing. They are effectually doing a decoupled read/modify/write cycle. e.g.: A B read fpos read fpos fpos += A fpos += B write fpos write fpos So you get overlapping reads. Probably not good. No unix system I'm aware of cares about the read/write positioning during parallel simultaneous reads or writes, with the exception of O_APPEND which is strictly defined. The problem case is getting fpos != either valid value. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] [8/18] BKL-removal: Remove BKL from remote_llseek
Also worse really fixing it would be a major change to the VFS because of the way -read/write are defined :/ I don't see a problem there. -read and -write update the passed pointer which is not the real f_pos anyway. Just the copies need fixing. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Parallelize IO for e2fsck
I'd tried to advocate SIGDANGER some years ago as well, but none of the kernel maintainers were interested. It definitely makes sense to have some sort of mechanism like this. At the time I first brought it up it was in conjunction with Netscape using too much cache on some system, but it would be just as useful for all kinds of other memory- hungry applications. There is an early thread for a /proc file which you can add to your poll() set and it will wake people when memory is low. Very elegant and if async support is added it will also give you the signal variant for free. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
Writeback cache on disk in iteself is not bad, it only gets bad if the disk is not engineered to save all its dirty cache on power loss, using the disk motor as a generator or alternatively a small battery. It would be awfully nice to know which brands fail here, if any, because writeback cache is a big performance booster. AFAIK no drive saves the cache. The worst case cache flush for drives is several seconds with no retries and a couple of minutes if something really bad happens. This is why the kernel has some knowledge of barriers and uses them to issue flushes when needed. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFD] Incremental fsck
What are ext3 expectations of disk (is there doc somewhere)? For example... if disk does not lie, but powerfail during write damages the sector -- is ext3 still going to work properly? Nope. However the few disks that did this rapidly got firmware updates because there are other OS's that can't cope. If disk does not lie, but powerfail during write may cause random numbers to be returned on read -- can fsck handle that? most of the time. and fsck knows about writing sectors to remove read errors in metadata blocks. What abou disk that kills 5 sectors around sector being written during powerfail; can ext3 survive that? generally. Note btw that for added fun there is nothing that guarantees the blocks around a block on the media are sequentially numbered. The usually are but you never know. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH, RESEND] locks: fix possible infinite loop in posix deadlock detection
On Tue, 30 Oct 2007 11:20:02 -0400 J. Bruce Fields [EMAIL PROTECTED] wrote: From: J. Bruce Fields [EMAIL PROTECTED] It's currently possible to send posix_locks_deadlock() into an infinite loop (under the BKL). For now, fix this just by bailing out after a few iterations. We may want to fix this in a way that better clarifies the semantics of deadlock detection. But that will take more time, and this minimal fix is probably adequate for any realistic scenario, and is simple enough to be appropriate for applying to stable kernels now. Thanks to George Davis for reporting the problem. Cc: George G. Davis [EMAIL PROTECTED] Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] Acked-by: Alan Cox [EMAIL PROTECTED] Its a good fix for now and I doubt any real world user has that complex a locking pattern to break. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007 13:43:21 -0400 J. Bruce Fields [EMAIL PROTECTED] wrote: From: J. Bruce Fields [EMAIL PROTECTED] We currently attempt to return -EDEALK to blocking fcntl() file locking requests that would create a cycle in the graph of tasks waiting on locks. This is inefficient: in the general case it requires us determining whether we're adding a cycle to an arbitrary directed acyclic graph. And this calculation has to be performed while holding a lock (currently the BKL) that prevents that graph from changing. It has historically been a source of bugs; most recently it was noticed that it could loop indefinitely while holding the BKL. It seems unlikely to be useful to applications: - The difficulty of implementation has kept standards from requiring it. (E.g. SUSv3 : Since implementation of full deadlock detection is not always feasible, the [EDEADLK] error was made optional.) So portable applications may not be able to depend on it. - It only detects deadlocks that involve nothing but local posix file locks; deadlocks involving network filesystems or other kinds of locks or resources are missed. It therefore seems best to remove deadlock detection. Signed-off-by: J. Bruce Fields [EMAIL PROTECTED] NAK. This is an ABI change and one that was rejected before when this was last discussed in detail. Moving it out of BKL makes a ton of sense, even adding a don't check flag makes a lot of sense. Removing the checking does not. I'd much rather see if (flags FL_NODLCHECK) posix_deadlock_detect() The failure case for removing this feature is obscure and hard to debug application hangs for the afflicted programs - not nice for users at all. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
And if posix file locks are to be useful to threaded applications, then we have to preserve the same no-false-positives requirement for them as well. It isn't useful to threaded applications. The specification requires this. Which is another reason for having an additional Linux (for now) flag to say don't bother Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007 12:27:32 -0600 Matthew Wilcox [EMAIL PROTECTED] wrote: On Sun, Oct 28, 2007 at 01:43:21PM -0400, J. Bruce Fields wrote: We currently attempt to return -EDEALK to blocking fcntl() file locking requests that would create a cycle in the graph of tasks waiting on locks. This is inefficient: in the general case it requires us determining whether we're adding a cycle to an arbitrary directed acyclic graph. And this calculation has to be performed while holding a lock (currently the BKL) that prevents that graph from changing. It has historically been a source of bugs; most recently it was noticed that it could loop indefinitely while holding the BKL. It can also return -EDEADLK spuriously. So yeah, just kill it. NAK. This is an ABI change. It was also comprehensively rejected before because - EDEADLK behaviour is ABI - EDEADLK behaviour is required by SuSv3 - We have no idea what applications may rely on this behaviour. and also SuSv3 is required by LSB See the thread http://osdir.com/ml/file-systems/2004-06/msg00017.html so we need to fix the bugs - the lock usage and the looping. At that point it merely becomes a performance concern to those who use it, which is the proper behaviour. If you want a faster non-checking one use flock(), or add another flag that is a Linux don't check for deadlock Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
- EDEADLK behaviour is ABI Not in any meaningful way. I've seen SYS5 software that relies on it so we should be careful. Again see the 2004 discussion where the conclusion was that EDEADLK should stay - EDEADLK behaviour is required by SuSv3 What SuSv3 actually says is: If the system detects that sleeping until a locked region is unlocked would cause a deadlock, fcntl() shall fail with an [EDEADLK] error. It doesn't require the system to detect it, only mandate what to return if it does detect it. We should be detecting at least the obvious case. - We have no idea what applications may rely on this behaviour. I've never heard of one that does. Very scientific. I have on SYS5 though not afaik Linux so we need to fix the bugs - the lock usage and the looping. At that point it merely becomes a performance concern to those who use it, which is the proper behaviour. If you want a faster non-checking one use flock(), or add another flag that is a Linux don't check for deadlock You can't fix the false EDEADLK detection without solving the halting problem. Best of luck with that. A good question to ask here would be what subset of deadlock loops on flock does SYS5 Unix error. I also don't see why you need to solve the halting problem If SYSV only spots simple AB - BA deadlocks or taking the same lock twice yourself then that ought to be sufficient for us too. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
Bzzt. You get a false deadlock with multiple threads like so: Thread A of task B takes lock 1 Thread C of task D takes lock 2 Thread C of task D blocks on lock 1 Thread E of task B blocks on lock 2 The spec and SYSV certainly ignore threading in this situation and you know that perfectly well (or did in 2004) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
The spec and SYSV certainly ignore threading in this situation and you know that perfectly well (or did in 2004) The discussion petered out (or that mailing list archive lost articles from the thread) without any kind of resolution, or indeed interest. I think the resolution was that the EDEADLK stayed. What is your suggestion for handling this problem? As it is now, the kernel 'detects' deadlock where there is none, which doesn't seem allowed by SuSv3 either Re-read the spec. The EDEADLK doesn't account for threads, only processes. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC, PATCH] locks: remove posix deadlock detection
On Sun, 28 Oct 2007 17:38:14 -0600 Matthew Wilcox [EMAIL PROTECTED] wrote: On Sun, Oct 28, 2007 at 09:38:55PM +, Alan Cox wrote: It doesn't require the system to detect it, only mandate what to return if it does detect it. We should be detecting at least the obvious case. What is the obvious case? A task that has never called clone()? Simple AB BA I would have thought obvious. Clone as has been said several times now is irrelevant as the standard is about *processes* [in the SuS sense] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
On Thu, 25 Oct 2007 16:06:40 -0700 Mike Waychison [EMAIL PROTECTED] wrote: Remove the need for having CAP_SYS_RAWIO when doing a FIBMAP call on an open file descriptor. It would be nice to allow users to have permission to see where their data is landing on disk, and there really isn't a good reason to keep them from getting at this information. Historically this was done because people felt it was more secure. It also allows you to make some deductions about other activities on the disk but thats probably only a concern for very very security crazed compartmentalised boxes Also historically at least FIBMAP could be abused to crash the system. Now if you can verify that has been fixed I have no problem, but given that I can find no record of that being fixed it would be wise to audit it first and review Chris Evans and other reports about what occurs when FIBMAP is passed random block numbers. FIBMAP has another problem for this general use as well - it takes an int but the block number can now be bigger for very large files on 32bit. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/1] Drop CAP_SYS_RAWIO requirement for FIBMAP
I found Chris's comment about negative block numbers, I'll send a patch out for that. You mentioned back in 99 about racing with ftruncate. Is it sufficient to mutex_lock(i_mutex) and down_read(i_alloc_sem)? One for the fs guys. That code has changed far beyond anything I understand any more 8) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] getattr - fill the size of pipes
Cute feature, but it is (I assume) a Linux-specific extension and is something which we'll need to maintain for ever and it invites Actually it used to work on the old old Linux pipe code. unportability to older Linuxes and other OSes and it introduces some risk of breakage of existing applications. And it slows down fstat on a pipe. Most Sys5 based boxes happen to put the right value there but not everyone and its not guaranteed in the slightest Given that the info can already be obtained via ioctl(FIONREAD) anyway, I don't think that (gain pain)? Nor me - any application trying to reduce the syscall count would just do a very large read and get the data and size in one go. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fs: Correct SuS compliance for open of large file without options
The early LFS work that Linux uses favours EFBIG in various places. SuSv3 specifically uses EOVERFLOW for this as noted by Michael (Bug 7253) -- [EOVERFLOW] The named file is a regular file and the size of the file cannot be represented correctly in an object of type off_t. We should therefore transition to the proper error return code Signed-off-by: Alan Cox [EMAIL PROTECTED] diff -u --new-file --exclude-from /usr/src/exclude --recursive linux.vanilla-2.6.23rc8-mm1/fs/gfs2/ops_file.c linux-2.6.23rc8-mm1/fs/gfs2/ops_file.c --- linux.vanilla-2.6.23rc8-mm1/fs/gfs2/ops_file.c 2007-09-26 16:46:54.0 +0100 +++ linux-2.6.23rc8-mm1/fs/gfs2/ops_file.c 2007-09-27 13:45:48.0 +0100 @@ -406,7 +406,7 @@ if (!(file-f_flags O_LARGEFILE) ip-i_di.di_size MAX_NON_LFS) { - error = -EFBIG; + error = -EOVERFLOW; goto fail_gunlock; } diff -u --new-file --exclude-from /usr/src/exclude --recursive linux.vanilla-2.6.23rc8-mm1/fs/ntfs/file.c linux-2.6.23rc8-mm1/fs/ntfs/file.c --- linux.vanilla-2.6.23rc8-mm1/fs/ntfs/file.c 2007-09-26 16:46:55.0 +0100 +++ linux-2.6.23rc8-mm1/fs/ntfs/file.c 2007-09-27 13:47:35.0 +0100 @@ -61,7 +61,7 @@ { if (sizeof(unsigned long) 8) { if (i_size_read(vi) MAX_LFS_FILESIZE) - return -EFBIG; + return -EOVERFLOW; } return generic_file_open(vi, filp); } diff -u --new-file --exclude-from /usr/src/exclude --recursive linux.vanilla-2.6.23rc8-mm1/fs/open.c linux-2.6.23rc8-mm1/fs/open.c --- linux.vanilla-2.6.23rc8-mm1/fs/open.c 2007-09-26 16:46:55.0 +0100 +++ linux-2.6.23rc8-mm1/fs/open.c 2007-09-27 13:45:10.0 +0100 @@ -1210,7 +1210,7 @@ int generic_file_open(struct inode * inode, struct file * filp) { if (!(filp-f_flags O_LARGEFILE) i_size_read(inode) MAX_NON_LFS) - return -EFBIG; + return -EOVERFLOW; return 0; } - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
On Thu, 27 Sep 2007 07:01:18 -0700 Arjan van de Ven [EMAIL PROTECTED] wrote: On Thu, 27 Sep 2007 14:29:19 +0100 Alan Cox [EMAIL PROTECTED] wrote: The early LFS work that Linux uses favours EFBIG in various places. SuSv3 specifically uses EOVERFLOW for this as noted by Michael (Bug 7253) isn't this an ABI change? Its a change of a specific error return from the wrong error to the right one, nothing more. Fixing the returned error gives us correct behaviour according to the standards and other systems. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
Its a change of a specific error return from the wrong error to the right one, nothing more. Fixing the returned error gives us correct behaviour according to the standards and other systems. It may still break applications. Waving some standard at them if they complain is unlikely to impress them. And our existing behaviour may well break correctly written portable applications, and is incorrect as well. Testing so far says it doesn't break anything, which is no suprise if you apply about ten braincells to the case under consideration. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs: Correct SuS compliance for open of large file without options
Well it's not my call, just seems like a really bad idea to change the error value. You can't claim full coverage for such testing anyway, it's one of those things that people will complain about two releases later saying it broke app foo. Strange since we've spent years changing error values and getting them right in the past. There are real things to worry about - sysfs, sysfs, sysfs, ... and all the other crap which is continually breaking stuff, not spec compliance corrections that don't break things but move us into compliance with the standard Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/2] VFS: new fgetattr() file operation
But it's has various dawbacks, like rmdir doesn't work if there are open files within an otherwise empty directory. I'd happily accept suggestions on how to deal with this differenty. NFS has that problem because it really has to sillyrename into the same directory. I don't see that ssh/sftp needs to do that. Instead it can sillyrename anywhere in the filesystem. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 00/44] AppArmor security module overview
Anyone can apply the apparmour patch to their tree, they get the choice that way. Nobody is currently prevented from using apparmour if they want to, any such suggestion is pure rubbish. The exact same argument was made prior to SELinux going upstream. Its made for every thing before it goes upstream. It shouldn't be going uptream until it works, is reliable and does something useful. Then if it ever makes that grade it can go and sit in -mm for a bit to shake down . Frankly I think AppArmour is a joke, SELinux, AppArmor, and Hilary Clinton walk into a bar ... SELinux orders a beer object AppArmor order a /beer Hilary says You are both under 21 you can't SELinux orders a shandy object AppArmor orders a /shandy SELinux is refused because the shandy mixer opened a beer object and shandy inherited beer typing AppArmor gets drunk because /shandy and /beer are clearly different - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
http://www.wipo.int/pctdb/en/fetch.jsp?LANG=ENGDBSELECT=PCTSERVER_TYPE=19SORT=1211506-KEYTYPE_FIELD=256IDB=0IDOC=1205953C=10ELEMENT_SET=IA,WO,TTL-ENRESULT=1TOTAL=3START=1DISP=25FORM=SEP-0/HITNUM,B-ENG,DP,MC,PA,ABSUM-ENGSEARCH_IA=US2005045566QUERY=%28IN%2fmerkey%29+ The last one was filed with WIPO and has international protection, UK included. Nope. EU and UK law does not recognize software as patentable. See the caselaw. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Versioning file system
(Vax/VMS System Software Handbook) (TOPS-20 User's Manual) Also Files/11 Basic versioning goes back to at least ITS Not sure how old doing file versioning and hiding it away with a tool to go rescue the stuff you blew away by mistake is, but Novell Netware 3 certainly did a good job on that one - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 01/41] Pass struct vfsmount to the inode_create LSM hook
As such, AA can detect whether you did exec(gzip) or exec(gunzip) and apply the policy relevant to the program. It could apply different That's not actually useful for programs which link the same binary to multiple names because if you don't consider argv[0] as well I can run /usr/bin/gzip passing argv[0] of gunzip and get one set of policies and the other set of behaviour. And then we have user added hardlinks of course. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] ChunkFS: fs fission for faster fsck
Preventive measures are taken to limit only one continuation inode per file per chunk. This can be done easily in the chunk allocation algorithm for disk space. Although I'm not quite sure what you mean by How are you handling the allocation in this situation, are you assuming that a chunk is out of bounds because part of a file already lives on it or simply keeping a single inode per chunk which has multiple sparse pieces of the file on it ? ie if I write 0-8MB to chunk A and then 8-16 to chunk B can I write 16-24MB to chunk A producing a single inode of 0-8 16-24, or does it have to find another chunk to use ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [d_path 1/7] Fix __d_path() for lazy unmounts and make it unambiguous
On Fri, 20 Apr 2007 01:23:04 +0200 Andreas Gruenbacher [EMAIL PROTECTED] wrote: First, when __d_path() hits a lazily unmounted mount point, it tries to prepend the name of the lazily unmounted dentry to the path name. It gets this wrong, and also overwrites the slash that separates the name from the following pathname component. This patch fixes that; if a process was in directory /foo/bar and /foo got lazily unmounted, the old result was ``foobar'' (note the missing slash), while the new result with this patch is ``foo/bar''. ACK the fix of ``foobar'' in the example described above. Subsequent patches propose to make getcwd() fail instead of reporting unreachable paths like this one and hide unreachable mount points from /proc/mounts. NAK that change of behaviour on the following patches. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [d_path 6/7] Filter out disconnected paths from /proc/mounts
There is some disagreement what /proc/mounts should include. Currently it reports all mounts from the current namespace and doesn't include lazy unmounts. This leads to ambiguities with the rootfs (which is an internal mount irrelevant to user-space except in the initrd), and in chroots. With this and the next patch, /proc/mounts only reports the mounts reachable for the current process, which makes a lot more sense IMO. If the current process is rooted in the namespace root (which it usually is), it will see all mounts except for the rootfs. Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] This change in behaviour appears to be fine for glibc (except when trying to find the name of a file from a namespace we are not in, which wouldn't have come out right before either) Acked-by: Alan Cox [EMAIL PROTECTED] (but still NAK on the getcwd change) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 31/41] Fix __d_path() for lazy unmounts and make it unambiguous; exclude unreachable mount points from /proc/mounts
That is a fairly significant and sudden change to the existing kernel/user interface. Well, this is not meant for 2.6.21. I hope it is possible to change it in early 2.6.22; otherwise if we can't fix mistakes from the past we are pretty doomed. I don't believe the existing behaviour _IS_ a mistake. This is untrue. The process can get there (via fd passing with another task) Process can access file descriptors which are unreachable via path name just fine indeed, but those fds still don't have a valid path in the context of that process. Which while problematic to your name based security is just fine to everything else. We are only talking about mount points unreachable by a particular process; this does not mean that the mount point isn't reachable by other processes. Human operators can choose the context from which they are looking at /proc/mounts. If they are looking form the real root, the will see all mounts that any process can reach (in that namespace). Ok, providing the real root sees them all it isn't so bad, but to assume you can filter based upon what the task can see is dodgy as an assumption. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
don't actually have to care --- if loading an invalid profile can bring down the system, then that's no worse than an arbitrary module that crashes the machine. Not sure if there will ever be user loadable profiles; at least at that point we had to care. CAP_SYS_RAWIO is needed to do arbitary patching/loading in the capability model so if you are using lesser capabilities it is a (minor) capability rise but not a big problem, just ugly and wanting a fix + /* + * Replacement needs to allocate a new aa_task_context for each + * task confined by old_profile. To do this the profile locks + * are only held when the actual switch is done per task. While + * looping to allocate a new aa_task_context the old_task list + * may get shorter if tasks exit/change their profile but will + * not get longer as new task will not use old_profile detecting + * that is stale. + */ + do { + new_cxt = aa_alloc_task_context(GFP_KERNEL | __GFP_NOFAIL); NOFAIL is usually a bad sign. It should be only used if there is no alternative. At this point there is no secure alternative to allocating a task context --- except killing the task, maybe. Can you count the number needed, preallocate them and then when you know for sure either succeed or fail the operation as a whole ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 31/41] Fix __d_path() for lazy unmounts and make it unambiguous; exclude unreachable mount points from /proc/mounts
Third, sys_getcwd() shouldn't return disconnected paths. The patch checks for that, and makes it fail with -ENOENT in that case That is a fairly significant and sudden change to the existing kernel/user interface. Fourth, this now allows us to tell unreachable mount points from reachable ones when generating the /proc/mounts and /proc/$pid/mountstats files. Unreachable mount points are not interesting to processes (they can't get there, anyway), so we hide unreachable mounts. In particular, ordinary This is untrue. The process can get there (via fd passing with another task) and the process can be producing output for the human operators, who most definitely need to know and see this stuff. Signed-off-by: Andreas Gruenbacher [EMAIL PROTECTED] Reviewed-by: NeilBrown [EMAIL PROTECTED] Signed-off-by: Andrew Morton [EMAIL PROTECTED] I don't think this is fit to apply in current form. The hiding of mounts and mountstats is the wrong approach. The changes to getcwd behaviour bother me too as we are changing user space behaviour without warning. The general idea of pushing some of the fail detect logic into __d_path() seems good though. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 03/41] Remove redundant check from proc_sys_setattr()
On Thu, 12 Apr 2007 02:08:12 -0700 [EMAIL PROTECTED] wrote: notify_change() already calls security_inode_setattr() before calling iop-setattr. This is a behaviour change on all of these and limits some behaviour of existing established security modules When inode_change_ok is called it has side effects. This includes clearing the SGID bit on attribute changes caused by chmod. If you make this change the results of some rulesets may be different before or after the change is made. I'm not saying the change is wrong but it does change behaviour so that needs looking at closely (ditto all other attribute twiddles) Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 38/41] AppArmor: Module and LSM hooks
+ + /** + * parent can ptrace child when + * - parent is unconfined + * - parent is in complain mode + * - parent and child are confined by the same profile + */ Your profiles are name based. That means the same profile in a different namespace does different things. It would be a very odd case where it mattered but surely the parent ptrace child rule should also require that the parent and child are in the same namespace when using apparmor name based security. +static int apparmor_capget(struct task_struct *task, + kernel_cap_t *effective, + kernel_cap_t *inheritable, + kernel_cap_t *permitted) +{ + return cap_capget(task, effective, inheritable, permitted); +} Pointless function should go away. +static int apparmor_sysctl(struct ctl_table *table, int op) +{ + int error = 0; + + if ((op 002) !capable(CAP_SYS_ADMIN)) + error = aa_reject_syscall(current, GFP_KERNEL, + sysctl (write)); + + return error; The usual file permission security override is DAC not ADMIN. What is the logic of this choice. +} + +static int apparmor_syslog(int type) +{ + return cap_syslog(type); +} More pointless functions to delete. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 39/41] AppArmor: Profile loading and manipulation, pathname matching
+ th.td_id = ntohs(*(u16 *) (blob)); + th.td_flags = ntohs(*(u16 *) (blob + 2)); + th.td_lolen = ntohl(*(u32 *) (blob + 8)); Use cpu_to and _to_cpu functions for here so it is clear the intended direction and endianness. + +static inline int aa_inbounds(struct aa_ext *e, size_t size) +{ + return (e-pos + size = e-end); +} Can e-pos + size ever overflow. If so this check isn't safe. + * aa_unpack_profile - unpack a serialized profile + * @e: serialized data extent information + * @error: error code returned if unpacking fails + */ +static struct aa_profile *aa_unpack_profile(struct aa_ext *e) +{ + struct aa_profile *profile = NULL; + + /* get optional subprofiles */ + if (aa_is_nameX(e, AA_LIST, hats)) { + while (!aa_is_nameX(e, AA_LISTEND, NULL)) { + struct aa_profile *subprofile; + subprofile = aa_unpack_profile(e); + if (IS_ERR(subprofile)) { What bounds recursion here on invalid input ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [AppArmor 37/41] AppArmor: Main Part
+ * aa_taskattr_access + * @name: name of the file to check + * + * Check if name matches /proc/self/attr/current, with self resolved + * to the current pid. This file is the usermode iterface for + * changing one's hat. + */ +static inline int aa_taskattr_access(const char *name) +{ + unsigned long pid; + char *end; + + if (strncmp(name, /proc/, 6) != 0) + return 0; The proc file system may not be mounted at /proc. There are environments where this is done for good reason (eg not wanting the /proc info exposed to a low trust environment). Another is when FUSE is providing an arbitrated proc either by merging across clusters or by removing stuff. +static int aa_file_denied(struct aa_profile *profile, const char *name, + int mask) +{ + int perms; + + /* Always allow write access to /proc/self/attr/current. */ + if (mask == MAY_WRITE aa_taskattr_access(name)) + return 0; Why can't this be done in the profile itself to avoid kernel special case uglies and inflexibility ? + if (PTR_ERR(sa-name) == -ENOENT (check AA_CHECK_FD)) + denied_mask = 0; Now there is an interesting question. Is PTR_ERR() safe for kernel pointers on all platforms or just for user ones ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: impact of 4k sector size on the IO FS stack
Now, if this disk was copied byte per byte (/bin/dd) to a 4096-based disk, and Linux would start using a sector size of 4096, then I would suddenly have The ATA drives I'm aware of report 512 byte sector size, do 512 byte I/O's but use 4K physical sectors and to get sane performance except the OS to issue sensible sized I/O requests. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: impact of 4k sector size on the IO FS stack
First generation of 1K sector drives will continue to use the same 512-byte ATA sector size you are familiar with. A single 512-byte write will cause the drive to perform a read-modify-write cycle. This configuration is physical 1K sector, logical 512b sector. The problem case is read-modify-screwup At that point we've trashed the block we were writing (a well studied recovery case), and we've blasted some previously sane, totally unrelated sector of data out of existance. Thats why we need to know ideally if they are doing the write to a different physical block when they do this, so that we don't lose the old data. My guess is they won't as it'll be hard. A future configuration will change the logical ATA interface away from 512-byte sectors to 1K or 4K. Here, it is impossible to read a quantity smaller than 1K or 4K, whatever the sector size is. That one I'm not worried about - other than guess how Redmond decide to make partition tables work that one is mostly easy (be fun to see how many controllers simply can't cope with the command formats) Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: impact of 4k sector size on the IO FS stack
For 1K/4K logical sector sizes, who knows. EFI? grins and runs Certainly seems incompatible with the current popular DOS partition format. Its a bit messier than that. There are two interpretations of DOS partition formats found on 2K sector size magneto opticals. One is that everything is the same as before (as if sectors were 512 byte), the other is a different everything is the same which scales by the 2K sector size. The two are of course wonderfully incompatible - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: impact of 4k sector size on the IO FS stack
Are there other concerns in the IO or FS stack that we should bring up with vendors? I have been asked to summarize the impact of 4k sectors on linux for a disk vendor gathering and want to make sure that I put all of our linux specific items into that summary... We need to make sure the physical sector size is correctly reported by the disk (eg in the ATA7 identify data) but I think for libata at least the right bits are already there and we've got a fair amount of scsi disk experience with other media sizes (eg 2K) already. 256byte/sector media is still broken btw 8) I would be interested to know what the disk vendors intend to use as their strategy when (with ATA) they have a 512 byte write from an older file system/setup into a 4K block. The case where errors magically appear in other parts of the fs when such an error occurs are not IMHO too well considered. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remainingh
On Maw, 2005-09-06 at 02:48 -0400, Daniel Phillips wrote: On Tuesday 06 September 2005 01:05, Dmitry Torokhov wrote: do you think it is a bit premature to dismiss something even without ever seeing the code? You told me you are using a dlm for a single-node application, is there anything more I need to know? That's standard practice for many non-Unix operating systems. It means your code supports failover without much additional work and it provides all the functionality for locks on a single node too - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Linux-cluster] Re: GFS, what's remaining
On Sad, 2005-09-03 at 21:46 -0700, Andrew Morton wrote: Actually I think it's rather sick. Taking O_NONBLOCK and making it a lock-manager trylock because they're kinda-sorta-similar-sounding? Spare me. O_NONBLOCK means open this file in nonblocking mode, not attempt to acquire a clustered filesystem lock. Not even close. The semantics of O_NONBLOCK on many other devices are trylock semantics. OSS audio has those semantics for example, as do regular files in the presence of SYS5 mandatory locks. While the latter is try lock , do operation and then drop lock the drivers using O_NDELAY are very definitely providing trylock semantics. I am curious why a lock manager uses open to implement its locking semantics rather than using the locking API (POSIX locks etc) however. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
On Iau, 2005-09-01 at 03:59 -0700, Andrew Morton wrote: - Why the kernel needs two clustered fileystems So delete reiserfs4, FAT, VFAT, ext2, and all the other junk. - Why GFS is better than OCFS2, or has functionality which OCFS2 cannot possibly gain (or vice versa) - Relative merits of the two offerings You missed the important one - people actively use it and have been for some years. Same reason with have NTFS, HPFS, and all the others. On that alone it makes sense to include. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: GFS, what's remaining
That's GFS. The submission is about a GFS2 that's on-disk incompatible to GFS. Just like say reiserfs3 and reiserfs4 or ext and ext2 or ext2 and ext3 then. I think the main point still stands - we have always taken multiple file systems on board and we have benefitted enormously from having the competition between them instead of a dictat from the kernel kremlin that 'foofs is the one true way' Competition will decide if OCFS or GFS is better, or indeed if someone comes along with another contender that is better still. And competition will probably get the answer right. The only thing that is important is we don't end up with each cluster fs wanting different core VFS interfaces added. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] New system call, unshare
On Llu, 2005-08-08 at 09:33 -0400, Janak Desai wrote: [PATCH 1/2] unshare system call: System Call handler function sys_unshare Given the complexity of the kernel code involved and the obscurity of the functionality why not just do another clone() in userspace to unshare the things you want to unshare and then _exit the parent ? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: quota deadlock in 2.4.5-pre4
I think it's a misfit between Linus' kernel and the quota tools from http://sourceforge.net/projects/linuxquota/ Linus quota code is way out of date and only handles 16bit uid Linus' tree and Alan's are showing a 2000 line diff in dquot.c alone. `quotaon' seems to be passing arguments into sys_quotactl() which it doesn't understand, etc. Yep. So. I'd prefer to not do further ext3 quota testing until Linus gets an update. Alan, is it possible to push this along a bit? Quota is relatively low priority. I can only feed stuff to Linus at a certain rate and 'oh whoops it crashed my machine' stuff is much higher priority. There are also other problems with both sets of quota code (lock inversions and the like) that need fixing first. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: ECN is on!
Matti Aarnio writes: I am contemplating to periodically turn off the ECN bit to let email out, but DaveM has veto there. I veto, the whole point of moving to ECN was to make a statement and get people to fix their kit. We will remove these people, that's all. Since HTML email also has a spec can we remove the people who moan about that too ;) Alan -- MIME, oh mime, how I hate thee. Let me stick pins in you to count the ways... -- Ben LaHaise - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: Why side-effects on open(2) are evil. (was Re: [RFD
Why are LVM and EVMS(competing LVM project) needed at all? I prefer to think of it the other way around Surely the same can be accomplished with * md * snapshot blkdev (attached in previous e-mail) * giving partitions and blkdevs the ability to grow and shrink * giving filesystems the ability to grow and shrink How about 'partitions are in inferior legacy form of LVM' - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
How about sprintf(s + strlen(s), foo)? Solar Designer said two years ago we should be using snprintf in the kernel. He was most decidedly right 8) - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
Linus, as much as I'd like to agree with you, you are hopeless optimist. 90% of drivers contain code written by stupid gits. I think thats a very arrogant and very mistaken view of the problem. 90% of the driver are written by people who are - Copying from other drivers - Using the existing API's to make their job easy - Working to timescales - Just want it to work So if you take ioctl away from them they will implement ioctl emulation by writing ioctl structs to an fd. If you want to make these things work well you have to provide a good working infrastructure. You don't change anything (except the maintainer) by causing pain. Instead you provide the mechanisms - the generic parsing code so that people don't screw up on procfs parsing - the generic ioctl alternatives etc. Ditto with the major numbers. You win that battle by getting enough people to believe it is the right answer that they write the nice code for managing resources and naming assignment - which is already beginning to occur. Then even if I'm still maintaining a major number list in 2 years nobody can quite remember why, and people are heard murmering 'You should have tried Linux two years ago, you had to actually make device files yourself sometimes' Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
On Sun, 20 May 2001, Ingo Oeser wrote: PS: English is neither mine, nor Linus native language. Why do the English natives complain instead of us? ;-) Because we had some experience with, erm, localized systems and for Alan it's most likely pure theory? ;-) I think its important its considered. I do like the idea of a sensible ioctl encoding (including ascii potentially) and being able to ship ioctls over the network. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: [RFD w/info-PATCH] device arguments from lookup, partion code
ioctls are evil, period. At least with these names you can use normal scripting and don't need any special tools. Every ioctl means a binary that has no business to exist. That is not IMHO a rational argument. It isn't my fault that your shell does not support ioctls usefully. If you used perl as your login shell you would have no problem there. Alan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED]
Re: cramfs b0rken on HIGHMEM machines
just look at fs/cramfs/inode.c:cramfs_read_page() It uses page_address instead of kmap(). I would have fixed it myself, but I don't know, how I should kunmap() it, once we have memory pressure. Take a look at ramfs. kmap isnt really a 'pressure' thing. You want to kunmap the page as soon as you can. The kmap/unmap operations are fairly fast but there is a limited pool of maps. Alan - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to [EMAIL PROTECTED]