Re: BTRFS partition usage...
> The Sun disk label only allows you to specify the start of a partition > in cylinders, so if you want to use a filesystem like XFS you have to > start the partition on cylinder 1 which can be many blocks into the > disk. That entire first cylinder is completely wasted. I don't believe a cylinder of wasted disk space is significant. I don't believe a cylinder of disk is worth adding the complexity of sharing a partition to a filesystem. That complexity translates into engineering time and mistakes. Your other arguments for making a hole in the filesystem, based on tradition, are more convincing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: BTRFS partition usage...
The Sun disk label only allows you to specify the start of a partition in cylinders, so if you want to use a filesystem like XFS you have to start the partition on cylinder 1 which can be many blocks into the disk. That entire first cylinder is completely wasted. I don't believe a cylinder of wasted disk space is significant. I don't believe a cylinder of disk is worth adding the complexity of sharing a partition to a filesystem. That complexity translates into engineering time and mistakes. Your other arguments for making a hole in the filesystem, based on tradition, are more convincing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
>> Incidentally, some context for the AIX approach to the OOM problem: a >> process may exclude itself from OOM vulnerability altogether. It places >> itself in "early allocation" mode, which means at the time it creates >> virtual memory, it reserves enough backing store for the worst case. The >> memory manager does not send such a process the SIGDANGER signal or >> terminate it when it runs out of paging space. Before c. 2000, this was >> the only mode. Now the default is late allocation mode, which is similar >> to Linux. > >This is an interesting approach. It feels like some programs might be >interested in choosing this mode instead of risking OOM. It's the way virtual memory always worked when it was first invented. The system not only reserved space to back every page of virtual memory; it assigned the particular blocks for it. Late allocation was a later innovation, and I believe its main goal was to make it possible to use the cheaper disk drives for paging instead of drums. Late allocation gives you better locality on disk, so the seeking doesn't eat you alive (drums don't seek). Even then, I assume (but am not sure) that the system at least reserved the space in an account somewhere so at pageout time there was guaranteed to be a place to which to page out. Overcommitting page space to save on disk space was a later idea. I was surprised to see AIX do late allocation by default, because IBM's traditional style is bulletproof systems. A system where a process can be killed at unpredictable times because of resource demands of unrelated processes doesn't really fit that style. It's really a fairly unusual application that benefits from late allocation: one that creates a lot more virtual memory than it ever touches. For example, a sparse array. Or am I missing something? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
Incidentally, some context for the AIX approach to the OOM problem: a process may exclude itself from OOM vulnerability altogether. It places itself in early allocation mode, which means at the time it creates virtual memory, it reserves enough backing store for the worst case. The memory manager does not send such a process the SIGDANGER signal or terminate it when it runs out of paging space. Before c. 2000, this was the only mode. Now the default is late allocation mode, which is similar to Linux. This is an interesting approach. It feels like some programs might be interested in choosing this mode instead of risking OOM. It's the way virtual memory always worked when it was first invented. The system not only reserved space to back every page of virtual memory; it assigned the particular blocks for it. Late allocation was a later innovation, and I believe its main goal was to make it possible to use the cheaper disk drives for paging instead of drums. Late allocation gives you better locality on disk, so the seeking doesn't eat you alive (drums don't seek). Even then, I assume (but am not sure) that the system at least reserved the space in an account somewhere so at pageout time there was guaranteed to be a place to which to page out. Overcommitting page space to save on disk space was a later idea. I was surprised to see AIX do late allocation by default, because IBM's traditional style is bulletproof systems. A system where a process can be killed at unpredictable times because of resource demands of unrelated processes doesn't really fit that style. It's really a fairly unusual application that benefits from late allocation: one that creates a lot more virtual memory than it ever touches. For example, a sparse array. Or am I missing something? -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
>I think there is a clear need for applications to be able to >register a callback from the kernel to indicate that the machine as >a whole is running out of memory and that the application should >trim it's caches to reduce memory utilisation. > >Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ... The problem with that approach is that the Fsck process doesn't know how its need for memory compares with other process' need for memory. How much memory should it give up? Maybe it should just quit altogether if other processes are in danger of deadlocking. Or maybe it's best for it to keep all its memory and let some other frivolous process give up its memory instead. It's the OS's job to have a view of the entire system and make resource allocation decisions. If it's just a matter of the application choosing a better page frame to vacate than what the kernel would have taken, (which is more a matter of self-interest than resource allocation), then Fsck can do that more directly by just monitoring its own page fault rate. If it's high, then it's using more real memory than the kernel thinks it's entitled to and it can reduce its memory footprint to improve its speed. It can even check whether an access to readahead data caused a page fault; if so, it knows reading ahead is actually making things worse and therefore reduce readahead until the page faults stop happening. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] Parallelize IO for e2fsck
I think there is a clear need for applications to be able to register a callback from the kernel to indicate that the machine as a whole is running out of memory and that the application should trim it's caches to reduce memory utilisation. Perhaps instead of swapping immediately, a SIGLOWMEM could be sent ... The problem with that approach is that the Fsck process doesn't know how its need for memory compares with other process' need for memory. How much memory should it give up? Maybe it should just quit altogether if other processes are in danger of deadlocking. Or maybe it's best for it to keep all its memory and let some other frivolous process give up its memory instead. It's the OS's job to have a view of the entire system and make resource allocation decisions. If it's just a matter of the application choosing a better page frame to vacate than what the kernel would have taken, (which is more a matter of self-interest than resource allocation), then Fsck can do that more directly by just monitoring its own page fault rate. If it's high, then it's using more real memory than the kernel thinks it's entitled to and it can reduce its memory footprint to improve its speed. It can even check whether an access to readahead data caused a page fault; if so, it knows reading ahead is actually making things worse and therefore reduce readahead until the page faults stop happening. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
I just had a talk with a colleague, John Palmer, who worked on disk drive design for about 5 years in the '90s and he gave me a very confident, credible explanation of some of the things we've been wondering about disk drive power loss in this thread, complete with demonstrations of various generations of disk drives, dismantled. First of all, it is plain to see that there is no spring capable of parking the head, and there is no capacitor that looks big enough to possibly supply the energy to park the head, in any of the models I looked at. Since parking of the heads is essential, we can only conclude that the myth of the kinetic energy of the disks being used for that (turned into electricity by the drive motor) is true. The energy required is not just to move the heads to the parking zone, but to latch them there as well. The myth is probably just that that energy is used for anything else; it's really easy to build a dumb circuit to park the heads using that power; keeping a computer running is something else. The drive does drop a write in the middle of the sector if it is writing at the time of power loss. The designers were too conservative to keep writing as power fails -- there's no telling what damage you might do. So the drive cuts the power to the heads at the first sign of power loss. If a write was in progress, this means there is one garbage sector on the disk. It can't be read. Trying to finish writing the sector is something I can image some drive model somewhere trying to do, but if even _some_ take the conservative approach, everyone has to design for it, so it doesn't matter. A device might then reassign that sector the next time you try to write to it (after failing to read it), thinking the medium must be bad. But there are various algorithms for deciding when to reassign a sector, so it might not too. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
"H. Peter Anvin" <[EMAIL PROTECTED]> wrote on 01/18/2008 07:08:30 AM: > Bryan Henderson wrote: > > > > We weren't actually talking about writing out the cache. While that was > > part of an earlier thread which ultimately conceded that disk drives most > > probably do not use the spinning disk energy to write out the cache, the > > claim was then made that the drive at least survives long enough to finish > > writing the sector it was writing, thereby maintaining the integrity of > > the data at the drive level. People often say that a disk drive > > guarantees atomic writes at the sector level even in the face of a power > > failure. > > > > But I heard some years ago from a disk drive engineer that that is a myth > > just like the rotational energy thing. I added that to the discussion, > > but admitted that I haven't actually seen a disk drive write a partial > > sector. > > > > A disk drive whose power is cut needs to have enough residual power to > park its heads (or *massive* data loss will occur), and at that point it > might as well keep enough on hand to finish an in-progress sector write. > > There are two possible sources of onboard temporary power: a large > enough capacitor, or the rotational energy of the platters (an > electrical motor also being a generator.) I don't care which one they > use, but they need to do something. I believe the power for that comes from a third source: a spring. Parking the heads is too important to leave to active circuits. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
H. Peter Anvin [EMAIL PROTECTED] wrote on 01/18/2008 07:08:30 AM: Bryan Henderson wrote: We weren't actually talking about writing out the cache. While that was part of an earlier thread which ultimately conceded that disk drives most probably do not use the spinning disk energy to write out the cache, the claim was then made that the drive at least survives long enough to finish writing the sector it was writing, thereby maintaining the integrity of the data at the drive level. People often say that a disk drive guarantees atomic writes at the sector level even in the face of a power failure. But I heard some years ago from a disk drive engineer that that is a myth just like the rotational energy thing. I added that to the discussion, but admitted that I haven't actually seen a disk drive write a partial sector. A disk drive whose power is cut needs to have enough residual power to park its heads (or *massive* data loss will occur), and at that point it might as well keep enough on hand to finish an in-progress sector write. There are two possible sources of onboard temporary power: a large enough capacitor, or the rotational energy of the platters (an electrical motor also being a generator.) I don't care which one they use, but they need to do something. I believe the power for that comes from a third source: a spring. Parking the heads is too important to leave to active circuits. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
I just had a talk with a colleague, John Palmer, who worked on disk drive design for about 5 years in the '90s and he gave me a very confident, credible explanation of some of the things we've been wondering about disk drive power loss in this thread, complete with demonstrations of various generations of disk drives, dismantled. First of all, it is plain to see that there is no spring capable of parking the head, and there is no capacitor that looks big enough to possibly supply the energy to park the head, in any of the models I looked at. Since parking of the heads is essential, we can only conclude that the myth of the kinetic energy of the disks being used for that (turned into electricity by the drive motor) is true. The energy required is not just to move the heads to the parking zone, but to latch them there as well. The myth is probably just that that energy is used for anything else; it's really easy to build a dumb circuit to park the heads using that power; keeping a computer running is something else. The drive does drop a write in the middle of the sector if it is writing at the time of power loss. The designers were too conservative to keep writing as power fails -- there's no telling what damage you might do. So the drive cuts the power to the heads at the first sign of power loss. If a write was in progress, this means there is one garbage sector on the disk. It can't be read. Trying to finish writing the sector is something I can image some drive model somewhere trying to do, but if even _some_ take the conservative approach, everyone has to design for it, so it doesn't matter. A device might then reassign that sector the next time you try to write to it (after failing to read it), thinking the medium must be bad. But there are various algorithms for deciding when to reassign a sector, so it might not too. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
Ric Wheeler <[EMAIL PROTECTED]> wrote on 01/17/2008 03:18:05 PM: > Theodore Tso wrote: > > On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote: > >> Have you observed that in the wild? A former engineer of a disk drive > >> company suggests to me that the capacitors on the board provide enough > >> power to complete the last sector, even to park the head. > >> > > Even if true (which I doubt), this is not implemented. > > A modern drive can have 16-32 MB of write cache. Worst case, those > sectors are not sequential which implies lots of head movement. We weren't actually talking about writing out the cache. While that was part of an earlier thread which ultimately conceded that disk drives most probably do not use the spinning disk energy to write out the cache, the claim was then made that the drive at least survives long enough to finish writing the sector it was writing, thereby maintaining the integrity of the data at the drive level. People often say that a disk drive guarantees atomic writes at the sector level even in the face of a power failure. But I heard some years ago from a disk drive engineer that that is a myth just like the rotational energy thing. I added that to the discussion, but admitted that I haven't actually seen a disk drive write a partial sector. Ted brought up the separate issue of the host sending garbage to the disk device because its own power is failing at the same time, which makes the integrity at the disk level moot (or even undesirable, as you'd rather write a bad sector than a good one with the wrong data). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
"Daniel Phillips" <[EMAIL PROTECTED]> wrote on 01/16/2008 06:02:50 PM: > On Jan 16, 2008 2:06 PM, Bryan Henderson <[EMAIL PROTECTED]> wrote: > > >The "disk motor as a generator" tale may not be purely folklore. When > > >an IDE drive is not in writeback mode, something special needs to done > > >to ensure the last write to media is not a scribble. > > > > No it doesn't. The last write _is_ a scribble. > > Have you observed that in the wild? A former engineer of a disk drive > company suggests to me that the capacitors on the board provide enough > power to complete the last sector, even to park the head. No, I haven't. It's hearsay, and from about 3 years ago. As for parking the head, that's hard to believe, since it's so easy and more reliable to use a spring and an electromagnet. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
Daniel Phillips [EMAIL PROTECTED] wrote on 01/16/2008 06:02:50 PM: On Jan 16, 2008 2:06 PM, Bryan Henderson [EMAIL PROTECTED] wrote: The disk motor as a generator tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. No it doesn't. The last write _is_ a scribble. Have you observed that in the wild? A former engineer of a disk drive company suggests to me that the capacitors on the board provide enough power to complete the last sector, even to park the head. No, I haven't. It's hearsay, and from about 3 years ago. As for parking the head, that's hard to believe, since it's so easy and more reliable to use a spring and an electromagnet. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
Ric Wheeler [EMAIL PROTECTED] wrote on 01/17/2008 03:18:05 PM: Theodore Tso wrote: On Wed, Jan 16, 2008 at 09:02:50PM -0500, Daniel Phillips wrote: Have you observed that in the wild? A former engineer of a disk drive company suggests to me that the capacitors on the board provide enough power to complete the last sector, even to park the head. Even if true (which I doubt), this is not implemented. A modern drive can have 16-32 MB of write cache. Worst case, those sectors are not sequential which implies lots of head movement. We weren't actually talking about writing out the cache. While that was part of an earlier thread which ultimately conceded that disk drives most probably do not use the spinning disk energy to write out the cache, the claim was then made that the drive at least survives long enough to finish writing the sector it was writing, thereby maintaining the integrity of the data at the drive level. People often say that a disk drive guarantees atomic writes at the sector level even in the face of a power failure. But I heard some years ago from a disk drive engineer that that is a myth just like the rotational energy thing. I added that to the discussion, but admitted that I haven't actually seen a disk drive write a partial sector. Ted brought up the separate issue of the host sending garbage to the disk device because its own power is failing at the same time, which makes the integrity at the disk level moot (or even undesirable, as you'd rather write a bad sector than a good one with the wrong data). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
>The "disk motor as a generator" tale may not be purely folklore. When >an IDE drive is not in writeback mode, something special needs to done >to ensure the last write to media is not a scribble. No it doesn't. The last write _is_ a scribble. Systems that make atomic updates to disk drives use a shadow update mechanism and write the master sector twice. If the power fails in the middle of writing one, it will almost certainly be unreadable due to a CRC failure, and the other one will have either the old or new master block contents. And I think there's a problem with drives that, upon sensing the unreadable sector, assign an alternate even though the sector is fine, and you eventually run out of spares. Incidentally, while this primitive behavior applies to IDE (ATA et al) drives, that isn't the only thing people put filesystem on. Many important filesystems go on higher level storage subsystems that contain IDE drives and cache memory and batteries. A device like this _does_ make sure that all data that it says has been written is actually retrievable even if there's a subsequent power outage, even while giving the performance of writeback caching. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Patch] document ext3 requirements (was Re: [RFD] Incremental fsck)
The disk motor as a generator tale may not be purely folklore. When an IDE drive is not in writeback mode, something special needs to done to ensure the last write to media is not a scribble. No it doesn't. The last write _is_ a scribble. Systems that make atomic updates to disk drives use a shadow update mechanism and write the master sector twice. If the power fails in the middle of writing one, it will almost certainly be unreadable due to a CRC failure, and the other one will have either the old or new master block contents. And I think there's a problem with drives that, upon sensing the unreadable sector, assign an alternate even though the sector is fine, and you eventually run out of spares. Incidentally, while this primitive behavior applies to IDE (ATA et al) drives, that isn't the only thing people put filesystem on. Many important filesystems go on higher level storage subsystems that contain IDE drives and cache memory and batteries. A device like this _does_ make sure that all data that it says has been written is actually retrievable even if there's a subsequent power outage, even while giving the performance of writeback caching. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
>the >maintainers of util-linux have well versed autotool people at their disposal, >so i really dont see this as being worrisome. As long as that is true, I agree that the fact that so many autotool packages don't work well is irrelevant. However, I think the difficulty of using autotools (I mean using by packagers), as evidenced by all the people who get it wrong, justifies people being skeptical that util-linux really has that expertise available. Also, many open source projects are developed by a large diverse group of people, so even if there exist people who can do the autotools right, it doesn't mean they'll be done right. One reason I try to minimize the number of tools/skills used in maintaining packages I distribute is to enable a larger group of people to help me maintain them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
the maintainers of util-linux have well versed autotool people at their disposal, so i really dont see this as being worrisome. As long as that is true, I agree that the fact that so many autotool packages don't work well is irrelevant. However, I think the difficulty of using autotools (I mean using by packagers), as evidenced by all the people who get it wrong, justifies people being skeptical that util-linux really has that expertise available. Also, many open source projects are developed by a large diverse group of people, so even if there exist people who can do the autotools right, it doesn't mean they'll be done right. One reason I try to minimize the number of tools/skills used in maintaining packages I distribute is to enable a larger group of people to help me maintain them. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
>i dont see how blaming autotools for other people's misuse is relevant Here's how other people's misuse of the tool can be relevant to the choice of the tool: some tools are easier to use right than others. Probably the easiest thing to use right is the system you designed and built yourself. I've considered distributing code with an Autotools-based build system before and determined quickly that I am not up to that challenge. (The bigger part of the challenge isn't writing the original input files; it's debugging when a user says his build doesn't work). But as far as I know, my hand-rolled build system is used correctly by me. >> checks the width of integers on i386 for projects not caring about that and >> fails to find installed libraries without telling how it was supposed to >> find them or how to make it find that library. > >no idea what this rant is about. The second part sounds like my number 1 complaint as a user of Autotools-based packages: 'configure' often can't find my libraries. I know exactly where they are, and even what compiler and linker options are needed to use them, but it often takes a half hour of tracing 'configure' or generated make files to figure out how to force the build to understand the same thing. And that's with lots of experience. The first five times it was much more frustrating. >> Configuring the build of an autotools program is harder than nescensary; >> if it used a config file, you could easily save it somewhere while adding >> comments on how and why you did *that* choice, and you could possibly >> use a set of default configs which you'd just include. > >history shows this is a pita to maintain. every package has its own build >system and configuration file ... It's my understanding that autotools _does_ provide that ability (as stated, though I think "config file" may have been meant here as "config.make"). The config file is a shell script that contains a 'configure' command with a pile of options on it, and as many comments as you want, to tailor the build to your requirements. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ANNOUNCE] util-linux-ng 2.13-rc1
i dont see how blaming autotools for other people's misuse is relevant Here's how other people's misuse of the tool can be relevant to the choice of the tool: some tools are easier to use right than others. Probably the easiest thing to use right is the system you designed and built yourself. I've considered distributing code with an Autotools-based build system before and determined quickly that I am not up to that challenge. (The bigger part of the challenge isn't writing the original input files; it's debugging when a user says his build doesn't work). But as far as I know, my hand-rolled build system is used correctly by me. checks the width of integers on i386 for projects not caring about that and fails to find installed libraries without telling how it was supposed to find them or how to make it find that library. no idea what this rant is about. The second part sounds like my number 1 complaint as a user of Autotools-based packages: 'configure' often can't find my libraries. I know exactly where they are, and even what compiler and linker options are needed to use them, but it often takes a half hour of tracing 'configure' or generated make files to figure out how to force the build to understand the same thing. And that's with lots of experience. The first five times it was much more frustrating. Configuring the build of an autotools program is harder than nescensary; if it used a config file, you could easily save it somewhere while adding comments on how and why you did *that* choice, and you could possibly use a set of default configs which you'd just include. history shows this is a pita to maintain. every package has its own build system and configuration file ... It's my understanding that autotools _does_ provide that ability (as stated, though I think config file may have been meant here as config.make). The config file is a shell script that contains a 'configure' command with a pile of options on it, and as many comments as you want, to tailor the build to your requirements. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Patent or not patent a new idea
>If your only purpose is to try generate a defensive patent, then just >dumping the idea in the public domain serves the same purpose, probably >better. > >I have a few patents, some of which are defensive. That has not prevented >the USPTO issuing quite a few patents that are in clear violation of mine. That's not what a defensive patent is. Indeed, patenting something just so someone else can't patent it is ridiculous, because publishing is so much easier. A defensive patent is one you file so that you can trade rights to it for rights to other patents that you need. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Patent or not patent a new idea
If your only purpose is to try generate a defensive patent, then just dumping the idea in the public domain serves the same purpose, probably better. I have a few patents, some of which are defensive. That has not prevented the USPTO issuing quite a few patents that are in clear violation of mine. That's not what a defensive patent is. Indeed, patenting something just so someone else can't patent it is ridiculous, because publishing is so much easier. A defensive patent is one you file so that you can trade rights to it for rights to other patents that you need. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
>The directory is quite visible with a standard 'ls -a'. Instead, >they simply mark it as a separate volume/filesystem: i.e. the fsid >differs when you call stat(). The whole thing ends up acting rather like >our bind mounts. Hmm. So it breaks user space quite a bit. By break, I mean uses that work with more conventional filesystems stop working if you switch to NetAp. Most programs that operate on directory trees willingly cross filesystems, right? Even ones that give you an option, such as GNU cp, don't by default. But if the implementation is, as described, wildly successful, that means users are willing to tolerate this level of breakage, so it could be used for versioning too. But I think I'd rather see a truly hidden directory for this (visible only when looked up explicitly). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
The directory is quite visible with a standard 'ls -a'. Instead, they simply mark it as a separate volume/filesystem: i.e. the fsid differs when you call stat(). The whole thing ends up acting rather like our bind mounts. Hmm. So it breaks user space quite a bit. By break, I mean uses that work with more conventional filesystems stop working if you switch to NetAp. Most programs that operate on directory trees willingly cross filesystems, right? Even ones that give you an option, such as GNU cp, don't by default. But if the implementation is, as described, wildly successful, that means users are willing to tolerate this level of breakage, so it could be used for versioning too. But I think I'd rather see a truly hidden directory for this (visible only when looked up explicitly). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
>We don't need a new special character for every >> new feature. We've got one, and it's flexible enough to do what you want, >> as proven by NetApp's extremely successful implementation. I don't know NetApp's implementation, but I assume it is more than just a choice of special character. If you merely start the directory name with a dot, you don't fool anyone but 'ls' and shell wildcard expansion. (And for some enlightened people like me, you don't even fool ls, because we use the --almost-all option to show the dot files by default, having been burned too many times by invisible files). I assume NetApp flags the directory specially so that a POSIX directory read doesn't get it. I've seen that done elsewhere. The same thing, by the way, is possible with Jack's filename:version idea, and I assumed that's what he had in mind. Not that that makes it all OK. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
We don't need a new special character for every new feature. We've got one, and it's flexible enough to do what you want, as proven by NetApp's extremely successful implementation. I don't know NetApp's implementation, but I assume it is more than just a choice of special character. If you merely start the directory name with a dot, you don't fool anyone but 'ls' and shell wildcard expansion. (And for some enlightened people like me, you don't even fool ls, because we use the --almost-all option to show the dot files by default, having been burned too many times by invisible files). I assume NetApp flags the directory specially so that a POSIX directory read doesn't get it. I've seen that done elsewhere. The same thing, by the way, is possible with Jack's filename:version idea, and I assumed that's what he had in mind. Not that that makes it all OK. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
>The question remains is where to implement versioning: directly in >individual filesystems or in the vfs code so all filesystems can use it? Or not in the kernel at all. I've been doing versioning of the types I described for years with user space code and I don't remember feeling that I compromised in order not to involve the kernel. Of course, if you want to do it with snapshots and COW, you'll have to ask where in the kernel to put that, but that's not a file versioning question; it's the larger snapshot question. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Versioning file system
The question remains is where to implement versioning: directly in individual filesystems or in the vfs code so all filesystems can use it? Or not in the kernel at all. I've been doing versioning of the types I described for years with user space code and I don't remember feeling that I compromised in order not to involve the kernel. Of course, if you want to do it with snapshots and COW, you'll have to ask where in the kernel to put that, but that's not a file versioning question; it's the larger snapshot question. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Finding hardlinks
>On any decent filesystem st_ino should uniquely identify an object and >reliably provide hardlink information. The UNIX world has relied upon this >for decades. A filesystem with st_ino collisions without being hardlinked >(or the other way around) needs a fix. But for at least the last of those decades, filesystems that could not do that were not uncommon. They had to present 32 bit inode numbers and either allowed more than 4G files or just didn't have the means of assigning inode numbers with the proper uniqueness to files. And the sky did not fall. I don't have an explanation why, but it makes it look to me like there are worse things than not having total one-one correspondence between inode numbers and files. Having a stat or mount fail because inodes are too big, having fewer than 4G files, and waiting for the filesystem to generate a suitable inode number might fall in that category. I fully agree that much effort should be put into making inode numbers work the way POSIX demands, but I also know that that sometimes requires more than just writing some code. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Finding hardlinks
On any decent filesystem st_ino should uniquely identify an object and reliably provide hardlink information. The UNIX world has relied upon this for decades. A filesystem with st_ino collisions without being hardlinked (or the other way around) needs a fix. But for at least the last of those decades, filesystems that could not do that were not uncommon. They had to present 32 bit inode numbers and either allowed more than 4G files or just didn't have the means of assigning inode numbers with the proper uniqueness to files. And the sky did not fall. I don't have an explanation why, but it makes it look to me like there are worse things than not having total one-one correspondence between inode numbers and files. Having a stat or mount fail because inodes are too big, having fewer than 4G files, and waiting for the filesystem to generate a suitable inode number might fall in that category. I fully agree that much effort should be put into making inode numbers work the way POSIX demands, but I also know that that sometimes requires more than just writing some code. -- Bryan Henderson San Jose California IBM Almaden Research Center Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
I have to correct an error in perspective, or at least in the wording of it, in the following, because it affects how people see the big picture in trying to decide how the filesystem types in question fit into the world: >Shared storage can be more efficient than network file >systems like NFS because the storage access is often more efficient >than network access The shared storage access _is_ network access. In most cases, it's a fibre channel/FCP network. Nowadays, it's more and more common for it to be a TCP/IP network just like the one folks use for NFS (but carrying ISCSI instead of NFS). It's also been done with a handful of other TCP/IP-based block storage protocols. The reason the storage access is expected to be more efficient than the NFS access is because the block access network protocols are supposed to be more efficient than the file access network protocols. In reality, I'm not sure there really is such a difference in efficiency between the protocols. The demonstrated differences in efficiency, or at least in speed, are due to other things that are different between a given new shared block implementation and a given old shared file implementation. But there's another advantage to shared block over shared file that hasn't been mentioned yet: some people find it easier to manage a pool of blocks than a pool of filesystems. >it is more reliable because it doesn't have a >single point of failure in form of the NFS server. This advantage isn't because it's shared (block) storage, but because it's a distributed filesystem. There are shared storage filesystems (e.g. IBM SANFS, ADIC StorNext) that have a centralized metadata or locking server that makes them unreliable (or unscalable) in the same ways as an NFS server. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: GFS, what's remaining
I have to correct an error in perspective, or at least in the wording of it, in the following, because it affects how people see the big picture in trying to decide how the filesystem types in question fit into the world: Shared storage can be more efficient than network file systems like NFS because the storage access is often more efficient than network access The shared storage access _is_ network access. In most cases, it's a fibre channel/FCP network. Nowadays, it's more and more common for it to be a TCP/IP network just like the one folks use for NFS (but carrying ISCSI instead of NFS). It's also been done with a handful of other TCP/IP-based block storage protocols. The reason the storage access is expected to be more efficient than the NFS access is because the block access network protocols are supposed to be more efficient than the file access network protocols. In reality, I'm not sure there really is such a difference in efficiency between the protocols. The demonstrated differences in efficiency, or at least in speed, are due to other things that are different between a given new shared block implementation and a given old shared file implementation. But there's another advantage to shared block over shared file that hasn't been mentioned yet: some people find it easier to manage a pool of blocks than a pool of filesystems. it is more reliable because it doesn't have a single point of failure in form of the NFS server. This advantage isn't because it's shared (block) storage, but because it's a distributed filesystem. There are shared storage filesystems (e.g. IBM SANFS, ADIC StorNext) that have a centralized metadata or locking server that makes them unreliable (or unscalable) in the same ways as an NFS server. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] atomic open(..., O_CREAT | ...)
>Have you looked at how we're dealing with this in NFSv4? No. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] atomic open(..., O_CREAT | ...)
>Intents are meant as optimisations, not replacements for existing >operations. I'm therefore not really comfortable about having them >return errors at all. That's true of normal intents, but not what are called intents here. A normal intent merely expresses an intent, and it can be totally ignored without harm to correctness. But these "intents" were designed to be responded to by actually performing the foreshadowed operation now - irreversibly. Linux needs an atomic lookup/open/create in order to participate in a shared filesystem and provide a POSIX interface (where shared filesystem means a filesystem that is simultaneously accessed by something besides the Linux system in question). Some operating systems do this simply with a VFS lookup/open/create function. Linux does it with this intents interface. It's hard to merge the concepts in code or in one's mind, which is why we're here now. A filesystem driver that needs to do atomic lookup/open/create has to bend over backwards to split the operation across the three filesystem driver calls that Linux wants to make. I've always preferred just to have a new inode operation for lookup/open/create (mirroring the POSIX open operation, used for all opens if available), but if enough arguments to lookup can do it, that's practically as good. But that means returning final status from lookup, and not under any circumstance proceeding to create or open when the filesystem driver has said the entire operation is complete. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] atomic open(..., O_CREAT | ...)
Intents are meant as optimisations, not replacements for existing operations. I'm therefore not really comfortable about having them return errors at all. That's true of normal intents, but not what are called intents here. A normal intent merely expresses an intent, and it can be totally ignored without harm to correctness. But these intents were designed to be responded to by actually performing the foreshadowed operation now - irreversibly. Linux needs an atomic lookup/open/create in order to participate in a shared filesystem and provide a POSIX interface (where shared filesystem means a filesystem that is simultaneously accessed by something besides the Linux system in question). Some operating systems do this simply with a VFS lookup/open/create function. Linux does it with this intents interface. It's hard to merge the concepts in code or in one's mind, which is why we're here now. A filesystem driver that needs to do atomic lookup/open/create has to bend over backwards to split the operation across the three filesystem driver calls that Linux wants to make. I've always preferred just to have a new inode operation for lookup/open/create (mirroring the POSIX open operation, used for all opens if available), but if enough arguments to lookup can do it, that's practically as good. But that means returning final status from lookup, and not under any circumstance proceeding to create or open when the filesystem driver has said the entire operation is complete. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] atomic open(..., O_CREAT | ...)
Have you looked at how we're dealing with this in NFSv4? No. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
>I don't see how the following is tortured: > >enum { > PNODE_MEMBER_VFS = 0x01, > PNODE_SLAVE_VFS = 0x02 >}; Only because it's using a facility that's supposed to be for enumerated types for something that isn't. If it were a true enumerated type, the codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas here they must fundamentally be integers whose pure binary cipher has exactly one 1 bit (because, as I understand it, these are used as bitmasks somewhere). I can see that this paradigm has practical advantages over using macros (or a middle ground - integer constants), but only as a byproduct of what the construct is really for. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
>If it's really enumerated data types, that's fine, but this example was >about bitfield masks. Ah. In that case, enum is a pretty tortured way to declare it, though it does have the practical advantages over define that have been mentioned because the syntax is more rigorous. The proper way to do bitfield masks is usually C bit field declarations, but I understand that tradition works even more strongly against using those than against using enum to declare enumerated types. >there is _nothing_ wrong with using defines for constants. I disagree with that; I find practical and, more importantly, philosophical reasons not to use defines for constants. I'm sure you've heard the arguments; I just didn't want to let that statement go uncontested. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
I wasn't aware anyone preferred defines to enums for declaring enumerated data types. The practical advantages of enums are slight, but as far as I know, the practical advantages of defines are zero. Isn't the only argument for defines, "that's what I'm used to."? Two advantages of the enum declaration that haven't been mentioned yet, that help me significantly: - if you have a typo in a define, it can be really hard to interpret the compiler error messages. The same typo in an enum gets a pointed error message referring to the line that has the typo. - Gcc warns you if a switch statement doesn't handle every case. I often add an enumeration and Gcc lets me know where I forgot to consider it. The macro language is one the most hated parts of the C language; it makes sense to try to avoid it as a general rule. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
I wasn't aware anyone preferred defines to enums for declaring enumerated data types. The practical advantages of enums are slight, but as far as I know, the practical advantages of defines are zero. Isn't the only argument for defines, that's what I'm used to.? Two advantages of the enum declaration that haven't been mentioned yet, that help me significantly: - if you have a typo in a define, it can be really hard to interpret the compiler error messages. The same typo in an enum gets a pointed error message referring to the line that has the typo. - Gcc warns you if a switch statement doesn't handle every case. I often add an enumeration and Gcc lets me know where I forgot to consider it. The macro language is one the most hated parts of the C language; it makes sense to try to avoid it as a general rule. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
If it's really enumerated data types, that's fine, but this example was about bitfield masks. Ah. In that case, enum is a pretty tortured way to declare it, though it does have the practical advantages over define that have been mentioned because the syntax is more rigorous. The proper way to do bitfield masks is usually C bit field declarations, but I understand that tradition works even more strongly against using those than against using enum to declare enumerated types. there is _nothing_ wrong with using defines for constants. I disagree with that; I find practical and, more importantly, philosophical reasons not to use defines for constants. I'm sure you've heard the arguments; I just didn't want to let that statement go uncontested. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: share/private/slave a subtree - define vs enum
I don't see how the following is tortured: enum { PNODE_MEMBER_VFS = 0x01, PNODE_SLAVE_VFS = 0x02 }; Only because it's using a facility that's supposed to be for enumerated types for something that isn't. If it were a true enumerated type, the codes for the enumerations (0x01, 0x02) would be quite arbitrary, whereas here they must fundamentally be integers whose pure binary cipher has exactly one 1 bit (because, as I understand it, these are used as bitmasks somewhere). I can see that this paradigm has practical advantages over using macros (or a middle ground - integer constants), but only as a byproduct of what the construct is really for. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device
>I did a patch which switched loop to use the file_operations.read/write >about a year ago. Forget what happened to it. It always seemed the right >thing to do.. This is unquestionably the right thing to do (at least compared to what we have now). The loop device driver has no business assuming that the underlying filesystem uses the generic routines. I always assumed it was a simple design error that it did. (Such errors are easy to make because prepare_write and commit_write are declared as address space operations, when they're really private to the buffer cache and generic writer). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative
>> > > And for the vmscan->writepage() side of things I wonder if it would be >> > > possible to overload the mapping's ->nopage handler. If the target page >> > > lies in a hole, go off and allocate all the necessary pagecache pages, zero >> > > them, mark them dirty? >> > >> > I guess it would be possible but ->nopage is used for the read case and >> > why would we want to then cause writes/allocations? >> >> yup, we'd need to create a new handler for writes, or pass `write_access' >> into ->nopage. I think others (dwdm2?) have seen a need for that. > >That would work as long as all writable mappings are actually written to >everywhere. Otherwise you still get that reading the whole mmap()ped >are but writing a small part of it would still instantiate all of it on >disk. As far as I understand this there is no way to hook into the mmap >system such that we have a hook whenever a mmap()ped page gets written >to for the first time. (I may well be wrong on that one so please >correct me if that is the case.) I think the point is that we can't have a "handler for writes," because the writes are being done by simple CPU Store instructions in a user program. The handler we're talking about is just for page faults. Other operating systems approach this by actually _having_ a handler for a CPU store instruction, in the form of a page protection fault handler -- the nopage routine adds the page to the user's address space, but write protects it. The first time the user tries to store into it, the filesystem driver gets a chance to do what's necessary to support a dirty cache page -- allocate a block, add additional dirty pages to the cache, etc. It would be wonderful to have that in Linux. I saw hints of such code in a Linux kernel once (a "write_protect" address space operation or something like that); I don't know what happened to it. Short of that, I don't see any way to avoid sometimes filling in holes due to reads. It's not a huge problem, though -- it requires someone to do a shared writable mmap and then read lots of holes and not write to them, which is a pretty rare situation for a normal file. I didn't follow how the helper function solves this problem. If it's something involving adding the required extra pages to the cache at pageout time, then that's not going to work -- you can't make adding pages to the cache a prerequisite for cleaning a page -- that would be Deadlock City. My large-block filesystem driver does the nopage thing, and does in fact fill in files unnecessarily in this scenario. :-( The driver for the same filesystems on AIX does not, though. It has the write protection thing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - nopage alternative
And for the vmscan-writepage() side of things I wonder if it would be possible to overload the mapping's -nopage handler. If the target page lies in a hole, go off and allocate all the necessary pagecache pages, zero them, mark them dirty? I guess it would be possible but -nopage is used for the read case and why would we want to then cause writes/allocations? yup, we'd need to create a new handler for writes, or pass `write_access' into -nopage. I think others (dwdm2?) have seen a need for that. That would work as long as all writable mappings are actually written to everywhere. Otherwise you still get that reading the whole mmap()ped are but writing a small part of it would still instantiate all of it on disk. As far as I understand this there is no way to hook into the mmap system such that we have a hook whenever a mmap()ped page gets written to for the first time. (I may well be wrong on that one so please correct me if that is the case.) I think the point is that we can't have a handler for writes, because the writes are being done by simple CPU Store instructions in a user program. The handler we're talking about is just for page faults. Other operating systems approach this by actually _having_ a handler for a CPU store instruction, in the form of a page protection fault handler -- the nopage routine adds the page to the user's address space, but write protects it. The first time the user tries to store into it, the filesystem driver gets a chance to do what's necessary to support a dirty cache page -- allocate a block, add additional dirty pages to the cache, etc. It would be wonderful to have that in Linux. I saw hints of such code in a Linux kernel once (a write_protect address space operation or something like that); I don't know what happened to it. Short of that, I don't see any way to avoid sometimes filling in holes due to reads. It's not a huge problem, though -- it requires someone to do a shared writable mmap and then read lots of holes and not write to them, which is a pretty rare situation for a normal file. I didn't follow how the helper function solves this problem. If it's something involving adding the required extra pages to the cache at pageout time, then that's not going to work -- you can't make adding pages to the cache a prerequisite for cleaning a page -- that would be Deadlock City. My large-block filesystem driver does the nopage thing, and does in fact fill in files unnecessarily in this scenario. :-( The driver for the same filesystems on AIX does not, though. It has the write protection thing. -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: [PATCH-2.6] Add helper function to lock multiple page cache pages - loop device
I did a patch which switched loop to use the file_operations.read/write about a year ago. Forget what happened to it. It always seemed the right thing to do.. This is unquestionably the right thing to do (at least compared to what we have now). The loop device driver has no business assuming that the underlying filesystem uses the generic routines. I always assumed it was a simple design error that it did. (Such errors are easy to make because prepare_write and commit_write are declared as address space operations, when they're really private to the buffer cache and generic writer). -- Bryan Henderson IBM Almaden Research Center San Jose CA Filesystems - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)
How it can be used? Well, say it you've mounted JFS on /usr/local >% mount -t jfsmeta none /mnt -o jfsroot=/usr/local >% ls /mnt >stats control bootcode whatever_I_bloody_want >% cat /mnt/stats >master is on /usr/local >fragmentation = 5% >696942 reads, yodda, yodda >% echo "defrag 69 whatever 42 13" > /mnt/control >% umount /mnt There's a lot of cool simplicity in this, both in implementation and application, but it leaves something to be desired in functionality. This is partly because the price you pay for being able to use existing, well-worn Unix interfaces is the ancient limitations of those interfaces -- like the inability to return adequate error information. Specifically, transactional stuff looks really hard in this method. If I want the user to know why his "defrag" command failed, how would I pass that information back to him? What if I want to warn him of of a filesystem inconsistency I found along the way? Or inform him of how effective the defrag was? And bear in mind that multiple processes may be issuing commands to /mnt/control simultaneously. With ioctl, I can easily match a response of any kind to a request. I can even return an English text message if I want to be friendly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sane access to per-fs metadata (was Re: [PATCH] Documentation/ioctl-number.txt)
How it can be used? Well, say it you've mounted JFS on /usr/local % mount -t jfsmeta none /mnt -o jfsroot=/usr/local % ls /mnt stats control bootcode whatever_I_bloody_want % cat /mnt/stats master is on /usr/local fragmentation = 5% 696942 reads, yodda, yodda % echo "defrag 69 whatever 42 13" /mnt/control % umount /mnt There's a lot of cool simplicity in this, both in implementation and application, but it leaves something to be desired in functionality. This is partly because the price you pay for being able to use existing, well-worn Unix interfaces is the ancient limitations of those interfaces -- like the inability to return adequate error information. Specifically, transactional stuff looks really hard in this method. If I want the user to know why his "defrag" command failed, how would I pass that information back to him? What if I want to warn him of of a filesystem inconsistency I found along the way? Or inform him of how effective the defrag was? And bear in mind that multiple processes may be issuing commands to /mnt/control simultaneously. With ioctl, I can easily match a response of any kind to a request. I can even return an English text message if I want to be friendly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
RE: Linux not adhering to BIOS Drive boot order?
>If we can truly go for label based mounting >and lilo'ing this would solve the problem. >From a layering point of view, it makes a lot more sense to me for the label (or signature or whatever) for this purpose to be in the partition table than inside the filesystem. The parts of the system that assign devices their identities already know about that part of the disk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
RE: Linux not adhering to BIOS Drive boot order?
If we can truly go for label based mounting and lilo'ing this would solve the problem. From a layering point of view, it makes a lot more sense to me for the label (or signature or whatever) for this purpose to be in the partition table than inside the filesystem. The parts of the system that assign devices their identities already know about that part of the disk. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/