Re: pagetable_ops: Hugetlb character device example
On Thu, 22 Mar 2007, Christoph Hellwig wrote: On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote: > A year ago, I may have agreed with you. However, Linus not only veto'd > it but > stamped on it repeatadly at VM Summit. He couldn't have made it clearer > if > he wore a t-shirt a hat and held up a neon sign. The assertion at the > time > was that variable page support of any sort had to be outside of the core > VM > because automatic support will get it wrong in some cases and makes the > core > VM harder to understand (because it's super-clear at the moment). Others > attending agreed with the position. That position rules out drivers or > filesystems giving hints about superpage sizes in the foreseeable > future. Actually I think the only way to get it right is to do it in the core (or inm the architecture code for the really nasty bits of course), but then again this isn't the point I want to make here.. Maybe ultimatly it's the right thing to do, more can be done "on the side" until such time as it's really worth handling the complexity in the core VM. > What they did not have any problem with was providing better interfaces > to > program against as long as they were on the side of the VM like > hugetlbfs > and not in the core. The character device for private mappings is an > example of an interface that is easier to program against than > hugetlbfs. > It's far easier for an application to mmap a file at a fixed location > than > trying to discover if hugetlbfs is mounted or not. However, to support > that > sort of interface, there needs to be a way of telling the VM to call the > an > alternative pagetable handler - hence Adam's patches. .. and this is where we get into problems. There should be no need to use all kinds of pseudo-OO obsfucation to get there. A VMA flag that means 'this is hugetlb backed anonymous memory' is much nicer to archive this. Except in the example he posted, the fault handler for the char device and the hugetlbfs case are doing slightly different things. The fault handler for hugetlbfs assumes the existance of a file mapping where as the char device is inserting the page directly. Having the bit is not enough for the core is not enough to determine that a slightly different fault handler was needed. With the current code, it is almost impossible for a driver with a different pagetable layout to express different semantics to hugetlbfs with respects to how their pagetables are setup. Altering hugetlbfs much is very difficult because any alteration becomes global in nature. The ops would allow a experimental drivers and interfaces to be developed without breaking existing users of hugetlbfs and let us figure out things like "Is it worth supporting 1GiB pages in Opterons" without breaking everything else in the process. Because it makes clear there is exactly one special case here and no carte blanche for drivers to do whatever they want. Drivers can already cause all sorts of mayhem through the existing hooks if they are perverse enough. It is never encouraged of course but nothing prevents them. I would prefer to even get rid of that single special case as mentioned above, but I'm definitly set dead against at making this special case totally open for random bits of the kernel to mess with. As kernel memory is already backed by huge tlb entries in many cases, random drivers should have little or no interest in doing anything mad with pagetable ops. All it gets them is pain and entertaining posts from the mailing list. That said, your main objection seems to be opening to door to arbitrary drivers to change the pagetable ops. I can see your point and Hughs on why this could lead to some hilarity down the road, particularly if out-of-tree drivers entering into the mess so how about the following; Instead of having a vma->pagetable_ops with a structure of pointers, it would be a simple integer into a fixed list of pagetable operation handlers. Something like #define PAGETABLE_OP_DEFAULT 0 #define PAGETABLE_OP_HUGETLB_FS 1 #define PAGETABLE_OP_HUGETLB_CHAR 2 struct pagetable_operations_struct[] pagetable_ops_lookup_table = { /* PAGETABLE_OP_DEFAULT assuming we always used the table */ { .fault = handle_pte_fault }, /* PAGETABLE_OP_HUGETLB_FS */ { .fault= hugetlb_fault .copy_vma = copy_hugetlb_page_range .. }, /* PAGETABLE_OP_HUGETLB_CHAR */ { .fault = whatever } }; Drivers would only be able to set an index in the VMA for this table. The lookup would be about the same cost as what is currently there. However, random drivers cannot mess with the pagetable ops - they would have to be known by the core. Experimental drivers would have to update the table but that shouldn't be an issue. Out-of-tree drivers would have no ability to mess here at all
Re: pagetable_ops: Hugetlb character device example
On Thu, 22 Mar 2007, Christoph Hellwig wrote: On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote: A year ago, I may have agreed with you. However, Linus not only veto'd it but stamped on it repeatadly at VM Summit. He couldn't have made it clearer if he wore a t-shirt a hat and held up a neon sign. The assertion at the time was that variable page support of any sort had to be outside of the core VM because automatic support will get it wrong in some cases and makes the core VM harder to understand (because it's super-clear at the moment). Others attending agreed with the position. That position rules out drivers or filesystems giving hints about superpage sizes in the foreseeable future. Actually I think the only way to get it right is to do it in the core (or inm the architecture code for the really nasty bits of course), but then again this isn't the point I want to make here.. Maybe ultimatly it's the right thing to do, more can be done on the side until such time as it's really worth handling the complexity in the core VM. What they did not have any problem with was providing better interfaces to program against as long as they were on the side of the VM like hugetlbfs and not in the core. The character device for private mappings is an example of an interface that is easier to program against than hugetlbfs. It's far easier for an application to mmap a file at a fixed location than trying to discover if hugetlbfs is mounted or not. However, to support that sort of interface, there needs to be a way of telling the VM to call the an alternative pagetable handler - hence Adam's patches. .. and this is where we get into problems. There should be no need to use all kinds of pseudo-OO obsfucation to get there. A VMA flag that means 'this is hugetlb backed anonymous memory' is much nicer to archive this. Except in the example he posted, the fault handler for the char device and the hugetlbfs case are doing slightly different things. The fault handler for hugetlbfs assumes the existance of a file mapping where as the char device is inserting the page directly. Having the bit is not enough for the core is not enough to determine that a slightly different fault handler was needed. With the current code, it is almost impossible for a driver with a different pagetable layout to express different semantics to hugetlbfs with respects to how their pagetables are setup. Altering hugetlbfs much is very difficult because any alteration becomes global in nature. The ops would allow a experimental drivers and interfaces to be developed without breaking existing users of hugetlbfs and let us figure out things like Is it worth supporting 1GiB pages in Opterons without breaking everything else in the process. Because it makes clear there is exactly one special case here and no carte blanche for drivers to do whatever they want. Drivers can already cause all sorts of mayhem through the existing hooks if they are perverse enough. It is never encouraged of course but nothing prevents them. I would prefer to even get rid of that single special case as mentioned above, but I'm definitly set dead against at making this special case totally open for random bits of the kernel to mess with. As kernel memory is already backed by huge tlb entries in many cases, random drivers should have little or no interest in doing anything mad with pagetable ops. All it gets them is pain and entertaining posts from the mailing list. That said, your main objection seems to be opening to door to arbitrary drivers to change the pagetable ops. I can see your point and Hughs on why this could lead to some hilarity down the road, particularly if out-of-tree drivers entering into the mess so how about the following; Instead of having a vma-pagetable_ops with a structure of pointers, it would be a simple integer into a fixed list of pagetable operation handlers. Something like #define PAGETABLE_OP_DEFAULT 0 #define PAGETABLE_OP_HUGETLB_FS 1 #define PAGETABLE_OP_HUGETLB_CHAR 2 struct pagetable_operations_struct[] pagetable_ops_lookup_table = { /* PAGETABLE_OP_DEFAULT assuming we always used the table */ { .fault = handle_pte_fault }, /* PAGETABLE_OP_HUGETLB_FS */ { .fault= hugetlb_fault .copy_vma = copy_hugetlb_page_range .. }, /* PAGETABLE_OP_HUGETLB_CHAR */ { .fault = whatever } }; Drivers would only be able to set an index in the VMA for this table. The lookup would be about the same cost as what is currently there. However, random drivers cannot mess with the pagetable ops - they would have to be known by the core. Experimental drivers would have to update the table but that shouldn't be an issue. Out-of-tree drivers would have no ability to mess here at all which is a good thing. Someone
Re: pagetable_ops: Hugetlb character device example
On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote: > A year ago, I may have agreed with you. However, Linus not only veto'd it but > stamped on it repeatadly at VM Summit. He couldn't have made it clearer if > he wore a t-shirt a hat and held up a neon sign. The assertion at the time > was that variable page support of any sort had to be outside of the core VM > because automatic support will get it wrong in some cases and makes the core > VM harder to understand (because it's super-clear at the moment). Others > attending agreed with the position. That position rules out drivers or > filesystems giving hints about superpage sizes in the foreseeable future. Actually I think the only way to get it right is to do it in the core (or inm the architecture code for the really nasty bits of course), but then again this isn't the point I want to make here.. > What they did not have any problem with was providing better interfaces to > program against as long as they were on the side of the VM like hugetlbfs > and not in the core. The character device for private mappings is an > example of an interface that is easier to program against than hugetlbfs. > It's far easier for an application to mmap a file at a fixed location than > trying to discover if hugetlbfs is mounted or not. However, to support that > sort of interface, there needs to be a way of telling the VM to call the an > alternative pagetable handler - hence Adam's patches. .. and this is where we get into problems. There should be no need to use all kinds of pseudo-OO obsfucation to get there. A VMA flag that means 'this is hugetlb backed anonymous memory' is much nicer to archive this. Because it makes clear there is exactly one special case here and no carte blanche for drivers to do whatever they want. I would prefer to even get rid of that single special case as mentioned above, but I'm definitly set dead against at making this special case totally open for random bits of the kernel to mess with. > Someone with sufficient energy could try implementing variable page support > entirely as a device using Adam's interface. Hopefully not, doing this in a driver would be utterly braindead and certainly not mergeable. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On (22/03/07 10:38), Christoph Hellwig didst pronounce: > On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote: > > The main reason I am advocating a set of pagetable_operations is to > > enable the development of a new hugetlb interface. During the hugetlb > > BOFS at OLS last year, we talked about a character device that would > > behave like /dev/zero. Many of the people were talking about how they > > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss > > about the hugetlbfs filesystem. /dev/zero is a familiar interface for > > getting anonymous memory so bringing that model to huge pages would make > > programming for anonymous huge pages easier. > > That is a very laudable goal, but an utterly wrong way to get there. > Despite Linus' veto a while ago what we really want is support for transparent > super pages. A year ago, I may have agreed with you. However, Linus not only veto'd it but stamped on it repeatadly at VM Summit. He couldn't have made it clearer if he wore a t-shirt a hat and held up a neon sign. The assertion at the time was that variable page support of any sort had to be outside of the core VM because automatic support will get it wrong in some cases and makes the core VM harder to understand (because it's super-clear at the moment). Others attending agreed with the position. That position rules out drivers or filesystems giving hints about superpage sizes in the foreseeable future. What they did not have any problem with was providing better interfaces to program against as long as they were on the side of the VM like hugetlbfs and not in the core. The character device for private mappings is an example of an interface that is easier to program against than hugetlbfs. It's far easier for an application to mmap a file at a fixed location than trying to discover if hugetlbfs is mounted or not. However, to support that sort of interface, there needs to be a way of telling the VM to call the an alternative pagetable handler - hence Adam's patches. Someone with sufficient energy could try implementing variable page support entirely as a device using Adam's interface. If it turned out to be a good idea, then another push could be made for transparent support later. As it is, transparent superpage support is a also bit of a bitch for Power and IA64. Power because in many cases (not all), pages of two different sizes cannot be in the same virtual address range. IA64 has issues because with the *current* pagetable implementation, hugepages are limited to fixed address ranges. These sort of issues alone make transparent support in the kernel a non-trivial problem. > Adding random pointer indirections where we had the direct > hugetlb calls before isn't helpful for that at all. They aren't random, they are pretty specific. Also, even when paths like fault is entered, the cost of an indirect call is insignificant in comparison to the page allocation, clearing the page and updating page tables. In Adam's current patches, the indirect call only happens when a driver is using the pagetable ops. In the tests I looked at, the cost of the branch could only be detected on an instruction-level profile and even the branch cost was pretty damn tiny. If it was a case that indirect calls always took place, it *might* be a bit more noticable but still nothing in comparison to the cost of the remainder of the operation. > As a start you might > want to make a clear destinction between core hugetlb code and the > filesystem interface to it without all the useless indirections. The indirect calls are about supporting interfaces to userspace. In practice, the hugetlbfs interface, the shared memory interface and the character device interface would share a large amount of core code. Admittadly that code could do with restructuring because it's all mangled together at the moment. The core hugetlb code as you call it is mainly dealing with page cache and huge page pool management. The filesystem layer is relatively thin on top of it. With Adams pagetable abstraction, it would make more sense to restructuring the huge page code and separate out core-support-for-superpages from hugetlbfs. > That > should get you as far as your char dev interface. No, it wouldn't. Restructing the current code would allow better sharing between interfaces but that's it. At the end of the restructuring, we'd still need a way of saying "this VMA should be using some but not all the hugetlb code over there even though I'm not hugetlbfs". At that point, we'd be back at the pagetable ops abstraction. > But over the long > term the core VM needs to deal with multiple (and probably not just two) > page sizes. Given that the code to deal with different sized pages is > essentially the same just on different units on most architectures cries > for a better method to implement this than adding random function indirection > that point to mostly identical code. > Internally, a semi-sane way of
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote: > The main reason I am advocating a set of pagetable_operations is to > enable the development of a new hugetlb interface. During the hugetlb > BOFS at OLS last year, we talked about a character device that would > behave like /dev/zero. Many of the people were talking about how they > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss > about the hugetlbfs filesystem. /dev/zero is a familiar interface for > getting anonymous memory so bringing that model to huge pages would make > programming for anonymous huge pages easier. That is a very laudable goal, but an utterly wrong way to get there. Despite Linus' veto a while ago what we really want is support for transparent super pages. Adding random pointer indirections where we had the direct hugetlb calls before isn't helpful for that at all. As a start you might want to make a clear destinction between core hugetlb code and the filesystem interface to it without all the useless indirections. That should get you as far as your char dev interface. But over the long term the core VM needs to deal with multiple (and probably not just two) page sizes. Given that the code to deal with different sized pages is essentially the same just on different units on most architectures cries for a better method to implement this than adding random function indirection that point to mostly identical code. And your driver is the best example of why we utterly don't want a page_table operations interface. The last thing we want is random driver taking over core VM functionality. The right way would be to a filesystem/driver to tell (or maybe just give hints) which page size to use for this mapping. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. That is a very laudable goal, but an utterly wrong way to get there. Despite Linus' veto a while ago what we really want is support for transparent super pages. Adding random pointer indirections where we had the direct hugetlb calls before isn't helpful for that at all. As a start you might want to make a clear destinction between core hugetlb code and the filesystem interface to it without all the useless indirections. That should get you as far as your char dev interface. But over the long term the core VM needs to deal with multiple (and probably not just two) page sizes. Given that the code to deal with different sized pages is essentially the same just on different units on most architectures cries for a better method to implement this than adding random function indirection that point to mostly identical code. And your driver is the best example of why we utterly don't want a page_table operations interface. The last thing we want is random driver taking over core VM functionality. The right way would be to a filesystem/driver to tell (or maybe just give hints) which page size to use for this mapping. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On (22/03/07 10:38), Christoph Hellwig didst pronounce: On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. That is a very laudable goal, but an utterly wrong way to get there. Despite Linus' veto a while ago what we really want is support for transparent super pages. A year ago, I may have agreed with you. However, Linus not only veto'd it but stamped on it repeatadly at VM Summit. He couldn't have made it clearer if he wore a t-shirt a hat and held up a neon sign. The assertion at the time was that variable page support of any sort had to be outside of the core VM because automatic support will get it wrong in some cases and makes the core VM harder to understand (because it's super-clear at the moment). Others attending agreed with the position. That position rules out drivers or filesystems giving hints about superpage sizes in the foreseeable future. What they did not have any problem with was providing better interfaces to program against as long as they were on the side of the VM like hugetlbfs and not in the core. The character device for private mappings is an example of an interface that is easier to program against than hugetlbfs. It's far easier for an application to mmap a file at a fixed location than trying to discover if hugetlbfs is mounted or not. However, to support that sort of interface, there needs to be a way of telling the VM to call the an alternative pagetable handler - hence Adam's patches. Someone with sufficient energy could try implementing variable page support entirely as a device using Adam's interface. If it turned out to be a good idea, then another push could be made for transparent support later. As it is, transparent superpage support is a also bit of a bitch for Power and IA64. Power because in many cases (not all), pages of two different sizes cannot be in the same virtual address range. IA64 has issues because with the *current* pagetable implementation, hugepages are limited to fixed address ranges. These sort of issues alone make transparent support in the kernel a non-trivial problem. Adding random pointer indirections where we had the direct hugetlb calls before isn't helpful for that at all. They aren't random, they are pretty specific. Also, even when paths like fault is entered, the cost of an indirect call is insignificant in comparison to the page allocation, clearing the page and updating page tables. In Adam's current patches, the indirect call only happens when a driver is using the pagetable ops. In the tests I looked at, the cost of the branch could only be detected on an instruction-level profile and even the branch cost was pretty damn tiny. If it was a case that indirect calls always took place, it *might* be a bit more noticable but still nothing in comparison to the cost of the remainder of the operation. As a start you might want to make a clear destinction between core hugetlb code and the filesystem interface to it without all the useless indirections. The indirect calls are about supporting interfaces to userspace. In practice, the hugetlbfs interface, the shared memory interface and the character device interface would share a large amount of core code. Admittadly that code could do with restructuring because it's all mangled together at the moment. The core hugetlb code as you call it is mainly dealing with page cache and huge page pool management. The filesystem layer is relatively thin on top of it. With Adams pagetable abstraction, it would make more sense to restructuring the huge page code and separate out core-support-for-superpages from hugetlbfs. That should get you as far as your char dev interface. No, it wouldn't. Restructing the current code would allow better sharing between interfaces but that's it. At the end of the restructuring, we'd still need a way of saying this VMA should be using some but not all the hugetlb code over there even though I'm not hugetlbfs. At that point, we'd be back at the pagetable ops abstraction. But over the long term the core VM needs to deal with multiple (and probably not just two) page sizes. Given that the code to deal with different sized pages is essentially the same just on different units on most architectures cries for a better method to implement this than adding random function indirection that point to mostly identical code. Internally, a semi-sane way of supporting multiple page sizes would be
Re: pagetable_ops: Hugetlb character device example
On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote: A year ago, I may have agreed with you. However, Linus not only veto'd it but stamped on it repeatadly at VM Summit. He couldn't have made it clearer if he wore a t-shirt a hat and held up a neon sign. The assertion at the time was that variable page support of any sort had to be outside of the core VM because automatic support will get it wrong in some cases and makes the core VM harder to understand (because it's super-clear at the moment). Others attending agreed with the position. That position rules out drivers or filesystems giving hints about superpage sizes in the foreseeable future. Actually I think the only way to get it right is to do it in the core (or inm the architecture code for the really nasty bits of course), but then again this isn't the point I want to make here.. What they did not have any problem with was providing better interfaces to program against as long as they were on the side of the VM like hugetlbfs and not in the core. The character device for private mappings is an example of an interface that is easier to program against than hugetlbfs. It's far easier for an application to mmap a file at a fixed location than trying to discover if hugetlbfs is mounted or not. However, to support that sort of interface, there needs to be a way of telling the VM to call the an alternative pagetable handler - hence Adam's patches. .. and this is where we get into problems. There should be no need to use all kinds of pseudo-OO obsfucation to get there. A VMA flag that means 'this is hugetlb backed anonymous memory' is much nicer to archive this. Because it makes clear there is exactly one special case here and no carte blanche for drivers to do whatever they want. I would prefer to even get rid of that single special case as mentioned above, but I'm definitly set dead against at making this special case totally open for random bits of the kernel to mess with. Someone with sufficient energy could try implementing variable page support entirely as a device using Adam's interface. Hopefully not, doing this in a driver would be utterly braindead and certainly not mergeable. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote: > On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: > >> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs > >> acquired the necessary functionality until expand-on-mmap() was merged. > >> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility > >> wrapper for expand-on-mmap() around ramfs once ramfs acquires the > >> necessary functionality is now the exit strategy. > > On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote: > > Can you describe what ramfs needs here in a bit more detail? > > If it's non-trivial, I'd rather see any new functionality go into > > shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus > > far. > > I was referring to fully-general multiple pagesize support. ramfs > would inherit the functionality by virtue of generic pagecache and TLB > handling in such an arrangement. It doesn't make sense to modify ramfs > as a special case; hugetlb is as it stands a ramfs special-cased for > such purposes. Ahh, I see. Good luck! -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: >> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs >> acquired the necessary functionality until expand-on-mmap() was merged. >> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility >> wrapper for expand-on-mmap() around ramfs once ramfs acquires the >> necessary functionality is now the exit strategy. On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote: > Can you describe what ramfs needs here in a bit more detail? > If it's non-trivial, I'd rather see any new functionality go into > shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus > far. I was referring to fully-general multiple pagesize support. ramfs would inherit the functionality by virtue of generic pagecache and TLB handling in such an arrangement. It doesn't make sense to modify ramfs as a special case; hugetlb is as it stands a ramfs special-cased for such purposes. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: > On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: > >> The main reason I am advocating a set of pagetable_operations is to > >> enable the development of a new hugetlb interface. > > On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote: > > Do you have an exit strategy for the *old* interface? > > Hello. > > My exit strategy was to make hugetlbfs an alias for ramfs when ramfs > acquired the necessary functionality until expand-on-mmap() was merged. > That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility > wrapper for expand-on-mmap() around ramfs once ramfs acquires the > necessary functionality is now the exit strategy. Can you describe what ramfs needs here in a bit more detail? If it's non-trivial, I'd rather see any new functionality go into shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus far. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: >> The main reason I am advocating a set of pagetable_operations is to >> enable the development of a new hugetlb interface. On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote: > Do you have an exit strategy for the *old* interface? Hello. My exit strategy was to make hugetlbfs an alias for ramfs when ramfs acquired the necessary functionality until expand-on-mmap() was merged. That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility wrapper for expand-on-mmap() around ramfs once ramfs acquires the necessary functionality is now the exit strategy. Given current opinions on general multiple pagesize support, by means of which the ramfs functionality is/was intended to be implemented, that time may well be "never." Character device analogues of /dev/zero are not replacements for the filesystem. Few or no transitions of existing users to such are possible. It primarily enables new users who really need anonymous hugetlb, such as numerical applications. The need for a filesystem namespace and persisting across process creation and destruction will not be eliminated by character devices. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote: > On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: > > The main reason I am advocating a set of pagetable_operations is to > > enable the development of a new hugetlb interface. > > Do you have an exit strategy for the *old* interface? Not really. Hugetlbfs needs to be kept around for a number of reasons. It was designed to support MAP_SHARED mappings and IPC shm segments. It is probably still the best interface for those jobs. Of course hugetlbfs has lots of users so we must preserve the interface for them. But... once hugetlbfs is abstracted behind pagetable_operations, you would have the option of configuring it out of the kernel without losing access to huge pages by other means (such as the character device). -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: > The main reason I am advocating a set of pagetable_operations is to > enable the development of a new hugetlb interface. Do you have an exit strategy for the *old* interface? pgpY6RsCOPJvi.pgp Description: PGP signature
pagetable_ops: Hugetlb character device example
The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. The pagetable_operations API opens up possibilities to do some additional (and completely sane) things. For example, I have a patch that alters the character device code below to make use of a hugetlb ZERO_PAGE. This eliminates almost all the up-front fault time, allowing pages to be COW'ed only when first written to. We cannot do things like this with hugetlbfs anymore because we have a set of complex semantics to preserve. The following patch is an example of what a simple pagetable_operations consumer could look like. It does depend on some other cleanups I am working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs. mm/hugetlb.c separation, etc). So it is unlikely to apply to any trees you may have. I do think it makes a useful illustration of what legitimate things can be done with a pagetable_operations interface. commit be72df1c616fb662693a8d4410ce3058f20c71f3 Author: Adam Litke <[EMAIL PROTECTED]> Date: Tue Feb 13 14:18:21 2007 -0800 diff --git a/drivers/char/Makefile b/drivers/char/Makefile index fc11063..c5e755b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER) += ipmi/ obj-$(CONFIG_HANGCHECK_TIMER) += hangcheck-timer.o obj-$(CONFIG_TCG_TPM) += tpm/ +obj-$(CONFIG_HUGETLB_PAGE) += page.o # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/page.c b/drivers/char/page.c new file mode 100644 index 000..e903028 --- /dev/null +++ b/drivers/char/page.c @@ -0,0 +1,133 @@ +#include +#include +#include +#include +#include +#include +#include + +static const struct { + unsigned intminor; + char*name; + umode_t mode; +} devlist[] = { + {1, "page-huge", S_IRUGO | S_IWUGO}, +}; + +static struct page *page_nopage(struct vm_area_struct *vma, + unsigned long address, int *unused) +{ + BUG(); + return NULL; +} + +static struct vm_operations_struct page_vm_ops = { + .nopage = page_nopage, +}; + +static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry, new_entry; + int ret; + static DEFINE_MUTEX(hugetlb_instantiation_mutex); + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + mutex_lock(_instantiation_mutex); + entry = *ptep; + if (pte_none(entry)) { + struct page *page; + + page = alloc_huge_page(vma, address); + if (!page) + return VM_FAULT_OOM; + clear_huge_page(page, address); + + ret = VM_FAULT_MINOR; + spin_lock(>page_table_lock); + if (!pte_none(*ptep)) + goto out; + add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); + new_entry = make_huge_pte(vma, page, 0); + set_huge_pte_at(mm, address, ptep, new_entry); + goto out; + } + + spin_lock(>page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access && !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + +out: + spin_unlock(>page_table_lock); + mutex_unlock(_instantiation_mutex); + return ret; +} + + +static struct pagetable_operations_struct page_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, + .fault = page_fault, +}; + +static int page_mmap(struct file * file, struct vm_area_struct *vma) +{ + if (vma->vm_flags & VM_SHARED) + return -EINVAL; + + if (vma->vm_pgoff) + return -EINVAL; + + if (vma->vm_start & ~HPAGE_MASK) + return -EINVAL; + + if (vma->vm_end & ~HPAGE_MASK) + return -EINVAL; + + if (vma->vm_end - vma->vm_start < HPAGE_SIZE) + return -EINVAL; + + vma->vm_flags |= (VM_HUGETLB |
pagetable_ops: Hugetlb character device example
The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. During the hugetlb BOFS at OLS last year, we talked about a character device that would behave like /dev/zero. Many of the people were talking about how they just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss about the hugetlbfs filesystem. /dev/zero is a familiar interface for getting anonymous memory so bringing that model to huge pages would make programming for anonymous huge pages easier. The pagetable_operations API opens up possibilities to do some additional (and completely sane) things. For example, I have a patch that alters the character device code below to make use of a hugetlb ZERO_PAGE. This eliminates almost all the up-front fault time, allowing pages to be COW'ed only when first written to. We cannot do things like this with hugetlbfs anymore because we have a set of complex semantics to preserve. The following patch is an example of what a simple pagetable_operations consumer could look like. It does depend on some other cleanups I am working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs. mm/hugetlb.c separation, etc). So it is unlikely to apply to any trees you may have. I do think it makes a useful illustration of what legitimate things can be done with a pagetable_operations interface. commit be72df1c616fb662693a8d4410ce3058f20c71f3 Author: Adam Litke [EMAIL PROTECTED] Date: Tue Feb 13 14:18:21 2007 -0800 diff --git a/drivers/char/Makefile b/drivers/char/Makefile index fc11063..c5e755b 100644 --- a/drivers/char/Makefile +++ b/drivers/char/Makefile @@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER) += ipmi/ obj-$(CONFIG_HANGCHECK_TIMER) += hangcheck-timer.o obj-$(CONFIG_TCG_TPM) += tpm/ +obj-$(CONFIG_HUGETLB_PAGE) += page.o # Files generated that shall be removed upon make clean clean-files := consolemap_deftbl.c defkeymap.c diff --git a/drivers/char/page.c b/drivers/char/page.c new file mode 100644 index 000..e903028 --- /dev/null +++ b/drivers/char/page.c @@ -0,0 +1,133 @@ +#include linux/mm.h +#include linux/mman.h +#include linux/init.h +#include linux/device.h +#include linux/fs.h +#include linux/pagemap.h +#include linux/hugetlb.h + +static const struct { + unsigned intminor; + char*name; + umode_t mode; +} devlist[] = { + {1, page-huge, S_IRUGO | S_IWUGO}, +}; + +static struct page *page_nopage(struct vm_area_struct *vma, + unsigned long address, int *unused) +{ + BUG(); + return NULL; +} + +static struct vm_operations_struct page_vm_ops = { + .nopage = page_nopage, +}; + +static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma, + unsigned long address, int write_access) +{ + pte_t *ptep; + pte_t entry, new_entry; + int ret; + static DEFINE_MUTEX(hugetlb_instantiation_mutex); + + ptep = huge_pte_alloc(mm, address); + if (!ptep) + return VM_FAULT_OOM; + + mutex_lock(hugetlb_instantiation_mutex); + entry = *ptep; + if (pte_none(entry)) { + struct page *page; + + page = alloc_huge_page(vma, address); + if (!page) + return VM_FAULT_OOM; + clear_huge_page(page, address); + + ret = VM_FAULT_MINOR; + spin_lock(mm-page_table_lock); + if (!pte_none(*ptep)) + goto out; + add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE); + new_entry = make_huge_pte(vma, page, 0); + set_huge_pte_at(mm, address, ptep, new_entry); + goto out; + } + + spin_lock(mm-page_table_lock); + /* Check for a racing update before calling hugetlb_cow */ + if (likely(pte_same(entry, *ptep))) + if (write_access !pte_write(entry)) + ret = hugetlb_cow(mm, vma, address, ptep, entry); + +out: + spin_unlock(mm-page_table_lock); + mutex_unlock(hugetlb_instantiation_mutex); + return ret; +} + + +static struct pagetable_operations_struct page_pagetable_ops = { + .copy_vma = copy_hugetlb_page_range, + .pin_pages = follow_hugetlb_page, + .unmap_page_range = unmap_hugepage_range, + .change_protection = hugetlb_change_protection, + .free_pgtable_range = hugetlb_free_pgd_range, + .fault = page_fault, +}; + +static int page_mmap(struct file * file, struct vm_area_struct *vma) +{ + if (vma-vm_flags VM_SHARED) + return -EINVAL; + + if (vma-vm_pgoff) + return -EINVAL; + + if (vma-vm_start ~HPAGE_MASK) + return -EINVAL; + + if (vma-vm_end ~HPAGE_MASK) + return -EINVAL; + + if (vma-vm_end - vma-vm_start
Re: pagetable_ops: Hugetlb character device example
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. Do you have an exit strategy for the *old* interface? pgpY6RsCOPJvi.pgp Description: PGP signature
Re: pagetable_ops: Hugetlb character device example
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote: On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. Do you have an exit strategy for the *old* interface? Not really. Hugetlbfs needs to be kept around for a number of reasons. It was designed to support MAP_SHARED mappings and IPC shm segments. It is probably still the best interface for those jobs. Of course hugetlbfs has lots of users so we must preserve the interface for them. But... once hugetlbfs is abstracted behind pagetable_operations, you would have the option of configuring it out of the kernel without losing access to huge pages by other means (such as the character device). -- Adam Litke - (agl at us.ibm.com) IBM Linux Technology Center - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote: Do you have an exit strategy for the *old* interface? Hello. My exit strategy was to make hugetlbfs an alias for ramfs when ramfs acquired the necessary functionality until expand-on-mmap() was merged. That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility wrapper for expand-on-mmap() around ramfs once ramfs acquires the necessary functionality is now the exit strategy. Given current opinions on general multiple pagesize support, by means of which the ramfs functionality is/was intended to be implemented, that time may well be never. Character device analogues of /dev/zero are not replacements for the filesystem. Few or no transitions of existing users to such are possible. It primarily enables new users who really need anonymous hugetlb, such as numerical applications. The need for a filesystem namespace and persisting across process creation and destruction will not be eliminated by character devices. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said: The main reason I am advocating a set of pagetable_operations is to enable the development of a new hugetlb interface. On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote: Do you have an exit strategy for the *old* interface? Hello. My exit strategy was to make hugetlbfs an alias for ramfs when ramfs acquired the necessary functionality until expand-on-mmap() was merged. That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility wrapper for expand-on-mmap() around ramfs once ramfs acquires the necessary functionality is now the exit strategy. Can you describe what ramfs needs here in a bit more detail? If it's non-trivial, I'd rather see any new functionality go into shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus far. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: My exit strategy was to make hugetlbfs an alias for ramfs when ramfs acquired the necessary functionality until expand-on-mmap() was merged. That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility wrapper for expand-on-mmap() around ramfs once ramfs acquires the necessary functionality is now the exit strategy. On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote: Can you describe what ramfs needs here in a bit more detail? If it's non-trivial, I'd rather see any new functionality go into shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus far. I was referring to fully-general multiple pagesize support. ramfs would inherit the functionality by virtue of generic pagecache and TLB handling in such an arrangement. It doesn't make sense to modify ramfs as a special case; hugetlb is as it stands a ramfs special-cased for such purposes. -- wli - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: pagetable_ops: Hugetlb character device example
On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote: On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote: My exit strategy was to make hugetlbfs an alias for ramfs when ramfs acquired the necessary functionality until expand-on-mmap() was merged. That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility wrapper for expand-on-mmap() around ramfs once ramfs acquires the necessary functionality is now the exit strategy. On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote: Can you describe what ramfs needs here in a bit more detail? If it's non-trivial, I'd rather see any new functionality go into shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus far. I was referring to fully-general multiple pagesize support. ramfs would inherit the functionality by virtue of generic pagecache and TLB handling in such an arrangement. It doesn't make sense to modify ramfs as a special case; hugetlb is as it stands a ramfs special-cased for such purposes. Ahh, I see. Good luck! -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/