Re: pagetable_ops: Hugetlb character device example

2007-03-23 Thread Mel Gorman


On Thu, 22 Mar 2007, Christoph Hellwig wrote:


On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote:
> A year ago, I may have agreed with you. However, Linus not only veto'd
> it but
> stamped on it repeatadly at VM Summit. He couldn't have made it clearer
> if
> he wore a t-shirt a hat and held up a neon sign. The assertion at the
> time
> was that variable page support of any sort had to be outside of the core
> VM
> because automatic support will get it wrong in some cases and makes the
> core
> VM harder to understand (because it's super-clear at the moment). Others
> attending agreed with the position. That position rules out drivers or
> filesystems giving hints about superpage sizes in the foreseeable
> future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..



Maybe ultimatly it's the right thing to do, more can be done "on the side"
until such time as it's really worth handling the complexity in the core VM.


> What they did not have any problem with was providing better interfaces
> to
> program against as long as they were on the side of the VM like
> hugetlbfs
> and not in the core. The character device for private mappings is an
> example of an interface that is easier to program against than
> hugetlbfs.
> It's far easier for an application to mmap a file at a fixed location
> than
> trying to discover if hugetlbfs is mounted or not. However, to support
> that
> sort of interface, there needs to be a way of telling the VM to call the
> an
> alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.


Except in the example he posted, the fault handler for the char device and
the hugetlbfs case are doing slightly different things. The fault handler
for hugetlbfs assumes the existance of a file mapping where as the char
device is inserting the page directly. Having the bit is not enough for the
core is not enough to determine that a slightly different fault handler was
needed.

With the current code, it is almost impossible for a driver with a different
pagetable layout to express different semantics to hugetlbfs with respects
to how their pagetables are setup. Altering hugetlbfs much is very difficult
because any alteration becomes global in nature. The ops would allow a
experimental drivers and interfaces to be developed without breaking
existing users of hugetlbfs and let us figure out things like "Is it worth
supporting 1GiB pages in Opterons" without breaking everything else in the
process.


Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.


Drivers can already cause all sorts of mayhem through the existing hooks if
they are perverse enough. It is never encouraged of course but nothing
prevents them.


I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.



As kernel memory is already backed by huge tlb entries in many cases, random
drivers should have little or no interest in doing anything mad with
pagetable ops. All it gets them is pain and entertaining posts from the
mailing list.

That said, your main objection seems to be opening to door to arbitrary
drivers to change the pagetable ops. I can see your point and Hughs on why
this could lead to some hilarity down the road, particularly if out-of-tree
drivers entering into the mess so how about the following;

Instead of having a vma->pagetable_ops with a structure of pointers,
it would be a simple integer  into a fixed list of pagetable operation
handlers. Something like

#define PAGETABLE_OP_DEFAULT  0
#define PAGETABLE_OP_HUGETLB_FS   1
#define PAGETABLE_OP_HUGETLB_CHAR 2

struct pagetable_operations_struct[] pagetable_ops_lookup_table = {
/* PAGETABLE_OP_DEFAULT assuming we always used the table */
{
.fault = handle_pte_fault

},

/* PAGETABLE_OP_HUGETLB_FS */
{
.fault= hugetlb_fault
.copy_vma = copy_hugetlb_page_range
..
},

/* PAGETABLE_OP_HUGETLB_CHAR */
{
.fault = whatever
}
};

Drivers would only be able to set an index in the VMA for this table.
The lookup would be about the same cost as what is currently there. However,
random drivers cannot mess with the pagetable ops - they would have to be
known by the core. Experimental drivers would have to update the table but
that shouldn't be an issue. Out-of-tree drivers would have no ability to
mess here at all 

Re: pagetable_ops: Hugetlb character device example

2007-03-23 Thread Mel Gorman


On Thu, 22 Mar 2007, Christoph Hellwig wrote:


On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote:
 A year ago, I may have agreed with you. However, Linus not only veto'd
 it but
 stamped on it repeatadly at VM Summit. He couldn't have made it clearer
 if
 he wore a t-shirt a hat and held up a neon sign. The assertion at the
 time
 was that variable page support of any sort had to be outside of the core
 VM
 because automatic support will get it wrong in some cases and makes the
 core
 VM harder to understand (because it's super-clear at the moment). Others
 attending agreed with the position. That position rules out drivers or
 filesystems giving hints about superpage sizes in the foreseeable
 future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..



Maybe ultimatly it's the right thing to do, more can be done on the side
until such time as it's really worth handling the complexity in the core VM.


 What they did not have any problem with was providing better interfaces
 to
 program against as long as they were on the side of the VM like
 hugetlbfs
 and not in the core. The character device for private mappings is an
 example of an interface that is easier to program against than
 hugetlbfs.
 It's far easier for an application to mmap a file at a fixed location
 than
 trying to discover if hugetlbfs is mounted or not. However, to support
 that
 sort of interface, there needs to be a way of telling the VM to call the
 an
 alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.


Except in the example he posted, the fault handler for the char device and
the hugetlbfs case are doing slightly different things. The fault handler
for hugetlbfs assumes the existance of a file mapping where as the char
device is inserting the page directly. Having the bit is not enough for the
core is not enough to determine that a slightly different fault handler was
needed.

With the current code, it is almost impossible for a driver with a different
pagetable layout to express different semantics to hugetlbfs with respects
to how their pagetables are setup. Altering hugetlbfs much is very difficult
because any alteration becomes global in nature. The ops would allow a
experimental drivers and interfaces to be developed without breaking
existing users of hugetlbfs and let us figure out things like Is it worth
supporting 1GiB pages in Opterons without breaking everything else in the
process.


Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.


Drivers can already cause all sorts of mayhem through the existing hooks if
they are perverse enough. It is never encouraged of course but nothing
prevents them.


I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.



As kernel memory is already backed by huge tlb entries in many cases, random
drivers should have little or no interest in doing anything mad with
pagetable ops. All it gets them is pain and entertaining posts from the
mailing list.

That said, your main objection seems to be opening to door to arbitrary
drivers to change the pagetable ops. I can see your point and Hughs on why
this could lead to some hilarity down the road, particularly if out-of-tree
drivers entering into the mess so how about the following;

Instead of having a vma-pagetable_ops with a structure of pointers,
it would be a simple integer  into a fixed list of pagetable operation
handlers. Something like

#define PAGETABLE_OP_DEFAULT  0
#define PAGETABLE_OP_HUGETLB_FS   1
#define PAGETABLE_OP_HUGETLB_CHAR 2

struct pagetable_operations_struct[] pagetable_ops_lookup_table = {
/* PAGETABLE_OP_DEFAULT assuming we always used the table */
{
.fault = handle_pte_fault

},

/* PAGETABLE_OP_HUGETLB_FS */
{
.fault= hugetlb_fault
.copy_vma = copy_hugetlb_page_range
..
},

/* PAGETABLE_OP_HUGETLB_CHAR */
{
.fault = whatever
}
};

Drivers would only be able to set an index in the VMA for this table.
The lookup would be about the same cost as what is currently there. However,
random drivers cannot mess with the pagetable ops - they would have to be
known by the core. Experimental drivers would have to update the table but
that shouldn't be an issue. Out-of-tree drivers would have no ability to
mess here at all which is a good thing.


 Someone 

Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Christoph Hellwig
On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote:
> A year ago, I may have agreed with you. However, Linus not only veto'd it but
> stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
> he wore a t-shirt a hat and held up a neon sign. The assertion at the time
> was that variable page support of any sort had to be outside of the core VM
> because automatic support will get it wrong in some cases and makes the core
> VM harder to understand (because it's super-clear at the moment). Others
> attending agreed with the position. That position rules out drivers or
> filesystems giving hints about superpage sizes in the foreseeable future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..

> What they did not have any problem with was providing better interfaces to
> program against as long as they were on the side of the VM like hugetlbfs
> and not in the core. The character device for private mappings is an
> example of an interface that is easier to program against than hugetlbfs.
> It's far easier for an application to mmap a file at a fixed location than
> trying to discover if hugetlbfs is mounted or not. However, to support that
> sort of interface, there needs to be a way of telling the VM to call the an
> alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.  Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.  I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.

> Someone with sufficient energy could try implementing variable page support
> entirely as a device using Adam's interface. 

Hopefully not, doing this in a driver would be utterly braindead and
certainly not mergeable.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Mel Gorman
On (22/03/07 10:38), Christoph Hellwig didst pronounce:
> On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.  During the hugetlb
> > BOFS at OLS last year, we talked about a character device that would
> > behave like /dev/zero.  Many of the people were talking about how they
> > just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> > about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> > getting anonymous memory so bringing that model to huge pages would make
> > programming for anonymous huge pages easier.
> 
> That is a very laudable goal, but an utterly wrong way to get there.
> Despite Linus' veto a while ago what we really want is support for transparent
> super pages.

A year ago, I may have agreed with you. However, Linus not only veto'd it but
stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
he wore a t-shirt a hat and held up a neon sign. The assertion at the time
was that variable page support of any sort had to be outside of the core VM
because automatic support will get it wrong in some cases and makes the core
VM harder to understand (because it's super-clear at the moment). Others
attending agreed with the position. That position rules out drivers or
filesystems giving hints about superpage sizes in the foreseeable future.

What they did not have any problem with was providing better interfaces to
program against as long as they were on the side of the VM like hugetlbfs
and not in the core. The character device for private mappings is an
example of an interface that is easier to program against than hugetlbfs.
It's far easier for an application to mmap a file at a fixed location than
trying to discover if hugetlbfs is mounted or not. However, to support that
sort of interface, there needs to be a way of telling the VM to call the an
alternative pagetable handler - hence Adam's patches.

Someone with sufficient energy could try implementing variable page support
entirely as a device using Adam's interface. If it turned out to be a
good idea, then another push could be made for transparent support later.
As it is, transparent superpage support is a also bit of a bitch for Power
and IA64. Power because in many cases (not all), pages of two different
sizes cannot be in the same virtual address range. IA64 has issues because
with the *current* pagetable implementation, hugepages are limited to fixed
address ranges. These sort of issues alone make transparent support in the
kernel a non-trivial problem.

> Adding random pointer indirections where we had the direct
> hugetlb calls before isn't helpful for that at all. 

They aren't random, they are pretty specific. Also, even when paths like fault
is entered, the cost of an indirect call is insignificant in comparison to
the page allocation, clearing the page and updating page tables.

In Adam's current patches, the indirect call only happens when a driver is
using the pagetable ops. In the tests I looked at, the cost of the branch
could only be detected on an instruction-level profile and even the branch
cost was pretty damn tiny. If it was a case that indirect calls always took
place, it *might* be a bit more noticable but still nothing in comparison
to the cost of the remainder of the operation.

> As a start you might
> want to make a clear destinction between core hugetlb code and the
> filesystem interface to it without all the useless indirections. 

The indirect calls are about supporting interfaces to userspace. In practice,
the hugetlbfs interface, the shared memory interface and the character device
interface would share a large amount of core code.  Admittadly that code
could do with restructuring because it's all mangled together at the moment.

The core hugetlb code as you call it is mainly dealing with page cache and
huge page pool management. The filesystem layer is relatively thin on top of
it. With Adams pagetable abstraction, it would make more sense to restructuring
the huge page code and separate out core-support-for-superpages from hugetlbfs.

> That
> should get you as far as your char dev interface. 

No, it wouldn't. Restructing the current code would allow better sharing
between interfaces but that's it. At the end of the restructuring, we'd still
need a way of saying "this VMA should be using some but not all the hugetlb
code over there even though I'm not hugetlbfs". At that point, we'd be back
at the pagetable ops abstraction.

> But over the long
> term the core VM needs to deal with multiple (and probably not just two)
> page sizes.  Given that the code to deal with different sized pages is
> essentially the same just on different units on most architectures cries
> for a better method to implement this than adding random function indirection
> that point to mostly identical code.
> 

Internally, a semi-sane way of 

Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Christoph Hellwig
On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
> The main reason I am advocating a set of pagetable_operations is to
> enable the development of a new hugetlb interface.  During the hugetlb
> BOFS at OLS last year, we talked about a character device that would
> behave like /dev/zero.  Many of the people were talking about how they
> just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
> about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
> getting anonymous memory so bringing that model to huge pages would make
> programming for anonymous huge pages easier.

That is a very laudable goal, but an utterly wrong way to get there.
Despite Linus' veto a while ago what we really want is support for transparent
super pages.  Adding random pointer indirections where we had the direct
hugetlb calls before isn't helpful for that at all.  As a start you might
want to make a clear destinction between core hugetlb code and the
filesystem interface to it without all the useless indirections.  That
should get you as far as your char dev interface.  But over the long
term the core VM needs to deal with multiple (and probably not just two)
page sizes.  Given that the code to deal with different sized pages is
essentially the same just on different units on most architectures cries
for a better method to implement this than adding random function indirection
that point to mostly identical code.


And your driver is the best example of why we utterly don't want
a page_table operations interface.  The last thing we want is random
driver taking over core VM functionality.  The right way would be to a
filesystem/driver to tell (or maybe just give hints) which page size
to use for this mapping.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Christoph Hellwig
On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
 The main reason I am advocating a set of pagetable_operations is to
 enable the development of a new hugetlb interface.  During the hugetlb
 BOFS at OLS last year, we talked about a character device that would
 behave like /dev/zero.  Many of the people were talking about how they
 just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
 about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
 getting anonymous memory so bringing that model to huge pages would make
 programming for anonymous huge pages easier.

That is a very laudable goal, but an utterly wrong way to get there.
Despite Linus' veto a while ago what we really want is support for transparent
super pages.  Adding random pointer indirections where we had the direct
hugetlb calls before isn't helpful for that at all.  As a start you might
want to make a clear destinction between core hugetlb code and the
filesystem interface to it without all the useless indirections.  That
should get you as far as your char dev interface.  But over the long
term the core VM needs to deal with multiple (and probably not just two)
page sizes.  Given that the code to deal with different sized pages is
essentially the same just on different units on most architectures cries
for a better method to implement this than adding random function indirection
that point to mostly identical code.


And your driver is the best example of why we utterly don't want
a page_table operations interface.  The last thing we want is random
driver taking over core VM functionality.  The right way would be to a
filesystem/driver to tell (or maybe just give hints) which page size
to use for this mapping.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Mel Gorman
On (22/03/07 10:38), Christoph Hellwig didst pronounce:
 On Wed, Mar 21, 2007 at 02:43:48PM -0500, Adam Litke wrote:
  The main reason I am advocating a set of pagetable_operations is to
  enable the development of a new hugetlb interface.  During the hugetlb
  BOFS at OLS last year, we talked about a character device that would
  behave like /dev/zero.  Many of the people were talking about how they
  just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
  about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
  getting anonymous memory so bringing that model to huge pages would make
  programming for anonymous huge pages easier.
 
 That is a very laudable goal, but an utterly wrong way to get there.
 Despite Linus' veto a while ago what we really want is support for transparent
 super pages.

A year ago, I may have agreed with you. However, Linus not only veto'd it but
stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
he wore a t-shirt a hat and held up a neon sign. The assertion at the time
was that variable page support of any sort had to be outside of the core VM
because automatic support will get it wrong in some cases and makes the core
VM harder to understand (because it's super-clear at the moment). Others
attending agreed with the position. That position rules out drivers or
filesystems giving hints about superpage sizes in the foreseeable future.

What they did not have any problem with was providing better interfaces to
program against as long as they were on the side of the VM like hugetlbfs
and not in the core. The character device for private mappings is an
example of an interface that is easier to program against than hugetlbfs.
It's far easier for an application to mmap a file at a fixed location than
trying to discover if hugetlbfs is mounted or not. However, to support that
sort of interface, there needs to be a way of telling the VM to call the an
alternative pagetable handler - hence Adam's patches.

Someone with sufficient energy could try implementing variable page support
entirely as a device using Adam's interface. If it turned out to be a
good idea, then another push could be made for transparent support later.
As it is, transparent superpage support is a also bit of a bitch for Power
and IA64. Power because in many cases (not all), pages of two different
sizes cannot be in the same virtual address range. IA64 has issues because
with the *current* pagetable implementation, hugepages are limited to fixed
address ranges. These sort of issues alone make transparent support in the
kernel a non-trivial problem.

 Adding random pointer indirections where we had the direct
 hugetlb calls before isn't helpful for that at all. 

They aren't random, they are pretty specific. Also, even when paths like fault
is entered, the cost of an indirect call is insignificant in comparison to
the page allocation, clearing the page and updating page tables.

In Adam's current patches, the indirect call only happens when a driver is
using the pagetable ops. In the tests I looked at, the cost of the branch
could only be detected on an instruction-level profile and even the branch
cost was pretty damn tiny. If it was a case that indirect calls always took
place, it *might* be a bit more noticable but still nothing in comparison
to the cost of the remainder of the operation.

 As a start you might
 want to make a clear destinction between core hugetlb code and the
 filesystem interface to it without all the useless indirections. 

The indirect calls are about supporting interfaces to userspace. In practice,
the hugetlbfs interface, the shared memory interface and the character device
interface would share a large amount of core code.  Admittadly that code
could do with restructuring because it's all mangled together at the moment.

The core hugetlb code as you call it is mainly dealing with page cache and
huge page pool management. The filesystem layer is relatively thin on top of
it. With Adams pagetable abstraction, it would make more sense to restructuring
the huge page code and separate out core-support-for-superpages from hugetlbfs.

 That
 should get you as far as your char dev interface. 

No, it wouldn't. Restructing the current code would allow better sharing
between interfaces but that's it. At the end of the restructuring, we'd still
need a way of saying this VMA should be using some but not all the hugetlb
code over there even though I'm not hugetlbfs. At that point, we'd be back
at the pagetable ops abstraction.

 But over the long
 term the core VM needs to deal with multiple (and probably not just two)
 page sizes.  Given that the code to deal with different sized pages is
 essentially the same just on different units on most architectures cries
 for a better method to implement this than adding random function indirection
 that point to mostly identical code.
 

Internally, a semi-sane way of supporting multiple page sizes would be

Re: pagetable_ops: Hugetlb character device example

2007-03-22 Thread Christoph Hellwig
On Thu, Mar 22, 2007 at 03:42:27PM +, Mel Gorman wrote:
 A year ago, I may have agreed with you. However, Linus not only veto'd it but
 stamped on it repeatadly at VM Summit. He couldn't have made it clearer if
 he wore a t-shirt a hat and held up a neon sign. The assertion at the time
 was that variable page support of any sort had to be outside of the core VM
 because automatic support will get it wrong in some cases and makes the core
 VM harder to understand (because it's super-clear at the moment). Others
 attending agreed with the position. That position rules out drivers or
 filesystems giving hints about superpage sizes in the foreseeable future.

Actually I think the only way to get it right is to do it in the core
(or inm the architecture code for the really nasty bits of course), but
then again this isn't the point I want to make here..

 What they did not have any problem with was providing better interfaces to
 program against as long as they were on the side of the VM like hugetlbfs
 and not in the core. The character device for private mappings is an
 example of an interface that is easier to program against than hugetlbfs.
 It's far easier for an application to mmap a file at a fixed location than
 trying to discover if hugetlbfs is mounted or not. However, to support that
 sort of interface, there needs to be a way of telling the VM to call the an
 alternative pagetable handler - hence Adam's patches.

.. and this is where we get into problems.  There should be no need to
use all kinds of pseudo-OO obsfucation to get there.  A VMA flag that
means 'this is hugetlb backed anonymous memory' is much nicer to archive
this.  Because it makes clear there is exactly one special case here
and no carte blanche for drivers to do whatever they want.  I would prefer
to even get rid of that single special case as mentioned above, but I'm
definitly set dead against at making this special case totally open for
random bits of the kernel to mess with.

 Someone with sufficient energy could try implementing variable page support
 entirely as a device using Adam's interface. 

Hopefully not, doing this in a driver would be utterly braindead and
certainly not mergeable.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Matt Mackall
On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote:
> On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> >> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> >> acquired the necessary functionality until expand-on-mmap() was merged.
> >> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> >> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> >> necessary functionality is now the exit strategy.
> 
> On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> > Can you describe what ramfs needs here in a bit more detail?
> > If it's non-trivial, I'd rather see any new functionality go into
> > shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> > far.
> 
> I was referring to fully-general multiple pagesize support. ramfs
> would inherit the functionality by virtue of generic pagecache and TLB
> handling in such an arrangement. It doesn't make sense to modify ramfs
> as a special case; hugetlb is as it stands a ramfs special-cased for
> such purposes.

Ahh, I see.

Good luck!

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread William Lee Irwin III
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
>> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
>> acquired the necessary functionality until expand-on-mmap() was merged.
>> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
>> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
>> necessary functionality is now the exit strategy.

On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
> Can you describe what ramfs needs here in a bit more detail?
> If it's non-trivial, I'd rather see any new functionality go into
> shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
> far.

I was referring to fully-general multiple pagesize support. ramfs
would inherit the functionality by virtue of generic pagecache and TLB
handling in such an arrangement. It doesn't make sense to modify ramfs
as a special case; hugetlb is as it stands a ramfs special-cased for
such purposes.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Matt Mackall
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> >> The main reason I am advocating a set of pagetable_operations is to
> >> enable the development of a new hugetlb interface.
> 
> On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote:
> > Do you have an exit strategy for the *old* interface?
> 
> Hello.
> 
> My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
> acquired the necessary functionality until expand-on-mmap() was merged.
> That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
> wrapper for expand-on-mmap() around ramfs once ramfs acquires the
> necessary functionality is now the exit strategy.

Can you describe what ramfs needs here in a bit more detail?

If it's non-trivial, I'd rather see any new functionality go into
shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
far.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread William Lee Irwin III
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
>> The main reason I am advocating a set of pagetable_operations is to
>> enable the development of a new hugetlb interface.

On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote:
> Do you have an exit strategy for the *old* interface?

Hello.

My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
acquired the necessary functionality until expand-on-mmap() was merged.
That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
wrapper for expand-on-mmap() around ramfs once ramfs acquires the
necessary functionality is now the exit strategy.

Given current opinions on general multiple pagesize support, by means of
which the ramfs functionality is/was intended to be implemented, that
time may well be "never."

Character device analogues of /dev/zero are not replacements for the
filesystem. Few or no transitions of existing users to such are
possible. It primarily enables new users who really need anonymous
hugetlb, such as numerical applications. The need for a filesystem
namespace and persisting across process creation and destruction will
not be eliminated by character devices.


-- wli
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote:
> On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> > The main reason I am advocating a set of pagetable_operations is to
> > enable the development of a new hugetlb interface.
> 
> Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Valdis . Kletnieks
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
> The main reason I am advocating a set of pagetable_operations is to
> enable the development of a new hugetlb interface.

Do you have an exit strategy for the *old* interface?


pgpY6RsCOPJvi.pgp
Description: PGP signature


pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke <[EMAIL PROTECTED]>
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)  += ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)  += hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)  += tpm/
+obj-$(CONFIG_HUGETLB_PAGE) += page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static const struct {
+   unsigned intminor;
+   char*name;
+   umode_t mode;
+} devlist[] = {
+   {1, "page-huge", S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+   unsigned long address, int *unused)
+{
+   BUG();
+   return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+   .nopage = page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long address, int write_access)
+{
+   pte_t *ptep;
+   pte_t entry, new_entry;
+   int ret;
+   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+   ptep = huge_pte_alloc(mm, address);
+   if (!ptep)
+   return VM_FAULT_OOM;
+
+   mutex_lock(_instantiation_mutex);
+   entry = *ptep;
+   if (pte_none(entry)) {
+   struct page *page;
+
+   page = alloc_huge_page(vma, address);
+   if (!page)
+   return VM_FAULT_OOM;
+   clear_huge_page(page, address);
+
+   ret = VM_FAULT_MINOR;
+   spin_lock(>page_table_lock);
+   if (!pte_none(*ptep))
+   goto out;
+   add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+   new_entry = make_huge_pte(vma, page, 0);
+   set_huge_pte_at(mm, address, ptep, new_entry);
+   goto out;
+   }
+
+   spin_lock(>page_table_lock);
+   /* Check for a racing update before calling hugetlb_cow */
+   if (likely(pte_same(entry, *ptep)))
+   if (write_access && !pte_write(entry))
+   ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+   spin_unlock(>page_table_lock);
+   mutex_unlock(_instantiation_mutex);
+   return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+   if (vma->vm_flags & VM_SHARED)
+   return -EINVAL;
+
+   if (vma->vm_pgoff)
+   return -EINVAL;
+
+   if (vma->vm_start & ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma->vm_end & ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
+   return -EINVAL;
+
+   vma->vm_flags |= (VM_HUGETLB | 

pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
The main reason I am advocating a set of pagetable_operations is to
enable the development of a new hugetlb interface.  During the hugetlb
BOFS at OLS last year, we talked about a character device that would
behave like /dev/zero.  Many of the people were talking about how they
just wanted to create MAP_PRIVATE hugetlb mappings without all the fuss
about the hugetlbfs filesystem.  /dev/zero is a familiar interface for
getting anonymous memory so bringing that model to huge pages would make
programming for anonymous huge pages easier.

The pagetable_operations API opens up possibilities to do some
additional (and completely sane) things.  For example, I have a patch
that alters the character device code below to make use of a hugetlb
ZERO_PAGE.  This eliminates almost all the up-front fault time, allowing
pages to be COW'ed only when first written to.  We cannot do things like
this with hugetlbfs anymore because we have a set of complex semantics
to preserve.

The following patch is an example of what a simple pagetable_operations
consumer could look like.  It does depend on some other cleanups I am
working on (removal of is_file_hugepages(), ...hugetlbfs/inode.c vs.
mm/hugetlb.c separation, etc).  So it is unlikely to apply to any trees
you may have.  I do think it makes a useful illustration of what
legitimate things can be done with a pagetable_operations interface.

commit be72df1c616fb662693a8d4410ce3058f20c71f3
Author: Adam Litke [EMAIL PROTECTED]
Date:   Tue Feb 13 14:18:21 2007 -0800

diff --git a/drivers/char/Makefile b/drivers/char/Makefile
index fc11063..c5e755b 100644
--- a/drivers/char/Makefile
+++ b/drivers/char/Makefile
@@ -100,6 +100,7 @@ obj-$(CONFIG_IPMI_HANDLER)  += ipmi/
 
 obj-$(CONFIG_HANGCHECK_TIMER)  += hangcheck-timer.o
 obj-$(CONFIG_TCG_TPM)  += tpm/
+obj-$(CONFIG_HUGETLB_PAGE) += page.o
 
 # Files generated that shall be removed upon make clean
 clean-files := consolemap_deftbl.c defkeymap.c
diff --git a/drivers/char/page.c b/drivers/char/page.c
new file mode 100644
index 000..e903028
--- /dev/null
+++ b/drivers/char/page.c
@@ -0,0 +1,133 @@
+#include linux/mm.h
+#include linux/mman.h
+#include linux/init.h
+#include linux/device.h
+#include linux/fs.h
+#include linux/pagemap.h
+#include linux/hugetlb.h
+
+static const struct {
+   unsigned intminor;
+   char*name;
+   umode_t mode;
+} devlist[] = {
+   {1, page-huge, S_IRUGO | S_IWUGO},
+};
+
+static struct page *page_nopage(struct vm_area_struct *vma,
+   unsigned long address, int *unused)
+{
+   BUG();
+   return NULL;
+}
+
+static struct vm_operations_struct page_vm_ops = {
+   .nopage = page_nopage,
+};
+
+static int page_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+   unsigned long address, int write_access)
+{
+   pte_t *ptep;
+   pte_t entry, new_entry;
+   int ret;
+   static DEFINE_MUTEX(hugetlb_instantiation_mutex);
+
+   ptep = huge_pte_alloc(mm, address);
+   if (!ptep)
+   return VM_FAULT_OOM;
+
+   mutex_lock(hugetlb_instantiation_mutex);
+   entry = *ptep;
+   if (pte_none(entry)) {
+   struct page *page;
+
+   page = alloc_huge_page(vma, address);
+   if (!page)
+   return VM_FAULT_OOM;
+   clear_huge_page(page, address);
+
+   ret = VM_FAULT_MINOR;
+   spin_lock(mm-page_table_lock);
+   if (!pte_none(*ptep))
+   goto out;
+   add_mm_counter(mm, file_rss, HPAGE_SIZE / PAGE_SIZE);
+   new_entry = make_huge_pte(vma, page, 0);
+   set_huge_pte_at(mm, address, ptep, new_entry);
+   goto out;
+   }
+
+   spin_lock(mm-page_table_lock);
+   /* Check for a racing update before calling hugetlb_cow */
+   if (likely(pte_same(entry, *ptep)))
+   if (write_access  !pte_write(entry))
+   ret = hugetlb_cow(mm, vma, address, ptep, entry);
+
+out:
+   spin_unlock(mm-page_table_lock);
+   mutex_unlock(hugetlb_instantiation_mutex);
+   return ret;
+}
+
+
+static struct pagetable_operations_struct page_pagetable_ops = {
+   .copy_vma   = copy_hugetlb_page_range,
+   .pin_pages  = follow_hugetlb_page,
+   .unmap_page_range   = unmap_hugepage_range,
+   .change_protection  = hugetlb_change_protection,
+   .free_pgtable_range = hugetlb_free_pgd_range,
+   .fault  = page_fault,
+};
+
+static int page_mmap(struct file * file, struct vm_area_struct *vma)
+{
+   if (vma-vm_flags  VM_SHARED)
+   return -EINVAL;
+
+   if (vma-vm_pgoff)
+   return -EINVAL;
+
+   if (vma-vm_start  ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma-vm_end  ~HPAGE_MASK)
+   return -EINVAL;
+
+   if (vma-vm_end - vma-vm_start  

Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Valdis . Kletnieks
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
 The main reason I am advocating a set of pagetable_operations is to
 enable the development of a new hugetlb interface.

Do you have an exit strategy for the *old* interface?


pgpY6RsCOPJvi.pgp
Description: PGP signature


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Adam Litke
On Wed, 2007-03-21 at 15:51 -0400, [EMAIL PROTECTED] wrote:
 On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
  The main reason I am advocating a set of pagetable_operations is to
  enable the development of a new hugetlb interface.
 
 Do you have an exit strategy for the *old* interface?

Not really.  Hugetlbfs needs to be kept around for a number of reasons.
It was designed to support MAP_SHARED mappings and IPC shm segments.  It
is probably still the best interface for those jobs.  Of course
hugetlbfs has lots of users so we must preserve the interface for them.

But... once hugetlbfs is abstracted behind pagetable_operations, you
would have the option of configuring it out of the kernel without losing
access to huge pages by other means (such as the character device).

-- 
Adam Litke - (agl at us.ibm.com)
IBM Linux Technology Center

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread William Lee Irwin III
On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
 The main reason I am advocating a set of pagetable_operations is to
 enable the development of a new hugetlb interface.

On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote:
 Do you have an exit strategy for the *old* interface?

Hello.

My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
acquired the necessary functionality until expand-on-mmap() was merged.
That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
wrapper for expand-on-mmap() around ramfs once ramfs acquires the
necessary functionality is now the exit strategy.

Given current opinions on general multiple pagesize support, by means of
which the ramfs functionality is/was intended to be implemented, that
time may well be never.

Character device analogues of /dev/zero are not replacements for the
filesystem. Few or no transitions of existing users to such are
possible. It primarily enables new users who really need anonymous
hugetlb, such as numerical applications. The need for a filesystem
namespace and persisting across process creation and destruction will
not be eliminated by character devices.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Matt Mackall
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
 On Wed, 21 Mar 2007 14:43:48 CDT, Adam Litke said:
  The main reason I am advocating a set of pagetable_operations is to
  enable the development of a new hugetlb interface.
 
 On Wed, Mar 21, 2007 at 03:51:31PM -0400, [EMAIL PROTECTED] wrote:
  Do you have an exit strategy for the *old* interface?
 
 Hello.
 
 My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
 acquired the necessary functionality until expand-on-mmap() was merged.
 That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
 wrapper for expand-on-mmap() around ramfs once ramfs acquires the
 necessary functionality is now the exit strategy.

Can you describe what ramfs needs here in a bit more detail?

If it's non-trivial, I'd rather see any new functionality go into
shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
far.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread William Lee Irwin III
On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
 My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
 acquired the necessary functionality until expand-on-mmap() was merged.
 That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
 wrapper for expand-on-mmap() around ramfs once ramfs acquires the
 necessary functionality is now the exit strategy.

On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
 Can you describe what ramfs needs here in a bit more detail?
 If it's non-trivial, I'd rather see any new functionality go into
 shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
 far.

I was referring to fully-general multiple pagesize support. ramfs
would inherit the functionality by virtue of generic pagecache and TLB
handling in such an arrangement. It doesn't make sense to modify ramfs
as a special case; hugetlb is as it stands a ramfs special-cased for
such purposes.


-- wli
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: pagetable_ops: Hugetlb character device example

2007-03-21 Thread Matt Mackall
On Wed, Mar 21, 2007 at 04:35:28PM -0700, William Lee Irwin III wrote:
 On Wed, Mar 21, 2007 at 03:26:59PM -0700, William Lee Irwin III wrote:
  My exit strategy was to make hugetlbfs an alias for ramfs when ramfs
  acquired the necessary functionality until expand-on-mmap() was merged.
  That would've allowed rm -rf fs/hugetlbfs/ outright. A compatibility
  wrapper for expand-on-mmap() around ramfs once ramfs acquires the
  necessary functionality is now the exit strategy.
 
 On Wed, Mar 21, 2007 at 05:53:48PM -0500, Matt Mackall wrote:
  Can you describe what ramfs needs here in a bit more detail?
  If it's non-trivial, I'd rather see any new functionality go into
  shmfs/tmpfs, as ramfs has done a good job at staying a minimal fs thus
  far.
 
 I was referring to fully-general multiple pagesize support. ramfs
 would inherit the functionality by virtue of generic pagecache and TLB
 handling in such an arrangement. It doesn't make sense to modify ramfs
 as a special case; hugetlb is as it stands a ramfs special-cased for
 such purposes.

Ahh, I see.

Good luck!

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/