Re: lockmeter: fix lock counter roll over issue

2005-08-15 Thread Ray Bryant
On Monday 15 August 2005 02:35, Xuekun Hu wrote:
> Does anyone have inputs?
>

Xuekun ,

I was on vacation last week.   I just saw your patch yesterday.  It looks 
reasonable, but I will test it later today.

You should also cc John Hawkes ([EMAIL PROTECTED]).

Also, please note my email address change:  my current email address is
[EMAIL PROTECTED]

Andrew is not so much interested in these changes as the lockmeter patch is 
not in -mm.
-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: lockmeter: fix lock counter roll over issue

2005-08-15 Thread Ray Bryant
On Monday 15 August 2005 02:35, Xuekun Hu wrote:
 Does anyone have inputs?


Xuekun ,

I was on vacation last week.   I just saw your patch yesterday.  It looks 
reasonable, but I will test it later today.

You should also cc John Hawkes ([EMAIL PROTECTED]).

Also, please note my email address change:  my current email address is
[EMAIL PROTECTED]

Andrew is not so much interested in these changes as the lockmeter patch is 
not in -mm.
-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VM: add vm.free_node_memory sysctl

2005-08-05 Thread Ray Bryant
On Wednesday 03 August 2005 15:08, Andi Kleen wrote:

> >
> > Hmmm What happens if there are already mapped pages (e. g. mapped in
> > the sense that pages are mapped into an address space) on the node and
> > you want to allocate some more, but can't because the node is full of
> > clean page cache pages?   Then one would have to set the memhog argument
> > to the right thing to
>

> If you have a bind policy in the memory grabbing program then the standard
> try_to_free_pages should DTRT. That is because we generated a custom zone
> list only containing nodes in that zone and the zone reclaim only looks
> into those.
>

It may depend on what your definition of DTRT is here.  :-)

As I understand things, if we have a node that has some mapped memory 
allocated, and if one starts up a numactl -bind node memhog nodesize-slop so 
as to clear some clean page cache pages from that node, then unless the 
"slop" is sized in proportion to the amount of mapped memory used on the 
node, then the existing mapped memory will get swapped out in order to 
satisfy the new request.  In addition, clean page-cache pages will get 
discarded.  I think what Martin and I would prefer to see is an interface 
that allows one to just get rid of the clean page cache (or at least enough 
of it) so that additional mapped page allocations will occur locally to the 
node without causing swapping.

AFAIK, the number of mapped pages on the node is not exported to user space 
(by, for example, /sys).   So there is no good way to size the "slop" to 
allow for an existing allocation.  If there was, then using a bound memory 
hog would likely be a reasonable replacement for Martin's syscall to release 
all free page cache, at least for small to medium sized sized systems.

> With prefered or other policies it's different though, in that cases
> t_t_f_p will also look into other nodes because the policy is not binding.
>
> That said it might be probably possible to even make non bind policies more
> aggressive at freeing in the current node before looking into other nodes.
> I think the zone balancing has been mostly tuned on non NUMA systems, so
> some improvements might be possible here.
>
> Most people don't use BIND and changing the default policies like this
> might give NUMA systems a better "out of the box" experience.  However this
> memory balance is very subtle code and easy to break, so this would need
> some care.
>

Of course!

> I don't think sysctls or new syscalls are the way to go here though.
>

The reason we ended up with a sysctl/syscall (to control the aggressiveness 
with which __alloc_pages will try to free page cache before spilling) is that 
deciding whether or not  to spend the effort to free up page cache pages on 
the local node before  spilling is a workload dependent optimization.   For 
an HPC application it is  typically worth the effort to try to free local 
node page cache before spilling off node because the program will run 
sufficiently long to make the improvement due to getting local storage 
dominates the extra cost of doing the page allocation.   For file server 
workloads, for example, it is typically important to minimize the time to do 
the page allocation; if it turns out to be on a remote node it really doesn't 
matter that much.   So it seems to me that we need some way for the 
application to tell the system which approach it prefers based on the type of 
workload it is -- hence the sysctl or syscall approach.

> -Andi

-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VM: add vm.free_node_memory sysctl

2005-08-05 Thread Ray Bryant
On Wednesday 03 August 2005 15:08, Andi Kleen wrote:

 
  Hmmm What happens if there are already mapped pages (e. g. mapped in
  the sense that pages are mapped into an address space) on the node and
  you want to allocate some more, but can't because the node is full of
  clean page cache pages?   Then one would have to set the memhog argument
  to the right thing to


 If you have a bind policy in the memory grabbing program then the standard
 try_to_free_pages should DTRT. That is because we generated a custom zone
 list only containing nodes in that zone and the zone reclaim only looks
 into those.


It may depend on what your definition of DTRT is here.  :-)

As I understand things, if we have a node that has some mapped memory 
allocated, and if one starts up a numactl -bind node memhog nodesize-slop so 
as to clear some clean page cache pages from that node, then unless the 
slop is sized in proportion to the amount of mapped memory used on the 
node, then the existing mapped memory will get swapped out in order to 
satisfy the new request.  In addition, clean page-cache pages will get 
discarded.  I think what Martin and I would prefer to see is an interface 
that allows one to just get rid of the clean page cache (or at least enough 
of it) so that additional mapped page allocations will occur locally to the 
node without causing swapping.

AFAIK, the number of mapped pages on the node is not exported to user space 
(by, for example, /sys).   So there is no good way to size the slop to 
allow for an existing allocation.  If there was, then using a bound memory 
hog would likely be a reasonable replacement for Martin's syscall to release 
all free page cache, at least for small to medium sized sized systems.

 With prefered or other policies it's different though, in that cases
 t_t_f_p will also look into other nodes because the policy is not binding.

 That said it might be probably possible to even make non bind policies more
 aggressive at freeing in the current node before looking into other nodes.
 I think the zone balancing has been mostly tuned on non NUMA systems, so
 some improvements might be possible here.

 Most people don't use BIND and changing the default policies like this
 might give NUMA systems a better out of the box experience.  However this
 memory balance is very subtle code and easy to break, so this would need
 some care.


Of course!

 I don't think sysctls or new syscalls are the way to go here though.


The reason we ended up with a sysctl/syscall (to control the aggressiveness 
with which __alloc_pages will try to free page cache before spilling) is that 
deciding whether or not  to spend the effort to free up page cache pages on 
the local node before  spilling is a workload dependent optimization.   For 
an HPC application it is  typically worth the effort to try to free local 
node page cache before spilling off node because the program will run 
sufficiently long to make the improvement due to getting local storage 
dominates the extra cost of doing the page allocation.   For file server 
workloads, for example, it is typically important to minimize the time to do 
the page allocation; if it turns out to be on a remote node it really doesn't 
matter that much.   So it seems to me that we need some way for the 
application to tell the system which approach it prefers based on the type of 
workload it is -- hence the sysctl or syscall approach.

 -Andi

-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VM: add vm.free_node_memory sysctl

2005-08-03 Thread Ray Bryant
On Wednesday 03 August 2005 09:38, Andi Kleen wrote:
> On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
> > On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
> > > On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
> > > > Here's the promised sysctl to dump a node's pagecache.  Please
> > > > review!
> > > >
> > > > This patch depends on the zone reclaim atomic ops cleanup:
> > > > http://marc.theaimsgroup.com/?l=linux-mm=112307646306476=2
> > >
> > > Doesn't numactl --bind=node memhog nodesize-someslack do the same?
> > >
> > > It just might kick in the oom killer if someslack is too small
> > > or someone has unfreeable data there. But then there should be
> > > already an sysctl to turn that one off.
> >
Hmmm What happens if there are already mapped pages (e. g. mapped in the 
sense that pages are mapped into an address space) on the node and you want 
to allocate some more, but can't because the node is full of clean page cache 
pages?   Then one would have to set the memhog argument to the right thing to 
keep the existing mapped memory from being swapped out, right?  Is the data 
to set that argument readily available to user space?  Martin's patch has the 
advantage of targeting just the clean page cache pages.

The way I see this, the problem is that clean page cache pages >>should<< be 
easily available to be used to satisfy a request for mapped pages.   This 
works correctly in non-NUMA Linux systems.  But in NUMA Linux systems, we 
keep tripping over this problem all the time, particularly in the  HPC space, 
and patches like Martin's come about as an attempt to solve this in the VMM.
(We trip over this in the sense that we end up allocating off node storage 
because the current node is full of page cache pages.)

The best answer we have at the present time is to run a memory hog program 
that forces the clean page cache pages to be reclaimed by putting the node in 
question under memory pressure, but this seems like an indirect way to solve 
the problem at hand which is, really, to quickly release those page cache 
pages and make them available for user programs to allocate.  So the most 
direct way to fix this is to fix it in the VMM rather than depending on a 
memory hog based work-around of some kind.   Perhaps we haven't gotten the 
right set of patches together to do this, but my take is that is where the 
fix belongs. 

And, just for the record (  :-)  ), this is not just an Altix problem.  
Opterons are NUMA systems too, and we encounter exactly this same problem in 
the HPC space on 4-node systems.  
-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] VM: add vm.free_node_memory sysctl

2005-08-03 Thread Ray Bryant
On Wednesday 03 August 2005 09:38, Andi Kleen wrote:
 On Wed, Aug 03, 2005 at 10:24:40AM -0400, Martin Hicks wrote:
  On Wed, Aug 03, 2005 at 04:15:29PM +0200, Andi Kleen wrote:
   On Wed, Aug 03, 2005 at 09:56:46AM -0400, Martin Hicks wrote:
Here's the promised sysctl to dump a node's pagecache.  Please
review!
   
This patch depends on the zone reclaim atomic ops cleanup:
http://marc.theaimsgroup.com/?l=linux-mmm=112307646306476w=2
  
   Doesn't numactl --bind=node memhog nodesize-someslack do the same?
  
   It just might kick in the oom killer if someslack is too small
   or someone has unfreeable data there. But then there should be
   already an sysctl to turn that one off.
 
Hmmm What happens if there are already mapped pages (e. g. mapped in the 
sense that pages are mapped into an address space) on the node and you want 
to allocate some more, but can't because the node is full of clean page cache 
pages?   Then one would have to set the memhog argument to the right thing to 
keep the existing mapped memory from being swapped out, right?  Is the data 
to set that argument readily available to user space?  Martin's patch has the 
advantage of targeting just the clean page cache pages.

The way I see this, the problem is that clean page cache pages should be 
easily available to be used to satisfy a request for mapped pages.   This 
works correctly in non-NUMA Linux systems.  But in NUMA Linux systems, we 
keep tripping over this problem all the time, particularly in the  HPC space, 
and patches like Martin's come about as an attempt to solve this in the VMM.
(We trip over this in the sense that we end up allocating off node storage 
because the current node is full of page cache pages.)

The best answer we have at the present time is to run a memory hog program 
that forces the clean page cache pages to be reclaimed by putting the node in 
question under memory pressure, but this seems like an indirect way to solve 
the problem at hand which is, really, to quickly release those page cache 
pages and make them available for user programs to allocate.  So the most 
direct way to fix this is to fix it in the VMM rather than depending on a 
memory hog based work-around of some kind.   Perhaps we haven't gotten the 
right set of patches together to do this, but my take is that is where the 
fix belongs. 

And, just for the record (  :-)  ), this is not just an Altix problem.  
Opterons are NUMA systems too, and we encounter exactly this same problem in 
the HPC space on 4-node systems.  
-- 
Ray Bryant
AMD Performance Labs   Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


(no subject)

2005-03-09 Thread Ray Bryant
subscribe linux-kernel
end
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


(no subject)

2005-03-09 Thread Ray Bryant
subscribe linux-kernel
end
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
  migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
new_node_list = 2 3 4 ... 33
for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]
But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

There can be mapped files that can't be mapped into the migration task.
.
Here's an example (courtesy of Jack Steiner);
sprintf(fname, "/tmp/tmp.%d", getpid());
unlink(fname);
fd = open(fname, O_CREAT|O_RDWR);
p = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
unlink(fname);
/* "p" remains valid until unmapped */
The file /tmp/tmp.pid is both mapped and deleted.  It can't be opened
by another process to mmap() it, so it can't be mapped into the
migration task AFAIK how to do things.  The file does show up in 
/proc/pid/maps as shown below (pardon the line splitting):

2027-20278000 rw-p 0020 08:13 75498728  \ 
/lib/tls/libc.so.6.1
20278000-20284000 rw-p 20278000 00:00 0
2030-20c8c000 rw-s  08:13 100885287 \ 
/tmp/tmp.18259 (deleted)
4000-40008000 r-xp  00:2a 14688706  \ 
/home/tulip14/steiner/apps/bigmem/big

Jack says:
"This is a fairly common way to work with scratch map'ed files. Sites that
have very large disk farms but limited swap space frequently do this (or at 
least they use to...)"

So while I tend to agree with your concern about manipulating
one process's address space from another, I honestly think we
are stuck, and I don't see a good way around this.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes, I agree.  Let's make that so.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
-----------
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant
Andrew Morton wrote:
Paul Jackson <[EMAIL PROTECTED]> wrote:
As Martin wrote, when he submitted this patch:
> The motivation for this patch is for setting up High Performance
> Computing jobs, where initial memory placement is very important to
> overall performance.
Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
Even if we modified all applications to do this, it still wouldn't help for
dirty page cache, which would eventually become cleaned, and hang around long
after the application has departed.
But the previous statement has a false hypothesis, namely, that we could
change all applications to do this.
--
Best Regards,
Ray
---
          Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

We'd have to use up a struct page flag (PG_MIGRATED?) to mark
the page as migrated to keep the call to migrate_pages() for
the anonymous pages from migrating the pages again.  Then we'd
have to have some way to clear PG_MIGRATED once all of the
migrate_pages() calls are complete (we can't have the anonymous
page migrate_pages() calls clear the flags, since the second
such call would find the flag clear and remigrate the pages
in the overlapping nodes case.)
How about ignoring the va_start and va_end values unless
either:
  pid == current->pid
  or  current->euid == 0 /* we're root */
I like the first check a bit better than checking for 0.  Are
there other system calls that follow that convention (e. g.
pid = 0 implies current?)
The second check lets a sufficiently responsible task manipulate
other tasks.  This task can choose to have the target tasks
suspended before it starts fussing with them.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes,.that's been pointed out to me before.  Let's make it so.
--
Best Regards,
Ray
---
      Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant
Ingo Molnar wrote:
* Andrew Morton <[EMAIL PROTECTED]> wrote:

. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that.
Yes, that would be preferable - I don't know what the difficulty is
with that.  sys_set_mempolicy() should provide a sufficiently good
hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.
We're pretty agnostic about this.  I agree that if we were to make this
a system call, then it should be restricted to root.  Or make it a
sysctl.  Whichever way you guys want to go is fine with us.
It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).
I have two observations about this:
(1)  It is our intent to use the infrastructure provided by this patch
 as the basis for an automatic (i. e. included with the VM) approach
 that selectively removes unused page cache pages before spilling
 off node.  We just figured it would be easier to get the
 infrastructure in place first.
(2)  If a sufficiently well behaved application knows in advance how
 much free memory it needs per node, then it makes sense to provide
 a mechanism for the application to request this, rather than for
 the VM to try to puzzle this out later.  Automatic algorithms in
 the VM are never perfect; they should be reserved to work in those
 cases where the application(s) either cooperate in such a way to
 make memory demands impossible to predict, or the application
 programmer can't (or can't take the time to) predict how much
 memory the application will use.
At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.
Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
          Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant
Ingo Molnar wrote:
* Andrew Morton [EMAIL PROTECTED] wrote:

. enable users to
specify an 'allocation priority' of some sort, which kicks out the
pagecache on the local node - or something like that.
Yes, that would be preferable - I don't know what the difficulty is
with that.  sys_set_mempolicy() should provide a sufficiently good
hint.

yes. I'm not against some flushing mechanism for debugging or test
purposes (it can be useful to start from a new, clean state - and as
such the sysctl for root only and depending on KERNEL_DEBUG is probably
better than an explicit syscall), but the idea to give a flushing API to
applications is bad i believe.
We're pretty agnostic about this.  I agree that if we were to make this
a system call, then it should be restricted to root.  Or make it a
sysctl.  Whichever way you guys want to go is fine with us.
It is the 'easy and incorrect path' to a number of NUMA (and non-NUMA)
VM problems and i fear that it will destroy the evolution of VM
priority/placement/affinity APIs (NUMAlib, etc.).
I have two observations about this:
(1)  It is our intent to use the infrastructure provided by this patch
 as the basis for an automatic (i. e. included with the VM) approach
 that selectively removes unused page cache pages before spilling
 off node.  We just figured it would be easier to get the
 infrastructure in place first.
(2)  If a sufficiently well behaved application knows in advance how
 much free memory it needs per node, then it makes sense to provide
 a mechanism for the application to request this, rather than for
 the VM to try to puzzle this out later.  Automatic algorithms in
 the VM are never perfect; they should be reserved to work in those
 cases where the application(s) either cooperate in such a way to
 make memory demands impossible to predict, or the application
 programmer can't (or can't take the time to) predict how much
 memory the application will use.
At least making it sufficiently painful to use (via the originally
proposed root-only sysctl) could still preserve some of the incentive to
provide a clean solution for applications. 'Time to market' constraints
should not be considered when adding core mechanisms.
Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

We'd have to use up a struct page flag (PG_MIGRATED?) to mark
the page as migrated to keep the call to migrate_pages() for
the anonymous pages from migrating the pages again.  Then we'd
have to have some way to clear PG_MIGRATED once all of the
migrate_pages() calls are complete (we can't have the anonymous
page migrate_pages() calls clear the flags, since the second
such call would find the flag clear and remigrate the pages
in the overlapping nodes case.)
How about ignoring the va_start and va_end values unless
either:
  pid == current-pid
  or  current-euid == 0 /* we're root */
I like the first check a bit better than checking for 0.  Are
there other system calls that follow that convention (e. g.
pid = 0 implies current?)
The second check lets a sufficiently responsible task manipulate
other tasks.  This task can choose to have the target tasks
suspended before it starts fussing with them.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes,.that's been pointed out to me before.  Let's make it so.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-22 Thread Ray Bryant
Andrew Morton wrote:
Paul Jackson [EMAIL PROTECTED] wrote:
As Martin wrote, when he submitted this patch:
 The motivation for this patch is for setting up High Performance
 Computing jobs, where initial memory placement is very important to
 overall performance.
Any left over cache is wrong, for this situation.

So...  Cannot the applicaiton remove all its pagecache with posix_fadvise()
prior to exitting?
Even if we modified all applications to do this, it still wouldn't help for
dirty page cache, which would eventually become cleaned, and hang around long
after the application has departed.
But the previous statement has a false hypothesis, namely, that we could
change all applications to do this.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
  migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
new_node_list = 2 3 4 ... 33
for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]
But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

There can be mapped files that can't be mapped into the migration task.
.
Here's an example (courtesy of Jack Steiner);
sprintf(fname, /tmp/tmp.%d, getpid());
unlink(fname);
fd = open(fname, O_CREAT|O_RDWR);
p = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
unlink(fname);
/* p remains valid until unmapped */
The file /tmp/tmp.pid is both mapped and deleted.  It can't be opened
by another process to mmap() it, so it can't be mapped into the
migration task AFAIK how to do things.  The file does show up in 
/proc/pid/maps as shown below (pardon the line splitting):

2027-20278000 rw-p 0020 08:13 75498728  \ 
/lib/tls/libc.so.6.1
20278000-20284000 rw-p 20278000 00:00 0
2030-20c8c000 rw-s  08:13 100885287 \ 
/tmp/tmp.18259 (deleted)
4000-40008000 r-xp  00:2a 14688706  \ 
/home/tulip14/steiner/apps/bigmem/big

Jack says:
This is a fairly common way to work with scratch map'ed files. Sites that
have very large disk farms but limited swap space frequently do this (or at 
least they use to...)

So while I tend to agree with your concern about manipulating
one process's address space from another, I honestly think we
are stuck, and I don't see a good way around this.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes, I agree.  Let's make that so.
-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
Oops.  It's late.  The pargraph below in my previous note confused
cpus and nodes.  It should have read as follows:
Let's suppose that nodes 0-1 of a 64 node [was: CPU] system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altix [128 CPUS in this system]. Let's suppose that jobs
arrive as follows:
. . .
Sorry about that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
I went back and did some digging on one the issues that has dropped
off the list here: the case where the set of old nodes and new
nodes overlap in some way.  No one could provide me with a specific
example, but the thread was that "This did happen in certain scenarios".
Part of these scenarios involved situations where a particular job
had to have access to a certain node, because that certain node was
attached to a graphics device, for example.  Here is one such
scenario:
Let's suppose that nodes 0-1 of a 64 CPU system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altox. Let's suppose that jobs arrive as follows:
(1)  32 processor, non-graphics job arrives and gets assigned
 cpus 96-127 (nodes 48-63)
(2)  A second 32 processor, non-graphics job arrives and is
 assigned cpus 64-95 (nodes 32-47)
(3)  A 64 processor non-graphics job arrives and gets assigned
 cpus 0-63.
(bear with me, please)
(4)  The job on nodes 64-95 terminates.  A new 28 processor
 job arrives and is assigned cpus 68-95.
(5)  A 4 cpu graphics job comes in and we want to assign it to
 cpus 0-3 (nodes 0-1) and it has a very high priority, so
 we want to migrate the 64 CPU job.  The only place left
 to migrate it is from cpus 0-63 to cpus 4-67.
(Note that we can't just migrate nodes 0-1 to nodes 32-33, because
for all we know, the program depends on the fact that nodes 0-1
are physically close to [have low latency access to] nodes 2-3.
So moving 0-1 to 32-33 would be a non-topological preserving
migration.)
Now if we are using a system call of the form
migrate_pages(pid, count, old_node_list, new_node_list);
then we really can't have old_node_list and new_node_list overlap,
unless this is the only process that we are migrating or there is
no shared memory among the pid's.  (Neither is very likely for
our workload mix.  :-)  ).
The reason that this doesn't work is the following:  It works
fine for the first pid.  The shared segment gets moved to the
new_node_list.  But when we call migrate_pages() for the 2nd
pid, we will remigrate the pages that ended up on the nodes
that are in the intersection of the sets of members of the
two lists.  (The scanning code has no way to recognize that
the pages have been migrated.  It finds pages that are on one
of the old nodes, and migrates them again.)  This gets repeated
for each subsequent call.  Not pretty.  What happens in this
particular case if you do the trivial thing and try:
old_nodes=0 1 2 ... 31
new_nodes=2 3 4 ... 33
Then after 16 process have been migrated, all of the shared memory
pages of the job are on nodes 32 and 33. (I've assume the shared
memory is shared among all of the processes of the job.)
Now you COULD do multiple migrations to make this work.
In this case, you could do 16 migrations:
stepold_nodes   new_nodes
  1   30 31  32 33
  2   28 29  30 31
  3   26 27  28 29
 ...
  16  0   1   2  3
During each step, you would have to call migrate_pages() 64 times,
since there are 64 processes involved.  (You can't migrate
any more nodes in each step without creating a situation where
pages will be physically migrated twice.)  Once again, we are
starting to veer close to O(N**2) behavior here, and we want
to stay away from that.
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
   migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
 new_node_list = 2 3 4 ... 33
for one of the pid's in the job.
(This is particularly important if the shared region is large.)
Next we could go and move the non-shared memory in each process
using similar calls, repeated one or more times in each process.
Yes, this is ugly, and yes this requires us to parse /proc/pid/maps.
Life is like that sometimes.
Now, I admit that this example is somewhat contrived, and it shows
worst case behavior.  But this is not an implausible scenario.  And
it shows the difficulties of trying to use a system call of the
form:
   migrate_pages(pid, count, old_node_list, new_node_list)
in those cases where the old_node_list and the new_node_list are not
disjoint.  Furthermore, it shows how we could end up in a situation
where the old_node_list and the new_node_lists overlap.
Jack Steiner pointed out this kind of example to me, and this kind
of example did arise in IRIX, so we believe that it will arise on
Altix and we don't know of a good way around these problems other
than the system call form that includes the va_start and va_end.
--
Best Regards,
Ray
---
      Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   

Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant
Andrew Morton wrote:
Ray Bryant <[EMAIL PROTECTED]> wrote:

We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.
I agree, but time-to-market decisions overrode that.  Anyway, everyone
uses a program called "bcfree" to actually do the buffer-cache freeing,
so changing the interface is not as bad as all that.
Let us put something together along these lines and we will get back to you.
Thanks,
--
Best Regards,
Ray
---
          Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant
Andrew Morton wrote:
Martin Hicks <[EMAIL PROTECTED]> wrote:
This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?
We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.
- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?
It is possible that this call might need to be executed at the start of
each batch job in the system.  The reason for using a kernel thread was
that there was no good way to start concurrency due to a write to /proc.
  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.
That works fine as well if we can get a system call number assigned and
avoids the hackiness of both /proc and the kernel threads.
- Dropping "as much pagecache as possible" might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.
  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.
Our thoughts exactly.  This is clearly a "big hammer" and we want to
make a lighter hammer to free up a certain number of pages.  Indeed,
we would like to have these calls occur automatically from __alloc_pages()
when we try to allocate local storage and find that there isn't any.
For our workloads, we want to free up unmapped, clean pagecache, if that
is what is keeping us from allocating a local page.  Not all workloads
want that, however, so we would probably use a sysctl() to enable/disable
this.
However, the first step is to do this manually from user space.
- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.
So what it comes down to is
sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
Do we have to implement all of those or just allow for the possibility of 
that
being implemented in the future?  E. g. in our case we'd just implement the
bit that says "unmapped pagecache".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
-------
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi Kleen wrote:

I wouldn't bother fixing up VMA policies. 


How would these policies get changed so that they represent the
reality of the new node location(s) then?  Doesn't this have to
happen as part of migrate_pages()?
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
All,
Just an update on the idea of migrating a process without suspending
it.
The hard part of the problem here is to make sure that the page_migrate()
system call sees all of the pages to migrate.  If the process that is
being migrated can still allocate pages, then the page_migrate() call
may miss some of the pages.
One way to solve this problem is to force the process to start allocating
pages on the new nodes before calling page_migrate().  There are a couple
of subcases:
(1)  For memory mapped files with a non-DEFAULT associated memory policy,
 one can use mbind() to fixup the memory policy.  (This assumes the
 Steve Longerbeam patches are applied, as I understand things).
(2)  For anonymous pages and memory mapped files with DEFAULT policy,
 the allocation depends on which node the process is running.  So
 after doing the above, you need to migrate the task to a cpu
 associated with one of the nodes.
The problem with (1) is that it is racy, there is no guarenteed way to get the
list of mapped files for the process while it is still running.  A process
can do it for itself, so one way to do this would be to write the set of
new nodes to a /proc/pid file, then send the process a SIG_MIGRATE
signal.  Ugly  (For multithreaded programs, all of the threads have
to be signalled to keep them from mmap()ing new files during the migration.)
(1) could be handled as part of the page_migrate() system call --
make one pass through the address space searching for mempolicy()
data structures, and updating them as necessary.  Then make a second
pass through and do the migrations.  Any new allocations will then
be done under the new mempolicy, so they won't be missed.  But this
still gets us into trouble if the old and new node lists are not
disjoint.
This doesn't handle anonymous memory or mapped files associated with
the DEFAULT policy.  A way around that would be to add a target cpu_id
to the page_migrate() system call.  Then before doing the first pass
described above, one would do the equivalenet of set_sched_affinity()
for the target pid, moving it to the indicated cpu.  Once it is known
the pid has moved (how to do that?), we now know anonymous memory and
DEFAULT mempolicy mapped files will be allocated on the nodes associated
with the new cpu.  Then we can proceed as discussed in the last paragraph.
Also ugly, due to the extra parameter.
Alternatively, we can just require, for correct execution, the invoking
code to do the set_sched_affinity() first, in those cases where
migrating a running task is important.
Anyway, how important is this, really for acceptance of a page_migrate()
system call in the community?  (that is, how important is it to be
able to migrate a process without suspending it?)
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
All,
Just an update on the idea of migrating a process without suspending
it.
The hard part of the problem here is to make sure that the page_migrate()
system call sees all of the pages to migrate.  If the process that is
being migrated can still allocate pages, then the page_migrate() call
may miss some of the pages.
One way to solve this problem is to force the process to start allocating
pages on the new nodes before calling page_migrate().  There are a couple
of subcases:
(1)  For memory mapped files with a non-DEFAULT associated memory policy,
 one can use mbind() to fixup the memory policy.  (This assumes the
 Steve Longerbeam patches are applied, as I understand things).
(2)  For anonymous pages and memory mapped files with DEFAULT policy,
 the allocation depends on which node the process is running.  So
 after doing the above, you need to migrate the task to a cpu
 associated with one of the nodes.
The problem with (1) is that it is racy, there is no guarenteed way to get the
list of mapped files for the process while it is still running.  A process
can do it for itself, so one way to do this would be to write the set of
new nodes to a /proc/pid file, then send the process a SIG_MIGRATE
signal.  Ugly  (For multithreaded programs, all of the threads have
to be signalled to keep them from mmap()ing new files during the migration.)
(1) could be handled as part of the page_migrate() system call --
make one pass through the address space searching for mempolicy()
data structures, and updating them as necessary.  Then make a second
pass through and do the migrations.  Any new allocations will then
be done under the new mempolicy, so they won't be missed.  But this
still gets us into trouble if the old and new node lists are not
disjoint.
This doesn't handle anonymous memory or mapped files associated with
the DEFAULT policy.  A way around that would be to add a target cpu_id
to the page_migrate() system call.  Then before doing the first pass
described above, one would do the equivalenet of set_sched_affinity()
for the target pid, moving it to the indicated cpu.  Once it is known
the pid has moved (how to do that?), we now know anonymous memory and
DEFAULT mempolicy mapped files will be allocated on the nodes associated
with the new cpu.  Then we can proceed as discussed in the last paragraph.
Also ugly, due to the extra parameter.
Alternatively, we can just require, for correct execution, the invoking
code to do the set_sched_affinity() first, in those cases where
migrating a running task is important.
Anyway, how important is this, really for acceptance of a page_migrate()
system call in the community?  (that is, how important is it to be
able to migrate a process without suspending it?)
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi Kleen wrote:

I wouldn't bother fixing up VMA policies. 


How would these policies get changed so that they represent the
reality of the new node location(s) then?  Doesn't this have to
happen as part of migrate_pages()?
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant
Andrew Morton wrote:
Martin Hicks [EMAIL PROTECTED] wrote:
This patch introduces a new sysctl for NUMA systems that tries to drop
as much of the page cache as possible from a set of nodes.  The
motivation for this patch is for setting up High Performance Computing
jobs, where initial memory placement is very important to overall
performance.

- Using a write to /proc for this seems a bit hacky.  Why not simply add
  a new system call for it?
We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.
- Starting a kernel thread for each node might be overkill.  Yes, it
  would take longer if one process was to do all the work, but does this
  operation need to be very fast?
It is possible that this call might need to be executed at the start of
each batch job in the system.  The reason for using a kernel thread was
that there was no good way to start concurrency due to a write to /proc.
  If it does, then userspace could arrange for that concurrency by
  starting a number of processes to perform the toss, each with a different
  nodemask.
That works fine as well if we can get a system call number assigned and
avoids the hackiness of both /proc and the kernel threads.
- Dropping as much pagecache as possible might be a bit crude.  I
  wonder if we should pass in some additional parameter which specifies how
  much of the node's pagecache should be removed.
  Or, better, specify how much free memory we will actually require on
  this node.  The syscall terminates when it determines that enough
  pagecache has been removed.
Our thoughts exactly.  This is clearly a big hammer and we want to
make a lighter hammer to free up a certain number of pages.  Indeed,
we would like to have these calls occur automatically from __alloc_pages()
when we try to allocate local storage and find that there isn't any.
For our workloads, we want to free up unmapped, clean pagecache, if that
is what is keeping us from allocating a local page.  Not all workloads
want that, however, so we would probably use a sysctl() to enable/disable
this.
However, the first step is to do this manually from user space.
- To make the syscall more general, we should be able to reclaim mapped
  pagecache and anonymous memory as well.
So what it comes down to is
sys_free_node_memory(long node_id, long pages_to_make_free, long what_to_free)
where `what_to_free' consists of a bunch of bitflags (unmapped pagecache,
mapped pagecache, anonymous memory, slab, ...).
Do we have to implement all of those or just allow for the possibility of 
that
being implemented in the future?  E. g. in our case we'd just implement the
bit that says unmapped pagecache.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH/RFC] A method for clearing out page cache

2005-02-21 Thread Ray Bryant
Andrew Morton wrote:
Ray Bryant [EMAIL PROTECTED] wrote:

We did it this way because it was easier to get it into SLES9 that way.
But there is no particular reason that we couldn't use a system call.
It's just that we figured adding system calls is hard.

aarggh.  This is why you should target kernel.org kernels first.  Now we
risk ending up with poor old suse carrying an obsolete interface and
application developers have to be able to cater for both interfaces.
I agree, but time-to-market decisions overrode that.  Anyway, everyone
uses a program called bcfree to actually do the buffer-cache freeing,
so changing the interface is not as bad as all that.
Let us put something together along these lines and we will get back to you.
Thanks,
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
I went back and did some digging on one the issues that has dropped
off the list here: the case where the set of old nodes and new
nodes overlap in some way.  No one could provide me with a specific
example, but the thread was that This did happen in certain scenarios.
Part of these scenarios involved situations where a particular job
had to have access to a certain node, because that certain node was
attached to a graphics device, for example.  Here is one such
scenario:
Let's suppose that nodes 0-1 of a 64 CPU system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altox. Let's suppose that jobs arrive as follows:
(1)  32 processor, non-graphics job arrives and gets assigned
 cpus 96-127 (nodes 48-63)
(2)  A second 32 processor, non-graphics job arrives and is
 assigned cpus 64-95 (nodes 32-47)
(3)  A 64 processor non-graphics job arrives and gets assigned
 cpus 0-63.
(bear with me, please)
(4)  The job on nodes 64-95 terminates.  A new 28 processor
 job arrives and is assigned cpus 68-95.
(5)  A 4 cpu graphics job comes in and we want to assign it to
 cpus 0-3 (nodes 0-1) and it has a very high priority, so
 we want to migrate the 64 CPU job.  The only place left
 to migrate it is from cpus 0-63 to cpus 4-67.
(Note that we can't just migrate nodes 0-1 to nodes 32-33, because
for all we know, the program depends on the fact that nodes 0-1
are physically close to [have low latency access to] nodes 2-3.
So moving 0-1 to 32-33 would be a non-topological preserving
migration.)
Now if we are using a system call of the form
migrate_pages(pid, count, old_node_list, new_node_list);
then we really can't have old_node_list and new_node_list overlap,
unless this is the only process that we are migrating or there is
no shared memory among the pid's.  (Neither is very likely for
our workload mix.  :-)  ).
The reason that this doesn't work is the following:  It works
fine for the first pid.  The shared segment gets moved to the
new_node_list.  But when we call migrate_pages() for the 2nd
pid, we will remigrate the pages that ended up on the nodes
that are in the intersection of the sets of members of the
two lists.  (The scanning code has no way to recognize that
the pages have been migrated.  It finds pages that are on one
of the old nodes, and migrates them again.)  This gets repeated
for each subsequent call.  Not pretty.  What happens in this
particular case if you do the trivial thing and try:
old_nodes=0 1 2 ... 31
new_nodes=2 3 4 ... 33
Then after 16 process have been migrated, all of the shared memory
pages of the job are on nodes 32 and 33. (I've assume the shared
memory is shared among all of the processes of the job.)
Now you COULD do multiple migrations to make this work.
In this case, you could do 16 migrations:
stepold_nodes   new_nodes
  1   30 31  32 33
  2   28 29  30 31
  3   26 27  28 29
 ...
  16  0   1   2  3
During each step, you would have to call migrate_pages() 64 times,
since there are 64 processes involved.  (You can't migrate
any more nodes in each step without creating a situation where
pages will be physically migrated twice.)  Once again, we are
starting to veer close to O(N**2) behavior here, and we want
to stay away from that.
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
   migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
 new_node_list = 2 3 4 ... 33
for one of the pid's in the job.
(This is particularly important if the shared region is large.)
Next we could go and move the non-shared memory in each process
using similar calls, repeated one or more times in each process.
Yes, this is ugly, and yes this requires us to parse /proc/pid/maps.
Life is like that sometimes.
Now, I admit that this example is somewhat contrived, and it shows
worst case behavior.  But this is not an implausible scenario.  And
it shows the difficulties of trying to use a system call of the
form:
   migrate_pages(pid, count, old_node_list, new_node_list)
in those cases where the old_node_list and the new_node_list are not
disjoint.  Furthermore, it shows how we could end up in a situation
where the old_node_list and the new_node_lists overlap.
Jack Steiner pointed out this kind of example to me, and this kind
of example did arise in IRIX, so we believe that it will arise on
Altix and we don't know of a good way around these problems other
than the system call form that includes the va_start and va_end.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
Oops.  It's late.  The pargraph below in my previous note confused
cpus and nodes.  It should have read as follows:
Let's suppose that nodes 0-1 of a 64 node [was: CPU] system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altix [128 CPUS in this system]. Let's suppose that jobs
arrive as follows:
. . .
Sorry about that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Paul Jackson wrote:
You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?
I don't think Andi was proposing you have to search all of the pages
on a node.  I think that the idea was that the (count, old_nodes, new_nodes)
parameters would have to be converted to a full node_map such as is done
in the patch (let's call it "sample code") that I sent out with the
overview that started this whole discussion.  node_map[] is MAX_NUMNODES
in length, and node_map[i] gives the node where pages on node i should be
migrated to, or is -1 if we are not migrating pages on this node.
Since we have extended the interface to support -1 as a possible value for
the old_nodes array [and it matches any old node], then in that case we
would make node_map[i]=new_node for all values of i.
--
Best Regards,
Ray
---
      Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node -> new node mapping.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
I'm going to assume that there have been some "crossed emails" here.
I don't think that this is the interface that you and I have been
converging on.  As I understood it, we were converging on the following:
(1)  extended attributes will be used to mark files as non-migratable
(2)  the page_migrate() system call will be defined as:
 page_migrate(pid, count, old_nodes, new_nodes);
 and it will migrate all pages that are either anonymous or part
 of mapped files that are not marked non-migratable.
(3)  The mbind() system call with MPOL_MF_STRICT will be hooked up
 to the migration code so that it actually causes a migration.
 Processes can use this interface to migrate a portion of their own
 address space containing a mapped file.
This is different than your reply above, which seems to imply that:
(A)  Step 1 is to migrate mapped files using mbind().  I don't understand
 how to do this in general, because:
 (a)  I don't know how to make a non-racy list of the mapped files to
  migrate without assuming that the process to be migrated is stopped
and  (b)  If the mapped file is associated with the DEFAULT memory policy,
  and page placement was done by first touch, then it is not clear
  how to use mbind() to cause the pages to be migrated, and still
  end up with the identical topological placement of pages after
  the migration.
(B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
 I don't like the restriction of this to just anonymous pages.
Fundamentally, I don't see why (A) is much different from allowing one
process to manipulate the physical storage for another process.  It's
just stated in terms of mmap'd objects instead of pid's.  So I don't
see why that is fundamentally different from a page_migration() call
with va_start and va_end arguments.
So I'm going to assume that the agreement was really (1)-(3) above.
The only problem I see with that is the following:  Suppose that a user
wants to migrate a portion of their own address space that is composed
of (at last partly) anonymous pages or pages mapped to a file associated
with the DEFAULT memory policy, and we want the pages to be toplogically
allocated the same way after the migration as they were before the
migration?
The only way I know how to do the latter is with a system call of the form:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
where the permission model is that a pid can migrate any process that it
can send a signal to.  So a root pid can migrate any process, and a user
pid can migrate pages of any pid started by the user.
--
Best Regards,
Ray
---
          Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.
-Andi
That makes sense.  I will agree to make that part work, too. as part
of this.  We will probably do the external case first, because we have
need for that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.
-Andi
That makes sense.  I will agree to make that part work, too. as part
of this.  We will probably do the external case first, because we have
need for that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node - new node mapping.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
I'm going to assume that there have been some crossed emails here.
I don't think that this is the interface that you and I have been
converging on.  As I understood it, we were converging on the following:
(1)  extended attributes will be used to mark files as non-migratable
(2)  the page_migrate() system call will be defined as:
 page_migrate(pid, count, old_nodes, new_nodes);
 and it will migrate all pages that are either anonymous or part
 of mapped files that are not marked non-migratable.
(3)  The mbind() system call with MPOL_MF_STRICT will be hooked up
 to the migration code so that it actually causes a migration.
 Processes can use this interface to migrate a portion of their own
 address space containing a mapped file.
This is different than your reply above, which seems to imply that:
(A)  Step 1 is to migrate mapped files using mbind().  I don't understand
 how to do this in general, because:
 (a)  I don't know how to make a non-racy list of the mapped files to
  migrate without assuming that the process to be migrated is stopped
and  (b)  If the mapped file is associated with the DEFAULT memory policy,
  and page placement was done by first touch, then it is not clear
  how to use mbind() to cause the pages to be migrated, and still
  end up with the identical topological placement of pages after
  the migration.
(B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
 I don't like the restriction of this to just anonymous pages.
Fundamentally, I don't see why (A) is much different from allowing one
process to manipulate the physical storage for another process.  It's
just stated in terms of mmap'd objects instead of pid's.  So I don't
see why that is fundamentally different from a page_migration() call
with va_start and va_end arguments.
So I'm going to assume that the agreement was really (1)-(3) above.
The only problem I see with that is the following:  Suppose that a user
wants to migrate a portion of their own address space that is composed
of (at last partly) anonymous pages or pages mapped to a file associated
with the DEFAULT memory policy, and we want the pages to be toplogically
allocated the same way after the migration as they were before the
migration?
The only way I know how to do the latter is with a system call of the form:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
where the permission model is that a pid can migrate any process that it
can send a signal to.  So a root pid can migrate any process, and a user
pid can migrate pages of any pid started by the user.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Paul Jackson wrote:
You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?
I don't think Andi was proposing you have to search all of the pages
on a node.  I think that the idea was that the (count, old_nodes, new_nodes)
parameters would have to be converted to a full node_map such as is done
in the patch (let's call it sample code) that I sent out with the
overview that started this whole discussion.  node_map[] is MAX_NUMNODES
in length, and node_map[i] gives the node where pages on node i should be
migrated to, or is -1 if we are not migrating pages on this node.
Since we have extended the interface to support -1 as a possible value for
the old_nodes array [and it matches any old node], then in that case we
would make node_map[i]=new_node for all values of i.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
ey are done.
Indeed we can only expose what we want most users to see in
glibc and leave the underlying system call in its full form
for only those systems that need it.
-Andi
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?
That's progress.  :-)
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
You and Robin mentioned some problems with "double migration"
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.

There is one other case where Robin and I have talked about double
migration.  That is the case where the set of old nodes and new
nodes overlap.  If one is not careful, and the system call interface
is assumed to be something like:
page_migrate(pid, old_node, new_node);
then if one is not careful (and depending on what the complete list
of old_nodes and new_nodes are), then if one does something like:
page_migrate(pid, 1, 2);
page_migrate(pid, 2, 3);
then you can end up actually moving pages from node 1 to node 2,
only to move them again from node 2 to node 3.  This is another
form of double migration that we have worried about avoiding.
--
-------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi, et al:
I see that  several messages have been sent in the interim.
I apologize for being "out of sync", but today is my last
day to go skiing and it is gorgeous outside.  I'll try
to catch up and digest everthing later.
--
-------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Here's an interface proposal that may be a middle ground and
should satisfy both small and large system requirements:
The system call interface would be:
page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);
(e. g. same as before, but please keep reading):
The following restrictions of my original proposal would be
dropped:
(1)  va_start and va_end can span multiple vma's.  To migrate
 all pages in a process, va_start can be 0UL and va_end
 would be MAX_INT L.  (Equivalently, we could use va_start
 and length, in pages)  We would expect the normal usage
 of this call on small systems to be va_start=0, va_end=MAX_INT.
 va_start and va_end would be required to be page aligned.
(2)  There is no requirement that the pid be suspended before
 the system call is issued.  Further requirements below
 are proposed to handle the allocation of new pages while
 the migrate system call is in progress.
(3)  Mempolicy data structures will be updated to reflect the
 new node locations before any pages are migrated.  That
 way, if the process allocates new pages before the migration
 process is completed, they will be allocated on the new
 nodes.
 (An alternative:  we could require the user to update
 the NUMA API data structures to reflect the new reality
 before the page_migrate() call is issued.  This is consistent
 with item (4).  If the user doesn't do this, then
 there is no guarentee that the page migration call will
 actually be able to migrate all pages.)
 If any memory policy is DEFAULT, then the pid will need to
 be migrated to a cpu associated with  one of the new_node_list
 nodes before the page_migrate() call.  This is so new
 allocations will happen in the new_node_list and the
 migration call won't miss those pages.  The system call
 will work correctly without this, it just can't guarentee
 that it will migrate all pages from the old_nodes.
(4)  If cpusets are in use, the new_node_list must represent
 valid nodes to allocate pages from for the cpuset that
 pid is currently a member of.  This implies that the
 pid is moved from its old cpuset to a new cpuset before
 the page_migrate() call is issued.  Any nodes not part
 of the new cpu set will cause the system call to return
 with -EINVAL.
(5)  If, during the migration process, a page is to be moved to
 node N, but the alloc_pages_node() call for node N fails, then the
 page will fall over to allocation on the "nearest" node
 in the new_node_list; if this node is full then fall over
 to the next nearest node, etc.  If none of the nodes has
 space, then the migration system call will fail.  (Hmmm...
 would we unmigrate the pages that had been migrated
 this far??  sounds messy also, not sure what one
 would do about error reporting here so that the caller
 could take some corrective action.)
(6)  The system call is reserved to root or a pid with
 capability CAP_PAGE_MIGRATE.
(7)  Mapped files with the extended attribute MIGRATE
 set to NONE are not migrated by the system call.
 Mapped files with the extended attribute MIGRATE
 set to LIB will be handled as follows:  r/o
 mappings will not be migrated.  r/w mappings will
 be migrated.  If no MIGRATE extended attribute is available,
 then the assumtion is that the MIGRATE extended
 attribute is not set.  (Files mapped from NFS
 would always be regarded as migrateable until
 NFS gets extended attributes.)
Note that nothing here requires parsing of /proc/pid/maps,
etc.  However, very large systems may use the system call
in special ways, e. g:
(1)  They may decide to suspend processes before migration.
(2)  They may decide to optimize the migration process by
 trying to migrate large shared objects only "once",
 in the sense that only one scan of a large shared
 object will be done.
Issues of complexity related to the above are reserved for
those systems who choose to use the system call in this way.
Please note, however that this is a performance optimization
that some systems MAY decide to do.  There is NO REQUIREMENT
that any user follow these steps from a correctness point of
view, the page_migrate() system call will still do the correct
thing.
Now, I know that is complicated and lot of verbage.  But this
would satisfy our requirements and I think it would satisfy
the concern that the page_migration() call was built just to
satisfy SGI requirements.
Comments, flames, suggestions, etc, as usual are all welcome.
--
-------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Sorry for the late answer.]
No problem, remember, I'm supposed to be on vacation, anyway.  :-)
Let's start off with at least one thing we can agree on.  If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable.   (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated.  So
the extended attribute may be a little more complicated than
just "don't migrate".)
The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though.  It is possible
we can live with the notion that files mapped in from NFS are always
migratable.  (I'll need to look into that some more).
On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

OK, so I've "lost the thread" a little bit here.  Specifically what
would you propose the API for page migration be?  As I read through your note,
I see a couple of different possibilities being considered:
(1)  Map each object to be migrated into a management process,
 update the object's memory policy to match the new node locations
 and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() and
 the result is that migration happens as part of the call.
(2)  Something along the lines of:
 page_migrate(pid, old_node, new_node);
 or perhaps
 page_migrate(pid, old_node_mask, new_node_mask);
or
(3)  mbind() with a pid argument?
I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?)  :-)


The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.
Well, yes, I guess to be more precise I should have said that
very few programs use any NUMA policy other than the DEFAULT
policy.  And that they instead make page placement decisions implicitly
using first touch.
Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.
I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
Yes, that is correct.  This has been exactly my point from the
beginning.
We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate  those programs without
requiring them to create a non-DEFAULT policy of some kind.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.

Very ugly, I think.  Particularly if you have to do a lot of
vma splitting to get the correct node placement.  (Worst case
is a VMA with nodes interleaved by first touch across a set of
nodes in a way that doesn't match the INTERLEAVE mempolicy.
Then you would have to create a separate VMA for each page
and use the BIND policy.  Then after migration you would
have to go through and set the policy back to DEFAULT,
resulting in a lot of vma merges.)

Sure, but NUMA API goes to great pains to handle such programs.
Yes, it does.  But, how do we handle legacy NUMA codes that people
use today on our Linux 2.4.21 based Altix kernels?  Such programs
don't have access to the NUMA API, so they aren't using it.  They
work fine on 2.6 with the DEFAULT memory policy.  It seems unreasonable
to go back and require these programs to use "numactl" to solve a problem that
they are already solving without it.  And it certainly seems difficult
to require them to use numactl to enable migration of those programs.
(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing.  Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use 
numactl?)

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting

That would make the node array infinite, won't it?  What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.
I'm sorry, I don't follow that at all.  The node array has nothing to do 
with
the size of the address range to be migrated.  It is not the case that the
ith entry in the node array says what to do with the ith page.  Instead the
old and new node arrays defining a mapping of pages:  for pages f

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Sorry for the late answer.]
No problem, remember, I'm supposed to be on vacation, anyway.  :-)
Let's start off with at least one thing we can agree on.  If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable.   (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated.  So
the extended attribute may be a little more complicated than
just don't migrate.)
The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though.  It is possible
we can live with the notion that files mapped in from NFS are always
migratable.  (I'll need to look into that some more).
On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

OK, so I've lost the thread a little bit here.  Specifically what
would you propose the API for page migration be?  As I read through your note,
I see a couple of different possibilities being considered:
(1)  Map each object to be migrated into a management process,
 update the object's memory policy to match the new node locations
 and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() and
 the result is that migration happens as part of the call.
(2)  Something along the lines of:
 page_migrate(pid, old_node, new_node);
 or perhaps
 page_migrate(pid, old_node_mask, new_node_mask);
or
(3)  mbind() with a pid argument?
I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?)  :-)


The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.
Well, yes, I guess to be more precise I should have said that
very few programs use any NUMA policy other than the DEFAULT
policy.  And that they instead make page placement decisions implicitly
using first touch.
Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.
I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
Yes, that is correct.  This has been exactly my point from the
beginning.
We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate  those programs without
requiring them to create a non-DEFAULT policy of some kind.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.

Very ugly, I think.  Particularly if you have to do a lot of
vma splitting to get the correct node placement.  (Worst case
is a VMA with nodes interleaved by first touch across a set of
nodes in a way that doesn't match the INTERLEAVE mempolicy.
Then you would have to create a separate VMA for each page
and use the BIND policy.  Then after migration you would
have to go through and set the policy back to DEFAULT,
resulting in a lot of vma merges.)

Sure, but NUMA API goes to great pains to handle such programs.
Yes, it does.  But, how do we handle legacy NUMA codes that people
use today on our Linux 2.4.21 based Altix kernels?  Such programs
don't have access to the NUMA API, so they aren't using it.  They
work fine on 2.6 with the DEFAULT memory policy.  It seems unreasonable
to go back and require these programs to use numactl to solve a problem that
they are already solving without it.  And it certainly seems difficult
to require them to use numactl to enable migration of those programs.
(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing.  Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use 
numactl?)

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting

That would make the node array infinite, won't it?  What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.
I'm sorry, I don't follow that at all.  The node array has nothing to do 
with
the size of the address range to be migrated.  It is not the case that the
ith entry in the node array says what to do with the ith page.  Instead the
old and new node arrays defining a mapping of pages:  for pages found on
old_node[i], move them

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Here's an interface proposal that may be a middle ground and
should satisfy both small and large system requirements:
The system call interface would be:
page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);
(e. g. same as before, but please keep reading):
The following restrictions of my original proposal would be
dropped:
(1)  va_start and va_end can span multiple vma's.  To migrate
 all pages in a process, va_start can be 0UL and va_end
 would be MAX_INT L.  (Equivalently, we could use va_start
 and length, in pages)  We would expect the normal usage
 of this call on small systems to be va_start=0, va_end=MAX_INT.
 va_start and va_end would be required to be page aligned.
(2)  There is no requirement that the pid be suspended before
 the system call is issued.  Further requirements below
 are proposed to handle the allocation of new pages while
 the migrate system call is in progress.
(3)  Mempolicy data structures will be updated to reflect the
 new node locations before any pages are migrated.  That
 way, if the process allocates new pages before the migration
 process is completed, they will be allocated on the new
 nodes.
 (An alternative:  we could require the user to update
 the NUMA API data structures to reflect the new reality
 before the page_migrate() call is issued.  This is consistent
 with item (4).  If the user doesn't do this, then
 there is no guarentee that the page migration call will
 actually be able to migrate all pages.)
 If any memory policy is DEFAULT, then the pid will need to
 be migrated to a cpu associated with  one of the new_node_list
 nodes before the page_migrate() call.  This is so new
 allocations will happen in the new_node_list and the
 migration call won't miss those pages.  The system call
 will work correctly without this, it just can't guarentee
 that it will migrate all pages from the old_nodes.
(4)  If cpusets are in use, the new_node_list must represent
 valid nodes to allocate pages from for the cpuset that
 pid is currently a member of.  This implies that the
 pid is moved from its old cpuset to a new cpuset before
 the page_migrate() call is issued.  Any nodes not part
 of the new cpu set will cause the system call to return
 with -EINVAL.
(5)  If, during the migration process, a page is to be moved to
 node N, but the alloc_pages_node() call for node N fails, then the
 page will fall over to allocation on the nearest node
 in the new_node_list; if this node is full then fall over
 to the next nearest node, etc.  If none of the nodes has
 space, then the migration system call will fail.  (Hmmm...
 would we unmigrate the pages that had been migrated
 this far??  sounds messy also, not sure what one
 would do about error reporting here so that the caller
 could take some corrective action.)
(6)  The system call is reserved to root or a pid with
 capability CAP_PAGE_MIGRATE.
(7)  Mapped files with the extended attribute MIGRATE
 set to NONE are not migrated by the system call.
 Mapped files with the extended attribute MIGRATE
 set to LIB will be handled as follows:  r/o
 mappings will not be migrated.  r/w mappings will
 be migrated.  If no MIGRATE extended attribute is available,
 then the assumtion is that the MIGRATE extended
 attribute is not set.  (Files mapped from NFS
 would always be regarded as migrateable until
 NFS gets extended attributes.)
Note that nothing here requires parsing of /proc/pid/maps,
etc.  However, very large systems may use the system call
in special ways, e. g:
(1)  They may decide to suspend processes before migration.
(2)  They may decide to optimize the migration process by
 trying to migrate large shared objects only once,
 in the sense that only one scan of a large shared
 object will be done.
Issues of complexity related to the above are reserved for
those systems who choose to use the system call in this way.
Please note, however that this is a performance optimization
that some systems MAY decide to do.  There is NO REQUIREMENT
that any user follow these steps from a correctness point of
view, the page_migrate() system call will still do the correct
thing.
Now, I know that is complicated and lot of verbage.  But this
would satisfy our requirements and I think it would satisfy
the concern that the page_migration() call was built just to
satisfy SGI requirements.
Comments, flames, suggestions, etc, as usual are all welcome.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
You and Robin mentioned some problems with double migration
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.

There is one other case where Robin and I have talked about double
migration.  That is the case where the set of old nodes and new
nodes overlap.  If one is not careful, and the system call interface
is assumed to be something like:
page_migrate(pid, old_node, new_node);
then if one is not careful (and depending on what the complete list
of old_nodes and new_nodes are), then if one does something like:
page_migrate(pid, 1, 2);
page_migrate(pid, 2, 3);
then you can end up actually moving pages from node 1 to node 2,
only to move them again from node 2 to node 3.  This is another
form of double migration that we have worried about avoiding.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi, et al:
I see that  several messages have been sent in the interim.
I apologize for being out of sync, but today is my last
day to go skiing and it is gorgeous outside.  I'll try
to catch up and digest everthing later.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
 in
glibc and leave the underlying system call in its full form
for only those systems that need it.
-Andi
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?
That's progress.  :-)
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
 is to have a call that just o
migrates everything. The main reasons for that is that I don't think external
processes should mess with virtual addresses of another process.
It just feels unclean and has many drawbacks (parsing /proc/*/maps
needs complicated user code, racy, locking difficult).  

Yes, but remember, we are coming from an assumption that migrated processes
are suspended.  This may be myopic, but we CAN make this work with  the
constraints we have in place.  Now if you are arguing for a more general
migration facility that doesn't require the processes to be blocked, well
then I agree with you.  The /proc/*/maps approach doesn't work.
So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting
scanning the pte's and the page cache for anonymous memory and mapped files,
respectively.  For each VMA we have to make a migrate/don't migrate decision.
We also have to accept that the set of originating and destination nodes
have to be distinct.  Otherwise, there is no good way to tell whether or not
a particular page has been migrated.  So we have to make that restriction.
Without xattrs, how do we make the migrate/non-migrate decision?  Where
do we put the data?  Well, we can have some file in the file system that
has file names in it and read that file into the kernel and convert each
file to a device and inode pair.  We can then match that against each of
the VMAs and choose not to migrate any VMA that maps a file on the list.
For each anonymous VMA we just migrate the pages.
Sounds like it is doable, but I have this requirement from my IRIX
buddies that I support overlapping sets of nodes in the two and from
node sets.  I guess we have to go back and examine that in more detail.
In kernel space handling full VMs is much easier and safer due to better 
locking facilities.

In user space only the process itself really can handle its own virtual 
addresses well, and if it does that it can use NUMA API directly anyways.

You argued that it may be costly to walk everything, but I don't see this
as a big problem - first walking mempolicies is not very costly and then
fork() and exit() do exactly this already. 
I'm willing to accept that walking the page table (via follow_page()) or
the file (via find_get_page()) is not that expensive, at least until it
is shown otherwise.  We do tent to have big address spaces and lots of
processors associated with them, but I'm willing to accept that we won't
migrate a huge process around very often.  (Or at least not often enough
for it to be interesting.)  However, if this does turn out to be a performance
problem for us, we will have to come back and re-examine this stuff.
The main missing piece for this would be a way to make policies for
files persistent. One way would be to use xattrs like selinux, but
that may be costly (not sure we want to read xattrs all the time
when reading a file). 

I'm not sure I want to tie implementation of the page migration
API to getting xattrs into all of the file systems in Linux
(although I suppose we could live with it if we got them into XFS).
Is this really the way go to here?  This seems like this would
decrease the likelyhood of getting the page migration code
accepted by a significant amount.  It introduces a new set of
people (the file system maintainers) whom I have to convince to
make changes.  I just don't see that as being a fruitful exercise.
Instead I would propose a magic file to be read at boot time as discussed
above -- that file would contain the names of all files not to be
migrated.  The kicker comes here in that what do we do if that set
needs to be changed during the course of a single boot?  (i. e. somone
adds a new shared library, for example.  I suppose we could have a
sysctl() that would cause that file to be re-read.  This would be
a short term solution until xattrs are accepted and/or until Steve
Longerbeam's patch is accepted.  Would that be an acceptable short
term kludge?
A hackish way to do this that already 
works would be to do a mlock on one page of the file to keep
the inode pinned. E.g. the batch manager could do this. That's 
not very clean, but would probably work. 

-Andi

--
-----------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Ray Bryant
Dave Hansen wrote:
On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow.  They make the expensive traversal
of ptes the single pass operation.  The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls.  I can not see the node array as anything but the right way
when compared to multiple system calls.  What am I missing?

I don't really have any fundamental opposition.  I'm just trying to make
sure that there's not a simpler (better) way of doing it.  You've
obviously thought about it a lot more than I have, and I'm trying to
understand your process.
As far as the execution speed with a simpler system call.  Yes, it will
likely be slower.  However, I'm not sure that the increase in scan time
is all that significant compared to the migration code (it's pretty
slow).
-- Dave

I'm worried about doing all of those find_get_page() things over and over
when the mapped file we are migrating is large.  I suppose one can argue
that that is never going to be the case (e. g. no one in their right mind
would migrate a job with a 300 GB mapped file).  So we are back to the
overlapping set of nodes issue.  Let me look into this some more.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
[Sorry, didn't answer to everything in your mail the first time. 
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above.  I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better. 

I really don't have an objection to that for the case of the shared
libraries in, for example, /lib and /usr/lib.  I just worry about making
sure that all of the libraries have so been marked.  I can do this
in a much simpler way by just adding a list of "do not migrate stuff"
to the migration library rather than requiring Steve Longerbeam's
API.

One issue that hasn't been addressed is the following:  given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated?   One could

[...]
Marking things externally would take care of that.
So the default would be that if the file is not mapped as "not-migratable",
then the file would be migratable, is that the idea?

If we did this, we still have to have the page migration system call
to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
pages were placed by first touch and for which there used to not be
a memory policy.  As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT); 
(once it is hooked up to page migration) 
Making memory migration a subset of page migration is not a general
solution.  It only works for programs that are using memory policy
to control placement.   As I've tried to point out multiple times
before, most programs that I am aware of use placement based on
first-touch.  When we migrate such programs, we have to respect
the placement decisions that the program has implicitly made in
this way.
Requiring memory migration to be a subset of the NUMA API is a
non-starter for this reason.   We have to follow the approach
of doing the correct migration, followed by fixing up the NUMA
policy to match the new reality.  (Perhaps we can do this as
part of memory migration.)
Until ALL programs use the NUMA mempolicy for placement
decisions, we cannot support page migration under the NUMA
API.
I don't understand why this is not clear to you.  Are you
assuming that you can manufacture a NUMA API for the new
location of the job that correctly represents the placement
information and toplogy of the job on the old set of nodes?
Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.
It should work later.
Wait.  As near as I can tell you
-Andi

--
-------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
(1)  You really don't want to migrate the code pages of shared libraries
that are mapped into the process address space.  This causes a
useless shuffling of pages which really doesn't help system
performance.  On the other hand, if a shared library is some
private thing that is only used by the processes being migrated,
then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s 
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then "lock" the shared 
libraries to their policy using a new attribute flag.

I really don't see how that is relevant to the current discussion, which,
as AFAIK, is that the kernel interface should be "migrate an entire process"
versus what I have proposed.  What we are trying to avoid here for shared
libraries is two things:  (1) don't migrate them needlessly, and (2) don't
even make the migration request if we know that the pages shouldn't be
migrated.
Using Steve Longerbeam's approach avoids (1).  But you will still scan the
pte's of the proceeses to be migrated (if you go with a "migrate the
entire process" approach) and try to migrate them, only to find out that
they are pinned in place.  How is that a good thing?
A much simpler way to do this would be to add a list of libraries that
you don't want to be migrated to the migration library that I have
proposed to be the interface between the batch scheduler and the kernel.
Then when the library scans the /proc/pid/maps stuff, it can exlcude
those libraries from migration.  Furthermore, no migration requests will
even be initiated for those parts of the address space.
Of course, this means maintaining a library list in the migration
library.  We may eventually decide to do that.  For now, we're following
up on the reference count approach I outlined before.

(2)  You really only want to migrate pages once.  If a file is mapped
into several of the pid's that are being migrated, then you want
to figure this out and issue one call to have it moved wrt one of
the pid's.
(The page migration code from the memory hotplug patch will handle
updating the pte's of the other processs (thank goodness for
rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.
How could this "double migration" happen? 
Not so much a double migration, but a double request for migration.
(This is not a correctness, but a performance issue, once again.)
Consider the case of a 300 GB file mapped into 256 pid's.  One doesn't
want each pid to try to migrate the file pages.  Granted, all after the
first one will find the data already migrated, but if you issue a
migration request for each address space, the others won't know that
the page has been migrated until they have found the page and looked
up its current node.  That means doing a find_get_page() for each page
in the mapped file in all 256 address spaces, and 255 of those address
spaces will find the page has already been migrated.  How is that
useful?  I'd much rather migrate it once from the perspective of
a single address space, and then skip the scanning for pages to
migrate in all of the other address spaces.

(3)  In the case where a particular file is mapped into different
processes at different file offsets (and we are migrating both
of the processes), one has to examine the file offsets to figure
out if the mappings overlap or not. If they overlap, then you've
got to issue two calls, each of which describes a non-overlapping
region; both calls taken together would cover the entire range
of pages mapped to the file.  Similarly if the ranges do not
overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.
-Andi

So what is your solution when this happens?  Make the job non-migratable?
Yes, it may be an obscure case in your view but we've got to handle all of
those cases to make a robust facility that can be used in a production 
environment.

--
-------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Robin Holt wrote:
On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
which is what you are asking for, I think.  The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.
That works as well, and if we keep the majority of the work on
deciding who to migrate where and what to do when in a user space
library rather than in the kernel, then we have a lot more flexibility
in, for example who suspends/resumes the jobs to be migrated.

do memory placement by first touch, during initialization.  This is,
in part, because most of our codes originate on non-NUMA systems,
and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.
Thanks,
Robin

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Paul Jackson wrote:
Ray wrote:
[Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]

Does the same disclaimer apply to cpusets?
Unless it causes some undo pain, I would think that page migration
should _not_ violate a tasks cpuset.  I guess this means that a typical
batch manager would move a task to its new cpuset on the new nodes, or
move the cpuset containing some tasks to their new nodes, before asking
the page migrator to drag along the currently allocated pages from the
old location.
No, I think we understand the interaction between manual page migration
and cpusets.  We've tried to keep the discussion here disjoint from cpusets
for tactical reasons -- we didn't want to tie acceptance of the manual
page migration code to acceptance of cpusets.
The exact ordering of when a task is moved to a new cpuset and when the
migration occurs doesn't matter, AFAIK, if we accept the notion that
a migrated task is in suspended state until after everything associated
with it (including the new cpuset definition) is done.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Paul Jackson wrote:
Ray wrote:
[Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]

Does the same disclaimer apply to cpusets?
Unless it causes some undo pain, I would think that page migration
should _not_ violate a tasks cpuset.  I guess this means that a typical
batch manager would move a task to its new cpuset on the new nodes, or
move the cpuset containing some tasks to their new nodes, before asking
the page migrator to drag along the currently allocated pages from the
old location.
No, I think we understand the interaction between manual page migration
and cpusets.  We've tried to keep the discussion here disjoint from cpusets
for tactical reasons -- we didn't want to tie acceptance of the manual
page migration code to acceptance of cpusets.
The exact ordering of when a task is moved to a new cpuset and when the
migration occurs doesn't matter, AFAIK, if we accept the notion that
a migrated task is in suspended state until after everything associated
with it (including the new cpuset definition) is done.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Robin Holt wrote:
On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
which is what you are asking for, I think.  The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.
That works as well, and if we keep the majority of the work on
deciding who to migrate where and what to do when in a user space
library rather than in the kernel, then we have a lot more flexibility
in, for example who suspends/resumes the jobs to be migrated.

do memory placement by first touch, during initialization.  This is,
in part, because most of our codes originate on non-NUMA systems,
and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.
Thanks,
Robin

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
(1)  You really don't want to migrate the code pages of shared libraries
that are mapped into the process address space.  This causes a
useless shuffling of pages which really doesn't help system
performance.  On the other hand, if a shared library is some
private thing that is only used by the processes being migrated,
then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s 
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then lock the shared 
libraries to their policy using a new attribute flag.

I really don't see how that is relevant to the current discussion, which,
as AFAIK, is that the kernel interface should be migrate an entire process
versus what I have proposed.  What we are trying to avoid here for shared
libraries is two things:  (1) don't migrate them needlessly, and (2) don't
even make the migration request if we know that the pages shouldn't be
migrated.
Using Steve Longerbeam's approach avoids (1).  But you will still scan the
pte's of the proceeses to be migrated (if you go with a migrate the
entire process approach) and try to migrate them, only to find out that
they are pinned in place.  How is that a good thing?
A much simpler way to do this would be to add a list of libraries that
you don't want to be migrated to the migration library that I have
proposed to be the interface between the batch scheduler and the kernel.
Then when the library scans the /proc/pid/maps stuff, it can exlcude
those libraries from migration.  Furthermore, no migration requests will
even be initiated for those parts of the address space.
Of course, this means maintaining a library list in the migration
library.  We may eventually decide to do that.  For now, we're following
up on the reference count approach I outlined before.

(2)  You really only want to migrate pages once.  If a file is mapped
into several of the pid's that are being migrated, then you want
to figure this out and issue one call to have it moved wrt one of
the pid's.
(The page migration code from the memory hotplug patch will handle
updating the pte's of the other processs (thank goodness for
rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.
How could this double migration happen? 
Not so much a double migration, but a double request for migration.
(This is not a correctness, but a performance issue, once again.)
Consider the case of a 300 GB file mapped into 256 pid's.  One doesn't
want each pid to try to migrate the file pages.  Granted, all after the
first one will find the data already migrated, but if you issue a
migration request for each address space, the others won't know that
the page has been migrated until they have found the page and looked
up its current node.  That means doing a find_get_page() for each page
in the mapped file in all 256 address spaces, and 255 of those address
spaces will find the page has already been migrated.  How is that
useful?  I'd much rather migrate it once from the perspective of
a single address space, and then skip the scanning for pages to
migrate in all of the other address spaces.

(3)  In the case where a particular file is mapped into different
processes at different file offsets (and we are migrating both
of the processes), one has to examine the file offsets to figure
out if the mappings overlap or not. If they overlap, then you've
got to issue two calls, each of which describes a non-overlapping
region; both calls taken together would cover the entire range
of pages mapped to the file.  Similarly if the ranges do not
overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.
-Andi

So what is your solution when this happens?  Make the job non-migratable?
Yes, it may be an obscure case in your view but we've got to handle all of
those cases to make a robust facility that can be used in a production 
environment.

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
[Sorry, didn't answer to everything in your mail the first time. 
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above.  I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better. 

I really don't have an objection to that for the case of the shared
libraries in, for example, /lib and /usr/lib.  I just worry about making
sure that all of the libraries have so been marked.  I can do this
in a much simpler way by just adding a list of do not migrate stuff
to the migration library rather than requiring Steve Longerbeam's
API.

One issue that hasn't been addressed is the following:  given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated?   One could

[...]
Marking things externally would take care of that.
So the default would be that if the file is not mapped as not-migratable,
then the file would be migratable, is that the idea?

If we did this, we still have to have the page migration system call
to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
pages were placed by first touch and for which there used to not be
a memory policy.  As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT); 
(once it is hooked up to page migration) 
Making memory migration a subset of page migration is not a general
solution.  It only works for programs that are using memory policy
to control placement.   As I've tried to point out multiple times
before, most programs that I am aware of use placement based on
first-touch.  When we migrate such programs, we have to respect
the placement decisions that the program has implicitly made in
this way.
Requiring memory migration to be a subset of the NUMA API is a
non-starter for this reason.   We have to follow the approach
of doing the correct migration, followed by fixing up the NUMA
policy to match the new reality.  (Perhaps we can do this as
part of memory migration.)
Until ALL programs use the NUMA mempolicy for placement
decisions, we cannot support page migration under the NUMA
API.
I don't understand why this is not clear to you.  Are you
assuming that you can manufacture a NUMA API for the new
location of the job that correctly represents the placement
information and toplogy of the job on the old set of nodes?
Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.
It should work later.
Wait.  As near as I can tell you
-Andi

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Ray Bryant
Dave Hansen wrote:
On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow.  They make the expensive traversal
of ptes the single pass operation.  The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls.  I can not see the node array as anything but the right way
when compared to multiple system calls.  What am I missing?

I don't really have any fundamental opposition.  I'm just trying to make
sure that there's not a simpler (better) way of doing it.  You've
obviously thought about it a lot more than I have, and I'm trying to
understand your process.
As far as the execution speed with a simpler system call.  Yes, it will
likely be slower.  However, I'm not sure that the increase in scan time
is all that significant compared to the migration code (it's pretty
slow).
-- Dave

I'm worried about doing all of those find_get_page() things over and over
when the mapped file we are migrating is large.  I suppose one can argue
that that is never going to be the case (e. g. no one in their right mind
would migrate a job with a 300 GB mapped file).  So we are back to the
overlapping set of nodes issue.  Let me look into this some more.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
 everything. The main reasons for that is that I don't think external
processes should mess with virtual addresses of another process.
It just feels unclean and has many drawbacks (parsing /proc/*/maps
needs complicated user code, racy, locking difficult).  

Yes, but remember, we are coming from an assumption that migrated processes
are suspended.  This may be myopic, but we CAN make this work with  the
constraints we have in place.  Now if you are arguing for a more general
migration facility that doesn't require the processes to be blocked, well
then I agree with you.  The /proc/*/maps approach doesn't work.
So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting
scanning the pte's and the page cache for anonymous memory and mapped files,
respectively.  For each VMA we have to make a migrate/don't migrate decision.
We also have to accept that the set of originating and destination nodes
have to be distinct.  Otherwise, there is no good way to tell whether or not
a particular page has been migrated.  So we have to make that restriction.
Without xattrs, how do we make the migrate/non-migrate decision?  Where
do we put the data?  Well, we can have some file in the file system that
has file names in it and read that file into the kernel and convert each
file to a device and inode pair.  We can then match that against each of
the VMAs and choose not to migrate any VMA that maps a file on the list.
For each anonymous VMA we just migrate the pages.
Sounds like it is doable, but I have this requirement from my IRIX
buddies that I support overlapping sets of nodes in the two and from
node sets.  I guess we have to go back and examine that in more detail.
In kernel space handling full VMs is much easier and safer due to better 
locking facilities.

In user space only the process itself really can handle its own virtual 
addresses well, and if it does that it can use NUMA API directly anyways.

You argued that it may be costly to walk everything, but I don't see this
as a big problem - first walking mempolicies is not very costly and then
fork() and exit() do exactly this already. 
I'm willing to accept that walking the page table (via follow_page()) or
the file (via find_get_page()) is not that expensive, at least until it
is shown otherwise.  We do tent to have big address spaces and lots of
processors associated with them, but I'm willing to accept that we won't
migrate a huge process around very often.  (Or at least not often enough
for it to be interesting.)  However, if this does turn out to be a performance
problem for us, we will have to come back and re-examine this stuff.
The main missing piece for this would be a way to make policies for
files persistent. One way would be to use xattrs like selinux, but
that may be costly (not sure we want to read xattrs all the time
when reading a file). 

I'm not sure I want to tie implementation of the page migration
API to getting xattrs into all of the file systems in Linux
(although I suppose we could live with it if we got them into XFS).
Is this really the way go to here?  This seems like this would
decrease the likelyhood of getting the page migration code
accepted by a significant amount.  It introduces a new set of
people (the file system maintainers) whom I have to convince to
make changes.  I just don't see that as being a fruitful exercise.
Instead I would propose a magic file to be read at boot time as discussed
above -- that file would contain the names of all files not to be
migrated.  The kicker comes here in that what do we do if that set
needs to be changed during the course of a single boot?  (i. e. somone
adds a new shared library, for example.  I suppose we could have a
sysctl() that would cause that file to be re-read.  This would be
a short term solution until xattrs are accepted and/or until Steve
Longerbeam's patch is accepted.  Would that be an acceptable short
term kludge?
A hackish way to do this that already 
works would be to do a mlock on one page of the file to keep
the inode pinned. E.g. the batch manager could do this. That's 
not very clean, but would probably work. 

-Andi

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-14 Thread Ray Bryant
Andi Kleen wrote:
Ray Bryant <[EMAIL PROTECTED]> writes:
set of pages associated with a particular process need to be moved.
The kernel interface that we are proposing is the following:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

[Only commenting on the interface, haven't read your patches at all]
This is basically mbind() with MPOL_F_STRICT, except that it has a pid 
argument. I assume that's for the benefit of your batch scheduler.

But it's not clear to me how and why the batch scheduler should know about
virtual addresses of different processes anyways. Walking
/proc/pid/maps? That's all inherently racy when the process is doing
mmap in parallel. The only way I can think of to do this would be to
check for changes in maps after a full move and loop, but then you risk
livelock.
And you cannot also just specify va_start=0, va_end=~0UL because that
would make the node arrays grow infinitely.
Also is there a good use case why the batch scheduler should only
move individual areas in a process around, not the full process?
The batch scheduler interface will be to move entire jobs (groups of
processes) around from one set of nodes to another.  But that interface
doesn't work at the kernel level.  The problem is that one just can't
ask the kernel to move the entire address space of a process for a number
of reasons:
(1)  You really don't want to migrate the code pages of shared libraries
 that are mapped into the process address space.  This causes a
 useless shuffling of pages which really doesn't help system
 performance.  On the other hand, if a shared library is some
 private thing that is only used by the processes being migrated,
 then you should move that.
(2)  You really only want to migrate pages once.  If a file is mapped
 into several of the pid's that are being migrated, then you want
 to figure this out and issue one call to have it moved wrt one of
 the pid's.
 (The page migration code from the memory hotplug patch will handle
 updating the pte's of the other processs (thank goodness for
 rmap...))
(3)  In the case where a particular file is mapped into different
 processes at different file offsets (and we are migrating both
 of the processes), one has to examine the file offsets to figure
 out if the mappings overlap or not. If they overlap, then you've
 got to issue two calls, each of which describes a non-overlapping
 region; both calls taken together would cover the entire range
 of pages mapped to the file.  Similarly if the ranges do not
 overlap.
Figuring all of this out seems to me to be way too complicated to
want to stick into the kernel.  Hence we proposed the kernel interface
as discussed in the overview note.  This interface would be used by
a user space library, whose batch scheduler interface would look
something like this:
migrate_processes(pid_count, pid_list, node_count, old_nodes, new_nodes);
which is what you are asking for, I think.  The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the
/proc/pid/maps entries for each pid that we are
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above.  I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.
One issue that hasn't been addressed is the following:  given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated?   One could
scan ALL of the /proc/pid/maps entries, I suppose, but that is
pretty expensive task on a 512 processor NUMA box.  The approach
I would like to follow would be to add a reference count to
/proc/pid/maps.  The reference could would tell how many VMAs
point at this particular /proc/pid/map entry.  Using this, if
all of the processes in the set to be migrated account for all
of the references, then this map entry represents an address
range that should be migrated.  If there are other references
then you shouldn't migrate the address range.
Note also that the data so reported represents a performance
optimization, not a correctness issue.  If some of the /proc/pid/map
info changes after we have read it and made our decision as
to what address ranges in which PIDs to migrate, the result
may be suboptimal performance.  But in most cases that we have
been able to think of where this could happen, it is not that
big of a deal.  (The typical example is library private.so
is used by an instance of batch job J1.  We decide to migrate
J1.  We look at the /proc/pid/maps info and find out that
only processes in J1 references private.so.  So we decide to m

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-14 Thread Ray Bryant
Andi Kleen wrote:
But how do you use mbind() to change the memory placement for an anonymous
private mapping used by a vendor provided executable with mbind()?

For that you use set_mempolicy.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [EMAIL PROTECTED]  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: mailto:"[EMAIL PROTECTED]"> [EMAIL PROTECTED] 
Andi,
If all processes are guarenteed to use the NUMA api for memory placement,
then AFAIK one could, in principle, imbed the migration of pages into
the NUMA api as you propose.  The problem is that AFAIK most programs
that we run are not using the NUMA api.  Instead, they are using first-touch
with the knowledge that such pages will be allocated on the node where they
are first referenced.
Since we have to build a migration facility that will migrate jobs that
use both the NUMA API and the first-touch approach, it seems to me the
only plausible soluion is to move the pages via a migration facility
and then if there are NUMA API control structures found associated with
the moved pages to update them to represent the new reality.  Whether
this happens as an automatic side effect of the migration call or it
happens by a issuing a new set_mempolicy() is not clear to me.  I would
prefer to just issue a new set_mempolicy(), but somehow the migration
code will have to figure out where this call needs to be executed (i. e.
which pages have an associated NUMA policy).  [Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]
--
-----------
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-14 Thread Ray Bryant
Andi Kleen wrote:
But how do you use mbind() to change the memory placement for an anonymous
private mapping used by a vendor provided executable with mbind()?

For that you use set_mempolicy.
-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [EMAIL PROTECTED]  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:[EMAIL PROTECTED] [EMAIL PROTECTED] /a
Andi,
If all processes are guarenteed to use the NUMA api for memory placement,
then AFAIK one could, in principle, imbed the migration of pages into
the NUMA api as you propose.  The problem is that AFAIK most programs
that we run are not using the NUMA api.  Instead, they are using first-touch
with the knowledge that such pages will be allocated on the node where they
are first referenced.
Since we have to build a migration facility that will migrate jobs that
use both the NUMA API and the first-touch approach, it seems to me the
only plausible soluion is to move the pages via a migration facility
and then if there are NUMA API control structures found associated with
the moved pages to update them to represent the new reality.  Whether
this happens as an automatic side effect of the migration call or it
happens by a issuing a new set_mempolicy() is not clear to me.  I would
prefer to just issue a new set_mempolicy(), but somehow the migration
code will have to figure out where this call needs to be executed (i. e.
which pages have an associated NUMA policy).  [Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-14 Thread Ray Bryant
Andi Kleen wrote:
Ray Bryant [EMAIL PROTECTED] writes:
set of pages associated with a particular process need to be moved.
The kernel interface that we are proposing is the following:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

[Only commenting on the interface, haven't read your patches at all]
This is basically mbind() with MPOL_F_STRICT, except that it has a pid 
argument. I assume that's for the benefit of your batch scheduler.

But it's not clear to me how and why the batch scheduler should know about
virtual addresses of different processes anyways. Walking
/proc/pid/maps? That's all inherently racy when the process is doing
mmap in parallel. The only way I can think of to do this would be to
check for changes in maps after a full move and loop, but then you risk
livelock.
And you cannot also just specify va_start=0, va_end=~0UL because that
would make the node arrays grow infinitely.
Also is there a good use case why the batch scheduler should only
move individual areas in a process around, not the full process?
The batch scheduler interface will be to move entire jobs (groups of
processes) around from one set of nodes to another.  But that interface
doesn't work at the kernel level.  The problem is that one just can't
ask the kernel to move the entire address space of a process for a number
of reasons:
(1)  You really don't want to migrate the code pages of shared libraries
 that are mapped into the process address space.  This causes a
 useless shuffling of pages which really doesn't help system
 performance.  On the other hand, if a shared library is some
 private thing that is only used by the processes being migrated,
 then you should move that.
(2)  You really only want to migrate pages once.  If a file is mapped
 into several of the pid's that are being migrated, then you want
 to figure this out and issue one call to have it moved wrt one of
 the pid's.
 (The page migration code from the memory hotplug patch will handle
 updating the pte's of the other processs (thank goodness for
 rmap...))
(3)  In the case where a particular file is mapped into different
 processes at different file offsets (and we are migrating both
 of the processes), one has to examine the file offsets to figure
 out if the mappings overlap or not. If they overlap, then you've
 got to issue two calls, each of which describes a non-overlapping
 region; both calls taken together would cover the entire range
 of pages mapped to the file.  Similarly if the ranges do not
 overlap.
Figuring all of this out seems to me to be way too complicated to
want to stick into the kernel.  Hence we proposed the kernel interface
as discussed in the overview note.  This interface would be used by
a user space library, whose batch scheduler interface would look
something like this:
migrate_processes(pid_count, pid_list, node_count, old_nodes, new_nodes);
which is what you are asking for, I think.  The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the
/proc/pid/maps entries for each pid that we are
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above.  I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.
One issue that hasn't been addressed is the following:  given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated?   One could
scan ALL of the /proc/pid/maps entries, I suppose, but that is
pretty expensive task on a 512 processor NUMA box.  The approach
I would like to follow would be to add a reference count to
/proc/pid/maps.  The reference could would tell how many VMAs
point at this particular /proc/pid/map entry.  Using this, if
all of the processes in the set to be migrated account for all
of the references, then this map entry represents an address
range that should be migrated.  If there are other references
then you shouldn't migrate the address range.
Note also that the data so reported represents a performance
optimization, not a correctness issue.  If some of the /proc/pid/map
info changes after we have read it and made our decision as
to what address ranges in which PIDs to migrate, the result
may be suboptimal performance.  But in most cases that we have
been able to think of where this could happen, it is not that
big of a deal.  (The typical example is library private.so
is used by an instance of batch job J1.  We decide to migrate
J1.  We look at the /proc/pid/maps info and find out that
only processes in J1 references private.so.  So we decide to migrate

[RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-11 Thread Ray Bryant
_list);
+   } else 
+   BUG();
+   spin_unlock_irq(>lru_lock);
+   }
+   } 
+   }
+
+   ret = migrate_vma_common(_list, node_map, count);
+
+   return ret;
+
+}
+
+static int
+migrate_anon_private_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, size_t va_start,
+ size_t va_end, short *node_map)
+{
+   struct page *page;
+   struct zone *zone;
+   unsigned long vaddr;
+   int count = 0, nid, ret;
+   LIST_HEAD(page_list);
+
+   va_start = va_start & PAGE_MASK;
+   va_end   = va_end   & PAGE_MASK;
+
+   for (vaddr=va_start; vaddr<=va_end; vaddr += PAGE_SIZE) {
+   spin_lock(>page_table_lock);
+   page = follow_page(mm, vaddr, 0);
+   spin_unlock(>page_table_lock);
+   /* 
+* follow_page has been observed to return pages with zero 
+* mapcount and NULL mapping.  Skip those pages as well
+*/
+   if (page && page_mapcount(page) && page->mapping) {
+   nid = page_to_nid(page);
+   if (node_map[nid] > 0) {
+   zone = page_zone(page);
+   spin_lock_irq(>lru_lock);
+   if (PageLRU(page) &&
+   __steal_page_from_lru(zone, page)) {
+   count++;
+   list_add(>lru, _list);
+   } else
+   BUG();
+   spin_unlock_irq(>lru_lock);
+   }
+   }
+   }
+
+   ret = migrate_vma_common(_list, node_map, count);
+
+   return ret;
+}
+
+void lru_add_drain_per_cpu(void *info) {
+   lru_add_drain();
+}
+
+asmlinkage long
+sys_page_migrate(const pid_t pid, size_t va_start, size_t va_end,
+   const int count, caddr_t old_nodes, caddr_t new_nodes)
+{
+   int i, ret = 0;
+   short *tmp_old_nodes;
+   short *tmp_new_nodes;
+   short *node_map;
+   struct task_struct *task;
+   struct mm_struct *mm = 0;
+   size_t size = count*sizeof(short);
+   struct vm_area_struct *vma, *vma2;
+
+
+   tmp_old_nodes = (short *) kmalloc(size, GFP_KERNEL);
+   tmp_new_nodes = (short *) kmalloc(size, GFP_KERNEL);
+   node_map = (short *) kmalloc(MAX_NUMNODES*sizeof(short), GFP_KERNEL);
+
+   if (!tmp_old_nodes || !tmp_new_nodes || !node_map) {
+   ret = -ENOMEM;
+   goto out_nodec;
+   }
+
+   if (copy_from_user(tmp_old_nodes, old_nodes, size) || 
+   copy_from_user(tmp_new_nodes, new_nodes, size)) {
+   ret = -EFAULT;
+   goto out_nodec;
+   }
+
+   read_lock(_lock);
+   task = find_task_by_pid(pid);
+   if (task) {
+   task_lock(task);
+   mm = task->mm;
+   if (mm)
+   atomic_inc(>mm_users);
+   task_unlock(task);
+   } else {
+   ret = -ESRCH;
+   goto out_nodec;
+   }
+   read_unlock(_lock);
+   if (!mm) {
+   ret = -EINVAL;
+   goto out_nodec;
+   }
+
+   /* 
+* for now, we require both the start and end addresses to
+* be mapped by the same vma.
+*/
+   vma = find_vma(mm, va_start);
+   vma2 = find_vma(mm, va_end);
+   if (!vma || !vma2 || (vma != vma2)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* set up the node_map array */
+   for(i=0; ivm_ops)
+   ret = migrate_mapped_file_vma(task, mm, vma, va_start, va_end,
+   node_map);
+   else
+   ret = migrate_anon_private_vma(task, mm, vma, va_start, va_end,
+   node_map);
+
+out:
+   atomic_dec(>mm_users);
+
+out_nodec:
+   if (tmp_old_nodes)
+   kfree(tmp_old_nodes);
+   if (tmp_new_nodes)
+   kfree(tmp_new_nodes);
+   if (node_map)
+   kfree(node_map);
+
+   return ret;
+
+}
+
 EXPORT_SYMBOL(generic_migrate_page);
 EXPORT_SYMBOL(migrate_page_common);
 EXPORT_SYMBOL(migrate_page_buffer);

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 4/7] mm: manual page migration -- cleanup 4

2005-02-11 Thread Ray Bryant
Add some extern declarations to include/linux/mmigrate.h to
eliminate some "implicitly" declared warnings.

Signed-off-by:Ray Bryant <[EMAIL PROTECTED]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
11:23:46.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:50:27.0 -0800
@@ -17,6 +17,9 @@ extern int page_migratable(struct page *
struct list_head *);
 extern struct page * migrate_onepage(struct page *, int nodeid);
 extern int try_to_migrate_pages(struct list_head *);
+extern int migration_duplicate(swp_entry_t);
+extern struct page * lookup_migration_cache(int);
+extern int migration_remove_reference(struct page *, int);
 
 #else
 static inline int generic_migrate_page(struct page *page, struct page *newpage,

-- 
Best Regards,
Ray
-------
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 6/7] mm: manual page migration -- add node_map arg to try_to_migrate_pages()

2005-02-11 Thread Ray Bryant
To migrate pages from one node to another, we need to tell
try_to_migrate_pages() which nodes we want to migrate off
of and where to migrate the pages found on each such node.

We do this by adding the node_map array argument to 
try_to_migrate_pages(); node_map[N] gives the target
node to migrate pages to from node N.

This patch depends on a previous patch I submiteed that
adds a node argument to migrate_onepage(); this patch
is currently part of the Memory HOTPLUG page migration
patch.

node_migrate_onepage() is introduced to handle the case
where node_map is NULL (i. e. caller doesn't care where
we migrate the page, just migrate it out of here) or
the system is not a NUMA system.

Signed-off-by:Ray Bryant <[EMAIL PROTECTED]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
11:50:27.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:52:50.0 -0800
@@ -16,11 +16,29 @@ extern int migrate_page_buffer(struct pa
 extern int page_migratable(struct page *, struct page *, int,
struct list_head *);
 extern struct page * migrate_onepage(struct page *, int nodeid);
-extern int try_to_migrate_pages(struct list_head *);
+extern int try_to_migrate_pages(struct list_head *, short *);
 extern int migration_duplicate(swp_entry_t);
 extern struct page * lookup_migration_cache(int);
 extern int migration_remove_reference(struct page *, int);
 
+extern int try_to_migrate_pages(struct list_head *, short *node_map);
+
+#ifdef CONFIG_NUMA
+static inline struct page *node_migrate_onepage(struct page *page, short 
*node_map) 
+{
+   if (node_map)
+   return migrate_onepage(page, node_map[page_to_nid(page)]);
+   else
+   return migrate_onepage(page, MIGRATE_NODE_ANY); 
+   
+}
+#else
+static inline struct page *node_migrate_onepage(struct page *page, short 
*node_map) 
+{
+   return migrate_onepage(page, MIGRATE_NODE_ANY); 
+}
+#endif
+
 #else
 static inline int generic_migrate_page(struct page *page, struct page *newpage,
int (*fn)(struct page *, struct page *))
Index: linux-2.6.11-rc2-mm2/mm/mmigrate.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/mmigrate.c 2005-02-11 11:50:40.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/mmigrate.c  2005-02-11 11:51:04.0 -0800
@@ -502,9 +502,11 @@ out_unlock:
 /*
  * This is the main entry point to migrate pages in a specific region.
  * If a page is inactive, the page may be just released instead of
- * migration.
+ * migration.  node_map is supplied in those cases (on NUMA systems)
+ * where the caller wishes to specify to which nodes the pages are
+ * migrated.  If node_map is null, the target node is MIGRATE_NODE_ANY.
  */
-int try_to_migrate_pages(struct list_head *page_list)
+int try_to_migrate_pages(struct list_head *page_list, short *node_map)
 {
struct page *page, *page2, *newpage;
LIST_HEAD(pass1_list);
@@ -542,7 +544,7 @@ int try_to_migrate_pages(struct list_hea
list_for_each_entry_safe(page, page2, _list, lru) {
list_del(>lru);
if (PageLocked(page) || PageWriteback(page) ||
-   IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+   IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);
@@ -560,7 +562,7 @@ int try_to_migrate_pages(struct list_hea
 */
list_for_each_entry_safe(page, page2, _list, lru) {
list_del(>lru);
-   if (IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+   if (IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);

-- 
Best Regards,
Ray
-------
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 5/7] mm: manual page migration -- cleanup 5

2005-02-11 Thread Ray Bryant
Fix up a switch statement so gcc doesn't complain about it.

Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>

Index: linux/mm/mmigrate.c
===
--- linux.orig/mm/mmigrate.c2005-01-30 11:13:58.0 -0800
+++ linux/mm/mmigrate.c 2005-01-30 11:19:33.0 -0800
@@ -319,17 +319,17 @@ generic_migrate_page(struct page *page, 
/* Wait for all operations against the page to finish. */
ret = migrate_fn(page, newpage, );
switch (ret) {
-   default:
-   /* The page is busy. Try it later. */
-   goto out_busy;
case -ENOENT:
/* The file the page belongs to has been truncated. */
page_cache_get(page);
page_cache_release(newpage);
newpage->mapping = NULL;
-   /* fall thru */
+   break;
case 0:
-   /* fall thru */
+   break;
+   default:
+   /* The page is busy. Try it later. */
+   goto out_busy;
}
 
arch_migrate_page(page, newpage);

-- 
Best Regards,
Ray
-------
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 1/7] mm: manual page migration -- cleanup 1

2005-02-11 Thread Ray Bryant
This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch.  I have sent Dave Hansen the -R
version of this patch so that this code can be added back 
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patchremoves VM_IMMOVABLE and MAP_IMMOVABLE.

Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>

Index: linux-2.6.10-mm1-page-migration/kernel/fork.c
===
--- linux-2.6.10-mm1-page-migration.orig/kernel/fork.c  2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/kernel/fork.c   2005-01-10 
09:14:03.0 -0800
@@ -211,7 +211,7 @@ static inline int dup_mmap(struct mm_str
if (IS_ERR(pol))
goto fail_nomem_policy;
vma_set_policy(tmp, pol);
-   tmp->vm_flags &= ~(VM_LOCKED|VM_IMMOVABLE);
+   tmp->vm_flags &= ~(VM_LOCKED);
tmp->vm_mm = mm;
tmp->vm_next = NULL;
anon_vma_link(tmp);
Index: linux-2.6.10-mm1-page-migration/include/linux/mm.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mm.h 2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mm.h  2005-01-10 
09:14:04.0 -0800
@@ -164,7 +164,6 @@ extern unsigned int kobjsize(const void 
 #define VM_ACCOUNT 0x0010  /* Is a VM accounted object */
 #define VM_HUGETLB 0x0040  /* Huge TLB Page VM */
 #define VM_NONLINEAR   0x0080  /* Is non-linear (remap_file_pages) */
-#define VM_IMMOVABLE   0x0100  /* Don't place in hot removable area */
 
 #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6.10-mm1-page-migration/include/linux/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mman.h   2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mman.h2005-01-10 
10:05:54.0 -0800
@@ -61,8 +61,7 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
   _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
-  _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED) |
-  _calc_vm_trans(flags, MAP_IMMOVABLE,  VM_IMMOVABLE );
+  _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 
 #endif /* _LINUX_MMAN_H */
Index: linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c
===
--- linux-2.6.10-mm1-page-migration.orig/arch/i386/kernel/sys_i386.c
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c 2005-01-10 
09:14:04.0 -0800
@@ -70,7 +70,7 @@ asmlinkage long sys_mmap2(unsigned long 
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
 {
-   return do_mmap2(addr, len, prot, flags & ~MAP_IMMOVABLE, fd, pgoff);
+   return do_mmap2(addr, len, prot, flags, fd, pgoff);
 }
 
 /*
@@ -101,7 +101,7 @@ asmlinkage int old_mmap(struct mmap_arg_
if (a.offset & ~PAGE_MASK)
goto out;
 
-   err = do_mmap2(a.addr, a.len, a.prot, a.flags & ~MAP_IMMOVABLE,
+   err = do_mmap2(a.addr, a.len, a.prot, a.flags,
a.fd, a.offset >> PAGE_SHIFT);
 out:
return err;
Index: linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/asm-ppc64/mman.h   
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h2005-01-10 
09:14:04.0 -0800
@@ -38,7 +38,6 @@
 
 #define MAP_POPULATE   0x8000  /* populate (prefault) pagetables */
 #define MAP_NONBLOCK   0x1 /* do not block on IO */
-#define MAP_IMMOVABLE  0x2
 
 #define MADV_NORMAL0x0 /* default page-in behavior */
 #define MADV_RANDOM0x1 /* page-in minimum required */
Index: linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/asm-i386/mman.h
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h 2005-01-10 
09:14:04.0 -0800
@@ -22,7 +22,6 @@
 #define MAP_NORESERVE  0x4000  /* don't check for reservations */
 #define MAP_POPULATE   0x8000  /* populate (prefault) pagetables */
 #define MAP_NONBLOCK   0x1 /* do not block on IO */
-#define MAP_IM

[RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-11 Thread Ray Bryant
  I welcome the
opportunity for others to examine this patch and provide suggestions,
point out possible improvements, help me to eliminate bugs, or to make
suggestions about improved coding style or algorithms.  I will, however,
be away from the office for the next week, so will likely not be able
to respond until the week of Feb 21st.

There are several things that this patch does not do, however, and
we hope to resolve some of these issues in subsequent versions of the
patchset:

(1)  There is no security or authentication checking.  Any process
 can migrate any pages of any other process.  This needs to
 be addressed.

(2)  We have not figured out yet what to do about the interaction
 between page migration and Andi Kleen's memory policy infrastructure.
 Presumably the memory policy data structures will have to be
 updated either as part of the system call above or through
 a new (or existing) system call.

(3)  As previously mentioned, we have omitted a glaring detail --
 how to determine what pages to migrate.  I have an algorithm
 and code to solve this problem, but it is still a little
 buggy and I wanted to get the ball rolling with what already
 existed and seems to work reasonably well.

(4)  It is likely that we will add a new operation to the vm_ops
 structure -- the "page_migration" routine.  The reason for
 this is to provide a way for each type of memory object to provide
 a way that it's pages can be migrated.  We have not included
 code for this in the current patch.

(5)  There are still some small bugs relating to what happens to
 non-present pages.  These issues should not hinder evaluation
 or discussion of the overall approach, however.

Finally, it is my goal to include the migration cache patch in 
the final version of this code, however, at the moment there are
some issues with this patch that are still being worked out, so
it has not been included in this version of the patch.

So, with all of the disclaimers and other details out of the
way, we should go on, in subsequent notes, to discuss each of the
7 patches.  Remember that only the last 2 are really significant;
the others are mostly cleanup of warnings and the like.

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 3/7] mm: manual page migration -- cleanup 3

2005-02-11 Thread Ray Bryant
Fix a trivial error in include/linux/mmigrate.h

Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
10:08:10.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:22:34.0 -0800
@@ -1,5 +1,5 @@
-#ifndef _LINUX_MEMHOTPLUG_H
-#define _LINUX_MEMHOTPLUG_H
+#ifndef _LINUX_MMIGRATE_H
+#define _LINUX_MMIGRATE_H
 
 #include 
 #include 
@@ -36,4 +36,4 @@ extern void arch_migrate_page(struct pag
 static inline void arch_migrate_page(struct page *page, struct page *newpage) 
{}
 #endif
 
-#endif /* _LINUX_MEMHOTPLUG_H */
+#endif /* _LINUX_MMIGRATE_H */

-- 
Best Regards,
Ray
-------
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 2/7] mm: manual page migration -- cleanup 2

2005-02-11 Thread Ray Bryant
This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch.  I have sent Dave Hansen the -R
version of this patch so that this code can be added back
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patch removes some #ifdef CONFIG_MEMORY_HOTPLUG
code from the page migration patch.

Signed-off-by: Ray Bryant <[EMAIL PROTECTED]>

Index: linux-2.6.11-rc2-mm2/mm/vmalloc.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/vmalloc.c  2005-02-11 10:08:10.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/vmalloc.c   2005-02-11 10:35:47.0 -0800
@@ -523,16 +523,7 @@ EXPORT_SYMBOL(__vmalloc);
  */
 void *vmalloc(unsigned long size)
 {
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* : This is temprary code, which should be replaced with proper one
-*   after the scheme to specify hot removable region has defined.
-*  25/Sep/2004 -- taka
-*/
-   return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
-#else
return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
-#endif
 }
 
 EXPORT_SYMBOL(vmalloc);
Index: linux-2.6.11-rc2-mm2/mm/shmem.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/shmem.c2005-02-11 10:08:10.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/shmem.c 2005-02-11 10:35:47.0 -0800
@@ -93,16 +93,7 @@ static inline struct page *shmem_dir_all
 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 * might be reconsidered if it ever diverges from PAGE_SIZE.
 */
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* : This is temprary code, which should be replaced with proper one
-*   after the scheme to specify hot removable region has defined.
-*  25/Sep/2004 -- taka
-*/
-   return alloc_pages(gfp_mask & ~__GFP_HIGHMEM, 
PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#else
return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#endif
 }
 
 static inline void shmem_dir_free(struct page *page)

-- 
Best Regards,
Ray
-------
Ray Bryant   [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 2/7] mm: manual page migration -- cleanup 2

2005-02-11 Thread Ray Bryant
This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch.  I have sent Dave Hansen the -R
version of this patch so that this code can be added back
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patch removes some #ifdef CONFIG_MEMORY_HOTPLUG
code from the page migration patch.

Signed-off-by: Ray Bryant [EMAIL PROTECTED]

Index: linux-2.6.11-rc2-mm2/mm/vmalloc.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/vmalloc.c  2005-02-11 10:08:10.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/vmalloc.c   2005-02-11 10:35:47.0 -0800
@@ -523,16 +523,7 @@ EXPORT_SYMBOL(__vmalloc);
  */
 void *vmalloc(unsigned long size)
 {
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* : This is temprary code, which should be replaced with proper one
-*   after the scheme to specify hot removable region has defined.
-*  25/Sep/2004 -- taka
-*/
-   return __vmalloc(size, GFP_KERNEL, PAGE_KERNEL);
-#else
return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL);
-#endif
 }
 
 EXPORT_SYMBOL(vmalloc);
Index: linux-2.6.11-rc2-mm2/mm/shmem.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/shmem.c2005-02-11 10:08:10.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/shmem.c 2005-02-11 10:35:47.0 -0800
@@ -93,16 +93,7 @@ static inline struct page *shmem_dir_all
 * BLOCKS_PER_PAGE on indirect pages, assume PAGE_CACHE_SIZE:
 * might be reconsidered if it ever diverges from PAGE_SIZE.
 */
-#ifdef CONFIG_MEMORY_HOTPLUG
-   /*
-* : This is temprary code, which should be replaced with proper one
-*   after the scheme to specify hot removable region has defined.
-*  25/Sep/2004 -- taka
-*/
-   return alloc_pages(gfp_mask  ~__GFP_HIGHMEM, 
PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#else
return alloc_pages(gfp_mask, PAGE_CACHE_SHIFT-PAGE_SHIFT);
-#endif
 }
 
 static inline void shmem_dir_free(struct page *page)

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 3/7] mm: manual page migration -- cleanup 3

2005-02-11 Thread Ray Bryant
Fix a trivial error in include/linux/mmigrate.h

Signed-off-by: Ray Bryant [EMAIL PROTECTED]

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
10:08:10.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:22:34.0 -0800
@@ -1,5 +1,5 @@
-#ifndef _LINUX_MEMHOTPLUG_H
-#define _LINUX_MEMHOTPLUG_H
+#ifndef _LINUX_MMIGRATE_H
+#define _LINUX_MMIGRATE_H
 
 #include linux/config.h
 #include linux/mm.h
@@ -36,4 +36,4 @@ extern void arch_migrate_page(struct pag
 static inline void arch_migrate_page(struct page *page, struct page *newpage) 
{}
 #endif
 
-#endif /* _LINUX_MEMHOTPLUG_H */
+#endif /* _LINUX_MMIGRATE_H */

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-11 Thread Ray Bryant
 suggestions,
point out possible improvements, help me to eliminate bugs, or to make
suggestions about improved coding style or algorithms.  I will, however,
be away from the office for the next week, so will likely not be able
to respond until the week of Feb 21st.

There are several things that this patch does not do, however, and
we hope to resolve some of these issues in subsequent versions of the
patchset:

(1)  There is no security or authentication checking.  Any process
 can migrate any pages of any other process.  This needs to
 be addressed.

(2)  We have not figured out yet what to do about the interaction
 between page migration and Andi Kleen's memory policy infrastructure.
 Presumably the memory policy data structures will have to be
 updated either as part of the system call above or through
 a new (or existing) system call.

(3)  As previously mentioned, we have omitted a glaring detail --
 how to determine what pages to migrate.  I have an algorithm
 and code to solve this problem, but it is still a little
 buggy and I wanted to get the ball rolling with what already
 existed and seems to work reasonably well.

(4)  It is likely that we will add a new operation to the vm_ops
 structure -- the page_migration routine.  The reason for
 this is to provide a way for each type of memory object to provide
 a way that it's pages can be migrated.  We have not included
 code for this in the current patch.

(5)  There are still some small bugs relating to what happens to
 non-present pages.  These issues should not hinder evaluation
 or discussion of the overall approach, however.

Finally, it is my goal to include the migration cache patch in 
the final version of this code, however, at the moment there are
some issues with this patch that are still being worked out, so
it has not been included in this version of the patch.

So, with all of the disclaimers and other details out of the
way, we should go on, in subsequent notes, to discuss each of the
7 patches.  Remember that only the last 2 are really significant;
the others are mostly cleanup of warnings and the like.

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 1/7] mm: manual page migration -- cleanup 1

2005-02-11 Thread Ray Bryant
This patch removes some remaining Memory HOTPLUG specific code
from the page migration patch.  I have sent Dave Hansen the -R
version of this patch so that this code can be added back 
later at the start of the Memory HOTPLUG patches themselves.

In particular, this patchremoves VM_IMMOVABLE and MAP_IMMOVABLE.

Signed-off-by: Ray Bryant [EMAIL PROTECTED]

Index: linux-2.6.10-mm1-page-migration/kernel/fork.c
===
--- linux-2.6.10-mm1-page-migration.orig/kernel/fork.c  2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/kernel/fork.c   2005-01-10 
09:14:03.0 -0800
@@ -211,7 +211,7 @@ static inline int dup_mmap(struct mm_str
if (IS_ERR(pol))
goto fail_nomem_policy;
vma_set_policy(tmp, pol);
-   tmp-vm_flags = ~(VM_LOCKED|VM_IMMOVABLE);
+   tmp-vm_flags = ~(VM_LOCKED);
tmp-vm_mm = mm;
tmp-vm_next = NULL;
anon_vma_link(tmp);
Index: linux-2.6.10-mm1-page-migration/include/linux/mm.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mm.h 2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mm.h  2005-01-10 
09:14:04.0 -0800
@@ -164,7 +164,6 @@ extern unsigned int kobjsize(const void 
 #define VM_ACCOUNT 0x0010  /* Is a VM accounted object */
 #define VM_HUGETLB 0x0040  /* Huge TLB Page VM */
 #define VM_NONLINEAR   0x0080  /* Is non-linear (remap_file_pages) */
-#define VM_IMMOVABLE   0x0100  /* Don't place in hot removable area */
 
 #ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
 #define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
Index: linux-2.6.10-mm1-page-migration/include/linux/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/linux/mman.h   2005-01-10 
08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/linux/mman.h2005-01-10 
10:05:54.0 -0800
@@ -61,8 +61,7 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
   _calc_vm_trans(flags, MAP_DENYWRITE,  VM_DENYWRITE ) |
   _calc_vm_trans(flags, MAP_EXECUTABLE, VM_EXECUTABLE) |
-  _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED) |
-  _calc_vm_trans(flags, MAP_IMMOVABLE,  VM_IMMOVABLE );
+  _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED);
 }
 
 #endif /* _LINUX_MMAN_H */
Index: linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c
===
--- linux-2.6.10-mm1-page-migration.orig/arch/i386/kernel/sys_i386.c
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/arch/i386/kernel/sys_i386.c 2005-01-10 
09:14:04.0 -0800
@@ -70,7 +70,7 @@ asmlinkage long sys_mmap2(unsigned long 
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
 {
-   return do_mmap2(addr, len, prot, flags  ~MAP_IMMOVABLE, fd, pgoff);
+   return do_mmap2(addr, len, prot, flags, fd, pgoff);
 }
 
 /*
@@ -101,7 +101,7 @@ asmlinkage int old_mmap(struct mmap_arg_
if (a.offset  ~PAGE_MASK)
goto out;
 
-   err = do_mmap2(a.addr, a.len, a.prot, a.flags  ~MAP_IMMOVABLE,
+   err = do_mmap2(a.addr, a.len, a.prot, a.flags,
a.fd, a.offset  PAGE_SHIFT);
 out:
return err;
Index: linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/asm-ppc64/mman.h   
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-ppc64/mman.h2005-01-10 
09:14:04.0 -0800
@@ -38,7 +38,6 @@
 
 #define MAP_POPULATE   0x8000  /* populate (prefault) pagetables */
 #define MAP_NONBLOCK   0x1 /* do not block on IO */
-#define MAP_IMMOVABLE  0x2
 
 #define MADV_NORMAL0x0 /* default page-in behavior */
 #define MADV_RANDOM0x1 /* page-in minimum required */
Index: linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h
===
--- linux-2.6.10-mm1-page-migration.orig/include/asm-i386/mman.h
2005-01-10 08:46:51.0 -0800
+++ linux-2.6.10-mm1-page-migration/include/asm-i386/mman.h 2005-01-10 
09:14:04.0 -0800
@@ -22,7 +22,6 @@
 #define MAP_NORESERVE  0x4000  /* don't check for reservations */
 #define MAP_POPULATE   0x8000  /* populate (prefault) pagetables */
 #define MAP_NONBLOCK   0x1 /* do not block on IO */
-#define MAP_IMMOVABLE  0x2
 
 #define MS_ASYNC   1   /* sync

[RFC 2.6.11-rc2-mm2 5/7] mm: manual page migration -- cleanup 5

2005-02-11 Thread Ray Bryant
Fix up a switch statement so gcc doesn't complain about it.

Signed-off-by: Ray Bryant [EMAIL PROTECTED]

Index: linux/mm/mmigrate.c
===
--- linux.orig/mm/mmigrate.c2005-01-30 11:13:58.0 -0800
+++ linux/mm/mmigrate.c 2005-01-30 11:19:33.0 -0800
@@ -319,17 +319,17 @@ generic_migrate_page(struct page *page, 
/* Wait for all operations against the page to finish. */
ret = migrate_fn(page, newpage, vlist);
switch (ret) {
-   default:
-   /* The page is busy. Try it later. */
-   goto out_busy;
case -ENOENT:
/* The file the page belongs to has been truncated. */
page_cache_get(page);
page_cache_release(newpage);
newpage-mapping = NULL;
-   /* fall thru */
+   break;
case 0:
-   /* fall thru */
+   break;
+   default:
+   /* The page is busy. Try it later. */
+   goto out_busy;
}
 
arch_migrate_page(page, newpage);

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 6/7] mm: manual page migration -- add node_map arg to try_to_migrate_pages()

2005-02-11 Thread Ray Bryant
To migrate pages from one node to another, we need to tell
try_to_migrate_pages() which nodes we want to migrate off
of and where to migrate the pages found on each such node.

We do this by adding the node_map array argument to 
try_to_migrate_pages(); node_map[N] gives the target
node to migrate pages to from node N.

This patch depends on a previous patch I submiteed that
adds a node argument to migrate_onepage(); this patch
is currently part of the Memory HOTPLUG page migration
patch.

node_migrate_onepage() is introduced to handle the case
where node_map is NULL (i. e. caller doesn't care where
we migrate the page, just migrate it out of here) or
the system is not a NUMA system.

Signed-off-by:Ray Bryant [EMAIL PROTECTED]

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
11:50:27.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:52:50.0 -0800
@@ -16,11 +16,29 @@ extern int migrate_page_buffer(struct pa
 extern int page_migratable(struct page *, struct page *, int,
struct list_head *);
 extern struct page * migrate_onepage(struct page *, int nodeid);
-extern int try_to_migrate_pages(struct list_head *);
+extern int try_to_migrate_pages(struct list_head *, short *);
 extern int migration_duplicate(swp_entry_t);
 extern struct page * lookup_migration_cache(int);
 extern int migration_remove_reference(struct page *, int);
 
+extern int try_to_migrate_pages(struct list_head *, short *node_map);
+
+#ifdef CONFIG_NUMA
+static inline struct page *node_migrate_onepage(struct page *page, short 
*node_map) 
+{
+   if (node_map)
+   return migrate_onepage(page, node_map[page_to_nid(page)]);
+   else
+   return migrate_onepage(page, MIGRATE_NODE_ANY); 
+   
+}
+#else
+static inline struct page *node_migrate_onepage(struct page *page, short 
*node_map) 
+{
+   return migrate_onepage(page, MIGRATE_NODE_ANY); 
+}
+#endif
+
 #else
 static inline int generic_migrate_page(struct page *page, struct page *newpage,
int (*fn)(struct page *, struct page *))
Index: linux-2.6.11-rc2-mm2/mm/mmigrate.c
===
--- linux-2.6.11-rc2-mm2.orig/mm/mmigrate.c 2005-02-11 11:50:40.0 
-0800
+++ linux-2.6.11-rc2-mm2/mm/mmigrate.c  2005-02-11 11:51:04.0 -0800
@@ -502,9 +502,11 @@ out_unlock:
 /*
  * This is the main entry point to migrate pages in a specific region.
  * If a page is inactive, the page may be just released instead of
- * migration.
+ * migration.  node_map is supplied in those cases (on NUMA systems)
+ * where the caller wishes to specify to which nodes the pages are
+ * migrated.  If node_map is null, the target node is MIGRATE_NODE_ANY.
  */
-int try_to_migrate_pages(struct list_head *page_list)
+int try_to_migrate_pages(struct list_head *page_list, short *node_map)
 {
struct page *page, *page2, *newpage;
LIST_HEAD(pass1_list);
@@ -542,7 +544,7 @@ int try_to_migrate_pages(struct list_hea
list_for_each_entry_safe(page, page2, pass1_list, lru) {
list_del(page-lru);
if (PageLocked(page) || PageWriteback(page) ||
-   IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+   IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);
@@ -560,7 +562,7 @@ int try_to_migrate_pages(struct list_hea
 */
list_for_each_entry_safe(page, page2, pass2_list, lru) {
list_del(page-lru);
-   if (IS_ERR(newpage = migrate_onepage(page, MIGRATE_NODE_ANY))) {
+   if (IS_ERR(newpage = node_migrate_onepage(page, node_map))) {
if (page_count(page) == 1) {
/* the page is already unused */
putback_page_to_lru(page_zone(page), page);

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 4/7] mm: manual page migration -- cleanup 4

2005-02-11 Thread Ray Bryant
Add some extern declarations to include/linux/mmigrate.h to
eliminate some implicitly declared warnings.

Signed-off-by:Ray Bryant [EMAIL PROTECTED]

Index: linux-2.6.11-rc2-mm2/include/linux/mmigrate.h
===
--- linux-2.6.11-rc2-mm2.orig/include/linux/mmigrate.h  2005-02-11 
11:23:46.0 -0800
+++ linux-2.6.11-rc2-mm2/include/linux/mmigrate.h   2005-02-11 
11:50:27.0 -0800
@@ -17,6 +17,9 @@ extern int page_migratable(struct page *
struct list_head *);
 extern struct page * migrate_onepage(struct page *, int nodeid);
 extern int try_to_migrate_pages(struct list_head *);
+extern int migration_duplicate(swp_entry_t);
+extern struct page * lookup_migration_cache(int);
+extern int migration_remove_reference(struct page *, int);
 
 #else
 static inline int generic_migrate_page(struct page *page, struct page *newpage,

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-11 Thread Ray Bryant
-lru_lock);
+   }
+   } 
+   }
+
+   ret = migrate_vma_common(page_list, node_map, count);
+
+   return ret;
+
+}
+
+static int
+migrate_anon_private_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, size_t va_start,
+ size_t va_end, short *node_map)
+{
+   struct page *page;
+   struct zone *zone;
+   unsigned long vaddr;
+   int count = 0, nid, ret;
+   LIST_HEAD(page_list);
+
+   va_start = va_start  PAGE_MASK;
+   va_end   = va_endPAGE_MASK;
+
+   for (vaddr=va_start; vaddr=va_end; vaddr += PAGE_SIZE) {
+   spin_lock(mm-page_table_lock);
+   page = follow_page(mm, vaddr, 0);
+   spin_unlock(mm-page_table_lock);
+   /* 
+* follow_page has been observed to return pages with zero 
+* mapcount and NULL mapping.  Skip those pages as well
+*/
+   if (page  page_mapcount(page)  page-mapping) {
+   nid = page_to_nid(page);
+   if (node_map[nid]  0) {
+   zone = page_zone(page);
+   spin_lock_irq(zone-lru_lock);
+   if (PageLRU(page) 
+   __steal_page_from_lru(zone, page)) {
+   count++;
+   list_add(page-lru, page_list);
+   } else
+   BUG();
+   spin_unlock_irq(zone-lru_lock);
+   }
+   }
+   }
+
+   ret = migrate_vma_common(page_list, node_map, count);
+
+   return ret;
+}
+
+void lru_add_drain_per_cpu(void *info) {
+   lru_add_drain();
+}
+
+asmlinkage long
+sys_page_migrate(const pid_t pid, size_t va_start, size_t va_end,
+   const int count, caddr_t old_nodes, caddr_t new_nodes)
+{
+   int i, ret = 0;
+   short *tmp_old_nodes;
+   short *tmp_new_nodes;
+   short *node_map;
+   struct task_struct *task;
+   struct mm_struct *mm = 0;
+   size_t size = count*sizeof(short);
+   struct vm_area_struct *vma, *vma2;
+
+
+   tmp_old_nodes = (short *) kmalloc(size, GFP_KERNEL);
+   tmp_new_nodes = (short *) kmalloc(size, GFP_KERNEL);
+   node_map = (short *) kmalloc(MAX_NUMNODES*sizeof(short), GFP_KERNEL);
+
+   if (!tmp_old_nodes || !tmp_new_nodes || !node_map) {
+   ret = -ENOMEM;
+   goto out_nodec;
+   }
+
+   if (copy_from_user(tmp_old_nodes, old_nodes, size) || 
+   copy_from_user(tmp_new_nodes, new_nodes, size)) {
+   ret = -EFAULT;
+   goto out_nodec;
+   }
+
+   read_lock(tasklist_lock);
+   task = find_task_by_pid(pid);
+   if (task) {
+   task_lock(task);
+   mm = task-mm;
+   if (mm)
+   atomic_inc(mm-mm_users);
+   task_unlock(task);
+   } else {
+   ret = -ESRCH;
+   goto out_nodec;
+   }
+   read_unlock(tasklist_lock);
+   if (!mm) {
+   ret = -EINVAL;
+   goto out_nodec;
+   }
+
+   /* 
+* for now, we require both the start and end addresses to
+* be mapped by the same vma.
+*/
+   vma = find_vma(mm, va_start);
+   vma2 = find_vma(mm, va_end);
+   if (!vma || !vma2 || (vma != vma2)) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   /* set up the node_map array */
+   for(i=0; iMAX_NUMNODES; i++)
+   node_map[i] = -1;
+   for(i=0; icount; i++)
+   node_map[tmp_old_nodes[i]] = tmp_new_nodes[i];
+
+   /* prepare for lru list manipulation */
+   smp_call_function(lru_add_drain_per_cpu, NULL, 0, 1);
+   lru_add_drain();
+
+   /* actually do the migration */
+   if (vma-vm_ops)
+   ret = migrate_mapped_file_vma(task, mm, vma, va_start, va_end,
+   node_map);
+   else
+   ret = migrate_anon_private_vma(task, mm, vma, va_start, va_end,
+   node_map);
+
+out:
+   atomic_dec(mm-mm_users);
+
+out_nodec:
+   if (tmp_old_nodes)
+   kfree(tmp_old_nodes);
+   if (tmp_new_nodes)
+   kfree(tmp_new_nodes);
+   if (node_map)
+   kfree(node_map);
+
+   return ret;
+
+}
+
 EXPORT_SYMBOL(generic_migrate_page);
 EXPORT_SYMBOL(migrate_page_common);
 EXPORT_SYMBOL(migrate_page_buffer);

-- 
Best Regards,
Ray
---
Ray Bryant   [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel

Re: high load & poor interactivity on fast thread creation

2000-11-30 Thread Ray Bryant

The IBM implementations of the Java language use native threads --
the result is that every time you do a Java thread creation, you
end up with a new cloned process.  Now this should be pretty fast,
so I am surprised that it stalls like that.  It is possible this
is a scheduler effect.  Do you have a program example you can
share with us?

Also, it is a little old now (by Internet standards) but you 
might take a look at this paper we did at the beginning of 
the year: 
 
http://www-4.ibm.com/software/developer/library/java2/index.html

Arnaud Installe wrote:
> 
> Hello,
> 
> When creating a lot of Java threads per second linux slows down to a
> crawl.  I don't think this happens on NT, probably because NT doesn't
> create new threads as fast as Linux does.
> 
> Is there a way (setting ?) to solve this problem ?  Rate-limit the number
> of threads created ?  The problem occurred on linux 2.2, IBM Java 1.1.8.
> 

-- 

Best Regards,

Ray Bryant
IBM Linux Technology Center
[EMAIL PROTECTED]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point.

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
--Eric S. Raymond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: high load poor interactivity on fast thread creation

2000-11-30 Thread Ray Bryant

The IBM implementations of the Java language use native threads --
the result is that every time you do a Java thread creation, you
end up with a new cloned process.  Now this should be pretty fast,
so I am surprised that it stalls like that.  It is possible this
is a scheduler effect.  Do you have a program example you can
share with us?

Also, it is a little old now (by Internet standards) but you 
might take a look at this paper we did at the beginning of 
the year: 
 
http://www-4.ibm.com/software/developer/library/java2/index.html

Arnaud Installe wrote:
 
 Hello,
 
 When creating a lot of Java threads per second linux slows down to a
 crawl.  I don't think this happens on NT, probably because NT doesn't
 create new threads as fast as Linux does.
 
 Is there a way (setting ?) to solve this problem ?  Rate-limit the number
 of threads created ?  The problem occurred on linux 2.2, IBM Java 1.1.8.
 

-- 

Best Regards,

Ray Bryant
IBM Linux Technology Center
[EMAIL PROTECTED]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point.

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
--Eric S. Raymond
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify

2000-09-11 Thread Ray Bryant

Is there a succinct description of the the thread group changes someplace?
I'd be willing to take a look at fixing linuxthreads, but haven't seen any
description
(other than the kernel source) of what the changes are.  Or is someone already
working on this?

Ulrich Drepper wrote:

>
> The thread group changes broke the signal handling in linuxthreads.
> The CLONE_SIGHAND is now also used to enable thread groups but since
> linuxthreads already used CLONE_SIGHAND and is not prepared for thread
> groups all hell breaks loose.
>
> I've told Linus several times about this problems but he puts out one
> test release after the other without this fixed.
>
> --
> ---.  ,-.   1325 Chesapeake Terrace
> Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
> Red Hat  `--' drepper at redhat.com   `
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/

--

Best Regards,

Ray Bryant
IBM Linux Technology Center
[EMAIL PROTECTED]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point.

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
--Eric S. Raymond


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [BUG] threaded processes get stuck in rt_sigsuspend/fillonedir/exit_notify

2000-09-11 Thread Ray Bryant

Is there a succinct description of the the thread group changes someplace?
I'd be willing to take a look at fixing linuxthreads, but haven't seen any
description
(other than the kernel source) of what the changes are.  Or is someone already
working on this?

Ulrich Drepper wrote:


 The thread group changes broke the signal handling in linuxthreads.
 The CLONE_SIGHAND is now also used to enable thread groups but since
 linuxthreads already used CLONE_SIGHAND and is not prepared for thread
 groups all hell breaks loose.

 I've told Linus several times about this problems but he puts out one
 test release after the other without this fixed.

 --
 ---.  ,-.   1325 Chesapeake Terrace
 Ulrich Drepper  \,---'   \  Sunnyvale, CA 94089 USA
 Red Hat  `--' drepper at redhat.com   `
 -
 To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
 the body of a message to [EMAIL PROTECTED]
 Please read the FAQ at http://www.tux.org/lkml/

--

Best Regards,

Ray Bryant
IBM Linux Technology Center
[EMAIL PROTECTED]
512-838-8538
http://oss.software.ibm.com/developerworks/opensource/linux

We are Linux. Resistance is an indication that you missed the point.

"...the Right Thing is more important than the amount of flamage you need
to go through to get there"
--Eric S. Raymond


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/