Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
  migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
new_node_list = 2 3 4 ... 33
for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]
But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

There can be mapped files that can't be mapped into the migration task.
.
Here's an example (courtesy of Jack Steiner);
sprintf(fname, "/tmp/tmp.%d", getpid());
unlink(fname);
fd = open(fname, O_CREAT|O_RDWR);
p = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
unlink(fname);
/* "p" remains valid until unmapped */
The file /tmp/tmp.pid is both mapped and deleted.  It can't be opened
by another process to mmap() it, so it can't be mapped into the
migration task AFAIK how to do things.  The file does show up in 
/proc/pid/maps as shown below (pardon the line splitting):

2027-20278000 rw-p 0020 08:13 75498728  \ 
/lib/tls/libc.so.6.1
20278000-20284000 rw-p 20278000 00:00 0
2030-20c8c000 rw-s  08:13 100885287 \ 
/tmp/tmp.18259 (deleted)
4000-40008000 r-xp  00:2a 14688706  \ 
/home/tulip14/steiner/apps/bigmem/big

Jack says:
"This is a fairly common way to work with scratch map'ed files. Sites that
have very large disk farms but limited swap space frequently do this (or at 
least they use to...)"

So while I tend to agree with your concern about manipulating
one process's address space from another, I honestly think we
are stuck, and I don't see a good way around this.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes, I agree.  Let's make that so.
-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
On Tue, Feb 22, 2005 at 12:45:21PM -0600, Ray Bryant wrote:
> Andi Kleen wrote:
> 
> >
> >How about you add the va_start, va_end but only accept them 
> >when pid is 0 (= current process). Otherwise enforce with EINVAL
> >that they are both 0. This way you could map the
> >shared object into the batch manager, migrate it there, then
> >mark it somehow to not be migrated further, and then
> >migrate the anonymous pages using migrate_pages(pid, ...) 
> >
> 
> We'd have to use up a struct page flag (PG_MIGRATED?) to mark
> the page as migrated to keep the call to migrate_pages() for
> the anonymous pages from migrating the pages again.  Then we'd

I was more thinking of a new mempolicy or a flag for one.
Flag would be probably better.  No need to keep state in struct page.

> How about ignoring the va_start and va_end values unless
> either:
> 
>   pid == current->pid
>   or  current->euid == 0 /* we're root */
> 
> I like the first check a bit better than checking for 0.  Are
> there other system calls that follow that convention (e. g.
> pid = 0 implies current?)
> 
> The second check lets a sufficiently responsible task manipulate
> other tasks.  This task can choose to have the target tasks
> suspended before it starts fussing with them.

I don't like that. The idea behind this restriction is to simplify
things by making sure only processes change their own VM. Letting
root overwrite this doesn't make much sense.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

We'd have to use up a struct page flag (PG_MIGRATED?) to mark
the page as migrated to keep the call to migrate_pages() for
the anonymous pages from migrating the pages again.  Then we'd
have to have some way to clear PG_MIGRATED once all of the
migrate_pages() calls are complete (we can't have the anonymous
page migrate_pages() calls clear the flags, since the second
such call would find the flag clear and remigrate the pages
in the overlapping nodes case.)
How about ignoring the va_start and va_end values unless
either:
  pid == current->pid
  or  current->euid == 0 /* we're root */
I like the first check a bit better than checking for 0.  Are
there other system calls that follow that convention (e. g.
pid = 0 implies current?)
The second check lets a sufficiently responsible task manipulate
other tasks.  This task can choose to have the target tasks
suspended before it starts fussing with them.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes,.that's been pointed out to me before.  Let's make it so.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
On Mon, Feb 21, 2005 at 11:12:14AM -0600, Ray Bryant wrote:
> Andi Kleen wrote:
> 
> 
> >
> >I wouldn't bother fixing up VMA policies. 
> >
> >
> 
> How would these policies get changed so that they represent the
> reality of the new node location(s) then?  Doesn't this have to
> happen as part of migrate_pages()?

You might want to change it, but it's a pure policy issue. And
such kind of policy should be in user space. However I can see
it being ugly to grab the list of policies from user space
(it would need a /proc file). 

Perhaps you're right and it's better to do in the kernel.
It just won't be very pretty code to convert all the masks.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
> OK, so what is the alternative?  Well, if we had a va_start and
> va_end (or a va_start and length) we could move the shared object
> once using a call of the form
> 
>migrate_pages(pid, va_start, va_end, count, old_node_list,
>   new_node_list);
> 
> with old_node_list = 0 1 2 ... 31
>  new_node_list = 2 3 4 ... 33
> 
> for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]

But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
 OK, so what is the alternative?  Well, if we had a va_start and
 va_end (or a va_start and length) we could move the shared object
 once using a call of the form
 
migrate_pages(pid, va_start, va_end, count, old_node_list,
   new_node_list);
 
 with old_node_list = 0 1 2 ... 31
  new_node_list = 2 3 4 ... 33
 
 for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]

But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
On Mon, Feb 21, 2005 at 11:12:14AM -0600, Ray Bryant wrote:
 Andi Kleen wrote:
 
 
 
 I wouldn't bother fixing up VMA policies. 
 
 
 
 How would these policies get changed so that they represent the
 reality of the new node location(s) then?  Doesn't this have to
 happen as part of migrate_pages()?

You might want to change it, but it's a pure policy issue. And
such kind of policy should be in user space. However I can see
it being ugly to grab the list of policies from user space
(it would need a /proc file). 

Perhaps you're right and it's better to do in the kernel.
It just won't be very pretty code to convert all the masks.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

We'd have to use up a struct page flag (PG_MIGRATED?) to mark
the page as migrated to keep the call to migrate_pages() for
the anonymous pages from migrating the pages again.  Then we'd
have to have some way to clear PG_MIGRATED once all of the
migrate_pages() calls are complete (we can't have the anonymous
page migrate_pages() calls clear the flags, since the second
such call would find the flag clear and remigrate the pages
in the overlapping nodes case.)
How about ignoring the va_start and va_end values unless
either:
  pid == current-pid
  or  current-euid == 0 /* we're root */
I like the first check a bit better than checking for 0.  Are
there other system calls that follow that convention (e. g.
pid = 0 implies current?)
The second check lets a sufficiently responsible task manipulate
other tasks.  This task can choose to have the target tasks
suspended before it starts fussing with them.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes,.that's been pointed out to me before.  Let's make it so.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Andi Kleen
On Tue, Feb 22, 2005 at 12:45:21PM -0600, Ray Bryant wrote:
 Andi Kleen wrote:
 
 
 How about you add the va_start, va_end but only accept them 
 when pid is 0 (= current process). Otherwise enforce with EINVAL
 that they are both 0. This way you could map the
 shared object into the batch manager, migrate it there, then
 mark it somehow to not be migrated further, and then
 migrate the anonymous pages using migrate_pages(pid, ...) 
 
 
 We'd have to use up a struct page flag (PG_MIGRATED?) to mark
 the page as migrated to keep the call to migrate_pages() for
 the anonymous pages from migrating the pages again.  Then we'd

I was more thinking of a new mempolicy or a flag for one.
Flag would be probably better.  No need to keep state in struct page.

 How about ignoring the va_start and va_end values unless
 either:
 
   pid == current-pid
   or  current-euid == 0 /* we're root */
 
 I like the first check a bit better than checking for 0.  Are
 there other system calls that follow that convention (e. g.
 pid = 0 implies current?)
 
 The second check lets a sufficiently responsible task manipulate
 other tasks.  This task can choose to have the target tasks
 suspended before it starts fussing with them.

I don't like that. The idea behind this restriction is to simplify
things by making sure only processes change their own VM. Letting
root overwrite this doesn't make much sense.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-22 Thread Ray Bryant
Andi Kleen wrote:
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
  migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
new_node_list = 2 3 4 ... 33
for one of the pid's in the job.

I still don't like it. It would be bad to make migrate_pages another
ptrace() [and ptrace at least really enforces a stopped process]
But I can see your point that migration DEFAULT pages with first touch
aware applications pretty much needs the old_node, new_node lists.
I just don't think an external process should mess with other processes
VA. But I can see that it makes sense to do this on SHM that 
is mapped into a management process.

How about you add the va_start, va_end but only accept them 
when pid is 0 (= current process). Otherwise enforce with EINVAL
that they are both 0. This way you could map the
shared object into the batch manager, migrate it there, then
mark it somehow to not be migrated further, and then
migrate the anonymous pages using migrate_pages(pid, ...) 

There can be mapped files that can't be mapped into the migration task.
.
Here's an example (courtesy of Jack Steiner);
sprintf(fname, /tmp/tmp.%d, getpid());
unlink(fname);
fd = open(fname, O_CREAT|O_RDWR);
p = mmap(NULL, bytes, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
close(fd);
unlink(fname);
/* p remains valid until unmapped */
The file /tmp/tmp.pid is both mapped and deleted.  It can't be opened
by another process to mmap() it, so it can't be mapped into the
migration task AFAIK how to do things.  The file does show up in 
/proc/pid/maps as shown below (pardon the line splitting):

2027-20278000 rw-p 0020 08:13 75498728  \ 
/lib/tls/libc.so.6.1
20278000-20284000 rw-p 20278000 00:00 0
2030-20c8c000 rw-s  08:13 100885287 \ 
/tmp/tmp.18259 (deleted)
4000-40008000 r-xp  00:2a 14688706  \ 
/home/tulip14/steiner/apps/bigmem/big

Jack says:
This is a fairly common way to work with scratch map'ed files. Sites that
have very large disk farms but limited swap space frequently do this (or at 
least they use to...)

So while I tend to agree with your concern about manipulating
one process's address space from another, I honestly think we
are stuck, and I don't see a good way around this.
BTW it might be better to make va_end a size, just to be more
symmetric with mlock,madvise,mmap et.al.
Yes, I agree.  Let's make that so.
-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
Oops.  It's late.  The pargraph below in my previous note confused
cpus and nodes.  It should have read as follows:
Let's suppose that nodes 0-1 of a 64 node [was: CPU] system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altix [128 CPUS in this system]. Let's suppose that jobs
arrive as follows:
. . .
Sorry about that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
I went back and did some digging on one the issues that has dropped
off the list here: the case where the set of old nodes and new
nodes overlap in some way.  No one could provide me with a specific
example, but the thread was that "This did happen in certain scenarios".
Part of these scenarios involved situations where a particular job
had to have access to a certain node, because that certain node was
attached to a graphics device, for example.  Here is one such
scenario:
Let's suppose that nodes 0-1 of a 64 CPU system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altox. Let's suppose that jobs arrive as follows:
(1)  32 processor, non-graphics job arrives and gets assigned
 cpus 96-127 (nodes 48-63)
(2)  A second 32 processor, non-graphics job arrives and is
 assigned cpus 64-95 (nodes 32-47)
(3)  A 64 processor non-graphics job arrives and gets assigned
 cpus 0-63.
(bear with me, please)
(4)  The job on nodes 64-95 terminates.  A new 28 processor
 job arrives and is assigned cpus 68-95.
(5)  A 4 cpu graphics job comes in and we want to assign it to
 cpus 0-3 (nodes 0-1) and it has a very high priority, so
 we want to migrate the 64 CPU job.  The only place left
 to migrate it is from cpus 0-63 to cpus 4-67.
(Note that we can't just migrate nodes 0-1 to nodes 32-33, because
for all we know, the program depends on the fact that nodes 0-1
are physically close to [have low latency access to] nodes 2-3.
So moving 0-1 to 32-33 would be a non-topological preserving
migration.)
Now if we are using a system call of the form
migrate_pages(pid, count, old_node_list, new_node_list);
then we really can't have old_node_list and new_node_list overlap,
unless this is the only process that we are migrating or there is
no shared memory among the pid's.  (Neither is very likely for
our workload mix.  :-)  ).
The reason that this doesn't work is the following:  It works
fine for the first pid.  The shared segment gets moved to the
new_node_list.  But when we call migrate_pages() for the 2nd
pid, we will remigrate the pages that ended up on the nodes
that are in the intersection of the sets of members of the
two lists.  (The scanning code has no way to recognize that
the pages have been migrated.  It finds pages that are on one
of the old nodes, and migrates them again.)  This gets repeated
for each subsequent call.  Not pretty.  What happens in this
particular case if you do the trivial thing and try:
old_nodes=0 1 2 ... 31
new_nodes=2 3 4 ... 33
Then after 16 process have been migrated, all of the shared memory
pages of the job are on nodes 32 and 33. (I've assume the shared
memory is shared among all of the processes of the job.)
Now you COULD do multiple migrations to make this work.
In this case, you could do 16 migrations:
stepold_nodes   new_nodes
  1   30 31  32 33
  2   28 29  30 31
  3   26 27  28 29
 ...
  16  0   1   2  3
During each step, you would have to call migrate_pages() 64 times,
since there are 64 processes involved.  (You can't migrate
any more nodes in each step without creating a situation where
pages will be physically migrated twice.)  Once again, we are
starting to veer close to O(N**2) behavior here, and we want
to stay away from that.
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
   migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
 new_node_list = 2 3 4 ... 33
for one of the pid's in the job.
(This is particularly important if the shared region is large.)
Next we could go and move the non-shared memory in each process
using similar calls, repeated one or more times in each process.
Yes, this is ugly, and yes this requires us to parse /proc/pid/maps.
Life is like that sometimes.
Now, I admit that this example is somewhat contrived, and it shows
worst case behavior.  But this is not an implausible scenario.  And
it shows the difficulties of trying to use a system call of the
form:
   migrate_pages(pid, count, old_node_list, new_node_list)
in those cases where the old_node_list and the new_node_list are not
disjoint.  Furthermore, it shows how we could end up in a situation
where the old_node_list and the new_node_lists overlap.
Jack Steiner pointed out this kind of example to me, and this kind
of example did arise in IRIX, so we believe that it will arise on
Altix and we don't know of a good way around these problems other
than the system call form that includes the va_start and va_end.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi Kleen wrote:

I wouldn't bother fixing up VMA policies. 


How would these policies get changed so that they represent the
reality of the new node location(s) then?  Doesn't this have to
happen as part of migrate_pages()?
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Andi Kleen
On Mon, Feb 21, 2005 at 02:42:16AM -0600, Ray Bryant wrote:
> All,
> 
> Just an update on the idea of migrating a process without suspending
> it.
> 
> The hard part of the problem here is to make sure that the page_migrate()
> system call sees all of the pages to migrate.  If the process that is
> being migrated can still allocate pages, then the page_migrate() call
> may miss some of the pages.

I would do an easy 95% solution:

When process has default process policy set temporarily a prefered policy
with the new node

[this won't work with multiple nodes though, so you have to decide on one
or stop the process if that is unacceptable] 

> 
> One way to solve this problem is to force the process to start allocating
> pages on the new nodes before calling page_migrate().  There are a couple
> of subcases:
> 
> (1)  For memory mapped files with a non-DEFAULT associated memory policy,
>  one can use mbind() to fixup the memory policy.  (This assumes the
>  Steve Longerbeam patches are applied, as I understand things).

I would just ignore them.  If user space wants it can handle it,
but it's probably not worth it.

> (1) could be handled as part of the page_migrate() system call --
> make one pass through the address space searching for mempolicy()
> data structures, and updating them as necessary.  Then make a second
> pass through and do the migrations.  Any new allocations will then
> be done under the new mempolicy, so they won't be missed.  But this
> still gets us into trouble if the old and new node lists are not
> disjoint.

I wouldn't bother fixing up VMA policies. 

> This doesn't handle anonymous memory or mapped files associated with
> the DEFAULT policy.  A way around that would be to add a target cpu_id

[...]

I would set temporarily a prefered policy as mentioned above.

That only handles a single node, but you solution is not better.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Paul Jackson

Ray wrote:
> As I understood it, we were converging on the following:
>   (1) ...
>   (2) ...
>   (3) ...
> This is different than your reply above, which seems to imply that:
>   (A) ...
>   (B) ...

Andi reacted to various details of (A) and (B).

Any chance, Andi, of you directly stating whether you concur
with Ray that you two are converging on (1), (2) and (3)?

I'm afraid my mind reading skills aren't that good.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Andi Kleen
On Mon, Feb 21, 2005 at 01:29:41AM -0600, Ray Bryant wrote:
> This is different than your reply above, which seems to imply that:
> 
> (A)  Step 1 is to migrate mapped files using mbind().  I don't understand
>  how to do this in general, because:
>  (a)  I don't know how to make a non-racy list of the mapped files to
>   migrate without assuming that the process to be migrated is 
>   stopped

That was just a stop gap way to do the migration before you have
xattrs for shared libraries. If you have them it's not needed.

> and  (b)  If the mapped file is associated with the DEFAULT memory policy,
>   and page placement was done by first touch, then it is not clear
>   how to use mbind() to cause the pages to be migrated, and still
>   end up with the identical topological placement of pages after
>   the migration.

It can be done, but it's ugly. But it really was only intended for
the shared libraries.

> (B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
>  I don't like the restriction of this to just anonymous pages.

That would be only in this scenario; I agree it doesn't make sense
to add it as a general restriction to the syscall.

> 
> Fundamentally, I don't see why (A) is much different from allowing one
> process to manipulate the physical storage for another process.  It's
> just stated in terms of mmap'd objects instead of pid's.  So I don't
> see why that is fundamentally different from a page_migration() call
> with va_start and va_end arguments.

An mmaped object exists on its own. It's access is fully reference counted etc.

> The only problem I see with that is the following:  Suppose that a user
> wants to migrate a portion of their own address space that is composed
> of (at last partly) anonymous pages or pages mapped to a file associated
> with the DEFAULT memory policy, and we want the pages to be toplogically
> allocated the same way after the migration as they were before the
> migration?

It doesn't seem very realistic to me. When a user wants to change
its own address room then they can use mbind() from the beginning
and they should know how their memory layout is.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
All,
Just an update on the idea of migrating a process without suspending
it.
The hard part of the problem here is to make sure that the page_migrate()
system call sees all of the pages to migrate.  If the process that is
being migrated can still allocate pages, then the page_migrate() call
may miss some of the pages.
One way to solve this problem is to force the process to start allocating
pages on the new nodes before calling page_migrate().  There are a couple
of subcases:
(1)  For memory mapped files with a non-DEFAULT associated memory policy,
 one can use mbind() to fixup the memory policy.  (This assumes the
 Steve Longerbeam patches are applied, as I understand things).
(2)  For anonymous pages and memory mapped files with DEFAULT policy,
 the allocation depends on which node the process is running.  So
 after doing the above, you need to migrate the task to a cpu
 associated with one of the nodes.
The problem with (1) is that it is racy, there is no guarenteed way to get the
list of mapped files for the process while it is still running.  A process
can do it for itself, so one way to do this would be to write the set of
new nodes to a /proc/pid file, then send the process a SIG_MIGRATE
signal.  Ugly  (For multithreaded programs, all of the threads have
to be signalled to keep them from mmap()ing new files during the migration.)
(1) could be handled as part of the page_migrate() system call --
make one pass through the address space searching for mempolicy()
data structures, and updating them as necessary.  Then make a second
pass through and do the migrations.  Any new allocations will then
be done under the new mempolicy, so they won't be missed.  But this
still gets us into trouble if the old and new node lists are not
disjoint.
This doesn't handle anonymous memory or mapped files associated with
the DEFAULT policy.  A way around that would be to add a target cpu_id
to the page_migrate() system call.  Then before doing the first pass
described above, one would do the equivalenet of set_sched_affinity()
for the target pid, moving it to the indicated cpu.  Once it is known
the pid has moved (how to do that?), we now know anonymous memory and
DEFAULT mempolicy mapped files will be allocated on the nodes associated
with the new cpu.  Then we can proceed as discussed in the last paragraph.
Also ugly, due to the extra parameter.
Alternatively, we can just require, for correct execution, the invoking
code to do the set_sched_affinity() first, in those cases where
migrating a running task is important.
Anyway, how important is this, really for acceptance of a page_migrate()
system call in the community?  (that is, how important is it to be
able to migrate a process without suspending it?)
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
All,
Just an update on the idea of migrating a process without suspending
it.
The hard part of the problem here is to make sure that the page_migrate()
system call sees all of the pages to migrate.  If the process that is
being migrated can still allocate pages, then the page_migrate() call
may miss some of the pages.
One way to solve this problem is to force the process to start allocating
pages on the new nodes before calling page_migrate().  There are a couple
of subcases:
(1)  For memory mapped files with a non-DEFAULT associated memory policy,
 one can use mbind() to fixup the memory policy.  (This assumes the
 Steve Longerbeam patches are applied, as I understand things).
(2)  For anonymous pages and memory mapped files with DEFAULT policy,
 the allocation depends on which node the process is running.  So
 after doing the above, you need to migrate the task to a cpu
 associated with one of the nodes.
The problem with (1) is that it is racy, there is no guarenteed way to get the
list of mapped files for the process while it is still running.  A process
can do it for itself, so one way to do this would be to write the set of
new nodes to a /proc/pid file, then send the process a SIG_MIGRATE
signal.  Ugly  (For multithreaded programs, all of the threads have
to be signalled to keep them from mmap()ing new files during the migration.)
(1) could be handled as part of the page_migrate() system call --
make one pass through the address space searching for mempolicy()
data structures, and updating them as necessary.  Then make a second
pass through and do the migrations.  Any new allocations will then
be done under the new mempolicy, so they won't be missed.  But this
still gets us into trouble if the old and new node lists are not
disjoint.
This doesn't handle anonymous memory or mapped files associated with
the DEFAULT policy.  A way around that would be to add a target cpu_id
to the page_migrate() system call.  Then before doing the first pass
described above, one would do the equivalenet of set_sched_affinity()
for the target pid, moving it to the indicated cpu.  Once it is known
the pid has moved (how to do that?), we now know anonymous memory and
DEFAULT mempolicy mapped files will be allocated on the nodes associated
with the new cpu.  Then we can proceed as discussed in the last paragraph.
Also ugly, due to the extra parameter.
Alternatively, we can just require, for correct execution, the invoking
code to do the set_sched_affinity() first, in those cases where
migrating a running task is important.
Anyway, how important is this, really for acceptance of a page_migrate()
system call in the community?  (that is, how important is it to be
able to migrate a process without suspending it?)
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Andi Kleen
On Mon, Feb 21, 2005 at 01:29:41AM -0600, Ray Bryant wrote:
 This is different than your reply above, which seems to imply that:
 
 (A)  Step 1 is to migrate mapped files using mbind().  I don't understand
  how to do this in general, because:
  (a)  I don't know how to make a non-racy list of the mapped files to
   migrate without assuming that the process to be migrated is 
   stopped

That was just a stop gap way to do the migration before you have
xattrs for shared libraries. If you have them it's not needed.

 and  (b)  If the mapped file is associated with the DEFAULT memory policy,
   and page placement was done by first touch, then it is not clear
   how to use mbind() to cause the pages to be migrated, and still
   end up with the identical topological placement of pages after
   the migration.

It can be done, but it's ugly. But it really was only intended for
the shared libraries.

 (B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
  I don't like the restriction of this to just anonymous pages.

That would be only in this scenario; I agree it doesn't make sense
to add it as a general restriction to the syscall.

 
 Fundamentally, I don't see why (A) is much different from allowing one
 process to manipulate the physical storage for another process.  It's
 just stated in terms of mmap'd objects instead of pid's.  So I don't
 see why that is fundamentally different from a page_migration() call
 with va_start and va_end arguments.

An mmaped object exists on its own. It's access is fully reference counted etc.

 The only problem I see with that is the following:  Suppose that a user
 wants to migrate a portion of their own address space that is composed
 of (at last partly) anonymous pages or pages mapped to a file associated
 with the DEFAULT memory policy, and we want the pages to be toplogically
 allocated the same way after the migration as they were before the
 migration?

It doesn't seem very realistic to me. When a user wants to change
its own address room then they can use mbind() from the beginning
and they should know how their memory layout is.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Paul Jackson

Ray wrote:
 As I understood it, we were converging on the following:
   (1) ...
   (2) ...
   (3) ...
 This is different than your reply above, which seems to imply that:
   (A) ...
   (B) ...

Andi reacted to various details of (A) and (B).

Any chance, Andi, of you directly stating whether you concur
with Ray that you two are converging on (1), (2) and (3)?

I'm afraid my mind reading skills aren't that good.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Andi Kleen
On Mon, Feb 21, 2005 at 02:42:16AM -0600, Ray Bryant wrote:
 All,
 
 Just an update on the idea of migrating a process without suspending
 it.
 
 The hard part of the problem here is to make sure that the page_migrate()
 system call sees all of the pages to migrate.  If the process that is
 being migrated can still allocate pages, then the page_migrate() call
 may miss some of the pages.

I would do an easy 95% solution:

When process has default process policy set temporarily a prefered policy
with the new node

[this won't work with multiple nodes though, so you have to decide on one
or stop the process if that is unacceptable] 

 
 One way to solve this problem is to force the process to start allocating
 pages on the new nodes before calling page_migrate().  There are a couple
 of subcases:
 
 (1)  For memory mapped files with a non-DEFAULT associated memory policy,
  one can use mbind() to fixup the memory policy.  (This assumes the
  Steve Longerbeam patches are applied, as I understand things).

I would just ignore them.  If user space wants it can handle it,
but it's probably not worth it.

 (1) could be handled as part of the page_migrate() system call --
 make one pass through the address space searching for mempolicy()
 data structures, and updating them as necessary.  Then make a second
 pass through and do the migrations.  Any new allocations will then
 be done under the new mempolicy, so they won't be missed.  But this
 still gets us into trouble if the old and new node lists are not
 disjoint.

I wouldn't bother fixing up VMA policies. 

 This doesn't handle anonymous memory or mapped files associated with
 the DEFAULT policy.  A way around that would be to add a target cpu_id

[...]

I would set temporarily a prefered policy as mentioned above.

That only handles a single node, but you solution is not better.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi Kleen wrote:

I wouldn't bother fixing up VMA policies. 


How would these policies get changed so that they represent the
reality of the new node location(s) then?  Doesn't this have to
happen as part of migrate_pages()?
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
I went back and did some digging on one the issues that has dropped
off the list here: the case where the set of old nodes and new
nodes overlap in some way.  No one could provide me with a specific
example, but the thread was that This did happen in certain scenarios.
Part of these scenarios involved situations where a particular job
had to have access to a certain node, because that certain node was
attached to a graphics device, for example.  Here is one such
scenario:
Let's suppose that nodes 0-1 of a 64 CPU system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altox. Let's suppose that jobs arrive as follows:
(1)  32 processor, non-graphics job arrives and gets assigned
 cpus 96-127 (nodes 48-63)
(2)  A second 32 processor, non-graphics job arrives and is
 assigned cpus 64-95 (nodes 32-47)
(3)  A 64 processor non-graphics job arrives and gets assigned
 cpus 0-63.
(bear with me, please)
(4)  The job on nodes 64-95 terminates.  A new 28 processor
 job arrives and is assigned cpus 68-95.
(5)  A 4 cpu graphics job comes in and we want to assign it to
 cpus 0-3 (nodes 0-1) and it has a very high priority, so
 we want to migrate the 64 CPU job.  The only place left
 to migrate it is from cpus 0-63 to cpus 4-67.
(Note that we can't just migrate nodes 0-1 to nodes 32-33, because
for all we know, the program depends on the fact that nodes 0-1
are physically close to [have low latency access to] nodes 2-3.
So moving 0-1 to 32-33 would be a non-topological preserving
migration.)
Now if we are using a system call of the form
migrate_pages(pid, count, old_node_list, new_node_list);
then we really can't have old_node_list and new_node_list overlap,
unless this is the only process that we are migrating or there is
no shared memory among the pid's.  (Neither is very likely for
our workload mix.  :-)  ).
The reason that this doesn't work is the following:  It works
fine for the first pid.  The shared segment gets moved to the
new_node_list.  But when we call migrate_pages() for the 2nd
pid, we will remigrate the pages that ended up on the nodes
that are in the intersection of the sets of members of the
two lists.  (The scanning code has no way to recognize that
the pages have been migrated.  It finds pages that are on one
of the old nodes, and migrates them again.)  This gets repeated
for each subsequent call.  Not pretty.  What happens in this
particular case if you do the trivial thing and try:
old_nodes=0 1 2 ... 31
new_nodes=2 3 4 ... 33
Then after 16 process have been migrated, all of the shared memory
pages of the job are on nodes 32 and 33. (I've assume the shared
memory is shared among all of the processes of the job.)
Now you COULD do multiple migrations to make this work.
In this case, you could do 16 migrations:
stepold_nodes   new_nodes
  1   30 31  32 33
  2   28 29  30 31
  3   26 27  28 29
 ...
  16  0   1   2  3
During each step, you would have to call migrate_pages() 64 times,
since there are 64 processes involved.  (You can't migrate
any more nodes in each step without creating a situation where
pages will be physically migrated twice.)  Once again, we are
starting to veer close to O(N**2) behavior here, and we want
to stay away from that.
OK, so what is the alternative?  Well, if we had a va_start and
va_end (or a va_start and length) we could move the shared object
once using a call of the form
   migrate_pages(pid, va_start, va_end, count, old_node_list,
new_node_list);
with old_node_list = 0 1 2 ... 31
 new_node_list = 2 3 4 ... 33
for one of the pid's in the job.
(This is particularly important if the shared region is large.)
Next we could go and move the non-shared memory in each process
using similar calls, repeated one or more times in each process.
Yes, this is ugly, and yes this requires us to parse /proc/pid/maps.
Life is like that sometimes.
Now, I admit that this example is somewhat contrived, and it shows
worst case behavior.  But this is not an implausible scenario.  And
it shows the difficulties of trying to use a system call of the
form:
   migrate_pages(pid, count, old_node_list, new_node_list)
in those cases where the old_node_list and the new_node_list are not
disjoint.  Furthermore, it shows how we could end up in a situation
where the old_node_list and the new_node_lists overlap.
Jack Steiner pointed out this kind of example to me, and this kind
of example did arise in IRIX, so we believe that it will arise on
Altix and we don't know of a good way around these problems other
than the system call form that includes the va_start and va_end.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-21 Thread Ray Bryant
Andi,
Oops.  It's late.  The pargraph below in my previous note confused
cpus and nodes.  It should have read as follows:
Let's suppose that nodes 0-1 of a 64 node [was: CPU] system have graphics
pipes.  To keep it simple, we will assume that there are 2 cpus
per node like an Altix [128 CPUS in this system]. Let's suppose that jobs
arrive as follows:
. . .
Sorry about that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Paul Jackson wrote:
You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?
I don't think Andi was proposing you have to search all of the pages
on a node.  I think that the idea was that the (count, old_nodes, new_nodes)
parameters would have to be converted to a full node_map such as is done
in the patch (let's call it "sample code") that I sent out with the
overview that started this whole discussion.  node_map[] is MAX_NUMNODES
in length, and node_map[i] gives the node where pages on node i should be
migrated to, or is -1 if we are not migrating pages on this node.
Since we have extended the interface to support -1 as a possible value for
the old_nodes array [and it matches any old node], then in that case we
would make node_map[i]=new_node for all values of i.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node -> new node mapping.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
I'm going to assume that there have been some "crossed emails" here.
I don't think that this is the interface that you and I have been
converging on.  As I understood it, we were converging on the following:
(1)  extended attributes will be used to mark files as non-migratable
(2)  the page_migrate() system call will be defined as:
 page_migrate(pid, count, old_nodes, new_nodes);
 and it will migrate all pages that are either anonymous or part
 of mapped files that are not marked non-migratable.
(3)  The mbind() system call with MPOL_MF_STRICT will be hooked up
 to the migration code so that it actually causes a migration.
 Processes can use this interface to migrate a portion of their own
 address space containing a mapped file.
This is different than your reply above, which seems to imply that:
(A)  Step 1 is to migrate mapped files using mbind().  I don't understand
 how to do this in general, because:
 (a)  I don't know how to make a non-racy list of the mapped files to
  migrate without assuming that the process to be migrated is stopped
and  (b)  If the mapped file is associated with the DEFAULT memory policy,
  and page placement was done by first touch, then it is not clear
  how to use mbind() to cause the pages to be migrated, and still
  end up with the identical topological placement of pages after
  the migration.
(B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
 I don't like the restriction of this to just anonymous pages.
Fundamentally, I don't see why (A) is much different from allowing one
process to manipulate the physical storage for another process.  It's
just stated in terms of mmap'd objects instead of pid's.  So I don't
see why that is fundamentally different from a page_migration() call
with va_start and va_end arguments.
So I'm going to assume that the agreement was really (1)-(3) above.
The only problem I see with that is the following:  Suppose that a user
wants to migrate a portion of their own address space that is composed
of (at last partly) anonymous pages or pages mapped to a file associated
with the DEFAULT memory policy, and we want the pages to be toplogically
allocated the same way after the migration as they were before the
migration?
The only way I know how to do the latter is with a system call of the form:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
where the permission model is that a pid can migrate any process that it
can send a signal to.  So a root pid can migrate any process, and a user
pid can migrate pages of any pid started by the user.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.
-Andi
That makes sense.  I will agree to make that part work, too. as part
of this.  We will probably do the external case first, because we have
need for that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
   so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Paul Jackson
> - Give the shared libraries and any other files a suitable policy
> (by mapping them and applying mbind) 

Ah - I think you've said this before, and I'm being a bit retarded.

You're saying that one could horse around with the physical placement of
existing files mapped into another tasks space by mapping them into ones
own space and using mbind, (once mbind is hooked up to page migration,
to quote one of your earlier posts ;).  Ok.

How well does this work with a mapped file if the pages of that file
have been placed (allocated on nodes) using some intricate first-touch
pattern that is only encoded in some inscrutable initialization code of
the application, and that is heavily fragmented, with few contiguous
pages on the same node?

It seems to me that you can't migrate such regions efficiently using the
above explicit mbind'ing -- it could require a vma per page in the
limit.  And you can't migrate such regions using a migrate_pages() for
all anonymous pages in a tasks space, because these aren't anon pages.

Do you have in mind being able to tag such mapped files with an
attribute that causes their pages to be migrated along with the
anon pages on the migrate_pages() call?  That might work ...


> > How would you recommend that the batch manager move that job to the
> > nodes that can run it?   ...
> 
> You have to walk to full node mapping for each array, but
> even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Andi Kleen
> Do you have any better way to suggest, Andi, for a batch manager to
> relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node -> new node mapping.

> How would you recommend that the batch manager move that job to the
> nodes that can run it?  The layout of allocated memory pages and tasks
> for that job must be preserved in order to keep the same performance.
> The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Paul Jackson
Andi wrote:
> I still think it's fundamentally unclean and racy. External processes
> shouldn't mess with virtual addresses of other processes.

It's not really messing with (changing) the virtual addresses of
another process.  It's messing with the physical placement.  It's
using the virtual addresses to help choose which pages to move.

Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is
thus.  A lower priority job, after running a while, is displaced by a
higher priority job that needs a large number of nodes.  Later on enough
nodes to run the lower priority job become available elsewhere.  The
lower priority job can either continue to wait for its original nodes to
come free (after the high priority job finishes) or it can be relocated
to the nodes available now.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

(I'm starting to have visions of vma's having externally visible id's,
in a per-task namespace.)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Andi Kleen
> >Perhaps node masks would be better and teaching the kernel to handle
> >relative distances inside the masks transparently while migrating?
> >Not sure how complicated this would be to implement though.
> >
> >Supporting interleaving on the new nodes may be also useful, that would
> >need a policy argument at least too and masks.
> >
> 
> The worry I have about using node masks is that it is not as general as
> old_node,new_node mappings (or preferably, the original proposal I made
> of old_node_list, new_node_list).  One can't differentiate between the

I agree that the node arrays are better for this case.

> >>and the majority of the memory is shared, then we only need to make
> >>one system call and one page table scan.  (We just "migrate" the
> >>shared object once.) So the time to do the page table scans disappears
> >
> >
> >I don't like this because it makes it much more complicated
> >to use for user space. And you can set separate policies for
> >shared objects anyways.
> 
> Yes, but only programs that care have to use the va_start and
> va_end.  Programs who want to move everything can specify
> 0 and MAX_INT there and they are done.

I still think it's fundamentally unclean and racy. External processes
shouldn't mess with virtual addresses of other processes.

> >-Andi
> 
> But we are least at the level of agreeing that the new system
> call looks something like the following:
> 
> migrate_pages(pid, count, old_list, new_list);
> 
> right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Andi Kleen
 Perhaps node masks would be better and teaching the kernel to handle
 relative distances inside the masks transparently while migrating?
 Not sure how complicated this would be to implement though.
 
 Supporting interleaving on the new nodes may be also useful, that would
 need a policy argument at least too and masks.
 
 
 The worry I have about using node masks is that it is not as general as
 old_node,new_node mappings (or preferably, the original proposal I made
 of old_node_list, new_node_list).  One can't differentiate between the

I agree that the node arrays are better for this case.

 and the majority of the memory is shared, then we only need to make
 one system call and one page table scan.  (We just migrate the
 shared object once.) So the time to do the page table scans disappears
 
 
 I don't like this because it makes it much more complicated
 to use for user space. And you can set separate policies for
 shared objects anyways.
 
 Yes, but only programs that care have to use the va_start and
 va_end.  Programs who want to move everything can specify
 0 and MAX_INT there and they are done.

I still think it's fundamentally unclean and racy. External processes
shouldn't mess with virtual addresses of other processes.

 -Andi
 
 But we are least at the level of agreeing that the new system
 call looks something like the following:
 
 migrate_pages(pid, count, old_list, new_list);
 
 right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Paul Jackson
Andi wrote:
 I still think it's fundamentally unclean and racy. External processes
 shouldn't mess with virtual addresses of other processes.

It's not really messing with (changing) the virtual addresses of
another process.  It's messing with the physical placement.  It's
using the virtual addresses to help choose which pages to move.

Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is
thus.  A lower priority job, after running a while, is displaced by a
higher priority job that needs a large number of nodes.  Later on enough
nodes to run the lower priority job become available elsewhere.  The
lower priority job can either continue to wait for its original nodes to
come free (after the high priority job finishes) or it can be relocated
to the nodes available now.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

(I'm starting to have visions of vma's having externally visible id's,
in a per-task namespace.)

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Andi Kleen
 Do you have any better way to suggest, Andi, for a batch manager to
 relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node - new node mapping.

 How would you recommend that the batch manager move that job to the
 nodes that can run it?  The layout of allocated memory pages and tasks
 for that job must be preserved in order to keep the same performance.
 The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Paul Jackson
 - Give the shared libraries and any other files a suitable policy
 (by mapping them and applying mbind) 

Ah - I think you've said this before, and I'm being a bit retarded.

You're saying that one could horse around with the physical placement of
existing files mapped into another tasks space by mapping them into ones
own space and using mbind, (once mbind is hooked up to page migration,
to quote one of your earlier posts ;).  Ok.

How well does this work with a mapped file if the pages of that file
have been placed (allocated on nodes) using some intricate first-touch
pattern that is only encoded in some inscrutable initialization code of
the application, and that is heavily fragmented, with few contiguous
pages on the same node?

It seems to me that you can't migrate such regions efficiently using the
above explicit mbind'ing -- it could require a vma per page in the
limit.  And you can't migrate such regions using a migrate_pages() for
all anonymous pages in a tasks space, because these aren't anon pages.

Do you have in mind being able to tag such mapped files with an
attribute that causes their pages to be migrated along with the
anon pages on the migrate_pages() call?  That might work ...


  How would you recommend that the batch manager move that job to the
  nodes that can run it?   ...
 
 You have to walk to full node mapping for each array, but
 even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
But we are least at the level of agreeing that the new system
call looks something like the following:
migrate_pages(pid, count, old_list, new_list);
right?

For the external case probably yes. For internal (process does this
on its own address space) it should be hooked into mbind() too.
-Andi
That makes sense.  I will agree to make that part work, too. as part
of this.  We will probably do the external case first, because we have
need for that.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Andi Kleen wrote:
Do you have any better way to suggest, Andi, for a batch manager to
relocate a job?  The typical scenario, as Ray explained it to me, is

- Give the shared libraries and any other files a suitable policy
(by mapping them and applying mbind) 

- Then execute migrate_pages() for the anonymous pages with a suitable
old node - new node mapping.

How would you recommend that the batch manager move that job to the
nodes that can run it?  The layout of allocated memory pages and tasks
for that job must be preserved in order to keep the same performance.
The migration method needs to scale to hundreds, or more, of nodes.

You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly
(in the worst case you could create a small hash table for it
in the kernel, but I'm not sure it's worth it) 

-Andi
-
I'm going to assume that there have been some crossed emails here.
I don't think that this is the interface that you and I have been
converging on.  As I understood it, we were converging on the following:
(1)  extended attributes will be used to mark files as non-migratable
(2)  the page_migrate() system call will be defined as:
 page_migrate(pid, count, old_nodes, new_nodes);
 and it will migrate all pages that are either anonymous or part
 of mapped files that are not marked non-migratable.
(3)  The mbind() system call with MPOL_MF_STRICT will be hooked up
 to the migration code so that it actually causes a migration.
 Processes can use this interface to migrate a portion of their own
 address space containing a mapped file.
This is different than your reply above, which seems to imply that:
(A)  Step 1 is to migrate mapped files using mbind().  I don't understand
 how to do this in general, because:
 (a)  I don't know how to make a non-racy list of the mapped files to
  migrate without assuming that the process to be migrated is stopped
and  (b)  If the mapped file is associated with the DEFAULT memory policy,
  and page placement was done by first touch, then it is not clear
  how to use mbind() to cause the pages to be migrated, and still
  end up with the identical topological placement of pages after
  the migration.
(B)  Step 2 is to use page_migrate() to migrate just the anonymous pages.
 I don't like the restriction of this to just anonymous pages.
Fundamentally, I don't see why (A) is much different from allowing one
process to manipulate the physical storage for another process.  It's
just stated in terms of mmap'd objects instead of pid's.  So I don't
see why that is fundamentally different from a page_migration() call
with va_start and va_end arguments.
So I'm going to assume that the agreement was really (1)-(3) above.
The only problem I see with that is the following:  Suppose that a user
wants to migrate a portion of their own address space that is composed
of (at last partly) anonymous pages or pages mapped to a file associated
with the DEFAULT memory policy, and we want the pages to be toplogically
allocated the same way after the migration as they were before the
migration?
The only way I know how to do the latter is with a system call of the form:
page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
where the permission model is that a pid can migrate any process that it
can send a signal to.  So a root pid can migrate any process, and a user
pid can migrate pages of any pid started by the user.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-20 Thread Ray Bryant
Paul Jackson wrote:
You have to walk to full node mapping for each array, but
even with hundreds of nodes that should not be that costly

I presume if you knew that the job only had pages on certain nodes,
perhaps due to aggressive use of cpusets, that you would only have to
walk those nodes, right?
I don't think Andi was proposing you have to search all of the pages
on a node.  I think that the idea was that the (count, old_nodes, new_nodes)
parameters would have to be converted to a full node_map such as is done
in the patch (let's call it sample code) that I sent out with the
overview that started this whole discussion.  node_map[] is MAX_NUMNODES
in length, and node_map[i] gives the node where pages on node i should be
migrated to, or is -1 if we are not migrating pages on this node.
Since we have extended the interface to support -1 as a possible value for
the old_nodes array [and it matches any old node], then in that case we
would make node_map[i]=new_node for all values of i.
--
Best Regards,
Ray
---
  Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
   so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Enjoy your vacation]
[I am thanks -- or I was -- I go home tomorrow]
I assume they would allow marking arbitary segments with specific
policies, so it should be possible.
An alternative way to handle shared libraries BTW would be to add the ELF
headers Steve did in his patch. And then handle them in user space
in ld.so and let it apply the necessary policy. 

This won't work for non ELF files though.
Would I then have to sign-off from the ld.so maintainer to get that patch
in?  :-(
This sounds more general than the xattr attribute thing I was thinking
of (i. e. marking a file non-migratable or library)
Well, we can work the exact details of this part later.

(2)  Something along the lines of:
page_migrate(pid, old_node, new_node);
or perhaps
page_migrate(pid, old_node_mask, new_node_mask);

+ node mask length. 

I don't like old_node* very much because it's imho unreliable
(because you can usually never fully know on which nodes the old
process was and there can be good reasons to just migrate everything)
In our case, it turns out we do because the job is running inside of
a cpuset.  So it can't allocate memory outside of that cpuset.  In
more general scenarios, you are right, you don't know.  But this
can be handled with a MIGRATE_NODE_ANY (more below).
I assume the second way would be more flexible, although I found
having node masks for this has the problem that you tend to allocate
most memory on the lowest numbered node because it's not easy to
round-robin over all set nodes (that's an issue in PREFERED policy
in NUMA API currently). So maybe the simple  new_node argument
is preferable.
page_migrate(pid, new_node)
(or putting it into a writable /proc file if you prefer that)   

or
(3)  mbind() with a pid argument?

That would bring it to 7 arguments, really too much for a system
call (and a function in general). Also it would mean needing
to know about other process private addresses again.
Maybe set_mempolicy, but a new call is probably better.
OK, lets assume we have a new call of some kind then.

But I think I now understand why you want this complicated
user space control. You want to preserve relative ordering
on a set of nodes, right? 

e.g. job runs threads on nodes 0,1,2,3  and you want it to move
to nodes 4,5,6,7 with all memory staying staying in the same
distance from the new CPUs as it were from the old CPUs, right? 
Yes, thats it:  we want the relative distances between the pages
on the new set of nodes to match the distances on the old set of
nodes as much as is possible, or we at least want a sufficiently
powerful system call to let us do this if the correct set of new
nodes is available.  This is to have the application have the same
level of performance before and after the migration call.
In actuality, what we intend to do is to use this API to migrate
jobs from one cpuset to another; we will require the administrator
to set up the cpusets so they are topologically equivalent for cpusets
of the same size.  If the don't do that, then performance can
change when a job is migrated.
It explains why you want old_node, you would do 
(assuming node mask arguments) 

page_migrate(pid, 0, 4)
page_migrate(pid, 1, 5)
...
page_migrate(pid, 3, 7) 

keeping the memory in the same relative order. Problem is what happens
when some memory is in some other node due to memory pressure fallbacks.
Your scheme would not migrate this memory at all. While you may
get away with this in your application I think it would make 
page migration much less useful in the general case than it could
be.  e.g. for a single threaded process it is very useful to just
force all its pages that have been allocated on multiple nodes
to a specific node. I would like to have this option at least, 
but with old node it would be rather inefficient. Ok, I guess you could
add a wildcard value for it; I guess that would work.

The patch that I sent out already defines MIGRATE_NODE_ANY to request
any other available node; this is needed for those cases where memory
hotplug just wants to move the page off of >>this<< node.  I don't
see why we we couldn't allow this as a value for old node, and it
would mean "migrate all pages".  (i. e. MIGRATE_NODE_ANY matches
pages on all nodes.)
Problem is still that you would need to iterate through all nodes for your 
migration scenario (or how would you find out where the job  allocated
its old pages?), which is not very nice.
Agreed.  Which is why  we really prefer an old_node_list, new_node_list,
then we iterate acrcoss pages and make the indicated decision for each
page.
Perhaps node masks would be better and teaching the kernel to handle
relative distances inside the masks transparently while migrating?
Not sure how complicated this would be to implement though.
Supporting interleaving on the new nodes may be also useful, that would
need a policy argument at least too and masks.
The worry I have about using node masks is that it is not as general 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
You and Robin mentioned some problems with "double migration"
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.

There is one other case where Robin and I have talked about double
migration.  That is the case where the set of old nodes and new
nodes overlap.  If one is not careful, and the system call interface
is assumed to be something like:
page_migrate(pid, old_node, new_node);
then if one is not careful (and depending on what the complete list
of old_nodes and new_nodes are), then if one does something like:
page_migrate(pid, 1, 2);
page_migrate(pid, 2, 3);
then you can end up actually moving pages from node 1 to node 2,
only to move them again from node 2 to node 3.  This is another
form of double migration that we have worried about avoiding.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi, et al:
I see that  several messages have been sent in the interim.
I apologize for being "out of sync", but today is my last
day to go skiing and it is gorgeous outside.  I'll try
to catch up and digest everthing later.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Here's an interface proposal that may be a middle ground and
should satisfy both small and large system requirements:
The system call interface would be:
page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);
(e. g. same as before, but please keep reading):
The following restrictions of my original proposal would be
dropped:
(1)  va_start and va_end can span multiple vma's.  To migrate
 all pages in a process, va_start can be 0UL and va_end
 would be MAX_INT L.  (Equivalently, we could use va_start
 and length, in pages)  We would expect the normal usage
 of this call on small systems to be va_start=0, va_end=MAX_INT.
 va_start and va_end would be required to be page aligned.
(2)  There is no requirement that the pid be suspended before
 the system call is issued.  Further requirements below
 are proposed to handle the allocation of new pages while
 the migrate system call is in progress.
(3)  Mempolicy data structures will be updated to reflect the
 new node locations before any pages are migrated.  That
 way, if the process allocates new pages before the migration
 process is completed, they will be allocated on the new
 nodes.
 (An alternative:  we could require the user to update
 the NUMA API data structures to reflect the new reality
 before the page_migrate() call is issued.  This is consistent
 with item (4).  If the user doesn't do this, then
 there is no guarentee that the page migration call will
 actually be able to migrate all pages.)
 If any memory policy is DEFAULT, then the pid will need to
 be migrated to a cpu associated with  one of the new_node_list
 nodes before the page_migrate() call.  This is so new
 allocations will happen in the new_node_list and the
 migration call won't miss those pages.  The system call
 will work correctly without this, it just can't guarentee
 that it will migrate all pages from the old_nodes.
(4)  If cpusets are in use, the new_node_list must represent
 valid nodes to allocate pages from for the cpuset that
 pid is currently a member of.  This implies that the
 pid is moved from its old cpuset to a new cpuset before
 the page_migrate() call is issued.  Any nodes not part
 of the new cpu set will cause the system call to return
 with -EINVAL.
(5)  If, during the migration process, a page is to be moved to
 node N, but the alloc_pages_node() call for node N fails, then the
 page will fall over to allocation on the "nearest" node
 in the new_node_list; if this node is full then fall over
 to the next nearest node, etc.  If none of the nodes has
 space, then the migration system call will fail.  (Hmmm...
 would we unmigrate the pages that had been migrated
 this far??  sounds messy also, not sure what one
 would do about error reporting here so that the caller
 could take some corrective action.)
(6)  The system call is reserved to root or a pid with
 capability CAP_PAGE_MIGRATE.
(7)  Mapped files with the extended attribute MIGRATE
 set to NONE are not migrated by the system call.
 Mapped files with the extended attribute MIGRATE
 set to LIB will be handled as follows:  r/o
 mappings will not be migrated.  r/w mappings will
 be migrated.  If no MIGRATE extended attribute is available,
 then the assumtion is that the MIGRATE extended
 attribute is not set.  (Files mapped from NFS
 would always be regarded as migrateable until
 NFS gets extended attributes.)
Note that nothing here requires parsing of /proc/pid/maps,
etc.  However, very large systems may use the system call
in special ways, e. g:
(1)  They may decide to suspend processes before migration.
(2)  They may decide to optimize the migration process by
 trying to migrate large shared objects only "once",
 in the sense that only one scan of a large shared
 object will be done.
Issues of complexity related to the above are reserved for
those systems who choose to use the system call in this way.
Please note, however that this is a performance optimization
that some systems MAY decide to do.  There is NO REQUIREMENT
that any user follow these steps from a correctness point of
view, the page_migrate() system call will still do the correct
thing.
Now, I know that is complicated and lot of verbage.  But this
would satisfy our requirements and I think it would satisfy
the concern that the page_migration() call was built just to
satisfy SGI requirements.
Comments, flames, suggestions, etc, as usual are all welcome.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
> Problem is what happens
> when some memory is in some other node due to memory pressure fallbacks.
> Your scheme would not migrate this memory at all. 

The arrays of old and new nodes handle this fine.
Include that 'other node' in the array of old nodes,
and the corresponding new node, where those pages
should migrate, in the array of new nodes.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
> e.g. job runs threads on nodes 0,1,2,3  and you want it to move
> to nodes 4,5,6,7 with all memory staying staying in the same
> distance from the new CPUs as it were from the old CPUs, right? 
> 
> It explains why you want old_node, you would do 
> (assuming node mask arguments) 

Yup - my immediately preceeding post repeated this - sorry.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
> I don't like old_node* very much because it's imho unreliable
> (because you can usually never fully know on which nodes the old
> process was and there can be good reasons to just migrate everything)

That's one way that the arrays of old and new nodes pays off.
You can list any old node that might have a page, and state
which new node that page should go to.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi - what does this line mean:

  + node mask length. 

I guess its the names of the parameters in a proposed
migration system call.  Length of what, mask of what,
what's the node mean, huh?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Andi Kleen
[Enjoy your vacation]

On Fri, Feb 18, 2005 at 02:38:42AM -0600, Ray Bryant wrote:
> 
> Let's start off with at least one thing we can agree on.  If xattrs
> are already part of XFS, then it seems reasonable to use an extended
> attribute to mark certain files as non-migratable.   (Some further
> thought is going to be required here -- r/o sections of a
> shared library need not be migrated, but r/w sections containing
> program or thread private data would need to be migrated.  So
> the extended attribute may be a little more complicated than
> just "don't migrate".)

I assume they would allow marking arbitary segments with specific
policies, so it should be possible.

An alternative way to handle shared libraries BTW would be to add the ELF
headers Steve did in his patch. And then handle them in user space
in ld.so and let it apply the necessary policy. 

This won't work for non ELF files though.


> 
> The fact that NFS doesn't support this means that we will have to
> have some other way to handle files from NFS though.  It is possible
> we can live with the notion that files mapped in from NFS are always
> migratable.  (I'll need to look into that some more).

I don't know details, but I would assume selinux (and other "advanced security" 
people who generally need more security information per file) have plans in 
this area too.

> >
> >>>
> >>>Sorry, but the only real difference between your API and mbind is that
> >>>yours has a pid argument. 
> >>>
> 
> OK, so I've "lost the thread" a little bit here.  Specifically what
> would you propose the API for page migration be?  As I read through your 
> note,
> I see a couple of different possibilities being considered:
> 
> (1)  Map each object to be migrated into a management process,
>  update the object's memory policy to match the new node locations
>  and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() 
>  and
>  the result is that migration happens as part of the call.
> 
> (2)  Something along the lines of:
> 
>  page_migrate(pid, old_node, new_node);
> 
>  or perhaps
> 
>  page_migrate(pid, old_node_mask, new_node_mask);

+ node mask length. 

I don't like old_node* very much because it's imho unreliable
(because you can usually never fully know on which nodes the old
process was and there can be good reasons to just migrate everything)

I assume the second way would be more flexible, although I found
having node masks for this has the problem that you tend to allocate
most memory on the lowest numbered node because it's not easy to
round-robin over all set nodes (that's an issue in PREFERED policy
in NUMA API currently). So maybe the simple  new_node argument
is preferable.

page_migrate(pid, new_node)

(or putting it into a writable /proc file if you prefer that)   

> 
> or
> 
> (3)  mbind() with a pid argument?

That would bring it to 7 arguments, really too much for a system
call (and a function in general). Also it would mean needing
to know about other process private addresses again.

Maybe set_mempolicy, but a new call is probably better.

> >NUMA API currently doesn't offer a way to do that, 
> >not even with Steve's patch that does simple page migration.
> >You only get a migration when you set a BIND or PREFERED
> >policy, and then it would stay. But I guess you could
> >force that and then set back DEFAULT. It's a big ugly,
> >but not too bad.
> >
> 
> Very ugly, I think.  Particularly if you have to do a lot of

Well, I guess it could be made a new flag that says to
not change the future policy. 

> vma splitting to get the correct node placement.  (Worst case
> is a VMA with nodes interleaved by first touch across a set of
> nodes in a way that doesn't match the INTERLEAVE mempolicy.
> Then you would have to create a separate VMA for each page
> and use the BIND policy.  Then after migration you would
> have to go through and set the policy back to DEFAULT,
> resulting in a lot of vma merges.)

Umm - I hope you don't want to do such tricks from external
processes. If a program does it by itself it can just use interleave
policy.

But I think I now understand why you want this complicated
user space control. You want to preserve relative ordering
on a set of nodes, right? 

e.g. job runs threads on nodes 0,1,2,3  and you want it to move
to nodes 4,5,6,7 with all memory staying staying in the same
distance from the new CPUs as it were from the old CPUs, right? 

It explains why you want old_node, you would do 
(assuming node mask arguments) 

page_migrate(pid, 0, 4)
page_migrate(pid, 1, 5)
...
page_migrate(pid, 3, 7) 

keeping the memory in the same relative order. Problem is what happens
when some memory is in some other node due to memory pressure fallbacks.
Your scheme would not migrate this memory at all. While you may
get away with this in your application I think it would make 
page migration much less useful in the general case than it could
be.  e.g. for a single 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Sorry for the late answer.]
No problem, remember, I'm supposed to be on vacation, anyway.  :-)
Let's start off with at least one thing we can agree on.  If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable.   (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated.  So
the extended attribute may be a little more complicated than
just "don't migrate".)
The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though.  It is possible
we can live with the notion that files mapped in from NFS are always
migratable.  (I'll need to look into that some more).
On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

OK, so I've "lost the thread" a little bit here.  Specifically what
would you propose the API for page migration be?  As I read through your note,
I see a couple of different possibilities being considered:
(1)  Map each object to be migrated into a management process,
 update the object's memory policy to match the new node locations
 and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() and
 the result is that migration happens as part of the call.
(2)  Something along the lines of:
 page_migrate(pid, old_node, new_node);
 or perhaps
 page_migrate(pid, old_node_mask, new_node_mask);
or
(3)  mbind() with a pid argument?
I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?)  :-)


The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.
Well, yes, I guess to be more precise I should have said that
very few programs use any NUMA policy other than the DEFAULT
policy.  And that they instead make page placement decisions implicitly
using first touch.
Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.
I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
Yes, that is correct.  This has been exactly my point from the
beginning.
We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate  those programs without
requiring them to create a non-DEFAULT policy of some kind.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.

Very ugly, I think.  Particularly if you have to do a lot of
vma splitting to get the correct node placement.  (Worst case
is a VMA with nodes interleaved by first touch across a set of
nodes in a way that doesn't match the INTERLEAVE mempolicy.
Then you would have to create a separate VMA for each page
and use the BIND policy.  Then after migration you would
have to go through and set the policy back to DEFAULT,
resulting in a lot of vma merges.)

Sure, but NUMA API goes to great pains to handle such programs.
Yes, it does.  But, how do we handle legacy NUMA codes that people
use today on our Linux 2.4.21 based Altix kernels?  Such programs
don't have access to the NUMA API, so they aren't using it.  They
work fine on 2.6 with the DEFAULT memory policy.  It seems unreasonable
to go back and require these programs to use "numactl" to solve a problem that
they are already solving without it.  And it certainly seems difficult
to require them to use numactl to enable migration of those programs.
(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing.  Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use 
numactl?)

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting

That would make the node array infinite, won't it?  What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.
I'm sorry, I don't follow that at all.  The node array has nothing to do 
with
the size of the address range to be migrated.  It is not the case that the
ith entry in the node array says what to do with the ith page.  Instead the
old and new node arrays defining a mapping of pages:  for pages found on
old_node[i], move them to 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Sorry for the late answer.]
No problem, remember, I'm supposed to be on vacation, anyway.  :-)
Let's start off with at least one thing we can agree on.  If xattrs
are already part of XFS, then it seems reasonable to use an extended
attribute to mark certain files as non-migratable.   (Some further
thought is going to be required here -- r/o sections of a
shared library need not be migrated, but r/w sections containing
program or thread private data would need to be migrated.  So
the extended attribute may be a little more complicated than
just don't migrate.)
The fact that NFS doesn't support this means that we will have to
have some other way to handle files from NFS though.  It is possible
we can live with the notion that files mapped in from NFS are always
migratable.  (I'll need to look into that some more).
On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

OK, so I've lost the thread a little bit here.  Specifically what
would you propose the API for page migration be?  As I read through your note,
I see a couple of different possibilities being considered:
(1)  Map each object to be migrated into a management process,
 update the object's memory policy to match the new node locations
 and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() and
 the result is that migration happens as part of the call.
(2)  Something along the lines of:
 page_migrate(pid, old_node, new_node);
 or perhaps
 page_migrate(pid, old_node_mask, new_node_mask);
or
(3)  mbind() with a pid argument?
I'm sorry to be so confused, but could you briefly describe what
your proposed API would be (or choose from the above list if I
have guessed correctly?)  :-)


The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.
Well, yes, I guess to be more precise I should have said that
very few programs use any NUMA policy other than the DEFAULT
policy.  And that they instead make page placement decisions implicitly
using first touch.
Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.
I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
Yes, that is correct.  This has been exactly my point from the
beginning.
We have programs that use the DEFAULT policy and do placement
by first touch, and we want to migrate  those programs without
requiring them to create a non-DEFAULT policy of some kind.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.

Very ugly, I think.  Particularly if you have to do a lot of
vma splitting to get the correct node placement.  (Worst case
is a VMA with nodes interleaved by first touch across a set of
nodes in a way that doesn't match the INTERLEAVE mempolicy.
Then you would have to create a separate VMA for each page
and use the BIND policy.  Then after migration you would
have to go through and set the policy back to DEFAULT,
resulting in a lot of vma merges.)

Sure, but NUMA API goes to great pains to handle such programs.
Yes, it does.  But, how do we handle legacy NUMA codes that people
use today on our Linux 2.4.21 based Altix kernels?  Such programs
don't have access to the NUMA API, so they aren't using it.  They
work fine on 2.6 with the DEFAULT memory policy.  It seems unreasonable
to go back and require these programs to use numactl to solve a problem that
they are already solving without it.  And it certainly seems difficult
to require them to use numactl to enable migration of those programs.
(I'm sorry to keep harping on this but I think this is the
heart of the issue we are discussing.  Are you of the opinion that
we sould require every program that runs on ALTIX under Linux 2.6 to use 
numactl?)

So lets go with the idea of dropping the va_start and va_end arguments from
the system call I proposed.  Then, we get into the kernel and starting

That would make the node array infinite, won't it?  What happens when
you want to migrate a 1TB process? @) I think you have to replace
that one with a single target node argument too.
I'm sorry, I don't follow that at all.  The node array has nothing to do 
with
the size of the address range to be migrated.  It is not the case that the
ith entry in the node array says what to do with the ith page.  Instead the
old and new node arrays defining a mapping of pages:  for pages found on
old_node[i], move them to 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Andi Kleen
[Enjoy your vacation]

On Fri, Feb 18, 2005 at 02:38:42AM -0600, Ray Bryant wrote:
 
 Let's start off with at least one thing we can agree on.  If xattrs
 are already part of XFS, then it seems reasonable to use an extended
 attribute to mark certain files as non-migratable.   (Some further
 thought is going to be required here -- r/o sections of a
 shared library need not be migrated, but r/w sections containing
 program or thread private data would need to be migrated.  So
 the extended attribute may be a little more complicated than
 just don't migrate.)

I assume they would allow marking arbitary segments with specific
policies, so it should be possible.

An alternative way to handle shared libraries BTW would be to add the ELF
headers Steve did in his patch. And then handle them in user space
in ld.so and let it apply the necessary policy. 

This won't work for non ELF files though.


 
 The fact that NFS doesn't support this means that we will have to
 have some other way to handle files from NFS though.  It is possible
 we can live with the notion that files mapped in from NFS are always
 migratable.  (I'll need to look into that some more).

I don't know details, but I would assume selinux (and other advanced security 
people who generally need more security information per file) have plans in 
this area too.

 
 
 Sorry, but the only real difference between your API and mbind is that
 yours has a pid argument. 
 
 
 OK, so I've lost the thread a little bit here.  Specifically what
 would you propose the API for page migration be?  As I read through your 
 note,
 I see a couple of different possibilities being considered:
 
 (1)  Map each object to be migrated into a management process,
  update the object's memory policy to match the new node locations
  and then unmap the object.  Use the MPOL_F_STRICT argument to mbind() 
  and
  the result is that migration happens as part of the call.
 
 (2)  Something along the lines of:
 
  page_migrate(pid, old_node, new_node);
 
  or perhaps
 
  page_migrate(pid, old_node_mask, new_node_mask);

+ node mask length. 

I don't like old_node* very much because it's imho unreliable
(because you can usually never fully know on which nodes the old
process was and there can be good reasons to just migrate everything)

I assume the second way would be more flexible, although I found
having node masks for this has the problem that you tend to allocate
most memory on the lowest numbered node because it's not easy to
round-robin over all set nodes (that's an issue in PREFERED policy
in NUMA API currently). So maybe the simple  new_node argument
is preferable.

page_migrate(pid, new_node)

(or putting it into a writable /proc file if you prefer that)   

 
 or
 
 (3)  mbind() with a pid argument?

That would bring it to 7 arguments, really too much for a system
call (and a function in general). Also it would mean needing
to know about other process private addresses again.

Maybe set_mempolicy, but a new call is probably better.

 NUMA API currently doesn't offer a way to do that, 
 not even with Steve's patch that does simple page migration.
 You only get a migration when you set a BIND or PREFERED
 policy, and then it would stay. But I guess you could
 force that and then set back DEFAULT. It's a big ugly,
 but not too bad.
 
 
 Very ugly, I think.  Particularly if you have to do a lot of

Well, I guess it could be made a new flag that says to
not change the future policy. 

 vma splitting to get the correct node placement.  (Worst case
 is a VMA with nodes interleaved by first touch across a set of
 nodes in a way that doesn't match the INTERLEAVE mempolicy.
 Then you would have to create a separate VMA for each page
 and use the BIND policy.  Then after migration you would
 have to go through and set the policy back to DEFAULT,
 resulting in a lot of vma merges.)

Umm - I hope you don't want to do such tricks from external
processes. If a program does it by itself it can just use interleave
policy.

But I think I now understand why you want this complicated
user space control. You want to preserve relative ordering
on a set of nodes, right? 

e.g. job runs threads on nodes 0,1,2,3  and you want it to move
to nodes 4,5,6,7 with all memory staying staying in the same
distance from the new CPUs as it were from the old CPUs, right? 

It explains why you want old_node, you would do 
(assuming node mask arguments) 

page_migrate(pid, 0, 4)
page_migrate(pid, 1, 5)
...
page_migrate(pid, 3, 7) 

keeping the memory in the same relative order. Problem is what happens
when some memory is in some other node due to memory pressure fallbacks.
Your scheme would not migrate this memory at all. While you may
get away with this in your application I think it would make 
page migration much less useful in the general case than it could
be.  e.g. for a single threaded process it is very useful to just
force all its pages that have been allocated on 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi - what does this line mean:

  + node mask length. 

I guess its the names of the parameters in a proposed
migration system call.  Length of what, mask of what,
what's the node mean, huh?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
 I don't like old_node* very much because it's imho unreliable
 (because you can usually never fully know on which nodes the old
 process was and there can be good reasons to just migrate everything)

That's one way that the arrays of old and new nodes pays off.
You can list any old node that might have a page, and state
which new node that page should go to.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
 e.g. job runs threads on nodes 0,1,2,3  and you want it to move
 to nodes 4,5,6,7 with all memory staying staying in the same
 distance from the new CPUs as it were from the old CPUs, right? 
 
 It explains why you want old_node, you would do 
 (assuming node mask arguments) 

Yup - my immediately preceeding post repeated this - sorry.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Paul Jackson
Andi wrote:
 Problem is what happens
 when some memory is in some other node due to memory pressure fallbacks.
 Your scheme would not migrate this memory at all. 

The arrays of old and new nodes handle this fine.
Include that 'other node' in the array of old nodes,
and the corresponding new node, where those pages
should migrate, in the array of new nodes.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Here's an interface proposal that may be a middle ground and
should satisfy both small and large system requirements:
The system call interface would be:
page_migrate(pid, va_start, va_end, count, old_node_list, new_node_list);
(e. g. same as before, but please keep reading):
The following restrictions of my original proposal would be
dropped:
(1)  va_start and va_end can span multiple vma's.  To migrate
 all pages in a process, va_start can be 0UL and va_end
 would be MAX_INT L.  (Equivalently, we could use va_start
 and length, in pages)  We would expect the normal usage
 of this call on small systems to be va_start=0, va_end=MAX_INT.
 va_start and va_end would be required to be page aligned.
(2)  There is no requirement that the pid be suspended before
 the system call is issued.  Further requirements below
 are proposed to handle the allocation of new pages while
 the migrate system call is in progress.
(3)  Mempolicy data structures will be updated to reflect the
 new node locations before any pages are migrated.  That
 way, if the process allocates new pages before the migration
 process is completed, they will be allocated on the new
 nodes.
 (An alternative:  we could require the user to update
 the NUMA API data structures to reflect the new reality
 before the page_migrate() call is issued.  This is consistent
 with item (4).  If the user doesn't do this, then
 there is no guarentee that the page migration call will
 actually be able to migrate all pages.)
 If any memory policy is DEFAULT, then the pid will need to
 be migrated to a cpu associated with  one of the new_node_list
 nodes before the page_migrate() call.  This is so new
 allocations will happen in the new_node_list and the
 migration call won't miss those pages.  The system call
 will work correctly without this, it just can't guarentee
 that it will migrate all pages from the old_nodes.
(4)  If cpusets are in use, the new_node_list must represent
 valid nodes to allocate pages from for the cpuset that
 pid is currently a member of.  This implies that the
 pid is moved from its old cpuset to a new cpuset before
 the page_migrate() call is issued.  Any nodes not part
 of the new cpu set will cause the system call to return
 with -EINVAL.
(5)  If, during the migration process, a page is to be moved to
 node N, but the alloc_pages_node() call for node N fails, then the
 page will fall over to allocation on the nearest node
 in the new_node_list; if this node is full then fall over
 to the next nearest node, etc.  If none of the nodes has
 space, then the migration system call will fail.  (Hmmm...
 would we unmigrate the pages that had been migrated
 this far??  sounds messy also, not sure what one
 would do about error reporting here so that the caller
 could take some corrective action.)
(6)  The system call is reserved to root or a pid with
 capability CAP_PAGE_MIGRATE.
(7)  Mapped files with the extended attribute MIGRATE
 set to NONE are not migrated by the system call.
 Mapped files with the extended attribute MIGRATE
 set to LIB will be handled as follows:  r/o
 mappings will not be migrated.  r/w mappings will
 be migrated.  If no MIGRATE extended attribute is available,
 then the assumtion is that the MIGRATE extended
 attribute is not set.  (Files mapped from NFS
 would always be regarded as migrateable until
 NFS gets extended attributes.)
Note that nothing here requires parsing of /proc/pid/maps,
etc.  However, very large systems may use the system call
in special ways, e. g:
(1)  They may decide to suspend processes before migration.
(2)  They may decide to optimize the migration process by
 trying to migrate large shared objects only once,
 in the sense that only one scan of a large shared
 object will be done.
Issues of complexity related to the above are reserved for
those systems who choose to use the system call in this way.
Please note, however that this is a performance optimization
that some systems MAY decide to do.  There is NO REQUIREMENT
that any user follow these steps from a correctness point of
view, the page_migrate() system call will still do the correct
thing.
Now, I know that is complicated and lot of verbage.  But this
would satisfy our requirements and I think it would satisfy
the concern that the page_migration() call was built just to
satisfy SGI requirements.
Comments, flames, suggestions, etc, as usual are all welcome.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
You and Robin mentioned some problems with double migration
with that, but it's still not completely clear to me what
problem you're solving here. Perhaps that needs to be reexamined.

There is one other case where Robin and I have talked about double
migration.  That is the case where the set of old nodes and new
nodes overlap.  If one is not careful, and the system call interface
is assumed to be something like:
page_migrate(pid, old_node, new_node);
then if one is not careful (and depending on what the complete list
of old_nodes and new_nodes are), then if one does something like:
page_migrate(pid, 1, 2);
page_migrate(pid, 2, 3);
then you can end up actually moving pages from node 1 to node 2,
only to move them again from node 2 to node 3.  This is another
form of double migration that we have worried about avoiding.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi, et al:
I see that  several messages have been sent in the interim.
I apologize for being out of sync, but today is my last
day to go skiing and it is gorgeous outside.  I'll try
to catch up and digest everthing later.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-18 Thread Ray Bryant
Andi Kleen wrote:
[Enjoy your vacation]
[I am thanks -- or I was -- I go home tomorrow]
I assume they would allow marking arbitary segments with specific
policies, so it should be possible.
An alternative way to handle shared libraries BTW would be to add the ELF
headers Steve did in his patch. And then handle them in user space
in ld.so and let it apply the necessary policy. 

This won't work for non ELF files though.
Would I then have to sign-off from the ld.so maintainer to get that patch
in?  :-(
This sounds more general than the xattr attribute thing I was thinking
of (i. e. marking a file non-migratable or library)
Well, we can work the exact details of this part later.

(2)  Something along the lines of:
page_migrate(pid, old_node, new_node);
or perhaps
page_migrate(pid, old_node_mask, new_node_mask);

+ node mask length. 

I don't like old_node* very much because it's imho unreliable
(because you can usually never fully know on which nodes the old
process was and there can be good reasons to just migrate everything)
In our case, it turns out we do because the job is running inside of
a cpuset.  So it can't allocate memory outside of that cpuset.  In
more general scenarios, you are right, you don't know.  But this
can be handled with a MIGRATE_NODE_ANY (more below).
I assume the second way would be more flexible, although I found
having node masks for this has the problem that you tend to allocate
most memory on the lowest numbered node because it's not easy to
round-robin over all set nodes (that's an issue in PREFERED policy
in NUMA API currently). So maybe the simple  new_node argument
is preferable.
page_migrate(pid, new_node)
(or putting it into a writable /proc file if you prefer that)   

or
(3)  mbind() with a pid argument?

That would bring it to 7 arguments, really too much for a system
call (and a function in general). Also it would mean needing
to know about other process private addresses again.
Maybe set_mempolicy, but a new call is probably better.
OK, lets assume we have a new call of some kind then.

But I think I now understand why you want this complicated
user space control. You want to preserve relative ordering
on a set of nodes, right? 

e.g. job runs threads on nodes 0,1,2,3  and you want it to move
to nodes 4,5,6,7 with all memory staying staying in the same
distance from the new CPUs as it were from the old CPUs, right? 
Yes, thats it:  we want the relative distances between the pages
on the new set of nodes to match the distances on the old set of
nodes as much as is possible, or we at least want a sufficiently
powerful system call to let us do this if the correct set of new
nodes is available.  This is to have the application have the same
level of performance before and after the migration call.
In actuality, what we intend to do is to use this API to migrate
jobs from one cpuset to another; we will require the administrator
to set up the cpusets so they are topologically equivalent for cpusets
of the same size.  If the don't do that, then performance can
change when a job is migrated.
It explains why you want old_node, you would do 
(assuming node mask arguments) 

page_migrate(pid, 0, 4)
page_migrate(pid, 1, 5)
...
page_migrate(pid, 3, 7) 

keeping the memory in the same relative order. Problem is what happens
when some memory is in some other node due to memory pressure fallbacks.
Your scheme would not migrate this memory at all. While you may
get away with this in your application I think it would make 
page migration much less useful in the general case than it could
be.  e.g. for a single threaded process it is very useful to just
force all its pages that have been allocated on multiple nodes
to a specific node. I would like to have this option at least, 
but with old node it would be rather inefficient. Ok, I guess you could
add a wildcard value for it; I guess that would work.

The patch that I sent out already defines MIGRATE_NODE_ANY to request
any other available node; this is needed for those cases where memory
hotplug just wants to move the page off of this node.  I don't
see why we we couldn't allow this as a value for old node, and it
would mean migrate all pages.  (i. e. MIGRATE_NODE_ANY matches
pages on all nodes.)
Problem is still that you would need to iterate through all nodes for your 
migration scenario (or how would you find out where the job  allocated
its old pages?), which is not very nice.
Agreed.  Which is why  we really prefer an old_node_list, new_node_list,
then we iterate acrcoss pages and make the indicated decision for each
page.
Perhaps node masks would be better and teaching the kernel to handle
relative distances inside the masks transparently while migrating?
Not sure how complicated this would be to implement though.
Supporting interleaving on the new nodes may be also useful, that would
need a policy argument at least too and masks.
The worry I have about using node masks is that it is not as general as

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-17 Thread Andi Kleen
[Sorry for the late answer.]

On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
> >
> >
> >Sorry, but the only real difference between your API and mbind is that
> >yours has a pid argument. 
> >
> 
> That may be true, but the internals of the implementations have got
> to be pretty different as near as I can tell.  So just beause the

Not necessarily. E.g. Steve's file attribute patch actually
implemented very simple page migration into NUMA API 
because he needed it to solve some problems with allocation.
It was even exposed as a new mbind() flag.

> >Main cases:
> >
> >- Program is NUMA API aware. Fine.  It takes care of its own.
> 
> Yes, we could migrate this program using a migration facility
> embedded in the NUMA API.
> 
> >- Program is not aware, but is started with a process policy from
> >numactl/cpusets/batch scheduler. Already covered too in NUMA API.
> 
> Hmmm What about the case where no NUMA API is used and cpusets

First the NUMA API internally doesn't care that much about this 
case. It just considers no policy as "DEFAULT" policy which
just happens to be what you call first-touch.

But there is no fundamental reason you can't change the policy
of an existing program externally. It is already implemented for some
kinds of named objects (shmfs etc.), but it can be extended to
more.

> >- Program is not aware and hasn't been started with a policy
> >(or has and you change your mind) but you want to change it later.

> I'm having a little trouble parsing the "it" in that sentence.
> Does that sentence mean "you want to change the NUMA API later"?

The policy. In this case policy means including the page placement
(this would be MPOL_F_STRICT) 

> What if there never is a NUMA API structure associated with
> the program other than the default (local) policy?

If you have some generic facility to change policy externally
it doesn't matter if there was policy before or not. 

> The fundamental disconnect here is that I think that very few
> programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.

Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.

I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.
> 
> Let me expand on that a bit.  What most programs do on Altix is
> to do first-touch to get data allocated locally.  That is, lets
> say you have a big array that your parallel computation is going to
> work on.  The programmer would sit down and say, I want processor 1
> to work on this part of the array, processor 2 on that part, etc.
> Then the programmer writes code that causes each processor to touch
> the portions of the data array that should be allocated locally on
> that processor.  Bingo, storage is now allocated the way the user
> wants it, and no NUMA API call was ever issued.

Sure, but NUMA API goes to great pains to handle such programs.
> 
> Yes, it is clumsy, but that is because these programs were written
> before your NUMA API came into being.  Now we simply can't go back
> to these people (many of them ISV's) and say "Please rewrite your
> code to use the NUMA API."  So we are left with a pile of legacy
> programs that we have to be able to migrate that don't have any
> NUMA api data structures associated with them.  What are we
> supposed to do in this case?


> 
> We can't necessarily construct a NUMA API that will cause storage
> to be allocated as the programmer intended, because we can't fathom
> what the programmer was trying to accomplish based on the state
> of the program when we go to migrate it.  So how would we use
> a migration facility embedded into the NUMA API to migrate this
> program and maintain its old topology?

numactl went to great pains to handle such programs. Take
a look at all the command line options ;-)

If the program is using shm and you applied the patch
to do page migration in mbind() you could handle it right now:

- map the shm segment into the management process. 
- change policy with mbind(), triggering page migration
- set back default policy.

For other objects (files etc.) there are patches in the pipeline.

The only hole that's still there is anonymous memory, but I think
we can fill that much simpler than what you're proposing, with
a "migrate whole process except when policy says otherwise" call.



> >That's the new case we discuss here. 
> >
> >Now how to change policy of objects in an already running process.
> >
> 
> If the running process 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-17 Thread Andi Kleen
[Sorry for the late answer.]

On Tue, Feb 15, 2005 at 09:44:41PM -0600, Ray Bryant wrote:
 
 
 Sorry, but the only real difference between your API and mbind is that
 yours has a pid argument. 
 
 
 That may be true, but the internals of the implementations have got
 to be pretty different as near as I can tell.  So just beause the

Not necessarily. E.g. Steve's file attribute patch actually
implemented very simple page migration into NUMA API 
because he needed it to solve some problems with allocation.
It was even exposed as a new mbind() flag.

 Main cases:
 
 - Program is NUMA API aware. Fine.  It takes care of its own.
 
 Yes, we could migrate this program using a migration facility
 embedded in the NUMA API.
 
 - Program is not aware, but is started with a process policy from
 numactl/cpusets/batch scheduler. Already covered too in NUMA API.
 
 Hmmm What about the case where no NUMA API is used and cpusets

First the NUMA API internally doesn't care that much about this 
case. It just considers no policy as DEFAULT policy which
just happens to be what you call first-touch.

But there is no fundamental reason you can't change the policy
of an existing program externally. It is already implemented for some
kinds of named objects (shmfs etc.), but it can be extended to
more.

 - Program is not aware and hasn't been started with a policy
 (or has and you change your mind) but you want to change it later.

 I'm having a little trouble parsing the it in that sentence.
 Does that sentence mean you want to change the NUMA API later?

The policy. In this case policy means including the page placement
(this would be MPOL_F_STRICT) 

 What if there never is a NUMA API structure associated with
 the program other than the default (local) policy?

If you have some generic facility to change policy externally
it doesn't matter if there was policy before or not. 

 The fundamental disconnect here is that I think that very few
 programs use the NUMA API, and you think that most programs do.

All programs use NUMA policy (assuming you have a CONFIG_NUMA kernel) 
Internally it's all the same.

Hmm, I see perhaps my distinction of these cases with programs
already using NUMA API and not using it was not very useful
and lead you to a tangent. Perhaps we can just drop it.

I think one problem that you have that you essentially
want to keep DEFAULT policy, but change the nodes.
NUMA API currently doesn't offer a way to do that, 
not even with Steve's patch that does simple page migration.
You only get a migration when you set a BIND or PREFERED
policy, and then it would stay. But I guess you could
force that and then set back DEFAULT. It's a big ugly,
but not too bad.
 
 Let me expand on that a bit.  What most programs do on Altix is
 to do first-touch to get data allocated locally.  That is, lets
 say you have a big array that your parallel computation is going to
 work on.  The programmer would sit down and say, I want processor 1
 to work on this part of the array, processor 2 on that part, etc.
 Then the programmer writes code that causes each processor to touch
 the portions of the data array that should be allocated locally on
 that processor.  Bingo, storage is now allocated the way the user
 wants it, and no NUMA API call was ever issued.

Sure, but NUMA API goes to great pains to handle such programs.
 
 Yes, it is clumsy, but that is because these programs were written
 before your NUMA API came into being.  Now we simply can't go back
 to these people (many of them ISV's) and say Please rewrite your
 code to use the NUMA API.  So we are left with a pile of legacy
 programs that we have to be able to migrate that don't have any
 NUMA api data structures associated with them.  What are we
 supposed to do in this case?


 
 We can't necessarily construct a NUMA API that will cause storage
 to be allocated as the programmer intended, because we can't fathom
 what the programmer was trying to accomplish based on the state
 of the program when we go to migrate it.  So how would we use
 a migration facility embedded into the NUMA API to migrate this
 program and maintain its old topology?

numactl went to great pains to handle such programs. Take
a look at all the command line options ;-)

If the program is using shm and you applied the patch
to do page migration in mbind() you could handle it right now:

- map the shm segment into the management process. 
- change policy with mbind(), triggering page migration
- set back default policy.

For other objects (files etc.) there are patches in the pipeline.

The only hole that's still there is anonymous memory, but I think
we can fill that much simpler than what you're proposing, with
a migrate whole process except when policy says otherwise call.



 That's the new case we discuss here. 
 
 Now how to change policy of objects in an already running process.
 
 
 If the running process has a non-trivial mempolicy defined for
 all of its address space, then I think I 

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
Making memory migration a subset of page migration is not a general
solution.  It only works for programs that are using memory policy
to control placement.   As I've tried to point out multiple times
before, most programs that I am aware of use placement based on
first-touch.  When we migrate such programs, we have to respect
the placement decisions that the program has implicitly made in
this way.

Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

That may be true, but the internals of the implementations have got
to be pretty different as near as I can tell.  So just beause the
API's are nearly the same doesn't imply that the internals are at
all the same.  And I'm convinced that using node masks is an
insufficiently general approach to specifying page migration.
But let's save that discussion for a later note, ok?
I think we are talking by each other, here's a more structured
overview of my thinking on the issue.
I'm sure that is what is going on and we face little other choice
than keep our good humor about this and keep trying until we see
our way clear to a common understanding.  :-)
Main cases:
- Program is NUMA API aware. Fine.  It takes care of its own.
Yes, we could migrate this program using a migration facility
embedded in the NUMA API.
- Program is not aware, but is started with a process policy from
numactl/cpusets/batch scheduler. Already covered too in NUMA API.
Hmmm What about the case where no NUMA API is used and cpusets
are used as containers, and page placement is done by first touch.
Then there no NUMA API whatsoever.  I think this is the category
where most of the programs in a large Altix system would fall.
(See more on this below)
- Program is not aware and hasn't been started with a policy
(or has and you change your mind) but you want to change it later.
I'm having a little trouble parsing the "it" in that sentence.
Does that sentence mean "you want to change the NUMA API later"?
What if there never is a NUMA API structure associated with
the program other than the default (local) policy?
The fundamental disconnect here is that I think that very few
programs use the NUMA API, and you think that most programs do.
Eventually more programs will use the NUMA API, but I don't think
they do at the present time.
Let me expand on that a bit.  What most programs do on Altix is
to do first-touch to get data allocated locally.  That is, lets
say you have a big array that your parallel computation is going to
work on.  The programmer would sit down and say, I want processor 1
to work on this part of the array, processor 2 on that part, etc.
Then the programmer writes code that causes each processor to touch
the portions of the data array that should be allocated locally on
that processor.  Bingo, storage is now allocated the way the user
wants it, and no NUMA API call was ever issued.
Yes, it is clumsy, but that is because these programs were written
before your NUMA API came into being.  Now we simply can't go back
to these people (many of them ISV's) and say "Please rewrite your
code to use the NUMA API."  So we are left with a pile of legacy
programs that we have to be able to migrate that don't have any
NUMA api data structures associated with them.  What are we
supposed to do in this case?
We can't necessarily construct a NUMA API that will cause storage
to be allocated as the programmer intended, because we can't fathom
what the programmer was trying to accomplish based on the state
of the program when we go to migrate it.  So how would we use
a migration facility embedded into the NUMA API to migrate this
program and maintain its old topology?
That's the fundamental question here.  Can you address that
question specifically for me, please?
That's the new case we discuss here. 

Now how to change policy of objects in an already running process.
If the running process has a non-trivial mempolicy defined for
all of its address space, then I think I understand this.  This
is not where our disconnect lies.  The disconnect is in the above, I
think.
First there are already some special cases already handled or
with existing patches:
- tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
the object into a different process and changing the policy there.
- shared libraries and mmaped files in general: this is a generialization of
the previous point. SteveL's patch is the beginning of handling this, although
it needs some more work (xattrs) to make the policy persistent over
memory pressure.
Only case not covered left is anonymous memory. 

You said it would need user space control, but the main reason for
wanting that seems to be to handle the non anonymous cases which
are already covered above.
Yes, so long as the rest of the cases were handled in user space, then
the anonymous memory case has to be handled there as well.
My thinking is the simplest way to handle that is to have a call that just o

Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Paul Jackson
Thanks Andi for your effort to present your case more completely.
I agree that there is some 'talking by each other' going on.

Dave Hansen has publically (and Ray privately) sought to
move this discussion to linux-mm (or more specifically,
off lkml for now).

Any chance, Andi, that you could repost this, in response
to Ray's restarting this thread on linux-mm, once he gets
around to that?

I will reserve my response until I see if that works out.

Thanks.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Good explanation, Robin.  Thanks.

See y'all on linux-mm.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Robin Holt
On Wed, Feb 16, 2005 at 08:58:19AM +1100, Peter Chubb wrote:
> > "Robin" == Robin Holt <[EMAIL PROTECTED]> writes:
> 
> Robin> On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote:
> >> What about the suggestion I had that you sort of skipped over,
> >> which amounted to changing the system call from a node array to
> >> just one node:
> >> 
> >> sys_page_migrate(pid, va_start, va_end, count, old_nodes,
> >> new_nodes);
> >> 
> >> to:
> >> 
> >> sys_page_migrate(pid, va_start, va_end, old_node, new_node);
> >> 
> >> Doesn't that let you do all you need to?  Is it insane too?
> 
> Robin> Migration could be done in most cases and would only fall apart
> Robin> when there are overlapping node lists and no nodes available as
> Robin> temp space and we are not moving large chunks of data.
> 
> A possibly stupid suggestion: 
> 
> Can page migration be done lazily, instead of all at once?  Move the
> process, mark its pages as candidates for migration, and when 
> the page faults, decide whether to copy across or not...
> 
> That way you only copy the pages the process is using, and only copy
> each page once.  It makes copy for replication easier in some future
> incarnation, too, because the same basic infrastructure can be used.

I would agree that lazy might be possible, but then we need to keep track
of the desired destination and can not rely upon first touch as that
will likely result in scrambling the memory of the application.

I have been very lax in describing how a typical MPI application works.
This method has been in place for years and is commonly accepted practice.

In the MPI model, a set of large mappings are done by the first process.
It then forks x number of worker threads which touch their chunk of
memory and rendezvous with the other workers.  Once all workers have
redezvoused, they are allowed to start their processing.  A typical
worker thread will reference their memory set 85-97% of the time and
reference other memory sets in a read-only fashion the other part
of the time.

It is important to performance that the worker threads memory remains
as close to its cpu as possible.  Any time the memory is on a different
node, the performance of that thread degrades (memory is further away)
and performance of the other thread is hindered (its memory controller
is more busy) and the read portions of the neighbor threads to both
of the afor mentioned worker threads is hindered as there is more
NUMA activity.  Given all that, there is a common concept in MPI called
a barrier where when worker threads complete a work set, they awaken
threads waiting at the barrier associated with the work set.  As a
result of this wait, by slowing down a single thread you can have a
cascade effect which slows down the entire application significantly
as barriers are missed.

Because of all this discussion, memory placement needs be thought of
as relative to the worker threads and maintained relatively consistent
before and after the migration.

Another issue with making it a lazy migrate is the real impetus for
this is to free up memory on a node so a job can be stopped on one
node, migrated to a different node and thereby free up the original
node for a second job which would not fit with the original job
taking up a section of the machine which would cause the other
job to perform too poorly.

Sorry for the long rambling explanation.  I guess I will try to
break this into smaller chunks on the upcoming discussion on the
linux-mm list.

Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Dr Peter Chubb writes:
> Can page migration be done lazily, instead of all at once?

That might be a useful option.  Not my area to comment on.

We would also require, at least as an option, to be able to force the
migration on demand.  Some of our big honkin iron parallel jobs run with
a high degree of parallelism, and nearly saturate each node being used. 
For jobs like that, it can be better to get everything in place, before
resuming execution.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Andi Kleen
> Making memory migration a subset of page migration is not a general
> solution.  It only works for programs that are using memory policy
> to control placement.   As I've tried to point out multiple times
> before, most programs that I am aware of use placement based on
> first-touch.  When we migrate such programs, we have to respect
> the placement decisions that the program has implicitly made in
> this way.

Sorry, but the only real difference between your API and mbind is that
yours has a pid argument. 

I think we are talking by each other, here's a more structured
overview of my thinking on the issue.

Main cases:

- Program is NUMA API aware. Fine.  It takes care of its own.
- Program is not aware, but is started with a process policy from
numactl/cpusets/batch scheduler. Already covered too in NUMA API. 
- Program is not aware and hasn't been started with a policy
(or has and you change your mind) but you want to change it later. 
That's the new case we discuss here. 

Now how to change policy of objects in an already running process.

First there are already some special cases already handled or
with existing patches:
- tmpfs/hugetlbfs/sysv shm: numactl can handle this by just mapping
the object into a different process and changing the policy there.
- shared libraries and mmaped files in general: this is a generialization of
the previous point. SteveL's patch is the beginning of handling this, although
it needs some more work (xattrs) to make the policy persistent over
memory pressure.

Only case not covered left is anonymous memory. 

You said it would need user space control, but the main reason for
wanting that seems to be to handle the non anonymous cases which
are already covered above.

My thinking is the simplest way to handle that is to have a call that just o
migrates everything. The main reasons for that is that I don't think external
processes should mess with virtual addresses of another process.
It just feels unclean and has many drawbacks (parsing /proc/*/maps
needs complicated user code, racy, locking difficult).  

In kernel space handling full VMs is much easier and safer due to better 
locking facilities.

In user space only the process itself really can handle its own virtual 
addresses well, and if it does that it can use NUMA API directly anyways.

You argued that it may be costly to walk everything, but I don't see this
as a big problem - first walking mempolicies is not very costly and then
fork() and exit() do exactly this already. 

The main missing piece for this would be a way to make policies for
files persistent. One way would be to use xattrs like selinux, but
that may be costly (not sure we want to read xattrs all the time
when reading a file). 

A hackish way to do this that already 
works would be to do a mlock on one page of the file to keep
the inode pinned. E.g. the batch manager could do this. That's 
not very clean, but would probably work. 

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Dave Hansen
In the interest of the size of everyone's inboxes, I mentioned to Ray
that we might move this discussion to a smaller forum while we resolve
some of the outstanding issues.  Ray's going to post a followup to to
linux-mm, and trim the cc list down.  So, if you're still interested,
keep your eyes on linux-mm and we'll continue there.  

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Robin Holt
On Tue, Feb 15, 2005 at 08:35:29AM -0800, Paul Jackson wrote:
> What about the suggestion I had that you sort of skipped over, which
> amounted to changing the system call from a node array to just one
> node:
> 
> sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
> 
> to:
> 
> sys_page_migrate(pid, va_start, va_end, old_node, new_node);
> 
> Doesn't that let you do all you need to?  Is it insane too?

Migration could be done in most cases and would only fall apart when
there are overlapping node lists and no nodes available as temp space
and we are not moving large chunks of data.

What is the fundamental concern with passing in an array of integers?
That seems like a fairly easy to verify item with very little chance
of breaking.  I don't feel the concern that others seem to.

I do see the benefit to those arrays as being a single pass through the
page tables, the ability to migrate without using a temporary node, and
reducing the number of times data is copied when there are overlapping
nodes.  To me, those seem to be very compelling reasons when compared
to the potential for a possible problem with an array of integers.

Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Ray Bryant
Dave Hansen wrote:
On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow.  They make the expensive traversal
of ptes the single pass operation.  The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls.  I can not see the node array as anything but the right way
when compared to multiple system calls.  What am I missing?

I don't really have any fundamental opposition.  I'm just trying to make
sure that there's not a simpler (better) way of doing it.  You've
obviously thought about it a lot more than I have, and I'm trying to
understand your process.
As far as the execution speed with a simpler system call.  Yes, it will
likely be slower.  However, I'm not sure that the increase in scan time
is all that significant compared to the migration code (it's pretty
slow).
-- Dave

I'm worried about doing all of those find_get_page() things over and over
when the mapped file we are migrating is large.  I suppose one can argue
that that is never going to be the case (e. g. no one in their right mind
would migrate a job with a 300 GB mapped file).  So we are back to the
overlapping set of nodes issue.  Let me look into this some more.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
[Sorry, didn't answer to everything in your mail the first time. 
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
migrating, and figure out from that what portions of which pid's
address spaces need to migrated so that we satisfy the constraints
given above.  I admit that this may be viewed as ugly, but I really
can't figure out a better solution than this without shuffling a
ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better. 

I really don't have an objection to that for the case of the shared
libraries in, for example, /lib and /usr/lib.  I just worry about making
sure that all of the libraries have so been marked.  I can do this
in a much simpler way by just adding a list of "do not migrate stuff"
to the migration library rather than requiring Steve Longerbeam's
API.

One issue that hasn't been addressed is the following:  given a
particular entry in /proc/pid/maps, how does one figure out whether
that entry is mapped into some other process in the system, one
that is not in the set of processes to be migrated?   One could

[...]
Marking things externally would take care of that.
So the default would be that if the file is not mapped as "not-migratable",
then the file would be migratable, is that the idea?

If we did this, we still have to have the page migration system call
to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
pages were placed by first touch and for which there used to not be
a memory policy.  As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT); 
(once it is hooked up to page migration) 
Making memory migration a subset of page migration is not a general
solution.  It only works for programs that are using memory policy
to control placement.   As I've tried to point out multiple times
before, most programs that I am aware of use placement based on
first-touch.  When we migrate such programs, we have to respect
the placement decisions that the program has implicitly made in
this way.
Requiring memory migration to be a subset of the NUMA API is a
non-starter for this reason.   We have to follow the approach
of doing the correct migration, followed by fixing up the NUMA
policy to match the new reality.  (Perhaps we can do this as
part of memory migration.)
Until ALL programs use the NUMA mempolicy for placement
decisions, we cannot support page migration under the NUMA
API.
I don't understand why this is not clear to you.  Are you
assuming that you can manufacture a NUMA API for the new
location of the job that correctly represents the placement
information and toplogy of the job on the old set of nodes?
Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.
It should work later.
Wait.  As near as I can tell you
-Andi

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Dave Hansen
On Tue, 2005-02-15 at 04:50 -0600, Robin Holt wrote:
> What is the fundamental opposition to an array from from-to node mappings?
> They are not that difficult to follow.  They make the expensive traversal
> of ptes the single pass operation.  The time to scan the list of from nodes
> to locate the node this page belongs to is relatively quick when compared
> to the time to scan ptes and will result in probably no cache trashing
> like the long traversal of all ptes in the system required for multiple
> system calls.  I can not see the node array as anything but the right way
> when compared to multiple system calls.  What am I missing?

I don't really have any fundamental opposition.  I'm just trying to make
sure that there's not a simpler (better) way of doing it.  You've
obviously thought about it a lot more than I have, and I'm trying to
understand your process.

As far as the execution speed with a simpler system call.  Yes, it will
likely be slower.  However, I'm not sure that the increase in scan time
is all that significant compared to the migration code (it's pretty
slow).

-- Dave

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Andi Kleen
> I really don't see how that is relevant to the current discussion, which,
> as AFAIK, is that the kernel interface should be "migrate an entire process"
> versus what I have proposed.  What we are trying to avoid here for shared
> libraries is two things:  (1) don't migrate them needlessly, and (2) don't
> even make the migration request if we know that the pages shouldn't be
> migrated.
> 
> Using Steve Longerbeam's approach avoids (1).  But you will still scan the
> pte's of the proceeses to be migrated (if you go with a "migrate the
> entire process" approach) and try to migrate them, only to find out that
> they are pinned in place.  How is that a good thing?

You don't scan any PTEs, just the mempolicy tree. That is extremly
cheap. 

> >>(The page migration code from the memory hotplug patch will handle
> >>updating the pte's of the other processs (thank goodness for
> >>rmap...))
> >
> >
> >I don't get this. Surely the migration code will check if a page
> >is already in the target node, and when that is the case do nothing.
> >
> >How could this "double migration" happen? 
> 
> Not so much a double migration, but a double request for migration.
> (This is not a correctness, but a performance issue, once again.)
> Consider the case of a 300 GB file mapped into 256 pid's.  One doesn't
> want each pid to try to migrate the file pages.  Granted, all after the

Again file policy nicely takes care of this.

> first one will find the data already migrated, but if you issue a
> migration request for each address space, the others won't know that
> the page has been migrated until they have found the page and looked
> up its current node.  That means doing a find_get_page() for each page
> in the mapped file in all 256 address spaces, and 255 of those address

You just look at the mempolicy extent tree linked from the
address space.

> >
> >>(3)  In the case where a particular file is mapped into different
> >>processes at different file offsets (and we are migrating both
> >>of the processes), one has to examine the file offsets to figure
> >>out if the mappings overlap or not. If they overlap, then you've
> >>got to issue two calls, each of which describes a non-overlapping
> >>region; both calls taken together would cover the entire range
> >>of pages mapped to the file.  Similarly if the ranges do not
> >>overlap.
> >
> >
> >That sounds like a quite obscure corner case which I'm not sure
> >is worth all the complexity.
> >
> >-Andi
> >
> >
> 
> So what is your solution when this happens?  Make the job non-migratable?
> Yes, it may be an obscure case in your view but we've got to handle all of
> those cases to make a robust facility that can be used in a production 
> environment.

With per file policies you really don't care if there are overlaps or
not. You then care about offsets inside the object, not addresses
in some process virtual memory image. You just set the policy to migrate or
not migrate to the file and set a "lock bit" (that would
need to be added) and then no one else will touch the poliocy

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Andi Kleen wrote:
(1)  You really don't want to migrate the code pages of shared libraries
that are mapped into the process address space.  This causes a
useless shuffling of pages which really doesn't help system
performance.  On the other hand, if a shared library is some
private thing that is only used by the processes being migrated,
then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s 
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then "lock" the shared 
libraries to their policy using a new attribute flag.

I really don't see how that is relevant to the current discussion, which,
as AFAIK, is that the kernel interface should be "migrate an entire process"
versus what I have proposed.  What we are trying to avoid here for shared
libraries is two things:  (1) don't migrate them needlessly, and (2) don't
even make the migration request if we know that the pages shouldn't be
migrated.
Using Steve Longerbeam's approach avoids (1).  But you will still scan the
pte's of the proceeses to be migrated (if you go with a "migrate the
entire process" approach) and try to migrate them, only to find out that
they are pinned in place.  How is that a good thing?
A much simpler way to do this would be to add a list of libraries that
you don't want to be migrated to the migration library that I have
proposed to be the interface between the batch scheduler and the kernel.
Then when the library scans the /proc/pid/maps stuff, it can exlcude
those libraries from migration.  Furthermore, no migration requests will
even be initiated for those parts of the address space.
Of course, this means maintaining a library list in the migration
library.  We may eventually decide to do that.  For now, we're following
up on the reference count approach I outlined before.

(2)  You really only want to migrate pages once.  If a file is mapped
into several of the pid's that are being migrated, then you want
to figure this out and issue one call to have it moved wrt one of
the pid's.
(The page migration code from the memory hotplug patch will handle
updating the pte's of the other processs (thank goodness for
rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.
How could this "double migration" happen? 
Not so much a double migration, but a double request for migration.
(This is not a correctness, but a performance issue, once again.)
Consider the case of a 300 GB file mapped into 256 pid's.  One doesn't
want each pid to try to migrate the file pages.  Granted, all after the
first one will find the data already migrated, but if you issue a
migration request for each address space, the others won't know that
the page has been migrated until they have found the page and looked
up its current node.  That means doing a find_get_page() for each page
in the mapped file in all 256 address spaces, and 255 of those address
spaces will find the page has already been migrated.  How is that
useful?  I'd much rather migrate it once from the perspective of
a single address space, and then skip the scanning for pages to
migrate in all of the other address spaces.

(3)  In the case where a particular file is mapped into different
processes at different file offsets (and we are migrating both
of the processes), one has to examine the file offsets to figure
out if the mappings overlap or not. If they overlap, then you've
got to issue two calls, each of which describes a non-overlapping
region; both calls taken together would cover the entire range
of pages mapped to the file.  Similarly if the ranges do not
overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.
-Andi

So what is your solution when this happens?  Make the job non-migratable?
Yes, it may be an obscure case in your view but we've got to handle all of
those cases to make a robust facility that can be used in a production 
environment.

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Robin Holt wrote:
On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
which is what you are asking for, I think.  The library's job
(in addition to suspending all of the processes in the list for
the duration of the migration operation, plus do some other things
that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.
That works as well, and if we keep the majority of the work on
deciding who to migrate where and what to do when in a user space
library rather than in the kernel, then we have a lot more flexibility
in, for example who suspends/resumes the jobs to be migrated.

do memory placement by first touch, during initialization.  This is,
in part, because most of our codes originate on non-NUMA systems,
and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.
Thanks,
Robin

--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
> That seems like it is insane!

Thank-you, thank-you.  

What about the suggestion I had that you sort of skipped over, which
amounted to changing the system call from a node array to just one
node:

sys_page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);

to:

sys_page_migrate(pid, va_start, va_end, old_node, new_node);

Doesn't that let you do all you need to?  Is it insane too?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Robin Holt
On Tue, Feb 15, 2005 at 07:49:06AM -0800, Paul Jackson wrote:
> Robin wrote:
> > Then how do you handle overlapping nodes.  If I am doing a 5->4, 4->3,
> > 3->2, 2->1 shift ...
> 
> Then do the shifts in the other order, first 2->1, then 3->2, ...
> 
> So now you ask, what if you are doing a rotation?  Use a temporary
> node: 2->tmp, 3->2, ..., N->(N-1), tmp->N.

Consider the case where you are moving 248GB of data off of that node
onto a temporary.  You have just made that a double copy to save the
difficulty of passing in an array.  That seems like it is insane!

> 
> So now you ask, what if you are doing a rotation involving _all_
> nodes, and have nothing you can use as a temporary node?

Not necessarily all nodes for the rotation, but if you have no free nodes
in the system aside from the ones you are working with.  That will be the
typical case.  The batch scheduler will have control of all the nodes
except the nodes that are dedicated to I/O.  These will also likely
have less memory on them.  The batch scheduler may have any number
of jobs running in small cpusets.  At the time of the migration, the
system may only have the nodes from the old and new jobs to work with.
They you are stuck with a need for the arrays.

> 
> Argh I say ... would anyone really do that?  Or perhaps it makes
> sense to have the system call take a virtual address range (and
> hence a pid).  In which case, you can do one page at a time, if
> need be, and get any foolhardy migration possible.
> 
> Or perhaps some integration with Andi's mbind/mempolicy make sense.
> I'm not quite following Andi's comments on this, so I can't say
> one way or the other if this is good.

I think this is more closely related to cpusets, but that was not in when
Ray started working on his stuff.  The mem policy stuff does not handle
the immediate need to migrate (at least not that I see) and it does not
preserve node locality for already touched pages.  Assume we have a job
which has 16 processes which are doing work on 16 blocks of memory.
The code is designed to first touch the pages it will work with on
startup, redezvous with the other processes, and then start working.
During its run, it needs access to its block 97% of the time and needs
to read from the other blocks 3% of the time.

With a mem policy, after the "migration" it is a race to see who touches
the page first as for which node the memory is migrated to.  We need to
have a way to migrate the memory which preserves the placement information
the process has already given us.

Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
> Then how do you handle overlapping nodes.  If I am doing a 5->4, 4->3,
> 3->2, 2->1 shift ...

Then do the shifts in the other order, first 2->1, then 3->2, ...

So now you ask, what if you are doing a rotation?  Use a temporary
node: 2->tmp, 3->2, ..., N->(N-1), tmp->N.

So now you ask, what if you are doing a rotation involving _all_
nodes, and have nothing you can use as a temporary node?

Argh I say ... would anyone really do that?  Or perhaps it makes
sense to have the system call take a virtual address range (and
hence a pid).  In which case, you can do one page at a time, if
need be, and get any foolhardy migration possible.

Or perhaps some integration with Andi's mbind/mempolicy make sense.
I'm not quite following Andi's comments on this, so I can't say
one way or the other if this is good.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
> Requiring that the process is stopped will somewhat limit the use of
> this API outside of the HPC space where so much control can be had over
> the processes. 

Good point.  Hopefully we can find a way to design this system
call so that it does not require suspension.  Some uses of it
may well choose to suspend, but that's a user space choice.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
> Given that the first user of this may place in onto a 256 node system,
> the chances that they use the same node in the source and destination node
> array are very good.

Am I parsing this sentence correctly when I read it as stating that we
need to handle the case where the source and destination node sets
overlap (have non-empty intersection)?

> I can not see the node array as anything but the right way
> when compared to multiple system calls.

Variable length arrays across the system call boundary are a pain in the
butt.  Especially ones that add what are essentially "new types", in this
case, an array of MAX_NUMNODES node numbers.  Odds are well over 50% that
there will be a bug in this area, in our lifetime.

And simplicity is measured more, in my mind, by whether each specific
system call does the essential minimum of work, with clear pre and post
conditions, than by whether the caller is able to make the fewest number
of such calls.  Such reduction to the smallest irreducible atoms of work
both ensures that the kernel is best able to maintain order, and that it
can be used in the most flexible, unforseeable patterns possible,
without further kernel changes.

Such a node array call may well make good sense as a library API.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Ray wrote:
> The exact ordering of when a task is moved to a new cpuset and when the
> migration occurs doesn't matter, AFAIK, if we accept the notion that
> a migrated task is in suspended state until after everything associated
> with it (including the new cpuset definition) is done.

The existance of _some_ sequence of system calls such that user space
could, if it so chose, do the 'right' thing does not exonerate the
kernel from enforcing its rules, on each call.

The kernel certainly does not have a crystal ball that lets it say "ok -
let this violation of my rules pass - I know that the caller will
straighten things out before it lets anything ontoward occur (before
it removes the suspension, in this case.)

In other words, more directly, the kernel must return from each system
call with everything in order, all its rules enforced.

I still think that migration should honor cpusets, unless you can show
me a good reason why that's too cumbersome.  At least a migration patch
for *-mm should honor cpusets.  When the migration patch goes into
Linus's main tree, then it should honor cpusets there too, if cpusets
are already there.  Or if migration goes into Linus's tree before
cpusets, the onus would be on cpusets to add the changes to the
migration code honoring cpusets, when and if cpusets followed along.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Would it work to have the migration system call take exactly two node
numbers, the old and the new?  Have it migrate all pages in the address
space specified that are on the old node to the new node.  Leave any
other pages alone.  For one thing, this avoids passing a long list of
nodes, for an N-way to N-way migration. And for another thing, it seems
to solve some of the double migration and such issues.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Robin wrote:
> for the second process and then from node 8 to node 4 for the second.

"for the second ... for the second"

I couldn't make sense of this statement.  Should one of those
seconds be a first; what word(s) are elided after the second
"second"?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson <[EMAIL PROTECTED]> 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Robin Holt
On Tue, Feb 15, 2005 at 12:53:03PM +0100, Andi Kleen wrote:
> > (2)  You really only want to migrate pages once.  If a file is mapped
> >  into several of the pid's that are being migrated, then you want
> >  to figure this out and issue one call to have it moved wrt one of
> >  the pid's.
> >  (The page migration code from the memory hotplug patch will handle
> >  updating the pte's of the other processs (thank goodness for
> >  rmap...))
> 
> I don't get this. Surely the migration code will check if a page
> is already in the target node, and when that is the case do nothing.
> 
> How could this "double migration" happen? 

A node is not always equal distant to a cpu.  We need to keep node-to-cpu
distant relatively constant between the original and final placement.
There may be a time where you are moving stuff from node 8 to node 4
and stuff from node 12 to node 8.  If you scan the vmas for both the
processes in the wrong order you will migrate memory from node 12 to 8
for the second process and then from node 8 to node 4 for the second.

> > (3)  In the case where a particular file is mapped into different
> >  processes at different file offsets (and we are migrating both
> >  of the processes), one has to examine the file offsets to figure
> >  out if the mappings overlap or not. If they overlap, then you've
> >  got to issue two calls, each of which describes a non-overlapping
> >  region; both calls taken together would cover the entire range
> >  of pages mapped to the file.  Similarly if the ranges do not
> >  overlap.
> 
> That sounds like a quite obscure corner case which I'm not sure
> is worth all the complexity.

So obscure that nearly every example batch job we looked at had exactly
this circumstance.  Turns out that quite a few batch jobs we looked at
have a parent that maps their working set initially.  After the workers
are forked, they map some part of the same data file to different parts
of their own address space.  They also commonly map over the top of the
large file mapping that was originally done leaving us with a jumble of
address space.  This really showed the need for a user-space application
to figure the problem out and allow the flexibility to come up with more
advanced migration algorithms.

Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Andi Kleen
[Sorry, didn't answer to everything in your mail the first time. 
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
> migrating, and figure out from that what portions of which pid's
> address spaces need to migrated so that we satisfy the constraints
> given above.  I admit that this may be viewed as ugly, but I really
> can't figure out a better solution than this without shuffling a
> ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better. 

> 
> One issue that hasn't been addressed is the following:  given a
> particular entry in /proc/pid/maps, how does one figure out whether
> that entry is mapped into some other process in the system, one
> that is not in the set of processes to be migrated?   One could

[...]

Marking things externally would take care of that.

> If we did this, we still have to have the page migration system call
> to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
> pages were placed by first touch and for which there used to not be
> a memory policy.  As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT); 
(once it is hooked up to page migration) 

Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.

It should work later.


-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Andi Kleen
> (1)  You really don't want to migrate the code pages of shared libraries
>  that are mapped into the process address space.  This causes a
>  useless shuffling of pages which really doesn't help system
>  performance.  On the other hand, if a shared library is some
>  private thing that is only used by the processes being migrated,
>  then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s 
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then "lock" the shared 
libraries to their policy using a new attribute flag.

> 
> (2)  You really only want to migrate pages once.  If a file is mapped
>  into several of the pid's that are being migrated, then you want
>  to figure this out and issue one call to have it moved wrt one of
>  the pid's.
>  (The page migration code from the memory hotplug patch will handle
>  updating the pte's of the other processs (thank goodness for
>  rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.

How could this "double migration" happen? 

> 
> (3)  In the case where a particular file is mapped into different
>  processes at different file offsets (and we are migrating both
>  of the processes), one has to examine the file offsets to figure
>  out if the mappings overlap or not. If they overlap, then you've
>  got to issue two calls, each of which describes a non-overlapping
>  region; both calls taken together would cover the entire range
>  of pages mapped to the file.  Similarly if the ranges do not
>  overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Robin Holt
On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
> which is what you are asking for, I think.  The library's job
> (in addition to suspending all of the processes in the list for
> the duration of the migration operation, plus do some other things
> that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.

> do memory placement by first touch, during initialization.  This is,
> in part, because most of our codes originate on non-NUMA systems,
> and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.

Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Robin Holt
On Mon, Feb 14, 2005 at 02:22:54PM -0800, Dave Hansen wrote:
> On Mon, 2005-02-14 at 16:01 -0600, Robin Holt wrote:
> > On Mon, Feb 14, 2005 at 10:50:42AM -0800, Dave Hansen wrote:
> > > On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
> > > > The node mask is a list of allowed.  This is intended to be as near
> > > > to a one-to-one migration path as possible.
> > > 
> > > If that's the case, it would make the kernel internals a bit simpler to
> > > only take a "from" and "to" node, instead of those maps.  You'll end up
> > > making multiple syscalls, but that shouldn't be a problem.  
> > 
> > Then how do you handle overlapping nodes.  If I am doing a 5->4, 4->3,
> > 3->2, 2->1 shift in the memory placement and had only a from and to node,
> > I would end up calling multiple times.  This would end up in memory shifting
> > from 5->4 on the first, 4->3 on the second, ... with the end result of
> > all memory shifting to a single node.
> 
> Can you give an example of when you'd actually want to do this?

Assume it is moving from a 4,5,6,7,8,9 to 2,3,4,5,6,7 because it
wants to move jobs from nodes 8 and 9 which are topologically closer
to 10-15 and the job that was running there did not care about node
distances as much but nodes 2 and 3 were busy when the job was starting.
Batch schedulers will use machine in very interesting ways that you
would never have imagined.  Give it the freedom to move a job around,
any you will get some really interesting new behavior

Given that the first user of this may place in onto a 256 node system,
the chances that they use the same node in the source and destination node
array are very good.  If I focus on the word "actually" from above,I
can not give you a precise example of when this was asked for by a
user because this is in the early design phase as opposed to the late
troubleshooting phase.  Given the size of the machine we are dealing
with, it is certainly plausible that they will, at some time, ask to
migrate from and to an overlapping set of nodes.  I see this as even more
likely given that the decision will be made by their batch scheduler.
This example may be a bit simplistic, but there are certainly many times
where a batch scheduler decides that because of topology, it wants to
move stuff around some.

What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow.  They make the expensive traversal
of ptes the single pass operation.  The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls.  I can not see the node array as anything but the right way
when compared to multiple system calls.  What am I missing?


Thanks,
Robin
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Paul Jackson wrote:
Ray wrote:
[Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]

Does the same disclaimer apply to cpusets?
Unless it causes some undo pain, I would think that page migration
should _not_ violate a tasks cpuset.  I guess this means that a typical
batch manager would move a task to its new cpuset on the new nodes, or
move the cpuset containing some tasks to their new nodes, before asking
the page migrator to drag along the currently allocated pages from the
old location.
No, I think we understand the interaction between manual page migration
and cpusets.  We've tried to keep the discussion here disjoint from cpusets
for tactical reasons -- we didn't want to tie acceptance of the manual
page migration code to acceptance of cpusets.
The exact ordering of when a task is moved to a new cpuset and when the
migration occurs doesn't matter, AFAIK, if we accept the notion that
a migrated task is in suspended state until after everything associated
with it (including the new cpuset definition) is done.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: "Requires Windows 98 or better",
 so I installed Linux.
---
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Ray Bryant
Paul Jackson wrote:
Ray wrote:
[Thus the disclaimer in
the overview note that we have figured all the interaction with
memory policy stuff yet.]

Does the same disclaimer apply to cpusets?
Unless it causes some undo pain, I would think that page migration
should _not_ violate a tasks cpuset.  I guess this means that a typical
batch manager would move a task to its new cpuset on the new nodes, or
move the cpuset containing some tasks to their new nodes, before asking
the page migrator to drag along the currently allocated pages from the
old location.
No, I think we understand the interaction between manual page migration
and cpusets.  We've tried to keep the discussion here disjoint from cpusets
for tactical reasons -- we didn't want to tie acceptance of the manual
page migration code to acceptance of cpusets.
The exact ordering of when a task is moved to a new cpuset and when the
migration occurs doesn't matter, AFAIK, if we accept the notion that
a migrated task is in suspended state until after everything associated
with it (including the new cpuset definition) is done.
--
---
Ray Bryant
512-453-9679 (work) 512-507-7807 (cell)
[EMAIL PROTECTED] [EMAIL PROTECTED]
The box said: Requires Windows 98 or better,
 so I installed Linux.
---
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Robin Holt
On Mon, Feb 14, 2005 at 02:22:54PM -0800, Dave Hansen wrote:
 On Mon, 2005-02-14 at 16:01 -0600, Robin Holt wrote:
  On Mon, Feb 14, 2005 at 10:50:42AM -0800, Dave Hansen wrote:
   On Mon, 2005-02-14 at 07:52 -0600, Robin Holt wrote:
The node mask is a list of allowed.  This is intended to be as near
to a one-to-one migration path as possible.
   
   If that's the case, it would make the kernel internals a bit simpler to
   only take a from and to node, instead of those maps.  You'll end up
   making multiple syscalls, but that shouldn't be a problem.  
  
  Then how do you handle overlapping nodes.  If I am doing a 5-4, 4-3,
  3-2, 2-1 shift in the memory placement and had only a from and to node,
  I would end up calling multiple times.  This would end up in memory shifting
  from 5-4 on the first, 4-3 on the second, ... with the end result of
  all memory shifting to a single node.
 
 Can you give an example of when you'd actually want to do this?

Assume it is moving from a 4,5,6,7,8,9 to 2,3,4,5,6,7 because it
wants to move jobs from nodes 8 and 9 which are topologically closer
to 10-15 and the job that was running there did not care about node
distances as much but nodes 2 and 3 were busy when the job was starting.
Batch schedulers will use machine in very interesting ways that you
would never have imagined.  Give it the freedom to move a job around,
any you will get some really interesting new behavior

Given that the first user of this may place in onto a 256 node system,
the chances that they use the same node in the source and destination node
array are very good.  If I focus on the word actually from above,I
can not give you a precise example of when this was asked for by a
user because this is in the early design phase as opposed to the late
troubleshooting phase.  Given the size of the machine we are dealing
with, it is certainly plausible that they will, at some time, ask to
migrate from and to an overlapping set of nodes.  I see this as even more
likely given that the decision will be made by their batch scheduler.
This example may be a bit simplistic, but there are certainly many times
where a batch scheduler decides that because of topology, it wants to
move stuff around some.

What is the fundamental opposition to an array from from-to node mappings?
They are not that difficult to follow.  They make the expensive traversal
of ptes the single pass operation.  The time to scan the list of from nodes
to locate the node this page belongs to is relatively quick when compared
to the time to scan ptes and will result in probably no cache trashing
like the long traversal of all ptes in the system required for multiple
system calls.  I can not see the node array as anything but the right way
when compared to multiple system calls.  What am I missing?


Thanks,
Robin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Robin Holt
On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
 which is what you are asking for, I think.  The library's job
 (in addition to suspending all of the processes in the list for
 the duration of the migration operation, plus do some other things
 that are specific to sn2 hardware) would be to examine the

You probably want the batch scheduler to do the suspend/resume as it
may be parking part of the job on nodes that have memory but running
processes of a different job while moving a job out of the way for a
big-mem app that wants to run on one of this jobs nodes.

 do memory placement by first touch, during initialization.  This is,
 in part, because most of our codes originate on non-NUMA systems,
 and we've typically done very just what is necessary to make them

Software Vendors tend to be very reluctant to do things for a single
architecture unless there are clear wins.

Thanks,
Robin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Andi Kleen
 (1)  You really don't want to migrate the code pages of shared libraries
  that are mapped into the process address space.  This causes a
  useless shuffling of pages which really doesn't help system
  performance.  On the other hand, if a shared library is some
  private thing that is only used by the processes being migrated,
  then you should move that.

I think the better solution for this would be to finally integrate Steve L.'s 
file attribute code (and find some solution to make it persistent,
e.g. using xattrs with a new inode flag) and then lock the shared 
libraries to their policy using a new attribute flag.

 
 (2)  You really only want to migrate pages once.  If a file is mapped
  into several of the pid's that are being migrated, then you want
  to figure this out and issue one call to have it moved wrt one of
  the pid's.
  (The page migration code from the memory hotplug patch will handle
  updating the pte's of the other processs (thank goodness for
  rmap...))

I don't get this. Surely the migration code will check if a page
is already in the target node, and when that is the case do nothing.

How could this double migration happen? 

 
 (3)  In the case where a particular file is mapped into different
  processes at different file offsets (and we are migrating both
  of the processes), one has to examine the file offsets to figure
  out if the mappings overlap or not. If they overlap, then you've
  got to issue two calls, each of which describes a non-overlapping
  region; both calls taken together would cover the entire range
  of pages mapped to the file.  Similarly if the ranges do not
  overlap.

That sounds like a quite obscure corner case which I'm not sure
is worth all the complexity.

-Andi

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview II

2005-02-15 Thread Andi Kleen
[Sorry, didn't answer to everything in your mail the first time. 
See previous mail for beginning]

On Mon, Feb 14, 2005 at 06:29:45PM -0600, Ray Bryant wrote:
 migrating, and figure out from that what portions of which pid's
 address spaces need to migrated so that we satisfy the constraints
 given above.  I admit that this may be viewed as ugly, but I really
 can't figure out a better solution than this without shuffling a
 ton of ugly code into the kernel.

I like the concept of marking stuff that shouldn't be migrated
externally (using NUMA policy) better. 

 
 One issue that hasn't been addressed is the following:  given a
 particular entry in /proc/pid/maps, how does one figure out whether
 that entry is mapped into some other process in the system, one
 that is not in the set of processes to be migrated?   One could

[...]

Marking things externally would take care of that.

 If we did this, we still have to have the page migration system call
 to handle those cases for the tmpfs/hugetlbfs/sysv shm segments whose
 pages were placed by first touch and for which there used to not be
 a memory policy.  As discussed in a previous note, we are not in a

You can handle those with mbind(..., MPOL_F_STRICT); 
(once it is hooked up to page migration) 

Just mmap the tmpfs/shm/hugetlb file in an external program and apply
the policy. That is what numactl supports today too for shm
files like this.

It should work later.


-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Robin Holt
On Tue, Feb 15, 2005 at 12:53:03PM +0100, Andi Kleen wrote:
  (2)  You really only want to migrate pages once.  If a file is mapped
   into several of the pid's that are being migrated, then you want
   to figure this out and issue one call to have it moved wrt one of
   the pid's.
   (The page migration code from the memory hotplug patch will handle
   updating the pte's of the other processs (thank goodness for
   rmap...))
 
 I don't get this. Surely the migration code will check if a page
 is already in the target node, and when that is the case do nothing.
 
 How could this double migration happen? 

A node is not always equal distant to a cpu.  We need to keep node-to-cpu
distant relatively constant between the original and final placement.
There may be a time where you are moving stuff from node 8 to node 4
and stuff from node 12 to node 8.  If you scan the vmas for both the
processes in the wrong order you will migrate memory from node 12 to 8
for the second process and then from node 8 to node 4 for the second.

  (3)  In the case where a particular file is mapped into different
   processes at different file offsets (and we are migrating both
   of the processes), one has to examine the file offsets to figure
   out if the mappings overlap or not. If they overlap, then you've
   got to issue two calls, each of which describes a non-overlapping
   region; both calls taken together would cover the entire range
   of pages mapped to the file.  Similarly if the ranges do not
   overlap.
 
 That sounds like a quite obscure corner case which I'm not sure
 is worth all the complexity.

So obscure that nearly every example batch job we looked at had exactly
this circumstance.  Turns out that quite a few batch jobs we looked at
have a parent that maps their working set initially.  After the workers
are forked, they map some part of the same data file to different parts
of their own address space.  They also commonly map over the top of the
large file mapping that was originally done leaving us with a jumble of
address space.  This really showed the need for a user-space application
to figure the problem out and allow the flexibility to come up with more
advanced migration algorithms.

Thanks,
Robin
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Robin wrote:
 for the second process and then from node 8 to node 4 for the second.

for the second ... for the second

I couldn't make sense of this statement.  Should one of those
seconds be a first; what word(s) are elided after the second
second?

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Would it work to have the migration system call take exactly two node
numbers, the old and the new?  Have it migrate all pages in the address
space specified that are on the old node to the new node.  Leave any
other pages alone.  For one thing, this avoids passing a long list of
nodes, for an N-way to N-way migration. And for another thing, it seems
to solve some of the double migration and such issues.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

2005-02-15 Thread Paul Jackson
Ray wrote:
 The exact ordering of when a task is moved to a new cpuset and when the
 migration occurs doesn't matter, AFAIK, if we accept the notion that
 a migrated task is in suspended state until after everything associated
 with it (including the new cpuset definition) is done.

The existance of _some_ sequence of system calls such that user space
could, if it so chose, do the 'right' thing does not exonerate the
kernel from enforcing its rules, on each call.

The kernel certainly does not have a crystal ball that lets it say ok -
let this violation of my rules pass - I know that the caller will
straighten things out before it lets anything ontoward occur (before
it removes the suspension, in this case.)

In other words, more directly, the kernel must return from each system
call with everything in order, all its rules enforced.

I still think that migration should honor cpusets, unless you can show
me a good reason why that's too cumbersome.  At least a migration patch
for *-mm should honor cpusets.  When the migration patch goes into
Linus's main tree, then it should honor cpusets there too, if cpusets
are already there.  Or if migration goes into Linus's tree before
cpusets, the onus would be on cpusets to add the changes to the
migration code honoring cpusets, when and if cpusets followed along.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
 Given that the first user of this may place in onto a 256 node system,
 the chances that they use the same node in the source and destination node
 array are very good.

Am I parsing this sentence correctly when I read it as stating that we
need to handle the case where the source and destination node sets
overlap (have non-empty intersection)?

 I can not see the node array as anything but the right way
 when compared to multiple system calls.

Variable length arrays across the system call boundary are a pain in the
butt.  Especially ones that add what are essentially new types, in this
case, an array of MAX_NUMNODES node numbers.  Odds are well over 50% that
there will be a bug in this area, in our lifetime.

And simplicity is measured more, in my mind, by whether each specific
system call does the essential minimum of work, with clear pre and post
conditions, than by whether the caller is able to make the fewest number
of such calls.  Such reduction to the smallest irreducible atoms of work
both ensures that the kernel is best able to maintain order, and that it
can be used in the most flexible, unforseeable patterns possible,
without further kernel changes.

Such a node array call may well make good sense as a library API.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
 Requiring that the process is stopped will somewhat limit the use of
 this API outside of the HPC space where so much control can be had over
 the processes. 

Good point.  Hopefully we can find a way to design this system
call so that it does not require suspension.  Some uses of it
may well choose to suspend, but that's a user space choice.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC 2.6.11-rc2-mm2 7/7] mm: manual page migration -- sys_page_migrate

2005-02-15 Thread Paul Jackson
Robin wrote:
 Then how do you handle overlapping nodes.  If I am doing a 5-4, 4-3,
 3-2, 2-1 shift ...

Then do the shifts in the other order, first 2-1, then 3-2, ...

So now you ask, what if you are doing a rotation?  Use a temporary
node: 2-tmp, 3-2, ..., N-(N-1), tmp-N.

So now you ask, what if you are doing a rotation involving _all_
nodes, and have nothing you can use as a temporary node?

Argh I say ... would anyone really do that?  Or perhaps it makes
sense to have the system call take a virtual address range (and
hence a pid).  In which case, you can do one page at a time, if
need be, and get any foolhardy migration possible.

Or perhaps some integration with Andi's mbind/mempolicy make sense.
I'm not quite following Andi's comments on this, so I can't say
one way or the other if this is good.

-- 
  I won't rest till it's the best ...
  Programmer, Linux Scalability
  Paul Jackson [EMAIL PROTECTED] 1.650.933.1373, 
1.925.600.0401
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >