Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-25 Thread Carsten Otte
Avi Kivity wrote:
 Well, dup_mm() can't work (and now that I think about it, for more 
 reasons -- what if the process has threads?).
We lock out multithreaded users already, -EINVAL.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-25 Thread Avi Kivity
Carsten Otte wrote:
 Avi Kivity wrote:
 Well, dup_mm() can't work (and now that I think about it, for more 
 reasons -- what if the process has threads?).
 We lock out multithreaded users already, -EINVAL.


Would be much better if this can be avoided.  It's surprising.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-25 Thread Carsten Otte
Am Freitag, den 21.03.2008, 11:29 -0700 schrieb Dave Hansen:
 What you've done with dup_mm() is probably the brute-force way that I
 would have done it had I just been trying to make a proof of concept or
 something.  I'm worried that there are a bunch of corner cases that
 haven't been considered.
 
 What if someone else is poking around with ptrace or something similar
 and they bump the mm_users:
 
 +   if (tsk-mm-context.pgstes)
 +   return 0;
 +   if (!tsk-mm || atomic_read(tsk-mm-mm_users)  1 ||
 +   tsk-mm != tsk-active_mm || tsk-mm-ioctx_list)
 +   return -EINVAL;
 HERE
 +   tsk-mm-context.pgstes = 1;/* dirty little tricks .. */
 +   mm = dup_mm(tsk);
 
 It'll race, possibly fault in some other pages, and those faults will be
 lost during the dup_mm().  I think you need to be able to lock out all
 of the users of access_process_vm() before you go and do this.  You also
 need to make sure that anyone who has looked at task-mm doesn't go and
 get a reference to it and get confused later when it isn't the task-mm
 any more.

Good catch, Dave. We intend to get rid of that race via task_lock().
That should lock out ptrace and all others who modify mm_users via get_task_mm.


See patch below:
---

 arch/s390/Kconfig  |4 ++
 arch/s390/kernel/setup.c   |4 ++
 arch/s390/mm/pgtable.c |   65 +++--
 include/asm-s390/mmu.h |1 
 include/asm-s390/mmu_context.h |8 -
 include/asm-s390/pgtable.h |1 
 include/linux/sched.h  |2 +
 kernel/fork.c  |2 -
 8 files changed, 82 insertions(+), 5 deletions(-)

Index: linux-host/arch/s390/Kconfig
===
--- linux-host.orig/arch/s390/Kconfig
+++ linux-host/arch/s390/Kconfig
@@ -55,6 +55,10 @@ config GENERIC_LOCKBREAK
default y
depends on SMP  PREEMPT
 
+config PGSTE
+   bool
+   default y if KVM
+
 mainmenu Linux Kernel Configuration
 
 config S390
Index: linux-host/arch/s390/kernel/setup.c
===
--- linux-host.orig/arch/s390/kernel/setup.c
+++ linux-host/arch/s390/kernel/setup.c
@@ -315,7 +315,11 @@ static int __init early_parse_ipldelay(c
 early_param(ipldelay, early_parse_ipldelay);
 
 #ifdef CONFIG_S390_SWITCH_AMODE
+#ifdef CONFIG_PGSTE
+unsigned int switch_amode = 1;
+#else
 unsigned int switch_amode = 0;
+#endif
 EXPORT_SYMBOL_GPL(switch_amode);
 
 static void set_amode_and_uaccess(unsigned long user_amode,
Index: linux-host/arch/s390/mm/pgtable.c
===
--- linux-host.orig/arch/s390/mm/pgtable.c
+++ linux-host/arch/s390/mm/pgtable.c
@@ -30,11 +30,27 @@
 #define TABLES_PER_PAGE4
 #define FRAG_MASK  15UL
 #define SECOND_HALVES  10UL
+
+void clear_table_pgstes(unsigned long *table)
+{
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
+   memset(table + 256, 0, PAGE_SIZE/4);
+   clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
+   memset(table + 768, 0, PAGE_SIZE/4);
+}
+
 #else
 #define ALLOC_ORDER2
 #define TABLES_PER_PAGE2
 #define FRAG_MASK  3UL
 #define SECOND_HALVES  2UL
+
+void clear_table_pgstes(unsigned long *table)
+{
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
+   memset(table + 256, 0, PAGE_SIZE/2);
+}
+
 #endif
 
 unsigned long *crst_table_alloc(struct mm_struct *mm, int noexec)
@@ -153,7 +169,7 @@ unsigned long *page_table_alloc(struct m
unsigned long *table;
unsigned long bits;
 
-   bits = mm-context.noexec ? 3UL : 1UL;
+   bits = (mm-context.noexec || mm-context.pgstes) ? 3UL : 1UL;
spin_lock(mm-page_table_lock);
page = NULL;
if (!list_empty(mm-context.pgtable_list)) {
@@ -170,7 +186,10 @@ unsigned long *page_table_alloc(struct m
pgtable_page_ctor(page);
page-flags = ~FRAG_MASK;
table = (unsigned long *) page_to_phys(page);
-   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+   if (mm-context.pgstes)
+   clear_table_pgstes(table);
+   else
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
spin_lock(mm-page_table_lock);
list_add(page-lru, mm-context.pgtable_list);
}
@@ -191,7 +210,7 @@ void page_table_free(struct mm_struct *m
struct page *page;
unsigned long bits;
 
-   bits = mm-context.noexec ? 3UL : 1UL;
+   bits = (mm-context.noexec || mm-context.pgstes) ? 3UL : 1UL;
bits = (__pa(table)  (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
page = pfn_to_page(__pa(table)  PAGE_SHIFT);
spin_lock(mm-page_table_lock);
@@ -228,3 +247,43 @@ void disable_noexec(struct mm_struct *mm
mm-context.noexec = 0;
update_mm(mm, tsk);
 }
+

Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-24 Thread Avi Kivity
Martin Schwidefsky wrote:
 On Sun, 2008-03-23 at 12:15 +0200, Avi Kivity wrote:
   
 Can you convert the page tables at a later time without doing a
 wholesale replacement of the mm?  It should be a bit easier to keep
 people off the pagetables than keep their grubby mitts off the mm
 itself.
 
 
 Yes, as far as I can see you're right. And whatever we do in arch code,
 after all it's just a work around to avoid a new clone flag.
 If something like clone() with CLONE_KVM would be useful for more
 architectures than just s390 then maybe we should try to get a flag.

 Oh... there are just two unused clone flag bits left. Looks like the
 namespace changes ate up a lot of them lately.

 Well, we could still play dirty tricks like setting a bit in current
 via whatever mechanism which indicates child-wants-extended-page-tables
 and then just fork and be happy.
   
   
 How about taking mmap_sem for write and converting all page tables 
 in-place?  I'd rather avoid the need to fork() when creating a VM.
 

 That was my initial approach as well. If all the page table allocations
 can be fullfilled the code is not too complicated. To handle allocation
 failures gets tricky. At this point I realized that dup_mmap already
 does what we want to do. It walks all the page tables, allocates new
 page tables and copies the ptes. In principle I would reinvent the wheel
 if we can not use dup_mmap

Well, dup_mm() can't work (and now that I think about it, for more 
reasons -- what if the process has threads?).

I don't think conversion is too bad.  You'd need a four-level loop to 
allocate and convert, and another loop to deallocate in case of error.  
If, as I don't doubt, s390 hardware can modify the ptes, you'd need 
cmpxchg to read and clear a pte in one operation.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-23 Thread Avi Kivity
Heiko Carstens wrote:
 What you've done with dup_mm() is probably the brute-force way that I
 would have done it had I just been trying to make a proof of concept or
 something.  I'm worried that there are a bunch of corner cases that
 haven't been considered.

 What if someone else is poking around with ptrace or something similar
 and they bump the mm_users:

 +   if (tsk-mm-context.pgstes)
 +   return 0;
 +   if (!tsk-mm || atomic_read(tsk-mm-mm_users)  1 ||
 +   tsk-mm != tsk-active_mm || tsk-mm-ioctx_list)
 +   return -EINVAL;
 HERE
 +   tsk-mm-context.pgstes = 1;/* dirty little tricks .. */
 +   mm = dup_mm(tsk);

 It'll race, possibly fault in some other pages, and those faults will be
 lost during the dup_mm().  I think you need to be able to lock out all
 of the users of access_process_vm() before you go and do this.  You also
 need to make sure that anyone who has looked at task-mm doesn't go and
 get a reference to it and get confused later when it isn't the task-mm
 any more.

 
 Therefore, we need to reallocate the page table after fork() 
 once we know that task is going to be a hypervisor. That's what this 
 code does: reallocate a bigger page table to accomondate the extra 
 information. The task needs to be single-threaded when calling for 
 extended page tables.

 Btw: at fork() time, we cannot tell whether or not the user's going to 
 be a hypervisor. Therefore we cannot do this in fork.
   
 Can you convert the page tables at a later time without doing a
 wholesale replacement of the mm?  It should be a bit easier to keep
 people off the pagetables than keep their grubby mitts off the mm
 itself.
 

 Yes, as far as I can see you're right. And whatever we do in arch code,
 after all it's just a work around to avoid a new clone flag.
 If something like clone() with CLONE_KVM would be useful for more
 architectures than just s390 then maybe we should try to get a flag.

 Oh... there are just two unused clone flag bits left. Looks like the
 namespace changes ate up a lot of them lately.

 Well, we could still play dirty tricks like setting a bit in current
 via whatever mechanism which indicates child-wants-extended-page-tables
 and then just fork and be happy.
   

How about taking mmap_sem for write and converting all page tables 
in-place?  I'd rather avoid the need to fork() when creating a VM.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-23 Thread Martin Schwidefsky
On Sun, 2008-03-23 at 12:15 +0200, Avi Kivity wrote:
  Can you convert the page tables at a later time without doing a
  wholesale replacement of the mm?  It should be a bit easier to keep
  people off the pagetables than keep their grubby mitts off the mm
  itself.
  
 
  Yes, as far as I can see you're right. And whatever we do in arch code,
  after all it's just a work around to avoid a new clone flag.
  If something like clone() with CLONE_KVM would be useful for more
  architectures than just s390 then maybe we should try to get a flag.
 
  Oh... there are just two unused clone flag bits left. Looks like the
  namespace changes ate up a lot of them lately.
 
  Well, we could still play dirty tricks like setting a bit in current
  via whatever mechanism which indicates child-wants-extended-page-tables
  and then just fork and be happy.

 
 How about taking mmap_sem for write and converting all page tables 
 in-place?  I'd rather avoid the need to fork() when creating a VM.

That was my initial approach as well. If all the page table allocations
can be fullfilled the code is not too complicated. To handle allocation
failures gets tricky. At this point I realized that dup_mmap already
does what we want to do. It walks all the page tables, allocates new
page tables and copies the ptes. In principle I would reinvent the wheel
if we can not use dup_mmap.

-- 
blue skies,
  Martin.

Reality continues to ruin my life. - Calvin.





-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-22 Thread Heiko Carstens
 What you've done with dup_mm() is probably the brute-force way that I
 would have done it had I just been trying to make a proof of concept or
 something.  I'm worried that there are a bunch of corner cases that
 haven't been considered.
 
 What if someone else is poking around with ptrace or something similar
 and they bump the mm_users:
 
 +   if (tsk-mm-context.pgstes)
 +   return 0;
 +   if (!tsk-mm || atomic_read(tsk-mm-mm_users)  1 ||
 +   tsk-mm != tsk-active_mm || tsk-mm-ioctx_list)
 +   return -EINVAL;
 HERE
 +   tsk-mm-context.pgstes = 1;/* dirty little tricks .. */
 +   mm = dup_mm(tsk);
 
 It'll race, possibly fault in some other pages, and those faults will be
 lost during the dup_mm().  I think you need to be able to lock out all
 of the users of access_process_vm() before you go and do this.  You also
 need to make sure that anyone who has looked at task-mm doesn't go and
 get a reference to it and get confused later when it isn't the task-mm
 any more.
 
  Therefore, we need to reallocate the page table after fork() 
  once we know that task is going to be a hypervisor. That's what this 
  code does: reallocate a bigger page table to accomondate the extra 
  information. The task needs to be single-threaded when calling for 
  extended page tables.
  
  Btw: at fork() time, we cannot tell whether or not the user's going to 
  be a hypervisor. Therefore we cannot do this in fork.
 
 Can you convert the page tables at a later time without doing a
 wholesale replacement of the mm?  It should be a bit easier to keep
 people off the pagetables than keep their grubby mitts off the mm
 itself.

Yes, as far as I can see you're right. And whatever we do in arch code,
after all it's just a work around to avoid a new clone flag.
If something like clone() with CLONE_KVM would be useful for more
architectures than just s390 then maybe we should try to get a flag.

Oh... there are just two unused clone flag bits left. Looks like the
namespace changes ate up a lot of them lately.

Well, we could still play dirty tricks like setting a bit in current
via whatever mechanism which indicates child-wants-extended-page-tables
and then just fork and be happy.

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-21 Thread Dave Hansen
On Thu, 2008-03-20 at 21:35 +0100, Carsten Otte wrote:
 Dave Hansen wrote:
  Well, and more fundamentally: do we really want dup_mm() able to be
  called from other code?
  
  Maybe we need a bit more detailed justification why fork() itself isn't
  good enough.  It looks to me like they basically need an arch-specific
  argument to fork, telling the new process's page tables to take the
  fancy new bit.
  
  I'm really curious how this new stuff is going to get used.  Are you
  basically replacing fork() when creating kvm guests?
 No. The trick is, that we do need bigger page tables when running 
 guests: our page tables are usually 2k, but when running a guest 
 they're 4k to track both guest and host dirtyreference information. 
 This looks like this:
 *--*
 *2k PTE's  *
 *--*
 *2k PGSTE  *
 *--*
 We don't want to waste precious memory for all page tables. We'd like 
 to have one kernel image that runs regular server workload _and_ 
 guests.

That makes a lot of sense.

Is that layout (the shadow and regular stacked together) specified in
hardware somehow, or was it just chosen?

What you've done with dup_mm() is probably the brute-force way that I
would have done it had I just been trying to make a proof of concept or
something.  I'm worried that there are a bunch of corner cases that
haven't been considered.

What if someone else is poking around with ptrace or something similar
and they bump the mm_users:

+   if (tsk-mm-context.pgstes)
+   return 0;
+   if (!tsk-mm || atomic_read(tsk-mm-mm_users)  1 ||
+   tsk-mm != tsk-active_mm || tsk-mm-ioctx_list)
+   return -EINVAL;
HERE
+   tsk-mm-context.pgstes = 1;/* dirty little tricks .. */
+   mm = dup_mm(tsk);

It'll race, possibly fault in some other pages, and those faults will be
lost during the dup_mm().  I think you need to be able to lock out all
of the users of access_process_vm() before you go and do this.  You also
need to make sure that anyone who has looked at task-mm doesn't go and
get a reference to it and get confused later when it isn't the task-mm
any more.

 Therefore, we need to reallocate the page table after fork() 
 once we know that task is going to be a hypervisor. That's what this 
 code does: reallocate a bigger page table to accomondate the extra 
 information. The task needs to be single-threaded when calling for 
 extended page tables.
 
 Btw: at fork() time, we cannot tell whether or not the user's going to 
 be a hypervisor. Therefore we cannot do this in fork.

Can you convert the page tables at a later time without doing a
wholesale replacement of the mm?  It should be a bit easier to keep
people off the pagetables than keep their grubby mitts off the mm
itself.

-- Dave


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-21 Thread Carsten Otte
Dave Hansen wrote:
 On Thu, 2008-03-20 at 21:35 +0100, Carsten Otte wrote:
 Dave Hansen wrote:
 Well, and more fundamentally: do we really want dup_mm() able to be
 called from other code?

 Maybe we need a bit more detailed justification why fork() itself isn't
 good enough.  It looks to me like they basically need an arch-specific
 argument to fork, telling the new process's page tables to take the
 fancy new bit.

 I'm really curious how this new stuff is going to get used.  Are you
 basically replacing fork() when creating kvm guests?
 No. The trick is, that we do need bigger page tables when running 
 guests: our page tables are usually 2k, but when running a guest 
 they're 4k to track both guest and host dirtyreference information. 
 This looks like this:
 *--*
 *2k PTE's  *
 *--*
 *2k PGSTE  *
 *--*
 We don't want to waste precious memory for all page tables. We'd like 
 to have one kernel image that runs regular server workload _and_ 
 guests.
 
 That makes a lot of sense.
 
 Is that layout (the shadow and regular stacked together) specified in
 hardware somehow, or was it just chosen?
It's defined by hardware. The chip just adds +2k to the ptep to get to 
the corresponding pgste. Both pte and pgste are 64bit per page. I know 
Heiko and Martin have thought a lot about possible races. I'll have to 
leave your question on the race against pfault open for them.

Btw: thanks a lot for reviewing our changes :-)

cheers,
Carsten

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-20 Thread Carsten Otte
From: Martin Schwidefsky [EMAIL PROTECTED]

The SIE instruction on s390 uses the 2nd half of the page table page to
virtualize the storage keys of a guest. This patch offers the s390_enable_sie
function, which reorganizes the page tables of a single-threaded process to
reserve space in the page table:
s390_enable_sie makes sure that the process is single threaded and then uses
dup_mm to create a new mm with reorganized page tables. The old mm is freed 
and the process has now a page status extended field after every page table.

Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.

This patch has a small common code hit, namely making dup_mm non-static.


Signed-off-by: Martin Schwidefsky [EMAIL PROTECTED]
Signed-off-by: Carsten Otte [EMAIL PROTECTED]
---

 arch/s390/Kconfig  |4 ++
 arch/s390/kernel/setup.c   |4 ++
 arch/s390/mm/pgtable.c |   55 ++---
 include/asm-s390/mmu.h |1 
 include/asm-s390/mmu_context.h |8 +
 include/asm-s390/pgtable.h |1 
 kernel/fork.c  |2 -
 7 files changed, 70 insertions(+), 5 deletions(-)

Index: kvm/arch/s390/Kconfig
===
--- kvm.orig/arch/s390/Kconfig
+++ kvm/arch/s390/Kconfig
@@ -55,6 +55,10 @@ config GENERIC_LOCKBREAK
default y
depends on SMP  PREEMPT
 
+config PGSTE
+   bool
+   default y if KVM
+
 mainmenu Linux Kernel Configuration
 
 config S390
Index: kvm/arch/s390/kernel/setup.c
===
--- kvm.orig/arch/s390/kernel/setup.c
+++ kvm/arch/s390/kernel/setup.c
@@ -315,7 +315,11 @@ static int __init early_parse_ipldelay(c
 early_param(ipldelay, early_parse_ipldelay);
 
 #ifdef CONFIG_S390_SWITCH_AMODE
+#ifdef CONFIG_PGSTE
+unsigned int switch_amode = 1;
+#else
 unsigned int switch_amode = 0;
+#endif
 EXPORT_SYMBOL_GPL(switch_amode);
 
 static void set_amode_and_uaccess(unsigned long user_amode,
Index: kvm/arch/s390/mm/pgtable.c
===
--- kvm.orig/arch/s390/mm/pgtable.c
+++ kvm/arch/s390/mm/pgtable.c
@@ -30,11 +30,27 @@
 #define TABLES_PER_PAGE4
 #define FRAG_MASK  15UL
 #define SECOND_HALVES  10UL
+
+void clear_table_pgstes(unsigned long *table)
+{
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
+   memset(table + 256, 0, PAGE_SIZE/4);
+   clear_table(table + 512, _PAGE_TYPE_EMPTY, PAGE_SIZE/4);
+   memset(table + 768, 0, PAGE_SIZE/4);
+}
+
 #else
 #define ALLOC_ORDER2
 #define TABLES_PER_PAGE2
 #define FRAG_MASK  3UL
 #define SECOND_HALVES  2UL
+
+void clear_table_pgstes(unsigned long *table)
+{
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE/2);
+   memset(table + 256, 0, PAGE_SIZE/2);
+}
+
 #endif
 
 unsigned long *crst_table_alloc(struct mm_struct *mm, int noexec)
@@ -153,7 +169,7 @@ unsigned long *page_table_alloc(struct m
unsigned long *table;
unsigned long bits;
 
-   bits = mm-context.noexec ? 3UL : 1UL;
+   bits = (mm-context.noexec || mm-context.pgstes) ? 3UL : 1UL;
spin_lock(mm-page_table_lock);
page = NULL;
if (!list_empty(mm-context.pgtable_list)) {
@@ -170,7 +186,10 @@ unsigned long *page_table_alloc(struct m
pgtable_page_ctor(page);
page-flags = ~FRAG_MASK;
table = (unsigned long *) page_to_phys(page);
-   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
+   if (mm-context.pgstes)
+   clear_table_pgstes(table);
+   else
+   clear_table(table, _PAGE_TYPE_EMPTY, PAGE_SIZE);
spin_lock(mm-page_table_lock);
list_add(page-lru, mm-context.pgtable_list);
}
@@ -191,7 +210,7 @@ void page_table_free(struct mm_struct *m
struct page *page;
unsigned long bits;
 
-   bits = mm-context.noexec ? 3UL : 1UL;
+   bits = (mm-context.noexec || mm-context.pgstes) ? 3UL : 1UL;
bits = (__pa(table)  (PAGE_SIZE - 1)) / 256 / sizeof(unsigned long);
page = pfn_to_page(__pa(table)  PAGE_SHIFT);
spin_lock(mm-page_table_lock);
@@ -228,3 +247,33 @@ void disable_noexec(struct mm_struct *mm
mm-context.noexec = 0;
update_mm(mm, tsk);
 }
+
+struct mm_struct *dup_mm(struct task_struct *tsk);
+
+/*
+ * switch on pgstes for its userspace process (for kvm)
+ */
+int s390_enable_sie(void)
+{
+   struct task_struct *tsk = current;
+   struct mm_struct *mm;
+
+   if (tsk-mm-context.pgstes)
+   return 0;
+   if (!tsk-mm || atomic_read(tsk-mm-mm_users)  1 ||
+   tsk-mm != tsk-active_mm || tsk-mm-ioctx_list)
+   return -EINVAL;
+   tsk-mm-context.pgstes = 1;/* dirty little tricks .. */
+   mm = dup_mm(tsk);
+   tsk-mm-context.pgstes = 0;
+   if (!mm)
+   return 

Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-20 Thread Jeremy Fitzhardinge
Carsten Otte wrote:
 +struct mm_struct *dup_mm(struct task_struct *tsk);
   

No prototypes in .c files.  Put this in an appropriate header.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-20 Thread Dave Hansen
On Thu, 2008-03-20 at 10:28 -0700, Jeremy Fitzhardinge wrote:
 Carsten Otte wrote:
  +struct mm_struct *dup_mm(struct task_struct *tsk);
 
 No prototypes in .c files.  Put this in an appropriate header.

Well, and more fundamentally: do we really want dup_mm() able to be
called from other code?

Maybe we need a bit more detailed justification why fork() itself isn't
good enough.  It looks to me like they basically need an arch-specific
argument to fork, telling the new process's page tables to take the
fancy new bit.

I'm really curious how this new stuff is going to get used.  Are you
basically replacing fork() when creating kvm guests?

-- Dave


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-20 Thread Carsten Otte
Dave Hansen wrote:
 Well, and more fundamentally: do we really want dup_mm() able to be
 called from other code?
 
 Maybe we need a bit more detailed justification why fork() itself isn't
 good enough.  It looks to me like they basically need an arch-specific
 argument to fork, telling the new process's page tables to take the
 fancy new bit.
 
 I'm really curious how this new stuff is going to get used.  Are you
 basically replacing fork() when creating kvm guests?
No. The trick is, that we do need bigger page tables when running 
guests: our page tables are usually 2k, but when running a guest 
they're 4k to track both guest and host dirtyreference information. 
This looks like this:
*--*
*2k PTE's  *
*--*
*2k PGSTE  *
*--*
We don't want to waste precious memory for all page tables. We'd like 
to have one kernel image that runs regular server workload _and_ 
guests. Therefore, we need to reallocate the page table after fork() 
once we know that task is going to be a hypervisor. That's what this 
code does: reallocate a bigger page table to accomondate the extra 
information. The task needs to be single-threaded when calling for 
extended page tables.

Btw: at fork() time, we cannot tell whether or not the user's going to 
be a hypervisor. Therefore we cannot do this in fork.


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel