Re: > best asked at one of the nvidia forums, not on lkml...

2008-02-04 Thread Zachary Amsden

On Tue, 2008-02-05 at 13:44 +0700, Igor M Podlesny wrote:
> On 2008-02-05 13:34, Arjan van de Ven wrote:
> [...]
> >>1) To have compiled it I had to replace global_flush_tlb()
> >> call with __flush_tlb_all() and still guessing was it(?) a correct
> >> replacment at all :-)
> > 
> > it is not; 
> 
>   I see, thanks. What would be the correct one? ;-)

global_flush_tlb() would be the correct one.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Commit f06e4ec breaks vmware

2008-02-04 Thread Zachary Amsden

On Mon, 2008-02-04 at 16:36 +0100, Ingo Molnar wrote:
> * Jeff Chua <[EMAIL PROTECTED]> wrote:
> 
> > On Feb 4, 2008 10:53 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> > > > commit 8d947344c47a40626730bb80d136d8daac9f2060
> > > > Author: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
> > > > Date:   Wed Jan 30 13:31:12 2008 +0100
> > > >
> > > > x86: change write_idt_entry signature
> > >
> > > does the patch below ontop of x86.git#mm fix this?
> > 
> > 
> > > 32-bit or 64-bit guest kernel?
> > 
> > 32-bit.
> > 
> > Yep, this fixed the problem.
> 
> great! I've added:
> 
>Tested-by: Jeff Chua <[EMAIL PROTECTED]>
> 
> to the commit message as well, if you dont mind. Full patch is below.

Acked-by: Zachary Amsden <[EMAIL PROTECTED]>

Thanks, Ingo!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Commit f06e4ec breaks vmware

2008-02-04 Thread Zachary Amsden

On Mon, 2008-02-04 at 16:36 +0100, Ingo Molnar wrote:
 * Jeff Chua [EMAIL PROTECTED] wrote:
 
  On Feb 4, 2008 10:53 PM, Ingo Molnar [EMAIL PROTECTED] wrote:
commit 8d947344c47a40626730bb80d136d8daac9f2060
Author: Glauber de Oliveira Costa [EMAIL PROTECTED]
Date:   Wed Jan 30 13:31:12 2008 +0100
   
x86: change write_idt_entry signature
  
   does the patch below ontop of x86.git#mm fix this?
  
  
   32-bit or 64-bit guest kernel?
  
  32-bit.
  
  Yep, this fixed the problem.
 
 great! I've added:
 
Tested-by: Jeff Chua [EMAIL PROTECTED]
 
 to the commit message as well, if you dont mind. Full patch is below.

Acked-by: Zachary Amsden [EMAIL PROTECTED]

Thanks, Ingo!

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: best asked at one of the nvidia forums, not on lkml...

2008-02-04 Thread Zachary Amsden

On Tue, 2008-02-05 at 13:44 +0700, Igor M Podlesny wrote:
 On 2008-02-05 13:34, Arjan van de Ven wrote:
 [...]
 1) To have compiled it I had to replace global_flush_tlb()
  call with __flush_tlb_all() and still guessing was it(?) a correct
  replacment at all :-)
  
  it is not; 
 
   I see, thanks. What would be the correct one? ;-)

global_flush_tlb() would be the correct one.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Tree fixes for PARAVIRT

2008-01-18 Thread Zachary Amsden
On Fri, 2008-01-18 at 22:37 +0100, Ingo Molnar wrote:
> * Ingo Molnar <[EMAIL PROTECTED]> wrote:
> 
> > > The first fix is not even specific for PARAVIRT, and it's actually 
> > > preventing the whole tree from booting.
> > 
> > on CONFIG_EFI, indeed :)
> 
> but in exchange you broke all of 32-bit with CONFIG_PARAVIRT=y. Which 
> means you did not even build-test it on 32-bit, let alone boot test 
> it...

Why are we rushing so much to do 64-bit paravirt that we are breaking
working configurations?  If the developement is going to be this
chaotic, it should be done and tested out of tree until it can
stabilize.

I do not like having to continuously retest and review the x86 branch
because the paravirt-ops are constantly in flux and the 32-bit code
keeps breaking.

We won't be doing 64-bit paravirt-ops for exactly this reason - is there
a serious justification from the performance angle on modern 64-bit
hardware?  If not, why justify the complexity and hackery to Linux?

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/10] Tree fixes for PARAVIRT

2008-01-18 Thread Zachary Amsden
On Fri, 2008-01-18 at 22:37 +0100, Ingo Molnar wrote:
 * Ingo Molnar [EMAIL PROTECTED] wrote:
 
   The first fix is not even specific for PARAVIRT, and it's actually 
   preventing the whole tree from booting.
  
  on CONFIG_EFI, indeed :)
 
 but in exchange you broke all of 32-bit with CONFIG_PARAVIRT=y. Which 
 means you did not even build-test it on 32-bit, let alone boot test 
 it...

Why are we rushing so much to do 64-bit paravirt that we are breaking
working configurations?  If the developement is going to be this
chaotic, it should be done and tested out of tree until it can
stabilize.

I do not like having to continuously retest and review the x86 branch
because the paravirt-ops are constantly in flux and the 32-bit code
keeps breaking.

We won't be doing 64-bit paravirt-ops for exactly this reason - is there
a serious justification from the performance angle on modern 64-bit
hardware?  If not, why justify the complexity and hackery to Linux?

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] serverworks: IRQ routing needs no _p

2008-01-11 Thread Zachary Amsden
On Fri, 2008-01-11 at 18:14 +, Alan Cox wrote:
> I can find no reason for the _p on the serverworks IRQ routing logic, and
> a review of the documentation contains no indication that any such delay
> is needed so lets try this
> 

Looks good to me; unfortunately my Serverworks boxes got upgraded so I
can't test, but install base is really big.

Acked-by: Zachary Amsden <[EMAIL PROTECTED]>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] 8390: split up 8390 with ISA delay and 8390 without

2008-01-11 Thread Zachary Amsden
On Fri, 2008-01-11 at 18:08 +, Alan Cox wrote:
> This lets us remove port 0x80 use on the PCI systems. It also speeds
> up
> some of the later 8390 based cores where we know the device does not
> need
> delay loops either because it has internal handling or in most cases a
> faster device anyway.
> 
> We compile up a new module 8390p to go with 8390 one with delays and
> one
> without. Most users will never need 8390p so it seems best to produce
> two
> modules.

> diff -u --new-file --recursive --exclude-from /usr/src/exclude 
> linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 
> linux-2.6.24-rc6-mm1/drivers/net/hp.c
> --- linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-02 
> 16:04:00.0 +
> +++ linux-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-11 14:47:44.0 
> +

> @@ -176,7 +176,7 @@
>   outb_p(irqmap[irq] | HP_RUN, ioaddr + 
> HP_CONFIGURE);
>   outb_p( 0x00 | HP_RUN, ioaddr + HP_CONFIGURE);

... didn't you miss two here?  Otherwise, code looks great.

How will the user know when they need to use 8390p?  Should you perhaps default 
to producing and using 8390p if CONFIG_ISA is on?

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] 8390: split up 8390 with ISA delay and 8390 without

2008-01-11 Thread Zachary Amsden
On Fri, 2008-01-11 at 18:08 +, Alan Cox wrote:
 This lets us remove port 0x80 use on the PCI systems. It also speeds
 up
 some of the later 8390 based cores where we know the device does not
 need
 delay loops either because it has internal handling or in most cases a
 faster device anyway.
 
 We compile up a new module 8390p to go with 8390 one with delays and
 one
 without. Most users will never need 8390p so it seems best to produce
 two
 modules.

 diff -u --new-file --recursive --exclude-from /usr/src/exclude 
 linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 
 linux-2.6.24-rc6-mm1/drivers/net/hp.c
 --- linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-02 
 16:04:00.0 +
 +++ linux-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-11 14:47:44.0 
 +

 @@ -176,7 +176,7 @@
   outb_p(irqmap[irq] | HP_RUN, ioaddr + 
 HP_CONFIGURE);
   outb_p( 0x00 | HP_RUN, ioaddr + HP_CONFIGURE);

... didn't you miss two here?  Otherwise, code looks great.

How will the user know when they need to use 8390p?  Should you perhaps default 
to producing and using 8390p if CONFIG_ISA is on?

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] serverworks: IRQ routing needs no _p

2008-01-11 Thread Zachary Amsden
On Fri, 2008-01-11 at 18:14 +, Alan Cox wrote:
 I can find no reason for the _p on the serverworks IRQ routing logic, and
 a review of the documentation contains no indication that any such delay
 is needed so lets try this
 

Looks good to me; unfortunately my Serverworks boxes got upgraded so I
can't test, but install base is really big.

Acked-by: Zachary Amsden [EMAIL PROTECTED]

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-10 Thread Zachary Amsden
On Wed, 2008-01-09 at 17:22 -0500, David P. Reed wrote:
> Zachary Amsden wrote:
> >
> > According to Phoenix Technologies book "System BIOS for IBM PCs,
> > Compatibles and EISA Computers, 2nd Edition", the I/O port list gives
> >
> > port 0080h   R/W  Extra page register (temporary storage)
> >
> > Despite looking, I've never seen it documented anywhere else, but I
> > believe it works on just about every PC platform.  Except, apparently,
> > my laptop.
> >
> >
> >   
> The port 80 problem was discovered by me, after months of "bisecting" 
> the running code around a problem with hanging when using hwclock in 
> 64-bit mode when ACPI is on.  So it kills my laptop, too, and many 
> currentlaptop motherboards designed by Quanta for HP and Compaq (dv6000, 
> dv9000, tx1000, apparently)

Thanks very much for that - I was debugging this for a while too, and
eventually just shut off hwclock.

> Your laptop isn't an aberration.  It's part of the new generation of 
> evolved machines that take advantage of the capabilities of ACPI and 
> SMBIOS and DMI standards that are becoming core parts of the market.

I beg to differ.  I managed to turn the thing into a brick by upgrading
the BIOS (with the correct image, no less) in an attempt to fix it.  I
just got it back from repair.  I'm not sure that is positive
evolutionary development, but it certainly does make my laptop an
aberration :)

FWIW, I fixed the problem locally by recompiling, changing port 80 to
port 84 in io.h; works great, and doesn't conflict with any occupied
ports.

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-10 Thread Zachary Amsden
On Wed, 2008-01-09 at 17:22 -0500, David P. Reed wrote:
 Zachary Amsden wrote:
 
  According to Phoenix Technologies book System BIOS for IBM PCs,
  Compatibles and EISA Computers, 2nd Edition, the I/O port list gives
 
  port 0080h   R/W  Extra page register (temporary storage)
 
  Despite looking, I've never seen it documented anywhere else, but I
  believe it works on just about every PC platform.  Except, apparently,
  my laptop.
 
 

 The port 80 problem was discovered by me, after months of bisecting 
 the running code around a problem with hanging when using hwclock in 
 64-bit mode when ACPI is on.  So it kills my laptop, too, and many 
 currentlaptop motherboards designed by Quanta for HP and Compaq (dv6000, 
 dv9000, tx1000, apparently)

Thanks very much for that - I was debugging this for a while too, and
eventually just shut off hwclock.

 Your laptop isn't an aberration.  It's part of the new generation of 
 evolved machines that take advantage of the capabilities of ACPI and 
 SMBIOS and DMI standards that are becoming core parts of the market.

I beg to differ.  I managed to turn the thing into a brick by upgrading
the BIOS (with the correct image, no less) in an attempt to fix it.  I
just got it back from repair.  I'm not sure that is positive
evolutionary development, but it certainly does make my laptop an
aberration :)

FWIW, I fixed the problem locally by recompiling, changing port 80 to
port 84 in io.h; works great, and doesn't conflict with any occupied
ports.

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-09 Thread Zachary Amsden
On Tue, 2008-01-08 at 21:19 -0800, H. Peter Anvin wrote:
> Zachary Amsden wrote:
> > 
> > BTW, it isn't ever safe to pass port 0x80 through to hardware from a
> > virtual machine; some OSes use port 0x80 as a hardware available scratch
> > register (I believe Darwin/x86 did/does this during boot).
> 
> That's funny, because there is definitely no guarantee that you get back 
> what you read (well, perhaps there is on Apple.)

According to Phoenix Technologies book "System BIOS for IBM PCs,
Compatibles and EISA Computers, 2nd Edition", the I/O port list gives

port 0080h   R/W  Extra page register (temporary storage)

Despite looking, I've never seen it documented anywhere else, but I
believe it works on just about every PC platform.  Except, apparently,
my laptop.

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-09 Thread Zachary Amsden
On Wed, 2008-01-09 at 16:27 +0100, Rene Herman wrote:
> On 09-01-08 06:30, Christer Weinigel wrote:
> I'd not expect very time crtical. The current outb_p use gives a delay 
> somewhere between .5 and 2 microseconds as per earlier survey meaning a 
> udelay(1) or 2 would be enough -- again, at the point that udelay() is 
> sensible.
> 
> New machines don't use the legacy PIC anymore anyway.
> 
> > The floppy controller code uses outb_p.  Even though there might be
> > floppy controllers on modern systems, I'd rather leave the floppy code
> > alone since it's supposed to be very fragile.  If you still use
> > floppies you deserve what you get.
> 
> Floppies forever. In practice, leaving it alone isn't going to matter, but 
> in that same practice changing it to udelay() probably doesn't either. The 
> ones to leave alone are the ones that are clumsy/impossible to test and the 
> ones such as in NIC drivers that were specifically tuned.

I'm speaking specifically in terms of 64-bit platforms here.  Shouldn't
we unconditionally drop outb_p doing extra port I/O on 64-bit
architectures?  Especially considering they don't even have an ISA bus
where the decode timing could even matter?

> If simple outb_p() deprecation is considered enough instead, no need to 
> touch anything in drivers/, only changes to "outb(); udelay()" outside 
> drivers/.
> 
> I'd let Alan decide here.

Agree.

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-09 Thread Zachary Amsden
On Wed, 2008-01-09 at 16:27 +0100, Rene Herman wrote:
 On 09-01-08 06:30, Christer Weinigel wrote:
 I'd not expect very time crtical. The current outb_p use gives a delay 
 somewhere between .5 and 2 microseconds as per earlier survey meaning a 
 udelay(1) or 2 would be enough -- again, at the point that udelay() is 
 sensible.
 
 New machines don't use the legacy PIC anymore anyway.
 
  The floppy controller code uses outb_p.  Even though there might be
  floppy controllers on modern systems, I'd rather leave the floppy code
  alone since it's supposed to be very fragile.  If you still use
  floppies you deserve what you get.
 
 Floppies forever. In practice, leaving it alone isn't going to matter, but 
 in that same practice changing it to udelay() probably doesn't either. The 
 ones to leave alone are the ones that are clumsy/impossible to test and the 
 ones such as in NIC drivers that were specifically tuned.

I'm speaking specifically in terms of 64-bit platforms here.  Shouldn't
we unconditionally drop outb_p doing extra port I/O on 64-bit
architectures?  Especially considering they don't even have an ISA bus
where the decode timing could even matter?

 If simple outb_p() deprecation is considered enough instead, no need to 
 touch anything in drivers/, only changes to outb(); udelay() outside 
 drivers/.
 
 I'd let Alan decide here.

Agree.

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-09 Thread Zachary Amsden
On Tue, 2008-01-08 at 21:19 -0800, H. Peter Anvin wrote:
 Zachary Amsden wrote:
  
  BTW, it isn't ever safe to pass port 0x80 through to hardware from a
  virtual machine; some OSes use port 0x80 as a hardware available scratch
  register (I believe Darwin/x86 did/does this during boot).
 
 That's funny, because there is definitely no guarantee that you get back 
 what you read (well, perhaps there is on Apple.)

According to Phoenix Technologies book System BIOS for IBM PCs,
Compatibles and EISA Computers, 2nd Edition, the I/O port list gives

port 0080h   R/W  Extra page register (temporary storage)

Despite looking, I've never seen it documented anywhere else, but I
believe it works on just about every PC platform.  Except, apparently,
my laptop.

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-08 Thread Zachary Amsden
On Tue, 2008-01-08 at 14:15 -0500, David P. Reed wrote:
> Alan Cox wrote:
> > The natsemi docs here say otherwise. I trust them not you.
> >   
> As well you should. I am honestly curious (for my own satisfaction) as 
> to what the natsemi docs say the delay code should do  (can't imagine 
> they say "use io port 80 because it is unused").  I don't have any 

What is the outcome of this thread?  Are we going to use timing based
port delays, or can we finally drop these things entirely on 64-bit
architectures?

I a have a doubly vested interest in this, both as the owner of an
affected HP dv9210us laptop and as a maintainer of paravirt code - and
would like 64-bit Linux code to stop using I/O to port 0x80 in both
cases (as I suspect would every other person involved with
virtualization).

BTW, it isn't ever safe to pass port 0x80 through to hardware from a
virtual machine; some OSes use port 0x80 as a hardware available scratch
register (I believe Darwin/x86 did/does this during boot).  This means
simultaneous execution of two virtual machines can interleave port 0x80
values or share data with a hardware provided covert channel.  This
means KVM should be trapping port 0x80 access, which is really
expensive, or alternatively, Linux should not be using port 0x80 for
timing bus access on modern (64-bit) hardware.

I've tried to follow this thread, but with all the jabs, 1-ups, and
obscure legacy hardware pageantry going on, it isn't clear what we're
really doing.

Thanks,

Zach

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.

2008-01-08 Thread Zachary Amsden
On Tue, 2008-01-08 at 14:15 -0500, David P. Reed wrote:
 Alan Cox wrote:
  The natsemi docs here say otherwise. I trust them not you.

 As well you should. I am honestly curious (for my own satisfaction) as 
 to what the natsemi docs say the delay code should do  (can't imagine 
 they say use io port 80 because it is unused).  I don't have any 

What is the outcome of this thread?  Are we going to use timing based
port delays, or can we finally drop these things entirely on 64-bit
architectures?

I a have a doubly vested interest in this, both as the owner of an
affected HP dv9210us laptop and as a maintainer of paravirt code - and
would like 64-bit Linux code to stop using I/O to port 0x80 in both
cases (as I suspect would every other person involved with
virtualization).

BTW, it isn't ever safe to pass port 0x80 through to hardware from a
virtual machine; some OSes use port 0x80 as a hardware available scratch
register (I believe Darwin/x86 did/does this during boot).  This means
simultaneous execution of two virtual machines can interleave port 0x80
values or share data with a hardware provided covert channel.  This
means KVM should be trapping port 0x80 access, which is really
expensive, or alternatively, Linux should not be using port 0x80 for
timing bus access on modern (64-bit) hardware.

I've tried to follow this thread, but with all the jabs, 1-ups, and
obscure legacy hardware pageantry going on, it isn't clear what we're
really doing.

Thanks,

Zach

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu

2007-11-26 Thread Zachary Amsden
On Mon, 2007-11-26 at 14:06 -0800, Yinghai Lu wrote:

> >> diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c
> >> index 500670c..966d124 100644
> >> --- a/arch/x86/kernel/smpboot_64.c
> >> +++ b/arch/x86/kernel/smpboot_64.c
> >> @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
> >>  void __init smp_prepare_boot_cpu(void)
> >>  {
> >>int me = smp_processor_id();
> >> -  cpu_set(me, cpu_online_map);
> >> +  /* already set me in cpu_online_map in boot_cpu_init() */
> >>cpu_set(me, cpu_callout_map);
> >>per_cpu(cpu_state, me) = CPU_ONLINE;
> >>  }
> > 
> > 
> > This ordering can be tricky wrt CPU hotplug.  Are you sure you are not
> > breaking CPU hotplug?  AFAIK, x86_64 has that right and the 32-bit code
> > had it wrong.
> 
> 
> CPU hot plug path will call smp_prepare_boot_cpu?
> it is using __init instead of __cpuinit?

No, but moving the place where cpu is moved into online map might break
hotplug.  In any case, I don't see anything wrong with your change now
that I look closer.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu

2007-11-26 Thread Zachary Amsden
On Mon, 2007-11-26 at 00:38 -0800, Yinghai Lu wrote:
> [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
> 
> in init/main.c boot_cpu_init() does that before
> 
> Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]>
> 
> diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c
> index 500670c..966d124 100644
> --- a/arch/x86/kernel/smpboot_64.c
> +++ b/arch/x86/kernel/smpboot_64.c
> @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
>  void __init smp_prepare_boot_cpu(void)
>  {
>   int me = smp_processor_id();
> - cpu_set(me, cpu_online_map);
> + /* already set me in cpu_online_map in boot_cpu_init() */
>   cpu_set(me, cpu_callout_map);
>   per_cpu(cpu_state, me) = CPU_ONLINE;
>  }


This ordering can be tricky wrt CPU hotplug.  Are you sure you are not
breaking CPU hotplug?  AFAIK, x86_64 has that right and the 32-bit code
had it wrong.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu

2007-11-26 Thread Zachary Amsden
On Mon, 2007-11-26 at 00:38 -0800, Yinghai Lu wrote:
 [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
 
 in init/main.c boot_cpu_init() does that before
 
 Signed-off-by: Yinghai Lu [EMAIL PROTECTED]
 
 diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c
 index 500670c..966d124 100644
 --- a/arch/x86/kernel/smpboot_64.c
 +++ b/arch/x86/kernel/smpboot_64.c
 @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
  void __init smp_prepare_boot_cpu(void)
  {
   int me = smp_processor_id();
 - cpu_set(me, cpu_online_map);
 + /* already set me in cpu_online_map in boot_cpu_init() */
   cpu_set(me, cpu_callout_map);
   per_cpu(cpu_state, me) = CPU_ONLINE;
  }


This ordering can be tricky wrt CPU hotplug.  Are you sure you are not
breaking CPU hotplug?  AFAIK, x86_64 has that right and the 32-bit code
had it wrong.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu

2007-11-26 Thread Zachary Amsden
On Mon, 2007-11-26 at 14:06 -0800, Yinghai Lu wrote:

  diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c
  index 500670c..966d124 100644
  --- a/arch/x86/kernel/smpboot_64.c
  +++ b/arch/x86/kernel/smpboot_64.c
  @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
   void __init smp_prepare_boot_cpu(void)
   {
 int me = smp_processor_id();
  -  cpu_set(me, cpu_online_map);
  +  /* already set me in cpu_online_map in boot_cpu_init() */
 cpu_set(me, cpu_callout_map);
 per_cpu(cpu_state, me) = CPU_ONLINE;
   }
  
  
  This ordering can be tricky wrt CPU hotplug.  Are you sure you are not
  breaking CPU hotplug?  AFAIK, x86_64 has that right and the 32-bit code
  had it wrong.
 
 
 CPU hot plug path will call smp_prepare_boot_cpu?
 it is using __init instead of __cpuinit?

No, but moving the place where cpu is moved into online map might break
hotplug.  In any case, I don't see anything wrong with your change now
that I look closer.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] x86: TLS cleanup

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 02:25 -0800, Roland McGrath wrote:
> This consolidates the four different places that implemented the same
> encoding magic for the GDT-slot 32-bit TLS support.  The old tls32.c is
> renamed and only slightly modified to be the shared implementation guts.

> -#define GET_BASE(desc) ( \
> - (((desc)->a >> 16) & 0x) | \
> - (((desc)->b << 16) & 0x00ff) | \
> - ( (desc)->b& 0xff00)   )
> -
> -#define GET_LIMIT(desc) ( \
> - ((desc)->a & 0x0) | \
> -  ((desc)->b & 0xf) )

That was the other redundant definition, thanks.

I had a bit of trouble verifying correctness here because of much
brownian motion.  Any possibility of a pure movement / fixup separation
to make it easier on the eyes?

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] x86: use get_desc_base

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 02:20 -0800, Roland McGrath wrote:
> This changes a couple of places to use the get_desc_base function.
> They were duplicating the same calculation with different equivalent code.
> 
> Signed-off-by: Roland McGrath <[EMAIL PROTECTED]>
> 
> diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c
> index 1cc4340..5291596 100644
> --- a/arch/x86/ia32/tls32.c
> +++ b/arch/x86/ia32/tls32.c
> @@ -85,11 +85,6 @@ asmlinkage long sys32_set_thread_area(struct user_desc 
> __user *u_info)
>   * Get the current Thread-Local Storage area:
>   */
>  
> -#define GET_BASE(desc) ( \
> - (((desc)->a >> 16) & 0x) | \
> - (((desc)->b << 16) & 0x00ff) | \
> - ( (desc)->b& 0xff00)   )

Are you sure there still aren't more of these things floating around?  I
recall ptrace and process.c even having one at one point, hopefully they
are all gone.

Nice cleanups,

Zach


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote:

> Where the device is implemented is an implementation detail that should 
> be hidden from the guest, isn't that one of the strengths of 
> virtualization?  Two examples: a file-based block device implemented in 
> qemu gives you fancy file formats with encryption and compression, while 
> the same device implemented in the kernel gives you a low-overhead path 
> directly to a zillion-disk SAN volume.  Or a user-level network device 
> capable of running with the slirp stack and no permissions vs. the 
> kernel device running copyless most of the time and using a dma engine 
> for the rest but requiring you to be good friends with the admin.
> 
> The user should expect zero reconfigurations moving a VM from one model 
> to the other.

I think that is pretty insightful, and indeed, is probably the only
reason we would ever consider using a virtio based driver.

But is this really a virtualization problem, and is virtio the right
place to solve it?  Doesn't I/O hotplug with multipathing or NIC teaming
provide the same infrastructure in a way that is useful in more than
just a virtualization context?

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/5] x86: TLS cleanup

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 02:25 -0800, Roland McGrath wrote:
 This consolidates the four different places that implemented the same
 encoding magic for the GDT-slot 32-bit TLS support.  The old tls32.c is
 renamed and only slightly modified to be the shared implementation guts.

 -#define GET_BASE(desc) ( \
 - (((desc)-a  16)  0x) | \
 - (((desc)-b  16)  0x00ff) | \
 - ( (desc)-b 0xff00)   )
 -
 -#define GET_LIMIT(desc) ( \
 - ((desc)-a  0x0) | \
 -  ((desc)-b  0xf) )

That was the other redundant definition, thanks.

I had a bit of trouble verifying correctness here because of much
brownian motion.  Any possibility of a pure movement / fixup separation
to make it easier on the eyes?

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH 3/3] virtio PCI device

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote:

 Where the device is implemented is an implementation detail that should 
 be hidden from the guest, isn't that one of the strengths of 
 virtualization?  Two examples: a file-based block device implemented in 
 qemu gives you fancy file formats with encryption and compression, while 
 the same device implemented in the kernel gives you a low-overhead path 
 directly to a zillion-disk SAN volume.  Or a user-level network device 
 capable of running with the slirp stack and no permissions vs. the 
 kernel device running copyless most of the time and using a dma engine 
 for the rest but requiring you to be good friends with the admin.
 
 The user should expect zero reconfigurations moving a VM from one model 
 to the other.

I think that is pretty insightful, and indeed, is probably the only
reason we would ever consider using a virtio based driver.

But is this really a virtualization problem, and is virtio the right
place to solve it?  Doesn't I/O hotplug with multipathing or NIC teaming
provide the same infrastructure in a way that is useful in more than
just a virtualization context?

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/5] x86: use get_desc_base

2007-11-21 Thread Zachary Amsden
On Wed, 2007-11-21 at 02:20 -0800, Roland McGrath wrote:
 This changes a couple of places to use the get_desc_base function.
 They were duplicating the same calculation with different equivalent code.
 
 Signed-off-by: Roland McGrath [EMAIL PROTECTED]
 
 diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c
 index 1cc4340..5291596 100644
 --- a/arch/x86/ia32/tls32.c
 +++ b/arch/x86/ia32/tls32.c
 @@ -85,11 +85,6 @@ asmlinkage long sys32_set_thread_area(struct user_desc 
 __user *u_info)
   * Get the current Thread-Local Storage area:
   */
  
 -#define GET_BASE(desc) ( \
 - (((desc)-a  16)  0x) | \
 - (((desc)-b  16)  0x00ff) | \
 - ( (desc)-b 0xff00)   )

Are you sure there still aren't more of these things floating around?  I
recall ptrace and process.c even having one at one point, hopefully they
are all gone.

Nice cleanups,

Zach


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 15/18] x86 vDSO: consolidate vdso32

2007-11-20 Thread Zachary Amsden
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote:
> This makes x86_64's ia32 emulation support share the sources used in the
> 32-bit kernel for the 32-bit vDSO and much of its setup code.
> 
> The 32-bit vDSO mapping now behaves the same on x86_64 as on native 32-bit.
> The abi.syscall32 sysctl on x86_64 now takes the same values that
> vm.vdso_enabled takes on the 32-bit kernel.  That is, 1 means a randomized
> vDSO location, 2 means the fixed old address.  The CONFIG_COMPAT_VDSO
> option is now available to make this the default setting, the same meaning
> it has for the 32-bit kernel.  (This does not affect the 64-bit vDSO.)

I think you should drop CONFIG_COMPAT_VDSO support for 32-bit VDSO on
64-bit kernel.  This was only to hack around a broken version of glibc
that shipped with SUSE PRO 9.0, which had broken assertions based on
misinterpretation of ELF fields.  64-bit machines will never see this
glibc and the hack can die.

Perhaps it is finally time to remove the hack from 32-bit as well, and
eliminate COMPAT_VDSO entirely?  Or does it really have to live forever.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/18] x86 vDSO: ia32 sysenter_return

2007-11-20 Thread Zachary Amsden
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote:
> This changes the 64-bit kernel's support for the 32-bit sysenter
> instruction to use stored fields rather than constants for the
> user-mode return address, as the 32-bit kernel does.  This adds a
> sysenter_return field to struct thread_info, as 32-bit has.  There
> is no observable effect from this yet.  It makes the assembly code
> independent of the 32-bit vDSO mapping address, paving the way for
> making the vDSO address vary as it does on the 32-bit kernel.

I hope you won't carry the dependency of COMPAT_VDSO over to 64-bit when
that happens.

> @@ -104,7 +103,7 @@ ENTRY(ia32_sysenter_target)
>   pushfq
>   CFI_ADJUST_CFA_OFFSET 8
>   /*CFI_REL_OFFSET rflags,0*/
> - movl$VSYSCALL32_SYSEXIT, %r10d
> + movl8*3-THREAD_SIZE+threadinfo_sysenter_return(%rsp), %r10d

8*3-THREAD_SIZE is not very intuitive.  Can you add a comment on the
math?

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 13/18] x86 vDSO: ia32 sysenter_return

2007-11-20 Thread Zachary Amsden
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote:
 This changes the 64-bit kernel's support for the 32-bit sysenter
 instruction to use stored fields rather than constants for the
 user-mode return address, as the 32-bit kernel does.  This adds a
 sysenter_return field to struct thread_info, as 32-bit has.  There
 is no observable effect from this yet.  It makes the assembly code
 independent of the 32-bit vDSO mapping address, paving the way for
 making the vDSO address vary as it does on the 32-bit kernel.

I hope you won't carry the dependency of COMPAT_VDSO over to 64-bit when
that happens.

 @@ -104,7 +103,7 @@ ENTRY(ia32_sysenter_target)
   pushfq
   CFI_ADJUST_CFA_OFFSET 8
   /*CFI_REL_OFFSET rflags,0*/
 - movl$VSYSCALL32_SYSEXIT, %r10d
 + movl8*3-THREAD_SIZE+threadinfo_sysenter_return(%rsp), %r10d

8*3-THREAD_SIZE is not very intuitive.  Can you add a comment on the
math?

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 15/18] x86 vDSO: consolidate vdso32

2007-11-20 Thread Zachary Amsden
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote:
 This makes x86_64's ia32 emulation support share the sources used in the
 32-bit kernel for the 32-bit vDSO and much of its setup code.
 
 The 32-bit vDSO mapping now behaves the same on x86_64 as on native 32-bit.
 The abi.syscall32 sysctl on x86_64 now takes the same values that
 vm.vdso_enabled takes on the 32-bit kernel.  That is, 1 means a randomized
 vDSO location, 2 means the fixed old address.  The CONFIG_COMPAT_VDSO
 option is now available to make this the default setting, the same meaning
 it has for the 32-bit kernel.  (This does not affect the 64-bit vDSO.)

I think you should drop CONFIG_COMPAT_VDSO support for 32-bit VDSO on
64-bit kernel.  This was only to hack around a broken version of glibc
that shipped with SUSE PRO 9.0, which had broken assertions based on
misinterpretation of ELF fields.  64-bit machines will never see this
glibc and the hack can die.

Perhaps it is finally time to remove the hack from 32-bit as well, and
eliminate COMPAT_VDSO entirely?  Or does it really have to live forever.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/paravirt: revert exports to restore old behaviour

2007-11-13 Thread Zachary Amsden
On Tue, 2007-11-13 at 22:22 +, Christoph Hellwig wrote:
> On Tue, Nov 13, 2007 at 12:21:16PM -0800, Jeremy Fitzhardinge wrote:
> > Subject: x86/paravirt: revert exports to restore old behaviour
> > 
> > Subdividing the paravirt_ops structure caused a regression in certain
> > non-GPL modules which try to use mmu_ops and cpu_ops.  This restores
> > the old behaviour, and makes it consistent with the
> > non-CONFIG_PARAVIRT case.
> 
> NACK, both of these are internal and graphics drivers should not be
> using them.

Some of them are internal, but some are not, they just happened to be
privileged CPU operations available to anyone.

Does anyone know what hooks they are actually using?  Something like
reading MSRs is not internal at all, it is CPU specific, and the
graphics drivers might have very good reasons to read them to figure out
how AGP is configured for example.

The graphics drivers most certainly don't need to be paravirtualized
however, so they could always work around this with raw asm constructs.

The mmu_ops hook is more debatable.  But until someone figures out what
hooks they want, we can't know whether they are internal or not.  In any
case, this is a regression.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of virtio device IDs

2007-11-13 Thread Zachary Amsden
On Tue, 2007-11-13 at 08:18 -0500, Gregory Haskins wrote:

> Since PCI was designed as a hardware solution it has all kinds of stuff
> specifically geared towards hardware constraints.   Those constraints
> are different in a virtualized platform, so some things do not translate
> well to an optimal solution.  Half of the stuff wouldn't be used, and
> the other half has some nasty baggage associated with it (like still
> requiring partial emulation in a PV environment).
> 
> The point of PV, of course, is high performance guest/host interfaces,
> and PCI baggage just gets in the way in many respects.  Once a

I would tend to disagree with that statement.  The point of PV is a
simpler to implement guest/host interface, which sometimes results in a
higher performance interface.  PV does not always mean high performance,
nor does high performance imply PV is necessary.

> particular guest's subsystem is HV aware we no longer strictly require
> legacy emulation.  It should know how to talk to the host using whatever
> is appropriate.  I aim to strip out all of those emulation points
> whenever possible.  Once you do that, all of the PCI features that we
> would use drops to zero.

Device discovery, bus enumeration and shared memory configuration space
mapping are all very useful features that require complex negotiation
with the hypervisor.  Those are provided implicitly in a standardized
way by PCI (or by any good hardware bus protocol).

> On the flip side, we have full emulation if you want broader
> compatibility with legacy, at the expense of performance.

There is no reason you need to sacrifice performance for broader
compatibility with well designed virtual devices on modern hardware.

> 
> > 
> >> You are
> >> probably better off designing something that is PV specific instead of
> >> shoehorning it in to fit a different model (at least for the things I
> >> have in mind).  
> > 
> > Well, if we design our pv devices to look like hardware, they will fit
> > quite well.  Both to the guest OS and to user's expectations.
> 
> Like what hardware?  Like PCI hardware?  What if the platform in
> question doesn't have PCI?  Also note that devices don't have to look
> like emulated PCI devices per se to look like hardware to the
> guest-OS/user.  That is just one way to do it.

What if the platform in question does have PCI.  How are you going to
write drivers for non-Linux guests?  Design a new bus protocol and
driver system for pv-only devices which can't run anywhere except for a
couple selected guests, or design along the lines of what the physical
hardware on your platform actually looks like?  It's not like you can
construct a full-featured virtualization of x86 without implenting PCI
at some level anyway.

Note this is not an argument for PCI.  It is an argument for devices
that look like hardware on whatever platform you are virtualizing.  On
s390, that might be a bit different than on x86.  But the key idea is
re-use the platform architecture as much as possible.  This gets you far
more code sharing and interoperability for devices.

Just because you can paravirtualize everything does not mean it is a
good idea or more efficient.  A good high performance "paravirtualized"
network driver only needs one efficient place to trap to the hypervisor,
that is to kick off a TX queue.  Nothing else needs to be
paravirtualized to make this efficient, and now you have a driver that
is fairly easily ported among different operating systems because it
reuses many architectural primitives.

So I think it is a good thing for virtio to allow coupling to a PCI
device backend for x86, but be generic enough to allow coupling to other
backends for non-PCI architectures.  Perhaps the top-level device code
and TX/RX queues can be re-used, although I'm not convinced sharing at
this layer gives so much benefit in terms of raw lines of code.

> 
> > 
> >> Its not a heck of a lot of code to write a pv-centric
> >> version of these facilities.
> >>
> >>   
> > 
> > It is.  
> 
> After having done it in the past, I disagree.   But it sounds like you

What about XenBus?  Device discovery and configuration are a huge amount
of work, especially when done without a standard to work from.

Just my $0.02.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Use of virtio device IDs

2007-11-13 Thread Zachary Amsden
On Tue, 2007-11-13 at 08:18 -0500, Gregory Haskins wrote:

 Since PCI was designed as a hardware solution it has all kinds of stuff
 specifically geared towards hardware constraints.   Those constraints
 are different in a virtualized platform, so some things do not translate
 well to an optimal solution.  Half of the stuff wouldn't be used, and
 the other half has some nasty baggage associated with it (like still
 requiring partial emulation in a PV environment).
 
 The point of PV, of course, is high performance guest/host interfaces,
 and PCI baggage just gets in the way in many respects.  Once a

I would tend to disagree with that statement.  The point of PV is a
simpler to implement guest/host interface, which sometimes results in a
higher performance interface.  PV does not always mean high performance,
nor does high performance imply PV is necessary.

 particular guest's subsystem is HV aware we no longer strictly require
 legacy emulation.  It should know how to talk to the host using whatever
 is appropriate.  I aim to strip out all of those emulation points
 whenever possible.  Once you do that, all of the PCI features that we
 would use drops to zero.

Device discovery, bus enumeration and shared memory configuration space
mapping are all very useful features that require complex negotiation
with the hypervisor.  Those are provided implicitly in a standardized
way by PCI (or by any good hardware bus protocol).

 On the flip side, we have full emulation if you want broader
 compatibility with legacy, at the expense of performance.

There is no reason you need to sacrifice performance for broader
compatibility with well designed virtual devices on modern hardware.

 
  
  You are
  probably better off designing something that is PV specific instead of
  shoehorning it in to fit a different model (at least for the things I
  have in mind).  
  
  Well, if we design our pv devices to look like hardware, they will fit
  quite well.  Both to the guest OS and to user's expectations.
 
 Like what hardware?  Like PCI hardware?  What if the platform in
 question doesn't have PCI?  Also note that devices don't have to look
 like emulated PCI devices per se to look like hardware to the
 guest-OS/user.  That is just one way to do it.

What if the platform in question does have PCI.  How are you going to
write drivers for non-Linux guests?  Design a new bus protocol and
driver system for pv-only devices which can't run anywhere except for a
couple selected guests, or design along the lines of what the physical
hardware on your platform actually looks like?  It's not like you can
construct a full-featured virtualization of x86 without implenting PCI
at some level anyway.

Note this is not an argument for PCI.  It is an argument for devices
that look like hardware on whatever platform you are virtualizing.  On
s390, that might be a bit different than on x86.  But the key idea is
re-use the platform architecture as much as possible.  This gets you far
more code sharing and interoperability for devices.

Just because you can paravirtualize everything does not mean it is a
good idea or more efficient.  A good high performance paravirtualized
network driver only needs one efficient place to trap to the hypervisor,
that is to kick off a TX queue.  Nothing else needs to be
paravirtualized to make this efficient, and now you have a driver that
is fairly easily ported among different operating systems because it
reuses many architectural primitives.

So I think it is a good thing for virtio to allow coupling to a PCI
device backend for x86, but be generic enough to allow coupling to other
backends for non-PCI architectures.  Perhaps the top-level device code
and TX/RX queues can be re-used, although I'm not convinced sharing at
this layer gives so much benefit in terms of raw lines of code.

 
  
  Its not a heck of a lot of code to write a pv-centric
  version of these facilities.
 

  
  It is.  
 
 After having done it in the past, I disagree.   But it sounds like you

What about XenBus?  Device discovery and configuration are a huge amount
of work, especially when done without a standard to work from.

Just my $0.02.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86/paravirt: revert exports to restore old behaviour

2007-11-13 Thread Zachary Amsden
On Tue, 2007-11-13 at 22:22 +, Christoph Hellwig wrote:
 On Tue, Nov 13, 2007 at 12:21:16PM -0800, Jeremy Fitzhardinge wrote:
  Subject: x86/paravirt: revert exports to restore old behaviour
  
  Subdividing the paravirt_ops structure caused a regression in certain
  non-GPL modules which try to use mmu_ops and cpu_ops.  This restores
  the old behaviour, and makes it consistent with the
  non-CONFIG_PARAVIRT case.
 
 NACK, both of these are internal and graphics drivers should not be
 using them.

Some of them are internal, but some are not, they just happened to be
privileged CPU operations available to anyone.

Does anyone know what hooks they are actually using?  Something like
reading MSRs is not internal at all, it is CPU specific, and the
graphics drivers might have very good reasons to read them to figure out
how AGP is configured for example.

The graphics drivers most certainly don't need to be paravirtualized
however, so they could always work around this with raw asm constructs.

The mmu_ops hook is more debatable.  But until someone figures out what
hooks they want, we can't know whether they are internal or not.  In any
case, this is a regression.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Zachary Amsden
On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote:
> Keir Fraser wrote:
> > volatile prevents the asm from being 'moved significantly', according to the
> > gcc manual. I take that to mean that reordering is not allowed.
> >   

I understood it as reordering was permitted, but no re-ordering across
another volatile load, store, or asm was permitted.  And of course, as
long as input and output constraints are written properly, the
re-ordering should not be vulnerable to pathological movement causing
the code to malfunction.

It seems that CPU state side effects which can't be expressed in C need
special care - FPU is certainly one example.

Also, memory clobber on a volatile asm should stop invalid movement
across TLB flushes and other problems areas.  Even memory fences should
have memory clobber in order to stop movement of loads and stores across
the fence by the compiler.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Zachary Amsden
On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote:
 Keir Fraser wrote:
  volatile prevents the asm from being 'moved significantly', according to the
  gcc manual. I take that to mean that reordering is not allowed.


I understood it as reordering was permitted, but no re-ordering across
another volatile load, store, or asm was permitted.  And of course, as
long as input and output constraints are written properly, the
re-ordering should not be vulnerable to pathological movement causing
the code to malfunction.

It seems that CPU state side effects which can't be expressed in C need
special care - FPU is certainly one example.

Also, memory clobber on a volatile asm should stop invalid movement
across TLB flushes and other problems areas.  Even memory fences should
have memory clobber in order to stop movement of loads and stores across
the fence by the compiler.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-30 Thread Zachary Amsden
On Tue, 2007-10-30 at 10:02 -0200, Glauber de Oliveira Costa wrote:

> > No, if no paravirt clocksource is detected, nothing can override the
> > perfect TSC hardware clocksource rating of 400.  And if a paravirt
> > clocksource is detected, it is always better than TSC.
> 
> Why always? tsc is the best possible alternative the _platform_ can
> provide. So the paravirt clocksource will be at best, as good as tsc.

What is the justification for this claim?  The TSC provides no
information about frequency.  The frequency must be measured from a
known fixed rate source.  This measurement is error prone, both on real
hardware, and even more so in a VM.

The TSC can only provide a single measure of elapsed periods.  It has no
concept to differentiate between elapsed periods of time spent in the
guest, which is relevant to the scheduler, and elapsed periods of time
spent in real time, which is relevant to timekeeping.

So a TSC can not drive both timekeeping and scheduling in a virtual
machine, or if it does, it does so inaccurately.


> And if it is the case: why bother with communication mechanisms of all
> kinds, calculations, etc, if we can just read the tsc?
> 
> Noting again: If the tsc does not arrive accurate to the guest, it will
> fail the tsc clocksource tests. So it will be rated zero anyway

You always need to provide a TSC, and it will always initially appear to
be accurate.  Long term, as time is reported via the TSC is an
approximation somewhere between real time and elapsed running time,
measurements made using it in a VM will drift, and it is impossible to
remove the inaccuracy from one end of the spectrum (by providing real
time exactly) without compromising it at the other end (by skewing
relative time measurement).

A paravirt clocksource is always better than a TSC because it
compensates for these effects.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-30 Thread Zachary Amsden
On Tue, 2007-10-30 at 10:02 -0200, Glauber de Oliveira Costa wrote:

  No, if no paravirt clocksource is detected, nothing can override the
  perfect TSC hardware clocksource rating of 400.  And if a paravirt
  clocksource is detected, it is always better than TSC.
 
 Why always? tsc is the best possible alternative the _platform_ can
 provide. So the paravirt clocksource will be at best, as good as tsc.

What is the justification for this claim?  The TSC provides no
information about frequency.  The frequency must be measured from a
known fixed rate source.  This measurement is error prone, both on real
hardware, and even more so in a VM.

The TSC can only provide a single measure of elapsed periods.  It has no
concept to differentiate between elapsed periods of time spent in the
guest, which is relevant to the scheduler, and elapsed periods of time
spent in real time, which is relevant to timekeeping.

So a TSC can not drive both timekeeping and scheduling in a virtual
machine, or if it does, it does so inaccurately.


 And if it is the case: why bother with communication mechanisms of all
 kinds, calculations, etc, if we can just read the tsc?
 
 Noting again: If the tsc does not arrive accurate to the guest, it will
 fail the tsc clocksource tests. So it will be rated zero anyway

You always need to provide a TSC, and it will always initially appear to
be accurate.  Long term, as time is reported via the TSC is an
approximation somewhere between real time and elapsed running time,
measurements made using it in a VM will drift, and it is impossible to
remove the inaccuracy from one end of the spectrum (by providing real
time exactly) without compromising it at the other end (by skewing
relative time measurement).

A paravirt clocksource is always better than a TSC because it
compensates for these effects.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Tue, 2007-10-30 at 00:02 +0100, Ingo Molnar wrote:
> * Zachary Amsden <[EMAIL PROTECTED]> wrote:

> > Not every guest support paravirt, but for correctness, all guests 
> > require TSC, which must be exposed all the way up to userspace, no 
> > matter what the efficiency or accuracy may be.
> 
> but if there's a perfect TSC available (there is such hardware) then the 
> TSC _is_ the best clocksource. Paravirt now turns it off unconditionally 
> in essence.

No, if no paravirt clocksource is detected, nothing can override the
perfect TSC hardware clocksource rating of 400.  And if a paravirt
clocksource is detected, it is always better than TSC.

> anyway, that's at most an optimization issue. No strong feelings here, 
> and we can certainly delay this patch until this gets all sorted out.

This patch should be nacked, since it is just wrong.  This is not an
optimization issue.  It is an accuracy issue for all virtualization
environments that affects long term kernel clock stability, which is
important to fix, and the best way to do that is to use a paravirt
clocksource.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Mon, 2007-10-29 at 23:48 +0100, Ingo Molnar wrote:
> * Zachary Amsden <[EMAIL PROTECTED]> wrote:

> if it's inaccurate why are you exposing it to the guest then? Native 
> only uses the TSC if it's safe and accurate to do so.

Not every guest support paravirt, but for correctness, all guests
require TSC, which must be exposed all the way up to userspace, no
matter what the efficiency or accuracy may be.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote:
> From: Glauber de Oliveira Costa <[EMAIL PROTECTED]>
> 
> tsc is very good time source (when it does not have drifts, does not
> change it's frequency, i.e. when it works), so it should have its rating
> raised to a value greater than, or equal 400.
> 
> Since it's being a tendency among paravirt clocksources to use values
> around 400, we should declare tsc as even better: So we use 500.

Why is the TSC better than a paravirt clocksource?  In our case this is
definitely inaccurate.  Paravirt clocksources should be preferred to
TSC, and both must be made available in hardware for platforms which do
not support paravirt.

Also, please cc all the paravirt developers on things related to
paravirt, especially things with such broad effect.  I think 400 is a
good value for a perfect native clocksource.  >400 should be reserved
for super-real (i.e. paravirt) sources that should always be chosen over
a hardware realistic implementation in a virtual environment.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote:
 From: Glauber de Oliveira Costa [EMAIL PROTECTED]
 
 tsc is very good time source (when it does not have drifts, does not
 change it's frequency, i.e. when it works), so it should have its rating
 raised to a value greater than, or equal 400.
 
 Since it's being a tendency among paravirt clocksources to use values
 around 400, we should declare tsc as even better: So we use 500.

Why is the TSC better than a paravirt clocksource?  In our case this is
definitely inaccurate.  Paravirt clocksources should be preferred to
TSC, and both must be made available in hardware for platforms which do
not support paravirt.

Also, please cc all the paravirt developers on things related to
paravirt, especially things with such broad effect.  I think 400 is a
good value for a perfect native clocksource.  400 should be reserved
for super-real (i.e. paravirt) sources that should always be chosen over
a hardware realistic implementation in a virtual environment.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Tue, 2007-10-30 at 00:02 +0100, Ingo Molnar wrote:
 * Zachary Amsden [EMAIL PROTECTED] wrote:

  Not every guest support paravirt, but for correctness, all guests 
  require TSC, which must be exposed all the way up to userspace, no 
  matter what the efficiency or accuracy may be.
 
 but if there's a perfect TSC available (there is such hardware) then the 
 TSC _is_ the best clocksource. Paravirt now turns it off unconditionally 
 in essence.

No, if no paravirt clocksource is detected, nothing can override the
perfect TSC hardware clocksource rating of 400.  And if a paravirt
clocksource is detected, it is always better than TSC.

 anyway, that's at most an optimization issue. No strong feelings here, 
 and we can certainly delay this patch until this gets all sorted out.

This patch should be nacked, since it is just wrong.  This is not an
optimization issue.  It is an accuracy issue for all virtualization
environments that affects long term kernel clock stability, which is
important to fix, and the best way to do that is to use a paravirt
clocksource.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] raise tsc clocksource rating

2007-10-29 Thread Zachary Amsden
On Mon, 2007-10-29 at 23:48 +0100, Ingo Molnar wrote:
 * Zachary Amsden [EMAIL PROTECTED] wrote:

 if it's inaccurate why are you exposing it to the guest then? Native 
 only uses the TSC if it's safe and accurate to do so.

Not every guest support paravirt, but for correctness, all guests
require TSC, which must be exposed all the way up to userspace, no
matter what the efficiency or accuracy may be.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-25 Thread Zachary Amsden
On Thu, 2007-10-25 at 16:57 -0700, Linus Torvalds wrote:
> 
> On Fri, 26 Oct 2007, Andi Kleen wrote:
> > 
> > The conditional add/sub using carry trick is not generally bogus. 
> > But for registers it's a fine optimization.
> 
> For registers it's fine. For memory, it's a disaster. It's more than just 
> dirty cachelines and introducing race conditions, it's also about 
> protection and dirty pages.
> 
> So even in user space, to even be correct in the first place, the compiler 

It's actually a fair bit worse for us.  We have paths where a false
optimization like this would hyperspace the machine.  In fact, this
frightens me so much I've just gone off to investigate whether gcc has
gone and done this to any of our code.

Clearly the right solution is to introduce threads and write protected
memory into gcc so that the developers are either motivated to ensure
they work or self-destruct.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Is gcc thread-unsafe?

2007-10-25 Thread Zachary Amsden
On Thu, 2007-10-25 at 16:57 -0700, Linus Torvalds wrote:
 
 On Fri, 26 Oct 2007, Andi Kleen wrote:
  
  The conditional add/sub using carry trick is not generally bogus. 
  But for registers it's a fine optimization.
 
 For registers it's fine. For memory, it's a disaster. It's more than just 
 dirty cachelines and introducing race conditions, it's also about 
 protection and dirty pages.
 
 So even in user space, to even be correct in the first place, the compiler 

It's actually a fair bit worse for us.  We have paths where a false
optimization like this would hyperspace the machine.  In fact, this
frightens me so much I've just gone off to investigate whether gcc has
gone and done this to any of our code.

Clearly the right solution is to introduce threads and write protected
memory into gcc so that the developers are either motivated to ensure
they work or self-destruct.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-22 Thread Zachary Amsden
On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote:
> On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
> > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
> > > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
> > > > Andi Kleen wrote:
> > > > > Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes:
> > > > >   
> > > > >> Yes, that's precisely the problem.  xfs does delay the unmap, leaving
> > > > >> stray mappings, which upsets Xen.
> > > > >> 
> > > > >
> > > > > Again it not just upsets Xen, keeping mappings to freed pages is 
> > > > > wrong generally 
> > > > > and violates the x86 (and likely others like PPC) architecture 
> > > > > because it can 
> > > > > cause illegal caching attribute aliases.

It is a serious offense to leave stray mappings for memory which can get
re-mapped to I/O devices... especially with PCI and other device
hotplug.  I have to back up Andi on this one unconditionally.

On architectures where you have multibyte, non-wordsize updates to
hardware page tables, you even have races here when setting, updating
and clearing PTEs that must be done in a sequence where no aliasing of
mappings to partially written PTEs can result in I/O memory getting
mapped in a cacheable state.  The window here is only one instruction,
and yes, it is possible for a window this small to create a problem.  A
large (or permanently lazy) window is extremely frightening.

These things do cause bugs.  The bugs take a very long time to show up
and are very difficult to track down, since they can basically cause
random behavior, such as hanging the machine or losing orbit and
crashing into the moon.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Interaction between Xen and XFS: stray RW mappings

2007-10-22 Thread Zachary Amsden
On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote:
 On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote:
  On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote:
   On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote:
Andi Kleen wrote:
 Jeremy Fitzhardinge [EMAIL PROTECTED] writes:
   
 Yes, that's precisely the problem.  xfs does delay the unmap, leaving
 stray mappings, which upsets Xen.
 

 Again it not just upsets Xen, keeping mappings to freed pages is 
 wrong generally 
 and violates the x86 (and likely others like PPC) architecture 
 because it can 
 cause illegal caching attribute aliases.

It is a serious offense to leave stray mappings for memory which can get
re-mapped to I/O devices... especially with PCI and other device
hotplug.  I have to back up Andi on this one unconditionally.

On architectures where you have multibyte, non-wordsize updates to
hardware page tables, you even have races here when setting, updating
and clearing PTEs that must be done in a sequence where no aliasing of
mappings to partially written PTEs can result in I/O memory getting
mapped in a cacheable state.  The window here is only one instruction,
and yes, it is possible for a window this small to create a problem.  A
large (or permanently lazy) window is extremely frightening.

These things do cause bugs.  The bugs take a very long time to show up
and are very difficult to track down, since they can basically cause
random behavior, such as hanging the machine or losing orbit and
crashing into the moon.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24

2007-10-15 Thread Zachary Amsden
On Tue, 2007-10-16 at 00:03 +0200, Andi Kleen wrote:
> > Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid
> > upsetting Xen
> 
> This should be probably done unconditionally because it's a undefined
> dangerous condition everywhere.

Should be done unconditionally.  One could remap the underlying physical
space to include an MMIO region, and speculative reads from the
cacheable virtual mapping of that region could move the robot arm,
destroying the world.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24

2007-10-15 Thread Zachary Amsden
On Tue, 2007-10-16 at 00:03 +0200, Andi Kleen wrote:
  Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid
  upsetting Xen
 
 This should be probably done unconditionally because it's a undefined
 dangerous condition everywhere.

Should be done unconditionally.  One could remap the underlying physical
space to include an MMIO region, and speculative reads from the
cacheable virtual mapping of that region could move the robot arm,
destroying the world.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86_64 : kernel initial decompression hangs on vmware

2007-10-10 Thread Zachary Amsden
On Wed, 2007-10-10 at 15:37 +0800, Fengguang Wu wrote:
> Hi Zachary,
> 
> One of my friends' vmware "hangs" on booting Linux 2.6.23, and then get
> it to work by applying your patch at http://lkml.org/lkml/2007/8/4/72.
> 
> It would be nice to see your fix going into mainline :-)

I thought that patch already went into mainline in 2.6.23?  If not, it
would be nice to queue it for -stable (also for 2.6.22).

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: x86_64 : kernel initial decompression hangs on vmware

2007-10-10 Thread Zachary Amsden
On Wed, 2007-10-10 at 15:37 +0800, Fengguang Wu wrote:
 Hi Zachary,
 
 One of my friends' vmware hangs on booting Linux 2.6.23, and then get
 it to work by applying your patch at http://lkml.org/lkml/2007/8/4/72.
 
 It would be nice to see your fix going into mainline :-)

I thought that patch already went into mainline in 2.6.23?  If not, it
would be nice to queue it for -stable (also for 2.6.22).

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops

2007-09-28 Thread Zachary Amsden
On Fri, 2007-09-28 at 11:49 -0700, Jeremy Fitzhardinge wrote:
> > We shouldn't need to export pv_init_ops.  
> 
> No.  The only ones I export are:
> 
> EXPORT_SYMBOL_GPL(pv_time_ops);
> EXPORT_SYMBOL_GPL(pv_cpu_ops);
> EXPORT_SYMBOL_GPL(pv_mmu_ops);
> EXPORT_SYMBOL_GPL(pv_apic_ops);
> EXPORT_SYMBOL(pv_irq_ops);

Nicely done.  I know of some out of tree modules which use part of the
pv_cpu_ops and pv_mmu_ops, but we should not worry about such things,
and it turns out those modules don't need to be virtualized anyway.

> 
> 
> > It is debatable whether
> > CR2/CR3 should be part of CPU or MMU ops.
> 
> Yeah, I was in two minds.  CR3, at least, should be grouped with the
> other tlb operations, wherever they go.  And while they're privileged
> CPU instructions (cpu_ops), they're more logically related to the rest
> of the mmu state.  On the other hand, we could have an ops structure
> specifically dedicated to pagetable manipulations, and put the cr3/tlb
> ops elsewhere.

I'm not against either approach.  I think the way you did it is fine.
If it were up to me, I would probably have driven myself crazy splitting
hairs on it until I was bald.

> >   Also, can we drop write_cr2?
> > It isn't used anywhere, so the only reason to keep it is symmetry.
> > Which was a fine argument when it was an inline, but now it just adds
> > unused junk to the code.
> >   
> 
> I think its used in some cpu state save/restore code, but its not
> relevant to pv-ops.

Ah yes, it is used there.  We actually exercise some of those paths, but
they don't need to be strictly virtualized.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops

2007-09-28 Thread Zachary Amsden
On Fri, 2007-09-28 at 11:10 -0700, Jeremy Fitzhardinge wrote:
> This patch refactors the paravirt_ops structure into groups of
> functionally related ops:
> 
> pv_info - random info, rather than function entrypoints
> pv_init_ops - functions used at boot time (some for module_init too)
> pv_misc_ops - lazy mode, which didn't fit well anywhere else
> pv_time_ops - time-related functions
> pv_cpu_ops - various privileged instruction ops
> pv_irq_ops - operations for managing interrupt state
> pv_apic_ops - APIC operations
> pv_mmu_ops - operations for managing pagetables
> 
> There are several motivations for this:
> 
> 1. Some of these ops will be general to all x86, and some will be
>i386/x86-64 specific.  This makes it easier to share common stuff
>while allowing separate implementations where needed.
> 
> 2. At the moment we must export all of paravirt_ops, but modules only
>need selected parts of it.  This allows us to export on a case by case
>basis (and also choose which export license we want to apply).

We shouldn't need to export pv_init_ops.  It is debatable whether
CR2/CR3 should be part of CPU or MMU ops.  Also, can we drop write_cr2?
It isn't used anywhere, so the only reason to keep it is symmetry.
Which was a fine argument when it was an inline, but now it just adds
unused junk to the code.

> This still needs to be reconciled with the x86 merge and glommer's
> paravirt_ops unification work.

Those issues aside, this looks great.  I'll give it a whirl today.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops

2007-09-28 Thread Zachary Amsden
On Fri, 2007-09-28 at 11:10 -0700, Jeremy Fitzhardinge wrote:
 This patch refactors the paravirt_ops structure into groups of
 functionally related ops:
 
 pv_info - random info, rather than function entrypoints
 pv_init_ops - functions used at boot time (some for module_init too)
 pv_misc_ops - lazy mode, which didn't fit well anywhere else
 pv_time_ops - time-related functions
 pv_cpu_ops - various privileged instruction ops
 pv_irq_ops - operations for managing interrupt state
 pv_apic_ops - APIC operations
 pv_mmu_ops - operations for managing pagetables
 
 There are several motivations for this:
 
 1. Some of these ops will be general to all x86, and some will be
i386/x86-64 specific.  This makes it easier to share common stuff
while allowing separate implementations where needed.
 
 2. At the moment we must export all of paravirt_ops, but modules only
need selected parts of it.  This allows us to export on a case by case
basis (and also choose which export license we want to apply).

We shouldn't need to export pv_init_ops.  It is debatable whether
CR2/CR3 should be part of CPU or MMU ops.  Also, can we drop write_cr2?
It isn't used anywhere, so the only reason to keep it is symmetry.
Which was a fine argument when it was an inline, but now it just adds
unused junk to the code.

 This still needs to be reconciled with the x86 merge and glommer's
 paravirt_ops unification work.

Those issues aside, this looks great.  I'll give it a whirl today.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops

2007-09-28 Thread Zachary Amsden
On Fri, 2007-09-28 at 11:49 -0700, Jeremy Fitzhardinge wrote:
  We shouldn't need to export pv_init_ops.  
 
 No.  The only ones I export are:
 
 EXPORT_SYMBOL_GPL(pv_time_ops);
 EXPORT_SYMBOL_GPL(pv_cpu_ops);
 EXPORT_SYMBOL_GPL(pv_mmu_ops);
 EXPORT_SYMBOL_GPL(pv_apic_ops);
 EXPORT_SYMBOL(pv_irq_ops);

Nicely done.  I know of some out of tree modules which use part of the
pv_cpu_ops and pv_mmu_ops, but we should not worry about such things,
and it turns out those modules don't need to be virtualized anyway.

 
 
  It is debatable whether
  CR2/CR3 should be part of CPU or MMU ops.
 
 Yeah, I was in two minds.  CR3, at least, should be grouped with the
 other tlb operations, wherever they go.  And while they're privileged
 CPU instructions (cpu_ops), they're more logically related to the rest
 of the mmu state.  On the other hand, we could have an ops structure
 specifically dedicated to pagetable manipulations, and put the cr3/tlb
 ops elsewhere.

I'm not against either approach.  I think the way you did it is fine.
If it were up to me, I would probably have driven myself crazy splitting
hairs on it until I was bald.

Also, can we drop write_cr2?
  It isn't used anywhere, so the only reason to keep it is symmetry.
  Which was a fine argument when it was an inline, but now it just adds
  unused junk to the code.

 
 I think its used in some cpu state save/restore code, but its not
 relevant to pv-ops.

Ah yes, it is used there.  We actually exercise some of those paths, but
they don't need to be strictly virtualized.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu

2007-09-17 Thread Zachary Amsden
On Sun, 2007-09-16 at 07:56 -0700, Jeremy Fitzhardinge wrote:
> Rusty Russell wrote:
> > Well, containerization deserves its own menu, but the question is does i
> > it belong under this menu?
> 
> No, I don't think so.  While there are some broad similarities in
> effect, the technology is completely different, as are the capabilities
> and tradeoffs.  They're pretty much orthogonal.

It seems to me containerization is much more similar a feature to SYSV
IPC, process accounting / auditing, and should be an option under
General setup, possibly with its own submenu.  Nearly all the features
under general setup provide optional extensions to kernel services
available from userspace, and so it seems containerization falls
naturally into that category.

Virtualization is completely different, and probably needs separate
server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in
it's menu.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu

2007-09-17 Thread Zachary Amsden
On Sun, 2007-09-16 at 07:56 -0700, Jeremy Fitzhardinge wrote:
 Rusty Russell wrote:
  Well, containerization deserves its own menu, but the question is does i
  it belong under this menu?
 
 No, I don't think so.  While there are some broad similarities in
 effect, the technology is completely different, as are the capabilities
 and tradeoffs.  They're pretty much orthogonal.

It seems to me containerization is much more similar a feature to SYSV
IPC, process accounting / auditing, and should be an option under
General setup, possibly with its own submenu.  Nearly all the features
under general setup provide optional extensions to kernel services
available from userspace, and so it seems containerization falls
naturally into that category.

Virtualization is completely different, and probably needs separate
server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in
it's menu.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Zachary Amsden
On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote:

> So then each module creates a hypercall page using this magic MSR and 
> the hypervisor has to keep track of it so that it can appropriately 
> change the page on migration.  The page can only contain a single 
> instruction or else it cannot be easily changed (or you have to be able 
> to prevent the guest from being migrated while in the hypercall page).
> 
> We're really talking about identical models.  Instead of an MSR, the #GP 
> is what tells the hypervisor to update the instruction.  The nice thing 
> about this is that you don't have to keep track of all the current 
> hypercall page locations in the hypervisor.

I agree, multiple hypercall pages is insane.  I was thinking more of a
single hypercall page, fixed in place by the hypervisor, not the kernel.

Then each module can read an MSR saying what VA the hypercall page is
at, and the hypervisor can simply flip one page to switch architectures.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Zachary Amsden
On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote:
> Jeremy Fitzhardinge wrote:
> > Anthony Liguori wrote:
> >   
> >> This patch refactors the current hypercall infrastructure to better 
> >> support live
> >> migration and SMP.  It eliminates the hypercall page by trapping the UD
> >> exception that would occur if you used the wrong hypercall instruction for 
> >> the
> >> underlying architecture and replacing it with the right one lazily.
> >>   
> >> 
> >
> > I guess it would be pretty rude/unlikely for these opcodes to get reused
> > in other implementations...  But couldn't you make the page trap
> > instead, rather than relying on an instruction fault?
> >   
> 
> The whole point of using the instruction is to allow hypercalls to be 
> used in many locations.  This has the nice side effect of not requiring 
> a central hypercall initialization routine in the guest to fetch the 
> hypercall page.  A PV driver can be completely independent of any other 
> code provided that it restricts itself to it's hypercall namespace.

But if the instruction is architecture dependent, and you run on the
wrong architecture, now you have to patch many locations at fault time,
introducing some nasty runtime code / data cache overlap performance
problems.  Granted, they go away eventually.

I prefer the idea of a hypercall page, but not a central initialization.
Rather, a decentralized approach where PV drivers can detect using CPUID
which hypervisor is present, and a common MSR shared by all hypervisors
that provides the location of the hypercall page.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Zachary Amsden
On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote:
 Jeremy Fitzhardinge wrote:
  Anthony Liguori wrote:

  This patch refactors the current hypercall infrastructure to better 
  support live
  migration and SMP.  It eliminates the hypercall page by trapping the UD
  exception that would occur if you used the wrong hypercall instruction for 
  the
  underlying architecture and replacing it with the right one lazily.

  
 
  I guess it would be pretty rude/unlikely for these opcodes to get reused
  in other implementations...  But couldn't you make the page trap
  instead, rather than relying on an instruction fault?

 
 The whole point of using the instruction is to allow hypercalls to be 
 used in many locations.  This has the nice side effect of not requiring 
 a central hypercall initialization routine in the guest to fetch the 
 hypercall page.  A PV driver can be completely independent of any other 
 code provided that it restricts itself to it's hypercall namespace.

But if the instruction is architecture dependent, and you run on the
wrong architecture, now you have to patch many locations at fault time,
introducing some nasty runtime code / data cache overlap performance
problems.  Granted, they go away eventually.

I prefer the idea of a hypercall page, but not a central initialization.
Rather, a decentralized approach where PV drivers can detect using CPUID
which hypervisor is present, and a common MSR shared by all hypervisors
that provides the location of the hypercall page.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Zachary Amsden
On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote:

 So then each module creates a hypercall page using this magic MSR and 
 the hypervisor has to keep track of it so that it can appropriately 
 change the page on migration.  The page can only contain a single 
 instruction or else it cannot be easily changed (or you have to be able 
 to prevent the guest from being migrated while in the hypercall page).
 
 We're really talking about identical models.  Instead of an MSR, the #GP 
 is what tells the hypervisor to update the instruction.  The nice thing 
 about this is that you don't have to keep track of all the current 
 hypercall page locations in the hypervisor.

I agree, multiple hypercall pages is insane.  I was thinking more of a
single hypercall page, fixed in place by the hypervisor, not the kernel.

Then each module can read an MSR saying what VA the hypercall page is
at, and the hypervisor can simply flip one page to switch architectures.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-09-05 Thread Zachary Amsden
On Thu, 2007-09-06 at 06:37 +1000, Rusty Russell wrote:
> On Thu, 2007-08-23 at 22:46 -0700, Zachary Amsden wrote:
> > I recently sent off a fix for lazy vmalloc faults which can happen under 
> > paravirt when lazy mode is enabled.  Unfortunately, I jumped the gun a 
> > bit on fixing this.  I neglected to notice that since the new call to 
> > flush the MMU update queue is called from the page fault handler, it can 
> > be pre-empted.  Both VMI and Xen use per-cpu variables to track lazy 
> > mode state, as all previous calls to set, disable, or flush lazy mode 
> > happened from a non-preemptable state.
> 
> Hi Zach,
> 
>   I don't think this patch does anything.  The flush is because we want
> the just-completed "set_pte" to have immediate effect, so if preempt is
> enabled we're already screwed because we can be moved between set_pte
> and the arch_flush_lazy_mmu_mode() call.
> 
> Now, where's the problem caller?  By my reading or rc4, vmalloc faults
> are fixed up before enabling interrupts.

I agree.  The patch is a nop.  I just got overly paranoid.  The whole
thing is just very prone to bugs.

So let's go over it carefully:

1) Lazy mode can't be entered unless preempt is disabled
2) Flushes are needed in two known cases: kmap_atomic and vmalloc sync
3) kmap_atomic can only be used when preempt is disabled
4) vmalloc sync happens under protection of interrupts disabled

Good logically.  What can break this logic?

#1 is defined by us
#2 is our currently believed complete list of flush scenarios
#3 is impossible to change by design
#4 seems very unlikely to change anytime soon

Seeing #2 appears weak, let us elaborate:

A) Lazy mode is used in a couple of controlled paths for user page table
updates which requires no immediately updated mapping; further, they are
protected under spinlocks, thus never preempted
B) Thus only kernel mapping updates require explicit flushes
C) Any interrupt / fault during lazy mode can only use kmap_atomic or a
set_pmd to sync a vmalloc region, thus proving my point by circular
logic (for interrupt / fault cases).
D) Or better, other kernel mapping changes (kmap, the pageattr code,
boot_ioremap, vmap, vm86 mark_screen_rdonly) are not usable by
interrupt / fault handlers, thus proving C by exclusion.

So I'm fairly certain there is no further issues with interrupt handlers
or faults, where update semantics are heavily constrained.  What of the
actual lazy mode code itself doing kernel remapping?

Most of these are clearly bogus (pageattr, boot_ioremap, vm86 stuff) for
the mm code to use inside a spinlock protected lazy mode region; it does
seem perfectly acceptable though for the mm code to use kmap or vmap
(not kmap_atomic) internally somewhere in the pagetable code.

We can exclude the trivial lazy mode regions (zeromap, unmap, and
remap).  Easily by inspection.  The PTE copy routine gets deep enough
that not all paths are immediately obvious, though, we should keep it in
mind for bug checking.

Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-09-05 Thread Zachary Amsden
On Thu, 2007-09-06 at 02:33 +1000, Rusty Russell wrote:
> On Tue, 2007-09-04 at 14:42 +0100, Jeremy Fitzhardinge wrote:
> > Rusty Russell wrote:
> > >  static inline void arch_flush_lazy_mmu_mode(void)
> > >  {
> > > - PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_FLUSH);
> > > + if (unlikely(__get_cpu_var(paravirt_lazy_mode) == PARAVIRT_LAZY_MMU))
> > > + arch_leave_lazy_mmu_mode();
> > >  }
> > >   
> > 
> > This changes the semantics a bit; previously "flush" would flush
> > anything pending but leave us in lazy mode.  This just drops lazymode
> > altogether?
> > 
> > I guess if we assume that flushing is a rare event then its OK, but I
> > think the name's a bit misleading.  How does it differ from plain
> > arch_leave_lazy_mmu_mode()?
> 
> Whether it's likely or unlikely to be in lazy mode, basically.  But
> you're right, this should be folded, since we don't want to "leave" lazy
> mode twice.
> 
> ===
> Hoist per-cpu lazy_mode variable up into common code.
> 
> VMI, Xen and lguest all track paravirt_lazy_mode individually, meaning
> we hand a "magic" value for set_lazy_mode() to say "flush if currently
> active".
> 
> We can simplify the logic by hoisting this variable into common code.
> 
> Signed-off-by: Rusty Russell <[EMAIL PROTECTED]>
> 
> ---
>  arch/i386/kernel/vmi.c  |   16 
>  arch/i386/xen/enlighten.c   |   15 +++
>  arch/i386/xen/multicalls.h  |2 +-
>  arch/i386/xen/xen-ops.h |7 ---
>  drivers/lguest/lguest.c |   25 ++---
>  include/asm-i386/paravirt.h |   29 -
>  6 files changed, 30 insertions(+), 64 deletions(-)
> 
> diff -r 072a0b3924fb arch/i386/kernel/vmi.c
> --- a/arch/i386/kernel/vmi.c  Thu Aug 30 04:47:43 2007 +1000
> +++ b/arch/i386/kernel/vmi.c  Wed Sep 05 04:06:20 2007 +1000
> @@ -554,22 +554,14 @@ vmi_startup_ipi_hook(int phys_apicid, un
>  
>  static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
>  {
> - static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode);
> -
>   if (!vmi_ops.set_lazy_mode)
>   return;
>  
>   /* Modes should never nest or overlap */
> - BUG_ON(__get_cpu_var(lazy_mode) && !(mode == PARAVIRT_LAZY_NONE ||
> -  mode == PARAVIRT_LAZY_FLUSH));
> -
> - if (mode == PARAVIRT_LAZY_FLUSH) {
> - vmi_ops.set_lazy_mode(0);
> - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode));
> - } else {
> - vmi_ops.set_lazy_mode(mode);
> - __get_cpu_var(lazy_mode) = mode;
> - }
> + BUG_ON(get_lazy_mode() && !(mode == PARAVIRT_LAZY_NONE |
> + | mode == PARAVIRT_LAZY_FLUSH));

That's a pretty strange line break.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-09-05 Thread Zachary Amsden
On Thu, 2007-09-06 at 02:33 +1000, Rusty Russell wrote:
 On Tue, 2007-09-04 at 14:42 +0100, Jeremy Fitzhardinge wrote:
  Rusty Russell wrote:
static inline void arch_flush_lazy_mmu_mode(void)
{
   - PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_FLUSH);
   + if (unlikely(__get_cpu_var(paravirt_lazy_mode) == PARAVIRT_LAZY_MMU))
   + arch_leave_lazy_mmu_mode();
}
 
  
  This changes the semantics a bit; previously flush would flush
  anything pending but leave us in lazy mode.  This just drops lazymode
  altogether?
  
  I guess if we assume that flushing is a rare event then its OK, but I
  think the name's a bit misleading.  How does it differ from plain
  arch_leave_lazy_mmu_mode()?
 
 Whether it's likely or unlikely to be in lazy mode, basically.  But
 you're right, this should be folded, since we don't want to leave lazy
 mode twice.
 
 ===
 Hoist per-cpu lazy_mode variable up into common code.
 
 VMI, Xen and lguest all track paravirt_lazy_mode individually, meaning
 we hand a magic value for set_lazy_mode() to say flush if currently
 active.
 
 We can simplify the logic by hoisting this variable into common code.
 
 Signed-off-by: Rusty Russell [EMAIL PROTECTED]
 
 ---
  arch/i386/kernel/vmi.c  |   16 
  arch/i386/xen/enlighten.c   |   15 +++
  arch/i386/xen/multicalls.h  |2 +-
  arch/i386/xen/xen-ops.h |7 ---
  drivers/lguest/lguest.c |   25 ++---
  include/asm-i386/paravirt.h |   29 -
  6 files changed, 30 insertions(+), 64 deletions(-)
 
 diff -r 072a0b3924fb arch/i386/kernel/vmi.c
 --- a/arch/i386/kernel/vmi.c  Thu Aug 30 04:47:43 2007 +1000
 +++ b/arch/i386/kernel/vmi.c  Wed Sep 05 04:06:20 2007 +1000
 @@ -554,22 +554,14 @@ vmi_startup_ipi_hook(int phys_apicid, un
  
  static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
  {
 - static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode);
 -
   if (!vmi_ops.set_lazy_mode)
   return;
  
   /* Modes should never nest or overlap */
 - BUG_ON(__get_cpu_var(lazy_mode)  !(mode == PARAVIRT_LAZY_NONE ||
 -  mode == PARAVIRT_LAZY_FLUSH));
 -
 - if (mode == PARAVIRT_LAZY_FLUSH) {
 - vmi_ops.set_lazy_mode(0);
 - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode));
 - } else {
 - vmi_ops.set_lazy_mode(mode);
 - __get_cpu_var(lazy_mode) = mode;
 - }
 + BUG_ON(get_lazy_mode()  !(mode == PARAVIRT_LAZY_NONE |
 + | mode == PARAVIRT_LAZY_FLUSH));

That's a pretty strange line break.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-09-05 Thread Zachary Amsden
On Thu, 2007-09-06 at 06:37 +1000, Rusty Russell wrote:
 On Thu, 2007-08-23 at 22:46 -0700, Zachary Amsden wrote:
  I recently sent off a fix for lazy vmalloc faults which can happen under 
  paravirt when lazy mode is enabled.  Unfortunately, I jumped the gun a 
  bit on fixing this.  I neglected to notice that since the new call to 
  flush the MMU update queue is called from the page fault handler, it can 
  be pre-empted.  Both VMI and Xen use per-cpu variables to track lazy 
  mode state, as all previous calls to set, disable, or flush lazy mode 
  happened from a non-preemptable state.
 
 Hi Zach,
 
   I don't think this patch does anything.  The flush is because we want
 the just-completed set_pte to have immediate effect, so if preempt is
 enabled we're already screwed because we can be moved between set_pte
 and the arch_flush_lazy_mmu_mode() call.
 
 Now, where's the problem caller?  By my reading or rc4, vmalloc faults
 are fixed up before enabling interrupts.

I agree.  The patch is a nop.  I just got overly paranoid.  The whole
thing is just very prone to bugs.

So let's go over it carefully:

1) Lazy mode can't be entered unless preempt is disabled
2) Flushes are needed in two known cases: kmap_atomic and vmalloc sync
3) kmap_atomic can only be used when preempt is disabled
4) vmalloc sync happens under protection of interrupts disabled

Good logically.  What can break this logic?

#1 is defined by us
#2 is our currently believed complete list of flush scenarios
#3 is impossible to change by design
#4 seems very unlikely to change anytime soon

Seeing #2 appears weak, let us elaborate:

A) Lazy mode is used in a couple of controlled paths for user page table
updates which requires no immediately updated mapping; further, they are
protected under spinlocks, thus never preempted
B) Thus only kernel mapping updates require explicit flushes
C) Any interrupt / fault during lazy mode can only use kmap_atomic or a
set_pmd to sync a vmalloc region, thus proving my point by circular
logic (for interrupt / fault cases).
D) Or better, other kernel mapping changes (kmap, the pageattr code,
boot_ioremap, vmap, vm86 mark_screen_rdonly) are not usable by
interrupt / fault handlers, thus proving C by exclusion.

So I'm fairly certain there is no further issues with interrupt handlers
or faults, where update semantics are heavily constrained.  What of the
actual lazy mode code itself doing kernel remapping?

Most of these are clearly bogus (pageattr, boot_ioremap, vm86 stuff) for
the mm code to use inside a spinlock protected lazy mode region; it does
seem perfectly acceptable though for the mm code to use kmap or vmap
(not kmap_atomic) internally somewhere in the pagetable code.

We can exclude the trivial lazy mode regions (zeromap, unmap, and
remap).  Easily by inspection.  The PTE copy routine gets deep enough
that not all paths are immediately obvious, though, we should keep it in
mind for bug checking.

Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-28 Thread Zachary Amsden

Benjamin Herrenschmidt wrote:

On Wed, 2007-08-22 at 16:25 +1000, Rusty Russell wrote:
  

On Wed, 2007-08-22 at 08:34 +0300, Avi Kivity wrote:


Zachary Amsden wrote:
  

This patch provides hypercalls for the i386 port I/O instructions,
which vastly helps guests which use native-style drivers.  For certain
VMI workloads, this provides a performance boost of up to 30%.  We
expect KVM and lguest to be able to achieve similar gains on I/O
intensive workloads.

Won't these workloads be better off using paravirtualized drivers? 
i.e., do the native drivers with paravirt I/O instructions get anywhere

near the performance of paravirt drivers?
  

This patch also means I can kill off the emulation code in
drivers/lguest/core.c, which is a real relief.



Hrm... how do you deal with X doing IOs ?

Ben.
  


We have an X driver that does minimal performance costing operations.  
As we should and will have for our other drivers.


Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-28 Thread Zachary Amsden

Benjamin Herrenschmidt wrote:

On Wed, 2007-08-22 at 16:25 +1000, Rusty Russell wrote:
  

On Wed, 2007-08-22 at 08:34 +0300, Avi Kivity wrote:


Zachary Amsden wrote:
  

This patch provides hypercalls for the i386 port I/O instructions,
which vastly helps guests which use native-style drivers.  For certain
VMI workloads, this provides a performance boost of up to 30%.  We
expect KVM and lguest to be able to achieve similar gains on I/O
intensive workloads.

Won't these workloads be better off using paravirtualized drivers? 
i.e., do the native drivers with paravirt I/O instructions get anywhere

near the performance of paravirt drivers?
  

This patch also means I can kill off the emulation code in
drivers/lguest/core.c, which is a real relief.



Hrm... how do you deal with X doing IOs ?

Ben.
  


We have an X driver that does minimal performance costing operations.  
As we should and will have for our other drivers.


Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/4] 2.6.23-rc3: known regressions v3

2007-08-24 Thread Zachary Amsden

Parag Warudkar wrote:

On 8/24/07, Michal Piotrowski <[EMAIL PROTECTED]> wrote:
  


Virtualization

Subject : CONFIG_VMI broken
References  : http://lkml.org/lkml/2007/8/14/203
Last known good : ?
Submitter   : Parag Warudkar <[EMAIL PROTECTED]>
Caused-By   : ?
Handled-By      : Zachary Amsden <[EMAIL PROTECTED]>
Status  : problem is being debugged



Zach seemed to think that this is already fixed - I am not in a
position to test it immediately so if we know what fixed this - can be
closed. I'll report back once I get a chance to test latest git.
  


Parag, thanks.  I reproduced this bug with your kernel config on 
2.6.23-rc3 and verified it does not happen on latest git.  I inspected 
memory after the crash and determined the problem was patching of 
instructions went awry.  Chris in the meantime fixed a bug with patching 
instructions, and the change from 100% apocalyptic failure to 100% 
unequivocal success has convinced me that was the same bug.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-08-24 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

Hm.  Doing any kind of lazy-state operation with preemption enabled is
fundamentally meaningless.  How does it get into a preemptable state
  


Agree 100%.  It is the lazy mode flush that might happen when preempt is 
enabled, but lazy mode is disabled.  In that case, the code relies on 
per-cpu variables, which is a bad thing to do in preemtible code.  This 
can happen in the current code path.


Thinking slightly deeper about it, it might be the case that there is no 
bug, because the local lazy mode variables are only _modified_ in the 
preemptible state, and guaranteed to be zero in the non-preemtible 
state; but it was not clear to me that this is always the case, and I 
got very nervous about reading per-cpu variables with preempt enabled.  
It would, in any case, fire a BUG_ON in the Xen code, which I did fix.


Do you agree it is better to be safe than sorry in this case?  The kind 
of bugs introduced by getting this wrong are really hard to find, and I 
would rather err on the side of an extra increment and decrement of 
preempt_count that causing a regression.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix preemptible lazy mode bug

2007-08-24 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

Hm.  Doing any kind of lazy-state operation with preemption enabled is
fundamentally meaningless.  How does it get into a preemptable state
  


Agree 100%.  It is the lazy mode flush that might happen when preempt is 
enabled, but lazy mode is disabled.  In that case, the code relies on 
per-cpu variables, which is a bad thing to do in preemtible code.  This 
can happen in the current code path.


Thinking slightly deeper about it, it might be the case that there is no 
bug, because the local lazy mode variables are only _modified_ in the 
preemptible state, and guaranteed to be zero in the non-preemtible 
state; but it was not clear to me that this is always the case, and I 
got very nervous about reading per-cpu variables with preempt enabled.  
It would, in any case, fire a BUG_ON in the Xen code, which I did fix.


Do you agree it is better to be safe than sorry in this case?  The kind 
of bugs introduced by getting this wrong are really hard to find, and I 
would rather err on the side of an extra increment and decrement of 
preempt_count that causing a regression.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [4/4] 2.6.23-rc3: known regressions v3

2007-08-24 Thread Zachary Amsden

Parag Warudkar wrote:

On 8/24/07, Michal Piotrowski [EMAIL PROTECTED] wrote:
  


Virtualization

Subject : CONFIG_VMI broken
References  : http://lkml.org/lkml/2007/8/14/203
Last known good : ?
Submitter   : Parag Warudkar [EMAIL PROTECTED]
Caused-By   : ?
Handled-By  : Zachary Amsden [EMAIL PROTECTED]
Status  : problem is being debugged



Zach seemed to think that this is already fixed - I am not in a
position to test it immediately so if we know what fixed this - can be
closed. I'll report back once I get a chance to test latest git.
  


Parag, thanks.  I reproduced this bug with your kernel config on 
2.6.23-rc3 and verified it does not happen on latest git.  I inspected 
memory after the crash and determined the problem was patching of 
instructions went awry.  Chris in the meantime fixed a bug with patching 
instructions, and the change from 100% apocalyptic failure to 100% 
unequivocal success has convinced me that was the same bug.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix preemptible lazy mode bug

2007-08-23 Thread Zachary Amsden
I recently sent off a fix for lazy vmalloc faults which can happen under 
paravirt when lazy mode is enabled.  Unfortunately, I jumped the gun a 
bit on fixing this.  I neglected to notice that since the new call to 
flush the MMU update queue is called from the page fault handler, it can 
be pre-empted.  Both VMI and Xen use per-cpu variables to track lazy 
mode state, as all previous calls to set, disable, or flush lazy mode 
happened from a non-preemptable state.


I have no idea how to convincingly produce the problem, as generating a 
kernel pre-emption at the required point is, um, difficult, but it is 
most certainly a real possibility, and potentially more likely than the 
bug I fixed originally.


Rusty, you may have to modify lguest code if you use lazy mode and rely 
on per-cpu variables during the callout for paravirt_ops.set_lazy_mode.


I have tested as best as I can, and am trying to write a suite destined 
for LTP which will help catch and debug these issues.


Zach
Since set_lazy_mode(LAZY_MODE_FLUSH) is now called from the page fault
handler, it can potentially happen in a preemptible state.  We therefore
must make all lazy mode paravirt-ops handlers non-preemptible.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED](none)>
---
 arch/i386/kernel/vmi.c|   14 ++
 arch/i386/xen/enlighten.c |4 +++-
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 18673e0..9e669cb 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -555,21 +555,27 @@ vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
 	static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode);
+	int cpu;
+	enum paravirt_lazy_mode cur_mode;
 
 	if (!vmi_ops.set_lazy_mode)
 		return;
 
+	cpu = get_cpu();
+	cur_mode = per_cpu(lazy_mode, cpu);
+
 	/* Modes should never nest or overlap */
-	BUG_ON(__get_cpu_var(lazy_mode) && !(mode == PARAVIRT_LAZY_NONE ||
-	 mode == PARAVIRT_LAZY_FLUSH));
+	BUG_ON(cur_mode && !(mode == PARAVIRT_LAZY_NONE ||
+			 mode == PARAVIRT_LAZY_FLUSH));
 
 	if (mode == PARAVIRT_LAZY_FLUSH) {
 		vmi_ops.set_lazy_mode(0);
-		vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode));
+		vmi_ops.set_lazy_mode(cur_mode);
 	} else {
 		vmi_ops.set_lazy_mode(mode);
-		__get_cpu_var(lazy_mode) = mode;
+		per_cpu(lazy_mode, cpu) = mode;
 	}
+	put_cpu();
 }
 
 static inline int __init check_vmi_rom(struct vrom_header *rom)
diff --git a/arch/i386/xen/enlighten.c b/arch/i386/xen/enlighten.c
index f0c3751..2dafb8a 100644
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -251,7 +251,7 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
-	BUG_ON(preemptible());
+	get_cpu();
 
 	switch (mode) {
 	case PARAVIRT_LAZY_NONE:
@@ -267,11 +267,13 @@ static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 		/* flush if necessary, but don't change state */
 		if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE)
 			xen_mc_flush();
+		put_cpu();
 		return;
 	}
 
 	xen_mc_flush();
 	x86_write_percpu(xen_lazy_mode, mode);
+	put_cpu();
 }
 
 static unsigned long xen_store_tr(void)
-- 
1.4.4.4



[PATCH] Fix preemptible lazy mode bug

2007-08-23 Thread Zachary Amsden
I recently sent off a fix for lazy vmalloc faults which can happen under 
paravirt when lazy mode is enabled.  Unfortunately, I jumped the gun a 
bit on fixing this.  I neglected to notice that since the new call to 
flush the MMU update queue is called from the page fault handler, it can 
be pre-empted.  Both VMI and Xen use per-cpu variables to track lazy 
mode state, as all previous calls to set, disable, or flush lazy mode 
happened from a non-preemptable state.


I have no idea how to convincingly produce the problem, as generating a 
kernel pre-emption at the required point is, um, difficult, but it is 
most certainly a real possibility, and potentially more likely than the 
bug I fixed originally.


Rusty, you may have to modify lguest code if you use lazy mode and rely 
on per-cpu variables during the callout for paravirt_ops.set_lazy_mode.


I have tested as best as I can, and am trying to write a suite destined 
for LTP which will help catch and debug these issues.


Zach
Since set_lazy_mode(LAZY_MODE_FLUSH) is now called from the page fault
handler, it can potentially happen in a preemptible state.  We therefore
must make all lazy mode paravirt-ops handlers non-preemptible.

Signed-off-by: Zachary Amsden [EMAIL PROTECTED](none)
---
 arch/i386/kernel/vmi.c|   14 ++
 arch/i386/xen/enlighten.c |4 +++-
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 18673e0..9e669cb 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -555,21 +555,27 @@ vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip,
 static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
 	static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode);
+	int cpu;
+	enum paravirt_lazy_mode cur_mode;
 
 	if (!vmi_ops.set_lazy_mode)
 		return;
 
+	cpu = get_cpu();
+	cur_mode = per_cpu(lazy_mode, cpu);
+
 	/* Modes should never nest or overlap */
-	BUG_ON(__get_cpu_var(lazy_mode)  !(mode == PARAVIRT_LAZY_NONE ||
-	 mode == PARAVIRT_LAZY_FLUSH));
+	BUG_ON(cur_mode  !(mode == PARAVIRT_LAZY_NONE ||
+			 mode == PARAVIRT_LAZY_FLUSH));
 
 	if (mode == PARAVIRT_LAZY_FLUSH) {
 		vmi_ops.set_lazy_mode(0);
-		vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode));
+		vmi_ops.set_lazy_mode(cur_mode);
 	} else {
 		vmi_ops.set_lazy_mode(mode);
-		__get_cpu_var(lazy_mode) = mode;
+		per_cpu(lazy_mode, cpu) = mode;
 	}
+	put_cpu();
 }
 
 static inline int __init check_vmi_rom(struct vrom_header *rom)
diff --git a/arch/i386/xen/enlighten.c b/arch/i386/xen/enlighten.c
index f0c3751..2dafb8a 100644
--- a/arch/i386/xen/enlighten.c
+++ b/arch/i386/xen/enlighten.c
@@ -251,7 +251,7 @@ static void xen_halt(void)
 
 static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 {
-	BUG_ON(preemptible());
+	get_cpu();
 
 	switch (mode) {
 	case PARAVIRT_LAZY_NONE:
@@ -267,11 +267,13 @@ static void xen_set_lazy_mode(enum paravirt_lazy_mode mode)
 		/* flush if necessary, but don't change state */
 		if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE)
 			xen_mc_flush();
+		put_cpu();
 		return;
 	}
 
 	xen_mc_flush();
 	x86_write_percpu(xen_lazy_mode, mode);
+	put_cpu();
 }
 
 static unsigned long xen_store_tr(void)
-- 
1.4.4.4



Re: 2.6.23-rc3 - CONFIG_VMI broken

2007-08-22 Thread Zachary Amsden

Parag Warudkar wrote:

CONFIG_VMI seems to be broken, but I am not sure when - the last
kernel I was running was 2.6.22-rc4 which used to boot fine and use
VMI. Current git with same configuration causes the kernel to reboot
early. Logs below.

Deselecting CONFIG_VMI and rebuilding allows the kernel to boot normally.

(I am running it on VMWare workstation 6 latest release.)

Thanks

Parag

Linux version 2.6.23-rc3 ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu
4.1.2-0ubuntu4)) #4 Mon Aug 13 21:50:06 EDT 2007
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f800 (usable)
 BIOS-e820: 0009f800 - 000a (reserved)
 BIOS-e820: 000ca000 - 000cc000 (reserved)
 BIOS-e820: 000dc000 - 0010 (reserved)
 BIOS-e820: 0010 - 1fef (usable)
 BIOS-e820: 1fef - 1feff000 (ACPI data)
 BIOS-e820: 1feff000 - 1ff0 (ACPI NVS)
 BIOS-e820: 1ff0 - 2000 (usable)
 BIOS-e820: fec0 - fec1 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: fffe - 0001 (reserved)
console [earlyser0] enabled
0MB HIGHMEM available.
512MB LOWMEM available.
found SMP MP-table at 000f6c90
VMI: Found VMware, Inc. Hypervisor OPROM, API version 3.0, ROM version 1.0
Reserving virtual address space above 0xfc00
Int 14: CR2 fc37e260  err   EIP fc37e260  CS 0062  flags 00010006
Stack: c0490ec7 c04913cb 0001  fc001340 c047fff4 c04a6080 c047aa00
  


I reproduced this, slightly different EIP, but I notice that all 
paravirt-ops function calls are to bogus addresses; the first byte 
appears correct (0xfc00 is in the VMI ROM range), and the extracted 
function addresses in the paravirt_ops struct are correct.  However, it 
looks like the patching of the call instructions went wrong.  Does this 
sound familiar to anyone?


In any case, I think the bug is already fixed.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:


We might benefit from it, but would the 
BusLogic driver?  It sets a nasty precedent for maintenance as different 
hypervisors and emulators hack up different drivers for their own 
performance.



I still think it's preferable to change some drivers than everybody.

AFAIK BusLogic as real hardware is pretty much dead anyways,
so you're probably the only primary user of it anyways.
Go wild on it!
  


It is looking juicy.  Maybe another day.

Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Alan Cox wrote:

I still think it's preferable to change some drivers than everybody.

AFAIK BusLogic as real hardware is pretty much dead anyways,
so you're probably the only primary user of it anyways.
Go wild on it!



I don't believe anyone is materially maintaining the buslogic driver and
in time its going to break completely.
  


I think I was actually the last person to touch it ;)

  

Well that might be. I just think it would be a mistake
to design paravirt_ops based on someone's short term release engineering
considerations.



Agreed, especially as an interface where each in or out traps into the
hypervisor is broken even for the model of virtualising hardware. 
  


Well, it's not necessarily broken, it's just a different model.  At some 
point the cost of maintaining a whole suite of virtual drivers becomes 
greater than leveraging a bunch of legacy drivers.  If you can eliminate 
most of the performance cost of that by changing something at a layer 
below (port I/O), it is a win even if it is not a perfect solution.


But I think I've lost the argument anyways; it doesn't seem to be for 
the greater good of Linux, and there are alternatives we can take.  
Unfortunately for me, they require a lot more work.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:


How is that measured? In a loop? In the same pipeline state?

It seems a little dubious to me.
  


I did the experiments in a controlled environment, with interrupts 
disabled and care to get the pipeline in the same state.  It was a 
perfectly repeatable experiment.  I don't have exact cycle time anymore, 
but they were the tightest measurements I've even seen on cycle counts 
because of the unique nature of serializing the processor for the fault 
/ privilege transition.  I tested a variety of different conditions, 
including different types of #GP (yes, the cost does vary), #NP, #PF, 
sysenter, int $0xxx.  Sysenter was the fastest, by far.  Int was about 
5x the cost.  #GP and friends were all about similar costs.  #PF was the 
most expensive.



  
to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 
   

When the page is not executable or not present you get #PF not #GP. 
So the hardware already checks that.


The only case where you would need to check yourself is if you emulate
NX on non NX capable hardware, but I can't see you doing that.
 
  
No, it doesn't.  Between the #GP and decode, you have an SMP race where 
another processor can rewrite the instruction.



That can be ignored imho. If the page goes away you'll notice
when you handle the page fault on read. If it becomes NX then the execution
just happened to be logically a little earlier.

  


No, you can't ignore it.  The page protections won't change between the 
GP and the decoder execution, but the instruction can, causing you to 
decode into the next page where the processor would not have.  !P 
becomes obvious, but failure to respect NX or U/S is an exploitable 
bug.  Put a 1 byte instruction at the end of a page crossing into a NX 
(or supervisor page).  Remotely, change keep switching between the 
instruction and a segment override.


Result: user executes instruction on supervisor code page, learning data 
as a result of this; code on NX page gets executed.



Or easier to just write a backend for the lguest virtio drivers,
that will be likely faster in the end anyways than this gross
hack.
  


We already have drivers for all of our hardware in Linux.  Most of the 
hardware we emulate is physical hardware, and there are no virtual 
drivers for it.  Should we take the BusLogic driver and "paravirtualize" 
it by adding VMI hypercalls?  We might benefit from it, but would the 
BusLogic driver?  It sets a nasty precedent for maintenance as different 
hypervisors and emulators hack up different drivers for their own 
performance.


Our SCSI and IDE emulation and thus the drivers used by Linux are pretty 
much fixed in stone; we are not going to go about changing a tricky 
hardware interface to a virtual one, it is simply too risky for 
something as critical as storage.  We might be able to move our network 
driver over to virtio, but that is not a short-term prospect either.


There is great advantage in talking to our existing device layer faster, 
and this is something that is valuable today.



Really LinuxHAL^wparavirt ops is already so complicated that
any new hooks need an extremly good justification and that is
just not here for this.

We can add it if you find an equivalent number of hooks
to eliminate.
  


Interesting trade.  What if I sanitized the whole I/O messy macros into 
something fun and friendly:


native_port_in(int port, iosize_t opsize, int delay)
native_port_out(int port, iosize_t opsize, u32 output, int delay)
native_port_string_in(int port, void *ptr, iosize_t opsize, unsigned 
count, int delay)
native_port_string_out(int port, void *ptr, iosize_t opsize, unsigned 
count, int delay)


Then we can be rid of all the macro goo in io.h, which frightens my 
mother.  We might even be able to get rid of the umpteen different 
places where drivers wrap iospace access with their own byte / word / 
long functions so they can switch between port I/O and memory mapped I/O 
by moving it all into common infrastructure.


We could make similar (unwelcome?) advances on the pte functions if it 
were not for the regrettable disconnect between pte_high / pte_low and 
the rest.  Perhaps if it was hidden in macros?


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:

On Wed, Aug 22, 2007 at 09:48:25AM -0700, Zachary Amsden wrote:
  

Andi Kleen wrote:


On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote:
 
  
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  
   


Is that really that expensive? Hard to imagine.
 
  
You have an expensive (16x cost of hypercall on some processors) 



Where is the difference comming from? Are you using SYSENTER
for the hypercall?  I can't really see you using SYSENTER,
because how would you do system calls then? I bet system calls
are more frequent than in/out, so if you have decide between the
two using them for syscalls is likely faster.
  


We use sysenter for hypercalls and also for system calls.  :)


Also I fail to see the fundamental speed difference between

mov index,register
int 0x...
...
switch (register) 
case : do emulation
  


Int (on p4 == ~680 cycles).


versus

out ...
#gp
-> switch (*eip) {
case 0xee:  /* etc. */ 
	do emulation
  


GP = ~2000 cycles.

to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 



When the page is not executable or not present you get #PF not #GP. 
So the hardware already checks that.


The only case where you would need to check yourself is if you emulate
NX on non NX capable hardware, but I can't see you doing that.
  


No, it doesn't.  Between the #GP and decode, you have an SMP race where 
another processor can rewrite the instruction.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:

On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote:
  
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  



Is that really that expensive? Hard to imagine.
  


You have an expensive (16x cost of hypercall on some processors) 
privilege transition, you have to decode the instruction, then you have 
to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 
hypercall.



e.g. you could always have a fast check for inb/outb at the beginning
of the #GP handler. And is your initial #GP entry really more expensive
than a hypercall? 
  


The number of reasons for #GP is enormous, and there are too many paths 
to optimize with fast checks.  We do have a fast check for inb/outb; 
it's just not fast enough.


On P4, hypercall entry is 120 cycles.  #GP is about 2000.  Modern 
processors are better, but a hypercall is always faster than a fault.  
Many times, the hypercall can be handled and ready to return before a 
#GP would even complete.


On workloads that sit there and hammer network cards, these costs become 
significant, and latency sensitive network benchmarks suffer.


Worse, even with hardware assist such as VT, the exit reason alone is 
not sufficient to determine the true nature of the faulting instruction, 
requiring a complex and costly instruction decode and simulation.



It's unclear to me why that should be that costly.

Worst case it's a switch()
  


There are 24 different possible I/O operations; sometimes with a port 
encoded in the instruction, sometimes with input in the DX register, 
sometimes with a rep prefix, and for 3 different operand sizes.


Combine that with the MMU checks required, and it's complex and branchy 
enough to justify short-circuiting the whole thing with a simple hypercall.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Avi Kivity wrote:


Since this is only for newer kernels, won't updating the driver to use 
a hypercall be more efficient?  Or is this for existing out-of-tree 
drivers?


Actually, it is for in-tree drivers that we emulate but don't want to 
pollute, and one out of tree driver (that will hopefully be in tree 
soon!) that has no way to determine if making hypercalls is acceptable.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix lazy mode vmalloc synchronization for paravirt

2007-08-22 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

No, under Xen the kernel/hypervisor PMD is not shared between processes,
so this is still used when PAE is enabled.
  


Ahh, yes.  So this was a lucky catch for us.  Non-PAE kernels seem to be 
increasing in value at antique sales.


Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

H. Peter Anvin wrote:

Zachary Amsden wrote:
  
In general, I/O in a virtual guest is subject to performance problems. 
The I/O can not be completed physically, but must be virtualized.  This
means trapping and decoding port I/O instructions from the guest OS. 
Not only is the trap for a #GP heavyweight, both in the processor and

the hypervisor (which usually has a complex #GP path), but this forces
the hypervisor to decode the individual instruction which has faulted. 
Worse, even with hardware assist such as VT, the exit reason alone is

not sufficient to determine the true nature of the faulting instruction,
requiring a complex and costly instruction decode and simulation.

This patch provides hypercalls for the i386 port I/O instructions, which
vastly helps guests which use native-style drivers.  For certain VMI
workloads, this provides a performance boost of up to 30%.  We expect
KVM and lguest to be able to achieve similar gains on I/O intensive
workloads.




What about cost on hardware?
  


On modern hardware, port I/O is about the most expensive thing you can 
do.  The extra function call cost is totally masked by the stall.  We 
have measured with port I/O converted like this on real hardware, and 
have seen zero measurable impact on macro-benchmarks.  Micro-benchmarks 
that generate massively repeated port I/O might show some effect on 
ancient hardware, but I can't even imagine a workload which does such a 
thing, other than a polling port I/O loop perhaps - which would not be 
performance critical in any case I can reasonably imagine.


Zach
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

H. Peter Anvin wrote:

Zachary Amsden wrote:
  
In general, I/O in a virtual guest is subject to performance problems. 
The I/O can not be completed physically, but must be virtualized.  This
means trapping and decoding port I/O instructions from the guest OS. 
Not only is the trap for a #GP heavyweight, both in the processor and

the hypervisor (which usually has a complex #GP path), but this forces
the hypervisor to decode the individual instruction which has faulted. 
Worse, even with hardware assist such as VT, the exit reason alone is

not sufficient to determine the true nature of the faulting instruction,
requiring a complex and costly instruction decode and simulation.

This patch provides hypercalls for the i386 port I/O instructions, which
vastly helps guests which use native-style drivers.  For certain VMI
workloads, this provides a performance boost of up to 30%.  We expect
KVM and lguest to be able to achieve similar gains on I/O intensive
workloads.




What about cost on hardware?
  


On modern hardware, port I/O is about the most expensive thing you can 
do.  The extra function call cost is totally masked by the stall.  We 
have measured with port I/O converted like this on real hardware, and 
have seen zero measurable impact on macro-benchmarks.  Micro-benchmarks 
that generate massively repeated port I/O might show some effect on 
ancient hardware, but I can't even imagine a workload which does such a 
thing, other than a polling port I/O loop perhaps - which would not be 
performance critical in any case I can reasonably imagine.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Fix lazy mode vmalloc synchronization for paravirt

2007-08-22 Thread Zachary Amsden

Jeremy Fitzhardinge wrote:

No, under Xen the kernel/hypervisor PMD is not shared between processes,
so this is still used when PAE is enabled.
  


Ahh, yes.  So this was a lucky catch for us.  Non-PAE kernels seem to be 
increasing in value at antique sales.


Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Avi Kivity wrote:


Since this is only for newer kernels, won't updating the driver to use 
a hypercall be more efficient?  Or is this for existing out-of-tree 
drivers?


Actually, it is for in-tree drivers that we emulate but don't want to 
pollute, and one out of tree driver (that will hopefully be in tree 
soon!) that has no way to determine if making hypercalls is acceptable.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:

On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote:
  
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  



Is that really that expensive? Hard to imagine.
  


You have an expensive (16x cost of hypercall on some processors) 
privilege transition, you have to decode the instruction, then you have 
to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 
hypercall.



e.g. you could always have a fast check for inb/outb at the beginning
of the #GP handler. And is your initial #GP entry really more expensive
than a hypercall? 
  


The number of reasons for #GP is enormous, and there are too many paths 
to optimize with fast checks.  We do have a fast check for inb/outb; 
it's just not fast enough.


On P4, hypercall entry is 120 cycles.  #GP is about 2000.  Modern 
processors are better, but a hypercall is always faster than a fault.  
Many times, the hypercall can be handled and ready to return before a 
#GP would even complete.


On workloads that sit there and hammer network cards, these costs become 
significant, and latency sensitive network benchmarks suffer.


Worse, even with hardware assist such as VT, the exit reason alone is 
not sufficient to determine the true nature of the faulting instruction, 
requiring a complex and costly instruction decode and simulation.



It's unclear to me why that should be that costly.

Worst case it's a switch()
  


There are 24 different possible I/O operations; sometimes with a port 
encoded in the instruction, sometimes with input in the DX register, 
sometimes with a rep prefix, and for 3 different operand sizes.


Combine that with the MMU checks required, and it's complex and branchy 
enough to justify short-circuiting the whole thing with a simple hypercall.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:

On Wed, Aug 22, 2007 at 09:48:25AM -0700, Zachary Amsden wrote:
  

Andi Kleen wrote:


On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote:
 
  
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  
   


Is that really that expensive? Hard to imagine.
 
  
You have an expensive (16x cost of hypercall on some processors) 



Where is the difference comming from? Are you using SYSENTER
for the hypercall?  I can't really see you using SYSENTER,
because how would you do system calls then? I bet system calls
are more frequent than in/out, so if you have decide between the
two using them for syscalls is likely faster.
  


We use sysenter for hypercalls and also for system calls.  :)


Also I fail to see the fundamental speed difference between

mov index,register
int 0x...
...
switch (register) 
case : do emulation
  


Int (on p4 == ~680 cycles).


versus

out ...
#gp
- switch (*eip) {
case 0xee:  /* etc. */ 
	do emulation
  


GP = ~2000 cycles.

to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 



When the page is not executable or not present you get #PF not #GP. 
So the hardware already checks that.


The only case where you would need to check yourself is if you emulate
NX on non NX capable hardware, but I can't see you doing that.
  


No, it doesn't.  Between the #GP and decode, you have an SMP race where 
another processor can rewrite the instruction.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:


How is that measured? In a loop? In the same pipeline state?

It seems a little dubious to me.
  


I did the experiments in a controlled environment, with interrupts 
disabled and care to get the pipeline in the same state.  It was a 
perfectly repeatable experiment.  I don't have exact cycle time anymore, 
but they were the tightest measurements I've even seen on cycle counts 
because of the unique nature of serializing the processor for the fault 
/ privilege transition.  I tested a variety of different conditions, 
including different types of #GP (yes, the cost does vary), #NP, #PF, 
sysenter, int $0xxx.  Sysenter was the fastest, by far.  Int was about 
5x the cost.  #GP and friends were all about similar costs.  #PF was the 
most expensive.



  
to verify protection in the page tables mapping the page allows 
execution (P, !NX, and U/S check).  This is a lot more expensive than a 
   

When the page is not executable or not present you get #PF not #GP. 
So the hardware already checks that.


The only case where you would need to check yourself is if you emulate
NX on non NX capable hardware, but I can't see you doing that.
 
  
No, it doesn't.  Between the #GP and decode, you have an SMP race where 
another processor can rewrite the instruction.



That can be ignored imho. If the page goes away you'll notice
when you handle the page fault on read. If it becomes NX then the execution
just happened to be logically a little earlier.

  


No, you can't ignore it.  The page protections won't change between the 
GP and the decoder execution, but the instruction can, causing you to 
decode into the next page where the processor would not have.  !P 
becomes obvious, but failure to respect NX or U/S is an exploitable 
bug.  Put a 1 byte instruction at the end of a page crossing into a NX 
(or supervisor page).  Remotely, change keep switching between the 
instruction and a segment override.


Result: user executes instruction on supervisor code page, learning data 
as a result of this; code on NX page gets executed.



Or easier to just write a backend for the lguest virtio drivers,
that will be likely faster in the end anyways than this gross
hack.
  


We already have drivers for all of our hardware in Linux.  Most of the 
hardware we emulate is physical hardware, and there are no virtual 
drivers for it.  Should we take the BusLogic driver and paravirtualize 
it by adding VMI hypercalls?  We might benefit from it, but would the 
BusLogic driver?  It sets a nasty precedent for maintenance as different 
hypervisors and emulators hack up different drivers for their own 
performance.


Our SCSI and IDE emulation and thus the drivers used by Linux are pretty 
much fixed in stone; we are not going to go about changing a tricky 
hardware interface to a virtual one, it is simply too risky for 
something as critical as storage.  We might be able to move our network 
driver over to virtio, but that is not a short-term prospect either.


There is great advantage in talking to our existing device layer faster, 
and this is something that is valuable today.



Really LinuxHAL^wparavirt ops is already so complicated that
any new hooks need an extremly good justification and that is
just not here for this.

We can add it if you find an equivalent number of hooks
to eliminate.
  


Interesting trade.  What if I sanitized the whole I/O messy macros into 
something fun and friendly:


native_port_in(int port, iosize_t opsize, int delay)
native_port_out(int port, iosize_t opsize, u32 output, int delay)
native_port_string_in(int port, void *ptr, iosize_t opsize, unsigned 
count, int delay)
native_port_string_out(int port, void *ptr, iosize_t opsize, unsigned 
count, int delay)


Then we can be rid of all the macro goo in io.h, which frightens my 
mother.  We might even be able to get rid of the umpteen different 
places where drivers wrap iospace access with their own byte / word / 
long functions so they can switch between port I/O and memory mapped I/O 
by moving it all into common infrastructure.


We could make similar (unwelcome?) advances on the pte functions if it 
were not for the regrettable disconnect between pte_high / pte_low and 
the rest.  Perhaps if it was hidden in macros?


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Alan Cox wrote:

I still think it's preferable to change some drivers than everybody.

AFAIK BusLogic as real hardware is pretty much dead anyways,
so you're probably the only primary user of it anyways.
Go wild on it!



I don't believe anyone is materially maintaining the buslogic driver and
in time its going to break completely.
  


I think I was actually the last person to touch it ;)

  

Well that might be. I just think it would be a mistake
to design paravirt_ops based on someone's short term release engineering
considerations.



Agreed, especially as an interface where each in or out traps into the
hypervisor is broken even for the model of virtualising hardware. 
  


Well, it's not necessarily broken, it's just a different model.  At some 
point the cost of maintaining a whole suite of virtual drivers becomes 
greater than leveraging a bunch of legacy drivers.  If you can eliminate 
most of the performance cost of that by changing something at a layer 
below (port I/O), it is a win even if it is not a perfect solution.


But I think I've lost the argument anyways; it doesn't seem to be for 
the greater good of Linux, and there are alternatives we can take.  
Unfortunately for me, they require a lot more work.


Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-22 Thread Zachary Amsden

Andi Kleen wrote:


We might benefit from it, but would the 
BusLogic driver?  It sets a nasty precedent for maintenance as different 
hypervisors and emulators hack up different drivers for their own 
performance.



I still think it's preferable to change some drivers than everybody.

AFAIK BusLogic as real hardware is pretty much dead anyways,
so you're probably the only primary user of it anyways.
Go wild on it!
  


It is looking juicy.  Maybe another day.

Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 2.6.23-rc3 - CONFIG_VMI broken

2007-08-22 Thread Zachary Amsden

Parag Warudkar wrote:

CONFIG_VMI seems to be broken, but I am not sure when - the last
kernel I was running was 2.6.22-rc4 which used to boot fine and use
VMI. Current git with same configuration causes the kernel to reboot
early. Logs below.

Deselecting CONFIG_VMI and rebuilding allows the kernel to boot normally.

(I am running it on VMWare workstation 6 latest release.)

Thanks

Parag

Linux version 2.6.23-rc3 ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu
4.1.2-0ubuntu4)) #4 Mon Aug 13 21:50:06 EDT 2007
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009f800 (usable)
 BIOS-e820: 0009f800 - 000a (reserved)
 BIOS-e820: 000ca000 - 000cc000 (reserved)
 BIOS-e820: 000dc000 - 0010 (reserved)
 BIOS-e820: 0010 - 1fef (usable)
 BIOS-e820: 1fef - 1feff000 (ACPI data)
 BIOS-e820: 1feff000 - 1ff0 (ACPI NVS)
 BIOS-e820: 1ff0 - 2000 (usable)
 BIOS-e820: fec0 - fec1 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: fffe - 0001 (reserved)
console [earlyser0] enabled
0MB HIGHMEM available.
512MB LOWMEM available.
found SMP MP-table at 000f6c90
VMI: Found VMware, Inc. Hypervisor OPROM, API version 3.0, ROM version 1.0
Reserving virtual address space above 0xfc00
Int 14: CR2 fc37e260  err   EIP fc37e260  CS 0062  flags 00010006
Stack: c0490ec7 c04913cb 0001  fc001340 c047fff4 c04a6080 c047aa00
  


I reproduced this, slightly different EIP, but I notice that all 
paravirt-ops function calls are to bogus addresses; the first byte 
appears correct (0xfc00 is in the VMI ROM range), and the extracted 
function addresses in the paravirt_ops struct are correct.  However, it 
looks like the patching of the call instructions went wrong.  Does this 
sound familiar to anyone?


In any case, I think the bug is already fixed.

Zach
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-21 Thread Zachary Amsden

Avi Kivity wrote:

Zachary Amsden wrote:
  

In general, I/O in a virtual guest is subject to performance
problems.  The I/O can not be completed physically, but must be
virtualized.  This means trapping and decoding port I/O instructions
from the guest OS.  Not only is the trap for a #GP heavyweight, both
in the processor and the hypervisor (which usually has a complex #GP
path), but this forces the hypervisor to decode the individual
instruction which has faulted.  Worse, even with hardware assist such
as VT, the exit reason alone is not sufficient to determine the true
nature of the faulting instruction, requiring a complex and costly
instruction decode and simulation.

This patch provides hypercalls for the i386 port I/O instructions,
which vastly helps guests which use native-style drivers.  For certain
VMI workloads, this provides a performance boost of up to 30%.  We
expect KVM and lguest to be able to achieve similar gains on I/O
intensive workloads.





Won't these workloads be better off using paravirtualized drivers? 
i.e., do the native drivers with paravirt I/O instructions get anywhere

near the performance of paravirt drivers?
  


Yes, in general, this is true (better off with paravirt drivers).  
However, we have "paravirt" drivers which run in both 
fully-paravirtualized and fully traditionally virtualized environments.  
As a result, they use native port I/O operations to interact with 
virtual hardware.


Since not all hypervisors have paravirtualized driver infrastructures 
and guest O/S support yet, these hypercalls can be advantages to a wide 
range of scenarios.  Using I/O hypercalls as such gives exactly the same 
performance as paravirt drivers for us, by eliminating the costly decode 
path, and the simplicity of using the same driver code makes this a huge 
win in code complexity.


Zach

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Add I/O hypercalls for i386 paravirt

2007-08-21 Thread Zachary Amsden
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  
Worse, even with hardware assist such as VT, the exit reason alone is 
not sufficient to determine the true nature of the faulting instruction, 
requiring a complex and costly instruction decode and simulation.


This patch provides hypercalls for the i386 port I/O instructions, which 
vastly helps guests which use native-style drivers.  For certain VMI 
workloads, this provides a performance boost of up to 30%.  We expect 
KVM and lguest to be able to achieve similar gains on I/O intensive 
workloads.


This patch is against 2.6.23-rc2-mm2, and should be targeted for 2.6.24.

Zach
Virtualized guests in general benefit from having I/O hypercalls.  This
patch adds support for port I/O hypercalls to VMI and provides the
infrastructure for other backends to make use of this feature.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c
index ea962c0..4d0d150 100644
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -329,6 +329,18 @@ struct paravirt_ops paravirt_ops = {
 
 	.set_iopl_mask = native_set_iopl_mask,
 	.io_delay = native_io_delay,
+	.outb = native_outb,
+	.outw = native_outw,
+	.outl = native_outl,
+	.inb = native_inb,
+	.inw = native_inw,
+	.inl = native_inl,
+	.outsb = native_outsb,
+	.outsw = native_outsw,
+	.outsl = native_outsl,
+	.insb = native_insb,
+	.insw = native_insw,
+	.insl = native_insl,
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	.apic_write = native_apic_write,
diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 44feb34..5ecd85b 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -56,6 +56,7 @@ static int disable_tsc;
 static int disable_mtrr;
 static int disable_noidle;
 static int disable_vmi_timer;
+static int disable_io_ops;
 
 /* Cached VMI operations */
 static struct {
@@ -72,6 +73,18 @@ static struct {
 	void (*set_initial_ap_state)(int, int);
 	void (*halt)(void);
   	void (*set_lazy_mode)(int mode);
+	void (*outb)(u8 value, u16 port);
+	void (*outw)(u16 value, u16 port);
+	void (*outl)(u32 value, u16 port);
+	u8 (*inb)(u16 port);
+	u16 (*inw)(u16 port);
+	u32 (*inl)(u16 port);
+	void (*outsb)(const void *addr, u16 port, u32 count);
+	void (*outsw)(const void *addr, u16 port, u32 count);
+	void (*outsl)(const void *addr, u16 port, u32 count);
+	void (*insb)(void *addr, u16 port, u32 count);
+	void (*insw)(void *addr, u16 port, u32 count);
+	void (*insl)(void *addr, u16 port, u32 count);
 } vmi_ops;
 
 /* Cached VMI operations */
@@ -565,6 +578,33 @@ static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
 	}
 }
 
+#define BUILDIO(bwl,type) \
+static void vmi_out##bwl(type value, int port) { \
+	__asm__ __volatile__("call *%0" : : \
+		"r"(vmi_ops.out##bwl), "a"(value), "d"(port)); \
+} \
+static type vmi_in##bwl(int port) { \
+	type value; \
+	__asm__ __volatile__("call *%1" : \
+		"=a"(value) : \
+		"r"(vmi_ops.in##bwl), "d"(port)); \
+	return value; \
+} \
+static void vmi_outs##bwl(int port, const void *addr, unsigned long count) { \
+	__asm__ __volatile__("call *%2" : \
+		"+S"(addr), "+c"(count) : \
+		"r"(vmi_ops.outs##bwl), "d"(port)); \
+} \
+static void vmi_ins##bwl(int port, void *addr, unsigned long count) { \
+	__asm__ __volatile__("call *%2" : \
+		"+D"(addr), "+c"(count) : \
+		"r"(vmi_ops.ins##bwl), "d"(port)); \
+} 
+
+BUILDIO(b,unsigned char)
+BUILDIO(w,unsigned short)
+BUILDIO(l,unsigned int)
+
 static inline int __init check_vmi_rom(struct vrom_header *rom)
 {
 	struct pci_header *pci;
@@ -791,6 +831,21 @@ static inline int __init activate_vmi(void)
 	para_wrap(load_esp0, vmi_load_esp0, set_kernel_stack, UpdateKernelStack);
 	para_fill(set_iopl_mask, SetIOPLMask);
 	para_fill(io_delay, IODelay);
+	if (!disable_io_ops) {
+		para_wrap(inb, vmi_inb, inb, INB);
+		para_wrap(inw, vmi_inw, inw, INW);
+		para_wrap(inl, vmi_inl, inl, INL);
+		para_wrap(outb, vmi_outb, outb, OUTB);
+		para_wrap(outw, vmi_outw, outw, OUTW);
+		para_wrap(outl, vmi_outl, outl, OUTL);
+		para_wrap(insb, vmi_insb, insb, INSB);
+		para_wrap(insw, vmi_insw, insw, INSW);
+		para_wrap(insl, vmi_insl, insl, INSL);
+		para_wrap(outsb, vmi_outsb, outsb, OUTSB);
+		para_wrap(outsw, vmi_outsw, outsw, OUTSW);
+		para_wrap(outsl, vmi_outsl, outsl, OUTSL);
+	}
+
 	para_wrap(set_lazy_mode, vmi_set_lazy_mode, set_lazy_mode, SetLazyMode);
 
 	/* user and kernel flush are just han

[PATCH] Fix lazy mode vmalloc synchronization for paravirt

2007-08-21 Thread Zachary Amsden

Found this looping Ubuntu installs with VMI.

If unlucky enough to hit a vmalloc sync fault during a lazy mode 
operation (from an IRQ handler for a module which was not yet populated 
in current page directory, or from inside copy_one_pte, which touches 
swap_map, and hit in an unused 4M region), the required PDE update would 
never get flushed, causing an infinite page fault loop.


This bug affects any paravirt-ops backend which uses lazy updates, I 
believe that makes it a bug in Xen, VMI and lguest.  It only happens on 
LOWMEM kernels.


Currently for 2.6.23, but we'll want to backport to -stable as well.

Zach
Touching vmalloc memory in the middle of a lazy mode update can generate
a kernel PDE update, which must be flushed immediately.  The fix is to
leave lazy mode when doing a vmalloc sync.

Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]>

diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
index 01ffdd4..fcb38e7 100644
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -249,9 +249,10 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 	pmd_k = pmd_offset(pud_k, address);
 	if (!pmd_present(*pmd_k))
 		return NULL;
-	if (!pmd_present(*pmd))
+	if (!pmd_present(*pmd)) {
 		set_pmd(pmd, *pmd_k);
-	else
+		arch_flush_lazy_mmu_mode();
+	} else
 		BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
 	return pmd_k;
 }


[PATCH] Add I/O hypercalls for i386 paravirt

2007-08-21 Thread Zachary Amsden
In general, I/O in a virtual guest is subject to performance problems.  
The I/O can not be completed physically, but must be virtualized.  This 
means trapping and decoding port I/O instructions from the guest OS.  
Not only is the trap for a #GP heavyweight, both in the processor and 
the hypervisor (which usually has a complex #GP path), but this forces 
the hypervisor to decode the individual instruction which has faulted.  
Worse, even with hardware assist such as VT, the exit reason alone is 
not sufficient to determine the true nature of the faulting instruction, 
requiring a complex and costly instruction decode and simulation.


This patch provides hypercalls for the i386 port I/O instructions, which 
vastly helps guests which use native-style drivers.  For certain VMI 
workloads, this provides a performance boost of up to 30%.  We expect 
KVM and lguest to be able to achieve similar gains on I/O intensive 
workloads.


This patch is against 2.6.23-rc2-mm2, and should be targeted for 2.6.24.

Zach
Virtualized guests in general benefit from having I/O hypercalls.  This
patch adds support for port I/O hypercalls to VMI and provides the
infrastructure for other backends to make use of this feature.

Signed-off-by: Zachary Amsden [EMAIL PROTECTED]

diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c
index ea962c0..4d0d150 100644
--- a/arch/i386/kernel/paravirt.c
+++ b/arch/i386/kernel/paravirt.c
@@ -329,6 +329,18 @@ struct paravirt_ops paravirt_ops = {
 
 	.set_iopl_mask = native_set_iopl_mask,
 	.io_delay = native_io_delay,
+	.outb = native_outb,
+	.outw = native_outw,
+	.outl = native_outl,
+	.inb = native_inb,
+	.inw = native_inw,
+	.inl = native_inl,
+	.outsb = native_outsb,
+	.outsw = native_outsw,
+	.outsl = native_outsl,
+	.insb = native_insb,
+	.insw = native_insw,
+	.insl = native_insl,
 
 #ifdef CONFIG_X86_LOCAL_APIC
 	.apic_write = native_apic_write,
diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c
index 44feb34..5ecd85b 100644
--- a/arch/i386/kernel/vmi.c
+++ b/arch/i386/kernel/vmi.c
@@ -56,6 +56,7 @@ static int disable_tsc;
 static int disable_mtrr;
 static int disable_noidle;
 static int disable_vmi_timer;
+static int disable_io_ops;
 
 /* Cached VMI operations */
 static struct {
@@ -72,6 +73,18 @@ static struct {
 	void (*set_initial_ap_state)(int, int);
 	void (*halt)(void);
   	void (*set_lazy_mode)(int mode);
+	void (*outb)(u8 value, u16 port);
+	void (*outw)(u16 value, u16 port);
+	void (*outl)(u32 value, u16 port);
+	u8 (*inb)(u16 port);
+	u16 (*inw)(u16 port);
+	u32 (*inl)(u16 port);
+	void (*outsb)(const void *addr, u16 port, u32 count);
+	void (*outsw)(const void *addr, u16 port, u32 count);
+	void (*outsl)(const void *addr, u16 port, u32 count);
+	void (*insb)(void *addr, u16 port, u32 count);
+	void (*insw)(void *addr, u16 port, u32 count);
+	void (*insl)(void *addr, u16 port, u32 count);
 } vmi_ops;
 
 /* Cached VMI operations */
@@ -565,6 +578,33 @@ static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode)
 	}
 }
 
+#define BUILDIO(bwl,type) \
+static void vmi_out##bwl(type value, int port) { \
+	__asm__ __volatile__(call *%0 : : \
+		r(vmi_ops.out##bwl), a(value), d(port)); \
+} \
+static type vmi_in##bwl(int port) { \
+	type value; \
+	__asm__ __volatile__(call *%1 : \
+		=a(value) : \
+		r(vmi_ops.in##bwl), d(port)); \
+	return value; \
+} \
+static void vmi_outs##bwl(int port, const void *addr, unsigned long count) { \
+	__asm__ __volatile__(call *%2 : \
+		+S(addr), +c(count) : \
+		r(vmi_ops.outs##bwl), d(port)); \
+} \
+static void vmi_ins##bwl(int port, void *addr, unsigned long count) { \
+	__asm__ __volatile__(call *%2 : \
+		+D(addr), +c(count) : \
+		r(vmi_ops.ins##bwl), d(port)); \
+} 
+
+BUILDIO(b,unsigned char)
+BUILDIO(w,unsigned short)
+BUILDIO(l,unsigned int)
+
 static inline int __init check_vmi_rom(struct vrom_header *rom)
 {
 	struct pci_header *pci;
@@ -791,6 +831,21 @@ static inline int __init activate_vmi(void)
 	para_wrap(load_esp0, vmi_load_esp0, set_kernel_stack, UpdateKernelStack);
 	para_fill(set_iopl_mask, SetIOPLMask);
 	para_fill(io_delay, IODelay);
+	if (!disable_io_ops) {
+		para_wrap(inb, vmi_inb, inb, INB);
+		para_wrap(inw, vmi_inw, inw, INW);
+		para_wrap(inl, vmi_inl, inl, INL);
+		para_wrap(outb, vmi_outb, outb, OUTB);
+		para_wrap(outw, vmi_outw, outw, OUTW);
+		para_wrap(outl, vmi_outl, outl, OUTL);
+		para_wrap(insb, vmi_insb, insb, INSB);
+		para_wrap(insw, vmi_insw, insw, INSW);
+		para_wrap(insl, vmi_insl, insl, INSL);
+		para_wrap(outsb, vmi_outsb, outsb, OUTSB);
+		para_wrap(outsw, vmi_outsw, outsw, OUTSW);
+		para_wrap(outsl, vmi_outsl, outsl, OUTSL);
+	}
+
 	para_wrap(set_lazy_mode, vmi_set_lazy_mode, set_lazy_mode, SetLazyMode);
 
 	/* user and kernel flush are just handled with different flags to FlushTLB */
@@ -968,6 +1023,8 @@ static int __init parse_vmi(char *arg)
 		disable_noidle = 1;
 	} else if (!strcmp(arg, disable_noidle))
 		disable_noidle = 1;
+	else if (!strcmp(arg

Re: [PATCH] Add I/O hypercalls for i386 paravirt

2007-08-21 Thread Zachary Amsden

Avi Kivity wrote:

Zachary Amsden wrote:
  

In general, I/O in a virtual guest is subject to performance
problems.  The I/O can not be completed physically, but must be
virtualized.  This means trapping and decoding port I/O instructions
from the guest OS.  Not only is the trap for a #GP heavyweight, both
in the processor and the hypervisor (which usually has a complex #GP
path), but this forces the hypervisor to decode the individual
instruction which has faulted.  Worse, even with hardware assist such
as VT, the exit reason alone is not sufficient to determine the true
nature of the faulting instruction, requiring a complex and costly
instruction decode and simulation.

This patch provides hypercalls for the i386 port I/O instructions,
which vastly helps guests which use native-style drivers.  For certain
VMI workloads, this provides a performance boost of up to 30%.  We
expect KVM and lguest to be able to achieve similar gains on I/O
intensive workloads.





Won't these workloads be better off using paravirtualized drivers? 
i.e., do the native drivers with paravirt I/O instructions get anywhere

near the performance of paravirt drivers?
  


Yes, in general, this is true (better off with paravirt drivers).  
However, we have paravirt drivers which run in both 
fully-paravirtualized and fully traditionally virtualized environments.  
As a result, they use native port I/O operations to interact with 
virtual hardware.


Since not all hypervisors have paravirtualized driver infrastructures 
and guest O/S support yet, these hypercalls can be advantages to a wide 
range of scenarios.  Using I/O hypercalls as such gives exactly the same 
performance as paravirt drivers for us, by eliminating the costly decode 
path, and the simplicity of using the same driver code makes this a huge 
win in code complexity.


Zach

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] Fix lazy mode vmalloc synchronization for paravirt

2007-08-21 Thread Zachary Amsden

Found this looping Ubuntu installs with VMI.

If unlucky enough to hit a vmalloc sync fault during a lazy mode 
operation (from an IRQ handler for a module which was not yet populated 
in current page directory, or from inside copy_one_pte, which touches 
swap_map, and hit in an unused 4M region), the required PDE update would 
never get flushed, causing an infinite page fault loop.


This bug affects any paravirt-ops backend which uses lazy updates, I 
believe that makes it a bug in Xen, VMI and lguest.  It only happens on 
LOWMEM kernels.


Currently for 2.6.23, but we'll want to backport to -stable as well.

Zach
Touching vmalloc memory in the middle of a lazy mode update can generate
a kernel PDE update, which must be flushed immediately.  The fix is to
leave lazy mode when doing a vmalloc sync.

Signed-off-by: Zachary Amsden [EMAIL PROTECTED]

diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c
index 01ffdd4..fcb38e7 100644
--- a/arch/i386/mm/fault.c
+++ b/arch/i386/mm/fault.c
@@ -249,9 +249,10 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address)
 	pmd_k = pmd_offset(pud_k, address);
 	if (!pmd_present(*pmd_k))
 		return NULL;
-	if (!pmd_present(*pmd))
+	if (!pmd_present(*pmd)) {
 		set_pmd(pmd, *pmd_k);
-	else
+		arch_flush_lazy_mmu_mode();
+	} else
 		BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k));
 	return pmd_k;
 }


  1   2   3   4   5   6   7   8   >