Re: > best asked at one of the nvidia forums, not on lkml...
On Tue, 2008-02-05 at 13:44 +0700, Igor M Podlesny wrote: > On 2008-02-05 13:34, Arjan van de Ven wrote: > [...] > >>1) To have compiled it I had to replace global_flush_tlb() > >> call with __flush_tlb_all() and still guessing was it(?) a correct > >> replacment at all :-) > > > > it is not; > > I see, thanks. What would be the correct one? ;-) global_flush_tlb() would be the correct one. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Commit f06e4ec breaks vmware
On Mon, 2008-02-04 at 16:36 +0100, Ingo Molnar wrote: > * Jeff Chua <[EMAIL PROTECTED]> wrote: > > > On Feb 4, 2008 10:53 PM, Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > commit 8d947344c47a40626730bb80d136d8daac9f2060 > > > > Author: Glauber de Oliveira Costa <[EMAIL PROTECTED]> > > > > Date: Wed Jan 30 13:31:12 2008 +0100 > > > > > > > > x86: change write_idt_entry signature > > > > > > does the patch below ontop of x86.git#mm fix this? > > > > > > > 32-bit or 64-bit guest kernel? > > > > 32-bit. > > > > Yep, this fixed the problem. > > great! I've added: > >Tested-by: Jeff Chua <[EMAIL PROTECTED]> > > to the commit message as well, if you dont mind. Full patch is below. Acked-by: Zachary Amsden <[EMAIL PROTECTED]> Thanks, Ingo! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Commit f06e4ec breaks vmware
On Mon, 2008-02-04 at 16:36 +0100, Ingo Molnar wrote: * Jeff Chua [EMAIL PROTECTED] wrote: On Feb 4, 2008 10:53 PM, Ingo Molnar [EMAIL PROTECTED] wrote: commit 8d947344c47a40626730bb80d136d8daac9f2060 Author: Glauber de Oliveira Costa [EMAIL PROTECTED] Date: Wed Jan 30 13:31:12 2008 +0100 x86: change write_idt_entry signature does the patch below ontop of x86.git#mm fix this? 32-bit or 64-bit guest kernel? 32-bit. Yep, this fixed the problem. great! I've added: Tested-by: Jeff Chua [EMAIL PROTECTED] to the commit message as well, if you dont mind. Full patch is below. Acked-by: Zachary Amsden [EMAIL PROTECTED] Thanks, Ingo! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: best asked at one of the nvidia forums, not on lkml...
On Tue, 2008-02-05 at 13:44 +0700, Igor M Podlesny wrote: On 2008-02-05 13:34, Arjan van de Ven wrote: [...] 1) To have compiled it I had to replace global_flush_tlb() call with __flush_tlb_all() and still guessing was it(?) a correct replacment at all :-) it is not; I see, thanks. What would be the correct one? ;-) global_flush_tlb() would be the correct one. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/10] Tree fixes for PARAVIRT
On Fri, 2008-01-18 at 22:37 +0100, Ingo Molnar wrote: > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > > The first fix is not even specific for PARAVIRT, and it's actually > > > preventing the whole tree from booting. > > > > on CONFIG_EFI, indeed :) > > but in exchange you broke all of 32-bit with CONFIG_PARAVIRT=y. Which > means you did not even build-test it on 32-bit, let alone boot test > it... Why are we rushing so much to do 64-bit paravirt that we are breaking working configurations? If the developement is going to be this chaotic, it should be done and tested out of tree until it can stabilize. I do not like having to continuously retest and review the x86 branch because the paravirt-ops are constantly in flux and the 32-bit code keeps breaking. We won't be doing 64-bit paravirt-ops for exactly this reason - is there a serious justification from the performance angle on modern 64-bit hardware? If not, why justify the complexity and hackery to Linux? Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/10] Tree fixes for PARAVIRT
On Fri, 2008-01-18 at 22:37 +0100, Ingo Molnar wrote: * Ingo Molnar [EMAIL PROTECTED] wrote: The first fix is not even specific for PARAVIRT, and it's actually preventing the whole tree from booting. on CONFIG_EFI, indeed :) but in exchange you broke all of 32-bit with CONFIG_PARAVIRT=y. Which means you did not even build-test it on 32-bit, let alone boot test it... Why are we rushing so much to do 64-bit paravirt that we are breaking working configurations? If the developement is going to be this chaotic, it should be done and tested out of tree until it can stabilize. I do not like having to continuously retest and review the x86 branch because the paravirt-ops are constantly in flux and the 32-bit code keeps breaking. We won't be doing 64-bit paravirt-ops for exactly this reason - is there a serious justification from the performance angle on modern 64-bit hardware? If not, why justify the complexity and hackery to Linux? Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] serverworks: IRQ routing needs no _p
On Fri, 2008-01-11 at 18:14 +, Alan Cox wrote: > I can find no reason for the _p on the serverworks IRQ routing logic, and > a review of the documentation contains no indication that any such delay > is needed so lets try this > Looks good to me; unfortunately my Serverworks boxes got upgraded so I can't test, but install base is really big. Acked-by: Zachary Amsden <[EMAIL PROTECTED]> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 8390: split up 8390 with ISA delay and 8390 without
On Fri, 2008-01-11 at 18:08 +, Alan Cox wrote: > This lets us remove port 0x80 use on the PCI systems. It also speeds > up > some of the later 8390 based cores where we know the device does not > need > delay loops either because it has internal handling or in most cases a > faster device anyway. > > We compile up a new module 8390p to go with 8390 one with delays and > one > without. Most users will never need 8390p so it seems best to produce > two > modules. > diff -u --new-file --recursive --exclude-from /usr/src/exclude > linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c > linux-2.6.24-rc6-mm1/drivers/net/hp.c > --- linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-02 > 16:04:00.0 + > +++ linux-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-11 14:47:44.0 > + > @@ -176,7 +176,7 @@ > outb_p(irqmap[irq] | HP_RUN, ioaddr + > HP_CONFIGURE); > outb_p( 0x00 | HP_RUN, ioaddr + HP_CONFIGURE); ... didn't you miss two here? Otherwise, code looks great. How will the user know when they need to use 8390p? Should you perhaps default to producing and using 8390p if CONFIG_ISA is on? Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] 8390: split up 8390 with ISA delay and 8390 without
On Fri, 2008-01-11 at 18:08 +, Alan Cox wrote: This lets us remove port 0x80 use on the PCI systems. It also speeds up some of the later 8390 based cores where we know the device does not need delay loops either because it has internal handling or in most cases a faster device anyway. We compile up a new module 8390p to go with 8390 one with delays and one without. Most users will never need 8390p so it seems best to produce two modules. diff -u --new-file --recursive --exclude-from /usr/src/exclude linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c linux-2.6.24-rc6-mm1/drivers/net/hp.c --- linux.vanilla-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-02 16:04:00.0 + +++ linux-2.6.24-rc6-mm1/drivers/net/hp.c 2008-01-11 14:47:44.0 + @@ -176,7 +176,7 @@ outb_p(irqmap[irq] | HP_RUN, ioaddr + HP_CONFIGURE); outb_p( 0x00 | HP_RUN, ioaddr + HP_CONFIGURE); ... didn't you miss two here? Otherwise, code looks great. How will the user know when they need to use 8390p? Should you perhaps default to producing and using 8390p if CONFIG_ISA is on? Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] serverworks: IRQ routing needs no _p
On Fri, 2008-01-11 at 18:14 +, Alan Cox wrote: I can find no reason for the _p on the serverworks IRQ routing logic, and a review of the documentation contains no indication that any such delay is needed so lets try this Looks good to me; unfortunately my Serverworks boxes got upgraded so I can't test, but install base is really big. Acked-by: Zachary Amsden [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Wed, 2008-01-09 at 17:22 -0500, David P. Reed wrote: > Zachary Amsden wrote: > > > > According to Phoenix Technologies book "System BIOS for IBM PCs, > > Compatibles and EISA Computers, 2nd Edition", the I/O port list gives > > > > port 0080h R/W Extra page register (temporary storage) > > > > Despite looking, I've never seen it documented anywhere else, but I > > believe it works on just about every PC platform. Except, apparently, > > my laptop. > > > > > > > The port 80 problem was discovered by me, after months of "bisecting" > the running code around a problem with hanging when using hwclock in > 64-bit mode when ACPI is on. So it kills my laptop, too, and many > currentlaptop motherboards designed by Quanta for HP and Compaq (dv6000, > dv9000, tx1000, apparently) Thanks very much for that - I was debugging this for a while too, and eventually just shut off hwclock. > Your laptop isn't an aberration. It's part of the new generation of > evolved machines that take advantage of the capabilities of ACPI and > SMBIOS and DMI standards that are becoming core parts of the market. I beg to differ. I managed to turn the thing into a brick by upgrading the BIOS (with the correct image, no less) in an attempt to fix it. I just got it back from repair. I'm not sure that is positive evolutionary development, but it certainly does make my laptop an aberration :) FWIW, I fixed the problem locally by recompiling, changing port 80 to port 84 in io.h; works great, and doesn't conflict with any occupied ports. Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Wed, 2008-01-09 at 17:22 -0500, David P. Reed wrote: Zachary Amsden wrote: According to Phoenix Technologies book System BIOS for IBM PCs, Compatibles and EISA Computers, 2nd Edition, the I/O port list gives port 0080h R/W Extra page register (temporary storage) Despite looking, I've never seen it documented anywhere else, but I believe it works on just about every PC platform. Except, apparently, my laptop. The port 80 problem was discovered by me, after months of bisecting the running code around a problem with hanging when using hwclock in 64-bit mode when ACPI is on. So it kills my laptop, too, and many currentlaptop motherboards designed by Quanta for HP and Compaq (dv6000, dv9000, tx1000, apparently) Thanks very much for that - I was debugging this for a while too, and eventually just shut off hwclock. Your laptop isn't an aberration. It's part of the new generation of evolved machines that take advantage of the capabilities of ACPI and SMBIOS and DMI standards that are becoming core parts of the market. I beg to differ. I managed to turn the thing into a brick by upgrading the BIOS (with the correct image, no less) in an attempt to fix it. I just got it back from repair. I'm not sure that is positive evolutionary development, but it certainly does make my laptop an aberration :) FWIW, I fixed the problem locally by recompiling, changing port 80 to port 84 in io.h; works great, and doesn't conflict with any occupied ports. Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Tue, 2008-01-08 at 21:19 -0800, H. Peter Anvin wrote: > Zachary Amsden wrote: > > > > BTW, it isn't ever safe to pass port 0x80 through to hardware from a > > virtual machine; some OSes use port 0x80 as a hardware available scratch > > register (I believe Darwin/x86 did/does this during boot). > > That's funny, because there is definitely no guarantee that you get back > what you read (well, perhaps there is on Apple.) According to Phoenix Technologies book "System BIOS for IBM PCs, Compatibles and EISA Computers, 2nd Edition", the I/O port list gives port 0080h R/W Extra page register (temporary storage) Despite looking, I've never seen it documented anywhere else, but I believe it works on just about every PC platform. Except, apparently, my laptop. Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Wed, 2008-01-09 at 16:27 +0100, Rene Herman wrote: > On 09-01-08 06:30, Christer Weinigel wrote: > I'd not expect very time crtical. The current outb_p use gives a delay > somewhere between .5 and 2 microseconds as per earlier survey meaning a > udelay(1) or 2 would be enough -- again, at the point that udelay() is > sensible. > > New machines don't use the legacy PIC anymore anyway. > > > The floppy controller code uses outb_p. Even though there might be > > floppy controllers on modern systems, I'd rather leave the floppy code > > alone since it's supposed to be very fragile. If you still use > > floppies you deserve what you get. > > Floppies forever. In practice, leaving it alone isn't going to matter, but > in that same practice changing it to udelay() probably doesn't either. The > ones to leave alone are the ones that are clumsy/impossible to test and the > ones such as in NIC drivers that were specifically tuned. I'm speaking specifically in terms of 64-bit platforms here. Shouldn't we unconditionally drop outb_p doing extra port I/O on 64-bit architectures? Especially considering they don't even have an ISA bus where the decode timing could even matter? > If simple outb_p() deprecation is considered enough instead, no need to > touch anything in drivers/, only changes to "outb(); udelay()" outside > drivers/. > > I'd let Alan decide here. Agree. Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Wed, 2008-01-09 at 16:27 +0100, Rene Herman wrote: On 09-01-08 06:30, Christer Weinigel wrote: I'd not expect very time crtical. The current outb_p use gives a delay somewhere between .5 and 2 microseconds as per earlier survey meaning a udelay(1) or 2 would be enough -- again, at the point that udelay() is sensible. New machines don't use the legacy PIC anymore anyway. The floppy controller code uses outb_p. Even though there might be floppy controllers on modern systems, I'd rather leave the floppy code alone since it's supposed to be very fragile. If you still use floppies you deserve what you get. Floppies forever. In practice, leaving it alone isn't going to matter, but in that same practice changing it to udelay() probably doesn't either. The ones to leave alone are the ones that are clumsy/impossible to test and the ones such as in NIC drivers that were specifically tuned. I'm speaking specifically in terms of 64-bit platforms here. Shouldn't we unconditionally drop outb_p doing extra port I/O on 64-bit architectures? Especially considering they don't even have an ISA bus where the decode timing could even matter? If simple outb_p() deprecation is considered enough instead, no need to touch anything in drivers/, only changes to outb(); udelay() outside drivers/. I'd let Alan decide here. Agree. Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Tue, 2008-01-08 at 21:19 -0800, H. Peter Anvin wrote: Zachary Amsden wrote: BTW, it isn't ever safe to pass port 0x80 through to hardware from a virtual machine; some OSes use port 0x80 as a hardware available scratch register (I believe Darwin/x86 did/does this during boot). That's funny, because there is definitely no guarantee that you get back what you read (well, perhaps there is on Apple.) According to Phoenix Technologies book System BIOS for IBM PCs, Compatibles and EISA Computers, 2nd Edition, the I/O port list gives port 0080h R/W Extra page register (temporary storage) Despite looking, I've never seen it documented anywhere else, but I believe it works on just about every PC platform. Except, apparently, my laptop. Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Tue, 2008-01-08 at 14:15 -0500, David P. Reed wrote: > Alan Cox wrote: > > The natsemi docs here say otherwise. I trust them not you. > > > As well you should. I am honestly curious (for my own satisfaction) as > to what the natsemi docs say the delay code should do (can't imagine > they say "use io port 80 because it is unused"). I don't have any What is the outcome of this thread? Are we going to use timing based port delays, or can we finally drop these things entirely on 64-bit architectures? I a have a doubly vested interest in this, both as the owner of an affected HP dv9210us laptop and as a maintainer of paravirt code - and would like 64-bit Linux code to stop using I/O to port 0x80 in both cases (as I suspect would every other person involved with virtualization). BTW, it isn't ever safe to pass port 0x80 through to hardware from a virtual machine; some OSes use port 0x80 as a hardware available scratch register (I believe Darwin/x86 did/does this during boot). This means simultaneous execution of two virtual machines can interleave port 0x80 values or share data with a hardware provided covert channel. This means KVM should be trapping port 0x80 access, which is really expensive, or alternatively, Linux should not be using port 0x80 for timing bus access on modern (64-bit) hardware. I've tried to follow this thread, but with all the jabs, 1-ups, and obscure legacy hardware pageantry going on, it isn't clear what we're really doing. Thanks, Zach -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [linux-kernel] Re: [PATCH] x86: provide a DMI based port 0x80 I/O delay override.
On Tue, 2008-01-08 at 14:15 -0500, David P. Reed wrote: Alan Cox wrote: The natsemi docs here say otherwise. I trust them not you. As well you should. I am honestly curious (for my own satisfaction) as to what the natsemi docs say the delay code should do (can't imagine they say use io port 80 because it is unused). I don't have any What is the outcome of this thread? Are we going to use timing based port delays, or can we finally drop these things entirely on 64-bit architectures? I a have a doubly vested interest in this, both as the owner of an affected HP dv9210us laptop and as a maintainer of paravirt code - and would like 64-bit Linux code to stop using I/O to port 0x80 in both cases (as I suspect would every other person involved with virtualization). BTW, it isn't ever safe to pass port 0x80 through to hardware from a virtual machine; some OSes use port 0x80 as a hardware available scratch register (I believe Darwin/x86 did/does this during boot). This means simultaneous execution of two virtual machines can interleave port 0x80 values or share data with a hardware provided covert channel. This means KVM should be trapping port 0x80 access, which is really expensive, or alternatively, Linux should not be using port 0x80 for timing bus access on modern (64-bit) hardware. I've tried to follow this thread, but with all the jabs, 1-ups, and obscure legacy hardware pageantry going on, it isn't clear what we're really doing. Thanks, Zach -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
On Mon, 2007-11-26 at 14:06 -0800, Yinghai Lu wrote: > >> diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c > >> index 500670c..966d124 100644 > >> --- a/arch/x86/kernel/smpboot_64.c > >> +++ b/arch/x86/kernel/smpboot_64.c > >> @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > >> void __init smp_prepare_boot_cpu(void) > >> { > >>int me = smp_processor_id(); > >> - cpu_set(me, cpu_online_map); > >> + /* already set me in cpu_online_map in boot_cpu_init() */ > >>cpu_set(me, cpu_callout_map); > >>per_cpu(cpu_state, me) = CPU_ONLINE; > >> } > > > > > > This ordering can be tricky wrt CPU hotplug. Are you sure you are not > > breaking CPU hotplug? AFAIK, x86_64 has that right and the 32-bit code > > had it wrong. > > > CPU hot plug path will call smp_prepare_boot_cpu? > it is using __init instead of __cpuinit? No, but moving the place where cpu is moved into online map might break hotplug. In any case, I don't see anything wrong with your change now that I look closer. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
On Mon, 2007-11-26 at 00:38 -0800, Yinghai Lu wrote: > [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu > > in init/main.c boot_cpu_init() does that before > > Signed-off-by: Yinghai Lu <[EMAIL PROTECTED]> > > diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c > index 500670c..966d124 100644 > --- a/arch/x86/kernel/smpboot_64.c > +++ b/arch/x86/kernel/smpboot_64.c > @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) > void __init smp_prepare_boot_cpu(void) > { > int me = smp_processor_id(); > - cpu_set(me, cpu_online_map); > + /* already set me in cpu_online_map in boot_cpu_init() */ > cpu_set(me, cpu_callout_map); > per_cpu(cpu_state, me) = CPU_ONLINE; > } This ordering can be tricky wrt CPU hotplug. Are you sure you are not breaking CPU hotplug? AFAIK, x86_64 has that right and the 32-bit code had it wrong. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
On Mon, 2007-11-26 at 00:38 -0800, Yinghai Lu wrote: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu in init/main.c boot_cpu_init() does that before Signed-off-by: Yinghai Lu [EMAIL PROTECTED] diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c index 500670c..966d124 100644 --- a/arch/x86/kernel/smpboot_64.c +++ b/arch/x86/kernel/smpboot_64.c @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) void __init smp_prepare_boot_cpu(void) { int me = smp_processor_id(); - cpu_set(me, cpu_online_map); + /* already set me in cpu_online_map in boot_cpu_init() */ cpu_set(me, cpu_callout_map); per_cpu(cpu_state, me) = CPU_ONLINE; } This ordering can be tricky wrt CPU hotplug. Are you sure you are not breaking CPU hotplug? AFAIK, x86_64 has that right and the 32-bit code had it wrong. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86_64: not set boot cpu in cpu_online_map at smp_prepare_boot_cpu
On Mon, 2007-11-26 at 14:06 -0800, Yinghai Lu wrote: diff --git a/arch/x86/kernel/smpboot_64.c b/arch/x86/kernel/smpboot_64.c index 500670c..966d124 100644 --- a/arch/x86/kernel/smpboot_64.c +++ b/arch/x86/kernel/smpboot_64.c @@ -912,7 +912,7 @@ void __init smp_prepare_cpus(unsigned int max_cpus) void __init smp_prepare_boot_cpu(void) { int me = smp_processor_id(); - cpu_set(me, cpu_online_map); + /* already set me in cpu_online_map in boot_cpu_init() */ cpu_set(me, cpu_callout_map); per_cpu(cpu_state, me) = CPU_ONLINE; } This ordering can be tricky wrt CPU hotplug. Are you sure you are not breaking CPU hotplug? AFAIK, x86_64 has that right and the 32-bit code had it wrong. CPU hot plug path will call smp_prepare_boot_cpu? it is using __init instead of __cpuinit? No, but moving the place where cpu is moved into online map might break hotplug. In any case, I don't see anything wrong with your change now that I look closer. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] x86: TLS cleanup
On Wed, 2007-11-21 at 02:25 -0800, Roland McGrath wrote: > This consolidates the four different places that implemented the same > encoding magic for the GDT-slot 32-bit TLS support. The old tls32.c is > renamed and only slightly modified to be the shared implementation guts. > -#define GET_BASE(desc) ( \ > - (((desc)->a >> 16) & 0x) | \ > - (((desc)->b << 16) & 0x00ff) | \ > - ( (desc)->b& 0xff00) ) > - > -#define GET_LIMIT(desc) ( \ > - ((desc)->a & 0x0) | \ > - ((desc)->b & 0xf) ) That was the other redundant definition, thanks. I had a bit of trouble verifying correctness here because of much brownian motion. Any possibility of a pure movement / fixup separation to make it easier on the eyes? Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] x86: use get_desc_base
On Wed, 2007-11-21 at 02:20 -0800, Roland McGrath wrote: > This changes a couple of places to use the get_desc_base function. > They were duplicating the same calculation with different equivalent code. > > Signed-off-by: Roland McGrath <[EMAIL PROTECTED]> > > diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c > index 1cc4340..5291596 100644 > --- a/arch/x86/ia32/tls32.c > +++ b/arch/x86/ia32/tls32.c > @@ -85,11 +85,6 @@ asmlinkage long sys32_set_thread_area(struct user_desc > __user *u_info) > * Get the current Thread-Local Storage area: > */ > > -#define GET_BASE(desc) ( \ > - (((desc)->a >> 16) & 0x) | \ > - (((desc)->b << 16) & 0x00ff) | \ > - ( (desc)->b& 0xff00) ) Are you sure there still aren't more of these things floating around? I recall ptrace and process.c even having one at one point, hopefully they are all gone. Nice cleanups, Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote: > Where the device is implemented is an implementation detail that should > be hidden from the guest, isn't that one of the strengths of > virtualization? Two examples: a file-based block device implemented in > qemu gives you fancy file formats with encryption and compression, while > the same device implemented in the kernel gives you a low-overhead path > directly to a zillion-disk SAN volume. Or a user-level network device > capable of running with the slirp stack and no permissions vs. the > kernel device running copyless most of the time and using a dma engine > for the rest but requiring you to be good friends with the admin. > > The user should expect zero reconfigurations moving a VM from one model > to the other. I think that is pretty insightful, and indeed, is probably the only reason we would ever consider using a virtio based driver. But is this really a virtualization problem, and is virtio the right place to solve it? Doesn't I/O hotplug with multipathing or NIC teaming provide the same infrastructure in a way that is useful in more than just a virtualization context? Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 5/5] x86: TLS cleanup
On Wed, 2007-11-21 at 02:25 -0800, Roland McGrath wrote: This consolidates the four different places that implemented the same encoding magic for the GDT-slot 32-bit TLS support. The old tls32.c is renamed and only slightly modified to be the shared implementation guts. -#define GET_BASE(desc) ( \ - (((desc)-a 16) 0x) | \ - (((desc)-b 16) 0x00ff) | \ - ( (desc)-b 0xff00) ) - -#define GET_LIMIT(desc) ( \ - ((desc)-a 0x0) | \ - ((desc)-b 0xf) ) That was the other redundant definition, thanks. I had a bit of trouble verifying correctness here because of much brownian motion. Any possibility of a pure movement / fixup separation to make it easier on the eyes? Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH 3/3] virtio PCI device
On Wed, 2007-11-21 at 09:13 +0200, Avi Kivity wrote: Where the device is implemented is an implementation detail that should be hidden from the guest, isn't that one of the strengths of virtualization? Two examples: a file-based block device implemented in qemu gives you fancy file formats with encryption and compression, while the same device implemented in the kernel gives you a low-overhead path directly to a zillion-disk SAN volume. Or a user-level network device capable of running with the slirp stack and no permissions vs. the kernel device running copyless most of the time and using a dma engine for the rest but requiring you to be good friends with the admin. The user should expect zero reconfigurations moving a VM from one model to the other. I think that is pretty insightful, and indeed, is probably the only reason we would ever consider using a virtio based driver. But is this really a virtualization problem, and is virtio the right place to solve it? Doesn't I/O hotplug with multipathing or NIC teaming provide the same infrastructure in a way that is useful in more than just a virtualization context? Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/5] x86: use get_desc_base
On Wed, 2007-11-21 at 02:20 -0800, Roland McGrath wrote: This changes a couple of places to use the get_desc_base function. They were duplicating the same calculation with different equivalent code. Signed-off-by: Roland McGrath [EMAIL PROTECTED] diff --git a/arch/x86/ia32/tls32.c b/arch/x86/ia32/tls32.c index 1cc4340..5291596 100644 --- a/arch/x86/ia32/tls32.c +++ b/arch/x86/ia32/tls32.c @@ -85,11 +85,6 @@ asmlinkage long sys32_set_thread_area(struct user_desc __user *u_info) * Get the current Thread-Local Storage area: */ -#define GET_BASE(desc) ( \ - (((desc)-a 16) 0x) | \ - (((desc)-b 16) 0x00ff) | \ - ( (desc)-b 0xff00) ) Are you sure there still aren't more of these things floating around? I recall ptrace and process.c even having one at one point, hopefully they are all gone. Nice cleanups, Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 15/18] x86 vDSO: consolidate vdso32
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote: > This makes x86_64's ia32 emulation support share the sources used in the > 32-bit kernel for the 32-bit vDSO and much of its setup code. > > The 32-bit vDSO mapping now behaves the same on x86_64 as on native 32-bit. > The abi.syscall32 sysctl on x86_64 now takes the same values that > vm.vdso_enabled takes on the 32-bit kernel. That is, 1 means a randomized > vDSO location, 2 means the fixed old address. The CONFIG_COMPAT_VDSO > option is now available to make this the default setting, the same meaning > it has for the 32-bit kernel. (This does not affect the 64-bit vDSO.) I think you should drop CONFIG_COMPAT_VDSO support for 32-bit VDSO on 64-bit kernel. This was only to hack around a broken version of glibc that shipped with SUSE PRO 9.0, which had broken assertions based on misinterpretation of ELF fields. 64-bit machines will never see this glibc and the hack can die. Perhaps it is finally time to remove the hack from 32-bit as well, and eliminate COMPAT_VDSO entirely? Or does it really have to live forever. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/18] x86 vDSO: ia32 sysenter_return
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote: > This changes the 64-bit kernel's support for the 32-bit sysenter > instruction to use stored fields rather than constants for the > user-mode return address, as the 32-bit kernel does. This adds a > sysenter_return field to struct thread_info, as 32-bit has. There > is no observable effect from this yet. It makes the assembly code > independent of the 32-bit vDSO mapping address, paving the way for > making the vDSO address vary as it does on the 32-bit kernel. I hope you won't carry the dependency of COMPAT_VDSO over to 64-bit when that happens. > @@ -104,7 +103,7 @@ ENTRY(ia32_sysenter_target) > pushfq > CFI_ADJUST_CFA_OFFSET 8 > /*CFI_REL_OFFSET rflags,0*/ > - movl$VSYSCALL32_SYSEXIT, %r10d > + movl8*3-THREAD_SIZE+threadinfo_sysenter_return(%rsp), %r10d 8*3-THREAD_SIZE is not very intuitive. Can you add a comment on the math? Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 13/18] x86 vDSO: ia32 sysenter_return
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote: This changes the 64-bit kernel's support for the 32-bit sysenter instruction to use stored fields rather than constants for the user-mode return address, as the 32-bit kernel does. This adds a sysenter_return field to struct thread_info, as 32-bit has. There is no observable effect from this yet. It makes the assembly code independent of the 32-bit vDSO mapping address, paving the way for making the vDSO address vary as it does on the 32-bit kernel. I hope you won't carry the dependency of COMPAT_VDSO over to 64-bit when that happens. @@ -104,7 +103,7 @@ ENTRY(ia32_sysenter_target) pushfq CFI_ADJUST_CFA_OFFSET 8 /*CFI_REL_OFFSET rflags,0*/ - movl$VSYSCALL32_SYSEXIT, %r10d + movl8*3-THREAD_SIZE+threadinfo_sysenter_return(%rsp), %r10d 8*3-THREAD_SIZE is not very intuitive. Can you add a comment on the math? Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 15/18] x86 vDSO: consolidate vdso32
On Mon, 2007-11-19 at 14:06 -0800, Roland McGrath wrote: This makes x86_64's ia32 emulation support share the sources used in the 32-bit kernel for the 32-bit vDSO and much of its setup code. The 32-bit vDSO mapping now behaves the same on x86_64 as on native 32-bit. The abi.syscall32 sysctl on x86_64 now takes the same values that vm.vdso_enabled takes on the 32-bit kernel. That is, 1 means a randomized vDSO location, 2 means the fixed old address. The CONFIG_COMPAT_VDSO option is now available to make this the default setting, the same meaning it has for the 32-bit kernel. (This does not affect the 64-bit vDSO.) I think you should drop CONFIG_COMPAT_VDSO support for 32-bit VDSO on 64-bit kernel. This was only to hack around a broken version of glibc that shipped with SUSE PRO 9.0, which had broken assertions based on misinterpretation of ELF fields. 64-bit machines will never see this glibc and the hack can die. Perhaps it is finally time to remove the hack from 32-bit as well, and eliminate COMPAT_VDSO entirely? Or does it really have to live forever. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/paravirt: revert exports to restore old behaviour
On Tue, 2007-11-13 at 22:22 +, Christoph Hellwig wrote: > On Tue, Nov 13, 2007 at 12:21:16PM -0800, Jeremy Fitzhardinge wrote: > > Subject: x86/paravirt: revert exports to restore old behaviour > > > > Subdividing the paravirt_ops structure caused a regression in certain > > non-GPL modules which try to use mmu_ops and cpu_ops. This restores > > the old behaviour, and makes it consistent with the > > non-CONFIG_PARAVIRT case. > > NACK, both of these are internal and graphics drivers should not be > using them. Some of them are internal, but some are not, they just happened to be privileged CPU operations available to anyone. Does anyone know what hooks they are actually using? Something like reading MSRs is not internal at all, it is CPU specific, and the graphics drivers might have very good reasons to read them to figure out how AGP is configured for example. The graphics drivers most certainly don't need to be paravirtualized however, so they could always work around this with raw asm constructs. The mmu_ops hook is more debatable. But until someone figures out what hooks they want, we can't know whether they are internal or not. In any case, this is a regression. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
On Tue, 2007-11-13 at 08:18 -0500, Gregory Haskins wrote: > Since PCI was designed as a hardware solution it has all kinds of stuff > specifically geared towards hardware constraints. Those constraints > are different in a virtualized platform, so some things do not translate > well to an optimal solution. Half of the stuff wouldn't be used, and > the other half has some nasty baggage associated with it (like still > requiring partial emulation in a PV environment). > > The point of PV, of course, is high performance guest/host interfaces, > and PCI baggage just gets in the way in many respects. Once a I would tend to disagree with that statement. The point of PV is a simpler to implement guest/host interface, which sometimes results in a higher performance interface. PV does not always mean high performance, nor does high performance imply PV is necessary. > particular guest's subsystem is HV aware we no longer strictly require > legacy emulation. It should know how to talk to the host using whatever > is appropriate. I aim to strip out all of those emulation points > whenever possible. Once you do that, all of the PCI features that we > would use drops to zero. Device discovery, bus enumeration and shared memory configuration space mapping are all very useful features that require complex negotiation with the hypervisor. Those are provided implicitly in a standardized way by PCI (or by any good hardware bus protocol). > On the flip side, we have full emulation if you want broader > compatibility with legacy, at the expense of performance. There is no reason you need to sacrifice performance for broader compatibility with well designed virtual devices on modern hardware. > > > > >> You are > >> probably better off designing something that is PV specific instead of > >> shoehorning it in to fit a different model (at least for the things I > >> have in mind). > > > > Well, if we design our pv devices to look like hardware, they will fit > > quite well. Both to the guest OS and to user's expectations. > > Like what hardware? Like PCI hardware? What if the platform in > question doesn't have PCI? Also note that devices don't have to look > like emulated PCI devices per se to look like hardware to the > guest-OS/user. That is just one way to do it. What if the platform in question does have PCI. How are you going to write drivers for non-Linux guests? Design a new bus protocol and driver system for pv-only devices which can't run anywhere except for a couple selected guests, or design along the lines of what the physical hardware on your platform actually looks like? It's not like you can construct a full-featured virtualization of x86 without implenting PCI at some level anyway. Note this is not an argument for PCI. It is an argument for devices that look like hardware on whatever platform you are virtualizing. On s390, that might be a bit different than on x86. But the key idea is re-use the platform architecture as much as possible. This gets you far more code sharing and interoperability for devices. Just because you can paravirtualize everything does not mean it is a good idea or more efficient. A good high performance "paravirtualized" network driver only needs one efficient place to trap to the hypervisor, that is to kick off a TX queue. Nothing else needs to be paravirtualized to make this efficient, and now you have a driver that is fairly easily ported among different operating systems because it reuses many architectural primitives. So I think it is a good thing for virtio to allow coupling to a PCI device backend for x86, but be generic enough to allow coupling to other backends for non-PCI architectures. Perhaps the top-level device code and TX/RX queues can be re-used, although I'm not convinced sharing at this layer gives so much benefit in terms of raw lines of code. > > > > >> Its not a heck of a lot of code to write a pv-centric > >> version of these facilities. > >> > >> > > > > It is. > > After having done it in the past, I disagree. But it sounds like you What about XenBus? Device discovery and configuration are a huge amount of work, especially when done without a standard to work from. Just my $0.02. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Use of virtio device IDs
On Tue, 2007-11-13 at 08:18 -0500, Gregory Haskins wrote: Since PCI was designed as a hardware solution it has all kinds of stuff specifically geared towards hardware constraints. Those constraints are different in a virtualized platform, so some things do not translate well to an optimal solution. Half of the stuff wouldn't be used, and the other half has some nasty baggage associated with it (like still requiring partial emulation in a PV environment). The point of PV, of course, is high performance guest/host interfaces, and PCI baggage just gets in the way in many respects. Once a I would tend to disagree with that statement. The point of PV is a simpler to implement guest/host interface, which sometimes results in a higher performance interface. PV does not always mean high performance, nor does high performance imply PV is necessary. particular guest's subsystem is HV aware we no longer strictly require legacy emulation. It should know how to talk to the host using whatever is appropriate. I aim to strip out all of those emulation points whenever possible. Once you do that, all of the PCI features that we would use drops to zero. Device discovery, bus enumeration and shared memory configuration space mapping are all very useful features that require complex negotiation with the hypervisor. Those are provided implicitly in a standardized way by PCI (or by any good hardware bus protocol). On the flip side, we have full emulation if you want broader compatibility with legacy, at the expense of performance. There is no reason you need to sacrifice performance for broader compatibility with well designed virtual devices on modern hardware. You are probably better off designing something that is PV specific instead of shoehorning it in to fit a different model (at least for the things I have in mind). Well, if we design our pv devices to look like hardware, they will fit quite well. Both to the guest OS and to user's expectations. Like what hardware? Like PCI hardware? What if the platform in question doesn't have PCI? Also note that devices don't have to look like emulated PCI devices per se to look like hardware to the guest-OS/user. That is just one way to do it. What if the platform in question does have PCI. How are you going to write drivers for non-Linux guests? Design a new bus protocol and driver system for pv-only devices which can't run anywhere except for a couple selected guests, or design along the lines of what the physical hardware on your platform actually looks like? It's not like you can construct a full-featured virtualization of x86 without implenting PCI at some level anyway. Note this is not an argument for PCI. It is an argument for devices that look like hardware on whatever platform you are virtualizing. On s390, that might be a bit different than on x86. But the key idea is re-use the platform architecture as much as possible. This gets you far more code sharing and interoperability for devices. Just because you can paravirtualize everything does not mean it is a good idea or more efficient. A good high performance paravirtualized network driver only needs one efficient place to trap to the hypervisor, that is to kick off a TX queue. Nothing else needs to be paravirtualized to make this efficient, and now you have a driver that is fairly easily ported among different operating systems because it reuses many architectural primitives. So I think it is a good thing for virtio to allow coupling to a PCI device backend for x86, but be generic enough to allow coupling to other backends for non-PCI architectures. Perhaps the top-level device code and TX/RX queues can be re-used, although I'm not convinced sharing at this layer gives so much benefit in terms of raw lines of code. Its not a heck of a lot of code to write a pv-centric version of these facilities. It is. After having done it in the past, I disagree. But it sounds like you What about XenBus? Device discovery and configuration are a huge amount of work, especially when done without a standard to work from. Just my $0.02. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86/paravirt: revert exports to restore old behaviour
On Tue, 2007-11-13 at 22:22 +, Christoph Hellwig wrote: On Tue, Nov 13, 2007 at 12:21:16PM -0800, Jeremy Fitzhardinge wrote: Subject: x86/paravirt: revert exports to restore old behaviour Subdividing the paravirt_ops structure caused a regression in certain non-GPL modules which try to use mmu_ops and cpu_ops. This restores the old behaviour, and makes it consistent with the non-CONFIG_PARAVIRT case. NACK, both of these are internal and graphics drivers should not be using them. Some of them are internal, but some are not, they just happened to be privileged CPU operations available to anyone. Does anyone know what hooks they are actually using? Something like reading MSRs is not internal at all, it is CPU specific, and the graphics drivers might have very good reasons to read them to figure out how AGP is configured for example. The graphics drivers most certainly don't need to be paravirtualized however, so they could always work around this with raw asm constructs. The mmu_ops hook is more debatable. But until someone figures out what hooks they want, we can't know whether they are internal or not. In any case, this is a regression. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote: > Keir Fraser wrote: > > volatile prevents the asm from being 'moved significantly', according to the > > gcc manual. I take that to mean that reordering is not allowed. > > I understood it as reordering was permitted, but no re-ordering across another volatile load, store, or asm was permitted. And of course, as long as input and output constraints are written properly, the re-ordering should not be vulnerable to pathological movement causing the code to malfunction. It seems that CPU state side effects which can't be expressed in C need special care - FPU is certainly one example. Also, memory clobber on a volatile asm should stop invalid movement across TLB flushes and other problems areas. Even memory fences should have memory clobber in order to stop movement of loads and stores across the fence by the compiler. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Lguest] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt
On Thu, 2007-11-01 at 10:41 -0700, Jeremy Fitzhardinge wrote: Keir Fraser wrote: volatile prevents the asm from being 'moved significantly', according to the gcc manual. I take that to mean that reordering is not allowed. I understood it as reordering was permitted, but no re-ordering across another volatile load, store, or asm was permitted. And of course, as long as input and output constraints are written properly, the re-ordering should not be vulnerable to pathological movement causing the code to malfunction. It seems that CPU state side effects which can't be expressed in C need special care - FPU is certainly one example. Also, memory clobber on a volatile asm should stop invalid movement across TLB flushes and other problems areas. Even memory fences should have memory clobber in order to stop movement of loads and stores across the fence by the compiler. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Tue, 2007-10-30 at 10:02 -0200, Glauber de Oliveira Costa wrote: > > No, if no paravirt clocksource is detected, nothing can override the > > perfect TSC hardware clocksource rating of 400. And if a paravirt > > clocksource is detected, it is always better than TSC. > > Why always? tsc is the best possible alternative the _platform_ can > provide. So the paravirt clocksource will be at best, as good as tsc. What is the justification for this claim? The TSC provides no information about frequency. The frequency must be measured from a known fixed rate source. This measurement is error prone, both on real hardware, and even more so in a VM. The TSC can only provide a single measure of elapsed periods. It has no concept to differentiate between elapsed periods of time spent in the guest, which is relevant to the scheduler, and elapsed periods of time spent in real time, which is relevant to timekeeping. So a TSC can not drive both timekeeping and scheduling in a virtual machine, or if it does, it does so inaccurately. > And if it is the case: why bother with communication mechanisms of all > kinds, calculations, etc, if we can just read the tsc? > > Noting again: If the tsc does not arrive accurate to the guest, it will > fail the tsc clocksource tests. So it will be rated zero anyway You always need to provide a TSC, and it will always initially appear to be accurate. Long term, as time is reported via the TSC is an approximation somewhere between real time and elapsed running time, measurements made using it in a VM will drift, and it is impossible to remove the inaccuracy from one end of the spectrum (by providing real time exactly) without compromising it at the other end (by skewing relative time measurement). A paravirt clocksource is always better than a TSC because it compensates for these effects. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Tue, 2007-10-30 at 10:02 -0200, Glauber de Oliveira Costa wrote: No, if no paravirt clocksource is detected, nothing can override the perfect TSC hardware clocksource rating of 400. And if a paravirt clocksource is detected, it is always better than TSC. Why always? tsc is the best possible alternative the _platform_ can provide. So the paravirt clocksource will be at best, as good as tsc. What is the justification for this claim? The TSC provides no information about frequency. The frequency must be measured from a known fixed rate source. This measurement is error prone, both on real hardware, and even more so in a VM. The TSC can only provide a single measure of elapsed periods. It has no concept to differentiate between elapsed periods of time spent in the guest, which is relevant to the scheduler, and elapsed periods of time spent in real time, which is relevant to timekeeping. So a TSC can not drive both timekeeping and scheduling in a virtual machine, or if it does, it does so inaccurately. And if it is the case: why bother with communication mechanisms of all kinds, calculations, etc, if we can just read the tsc? Noting again: If the tsc does not arrive accurate to the guest, it will fail the tsc clocksource tests. So it will be rated zero anyway You always need to provide a TSC, and it will always initially appear to be accurate. Long term, as time is reported via the TSC is an approximation somewhere between real time and elapsed running time, measurements made using it in a VM will drift, and it is impossible to remove the inaccuracy from one end of the spectrum (by providing real time exactly) without compromising it at the other end (by skewing relative time measurement). A paravirt clocksource is always better than a TSC because it compensates for these effects. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Tue, 2007-10-30 at 00:02 +0100, Ingo Molnar wrote: > * Zachary Amsden <[EMAIL PROTECTED]> wrote: > > Not every guest support paravirt, but for correctness, all guests > > require TSC, which must be exposed all the way up to userspace, no > > matter what the efficiency or accuracy may be. > > but if there's a perfect TSC available (there is such hardware) then the > TSC _is_ the best clocksource. Paravirt now turns it off unconditionally > in essence. No, if no paravirt clocksource is detected, nothing can override the perfect TSC hardware clocksource rating of 400. And if a paravirt clocksource is detected, it is always better than TSC. > anyway, that's at most an optimization issue. No strong feelings here, > and we can certainly delay this patch until this gets all sorted out. This patch should be nacked, since it is just wrong. This is not an optimization issue. It is an accuracy issue for all virtualization environments that affects long term kernel clock stability, which is important to fix, and the best way to do that is to use a paravirt clocksource. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Mon, 2007-10-29 at 23:48 +0100, Ingo Molnar wrote: > * Zachary Amsden <[EMAIL PROTECTED]> wrote: > if it's inaccurate why are you exposing it to the guest then? Native > only uses the TSC if it's safe and accurate to do so. Not every guest support paravirt, but for correctness, all guests require TSC, which must be exposed all the way up to userspace, no matter what the efficiency or accuracy may be. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote: > From: Glauber de Oliveira Costa <[EMAIL PROTECTED]> > > tsc is very good time source (when it does not have drifts, does not > change it's frequency, i.e. when it works), so it should have its rating > raised to a value greater than, or equal 400. > > Since it's being a tendency among paravirt clocksources to use values > around 400, we should declare tsc as even better: So we use 500. Why is the TSC better than a paravirt clocksource? In our case this is definitely inaccurate. Paravirt clocksources should be preferred to TSC, and both must be made available in hardware for platforms which do not support paravirt. Also, please cc all the paravirt developers on things related to paravirt, especially things with such broad effect. I think 400 is a good value for a perfect native clocksource. >400 should be reserved for super-real (i.e. paravirt) sources that should always be chosen over a hardware realistic implementation in a virtual environment. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote: From: Glauber de Oliveira Costa [EMAIL PROTECTED] tsc is very good time source (when it does not have drifts, does not change it's frequency, i.e. when it works), so it should have its rating raised to a value greater than, or equal 400. Since it's being a tendency among paravirt clocksources to use values around 400, we should declare tsc as even better: So we use 500. Why is the TSC better than a paravirt clocksource? In our case this is definitely inaccurate. Paravirt clocksources should be preferred to TSC, and both must be made available in hardware for platforms which do not support paravirt. Also, please cc all the paravirt developers on things related to paravirt, especially things with such broad effect. I think 400 is a good value for a perfect native clocksource. 400 should be reserved for super-real (i.e. paravirt) sources that should always be chosen over a hardware realistic implementation in a virtual environment. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Tue, 2007-10-30 at 00:02 +0100, Ingo Molnar wrote: * Zachary Amsden [EMAIL PROTECTED] wrote: Not every guest support paravirt, but for correctness, all guests require TSC, which must be exposed all the way up to userspace, no matter what the efficiency or accuracy may be. but if there's a perfect TSC available (there is such hardware) then the TSC _is_ the best clocksource. Paravirt now turns it off unconditionally in essence. No, if no paravirt clocksource is detected, nothing can override the perfect TSC hardware clocksource rating of 400. And if a paravirt clocksource is detected, it is always better than TSC. anyway, that's at most an optimization issue. No strong feelings here, and we can certainly delay this patch until this gets all sorted out. This patch should be nacked, since it is just wrong. This is not an optimization issue. It is an accuracy issue for all virtualization environments that affects long term kernel clock stability, which is important to fix, and the best way to do that is to use a paravirt clocksource. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] raise tsc clocksource rating
On Mon, 2007-10-29 at 23:48 +0100, Ingo Molnar wrote: * Zachary Amsden [EMAIL PROTECTED] wrote: if it's inaccurate why are you exposing it to the guest then? Native only uses the TSC if it's safe and accurate to do so. Not every guest support paravirt, but for correctness, all guests require TSC, which must be exposed all the way up to userspace, no matter what the efficiency or accuracy may be. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On Thu, 2007-10-25 at 16:57 -0700, Linus Torvalds wrote: > > On Fri, 26 Oct 2007, Andi Kleen wrote: > > > > The conditional add/sub using carry trick is not generally bogus. > > But for registers it's a fine optimization. > > For registers it's fine. For memory, it's a disaster. It's more than just > dirty cachelines and introducing race conditions, it's also about > protection and dirty pages. > > So even in user space, to even be correct in the first place, the compiler It's actually a fair bit worse for us. We have paths where a false optimization like this would hyperspace the machine. In fact, this frightens me so much I've just gone off to investigate whether gcc has gone and done this to any of our code. Clearly the right solution is to introduce threads and write protected memory into gcc so that the developers are either motivated to ensure they work or self-destruct. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Is gcc thread-unsafe?
On Thu, 2007-10-25 at 16:57 -0700, Linus Torvalds wrote: On Fri, 26 Oct 2007, Andi Kleen wrote: The conditional add/sub using carry trick is not generally bogus. But for registers it's a fine optimization. For registers it's fine. For memory, it's a disaster. It's more than just dirty cachelines and introducing race conditions, it's also about protection and dirty pages. So even in user space, to even be correct in the first place, the compiler It's actually a fair bit worse for us. We have paths where a false optimization like this would hyperspace the machine. In fact, this frightens me so much I've just gone off to investigate whether gcc has gone and done this to any of our code. Clearly the right solution is to introduce threads and write protected memory into gcc so that the developers are either motivated to ensure they work or self-destruct. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interaction between Xen and XFS: stray RW mappings
On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote: > On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: > > On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote: > > > On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote: > > > > Andi Kleen wrote: > > > > > Jeremy Fitzhardinge <[EMAIL PROTECTED]> writes: > > > > > > > > > >> Yes, that's precisely the problem. xfs does delay the unmap, leaving > > > > >> stray mappings, which upsets Xen. > > > > >> > > > > > > > > > > Again it not just upsets Xen, keeping mappings to freed pages is > > > > > wrong generally > > > > > and violates the x86 (and likely others like PPC) architecture > > > > > because it can > > > > > cause illegal caching attribute aliases. It is a serious offense to leave stray mappings for memory which can get re-mapped to I/O devices... especially with PCI and other device hotplug. I have to back up Andi on this one unconditionally. On architectures where you have multibyte, non-wordsize updates to hardware page tables, you even have races here when setting, updating and clearing PTEs that must be done in a sequence where no aliasing of mappings to partially written PTEs can result in I/O memory getting mapped in a cacheable state. The window here is only one instruction, and yes, it is possible for a window this small to create a problem. A large (or permanently lazy) window is extremely frightening. These things do cause bugs. The bugs take a very long time to show up and are very difficult to track down, since they can basically cause random behavior, such as hanging the machine or losing orbit and crashing into the moon. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Interaction between Xen and XFS: stray RW mappings
On Tue, 2007-10-23 at 01:35 +0200, Andi Kleen wrote: On Tue, Oct 23, 2007 at 08:32:25AM +1000, David Chinner wrote: On Mon, Oct 22, 2007 at 09:07:40PM +0200, Andi Kleen wrote: On Mon, Oct 22, 2007 at 11:40:52AM -0700, Jeremy Fitzhardinge wrote: Andi Kleen wrote: Jeremy Fitzhardinge [EMAIL PROTECTED] writes: Yes, that's precisely the problem. xfs does delay the unmap, leaving stray mappings, which upsets Xen. Again it not just upsets Xen, keeping mappings to freed pages is wrong generally and violates the x86 (and likely others like PPC) architecture because it can cause illegal caching attribute aliases. It is a serious offense to leave stray mappings for memory which can get re-mapped to I/O devices... especially with PCI and other device hotplug. I have to back up Andi on this one unconditionally. On architectures where you have multibyte, non-wordsize updates to hardware page tables, you even have races here when setting, updating and clearing PTEs that must be done in a sequence where no aliasing of mappings to partially written PTEs can result in I/O memory getting mapped in a cacheable state. The window here is only one instruction, and yes, it is possible for a window this small to create a problem. A large (or permanently lazy) window is extremely frightening. These things do cause bugs. The bugs take a very long time to show up and are very difficult to track down, since they can basically cause random behavior, such as hanging the machine or losing orbit and crashing into the moon. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24
On Tue, 2007-10-16 at 00:03 +0200, Andi Kleen wrote: > > Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid > > upsetting Xen > > This should be probably done unconditionally because it's a undefined > dangerous condition everywhere. Should be done unconditionally. One could remap the underlying physical space to include an MMIO region, and speculative reads from the cacheable virtual mapping of that region could move the robot arm, destroying the world. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [stable] [PATCH 00/12] xen/paravirt_ops patches for 2.6.24
On Tue, 2007-10-16 at 00:03 +0200, Andi Kleen wrote: Subject: [PATCH 12/12] xfs: eagerly remove vmap mappings to avoid upsetting Xen This should be probably done unconditionally because it's a undefined dangerous condition everywhere. Should be done unconditionally. One could remap the underlying physical space to include an MMIO region, and speculative reads from the cacheable virtual mapping of that region could move the robot arm, destroying the world. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64 : kernel initial decompression hangs on vmware
On Wed, 2007-10-10 at 15:37 +0800, Fengguang Wu wrote: > Hi Zachary, > > One of my friends' vmware "hangs" on booting Linux 2.6.23, and then get > it to work by applying your patch at http://lkml.org/lkml/2007/8/4/72. > > It would be nice to see your fix going into mainline :-) I thought that patch already went into mainline in 2.6.23? If not, it would be nice to queue it for -stable (also for 2.6.22). Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: x86_64 : kernel initial decompression hangs on vmware
On Wed, 2007-10-10 at 15:37 +0800, Fengguang Wu wrote: Hi Zachary, One of my friends' vmware hangs on booting Linux 2.6.23, and then get it to work by applying your patch at http://lkml.org/lkml/2007/8/4/72. It would be nice to see your fix going into mainline :-) I thought that patch already went into mainline in 2.6.23? If not, it would be nice to queue it for -stable (also for 2.6.22). Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops
On Fri, 2007-09-28 at 11:49 -0700, Jeremy Fitzhardinge wrote: > > We shouldn't need to export pv_init_ops. > > No. The only ones I export are: > > EXPORT_SYMBOL_GPL(pv_time_ops); > EXPORT_SYMBOL_GPL(pv_cpu_ops); > EXPORT_SYMBOL_GPL(pv_mmu_ops); > EXPORT_SYMBOL_GPL(pv_apic_ops); > EXPORT_SYMBOL(pv_irq_ops); Nicely done. I know of some out of tree modules which use part of the pv_cpu_ops and pv_mmu_ops, but we should not worry about such things, and it turns out those modules don't need to be virtualized anyway. > > > > It is debatable whether > > CR2/CR3 should be part of CPU or MMU ops. > > Yeah, I was in two minds. CR3, at least, should be grouped with the > other tlb operations, wherever they go. And while they're privileged > CPU instructions (cpu_ops), they're more logically related to the rest > of the mmu state. On the other hand, we could have an ops structure > specifically dedicated to pagetable manipulations, and put the cr3/tlb > ops elsewhere. I'm not against either approach. I think the way you did it is fine. If it were up to me, I would probably have driven myself crazy splitting hairs on it until I was bald. > > Also, can we drop write_cr2? > > It isn't used anywhere, so the only reason to keep it is symmetry. > > Which was a fine argument when it was an inline, but now it just adds > > unused junk to the code. > > > > I think its used in some cpu state save/restore code, but its not > relevant to pv-ops. Ah yes, it is used there. We actually exercise some of those paths, but they don't need to be strictly virtualized. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops
On Fri, 2007-09-28 at 11:10 -0700, Jeremy Fitzhardinge wrote: > This patch refactors the paravirt_ops structure into groups of > functionally related ops: > > pv_info - random info, rather than function entrypoints > pv_init_ops - functions used at boot time (some for module_init too) > pv_misc_ops - lazy mode, which didn't fit well anywhere else > pv_time_ops - time-related functions > pv_cpu_ops - various privileged instruction ops > pv_irq_ops - operations for managing interrupt state > pv_apic_ops - APIC operations > pv_mmu_ops - operations for managing pagetables > > There are several motivations for this: > > 1. Some of these ops will be general to all x86, and some will be >i386/x86-64 specific. This makes it easier to share common stuff >while allowing separate implementations where needed. > > 2. At the moment we must export all of paravirt_ops, but modules only >need selected parts of it. This allows us to export on a case by case >basis (and also choose which export license we want to apply). We shouldn't need to export pv_init_ops. It is debatable whether CR2/CR3 should be part of CPU or MMU ops. Also, can we drop write_cr2? It isn't used anywhere, so the only reason to keep it is symmetry. Which was a fine argument when it was an inline, but now it just adds unused junk to the code. > This still needs to be reconciled with the x86 merge and glommer's > paravirt_ops unification work. Those issues aside, this looks great. I'll give it a whirl today. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops
On Fri, 2007-09-28 at 11:10 -0700, Jeremy Fitzhardinge wrote: This patch refactors the paravirt_ops structure into groups of functionally related ops: pv_info - random info, rather than function entrypoints pv_init_ops - functions used at boot time (some for module_init too) pv_misc_ops - lazy mode, which didn't fit well anywhere else pv_time_ops - time-related functions pv_cpu_ops - various privileged instruction ops pv_irq_ops - operations for managing interrupt state pv_apic_ops - APIC operations pv_mmu_ops - operations for managing pagetables There are several motivations for this: 1. Some of these ops will be general to all x86, and some will be i386/x86-64 specific. This makes it easier to share common stuff while allowing separate implementations where needed. 2. At the moment we must export all of paravirt_ops, but modules only need selected parts of it. This allows us to export on a case by case basis (and also choose which export license we want to apply). We shouldn't need to export pv_init_ops. It is debatable whether CR2/CR3 should be part of CPU or MMU ops. Also, can we drop write_cr2? It isn't used anywhere, so the only reason to keep it is symmetry. Which was a fine argument when it was an inline, but now it just adds unused junk to the code. This still needs to be reconciled with the x86 merge and glommer's paravirt_ops unification work. Those issues aside, this looks great. I'll give it a whirl today. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH RFC] paravirt_ops: refactor struct paravirt_ops into smaller pv_*_ops
On Fri, 2007-09-28 at 11:49 -0700, Jeremy Fitzhardinge wrote: We shouldn't need to export pv_init_ops. No. The only ones I export are: EXPORT_SYMBOL_GPL(pv_time_ops); EXPORT_SYMBOL_GPL(pv_cpu_ops); EXPORT_SYMBOL_GPL(pv_mmu_ops); EXPORT_SYMBOL_GPL(pv_apic_ops); EXPORT_SYMBOL(pv_irq_ops); Nicely done. I know of some out of tree modules which use part of the pv_cpu_ops and pv_mmu_ops, but we should not worry about such things, and it turns out those modules don't need to be virtualized anyway. It is debatable whether CR2/CR3 should be part of CPU or MMU ops. Yeah, I was in two minds. CR3, at least, should be grouped with the other tlb operations, wherever they go. And while they're privileged CPU instructions (cpu_ops), they're more logically related to the rest of the mmu state. On the other hand, we could have an ops structure specifically dedicated to pagetable manipulations, and put the cr3/tlb ops elsewhere. I'm not against either approach. I think the way you did it is fine. If it were up to me, I would probably have driven myself crazy splitting hairs on it until I was bald. Also, can we drop write_cr2? It isn't used anywhere, so the only reason to keep it is symmetry. Which was a fine argument when it was an inline, but now it just adds unused junk to the code. I think its used in some cpu state save/restore code, but its not relevant to pv-ops. Ah yes, it is used there. We actually exercise some of those paths, but they don't need to be strictly virtualized. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu
On Sun, 2007-09-16 at 07:56 -0700, Jeremy Fitzhardinge wrote: > Rusty Russell wrote: > > Well, containerization deserves its own menu, but the question is does i > > it belong under this menu? > > No, I don't think so. While there are some broad similarities in > effect, the technology is completely different, as are the capabilities > and tradeoffs. They're pretty much orthogonal. It seems to me containerization is much more similar a feature to SYSV IPC, process accounting / auditing, and should be an option under General setup, possibly with its own submenu. Nearly all the features under general setup provide optional extensions to kernel services available from userspace, and so it seems containerization falls naturally into that category. Virtualization is completely different, and probably needs separate server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in it's menu. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] Consolidate host virtualization support under Virtualization menu
On Sun, 2007-09-16 at 07:56 -0700, Jeremy Fitzhardinge wrote: Rusty Russell wrote: Well, containerization deserves its own menu, but the question is does i it belong under this menu? No, I don't think so. While there are some broad similarities in effect, the technology is completely different, as are the capabilities and tradeoffs. They're pretty much orthogonal. It seems to me containerization is much more similar a feature to SYSV IPC, process accounting / auditing, and should be an option under General setup, possibly with its own submenu. Nearly all the features under general setup provide optional extensions to kernel services available from userspace, and so it seems containerization falls naturally into that category. Virtualization is completely different, and probably needs separate server (kvm, lguest) and client (kvm, lguest, xen, vmware) sections in it's menu. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote: > So then each module creates a hypercall page using this magic MSR and > the hypervisor has to keep track of it so that it can appropriately > change the page on migration. The page can only contain a single > instruction or else it cannot be easily changed (or you have to be able > to prevent the guest from being migrated while in the hypercall page). > > We're really talking about identical models. Instead of an MSR, the #GP > is what tells the hypervisor to update the instruction. The nice thing > about this is that you don't have to keep track of all the current > hypercall page locations in the hypervisor. I agree, multiple hypercall pages is insane. I was thinking more of a single hypercall page, fixed in place by the hypervisor, not the kernel. Then each module can read an MSR saying what VA the hypercall page is at, and the hypervisor can simply flip one page to switch architectures. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote: > Jeremy Fitzhardinge wrote: > > Anthony Liguori wrote: > > > >> This patch refactors the current hypercall infrastructure to better > >> support live > >> migration and SMP. It eliminates the hypercall page by trapping the UD > >> exception that would occur if you used the wrong hypercall instruction for > >> the > >> underlying architecture and replacing it with the right one lazily. > >> > >> > > > > I guess it would be pretty rude/unlikely for these opcodes to get reused > > in other implementations... But couldn't you make the page trap > > instead, rather than relying on an instruction fault? > > > > The whole point of using the instruction is to allow hypercalls to be > used in many locations. This has the nice side effect of not requiring > a central hypercall initialization routine in the guest to fetch the > hypercall page. A PV driver can be completely independent of any other > code provided that it restricts itself to it's hypercall namespace. But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. I prefer the idea of a hypercall page, but not a central initialization. Rather, a decentralized approach where PV drivers can detect using CPUID which hypervisor is present, and a common MSR shared by all hypervisors that provides the location of the hypercall page. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
On Fri, 2007-09-14 at 16:02 -0500, Anthony Liguori wrote: Jeremy Fitzhardinge wrote: Anthony Liguori wrote: This patch refactors the current hypercall infrastructure to better support live migration and SMP. It eliminates the hypercall page by trapping the UD exception that would occur if you used the wrong hypercall instruction for the underlying architecture and replacing it with the right one lazily. I guess it would be pretty rude/unlikely for these opcodes to get reused in other implementations... But couldn't you make the page trap instead, rather than relying on an instruction fault? The whole point of using the instruction is to allow hypercalls to be used in many locations. This has the nice side effect of not requiring a central hypercall initialization routine in the guest to fetch the hypercall page. A PV driver can be completely independent of any other code provided that it restricts itself to it's hypercall namespace. But if the instruction is architecture dependent, and you run on the wrong architecture, now you have to patch many locations at fault time, introducing some nasty runtime code / data cache overlap performance problems. Granted, they go away eventually. I prefer the idea of a hypercall page, but not a central initialization. Rather, a decentralized approach where PV drivers can detect using CPUID which hypervisor is present, and a common MSR shared by all hypervisors that provides the location of the hypercall page. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure
On Fri, 2007-09-14 at 16:44 -0500, Anthony Liguori wrote: So then each module creates a hypercall page using this magic MSR and the hypervisor has to keep track of it so that it can appropriately change the page on migration. The page can only contain a single instruction or else it cannot be easily changed (or you have to be able to prevent the guest from being migrated while in the hypercall page). We're really talking about identical models. Instead of an MSR, the #GP is what tells the hypervisor to update the instruction. The nice thing about this is that you don't have to keep track of all the current hypercall page locations in the hypervisor. I agree, multiple hypercall pages is insane. I was thinking more of a single hypercall page, fixed in place by the hypervisor, not the kernel. Then each module can read an MSR saying what VA the hypercall page is at, and the hypervisor can simply flip one page to switch architectures. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
On Thu, 2007-09-06 at 06:37 +1000, Rusty Russell wrote: > On Thu, 2007-08-23 at 22:46 -0700, Zachary Amsden wrote: > > I recently sent off a fix for lazy vmalloc faults which can happen under > > paravirt when lazy mode is enabled. Unfortunately, I jumped the gun a > > bit on fixing this. I neglected to notice that since the new call to > > flush the MMU update queue is called from the page fault handler, it can > > be pre-empted. Both VMI and Xen use per-cpu variables to track lazy > > mode state, as all previous calls to set, disable, or flush lazy mode > > happened from a non-preemptable state. > > Hi Zach, > > I don't think this patch does anything. The flush is because we want > the just-completed "set_pte" to have immediate effect, so if preempt is > enabled we're already screwed because we can be moved between set_pte > and the arch_flush_lazy_mmu_mode() call. > > Now, where's the problem caller? By my reading or rc4, vmalloc faults > are fixed up before enabling interrupts. I agree. The patch is a nop. I just got overly paranoid. The whole thing is just very prone to bugs. So let's go over it carefully: 1) Lazy mode can't be entered unless preempt is disabled 2) Flushes are needed in two known cases: kmap_atomic and vmalloc sync 3) kmap_atomic can only be used when preempt is disabled 4) vmalloc sync happens under protection of interrupts disabled Good logically. What can break this logic? #1 is defined by us #2 is our currently believed complete list of flush scenarios #3 is impossible to change by design #4 seems very unlikely to change anytime soon Seeing #2 appears weak, let us elaborate: A) Lazy mode is used in a couple of controlled paths for user page table updates which requires no immediately updated mapping; further, they are protected under spinlocks, thus never preempted B) Thus only kernel mapping updates require explicit flushes C) Any interrupt / fault during lazy mode can only use kmap_atomic or a set_pmd to sync a vmalloc region, thus proving my point by circular logic (for interrupt / fault cases). D) Or better, other kernel mapping changes (kmap, the pageattr code, boot_ioremap, vmap, vm86 mark_screen_rdonly) are not usable by interrupt / fault handlers, thus proving C by exclusion. So I'm fairly certain there is no further issues with interrupt handlers or faults, where update semantics are heavily constrained. What of the actual lazy mode code itself doing kernel remapping? Most of these are clearly bogus (pageattr, boot_ioremap, vm86 stuff) for the mm code to use inside a spinlock protected lazy mode region; it does seem perfectly acceptable though for the mm code to use kmap or vmap (not kmap_atomic) internally somewhere in the pagetable code. We can exclude the trivial lazy mode regions (zeromap, unmap, and remap). Easily by inspection. The PTE copy routine gets deep enough that not all paths are immediately obvious, though, we should keep it in mind for bug checking. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
On Thu, 2007-09-06 at 02:33 +1000, Rusty Russell wrote: > On Tue, 2007-09-04 at 14:42 +0100, Jeremy Fitzhardinge wrote: > > Rusty Russell wrote: > > > static inline void arch_flush_lazy_mmu_mode(void) > > > { > > > - PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_FLUSH); > > > + if (unlikely(__get_cpu_var(paravirt_lazy_mode) == PARAVIRT_LAZY_MMU)) > > > + arch_leave_lazy_mmu_mode(); > > > } > > > > > > > This changes the semantics a bit; previously "flush" would flush > > anything pending but leave us in lazy mode. This just drops lazymode > > altogether? > > > > I guess if we assume that flushing is a rare event then its OK, but I > > think the name's a bit misleading. How does it differ from plain > > arch_leave_lazy_mmu_mode()? > > Whether it's likely or unlikely to be in lazy mode, basically. But > you're right, this should be folded, since we don't want to "leave" lazy > mode twice. > > === > Hoist per-cpu lazy_mode variable up into common code. > > VMI, Xen and lguest all track paravirt_lazy_mode individually, meaning > we hand a "magic" value for set_lazy_mode() to say "flush if currently > active". > > We can simplify the logic by hoisting this variable into common code. > > Signed-off-by: Rusty Russell <[EMAIL PROTECTED]> > > --- > arch/i386/kernel/vmi.c | 16 > arch/i386/xen/enlighten.c | 15 +++ > arch/i386/xen/multicalls.h |2 +- > arch/i386/xen/xen-ops.h |7 --- > drivers/lguest/lguest.c | 25 ++--- > include/asm-i386/paravirt.h | 29 - > 6 files changed, 30 insertions(+), 64 deletions(-) > > diff -r 072a0b3924fb arch/i386/kernel/vmi.c > --- a/arch/i386/kernel/vmi.c Thu Aug 30 04:47:43 2007 +1000 > +++ b/arch/i386/kernel/vmi.c Wed Sep 05 04:06:20 2007 +1000 > @@ -554,22 +554,14 @@ vmi_startup_ipi_hook(int phys_apicid, un > > static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) > { > - static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode); > - > if (!vmi_ops.set_lazy_mode) > return; > > /* Modes should never nest or overlap */ > - BUG_ON(__get_cpu_var(lazy_mode) && !(mode == PARAVIRT_LAZY_NONE || > - mode == PARAVIRT_LAZY_FLUSH)); > - > - if (mode == PARAVIRT_LAZY_FLUSH) { > - vmi_ops.set_lazy_mode(0); > - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode)); > - } else { > - vmi_ops.set_lazy_mode(mode); > - __get_cpu_var(lazy_mode) = mode; > - } > + BUG_ON(get_lazy_mode() && !(mode == PARAVIRT_LAZY_NONE | > + | mode == PARAVIRT_LAZY_FLUSH)); That's a pretty strange line break. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
On Thu, 2007-09-06 at 02:33 +1000, Rusty Russell wrote: On Tue, 2007-09-04 at 14:42 +0100, Jeremy Fitzhardinge wrote: Rusty Russell wrote: static inline void arch_flush_lazy_mmu_mode(void) { - PVOP_VCALL1(set_lazy_mode, PARAVIRT_LAZY_FLUSH); + if (unlikely(__get_cpu_var(paravirt_lazy_mode) == PARAVIRT_LAZY_MMU)) + arch_leave_lazy_mmu_mode(); } This changes the semantics a bit; previously flush would flush anything pending but leave us in lazy mode. This just drops lazymode altogether? I guess if we assume that flushing is a rare event then its OK, but I think the name's a bit misleading. How does it differ from plain arch_leave_lazy_mmu_mode()? Whether it's likely or unlikely to be in lazy mode, basically. But you're right, this should be folded, since we don't want to leave lazy mode twice. === Hoist per-cpu lazy_mode variable up into common code. VMI, Xen and lguest all track paravirt_lazy_mode individually, meaning we hand a magic value for set_lazy_mode() to say flush if currently active. We can simplify the logic by hoisting this variable into common code. Signed-off-by: Rusty Russell [EMAIL PROTECTED] --- arch/i386/kernel/vmi.c | 16 arch/i386/xen/enlighten.c | 15 +++ arch/i386/xen/multicalls.h |2 +- arch/i386/xen/xen-ops.h |7 --- drivers/lguest/lguest.c | 25 ++--- include/asm-i386/paravirt.h | 29 - 6 files changed, 30 insertions(+), 64 deletions(-) diff -r 072a0b3924fb arch/i386/kernel/vmi.c --- a/arch/i386/kernel/vmi.c Thu Aug 30 04:47:43 2007 +1000 +++ b/arch/i386/kernel/vmi.c Wed Sep 05 04:06:20 2007 +1000 @@ -554,22 +554,14 @@ vmi_startup_ipi_hook(int phys_apicid, un static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) { - static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode); - if (!vmi_ops.set_lazy_mode) return; /* Modes should never nest or overlap */ - BUG_ON(__get_cpu_var(lazy_mode) !(mode == PARAVIRT_LAZY_NONE || - mode == PARAVIRT_LAZY_FLUSH)); - - if (mode == PARAVIRT_LAZY_FLUSH) { - vmi_ops.set_lazy_mode(0); - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode)); - } else { - vmi_ops.set_lazy_mode(mode); - __get_cpu_var(lazy_mode) = mode; - } + BUG_ON(get_lazy_mode() !(mode == PARAVIRT_LAZY_NONE | + | mode == PARAVIRT_LAZY_FLUSH)); That's a pretty strange line break. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
On Thu, 2007-09-06 at 06:37 +1000, Rusty Russell wrote: On Thu, 2007-08-23 at 22:46 -0700, Zachary Amsden wrote: I recently sent off a fix for lazy vmalloc faults which can happen under paravirt when lazy mode is enabled. Unfortunately, I jumped the gun a bit on fixing this. I neglected to notice that since the new call to flush the MMU update queue is called from the page fault handler, it can be pre-empted. Both VMI and Xen use per-cpu variables to track lazy mode state, as all previous calls to set, disable, or flush lazy mode happened from a non-preemptable state. Hi Zach, I don't think this patch does anything. The flush is because we want the just-completed set_pte to have immediate effect, so if preempt is enabled we're already screwed because we can be moved between set_pte and the arch_flush_lazy_mmu_mode() call. Now, where's the problem caller? By my reading or rc4, vmalloc faults are fixed up before enabling interrupts. I agree. The patch is a nop. I just got overly paranoid. The whole thing is just very prone to bugs. So let's go over it carefully: 1) Lazy mode can't be entered unless preempt is disabled 2) Flushes are needed in two known cases: kmap_atomic and vmalloc sync 3) kmap_atomic can only be used when preempt is disabled 4) vmalloc sync happens under protection of interrupts disabled Good logically. What can break this logic? #1 is defined by us #2 is our currently believed complete list of flush scenarios #3 is impossible to change by design #4 seems very unlikely to change anytime soon Seeing #2 appears weak, let us elaborate: A) Lazy mode is used in a couple of controlled paths for user page table updates which requires no immediately updated mapping; further, they are protected under spinlocks, thus never preempted B) Thus only kernel mapping updates require explicit flushes C) Any interrupt / fault during lazy mode can only use kmap_atomic or a set_pmd to sync a vmalloc region, thus proving my point by circular logic (for interrupt / fault cases). D) Or better, other kernel mapping changes (kmap, the pageattr code, boot_ioremap, vmap, vm86 mark_screen_rdonly) are not usable by interrupt / fault handlers, thus proving C by exclusion. So I'm fairly certain there is no further issues with interrupt handlers or faults, where update semantics are heavily constrained. What of the actual lazy mode code itself doing kernel remapping? Most of these are clearly bogus (pageattr, boot_ioremap, vm86 stuff) for the mm code to use inside a spinlock protected lazy mode region; it does seem perfectly acceptable though for the mm code to use kmap or vmap (not kmap_atomic) internally somewhere in the pagetable code. We can exclude the trivial lazy mode regions (zeromap, unmap, and remap). Easily by inspection. The PTE copy routine gets deep enough that not all paths are immediately obvious, though, we should keep it in mind for bug checking. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Benjamin Herrenschmidt wrote: On Wed, 2007-08-22 at 16:25 +1000, Rusty Russell wrote: On Wed, 2007-08-22 at 08:34 +0300, Avi Kivity wrote: Zachary Amsden wrote: This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. Won't these workloads be better off using paravirtualized drivers? i.e., do the native drivers with paravirt I/O instructions get anywhere near the performance of paravirt drivers? This patch also means I can kill off the emulation code in drivers/lguest/core.c, which is a real relief. Hrm... how do you deal with X doing IOs ? Ben. We have an X driver that does minimal performance costing operations. As we should and will have for our other drivers. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Benjamin Herrenschmidt wrote: On Wed, 2007-08-22 at 16:25 +1000, Rusty Russell wrote: On Wed, 2007-08-22 at 08:34 +0300, Avi Kivity wrote: Zachary Amsden wrote: This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. Won't these workloads be better off using paravirtualized drivers? i.e., do the native drivers with paravirt I/O instructions get anywhere near the performance of paravirt drivers? This patch also means I can kill off the emulation code in drivers/lguest/core.c, which is a real relief. Hrm... how do you deal with X doing IOs ? Ben. We have an X driver that does minimal performance costing operations. As we should and will have for our other drivers. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/4] 2.6.23-rc3: known regressions v3
Parag Warudkar wrote: On 8/24/07, Michal Piotrowski <[EMAIL PROTECTED]> wrote: Virtualization Subject : CONFIG_VMI broken References : http://lkml.org/lkml/2007/8/14/203 Last known good : ? Submitter : Parag Warudkar <[EMAIL PROTECTED]> Caused-By : ? Handled-By : Zachary Amsden <[EMAIL PROTECTED]> Status : problem is being debugged Zach seemed to think that this is already fixed - I am not in a position to test it immediately so if we know what fixed this - can be closed. I'll report back once I get a chance to test latest git. Parag, thanks. I reproduced this bug with your kernel config on 2.6.23-rc3 and verified it does not happen on latest git. I inspected memory after the crash and determined the problem was patching of instructions went awry. Chris in the meantime fixed a bug with patching instructions, and the change from 100% apocalyptic failure to 100% unequivocal success has convinced me that was the same bug. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
Jeremy Fitzhardinge wrote: Hm. Doing any kind of lazy-state operation with preemption enabled is fundamentally meaningless. How does it get into a preemptable state Agree 100%. It is the lazy mode flush that might happen when preempt is enabled, but lazy mode is disabled. In that case, the code relies on per-cpu variables, which is a bad thing to do in preemtible code. This can happen in the current code path. Thinking slightly deeper about it, it might be the case that there is no bug, because the local lazy mode variables are only _modified_ in the preemptible state, and guaranteed to be zero in the non-preemtible state; but it was not clear to me that this is always the case, and I got very nervous about reading per-cpu variables with preempt enabled. It would, in any case, fire a BUG_ON in the Xen code, which I did fix. Do you agree it is better to be safe than sorry in this case? The kind of bugs introduced by getting this wrong are really hard to find, and I would rather err on the side of an extra increment and decrement of preempt_count that causing a regression. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix preemptible lazy mode bug
Jeremy Fitzhardinge wrote: Hm. Doing any kind of lazy-state operation with preemption enabled is fundamentally meaningless. How does it get into a preemptable state Agree 100%. It is the lazy mode flush that might happen when preempt is enabled, but lazy mode is disabled. In that case, the code relies on per-cpu variables, which is a bad thing to do in preemtible code. This can happen in the current code path. Thinking slightly deeper about it, it might be the case that there is no bug, because the local lazy mode variables are only _modified_ in the preemptible state, and guaranteed to be zero in the non-preemtible state; but it was not clear to me that this is always the case, and I got very nervous about reading per-cpu variables with preempt enabled. It would, in any case, fire a BUG_ON in the Xen code, which I did fix. Do you agree it is better to be safe than sorry in this case? The kind of bugs introduced by getting this wrong are really hard to find, and I would rather err on the side of an extra increment and decrement of preempt_count that causing a regression. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [4/4] 2.6.23-rc3: known regressions v3
Parag Warudkar wrote: On 8/24/07, Michal Piotrowski [EMAIL PROTECTED] wrote: Virtualization Subject : CONFIG_VMI broken References : http://lkml.org/lkml/2007/8/14/203 Last known good : ? Submitter : Parag Warudkar [EMAIL PROTECTED] Caused-By : ? Handled-By : Zachary Amsden [EMAIL PROTECTED] Status : problem is being debugged Zach seemed to think that this is already fixed - I am not in a position to test it immediately so if we know what fixed this - can be closed. I'll report back once I get a chance to test latest git. Parag, thanks. I reproduced this bug with your kernel config on 2.6.23-rc3 and verified it does not happen on latest git. I inspected memory after the crash and determined the problem was patching of instructions went awry. Chris in the meantime fixed a bug with patching instructions, and the change from 100% apocalyptic failure to 100% unequivocal success has convinced me that was the same bug. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix preemptible lazy mode bug
I recently sent off a fix for lazy vmalloc faults which can happen under paravirt when lazy mode is enabled. Unfortunately, I jumped the gun a bit on fixing this. I neglected to notice that since the new call to flush the MMU update queue is called from the page fault handler, it can be pre-empted. Both VMI and Xen use per-cpu variables to track lazy mode state, as all previous calls to set, disable, or flush lazy mode happened from a non-preemptable state. I have no idea how to convincingly produce the problem, as generating a kernel pre-emption at the required point is, um, difficult, but it is most certainly a real possibility, and potentially more likely than the bug I fixed originally. Rusty, you may have to modify lguest code if you use lazy mode and rely on per-cpu variables during the callout for paravirt_ops.set_lazy_mode. I have tested as best as I can, and am trying to write a suite destined for LTP which will help catch and debug these issues. Zach Since set_lazy_mode(LAZY_MODE_FLUSH) is now called from the page fault handler, it can potentially happen in a preemptible state. We therefore must make all lazy mode paravirt-ops handlers non-preemptible. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED](none)> --- arch/i386/kernel/vmi.c| 14 ++ arch/i386/xen/enlighten.c |4 +++- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c index 18673e0..9e669cb 100644 --- a/arch/i386/kernel/vmi.c +++ b/arch/i386/kernel/vmi.c @@ -555,21 +555,27 @@ vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip, static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) { static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode); + int cpu; + enum paravirt_lazy_mode cur_mode; if (!vmi_ops.set_lazy_mode) return; + cpu = get_cpu(); + cur_mode = per_cpu(lazy_mode, cpu); + /* Modes should never nest or overlap */ - BUG_ON(__get_cpu_var(lazy_mode) && !(mode == PARAVIRT_LAZY_NONE || - mode == PARAVIRT_LAZY_FLUSH)); + BUG_ON(cur_mode && !(mode == PARAVIRT_LAZY_NONE || + mode == PARAVIRT_LAZY_FLUSH)); if (mode == PARAVIRT_LAZY_FLUSH) { vmi_ops.set_lazy_mode(0); - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode)); + vmi_ops.set_lazy_mode(cur_mode); } else { vmi_ops.set_lazy_mode(mode); - __get_cpu_var(lazy_mode) = mode; + per_cpu(lazy_mode, cpu) = mode; } + put_cpu(); } static inline int __init check_vmi_rom(struct vrom_header *rom) diff --git a/arch/i386/xen/enlighten.c b/arch/i386/xen/enlighten.c index f0c3751..2dafb8a 100644 --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -251,7 +251,7 @@ static void xen_halt(void) static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) { - BUG_ON(preemptible()); + get_cpu(); switch (mode) { case PARAVIRT_LAZY_NONE: @@ -267,11 +267,13 @@ static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) /* flush if necessary, but don't change state */ if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE) xen_mc_flush(); + put_cpu(); return; } xen_mc_flush(); x86_write_percpu(xen_lazy_mode, mode); + put_cpu(); } static unsigned long xen_store_tr(void) -- 1.4.4.4
[PATCH] Fix preemptible lazy mode bug
I recently sent off a fix for lazy vmalloc faults which can happen under paravirt when lazy mode is enabled. Unfortunately, I jumped the gun a bit on fixing this. I neglected to notice that since the new call to flush the MMU update queue is called from the page fault handler, it can be pre-empted. Both VMI and Xen use per-cpu variables to track lazy mode state, as all previous calls to set, disable, or flush lazy mode happened from a non-preemptable state. I have no idea how to convincingly produce the problem, as generating a kernel pre-emption at the required point is, um, difficult, but it is most certainly a real possibility, and potentially more likely than the bug I fixed originally. Rusty, you may have to modify lguest code if you use lazy mode and rely on per-cpu variables during the callout for paravirt_ops.set_lazy_mode. I have tested as best as I can, and am trying to write a suite destined for LTP which will help catch and debug these issues. Zach Since set_lazy_mode(LAZY_MODE_FLUSH) is now called from the page fault handler, it can potentially happen in a preemptible state. We therefore must make all lazy mode paravirt-ops handlers non-preemptible. Signed-off-by: Zachary Amsden [EMAIL PROTECTED](none) --- arch/i386/kernel/vmi.c| 14 ++ arch/i386/xen/enlighten.c |4 +++- 2 files changed, 13 insertions(+), 5 deletions(-) diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c index 18673e0..9e669cb 100644 --- a/arch/i386/kernel/vmi.c +++ b/arch/i386/kernel/vmi.c @@ -555,21 +555,27 @@ vmi_startup_ipi_hook(int phys_apicid, unsigned long start_eip, static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) { static DEFINE_PER_CPU(enum paravirt_lazy_mode, lazy_mode); + int cpu; + enum paravirt_lazy_mode cur_mode; if (!vmi_ops.set_lazy_mode) return; + cpu = get_cpu(); + cur_mode = per_cpu(lazy_mode, cpu); + /* Modes should never nest or overlap */ - BUG_ON(__get_cpu_var(lazy_mode) !(mode == PARAVIRT_LAZY_NONE || - mode == PARAVIRT_LAZY_FLUSH)); + BUG_ON(cur_mode !(mode == PARAVIRT_LAZY_NONE || + mode == PARAVIRT_LAZY_FLUSH)); if (mode == PARAVIRT_LAZY_FLUSH) { vmi_ops.set_lazy_mode(0); - vmi_ops.set_lazy_mode(__get_cpu_var(lazy_mode)); + vmi_ops.set_lazy_mode(cur_mode); } else { vmi_ops.set_lazy_mode(mode); - __get_cpu_var(lazy_mode) = mode; + per_cpu(lazy_mode, cpu) = mode; } + put_cpu(); } static inline int __init check_vmi_rom(struct vrom_header *rom) diff --git a/arch/i386/xen/enlighten.c b/arch/i386/xen/enlighten.c index f0c3751..2dafb8a 100644 --- a/arch/i386/xen/enlighten.c +++ b/arch/i386/xen/enlighten.c @@ -251,7 +251,7 @@ static void xen_halt(void) static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) { - BUG_ON(preemptible()); + get_cpu(); switch (mode) { case PARAVIRT_LAZY_NONE: @@ -267,11 +267,13 @@ static void xen_set_lazy_mode(enum paravirt_lazy_mode mode) /* flush if necessary, but don't change state */ if (x86_read_percpu(xen_lazy_mode) != PARAVIRT_LAZY_NONE) xen_mc_flush(); + put_cpu(); return; } xen_mc_flush(); x86_write_percpu(xen_lazy_mode, mode); + put_cpu(); } static unsigned long xen_store_tr(void) -- 1.4.4.4
Re: 2.6.23-rc3 - CONFIG_VMI broken
Parag Warudkar wrote: CONFIG_VMI seems to be broken, but I am not sure when - the last kernel I was running was 2.6.22-rc4 which used to boot fine and use VMI. Current git with same configuration causes the kernel to reboot early. Logs below. Deselecting CONFIG_VMI and rebuilding allows the kernel to boot normally. (I am running it on VMWare workstation 6 latest release.) Thanks Parag Linux version 2.6.23-rc3 ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #4 Mon Aug 13 21:50:06 EDT 2007 BIOS-provided physical RAM map: BIOS-e820: - 0009f800 (usable) BIOS-e820: 0009f800 - 000a (reserved) BIOS-e820: 000ca000 - 000cc000 (reserved) BIOS-e820: 000dc000 - 0010 (reserved) BIOS-e820: 0010 - 1fef (usable) BIOS-e820: 1fef - 1feff000 (ACPI data) BIOS-e820: 1feff000 - 1ff0 (ACPI NVS) BIOS-e820: 1ff0 - 2000 (usable) BIOS-e820: fec0 - fec1 (reserved) BIOS-e820: fee0 - fee01000 (reserved) BIOS-e820: fffe - 0001 (reserved) console [earlyser0] enabled 0MB HIGHMEM available. 512MB LOWMEM available. found SMP MP-table at 000f6c90 VMI: Found VMware, Inc. Hypervisor OPROM, API version 3.0, ROM version 1.0 Reserving virtual address space above 0xfc00 Int 14: CR2 fc37e260 err EIP fc37e260 CS 0062 flags 00010006 Stack: c0490ec7 c04913cb 0001 fc001340 c047fff4 c04a6080 c047aa00 I reproduced this, slightly different EIP, but I notice that all paravirt-ops function calls are to bogus addresses; the first byte appears correct (0xfc00 is in the VMI ROM range), and the extracted function addresses in the paravirt_ops struct are correct. However, it looks like the patching of the call instructions went wrong. Does this sound familiar to anyone? In any case, I think the bug is already fixed. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: We might benefit from it, but would the BusLogic driver? It sets a nasty precedent for maintenance as different hypervisors and emulators hack up different drivers for their own performance. I still think it's preferable to change some drivers than everybody. AFAIK BusLogic as real hardware is pretty much dead anyways, so you're probably the only primary user of it anyways. Go wild on it! It is looking juicy. Maybe another day. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Alan Cox wrote: I still think it's preferable to change some drivers than everybody. AFAIK BusLogic as real hardware is pretty much dead anyways, so you're probably the only primary user of it anyways. Go wild on it! I don't believe anyone is materially maintaining the buslogic driver and in time its going to break completely. I think I was actually the last person to touch it ;) Well that might be. I just think it would be a mistake to design paravirt_ops based on someone's short term release engineering considerations. Agreed, especially as an interface where each in or out traps into the hypervisor is broken even for the model of virtualising hardware. Well, it's not necessarily broken, it's just a different model. At some point the cost of maintaining a whole suite of virtual drivers becomes greater than leveraging a bunch of legacy drivers. If you can eliminate most of the performance cost of that by changing something at a layer below (port I/O), it is a win even if it is not a perfect solution. But I think I've lost the argument anyways; it doesn't seem to be for the greater good of Linux, and there are alternatives we can take. Unfortunately for me, they require a lot more work. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: How is that measured? In a loop? In the same pipeline state? It seems a little dubious to me. I did the experiments in a controlled environment, with interrupts disabled and care to get the pipeline in the same state. It was a perfectly repeatable experiment. I don't have exact cycle time anymore, but they were the tightest measurements I've even seen on cycle counts because of the unique nature of serializing the processor for the fault / privilege transition. I tested a variety of different conditions, including different types of #GP (yes, the cost does vary), #NP, #PF, sysenter, int $0xxx. Sysenter was the fastest, by far. Int was about 5x the cost. #GP and friends were all about similar costs. #PF was the most expensive. to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a When the page is not executable or not present you get #PF not #GP. So the hardware already checks that. The only case where you would need to check yourself is if you emulate NX on non NX capable hardware, but I can't see you doing that. No, it doesn't. Between the #GP and decode, you have an SMP race where another processor can rewrite the instruction. That can be ignored imho. If the page goes away you'll notice when you handle the page fault on read. If it becomes NX then the execution just happened to be logically a little earlier. No, you can't ignore it. The page protections won't change between the GP and the decoder execution, but the instruction can, causing you to decode into the next page where the processor would not have. !P becomes obvious, but failure to respect NX or U/S is an exploitable bug. Put a 1 byte instruction at the end of a page crossing into a NX (or supervisor page). Remotely, change keep switching between the instruction and a segment override. Result: user executes instruction on supervisor code page, learning data as a result of this; code on NX page gets executed. Or easier to just write a backend for the lguest virtio drivers, that will be likely faster in the end anyways than this gross hack. We already have drivers for all of our hardware in Linux. Most of the hardware we emulate is physical hardware, and there are no virtual drivers for it. Should we take the BusLogic driver and "paravirtualize" it by adding VMI hypercalls? We might benefit from it, but would the BusLogic driver? It sets a nasty precedent for maintenance as different hypervisors and emulators hack up different drivers for their own performance. Our SCSI and IDE emulation and thus the drivers used by Linux are pretty much fixed in stone; we are not going to go about changing a tricky hardware interface to a virtual one, it is simply too risky for something as critical as storage. We might be able to move our network driver over to virtio, but that is not a short-term prospect either. There is great advantage in talking to our existing device layer faster, and this is something that is valuable today. Really LinuxHAL^wparavirt ops is already so complicated that any new hooks need an extremly good justification and that is just not here for this. We can add it if you find an equivalent number of hooks to eliminate. Interesting trade. What if I sanitized the whole I/O messy macros into something fun and friendly: native_port_in(int port, iosize_t opsize, int delay) native_port_out(int port, iosize_t opsize, u32 output, int delay) native_port_string_in(int port, void *ptr, iosize_t opsize, unsigned count, int delay) native_port_string_out(int port, void *ptr, iosize_t opsize, unsigned count, int delay) Then we can be rid of all the macro goo in io.h, which frightens my mother. We might even be able to get rid of the umpteen different places where drivers wrap iospace access with their own byte / word / long functions so they can switch between port I/O and memory mapped I/O by moving it all into common infrastructure. We could make similar (unwelcome?) advances on the pte functions if it were not for the regrettable disconnect between pte_high / pte_low and the rest. Perhaps if it was hidden in macros? Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: On Wed, Aug 22, 2007 at 09:48:25AM -0700, Zachary Amsden wrote: Andi Kleen wrote: On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Is that really that expensive? Hard to imagine. You have an expensive (16x cost of hypercall on some processors) Where is the difference comming from? Are you using SYSENTER for the hypercall? I can't really see you using SYSENTER, because how would you do system calls then? I bet system calls are more frequent than in/out, so if you have decide between the two using them for syscalls is likely faster. We use sysenter for hypercalls and also for system calls. :) Also I fail to see the fundamental speed difference between mov index,register int 0x... ... switch (register) case : do emulation Int (on p4 == ~680 cycles). versus out ... #gp -> switch (*eip) { case 0xee: /* etc. */ do emulation GP = ~2000 cycles. to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a When the page is not executable or not present you get #PF not #GP. So the hardware already checks that. The only case where you would need to check yourself is if you emulate NX on non NX capable hardware, but I can't see you doing that. No, it doesn't. Between the #GP and decode, you have an SMP race where another processor can rewrite the instruction. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Is that really that expensive? Hard to imagine. You have an expensive (16x cost of hypercall on some processors) privilege transition, you have to decode the instruction, then you have to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a hypercall. e.g. you could always have a fast check for inb/outb at the beginning of the #GP handler. And is your initial #GP entry really more expensive than a hypercall? The number of reasons for #GP is enormous, and there are too many paths to optimize with fast checks. We do have a fast check for inb/outb; it's just not fast enough. On P4, hypercall entry is 120 cycles. #GP is about 2000. Modern processors are better, but a hypercall is always faster than a fault. Many times, the hypercall can be handled and ready to return before a #GP would even complete. On workloads that sit there and hammer network cards, these costs become significant, and latency sensitive network benchmarks suffer. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. It's unclear to me why that should be that costly. Worst case it's a switch() There are 24 different possible I/O operations; sometimes with a port encoded in the instruction, sometimes with input in the DX register, sometimes with a rep prefix, and for 3 different operand sizes. Combine that with the MMU checks required, and it's complex and branchy enough to justify short-circuiting the whole thing with a simple hypercall. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Avi Kivity wrote: Since this is only for newer kernels, won't updating the driver to use a hypercall be more efficient? Or is this for existing out-of-tree drivers? Actually, it is for in-tree drivers that we emulate but don't want to pollute, and one out of tree driver (that will hopefully be in tree soon!) that has no way to determine if making hypercalls is acceptable. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix lazy mode vmalloc synchronization for paravirt
Jeremy Fitzhardinge wrote: No, under Xen the kernel/hypervisor PMD is not shared between processes, so this is still used when PAE is enabled. Ahh, yes. So this was a lucky catch for us. Non-PAE kernels seem to be increasing in value at antique sales. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
H. Peter Anvin wrote: Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. What about cost on hardware? On modern hardware, port I/O is about the most expensive thing you can do. The extra function call cost is totally masked by the stall. We have measured with port I/O converted like this on real hardware, and have seen zero measurable impact on macro-benchmarks. Micro-benchmarks that generate massively repeated port I/O might show some effect on ancient hardware, but I can't even imagine a workload which does such a thing, other than a polling port I/O loop perhaps - which would not be performance critical in any case I can reasonably imagine. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
H. Peter Anvin wrote: Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. What about cost on hardware? On modern hardware, port I/O is about the most expensive thing you can do. The extra function call cost is totally masked by the stall. We have measured with port I/O converted like this on real hardware, and have seen zero measurable impact on macro-benchmarks. Micro-benchmarks that generate massively repeated port I/O might show some effect on ancient hardware, but I can't even imagine a workload which does such a thing, other than a polling port I/O loop perhaps - which would not be performance critical in any case I can reasonably imagine. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix lazy mode vmalloc synchronization for paravirt
Jeremy Fitzhardinge wrote: No, under Xen the kernel/hypervisor PMD is not shared between processes, so this is still used when PAE is enabled. Ahh, yes. So this was a lucky catch for us. Non-PAE kernels seem to be increasing in value at antique sales. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Avi Kivity wrote: Since this is only for newer kernels, won't updating the driver to use a hypercall be more efficient? Or is this for existing out-of-tree drivers? Actually, it is for in-tree drivers that we emulate but don't want to pollute, and one out of tree driver (that will hopefully be in tree soon!) that has no way to determine if making hypercalls is acceptable. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Is that really that expensive? Hard to imagine. You have an expensive (16x cost of hypercall on some processors) privilege transition, you have to decode the instruction, then you have to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a hypercall. e.g. you could always have a fast check for inb/outb at the beginning of the #GP handler. And is your initial #GP entry really more expensive than a hypercall? The number of reasons for #GP is enormous, and there are too many paths to optimize with fast checks. We do have a fast check for inb/outb; it's just not fast enough. On P4, hypercall entry is 120 cycles. #GP is about 2000. Modern processors are better, but a hypercall is always faster than a fault. Many times, the hypercall can be handled and ready to return before a #GP would even complete. On workloads that sit there and hammer network cards, these costs become significant, and latency sensitive network benchmarks suffer. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. It's unclear to me why that should be that costly. Worst case it's a switch() There are 24 different possible I/O operations; sometimes with a port encoded in the instruction, sometimes with input in the DX register, sometimes with a rep prefix, and for 3 different operand sizes. Combine that with the MMU checks required, and it's complex and branchy enough to justify short-circuiting the whole thing with a simple hypercall. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: On Wed, Aug 22, 2007 at 09:48:25AM -0700, Zachary Amsden wrote: Andi Kleen wrote: On Tue, Aug 21, 2007 at 10:23:14PM -0700, Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Is that really that expensive? Hard to imagine. You have an expensive (16x cost of hypercall on some processors) Where is the difference comming from? Are you using SYSENTER for the hypercall? I can't really see you using SYSENTER, because how would you do system calls then? I bet system calls are more frequent than in/out, so if you have decide between the two using them for syscalls is likely faster. We use sysenter for hypercalls and also for system calls. :) Also I fail to see the fundamental speed difference between mov index,register int 0x... ... switch (register) case : do emulation Int (on p4 == ~680 cycles). versus out ... #gp - switch (*eip) { case 0xee: /* etc. */ do emulation GP = ~2000 cycles. to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a When the page is not executable or not present you get #PF not #GP. So the hardware already checks that. The only case where you would need to check yourself is if you emulate NX on non NX capable hardware, but I can't see you doing that. No, it doesn't. Between the #GP and decode, you have an SMP race where another processor can rewrite the instruction. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: How is that measured? In a loop? In the same pipeline state? It seems a little dubious to me. I did the experiments in a controlled environment, with interrupts disabled and care to get the pipeline in the same state. It was a perfectly repeatable experiment. I don't have exact cycle time anymore, but they were the tightest measurements I've even seen on cycle counts because of the unique nature of serializing the processor for the fault / privilege transition. I tested a variety of different conditions, including different types of #GP (yes, the cost does vary), #NP, #PF, sysenter, int $0xxx. Sysenter was the fastest, by far. Int was about 5x the cost. #GP and friends were all about similar costs. #PF was the most expensive. to verify protection in the page tables mapping the page allows execution (P, !NX, and U/S check). This is a lot more expensive than a When the page is not executable or not present you get #PF not #GP. So the hardware already checks that. The only case where you would need to check yourself is if you emulate NX on non NX capable hardware, but I can't see you doing that. No, it doesn't. Between the #GP and decode, you have an SMP race where another processor can rewrite the instruction. That can be ignored imho. If the page goes away you'll notice when you handle the page fault on read. If it becomes NX then the execution just happened to be logically a little earlier. No, you can't ignore it. The page protections won't change between the GP and the decoder execution, but the instruction can, causing you to decode into the next page where the processor would not have. !P becomes obvious, but failure to respect NX or U/S is an exploitable bug. Put a 1 byte instruction at the end of a page crossing into a NX (or supervisor page). Remotely, change keep switching between the instruction and a segment override. Result: user executes instruction on supervisor code page, learning data as a result of this; code on NX page gets executed. Or easier to just write a backend for the lguest virtio drivers, that will be likely faster in the end anyways than this gross hack. We already have drivers for all of our hardware in Linux. Most of the hardware we emulate is physical hardware, and there are no virtual drivers for it. Should we take the BusLogic driver and paravirtualize it by adding VMI hypercalls? We might benefit from it, but would the BusLogic driver? It sets a nasty precedent for maintenance as different hypervisors and emulators hack up different drivers for their own performance. Our SCSI and IDE emulation and thus the drivers used by Linux are pretty much fixed in stone; we are not going to go about changing a tricky hardware interface to a virtual one, it is simply too risky for something as critical as storage. We might be able to move our network driver over to virtio, but that is not a short-term prospect either. There is great advantage in talking to our existing device layer faster, and this is something that is valuable today. Really LinuxHAL^wparavirt ops is already so complicated that any new hooks need an extremly good justification and that is just not here for this. We can add it if you find an equivalent number of hooks to eliminate. Interesting trade. What if I sanitized the whole I/O messy macros into something fun and friendly: native_port_in(int port, iosize_t opsize, int delay) native_port_out(int port, iosize_t opsize, u32 output, int delay) native_port_string_in(int port, void *ptr, iosize_t opsize, unsigned count, int delay) native_port_string_out(int port, void *ptr, iosize_t opsize, unsigned count, int delay) Then we can be rid of all the macro goo in io.h, which frightens my mother. We might even be able to get rid of the umpteen different places where drivers wrap iospace access with their own byte / word / long functions so they can switch between port I/O and memory mapped I/O by moving it all into common infrastructure. We could make similar (unwelcome?) advances on the pte functions if it were not for the regrettable disconnect between pte_high / pte_low and the rest. Perhaps if it was hidden in macros? Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Alan Cox wrote: I still think it's preferable to change some drivers than everybody. AFAIK BusLogic as real hardware is pretty much dead anyways, so you're probably the only primary user of it anyways. Go wild on it! I don't believe anyone is materially maintaining the buslogic driver and in time its going to break completely. I think I was actually the last person to touch it ;) Well that might be. I just think it would be a mistake to design paravirt_ops based on someone's short term release engineering considerations. Agreed, especially as an interface where each in or out traps into the hypervisor is broken even for the model of virtualising hardware. Well, it's not necessarily broken, it's just a different model. At some point the cost of maintaining a whole suite of virtual drivers becomes greater than leveraging a bunch of legacy drivers. If you can eliminate most of the performance cost of that by changing something at a layer below (port I/O), it is a win even if it is not a perfect solution. But I think I've lost the argument anyways; it doesn't seem to be for the greater good of Linux, and there are alternatives we can take. Unfortunately for me, they require a lot more work. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Andi Kleen wrote: We might benefit from it, but would the BusLogic driver? It sets a nasty precedent for maintenance as different hypervisors and emulators hack up different drivers for their own performance. I still think it's preferable to change some drivers than everybody. AFAIK BusLogic as real hardware is pretty much dead anyways, so you're probably the only primary user of it anyways. Go wild on it! It is looking juicy. Maybe another day. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.23-rc3 - CONFIG_VMI broken
Parag Warudkar wrote: CONFIG_VMI seems to be broken, but I am not sure when - the last kernel I was running was 2.6.22-rc4 which used to boot fine and use VMI. Current git with same configuration causes the kernel to reboot early. Logs below. Deselecting CONFIG_VMI and rebuilding allows the kernel to boot normally. (I am running it on VMWare workstation 6 latest release.) Thanks Parag Linux version 2.6.23-rc3 ([EMAIL PROTECTED]) (gcc version 4.1.2 (Ubuntu 4.1.2-0ubuntu4)) #4 Mon Aug 13 21:50:06 EDT 2007 BIOS-provided physical RAM map: BIOS-e820: - 0009f800 (usable) BIOS-e820: 0009f800 - 000a (reserved) BIOS-e820: 000ca000 - 000cc000 (reserved) BIOS-e820: 000dc000 - 0010 (reserved) BIOS-e820: 0010 - 1fef (usable) BIOS-e820: 1fef - 1feff000 (ACPI data) BIOS-e820: 1feff000 - 1ff0 (ACPI NVS) BIOS-e820: 1ff0 - 2000 (usable) BIOS-e820: fec0 - fec1 (reserved) BIOS-e820: fee0 - fee01000 (reserved) BIOS-e820: fffe - 0001 (reserved) console [earlyser0] enabled 0MB HIGHMEM available. 512MB LOWMEM available. found SMP MP-table at 000f6c90 VMI: Found VMware, Inc. Hypervisor OPROM, API version 3.0, ROM version 1.0 Reserving virtual address space above 0xfc00 Int 14: CR2 fc37e260 err EIP fc37e260 CS 0062 flags 00010006 Stack: c0490ec7 c04913cb 0001 fc001340 c047fff4 c04a6080 c047aa00 I reproduced this, slightly different EIP, but I notice that all paravirt-ops function calls are to bogus addresses; the first byte appears correct (0xfc00 is in the VMI ROM range), and the extracted function addresses in the paravirt_ops struct are correct. However, it looks like the patching of the call instructions went wrong. Does this sound familiar to anyone? In any case, I think the bug is already fixed. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Avi Kivity wrote: Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. Won't these workloads be better off using paravirtualized drivers? i.e., do the native drivers with paravirt I/O instructions get anywhere near the performance of paravirt drivers? Yes, in general, this is true (better off with paravirt drivers). However, we have "paravirt" drivers which run in both fully-paravirtualized and fully traditionally virtualized environments. As a result, they use native port I/O operations to interact with virtual hardware. Since not all hypervisors have paravirtualized driver infrastructures and guest O/S support yet, these hypercalls can be advantages to a wide range of scenarios. Using I/O hypercalls as such gives exactly the same performance as paravirt drivers for us, by eliminating the costly decode path, and the simplicity of using the same driver code makes this a huge win in code complexity. Zach - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Add I/O hypercalls for i386 paravirt
In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. This patch is against 2.6.23-rc2-mm2, and should be targeted for 2.6.24. Zach Virtualized guests in general benefit from having I/O hypercalls. This patch adds support for port I/O hypercalls to VMI and provides the infrastructure for other backends to make use of this feature. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c index ea962c0..4d0d150 100644 --- a/arch/i386/kernel/paravirt.c +++ b/arch/i386/kernel/paravirt.c @@ -329,6 +329,18 @@ struct paravirt_ops paravirt_ops = { .set_iopl_mask = native_set_iopl_mask, .io_delay = native_io_delay, + .outb = native_outb, + .outw = native_outw, + .outl = native_outl, + .inb = native_inb, + .inw = native_inw, + .inl = native_inl, + .outsb = native_outsb, + .outsw = native_outsw, + .outsl = native_outsl, + .insb = native_insb, + .insw = native_insw, + .insl = native_insl, #ifdef CONFIG_X86_LOCAL_APIC .apic_write = native_apic_write, diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c index 44feb34..5ecd85b 100644 --- a/arch/i386/kernel/vmi.c +++ b/arch/i386/kernel/vmi.c @@ -56,6 +56,7 @@ static int disable_tsc; static int disable_mtrr; static int disable_noidle; static int disable_vmi_timer; +static int disable_io_ops; /* Cached VMI operations */ static struct { @@ -72,6 +73,18 @@ static struct { void (*set_initial_ap_state)(int, int); void (*halt)(void); void (*set_lazy_mode)(int mode); + void (*outb)(u8 value, u16 port); + void (*outw)(u16 value, u16 port); + void (*outl)(u32 value, u16 port); + u8 (*inb)(u16 port); + u16 (*inw)(u16 port); + u32 (*inl)(u16 port); + void (*outsb)(const void *addr, u16 port, u32 count); + void (*outsw)(const void *addr, u16 port, u32 count); + void (*outsl)(const void *addr, u16 port, u32 count); + void (*insb)(void *addr, u16 port, u32 count); + void (*insw)(void *addr, u16 port, u32 count); + void (*insl)(void *addr, u16 port, u32 count); } vmi_ops; /* Cached VMI operations */ @@ -565,6 +578,33 @@ static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) } } +#define BUILDIO(bwl,type) \ +static void vmi_out##bwl(type value, int port) { \ + __asm__ __volatile__("call *%0" : : \ + "r"(vmi_ops.out##bwl), "a"(value), "d"(port)); \ +} \ +static type vmi_in##bwl(int port) { \ + type value; \ + __asm__ __volatile__("call *%1" : \ + "=a"(value) : \ + "r"(vmi_ops.in##bwl), "d"(port)); \ + return value; \ +} \ +static void vmi_outs##bwl(int port, const void *addr, unsigned long count) { \ + __asm__ __volatile__("call *%2" : \ + "+S"(addr), "+c"(count) : \ + "r"(vmi_ops.outs##bwl), "d"(port)); \ +} \ +static void vmi_ins##bwl(int port, void *addr, unsigned long count) { \ + __asm__ __volatile__("call *%2" : \ + "+D"(addr), "+c"(count) : \ + "r"(vmi_ops.ins##bwl), "d"(port)); \ +} + +BUILDIO(b,unsigned char) +BUILDIO(w,unsigned short) +BUILDIO(l,unsigned int) + static inline int __init check_vmi_rom(struct vrom_header *rom) { struct pci_header *pci; @@ -791,6 +831,21 @@ static inline int __init activate_vmi(void) para_wrap(load_esp0, vmi_load_esp0, set_kernel_stack, UpdateKernelStack); para_fill(set_iopl_mask, SetIOPLMask); para_fill(io_delay, IODelay); + if (!disable_io_ops) { + para_wrap(inb, vmi_inb, inb, INB); + para_wrap(inw, vmi_inw, inw, INW); + para_wrap(inl, vmi_inl, inl, INL); + para_wrap(outb, vmi_outb, outb, OUTB); + para_wrap(outw, vmi_outw, outw, OUTW); + para_wrap(outl, vmi_outl, outl, OUTL); + para_wrap(insb, vmi_insb, insb, INSB); + para_wrap(insw, vmi_insw, insw, INSW); + para_wrap(insl, vmi_insl, insl, INSL); + para_wrap(outsb, vmi_outsb, outsb, OUTSB); + para_wrap(outsw, vmi_outsw, outsw, OUTSW); + para_wrap(outsl, vmi_outsl, outsl, OUTSL); + } + para_wrap(set_lazy_mode, vmi_set_lazy_mode, set_lazy_mode, SetLazyMode); /* user and kernel flush are just han
[PATCH] Fix lazy mode vmalloc synchronization for paravirt
Found this looping Ubuntu installs with VMI. If unlucky enough to hit a vmalloc sync fault during a lazy mode operation (from an IRQ handler for a module which was not yet populated in current page directory, or from inside copy_one_pte, which touches swap_map, and hit in an unused 4M region), the required PDE update would never get flushed, causing an infinite page fault loop. This bug affects any paravirt-ops backend which uses lazy updates, I believe that makes it a bug in Xen, VMI and lguest. It only happens on LOWMEM kernels. Currently for 2.6.23, but we'll want to backport to -stable as well. Zach Touching vmalloc memory in the middle of a lazy mode update can generate a kernel PDE update, which must be flushed immediately. The fix is to leave lazy mode when doing a vmalloc sync. Signed-off-by: Zachary Amsden <[EMAIL PROTECTED]> diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c index 01ffdd4..fcb38e7 100644 --- a/arch/i386/mm/fault.c +++ b/arch/i386/mm/fault.c @@ -249,9 +249,10 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) pmd_k = pmd_offset(pud_k, address); if (!pmd_present(*pmd_k)) return NULL; - if (!pmd_present(*pmd)) + if (!pmd_present(*pmd)) { set_pmd(pmd, *pmd_k); - else + arch_flush_lazy_mmu_mode(); + } else BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k)); return pmd_k; }
[PATCH] Add I/O hypercalls for i386 paravirt
In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. This patch is against 2.6.23-rc2-mm2, and should be targeted for 2.6.24. Zach Virtualized guests in general benefit from having I/O hypercalls. This patch adds support for port I/O hypercalls to VMI and provides the infrastructure for other backends to make use of this feature. Signed-off-by: Zachary Amsden [EMAIL PROTECTED] diff --git a/arch/i386/kernel/paravirt.c b/arch/i386/kernel/paravirt.c index ea962c0..4d0d150 100644 --- a/arch/i386/kernel/paravirt.c +++ b/arch/i386/kernel/paravirt.c @@ -329,6 +329,18 @@ struct paravirt_ops paravirt_ops = { .set_iopl_mask = native_set_iopl_mask, .io_delay = native_io_delay, + .outb = native_outb, + .outw = native_outw, + .outl = native_outl, + .inb = native_inb, + .inw = native_inw, + .inl = native_inl, + .outsb = native_outsb, + .outsw = native_outsw, + .outsl = native_outsl, + .insb = native_insb, + .insw = native_insw, + .insl = native_insl, #ifdef CONFIG_X86_LOCAL_APIC .apic_write = native_apic_write, diff --git a/arch/i386/kernel/vmi.c b/arch/i386/kernel/vmi.c index 44feb34..5ecd85b 100644 --- a/arch/i386/kernel/vmi.c +++ b/arch/i386/kernel/vmi.c @@ -56,6 +56,7 @@ static int disable_tsc; static int disable_mtrr; static int disable_noidle; static int disable_vmi_timer; +static int disable_io_ops; /* Cached VMI operations */ static struct { @@ -72,6 +73,18 @@ static struct { void (*set_initial_ap_state)(int, int); void (*halt)(void); void (*set_lazy_mode)(int mode); + void (*outb)(u8 value, u16 port); + void (*outw)(u16 value, u16 port); + void (*outl)(u32 value, u16 port); + u8 (*inb)(u16 port); + u16 (*inw)(u16 port); + u32 (*inl)(u16 port); + void (*outsb)(const void *addr, u16 port, u32 count); + void (*outsw)(const void *addr, u16 port, u32 count); + void (*outsl)(const void *addr, u16 port, u32 count); + void (*insb)(void *addr, u16 port, u32 count); + void (*insw)(void *addr, u16 port, u32 count); + void (*insl)(void *addr, u16 port, u32 count); } vmi_ops; /* Cached VMI operations */ @@ -565,6 +578,33 @@ static void vmi_set_lazy_mode(enum paravirt_lazy_mode mode) } } +#define BUILDIO(bwl,type) \ +static void vmi_out##bwl(type value, int port) { \ + __asm__ __volatile__(call *%0 : : \ + r(vmi_ops.out##bwl), a(value), d(port)); \ +} \ +static type vmi_in##bwl(int port) { \ + type value; \ + __asm__ __volatile__(call *%1 : \ + =a(value) : \ + r(vmi_ops.in##bwl), d(port)); \ + return value; \ +} \ +static void vmi_outs##bwl(int port, const void *addr, unsigned long count) { \ + __asm__ __volatile__(call *%2 : \ + +S(addr), +c(count) : \ + r(vmi_ops.outs##bwl), d(port)); \ +} \ +static void vmi_ins##bwl(int port, void *addr, unsigned long count) { \ + __asm__ __volatile__(call *%2 : \ + +D(addr), +c(count) : \ + r(vmi_ops.ins##bwl), d(port)); \ +} + +BUILDIO(b,unsigned char) +BUILDIO(w,unsigned short) +BUILDIO(l,unsigned int) + static inline int __init check_vmi_rom(struct vrom_header *rom) { struct pci_header *pci; @@ -791,6 +831,21 @@ static inline int __init activate_vmi(void) para_wrap(load_esp0, vmi_load_esp0, set_kernel_stack, UpdateKernelStack); para_fill(set_iopl_mask, SetIOPLMask); para_fill(io_delay, IODelay); + if (!disable_io_ops) { + para_wrap(inb, vmi_inb, inb, INB); + para_wrap(inw, vmi_inw, inw, INW); + para_wrap(inl, vmi_inl, inl, INL); + para_wrap(outb, vmi_outb, outb, OUTB); + para_wrap(outw, vmi_outw, outw, OUTW); + para_wrap(outl, vmi_outl, outl, OUTL); + para_wrap(insb, vmi_insb, insb, INSB); + para_wrap(insw, vmi_insw, insw, INSW); + para_wrap(insl, vmi_insl, insl, INSL); + para_wrap(outsb, vmi_outsb, outsb, OUTSB); + para_wrap(outsw, vmi_outsw, outsw, OUTSW); + para_wrap(outsl, vmi_outsl, outsl, OUTSL); + } + para_wrap(set_lazy_mode, vmi_set_lazy_mode, set_lazy_mode, SetLazyMode); /* user and kernel flush are just handled with different flags to FlushTLB */ @@ -968,6 +1023,8 @@ static int __init parse_vmi(char *arg) disable_noidle = 1; } else if (!strcmp(arg, disable_noidle)) disable_noidle = 1; + else if (!strcmp(arg
Re: [PATCH] Add I/O hypercalls for i386 paravirt
Avi Kivity wrote: Zachary Amsden wrote: In general, I/O in a virtual guest is subject to performance problems. The I/O can not be completed physically, but must be virtualized. This means trapping and decoding port I/O instructions from the guest OS. Not only is the trap for a #GP heavyweight, both in the processor and the hypervisor (which usually has a complex #GP path), but this forces the hypervisor to decode the individual instruction which has faulted. Worse, even with hardware assist such as VT, the exit reason alone is not sufficient to determine the true nature of the faulting instruction, requiring a complex and costly instruction decode and simulation. This patch provides hypercalls for the i386 port I/O instructions, which vastly helps guests which use native-style drivers. For certain VMI workloads, this provides a performance boost of up to 30%. We expect KVM and lguest to be able to achieve similar gains on I/O intensive workloads. Won't these workloads be better off using paravirtualized drivers? i.e., do the native drivers with paravirt I/O instructions get anywhere near the performance of paravirt drivers? Yes, in general, this is true (better off with paravirt drivers). However, we have paravirt drivers which run in both fully-paravirtualized and fully traditionally virtualized environments. As a result, they use native port I/O operations to interact with virtual hardware. Since not all hypervisors have paravirtualized driver infrastructures and guest O/S support yet, these hypercalls can be advantages to a wide range of scenarios. Using I/O hypercalls as such gives exactly the same performance as paravirt drivers for us, by eliminating the costly decode path, and the simplicity of using the same driver code makes this a huge win in code complexity. Zach - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] Fix lazy mode vmalloc synchronization for paravirt
Found this looping Ubuntu installs with VMI. If unlucky enough to hit a vmalloc sync fault during a lazy mode operation (from an IRQ handler for a module which was not yet populated in current page directory, or from inside copy_one_pte, which touches swap_map, and hit in an unused 4M region), the required PDE update would never get flushed, causing an infinite page fault loop. This bug affects any paravirt-ops backend which uses lazy updates, I believe that makes it a bug in Xen, VMI and lguest. It only happens on LOWMEM kernels. Currently for 2.6.23, but we'll want to backport to -stable as well. Zach Touching vmalloc memory in the middle of a lazy mode update can generate a kernel PDE update, which must be flushed immediately. The fix is to leave lazy mode when doing a vmalloc sync. Signed-off-by: Zachary Amsden [EMAIL PROTECTED] diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c diff --git a/arch/i386/mm/fault.c b/arch/i386/mm/fault.c index 01ffdd4..fcb38e7 100644 --- a/arch/i386/mm/fault.c +++ b/arch/i386/mm/fault.c @@ -249,9 +249,10 @@ static inline pmd_t *vmalloc_sync_one(pgd_t *pgd, unsigned long address) pmd_k = pmd_offset(pud_k, address); if (!pmd_present(*pmd_k)) return NULL; - if (!pmd_present(*pmd)) + if (!pmd_present(*pmd)) { set_pmd(pmd, *pmd_k); - else + arch_flush_lazy_mmu_mode(); + } else BUG_ON(pmd_page(*pmd) != pmd_page(*pmd_k)); return pmd_k; }