Re: [kvm-devel] [patch 3/2] hotadd: lsi_scsi_init can fail
Chris Wright wrote: During hotadd of SCSI devices lsi_scsi_init() handles failed pci_device_register(), but qemu_system_hot_add_storage() will try and attach a drive any way. Handle this error case rather the generating SEGV. Cc: Marcelo Tosatti [EMAIL PROTECTED] Signed-off-by: Chris Wright [EMAIL PROTECTED] --- qemu/hw/device-hotplug.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/qemu/hw/device-hotplug.c +++ b/qemu/hw/device-hotplug.c @@ -125,7 +125,7 @@ static PCIDevice *qemu_system_hot_add_st switch (type) { case IF_SCSI: opaque = lsi_scsi_init (pci_bus, -1); -if (drive_idx = 0) +if (opaque drive_idx = 0) lsi_scsi_attach (opaque, drives_table[drive_idx].bdrv, drives_table[drive_idx].unit); break; It's not so opaque if you're testing it against NULL... long term we want better error reporting here. -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
Am Dienstag, 22. April 2008 schrieb Rusty Russell: [Christian, Hollis, how much is this ABI breakage going to hurt you?] It is ok for s390 at the moment. We are still working on making userspace ready and I plan to change the guest-host for s390 anyway. I try to make these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main reason is, that we are currently limited to around 80 devices. I am not sure, if I should change the allocation of the virtqueues and descriptors to guest memory as well. Back to your patch: I have still some ideas about virtio between little endian and big endian systems, but it requires more and different marshalling anyway - even on driver level. No idea yet how to solve that properly. Consider your change Acked-by: Christian Bornraeger [EMAIL PROTECTED] given that you fix the issue below: [...] --- a/drivers/virtio/virtio_balloon.c Sun Apr 20 14:41:02 2008 +1000 +++ b/drivers/virtio/virtio_balloon.c Sun Apr 20 15:07:45 2008 +1000 @@ -155,9 +155,9 @@ static inline s64 towards_target(struct static inline s64 towards_target(struct virtio_balloon *vb) { u32 v; - __virtio_config_val(vb-vdev, - offsetof(struct virtio_balloon_config, num_pages), - v); + vb-vdev-config-get(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + v); this is missing a sizeof(v), no? Christian - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: At such a tiny difference, I'm wondering why Linux-AIO exists at all, as it complicates the kernel rather a lot. I can see the theoretical appeal, but if performance is so marginal, I'm surprised it's in there. Linux aio exists, but that's all that can be said for it. It works mostly for raw disks, doesn't integrate with networking, and doesn't advance at the same pace as the rest of the kernel. I believe only databases use it (and a userspace filesystem I wrote some time ago). And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? I'm also surprised the Glibc implementation of AIO using ordinary threads is so close to it. Why are you surprised? Because I've read that Glibc AIO (which uses a thread pool) is a relatively poor performer as AIO implementations go, and is only there for API compatibility, not suggested for performance. But I read that quite a while ago, perhaps it's changed. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). So my comments only apply to linux-aio vs a sane thread pool. Sorry for spreading confusion. Actually the glibc implementation could be improved from what I've heard. My estimates are for a thread pool implementation, but there is not reason why glibc couldn't achieve exactly the same performance. Erm... I thought you said it _does_ achieve nearly the same performance, not that it _could_. Do you mean it could achieve exactly the same performance by using Linux AIO when possible? It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. And then, I'm wondering why use AIO it all: it suggests QEMU would run about as fast doing synchronous I/O in a few dedicated I/O threads. Posix aio is the unix API for this, why not use it? Because far more host platforms have threads than have POSIX AIO. (I suspect both options will end up supported in the end, as dedicated I/O threads were already suggested for other things.) Agree. Also, I'd presume that those that need 10K IOPS and above will not place their high throughput images on a filesystem; rather on a separate SAN LUN. Does the separate LUN make any difference? I thought O_DIRECT on a filesystem was meant to be pretty close to block device performance. On a good extent-based filesystem like XFS you will get good performance (though more cpu overhead due to needing to go through additional mapping layers. Old clunkers like ext3 will require additional seeks or a ton of cache (1 GB per 1 TB). Hmm. Thanks. I may consider switching to XFS now I'm rooting for btrfs myself. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [Deadline Extended] Call for Presentations: KVM Forum 2008
[Note: KVM Forum registration is now open at http://kforum.qumranet.com/KVMForum/about_kvmforum.php] [The deadline for submitting presentations has been extended by two weeks, until May 4th] This is the Call for Presentations for the second annual KVM Developer's Forum, to be held on June 10-13, 2008, in Napa, California, USA [1]. We are looking for presentations on KVM development, quality assurance, management, security, interoperability, architecture support, and interesting use cases. Presentations are 50 minutes in length; there are also 25-minute mini-presentation slots available. KVM Forum presentations are an excellent way to inform the KVM development community about your work, and to gather valuable feedback about your approach. Please send your presentation proposal to the KVM Forum 2008 Content Committee at [EMAIL PROTECTED] by May 4th. KVM Forum 2008 Content Committee: Dor Laor Anthony Liguori Avi Kivity [1] http://kforum.qumranet.com/KVMForum/about_kvmforum.php -- Any sufficiently difficult bug is indistinguishable from a feature. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] KVM Test result, kernel 6cf5973.., userspace 5157358.. -- One Issue Fixed
Hi All, This is today's KVM test result against kvm.git 6cf59734fc9bc89954d0157524eea156c2f9a5ab and kvm-userspace.git 5157358e1946770847271e3602f1adae85002871. One Issue Fixed: 1. booting smp windows guests has 30% chance of hang https://sourceforge.net/tracker/?func=detailatid=893831aid=1910923group_id=180599 Two Old Issues: 1. Booting four guests likely fails https://sourceforge.net/tracker/?func=detailatid=893831aid=1919354group_id=180599 2. Cannot boot guests with hugetlbfs https://sourceforge.net/tracker/?func=detailatid=893831aid=1941302group_id=180599 Test environment PlatformWoodcrest CPU 4 Memory size 8G' Details IA32-pae: 1. boot guest with 256M memory PASS 2. boot two windows xp guest PASS 3. boot 4 same guest in parallelPASS 4. boot linux and windows guest in parallel PASS 5. boot guest with 1500M memory PASS 6. boot windows 2003 with ACPI enabled PASS 7. boot Windows xp with ACPI enabled PASS 8. boot Windows 2000 without ACPI PASS 9. kernel build on SMP linux guestPASS 10. LTP on linux guest PASS 11. boot base kernel linux PASS 12. save/restore 32-bit HVM guests PASS 13. live migration 32-bit HVM guests PASS 14. boot SMP Windows xp with ACPI enabledPASS 15. boot SMP Windows 2003 with ACPI enabled PASS 16. boot SMP Windows 2000 with ACPI enabled PASS IA32e: 1. boot four 32-bit guest in parallel PASS 2. boot four 64-bit guest in parallel PASS 3. boot 4G 64-bit guest PASS 4. boot 4G pae guest PASS 5. boot 32-bit linux and 32 bit windows guest in parallelPASS 6. boot 32-bit guest with 1500M memory PASS 7. boot 64-bit guest with 1500M memory PASS 8. boot 32-bit guest with 256M memory PASS 9. boot 64-bit guest with 256M memory PASS 10. boot two 32-bit windows xp in parallelPASS 11. boot four 32-bit different guest in para PASS 12. save/restore 64-bit linux guests PASS 13. save/restore 32-bit linux guests PASS 14. boot 32-bit SMP windows 2003 with ACPI enabled PASS 15. boot 32-bit SMP Windows 2000 with ACPI enabled PASS 16. boot 32-bit SMP Windows xp with ACPI enabledPASS 17. boot 32-bit Windows 2000 without ACPIPASS 18. boot 64-bit Windows xp with ACPI enabledPASS 19. boot 32-bit Windows xp without ACPIPASS 20. boot 64-bit UP vista PASS 21. boot 64-bit SMP vista PASS 22. kernel build in 32-bit linux guest OS PASS 23. kernel build in 64-bit linux guest OS PASS 24. LTP on 32-bit linux guest OSPASS 25. LTP on 64-bit linux guest OSPASS 26. boot 64-bit guests with ACPI enabled PASS 27. boot 32-bit x-server PASS 28. boot 64-bit SMP windows XP with ACPI enabled PASS 29. boot 64-bit SMP windows 2003 with ACPI enabled PASS 30. live migration 64bit linux guests PASS 31. live migration 32bit linux guests PASS 32. reboot 32bit windows xp guest PASS 33. reboot 32bit windows xp guest PASS Report Summary on IA32-pae Summary Test Report of Last Session = Total PassFailNoResult Crash = control_panel 7 5 2 00 Restart 2 2 0 00 gtest 15 15 0
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Hi, I am thinking about comibing this ROM with the extboot. Both two ROM are about booting, so I think that is reasonable. So we will have only 1 ROM that supports both external boot and Linux boot. Is that desirable or not? Thanks, Quynh On 4/21/08, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Hmm, the last patch includes a binary. So please take this patch instead. Thanks, Q # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 +++ linuxboot/rom.c | 104 linuxboot/signrom.c | 128 ++ linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 12 files changed, 600 insertions(+), 9 deletions(-) On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Forget to say that this patch is against kvm-66. Thanks, Q On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Hi, This should be submitted to upstream (but not to kvm-devel list), but this is only the test code that I want to quickly send out for comments. In case it looks OK, I will send it to upstream later. Inspired by extboot and conversations with Anthony and HPA, this linuxboot option ROM is a simple option ROM that intercepts int19 in order to execute linux setup code. This approach eliminates the need to manipulate the boot sector for this purpose. To test it, just load linux kernel with your KVM/QEMU image using -kernel option in normal way. I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest Ubuntu 8.04. Thanks, Quynh # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 +++ linuxboot/rom.c | 104 linuxboot/signrom|binary linuxboot/signrom.c | 128 ++ linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 13 files changed, 600 insertions(+), 9 deletions(-) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 6/6] kvm: qemu: Enable EPT support for real mode
Yang, Sheng wrote: From 73c33765f3d879001818cd0719038c78a0c65561 Mon Sep 17 00:00:00 2001 From: Sheng Yang [EMAIL PROTECTED] Date: Fri, 18 Apr 2008 17:15:39 +0800 Subject: [PATCH] kvm: qemu: Enable EPT support for real mode This patch build a identity page table on the last page of VGA bios, and use it as the guest page table in nonpaging mode for EPT. Doing this in qemu means older versions of qemu can't work with an ept-enabled kernel. Also, placing the table in the vga bios might conflict with video card assignment to a guest. Suggest placing this near the realmode tss (see vmx.c:init_rmode_tss()) which serves a similar function. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting
I believe that's the way to go. If you have spare time on your hands, feel free to integrate my multiboot patches as well. Alex On Apr 22, 2008, at 11:07 AM, Nguyen Anh Quynh wrote: Hi, I am thinking about comibing this ROM with the extboot. Both two ROM are about booting, so I think that is reasonable. So we will have only 1 ROM that supports both external boot and Linux boot. Is that desirable or not? Thanks, Quynh On 4/21/08, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Hmm, the last patch includes a binary. So please take this patch instead. Thanks, Q # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 +++ linuxboot/rom.c | 104 linuxboot/signrom.c | 128 ++ linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 12 files changed, 600 insertions(+), 9 deletions(-) On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Forget to say that this patch is against kvm-66. Thanks, Q On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh [EMAIL PROTECTED] wrote: Hi, This should be submitted to upstream (but not to kvm-devel list), but this is only the test code that I want to quickly send out for comments. In case it looks OK, I will send it to upstream later. Inspired by extboot and conversations with Anthony and HPA, this linuxboot option ROM is a simple option ROM that intercepts int19 in order to execute linux setup code. This approach eliminates the need to manipulate the boot sector for this purpose. To test it, just load linux kernel with your KVM/QEMU image using -kernel option in normal way. I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest Ubuntu 8.04. Thanks, Quynh # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 + ++ linuxboot/rom.c | 104 + +++ linuxboot/signrom|binary linuxboot/signrom.c | 128 + + linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 13 files changed, 600 insertions(+), 9 deletions(-) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Повторные торги
Размeщeниe гoсудаpствeннoгo и муниципальнoгo заказа на тopгах: спopныe вoпpoсы сoвpeмeннoй пpактики 25 апpeля 2008, г. Мoсква Пpoгpамма сeминаpа - Анализ кoнкpeтных аpбитpажных дeл, являющихся наибoлee pаспpoстpанeнными (типичными) в судeбнo-аpбитpажнoй пpактикe oспаpивания pазмeщeния гoсудаpствeннoгo и муниципальнoгo заказа. - Аналитичeский кoммeнтаpий спopных ситуаций, вoзникающих как в хoдe пpoвeдeния тopгoв, так и в пpoцeссe заключeния и испoлнeния гoсудаpствeнных (муниципальных) кoнтpактoв), заключeнных пo их итoгам. - Исслeдoваниe тopгoв нoсит кoмплeксный хаpактep, т.к. затpагиваются и пpoцeссуальныe oсoбeннoсти pассмoтpeния спopoв o нeдeйствитeльнoсти тopгoв. - oтвeты на ключeвыe вoпpoсы сeминаpа аpгумeнтиpуются сo ссылкoй на научнo-пpикладныe исслeдoвания, матepиалы заpубeжнoй и мeждунаpoднoй пpактики пpoвeдeния аукциoнoв и кoнкуpсoв. - Пo каждoй тeмe пpoгpаммы мoгут быть пpoанализиpoваны пpoблeмныe ситуации из пpактики слушатeлeй, пo хoду oбсуждeния автopoм даются кoнкpeтныe peкoмeндации. Ключeвыe вoпpoсы пpoгpаммы ∙ Скoлькo мoжeт быть пoбeдитeлeй на тopгах? Впpавe ли участники oбъeдинять свoи пpeдлoжeния дo или в пpoцeссe пpoвeдeния тopгoв? ∙ Какиe лица впpавe заявить иск o пpизнании тopгoв нeдeйствитeльными? Как oпpeдeлить заинтepeсoваннoсть в oспаpивании peзультатoв тopгoв? ∙ Чтo oзначаeт нeдeйствитeльнoсть аукциoна или кoнкуpса: oспopимoсть или ничтoжнoсть? ∙ Как oцeнить сoстязатeльнoсть участникoв? Есть ли oснoвания пpизнать тopги нeсoстoявшимися, eсли участникoв былo двoe? ∙ Чeм oбeспeчиваeтся заявка на участиe в аукциoнe (кoнкуpсe)? Мoжeт ли opганизатop тopгoв пpинимать oт участникoв банкoвскиe гаpантии, пpoстыe вeксeля или дeнeжныe сpeдства на услoвиях залoга? ∙ В чeм пpинципиальныe oтличия пpавoвoгo статуса участника тopгoв и участника pазмeщeния заказа? ∙ Каким oбpазoм фopмулиpуются кpитepии кoнкуpснoгo (аукциoннoгo) oтбopа и мoжнo ли oт них oтступить пpи oпpeдeлeнии пoбeдитeля? ∙ Чтo дeлать пpи пoлучeнии oдинакoвых пpeдлoжeний oт нeскoльких участникoв? ∙ Нeoбхoдимo ли пpoвoдить пoвтopныe тopги, eсли пoбeдитeль нe испoлняeт заключeнный на тopгах дoгoвop / уклoняeтся oт eгo заключeния? ∙ Мoжeт ли суд, пpизнав факты наpушeния закoнoдатeльства, oставить в силe peзультаты тopгoв на pазмeщeниe гoсудаpствeннoгo и муниципальнoгo заказа? oпopныe тeмы пpoгpаммы ∙ Пpeимущeства и нeдoстатки заключeния дoгoвopoв путeм пpoвeдeния тopгoв. oснoвныe pазнoвиднoсти аукциoнoв и кoнкуpсoв. ∙ Пpавoвыe пpoблeмы участия в тopгах дoгoвopных oбъeдинeний участникoв, а такжe аффилиpoванных лиц. ∙ Пpoцeссуальныe oсoбeннoсти pассмoтpeния спopoв o нeдeйствитeльнoсти тopгoв. ∙ oснoвания для пpизнания тopгoв нeсoстoявшимися. ∙ Пpавoвoe значeниe oбeспeчeния аукциoннoй или кoнкуpснoй заявки. ∙ Тpeбoвания закoнoдатeльства к извeщeнию o пpoвeдeнии тopгoв, хаpактepистика нeнадлeжащих извeщeний. ∙ oпpeдeлeниe кpитepиeв кoнкуpснoгo или аукциoннoгo oтбopа, пpавoвыe pамки pабoты кoнкуpснoй (аукциoннoй) кoмиссии. ∙ Аннулиpoваниe тopгoв, Пpизнаниe тopгoв нeдeйствитeльными, oбъявлeниe тopгoв нeсoстoявшимися: pазличия пpoцeдуp и их пpавoвыe пoслeдствия. ∙ Сooтнoшeниe администpативнoгo и судeбнoгo спoсoбoв защиты пpав и закoнных интepeсoв участникoв pазмeщeния заказа. Пpoдoлжитeльнoсть oбучeния: с 10 дo 17 часoв (с пepepывoм на oбeд и кoфe-паузу). Мeстo oбучeния: г. Мoсква, 5 мин. пeшкoм oт м. Акадeмичeская. Стoимoсть oбучeния: 4900 pуб. (с НДС). (В стoимoсть вxoдит: pаздатoчный матepиал, кoфe-пауза, oбeд в peстopанe). Пpи oтсутствии вoзмoжнoсти пoсeтить сeминаp, мы пpeдлагаeм пpиoбpeсти eгo видeoвepсию на DVD/CD дискаx или видeoкассeтаx (пpилагаeтся автopский pаздатoчный матepиал). Цeна видeoкуpса - 3500 pублeй, с учeтoм НДС. Для peгистpации на сeминаp нeoбxoдимo oтпpавить нам пo факсу: peквизиты opганизации, тeму и дату сeминаpа, пoлнoe ФИo участникoв, кoнтактный тeлeфoн и факс. Для заказа видeoкуpса нeoбxoдимo oтпpавить нам пo факсу: peквизиты opганизации, тeму видeoкуpса, указать нoситeль (ДВД или СД диски), тeлeфoн, факс, кoнтактнoe лицo и тoчный адpeс дoставки. Пoлучить дoпoлнитeльную инфopмацию и заpeгистpиpoваться мoжнo: пo т/ф: ( 4 9 5 ) 54 З 8 8 4 6 - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Some FAQ questions
Damjan wrote: I have some questions for the FAQ, about the configuration of Linux guests: a) is swap needed in the guest (I'd say no, but..) b) what filesystem is best for a guest c) what io scheduler in the guest (noop? or cfq) d) are there any runtime kernel tweaks for the guest (/proc/sys)? For the first four questions, do whatever you'd do for a similarly configured host running a similar workload. It's fine to use cfq as the I/O scheduler. e) suggested linux kernel source configuration (.config) With newer kernels, be sure to enable virtio drivers, kvm clock, and kvm mmu paravirtualization. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver
Jerone Young wrote: virtio net device does not appear to show itself in the guest. I'm curious of what options I may be missing. Here is my config CONFIG_VIRTIO_NET=y [..] CONFIG_VIRTUALIZATION=y CONFIG_KVM=y CONFIG_KVM_BOOKE_HOST=y CONFIG_VIRTIO=y CONFIG_VIRTIO_RING=y CONFIG_VIRTIO_PCI=y That should be enough in .config, but be aware that you need the proper qemu command line like -net nic,model=virtio,macaddr=00:00:00:00:00:AA -net tap as well as a /etc/qemu-ifup script (I sent one for our purpose to kvm-ppc-devel a while ago) + you need some tools installed e.g. brctl and you need to create /dev/net/tun in the host because we have no dynamic /dev. If you have done all that already and it is still not working you should continue with anthonys suggestion and send what lspci shows you. If you want to be complete use lspci -vvvx And maybe it is worth to add debug to the kernel command line of the guest and attach a full dmesg to the same response too, just in case someone might want to look at driver messages. -- Grüsse / regards, Christian Ehrhardt IBM Linux Technology Center, Open Virtualization - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver
Jerone Young wrote: What I am asking is do I have all the proper options in my kernel config set to use it? I have: [EMAIL PROTECTED] linux-2.6 (kvm-updates-2.6.26)]$ grep VIRTIO .config CONFIG_VIRTIO_BLK=m CONFIG_VIRTIO_NET=m CONFIG_VIRTIO=m CONFIG_VIRTIO_RING=m CONFIG_VIRTIO_PCI=m CONFIG_VIRTIO_BALLOON=m -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch] qemu/ia64 include prototype for qemu_mallocz
Jes Sorensen wrote: Hi, This one fixes a segfault problem I am seeing on ia64 due to the malloc'ed address being truncated to 32 bit. Applied, thanks. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [patch 0/2] pci_register_device can fail
Chris Wright wrote: The pci hotadd patches make it easy to trigger segfaults when adding more devices than a single PCI bus can handle. The following 2 patches fix the pci nic devices and virtio-blk device. Now the following the following: OK bus 0, slot 31, function 0 (devfn 248) (qemu) pci_add 0 nic model=virtio Segmentation fault OK bus 0, slot 31, function 0 (devfn 248) (qemu) pci_add 0 storage file=/mnt/disk1,if=virtio Segmentation fault become: OK bus 0, slot 31, function 0 (devfn 248) (qemu) pci_add 0 nic model=virtio qemu: Unable to initialze NIC: virtio failed to add model=virtio OK bus 0, slot 31, function 0 (devfn 248) (qemu) pci_add 0 storage file=/mnt/disk1,if=virtio failed to add file=/mnt/disk1,if=virtio Applied all three, thanks. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
Rusty Russell wrote: [Christian, Hollis, how much is this ABI breakage going to hurt you?] A recent proposed feature addition to the virtio block driver revealed some flaws in the API, in particular how easy it is to break big endian machines. The virtio config space was originally chosen to be little-endian, because we thought the config might be part of the PCI config space for virtio_pci. It's actually a separate mmio region, so that argument holds little water; as only x86 is currently using the virtio mechanism, we can change this (but must do so now, before the impending s390 and ppc merges). This will probably annoy Hollis which has guests that can go both ways. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Some FAQ questions
On Tue, Apr 22, 2008 at 1:10 PM, Avi Kivity [EMAIL PROTECTED] wrote: Damjan wrote: I have some questions for the FAQ, about the configuration of Linux guests: a) is swap needed in the guest (I'd say no, but..) b) what filesystem is best for a guest c) what io scheduler in the guest (noop? or cfq) d) are there any runtime kernel tweaks for the guest (/proc/sys)? For the first four questions, do whatever you'd do for a similarly configured host running a similar workload. It's fine to use cfq as the I/O scheduler. Is cfq still fair in the guest? The VM re-dispatches the requests (at least when using QEMU IDE) and the host can reschedule them at will. Luca - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Some FAQ questions
Luca Tettamanti wrote: Is cfq still fair in the guest? The VM re-dispatches the requests (at least when using QEMU IDE) and the host can reschedule them at will. The same problem occurs (to a lesser extent) in non-virtualized environments; disks (and esp. array controllers) also have their own I/O schedulers. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix missing decleration for kvm_enabled() in qemu for target-ppc/helper.c
Jerone Young wrote: Recent change now requires target-ppc/helper.c to now include qemu-kvm.h to get the definition for kvm_enabled(). This fixes it so things now compile again. Applied, thanks. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: invalidate_range_start { spin_lock(kvm-mmu_lock); kvm-invalidate_range_count++; rmap-invalidate of sptes in range write_seqlock; write_sequnlock; spin_unlock(kvm-mmu_lock) } invalidate_range_end { spin_lock(kvm-mmu_lock); kvm-invalidate_range_count--; write_seqlock; write_sequnlock; spin_unlock(kvm-mmu_lock) } Robin correctly pointed out by PM there should be a seqlock in range_begin/end too like corrected above. I guess it's better to use an explicit sequence counter so we avoid an useless spinlock of the write_seqlock (mmu_lock is enough already in all places) and so we can increase it with a single op with +=2 in the range_begin/end. The above is a lower-perf version of the final locking but simpler for reading purposes. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 1/2]kvmtrace: add event mask support (kernel part)
Liu, Eric E wrote: From a1b062cfd4d1a91c447b680ac9a2250fe55119ec Mon Sep 17 00:00:00 2001 From: Feng (Eric) Liu [EMAIL PROTECTED] Date: Wed, 16 Apr 2008 05:29:37 -0400 Subject: [PATCH] KVM: trace: Add event mask support. Allow user space application to specify one or more filter masks to limit the events being captured via it. Sorry about the late review. --- a/include/linux/kvm.h +++ b/include/linux/kvm.h @@ -18,6 +18,8 @@ struct kvm_user_trace_setup { __u32 buf_size; /* sub_buffer size of each per-cpu */ __u32 buf_nr; /* the number of sub_buffers of each per-cpu */ + __u16 cat_mask; /* the tracing categories are enabled */ + __u64 act_bitmap[16]; /* the actions are enabled for each category */ }; The structures will be laid out differently on 32-bit and 64-bit. This is important since we'd like 32-bit userspace to work correctly with a 64-bit kernel. The usual solution is to insert a __u16 pad1[3]; between the two fields. Otherwise, the patch seems fine. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: invalidate_range_start { spin_lock(kvm-mmu_lock); kvm-invalidate_range_count++; rmap-invalidate of sptes in range write_seqlock; write_sequnlock; I don't think you need it here since invalidate_range_count is already elevated which will accomplish the same effect. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote: On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: invalidate_range_start { spin_lock(kvm-mmu_lock); kvm-invalidate_range_count++; rmap-invalidate of sptes in range write_seqlock; write_sequnlock; I don't think you need it here since invalidate_range_count is already elevated which will accomplish the same effect. Agreed, seqlock only in range_end should be enough. BTW, the fact seqlock is needed regardless of invalidate_page existing or not, really makes invalidate_page a no brainer not just from the core VM point of view, but from the driver point of view too. The kvm_page_fault logic would be the same even if I remove invalidate_page from the mmu notifier patch but it'd run slower both when armed and disarmed. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver
Jerone Young wrote: What I am asking is do I have all the proper options in my kernel config set to use it? Yes. You just need CONFIG_VIRTIO_NET and CONFIG_VIRTIO_PCI. The remaining options will be automatically selected. Regards, Anthony Liguori On Mon, 2008-04-21 at 17:13 -0500, Anthony Liguori wrote: Jerone Young wrote: virtio net device does not appear to show itself in the guest. I'm curious of what options I may be missing. Here is my config You'll have to be more specific about what does not appear to show itself means. What's the output of lspci? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
On Tue, Apr 22, 2008 at 03:21:43PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote: On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote: invalidate_range_start { spin_lock(kvm-mmu_lock); kvm-invalidate_range_count++; rmap-invalidate of sptes in range write_seqlock; write_sequnlock; I don't think you need it here since invalidate_range_count is already elevated which will accomplish the same effect. Agreed, seqlock only in range_end should be enough. BTW, the fact I am a little confused about the value of the seq_lock versus a simple atomic, but I assumed there is a reason and left it at that. seqlock is needed regardless of invalidate_page existing or not, really makes invalidate_page a no brainer not just from the core VM point of view, but from the driver point of view too. The kvm_page_fault logic would be the same even if I remove invalidate_page from the mmu notifier patch but it'd run slower both when armed and disarmed. I don't know what you mean by it'd run slower and what you mean by armed and disarmed. For the sake of this discussion, I will assume it'd means the kernel in general and not KVM. With the two call sites for range_begin/range_end, I would agree we have more call sites, but the second is extremely likely to be cache hot. By disarmed, I will assume you mean no notifiers registered for a particular mm. In that case, the cache will make the second call effectively free. So, for the disarmed case, I see no measurable difference. For the case where there is a notifier registered, I certainly can see a difference. I am not certain how to quantify the difference as it depends on the callee. In the case of xpmem, our callout is always very expensive for the _start case. Our _end case is very light, but it is essentially the exact same steps we would perform for the _page callout. When I was discussing this difference with Jack, he reminded me that the GRU, due to its hardware, does not have any race issues with the invalidate_page callout simply doing the tlb shootdown and not modifying any of its internal structures. He then put a caveat on the discussion that _either_ method was acceptable as far as he was concerned. The real issue is getting a patch in that satisfies all needs and not whether there is a seperate invalidate_page callout. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Nguyen Anh Quynh wrote: Hi, I am thinking about comibing this ROM with the extboot. Both two ROM are about booting, so I think that is reasonable. So we will have only 1 ROM that supports both external boot and Linux boot. Is that desirable or not? I think so. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: I am a little confused about the value of the seq_lock versus a simple atomic, but I assumed there is a reason and left it at that. There's no value for anything but get_user_pages (get_user_pages takes its own lock internally though). I preferred to explain it as a seqlock because it was simpler for reading, but I totally agree in the final implementation it shouldn't be a seqlock. My code was meant to be pseudo-code only. It doesn't even need to be atomic ;). I don't know what you mean by it'd run slower and what you mean by armed and disarmed. 1) when armed the time-window where the kvm-page-fault would be blocked would be a bit larger without invalidate_page for no good reason 2) if you were to remove invalidate_page when disarmed the VM could would need two branches instead of one in various places I don't want to waste cycles if not wasting them improves performance both when armed and disarmed. For the sake of this discussion, I will assume it'd means the kernel in general and not KVM. With the two call sites for range_begin/range_end, I actually meant for both. By disarmed, I will assume you mean no notifiers registered for a particular mm. In that case, the cache will make the second call effectively free. So, for the disarmed case, I see no measurable difference. For rmap is sure effective free, for do_wp_page it costs one branch for no good reason. For the case where there is a notifier registered, I certainly can see a difference. I am not certain how to quantify the difference as it Agreed. When I was discussing this difference with Jack, he reminded me that the GRU, due to its hardware, does not have any race issues with the invalidate_page callout simply doing the tlb shootdown and not modifying any of its internal structures. He then put a caveat on the discussion that _either_ method was acceptable as far as he was concerned. The real issue is getting a patch in that satisfies all needs and not whether there is a seperate invalidate_page callout. Sure, we have that patch now, I'll send it out in a minute, I was just trying to explain why it makes sense to have an invalidate_page too (which remains the only difference by now), removing it would be a regression on all sides, even if a minor one. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208870142 -7200 # Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5 # Parent fb3bc9942fb78629d096bd07564f435d51d86e5f Core of mmu notifiers. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Nick Piggin [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1050,6 +1050,27 @@ unsigned long addr, unsigned long len, unsigned long flags, struct page **pages); +/* + * mm_lock will take mmap_sem writably (to prevent all modifications + * and scanning of vmas) and then also takes the mapping locks for + * each of the vma to lockout any scans of pagetables of this address + * space. This can be used to effectively holding off reclaim from the + * address space. + * + * mm_lock can fail if there is not enough memory to store a pointer + * array to all vmas. + * + * mm_lock and mm_unlock are expensive operations that may take a long time. + */ +struct mm_lock_data { + spinlock_t **i_mmap_locks; + spinlock_t **anon_vma_locks; + size_t nr_i_mmap_locks; + size_t nr_anon_vma_locks; +}; +extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); +extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); + extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -225,6 +225,9 @@ #ifdef CONFIG_CGROUP_MEM_RES_CTLR struct mem_cgroup *mem_cgroup; #endif +#ifdef CONFIG_MMU_NOTIFIER + struct hlist_head mmu_notifier_list; +#endif }; #endif /* _LINUX_MM_TYPES_H */ diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h new file mode 100644 --- /dev/null +++ b/include/linux/mmu_notifier.h @@ -0,0 +1,229 @@ +#ifndef _LINUX_MMU_NOTIFIER_H +#define _LINUX_MMU_NOTIFIER_H + +#include linux/list.h +#include linux/spinlock.h +#include linux/mm_types.h + +struct mmu_notifier; +struct mmu_notifier_ops; + +#ifdef CONFIG_MMU_NOTIFIER + +struct mmu_notifier_ops { + /* +* Called after all other threads have terminated and the executing +* thread is the only remaining execution thread. There are no +* users of the mm_struct remaining. +*/ + void (*release)(struct mmu_notifier *mn, + struct mm_struct *mm); + + /* +* clear_flush_young is called after the VM is +* test-and-clearing the young/accessed bitflag in the +* pte. This way the VM will provide proper aging to the +* accesses to the page through the secondary MMUs and not +* only to the ones through the Linux pte. +*/ + int (*clear_flush_young)(struct mmu_notifier *mn, +struct mm_struct *mm, +unsigned long address); + + /* +* Before this is invoked any secondary MMU is still ok to +* read/write to the page previously pointed by the Linux pte +* because the old page hasn't been freed yet. If required +* set_page_dirty has to be called internally to this method. +*/ + void (*invalidate_page)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address); + + /* +* invalidate_range_start() and invalidate_range_end() must be +* paired and are called only when the mmap_sem is held and/or +* the semaphores protecting the reverse maps. Both functions +* may sleep. The subsystem must guarantee that no additional +* references to the pages in the range established between +* the call to invalidate_range_start() and the matching call +* to invalidate_range_end(). +* +* Invalidation of multiple concurrent ranges may be permitted +* by the driver or the driver may exclude other invalidation +* from proceeding by blocking on new invalidate_range_start() +* callback that overlap invalidates that are already in +* progress. Either way the establishment of sptes to the +* range can only be allowed if all invalidate_range_stop() +* function have been called. +* +* invalidate_range_start() is called when all pages in the +* range are still mapped and have at least a refcount of one. +* +* invalidate_range_end() is called when all pages in the +* range have been unmapped and the pages have been freed by +* the VM. +* +* The VM will remove the page table entries and potentially +
[kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872186 -7200 # Node ID 3c804dca25b15017b22008647783d6f5f3801fa9 # Parent ea87c15371b1bd49380c40c3f15f1c7ca4438af5 Fix ia64 compilation failure because of common code include bug. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -10,6 +10,7 @@ #include linux/rbtree.h #include linux/rwsem.h #include linux/completion.h +#include linux/cpumask.h #include asm/page.h #include asm/mmu.h - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872186 -7200 # Node ID ac9bb1fb3de2aa5d27210a28edf24f6577094076 # Parent a6672bdeead0d41b2ebd6846f731d43a611645b7 Moves all mmu notifier methods outside the PT lock (first and not last step to make them sleep capable). Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -169,27 +169,6 @@ INIT_HLIST_HEAD(mm-mmu_notifier_list); } -#define ptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - pte_t __pte;\ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __pte = ptep_clear_flush(___vma, ___address, __ptep); \ - mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\ - __pte; \ -}) - -#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ -({ \ - int __young;\ - struct vm_area_struct *___vma = __vma; \ - unsigned long ___address = __address; \ - __young = ptep_clear_flush_young(___vma, ___address, __ptep); \ - __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\ - ___address); \ - __young;\ -}) - #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -221,9 +200,6 @@ { } -#define ptep_clear_flush_young_notify ptep_clear_flush_young -#define ptep_clear_flush_notify ptep_clear_flush - #endif /* CONFIG_MMU_NOTIFIER */ #endif /* _LINUX_MMU_NOTIFIER_H */ diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -194,11 +194,13 @@ if (pte) { /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); - pteval = ptep_clear_flush_notify(vma, address, pte); + pteval = ptep_clear_flush(vma, address, pte); page_remove_rmap(page, vma); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); + /* must invalidate_page _before_ freeing the page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(page); } } diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -1627,9 +1627,10 @@ */ page_table = pte_offset_map_lock(mm, pmd, address, ptl); - page_cache_release(old_page); + new_page = NULL; if (!pte_same(*page_table, orig_pte)) goto unlock; + page_cache_release(old_page); page_mkwrite = 1; } @@ -1645,6 +1646,7 @@ if (ptep_set_access_flags(vma, address, page_table, entry,1)) update_mmu_cache(vma, address, entry); ret |= VM_FAULT_WRITE; + old_page = new_page = NULL; goto unlock; } @@ -1689,7 +1691,7 @@ * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); lru_cache_add_active(new_page); @@ -1701,12 +1703,18 @@ } else mem_cgroup_uncharge_page(new_page); - if (new_page) +unlock: + pte_unmap_unlock(page_table, ptl); + + if (new_page) { + if (new_page == old_page) + /* cow happened, notify before releasing old_page */ + mmu_notifier_invalidate_page(mm, address); page_cache_release(new_page); + } if (old_page) page_cache_release(old_page); -unlock: - pte_unmap_unlock(page_table, ptl); + if (dirty_page) { if (vma-vm_file) file_update_time(vma-vm_file); diff --git
[kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872187 -7200 # Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93 # Parent bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2 Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock conversion. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1062,10 +1062,10 @@ * mm_lock and mm_unlock are expensive operations that may take a long time. */ struct mm_lock_data { - spinlock_t **i_mmap_locks; - spinlock_t **anon_vma_locks; - size_t nr_i_mmap_locks; - size_t nr_anon_vma_locks; + struct rw_semaphore **i_mmap_sems; + struct rw_semaphore **anon_vma_sems; + size_t nr_i_mmap_sems; + size_t nr_anon_vma_sems; }; extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data); extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2243,8 +2243,8 @@ static int mm_lock_cmp(const void *a, const void *b) { cond_resched(); - if ((unsigned long)*(spinlock_t **)a - (unsigned long)*(spinlock_t **)b) + if ((unsigned long)*(struct rw_semaphore **)a + (unsigned long)*(struct rw_semaphore **)b) return -1; else if (a == b) return 0; @@ -2252,7 +2252,7 @@ return 1; } -static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks, +static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore **sems, int anon) { struct vm_area_struct *vma; @@ -2261,59 +2261,59 @@ for (vma = mm-mmap; vma; vma = vma-vm_next) { if (anon) { if (vma-anon_vma) - locks[i++] = vma-anon_vma-lock; + sems[i++] = vma-anon_vma-sem; } else { if (vma-vm_file vma-vm_file-f_mapping) - locks[i++] = vma-vm_file-f_mapping-i_mmap_lock; + sems[i++] = vma-vm_file-f_mapping-i_mmap_sem; } } if (!i) goto out; - sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL); + sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL); out: return i; } static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 1); + return mm_lock_sort(mm, sems, 1); } static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm, - spinlock_t **locks) + struct rw_semaphore **sems) { - return mm_lock_sort(mm, locks, 0); + return mm_lock_sort(mm, sems, 0); } -static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock) +static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock) { - spinlock_t *last = NULL; + struct rw_semaphore *last = NULL; size_t i; for (i = 0; i nr; i++) /* Multiple vmas may use the same lock. */ - if (locks[i] != last) { - BUG_ON((unsigned long) last (unsigned long) locks[i]); - last = locks[i]; + if (sems[i] != last) { + BUG_ON((unsigned long) last (unsigned long) sems[i]); + last = sems[i]; if (lock) - spin_lock(last); + down_write(last); else - spin_unlock(last); + up_write(last); } } -static inline void __mm_lock(spinlock_t **locks, size_t nr) +static inline void __mm_lock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 1); + mm_lock_unlock(sems, nr, 1); } -static inline void __mm_unlock(spinlock_t **locks, size_t nr) +static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr) { - mm_lock_unlock(locks, nr, 0); + mm_lock_unlock(sems, nr, 0); } /* @@ -2325,57 +2325,57 @@ */ int mm_lock(struct mm_struct *mm, struct mm_lock_data *data) { - spinlock_t **anon_vma_locks, **i_mmap_locks; + struct rw_semaphore **anon_vma_sems, **i_mmap_sems; down_write(mm-mmap_sem); if (mm-map_count) { - anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm-map_count); - if (unlikely(!anon_vma_locks)) { + anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * mm-map_count); +
[kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872186 -7200 # Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0 # Parent ac9bb1fb3de2aa5d27210a28edf24f6577094076 Move the tlb flushing into free_pgtables. The conversion of the locks taken for reverse map scanning would require taking sleeping locks in free_pgtables() and we cannot sleep while gathering pages for a tlb flush. Move the tlb_gather/tlb_finish call to free_pgtables() to be done for each vma. This may add a number of tlb flushes depending on the number of vmas that cannot be coalesced into one. The first pointer argument to free_pgtables() can then be dropped. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -751,8 +751,8 @@ void *private); void free_pgd_range(struct mmu_gather **tlb, unsigned long addr, unsigned long end, unsigned long floor, unsigned long ceiling); -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma, - unsigned long floor, unsigned long ceiling); +void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor, + unsigned long ceiling); int copy_page_range(struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma); void unmap_mapping_range(struct address_space *mapping, diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -272,9 +272,11 @@ } while (pgd++, addr = next, addr != end); } -void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma, - unsigned long floor, unsigned long ceiling) +void free_pgtables(struct vm_area_struct *vma, unsigned long floor, + unsigned long ceiling) { + struct mmu_gather *tlb; + while (vma) { struct vm_area_struct *next = vma-vm_next; unsigned long addr = vma-vm_start; @@ -286,7 +288,8 @@ unlink_file_vma(vma); if (is_vm_hugetlb_page(vma)) { - hugetlb_free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + hugetlb_free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); } else { /* @@ -299,9 +302,11 @@ anon_vma_unlink(vma); unlink_file_vma(vma); } - free_pgd_range(tlb, addr, vma-vm_end, + tlb = tlb_gather_mmu(vma-vm_mm, 0); + free_pgd_range(tlb, addr, vma-vm_end, floor, next? next-vm_start: ceiling); } + tlb_finish_mmu(tlb, addr, vma-vm_end); vma = next; } } diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1752,9 +1752,9 @@ update_hiwater_rss(mm); unmap_vmas(tlb, vma, start, end, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS, + tlb_finish_mmu(tlb, start, end); + free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS, next? next-vm_start: 0); - tlb_finish_mmu(tlb, start, end); } /* @@ -2050,8 +2050,8 @@ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL); vm_unacct_memory(nr_accounted); - free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0); tlb_finish_mmu(tlb, 0, end); + free_pgtables(vma, FIRST_USER_ADDRESS, 0); /* * Walk the list again, actually closing and freeing it, - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872187 -7200 # Node ID e847039ee2e815088661933b7195584847dc7540 # Parent 128d705f38c8a774ac11559db445787ce6e91c77 This patch adds a lock ordering rule to avoid a potential deadlock when multiple mmap_sems need to be locked. Signed-off-by: Dean Nelson [EMAIL PROTECTED] diff --git a/mm/filemap.c b/mm/filemap.c --- a/mm/filemap.c +++ b/mm/filemap.c @@ -79,6 +79,9 @@ * * -i_mutex (generic_file_buffered_write) *-mmap_sem (fault_in_pages_readable-do_page_fault) + * + *When taking multiple mmap_sems, one should lock the lowest-addressed + *one first proceeding on up to the highest-addressed one. * * -i_mutex *-i_alloc_sem (various) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872187 -7200 # Node ID 128d705f38c8a774ac11559db445787ce6e91c77 # Parent f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93 XPMEM would have used sys_madvise() except that madvise_dontneed() returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages XPMEM imports from other partitions and is also true for uncached pages allocated locally via the mspec allocator. XPMEM needs zap_page_range() functionality for these types of pages as well as 'normal' pages. Signed-off-by: Dean Nelson [EMAIL PROTECTED] diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -909,6 +909,7 @@ return unmap_vmas(vma, address, end, nr_accounted, details); } +EXPORT_SYMBOL_GPL(zap_page_range); /* * Do a quick page-table lookup for a single page. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872186 -7200 # Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5 # Parent ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0 Move the tlb flushing inside of unmap vmas. This saves us from passing a pointer to the TLB structure around and simplifies the callers. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -723,8 +723,7 @@ struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t); unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *); -unsigned long unmap_vmas(struct mmu_gather **tlb, - struct vm_area_struct *start_vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *); diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -804,7 +804,6 @@ /** * unmap_vmas - unmap a range of memory covered by a list of vma's - * @tlbp: address of the caller's struct mmu_gather * @vma: the starting vma * @start_addr: virtual address at which to start unmapping * @end_addr: virtual address at which to end unmapping @@ -816,20 +815,13 @@ * Unmap all pages in the vma list. * * We aim to not hold locks for too long (for scheduling latency reasons). - * So zap pages in ZAP_BLOCK_SIZE bytecounts. This means we need to - * return the ending mmu_gather to the caller. + * So zap pages in ZAP_BLOCK_SIZE bytecounts. * * Only addresses between `start' and `end' will be unmapped. * * The VMA list must be sorted in ascending virtual address order. - * - * unmap_vmas() assumes that the caller will flush the whole unmapped address - * range after unmap_vmas() returns. So the only responsibility here is to - * ensure that any thus-far unmapped pages are flushed before unmap_vmas() - * drops the lock and schedules. */ -unsigned long unmap_vmas(struct mmu_gather **tlbp, - struct vm_area_struct *vma, unsigned long start_addr, +unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr, unsigned long end_addr, unsigned long *nr_accounted, struct zap_details *details) { @@ -838,9 +830,14 @@ int tlb_start_valid = 0; unsigned long start = start_addr; spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL; - int fullmm = (*tlbp)-fullmm; + int fullmm; + struct mmu_gather *tlb; struct mm_struct *mm = vma-vm_mm; + lru_add_drain(); + tlb = tlb_gather_mmu(mm, 0); + update_hiwater_rss(mm); + fullmm = tlb-fullmm; mmu_notifier_invalidate_range_start(mm, start_addr, end_addr); for ( ; vma vma-vm_start end_addr; vma = vma-vm_next) { unsigned long end; @@ -867,7 +864,7 @@ (HPAGE_SIZE / PAGE_SIZE); start = end; } else - start = unmap_page_range(*tlbp, vma, + start = unmap_page_range(tlb, vma, start, end, zap_work, details); if (zap_work 0) { @@ -875,22 +872,23 @@ break; } - tlb_finish_mmu(*tlbp, tlb_start, start); + tlb_finish_mmu(tlb, tlb_start, start); if (need_resched() || (i_mmap_lock spin_needbreak(i_mmap_lock))) { if (i_mmap_lock) { - *tlbp = NULL; + tlb = NULL; goto out; } cond_resched(); } - *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm); + tlb = tlb_gather_mmu(vma-vm_mm, fullmm); tlb_start_valid = 0; zap_work = ZAP_BLOCK_SIZE; } } + tlb_finish_mmu(tlb, start_addr, end_addr); out: mmu_notifier_invalidate_range_end(mm, start_addr, end_addr); return start; /* which is now the end (or restart) address */ @@ -906,18 +904,10 @@ unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address, unsigned long size, struct zap_details *details) { - struct mm_struct *mm = vma-vm_mm; - struct mmu_gather *tlb; unsigned long end = address + size; unsigned long nr_accounted = 0; - lru_add_drain(); - tlb = tlb_gather_mmu(mm, 0); -
[kvm-devel] [PATCH 00 of 12] mmu notifier #v13
Hello, This is the latest and greatest version of the mmu notifier patch #v13. Changes are mainly in the mm_lock that uses sort() suggested by Christoph. This reduces the complexity from O(N**2) to O(N*log(N)). I folded the mm_lock functionality together with the mmu-notifier-core 1/12 patch to make it self-contained. I recommend merging 1/12 into -mm/mainline ASAP. Lack of mmu notifiers is holding off KVM development. We are going to rework the way the pages are mapped and unmapped to work with pure pfn for pci passthrough without the use of page pinning, and we can't without mmu notifiers. This is not just a performance matter. KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed further. XPMEM support comes with the later patches 2-12, risk for those patches is 0 and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are simple and can go in immediately but not all are so simple. 2-12/12 are posted as usual for review by the VM developers and so Robin can keep testing them on XPMEM and they can be merged later without any downside (they're mostly orthogonal with 1/12). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872186 -7200 # Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7 # Parent 3c804dca25b15017b22008647783d6f5f3801fa9 get_task_mm should not succeed if mmput() is running and has reduced the mm_users count to zero. This can occur if a processor follows a tasks pointer to an mm struct because that pointer is only cleared after the mmput(). If get_task_mm() succeeds after mmput() reduced the mm_users to zero then we have the lovely situation that one portion of the kernel is doing all the teardown work for an mm while another portion is happily using it. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -442,7 +442,8 @@ if (task-flags PF_BORROWED_MM) mm = NULL; else - atomic_inc(mm-mm_users); + if (!atomic_inc_not_zero(mm-mm_users)) + mm = NULL; } task_unlock(task); return mm; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872187 -7200 # Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2 # Parent 6e04df1f4284689b1c46e57a67559abe49ecf292 Convert the anon_vma spinlock to a rw semaphore. This allows concurrent traversal of reverse maps for try_to_unmap() and page_mkclean(). It also allows the calling of sleeping functions from reverse map traversal as needed for the notifier callbacks. It includes possible concurrency. Rcu is used in some context to guarantee the presence of the anon_vma (try_to_unmap) while we acquire the anon_vma lock. We cannot take a semaphore within an rcu critical section. Add a refcount to the anon_vma structure which allow us to give an existence guarantee for the anon_vma structure independent of the spinlock or the list contents. The refcount can then be taken within the RCU section. If it has been taken successfully then the refcount guarantees the existence of the anon_vma. The refcount in anon_vma also allows us to fix a nasty issue in page migration where we fudged by using rcu for a long code path to guarantee the existence of the anon_vma. I think this is a bug because the anon_vma may become empty and get scheduled to be freed but then we increase the refcount again when the migration entries are removed. The refcount in general allows a shortening of RCU critical sections since we can do a rcu_unlock after taking the refcount. This is particularly relevant if the anon_vma chains contain hundreds of entries. However: - Atomic overhead increases in situations where a new reference to the anon_vma has to be established or removed. Overhead also increases when a speculative reference is used (try_to_unmap, page_mkclean, page migration). - There is the potential for more frequent processor change due to up_xxx letting waiting tasks run first. This results in f.e. the Aim9 brk performance test to got down by 10-15%. Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/include/linux/rmap.h b/include/linux/rmap.h --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -25,7 +25,8 @@ * pointing to this anon_vma once its vma list is empty. */ struct anon_vma { - spinlock_t lock;/* Serialize access to vma list */ + atomic_t refcount; /* vmas on the list */ + struct rw_semaphore sem;/* Serialize access to vma list */ struct list_head head; /* List of private related vmas */ }; @@ -43,18 +44,31 @@ kmem_cache_free(anon_vma_cachep, anon_vma); } +struct anon_vma *grab_anon_vma(struct page *page); + +static inline void get_anon_vma(struct anon_vma *anon_vma) +{ + atomic_inc(anon_vma-refcount); +} + +static inline void put_anon_vma(struct anon_vma *anon_vma) +{ + if (atomic_dec_and_test(anon_vma-refcount)) + anon_vma_free(anon_vma); +} + static inline void anon_vma_lock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma-anon_vma; if (anon_vma) - spin_lock(anon_vma-lock); + down_write(anon_vma-sem); } static inline void anon_vma_unlock(struct vm_area_struct *vma) { struct anon_vma *anon_vma = vma-anon_vma; if (anon_vma) - spin_unlock(anon_vma-lock); + up_write(anon_vma-sem); } /* diff --git a/mm/migrate.c b/mm/migrate.c --- a/mm/migrate.c +++ b/mm/migrate.c @@ -235,15 +235,16 @@ return; /* -* We hold the mmap_sem lock. So no need to call page_lock_anon_vma. +* We hold either the mmap_sem lock or a reference on the +* anon_vma. So no need to call page_lock_anon_vma. */ anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON); - spin_lock(anon_vma-lock); + down_read(anon_vma-sem); list_for_each_entry(vma, anon_vma-head, anon_vma_node) remove_migration_pte(vma, old, new); - spin_unlock(anon_vma-lock); + up_read(anon_vma-sem); } /* @@ -623,7 +624,7 @@ int rc = 0; int *result = NULL; struct page *newpage = get_new_page(page, private, result); - int rcu_locked = 0; + struct anon_vma *anon_vma = NULL; int charge = 0; if (!newpage) @@ -647,16 +648,14 @@ } /* * By try_to_unmap(), page-mapcount goes down to 0 here. In this case, -* we cannot notice that anon_vma is freed while we migrates a page. +* we cannot notice that anon_vma is freed while we migrate a page. * This rcu_read_lock() delays freeing anon_vma pointer until the end * of migration. File cache pages are no problem because of page_lock() * File Caches may use write_page() or lock_page() in migration, then, * just care Anon page here. */ - if (PageAnon(page)) { - rcu_read_lock(); - rcu_locked = 1; - } + if (PageAnon(page)) + anon_vma =
[kvm-devel] [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal
# HG changeset patch # User Andrea Arcangeli [EMAIL PROTECTED] # Date 1208872187 -7200 # Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292 # Parent 8965539f4d174c79bd37e58e8b037d5db906e219 The conversion to a rwsem allows notifier callbacks during rmap traversal for files. A rw style lock also allows concurrent walking of the reverse map so that multiple processors can expire pages in the same memory area of the same process. So it increases the potential concurrency. Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED] Signed-off-by: Christoph Lameter [EMAIL PROTECTED] diff --git a/Documentation/vm/locking b/Documentation/vm/locking --- a/Documentation/vm/locking +++ b/Documentation/vm/locking @@ -66,7 +66,7 @@ expand_stack(), it is hard to come up with a destructive scenario without having the vmlist protection in this case. -The page_table_lock nests with the inode i_mmap_lock and the kmem cache +The page_table_lock nests with the inode i_mmap_sem and the kmem cache c_spinlock spinlocks. This is okay, since the kmem code asks for pages after dropping c_spinlock. The page_table_lock also nests with pagecache_lock and pagemap_lru_lock spinlocks, and no code asks for memory with these locks diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c --- a/arch/x86/mm/hugetlbpage.c +++ b/arch/x86/mm/hugetlbpage.c @@ -69,7 +69,7 @@ if (!vma_shareable(vma, addr)) return; - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) { if (svma == vma) continue; @@ -94,7 +94,7 @@ put_page(virt_to_page(spte)); spin_unlock(mm-page_table_lock); out: - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); } /* diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -454,10 +454,10 @@ pgoff = offset PAGE_SHIFT; i_size_write(inode, offset); - spin_lock(mapping-i_mmap_lock); + down_read(mapping-i_mmap_sem); if (!prio_tree_empty(mapping-i_mmap)) hugetlb_vmtruncate_list(mapping-i_mmap, pgoff); - spin_unlock(mapping-i_mmap_lock); + up_read(mapping-i_mmap_sem); truncate_hugepages(inode, offset); return 0; } diff --git a/fs/inode.c b/fs/inode.c --- a/fs/inode.c +++ b/fs/inode.c @@ -210,7 +210,7 @@ INIT_LIST_HEAD(inode-i_devices); INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC); rwlock_init(inode-i_data.tree_lock); - spin_lock_init(inode-i_data.i_mmap_lock); + init_rwsem(inode-i_data.i_mmap_sem); INIT_LIST_HEAD(inode-i_data.private_list); spin_lock_init(inode-i_data.private_lock); INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap); diff --git a/include/linux/fs.h b/include/linux/fs.h --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -503,7 +503,7 @@ unsigned inti_mmap_writable;/* count VM_SHARED mappings */ struct prio_tree_root i_mmap; /* tree of private and shared mappings */ struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */ - spinlock_t i_mmap_lock;/* protect tree, count, list */ + struct rw_semaphore i_mmap_sem; /* protect tree, count, list */ unsigned inttruncate_count; /* Cover race condition with truncate */ unsigned long nrpages;/* number of total pages */ pgoff_t writeback_index;/* writeback starts here */ diff --git a/include/linux/mm.h b/include/linux/mm.h --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -716,7 +716,7 @@ struct address_space *check_mapping;/* Check page-mapping if set */ pgoff_t first_index;/* Lowest page-index to unmap */ pgoff_t last_index; /* Highest page-index to unmap */ - spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */ + struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */ unsigned long truncate_count; /* Compare vm_truncate_count */ }; diff --git a/kernel/fork.c b/kernel/fork.c --- a/kernel/fork.c +++ b/kernel/fork.c @@ -274,12 +274,12 @@ atomic_dec(inode-i_writecount); /* insert tmp into the share list, just after mpnt */ - spin_lock(file-f_mapping-i_mmap_lock); + down_write(file-f_mapping-i_mmap_sem); tmp-vm_truncate_count = mpnt-vm_truncate_count; flush_dcache_mmap_lock(file-f_mapping); vma_prio_tree_add(tmp, mpnt); flush_dcache_mmap_unlock(file-f_mapping); - spin_unlock(file-f_mapping-i_mmap_lock); +
[kvm-devel] [PATCH] Make virtio devices multi-function
This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. I haven't tested the Windows virtio drivers. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index 60e4094..df3a878 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -33,7 +33,7 @@ typedef struct PCIIORegion { #define PCI_ROM_SLOT 6 #define PCI_NUM_REGIONS 7 -#define PCI_DEVICES_MAX 64 +#define PCI_DEVICES_MAX 256 #define PCI_VENDOR_ID 0x00/* 16 bits */ #define PCI_DEVICE_ID 0x02/* 16 bits */ diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 9100bb1..9ea14d3 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, PCIDevice *pci_dev; uint8_t *config; uint32_t size; +static int devfn = 7; + +if ((devfn % 8) == 7) + devfn = -1; +else + devfn++; pci_dev = pci_register_device(bus, name, struct_size, - -1, NULL, NULL); + devfn, NULL, NULL); + +devfn = pci_dev-devfn; + vdev = to_virtio_device(pci_dev); vdev-status = 0; @@ -435,6 +444,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, config[0x3d] = 1; +/* Mark device as multi-function */ +if ((devfn % 8) == 0) + config[0x0e] |= 0x80; + vdev-name = name; vdev-config_len = config_size; if (vdev-config_len) diff --git a/qemu/net.h b/qemu/net.h index 13daa27..3bada75 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd); /* NIC info */ -#define MAX_NICS 8 +#define MAX_NICS 256 struct NICInfo { uint8_t macaddr[6]; diff --git a/qemu/sysemu.h b/qemu/sysemu.h index b645fb7..7992a77 100644 --- a/qemu/sysemu.h +++ b/qemu/sysemu.h @@ -151,7 +151,7 @@ typedef struct DriveInfo { #define MAX_IDE_DEVS 2 #define MAX_SCSI_DEVS 7 -#define MAX_DRIVES 32 +#define MAX_DRIVES 256 int nb_drives; DriveInfo drives_table[MAX_DRIVES+1]; diff --git a/qemu/vl.c b/qemu/vl.c index 7dd0094..e203a4d 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -8754,7 +8754,7 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type) } #endif -#define MAX_NET_CLIENTS 32 +#define MAX_NET_CLIENTS 512 static int saved_argc; static char **saved_argv; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] KVM: PIT: make last_injected_time per-guest
Otherwise multiple guests use the same variable and boom. Also use kvm_vcpu_kick() to make sure that if a timer triggers on a different CPU the event won't be missed. Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED] Tested-and-Acked-by: Alex Davis [EMAIL PROTECTED] diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c index 2852dd1..5697ad2 100644 --- a/arch/x86/kvm/i8254.c +++ b/arch/x86/kvm/i8254.c @@ -200,10 +200,8 @@ int __pit_timer_fn(struct kvm_kpit_state *ps) atomic_inc(pt-pending); smp_mb__after_atomic_inc(); - if (vcpu0 waitqueue_active(vcpu0-wq)) { - vcpu0-arch.mp_state = KVM_MP_STATE_RUNNABLE; - wake_up_interruptible(vcpu0-wq); - } + if (vcpu0) + kvm_vcpu_kick(vcpu0); pt-timer.expires = ktime_add_ns(pt-timer.expires, pt-period); pt-scheduled = ktime_to_ns(pt-timer.expires); @@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) struct kvm_pit *pit = vcpu-kvm-arch.vpit; struct kvm *kvm = vcpu-kvm; struct kvm_kpit_state *ps; - static unsigned long last_injected_time; if (vcpu pit) { ps = pit-pit_state; @@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu) * 2. Last interrupt was accepted or waited for too long time*/ if (atomic_read(ps-pit_timer.pending) (ps-inject_pending || - (jiffies - last_injected_time + (jiffies - ps-last_injected_time = KVM_MAX_PIT_INTR_INTERVAL))) { ps-inject_pending = 0; __inject_pit_timer_intr(kvm); - last_injected_time = jiffies; + ps-last_injected_time = jiffies; } } } diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h index e63ef38..db25c2a 100644 --- a/arch/x86/kvm/i8254.h +++ b/arch/x86/kvm/i8254.h @@ -35,6 +35,7 @@ struct kvm_kpit_state { struct mutex lock; struct kvm_pit *pit; bool inject_pending; /* if inject pending interrupts */ + unsigned long last_injected_time; }; struct kvm_pit { - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. (certainly if we have a device with one function, which then gets converted to a multifunction device) -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote: The virtio config space was originally chosen to be little-endian, because we thought the config might be part of the PCI config space for virtio_pci. It's actually a separate mmio region, so that argument holds little water; as only x86 is currently using the virtio mechanism, we can change this (but must do so now, before the impending s390 and ppc merges). This will probably annoy Hollis which has guests that can go both ways. Yes, I discussed this with Hollis. But the virtio rings themselves already have this issue: we don't do any endian conversion on them and assume they're our endian in the guest. We may still regret not doing *everything* little-endian, but this doesn't make it worse. Thanks, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 17:44:08 Christian Borntraeger wrote: Am Dienstag, 22. April 2008 schrieb Rusty Russell: [Christian, Hollis, how much is this ABI breakage going to hurt you?] It is ok for s390 at the moment. We are still working on making userspace ready and I plan to change the guest-host for s390 anyway. I try to make these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main reason is, that we are currently limited to around 80 devices. I am not sure, if I should change the allocation of the virtqueues and descriptors to guest memory as well. Large rings require contiguous memory, which makes guest allocation problematic. 512 elems at 4k pages == 5 pages. Back to your patch: I have still some ideas about virtio between little endian and big endian systems, but it requires more and different marshalling anyway - even on driver level. No idea yet how to solve that properly. So far we've pushed such considerations onto the host. This does mean that you can't virtio connect two guests directly without understanding the contents of the buffers so you can endian correct (eg. direct inter-guest networking). inter-guest virtio is currently a party trick anyway, so I'm not sure it's a real issue. + vb-vdev-config-get(vb-vdev, + offsetof(struct virtio_balloon_config, num_pages), + v); this is missing a sizeof(v), no? Ah... sure enough, I fixed that in a followon patch. Well-spotted, thanks! Cheers, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] KVM: PIT: make last_injected_time per-guest
Marcelo Tosatti wrote: Otherwise multiple guests use the same variable and boom. Also use kvm_vcpu_kick() to make sure that if a timer triggers on a different CPU the event won't be missed. Applied, thanks. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: Anthony Liguori wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. Probably not. I imagine you can only hotplug devices, not individual functions? Regards, Anthony Liguori (certainly if we have a device with one function, which then gets converted to a multifunction device) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. Probably not. I imagine you can only hotplug devices, not individual functions? It sounds reasonable to expect so. ACPI has objects for devices, not functions (IIRC). Maybe require explicit device/function assignment on the command line? It will be managed anyway. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting
Le mardi 22 avril 2008 à 08:50 -0500, Anthony Liguori a écrit : Nguyen Anh Quynh wrote: Hi, This should be submitted to upstream (but not to kvm-devel list), but this is only the test code that I want to quickly send out for comments. In case it looks OK, I will send it to upstream later. Inspired by extboot and conversations with Anthony and HPA, this linuxboot option ROM is a simple option ROM that intercepts int19 in order to execute linux setup code. This approach eliminates the need to manipulate the boot sector for this purpose. To test it, just load linux kernel with your KVM/QEMU image using -kernel option in normal way. I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest Ubuntu 8.04. For the next rounds, could you actually rebase against upstream QEMU and submit to qemu-devel? One of Paul Brook's objections to extboot had historically been that it wasn't not easily sharable with other architectures. With a C version, it seems more reasonable now to do that. Moreover add a binary version of the ROM in the pc-bios directory: it avoids to have a cross-compiler to build ROM on non-x86 architecture. Regards, Laurent Make sure you remove all the old linux boot code too within QEMU along with the -hda checks. Regards, Anthony Liguori Thanks, Quynh # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 +++ linuxboot/rom.c | 104 linuxboot/signrom|binary linuxboot/signrom.c | 128 ++ linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 13 files changed, 600 insertions(+), 9 deletions(-) -- - [EMAIL PROTECTED] --- The best way to predict the future is to invent it. - Alan Kay - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Jamie Lokier wrote: Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? Yes, unless the implementation in the kernel (or glibc) is threaded. With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? My first thought on seeing this is that it relates to a deficiency on older kernels servicing multiple requests on a single fd (i.e. a per-file lock). I don't know if such a deficiency ever existed, though. It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. I did have measurements (and a test rig) at a previous job (where I did a lot of I/O work); IIRC the performance of a tuned thread pool was not far behind aio, both for seeks and sequential. It was a while back though. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Nguyen Anh Quynh wrote: Hi, This should be submitted to upstream (but not to kvm-devel list), but this is only the test code that I want to quickly send out for comments. In case it looks OK, I will send it to upstream later. Inspired by extboot and conversations with Anthony and HPA, this linuxboot option ROM is a simple option ROM that intercepts int19 in order to execute linux setup code. This approach eliminates the need to manipulate the boot sector for this purpose. To test it, just load linux kernel with your KVM/QEMU image using -kernel option in normal way. I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest Ubuntu 8.04. For the next rounds, could you actually rebase against upstream QEMU and submit to qemu-devel? One of Paul Brook's objections to extboot had historically been that it wasn't not easily sharable with other architectures. With a C version, it seems more reasonable now to do that. Make sure you remove all the old linux boot code too within QEMU along with the -hda checks. Regards, Anthony Liguori Thanks, Quynh # diffstat linuxboot1.diff Makefile | 13 - linuxboot/Makefile | 40 +++ linuxboot/boot.S | 54 + linuxboot/farvar.h | 130 +++ linuxboot/rom.c | 104 linuxboot/signrom|binary linuxboot/signrom.c | 128 ++ linuxboot/util.h | 69 +++ qemu/Makefile|3 - qemu/Makefile.target |2 qemu/hw/linuxboot.c | 39 +++ qemu/hw/pc.c | 22 +++- qemu/hw/pc.h |5 + 13 files changed, 600 insertions(+), 9 deletions(-) - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 4:15 PM, Anthony Liguori [EMAIL PROTECTED] wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. [...] diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 9100bb1..9ea14d3 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, PCIDevice *pci_dev; uint8_t *config; uint32_t size; +static int devfn = 7; + +if ((devfn % 8) == 7) + devfn = -1; +else + devfn++; This code look strange... devfn should be passed to virtio_init_pci by virtio-{net,blk} init functions, no? Luca - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Anthony Liguori wrote: Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. You misunderstand. I'm not talking about guarantees, I'm talking about expectations for the performance effect. Basically, to do performant streaming read with O_DIRECT you need two things: 1. Overlap at least 2 requests, so the device is kept busy. 2. Requests be sent to the disk in a good order, which is usually (but not always) sequential offset order. The kernel does this itself with buffered reads, doing readahead. It works very well, unless you have other problems caused by readahead. With O_DIRECT, an application has to do the equivalent of readahead itself to get performant streaming. If the app uses two threads calling pread(), it's hard to ensure the kernel even _sees_ the first two calls in sequential offset order. You spawn two threads, and then both threads call pread() with non-deterministic scheduling. The problem starts before even entering the kernel. Then, depending on I/O scheduling in the kernel, it might send the less good pread() to the disk immediately, then later a backward head seek and the other one. The elevator cannot fix this: it doesn't have enough information, unless it adds artificial delays. But artificial delays may harm too; it's not optimal. After that, the two threads tend to call pread() in the best order provided there's no scheduling conflicts, but are easily disrupted by other tasks, especially on SMP (one reading thread per CPU, so when one of them is descheduled, the other continues and issues a request in the 'wrong' order.) With AIO, even though you can't be sure what the kernel does, you can be sure the kernel receives aio_read() calls in the exact order which is most likely to perform well. Application knowledge of it's access pattern is passed along better. As I've said, I saw a man page which described why this makes AIO superior to using threads for reading tapes on that OS. So it's not a completely spurious point. This has nothing to do with guarantees. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
* Anthony Liguori [EMAIL PROTECTED] [2008-04-22 09:16]: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Have you confirmed that the network devices show up? I was playing around with some of the limits last night and while it is easy to get QEMU to create the adapters, so far I've only had a guest see 29 pci nics (e1000). -- Ryan Harper Software Engineer; Linux Technology Center IBM Corp., Austin, Tx (512) 838-9253 T/L: 678-9253 [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ +cond_resched(); +if ((unsigned long)*(spinlock_t **)a +(unsigned long)*(spinlock_t **)b) +return -1; +else if (a == b) +return 0; +else +return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Ryan Harper wrote: * Anthony Liguori [EMAIL PROTECTED] [2008-04-22 09:16]: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Have you confirmed that the network devices show up? I was playing around with some of the limits last night and while it is easy to get QEMU to create the adapters, so far I've only had a guest see 29 pci nics (e1000). Yup, I had an eth219 Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: Anthony Liguori wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. (certainly if we have a device with one function, which then gets converted to a multifunction device) Would have to change the hotplug code to handle functions... It sounds less hacky to just extend the PCI slots instead of (ab)using multiple functions per-slot. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: Anthony Liguori wrote: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? There's no guarantee that any sort of order will be preserved by AIO requests. The same is true with writes. This is what fdsync is for, to guarantee ordering. I believe he'd like a hint to get good scheduling, not a guarantee. With a thread pool if the threads are scheduled out of order, so are your requests. If the elevator doesn't plug the queue, the first few requests may not be optimally sorted. That's right. Then they tend to settle to a good order. But any delay in scheduling one of the threads, or a signal received by one of them, can make it lose order briefly, making the streaming stutter as the disk performes a few local seeks until it settles to good order again. You can mitigate the disruption in various ways. 1. If all threads share an offset variable, and reads and increments that atomically just prior to calling pread(), that helps especially at the start. (If threaded I/O is used for QEMU disk emulation, I would suggest doing that, in the more general form of popping a request from QEMU's internal shared queue at the last moment.) 2. Using more threads helps keep it sustained, at the cost of more wasted I/O when there's a cancellation (changed mind), and more memory. However, AIO, in principle (if not implementations...) could be better at keeping the suggested I/O order than thread, without special tricks. -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? You need to compare *a to *b (at least, that's what you're doing for the case). -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12
Andrew, Could we get direction/guidance from you as regards the invalidate_page() callout of Andrea's patch set versus the invalidate_range_start/invalidate_range_end callout pairs of Christoph's patchset? This is only in the context of the __xip_unmap, do_wp_page, page_mkclean_one, and try_to_unmap_one call sites. On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote: I am a little confused about the value of the seq_lock versus a simple atomic, but I assumed there is a reason and left it at that. There's no value for anything but get_user_pages (get_user_pages takes its own lock internally though). I preferred to explain it as a seqlock because it was simpler for reading, but I totally agree in the final implementation it shouldn't be a seqlock. My code was meant to be pseudo-code only. It doesn't even need to be atomic ;). Unless there is additional locking in your fault path, I think it does need to be atomic. I don't know what you mean by it'd run slower and what you mean by armed and disarmed. 1) when armed the time-window where the kvm-page-fault would be blocked would be a bit larger without invalidate_page for no good reason But that is a distinction without a difference. In the _start/_end case, kvm's fault handler will not have any _DIRECT_ blocking, but get_user_pages() had certainly better block waiting for some other lock to prevent the process's pages being refaulted. I am no VM expert, but that seems like it is critical to having a consistent virtual address space. Effectively, you have a delay on the kvm fault handler beginning when either invalidate_page() is entered or invalidate_range_start() is entered until when the _CALLER_ of the invalidate* method has unlocked. That time will remain essentailly identical for either case. I would argue you would be hard pressed to even measure the difference. 2) if you were to remove invalidate_page when disarmed the VM could would need two branches instead of one in various places Those branches are conditional upon there being list entries. That check should be extremely cheap. The vast majority of cases will have no registered notifiers. The second check for the _end callout will be from cpu cache. I don't want to waste cycles if not wasting them improves performance both when armed and disarmed. In summary, I think we have narrowed down the case of no registered notifiers to being infinitesimal. The case of registered notifiers being a distinction without a difference. When I was discussing this difference with Jack, he reminded me that the GRU, due to its hardware, does not have any race issues with the invalidate_page callout simply doing the tlb shootdown and not modifying any of its internal structures. He then put a caveat on the discussion that _either_ method was acceptable as far as he was concerned. The real issue is getting a patch in that satisfies all needs and not whether there is a seperate invalidate_page callout. Sure, we have that patch now, I'll send it out in a minute, I was just trying to explain why it makes sense to have an invalidate_page too (which remains the only difference by now), removing it would be a regression on all sides, even if a minor one. I think GRU is the only compelling case I have heard for having the invalidate_page seperate. In the case of the GRU, the hardware enforces a lifetime of the invalidate which covers all in-progress faults including ones where the hardware is informed after the flush of a PTE. in all cases, once the GRU invalidate instruction is issued, all active requests are invalidated. Future faults will be blocked in get_user_pages(). Without that special feature of the hardware, I don't think any code simplification exists. I, of course, reserve the right to be wrong. I believe the argument against a seperate invalidate_page() callout was Christoph's interpretation of Andrew's comments. I am not certain Andrew was aware of this special aspects of the GRU hardware and whether that had been factored into the discussion at that point in time. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? Yes, unless the implementation in the kernel (or glibc) is threaded. With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. If the disk is busy, it doesn't matter. The requests will queue and the elevator will sort them out. So it's just the first few requests that may get to disk out of order. There's two cases where it matters to a read-streaming app: 1. Disk isn't busy with anything else, maximum streaming performance is desired. 2. Disk is busy with unrelated things, but you're using I/O priorities to give the streaming app near-absolute priority. Then you need to maintain overlapped streaming requests, otherwise disk is given to a lower priority I/O. If that happens often, you lose, priority is ineffective. Because one of the streaming requests is usually being serviced, elevator has similar limitations as for a disk which is not busy with anything else. I haven't considered tape, but this is a good point indeed. I expect it doesn't make much of a difference for a loaded disk. Yes, as long as it's loaded with unrelated requests at the same I/O priority, the elevator has time to sort requests and hide thread scheduling artifacts. Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's elevator, then submits them to the host's elevator. If the guest and host elevators are both configured 'anticipatory', do the anticipatory delays add up? -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
Avi Kivity wrote: And video streaming on some embedded devices with no MMU! (Due to the page cache heuristics working poorly with no MMU, sustained reliable streaming is managed with O_DIRECT and the app managing cache itself (like a database), and that needs AIO to keep the request queue busy. At least, that's the theory.) Could use threads as well, no? Perhaps. This raises another point about AIO vs. threads: If I submit sequential O_DIRECT reads with aio_read(), will they enter the device read queue in the same order, and reach the disk in that order (allowing for reordering when worthwhile by the elevator)? With threads this isn't guaranteed and scheduling makes it quite likely to issue the parallel synchronous reads out of order, and for them to reach the disk out of order because the elevator doesn't see them simultaneously. With AIO (non-Glibc! (and non-kthreads)) it might be better at keeping the intended issue order, I'm not sure. It is highly desirable: O_DIRECT streaming performance depends on avoiding seeks (no reordering) and on keeping the request queue non-empty (no gap). I read a man page for some other unix, describing AIO as better than threaded parallel reads for reading tape drives because of this (tape seeks are very expensive). But the rest of the man page didn't say anything more. Unfortunately I don't remember where I read it. I have no idea whether AIO submission order is nearly always preserved in general, or expected to be. It's me at fault here. I just assumed that because it's easy to do aio in a thread pool efficiently, that's what glibc does. Unfortunately the code does some ridiculous things like not service multiple requests on a single fd in parallel. I see absolutely no reason for it (the code says fight for resources). Ouch. Perhaps that relates to my thought above, about multiple requests to the same file causing seek storms when thread scheduling is unlucky? So my comments only apply to linux-aio vs a sane thread pool. Sorry for spreading confusion. Thanks. I thought you'd measured it :-) It could and should. It probably doesn't. A simple thread pool implementation could come within 10% of Linux aio for most workloads. It will never be exactly, but for small numbers of disks, close enough. I would wait for benchmark results for I/O patterns like sequential reading and writing, because of potential for seeks caused by request reordering, before being confident of that. Hmm. Thanks. I may consider switching to XFS now I'm rooting for btrfs myself. In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll be happy to give it a try! :-) -- Jamie - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity [EMAIL PROTECTED] wrote: I'm rooting for btrfs myself. but could btrfs (when stable) work for migration? i'm curious about OCFS2 performance on this kind of load... when i manage to sell the idea of a KVM cluster i'd like to know if i should try first EVMS-HA (cluster LV's) or OCFS (cluster FS) -- Javier - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Marcelo Tosatti wrote: Maybe require explicit device/function assignment on the command line? It will be managed anyway. ACPI does support hotplugging of individual functions inside slots, not sure how well does Linux (and other OSes) support that.. should be transparent though. I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We could easily support a very large number of devices this way. But again, what do we want to target for now? Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 06:22:48 Avi Kivity wrote: Rusty Russell wrote: [Christian, Hollis, how much is this ABI breakage going to hurt you?] A recent proposed feature addition to the virtio block driver revealed some flaws in the API, in particular how easy it is to break big endian machines. The virtio config space was originally chosen to be little-endian, because we thought the config might be part of the PCI config space for virtio_pci. It's actually a separate mmio region, so that argument holds little water; as only x86 is currently using the virtio mechanism, we can change this (but must do so now, before the impending s390 and ppc merges). This will probably annoy Hollis which has guests that can go both ways. Rusty and I have discussed it. Ultimately, this just takes us from a cross-architecture endianness definition to a per-architecture definition. Anyways, we've already fallen into this situation with the virtio ring data itself, so we're really saying same endianness as the ring. -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting
Nguyen Anh Quynh wrote: Hi, I am thinking about comibing this ROM with the extboot. Both two ROM are about booting, so I think that is reasonable. So we will have only 1 ROM that supports both external boot and Linux boot. Is that desirable or not? Does it make the code simpler and easier to understand? If not, then I would say no. -hpa - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Anthony Liguori wrote: I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We could easily support a very large number of devices this way. But again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. -- error compiling committee.c: too many arguments to function - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: Anthony Liguori wrote: I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We could easily support a very large number of devices this way. But again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. My latest thought it to do a virtio-based virtio controller. We could avoid creating one in QEMU unless we detect an abnormally large number of disks or something. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote: Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... Hmm, are you saying my mm_lock_cmp won't return 0 if a==b? I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Because a and b are pointers to the data you want to compare. You need to dereference them. static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } If your intent is to use the assumption that there are going to be few equal entries, you should have used likely(la lb) to signal it's rarely going to return zero or gcc is likely free to do whatever it wants with the above. Overall that function is such a slow path that this is going to be lost in the noise. My suggestion would be to defer microoptimizations like this after 1/12 will be applied to mainline. Thanks! Hum, it's not a micro-optimization, but a bug fix. :) Sorry if it was not clear - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote: I am saying your intent was probably to test else if ((unsigned long)*(spinlock_t **)a == (unsigned long)*(spinlock_t **)b) return 0; Indeed... Hum, it's not a micro-optimization, but a bug fix. :) The good thing is that even if this bug would lead to a system crash, it would be still zero risk for everybody that isn't using KVM/GRU actively with mmu notifiers. The important thing is that this patch has zero risk to introduce regressions into the kernel, both when enabled and disabled, it's like a new driver. I'll shortly resend 1/12 and likely 12/12 for theoretical correctness. For now you can go ahead testing with this patch as it'll work fine despite of the bug (if it wasn't the case I would have noticed already ;). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Odd hang in the Ubuntu installer
Hi guys. I'm trying to figure out what's going on with this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217815 The short version of the problem is that it seems that if the console is left alone for an extended period of time, everything seems to stall until something (moving the mouse around, pressing a key, whatever) awakens it again. It usually shows itself when you choose the Encrypted LVM option in our installer (this process wipes the drive, which is a rather lenghty process), since that's probably the only place where you'd leave the console alone for a while, while still getting some UI feedback (and suddenly lack of feedback, obviously). It started when I backported this to the kvm version in our archive: commit d2668b3fd41f88c18a7f9c4f1d024f0e5d9f64cf Author: Marcelo Tosatti [EMAIL PROTECTED] Date: Wed Apr 2 20:20:14 2008 -0300 Subject: kvm: qemu: separate thread for IO handling While trying to solve this problem, I noticed that that commit was just one of a set of three patches. Applying those two: commit 1743ef816b6cd22d100ccb80e542b8ca19c75392 Author: Marcelo Tosatti [EMAIL PROTECTED] Date: Wed Apr 2 20:20:15 2008 -0300 Subject: kvm: qemu: add function to handle signals commit d84f71afaafec49e0ab3aa7a33518df04c14f38a Author: Marcelo Tosatti [EMAIL PROTECTED] Date: Wed Apr 2 20:20:16 2008 -0300 Subject: kvm: qemu: notify IO thread of pending bhs ...makes it take a bit longer before it happens, but it's still very much reproducable. Reverting those changes fixes it completely. We've tried with kvm 66, which also exhibits this behaviour, so I'm fairly confident I didn't mess up the patch while backporting it. In case you're interested, the backported patch is here: http://people.ubuntu.com/~soren/virtio_hang.patch The latter two commits applied without changes (with a bit of fuzz, though). I'm hoping one of you guys could give me a hint (or perhaps even a patch)? -- Soren Hansen | Virtualisation specialist | Ubuntu Server Team Canonical Ltd. | http://www.ubuntu.com/ signature.asc Description: Digital signature - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Andrea Arcangeli a écrit : + +static int mm_lock_cmp(const void *a, const void *b) +{ + cond_resched(); + if ((unsigned long)*(spinlock_t **)a + (unsigned long)*(spinlock_t **)b) + return -1; + else if (a == b) + return 0; + else + return 1; +} + This compare function looks unusual... It should work, but sort() could be faster if the if (a == b) test had a chance to be true eventually... static int mm_lock_cmp(const void *a, const void *b) { unsigned long la = (unsigned long)*(spinlock_t **)a; unsigned long lb = (unsigned long)*(spinlock_t **)b; cond_resched(); if (la lb) return -1; if (la lb) return 1; return 0; } - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We should increase the number of interrupt lines, perhaps to 16. Using shared memory to avoid exits sounds very good idea. We could easily support a very large number of devices this way. But again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. The target was along the lines of 20 nics + 80 disks. Dan? For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. My latest thought it to do a virtio-based virtio controller. Why do you dislike multiple disks per virtio-blk controller? As mentioned this seems a natural way forward. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 2/2] KVM: Handle interrupts for PCI passthrough devices
* On Sunday 13 Apr 2008 14:06:27 Avi Kivity wrote: Amit Shah wrote: Passthrough devices are host machine PCI devices which have been handed off to the guest. Handle interrupts from these devices and route them to the appropriate guest irq lines. The userspace provides us with the necessary information via the ioctls. The guest IRQ numbers can change dynamically, so we have an additional ioctl that keeps track of those changes in userspace and notifies us whenever that happens. It is expected the kernel driver for the passthrough device is removed before passing it on to the guest. +/* + * Used to find a registered host PCI device (a passthrough device) + * during interrupts or EOI + */ +static struct kvm_pci_pt_dev_list * +find_pci_pt_dev(struct list_head *head, + struct kvm_pci_pt_info *pv_pci_info, int irq, int source) +{ + struct list_head *ptr; + struct kvm_pci_pt_dev_list *match; + + list_for_each(ptr, head) { + match = list_entry(ptr, struct kvm_pci_pt_dev_list, list); + + switch (source) { + case KVM_PT_SOURCE_IRQ: + /* +* Used to find a registered host device +* during interrupt context on host +*/ + if (match-pt_dev.host.irq == irq) + return match; + break; + case KVM_PT_SOURCE_IRQ_ACK: + /* +* Used to find a registered host device when +* the guest acks an interrupt +*/ + if (match-pt_dev.guest.irq == irq) + return match; + break; + } + } + return NULL; +} This would be better as two separate functions. Also, locking? For pvdma, there will be two more cases. Very similar functions for essentially looking up an entry in the same list. Locking will be supported soon. +static irqreturn_t +kvm_pci_pt_dev_intr(int irq, void *dev_id) Please don't split declarations unnecessarily. Fixed. +{ + struct kvm_pci_pt_dev_list *match; + struct kvm *kvm = (struct kvm *) dev_id; + + if (!test_bit(irq, pt_irq_handled)) + return IRQ_NONE; + + if (test_bit(irq, pt_irq_pending)) + return IRQ_HANDLED; Will the interrupt not fire immediately after this returns? Hmm. This is just an optimisation so that we don't have to look up the list each time to find out which assigned device it is and (re)injecting the interrupt. Also we avoid the (TODO) getting/releasing locks which will be needed for the list lookup. Disabling interrupts for PCI devices isn't a good idea even if we don't support shared interrupts. Any other ideas to avoid this from happening? + match = find_pci_pt_dev(kvm-arch.pci_pt_dev_head, NULL, + irq, KVM_PT_SOURCE_IRQ); + if (!match) + return IRQ_NONE; + + /* Not possible to detect if the guest uses the PIC or the +* IOAPIC. So set the bit in both. The guest will ignore +* writes to the unused one. +*/ + kvm_ioapic_set_irq(kvm-arch.vioapic, match-pt_dev.guest.irq, 1); + kvm_pic_set_irq(pic_irqchip(kvm), match-pt_dev.guest.irq, 1); A function that calls both the apic and the pic is better, as it will be easier to port. Done. + set_bit(irq, pt_irq_pending); + return IRQ_HANDLED; +} + +/* Ack the irq line for a passthrough device */ +void +kvm_pci_pt_ack_irq(struct kvm *kvm, int vector) +{ + int irq; + struct kvm_pci_pt_dev_list *match; + + irq = get_eoi_gsi(kvm-arch.vioapic, vector); + match = find_pci_pt_dev(kvm-arch.pci_pt_dev_head, NULL, + irq, KVM_PT_SOURCE_IRQ_ACK); + if (!match) + return; + if (test_bit(match-pt_dev.host.irq, pt_irq_pending)) { + kvm_ioapic_set_irq(kvm-arch.vioapic, irq, 0); + kvm_pic_set_irq(pic_irqchip(kvm), irq, 0); This is dangerous with smp guests, if we aren't careful with the ordering the interrupt may fire again and be forwarded to the other vcpu. We need to call this before we redeliver interrupts. The 'pending' bitmap ensures we don't inject an interrupt that hasn't been ack'ed. Once the locking is in place, this shouldn't be a worry. + clear_bit(match-pt_dev.host.irq, pt_irq_pending); + } +} ... @@ -1671,6 +1836,30 @@ long kvm_arch_vm_ioctl(struct file *filp, r = 0; break; } + case KVM_ASSIGN_PCI_PT_DEV: { + struct kvm_pci_passthrough_dev pci_pt_dev; + + r = -EFAULT; + if (copy_from_user(pci_pt_dev, argp, sizeof pci_pt_dev)) + goto out; + + r = kvm_vm_ioctl_pci_pt_dev(kvm, pci_pt_dev); + if (r) + goto
Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(
Gerd Hoffmann wrote: Jeremy Fitzhardinge wrote: Xen could change the parameters in the instant after get_time_values(). That change could be as a result of suspend-resume, so the parameters and the tsc could be wildly different. Ah, ok, forgot the rdtsc in the picture. With that in mind I fully agree that the loop is needed. I think kvm guests can even hit that one with the vcpu migrating to a different physical cpu, so we better handle it correctly ;) It's probably not needed for kvm, since we update everything everytime we get scheduled in the host side, which would cover the case for migration between physical cpus. But it's probably okay to do it to get a common denominator with xen, if needed. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
I believe the differences between your patch set and Christoph's need to be understood and a compromise approach agreed upon. Those differences, as I understand them, are: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. 2) Range callout names: Your range callouts are invalidate_range_start and invalidate_range_end whereas Christoph's are start and end. I do not believe this has been discussed in great detail. I know I have expressed a preference for your names. I admit to having failed to follow up on this issue. I certainly believe we could come to an agreement quickly if pressed. 3) The structure of the patch set: Christoph's upcoming release orders the patches so the prerequisite patches are seperately reviewable and each file is only touched by a single patch. Additionally, that allows mmu_notifiers to be introduced as a single patch with sleeping functionality from its inception and an API which remains unchanged. Your patch set, however, introduces one API, then turns around and changes that API. Again, the desire to make it an unchanging API was expressed by, IIRC, Andrew. This does represent a risk to XPMEM as the non-sleeping API may become entrenched and make acceptance of the sleeping version less acceptable. Can we agree upon this list of issues? Thank you, Robin Holt - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)
Thanks for all those who work on KVM. It is a wonderful product and I have been very impressed with its features, performance, and the level of activity in this project. Back in February a bug was filed. I've been hit by this bug as well, but there hasn't been much activity with it in the last little bit. I wanted to know if anyone had a fix for it, or a workaround (other than using IDE), or whether it was on someone's radar. Here is a link to the bug: http://sourceforge.net/tracker/index.php?func=detailaid=1895893group_id=180599atid=893831 Thanks in advance. -- Alberto Treviño [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. invalidate_page as demonstrated in KVM pseudocode doesn't change the locking requirements, and it has the benefit of reducing the window of time the secondary page fault has to be masked and at the same time _halves_ the number of _hooks_ in the VM every time the VM deal with single pages (example: do_wp_page hot path). As long as we can't fully converge because of point 3, it'd rather keep invalidate_page to be better. But that's by far not a priority to keep. 2) Range callout names: Your range callouts are invalidate_range_start and invalidate_range_end whereas Christoph's are start and end. I do not believe this has been discussed in great detail. I know I have expressed a preference for your names. I admit to having failed to follow up on this issue. I certainly believe we could come to an agreement quickly if pressed. I think using -start -end is a mistake, think when we later add mprotect_range_start/end. Here too I keep the better names only because we can't converge on point 3 (the API will eventually change, like every other kernel interal API, even core things like __free_page have been mostly obsoleted). 3) The structure of the patch set: Christoph's upcoming release orders the patches so the prerequisite patches are seperately reviewable and each file is only touched by a single patch. Additionally, that Each file touched by a single patch? I doubt... The split is about the same, the main difference is the merge ordering, I always had the zero risk part at the head, he moved it at the tail when he incorporated #v12 into his patchset. allows mmu_notifiers to be introduced as a single patch with sleeping functionality from its inception and an API which remains unchanged. Your patch set, however, introduces one API, then turns around and changes that API. Again, the desire to make it an unchanging API was expressed by, IIRC, Andrew. This does represent a risk to XPMEM as the non-sleeping API may become entrenched and make acceptance of the sleeping version less acceptable. Can we agree upon this list of issues? This is a kernel internal API, so it will definitely change over time. It's nothing close to a syscall. Also note: the API is obviously defined in mmu_notifier.h and none of the 2-12 patches touches mmu_notifier.h. So the extension of the method semantics is 100% backwards compatible. My patch order and API backward compatible extension over the patchset is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support XPMEM as well. KVM/GRU won't notice any difference once the support for XPMEM is added, but even if the API would completely change in 2.6.27, that's still better than no functionality at all in 2.6.26. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. My latest thought it to do a virtio-based virtio controller. Why do you dislike multiple disks per virtio-blk controller? As mentioned this seems a natural way forward. Logically speaking, virtio is a bus. virtio supports all of the features of a bus (discover, hot add, hot remove). Right now, we map virtio devices directly onto the PCI bus. The problem we're trying to address is limitations of the PCI bus. We have a couple options: 1) add a virtio device that supports multiple disks. we need to reinvent hotplug within this device. 2) add a new PCI virtio transport that supports multiple virtio-blk devices within a single PCI slot 3) add a generic PCI virtio transport that supports multiple virtio devices within a single PCI slot 4) add a generic virtio bridge that supports multiple virtio devices within a single virtio device. #4 may seem strange, but it's no different from a PCI-to-PCI bridge. I like #4 the most, but #2 is probably the most practical. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Avi Kivity wrote: For mass storage, we should follow the SCSI model with a single device serving multiple disks, similar to what you suggest. Not sure if the device should have a single queue or one queue per disk. Don't you just end up re-implementing SCSI then, at which point you might as well stick with a 'fake' SCSI device in the guest? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
Marcelo Tosatti wrote: On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote: Anthony Liguori wrote: This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Does this play well with hotplug? Perhaps we need to allocate a new device on hotplug. (certainly if we have a device with one function, which then gets converted to a multifunction device) Would have to change the hotplug code to handle functions... BTW, I've never been that convinced that hotplugging devices is as useful as people make it out to be. I also think that's particularly true when it comes to hot adding/removing very large numbers of disks. I think if you created all virtio devices as multifunction devices, but didn't add additional functions until you ran out of PCI slots, it would be a pretty acceptable solution. Hotplug works just as it does today until you get much higher than 32 devices. Even then, hotplug still works with most of your devices (until you hit the absolute maximum number of devices of course). Regards, Anthony Liguori It sounds less hacky to just extend the PCI slots instead of (ab)using multiple functions per-slot. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Make virtio devices multi-function
On Tue, Apr 22, 2008 at 02:26:45PM -0300, Marcelo Tosatti wrote: On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote: Avi Kivity wrote: Anthony Liguori wrote: I think we need to decide what we want to target in terms of upper limits. With a bridge or two, we can probably easily do 128. If we really want to push things, I think we should do a PCI based virtio controller. I doubt a large number of PCI devices is ever going to perform very well b/c of interrupt sharing and some of the assumptions in virtio_pci. If we implement a controller, we can use a single interrupt, but multiplex multiple notifications on that single interrupt. We can also be more aggressive about using shared memory instead of PCI config space which would reduce the overall number of exits. We should increase the number of interrupt lines, perhaps to 16. Using shared memory to avoid exits sounds very good idea. We could easily support a very large number of devices this way. But again, what do we want to target for now? I think that for networking we should keep things as is. I don't see anybody using 100 virtual NICs. The target was along the lines of 20 nics + 80 disks. Dan? I've already had people ask for ability to as many as 64 disks and 32 nics with Xen, so to my mind, the more we support the better. 100's if possible. Dan. -- |: Red Hat, Engineering, Boston -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :| - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote: On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote: 1) invalidate_page: You retain an invalidate_page() callout. I believe we have progressed that discussion to the point that it requires some direction for Andrew, Linus, or somebody in authority. The basics of the difference distill down to no expected significant performance difference between the two. The invalidate_page() callout potentially can simplify GRU code. It does provide a more complex api for the users of mmu_notifier which, IIRC, Christoph had interpretted from one of Andrew's earlier comments as being undesirable. I vaguely recall that sentiment as having been expressed. invalidate_page as demonstrated in KVM pseudocode doesn't change the locking requirements, and it has the benefit of reducing the window of time the secondary page fault has to be masked and at the same time _halves_ the number of _hooks_ in the VM every time the VM deal with single pages (example: do_wp_page hot path). As long as we can't fully converge because of point 3, it'd rather keep invalidate_page to be better. But that's by far not a priority to keep. Christoph, Jack and I just discussed invalidate_page(). I don't think the point Andrew was making is that compelling in this circumstance. The code has change fairly remarkably. Would you have any objection to putting it back into your patch/agreeing to it remaining in Andrea's patch? If not, I think we can put this issue aside until Andrew gets out of the merge window and can decide it. Either way, the patches become much more similar with this in. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Contact Mr Philip Williams
Hello my good friend. How are you today? Hope all is well with you and your family?, You may not understand why this mail came to you.But if you do not remember me, you might have receive an email from me in the past regarding a multi-million-dollar business proposal which we never concluded. I am using this opportunity to inform you that this multi-million-dollar business has been concluded with the assistance of another partner from India who financed the transaction to alogical conclusion. I thank you for your great effort to our unfinished transfer of fund into your account due to one reason or the other best known to you.But I want to informyou that I have successfully transferred the fund out of my bank to my new partner's account in India that was capable of assisting me in this great venture. Due to your effort, sincerity, courage and trustworthiness You showed during the course of the transaction.I want to compensate you and show my gratitude to you with the sum of $1,200,000.00. I haveleft a certified international bank cheque for youworth of $1,200,000.00 cashable anywhere in the world. My dear friend I will like you to contact my Account Officer Mr. Philip Williams, on his direct email address at:[EMAIL PROTECTED] for the collection of your bank cheque. I authorized him to release theBank Cheque to you whenever you contact him regardingthe cheque. At the moment, I'm very busy here because of the investment projects, which I and the new partner are having at hand.Please I will like you to accept this token with good faith as this is from the bottom of my heart,Also comply with Mr. Phillip's directives so that he will send the cheque to you without any delay. CONTACT: Mr. Philip Williams. Account Officer, Cotonou, Benin Republic, His email address: [EMAIL PROTECTED] Therefore, you should send him your full Name and telephone number/your correct mailing address where you want him to send the draft to you. Thanks and God bless you and your family. Hoping to hear from you. Mrs Fatima Ali - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. 2. The locks that are used are later changed to semaphores. This is f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the lock conversion is done first and then mm_lock is introduced. The way the patches are structured means that reviewers cannot review the final version of mm_lock etc etc. The lock conversion needs to come first. 3. As noted by Eric and also contained in private post from yesterday by me: The cmp function needs to retrieve the value before doing comparisons which is not done for the == of a and b. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Problems with MAC address with e1000 on Windows 2003
I was wondering if anyone could reproduce my problem. If it is reproduceable, then I'll file a bug. I am using e1000 ethernet adapters on Windows 2003 and Linux guests. The line to set it up is something like this: -net nic,vlan=1,macaddr=00:ff:21:cf:91:01,model=e1000 \ -net tap,vlan=1,ifname=tap.br1.91.1 On Linux, this works just fine. However, on Windows 2003, the mac address for the device is reported as 00:ff:ff:ff:ff:ff and the packets carry this mac address as well. The corresponding tap device has the correct IP address, however. This problem is definitely tied to using Windows 2003 with a e1000 device. If I use the rtl8139 device, Windows reports the correct mac address. When booting the same VM with a Linux bootable CD and the e1000 device, Linux reports the correct mac address as set in the qemu command. It's the combination of Windows 2003 and the e1000 device that causes the problem. Has anyone else seen this problem? Thanks in advance. -- Alberto Treviño [EMAIL PROTECTED] - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug
Looks like this is not complete. There are numerous .h files missing which means that various structs are undefined (fs.h and rmap.h are needed f.e.) which leads to surprises when dereferencing fields of these struct. It seems that mm_types.h is expected to be included only in certain contexts. Could you make sure to include all necessary .h files? Or add some docs to clarify the situation here. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles: 1. before vcpu-arch.mmu.page_fault() 2. after vcpu-arch.mmu.page_fault() 3. after mmu_topup_memory_caches() 4. after emulate_instruction() So the delta in the trace reports show: - cycles required for arch.mmu.page_fault (tracer 2) - cycles required for mmu_topup_memory_caches(tracer 3) - cycles required for emulate_instruction() (tracer 4) I captured trace data for ~5-seconds during one of the usual events (again this time it was due to kscand in the guest). I ran the formatted trace data through an awk script to summarize: TSC cycles tracer2 tracer3 tracer4 0 - 10,000: 295067213251115873 10,001 - 25,000: 7682 1004 98336 25,001 - 50,000: 2011536 50,001 - 100,000: 100655 010 100,000: 117 015 This means vcpu-arch.mmu.page_fault() was called 403,722 times in the roughyl 5-second interval: 295,067 times it took 10,000 cycles, but 100,772 times it took longer than 50,000 cycles. The page_fault function getting run is paging64_page_fault. mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times, most of them relatively quickly. Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few host processes could interrupt it. david Avi Kivity wrote: David S. Ahern wrote: I added the traces and captured data over another apparent lockup of the guest. This seems to be representative of the sequence (pid/vcpu removed). (+4776) VMEXIT [ exitcode = 0x, rip = 0x c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db4 ] (+3632) VMENTRY (+4552) VMEXIT [ exitcode = 0x, rip = 0x c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x000b, virt = 0x fffb61c8 ] (+ 54928) VMENTRY Can you oprofile the host to see where the 54K cycles are spent? (+4568) VMEXIT [ exitcode = 0x, rip = 0x c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x 9db4 gpte = 0x 41c5d363 ] (+8432) VMENTRY (+3936) VMEXIT [ exitcode = 0x, rip = 0x c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x 9db0 gpte = 0x ] (+ 13832) VMENTRY (+5768) VMEXIT [ exitcode = 0x, rip = 0x c016127c ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db4 ] (+3712) VMENTRY (+4576) VMEXIT [ exitcode = 0x, rip = 0x c016104a ] (+ 0) PAGE_FAULT [ errorcode = 0x000b, virt = 0x fffb61d0 ] (+ 0) PTE_WRITE [ gpa = 0x 3d5981d0 gpte = 0x 3d55d047 ] This indeed has the accessed bit clear. (+ 65216) VMENTRY (+4232) VMEXIT [ exitcode = 0x, rip = 0x c01610e7 ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db4 ] (+ 0) PTE_WRITE [ gpa = 0x 9db4 gpte = 0x 3d598363 ] This has the accessed bit set and the user bit clear, and the pte pointing at the previous pte_write gpa. Looks like a kmap_atomic(). (+8640) VMENTRY (+3936) VMEXIT [ exitcode = 0x, rip = 0x c01610ee ] (+ 0) PAGE_FAULT [ errorcode = 0x0003, virt = 0x c0009db0 ] (+ 0) PTE_WRITE [ gpa = 0x 9db0 gpte = 0x ] (+ 14160) VMENTRY I can forward a more complete time snippet if you'd like. vcpu0 + corresponding vcpu1 files have 85000 total lines and compressed the files total ~500k. I did not see the FLOODED trace come out during this sample though I did bump the count from 3 to 4 as you suggested. Bumping the count was supposed to remove the flooding... Correlating rip addresses to the 2.4 kernel: c0160d00-c0161290 = page_referenced It looks like the event is kscand running through the pages. I suspected this some time ago, and tried tweaking the kscand_work_percent sysctl variable. It appeared to lower the peak of the spikes, but maybe I imagined it. I believe lowering that value makes kscand wake up more often but do less work (page scanning) each time it is awakened. What does 'top' in the guest show (perhaps sorted by total cpu time rather than instantaneous usage)? What host kernel are you running? How many host cpus? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced
Missing signoff by you. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last
Reverts a part of an earlier patch. Why isnt this merged into 1 of 12? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks
Why are the subjects all screwed up? They are the first line of the description instead of the subject line of my patches. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
Doing the right patch ordering would have avoided this patch and allow better review. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, 22 Apr 2008, Andrea Arcangeli wrote: My patch order and API backward compatible extension over the patchset is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support XPMEM as well. KVM/GRU won't notice any difference once the support for XPMEM is added, but even if the API would completely change in 2.6.27, that's still better than no functionality at all in 2.6.26. Please redo the patchset with the right order. To my knowledge there is no chance of this getting merged for 2.6.26. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13
On Tue, 22 Apr 2008, Robin Holt wrote: putting it back into your patch/agreeing to it remaining in Andrea's patch? If not, I think we can put this issue aside until Andrew gets out of the merge window and can decide it. Either way, the patches become much more similar with this in. One solution would be to separate the invalidate_page() callout into a patch at the very end that can be omitted. AFACIT There is no compelling reason to have this callback and it complicates the API for the device driver writers. Not having this callback makes the way that mmu notifiers are called from the VM uniform which is a desirable goal. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote: Thanks for adding most of my enhancements. But 1. There is no real need for invalidate_page(). Can be done with invalidate_start/end. Needlessly complicates the API. One of the objections by Andrew was that there mere multiple callbacks that perform similar functions. While I agree with that reading of Andrew's email about invalidate_page, I think the GRU hardware makes a strong enough case to justify the two seperate callouts. Due to the GRU hardware, we can assure that invalidate_page terminates all pending GRU faults (that includes faults that are just beginning) and can therefore be completed without needing any locking. The invalidate_page() callout gets turned into a GRU flush instruction and we return. Because the invalidate_range_start() leaves the page table information available, we can not use a single page _start to mimick that functionality. Therefore, there is a documented case justifying the seperate callouts. I agree the case is fairly weak, but it does exist. Given Andrea's unwillingness to move and Jack's documented case, it is my opinion the most likely compromise is to leave in the invalidate_page() callout. Thanks, Robin - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote: The virtio config space was originally chosen to be little-endian, because we thought the config might be part of the PCI config space for virtio_pci. It's actually a separate mmio region, so that argument holds little water; as only x86 is currently using the virtio mechanism, we can change this (but must do so now, before the impending s390 and ppc merges). This will probably annoy Hollis which has guests that can go both ways. Yes, I discussed this with Hollis. But the virtio rings themselves already have this issue: we don't do any endian conversion on them and assume they're our endian in the guest. We may still regret not doing *everything* little-endian, but this doesn't make it worse. Hmm, why *don't* we just do everything LE, including the ring? -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] [PATCH] Make virtio devices multi-function (v2)
This patch changes virtio devices to be multi-function devices whenever possible. This increases the number of virtio devices we can support now by a factor of 8. With this patch, I've been able to launch a guest with either 220 disks or 220 network adapters. Since v1, I've changed the way virtio devices are allocated to be as follows: 1) Always use a slot as long as they are available. We can extend this to use a PCI when we get that working more reliably. 2) When PCI slots are exhausted, fall back add device as an additional function on an existing slot This way, hotplug continues to work just as well as it does now. Once you exceed the number of PCI slots, you need an OS that can do hotplug of individual PCI functions if you care about doing hotplug. I think this is a pretty reasonable trade-off. Signed-off-by: Anthony Liguori [EMAIL PROTECTED] diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c index a23a466..5d5d1a5 100644 --- a/qemu/hw/pci.c +++ b/qemu/hw/pci.c @@ -146,6 +146,41 @@ int pci_device_load(PCIDevice *s, QEMUFile *f) return 0; } +/* Search the bus for a multifunction device with a free function that + * matches vendor_id_filter and device_id_filter. -1 can be passed as + * a filter value to accept any id. + */ +int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter, +int device_id_filter) +{ +int devfn; + +for (devfn = bus-devfn_min; devfn 256; devfn += 8) { + int vendor_id, device_id; + PCIDevice *pci_dev; + + if (!bus-devices[devfn]) + continue; + + pci_dev = bus-devices[devfn]; + vendor_id = pci_dev-config[0x01] 8 | pci_dev-config[0x00]; + device_id = pci_dev-config[0x03] 8 | pci_dev-config[0x02]; + + if ((vendor_id_filter == -1 || vendor_id_filter == vendor_id) + (device_id_filter == -1 || device_id_filter == device_id) + ((pci_dev-config[0x0e] 0x80) == 0x80)) { + int i; + + for (i = 1; i 8; i++) { + if (!bus-devices[devfn + i]) + return devfn + i; + } + } +} + +return -1; +} + /* -1 for devfn means auto assign */ PCIDevice *pci_register_device(PCIBus *bus, const char *name, int instance_size, int devfn, diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h index 60e4094..84d6a29 100644 --- a/qemu/hw/pci.h +++ b/qemu/hw/pci.h @@ -33,7 +33,7 @@ typedef struct PCIIORegion { #define PCI_ROM_SLOT 6 #define PCI_NUM_REGIONS 7 -#define PCI_DEVICES_MAX 64 +#define PCI_DEVICES_MAX 256 #define PCI_VENDOR_ID 0x00/* 16 bits */ #define PCI_DEVICE_ID 0x02/* 16 bits */ @@ -105,6 +105,9 @@ void pci_info(void); PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint32_t id, pci_map_irq_fn map_irq, const char *name); +int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter, +int device_id_filter); + /* lsi53c895a.c */ #define LSI_MAX_DEVS 7 void lsi_scsi_attach(void *opaque, BlockDriverState *bd, int id); diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c index 6a50001..361455d 100644 --- a/qemu/hw/virtio.c +++ b/qemu/hw/virtio.c @@ -405,12 +405,22 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, PCIDevice *pci_dev; uint8_t *config; uint32_t size; +int devfn = -1; -pci_dev = pci_register_device(bus, name, struct_size, - -1, NULL, NULL); -if (!pci_dev) +pci_dev = pci_register_device(bus, name, struct_size, -1, NULL, NULL); + +if (pci_dev == NULL) { + devfn = pci_bus_find_device_function(bus, vendor, -1); + if (devfn != -1) + pci_dev = pci_register_device(bus, name, struct_size, + devfn, NULL, NULL); +} + +if (pci_dev == NULL) return NULL; +devfn = pci_dev-devfn; + vdev = to_virtio_device(pci_dev); vdev-status = 0; @@ -438,6 +448,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char *name, config[0x3d] = 1; +/* Mark device as multi-function */ +if ((devfn % 8) == 0) + config[0x0e] |= 0x80; + vdev-name = name; vdev-config_len = config_size; if (vdev-config_len) diff --git a/qemu/net.h b/qemu/net.h index 13daa27..3bada75 100644 --- a/qemu/net.h +++ b/qemu/net.h @@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd); /* NIC info */ -#define MAX_NICS 8 +#define MAX_NICS 256 struct NICInfo { uint8_t macaddr[6]; diff --git a/qemu/sysemu.h b/qemu/sysemu.h index c60072d..4385802 100644 --- a/qemu/sysemu.h +++ b/qemu/sysemu.h @@ -149,7 +149,7 @@ typedef struct DriveInfo { #define MAX_IDE_DEVS 2 #define MAX_SCSI_DEVS 7 -#define MAX_DRIVES 32 +#define MAX_DRIVES 256 int nb_drives; DriveInfo drives_table[MAX_DRIVES+1]; diff --git a/qemu/vl.c b/qemu/vl.c index 74be059..824e331 100644 --- a/qemu/vl.c +++ b/qemu/vl.c @@ -8717,7
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: We may still regret not doing *everything* little-endian, but this doesn't make it worse. Hmm, why *don't* we just do everything LE, including the ring? Mainly because when requirements are in doubt, simplicity wins, I think. Cheers, Rusty. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.
On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote: On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote: On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote: We may still regret not doing *everything* little-endian, but this doesn't make it worse. Hmm, why *don't* we just do everything LE, including the ring? Mainly because when requirements are in doubt, simplicity wins, I think. Well, I think the definition of simplicity is up for debate in this case... LE everywhere is much simpler than it depends, IMHO. -- Hollis Blanchard IBM Linux Technology Center - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()
On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo [EMAIL PROTECTED] wrote: Hi, I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit host). Got the following in my logs today : Apr 21 17:55:01 buffy kernel: WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed() Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P 2.6.24-1-686 #1 Apr 21 17:55:01 buffy kernel: [f8d07a36] kvm_mmu_get_page+0x42d/0x447 [kvm] Apr 21 17:55:01 buffy kernel: [f8d08cca] kvm_mmu_load+0xdf/0x15c [kvm] Apr 21 17:55:01 buffy kernel: [f8affe41] vmx_queue_exception+0x0/0x33 [kvm_intel] Apr 21 17:55:01 buffy kernel: [f8d05521] kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm] Apr 21 17:55:01 buffy kernel: [f8d013aa] kvm_vcpu_ioctl+0xe4/0x34c [kvm] Apr 21 17:55:01 buffy kernel: [c0159078] delayacct_end+0x70/0x77 Apr 21 17:55:01 buffy kernel: [c015aa19] sync_page+0x0/0x3b Apr 21 17:55:01 buffy kernel: [c0159388] __delayacct_blkio_end+0x5b/0x5f Apr 21 17:55:01 buffy kernel: [c02bcaab] io_schedule+0x64/0x80 Apr 21 17:55:01 buffy kernel: [c011e07d] enqueue_entity+0x2b/0x3d Apr 21 17:55:01 buffy kernel: [c0115343] apic_wait_icr_idle+0xe/0x15 Apr 21 17:55:01 buffy kernel: [c011e0a5] enqueue_task_fair+0x16/0x24 Apr 21 17:55:01 buffy kernel: [c011d643] enqueue_task+0x52/0x5d Apr 21 17:55:01 buffy kernel: [c011de9e] resched_task+0x52/0x54 Apr 21 17:55:01 buffy kernel: [c011f459] try_to_wake_up+0x2b8/0x2c2 Apr 21 17:55:01 buffy kernel: [c011d47e] __wake_up_common+0x32/0x5c Apr 21 17:55:01 buffy kernel: [c011eecc] __wake_up+0x32/0x42 Apr 21 17:55:01 buffy kernel: [c013e25c] wake_futex+0x3b/0x45 Apr 21 17:55:01 buffy kernel: [c013e4de] futex_wake+0x81/0xb0 Apr 21 17:55:01 buffy kernel: [c013f097] do_futex+0x77/0x983 Apr 21 17:55:01 buffy kernel: [c011d9ca] update_curr+0x62/0xef Apr 21 17:55:01 buffy kernel: [c0103044] __switch_to+0x9d/0x11d Apr 21 17:55:01 buffy kernel: [f8d012c6] kvm_vcpu_ioctl+0x0/0x34c [kvm] Apr 21 17:55:01 buffy kernel: [c018285b] do_ioctl+0x1f/0x62 Apr 21 17:55:01 buffy kernel: [c0182ad5] vfs_ioctl+0x237/0x249 Apr 21 17:55:01 buffy kernel: [c0182b2c] sys_ioctl+0x45/0x5d Apr 21 17:55:01 buffy kernel: [c0103e5e] sysenter_past_esp+0x6b/0xa1 Regards, Thomas. as I got no reply, I guess it is a bad setup on my part. If that might help, this happenned while I was doing a make -j on webkit svn tree (ie. heavy c++ compilation workload) . - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel