Re: [kvm-devel] [patch 3/2] hotadd: lsi_scsi_init can fail

2008-04-22 Thread Avi Kivity
Chris Wright wrote:
 During hotadd of SCSI devices lsi_scsi_init() handles failed
 pci_device_register(), but qemu_system_hot_add_storage() will try and
 attach a drive any way.  Handle this error case rather the generating
 SEGV.

 Cc: Marcelo Tosatti [EMAIL PROTECTED]
 Signed-off-by: Chris Wright [EMAIL PROTECTED]
 ---
  qemu/hw/device-hotplug.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 --- a/qemu/hw/device-hotplug.c
 +++ b/qemu/hw/device-hotplug.c
 @@ -125,7 +125,7 @@ static PCIDevice *qemu_system_hot_add_st
  switch (type) {
  case IF_SCSI:
  opaque = lsi_scsi_init (pci_bus, -1);
 -if (drive_idx = 0)
 +if (opaque  drive_idx = 0)
  lsi_scsi_attach (opaque, drives_table[drive_idx].bdrv,
   drives_table[drive_idx].unit);
  break;
   

It's not so opaque if you're testing it against NULL...

long term we want better error reporting here.

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Christian Borntraeger
Am Dienstag, 22. April 2008 schrieb Rusty Russell:
 [Christian, Hollis, how much is this ABI breakage going to hurt you?]

It is ok for s390 at the moment. We are still working on making userspace 
ready and I plan to change the guest-host for s390 anyway. I try to make 
these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main 
reason is, that we are currently limited to around 80 devices. I am not sure, 
if I should change the allocation of the virtqueues and descriptors to guest 
memory as well. 

Back to your patch:
I have still some ideas about virtio between little endian and big endian 
systems, but it requires more and different marshalling anyway - even on 
driver level. No idea yet how to solve that properly.

Consider your change
Acked-by: Christian Bornraeger [EMAIL PROTECTED]
given that you fix the issue below:

[...]
 --- a/drivers/virtio/virtio_balloon.c Sun Apr 20 14:41:02 2008 +1000
 +++ b/drivers/virtio/virtio_balloon.c Sun Apr 20 15:07:45 2008 +1000
 @@ -155,9 +155,9 @@ static inline s64 towards_target(struct 
  static inline s64 towards_target(struct virtio_balloon *vb)
  {
   u32 v;
 - __virtio_config_val(vb-vdev,
 - offsetof(struct virtio_balloon_config, num_pages),
 - v);
 + vb-vdev-config-get(vb-vdev,
 +   offsetof(struct virtio_balloon_config, num_pages),
 +   v);

this is missing a sizeof(v), no?

Christian

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Avi Kivity
Jamie Lokier wrote:
 Avi Kivity wrote:
   
 At such a tiny difference, I'm wondering why Linux-AIO exists at all,
 as it complicates the kernel rather a lot.  I can see the theoretical
 appeal, but if performance is so marginal, I'm surprised it's in
 there.
   
 Linux aio exists, but that's all that can be said for it.  It works 
 mostly for raw disks, doesn't integrate with networking, and doesn't 
 advance at the same pace as the rest of the kernel.  I believe only 
 databases use it (and a userspace filesystem I wrote some time ago).
 

 And video streaming on some embedded devices with no MMU!  (Due to the
 page cache heuristics working poorly with no MMU, sustained reliable
 streaming is managed with O_DIRECT and the app managing cache itself
 (like a database), and that needs AIO to keep the request queue busy.
 At least, that's the theory.)

   

Could use threads as well, no?

 I'm also surprised the Glibc implementation of AIO using ordinary
 threads is so close to it.  
   
 Why are you surprised?
 

 Because I've read that Glibc AIO (which uses a thread pool) is a
 relatively poor performer as AIO implementations go, and is only there
 for API compatibility, not suggested for performance.

 But I read that quite a while ago, perhaps it's changed.

   

It's me at fault here.  I just assumed that because it's easy to do aio 
in a thread pool efficiently, that's what glibc does.

Unfortunately the code does some ridiculous things like not service 
multiple requests on a single fd in parallel.  I see absolutely no 
reason for it (the code says fight for resources).

So my comments only apply to linux-aio vs a sane thread pool.  Sorry for 
spreading confusion.

 Actually the glibc implementation could be improved from what I've 
 heard.  My estimates are for a thread pool implementation, but there is 
 not reason why glibc couldn't achieve exactly the same performance.
 

 Erm...  I thought you said it _does_ achieve nearly the same
 performance, not that it _could_.

 Do you mean it could achieve exactly the same performance by using
 Linux AIO when possible?

   

It could and should.  It probably doesn't.

A simple thread pool implementation could come within 10% of Linux aio 
for most workloads.  It will never be exactly, but for small numbers 
of disks, close enough.

 And then, I'm wondering why use AIO it
 all: it suggests QEMU would run about as fast doing synchronous I/O in
 a few dedicated I/O threads.
   
 Posix aio is the unix API for this, why not use it?
 

 Because far more host platforms have threads than have POSIX AIO.  (I
 suspect both options will end up supported in the end, as dedicated
 I/O threads were already suggested for other things.)
   

Agree.

   
 Also, I'd presume that those that need 10K IOPS and above will not place 
 their high throughput images on a filesystem; rather on a separate SAN 
 LUN.
 
 Does the separate LUN make any difference?  I thought O_DIRECT on a
 filesystem was meant to be pretty close to block device performance.
   
 On a good extent-based filesystem like XFS you will get good performance 
 (though more cpu overhead due to needing to go through additional 
 mapping layers.  Old clunkers like ext3 will require additional seeks or 
 a ton of cache (1 GB per 1 TB).
 

 Hmm.  Thanks.  I may consider switching to XFS now

   

I'm rooting for btrfs myself.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [Deadline Extended] Call for Presentations: KVM Forum 2008

2008-04-22 Thread Avi Kivity
[Note: KVM Forum registration is now open at
http://kforum.qumranet.com/KVMForum/about_kvmforum.php]

[The deadline for submitting presentations has been extended by two 
weeks, until May 4th]

This is the Call for Presentations for the second annual KVM Developer's
Forum, to be held on June 10-13, 2008, in Napa, California, USA [1].  We
are looking for presentations on KVM development, quality assurance,
management, security, interoperability, architecture support, and
interesting use cases.  Presentations are 50 minutes in length; there
are also 25-minute mini-presentation slots available.

KVM Forum presentations are an excellent way to inform the KVM
development community about your work, and to gather valuable feedback
about your approach.

Please send your presentation proposal to the KVM Forum 2008 Content
Committee at [EMAIL PROTECTED] by May 4th.

KVM Forum 2008 Content Committee:
Dor Laor
Anthony Liguori
Avi Kivity

[1] http://kforum.qumranet.com/KVMForum/about_kvmforum.php

-- 
Any sufficiently difficult bug is indistinguishable from a feature.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] KVM Test result, kernel 6cf5973.., userspace 5157358.. -- One Issue Fixed

2008-04-22 Thread Yunfeng Zhao
Hi All,

This is today's KVM test result against kvm.git 
6cf59734fc9bc89954d0157524eea156c2f9a5ab and kvm-userspace.git 
5157358e1946770847271e3602f1adae85002871.

One Issue Fixed:

1.  booting smp windows guests has 30% chance of hang
https://sourceforge.net/tracker/?func=detailatid=893831aid=1910923group_id=180599
 


Two Old Issues:

1. Booting four guests likely fails
https://sourceforge.net/tracker/?func=detailatid=893831aid=1919354group_id=180599
 

2. Cannot boot guests with hugetlbfs
https://sourceforge.net/tracker/?func=detailatid=893831aid=1941302group_id=180599
 


Test environment
 
PlatformWoodcrest
CPU 4
Memory size 8G'
 
Details

IA32-pae: 
1. boot guest with 256M memory  PASS
2. boot two windows xp guest   PASS
3. boot 4 same guest in parallelPASS
4. boot linux and windows guest in parallel PASS
5. boot guest with 1500M memory PASS
6. boot windows 2003 with ACPI enabled   PASS
7. boot Windows xp with ACPI enabled  PASS
8. boot Windows 2000 without ACPI  PASS
9. kernel build on SMP linux guestPASS
10. LTP on linux guest  PASS
11. boot base kernel linux PASS
12. save/restore 32-bit HVM guests   PASS
13. live migration 32-bit HVM guests  PASS
14. boot SMP Windows xp with ACPI enabledPASS
15. boot SMP Windows 2003 with ACPI enabled PASS
16. boot SMP Windows 2000 with ACPI enabled PASS
 

IA32e: 
1. boot four 32-bit guest in 
parallel  PASS
2. boot four 64-bit guest in 
parallel  PASS
3. boot 4G 64-bit 
guest  PASS
4. boot 4G pae 
guest PASS
5. boot 32-bit linux and 32 bit windows guest in parallelPASS
6. boot 32-bit guest with 1500M memory PASS
7. boot 64-bit guest with 1500M memory PASS
8. boot 32-bit guest with 256M memory   PASS
9. boot 64-bit guest with 256M memory   PASS
10. boot two 32-bit windows xp in parallelPASS
11. boot four 32-bit different guest in para 
PASS
12. save/restore 64-bit linux guests 
PASS
13. save/restore 32-bit linux guests 
PASS
14. boot 32-bit SMP windows 2003 with ACPI enabled  PASS
15. boot 32-bit SMP Windows 2000 with ACPI enabled PASS
16. boot 32-bit SMP Windows xp with ACPI enabledPASS
17. boot 32-bit Windows 2000 without ACPIPASS
18. boot 64-bit Windows xp with ACPI enabledPASS
19. boot 32-bit Windows xp without ACPIPASS
20. boot 64-bit UP 
vista  PASS
21. boot 64-bit SMP 
vista   PASS
22. kernel build in 32-bit linux guest OS  PASS
23. kernel build in 64-bit linux guest OS  PASS
24. LTP on 32-bit linux guest OSPASS
25. LTP on 64-bit linux guest OSPASS
26. boot 64-bit guests with ACPI enabled PASS
27. boot 32-bit 
x-server   PASS  
28. boot 64-bit SMP windows XP with ACPI enabled PASS
29. boot 64-bit SMP windows 2003 with ACPI enabled  PASS
30. live migration 64bit linux 
guests PASS
31. live migration 32bit linux 
guests PASS
32. reboot 32bit windows xp guest   PASS
33. reboot 32bit windows xp guest   PASS
 
 
Report Summary on IA32-pae
Summary Test Report of Last Session
=
  Total   PassFailNoResult   Crash
=
control_panel   7   5   2 00
Restart 2   2   0 00
gtest   15  15  0 

Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Nguyen Anh Quynh
Hi,

I am thinking about comibing this ROM with the extboot. Both two ROM
are about booting, so I think that is reasonable. So we will have
only 1 ROM that supports both external boot and Linux boot.

Is that desirable or not?

Thanks,
Quynh

On 4/21/08, Nguyen Anh Quynh [EMAIL PROTECTED] wrote:
 Hmm, the last patch includes a binary. So please take this patch instead.

  Thanks,

 Q

  # diffstat linuxboot1.diff
   Makefile |   13 -
   linuxboot/Makefile   |   40 +++
   linuxboot/boot.S |   54 +
   linuxboot/farvar.h   |  130 
 +++
   linuxboot/rom.c  |  104 

  linuxboot/signrom.c  |  128 
 ++
   linuxboot/util.h |   69 +++
   qemu/Makefile|3 -
   qemu/Makefile.target |2
   qemu/hw/linuxboot.c  |   39 +++
   qemu/hw/pc.c |   22 +++-
   qemu/hw/pc.h |5 +

  12 files changed, 600 insertions(+), 9 deletions(-)






  On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh [EMAIL PROTECTED] wrote:
   Forget to say that this patch is against kvm-66.
  
Thanks,
Q
  
  
  
On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh [EMAIL PROTECTED] 
 wrote:
 Hi,

  This should be submitted to upstream (but not to kvm-devel list), but
  this is only the test code that I want to quickly send out for
  comments. In case it looks OK, I will send it to upstream later.

  Inspired by extboot and conversations with Anthony and HPA, this
  linuxboot option ROM is a simple option ROM that intercepts int19 in
  order to execute linux setup code. This approach eliminates the need
  to manipulate the boot sector for this purpose.

  To test it, just load linux kernel with your KVM/QEMU image using
  -kernel option in normal way.

  I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest
  Ubuntu 8.04.

  Thanks,
  Quynh


  # diffstat linuxboot1.diff
   Makefile |   13 -
   linuxboot/Makefile   |   40 +++
   linuxboot/boot.S |   54 +
   linuxboot/farvar.h   |  130 
 +++
   linuxboot/rom.c  |  104 
   linuxboot/signrom|binary
   linuxboot/signrom.c  |  128 
 ++
   linuxboot/util.h |   69 +++
   qemu/Makefile|3 -
   qemu/Makefile.target |2
   qemu/hw/linuxboot.c  |   39 +++
   qemu/hw/pc.c |   22 +++-
   qemu/hw/pc.h |5 +
   13 files changed, 600 insertions(+), 9 deletions(-)

  



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 6/6] kvm: qemu: Enable EPT support for real mode

2008-04-22 Thread Avi Kivity
Yang, Sheng wrote:
 From 73c33765f3d879001818cd0719038c78a0c65561 Mon Sep 17 00:00:00 2001
 From: Sheng Yang [EMAIL PROTECTED]
 Date: Fri, 18 Apr 2008 17:15:39 +0800
 Subject: [PATCH] kvm: qemu: Enable EPT support for real mode

 This patch build a identity page table on the last page of VGA bios, and use 
 it as the guest page table in nonpaging mode for EPT.

   


Doing this in qemu means older versions of qemu can't work with an 
ept-enabled kernel.  Also, placing the table in the vga bios might 
conflict with video card assignment to a guest.

Suggest placing this near the realmode tss (see vmx.c:init_rmode_tss()) 
which serves a similar function.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Alexander Graf
I believe that's the way to go. If you have spare time on your hands,  
feel free to integrate my multiboot patches as well.

Alex

On Apr 22, 2008, at 11:07 AM, Nguyen Anh Quynh wrote:

 Hi,

 I am thinking about comibing this ROM with the extboot. Both two ROM
 are about booting, so I think that is reasonable. So we will have
 only 1 ROM that supports both external boot and Linux boot.

 Is that desirable or not?

 Thanks,
 Quynh

 On 4/21/08, Nguyen Anh Quynh [EMAIL PROTECTED] wrote:
 Hmm, the last patch includes a binary. So please take this patch  
 instead.

 Thanks,

 Q

 # diffstat linuxboot1.diff
  Makefile |   13 -
  linuxboot/Makefile   |   40 +++
  linuxboot/boot.S |   54 +
  linuxboot/farvar.h   |  130 +++ 
 
  linuxboot/rom.c  |  104 

 linuxboot/signrom.c  |  128  
 ++
  linuxboot/util.h |   69 +++
  qemu/Makefile|3 -
  qemu/Makefile.target |2
  qemu/hw/linuxboot.c  |   39 +++
  qemu/hw/pc.c |   22 +++-
  qemu/hw/pc.h |5 +

 12 files changed, 600 insertions(+), 9 deletions(-)






 On Mon, Apr 21, 2008 at 12:33 PM, Nguyen Anh Quynh  
 [EMAIL PROTECTED] wrote:
 Forget to say that this patch is against kvm-66.

 Thanks,
 Q



 On Mon, Apr 21, 2008 at 12:32 PM, Nguyen Anh Quynh  
 [EMAIL PROTECTED] wrote:
 Hi,

 This should be submitted to upstream (but not to kvm-devel list),  
 but
 this is only the test code that I want to quickly send out for
 comments. In case it looks OK, I will send it to upstream later.

 Inspired by extboot and conversations with Anthony and HPA, this
 linuxboot option ROM is a simple option ROM that intercepts int19  
 in
 order to execute linux setup code. This approach eliminates the  
 need
 to manipulate the boot sector for this purpose.

 To test it, just load linux kernel with your KVM/QEMU image using
 -kernel option in normal way.

 I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10,  
 guest
 Ubuntu 8.04.

 Thanks,
 Quynh


 # diffstat linuxboot1.diff
  Makefile |   13 -
  linuxboot/Makefile   |   40 +++
  linuxboot/boot.S |   54 +
  linuxboot/farvar.h   |  130 + 
 ++
  linuxboot/rom.c  |  104 + 
 +++
  linuxboot/signrom|binary
  linuxboot/signrom.c  |  128 + 
 +
  linuxboot/util.h |   69 +++
  qemu/Makefile|3 -
  qemu/Makefile.target |2
  qemu/hw/linuxboot.c  |   39 +++
  qemu/hw/pc.c |   22 +++-
  qemu/hw/pc.h |5 +
  13 files changed, 600 insertions(+), 9 deletions(-)








-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Повторные торги

2008-04-22 Thread Конкурс
Размeщeниe гoсудаpствeннoгo и муниципальнoгo заказа на тopгах: спopныe вoпpoсы 
сoвpeмeннoй пpактики

25 апpeля 2008, г. Мoсква
Пpoгpамма сeминаpа

- Анализ кoнкpeтных аpбитpажных дeл, являющихся наибoлee pаспpoстpанeнными 
(типичными) в судeбнo-аpбитpажнoй пpактикe oспаpивания pазмeщeния 
гoсудаpствeннoгo и муниципальнoгo заказа. 
- Аналитичeский кoммeнтаpий спopных ситуаций, вoзникающих как в хoдe пpoвeдeния 
тopгoв, так и в пpoцeссe заключeния и испoлнeния гoсудаpствeнных 
(муниципальных) кoнтpактoв), заключeнных пo их итoгам. 
- Исслeдoваниe тopгoв нoсит кoмплeксный хаpактep, т.к. затpагиваются и 
пpoцeссуальныe oсoбeннoсти pассмoтpeния спopoв o нeдeйствитeльнoсти тopгoв. 
- oтвeты на ключeвыe вoпpoсы сeминаpа аpгумeнтиpуются сo ссылкoй на 
научнo-пpикладныe исслeдoвания, матepиалы заpубeжнoй и мeждунаpoднoй пpактики 
пpoвeдeния аукциoнoв и кoнкуpсoв. 
- Пo каждoй тeмe пpoгpаммы мoгут быть пpoанализиpoваны пpoблeмныe ситуации из 
пpактики слушатeлeй, пo хoду oбсуждeния автopoм даются кoнкpeтныe peкoмeндации. 

Ключeвыe вoпpoсы пpoгpаммы 
∙ Скoлькo мoжeт быть пoбeдитeлeй на тopгах? Впpавe ли участники oбъeдинять свoи 
пpeдлoжeния дo или в пpoцeссe пpoвeдeния тopгoв? 
∙ Какиe лица впpавe заявить иск o пpизнании тopгoв нeдeйствитeльными? Как 
oпpeдeлить заинтepeсoваннoсть в oспаpивании peзультатoв тopгoв? 
∙ Чтo oзначаeт нeдeйствитeльнoсть аукциoна или кoнкуpса: oспopимoсть или 
ничтoжнoсть? 
∙ Как oцeнить сoстязатeльнoсть участникoв? Есть ли oснoвания пpизнать тopги 
нeсoстoявшимися, eсли участникoв былo двoe? 
∙ Чeм oбeспeчиваeтся заявка на участиe в аукциoнe (кoнкуpсe)? Мoжeт ли 
opганизатop тopгoв пpинимать oт участникoв банкoвскиe гаpантии, пpoстыe вeксeля 
или дeнeжныe сpeдства на услoвиях залoга? 
∙ В чeм пpинципиальныe oтличия пpавoвoгo статуса участника тopгoв и участника 
pазмeщeния заказа? 
∙ Каким oбpазoм фopмулиpуются кpитepии кoнкуpснoгo (аукциoннoгo) oтбopа и мoжнo 
ли oт них oтступить пpи oпpeдeлeнии пoбeдитeля? 
∙ Чтo дeлать пpи пoлучeнии oдинакoвых пpeдлoжeний oт нeскoльких участникoв? 
∙ Нeoбхoдимo ли пpoвoдить пoвтopныe тopги, eсли пoбeдитeль нe испoлняeт 
заключeнный на тopгах дoгoвop / уклoняeтся oт eгo заключeния? 
∙ Мoжeт ли суд, пpизнав факты наpушeния закoнoдатeльства, oставить в силe 
peзультаты тopгoв на pазмeщeниe гoсудаpствeннoгo и муниципальнoгo заказа? 

oпopныe тeмы пpoгpаммы 
∙ Пpeимущeства и нeдoстатки заключeния дoгoвopoв путeм пpoвeдeния тopгoв. 
oснoвныe pазнoвиднoсти аукциoнoв и кoнкуpсoв. 
∙ Пpавoвыe пpoблeмы участия в тopгах дoгoвopных oбъeдинeний участникoв, а такжe 
аффилиpoванных лиц. 
∙ Пpoцeссуальныe oсoбeннoсти pассмoтpeния спopoв o нeдeйствитeльнoсти тopгoв. 
∙ oснoвания для пpизнания тopгoв нeсoстoявшимися. 
∙ Пpавoвoe значeниe oбeспeчeния аукциoннoй или кoнкуpснoй заявки. 
∙ Тpeбoвания закoнoдатeльства к извeщeнию o пpoвeдeнии тopгoв, хаpактepистика 
нeнадлeжащих извeщeний. 
∙ oпpeдeлeниe кpитepиeв кoнкуpснoгo или аукциoннoгo oтбopа, пpавoвыe pамки 
pабoты кoнкуpснoй (аукциoннoй) кoмиссии. 
∙ Аннулиpoваниe тopгoв, Пpизнаниe тopгoв нeдeйствитeльными, oбъявлeниe 
тopгoв нeсoстoявшимися: pазличия пpoцeдуp и их пpавoвыe пoслeдствия. 
∙ Сooтнoшeниe администpативнoгo и судeбнoгo спoсoбoв защиты пpав и закoнных 
интepeсoв участникoв pазмeщeния заказа.

Пpoдoлжитeльнoсть oбучeния: с 10 дo 17 часoв (с пepepывoм на oбeд и кoфe-паузу).
Мeстo oбучeния: г. Мoсква, 5 мин. пeшкoм oт м. Акадeмичeская.
Стoимoсть oбучeния: 4900 pуб. (с НДС). 
(В стoимoсть вxoдит: pаздатoчный матepиал, кoфe-пауза, oбeд в peстopанe).

Пpи oтсутствии вoзмoжнoсти пoсeтить сeминаp, мы пpeдлагаeм пpиoбpeсти eгo 
видeoвepсию на DVD/CD дискаx или видeoкассeтаx (пpилагаeтся автopский 
pаздатoчный матepиал). 
Цeна видeoкуpса - 3500 pублeй, с учeтoм НДС.

Для peгистpации на сeминаp нeoбxoдимo oтпpавить нам пo факсу: peквизиты 
opганизации, тeму и дату сeминаpа, пoлнoe ФИo участникoв, кoнтактный тeлeфoн и 
факс. 
Для заказа видeoкуpса нeoбxoдимo oтпpавить нам пo факсу: peквизиты opганизации, 
тeму видeoкуpса, указать нoситeль (ДВД или СД диски), тeлeфoн, факс, кoнтактнoe 
лицo и тoчный адpeс дoставки. 
 
Пoлучить дoпoлнитeльную инфopмацию и заpeгистpиpoваться мoжнo:
пo т/ф: ( 4 9 5 ) 54 З 8 8 4 6






-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Some FAQ questions

2008-04-22 Thread Avi Kivity
Damjan wrote:
 I have some questions for the FAQ, about the configuration of Linux guests:
  a) is swap needed in the guest (I'd say no, but..)
  b) what filesystem is best for a guest
  c) what io scheduler in the guest (noop? or cfq)
  d) are there any runtime kernel tweaks for the guest (/proc/sys)?
   

For the first four questions, do whatever you'd do for a similarly 
configured host running a similar workload.  It's fine to use cfq as the 
I/O scheduler.

  e) suggested linux kernel source configuration (.config)

With newer kernels, be sure to enable virtio drivers, kvm clock, and kvm 
mmu paravirtualization.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver

2008-04-22 Thread Christian Ehrhardt
Jerone Young wrote:
 virtio net device does not appear to show itself in the guest. I'm
 curious of what options I may be missing. Here is my config
 
 CONFIG_VIRTIO_NET=y
[..]
 CONFIG_VIRTUALIZATION=y
 CONFIG_KVM=y
 CONFIG_KVM_BOOKE_HOST=y
 CONFIG_VIRTIO=y
 CONFIG_VIRTIO_RING=y
 CONFIG_VIRTIO_PCI=y 

That should be enough in .config, but be aware that you need the proper qemu 
command line like 
  -net nic,model=virtio,macaddr=00:00:00:00:00:AA -net tap
as well as a /etc/qemu-ifup script (I sent one for our purpose to kvm-ppc-devel 
a while ago)
+ you need some tools installed e.g. brctl
and you need to create /dev/net/tun in the host because we have no dynamic /dev.

If you have done all that already and it is still not working you should 
continue with anthonys
suggestion and send what lspci shows you. If you want to be complete use lspci 
-vvvx
And maybe it is worth to add debug to the kernel command line of the guest and 
attach a full dmesg
to the same response too, just in case someone might want to look at driver 
messages.

-- 

Grüsse / regards, 
Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver

2008-04-22 Thread Avi Kivity
Jerone Young wrote:
 What I am asking is do I have all the proper options in my kernel config
 set to use it?

   

I have:

[EMAIL PROTECTED] linux-2.6 (kvm-updates-2.6.26)]$ grep VIRTIO .config
CONFIG_VIRTIO_BLK=m
CONFIG_VIRTIO_NET=m
CONFIG_VIRTIO=m
CONFIG_VIRTIO_RING=m
CONFIG_VIRTIO_PCI=m
CONFIG_VIRTIO_BALLOON=m

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch] qemu/ia64 include prototype for qemu_mallocz

2008-04-22 Thread Avi Kivity
Jes Sorensen wrote:
 Hi,

 This one fixes a segfault problem I am seeing on ia64 due to the
 malloc'ed address being truncated to 32 bit.


Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 0/2] pci_register_device can fail

2008-04-22 Thread Avi Kivity
Chris Wright wrote:
 The pci hotadd patches make it easy to trigger segfaults when adding more
 devices than a single PCI bus can handle.  The following 2 patches fix the
 pci nic devices and virtio-blk device.  Now the following the following:

   OK bus 0, slot 31, function 0 (devfn 248)
   (qemu) pci_add 0 nic model=virtio
   Segmentation fault

   OK bus 0, slot 31, function 0 (devfn 248)
   (qemu) pci_add 0 storage file=/mnt/disk1,if=virtio
   Segmentation fault

 become:

   OK bus 0, slot 31, function 0 (devfn 248)
   (qemu) pci_add 0 nic model=virtio
   qemu: Unable to initialze NIC: virtio
   failed to add model=virtio

   OK bus 0, slot 31, function 0 (devfn 248)
   (qemu) pci_add 0 storage file=/mnt/disk1,if=virtio
   failed to add file=/mnt/disk1,if=virtio
   

Applied all three, thanks.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Avi Kivity
Rusty Russell wrote:
 [Christian, Hollis, how much is this ABI breakage going to hurt you?]

 A recent proposed feature addition to the virtio block driver revealed
 some flaws in the API, in particular how easy it is to break big
 endian machines.

 The virtio config space was originally chosen to be little-endian,
 because we thought the config might be part of the PCI config space
 for virtio_pci.  It's actually a separate mmio region, so that
 argument holds little water; as only x86 is currently using the virtio
 mechanism, we can change this (but must do so now, before the
 impending s390 and ppc merges).

   

This will probably annoy Hollis which has guests that can go both ways.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Some FAQ questions

2008-04-22 Thread Luca Tettamanti
On Tue, Apr 22, 2008 at 1:10 PM, Avi Kivity [EMAIL PROTECTED] wrote:
 Damjan wrote:
   I have some questions for the FAQ, about the configuration of Linux guests:
a) is swap needed in the guest (I'd say no, but..)
b) what filesystem is best for a guest
c) what io scheduler in the guest (noop? or cfq)
d) are there any runtime kernel tweaks for the guest (/proc/sys)?

  For the first four questions, do whatever you'd do for a similarly
  configured host running a similar workload.  It's fine to use cfq as the
  I/O scheduler.

Is cfq still fair in the guest? The VM re-dispatches the requests (at
least when using QEMU IDE) and the host can reschedule them at will.

Luca

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Some FAQ questions

2008-04-22 Thread Avi Kivity
Luca Tettamanti wrote:
 Is cfq still fair in the guest? The VM re-dispatches the requests (at
 least when using QEMU IDE) and the host can reschedule them at will.
   

The same problem occurs (to a lesser extent) in non-virtualized 
environments; disks (and esp. array controllers) also have their own I/O 
schedulers.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix missing decleration for kvm_enabled() in qemu for target-ppc/helper.c

2008-04-22 Thread Avi Kivity
Jerone Young wrote:
 Recent change now requires target-ppc/helper.c to now include qemu-kvm.h to 
 get the definition for kvm_enabled(). This fixes it so things now compile 
 again.

   

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
 invalidate_range_start {
   spin_lock(kvm-mmu_lock);
 
   kvm-invalidate_range_count++;
   rmap-invalidate of sptes in range
 

write_seqlock; write_sequnlock;

   spin_unlock(kvm-mmu_lock)
 }
 
 invalidate_range_end {
   spin_lock(kvm-mmu_lock);
 
   kvm-invalidate_range_count--;


write_seqlock; write_sequnlock;

 
   spin_unlock(kvm-mmu_lock)
 }

Robin correctly pointed out by PM there should be a seqlock in
range_begin/end too like corrected above.

I guess it's better to use an explicit sequence counter so we avoid an
useless spinlock of the write_seqlock (mmu_lock is enough already in
all places) and so we can increase it with a single op with +=2 in the
range_begin/end. The above is a lower-perf version of the final
locking but simpler for reading purposes.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/2]kvmtrace: add event mask support (kernel part)

2008-04-22 Thread Avi Kivity
Liu, Eric E wrote:
 From a1b062cfd4d1a91c447b680ac9a2250fe55119ec Mon Sep 17 00:00:00 2001
 From: Feng (Eric) Liu [EMAIL PROTECTED]
 Date: Wed, 16 Apr 2008 05:29:37 -0400
 Subject: [PATCH] KVM: trace: Add event mask support.

 Allow user space application to specify one or more
 filter masks to limit the events being captured via it.

   

Sorry about the late review.

 --- a/include/linux/kvm.h
 +++ b/include/linux/kvm.h
 @@ -18,6 +18,8 @@
  struct kvm_user_trace_setup {
   __u32 buf_size; /* sub_buffer size of each per-cpu */
   __u32 buf_nr; /* the number of sub_buffers of each per-cpu */
 + __u16 cat_mask; /* the tracing categories are enabled */
 + __u64 act_bitmap[16]; /* the actions are enabled for each
 category */
  };
   

The structures will be laid out differently on 32-bit and 64-bit.  This 
is important since we'd like 32-bit userspace to work correctly with a 
64-bit kernel.  The usual solution is to insert a __u16 pad1[3]; 
between the two fields.

Otherwise, the patch seems fine.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
  invalidate_range_start {
  spin_lock(kvm-mmu_lock);
  
  kvm-invalidate_range_count++;
  rmap-invalidate of sptes in range
  
 
   write_seqlock; write_sequnlock;

I don't think you need it here since invalidate_range_count is already
elevated which will accomplish the same effect.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote:
 On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
  On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
   invalidate_range_start {
 spin_lock(kvm-mmu_lock);
   
 kvm-invalidate_range_count++;
 rmap-invalidate of sptes in range
   
  
  write_seqlock; write_sequnlock;
 
 I don't think you need it here since invalidate_range_count is already
 elevated which will accomplish the same effect.

Agreed, seqlock only in range_end should be enough. BTW, the fact
seqlock is needed regardless of invalidate_page existing or not,
really makes invalidate_page a no brainer not just from the core VM
point of view, but from the driver point of view too. The
kvm_page_fault logic would be the same even if I remove
invalidate_page from the mmu notifier patch but it'd run slower both
when armed and disarmed.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] What kernel options do I need to properly enable virtio net driver

2008-04-22 Thread Anthony Liguori
Jerone Young wrote:
 What I am asking is do I have all the proper options in my kernel config
 set to use it?
   

Yes.  You just need CONFIG_VIRTIO_NET and CONFIG_VIRTIO_PCI.  The 
remaining options will be automatically selected.

Regards,

Anthony Liguori

 On Mon, 2008-04-21 at 17:13 -0500, Anthony Liguori wrote:
   
 Jerone Young wrote:
 
 virtio net device does not appear to show itself in the guest. I'm
 curious of what options I may be missing. Here is my config
   
 You'll have to be more specific about what does not appear to show 
 itself means.  What's the output of lspci?

 Regards,

 Anthony Liguori

 
 

 -
 This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
 Don't miss this year's exciting event. There's still time to save $100. 
 Use priority code J8TL2D2. 
 http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
 

 ___
 kvm-devel mailing list
 kvm-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/kvm-devel
   

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 03:21:43PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 08:01:20AM -0500, Robin Holt wrote:
  On Tue, Apr 22, 2008 at 02:00:56PM +0200, Andrea Arcangeli wrote:
   On Tue, Apr 22, 2008 at 09:20:26AM +0200, Andrea Arcangeli wrote:
invalidate_range_start {
spin_lock(kvm-mmu_lock);

kvm-invalidate_range_count++;
rmap-invalidate of sptes in range

   
 write_seqlock; write_sequnlock;
  
  I don't think you need it here since invalidate_range_count is already
  elevated which will accomplish the same effect.
 
 Agreed, seqlock only in range_end should be enough. BTW, the fact

I am a little confused about the value of the seq_lock versus a simple
atomic, but I assumed there is a reason and left it at that.

 seqlock is needed regardless of invalidate_page existing or not,
 really makes invalidate_page a no brainer not just from the core VM
 point of view, but from the driver point of view too. The
 kvm_page_fault logic would be the same even if I remove
 invalidate_page from the mmu notifier patch but it'd run slower both
 when armed and disarmed.

I don't know what you mean by it'd run slower and what you mean by
armed and disarmed.

For the sake of this discussion, I will assume it'd means the kernel in
general and not KVM.  With the two call sites for range_begin/range_end,
I would agree we have more call sites, but the second is extremely likely
to be cache hot.

By disarmed, I will assume you mean no notifiers registered for a
particular mm.  In that case, the cache will make the second call
effectively free.  So, for the disarmed case, I see no measurable
difference.

For the case where there is a notifier registered, I certainly can see
a difference.  I am not certain how to quantify the difference as it
depends on the callee.  In the case of xpmem, our callout is always very
expensive for the _start case.  Our _end case is very light, but it is
essentially the exact same steps we would perform for the _page callout.

When I was discussing this difference with Jack, he reminded me that
the GRU, due to its hardware, does not have any race issues with the
invalidate_page callout simply doing the tlb shootdown and not modifying
any of its internal structures.  He then put a caveat on the discussion
that _either_ method was acceptable as far as he was concerned.  The real
issue is getting a patch in that satisfies all needs and not whether
there is a seperate invalidate_page callout.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Anthony Liguori
Nguyen Anh Quynh wrote:
 Hi,

 I am thinking about comibing this ROM with the extboot. Both two ROM
 are about booting, so I think that is reasonable. So we will have
 only 1 ROM that supports both external boot and Linux boot.

 Is that desirable or not?
   

I think so.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
 I am a little confused about the value of the seq_lock versus a simple
 atomic, but I assumed there is a reason and left it at that.

There's no value for anything but get_user_pages (get_user_pages takes
its own lock internally though). I preferred to explain it as a
seqlock because it was simpler for reading, but I totally agree in the
final implementation it shouldn't be a seqlock. My code was meant to
be pseudo-code only. It doesn't even need to be atomic ;).

 I don't know what you mean by it'd run slower and what you mean by
 armed and disarmed.

1) when armed the time-window where the kvm-page-fault would be
blocked would be a bit larger without invalidate_page for no good
reason

2) if you were to remove invalidate_page when disarmed the VM could
would need two branches instead of one in various places

I don't want to waste cycles if not wasting them improves performance
both when armed and disarmed.

 For the sake of this discussion, I will assume it'd means the kernel in
 general and not KVM.  With the two call sites for range_begin/range_end,

I actually meant for both.

 By disarmed, I will assume you mean no notifiers registered for a
 particular mm.  In that case, the cache will make the second call
 effectively free.  So, for the disarmed case, I see no measurable
 difference.

For rmap is sure effective free, for do_wp_page it costs one branch
for no good reason.

 For the case where there is a notifier registered, I certainly can see
 a difference.  I am not certain how to quantify the difference as it

Agreed.

 When I was discussing this difference with Jack, he reminded me that
 the GRU, due to its hardware, does not have any race issues with the
 invalidate_page callout simply doing the tlb shootdown and not modifying
 any of its internal structures.  He then put a caveat on the discussion
 that _either_ method was acceptable as far as he was concerned.  The real
 issue is getting a patch in that satisfies all needs and not whether
 there is a seperate invalidate_page callout.

Sure, we have that patch now, I'll send it out in a minute, I was just
trying to explain why it makes sense to have an invalidate_page too
(which remains the only difference by now), removing it would be a
regression on all sides, even if a minor one.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208870142 -7200
# Node ID ea87c15371b1bd49380c40c3f15f1c7ca4438af5
# Parent  fb3bc9942fb78629d096bd07564f435d51d86e5f
Core of mmu notifiers.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Nick Piggin [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1050,6 +1050,27 @@
   unsigned long addr, unsigned long len,
   unsigned long flags, struct page **pages);
 
+/*
+ * mm_lock will take mmap_sem writably (to prevent all modifications
+ * and scanning of vmas) and then also takes the mapping locks for
+ * each of the vma to lockout any scans of pagetables of this address
+ * space. This can be used to effectively holding off reclaim from the
+ * address space.
+ *
+ * mm_lock can fail if there is not enough memory to store a pointer
+ * array to all vmas.
+ *
+ * mm_lock and mm_unlock are expensive operations that may take a long time.
+ */
+struct mm_lock_data {
+   spinlock_t **i_mmap_locks;
+   spinlock_t **anon_vma_locks;
+   size_t nr_i_mmap_locks;
+   size_t nr_anon_vma_locks;
+};
+extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
+extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
+
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
 extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -225,6 +225,9 @@
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR
struct mem_cgroup *mem_cgroup;
 #endif
+#ifdef CONFIG_MMU_NOTIFIER
+   struct hlist_head mmu_notifier_list;
+#endif
 };
 
 #endif /* _LINUX_MM_TYPES_H */
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
new file mode 100644
--- /dev/null
+++ b/include/linux/mmu_notifier.h
@@ -0,0 +1,229 @@
+#ifndef _LINUX_MMU_NOTIFIER_H
+#define _LINUX_MMU_NOTIFIER_H
+
+#include linux/list.h
+#include linux/spinlock.h
+#include linux/mm_types.h
+
+struct mmu_notifier;
+struct mmu_notifier_ops;
+
+#ifdef CONFIG_MMU_NOTIFIER
+
+struct mmu_notifier_ops {
+   /*
+* Called after all other threads have terminated and the executing
+* thread is the only remaining execution thread. There are no
+* users of the mm_struct remaining.
+*/
+   void (*release)(struct mmu_notifier *mn,
+   struct mm_struct *mm);
+
+   /*
+* clear_flush_young is called after the VM is
+* test-and-clearing the young/accessed bitflag in the
+* pte. This way the VM will provide proper aging to the
+* accesses to the page through the secondary MMUs and not
+* only to the ones through the Linux pte.
+*/
+   int (*clear_flush_young)(struct mmu_notifier *mn,
+struct mm_struct *mm,
+unsigned long address);
+
+   /*
+* Before this is invoked any secondary MMU is still ok to
+* read/write to the page previously pointed by the Linux pte
+* because the old page hasn't been freed yet.  If required
+* set_page_dirty has to be called internally to this method.
+*/
+   void (*invalidate_page)(struct mmu_notifier *mn,
+   struct mm_struct *mm,
+   unsigned long address);
+
+   /*
+* invalidate_range_start() and invalidate_range_end() must be
+* paired and are called only when the mmap_sem is held and/or
+* the semaphores protecting the reverse maps. Both functions
+* may sleep. The subsystem must guarantee that no additional
+* references to the pages in the range established between
+* the call to invalidate_range_start() and the matching call
+* to invalidate_range_end().
+*
+* Invalidation of multiple concurrent ranges may be permitted
+* by the driver or the driver may exclude other invalidation
+* from proceeding by blocking on new invalidate_range_start()
+* callback that overlap invalidates that are already in
+* progress. Either way the establishment of sptes to the
+* range can only be allowed if all invalidate_range_stop()
+* function have been called.
+*
+* invalidate_range_start() is called when all pages in the
+* range are still mapped and have at least a refcount of one.
+*
+* invalidate_range_end() is called when all pages in the
+* range have been unmapped and the pages have been freed by
+* the VM.
+*
+* The VM will remove the page table entries and potentially
+   

[kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID 3c804dca25b15017b22008647783d6f5f3801fa9
# Parent  ea87c15371b1bd49380c40c3f15f1c7ca4438af5
Fix ia64 compilation failure because of common code include bug.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -10,6 +10,7 @@
 #include linux/rbtree.h
 #include linux/rwsem.h
 #include linux/completion.h
+#include linux/cpumask.h
 #include asm/page.h
 #include asm/mmu.h
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID ac9bb1fb3de2aa5d27210a28edf24f6577094076
# Parent  a6672bdeead0d41b2ebd6846f731d43a611645b7
Moves all mmu notifier methods outside the PT lock (first and not last
step to make them sleep capable).

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -169,27 +169,6 @@
INIT_HLIST_HEAD(mm-mmu_notifier_list);
 }
 
-#define ptep_clear_flush_notify(__vma, __address, __ptep)  \
-({ \
-   pte_t __pte;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __pte = ptep_clear_flush(___vma, ___address, __ptep);   \
-   mmu_notifier_invalidate_page(___vma-vm_mm, ___address);\
-   __pte;  \
-})
-
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep)
\
-({ \
-   int __young;\
-   struct vm_area_struct *___vma = __vma;  \
-   unsigned long ___address = __address;   \
-   __young = ptep_clear_flush_young(___vma, ___address, __ptep);   \
-   __young |= mmu_notifier_clear_flush_young(___vma-vm_mm,\
- ___address);  \
-   __young;\
-})
-
 #else /* CONFIG_MMU_NOTIFIER */
 
 static inline void mmu_notifier_release(struct mm_struct *mm)
@@ -221,9 +200,6 @@
 {
 }
 
-#define ptep_clear_flush_young_notify ptep_clear_flush_young
-#define ptep_clear_flush_notify ptep_clear_flush
-
 #endif /* CONFIG_MMU_NOTIFIER */
 
 #endif /* _LINUX_MMU_NOTIFIER_H */
diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c
--- a/mm/filemap_xip.c
+++ b/mm/filemap_xip.c
@@ -194,11 +194,13 @@
if (pte) {
/* Nuke the page table entry. */
flush_cache_page(vma, address, pte_pfn(*pte));
-   pteval = ptep_clear_flush_notify(vma, address, pte);
+   pteval = ptep_clear_flush(vma, address, pte);
page_remove_rmap(page, vma);
dec_mm_counter(mm, file_rss);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
+   /* must invalidate_page _before_ freeing the page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(page);
}
}
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1627,9 +1627,10 @@
 */
page_table = pte_offset_map_lock(mm, pmd, address,
 ptl);
-   page_cache_release(old_page);
+   new_page = NULL;
if (!pte_same(*page_table, orig_pte))
goto unlock;
+   page_cache_release(old_page);
 
page_mkwrite = 1;
}
@@ -1645,6 +1646,7 @@
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, entry);
ret |= VM_FAULT_WRITE;
+   old_page = new_page = NULL;
goto unlock;
}
 
@@ -1689,7 +1691,7 @@
 * seen in the presence of one thread doing SMC and another
 * thread doing COW.
 */
-   ptep_clear_flush_notify(vma, address, page_table);
+   ptep_clear_flush(vma, address, page_table);
set_pte_at(mm, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lru_cache_add_active(new_page);
@@ -1701,12 +1703,18 @@
} else
mem_cgroup_uncharge_page(new_page);
 
-   if (new_page)
+unlock:
+   pte_unmap_unlock(page_table, ptl);
+
+   if (new_page) {
+   if (new_page == old_page)
+   /* cow happened, notify before releasing old_page */
+   mmu_notifier_invalidate_page(mm, address);
page_cache_release(new_page);
+   }
if (old_page)
page_cache_release(old_page);
-unlock:
-   pte_unmap_unlock(page_table, ptl);
+
if (dirty_page) {
if (vma-vm_file)
file_update_time(vma-vm_file);
diff --git 

[kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
# Parent  bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock
conversion.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1062,10 +1062,10 @@
  * mm_lock and mm_unlock are expensive operations that may take a long time.
  */
 struct mm_lock_data {
-   spinlock_t **i_mmap_locks;
-   spinlock_t **anon_vma_locks;
-   size_t nr_i_mmap_locks;
-   size_t nr_anon_vma_locks;
+   struct rw_semaphore **i_mmap_sems;
+   struct rw_semaphore **anon_vma_sems;
+   size_t nr_i_mmap_sems;
+   size_t nr_anon_vma_sems;
 };
 extern int mm_lock(struct mm_struct *mm, struct mm_lock_data *data);
 extern void mm_unlock(struct mm_struct *mm, struct mm_lock_data *data);
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,8 +2243,8 @@
 static int mm_lock_cmp(const void *a, const void *b)
 {
cond_resched();
-   if ((unsigned long)*(spinlock_t **)a 
-   (unsigned long)*(spinlock_t **)b)
+   if ((unsigned long)*(struct rw_semaphore **)a 
+   (unsigned long)*(struct rw_semaphore **)b)
return -1;
else if (a == b)
return 0;
@@ -2252,7 +2252,7 @@
return 1;
 }
 
-static unsigned long mm_lock_sort(struct mm_struct *mm, spinlock_t **locks,
+static unsigned long mm_lock_sort(struct mm_struct *mm, struct rw_semaphore 
**sems,
  int anon)
 {
struct vm_area_struct *vma;
@@ -2261,59 +2261,59 @@
for (vma = mm-mmap; vma; vma = vma-vm_next) {
if (anon) {
if (vma-anon_vma)
-   locks[i++] = vma-anon_vma-lock;
+   sems[i++] = vma-anon_vma-sem;
} else {
if (vma-vm_file  vma-vm_file-f_mapping)
-   locks[i++] = 
vma-vm_file-f_mapping-i_mmap_lock;
+   sems[i++] = 
vma-vm_file-f_mapping-i_mmap_sem;
}
}
 
if (!i)
goto out;
 
-   sort(locks, i, sizeof(spinlock_t *), mm_lock_cmp, NULL);
+   sort(sems, i, sizeof(struct rw_semaphore *), mm_lock_cmp, NULL);
 
 out:
return i;
 }
 
 static inline unsigned long mm_lock_sort_anon_vma(struct mm_struct *mm,
- spinlock_t **locks)
+ struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 1);
+   return mm_lock_sort(mm, sems, 1);
 }
 
 static inline unsigned long mm_lock_sort_i_mmap(struct mm_struct *mm,
-   spinlock_t **locks)
+   struct rw_semaphore **sems)
 {
-   return mm_lock_sort(mm, locks, 0);
+   return mm_lock_sort(mm, sems, 0);
 }
 
-static void mm_lock_unlock(spinlock_t **locks, size_t nr, int lock)
+static void mm_lock_unlock(struct rw_semaphore **sems, size_t nr, int lock)
 {
-   spinlock_t *last = NULL;
+   struct rw_semaphore *last = NULL;
size_t i;
 
for (i = 0; i  nr; i++)
/*  Multiple vmas may use the same lock. */
-   if (locks[i] != last) {
-   BUG_ON((unsigned long) last  (unsigned long) locks[i]);
-   last = locks[i];
+   if (sems[i] != last) {
+   BUG_ON((unsigned long) last  (unsigned long) sems[i]);
+   last = sems[i];
if (lock)
-   spin_lock(last);
+   down_write(last);
else
-   spin_unlock(last);
+   up_write(last);
}
 }
 
-static inline void __mm_lock(spinlock_t **locks, size_t nr)
+static inline void __mm_lock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 1);
+   mm_lock_unlock(sems, nr, 1);
 }
 
-static inline void __mm_unlock(spinlock_t **locks, size_t nr)
+static inline void __mm_unlock(struct rw_semaphore **sems, size_t nr)
 {
-   mm_lock_unlock(locks, nr, 0);
+   mm_lock_unlock(sems, nr, 0);
 }
 
 /*
@@ -2325,57 +2325,57 @@
  */
 int mm_lock(struct mm_struct *mm, struct mm_lock_data *data)
 {
-   spinlock_t **anon_vma_locks, **i_mmap_locks;
+   struct rw_semaphore **anon_vma_sems, **i_mmap_sems;
 
down_write(mm-mmap_sem);
if (mm-map_count) {
-   anon_vma_locks = vmalloc(sizeof(spinlock_t *) * mm-map_count);
-   if (unlikely(!anon_vma_locks)) {
+   anon_vma_sems = vmalloc(sizeof(struct rw_semaphore *) * 
mm-map_count);
+ 

[kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
# Parent  ac9bb1fb3de2aa5d27210a28edf24f6577094076
Move the tlb flushing into free_pgtables. The conversion of the locks
taken for reverse map scanning would require taking sleeping locks
in free_pgtables() and we cannot sleep while gathering pages for a tlb
flush.

Move the tlb_gather/tlb_finish call to free_pgtables() to be done
for each vma. This may add a number of tlb flushes depending on the
number of vmas that cannot be coalesced into one.

The first pointer argument to free_pgtables() can then be dropped.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -751,8 +751,8 @@
void *private);
 void free_pgd_range(struct mmu_gather **tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *start_vma,
-   unsigned long floor, unsigned long ceiling);
+void free_pgtables(struct vm_area_struct *start_vma, unsigned long floor,
+   unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
 void unmap_mapping_range(struct address_space *mapping,
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -272,9 +272,11 @@
} while (pgd++, addr = next, addr != end);
 }
 
-void free_pgtables(struct mmu_gather **tlb, struct vm_area_struct *vma,
-   unsigned long floor, unsigned long ceiling)
+void free_pgtables(struct vm_area_struct *vma, unsigned long floor,
+   unsigned long ceiling)
 {
+   struct mmu_gather *tlb;
+
while (vma) {
struct vm_area_struct *next = vma-vm_next;
unsigned long addr = vma-vm_start;
@@ -286,7 +288,8 @@
unlink_file_vma(vma);
 
if (is_vm_hugetlb_page(vma)) {
-   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   hugetlb_free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
} else {
/*
@@ -299,9 +302,11 @@
anon_vma_unlink(vma);
unlink_file_vma(vma);
}
-   free_pgd_range(tlb, addr, vma-vm_end,
+   tlb = tlb_gather_mmu(vma-vm_mm, 0);
+   free_pgd_range(tlb, addr, vma-vm_end,
floor, next? next-vm_start: ceiling);
}
+   tlb_finish_mmu(tlb, addr, vma-vm_end);
vma = next;
}
 }
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1752,9 +1752,9 @@
update_hiwater_rss(mm);
unmap_vmas(tlb, vma, start, end, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
+   tlb_finish_mmu(tlb, start, end);
+   free_pgtables(vma, prev? prev-vm_end: FIRST_USER_ADDRESS,
 next? next-vm_start: 0);
-   tlb_finish_mmu(tlb, start, end);
 }
 
 /*
@@ -2050,8 +2050,8 @@
/* Use -1 here to ensure all VMAs in the mm are unmapped */
end = unmap_vmas(tlb, vma, 0, -1, nr_accounted, NULL);
vm_unacct_memory(nr_accounted);
-   free_pgtables(tlb, vma, FIRST_USER_ADDRESS, 0);
tlb_finish_mmu(tlb, 0, end);
+   free_pgtables(vma, FIRST_USER_ADDRESS, 0);
 
/*
 * Walk the list again, actually closing and freeing it,

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 12 of 12] This patch adds a lock ordering rule to avoid a potential deadlock when

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID e847039ee2e815088661933b7195584847dc7540
# Parent  128d705f38c8a774ac11559db445787ce6e91c77
This patch adds a lock ordering rule to avoid a potential deadlock when
multiple mmap_sems need to be locked.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

diff --git a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -79,6 +79,9 @@
  *
  *  -i_mutex  (generic_file_buffered_write)
  *-mmap_sem   (fault_in_pages_readable-do_page_fault)
+ *
+ *When taking multiple mmap_sems, one should lock the lowest-addressed
+ *one first proceeding on up to the highest-addressed one.
  *
  *  -i_mutex
  *-i_alloc_sem (various)

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 11 of 12] XPMEM would have used sys_madvise() except that madvise_dontneed()

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID 128d705f38c8a774ac11559db445787ce6e91c77
# Parent  f8210c45f1c6f8b38d15e5dfebbc5f7c1f890c93
XPMEM would have used sys_madvise() except that madvise_dontneed()
returns an -EINVAL if VM_PFNMAP is set, which is always true for the pages
XPMEM imports from other partitions and is also true for uncached pages
allocated locally via the mspec allocator.  XPMEM needs zap_page_range()
functionality for these types of pages as well as 'normal' pages.

Signed-off-by: Dean Nelson [EMAIL PROTECTED]

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -909,6 +909,7 @@
 
return unmap_vmas(vma, address, end, nr_accounted, details);
 }
+EXPORT_SYMBOL_GPL(zap_page_range);
 
 /*
  * Do a quick page-table lookup for a single page.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 06 of 12] Move the tlb flushing inside of unmap vmas. This saves us from passing

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID fbce3fecb033eb3fba1d9c2398ac74401ce0ecb5
# Parent  ee8c0644d5f67c1ef59142cce91b0bb6f34a53e0
Move the tlb flushing inside of unmap vmas. This saves us from passing
a pointer to the TLB structure around and simplifies the callers.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -723,8 +723,7 @@
 struct page *vm_normal_page(struct vm_area_struct *, unsigned long, pte_t);
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *);
-unsigned long unmap_vmas(struct mmu_gather **tlb,
-   struct vm_area_struct *start_vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *start_vma, unsigned long 
start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *);
 
diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,7 +804,6 @@
 
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
- * @tlbp: address of the caller's struct mmu_gather
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
@@ -816,20 +815,13 @@
  * Unmap all pages in the vma list.
  *
  * We aim to not hold locks for too long (for scheduling latency reasons).
- * So zap pages in ZAP_BLOCK_SIZE bytecounts.  This means we need to
- * return the ending mmu_gather to the caller.
+ * So zap pages in ZAP_BLOCK_SIZE bytecounts.
  *
  * Only addresses between `start' and `end' will be unmapped.
  *
  * The VMA list must be sorted in ascending virtual address order.
- *
- * unmap_vmas() assumes that the caller will flush the whole unmapped address
- * range after unmap_vmas() returns.  So the only responsibility here is to
- * ensure that any thus-far unmapped pages are flushed before unmap_vmas()
- * drops the lock and schedules.
  */
-unsigned long unmap_vmas(struct mmu_gather **tlbp,
-   struct vm_area_struct *vma, unsigned long start_addr,
+unsigned long unmap_vmas(struct vm_area_struct *vma, unsigned long start_addr,
unsigned long end_addr, unsigned long *nr_accounted,
struct zap_details *details)
 {
@@ -838,9 +830,14 @@
int tlb_start_valid = 0;
unsigned long start = start_addr;
spinlock_t *i_mmap_lock = details? details-i_mmap_lock: NULL;
-   int fullmm = (*tlbp)-fullmm;
+   int fullmm;
+   struct mmu_gather *tlb;
struct mm_struct *mm = vma-vm_mm;
 
+   lru_add_drain();
+   tlb = tlb_gather_mmu(mm, 0);
+   update_hiwater_rss(mm);
+   fullmm = tlb-fullmm;
mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
for ( ; vma  vma-vm_start  end_addr; vma = vma-vm_next) {
unsigned long end;
@@ -867,7 +864,7 @@
(HPAGE_SIZE / PAGE_SIZE);
start = end;
} else
-   start = unmap_page_range(*tlbp, vma,
+   start = unmap_page_range(tlb, vma,
start, end, zap_work, details);
 
if (zap_work  0) {
@@ -875,22 +872,23 @@
break;
}
 
-   tlb_finish_mmu(*tlbp, tlb_start, start);
+   tlb_finish_mmu(tlb, tlb_start, start);
 
if (need_resched() ||
(i_mmap_lock  spin_needbreak(i_mmap_lock))) {
if (i_mmap_lock) {
-   *tlbp = NULL;
+   tlb = NULL;
goto out;
}
cond_resched();
}
 
-   *tlbp = tlb_gather_mmu(vma-vm_mm, fullmm);
+   tlb = tlb_gather_mmu(vma-vm_mm, fullmm);
tlb_start_valid = 0;
zap_work = ZAP_BLOCK_SIZE;
}
}
+   tlb_finish_mmu(tlb, start_addr, end_addr);
 out:
mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
return start;   /* which is now the end (or restart) address */
@@ -906,18 +904,10 @@
 unsigned long zap_page_range(struct vm_area_struct *vma, unsigned long address,
unsigned long size, struct zap_details *details)
 {
-   struct mm_struct *mm = vma-vm_mm;
-   struct mmu_gather *tlb;
unsigned long end = address + size;
unsigned long nr_accounted = 0;
 
-   lru_add_drain();
-   tlb = tlb_gather_mmu(mm, 0);
-   

[kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Andrea Arcangeli
Hello,

This is the latest and greatest version of the mmu notifier patch #v13.

Changes are mainly in the mm_lock that uses sort() suggested by Christoph.
This reduces the complexity from O(N**2) to O(N*log(N)).

I folded the mm_lock functionality together with the mmu-notifier-core 1/12
patch to make it self-contained. I recommend merging 1/12 into -mm/mainline
ASAP. Lack of mmu notifiers is holding off KVM development. We are going to
rework the way the pages are mapped and unmapped to work with pure pfn for pci
passthrough without the use of page pinning, and we can't without mmu
notifiers. This is not just a performance matter.

KVM/GRU and AFAICT Quadrics are all covered by applying the single 1/12 patch
that shall be shipped with 2.6.26. The risk of brekage by applying 1/12 is
zero. Both when MMU_NOTIFIER=y and when it's =n, so it shouldn't be delayed
further.

XPMEM support comes with the later patches 2-12, risk for those patches is 0
and this is why the mmu-notifier-core is numbered 1/12 and not 12/12. Some are
simple and can go in immediately but not all are so simple.

2-12/12 are posted as usual for review by the VM developers and so Robin can
keep testing them on XPMEM and they can be merged later without any downside
(they're mostly orthogonal with 1/12).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872186 -7200
# Node ID a6672bdeead0d41b2ebd6846f731d43a611645b7
# Parent  3c804dca25b15017b22008647783d6f5f3801fa9
get_task_mm should not succeed if mmput() is running and has reduced
the mm_users count to zero. This can occur if a processor follows
a tasks pointer to an mm struct because that pointer is only cleared
after the mmput().

If get_task_mm() succeeds after mmput() reduced the mm_users to zero then
we have the lovely situation that one portion of the kernel is doing
all the teardown work for an mm while another portion is happily using
it.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -442,7 +442,8 @@
if (task-flags  PF_BORROWED_MM)
mm = NULL;
else
-   atomic_inc(mm-mm_users);
+   if (!atomic_inc_not_zero(mm-mm_users))
+   mm = NULL;
}
task_unlock(task);
return mm;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH 09 of 12] Convert the anon_vma spinlock to a rw semaphore. This allows concurrent

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID bdb3d928a0ba91cdce2b61bd40a2f80bddbe4ff2
# Parent  6e04df1f4284689b1c46e57a67559abe49ecf292
Convert the anon_vma spinlock to a rw semaphore. This allows concurrent
traversal of reverse maps for try_to_unmap() and page_mkclean(). It also
allows the calling of sleeping functions from reverse map traversal as
needed for the notifier callbacks. It includes possible concurrency.

Rcu is used in some context to guarantee the presence of the anon_vma
(try_to_unmap) while we acquire the anon_vma lock. We cannot take a
semaphore within an rcu critical section. Add a refcount to the anon_vma
structure which allow us to give an existence guarantee for the anon_vma
structure independent of the spinlock or the list contents.

The refcount can then be taken within the RCU section. If it has been
taken successfully then the refcount guarantees the existence of the
anon_vma. The refcount in anon_vma also allows us to fix a nasty
issue in page migration where we fudged by using rcu for a long code
path to guarantee the existence of the anon_vma. I think this is a bug
because the anon_vma may become empty and get scheduled to be freed
but then we increase the refcount again when the migration entries are
removed.

The refcount in general allows a shortening of RCU critical sections since
we can do a rcu_unlock after taking the refcount. This is particularly
relevant if the anon_vma chains contain hundreds of entries.

However:
- Atomic overhead increases in situations where a new reference
  to the anon_vma has to be established or removed. Overhead also increases
  when a speculative reference is used (try_to_unmap,
  page_mkclean, page migration).
- There is the potential for more frequent processor change due to up_xxx
  letting waiting tasks run first. This results in f.e. the Aim9 brk
  performance test to got down by 10-15%.

Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -25,7 +25,8 @@
  * pointing to this anon_vma once its vma list is empty.
  */
 struct anon_vma {
-   spinlock_t lock;/* Serialize access to vma list */
+   atomic_t refcount;  /* vmas on the list */
+   struct rw_semaphore sem;/* Serialize access to vma list */
struct list_head head;  /* List of private related vmas */
 };
 
@@ -43,18 +44,31 @@
kmem_cache_free(anon_vma_cachep, anon_vma);
 }
 
+struct anon_vma *grab_anon_vma(struct page *page);
+
+static inline void get_anon_vma(struct anon_vma *anon_vma)
+{
+   atomic_inc(anon_vma-refcount);
+}
+
+static inline void put_anon_vma(struct anon_vma *anon_vma)
+{
+   if (atomic_dec_and_test(anon_vma-refcount))
+   anon_vma_free(anon_vma);
+}
+
 static inline void anon_vma_lock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_lock(anon_vma-lock);
+   down_write(anon_vma-sem);
 }
 
 static inline void anon_vma_unlock(struct vm_area_struct *vma)
 {
struct anon_vma *anon_vma = vma-anon_vma;
if (anon_vma)
-   spin_unlock(anon_vma-lock);
+   up_write(anon_vma-sem);
 }
 
 /*
diff --git a/mm/migrate.c b/mm/migrate.c
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -235,15 +235,16 @@
return;
 
/*
-* We hold the mmap_sem lock. So no need to call page_lock_anon_vma.
+* We hold either the mmap_sem lock or a reference on the
+* anon_vma. So no need to call page_lock_anon_vma.
 */
anon_vma = (struct anon_vma *) (mapping - PAGE_MAPPING_ANON);
-   spin_lock(anon_vma-lock);
+   down_read(anon_vma-sem);
 
list_for_each_entry(vma, anon_vma-head, anon_vma_node)
remove_migration_pte(vma, old, new);
 
-   spin_unlock(anon_vma-lock);
+   up_read(anon_vma-sem);
 }
 
 /*
@@ -623,7 +624,7 @@
int rc = 0;
int *result = NULL;
struct page *newpage = get_new_page(page, private, result);
-   int rcu_locked = 0;
+   struct anon_vma *anon_vma = NULL;
int charge = 0;
 
if (!newpage)
@@ -647,16 +648,14 @@
}
/*
 * By try_to_unmap(), page-mapcount goes down to 0 here. In this case,
-* we cannot notice that anon_vma is freed while we migrates a page.
+* we cannot notice that anon_vma is freed while we migrate a page.
 * This rcu_read_lock() delays freeing anon_vma pointer until the end
 * of migration. File cache pages are no problem because of page_lock()
 * File Caches may use write_page() or lock_page() in migration, then,
 * just care Anon page here.
 */
-   if (PageAnon(page)) {
-   rcu_read_lock();
-   rcu_locked = 1;
-   }
+   if (PageAnon(page))
+   anon_vma = 

[kvm-devel] [PATCH 08 of 12] The conversion to a rwsem allows notifier callbacks during rmap traversal

2008-04-22 Thread Andrea Arcangeli
# HG changeset patch
# User Andrea Arcangeli [EMAIL PROTECTED]
# Date 1208872187 -7200
# Node ID 6e04df1f4284689b1c46e57a67559abe49ecf292
# Parent  8965539f4d174c79bd37e58e8b037d5db906e219
The conversion to a rwsem allows notifier callbacks during rmap traversal
for files. A rw style lock also allows concurrent walking of the
reverse map so that multiple processors can expire pages in the same memory
area of the same process. So it increases the potential concurrency.

Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

diff --git a/Documentation/vm/locking b/Documentation/vm/locking
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -66,7 +66,7 @@
 expand_stack(), it is hard to come up with a destructive scenario without 
 having the vmlist protection in this case.
 
-The page_table_lock nests with the inode i_mmap_lock and the kmem cache
+The page_table_lock nests with the inode i_mmap_sem and the kmem cache
 c_spinlock spinlocks.  This is okay, since the kmem code asks for pages after
 dropping c_spinlock.  The page_table_lock also nests with pagecache_lock and
 pagemap_lru_lock spinlocks, and no code asks for memory with these locks
diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -69,7 +69,7 @@
if (!vma_shareable(vma, addr))
return;
 
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
vma_prio_tree_foreach(svma, iter, mapping-i_mmap, idx, idx) {
if (svma == vma)
continue;
@@ -94,7 +94,7 @@
put_page(virt_to_page(spte));
spin_unlock(mm-page_table_lock);
 out:
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
 }
 
 /*
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -454,10 +454,10 @@
pgoff = offset  PAGE_SHIFT;
 
i_size_write(inode, offset);
-   spin_lock(mapping-i_mmap_lock);
+   down_read(mapping-i_mmap_sem);
if (!prio_tree_empty(mapping-i_mmap))
hugetlb_vmtruncate_list(mapping-i_mmap, pgoff);
-   spin_unlock(mapping-i_mmap_lock);
+   up_read(mapping-i_mmap_sem);
truncate_hugepages(inode, offset);
return 0;
 }
diff --git a/fs/inode.c b/fs/inode.c
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -210,7 +210,7 @@
INIT_LIST_HEAD(inode-i_devices);
INIT_RADIX_TREE(inode-i_data.page_tree, GFP_ATOMIC);
rwlock_init(inode-i_data.tree_lock);
-   spin_lock_init(inode-i_data.i_mmap_lock);
+   init_rwsem(inode-i_data.i_mmap_sem);
INIT_LIST_HEAD(inode-i_data.private_list);
spin_lock_init(inode-i_data.private_lock);
INIT_RAW_PRIO_TREE_ROOT(inode-i_data.i_mmap);
diff --git a/include/linux/fs.h b/include/linux/fs.h
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -503,7 +503,7 @@
unsigned inti_mmap_writable;/* count VM_SHARED mappings */
struct prio_tree_root   i_mmap; /* tree of private and shared 
mappings */
struct list_headi_mmap_nonlinear;/*list VM_NONLINEAR mappings */
-   spinlock_t  i_mmap_lock;/* protect tree, count, list */
+   struct rw_semaphore i_mmap_sem; /* protect tree, count, list */
unsigned inttruncate_count; /* Cover race condition with 
truncate */
unsigned long   nrpages;/* number of total pages */
pgoff_t writeback_index;/* writeback starts here */
diff --git a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -716,7 +716,7 @@
struct address_space *check_mapping;/* Check page-mapping if set */
pgoff_t first_index;/* Lowest page-index to unmap 
*/
pgoff_t last_index; /* Highest page-index to unmap 
*/
-   spinlock_t *i_mmap_lock;/* For unmap_mapping_range: */
+   struct rw_semaphore *i_mmap_sem;/* For unmap_mapping_range: */
unsigned long truncate_count;   /* Compare vm_truncate_count */
 };
 
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -274,12 +274,12 @@
atomic_dec(inode-i_writecount);
 
/* insert tmp into the share list, just after mpnt */
-   spin_lock(file-f_mapping-i_mmap_lock);
+   down_write(file-f_mapping-i_mmap_sem);
tmp-vm_truncate_count = mpnt-vm_truncate_count;
flush_dcache_mmap_lock(file-f_mapping);
vma_prio_tree_add(tmp, mpnt);
flush_dcache_mmap_unlock(file-f_mapping);
-   spin_unlock(file-f_mapping-i_mmap_lock);
+   

[kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
This patch changes virtio devices to be multi-function devices whenever
possible.  This increases the number of virtio devices we can support now by
a factor of 8.

With this patch, I've been able to launch a guest with either 220 disks or 220
network adapters.

I haven't tested the Windows virtio drivers.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index 60e4094..df3a878 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -33,7 +33,7 @@ typedef struct PCIIORegion {
 #define PCI_ROM_SLOT 6
 #define PCI_NUM_REGIONS 7
 
-#define PCI_DEVICES_MAX 64
+#define PCI_DEVICES_MAX 256
 
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 9100bb1..9ea14d3 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 PCIDevice *pci_dev;
 uint8_t *config;
 uint32_t size;
+static int devfn = 7;
+
+if ((devfn % 8) == 7)
+   devfn = -1;
+else
+   devfn++;
 
 pci_dev = pci_register_device(bus, name, struct_size,
- -1, NULL, NULL);
+ devfn, NULL, NULL);
+
+devfn = pci_dev-devfn;
+
 vdev = to_virtio_device(pci_dev);
 
 vdev-status = 0;
@@ -435,6 +444,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 
 config[0x3d] = 1;
 
+/* Mark device as multi-function */
+if ((devfn % 8) == 0)
+   config[0x0e] |= 0x80;
+
 vdev-name = name;
 vdev-config_len = config_size;
 if (vdev-config_len)
diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..3bada75 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd);
 
 /* NIC info */
 
-#define MAX_NICS 8
+#define MAX_NICS 256
 
 struct NICInfo {
 uint8_t macaddr[6];
diff --git a/qemu/sysemu.h b/qemu/sysemu.h
index b645fb7..7992a77 100644
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -151,7 +151,7 @@ typedef struct DriveInfo {
 
 #define MAX_IDE_DEVS   2
 #define MAX_SCSI_DEVS  7
-#define MAX_DRIVES 32
+#define MAX_DRIVES 256
 
 int nb_drives;
 DriveInfo drives_table[MAX_DRIVES+1];
diff --git a/qemu/vl.c b/qemu/vl.c
index 7dd0094..e203a4d 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -8754,7 +8754,7 @@ static BOOL WINAPI qemu_ctrl_handler(DWORD type)
 }
 #endif
 
-#define MAX_NET_CLIENTS 32
+#define MAX_NET_CLIENTS 512
 
 static int saved_argc;
 static char **saved_argv;

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] KVM: PIT: make last_injected_time per-guest

2008-04-22 Thread Marcelo Tosatti

Otherwise multiple guests use the same variable and boom.

Also use kvm_vcpu_kick() to make sure that if a timer triggers on 
a different CPU the event won't be missed.

Signed-off-by: Marcelo Tosatti [EMAIL PROTECTED]
Tested-and-Acked-by: Alex Davis [EMAIL PROTECTED]

diff --git a/arch/x86/kvm/i8254.c b/arch/x86/kvm/i8254.c
index 2852dd1..5697ad2 100644
--- a/arch/x86/kvm/i8254.c
+++ b/arch/x86/kvm/i8254.c
@@ -200,10 +200,8 @@ int __pit_timer_fn(struct kvm_kpit_state *ps)
 
atomic_inc(pt-pending);
smp_mb__after_atomic_inc();
-   if (vcpu0  waitqueue_active(vcpu0-wq)) {
-   vcpu0-arch.mp_state = KVM_MP_STATE_RUNNABLE;
-   wake_up_interruptible(vcpu0-wq);
-   }
+   if (vcpu0)
+   kvm_vcpu_kick(vcpu0);
 
pt-timer.expires = ktime_add_ns(pt-timer.expires, pt-period);
pt-scheduled = ktime_to_ns(pt-timer.expires);
@@ -572,7 +570,6 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu)
struct kvm_pit *pit = vcpu-kvm-arch.vpit;
struct kvm *kvm = vcpu-kvm;
struct kvm_kpit_state *ps;
-   static unsigned long last_injected_time;
 
if (vcpu  pit) {
ps = pit-pit_state;
@@ -582,11 +579,11 @@ void kvm_inject_pit_timer_irqs(struct kvm_vcpu *vcpu)
 * 2. Last interrupt was accepted or waited for too long time*/
if (atomic_read(ps-pit_timer.pending) 
(ps-inject_pending ||
-   (jiffies - last_injected_time
+   (jiffies - ps-last_injected_time
= KVM_MAX_PIT_INTR_INTERVAL))) {
ps-inject_pending = 0;
__inject_pit_timer_intr(kvm);
-   last_injected_time = jiffies;
+   ps-last_injected_time = jiffies;
}
}
 }
diff --git a/arch/x86/kvm/i8254.h b/arch/x86/kvm/i8254.h
index e63ef38..db25c2a 100644
--- a/arch/x86/kvm/i8254.h
+++ b/arch/x86/kvm/i8254.h
@@ -35,6 +35,7 @@ struct kvm_kpit_state {
struct mutex lock;
struct kvm_pit *pit;
bool inject_pending; /* if inject pending interrupts */
+   unsigned long last_injected_time;
 };
 
 struct kvm_pit {

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can support now by
 a factor of 8.

 With this patch, I've been able to launch a guest with either 220 disks or 220
 network adapters.

   

Does this play well with hotplug?  Perhaps we need to allocate a new 
device on hotplug.

(certainly if we have a device with one function, which then gets 
converted to a multifunction device)

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote:
  The virtio config space was originally chosen to be little-endian,
  because we thought the config might be part of the PCI config space
  for virtio_pci.  It's actually a separate mmio region, so that
  argument holds little water; as only x86 is currently using the virtio
  mechanism, we can change this (but must do so now, before the
  impending s390 and ppc merges).

 This will probably annoy Hollis which has guests that can go both ways.

Yes, I discussed this with Hollis.  But the virtio rings themselves already 
have this issue: we don't do any endian conversion on them and assume 
they're our endian in the guest.

We may still regret not doing *everything* little-endian, but this doesn't 
make it worse.

Thanks,
Rusty.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Tuesday 22 April 2008 17:44:08 Christian Borntraeger wrote:
 Am Dienstag, 22. April 2008 schrieb Rusty Russell:
  [Christian, Hollis, how much is this ABI breakage going to hurt you?]

 It is ok for s390 at the moment. We are still working on making userspace
 ready and I plan to change the guest-host for s390 anyway. I try to make
 these changes for drivers/s390/kvm/kvm_virtio.c before 2.6.26. The main
 reason is, that we are currently limited to around 80 devices. I am not
 sure, if I should change the allocation of the virtqueues and descriptors
 to guest memory as well.

Large rings require contiguous memory, which makes guest allocation 
problematic.  512 elems at 4k pages == 5 pages.

 Back to your patch:
 I have still some ideas about virtio between little endian and big endian
 systems, but it requires more and different marshalling anyway - even on
 driver level. No idea yet how to solve that properly.

So far we've pushed such considerations onto the host.  This does mean that 
you can't virtio connect two guests directly without understanding the 
contents of the buffers so you can endian correct (eg. direct inter-guest 
networking).  inter-guest virtio is currently a party trick anyway, so I'm 
not sure it's a real issue.

  +   vb-vdev-config-get(vb-vdev,
  + offsetof(struct virtio_balloon_config, num_pages),
  + v);

 this is missing a sizeof(v), no?

Ah... sure enough, I fixed that in a followon patch.  Well-spotted, thanks!

Cheers,
Rusty.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] KVM: PIT: make last_injected_time per-guest

2008-04-22 Thread Avi Kivity
Marcelo Tosatti wrote:
 Otherwise multiple guests use the same variable and boom.

 Also use kvm_vcpu_kick() to make sure that if a timer triggers on 
 a different CPU the event won't be missed.

   

Applied, thanks.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:
   
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can support now by
 a factor of 8.

 With this patch, I've been able to launch a guest with either 220 disks or 
 220
 network adapters.

   
 

 Does this play well with hotplug?  Perhaps we need to allocate a new 
 device on hotplug.
   

Probably not.  I imagine you can only hotplug devices, not individual 
functions?

Regards,

Anthony Liguori

 (certainly if we have a device with one function, which then gets 
 converted to a multifunction device)

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:
 Avi Kivity wrote:
 Anthony Liguori wrote:
  
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can 
 support now by
 a factor of 8.

 With this patch, I've been able to launch a guest with either 220 
 disks or 220
 network adapters.

   

 Does this play well with hotplug?  Perhaps we need to allocate a new 
 device on hotplug.
   

 Probably not.  I imagine you can only hotplug devices, not individual 
 functions?


It sounds reasonable to expect so.  ACPI has objects for devices, not 
functions (IIRC).

Maybe require explicit device/function assignment on the command line?  
It will be managed anyway.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Anthony Liguori
Jamie Lokier wrote:
 Avi Kivity wrote:
   
 And video streaming on some embedded devices with no MMU!  (Due to the
 page cache heuristics working poorly with no MMU, sustained reliable
 streaming is managed with O_DIRECT and the app managing cache itself
 (like a database), and that needs AIO to keep the request queue busy.
 At least, that's the theory.)
   
 Could use threads as well, no?
 

 Perhaps.  This raises another point about AIO vs. threads:

 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
   

There's no guarantee that any sort of order will be preserved by AIO 
requests.  The same is true with writes.  This is what fdsync is for, to 
guarantee ordering.

Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Laurent Vivier

Le mardi 22 avril 2008 à 08:50 -0500, Anthony Liguori a écrit :
 Nguyen Anh Quynh wrote:
  Hi,
 
  This should be submitted to upstream (but not to kvm-devel list), but
  this is only the test code that I want to quickly send out for
  comments. In case it looks OK, I will send it to upstream later.
 
  Inspired by extboot and conversations with Anthony and HPA, this
  linuxboot option ROM is a simple option ROM that intercepts int19 in
  order to execute linux setup code. This approach eliminates the need
  to manipulate the boot sector for this purpose.
 
  To test it, just load linux kernel with your KVM/QEMU image using
  -kernel option in normal way.
 
  I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest
  Ubuntu 8.04.

 
 For the next rounds, could you actually rebase against upstream QEMU and 
 submit to qemu-devel?  One of Paul Brook's objections to extboot had 
 historically been that it wasn't not easily sharable with other 
 architectures.  With a C version, it seems more reasonable now to do that.

Moreover add a binary version of the ROM in the pc-bios directory: it
avoids to have a cross-compiler to build ROM on non-x86 architecture.

Regards,
Laurent

 Make sure you remove all the old linux boot code too within QEMU along 
 with the -hda checks.
 
 Regards,
 
 Anthony Liguori
 
  Thanks,
  Quynh
 
 
  # diffstat linuxboot1.diff
   Makefile |   13 -
   linuxboot/Makefile   |   40 +++
   linuxboot/boot.S |   54 +
   linuxboot/farvar.h   |  130 
  +++
   linuxboot/rom.c  |  104 
   linuxboot/signrom|binary
   linuxboot/signrom.c  |  128 
  ++
   linuxboot/util.h |   69 +++
   qemu/Makefile|3 -
   qemu/Makefile.target |2
   qemu/hw/linuxboot.c  |   39 +++
   qemu/hw/pc.c |   22 +++-
   qemu/hw/pc.h |5 +
   13 files changed, 600 insertions(+), 9 deletions(-)

 
 
 
 
-- 
- [EMAIL PROTECTED] ---
The best way to predict the future is to invent it.
- Alan Kay


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Avi Kivity
Jamie Lokier wrote:
 Avi Kivity wrote:
   
 And video streaming on some embedded devices with no MMU!  (Due to the
 page cache heuristics working poorly with no MMU, sustained reliable
 streaming is managed with O_DIRECT and the app managing cache itself
 (like a database), and that needs AIO to keep the request queue busy.
 At least, that's the theory.)
   
 Could use threads as well, no?
 

 Perhaps.  This raises another point about AIO vs. threads:

 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
   

Yes, unless the implementation in the kernel (or glibc) is threaded.

 With threads this isn't guaranteed and scheduling makes it quite
 likely to issue the parallel synchronous reads out of order, and for
 them to reach the disk out of order because the elevator doesn't see
 them simultaneously.
   

If the disk is busy, it doesn't matter.  The requests will queue and the 
elevator will sort them out.  So it's just the first few requests that 
may get to disk out of order.

 With AIO (non-Glibc! (and non-kthreads)) it might be better at
 keeping the intended issue order, I'm not sure.

 It is highly desirable: O_DIRECT streaming performance depends on
 avoiding seeks (no reordering) and on keeping the request queue
 non-empty (no gap).

 I read a man page for some other unix, describing AIO as better than
 threaded parallel reads for reading tape drives because of this (tape
 seeks are very expensive).  But the rest of the man page didn't say
 anything more.  Unfortunately I don't remember where I read it.  I
 have no idea whether AIO submission order is nearly always preserved
 in general, or expected to be.
   

I haven't considered tape, but this is a good point indeed.  I expect it 
doesn't make much of a difference for a loaded disk.

   
 It's me at fault here.  I just assumed that because it's easy to do aio 
 in a thread pool efficiently, that's what glibc does.

 Unfortunately the code does some ridiculous things like not service 
 multiple requests on a single fd in parallel.  I see absolutely no 
 reason for it (the code says fight for resources).
 

 Ouch.  Perhaps that relates to my thought above, about multiple
 requests to the same file causing seek storms when thread scheduling
 is unlucky?
   

My first thought on seeing this is that it relates to a deficiency on 
older kernels servicing multiple requests on a single fd (i.e. a 
per-file lock).  I don't know if such a deficiency ever existed, though.

   
 It could and should.  It probably doesn't.

 A simple thread pool implementation could come within 10% of Linux aio 
 for most workloads.  It will never be exactly, but for small numbers 
 of disks, close enough.
 

 I would wait for benchmark results for I/O patterns like sequential
 reading and writing, because of potential for seeks caused by request
 reordering, before being confident of that.

   

I did have measurements (and a test rig) at a previous job (where I did 
a lot of I/O work); IIRC the performance of a tuned thread pool was not 
far behind aio, both for seeks and sequential.  It was a while back though.


-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread Anthony Liguori
Nguyen Anh Quynh wrote:
 Hi,

 This should be submitted to upstream (but not to kvm-devel list), but
 this is only the test code that I want to quickly send out for
 comments. In case it looks OK, I will send it to upstream later.

 Inspired by extboot and conversations with Anthony and HPA, this
 linuxboot option ROM is a simple option ROM that intercepts int19 in
 order to execute linux setup code. This approach eliminates the need
 to manipulate the boot sector for this purpose.

 To test it, just load linux kernel with your KVM/QEMU image using
 -kernel option in normal way.

 I succesfully compiled and tested it with kvm-66 on Ubuntu 7.10, guest
 Ubuntu 8.04.
   

For the next rounds, could you actually rebase against upstream QEMU and 
submit to qemu-devel?  One of Paul Brook's objections to extboot had 
historically been that it wasn't not easily sharable with other 
architectures.  With a C version, it seems more reasonable now to do that.

Make sure you remove all the old linux boot code too within QEMU along 
with the -hda checks.

Regards,

Anthony Liguori

 Thanks,
 Quynh


 # diffstat linuxboot1.diff
  Makefile |   13 -
  linuxboot/Makefile   |   40 +++
  linuxboot/boot.S |   54 +
  linuxboot/farvar.h   |  130 
 +++
  linuxboot/rom.c  |  104 
  linuxboot/signrom|binary
  linuxboot/signrom.c  |  128 
 ++
  linuxboot/util.h |   69 +++
  qemu/Makefile|3 -
  qemu/Makefile.target |2
  qemu/hw/linuxboot.c  |   39 +++
  qemu/hw/pc.c |   22 +++-
  qemu/hw/pc.h |5 +
  13 files changed, 600 insertions(+), 9 deletions(-)
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:

 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
   

 There's no guarantee that any sort of order will be preserved by AIO 
 requests.  The same is true with writes.  This is what fdsync is for, 
 to guarantee ordering.

I believe he'd like a hint to get good scheduling, not a guarantee.  
With a thread pool if the threads are scheduled out of order, so are 
your requests.  If the elevator doesn't plug the queue, the first few 
requests may not be optimally sorted.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Luca Tettamanti
On Tue, Apr 22, 2008 at 4:15 PM, Anthony Liguori [EMAIL PROTECTED] wrote:
 This patch changes virtio devices to be multi-function devices whenever
  possible.  This increases the number of virtio devices we can support now by
  a factor of 8.
[...]
  diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
  index 9100bb1..9ea14d3 100644
  --- a/qemu/hw/virtio.c
  +++ b/qemu/hw/virtio.c
  @@ -405,9 +405,18 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
 *name,
  PCIDevice *pci_dev;
  uint8_t *config;
  uint32_t size;
  +static int devfn = 7;
  +
  +if ((devfn % 8) == 7)
  +   devfn = -1;
  +else
  +   devfn++;

This code look strange... devfn should be passed to virtio_init_pci by
virtio-{net,blk} init functions, no?

Luca

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Anthony Liguori wrote:
 Perhaps.  This raises another point about AIO vs. threads:
 
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
 
 There's no guarantee that any sort of order will be preserved by AIO 
 requests.  The same is true with writes.  This is what fdsync is for, to 
 guarantee ordering.

You misunderstand.  I'm not talking about guarantees, I'm talking
about expectations for the performance effect.

Basically, to do performant streaming read with O_DIRECT you need two
things:

   1. Overlap at least 2 requests, so the device is kept busy.

   2. Requests be sent to the disk in a good order, which is usually
  (but not always) sequential offset order.

The kernel does this itself with buffered reads, doing readahead.
It works very well, unless you have other problems caused by readahead.

With O_DIRECT, an application has to do the equivalent of readahead
itself to get performant streaming.

If the app uses two threads calling pread(), it's hard to ensure the
kernel even _sees_ the first two calls in sequential offset order.
You spawn two threads, and then both threads call pread() with
non-deterministic scheduling.  The problem starts before even entering
the kernel.

Then, depending on I/O scheduling in the kernel, it might send the
less good pread() to the disk immediately, then later a backward head
seek and the other one.  The elevator cannot fix this: it doesn't have
enough information, unless it adds artificial delays.  But artificial
delays may harm too; it's not optimal.

After that, the two threads tend to call pread() in the best order
provided there's no scheduling conflicts, but are easily disrupted by
other tasks, especially on SMP (one reading thread per CPU, so when
one of them is descheduled, the other continues and issues a request
in the 'wrong' order.)

With AIO, even though you can't be sure what the kernel does, you can
be sure the kernel receives aio_read() calls in the exact order which
is most likely to perform well.  Application knowledge of it's access
pattern is passed along better.

As I've said, I saw a man page which described why this makes AIO
superior to using threads for reading tapes on that OS.  So it's not a
completely spurious point.

This has nothing to do with guarantees.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Ryan Harper
* Anthony Liguori [EMAIL PROTECTED] [2008-04-22 09:16]:
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can support now by
 a factor of 8.
 
 With this patch, I've been able to launch a guest with either 220 disks or 220
 network adapters.

Have you confirmed that the network devices show up?  I was playing
around with some of the limits last night and while it is easy to get
QEMU to create the adapters, so far I've only had a guest see 29 pci
nics (e1000).


-- 
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
(512) 838-9253   T/L: 678-9253
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
 Andrea Arcangeli a écrit :
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +cond_resched();
 +if ((unsigned long)*(spinlock_t **)a 
 +(unsigned long)*(spinlock_t **)b)
 +return -1;
 +else if (a == b)
 +return 0;
 +else
 +return 1;
 +}
 +
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...

Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

 static int mm_lock_cmp(const void *a, const void *b)
 {
   unsigned long la = (unsigned long)*(spinlock_t **)a;
   unsigned long lb = (unsigned long)*(spinlock_t **)b;

   cond_resched();
   if (la  lb)
   return -1;
   if (la  lb)
   return 1;
   return 0;
 }

If your intent is to use the assumption that there are going to be few
equal entries, you should have used likely(la  lb) to signal it's
rarely going to return zero or gcc is likely free to do whatever it
wants with the above. Overall that function is such a slow path that
this is going to be lost in the noise. My suggestion would be to defer
microoptimizations like this after 1/12 will be applied to mainline.

Thanks!


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Ryan Harper wrote:
 * Anthony Liguori [EMAIL PROTECTED] [2008-04-22 09:16]:
   
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can support now by
 a factor of 8.

 With this patch, I've been able to launch a guest with either 220 disks or 
 220
 network adapters.
 

 Have you confirmed that the network devices show up?  I was playing
 around with some of the limits last night and while it is easy to get
 QEMU to create the adapters, so far I've only had a guest see 29 pci
 nics (e1000).
   

Yup, I had an eth219

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote:
 Anthony Liguori wrote:
  This patch changes virtio devices to be multi-function devices whenever
  possible.  This increases the number of virtio devices we can support now by
  a factor of 8.
 
  With this patch, I've been able to launch a guest with either 220 disks or 
  220
  network adapters.
 

 
 Does this play well with hotplug?  Perhaps we need to allocate a new 
 device on hotplug.
 
 (certainly if we have a device with one function, which then gets 
 converted to a multifunction device)

Would have to change the hotplug code to handle functions...

It sounds less hacky to just extend the PCI slots instead of (ab)using
multiple functions per-slot.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 Anthony Liguori wrote:
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
   
 There's no guarantee that any sort of order will be preserved by AIO 
 requests.  The same is true with writes.  This is what fdsync is for, 
 to guarantee ordering.
 
 I believe he'd like a hint to get good scheduling, not a guarantee.  
 With a thread pool if the threads are scheduled out of order, so are 
 your requests.

 If the elevator doesn't plug the queue, the first few requests may
 not be optimally sorted.

That's right.  Then they tend to settle to a good order.  But any
delay in scheduling one of the threads, or a signal received by one of
them, can make it lose order briefly, making the streaming stutter as
the disk performes a few local seeks until it settles to good order
again.

You can mitigate the disruption in various ways.

  1. If all threads share an offset variable, and reads and
 increments that atomically just prior to calling pread(), that helps
 especially at the start.  (If threaded I/O is used for QEMU disk
 emulation, I would suggest doing that, in the more general form
 of popping a request from QEMU's internal shared queue at the last
 moment.)

  2. Using more threads helps keep it sustained, at the cost of more
 wasted I/O when there's a cancellation (changed mind), and more
 memory.

However, AIO, in principle (if not implementations...) could be better
at keeping the suggested I/O order than thread, without special tricks.

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Avi Kivity
Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
   
 Andrea Arcangeli a écrit :
 
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +   cond_resched();
 +   if ((unsigned long)*(spinlock_t **)a 
 +   (unsigned long)*(spinlock_t **)b)
 +   return -1;
 +   else if (a == b)
 +   return 0;
 +   else
 +   return 1;
 +}
 +
   
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...
 

 Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?

   

You need to compare *a to *b (at least, that's what you're doing for the 
 case).

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0 of 9] mmu notifier #v12

2008-04-22 Thread Robin Holt
Andrew, Could we get direction/guidance from you as regards
the invalidate_page() callout of Andrea's patch set versus the
invalidate_range_start/invalidate_range_end callout pairs of Christoph's
patchset?  This is only in the context of the __xip_unmap, do_wp_page,
page_mkclean_one, and try_to_unmap_one call sites.

On Tue, Apr 22, 2008 at 03:48:47PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 08:36:04AM -0500, Robin Holt wrote:
  I am a little confused about the value of the seq_lock versus a simple
  atomic, but I assumed there is a reason and left it at that.
 
 There's no value for anything but get_user_pages (get_user_pages takes
 its own lock internally though). I preferred to explain it as a
 seqlock because it was simpler for reading, but I totally agree in the
 final implementation it shouldn't be a seqlock. My code was meant to
 be pseudo-code only. It doesn't even need to be atomic ;).

Unless there is additional locking in your fault path, I think it does
need to be atomic.

  I don't know what you mean by it'd run slower and what you mean by
  armed and disarmed.
 
 1) when armed the time-window where the kvm-page-fault would be
 blocked would be a bit larger without invalidate_page for no good
 reason

But that is a distinction without a difference.  In the _start/_end
case, kvm's fault handler will not have any _DIRECT_ blocking, but
get_user_pages() had certainly better block waiting for some other lock
to prevent the process's pages being refaulted.

I am no VM expert, but that seems like it is critical to having a
consistent virtual address space.  Effectively, you have a delay on the
kvm fault handler beginning when either invalidate_page() is entered
or invalidate_range_start() is entered until when the _CALLER_ of the
invalidate* method has unlocked.  That time will remain essentailly
identical for either case.  I would argue you would be hard pressed to
even measure the difference.

 2) if you were to remove invalidate_page when disarmed the VM could
 would need two branches instead of one in various places

Those branches are conditional upon there being list entries.  That check
should be extremely cheap.  The vast majority of cases will have no
registered notifiers.  The second check for the _end callout will be
from cpu cache.

 I don't want to waste cycles if not wasting them improves performance
 both when armed and disarmed.

In summary, I think we have narrowed down the case of no registered
notifiers to being infinitesimal.  The case of registered notifiers
being a distinction without a difference.

  When I was discussing this difference with Jack, he reminded me that
  the GRU, due to its hardware, does not have any race issues with the
  invalidate_page callout simply doing the tlb shootdown and not modifying
  any of its internal structures.  He then put a caveat on the discussion
  that _either_ method was acceptable as far as he was concerned.  The real
  issue is getting a patch in that satisfies all needs and not whether
  there is a seperate invalidate_page callout.
 
 Sure, we have that patch now, I'll send it out in a minute, I was just
 trying to explain why it makes sense to have an invalidate_page too
 (which remains the only difference by now), removing it would be a
 regression on all sides, even if a minor one.

I think GRU is the only compelling case I have heard for having the
invalidate_page seperate.  In the case of the GRU, the hardware enforces a
lifetime of the invalidate which covers all in-progress faults including
ones where the hardware is informed after the flush of a PTE.  in all
cases, once the GRU invalidate instruction is issued, all active requests
are invalidated.  Future faults will be blocked in get_user_pages().
Without that special feature of the hardware, I don't think any code
simplification exists.  I, of course, reserve the right to be wrong.

I believe the argument against a seperate invalidate_page() callout was
Christoph's interpretation of Andrew's comments.  I am not certain Andrew
was aware of this special aspects of the GRU hardware and whether that
had been factored into the discussion at that point in time.


Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 Perhaps.  This raises another point about AIO vs. threads:
 
 If I submit sequential O_DIRECT reads with aio_read(), will they enter
 the device read queue in the same order, and reach the disk in that
 order (allowing for reordering when worthwhile by the elevator)?
 
 Yes, unless the implementation in the kernel (or glibc) is threaded.

 
 With threads this isn't guaranteed and scheduling makes it quite
 likely to issue the parallel synchronous reads out of order, and for
 them to reach the disk out of order because the elevator doesn't see
 them simultaneously.
 
 If the disk is busy, it doesn't matter.  The requests will queue and the 
 elevator will sort them out.  So it's just the first few requests that 
 may get to disk out of order.

There's two cases where it matters to a read-streaming app:

1. Disk isn't busy with anything else, maximum streaming
   performance is desired.

2. Disk is busy with unrelated things, but you're using I/O
   priorities to give the streaming app near-absolute priority.
   Then you need to maintain overlapped streaming requests,
   otherwise disk is given to a lower priority I/O.  If that
   happens often, you lose, priority is ineffective.  Because one
   of the streaming requests is usually being serviced, elevator
   has similar limitations as for a disk which is not busy with
   anything else.

 I haven't considered tape, but this is a good point indeed.  I expect it 
 doesn't make much of a difference for a loaded disk.

Yes, as long as it's loaded with unrelated requests at the same I/O
priority, the elevator has time to sort requests and hide thread
scheduling artifacts.

Btw, regarding QEMU: QEMU gets requests _after_ sorting by the guest's
elevator, then submits them to the host's elevator.  If the guest and
host elevators are both configured 'anticipatory', do the anticipatory
delays add up?

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Jamie Lokier
Avi Kivity wrote:
 And video streaming on some embedded devices with no MMU!  (Due to the
 page cache heuristics working poorly with no MMU, sustained reliable
 streaming is managed with O_DIRECT and the app managing cache itself
 (like a database), and that needs AIO to keep the request queue busy.
 At least, that's the theory.)
 
 Could use threads as well, no?

Perhaps.  This raises another point about AIO vs. threads:

If I submit sequential O_DIRECT reads with aio_read(), will they enter
the device read queue in the same order, and reach the disk in that
order (allowing for reordering when worthwhile by the elevator)?

With threads this isn't guaranteed and scheduling makes it quite
likely to issue the parallel synchronous reads out of order, and for
them to reach the disk out of order because the elevator doesn't see
them simultaneously.

With AIO (non-Glibc! (and non-kthreads)) it might be better at
keeping the intended issue order, I'm not sure.

It is highly desirable: O_DIRECT streaming performance depends on
avoiding seeks (no reordering) and on keeping the request queue
non-empty (no gap).

I read a man page for some other unix, describing AIO as better than
threaded parallel reads for reading tape drives because of this (tape
seeks are very expensive).  But the rest of the man page didn't say
anything more.  Unfortunately I don't remember where I read it.  I
have no idea whether AIO submission order is nearly always preserved
in general, or expected to be.

 It's me at fault here.  I just assumed that because it's easy to do aio 
 in a thread pool efficiently, that's what glibc does.
 
 Unfortunately the code does some ridiculous things like not service 
 multiple requests on a single fd in parallel.  I see absolutely no 
 reason for it (the code says fight for resources).

Ouch.  Perhaps that relates to my thought above, about multiple
requests to the same file causing seek storms when thread scheduling
is unlucky?

 So my comments only apply to linux-aio vs a sane thread pool.  Sorry for 
 spreading confusion.

Thanks.  I thought you'd measured it :-)

 It could and should.  It probably doesn't.
 
 A simple thread pool implementation could come within 10% of Linux aio 
 for most workloads.  It will never be exactly, but for small numbers 
 of disks, close enough.

I would wait for benchmark results for I/O patterns like sequential
reading and writing, because of potential for seeks caused by request
reordering, before being confident of that.

 Hmm.  Thanks.  I may consider switching to XFS now
 
 I'm rooting for btrfs myself.

In the unlikely event they backport btrfs to kernel 2.4.26-uc0, I'll
be happy to give it a try! :-)

-- Jamie

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Qemu-devel] Re: [PATCH 1/3] Refactor AIO interface to allow other AIO implementations

2008-04-22 Thread Javier Guerra
On Tue, Apr 22, 2008 at 3:10 AM, Avi Kivity [EMAIL PROTECTED] wrote:
  I'm rooting for btrfs myself.

but could btrfs (when stable) work for migration?  i'm curious about
OCFS2 performance on this kind of load...

when i manage to sell the idea of a KVM cluster i'd like to know if i
should try first EVMS-HA (cluster LV's) or OCFS (cluster FS)

-- 
Javier

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Marcelo Tosatti wrote:
 Maybe require explicit device/function assignment on the command line?  
 It will be managed anyway.
 

 ACPI does support hotplugging of individual functions inside slots,
 not sure how well does Linux (and other OSes) support that.. should be
 transparent though.
   

I think we need to decide what we want to target in terms of upper limits.

With a bridge or two, we can probably easily do 128.

If we really want to push things, I think we should do a PCI based 
virtio controller.  I doubt a large number of PCI devices is ever going 
to perform very well b/c of interrupt sharing and some of the 
assumptions in virtio_pci.

If we implement a controller, we can use a single interrupt, but 
multiplex multiple notifications on that single interrupt.  We can also 
be more aggressive about using shared memory instead of PCI config space 
which would reduce the overall number of exits.

We could easily support a very large number of devices this way.  But 
again, what do we want to target for now?

Regards,

Anthony Liguori



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 06:22:48 Avi Kivity wrote:
 Rusty Russell wrote:
  [Christian, Hollis, how much is this ABI breakage going to hurt you?]
 
  A recent proposed feature addition to the virtio block driver revealed
  some flaws in the API, in particular how easy it is to break big
  endian machines.
 
  The virtio config space was originally chosen to be little-endian,
  because we thought the config might be part of the PCI config space
  for virtio_pci.  It's actually a separate mmio region, so that
  argument holds little water; as only x86 is currently using the virtio
  mechanism, we can change this (but must do so now, before the
  impending s390 and ppc merges).
 
 This will probably annoy Hollis which has guests that can go both ways.

Rusty and I have discussed it. Ultimately, this just takes us from a 
cross-architecture endianness definition to a per-architecture definition. 
Anyways, we've already fallen into this situation with the virtio ring data 
itself, so we're really saying same endianness as the ring.

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] linuxboot Option ROM for Linux kernel booting

2008-04-22 Thread H. Peter Anvin
Nguyen Anh Quynh wrote:
 Hi,
 
 I am thinking about comibing this ROM with the extboot. Both two ROM
 are about booting, so I think that is reasonable. So we will have
 only 1 ROM that supports both external boot and Linux boot.
 
 Is that desirable or not?
 

Does it make the code simpler and easier to understand?  If not, then I 
would say no.

-hpa

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Avi Kivity
Anthony Liguori wrote:

 I think we need to decide what we want to target in terms of upper 
 limits.

 With a bridge or two, we can probably easily do 128.

 If we really want to push things, I think we should do a PCI based 
 virtio controller.  I doubt a large number of PCI devices is ever 
 going to perform very well b/c of interrupt sharing and some of the 
 assumptions in virtio_pci.

 If we implement a controller, we can use a single interrupt, but 
 multiplex multiple notifications on that single interrupt.  We can 
 also be more aggressive about using shared memory instead of PCI 
 config space which would reduce the overall number of exits.

 We could easily support a very large number of devices this way.  But 
 again, what do we want to target for now? 

I think that for networking we should keep things as is.  I don't see 
anybody using 100 virtual NICs.

For mass storage, we should follow the SCSI model with a single device 
serving multiple disks, similar to what you suggest.  Not sure if the 
device should have a single queue or one queue per disk.

-- 
error compiling committee.c: too many arguments to function


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Avi Kivity wrote:
 Anthony Liguori wrote:

 I think we need to decide what we want to target in terms of upper 
 limits.

 With a bridge or two, we can probably easily do 128.

 If we really want to push things, I think we should do a PCI based 
 virtio controller.  I doubt a large number of PCI devices is ever 
 going to perform very well b/c of interrupt sharing and some of the 
 assumptions in virtio_pci.

 If we implement a controller, we can use a single interrupt, but 
 multiplex multiple notifications on that single interrupt.  We can 
 also be more aggressive about using shared memory instead of PCI 
 config space which would reduce the overall number of exits.

 We could easily support a very large number of devices this way.  But 
 again, what do we want to target for now? 

 I think that for networking we should keep things as is.  I don't see 
 anybody using 100 virtual NICs.

 For mass storage, we should follow the SCSI model with a single device 
 serving multiple disks, similar to what you suggest.  Not sure if the 
 device should have a single queue or one queue per disk.

My latest thought it to do a virtio-based virtio controller.

We could avoid creating one in QEMU unless we detect an abnormally large 
number of disks or something.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
 On Tue, Apr 22, 2008 at 04:56:10PM +0200, Eric Dumazet wrote:
   
 Andrea Arcangeli a écrit :
 
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 +   cond_resched();
 +   if ((unsigned long)*(spinlock_t **)a 
 +   (unsigned long)*(spinlock_t **)b)
 +   return -1;
 +   else if (a == b)
 +   return 0;
 +   else
 +   return 1;
 +}
 +
   
 This compare function looks unusual...
 It should work, but sort() could be faster if the
 if (a == b) test had a chance to be true eventually...
 

 Hmm, are you saying my mm_lock_cmp won't return 0 if a==b?
   
I am saying your intent was probably to test

else if ((unsigned long)*(spinlock_t **)a ==
(unsigned long)*(spinlock_t **)b)
return 0;


Because a and b are pointers to the data you want to compare. You need 
to dereference them.


 static int mm_lock_cmp(const void *a, const void *b)
 {
  unsigned long la = (unsigned long)*(spinlock_t **)a;
  unsigned long lb = (unsigned long)*(spinlock_t **)b;

  cond_resched();
  if (la  lb)
  return -1;
  if (la  lb)
  return 1;
  return 0;
 }
 

 If your intent is to use the assumption that there are going to be few
 equal entries, you should have used likely(la  lb) to signal it's
 rarely going to return zero or gcc is likely free to do whatever it
 wants with the above. Overall that function is such a slow path that
 this is going to be lost in the noise. My suggestion would be to defer
 microoptimizations like this after 1/12 will be applied to mainline.

 Thanks!

   
Hum, it's not a micro-optimization, but a bug fix. :)

Sorry if it was not clear





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 05:37:38PM +0200, Eric Dumazet wrote:
 I am saying your intent was probably to test

 else if ((unsigned long)*(spinlock_t **)a ==
   (unsigned long)*(spinlock_t **)b)
   return 0;

Indeed...

 Hum, it's not a micro-optimization, but a bug fix. :)

The good thing is that even if this bug would lead to a system crash,
it would be still zero risk for everybody that isn't using KVM/GRU
actively with mmu notifiers. The important thing is that this patch
has zero risk to introduce regressions into the kernel, both when
enabled and disabled, it's like a new driver. I'll shortly resend 1/12
and likely 12/12 for theoretical correctness. For now you can go ahead
testing with this patch as it'll work fine despite of the bug (if it
wasn't the case I would have noticed already ;).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Odd hang in the Ubuntu installer

2008-04-22 Thread Soren Hansen
Hi guys.

I'm trying to figure out what's going on with this bug:

   https://bugs.launchpad.net/ubuntu/+source/linux/+bug/217815

The short version of the problem is that it seems that if the console is
left alone for an extended period of time, everything seems to stall
until something (moving the mouse around, pressing a key, whatever)
awakens it again.  It usually shows itself when you choose the
Encrypted LVM option in our installer (this process wipes the drive,
which is a rather lenghty process), since that's probably the only place
where you'd leave the console alone for a while, while still getting
some UI feedback (and suddenly lack of feedback, obviously).

It started when I backported this to the kvm version in our archive:

commit d2668b3fd41f88c18a7f9c4f1d024f0e5d9f64cf
Author: Marcelo Tosatti [EMAIL PROTECTED]
Date:   Wed Apr 2 20:20:14 2008 -0300
Subject: kvm: qemu: separate thread for IO handling


While trying to solve this problem, I noticed that that commit was just
one of a set of three patches. Applying those two:

commit 1743ef816b6cd22d100ccb80e542b8ca19c75392
Author: Marcelo Tosatti [EMAIL PROTECTED]
Date:   Wed Apr 2 20:20:15 2008 -0300
Subject: kvm: qemu: add function to handle signals

commit d84f71afaafec49e0ab3aa7a33518df04c14f38a
Author: Marcelo Tosatti [EMAIL PROTECTED]
Date:   Wed Apr 2 20:20:16 2008 -0300
Subject: kvm: qemu: notify IO thread of pending bhs

...makes it take a bit longer before it happens, but it's still very
much reproducable. Reverting those changes fixes it completely.

We've tried with kvm 66, which also exhibits this behaviour, so I'm
fairly confident I didn't mess up the patch while backporting it. In
case you're interested, the backported patch is here:

   http://people.ubuntu.com/~soren/virtio_hang.patch

The latter two commits applied without changes (with a bit of fuzz,
though).

I'm hoping one of you guys could give me a hint (or perhaps even a
patch)?

-- 
Soren Hansen   | 
Virtualisation specialist  | Ubuntu Server Team
Canonical Ltd. | http://www.ubuntu.com/


signature.asc
Description: Digital signature
-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Eric Dumazet
Andrea Arcangeli a écrit :
 +
 +static int mm_lock_cmp(const void *a, const void *b)
 +{
 + cond_resched();
 + if ((unsigned long)*(spinlock_t **)a 
 + (unsigned long)*(spinlock_t **)b)
 + return -1;
 + else if (a == b)
 + return 0;
 + else
 + return 1;
 +}
 +
This compare function looks unusual...
It should work, but sort() could be faster if the
if (a == b) test had a chance to be true eventually...

static int mm_lock_cmp(const void *a, const void *b)
{
unsigned long la = (unsigned long)*(spinlock_t **)a;
unsigned long lb = (unsigned long)*(spinlock_t **)b;

cond_resched();
if (la  lb)
return -1;
if (la  lb)
return 1;
return 0;
}






-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Marcelo Tosatti
On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote:
 Avi Kivity wrote:
 Anthony Liguori wrote:
 
 I think we need to decide what we want to target in terms of upper 
 limits.
 
 With a bridge or two, we can probably easily do 128.
 
 If we really want to push things, I think we should do a PCI based 
 virtio controller.  I doubt a large number of PCI devices is ever 
 going to perform very well b/c of interrupt sharing and some of the 
 assumptions in virtio_pci.

 If we implement a controller, we can use a single interrupt, but 
 multiplex multiple notifications on that single interrupt.  We can 
 also be more aggressive about using shared memory instead of PCI 
 config space which would reduce the overall number of exits.

We should increase the number of interrupt lines, perhaps to 16.

Using shared memory to avoid exits sounds very good idea.

 We could easily support a very large number of devices this way.  But 
 again, what do we want to target for now? 
 
 I think that for networking we should keep things as is.  I don't see 
 anybody using 100 virtual NICs.

The target was along the lines of 20 nics + 80 disks. Dan?

 For mass storage, we should follow the SCSI model with a single device 
 serving multiple disks, similar to what you suggest.  Not sure if the 
 device should have a single queue or one queue per disk.
 
 My latest thought it to do a virtio-based virtio controller.

Why do you dislike multiple disks per virtio-blk controller? As
mentioned this seems a natural way forward.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 2/2] KVM: Handle interrupts for PCI passthrough devices

2008-04-22 Thread Amit Shah
* On Sunday 13 Apr 2008 14:06:27 Avi Kivity wrote:
 Amit Shah wrote:
  Passthrough devices are host machine PCI devices which have
  been handed off to the guest. Handle interrupts from these
  devices and route them to the appropriate guest irq lines.
  The userspace provides us with the necessary information
  via the ioctls.
 
  The guest IRQ numbers can change dynamically, so we have an
  additional ioctl that keeps track of those changes in userspace
  and notifies us whenever that happens.
 
  It is expected the kernel driver for the passthrough device
  is removed before passing it on to the guest.
 
 
  +/*
  + * Used to find a registered host PCI device (a passthrough device)
  + * during interrupts or EOI
  + */
  +static struct kvm_pci_pt_dev_list *
  +find_pci_pt_dev(struct list_head *head,
  +   struct kvm_pci_pt_info *pv_pci_info, int irq, int source)
  +{
  +   struct list_head *ptr;
  +   struct kvm_pci_pt_dev_list *match;
  +
  +   list_for_each(ptr, head) {
  +   match = list_entry(ptr, struct kvm_pci_pt_dev_list, list);
  +
  +   switch (source) {
  +   case KVM_PT_SOURCE_IRQ:
  +   /*
  +* Used to find a registered host device
  +* during interrupt context on host
  +*/
  +   if (match-pt_dev.host.irq == irq)
  +   return match;
  +   break;
  +   case KVM_PT_SOURCE_IRQ_ACK:
  +   /*
  +* Used to find a registered host device when
  +* the guest acks an interrupt
  +*/
  +   if (match-pt_dev.guest.irq == irq)
  +   return match;
  +   break;
  +   }
  +   }
  +   return NULL;
  +}

 This would be better as two separate functions.  Also, locking?

For pvdma, there will be two more cases. Very similar functions for 
essentially looking up an entry in the same list.

Locking will be supported soon.

  +static irqreturn_t
  +kvm_pci_pt_dev_intr(int irq, void *dev_id)

 Please don't split declarations unnecessarily.

Fixed.

  +{
  +   struct kvm_pci_pt_dev_list *match;
  +   struct kvm *kvm = (struct kvm *) dev_id;
  +
  +   if (!test_bit(irq, pt_irq_handled))
  +   return IRQ_NONE;
  +
  +   if (test_bit(irq, pt_irq_pending))
  +   return IRQ_HANDLED;

 Will the interrupt not fire immediately after this returns?

Hmm. This is just an optimisation so that we don't have to look up the list 
each time to find out which assigned device it is and (re)injecting the 
interrupt. Also we avoid the (TODO) getting/releasing locks which will be 
needed for the list lookup.

Disabling interrupts for PCI devices isn't a good idea even if we don't 
support shared interrupts. Any other ideas to avoid this from happening?

  +   match = find_pci_pt_dev(kvm-arch.pci_pt_dev_head, NULL,
  +   irq, KVM_PT_SOURCE_IRQ);
  +   if (!match)
  +   return IRQ_NONE;
  +
  +   /* Not possible to detect if the guest uses the PIC or the
  +* IOAPIC.  So set the bit in both. The guest will ignore
  +* writes to the unused one.
  +*/
  +   kvm_ioapic_set_irq(kvm-arch.vioapic, match-pt_dev.guest.irq, 1);
  +   kvm_pic_set_irq(pic_irqchip(kvm), match-pt_dev.guest.irq, 1);

 A function that calls both the apic and the pic is better, as it will be
 easier to port.

Done.

  +   set_bit(irq, pt_irq_pending);
  +   return IRQ_HANDLED;
  +}
  +
  +/* Ack the irq line for a passthrough device */
  +void
  +kvm_pci_pt_ack_irq(struct kvm *kvm, int vector)
  +{
  +   int irq;
  +   struct kvm_pci_pt_dev_list *match;
  +
  +   irq = get_eoi_gsi(kvm-arch.vioapic, vector);
  +   match = find_pci_pt_dev(kvm-arch.pci_pt_dev_head, NULL,
  +   irq, KVM_PT_SOURCE_IRQ_ACK);
  +   if (!match)
  +   return;
  +   if (test_bit(match-pt_dev.host.irq, pt_irq_pending)) {
  +   kvm_ioapic_set_irq(kvm-arch.vioapic, irq, 0);
  +   kvm_pic_set_irq(pic_irqchip(kvm), irq, 0);

 This is dangerous with smp guests, if we aren't careful with the
 ordering the interrupt may fire again and be forwarded to the other
 vcpu.  We need to call this before we redeliver interrupts.

The 'pending' bitmap ensures we don't inject an interrupt that hasn't been 
ack'ed. Once the locking is in place, this shouldn't be a worry.

  +   clear_bit(match-pt_dev.host.irq, pt_irq_pending);
  +   }
  +}

...

  @@ -1671,6 +1836,30 @@ long kvm_arch_vm_ioctl(struct file *filp,
  r = 0;
  break;
  }
  +   case KVM_ASSIGN_PCI_PT_DEV: {
  +   struct kvm_pci_passthrough_dev pci_pt_dev;
  +
  +   r = -EFAULT;
  +   if (copy_from_user(pci_pt_dev, argp, sizeof pci_pt_dev))
  +   goto out;
  +
  +   r = kvm_vm_ioctl_pci_pt_dev(kvm, pci_pt_dev);
  +   if (r)
  +   goto 

Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-22 Thread Glauber Costa
Gerd Hoffmann wrote:
 Jeremy Fitzhardinge wrote:
 Xen could change the parameters in the instant after get_time_values(). 
 That change could be as a result of suspend-resume, so the parameters
 and the tsc could be wildly different.
 
 Ah, ok, forgot the rdtsc in the picture.  With that in mind I fully
 agree that the loop is needed.  I think kvm guests can even hit that one
 with the vcpu migrating to a different physical cpu, so we better handle
 it correctly ;)

It's probably not needed for kvm, since we update everything everytime 
we get scheduled in the host side, which would cover the case for 
migration between physical cpus. But it's probably okay to do it to get 
a common denominator with xen, if needed.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Robin Holt

I believe the differences between your patch set and Christoph's need
to be understood and a compromise approach agreed upon.

Those differences, as I understand them, are:

1) invalidate_page:  You retain an invalidate_page() callout.  I believe
we have progressed that discussion to the point that it requires some
direction for Andrew, Linus, or somebody in authority.  The basics
of the difference distill down to no expected significant performance
difference between the two.  The invalidate_page() callout potentially
can simplify GRU code.  It does provide a more complex api for the
users of mmu_notifier which, IIRC, Christoph had interpretted from one
of Andrew's earlier comments as being undesirable.  I vaguely recall
that sentiment as having been expressed.

2) Range callout names: Your range callouts are invalidate_range_start
and invalidate_range_end whereas Christoph's are start and end.  I do not
believe this has been discussed in great detail.  I know I have expressed
a preference for your names.  I admit to having failed to follow up on
this issue.  I certainly believe we could come to an agreement quickly
if pressed.

3) The structure of the patch set:  Christoph's upcoming release orders
the patches so the prerequisite patches are seperately reviewable
and each file is only touched by a single patch.  Additionally, that
allows mmu_notifiers to be introduced as a single patch with sleeping
functionality from its inception and an API which remains unchanged.
Your patch set, however, introduces one API, then turns around and
changes that API.  Again, the desire to make it an unchanging API was
expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
the non-sleeping API may become entrenched and make acceptance of the
sleeping version less acceptable.

Can we agree upon this list of issues?

Thank you,
Robin Holt

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Bug 1895893 inquiry (KVM-60+ halts, when using SCSI)

2008-04-22 Thread Alberto Treviño
Thanks for all those who work on KVM.  It is a wonderful product and I 
have been very impressed with its features, performance, and the level 
of activity in this project.

Back in February a bug was filed.  I've been hit by this bug as well, 
but there hasn't been much activity with it in the last little bit.  I 
wanted to know if anyone had a fix for it, or a workaround (other than 
using IDE), or whether it was on someone's radar.  Here is a link to 
the bug:

http://sourceforge.net/tracker/index.php?func=detailaid=1895893group_id=180599atid=893831

Thanks in advance.

-- 
Alberto Treviño
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Andrea Arcangeli
On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
 1) invalidate_page:  You retain an invalidate_page() callout.  I believe
 we have progressed that discussion to the point that it requires some
 direction for Andrew, Linus, or somebody in authority.  The basics
 of the difference distill down to no expected significant performance
 difference between the two.  The invalidate_page() callout potentially
 can simplify GRU code.  It does provide a more complex api for the
 users of mmu_notifier which, IIRC, Christoph had interpretted from one
 of Andrew's earlier comments as being undesirable.  I vaguely recall
 that sentiment as having been expressed.

invalidate_page as demonstrated in KVM pseudocode doesn't change the
locking requirements, and it has the benefit of reducing the window of
time the secondary page fault has to be masked and at the same time
_halves_ the number of _hooks_ in the VM every time the VM deal with
single pages (example: do_wp_page hot path). As long as we can't fully
converge because of point 3, it'd rather keep invalidate_page to be
better. But that's by far not a priority to keep.

 2) Range callout names: Your range callouts are invalidate_range_start
 and invalidate_range_end whereas Christoph's are start and end.  I do not
 believe this has been discussed in great detail.  I know I have expressed
 a preference for your names.  I admit to having failed to follow up on
 this issue.  I certainly believe we could come to an agreement quickly
 if pressed.

I think using -start -end is a mistake, think when we later add
mprotect_range_start/end. Here too I keep the better names only
because we can't converge on point 3 (the API will eventually change,
like every other kernel interal API, even core things like __free_page
have been mostly obsoleted).

 3) The structure of the patch set:  Christoph's upcoming release orders
 the patches so the prerequisite patches are seperately reviewable
 and each file is only touched by a single patch.  Additionally, that

Each file touched by a single patch? I doubt... The split is about the
same, the main difference is the merge ordering, I always had the zero
risk part at the head, he moved it at the tail when he incorporated
#v12 into his patchset.

 allows mmu_notifiers to be introduced as a single patch with sleeping
 functionality from its inception and an API which remains unchanged.
 Your patch set, however, introduces one API, then turns around and
 changes that API.  Again, the desire to make it an unchanging API was
 expressed by, IIRC, Andrew.  This does represent a risk to XPMEM as
 the non-sleeping API may become entrenched and make acceptance of the
 sleeping version less acceptable.
 
 Can we agree upon this list of issues?

This is a kernel internal API, so it will definitely change over
time. It's nothing close to a syscall.

Also note: the API is obviously defined in mmu_notifier.h and none of
the 2-12 patches touches mmu_notifier.h. So the extension of the
method semantics is 100% backwards compatible.

My patch order and API backward compatible extension over the patchset
is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
XPMEM as well. KVM/GRU won't notice any difference once the support
for XPMEM is added, but even if the API would completely change in
2.6.27, that's still better than no functionality at all in 2.6.26.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori

 For mass storage, we should follow the SCSI model with a single device 
 serving multiple disks, similar to what you suggest.  Not sure if the 
 device should have a single queue or one queue per disk.
   
 My latest thought it to do a virtio-based virtio controller.
 

 Why do you dislike multiple disks per virtio-blk controller? As
 mentioned this seems a natural way forward.
   

Logically speaking, virtio is a bus.  virtio supports all of the 
features of a bus (discover, hot add, hot remove).

Right now, we map virtio devices directly onto the PCI bus.

The problem we're trying to address is limitations of the PCI bus.  We 
have a couple options:

1) add a virtio device that supports multiple disks.  we need to 
reinvent hotplug within this device.

2) add a new PCI virtio transport that supports multiple virtio-blk 
devices within a single PCI slot

3) add a generic PCI virtio transport that supports multiple virtio 
devices within a single PCI slot

4) add a generic virtio bridge that supports multiple virtio devices 
within a single virtio device.

#4 may seem strange, but it's no different from a PCI-to-PCI bridge.

I like #4 the most, but #2 is probably the most practical.


Regards,

Anthony Liguori

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Ian Kirk
Avi Kivity wrote:

 For mass storage, we should follow the SCSI model with a single device
 serving multiple disks, similar to what you suggest.  Not sure if the
 device should have a single queue or one queue per disk.

Don't you just end up re-implementing SCSI then, at which point you might
as well stick with a 'fake' SCSI device in the guest?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Anthony Liguori
Marcelo Tosatti wrote:
 On Tue, Apr 22, 2008 at 05:32:45PM +0300, Avi Kivity wrote:
   
 Anthony Liguori wrote:
 
 This patch changes virtio devices to be multi-function devices whenever
 possible.  This increases the number of virtio devices we can support now by
 a factor of 8.

 With this patch, I've been able to launch a guest with either 220 disks or 
 220
 network adapters.

   
   
 Does this play well with hotplug?  Perhaps we need to allocate a new 
 device on hotplug.

 (certainly if we have a device with one function, which then gets 
 converted to a multifunction device)
 

 Would have to change the hotplug code to handle functions...
   

BTW, I've never been that convinced that hotplugging devices is as 
useful as people make it out to be.  I also think that's particularly 
true when it comes to hot adding/removing very large numbers of disks.

I think if you created all virtio devices as multifunction devices, but 
didn't add additional functions until you ran out of PCI slots, it would 
be a pretty acceptable solution.  Hotplug works just as it does today 
until you get much higher than 32 devices.  Even then, hotplug still 
works with most of your devices (until you hit the absolute maximum 
number of devices of course).

Regards,

Anthony Liguori

 It sounds less hacky to just extend the PCI slots instead of (ab)using
 multiple functions per-slot.
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Make virtio devices multi-function

2008-04-22 Thread Daniel P. Berrange
On Tue, Apr 22, 2008 at 02:26:45PM -0300, Marcelo Tosatti wrote:
 On Tue, Apr 22, 2008 at 11:31:11AM -0500, Anthony Liguori wrote:
  Avi Kivity wrote:
  Anthony Liguori wrote:
  
  I think we need to decide what we want to target in terms of upper 
  limits.
  
  With a bridge or two, we can probably easily do 128.
  
  If we really want to push things, I think we should do a PCI based 
  virtio controller.  I doubt a large number of PCI devices is ever 
  going to perform very well b/c of interrupt sharing and some of the 
  assumptions in virtio_pci.
 
  If we implement a controller, we can use a single interrupt, but 
  multiplex multiple notifications on that single interrupt.  We can 
  also be more aggressive about using shared memory instead of PCI 
  config space which would reduce the overall number of exits.
 
 We should increase the number of interrupt lines, perhaps to 16.
 
 Using shared memory to avoid exits sounds very good idea.
 
  We could easily support a very large number of devices this way.  But 
  again, what do we want to target for now? 
  
  I think that for networking we should keep things as is.  I don't see 
  anybody using 100 virtual NICs.
 
 The target was along the lines of 20 nics + 80 disks. Dan?

I've already had people ask for ability to as many as 64 disks and 32 nics
with Xen, so to my mind, the more we support the better. 100's if possible.

Dan.
-- 
|: Red Hat, Engineering, Boston   -o-   http://people.redhat.com/berrange/ :|
|: http://libvirt.org  -o-  http://virt-manager.org  -o-  http://ovirt.org :|
|: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505  -o-  F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 08:43:35PM +0200, Andrea Arcangeli wrote:
 On Tue, Apr 22, 2008 at 01:22:13PM -0500, Robin Holt wrote:
  1) invalidate_page:  You retain an invalidate_page() callout.  I believe
  we have progressed that discussion to the point that it requires some
  direction for Andrew, Linus, or somebody in authority.  The basics
  of the difference distill down to no expected significant performance
  difference between the two.  The invalidate_page() callout potentially
  can simplify GRU code.  It does provide a more complex api for the
  users of mmu_notifier which, IIRC, Christoph had interpretted from one
  of Andrew's earlier comments as being undesirable.  I vaguely recall
  that sentiment as having been expressed.
 
 invalidate_page as demonstrated in KVM pseudocode doesn't change the
 locking requirements, and it has the benefit of reducing the window of
 time the secondary page fault has to be masked and at the same time
 _halves_ the number of _hooks_ in the VM every time the VM deal with
 single pages (example: do_wp_page hot path). As long as we can't fully
 converge because of point 3, it'd rather keep invalidate_page to be
 better. But that's by far not a priority to keep.

Christoph, Jack and I just discussed invalidate_page().  I don't think
the point Andrew was making is that compelling in this circumstance.
The code has change fairly remarkably.  Would you have any objection to
putting it back into your patch/agreeing to it remaining in Andrea's
patch?  If not, I think we can put this issue aside until Andrew gets
out of the merge window and can decide it.  Either way, the patches
become much more similar with this in.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Contact Mr Philip Williams

2008-04-22 Thread Mrs Fatima Ali


Hello my good friend.

How are you today? Hope all is well with you and your family?, You may not 
understand why this mail came to you.But if you do not remember me, you might 
have receive an email from me in the past regarding a multi-million-dollar 
business proposal which we never concluded.

 

I am using this opportunity to inform you that this multi-million-dollar 
business has been concluded with the assistance of another partner from India 
who financed the transaction to alogical conclusion.

 

I thank you for your great effort to our unfinished transfer of fund into your 
account due to one reason or the other best known to you.But I want to 
informyou that I have successfully transferred the fund out of my bank to my 
new  partner's account in India that was capable of assisting me in this great 
venture.

 

Due to your effort, sincerity, courage and trustworthiness You showed during  
the course of the transaction.I want to compensate you and show my gratitude to 
you with the sum of $1,200,000.00. I haveleft a certified international bank 
cheque for youworth of $1,200,000.00 cashable anywhere in the world. 

 

My dear friend I will like you to contact my Account Officer Mr. Philip 
Williams, on his direct email address at:[EMAIL PROTECTED] for the collection 
of your bank cheque. I  authorized him to release theBank Cheque to you 
whenever you contact him regardingthe cheque.

 

At the moment, I'm very busy here because of the investment projects, which I 
and the new partner are having at hand.Please I will like you to accept  this 
token with good faith as this is from the bottom of my heart,Also comply with 
Mr. Phillip's directives so that he will send the cheque to you without any  
delay.

 

CONTACT: Mr. Philip Williams.

Account Officer,

Cotonou, Benin Republic,

His email address:  [EMAIL PROTECTED]

Therefore, you should send him your full Name and telephone number/your  
correct mailing address where

you want him to send the draft to you.

 

Thanks and God bless you and your family.

Hoping to hear from you.

Mrs Fatima Ali 





-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Christoph Lameter
Thanks for adding most of my enhancements. But

1. There is no real need for invalidate_page(). Can be done with 
invalidate_start/end. Needlessly complicates the API. One
of the objections by Andrew was that there mere multiple
callbacks that perform similar functions.

2. The locks that are used are later changed to semaphores. This is
   f.e. true for mm_lock / mm_unlock. The diffs will be smaller if the
   lock conversion is done first and then mm_lock is introduced. The
   way the patches are structured means that reviewers cannot review the
   final version of mm_lock etc etc. The lock conversion needs to come 
   first.

3. As noted by Eric and also contained in private post from yesterday by 
   me: The cmp function needs to retrieve the value before
   doing comparisons which is not done for the == of a and b.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Problems with MAC address with e1000 on Windows 2003

2008-04-22 Thread Alberto Treviño
I was wondering if anyone could reproduce my problem.  If it is 
reproduceable, then I'll file a bug.

I am using e1000 ethernet adapters on Windows 2003 and Linux guests.  
The line to set it up is something like this:

  -net nic,vlan=1,macaddr=00:ff:21:cf:91:01,model=e1000 \
-net tap,vlan=1,ifname=tap.br1.91.1

On Linux, this works just fine.  However, on Windows 2003, the mac 
address for the device is reported as 00:ff:ff:ff:ff:ff and the packets 
carry this mac address as well.  The corresponding tap device has the 
correct IP address, however.  This problem is definitely tied to using 
Windows 2003 with a e1000 device.  If I use the rtl8139 device, Windows 
reports the correct mac address.  When booting the same VM with a Linux 
bootable CD and the e1000 device, Linux reports the correct mac address 
as set in the qemu command.  It's the combination of Windows 2003 and 
the e1000 device that causes the problem.

Has anyone else seen this problem?  Thanks in advance.

-- 
Alberto Treviño
[EMAIL PROTECTED]

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 02 of 12] Fix ia64 compilation failure because of common code include bug

2008-04-22 Thread Christoph Lameter
Looks like this is not complete. There are numerous .h files missing which 
means that various structs are undefined (fs.h and rmap.h are needed 
f.e.) which leads to surprises when dereferencing fields of these struct.

It seems that mm_types.h is expected to be included only in certain 
contexts. Could you make sure to include all necessary .h files? Or add
some docs to clarify the situation here.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] performance with guests running 2.4 kernels (specifically RHEL3)

2008-04-22 Thread David S. Ahern
I added tracers to kvm_mmu_page_fault() that include collecting tsc cycles:

1. before vcpu-arch.mmu.page_fault()
2. after vcpu-arch.mmu.page_fault()
3. after mmu_topup_memory_caches()
4. after emulate_instruction()

So the delta in the trace reports show:
- cycles required for arch.mmu.page_fault (tracer 2)
- cycles required for mmu_topup_memory_caches(tracer 3)
- cycles required for emulate_instruction() (tracer 4)

I captured trace data for ~5-seconds during one of the usual events (again this
time it was due to kscand in the guest). I ran the formatted trace data through
an awk script to summarize:

TSC cycles  tracer2   tracer3   tracer4
  0 -  10,000:   295067213251115873
 10,001 -  25,000: 7682  1004 98336
 25,001 -  50,000:  2011536
 50,001 - 100,000:   100655 010
 100,000:  117 015

This means vcpu-arch.mmu.page_fault() was called 403,722 times in the roughyl
5-second interval: 295,067 times it took  10,000 cycles, but 100,772 times it
took longer than 50,000 cycles. The page_fault function getting run is
paging64_page_fault.

mmu_topup_memory_caches() and emulate_instruction() were both run 214,270 times,
most of them relatively quickly.

Note: I bumped the scheduling priority of the qemu threads to RR 1 so that few
host processes could interrupt it.

david


Avi Kivity wrote:
 David S. Ahern wrote:
 I added the traces and captured data over another apparent lockup of
 the guest.
 This seems to be representative of the sequence (pid/vcpu removed).

 (+4776)  VMEXIT [ exitcode = 0x, rip = 0x
 c016127c ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db4 ]
 (+3632)  VMENTRY
 (+4552)  VMEXIT [ exitcode = 0x, rip = 0x
 c016104a ]
 (+   0)  PAGE_FAULT [ errorcode = 0x000b, virt = 0x
 fffb61c8 ]
 (+   54928)  VMENTRY
   
 
 Can you oprofile the host to see where the 54K cycles are spent?
 
 (+4568)  VMEXIT [ exitcode = 0x, rip = 0x
 c01610e7 ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db4 ]
 (+   0)  PTE_WRITE  [ gpa = 0x 9db4 gpte = 0x
 41c5d363 ]
 (+8432)  VMENTRY
 (+3936)  VMEXIT [ exitcode = 0x, rip = 0x
 c01610ee ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db0 ]
 (+   0)  PTE_WRITE  [ gpa = 0x 9db0 gpte = 0x
  ]
 (+   13832)  VMENTRY


 (+5768)  VMEXIT [ exitcode = 0x, rip = 0x
 c016127c ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db4 ]
 (+3712)  VMENTRY
 (+4576)  VMEXIT [ exitcode = 0x, rip = 0x
 c016104a ]
 (+   0)  PAGE_FAULT [ errorcode = 0x000b, virt = 0x
 fffb61d0 ]
 (+   0)  PTE_WRITE  [ gpa = 0x 3d5981d0 gpte = 0x
 3d55d047 ]
   
 
 This indeed has the accessed bit clear.
 
 (+   65216)  VMENTRY
 (+4232)  VMEXIT [ exitcode = 0x, rip = 0x
 c01610e7 ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db4 ]
 (+   0)  PTE_WRITE  [ gpa = 0x 9db4 gpte = 0x
 3d598363 ]
   
 
 This has the accessed bit set and the user bit clear, and the pte
 pointing at the previous pte_write gpa.  Looks like a kmap_atomic().
 
 (+8640)  VMENTRY
 (+3936)  VMEXIT [ exitcode = 0x, rip = 0x
 c01610ee ]
 (+   0)  PAGE_FAULT [ errorcode = 0x0003, virt = 0x
 c0009db0 ]
 (+   0)  PTE_WRITE  [ gpa = 0x 9db0 gpte = 0x
  ]
 (+   14160)  VMENTRY

 I can forward a more complete time snippet if you'd like. vcpu0 +
 corresponding
 vcpu1 files have 85000 total lines and compressed the files total ~500k.

 I did not see the FLOODED trace come out during this sample though I
 did bump
 the count from 3 to 4 as you suggested.


   
 
 Bumping the count was supposed to remove the flooding...
 
 Correlating rip addresses to the 2.4 kernel:

 c0160d00-c0161290 = page_referenced

 It looks like the event is kscand running through the pages. I
 suspected this
 some time ago, and tried tweaking the kscand_work_percent sysctl
 variable. It
 appeared to lower the peak of the spikes, but maybe I imagined it. I
 believe
 lowering that value makes kscand wake up more often but do less work
 (page
 scanning) each time it is awakened.

   
 
 What does 'top' in the guest show (perhaps sorted by total cpu time
 rather than instantaneous usage)?
 
 What host kernel are you running?  How many host cpus?
 

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone

Re: [kvm-devel] [PATCH 03 of 12] get_task_mm should not succeed if mmput() is running and has reduced

2008-04-22 Thread Christoph Lameter
Missing signoff by you.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 04 of 12] Moves all mmu notifier methods outside the PT lock (first and not last

2008-04-22 Thread Christoph Lameter
Reverts a part of an earlier patch. Why isnt this merged into 1 of 12?



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 05 of 12] Move the tlb flushing into free_pgtables. The conversion of the locks

2008-04-22 Thread Christoph Lameter
Why are the subjects all screwed up? They are the first line of the 
description instead of the subject line of my patches.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 10 of 12] Convert mm_lock to use semaphores after i_mmap_lock and anon_vma_lock

2008-04-22 Thread Christoph Lameter
Doing the right patch ordering would have avoided this patch and allow 
better review.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Christoph Lameter
On Tue, 22 Apr 2008, Andrea Arcangeli wrote:

 My patch order and API backward compatible extension over the patchset
 is done to allow 2.6.26 to fully support KVM/GRU and 2.6.27 to support
 XPMEM as well. KVM/GRU won't notice any difference once the support
 for XPMEM is added, but even if the API would completely change in
 2.6.27, that's still better than no functionality at all in 2.6.26.

Please redo the patchset with the right order. To my knowledge there is no 
chance of this getting merged for 2.6.26.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 00 of 12] mmu notifier #v13

2008-04-22 Thread Christoph Lameter
On Tue, 22 Apr 2008, Robin Holt wrote:

 putting it back into your patch/agreeing to it remaining in Andrea's
 patch?  If not, I think we can put this issue aside until Andrew gets
 out of the merge window and can decide it.  Either way, the patches
 become much more similar with this in.

One solution would be to separate the invalidate_page() callout into a
patch at the very end that can be omitted. AFACIT There is no compelling 
reason to have this callback and it complicates the API for the device 
driver writers. Not having this callback makes the way that mmu notifiers 
are called from the VM uniform which is a desirable goal.



-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-22 Thread Robin Holt
On Tue, Apr 22, 2008 at 01:19:29PM -0700, Christoph Lameter wrote:
 Thanks for adding most of my enhancements. But
 
 1. There is no real need for invalidate_page(). Can be done with 
   invalidate_start/end. Needlessly complicates the API. One
   of the objections by Andrew was that there mere multiple
   callbacks that perform similar functions.

While I agree with that reading of Andrew's email about invalidate_page,
I think the GRU hardware makes a strong enough case to justify the two
seperate callouts.

Due to the GRU hardware, we can assure that invalidate_page terminates all
pending GRU faults (that includes faults that are just beginning) and can
therefore be completed without needing any locking.  The invalidate_page()
callout gets turned into a GRU flush instruction and we return.

Because the invalidate_range_start() leaves the page table information
available, we can not use a single page _start to mimick that
functionality.  Therefore, there is a documented case justifying the
seperate callouts.

I agree the case is fairly weak, but it does exist.  Given Andrea's
unwillingness to move and Jack's documented case, it is my opinion the
most likely compromise is to leave in the invalidate_page() callout.

Thanks,
Robin

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
 On Tuesday 22 April 2008 21:22:48 Avi Kivity wrote:
   The virtio config space was originally chosen to be little-endian,
   because we thought the config might be part of the PCI config space
   for virtio_pci.  It's actually a separate mmio region, so that
   argument holds little water; as only x86 is currently using the virtio
   mechanism, we can change this (but must do so now, before the
   impending s390 and ppc merges).
 
  This will probably annoy Hollis which has guests that can go both ways.
 
 Yes, I discussed this with Hollis.  But the virtio rings themselves already 
 have this issue: we don't do any endian conversion on them and assume 
 they're our endian in the guest.
 
 We may still regret not doing *everything* little-endian, but this doesn't 
 make it worse.

Hmm, why *don't* we just do everything LE, including the ring?

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] [PATCH] Make virtio devices multi-function (v2)

2008-04-22 Thread Anthony Liguori
This patch changes virtio devices to be multi-function devices whenever
possible.  This increases the number of virtio devices we can support now by
a factor of 8.

With this patch, I've been able to launch a guest with either 220 disks or 220
network adapters.

Since v1, I've changed the way virtio devices are allocated to be as follows:

 1) Always use a slot as long as they are available.  We can extend this to
use a PCI when we get that working more reliably.

 2) When PCI slots are exhausted, fall back add device as an additional
function on an existing slot

This way, hotplug continues to work just as well as it does now.  Once you
exceed the number of PCI slots, you need an OS that can do hotplug of
individual PCI functions if you care about doing hotplug.  I think this is a
pretty reasonable trade-off.

Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

diff --git a/qemu/hw/pci.c b/qemu/hw/pci.c
index a23a466..5d5d1a5 100644
--- a/qemu/hw/pci.c
+++ b/qemu/hw/pci.c
@@ -146,6 +146,41 @@ int pci_device_load(PCIDevice *s, QEMUFile *f)
 return 0;
 }
 
+/* Search the bus for a multifunction device with a free function that
+ * matches vendor_id_filter and device_id_filter.  -1 can be passed as
+ * a filter value to accept any id.
+ */
+int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter,
+int device_id_filter)
+{
+int devfn;
+
+for (devfn = bus-devfn_min; devfn  256; devfn += 8) {
+   int vendor_id, device_id;
+   PCIDevice *pci_dev;
+
+   if (!bus-devices[devfn])
+   continue;
+
+   pci_dev = bus-devices[devfn];
+   vendor_id = pci_dev-config[0x01]  8 | pci_dev-config[0x00];
+   device_id = pci_dev-config[0x03]  8 | pci_dev-config[0x02];
+
+   if ((vendor_id_filter == -1 || vendor_id_filter == vendor_id) 
+   (device_id_filter == -1 || device_id_filter == device_id) 
+   ((pci_dev-config[0x0e]  0x80) == 0x80)) {
+   int i;
+
+   for (i = 1; i  8; i++) {
+   if (!bus-devices[devfn + i])
+   return devfn + i;
+   }
+   }
+}
+
+return -1;
+}
+
 /* -1 for devfn means auto assign */
 PCIDevice *pci_register_device(PCIBus *bus, const char *name,
int instance_size, int devfn,
diff --git a/qemu/hw/pci.h b/qemu/hw/pci.h
index 60e4094..84d6a29 100644
--- a/qemu/hw/pci.h
+++ b/qemu/hw/pci.h
@@ -33,7 +33,7 @@ typedef struct PCIIORegion {
 #define PCI_ROM_SLOT 6
 #define PCI_NUM_REGIONS 7
 
-#define PCI_DEVICES_MAX 64
+#define PCI_DEVICES_MAX 256
 
 #define PCI_VENDOR_ID  0x00/* 16 bits */
 #define PCI_DEVICE_ID  0x02/* 16 bits */
@@ -105,6 +105,9 @@ void pci_info(void);
 PCIBus *pci_bridge_init(PCIBus *bus, int devfn, uint32_t id,
 pci_map_irq_fn map_irq, const char *name);
 
+int pci_bus_find_device_function(PCIBus *bus, int vendor_id_filter,
+int device_id_filter);
+
 /* lsi53c895a.c */
 #define LSI_MAX_DEVS 7
 void lsi_scsi_attach(void *opaque, BlockDriverState *bd, int id);
diff --git a/qemu/hw/virtio.c b/qemu/hw/virtio.c
index 6a50001..361455d 100644
--- a/qemu/hw/virtio.c
+++ b/qemu/hw/virtio.c
@@ -405,12 +405,22 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 PCIDevice *pci_dev;
 uint8_t *config;
 uint32_t size;
+int devfn = -1;
 
-pci_dev = pci_register_device(bus, name, struct_size,
- -1, NULL, NULL);
-if (!pci_dev)
+pci_dev = pci_register_device(bus, name, struct_size, -1, NULL, NULL);
+
+if (pci_dev == NULL) {
+   devfn = pci_bus_find_device_function(bus, vendor, -1);
+   if (devfn != -1)
+   pci_dev = pci_register_device(bus, name, struct_size,
+ devfn, NULL, NULL);
+}
+
+if (pci_dev == NULL)
return NULL;
 
+devfn = pci_dev-devfn;
+
 vdev = to_virtio_device(pci_dev);
 
 vdev-status = 0;
@@ -438,6 +448,10 @@ VirtIODevice *virtio_init_pci(PCIBus *bus, const char 
*name,
 
 config[0x3d] = 1;
 
+/* Mark device as multi-function */
+if ((devfn % 8) == 0)
+   config[0x0e] |= 0x80;
+
 vdev-name = name;
 vdev-config_len = config_size;
 if (vdev-config_len)
diff --git a/qemu/net.h b/qemu/net.h
index 13daa27..3bada75 100644
--- a/qemu/net.h
+++ b/qemu/net.h
@@ -42,7 +42,7 @@ void net_client_uninit(NICInfo *nd);
 
 /* NIC info */
 
-#define MAX_NICS 8
+#define MAX_NICS 256
 
 struct NICInfo {
 uint8_t macaddr[6];
diff --git a/qemu/sysemu.h b/qemu/sysemu.h
index c60072d..4385802 100644
--- a/qemu/sysemu.h
+++ b/qemu/sysemu.h
@@ -149,7 +149,7 @@ typedef struct DriveInfo {
 
 #define MAX_IDE_DEVS   2
 #define MAX_SCSI_DEVS  7
-#define MAX_DRIVES 32
+#define MAX_DRIVES 256
 
 int nb_drives;
 DriveInfo drives_table[MAX_DRIVES+1];
diff --git a/qemu/vl.c b/qemu/vl.c
index 74be059..824e331 100644
--- a/qemu/vl.c
+++ b/qemu/vl.c
@@ -8717,7 

Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Rusty Russell
On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
 On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
  We may still regret not doing *everything* little-endian, but this
  doesn't make it worse.

 Hmm, why *don't* we just do everything LE, including the ring?

Mainly because when requirements are in doubt, simplicity wins, I think.

Cheers,
Rusty.

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC PATCH] virtio: change config to guest endian.

2008-04-22 Thread Hollis Blanchard
On Tuesday 22 April 2008 16:05:38 Rusty Russell wrote:
 On Wednesday 23 April 2008 06:29:14 Hollis Blanchard wrote:
  On Tuesday 22 April 2008 09:31:35 Rusty Russell wrote:
   We may still regret not doing *everything* little-endian, but this
   doesn't make it worse.
 
  Hmm, why *don't* we just do everything LE, including the ring?
 
 Mainly because when requirements are in doubt, simplicity wins, I think.

Well, I think the definition of simplicity is up for debate in this 
case... LE everywhere is much simpler than it depends, IMHO.

-- 
Hollis Blanchard
IBM Linux Technology Center

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()

2008-04-22 Thread Thomas Cataldo
On Mon, Apr 21, 2008 at 9:57 PM, Thomas Cataldo
[EMAIL PROTECTED] wrote:
 Hi,

  I am running kvm-66 on top of a debian sid host with 2.6.24 (intel 32bit 
 host).

  Got the following in my logs today :

  Apr 21 17:55:01 buffy kernel: WARNING: at
  /usr/src/modules/kvm/mmu.c:390 account_shadowed()
  Apr 21 17:55:01 buffy kernel: Pid: 21416, comm: kvm Tainted: P
  2.6.24-1-686 #1
  Apr 21 17:55:01 buffy kernel:  [f8d07a36] kvm_mmu_get_page+0x42d/0x447 
 [kvm]
  Apr 21 17:55:01 buffy kernel:  [f8d08cca] kvm_mmu_load+0xdf/0x15c [kvm]
  Apr 21 17:55:01 buffy kernel:  [f8affe41]
  vmx_queue_exception+0x0/0x33 [kvm_intel]
  Apr 21 17:55:01 buffy kernel:  [f8d05521]
  kvm_arch_vcpu_ioctl_run+0x233/0x5a9 [kvm]
  Apr 21 17:55:01 buffy kernel:  [f8d013aa] kvm_vcpu_ioctl+0xe4/0x34c [kvm]
  Apr 21 17:55:01 buffy kernel:  [c0159078] delayacct_end+0x70/0x77
  Apr 21 17:55:01 buffy kernel:  [c015aa19] sync_page+0x0/0x3b
  Apr 21 17:55:01 buffy kernel:  [c0159388] __delayacct_blkio_end+0x5b/0x5f
  Apr 21 17:55:01 buffy kernel:  [c02bcaab] io_schedule+0x64/0x80
  Apr 21 17:55:01 buffy kernel:  [c011e07d] enqueue_entity+0x2b/0x3d
  Apr 21 17:55:01 buffy kernel:  [c0115343] apic_wait_icr_idle+0xe/0x15
  Apr 21 17:55:01 buffy kernel:  [c011e0a5] enqueue_task_fair+0x16/0x24
  Apr 21 17:55:01 buffy kernel:  [c011d643] enqueue_task+0x52/0x5d
  Apr 21 17:55:01 buffy kernel:  [c011de9e] resched_task+0x52/0x54
  Apr 21 17:55:01 buffy kernel:  [c011f459] try_to_wake_up+0x2b8/0x2c2
  Apr 21 17:55:01 buffy kernel:  [c011d47e] __wake_up_common+0x32/0x5c
  Apr 21 17:55:01 buffy kernel:  [c011eecc] __wake_up+0x32/0x42
  Apr 21 17:55:01 buffy kernel:  [c013e25c] wake_futex+0x3b/0x45
  Apr 21 17:55:01 buffy kernel:  [c013e4de] futex_wake+0x81/0xb0
  Apr 21 17:55:01 buffy kernel:  [c013f097] do_futex+0x77/0x983
  Apr 21 17:55:01 buffy kernel:  [c011d9ca] update_curr+0x62/0xef
  Apr 21 17:55:01 buffy kernel:  [c0103044] __switch_to+0x9d/0x11d
  Apr 21 17:55:01 buffy kernel:  [f8d012c6] kvm_vcpu_ioctl+0x0/0x34c [kvm]
  Apr 21 17:55:01 buffy kernel:  [c018285b] do_ioctl+0x1f/0x62
  Apr 21 17:55:01 buffy kernel:  [c0182ad5] vfs_ioctl+0x237/0x249
  Apr 21 17:55:01 buffy kernel:  [c0182b2c] sys_ioctl+0x45/0x5d
  Apr 21 17:55:01 buffy kernel:  [c0103e5e] sysenter_past_esp+0x6b/0xa1


  Regards,
  Thomas.


as I got no reply, I guess it is a bad setup on my part. If that might
help, this happenned while I was doing a make -j on webkit svn tree
(ie. heavy c++ compilation workload) .

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


  1   2   >