Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Gerd Hoffmann

  Hi,


For (2), you cannot use bus=X,addr=Y because it makes assumptions about
the PCI topology which may change in newer -M pc's.


Why should the PCI topology for 'pc' ever change?

We'll probably get q35 support some day, but when this lands I expect 
we'll see a new machine type 'q35', so '-m q35' will pick the ich9 
chipset (which will have a different pci topology of course) and '-m pc' 
will pick the existing piix chipset (which will continue to look like it 
looks today).


cheers,
  Gerd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf
Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +return;
 +}
 +
 +bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov,
 +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb,
 +blk_req-reqs[0].opaque);

 Same here.

 +bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool
 
 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.
 
 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

In theory I think you could just re-submit the requests on the secondary
if they had not completed yet.

But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?
 
 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

Right. So when is a transaction completed? This is the time that a
single request will take.

 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.
 
 Yes, and that's what I'm trying to do at this point.  

Oh, I must have missed that code. Which patch/function should I look at?

 I know that
 performance matters a lot, but sacrificing reliability over
 performance now isn't a good idea.  I first want to lay the
 ground, and then focus on optimization.  Note that without dirty
 bitmap optimization, Kemari suffers a lot in sending rams.
 Anthony and I discussed to take this approach at KVM Forum.

I agree, starting simple makes sense.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka
On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?


 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically for
 creating PCI devices that doesn't require a parent bus argument but provides
 a way to specify stable addressing (for instancing, using a linear index).
 
 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.
 
 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.
 
 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?
 
 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.

There are currently no VCPU-specific bits remaining in kvm_state. It may
be a good idea to introduce an arch-specific kvm_state and move related
bits over. It may also once be feasible to carve out memory management
related fields if we have proper abstractions for that, but I'm not
completely sure here.

Anyway, all these things are secondary. The primary topic here is how to
deal with kvm_state and its fields that have VM-global scope.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Daniel P. Berrange
On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote:
   Hi,
 
 For (2), you cannot use bus=X,addr=Y because it makes assumptions about
 the PCI topology which may change in newer -M pc's.
 
 Why should the PCI topology for 'pc' ever change?
 
 We'll probably get q35 support some day, but when this lands I
 expect we'll see a new machine type 'q35', so '-m q35' will pick the
 ich9 chipset (which will have a different pci topology of course)
 and '-m pc' will pick the existing piix chipset (which will continue
 to look like it looks today).

If the topology does ever change (eg in the kind of way anthony
suggests, first bus only has the graphics card), I think libvirt
is going to need a little work to adapt to the new topology,
regardless of whether we currently specify a bus= arg to -device
or not. I'm not sure there's anything we could do to future proof
us to that kind of change.

Regards,
Daniel
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura
2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +        return;
 +    }
 +
 +    bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov,
 +                    blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb,
 +                    blk_req-reqs[0].opaque);

 Same here.

 +    bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.

The transaction is completed when the vm state is sent to the
secondary, and the primary receives the ack to it.  Please let me
know if the answer is too vague.  What I can tell is that it
can't be super fast.

 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?

Maybe I miss-answered to your question.  The device may receive
timeouts.  If timeouts didn't happen, the requests are flushed
one-by-one in writethrough because we're calling qemu_aio_flush
and bdrv_flush together.

Yoshi

 I know that
 performance matters a lot, but sacrificing reliability over
 performance now isn't a good idea.  I first want to lay the
 ground, and then focus on optimization.  Note that without dirty
 bitmap optimization, Kemari suffers a lot in sending rams.
 Anthony and I discussed to take this approach at KVM Forum.

 I agree, starting simple makes sense.

 Kevin
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Srivatsa Vaddagiri
On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote:
  The reason for wanting this should be clear I guess, it allows PI.
 
 Well, if we can expand the spinlock to include an owner, then all this
 becomes moot...

How so? Having an owner will not eliminate the need for pv-ticketlocks
afaict. We still need a mechanism to reduce latency in scheduling the next
thread-in-waiting, which is achieved by your patches. 

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf
Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +return;
 +}
 +
 +bdrv_aio_writev(bs, blk_req-reqs[0].sector, blk_req-reqs[0].qiov,
 +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb,
 +blk_req-reqs[0].opaque);

 Same here.

 +bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.
 
 The transaction is completed when the vm state is sent to the
 secondary, and the primary receives the ack to it.  Please let me
 know if the answer is too vague.  What I can tell is that it
 can't be super fast.
 
 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?
 
 Maybe I miss-answered to your question.  The device may receive
 timeouts.  

We should pay attention that the guest does not see timeouts. I'm not
expecting that I/O will be super fast, and as long as it is only a
performance problem we can live with it.

However, as soon as the guest gets timeouts it reports I/O errors and
eventually offlines the block device. At this point it's not a
performance problem any more, but also a correctness problem.

This is why I suggested that we flush the event-tap queue (i.e. complete
the transaction) immediately after an I/O request has been issued
instead of waiting for other events that would complete the transaction.

 If timeouts didn't happen, the requests are flushed
 one-by-one in writethrough because we're calling qemu_aio_flush
 and bdrv_flush together.

I think this is what we must do.

Kevin
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EPT: Misconfiguration

2011-01-20 Thread Ruben Kerkhof
I'm suddenly getting lots of the following errors on a server running
2.36.7, but I have no idea what it means:

2011-01-20T12:41:18.358603+01:00 phy005 kernel: EPT: Misconfiguration.
2011-01-20T12:41:18.358621+01:00 phy005 kernel: EPT: GPA: 0x3dbff6b0
2011-01-20T12:41:18.358624+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x50743e007 level 4
2011-01-20T12:41:18.358627+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x523de2007 level 3
2011-01-20T12:41:18.358629+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x62336f007 level 2
2011-01-20T12:41:18.360109+01:00 phy005 kernel:
ept_misconfig_inspect_spte: spte 0x1603a0730500d277 level 1
2011-01-20T12:41:18.360137+01:00 phy005 kernel:
ept_misconfig_inspect_spte: rsvd_bits = 0x3a000
2011-01-20T12:41:18.360151+01:00 phy005 kernel: [ cut here
]
2011-01-20T12:41:18.360155+01:00 phy005 kernel: WARNING: at
arch/x86/kvm/vmx.c:3425 handle_ept_misconfig+0x152/0x1d8 [kvm_intel]()
2011-01-20T12:41:18.360160+01:00 phy005 kernel: Hardware name: X8DTU
2011-01-20T12:41:18.363296+01:00 phy005 kernel: Modules linked in: tun
ipmi_devintf ipmi_si ipmi_msghandler bridge 8021q garp stp llc bonding
xt_comment xt_recent ip6t_REJECT nf_conntrack_ipv6 ip6table_
filter ip6_tables ipv6 kvm_intel kvm igb i2c_i801 iTCO_wdt i2c_core
ioatdma joydev iTCO_vendor_support serio_raw dca 3w_9xxx [last
unloaded: scsi_wait_scan]
2011-01-20T12:41:18.363312+01:00 phy005 kernel: Pid: 3595, comm:
qemu-kvm Tainted: G  D W  2.6.34.7-66.tilaa.fc13.x86_64 #1
2011-01-20T12:41:18.363314+01:00 phy005 kernel: Call Trace:
2011-01-20T12:41:18.364385+01:00 phy005 kernel: [8104d11f]
warn_slowpath_common+0x7c/0x94
2011-01-20T12:41:18.364455+01:00 phy005 kernel: [8104d14b]
warn_slowpath_null+0x14/0x16
2011-01-20T12:41:18.364462+01:00 phy005 kernel: [a00ba7fb]
handle_ept_misconfig+0x152/0x1d8 [kvm_intel]
2011-01-20T12:41:18.364466+01:00 phy005 kernel: [a00bb401]
vmx_handle_exit+0x204/0x23a [kvm_intel]
2011-01-20T12:41:18.370619+01:00 phy005 kernel: [a0075998]
kvm_arch_vcpu_ioctl_run+0x7cd/0xa74 [kvm]
2011-01-20T12:41:18.370731+01:00 phy005 kernel: [a00645ba]
kvm_vcpu_ioctl+0xfd/0x56e [kvm]
2011-01-20T12:41:18.370737+01:00 phy005 kernel: [8100a60e] ?
apic_timer_interrupt+0xe/0x20
2011-01-20T12:41:18.370741+01:00 phy005 kernel: [8111aa2f]
vfs_ioctl+0x32/0xa6
2011-01-20T12:41:18.371562+01:00 phy005 kernel: [8111afa2]
do_vfs_ioctl+0x483/0x4c9
2011-01-20T12:41:18.371577+01:00 phy005 kernel: [8111b03e]
sys_ioctl+0x56/0x79
2011-01-20T12:41:18.371581+01:00 phy005 kernel: [81009c72]
system_call_fastpath+0x16/0x1b
2011-01-20T12:41:18.372244+01:00 phy005 kernel: ---[ end trace
7d57b311d4a5b22c ]---
2011-01-20T12:41:57.568322+01:00 phy005 kernel: general protection
fault:  [#2] SMP
2011-01-20T12:41:57.568335+01:00 phy005 kernel: last sysfs file:
/sys/devices/system/cpu/cpu15/topology/thread_siblings
2011-01-20T12:41:57.568339+01:00 phy005 kernel: CPU 0

Kind regards,

Ruben
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT: Misconfiguration

2011-01-20 Thread Ruben Kerkhof
On Thu, Jan 20, 2011 at 12:48, Ruben Kerkhof ru...@rubenkerkhof.com wrote:
 I'm suddenly getting lots of the following errors on a server running
 2.36.7, but I have no idea what it means:

Sorry, that should be 2.34.7.

Kind regards,

Ruben
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Srivatsa Vaddagiri
On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote:
  I didn't really read the patch, and I totally forgot everything from
  when I looked at the Xen series, but does the Xen/KVM hypercall
  interface for this include the vcpu to await the kick from?
 
  My guess is not, since the ticket locks used don't know who the owner
  is, which is of course, sad. There are FIFO spinlock implementations
  that can do this though.. although I think they all have a bigger memory
  footprint.
 
 At least in the Xen code, a current owner isn't very useful, because we
 need the current owner to kick the *next* owner to life at release time,
 which we can't do without some structure recording which ticket belongs
 to which cpu.

If we had a yield-to [1] sort of interface _and_ information on which vcpu
owns a lock, then lock-spinners can yield-to the owning vcpu, while the
unlocking vcpu can yield-to the next-vcpu-in-waiting. The key here is not to
sleep when waiting for locks (as implemented by current patch-series, which can 
put other VMs at an advantage by giving them more time than they are entitled 
to) and also to ensure that lock-owner as well as the next-in-line lock-owner 
are not unduly made to wait for cpu. 

Is there a way we can dynamically expand the size of lock only upon contention
to include additional information like owning vcpu? Have the lock point to a
per-cpu area upon contention where additional details can be stored perhaps?

1. https://lkml.org/lkml/2011/1/14/44

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST][PATCH 0/3] Unmapped page cache control (v3)

2011-01-20 Thread Balbir Singh

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision. The
last series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description

This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
  double caching - (one in the host and one in the guest). As
  we try to scale guests, cache usage across the system grows.
  The goal of this patch is to reclaim page cache when Linux is running
  as a guest and get the host to hold the page cache and manage it.
  There might be temporary duplication, but in the long run, memory
  in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
  can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 419431 1883047  140888 2443366  254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o 
   textdata bss dec hex filename
 418908 1883023  140888 2442819  254643 mm/built-in.o


---

Balbir Singh (3):
  Move zone_reclaim() outside of CONFIG_NUMA
  Refactor zone_reclaim code
  Provide control over unmapped pages


 Documentation/kernel-parameters.txt |8 ++
 include/linux/mmzone.h  |4 +
 include/linux/swap.h|   21 +-
 init/Kconfig|   12 +++
 kernel/sysctl.c |   20 +++--
 mm/page_alloc.c |9 ++
 mm/vmscan.c |  132 +++
 7 files changed, 175 insertions(+), 31 deletions(-)

-- 
Balbir Singh
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Balbir Singh
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 include/linux/mmzone.h |4 ++--
 include/linux/swap.h   |4 ++--
 kernel/sysctl.c|   18 +-
 mm/vmscan.c|2 --
 4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
 */
unsigned long   lowmem_reserve[MAX_NR_ZONES];
 
-#ifdef CONFIG_NUMA
-   int node;
/*
 * zone reclaim becomes active if more unmapped pages exist.
 */
unsigned long   min_unmapped_pages;
+#ifdef CONFIG_NUMA
+   int node;
unsigned long   min_slab_pages;
 #endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #ifdef CONFIG_NUMA
 extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
 #else
 #define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
-#ifdef CONFIG_NUMA
-   {
-   .procname   = zone_reclaim_mode,
-   .data   = zone_reclaim_mode,
-   .maxlen = sizeof(zone_reclaim_mode),
-   .mode   = 0644,
-   .proc_handler   = proc_dointvec,
-   .extra1 = zero,
-   },
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#ifdef CONFIG_NUMA
+   {
+   .procname   = zone_reclaim_mode,
+   .data   = zone_reclaim_mode,
+   .maxlen = sizeof(zone_reclaim_mode),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec,
+   .extra1 = zero,
+   },
{
.procname   = min_slab_ratio,
.data   = sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)
 
 module_init(kswapd_init)
 
-#ifdef CONFIG_NUMA
 /*
  * Zone reclaim mode
  *
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, 
unsigned int order)
 
return ret;
 }
-#endif
 
 /*
  * page_evictable - test whether a page is evictable

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Balbir Singh
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 mm/vmscan.c |   35 +++
 1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..3b25423 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
 }
 
 /*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+   unsigned long nr_pages)
+{
+   int priority;
+   /*
+* Free memory by calling shrink zone with increasing
+* priorities until we have enough memory freed.
+*/
+   priority = ZONE_RECLAIM_PRIORITY;
+   do {
+   shrink_zone(priority, zone, sc);
+   priority--;
+   } while (priority = 0  sc-nr_reclaimed  nr_pages);
+}
+
+/*
  * Try to free up some pages from this zone through reclaim.
  */
 static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int 
order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
const unsigned long nr_pages = 1  order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
-   int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode  RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode  RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t 
gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p-reclaim_state = reclaim_state;
 
-   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages) {
-   /*
-* Free memory by calling shrink zone with increasing
-* priorities until we have enough memory freed.
-*/
-   priority = ZONE_RECLAIM_PRIORITY;
-   do {
-   shrink_zone(priority, zone, sc);
-   priority--;
-   } while (priority = 0  sc.nr_reclaimed  nr_pages);
-   }
+   if (zone_pagecache_reclaimable(zone)  zone-min_unmapped_pages)
+   zone_reclaim_pages(zone, sc, nr_pages);
 
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0  zone-min_slab_pages) {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Balbir Singh
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
   to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com
---
 Documentation/kernel-parameters.txt |8 +++
 include/linux/swap.h|   21 ++--
 init/Kconfig|   12 
 kernel/sysctl.c |2 +
 mm/page_alloc.c |9 +++
 mm/vmscan.c |   97 +++
 6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt 
b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in 
the file
[X86]
Set unknown_nmi_panic=1 early on boot.
 
+   unmapped_page_control
+   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+   is enabled. It controls the amount of unmapped memory
+   that is present in the system. This boot option plus
+   vm.min_unmapped_ratio (sysctl) provide granular control
+   over how much unmapped page cache can exist in the 
system
+   before kswapd starts reclaiming unmapped page cache 
pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2).  This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
 extern int remove_mapping(struct address_space *mapping, struct page *page);
 extern long vm_total_pages;
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
 {
return 0;
 }
 #endif
 
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+   return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
 extern int page_evictable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_unevictable_pages(struct address_space *);
 
diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
 config MM_OWNER
bool
 
+config UNMAPPED_PAGECACHE_CONTROL
+   bool Provide control over unmapped page cache
+   default n
+   help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
 config SYSFS_DEPRECATED
bool enable deprecated sysfs features to support old userspace tools
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
},
 #endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname   = min_unmapped_ratio,
.data   = sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = zero,
.extra2 = one_hundred,
},
+#endif
 #ifdef CONFIG_NUMA
{
.procname   = zone_reclaim_mode,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
unsigned long mark;
  

Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Peter Zijlstra
On Thu, 2011-01-20 at 17:29 +0530, Srivatsa Vaddagiri wrote:
 
 If we had a yield-to [1] sort of interface _and_ information on which vcpu
 owns a lock, then lock-spinners can yield-to the owning vcpu, 

and then I'd nak it for being stupid ;-)

really, yield*() is retarded, never even consider using it. If you've
got the actual owner you can do full blown PI, which is tons better than
a 'do-something-random' call.

The only reason the whole non-virt pause loop filtering muck uses it is
because it really doesn't know anything, and do-something is pretty much
all it can do. Its a broken interface.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura
2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +        return;
 +    }
 +
 +    bdrv_aio_writev(bs, blk_req-reqs[0].sector, 
 blk_req-reqs[0].qiov,
 +                    blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb,
 +                    blk_req-reqs[0].opaque);

 Same here.

 +    bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.

 The transaction is completed when the vm state is sent to the
 secondary, and the primary receives the ack to it.  Please let me
 know if the answer is too vague.  What I can tell is that it
 can't be super fast.

 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?

 Maybe I miss-answered to your question.  The device may receive
 timeouts.

 We should pay attention that the guest does not see timeouts. I'm not
 expecting that I/O will be super fast, and as long as it is only a
 performance problem we can live with it.

 However, as soon as the guest gets timeouts it reports I/O errors and
 eventually offlines the block device. At this point it's not a
 performance problem any more, but also a correctness problem.

 This is why I suggested that we flush the event-tap queue (i.e. complete
 the transaction) immediately after an I/O request has been issued
 instead of waiting for other events that would complete the transaction.

Right.  event-tap doesn't queue at specific interval.  It'll
schedule the transaction as bh once events are tapped .  The
purpose of the queue is store requests initiated while the
transaction.  So I believe current implementation should be doing
what you're expecting.  However, if the guest dirtied huge amount
of ram and initiated block requests, we may get timeouts even we
started transaction right away.

Yoshi

 If timeouts didn't happen, the requests are flushed
 one-by-one in writethrough because we're calling qemu_aio_flush
 and bdrv_flush 

[PATCH 1/1]: Log more information on IO_PAGE_FAULT event.

2011-01-20 Thread Prasad Joshi
Prints more information when IO_PAGE_FAULT event occurs.

Signed-off-by: Prasad Joshi prasadjoshi...@gmail.com

---
diff --git a/arch/x86/include/asm/amd_iommu_types.h
b/arch/x86/include/asm/amd_iommu_types.h
index e3509fc..add56b3 100644
--- a/arch/x86/include/asm/amd_iommu_types.h
+++ b/arch/x86/include/asm/amd_iommu_types.h
@@ -96,6 +96,16 @@
 #define EVENT_FLAGS_MASK   0xfff
 #define EVENT_FLAGS_SHIFT  0x10

+#define EVENT_FLAGS_RESERVED1 0x0
+#define EVENT_FLAGS_RESERVED_SIZE 0x3
+#define EVENT_FLAGS_INT (EVENT_FLAGS_RESERVED1 + EVENT_FLAGS_RESERVED_SIZE)
+#define EVENT_FLAGS_PR  (EVENT_FLAGS_INT + 0x1)
+#define EVENT_FLAGS_RW  (EVENT_FLAGS_PR + 0x1)
+#define EVENT_FLAGS_PE  (EVENT_FLAGS_RW + 0x1)
+#define EVENT_FLAGS_RZ  (EVENT_FLAGS_PE + 0x1)
+#define EVENT_FLAGS_TR  (EVENT_FLAGS_RZ + 0x1)
+#define EVENT_FLAGS_RESERVED2 (EVENT_FLAGS_TR + 0x1)
+
 /* feature control bits */
 #define CONTROL_IOMMU_EN0x00ULL
 #define CONTROL_HT_TUN_EN   0x01ULL
diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c
index 57ca777..a158852 100644
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -283,6 +283,44 @@ static void dump_command(unsigned long phys_addr)
pr_err(AMD-Vi: CMD[%d]: %08x\n, i, cmd-data[i]);
 }

+static void inline dump_page_fault(int flags)
+{
+   int i = (flags  EVENT_FLAGS_INT)  0x1;
+   int pr = (flags  EVENT_FLAGS_PR)  0x1;
+   int rw = (flags  EVENT_FLAGS_RW)  0x1;
+   int pe = (flags  EVENT_FLAGS_PE)  0x1;
+   int rz = (flags  EVENT_FLAGS_RZ)  0x1;
+   int tr = (flags  EVENT_FLAGS_TR)  0x1;
+
+   const char *i_pr[] = {
+   to page marked not present,
+   to page marked present,
+   to interrupt marked blocked,
+   to interrupt marked remapped
+   };
+
+   const char *tr_s[2] = {
+   transaction,
+   translation
+   };
+
+   const char *type[2] = {
+   read,
+   write
+   };
+
+   const char *permission[2] = {
+   peripheral had permission,
+   peripheral had no permission,
+   };
+
+   pr_err(AMD-Vi: \t%s %s, tr_s[tr], i_pr[(i1)|pr]);
+   pr_err(AMD-Vi: \t%s type: %s, tr_s[tr], type[rw]);
+   if (pr) {
+   pr_err(AMD-Vi: \t%s, permission[pe]);
+   }
+}
+
 static void iommu_print_event(struct amd_iommu *iommu, void *__evt)
 {
u32 *event = __evt;
@@ -307,6 +345,7 @@ static void iommu_print_event(struct amd_iommu
*iommu, void *__evt)
   domain=0x%04x address=0x%016llx flags=0x%04x]\n,
   PCI_BUS(devid), PCI_SLOT(devid), PCI_FUNC(devid),
   domid, address, flags);
+   dump_page_fault(flags);
break;
case EVENT_TYPE_DEV_TAB_ERR:
printk(DEV_TAB_HARDWARE_ERROR device=%02x:%02x.%x 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Kevin Wolf
Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +return;
 +}
 +
 +bdrv_aio_writev(bs, blk_req-reqs[0].sector, 
 blk_req-reqs[0].qiov,
 +blk_req-reqs[0].nb_sectors, blk_req-reqs[0].cb,
 +blk_req-reqs[0].opaque);

 Same here.

 +bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to the 
 disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.

 The transaction is completed when the vm state is sent to the
 secondary, and the primary receives the ack to it.  Please let me
 know if the answer is too vague.  What I can tell is that it
 can't be super fast.

 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?

 Maybe I miss-answered to your question.  The device may receive
 timeouts.

 We should pay attention that the guest does not see timeouts. I'm not
 expecting that I/O will be super fast, and as long as it is only a
 performance problem we can live with it.

 However, as soon as the guest gets timeouts it reports I/O errors and
 eventually offlines the block device. At this point it's not a
 performance problem any more, but also a correctness problem.

 This is why I suggested that we flush the event-tap queue (i.e. complete
 the transaction) immediately after an I/O request has been issued
 instead of waiting for other events that would complete the transaction.
 
 Right.  event-tap doesn't queue at specific interval.  It'll
 schedule the transaction as bh once events are tapped .  The
 purpose of the queue is store requests initiated while the
 transaction.  

Ok, now I got it. :-)

So the patches are already doing the best we can do.

 So I believe current implementation should be doing
 what you're expecting.  However, if the guest dirtied huge amount
 of ram and initiated block requests, we may get timeouts even we
 started transaction right away.

Right. 

Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Srivatsa Vaddagiri
On Thu, Jan 20, 2011 at 02:41:46PM +0100, Peter Zijlstra wrote:
 On Thu, 2011-01-20 at 17:29 +0530, Srivatsa Vaddagiri wrote:
  
  If we had a yield-to [1] sort of interface _and_ information on which vcpu
  owns a lock, then lock-spinners can yield-to the owning vcpu, 
 
 and then I'd nak it for being stupid ;-)
 
 really, yield*() is retarded, never even consider using it. If you've
 got the actual owner you can do full blown PI, which is tons better than
 a 'do-something-random' call.

Yes definitely that would be much better than yield-to.

 The only reason the whole non-virt pause loop filtering muck uses it is
 because it really doesn't know anything, and do-something is pretty much
 all it can do. Its a broken interface.

- vatsa
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Christoph Lameter
On Thu, 20 Jan 2011, Balbir Singh wrote:

 --- a/include/linux/swap.h
 +++ b/include/linux/swap.h
 @@ -253,11 +253,11 @@ extern int vm_swappiness;
  extern int remove_mapping(struct address_space *mapping, struct page *page);
  extern long vm_total_pages;

 +extern int sysctl_min_unmapped_ratio;
 +extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
  #ifdef CONFIG_NUMA
  extern int zone_reclaim_mode;
 -extern int sysctl_min_unmapped_ratio;
  extern int sysctl_min_slab_ratio;
 -extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
  #else
  #define zone_reclaim_mode 0

So the end result of this patch is that zone reclaim is compiled
into vmscan.o even on !NUMA configurations but since zone_reclaim_mode ==
0 noone can ever call that code?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Christoph Lameter

Reviewed-by: Christoph Lameter c...@linux.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM test: virtio_console: Add kernel crash logging

2011-01-20 Thread Lucas Meneghel Rodrigues
Change cleanup function.
   1) Try cleanup faster way.
   2) Try restart ssh session.
   3) Try restart machine.
   *) If test end with kernel crash directly try to restart machine.

Signed-off-by: Jiri Zupka jzu...@redhat.com
---
 client/tests/kvm/scripts/virtio_guest.py |   79 -
 client/tests/kvm/tests/virtio_console.py |  204 +++---
 2 files changed, 179 insertions(+), 104 deletions(-)

diff --git a/client/tests/kvm/scripts/virtio_guest.py 
b/client/tests/kvm/scripts/virtio_guest.py
index 0038f48..498bf6a 100755
--- a/client/tests/kvm/scripts/virtio_guest.py
+++ b/client/tests/kvm/scripts/virtio_guest.py
@@ -10,11 +10,12 @@ Auxiliary script used to send data between ports on guests.
 import threading
 from threading import Thread
 import os, time, select, re, random, sys, array
-import fcntl, array, subprocess, traceback, signal
+import fcntl, subprocess, traceback, signal
 
 DEBUGPATH = /sys/kernel/debug
 SYSFSPATH = /sys/class/virtio-ports/
 
+exiting = False
 
 class VirtioGuest:
 
@@ -70,32 +71,42 @@ class VirtioGuest:
 else:
 viop_names = os.listdir('%s/virtio-ports' % (DEBUGPATH))
 for name in viop_names:
-f = open(%s/virtio-ports/%s % (DEBUGPATH, name), 'r')
+open_db_file = %s/virtio-ports/%s % (DEBUGPATH, name)
+f = open(open_db_file, 'r')
 port = {}
+file = []
 for line in iter(f):
-m = re.match((\S+): (\S+), line)
-port[m.group(1)] = m.group(2)
-
-if (port['is_console'] == yes):
-port[path] = /dev/hvc%s % (port[console_vtermno])
-# Console works like a serialport
-else:
-port[path] = /dev/%s % name
-
-if (not os.path.exists(port['path'])):
-print FAIL: %s not exist % port['path']
-
-sysfspath = SYSFSPATH + name
-if (not os.path.isdir(sysfspath)):
-print FAIL: %s not exist % (sysfspath)
-
-info_name = sysfspath + /name
-port_name = self._readfile(info_name).strip()
-if (port_name != port[name]):
-print (FAIL: Port info not match \n%s - %s\n%s - %s %
-   (info_name , port_name,
-%s/virtio-ports/%s % (DEBUGPATH, name),
-port[name]))
+file.append(line)
+try:
+for line in file:
+m = re.match((\S+): (\S+), line)
+port[m.group(1)] = m.group(2)
+
+if (port['is_console'] == yes):
+port[path] = /dev/hvc%s % (port[console_vtermno])
+# Console works like a serialport
+else:
+port[path] = /dev/%s % name
+
+if (not os.path.exists(port['path'])):
+print FAIL: %s not exist % port['path']
+
+sysfspath = SYSFSPATH + name
+if (not os.path.isdir(sysfspath)):
+print FAIL: %s not exist % (sysfspath)
+
+info_name = sysfspath + /name
+port_name = self._readfile(info_name).strip()
+if (port_name != port[name]):
+print (FAIL: Port info not match \n%s - %s\n%s - %s %
+   (info_name , port_name,
+%s/virtio-ports/%s % (DEBUGPATH, name),
+port[name]))
+except AttributeError:
+print (In file  + open_db_file +
+are bad data\n+ .join(file).strip())
+print (FAIL: Fail file data.)
+return
 
 ports[port['name']] = port
 f.close()
@@ -109,6 +120,8 @@ class VirtioGuest:
 
 self.ports = self._get_port_status()
 
+if self.ports == None:
+return
 for item in in_files:
 if (item[1] != self.ports[item[0]][is_console]):
 print self.ports
@@ -619,7 +632,7 @@ class VirtioGuest:
 buf = 
 if ret[0]:
 buf = os.read(in_f[0], buffer)
-print (PASS: Rest in socket:  + buf)
+print (PASS: Rest in socket: ) + str(buf[10])
 
 
 def is_alive():
@@ -645,13 +658,20 @@ def compile():
 sys.exit()
 
 
+def guest_exit():
+global exiting
+exiting = True
+os.kill(os.getpid(), signal.SIGUSR1)
+
+
 def worker(virt):
 
 Worker thread (infinite) loop of virtio_guest.
 
+global exiting
 print PASS: Start
 
-while True:
+while not exiting:
 str = raw_input()
 try:
 exec str
@@ -661,7 +681,6 @@ def worker(virt):
   

Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Christoph Lameter
On Thu, 20 Jan 2011, Balbir Singh wrote:

 + unmapped_page_control
 + [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
 + is enabled. It controls the amount of unmapped memory
 + that is present in the system. This boot option plus
 + vm.min_unmapped_ratio (sysctl) provide granular control

min_unmapped_ratio is there to guarantee that zone reclaim does not
reclaim all unmapped pages.

What you want here is a max_unmapped_ratio.


  {
 @@ -2297,6 +2320,12 @@ loop_again:
   shrink_active_list(SWAP_CLUSTER_MAX, zone,
   sc, priority, 0);

 + /*
 +  * We do unmapped page reclaim once here and once
 +  * below, so that we don't lose out
 +  */
 + reclaim_unmapped_pages(priority, zone, sc);
 +
   if (!zone_watermark_ok_safe(zone, order,

H. Okay that means background reclaim does it. If so then we also want
zone reclaim to be able to work in the background I think.
max_unmapped_ratio could also be useful to the zone reclaim logic.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin
When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---

I need to report some error from virtio-pci
that would be handled specially (disable but don't
report an error) so I wanted one that's never likely to be used by a
userspace ioctl. I selected ERANGE but it'd
be easy to switch to something else. Comments?

 hw/vhost.c  |4 +++-
 hw/virtio-net.c |6 --
 hw/virtio-pci.c |3 +++
 hw/virtio.h |2 ++
 4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1d09ed0..c79765a 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)
 
 r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
 if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
 goto fail_notifiers;
 }
 
diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ccb3e63..5de3fee 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
 if (!n-vhost_started) {
 int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), n-vdev);
 if (r  0) {
-error_report(unable to start vhost net: %d: 
- falling back on userspace virtio, -r);
+if (r != -EVIRTIO_DISABLED) {
+error_report(unable to start vhost net: %d: 
+ falling back on userspace virtio, -r);
+}
 } else {
 n-vhost_started = 1;
 }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index dd8887a..dbf4be0 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int 
n, bool assign)
 EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
 
 if (assign) {
+if (!msix_enabled(proxy-pci_dev)) {
+return -EVIRTIO_DISABLED;
+}
 int r = event_notifier_init(notifier, 0);
 if (r  0) {
 return r;
diff --git a/hw/virtio.h b/hw/virtio.h
index d8546d5..53bbdba 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -98,6 +98,8 @@ typedef struct {
 void (*vmstate_change)(void * opaque, bool running);
 } VirtIOBindings;
 
+#define EVIRTIO_DISABLED ERANGE
+
 #define VIRTIO_PCI_QUEUE_MAX 64
 
 #define VIRTIO_NO_VECTOR 0x
-- 
1.7.3.2.91.g446ac
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] KVM: fix rcu usage warning in kvm_arch_vcpu_ioctl_set_sregs()

2011-01-20 Thread Marcelo Tosatti
On Thu, Jan 20, 2011 at 11:03:31AM +0800, Xiao Guangrong wrote:
 On 01/20/2011 02:13 AM, Marcelo Tosatti wrote:
  On Wed, Jan 19, 2011 at 01:16:23PM +0800, Xiao Guangrong wrote:
  On 01/12/2011 03:39 PM, Xiao Guangrong wrote:
  Fix:
 
  [ 1001.499596] ===
  [ 1001.499599] [ INFO: suspicious rcu_dereference_check() usage. ]
  [ 1001.499601] ---
  [ 1001.499604] include/linux/kvm_host.h:301 invoked 
  rcu_dereference_check() without protection!
..
  [ 1001.499636] Pid: 6035, comm: qemu-system-x86 Not tainted 2.6.37-rc6+ 
  #62
  [ 1001.499638] Call Trace:
  [ 1001.499644]  [] lockdep_rcu_dereference+0x9d/0xa5
  [ 1001.499653]  [] gfn_to_memslot+0x8d/0xc8 [kvm]
  [ 1001.499661]  [] gfn_to_hva+0x16/0x3f [kvm]
  [ 1001.499669]  [] kvm_read_guest_page+0x1e/0x5e [kvm]
  [ 1001.499681]  [] kvm_read_guest_page_mmu+0x53/0x5e [kvm]
  [ 1001.499699]  [] load_pdptrs+0x3f/0x9c [kvm]
  [ 1001.499705]  [] ? vmx_set_cr0+0x507/0x517 [kvm_intel]
  [ 1001.499717]  [] kvm_arch_vcpu_ioctl_set_sregs+0x1f3/0x3c0 [kvm]
  [ 1001.499727]  [] kvm_vcpu_ioctl+0x6a5/0xbc5 [kvm]
 
  Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
 
  Ping ...?
  
  Applied this fix. For the make_all_cpus_request optimization, can you
  show numbers with this new version? Because now there is LOCK# similarly
  to the spinlock.
 
 Marcelo,
 
 Sure :-), there is the simply test result of kernbench:
 
 Before patch:
 
 real5m6.493s  
   
 user3m57.847s 
   
 sys 9m7.115s 
 
 real5m1.750s  
   
 user4m0.109s  
   
 sys 9m10.192s   
 
 
 After patch:
 real5m0.140s  
   
 user3m57.956s 
   
 sys 8m58.339s  
 
 real4m56.314s 
   
 user4m0.303s  
   
 sys 8m55.774s  

Nice. One disadvantageous side effect for the kvm_vcpu_kick path 
is that it can race with make_all_cpus_request, which is possibly
doing unrelated, slower work (IPI'ing other vcpus, waiting for
response).

Looks ok, but lets wait for more careful reviews before applying.


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Anthony Liguori

On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkinm...@redhat.com
   


I actually think this should be a terminal error.  The user asks for 
vhost-net, if we cannot enable it, we should exit.


Or we should warn the user that they should expect bad performance.  
Silently doing something that the user has explicitly asked us not to do 
is not a good behavior.


Regards,

Anthony Liguori


---

I need to report some error from virtio-pci
that would be handled specially (disable but don't
report an error) so I wanted one that's never likely to be used by a
userspace ioctl. I selected ERANGE but it'd
be easy to switch to something else. Comments?

  hw/vhost.c  |4 +++-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |3 +++
  hw/virtio.h |2 ++
  4 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/hw/vhost.c b/hw/vhost.c
index 1d09ed0..c79765a 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
*vdev)

  r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  if (r  0) {
-fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   if (r != -EVIRTIO_DISABLED) {
+   fprintf(stderr, Error binding guest notifier: %d\n, -r);
+   }
  goto fail_notifiers;
  }

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index ccb3e63..5de3fee 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, uint8_t 
status)
  if (!n-vhost_started) {
  int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer),n-vdev);
  if (r  0) {
-error_report(unable to start vhost net: %d: 
- falling back on userspace virtio, -r);
+if (r != -EVIRTIO_DISABLED) {
+error_report(unable to start vhost net: %d: 
+ falling back on userspace virtio, -r);
+}
  } else {
  n-vhost_started = 1;
  }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index dd8887a..dbf4be0 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, int 
n, bool assign)
  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);

  if (assign) {
+if (!msix_enabled(proxy-pci_dev)) {
+return -EVIRTIO_DISABLED;
+}
  int r = event_notifier_init(notifier, 0);
  if (r  0) {
  return r;
diff --git a/hw/virtio.h b/hw/virtio.h
index d8546d5..53bbdba 100644
--- a/hw/virtio.h
+++ b/hw/virtio.h
@@ -98,6 +98,8 @@ typedef struct {
  void (*vmstate_change)(void * opaque, bool running);
  } VirtIOBindings;

+#define EVIRTIO_DISABLED ERANGE
+
  #define VIRTIO_PCI_QUEUE_MAX 64

  #define VIRTIO_NO_VECTOR 0x
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 09/19] Introduce event-tap.

2011-01-20 Thread Yoshiaki Tamura
2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 14:50, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 11:39, schrieb Yoshiaki Tamura:
 2011/1/20 Kevin Wolf kw...@redhat.com:
 Am 20.01.2011 06:19, schrieb Yoshiaki Tamura:
 +        return;
 +    }
 +
 +    bdrv_aio_writev(bs, blk_req-reqs[0].sector, 
 blk_req-reqs[0].qiov,
 +                    blk_req-reqs[0].nb_sectors, 
 blk_req-reqs[0].cb,
 +                    blk_req-reqs[0].opaque);

 Same here.

 +    bdrv_flush(bs);

 This looks really strange. What is this supposed to do?

 One point is that you write it immediately after bdrv_aio_write, so 
 you
 get an fsync for which you don't know if it includes the current write
 request or if it doesn't. Which data do you want to get flushed to 
 the disk?

 I was expecting to flush the aio request that was just initiated.
 Am I misunderstanding the function?

 Seems so. The function names don't use really clear terminology either,
 so you're not the first one to fall in this trap. Basically we have:

 * qemu_aio_flush() waits for all AIO requests to complete. I think you
 wanted to have exactly this, but only for a single block device. Such a
 function doesn't exist yet.

 * bdrv_flush() makes sure that all successfully completed requests are
 written to disk (by calling fsync)

 * bdrv_aio_flush() is the asynchronous version of bdrv_flush, i.e. run
 the fsync in the thread pool

 Then what I wanted to do is, call qemu_aio_flush first, then
 bdrv_flush.  It should be like live migration.

 Okay, that makes sense. :-)

 The other thing is that you introduce a bdrv_flush for each request,
 basically forcing everyone to something very similar to writethrough
 mode. I'm sure this will have a big impact on performance.

 The reason is to avoid inversion of queued requests.  Although
 processing one-by-one is heavy, wouldn't having requests flushed
 to disk out of order break the disk image?

 No, that's fine. If a guest issues two requests at the same time, they
 may complete in any order. You just need to make sure that you don't
 call the completion callback before the request really has completed.

 We need to flush requests, meaning aio and fsync, before sending
 the final state of the guests, to make sure we can switch to the
 secondary safely.

 In theory I think you could just re-submit the requests on the secondary
 if they had not completed yet.

 But you're right, let's keep things simple for the start.

 I'm just starting to wonder if the guest won't timeout the requests if
 they are queued for too long. Even more, with IDE, it can only handle
 one request at a time, so not completing requests doesn't sound like a
 good idea at all. In what intervals is the event-tap queue flushed?

 The requests are flushed once each transaction completes.  So
 it's not with specific intervals.

 Right. So when is a transaction completed? This is the time that a
 single request will take.

 The transaction is completed when the vm state is sent to the
 secondary, and the primary receives the ack to it.  Please let me
 know if the answer is too vague.  What I can tell is that it
 can't be super fast.

 On the other hand, if you complete before actually writing out, you
 don't get timeouts, but you signal success to the guest when the request
 could still fail. What would you do in this case? With a writeback cache
 mode we're fine, we can just fail the next flush (until then nothing is
 guaranteed to be on disk and order doesn't matter either), but with
 cache=writethrough we're in serious trouble.

 Have you thought about this problem? Maybe we end up having to flush the
 event-tap queue for each single write in writethrough mode.

 Yes, and that's what I'm trying to do at this point.

 Oh, I must have missed that code. Which patch/function should I look at?

 Maybe I miss-answered to your question.  The device may receive
 timeouts.

 We should pay attention that the guest does not see timeouts. I'm not
 expecting that I/O will be super fast, and as long as it is only a
 performance problem we can live with it.

 However, as soon as the guest gets timeouts it reports I/O errors and
 eventually offlines the block device. At this point it's not a
 performance problem any more, but also a correctness problem.

 This is why I suggested that we flush the event-tap queue (i.e. complete
 the transaction) immediately after an I/O request has been issued
 instead of waiting for other events that would complete the transaction.

 Right.  event-tap doesn't queue at specific interval.  It'll
 schedule the transaction as bh once events are tapped .  The
 purpose of the queue is store requests initiated while the
 transaction.

 Ok, now I got it. :-)

 So the patches are already doing the best we can do.

 So I believe current implementation should be doing
 what you're expecting.  However, if the guest dirtied huge amount
 of ram and initiated block requests, we may get timeouts even we
 

Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin
On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
 On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Signed-off-by: Michael S. Tsirkinm...@redhat.com
 
 I actually think this should be a terminal error.  The user asks for
 vhost-net, if we cannot enable it, we should exit.
 
 Or we should warn the user that they should expect bad performance.
 Silently doing something that the user has explicitly asked us not
 to do is not a good behavior.
 
 Regards,
 
 Anthony Liguori

The issue is that user has no control of the guest, and can not know
whether the guest enables MSI. So what you ask for will just make
some guests fail, and others fail sometimes.
The user also has no way to know that version X of kvm does not expose a
way to inject level interrupts with irqfd.

We could have *another* flag that says use vhost where it helps but
then I think this is what everyone wants to do, anyway, and libvirt
already sets vhost=on so I prefer redefining the meaning of an existing
flag.

Maybe this is best handled by a documentation update?

We always said:
use vhost=on to enable experimental in kernel 
accelerator\n

note 'enable' not 'require'. This is similar to how we specify
nvectors : you can not make guest use the feature.

How about this:

diff --git a/qemu-options.hx b/qemu-options.hx
index 898561d..3c937c1 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net,
 use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap 
flag\n
 use vnet_hdr=on to make the lack of IFF_VNET_HDR support 
an error condition\n
 use vhost=on to enable experimental in kernel 
accelerator\n
+(note: vhost=on has no effect unless guest uses MSI-X)\n
 use 'vhostfd=h' to connect to an already opened vhost net 
device\n
 #endif
 -net 
socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 I need to report some error from virtio-pci
 that would be handled specially (disable but don't
 report an error) so I wanted one that's never likely to be used by a
 userspace ioctl. I selected ERANGE but it'd
 be easy to switch to something else. Comments?

Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?

-Sridhar
 
  hw/vhost.c  |4 +++-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |3 +++
  hw/virtio.h |2 ++
  4 files changed, 12 insertions(+), 3 deletions(-)
 
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 1d09ed0..c79765a 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
 *vdev)
  
  r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  if (r  0) {
 -fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + if (r != -EVIRTIO_DISABLED) {
 + fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + }
  goto fail_notifiers;
  }
  
 diff --git a/hw/virtio-net.c b/hw/virtio-net.c
 index ccb3e63..5de3fee 100644
 --- a/hw/virtio-net.c
 +++ b/hw/virtio-net.c
 @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
 uint8_t status)
  if (!n-vhost_started) {
  int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
 n-vdev);
  if (r  0) {
 -error_report(unable to start vhost net: %d: 
 - falling back on userspace virtio, -r);
 +if (r != -EVIRTIO_DISABLED) {
 +error_report(unable to start vhost net: %d: 
 + falling back on userspace virtio, -r);
 +}
  } else {
  n-vhost_started = 1;
  }
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index dd8887a..dbf4be0 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
 int n, bool assign)
  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
  
  if (assign) {
 +if (!msix_enabled(proxy-pci_dev)) {
 +return -EVIRTIO_DISABLED;
 +}
  int r = event_notifier_init(notifier, 0);
  if (r  0) {
  return r;
 diff --git a/hw/virtio.h b/hw/virtio.h
 index d8546d5..53bbdba 100644
 --- a/hw/virtio.h
 +++ b/hw/virtio.h
 @@ -98,6 +98,8 @@ typedef struct {
  void (*vmstate_change)(void * opaque, bool running);
  } VirtIOBindings;
  
 +#define EVIRTIO_DISABLED ERANGE
 +
  #define VIRTIO_PCI_QUEUE_MAX 64
  
  #define VIRTIO_NO_VECTOR 0x

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-01-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052


Marcelo Tosatti mtosa...@redhat.com changed:

   What|Removed |Added

 CC||mtosa...@redhat.com




--- Comment #4 from Marcelo Tosatti mtosa...@redhat.com  2011-01-20 17:28:40 
---
Nicolas,

This should be fixed by the attached patch, queued for 2.6.36-stable.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-01-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #5 from Marcelo Tosatti mtosa...@redhat.com  2011-01-20 17:30:38 
---
Created an attachment (id=44522)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=44522)
KVM: MMU: fix rmap_remove on non present sptes

KVM: MMU: fix rmap_remove on non present sptes

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Michael S. Tsirkin
On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
 On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
  When MSI is off, each interrupt needs to be bounced through the io
  thread when it's set/cleared, so vhost-net causes more context switches and
  higher CPU utilization than userspace virtio which handles networking in
  the same thread.
  
  We'll need to fix this by adding level irq support in kvm irqfd,
  for now disable vhost-net in these configurations.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
  
  I need to report some error from virtio-pci
  that would be handled specially (disable but don't
  report an error) so I wanted one that's never likely to be used by a
  userspace ioctl. I selected ERANGE but it'd
  be easy to switch to something else. Comments?
 
 Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
 
 -Sridhar

The error is reported by virtio-pci which does not know about vhost.
I started with EVIRTIO_MSIX_DISABLED and made is shorter.
Would EVIRTIO_MSIX_DISABLED be better?

  
   hw/vhost.c  |4 +++-
   hw/virtio-net.c |6 --
   hw/virtio-pci.c |3 +++
   hw/virtio.h |2 ++
   4 files changed, 12 insertions(+), 3 deletions(-)
  
  diff --git a/hw/vhost.c b/hw/vhost.c
  index 1d09ed0..c79765a 100644
  --- a/hw/vhost.c
  +++ b/hw/vhost.c
  @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
  VirtIODevice *vdev)
   
   r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
   if (r  0) {
  -fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   if (r != -EVIRTIO_DISABLED) {
  +   fprintf(stderr, Error binding guest notifier: %d\n, -r);
  +   }
   goto fail_notifiers;
   }
   
  diff --git a/hw/virtio-net.c b/hw/virtio-net.c
  index ccb3e63..5de3fee 100644
  --- a/hw/virtio-net.c
  +++ b/hw/virtio-net.c
  @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
  uint8_t status)
   if (!n-vhost_started) {
   int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
  n-vdev);
   if (r  0) {
  -error_report(unable to start vhost net: %d: 
  - falling back on userspace virtio, -r);
  +if (r != -EVIRTIO_DISABLED) {
  +error_report(unable to start vhost net: %d: 
  + falling back on userspace virtio, -r);
  +}
   } else {
   n-vhost_started = 1;
   }
  diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
  index dd8887a..dbf4be0 100644
  --- a/hw/virtio-pci.c
  +++ b/hw/virtio-pci.c
  @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
  int n, bool assign)
   EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
   
   if (assign) {
  +if (!msix_enabled(proxy-pci_dev)) {
  +return -EVIRTIO_DISABLED;
  +}
   int r = event_notifier_init(notifier, 0);
   if (r  0) {
   return r;
  diff --git a/hw/virtio.h b/hw/virtio.h
  index d8546d5..53bbdba 100644
  --- a/hw/virtio.h
  +++ b/hw/virtio.h
  @@ -98,6 +98,8 @@ typedef struct {
   void (*vmstate_change)(void * opaque, bool running);
   } VirtIOBindings;
   
  +#define EVIRTIO_DISABLED ERANGE
  +
   #define VIRTIO_PCI_QUEUE_MAX 64
   
   #define VIRTIO_NO_VECTOR 0x
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Jeremy Fitzhardinge
On 01/20/2011 03:42 AM, Srivatsa Vaddagiri wrote:
 On Wed, Jan 19, 2011 at 10:53:52AM -0800, Jeremy Fitzhardinge wrote:
 The reason for wanting this should be clear I guess, it allows PI.
 Well, if we can expand the spinlock to include an owner, then all this
 becomes moot...
 How so? Having an owner will not eliminate the need for pv-ticketlocks
 afaict. We still need a mechanism to reduce latency in scheduling the next
 thread-in-waiting, which is achieved by your patches. 

No, sorry, I should have been clearer.  I meant that going to the effort
of not increasing the lock size to record in slowpath state.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] kvm hypervisor : Add hypercalls to support pv-ticketlock

2011-01-20 Thread Jeremy Fitzhardinge
On 01/20/2011 03:59 AM, Srivatsa Vaddagiri wrote:
 At least in the Xen code, a current owner isn't very useful, because we
 need the current owner to kick the *next* owner to life at release time,
 which we can't do without some structure recording which ticket belongs
 to which cpu.
 If we had a yield-to [1] sort of interface _and_ information on which vcpu
 owns a lock, then lock-spinners can yield-to the owning vcpu, while the
 unlocking vcpu can yield-to the next-vcpu-in-waiting.

Perhaps, but the core problem is how to find next-vcpu-in-waiting
efficiently.  Once you have that info, there's a number of things you
can usefully do with it.

  The key here is not to
 sleep when waiting for locks (as implemented by current patch-series, which 
 can 
 put other VMs at an advantage by giving them more time than they are entitled 
 to)

Why?  If a VCPU can't make progress because its waiting for some
resource, then why not schedule something else instead?  Presumably when
the VCPU does become runnable, the scheduler will credit its previous
blocked state and let it run in preference to something else.

  and also to ensure that lock-owner as well as the next-in-line lock-owner 
 are not unduly made to wait for cpu. 

 Is there a way we can dynamically expand the size of lock only upon contention
 to include additional information like owning vcpu? Have the lock point to a
 per-cpu area upon contention where additional details can be stored perhaps?

As soon as you add a pointer to the lock, you're increasing its size. 
If we had a pointer in there already, then all of this would be moot.

If auxiliary per-lock is uncommon, then using a hash keyed on lock
address would be one way to do it.

J
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Alex Williamson
On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
 When MSI is off, each interrupt needs to be bounced through the io
 thread when it's set/cleared, so vhost-net causes more context switches and
 higher CPU utilization than userspace virtio which handles networking in
 the same thread.
 
 We'll need to fix this by adding level irq support in kvm irqfd,
 for now disable vhost-net in these configurations.

We need this if we want avoid bouncing vfio INtx through qemu too.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 I need to report some error from virtio-pci
 that would be handled specially (disable but don't
 report an error) so I wanted one that's never likely to be used by a
 userspace ioctl. I selected ERANGE but it'd
 be easy to switch to something else. Comments?
 
  hw/vhost.c  |4 +++-
  hw/virtio-net.c |6 --
  hw/virtio-pci.c |3 +++
  hw/virtio.h |2 ++
  4 files changed, 12 insertions(+), 3 deletions(-)
 
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 1d09ed0..c79765a 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, VirtIODevice 
 *vdev)
  
  r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
  if (r  0) {
 -fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + if (r != -EVIRTIO_DISABLED) {
 + fprintf(stderr, Error binding guest notifier: %d\n, -r);
 + }

style - the above is tab indented.

  goto fail_notifiers;
  }
  
 diff --git a/hw/virtio-net.c b/hw/virtio-net.c
 index ccb3e63..5de3fee 100644
 --- a/hw/virtio-net.c
 +++ b/hw/virtio-net.c
 @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
 uint8_t status)
  if (!n-vhost_started) {
  int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
 n-vdev);
  if (r  0) {
 -error_report(unable to start vhost net: %d: 
 - falling back on userspace virtio, -r);
 +if (r != -EVIRTIO_DISABLED) {
 +error_report(unable to start vhost net: %d: 
 + falling back on userspace virtio, -r);
 +}
  } else {
  n-vhost_started = 1;
  }
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index dd8887a..dbf4be0 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void *opaque, 
 int n, bool assign)
  EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);
  
  if (assign) {
 +if (!msix_enabled(proxy-pci_dev)) {
 +return -EVIRTIO_DISABLED;
 +}
  int r = event_notifier_init(notifier, 0);
  if (r  0) {
  return r;
 diff --git a/hw/virtio.h b/hw/virtio.h
 index d8546d5..53bbdba 100644
 --- a/hw/virtio.h
 +++ b/hw/virtio.h
 @@ -98,6 +98,8 @@ typedef struct {
  void (*vmstate_change)(void * opaque, bool running);
  } VirtIOBindings;
  
 +#define EVIRTIO_DISABLED ERANGE
 +
  #define VIRTIO_PCI_QUEUE_MAX 64
  
  #define VIRTIO_NO_VECTOR 0x

I'm not a fan of having this special return value.  Why doesn't
virtio-pci only setup the set_guest_notifiers function pointer when msix
is enabled?  If not that, it could at least expose some
virtio_foo_enabled() type feature that vhost could check.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl
On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?


 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically for
 creating PCI devices that doesn't require a parent bus argument but provides
 a way to specify stable addressing (for instancing, using a linear index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.

 There are currently no VCPU-specific bits remaining in kvm_state.

I think fields vcpu_events, robust_singlestep, debugregs,
kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
same for all VCPUs but still they are sort of CPU properties. I'm not
sure about fd field.

 It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over.

This should probably contain only irqchip_in_kernel, pit_in_kernel and
many_ioeventfds, maybe fd.

 It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.

I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
migration_log into the memory object.

 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.

If it is an opaque blob which contains various unrelated stuff, no
clear place will be found.

By the way, we don't have a QEMUState but instead use globals. Perhaps
this should be reorganized as well. For fd field, maybe even using a
global variable could be justified, since it is used for direct access
to kernel, not unlike a system call.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori

On 01/20/2011 03:33 AM, Jan Kiszka wrote:

On 2011-01-19 20:32, Blue Swirl wrote:
   

On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
aligu...@linux.vnet.ibm.com  wrote:
 

On 01/19/2011 07:15 AM, Markus Armbruster wrote:
   

So they interact with KVM (need kvm_state), and they interact with the
emulated PCI bus.  Could you elaborate on the fundamental difference
between the two interactions that makes you choose the (hypothetical)
KVM bus over the PCI bus as device parent?

 

It's almost arbitrary, but I would say it's the direction that I/Os flow.

But if the underlying observation is that the device tree is not really a
tree, you're 100% correct.  This is part of why a factory interface that
just takes a parent bus is too simplistic.

I think we ought to introduce a -pci-device option that is specifically for
creating PCI devices that doesn't require a parent bus argument but provides
a way to specify stable addressing (for instancing, using a linear index).
   

I think kvm_state should not be a property of any device or bus. It
should be split to more logical pieces.

Some parts of it could remain in CPUState, because they are associated
with a VCPU.

Also, for example irqfd could be considered to be similar object to
char or block devices provided by QEMU to devices. Would it make sense
to introduce new host types for passing parts of kvm_state to devices?

I'd also make coalesced MMIO stuff part of memory object. We are not
passing any state references when using cpu_physical_memory_rw(), but
that could be changed.
 

There are currently no VCPU-specific bits remaining in kvm_state. It may
be a good idea to introduce an arch-specific kvm_state and move related
bits over. It may also once be feasible to carve out memory management
related fields if we have proper abstractions for that, but I'm not
completely sure here.

Anyway, all these things are secondary. The primary topic here is how to
deal with kvm_state and its fields that have VM-global scope.
   


The debate is really:

1) should we remove all passing of kvm_state and just assume it's static

2) deal with a couple places in the code where we need to figure out how 
to get at kvm_state


I think we've only identified 1 real instance of (2) and it's resulted 
in some good discussions about how to model KVM devices vs. emulated 
devices.  Honestly, (1) just stinks.  I see absolutely no advantage to 
it at all.   In the very worst case scenario, the thing we need to do is 
just reference an extern variable in a few places.  That completely 
avoids all of the modelling discussions for now (while leaving for 
placeholder FIXMEs so the problem can be tackled later).


I don't understand the resistance here.

Regards,

Anthony Liguori


Jan

   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori

On 01/20/2011 02:44 AM, Gerd Hoffmann wrote:

  Hi,


For (2), you cannot use bus=X,addr=Y because it makes assumptions about
the PCI topology which may change in newer -M pc's.


Why should the PCI topology for 'pc' ever change?

We'll probably get q35 support some day, but when this lands I expect 
we'll see a new machine type 'q35', so '-m q35' will pick the ich9 
chipset (which will have a different pci topology of course) and '-m 
pc' will pick the existing piix chipset (which will continue to look 
like it looks today).


But then what's the default machine type?  When I say -M pc, I really 
mean the default machine.


At some point, qemu-system-x86_64 -device virtio-net-pci,addr=2.0

Is not going to be a reliable way to invoke qemu because there's no way 
we can guarantee that slot 2 isn't occupied by a chipset device or some 
other default device.


Regards,

Anthony Liguori


cheers,
  Gerd


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Anthony Liguori

On 01/20/2011 04:33 AM, Daniel P. Berrange wrote:

On Thu, Jan 20, 2011 at 09:44:05AM +0100, Gerd Hoffmann wrote:
   

   Hi,

 

For (2), you cannot use bus=X,addr=Y because it makes assumptions about
the PCI topology which may change in newer -M pc's.
   

Why should the PCI topology for 'pc' ever change?

We'll probably get q35 support some day, but when this lands I
expect we'll see a new machine type 'q35', so '-m q35' will pick the
ich9 chipset (which will have a different pci topology of course)
and '-m pc' will pick the existing piix chipset (which will continue
to look like it looks today).
 

If the topology does ever change (eg in the kind of way anthony
suggests, first bus only has the graphics card), I think libvirt
is going to need a little work to adapt to the new topology,
regardless of whether we currently specify a bus= arg to -device
or not. I'm not sure there's anything we could do to future proof
us to that kind of change.
   


I assume that libvirt today assumes that it can use a set of PCI slots 
in bus 0, correct?  Probably in the range 3-31?  Such assumptions are 
very likely to break.


Regards,

Anthony Liguori


Regards,
Daniel
   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-01-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #6 from prochazka prochazka.nico...@gmail.com  2011-01-20 
19:45:49 ---
hello, 
I do not understand, patch seems to be already apply on 2.6.37 kernel tree, 
and my test are based on this release.

NP.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl
On Thu, Jan 20, 2011 at 7:37 PM, Anthony Liguori
aligu...@linux.vnet.ibm.com wrote:
 On 01/20/2011 03:33 AM, Jan Kiszka wrote:

 On 2011-01-19 20:32, Blue Swirl wrote:


 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com  wrote:


 On 01/19/2011 07:15 AM, Markus Armbruster wrote:


 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?



 It's almost arbitrary, but I would say it's the direction that I/Os
 flow.

 But if the underlying observation is that the device tree is not really
 a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically
 for
 creating PCI devices that doesn't require a parent bus argument but
 provides
 a way to specify stable addressing (for instancing, using a linear
 index).


 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.


 There are currently no VCPU-specific bits remaining in kvm_state. It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over. It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.

 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.


 The debate is really:

 1) should we remove all passing of kvm_state and just assume it's static

 2) deal with a couple places in the code where we need to figure out how to
 get at kvm_state

 I think we've only identified 1 real instance of (2) and it's resulted in
 some good discussions about how to model KVM devices vs. emulated devices.
  Honestly, (1) just stinks.  I see absolutely no advantage to it at all.

Fully agree.

 In the very worst case scenario, the thing we need to do is just reference
 an extern variable in a few places.  That completely avoids all of the
 modelling discussions for now (while leaving for placeholder FIXMEs so the
 problem can be tackled later).

I think KVMState was designed to match KVM ioctl interface: all stuff
that is needed for talking to KVM or received from KVM are there. But
I think this shouldn't be a design driver.

If the only pieces of kvm_state that are needed by the devices are
irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of
passing kvm_state to devices becomes very different. Each of these are
just single bits, affecting only a few devices. Perhaps they could be
device properties which the board level sets when KVM is used?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka
On 2011-01-20 20:27, Blue Swirl wrote:
 On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?


 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically for
 creating PCI devices that doesn't require a parent bus argument but 
 provides
 a way to specify stable addressing (for instancing, using a linear index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.

 There are currently no VCPU-specific bits remaining in kvm_state.
 
 I think fields vcpu_events, robust_singlestep, debugregs,
 kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
 same for all VCPUs but still they are sort of CPU properties. I'm not
 sure about fd field.

They are all properties of the currently loaded KVM subsystem in the
host kernel. They can't change while KVM's root fd is opened.
Replicating this static information into each and every VCPU state would
be crazy.

In fact, services like kvm_has_vcpu_events() already encode this: they
are static functions without any kvm_state reference that simply return
the content of those fields. Totally inconsistent to this, we force the
caller of kvm_check_extension to pass a handle. This is part of my
problem with the current situation and any halfhearted steps in this
context. Either we work towards eliminating static KVMState *kvm_state
in kvm-all.c or eliminating KVMState.

 
 It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over.
 
 This should probably contain only irqchip_in_kernel, pit_in_kernel and
 many_ioeventfds, maybe fd.

fd is that root file descriptor you need for a few KVM services that are
not bound to a specific VM - e.g. feature queries. It's not arch
specific. Arch specific are e.g. robust_singlestep or xsave feature states.

 
 It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.
 
 I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
 migration_log into the memory object.

vmfd is the VM-scope file descriptor you need at machine-level. The rest
logically belongs to a memory object, but I haven't looked at technical
details yet.

 
 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.
 
 If it is an opaque blob which contains various unrelated stuff, no
 clear place will be found.

We aren't moving fields yet (and we shouldn't). We first of all need to
establish the handle distribution (which apparently requires quite some
work in areas beyond KVM).

 
 By the way, we don't have a QEMUState but instead use globals. Perhaps
 this should be reorganized as well. For fd field, maybe even using a
 global variable could be justified, since it is used for direct access
 to kernel, not unlike a system call.

The fd field is part of this discussion. Making it global (but local to
the kvm subsystem) was an implicit part of my original suggestion.

I've no problem with something like a QEMUState, or better a
MachineState that would also include a few KVM-specific fields like the
vmfd - just like we already do for CPUstate (or should we better
introduce a KVM CPU bus... ;) ).

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka
On 2011-01-20 20:37, Anthony Liguori wrote:
 On 01/20/2011 03:33 AM, Jan Kiszka wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
   
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com  wrote:
 
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:
   
 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?

  
 It's almost arbitrary, but I would say it's the direction that I/Os
 flow.

 But if the underlying observation is that the device tree is not
 really a
 tree, you're 100% correct.  This is part of why a factory interface
 that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is
 specifically for
 creating PCI devices that doesn't require a parent bus argument but
 provides
 a way to specify stable addressing (for instancing, using a linear
 index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.
  
 There are currently no VCPU-specific bits remaining in kvm_state. It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over. It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.

 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.

 
 The debate is really:
 
 1) should we remove all passing of kvm_state and just assume it's static
 
 2) deal with a couple places in the code where we need to figure out how
 to get at kvm_state
 
 I think we've only identified 1 real instance of (2) and it's resulted
 in some good discussions about how to model KVM devices vs. emulated
 devices.  Honestly, (1) just stinks.  I see absolutely no advantage to
 it at all.   In the very worst case scenario, the thing we need to do is
 just reference an extern variable in a few places.  That completely
 avoids all of the modelling discussions for now (while leaving for
 placeholder FIXMEs so the problem can be tackled later).

The PCI bus discussion is surely an interesting outcome, but now almost
completely off-topic to the original, way less critical issue (as we
were discussing internals).

 
 I don't understand the resistance here.

IMHO, most suggestions on the table are still over-designed (like a
KVMBus that only passes a kvm_state - or do you have more features for
it in mind?). The idea I love most so far is establishing a machine
state that also carries those few KVM bits which correspond to the KVM
extension of CPUState.

But in the end I want an implementable consensus that helps moving
forward with main topic: the overdue KVM upstream merge. I just do not
have a clear picture yet.

Jan



signature.asc
Description: OpenPGP digital signature


[RFC -v6 PATCH 1/8] sched: check the right -nr_running in yield_task_fair

2011-01-20 Thread Rik van Riel
With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index c62ebae..7b338ac 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1304,7 +1304,7 @@ static void yield_task_fair(struct rq *rq)
/*
 * Are we the only task in the tree?
 */
-   if (unlikely(cfs_rq-nr_running == 1))
+   if (unlikely(rq-nr_running == 1))
return;
 
clear_buddies(cfs_rq, se);

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 2/8] sched: limit the scope of clear_buddies

2011-01-20 Thread Rik van Riel
The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues.  In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

 (nl)
/\
   G(nl)  G
   / \ \
 T(l) T(n)  T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared.  This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be.  As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched_fair.c |   30 +++---
 1 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f4ee445..0321473 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -784,19 +784,35 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity 
*se, int flags)
__enqueue_entity(cfs_rq, se);
 }
 
-static void __clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static void __clear_buddies_last(struct sched_entity *se)
 {
-   if (!se || cfs_rq-last == se)
-   cfs_rq-last = NULL;
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-last == se)
+   cfs_rq-last = NULL;
+   else
+   break;
+   }
+}
 
-   if (!se || cfs_rq-next == se)
-   cfs_rq-next = NULL;
+static void __clear_buddies_next(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-next == se)
+   cfs_rq-next = NULL;
+   else
+   break;
+   }
 }
 
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-   for_each_sched_entity(se)
-   __clear_buddies(cfs_rq_of(se), se);
+   if (cfs_rq-last == se)
+   __clear_buddies_last(se);
+
+   if (cfs_rq-next == se)
+   __clear_buddies_next(se);
 }
 
 static void

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 3/8] sched: use a buddy to implement yield_task_fair

2011-01-20 Thread Rik van Riel
Use the buddy mechanism to implement yield_task_fair.  This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next.  We need next to be able
to override yield, because it is possible for the next and
yield task to be different processen in the same sub-tree
of the CFS tree.  When they are, we need to go into that
sub-tree regardless of the yield hint, and pick the correct
entity once we get to the right level.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..e4e57ff 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -327,7 +327,7 @@ struct cfs_rq {
 * 'curr' points to currently running entity on this cfs_rq.
 * It is set to NULL otherwise (i.e when none are currently running).
 */
-   struct sched_entity *curr, *next, *last;
+   struct sched_entity *curr, *next, *last, *yield;
 
unsigned int nr_spread_over;
 
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index ad946fd..f701a51 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -384,6 +384,22 @@ static struct sched_entity *__pick_next_entity(struct 
cfs_rq *cfs_rq)
return rb_entry(left, struct sched_entity, run_node);
 }
 
+static struct sched_entity *__pick_second_entity(struct cfs_rq *cfs_rq)
+{
+   struct rb_node *left = cfs_rq-rb_leftmost;
+   struct rb_node *second;
+
+   if (!left)
+   return NULL;
+
+   second = rb_next(left);
+
+   if (!second)
+   second = left;
+
+   return rb_entry(second, struct sched_entity, run_node);
+}
+
 static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
 {
struct rb_node *last = rb_last(cfs_rq-tasks_timeline);
@@ -806,6 +822,17 @@ static void __clear_buddies_next(struct sched_entity *se)
}
 }
 
+static void __clear_buddies_yield(struct sched_entity *se)
+{
+   for_each_sched_entity(se) {
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   if (cfs_rq-yield == se)
+   cfs_rq-yield = NULL;
+   else
+   break;
+   }
+}
+
 static void clear_buddies(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
if (cfs_rq-last == se)
@@ -813,6 +840,9 @@ static void clear_buddies(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 
if (cfs_rq-next == se)
__clear_buddies_next(se);
+
+   if (cfs_rq-yield == se)
+   __clear_buddies_yield(se);
 }
 
 static void
@@ -926,13 +956,27 @@ set_next_entity(struct cfs_rq *cfs_rq, struct 
sched_entity *se)
 static int
 wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
 
+/*
+ * Pick the next process, keeping these things in mind, in this order:
+ * 1) keep things fair between processes/task groups
+ * 2) pick the next process, since someone really wants that to run
+ * 3) pick the last process, for cache locality
+ * 4) do not run the yield process, if something else is available
+ */
 static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
 {
struct sched_entity *se = __pick_next_entity(cfs_rq);
struct sched_entity *left = se;
 
-   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
-   se = cfs_rq-next;
+   /*
+* Avoid running the yield buddy, if running something else can
+* be done without getting too unfair.
+*/
+   if (cfs_rq-yield == se) {
+   struct sched_entity *second = __pick_second_entity(cfs_rq);
+   if (wakeup_preempt_entity(second, left)  1)
+   se = second;
+   }
 
/*
 * Prefer last buddy, try to return the CPU to a preempted task.
@@ -940,6 +984,12 @@ static struct sched_entity *pick_next_entity(struct cfs_rq 
*cfs_rq)
if (cfs_rq-last  wakeup_preempt_entity(cfs_rq-last, left)  1)
se = cfs_rq-last;
 
+   /*
+* Someone really wants this to run. If it's not unfair, run it.
+*/
+   if (cfs_rq-next  wakeup_preempt_entity(cfs_rq-next, left)  1)
+   se = cfs_rq-next;
+
clear_buddies(cfs_rq, se);
 
return se;
@@ -1096,52 +1146,6 @@ static void dequeue_task_fair(struct rq *rq, struct 
task_struct *p, int flags)
hrtick_update(rq);
 }
 
-/*
- * sched_yield() support is very simple - we dequeue and enqueue.
- *
- * If compat_yield is turned on then we requeue to the end of the tree.
- */
-static void yield_task_fair(struct rq *rq)
-{
-   struct task_struct *curr = rq-curr;
-   struct cfs_rq *cfs_rq = task_cfs_rq(curr);
-   struct sched_entity *rightmost, *se = curr-se;
-
-   /*
-* Are we the only task in the tree?
-*/
-   if (unlikely(rq-nr_running == 1))
-   return;
-
-   

[RFC -v6 PATCH 4/8] sched: Add yield_to(task, preempt) functionality

2011-01-20 Thread Rik van Riel
From: Mike Galbraith efa...@gmx.de

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq-next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
Signed-off-by: Mike Galbraith efa...@gmx.de

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..6c43fc4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1047,6 +1047,7 @@ struct sched_class {
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
void (*yield_task) (struct rq *rq);
+   bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool 
preempt);
 
void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int 
flags);
 
@@ -1943,6 +1944,7 @@ static inline int rt_mutex_getprio(struct task_struct *p)
 # define rt_mutex_adjust_pi(p) do { } while (0)
 #endif
 
+extern bool yield_to(struct task_struct *p, bool preempt);
 extern void set_user_nice(struct task_struct *p, long nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
diff --git a/kernel/sched.c b/kernel/sched.c
index e4e57ff..1f38ed2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5270,6 +5270,64 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/**
+ * yield_to - yield the current processor to another thread in
+ * your thread group, or accelerate that thread toward the
+ * processor it's on.
+ *
+ * It's the caller's job to ensure that the target task struct
+ * can't go away on us before we can do any checks.
+ *
+ * Returns true if we indeed boosted the target task.
+ */
+bool __sched yield_to(struct task_struct *p, bool preempt)
+{
+   struct task_struct *curr = current;
+   struct rq *rq, *p_rq;
+   unsigned long flags;
+   bool yielded = 0;
+
+   local_irq_save(flags);
+   rq = this_rq();
+
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   while (task_rq(p) != p_rq) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   if (!curr-sched_class-yield_to_task)
+   goto out;
+
+   if (curr-sched_class != p-sched_class)
+   goto out;
+
+   if (task_running(p_rq, p) || p-state)
+   goto out;
+
+   if (!same_thread_group(p, curr))
+   goto out;
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+   if (task_group(p) != task_group(curr))
+   goto out;
+#endif
+
+   yielded = curr-sched_class-yield_to_task(rq, p, preempt);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+
+   if (yielded)
+   yield();
+
+   return yielded;
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /*
  * This task is about to go to sleep on IO. Increment rq-nr_iowait so
  * that process accounting knows that this is a task in IO wait state.
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index f701a51..097e936 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1800,6 +1800,23 @@ static void yield_task_fair(struct rq *rq)
set_yield_buddy(se);
 }
 
+static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool 
preempt)
+{
+   struct sched_entity *se = p-se;
+
+   if (!se-on_rq)
+   return false;
+
+   /* Tell the scheduler that we'd really like pse to run next. */
+   set_next_buddy(se);
+
+   /* Make p's CPU reschedule; pick_next_entity takes care of fairness. */
+   if (preempt)
+   resched_task(rq-curr);
+
+   return true;
+}
+
 #ifdef CONFIG_SMP
 /**
  * Fair scheduling class load-balancing methods:
@@ -3993,6 +4010,7 @@ static const struct sched_class fair_sched_class = {
.enqueue_task   = enqueue_task_fair,
.dequeue_task   = dequeue_task_fair,
.yield_task = yield_task_fair,
+   .yield_to_task  = yield_to_task_fair,
 
.check_preempt_curr = check_preempt_wakeup,
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 8/8] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2011-01-20 Thread Rik van Riel
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to get
another VCPU in the same KVM guest to run sooner.

This seems to give a 10-15% speedup in certain workloads, versus
not having PLE at all.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9d56ed5..fab2250 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -187,6 +187,7 @@ struct kvm {
 #endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+   int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 86c4905..8b761ba 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1292,18 +1292,55 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-   ktime_t expires;
-   DEFINE_WAIT(wait);
-
-   prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE);
-
-   /* Sleep for 100 us, and hope lock-holder got scheduled */
-   expires = ktime_add_ns(ktime_get(), 10UL);
-   schedule_hrtimeout(expires, HRTIMER_MODE_ABS);
+   struct kvm *kvm = me-kvm;
+   struct kvm_vcpu *vcpu;
+   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
+   int yielded = 0;
+   int pass;
+   int i;
 
-   finish_wait(vcpu-wq, wait);
+   /*
+* We boost the priority of a VCPU that is runnable but not
+* currently running, because it got preempted by something
+* else and called schedule in __vcpu_run.  Hopefully that
+* VCPU is holding the lock that we need and will release it.
+* We approximate round-robin by starting at the last boosted VCPU.
+*/
+   for (pass = 0; pass  2  !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = NULL;
+   struct pid *pid;
+   if (!pass  i  last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass  i  last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (waitqueue_active(vcpu-wq))
+   continue;
+   rcu_read_lock();
+   pid = rcu_dereference(vcpu-pid);
+   if (pid)
+   task = get_pid_task(vcpu-pid, PIDTYPE_PID);
+   rcu_read_unlock();
+   if (!task)
+   continue;
+   if (task-flags  PF_VCPU) {
+   put_task_struct(task);
+   continue;
+   }
+   if (yield_to(task, 1)) {
+   put_task_struct(task);
+   kvm-last_boosted_vcpu = i;
+   yielded = 1;
+   break;
+   }
+   put_task_struct(task);
+   }
+   }
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 7/8] kvm: keep track of which task is running a KVM vcpu

2011-01-20 Thread Rik van Riel
Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..9d56ed5 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
int vcpu_id;
struct mutex mutex;
+   struct pid *pid;
int   cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..86c4905 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -185,6 +185,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu-cpu = -1;
vcpu-kvm = kvm;
vcpu-vcpu_id = id;
+   vcpu-pid = NULL;
init_waitqueue_head(vcpu-wq);
 
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
@@ -208,6 +209,8 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_init);
 
 void kvm_vcpu_uninit(struct kvm_vcpu *vcpu)
 {
+   if (vcpu-pid)
+   put_pid(vcpu-pid);
kvm_arch_vcpu_uninit(vcpu);
free_page((unsigned long)vcpu-run);
 }
@@ -1456,6 +1459,14 @@ static long kvm_vcpu_ioctl(struct file *filp,
r = -EINVAL;
if (arg)
goto out;
+   if (unlikely(vcpu-pid != current-pids[PIDTYPE_PID].pid)) {
+   /* The thread running this VCPU changed. */
+   struct pid *oldpid = vcpu-pid;
+   struct pid *newpid = get_task_pid(current, PIDTYPE_PID);
+   rcu_assign_pointer(vcpu-pid, newpid);
+   synchronize_rcu();
+   put_pid(oldpid);
+   }
r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu-run);
break;
case KVM_GET_REGS: {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 0/8] directed yield for Pause Loop Exiting

2011-01-20 Thread Rik van Riel
When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v6:
- implement yield_task_fair in a way that works with task groups,
  this allows me to actually get a performance improvement!
- fix another race Avi pointed out, the code should be good now
v5:
- fix the race condition Avi pointed out, by tracking vcpu-pid
- also allows us to yield to vcpu tasks that got preempted while in qemu
  userspace
v4:
- change to newer version of Mike Galbraith's yield_to implementation
- chainsaw out some code from Mike that looked like a great idea, but
  turned out to give weird interactions in practice
v3:
- more cleanups
- change to Mike Galbraith's yield_to implementation
- yield to spinning VCPUs, this seems to work better in some
  situations and has little downside potential
v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)


Benchmark results:

Two 4-CPU KVM guests are pinned to the same 4 physical CPUs.

One guest runs the AMQP performance test, the other guest runs
0, 2 or 4 infinite loops, for CPU overcommit factors of 0, 1.5
and 4.

The AMQP perftest is run 30 times, with 8 and 16 threads.

8thrno overcommit   1.5x overcommit 2x overcommit

no PLE  223801  135137  104951
PLE 224135  141105  118744

16thr   no overcommit   1.5x overcommit 2x overcommit

no PLE  222424  126175  105299
PLE 222534  138082  132945

Note: this is with the KVM guests NOT running inside cgroups.  There
seems to be a CPU load balancing issue with cgroup fair group scheduling,
which often results in one guest getting only 80% CPU time and the other
guest 320%.  That will have to be fixed to get meaningful results with
cgroups.

CPU time division between the AMQP guest and the infinite loop guest
were not exactly fair, but the guests got close to the same amount
of CPU time in each test run.

There is a substantial amount of randomness in CPU time division between
guests, but the performance improvement is consistent between multiple
runs.

-- 
All rights reversed.


-- 
-- 
All rights reversed.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 6/8] export pid symbols needed for kvm_vcpu_on_spin

2011-01-20 Thread Rik van Riel
Export the symbols required for a race-free kvm_vcpu_on_spin.

Signed-off-by: Rik van Riel r...@redhat.com

diff --git a/kernel/fork.c b/kernel/fork.c
index 3b159c5..adc8f47 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -191,6 +191,7 @@ void __put_task_struct(struct task_struct *tsk)
if (!profile_handoff_task(tsk))
free_task(tsk);
 }
+EXPORT_SYMBOL_GPL(__put_task_struct);
 
 /*
  * macro override instead of weak attribute alias, to workaround
diff --git a/kernel/pid.c b/kernel/pid.c
index 39b65b6..02f2212 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -435,6 +435,7 @@ struct pid *get_task_pid(struct task_struct *task, enum 
pid_type type)
rcu_read_unlock();
return pid;
 }
+EXPORT_SYMBOL_GPL(get_task_pid);
 
 struct task_struct *get_pid_task(struct pid *pid, enum pid_type type)
 {
@@ -446,6 +447,7 @@ struct task_struct *get_pid_task(struct pid *pid, enum 
pid_type type)
rcu_read_unlock();
return result;
 }
+EXPORT_SYMBOL_GPL(get_pid_task);
 
 struct pid *find_get_pid(pid_t nr)
 {

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v6 PATCH 5/8] sched: drop superfluous tests from yield_to

2011-01-20 Thread Rik van Riel
Fairness is enforced by pick_next_entity, so we can drop some
superfluous tests from yield_to.

Signed-off-by: Rik van Riel r...@redhat.com
---
 kernel/sched.c |8 
 1 files changed, 0 insertions(+), 8 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 1f38ed2..398eedf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5307,14 +5307,6 @@ again:
if (task_running(p_rq, p) || p-state)
goto out;
 
-   if (!same_thread_group(p, curr))
-   goto out;
-
-#ifdef CONFIG_FAIR_GROUP_SCHED
-   if (task_group(p) != task_group(curr))
-   goto out;
-#endif
-
yielded = curr-sched_class-yield_to_task(rq, p, preempt);
 
 out:
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Blue Swirl
On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2011-01-20 20:27, Blue Swirl wrote:
 On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?


 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically 
 for
 creating PCI devices that doesn't require a parent bus argument but 
 provides
 a way to specify stable addressing (for instancing, using a linear index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.

 There are currently no VCPU-specific bits remaining in kvm_state.

 I think fields vcpu_events, robust_singlestep, debugregs,
 kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
 same for all VCPUs but still they are sort of CPU properties. I'm not
 sure about fd field.

 They are all properties of the currently loaded KVM subsystem in the
 host kernel. They can't change while KVM's root fd is opened.
 Replicating this static information into each and every VCPU state would
 be crazy.

Then each CPUX86State could have a pointer to common structure.

 In fact, services like kvm_has_vcpu_events() already encode this: they
 are static functions without any kvm_state reference that simply return
 the content of those fields. Totally inconsistent to this, we force the
 caller of kvm_check_extension to pass a handle. This is part of my
 problem with the current situation and any halfhearted steps in this
 context. Either we work towards eliminating static KVMState *kvm_state
 in kvm-all.c or eliminating KVMState.

If the CPU related fields are accessible through CPUState, the handle
should be available.

 It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over.

 This should probably contain only irqchip_in_kernel, pit_in_kernel and
 many_ioeventfds, maybe fd.

 fd is that root file descriptor you need for a few KVM services that are
 not bound to a specific VM - e.g. feature queries. It's not arch
 specific. Arch specific are e.g. robust_singlestep or xsave feature states.

By arch you mean guest CPU architecture? They are not machine features.


 It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.

 I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
 migration_log into the memory object.

 vmfd is the VM-scope file descriptor you need at machine-level. The rest
 logically belongs to a memory object, but I haven't looked at technical
 details yet.


 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.

 If it is an opaque blob which contains various unrelated stuff, no
 clear place will be found.

 We aren't moving fields yet (and we shouldn't). We first of all need to
 establish the handle distribution (which apparently requires quite some
 work in areas beyond KVM).

But I think this is exactly  the problem. If the handle is for the
current KVMState, you'll indeed need it in various places and passing
it around will be cumbersome. By moving the fields around, the
information should  be available more naturally.

 By the way, we don't have a QEMUState but instead use globals. Perhaps
 this should be reorganized as well. For fd field, maybe even using a
 global variable could be justified, since it is used for direct access
 to kernel, not unlike a system call.

 The fd field is part of this discussion. Making it global (but local to
 the kvm subsystem) was an implicit part of my original suggestion.

 I've no problem with something like a QEMUState, or better a
 MachineState that would also include a few KVM-specific fields like the
 vmfd - just like we already do for CPUstate (or should we 

Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka
On 2011-01-20 21:02, Blue Swirl wrote:
 I think KVMState was designed to match KVM ioctl interface: all stuff
 that is needed for talking to KVM or received from KVM are there. But
 I think this shouldn't be a design driver.

Agreed. The nice cleanup would probably be the complete assimilation of
KVMState by something bigger of comparable scope.

If a machine was brought up with KVM support, every device that refers
to this machine (as it is supposed to become part of it) should be able
to use KVM services in order to accelerate its model.

 
 If the only pieces of kvm_state that are needed by the devices are
 irqchip_in_kernel, pit_in_kernel and many_ioeventfds, the problem of
 passing kvm_state to devices becomes very different. Each of these are
 just single bits, affecting only a few devices. Perhaps they could be
 device properties which the board level sets when KVM is used?

Forget about the static capabilities for now. The core of kvm_state are
handles that enable you to use KVM services and maybe state fields that
have machine scope (unless we find more local homes like a memory
object). Those need to be accessible by the kvm layer when servicing
requests of components that are related to that very same machine.

Jan



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-devel] [PATCH 28/35] kvm: x86: Introduce kvmclock device to save/restore its state

2011-01-20 Thread Jan Kiszka
On 2011-01-20 22:40, Blue Swirl wrote:
 On Thu, Jan 20, 2011 at 9:22 PM, Jan Kiszka jan.kis...@web.de wrote:
 On 2011-01-20 20:27, Blue Swirl wrote:
 On Thu, Jan 20, 2011 at 9:33 AM, Jan Kiszka jan.kis...@siemens.com wrote:
 On 2011-01-19 20:32, Blue Swirl wrote:
 On Wed, Jan 19, 2011 at 4:57 PM, Anthony Liguori
 aligu...@linux.vnet.ibm.com wrote:
 On 01/19/2011 07:15 AM, Markus Armbruster wrote:

 So they interact with KVM (need kvm_state), and they interact with the
 emulated PCI bus.  Could you elaborate on the fundamental difference
 between the two interactions that makes you choose the (hypothetical)
 KVM bus over the PCI bus as device parent?


 It's almost arbitrary, but I would say it's the direction that I/Os flow.

 But if the underlying observation is that the device tree is not really a
 tree, you're 100% correct.  This is part of why a factory interface that
 just takes a parent bus is too simplistic.

 I think we ought to introduce a -pci-device option that is specifically 
 for
 creating PCI devices that doesn't require a parent bus argument but 
 provides
 a way to specify stable addressing (for instancing, using a linear 
 index).

 I think kvm_state should not be a property of any device or bus. It
 should be split to more logical pieces.

 Some parts of it could remain in CPUState, because they are associated
 with a VCPU.

 Also, for example irqfd could be considered to be similar object to
 char or block devices provided by QEMU to devices. Would it make sense
 to introduce new host types for passing parts of kvm_state to devices?

 I'd also make coalesced MMIO stuff part of memory object. We are not
 passing any state references when using cpu_physical_memory_rw(), but
 that could be changed.

 There are currently no VCPU-specific bits remaining in kvm_state.

 I think fields vcpu_events, robust_singlestep, debugregs,
 kvm_sw_breakpoints, xsave, xcrs belong to CPUX86State. They may be the
 same for all VCPUs but still they are sort of CPU properties. I'm not
 sure about fd field.

 They are all properties of the currently loaded KVM subsystem in the
 host kernel. They can't change while KVM's root fd is opened.
 Replicating this static information into each and every VCPU state would
 be crazy.
 
 Then each CPUX86State could have a pointer to common structure.

That already exists.

 
 In fact, services like kvm_has_vcpu_events() already encode this: they
 are static functions without any kvm_state reference that simply return
 the content of those fields. Totally inconsistent to this, we force the
 caller of kvm_check_extension to pass a handle. This is part of my
 problem with the current situation and any halfhearted steps in this
 context. Either we work towards eliminating static KVMState *kvm_state
 in kvm-all.c or eliminating KVMState.
 
 If the CPU related fields are accessible through CPUState, the handle
 should be available.
 
 It may
 be a good idea to introduce an arch-specific kvm_state and move related
 bits over.

 This should probably contain only irqchip_in_kernel, pit_in_kernel and
 many_ioeventfds, maybe fd.

 fd is that root file descriptor you need for a few KVM services that are
 not bound to a specific VM - e.g. feature queries. It's not arch
 specific. Arch specific are e.g. robust_singlestep or xsave feature states.
 
 By arch you mean guest CPU architecture? They are not machine features.

No, they are practically static host features.

 

 It may also once be feasible to carve out memory management
 related fields if we have proper abstractions for that, but I'm not
 completely sure here.

 I'd put slots, vmfd, coalesced_mmio, broken_set_mem_region,
 migration_log into the memory object.

 vmfd is the VM-scope file descriptor you need at machine-level. The rest
 logically belongs to a memory object, but I haven't looked at technical
 details yet.


 Anyway, all these things are secondary. The primary topic here is how to
 deal with kvm_state and its fields that have VM-global scope.

 If it is an opaque blob which contains various unrelated stuff, no
 clear place will be found.

 We aren't moving fields yet (and we shouldn't). We first of all need to
 establish the handle distribution (which apparently requires quite some
 work in areas beyond KVM).
 
 But I think this is exactly  the problem. If the handle is for the
 current KVMState, you'll indeed need it in various places and passing
 it around will be cumbersome. By moving the fields around, the
 information should  be available more naturally.

Yeah, if we had a MachineState or if we could agree on introducing it,
I'm with you again. Improving the currently cumbersome KVM API
interaction was the main motivation for my original patch.

Jan



signature.asc
Description: OpenPGP digital signature


Re: suspending in kvm and resuming in qemu

2011-01-20 Thread Marcelo Tosatti
On Tue, Jan 18, 2011 at 04:08:33AM -0800, Mehul Chadha wrote:
 Hi,
 
 I have been trying to get suspending and resuming done across kvm and qemu.
 While resuming a suspended state in kvm, an error was generated saying 
 unknown section kvmclock . I modified loadvm in qemu to neglect the 
 kvmclock section.
 
 Now, when I suspend and resume, qemu screen just stalls at the suspended 
 state and it seems nothing is executing. 
 
 Pls give any inputs or help, where should I start looking at??
 
 Thanks,
 Mehul

The guest relies on kvmclock support, which is not implemented by qemu
alone. So you'd have to disable kvmclock support (with no-kvmclock
kernel boot option) in the guest.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: x86: Remove user space triggerable MCE error message

2011-01-20 Thread Marcelo Tosatti
On Sat, Jan 15, 2011 at 10:00:53AM +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 This case is a pure user space error we do not need to record. Moreover,
 it can be misused to flood the kernel log. Remove it.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] device-assignment: Properly terminate vmsd.fields

2011-01-20 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 10:17:49AM -0700, Alex Williamson wrote:
 The vmsd code expects the fields structure to be properly terminated,
 not NULL.  An assigned device should never be saved or restored, and
 recent qemu fixes to the no_migrate flag should ensure this, but let's
 avoid setting the wrong precedent.
 
 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---
 
 v2:
  - Change to commit log only, was device-assignment: Add fields to
VMStateDescription.  Recent qemu.git no_migrate changes avoid
potential segfault, but this should still be applied for correctness.
 
  hw/device-assignment.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] virtio-pci: mask notifier error handling fixups

2011-01-20 Thread Marcelo Tosatti
On Tue, Jan 18, 2011 at 03:42:57PM +0200, Michael S. Tsirkin wrote:
 Fix virtio-pci error handling in the mask notifiers: be careful to undo
 exactly what we did so far.
 
 Reported-by: Alex Williamson alex.william...@redhat.com
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
 
 This is for qemu-kvm only.
 
  hw/virtio-pci.c |   12 
  1 files changed, 12 insertions(+), 0 deletions(-)

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: fix build warning within __kvm_set_memory_region() on s390

2011-01-20 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 09:21:08PM +0100, Heiko Carstens wrote:
 From: Heiko Carstens heiko.carst...@de.ibm.com
 
 Get rid of this warning:
 
   CC  arch/s390/kvm/../../../virt/kvm/kvm_main.o
 arch/s390/kvm/../../../virt/kvm/kvm_main.c:596:12: warning: 
 'kvm_create_dirty_bitmap' defined but not used
 
 The only caller of the function is within a !CONFIG_S390 section, so add the
 same ifdef around kvm_create_dirty_bitmap() as well.
 
 Signed-off-by: Heiko Carstens heiko.carst...@de.ibm.com

Applied, thanks.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] mm, Make __get_user_pages return -EHWPOISON for HWPOISON page optionally

2011-01-20 Thread Marcelo Tosatti
On Mon, Jan 17, 2011 at 08:47:39AM +0800, Huang Ying wrote:
 Hi, Andrew,
 
 On Sun, 2011-01-16 at 23:35 +0800, Avi Kivity wrote:
  On 01/14/2011 03:37 AM, Huang Ying wrote:
   On Thu, 2011-01-13 at 18:43 +0800, Avi Kivity wrote:
 On 01/13/2011 10:42 AM, Huang Ying wrote:
   Make __get_user_pages return -EHWPOISON for HWPOISON page only if
   FOLL_HWPOISON is specified.  With this patch, the interested callers
   can distinguish HWPOISON page from general FAULT page, while other
   callers will still get -EFAULT for pages, so the user space 
interface
   need not to be changed.
 
   get_user_pages_hwpoison is added as a variant of get_user_pages that
   can return -EHWPOISON for HWPOISON page.
 
   This feature is needed by KVM, where UCR MCE should be relayed to
   guest for HWPOISON page, while instruction emulation and MMIO will 
be
   tried for general FAULT page.
 
   The idea comes from Andrew Morton.
 
   Signed-off-by: Huang Yingying.hu...@intel.com
   Cc: Andrew Mortona...@linux-foundation.org
 
   +#ifdef CONFIG_MEMORY_FAILURE
   +int get_user_pages_hwpoison(struct task_struct *tsk, struct 
mm_struct *mm,
   +   unsigned long start, int nr_pages, int 
write,
   +   int force, struct page **pages,
   +   struct vm_area_struct **vmas);
   +#else
   
 Since we'd also like to add get_user_pages_noio(), perhaps adding a
 flags field to get_user_pages() is better.
  
   Or export __get_user_pages()?
  
  That's better, yes.
 
 Do you think it is a good idea to export __get_user_pages() instead of
 adding get_user_pages_noio() and get_user_pages_hwpoison()?

Better Andrew and/or Linus should decide it.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Sridhar Samudrala
On Thu, 2011-01-20 at 19:47 +0200, Michael S. Tsirkin wrote:
 On Thu, Jan 20, 2011 at 08:31:53AM -0800, Sridhar Samudrala wrote:
  On Thu, 2011-01-20 at 17:35 +0200, Michael S. Tsirkin wrote:
   When MSI is off, each interrupt needs to be bounced through the io
   thread when it's set/cleared, so vhost-net causes more context switches 
   and
   higher CPU utilization than userspace virtio which handles networking in
   the same thread.
   
   We'll need to fix this by adding level irq support in kvm irqfd,
   for now disable vhost-net in these configurations.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   ---
   
   I need to report some error from virtio-pci
   that would be handled specially (disable but don't
   report an error) so I wanted one that's never likely to be used by a
   userspace ioctl. I selected ERANGE but it'd
   be easy to switch to something else. Comments?
  
  Should this error be EVHOST_DISABLED rather than EVIRTIO_DISABLED?
  
  -Sridhar
 
 The error is reported by virtio-pci which does not know about vhost.
 I started with EVIRTIO_MSIX_DISABLED and made is shorter.
 Would EVIRTIO_MSIX_DISABLED be better?

I think so. This makes it more clear.
-Sridhar
 
   
hw/vhost.c  |4 +++-
hw/virtio-net.c |6 --
hw/virtio-pci.c |3 +++
hw/virtio.h |2 ++
4 files changed, 12 insertions(+), 3 deletions(-)
   
   diff --git a/hw/vhost.c b/hw/vhost.c
   index 1d09ed0..c79765a 100644
   --- a/hw/vhost.c
   +++ b/hw/vhost.c
   @@ -649,7 +649,9 @@ int vhost_dev_start(struct vhost_dev *hdev, 
   VirtIODevice *vdev)

r = vdev-binding-set_guest_notifiers(vdev-binding_opaque, true);
if (r  0) {
   -fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + if (r != -EVIRTIO_DISABLED) {
   + fprintf(stderr, Error binding guest notifier: %d\n, -r);
   + }
goto fail_notifiers;
}

   diff --git a/hw/virtio-net.c b/hw/virtio-net.c
   index ccb3e63..5de3fee 100644
   --- a/hw/virtio-net.c
   +++ b/hw/virtio-net.c
   @@ -121,8 +121,10 @@ static void virtio_net_vhost_status(VirtIONet *n, 
   uint8_t status)
if (!n-vhost_started) {
int r = vhost_net_start(tap_get_vhost_net(n-nic-nc.peer), 
   n-vdev);
if (r  0) {
   -error_report(unable to start vhost net: %d: 
   - falling back on userspace virtio, -r);
   +if (r != -EVIRTIO_DISABLED) {
   +error_report(unable to start vhost net: %d: 
   + falling back on userspace virtio, -r);
   +}
} else {
n-vhost_started = 1;
}
   diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
   index dd8887a..dbf4be0 100644
   --- a/hw/virtio-pci.c
   +++ b/hw/virtio-pci.c
   @@ -628,6 +628,9 @@ static int virtio_pci_set_guest_notifier(void 
   *opaque, int n, bool assign)
EventNotifier *notifier = virtio_queue_get_guest_notifier(vq);

if (assign) {
   +if (!msix_enabled(proxy-pci_dev)) {
   +return -EVIRTIO_DISABLED;
   +}
int r = event_notifier_init(notifier, 0);
if (r  0) {
return r;
   diff --git a/hw/virtio.h b/hw/virtio.h
   index d8546d5..53bbdba 100644
   --- a/hw/virtio.h
   +++ b/hw/virtio.h
   @@ -98,6 +98,8 @@ typedef struct {
void (*vmstate_change)(void * opaque, bool running);
} VirtIOBindings;

   +#define EVIRTIO_DISABLED ERANGE
   +
#define VIRTIO_PCI_QUEUE_MAX 64

#define VIRTIO_NO_VECTOR 0x

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Anthony Liguori

On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:

On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
   

On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
 

When MSI is off, each interrupt needs to be bounced through the io
thread when it's set/cleared, so vhost-net causes more context switches and
higher CPU utilization than userspace virtio which handles networking in
the same thread.

We'll need to fix this by adding level irq support in kvm irqfd,
for now disable vhost-net in these configurations.

Signed-off-by: Michael S. Tsirkinm...@redhat.com
   

I actually think this should be a terminal error.  The user asks for
vhost-net, if we cannot enable it, we should exit.

Or we should warn the user that they should expect bad performance.
Silently doing something that the user has explicitly asked us not
to do is not a good behavior.

Regards,

Anthony Liguori
 

The issue is that user has no control of the guest, and can not know
whether the guest enables MSI. So what you ask for will just make
some guests fail, and others fail sometimes.
The user also has no way to know that version X of kvm does not expose a
way to inject level interrupts with irqfd.

We could have *another* flag that says use vhost where it helps but
then I think this is what everyone wants to do, anyway, and libvirt
already sets vhost=on so I prefer redefining the meaning of an existing
flag.
   


In the very least, there needs to be a vhost=force.

Having some sort of friendly default policy is fine but we need to 
provide a mechanism for a user to have the final say.  If you want to 
redefine vhost=on to really mean, use the friendly default, that's fine 
by me, but only if the vhost=force option exists.


I actually would think libvirt would want to use vhost=force.  Debugging 
with vhost=on is going to be a royal pain in the ass if a user reports 
bad performance.  Given the libvirt XML, you can't actually tell from 
the guest and the XML whether or not vhost was actually in use or not.


Regards,

Anthony Liguori


Maybe this is best handled by a documentation update?

We always said:
use vhost=on to enable experimental in kernel 
accelerator\n

note 'enable' not 'require'. This is similar to how we specify
nvectors : you can not make guest use the feature.

How about this:

diff --git a/qemu-options.hx b/qemu-options.hx
index 898561d..3c937c1 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -1061,6 +1061,7 @@ DEF(net, HAS_ARG, QEMU_OPTION_net,
  use vnet_hdr=off to avoid enabling the IFF_VNET_HDR tap 
flag\n
  use vnet_hdr=on to make the lack of IFF_VNET_HDR support an 
error condition\n
  use vhost=on to enable experimental in kernel 
accelerator\n
+(note: vhost=on has no effect unless guest uses MSI-X)\n
  use 'vhostfd=h' to connect to an already opened vhost net 
device\n
  #endif
  -net 
socket[,vlan=n][,name=str][,fd=h][,listen=[host]:port][,connect=host:port]\n


   


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] vhost: force vhost off for non-MSI guests

2011-01-20 Thread Alex Williamson
On Thu, 2011-01-20 at 18:23 -0600, Anthony Liguori wrote:
 On 01/20/2011 10:07 AM, Michael S. Tsirkin wrote:
  On Thu, Jan 20, 2011 at 09:43:57AM -0600, Anthony Liguori wrote:
 
  On 01/20/2011 09:35 AM, Michael S. Tsirkin wrote:
   
  When MSI is off, each interrupt needs to be bounced through the io
  thread when it's set/cleared, so vhost-net causes more context switches 
  and
  higher CPU utilization than userspace virtio which handles networking in
  the same thread.
 
  We'll need to fix this by adding level irq support in kvm irqfd,
  for now disable vhost-net in these configurations.
 
  Signed-off-by: Michael S. Tsirkinm...@redhat.com
 
  I actually think this should be a terminal error.  The user asks for
  vhost-net, if we cannot enable it, we should exit.
 
  Or we should warn the user that they should expect bad performance.
  Silently doing something that the user has explicitly asked us not
  to do is not a good behavior.
 
  Regards,
 
  Anthony Liguori
   
  The issue is that user has no control of the guest, and can not know
  whether the guest enables MSI. So what you ask for will just make
  some guests fail, and others fail sometimes.
  The user also has no way to know that version X of kvm does not expose a
  way to inject level interrupts with irqfd.
 
  We could have *another* flag that says use vhost where it helps but
  then I think this is what everyone wants to do, anyway, and libvirt
  already sets vhost=on so I prefer redefining the meaning of an existing
  flag.
 
 
 In the very least, there needs to be a vhost=force.
 
 Having some sort of friendly default policy is fine but we need to 
 provide a mechanism for a user to have the final say.  If you want to 
 redefine vhost=on to really mean, use the friendly default, that's fine 
 by me, but only if the vhost=force option exists.
 
 I actually would think libvirt would want to use vhost=force.  Debugging 
 with vhost=on is going to be a royal pain in the ass if a user reports 
 bad performance.  Given the libvirt XML, you can't actually tell from 
 the guest and the XML whether or not vhost was actually in use or not.

If we add a force option, let's please distinguish hotplug from VM
creation time.  The latter can abort.  Hotplug should print an error and
fail the initfn.  Thanks,

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-20 Thread Simon Horman
[ Trimmed Eric from CC list as vger was complaining that it is too long ]

On Tue, Jan 18, 2011 at 11:41:22AM -0800, Rick Jones wrote:
 So it won't be all that simple to implement well, and before we try,
 I'd like to know whether there are applications that are helped
 by it. For example, we could try to measure latency at various
 pps and see whether the backpressure helps. netperf has -b, -w
 flags which might help these measurements.
 
 Those options are enabled when one adds --enable-burst to the
 pre-compilation ./configure  of netperf (one doesn't have to
 recompile netserver).  However, if one is also looking at latency
 statistics via the -j option in the top-of-trunk, or simply at the
 histogram with --enable-histogram on the ./configure and a verbosity
 level of 2 (global -v 2) then one wants the very top of trunk
 netperf from:

Hi,

I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

1. Omni test only:  MEAN_LATENCY=272.00
2. Omni and stream test:MEAN_LATENCY=3423.00
3. cpu and net_cls group:   MEAN_LATENCY=493.00
   As per 2 plus cgoups are created for each guest
   and guest tasks added to the groups
4. 100Mbit/s class: MEAN_LATENCY=273.00
   As per 3 plus the net_cls groups each have a 100MBit/s HTB class
5. cpu.shares=128:  MEAN_LATENCY=652.00
   As per 4 plus the cpu groups have cpu.shares set to 128
6. Busy CPUS:   MEAN_LATENCY=15126.00
   As per 5 but the CPUs are made busy using a simple shell while loop

There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
-k foo -b 1 -w 200 -m 200

foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Flow Control and Port Mirroring Revisited

2011-01-20 Thread Rick Jones

Simon Horman wrote:

[ Trimmed Eric from CC list as vger was complaining that it is too long ]
...
I have constructed a test where I run an un-paced  UDP_STREAM test in
one guest and a paced omni rr test in another guest at the same time.
Breifly I get the following results from the omni test..

...



There is a bit of noise in the results as the two netperf invocations
aren't started at exactly the same moment

For reference, my netperf invocations are:
netperf -c -C -t UDP_STREAM -H 172.17.60.216 -l 12
netperf.omni -p 12866 -D -c -C -H 172.17.60.216 -t omni -j -v 2 -- -r 1 -d rr 
-k foo -b 1 -w 200 -m 200


Since the -b and -w are in the test-specific portion, this test was not actually 
 paced. The -w will have been ignored entirely (IIRC) and the -b will have 
attempted to set the burst size of a --enable-burst ./configured netperf.  If 
netperf was ./configured that way, it will have had two rr transactions in 
flight at one time - the regular one and then the one additional from the -b 
option.  If netperf was not ./configured with --enable-burst then a warning 
message should have been emitted.


Also, I am guessing you wanted TCP_NODELAY set, and that is -D but not a global 
-D.  I'm reasonably confident the -m 200 will have been ignored, but it would be 
best to drop it. So, I think your second line needs to be:


netperf.omni -p 12866 -c -C -H  172.17.60.216 -t omni -j -v 2 -b 1 -w 200 -- -r 
1 -d rr -k foo -D


If you want the request and response sizes to be 200 bytes, use -r 200 
(test-specific).


Also, if you ./configure with --enable-omni first, that netserver will 
understand both omni and non-omni tests at the same time and you don't have to 
have a second netserver on a different control port.  You can also go-in to 
config.h after the ./configure and unset WANT_MIGRATION and then UDP_STREAM in 
netperf will be the true classic UDP_STREAM code rather than the migrated to 
omni path.



foo contains
PROTOCOL
THROUGHPUT,THROUGHPUT_UNITS
LOCAL_SEND_THROUGHPUT
LOCAL_RECV_THROUGHPUT
REMOTE_SEND_THROUGHPUT
REMOTE_RECV_THROUGHPUT
RT_LATENCY,MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY
P50_LATENCY,P90_LATENCY,P99_LATENCY,STDDEV_LATENCY
LOCAL_CPU_UTIL,REMOTE_CPU_UTIL


As the -k file parsing option didn't care until recently (within the hour or 
so), I think it didn't matter that you had more than four lines (assuming that 
is a verbatim cat of foo).  However, if you pull the *current* top of trunk, it 
will probably start to care - I'm in the midst of adding support for direct 
output selection in the -k, -o and -O options and also cleaning-up the omni 
printing code to the point where there is only the one routing parsing the 
output selection file.  Currently that is the one for human output, which has 
a four line restriction.  I will try to make it smarter as I go.


happy benchmarking,

rick jones
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-01-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #7 from Marcelo Tosatti mtosa...@redhat.com  2011-01-21 03:27:36 
---
Nicolas,

My bad. Can you please try the following patch.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 27052] Module KVM : unable to handle kernel NULL pointer dereference at

2011-01-20 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=27052





--- Comment #8 from Marcelo Tosatti mtosa...@redhat.com  2011-01-21 03:29:36 
---
Created an attachment (id=44552)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=44552)
update sp-gfns on pte update path

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] V2 Handle guest access to BBL_CR_CTL3 MSR

2011-01-20 Thread john cooper
[Resubmit of prior version which contained a wayward
patch hunk.  Thanks Marcelo]

A correction to Intel cpu model CPUID data (patch queued)
caused winxp to BSOD when booted with a Penryn model.
This was traced to the CPUID model field correction from
6 - 23 (as is proper for a Penryn class of cpu).  Only in
this case does the problem surface.

The cause for this failure is winxp accessing the BBL_CR_CTL3
MSR which is unsupported by current kvm, appears to be a
legacy MSR not fully characterized yet existing in current
silicon, and is apparently carried forward in MSR space to
accommodate vintage code as here.  It is not yet conclusive
whether this MSR implements any of its legacy functionality
or is just an ornamental dud for compatibility.  While I
found no silicon version specific documentation link to
this MSR, a general description exists in Intel's developer's
reference which agrees with the functional behavior of
other bootloader/kernel code I've examined accessing
BBL_CR_CTL3.  Regrettably winxp appears to be setting bit #19
called out as reserved in the above document.

So to minimally accommodate this MSR, kvm msr get will provide
the equivalent mock data and kvm msr write will simply toss the
guest passed data without interpretation.  While this treatment
of BBL_CR_CTL3 addresses the immediate problem, the approach may
be modified pending clarification from Intel.

Signed-off-by: john cooper john.coo...@redhat.com
---

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 4d0dfa0..5bfafb6 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -38,6 +38,7 @@
 
 #define MSR_MTRRcap0x00fe
 #define MSR_IA32_BBL_CR_CTL0x0119
+#define MSR_IA32_BBL_CR_CTL3   0x011e
 
 #define MSR_IA32_SYSENTER_CS   0x0174
 #define MSR_IA32_SYSENTER_ESP  0x0175
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bcc0efc..04d6c55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1592,6 +1592,12 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 data)
} else
return set_msr_hyperv(vcpu, msr, data);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* Drop writes to this legacy MSR -- see rdmsr
+* counterpart for further detail.
+*/
+   pr_unimpl(vcpu, ignored wrmsr: 0x%x data %llx\n, msr, data);
+   break;
default:
if (msr  (msr == vcpu-kvm-arch.xen_hvm_config.msr))
return xen_hvm_config(vcpu, data);
@@ -1846,6 +1852,19 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, 
u64 *pdata)
} else
return get_msr_hyperv(vcpu, msr, pdata);
break;
+   case MSR_IA32_BBL_CR_CTL3:
+   /* This legacy MSR exists but isn't fully documented in current
+* silicon.  It is however accessed by winxp in very narrow
+* scenarios where it sets bit #19, itself documented as
+* a reserved bit.  Best effort attempt to source coherent
+* read data here should the balance of the register be
+* interpreted by the guest:
+*
+* L2 cache control register 3: 64GB range, 256KB size,
+* enabled, latency 0x1, configured
+*/ 
+   data = 0xbe702111;
+   break;
default:
if (!ignore_msrs) {
pr_unimpl(vcpu, unhandled rdmsr: 0x%x\n, msr);


-- 
john.coo...@redhat.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 08:49:27]:

 On Thu, 20 Jan 2011, Balbir Singh wrote:
 
  --- a/include/linux/swap.h
  +++ b/include/linux/swap.h
  @@ -253,11 +253,11 @@ extern int vm_swappiness;
   extern int remove_mapping(struct address_space *mapping, struct page 
  *page);
   extern long vm_total_pages;
 
  +extern int sysctl_min_unmapped_ratio;
  +extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   #ifdef CONFIG_NUMA
   extern int zone_reclaim_mode;
  -extern int sysctl_min_unmapped_ratio;
   extern int sysctl_min_slab_ratio;
  -extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
   #else
   #define zone_reclaim_mode 0
 
 So the end result of this patch is that zone reclaim is compiled
 into vmscan.o even on !NUMA configurations but since zone_reclaim_mode ==
 0 noone can ever call that code?


The third patch, fixes this with the introduction of a config
(cut-copy-paste below). If someone were to bisect to this point, what
you say is correct.

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) ||
defined(CONFIG_NUMA)
 extern int sysctl_min_unmapped_ratio;
 extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
 #else
-#define zone_reclaim_mode 0
 static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned
int order)
 {
return 0;
 }
 #endif

Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 2/3] Refactor zone_reclaim code (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 08:50:40]:

 
 Reviewed-by: Christoph Lameter c...@linux.com


Thanks for the review! 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [REPOST] [PATCH 3/3] Provide control over unmapped pages (v3)

2011-01-20 Thread Balbir Singh
* Christoph Lameter c...@linux.com [2011-01-20 09:00:09]:

 On Thu, 20 Jan 2011, Balbir Singh wrote:
 
  +   unmapped_page_control
  +   [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
  +   is enabled. It controls the amount of unmapped memory
  +   that is present in the system. This boot option plus
  +   vm.min_unmapped_ratio (sysctl) provide granular control
 
 min_unmapped_ratio is there to guarantee that zone reclaim does not
 reclaim all unmapped pages.
 
 What you want here is a max_unmapped_ratio.


I thought about that, the logic for reusing min_unmapped_ratio was to
keep a limit beyond which unmapped page cache shrinking should stop.
I think you are suggesting max_unmapped_ratio as the point at which
shrinking should begin, right?
 
 
   {
  @@ -2297,6 +2320,12 @@ loop_again:
  shrink_active_list(SWAP_CLUSTER_MAX, zone,
  sc, priority, 0);
 
  +   /*
  +* We do unmapped page reclaim once here and once
  +* below, so that we don't lose out
  +*/
  +   reclaim_unmapped_pages(priority, zone, sc);
  +
  if (!zone_watermark_ok_safe(zone, order,
 
 H. Okay that means background reclaim does it. If so then we also want
 zone reclaim to be able to work in the background I think.

Anything specific you had in mind, works for me in testing, but is
there anything specific that stands out in your mind that needs to be
done?

Thanks for the review!
 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html