[Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-12 Thread Avi Kivity
Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since
it doesn't have the required capabilities.

Fix by adding them.

Note: this may break installing as non-root.  This is actually the right
thing to do, since not setting up the capability would result in a broken
setup.  Perhaps we need a configure flag to disable helpers.

Signed-off-by: Avi Kivity a...@cloudius-systems.com
---
 Makefile  | 3 +++
 rules.mak | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/Makefile b/Makefile
index b15003f..af7748c 100644
--- a/Makefile
+++ b/Makefile
@@ -188,6 +188,7 @@ qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a 
libqemustub.a
 qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
 qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a
 
+qemu-bridge-helper$(EXESUF).capabilities = cap_net_admin
 qemu-bridge-helper$(EXESUF): qemu-bridge-helper.o
 
 fsdev/virtfs-proxy-helper$(EXESUF): fsdev/virtfs-proxy-helper.o 
fsdev/virtio-9p-marshal.o libqemuutil.a libqemustub.a
@@ -345,6 +346,8 @@ endif
 ifneq ($(HELPERS-y),)
$(INSTALL_DIR) $(DESTDIR)$(libexecdir)
$(INSTALL_PROG) $(STRIP_OPT) $(HELPERS-y) $(DESTDIR)$(libexecdir)
+   $(SETCAP) $(foreach helper, $(HELPERS-y), \
+   $($(helper).capabilities)+pe $(DESTDIR)$(libexecdir)/$(helper))
 endif
 ifneq ($(BLOBS),)
set -e; for x in $(BLOBS); do \
diff --git a/rules.mak b/rules.mak
index 49edb9b..9194c79 100644
--- a/rules.mak
+++ b/rules.mak
@@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach var,$(nested-vars),$(dir 
$($(var))
 $(foreach var,$(nested-vars), $(eval \
   -include $(addsuffix *.d, $(sort $(dir $($(var)))
 endef
+
+SETCAP = setcap
\ No newline at end of file
-- 
1.8.3.1




Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-12 Thread Avi Kivity
On Nov 12, 2013 9:23 PM, Eric Blake ebl...@redhat.com wrote:

 On 11/12/2013 04:10 AM, Avi Kivity wrote:
  Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper
since
  it doesn't have the required capabilities.
 
  Fix by adding them.
 
  Note: this may break installing as non-root.  This is actually the right
  thing to do, since not setting up the capability would result in a
broken
  setup.  Perhaps we need a configure flag to disable helpers.
 

  +++ b/rules.mak
  @@ -177,3 +177,5 @@ $(shell mkdir -p $(sort $(foreach
var,$(nested-vars),$(dir $($(var))
   $(foreach var,$(nested-vars), $(eval \
 -include $(addsuffix *.d, $(sort $(dir $($(var)))
   endef
  +
  +SETCAP = setcap
  \ No newline at end of file

 Is this intentional?

No.


 --
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org



Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level compression

2013-11-14 Thread Avi Kivity
Michael S. Tsirkin mst at redhat.com writes:

 
 At the moment, memory radix tree is already variable width, but it can
 only skip the low bits of address.
 
 This is efficient if we have huge memory regions but inefficient if we
 are only using a tiny portion of the address space.
 
 After we have built up the map, detect
 configurations where a single L2 entry is valid.
 
 We then speed up the lookup by skipping one or more levels.
 In case any levels were skipped, we might end up in a valid section
 instead of erroring out. We handle this by checking that
 the address is in range of the resulting section.
 


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
   if (index  rt-invalid_index_mask) {
   // not found
   }
   lp = rt-root;
   for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
 ...

This exploits the fact the lower portion of the address space is always 
filled, at least in the cases that matter to us.








Re: [Qemu-devel] [PATCH] build: set up capabilities on qemu-bridge-helper

2013-11-14 Thread Avi Kivity

On 11/14/2013 03:29 PM, Stefan Hajnoczi wrote:

On Tue, Nov 12, 2013 at 01:10:24PM +0200, Avi Kivity wrote:

Out-of-the-box, 'make install' sets up an unusable qemu-bridge-helper since
it doesn't have the required capabilities.

Fix by adding them.

Up until now, downstreams had to make the bridge helper executable
setuid, add the cap_net_admin capability, or they did nothing and it was
broken ;-).  CCing downstream package maintainers in case they have any
comments on this patch.


And it was, indeed, broken.




Note: this may break installing as non-root.  This is actually the right
thing to do, since not setting up the capability would result in a broken
setup.  Perhaps we need a configure flag to disable helpers.

Users who have been successfully installing QEMU would be upset if it
suddenly starts failing after this patch.  The bridge helper is a niche
feature that shouldn't cause a regression for the majority of users who
don't care about it.

If we're installing non-root then the bridge helper simply shouldn't be
installed.


Or maybe installed without the capabilities?  This way if the user 
invokes qemu with sudo, it still works as before.




Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-14 Thread Avi Kivity

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.



I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.


Not much value in combining them -- the variable level tree check will 
be dominated by the level skip logic.


IMO however skipping intermediate levels will be too rare to justify the 
complexity and the doubling of the page table size -- it can only happen 
in iommu setups that place memory in very high addresses.  These ought 
to be rare.






Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-14 Thread Avi Kivity

On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.

Not much value in combining them -- the variable level tree check
will be dominated by the level skip logic.

IMO however skipping intermediate levels will be too rare to justify
the complexity and the doubling of the page table size -- it can
only happen in iommu setups that place memory in very high
addresses.  These ought to be rare.


Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).

I agree the simplicity is attractive.

However I really would like some logic that can handle  1 leaf
somehow.

Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.


That can be achieved by directing a missed lookup to a catch-all region.

It would be best to do this completely internally to the memory code, 
without API changes.




Re: [Qemu-devel] [PATCH v2 5/7] exec: memory radix tree page level?compression

2013-11-17 Thread Avi Kivity

On 11/17/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 05:42:26PM +0200, Avi Kivity wrote:

On 11/14/2013 05:37 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 04:56:43PM +0200, Avi Kivity wrote:

On 11/14/2013 04:40 PM, Michael S. Tsirkin wrote:

On Thu, Nov 14, 2013 at 08:54:11AM +, Avi Kivity wrote:

Michael S. Tsirkin mst at redhat.com writes:


At the moment, memory radix tree is already variable width, but it can
only skip the low bits of address.

This is efficient if we have huge memory regions but inefficient if we
are only using a tiny portion of the address space.

After we have built up the map, detect
configurations where a single L2 entry is valid.

We then speed up the lookup by skipping one or more levels.
In case any levels were skipped, we might end up in a valid section
instead of erroring out. We handle this by checking that
the address is in range of the resulting section.


I think this is overkill.  It can be done in a simpler way as follows:


phys_page_find(RadixTree* tr, hwaddr index, ...)
{
if (index  rt-invalid_index_mask) {
// not found
}
lp = rt-root;
for (i = rt-nb_levels - 1; i = 0  !lp.is_leaf; --i) {
  ...

This exploits the fact the lower portion of the address space is always
filled, at least in the cases that matter to us.






Basically skip unused high bits?
Sure.
In fact I think both optimizations can be combined.

Not much value in combining them -- the variable level tree check
will be dominated by the level skip logic.

IMO however skipping intermediate levels will be too rare to justify
the complexity and the doubling of the page table size -- it can
only happen in iommu setups that place memory in very high
addresses.  These ought to be rare.


Well maybe not very high address, but you can have a device with
a 64 bit bar and this will add back levels (though it would not
be slower than it is currently).

I agree the simplicity is attractive.

However I really would like some logic that can handle  1 leaf
somehow.

Specifically both tricks break if we add a full 64 bit io region
in order to return -1 for reads on master abort.

That can be achieved by directing a missed lookup to a catch-all region.

It would be best to do this completely internally to the memory
code, without API changes.

You mean e.g. add a catch-all region to the address space?



The API already specifies a catch-all region - the container region 
catches everything if an access doesn't hit a sub-region (though I 
wanted to discourage the practice of using the same region for both I/O 
and a container), or you can have an I/O region as a subregion with the 
lowest priority, that spans the entire space, so that any access which 
misses the other sub-regions will hit the default subregion.


So we don't need API changes, just to change the internals.

I think this can be done by looking at the address space, and finding 
out what region is mapped to the very last byte in the address space, if 
any, and using that as the default if an access misses the tree.  That 
is, this region will be the target of null pointers in the tree, or if 
(addr  addr_invalid_mask) is true.


This will ensure the tree is of the smallest possible size.

Another option is to simply give up on the tree and use 
std::binary_search() on the FlatView.  It will be more cpu intensive, 
but on the other hand more cache-friendly.  tcg already caches stuff in 
the iotlb, and maybe it makes sense to add a cache for kvm lookups too 
(the advantage of the cache is that it only needs to contain addresses 
that are actually touched, so it can be very small and fast, compared to 
the radix tree which needs to cover all addresses).





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity
On Tue, Nov 26, 2013 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 26/11/2013 13:40, Zhanghaoyu (A) ha scritto:
 When guest set irq smp_affinity, VMEXIT occurs, then the vcpu thread
will IOCTL return to QEMU from hypervisor, then vcpu thread ask the
hypervisor to update the irq routing table,
 in kvm_set_irq_routing, synchronize_rcu is called, current vcpu thread
is blocked for so much time to wait RCU grace period, and during this
period, this vcpu cannot provide service to VM,
 so those interrupts delivered to this vcpu cannot be handled in time,
and the apps running on this vcpu cannot be serviced too.
 It's unacceptable in some real-time scenario, e.g. telecom.

 So, I want to create a single workqueue for each VM, to asynchronously
performing the RCU synchronization for irq routing table,
 and let the vcpu thread return and VMENTRY to service VM immediately, no
more need to blocked to wait RCU grace period.
 And, I have implemented a raw patch, took a test in our telecom
environment, above problem disappeared.

 I don't think a workqueue is even needed.  You just need to use call_rcu
 to free old after releasing kvm-irq_lock.

 What do you think?

Can this cause an interrupt to be delivered to the wrong (old) vcpu?

The way Linux sets interrupt affinity, it cannot, since changing the
affinity is (IIRC) done in the interrupt handler, so the next interrupt
cannot be in flight and thus pick up the old interrupt routing table.

However it may be vulnerable in other ways.


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity
On Tue, Nov 26, 2013 at 3:47 PM, Paolo Bonzini pbonz...@redhat.com wrote:

 Il 26/11/2013 14:18, Avi Kivity ha scritto:
 
  I don't think a workqueue is even needed.  You just need to use call_rcu
  to free old after releasing kvm-irq_lock.
 
  What do you think?
 
  Can this cause an interrupt to be delivered to the wrong (old) vcpu?

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.



I understood the proposal was also to eliminate the synchronize_rcu(), so
while new interrupts would see the new routing table, interrupts already in
flight could pick up the old one.


Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).


With synchronize_rcu(), you have the additional guarantee that any 
parallel accesses to the old routing table have completed.  Since we 
also trigger the irq from rcu context, you know that after 
synchronize_rcu() you won't get any interrupts to the old destination 
(see kvm_set_irq_inatomic()).


It's another question whether the hardware provides the same guarantee.


If you eliminate the synchronize_rcu, new interrupts would see the new
routing table, while interrupts already in flight will get a dangling
pointer.


Sure, if you drop the synchronize_rcu(), you have to add call_rcu().



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:03 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:54:44PM +0200, Avi Kivity wrote:

On 11/26/2013 04:46 PM, Paolo Bonzini wrote:

Il 26/11/2013 15:36, Avi Kivity ha scritto:

 No, this would be exactly the same code that is running now:

 mutex_lock(kvm-irq_lock);
 old = kvm-irq_routing;
 kvm_irq_routing_update(kvm, new);
 mutex_unlock(kvm-irq_lock);

 synchronize_rcu();
 kfree(old);
 return 0;

 Except that the kfree would run in the call_rcu kernel thread instead of
 the vcpu thread.  But the vcpus already see the new routing table after
 the rcu_assign_pointer that is in kvm_irq_routing_update.

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().



Consider this guest code:

  write msi entry, directing the interrupt away from this vcpu
  nop
  memset(idt, 0, sizeof(idt));

Currently, this code will never trigger a triple fault.  With the change 
to call_rcu(), it may.


Now it may be that the guest does not expect this to work (PCI writes 
are posted; and interrupts can be delayed indefinitely by the pci 
fabric), but we don't know if there's a path that guarantees the guest 
something that we're taking away with this change.







Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:20 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.


An alternative path is to convince ourselves that the hardware does not 
provide the guarantees that the current code provides, and so we can 
relax them.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:25, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.



Why?



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.


It's indeed safe, but I think there's a nice win to be had if we drop 
the assumption.



(BTW, PCI memory writes are posted, but configuration writes are not).


MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush the 
write via a read.  But I think the fact that the interrupt messages 
themselves are posted proves that it is safe.  The fact that Linux does 
interrupt migration from within the interrupt handler also shows that 
someone else believes that it is the only safe place to do it.






Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:11 PM, Michael S. Tsirkin wrote:

On Tue, Nov 26, 2013 at 06:06:26PM +0200, Avi Kivity wrote:

On 11/26/2013 05:58 PM, Paolo Bonzini wrote:

Il 26/11/2013 16:35, Avi Kivity ha scritto:

If we want to ensure, we need to use a different mechanism for
synchronization than the global RCU.  QRCU would work; readers are not
wait-free but only if there is a concurrent synchronize_qrcu, which
should be rare.

An alternative path is to convince ourselves that the hardware does not
provide the guarantees that the current code provides, and so we can
relax them.

No, I think it's a reasonable guarantee to provide.

Why?

Because IIUC the semantics may depend not just on the interrupt
controller, but also on the specific PCI device.  It seems safer to
assume that at least one device/driver pair wants this to work.

It's indeed safe, but I think there's a nice win to be had if we
drop the assumption.

I'm not arguing with that, but a minor commoent below:


(BTW, PCI memory writes are posted, but configuration writes are not).

MSIs are configured via PCI memory writes.

By itself, that doesn't buy us anything, since the guest could flush
the write via a read.  But I think the fact that the interrupt
messages themselves are posted proves that it is safe.

FYI, PCI read flushes the interrupt itself in, too.



I guess that kills the optimization then.  Maybe you can do qrcu, 
whatever that is.





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:24 PM, Gleb Natapov wrote:

On Tue, Nov 26, 2013 at 04:20:27PM +0100, Paolo Bonzini wrote:

Il 26/11/2013 16:03, Gleb Natapov ha scritto:

I understood the proposal was also to eliminate the synchronize_rcu(),
so while new interrupts would see the new routing table, interrupts
already in flight could pick up the old one.

Isn't that always the case with RCU?  (See my answer above: the vcpus
already see the new routing table after the rcu_assign_pointer that is
in kvm_irq_routing_update).

With synchronize_rcu(), you have the additional guarantee that any
parallel accesses to the old routing table have completed.  Since we
also trigger the irq from rcu context, you know that after
synchronize_rcu() you won't get any interrupts to the old
destination (see kvm_set_irq_inatomic()).

We do not have this guaranty for other vcpus that do not call
synchronize_rcu(). They may still use outdated routing table while a vcpu
or iothread that performed table update sits in synchronize_rcu().

Avi's point is that, after the VCPU resumes execution, you know that no
interrupt will be sent to the old destination because
kvm_set_msi_inatomic (and ultimately kvm_irq_delivery_to_apic_fast) is
also called within the RCU read-side critical section.

Without synchronize_rcu you could have

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

   



Suppose the guest did not disable_irq() and enable_irq(), but instead 
had a pci read where you have the enable_irq().  After the read you 
cannot have a stale irq (assuming the read flushes the irq all the way 
to the APIC).





Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-26 Thread Avi Kivity

On 11/26/2013 06:28 PM, Paolo Bonzini wrote:

Il 26/11/2013 17:24, Gleb Natapov ha scritto:

 VCPU writes to routing table
e = entry from IRQ routing table
 kvm_irq_routing_update(kvm, new);
 VCPU resumes execution
kvm_set_msi_irq(e, irq);
kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.

So how is it different from what we have now:

disable_irq()
VCPU writes to routing table
  e = entry from IRQ routing table
  kvm_set_msi_irq(e, irq);
  kvm_irq_delivery_to_apic_fast();
kvm_irq_routing_update(kvm, new);
synchronize_rcu()
VCPU resumes execution
enable_irq()
receive stale irq

Adding a disable/enable IRQs looks like a relatively big change.  But
perhaps it's not for some reason I'm missing.



Those are guest operations, which may not be there at all.



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).

  
Not changing current behaviour is certainly safer, but I am still not 100%

convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.



Linux is safe, it does interrupt migration from within the interrupt 
handler.  If you do that before the device-specific EOI, you won't get 
another interrupt until programming the MSI is complete.


Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes, 
consistently faster.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?




Shouldn't it work with old userspace too? Maybe I misunderstood your intent.



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 11:53 AM, Paolo Bonzini wrote:

Il 28/11/2013 10:49, Avi Kivity ha scritto:

Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't get
another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code changes,
consistently faster.

call_rcu() has the problem of rate limiting, too.  It wasn't such a
great idea, I think.

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.



I guess srcu would work.  Do you know what's the typical write-side 
latency there? (in terms of what it waits for, not nanoseconds).




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 12:40 PM, Paolo Bonzini wrote:

Il 28/11/2013 11:16, Avi Kivity ha scritto:

The QRCU I linked would work great latency-wise (it has roughly the same
latency of an rwsem but readers are lock-free).  However, the locked
operations in the read path would hurt because of cache misses, so it's
not good either.

I guess srcu would work.  Do you know what's the typical write-side
latency there? (in terms of what it waits for, not nanoseconds).

If there's no concurrent reader, it's zero if I read the code right.
Otherwise it depends:

- if there are many callbacks, only 10 of them are processed per
millisecond.  But unless there are concurrent synchronize_srcu calls
there should not be any callback at all.  If all VCPUs were to furiously
change the MSIs, the latency could go up to #vcpu/10 milliseconds.

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

So, given the very restricted usage this SRCU would have, we probably
can expect synchronize_srcu_expedited to finish its job in the
busy-looping phase, and 120 us should be the expected maximum
latency---more likely to be an order of magnitude smaller, and in very
rare cases higher.



Looks like it's a good fit then.



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:10 PM, Paolo Bonzini wrote:

Il 28/11/2013 12:09, Gleb Natapov ha scritto:

- if there are no callbacks, but there are readers, synchronize_srcu
busy-loops for some time checking if the readers complete.  After a
while (20 us for synchronize_srcu, 120 us for
synchronize_srcu_expedited) it gives up and starts using a workqueue to
poll every millisecond.  This should never happen unless

Unless what ? :) Unless reader is scheduled out?

Yes.  Or unless my brain is scheduled out in the middle of a sentence.



You need a grace period.  Or just sleep().



Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.




Isn't the latency due to interrupt migration causing long 
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


The problem occurs with assigned devices too AFAICT.




Re: [Qemu-devel] [RFC] create a single workqueue for each vm to update vm irq routing table

2013-11-28 Thread Avi Kivity

On 11/28/2013 01:22 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 01:18:54PM +0200, Avi Kivity wrote:

On 11/28/2013 01:02 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 12:12:55PM +0200, Avi Kivity wrote:

On 11/28/2013 12:11 PM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 11:49:00AM +0200, Avi Kivity wrote:

On 11/28/2013 11:19 AM, Gleb Natapov wrote:

On Thu, Nov 28, 2013 at 09:55:42AM +0100, Paolo Bonzini wrote:

Il 28/11/2013 07:27, Zhanghaoyu (A) ha scritto:

Without synchronize_rcu you could have

VCPU writes to routing table
   e = entry from IRQ routing table
kvm_irq_routing_update(kvm, new);
VCPU resumes execution
   kvm_set_msi_irq(e, irq);
   kvm_irq_delivery_to_apic_fast();

where the entry is stale but the VCPU has already resumed execution.


If we use call_rcu()(Not consider the problem that Gleb pointed out 
temporarily) instead of synchronize_rcu(), should we still ensure this?

The problem is that we should ensure this, so using call_rcu is not
possible (even not considering the memory allocation problem).


Not changing current behaviour is certainly safer, but I am still not 100%
convinced we have to ensure this.

Suppose guest does:

1: change msi interrupt by writing to pci register
2: read the pci register to flush the write
3: zero idt

I am pretty certain that this code can get interrupt after step 2 on real HW,
but I cannot tell if guest can rely on it to be delivered exactly after
read instruction or it can be delayed by couple of instructions. Seems to me
it would be fragile for an OS to depend on this behaviour. AFAIK Linux does not.


Linux is safe, it does interrupt migration from within the interrupt
handler.  If you do that before the device-specific EOI, you won't
get another interrupt until programming the MSI is complete.

Is virtio safe? IIRC it can post multiple interrupts without guest acks.

Using call_rcu() is a better solution than srcu IMO.  Less code
changes, consistently faster.

Why not fix userspace to use KVM_SIGNAL_MSI instead?



Shouldn't it work with old userspace too? Maybe I misunderstood your intent.

Zhanghaoyu said that the problem mostly hurts in real-time telecom
environment, so I propose how he can fix the problem in his specific
environment.  It will not fix older userspace obviously, but kernel
fix will also require kernel update and usually updating userspace
is easier.



Isn't the latency due to interrupt migration causing long
synchronize_rcu()s?  How does KVM_SIGNAL_MSI help?


If MSI is delivered using KVM_SIGNAL_MSI as opposite to via an entry in
irq routing table changing MSI configuration should not cause update to
irq routing table (not saying this is what happens with current QEMU, but
theoretically there is not reason to update routing table in this case).


I see.  That pushes the problem to userspace, which uses traditional 
locking, so the problem disappears until qemu starts using rcu too to 
manage this.


There is also irqfd, however.  We could also do a KVM_UPDATE_IRQFD to 
change the payload it delivers, but that has exactly the same problems.





Re: [Qemu-devel] outlined TLB lookup on x86

2013-12-08 Thread Avi Kivity

On 11/28/2013 04:12 AM, Richard Henderson wrote:

2. why not use a TLB or bigger size?  currently the TLB has 18 entries. the
TLB lookup is 10 x86 instructions , but every miss needs ~450 instructions, i
measured this using Intel PIN. so even the miss rate is low (say 3%) the
overall time spent in the cpu_x86_handle_mmu_fault is still signifcant.

I'd be interested to experiment with different TLB sizes, to see what effect
that has on performance.  But I suspect that lack of TLB contexts mean that we
wind up flushing the TLB more often than real hardware does, and therefore a
larger TLB merely takes longer to flush.




You could use a generation counter to flush the TLB in O(1) by 
incrementing the counter.  That slows down the fast path though. Maybe 
you can do that for the larger second level TLB only.




[Qemu-devel] [PATCH] enhance gdbstub to support x86-64

2006-07-25 Thread Avi Kivity

The following patch adds x86-64 support to gdbstub.  Please consider for
inclusion.

Index: gdbstub.c
===
--- gdbstub.c   (revision 2399)
+++ gdbstub.c   (revision 2400)
@@ -175,10 +175,144 @@
return 0;
}

-#if defined(TARGET_I386)
+#if defined(TARGET_X86_64)

static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf)
{
+uint8_t *p = mem_buf;
+int i, fpus;
+
+#define PUTREG(x) do { \
+   target_ulong reg = tswapl(x); \
+   memcpy(p, reg, sizeof reg); \
+   p += sizeof reg; \
+} while (0)
+#define PUTREG32(x) do { \
+   uint32_t reg = tswap32(x);  \
+   memcpy(p, reg, sizeof reg); \
+   p += sizeof reg; \
+} while (0)
+#define PUTREGF(x) do { \
+   memcpy(p, (x), 10);\
+   p += sizeof (x);\
+} while (0)
+
+PUTREG(env-regs[R_EAX]);
+PUTREG(env-regs[R_EBX]);
+PUTREG(env-regs[R_ECX]);
+PUTREG(env-regs[R_EDX]);
+PUTREG(env-regs[R_ESI]);
+PUTREG(env-regs[R_EDI]);
+PUTREG(env-regs[R_EBP]);
+PUTREG(env-regs[R_ESP]);
+PUTREG(env-regs[8]);
+PUTREG(env-regs[9]);
+PUTREG(env-regs[10]);
+PUTREG(env-regs[11]);
+PUTREG(env-regs[12]);
+PUTREG(env-regs[13]);
+PUTREG(env-regs[14]);
+PUTREG(env-regs[15]);
+
+PUTREG(env-eip);
+PUTREG32(env-eflags);
+PUTREG32(env-segs[R_CS].selector);
+PUTREG32(env-segs[R_SS].selector);
+PUTREG32(env-segs[R_DS].selector);
+PUTREG32(env-segs[R_ES].selector);
+PUTREG32(env-segs[R_FS].selector);
+PUTREG32(env-segs[R_GS].selector);
+/* XXX: convert floats */
+for(i = 0; i  8; i++) {
+PUTREGF(env-fpregs[i]);
+}
+PUTREG32(env-fpuc);
+fpus = (env-fpus  ~0x3800) | (env-fpstt  0x7)  11;
+PUTREG32(fpus);
+PUTREG32(0); /* XXX: convert tags */
+PUTREG32(0); /* fiseg */
+PUTREG32(0); /* fioff */
+PUTREG32(0); /* foseg */
+PUTREG32(0); /* fooff */
+PUTREG32(0); /* fop */
+
+#undef PUTREG
+#undef PUTREG32
+#undef PUTREGF
+
+return p - mem_buf;
+}
+
+static void cpu_gdb_write_registers(CPUState *env, uint8_t *mem_buf, 
int size)

+{
+uint8_t *p = mem_buf;
+uint32_t junk;
+int i, fpus;
+
+#define GETREG(x) do { \
+   target_ulong reg; \
+   memcpy(reg, p, sizeof reg); \
+   x = tswapl(reg); \
+   p += sizeof reg; \
+} while (0)
+#define GETREG32(x) do { \
+   uint32_t reg; \
+   memcpy(reg, p, sizeof reg); \
+   x = tswap32(reg);\
+   p += sizeof reg; \
+} while (0)
+#define GETREGF(x) do { \
+   memcpy((x), p, 10); \
+   p += 10; \
+} while (0)
+
+GETREG(env-regs[R_EAX]);
+GETREG(env-regs[R_EBX]);
+GETREG(env-regs[R_ECX]);
+GETREG(env-regs[R_EDX]);
+GETREG(env-regs[R_ESI]);
+GETREG(env-regs[R_EDI]);
+GETREG(env-regs[R_EBP]);
+GETREG(env-regs[R_ESP]);
+GETREG(env-regs[8]);
+GETREG(env-regs[9]);
+GETREG(env-regs[10]);
+GETREG(env-regs[11]);
+GETREG(env-regs[12]);
+GETREG(env-regs[13]);
+GETREG(env-regs[14]);
+GETREG(env-regs[15]);
+
+GETREG(env-eip);
+GETREG32(env-eflags);
+GETREG32(env-segs[R_CS].selector);
+GETREG32(env-segs[R_SS].selector);
+GETREG32(env-segs[R_DS].selector);
+GETREG32(env-segs[R_ES].selector);
+GETREG32(env-segs[R_FS].selector);
+GETREG32(env-segs[R_GS].selector);
+/* XXX: convert floats */
+for(i = 0; i  8; i++) {
+GETREGF(env-fpregs[i]);
+}
+GETREG32(env-fpuc);
+GETREG32(fpus); /* XXX: convert fpus */
+GETREG32(junk); /* XXX: convert tags */
+GETREG32(junk); /* fiseg */
+GETREG32(junk); /* fioff */
+GETREG32(junk); /* foseg */
+GETREG32(junk); /* fooff */
+GETREG32(junk); /* fop */
+
+#undef GETREG
+#undef GETREG32
+#undef GETREGF
+}
+
+#elif defined(TARGET_I386)
+
+static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf)
+{
uint32_t *registers = (uint32_t *)mem_buf;
int i, fpus;

@@ -545,7 +679,8 @@
char buf[4096];
uint8_t mem_buf[2000];
uint32_t *registers;
-uint32_t addr, len;
+target_ulong addr;
+uint32_t len;

#ifdef DEBUG_GDB
printf(command='%s'\n, line_buf);


--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] Wipe patch

2006-07-26 Thread Avi Kivity

Brad Campbell wrote:

ZIGLIO, Frediano, VF-IT wrote:

Hi,
  well, this is not a definitive patch but it works. The aim is to be
able to wipe the disk without allocating entire space. When you wipe a
disk the program fill disk with zero bytes so disk image increase to
allocate all space. This just patch detect null byte writes and do not
write all zero byte clusters.


This _looks_ like it would severely impact cpu load during a write.
Have you done any testing to determine if this is likely to impact a 
normal usage scenario?


Why would it? In most cases, the zero test would terminate quickly, 
without accessing the entire cluster.


--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


[Qemu-devel] [PATCH] enhance gdbstub to support x86-64

2006-08-20 Thread Avi Kivity
The following patch adds x86-64 support to gdbstub.  Please consider for 
inclusion.


[not subscribed, please cc:]

Index: gdbstub.c
===
--- gdbstub.c   (revision 2399)
+++ gdbstub.c   (revision 2400)
@@ -175,10 +175,144 @@
return 0;
}

-#if defined(TARGET_I386)
+#if defined(TARGET_X86_64)

static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf)
{
+uint8_t *p = mem_buf;
+int i, fpus;
+
+#define PUTREG(x) do { \
+   target_ulong reg = tswapl(x); \
+   memcpy(p, reg, sizeof reg); \
+   p += sizeof reg; \
+} while (0)
+#define PUTREG32(x) do { \
+   uint32_t reg = tswap32(x);  \
+   memcpy(p, reg, sizeof reg); \
+   p += sizeof reg; \
+} while (0)
+#define PUTREGF(x) do { \
+   memcpy(p, (x), 10);\
+   p += sizeof (x);\
+} while (0)
+
+PUTREG(env-regs[R_EAX]);
+PUTREG(env-regs[R_EBX]);
+PUTREG(env-regs[R_ECX]);
+PUTREG(env-regs[R_EDX]);
+PUTREG(env-regs[R_ESI]);
+PUTREG(env-regs[R_EDI]);
+PUTREG(env-regs[R_EBP]);
+PUTREG(env-regs[R_ESP]);
+PUTREG(env-regs[8]);
+PUTREG(env-regs[9]);
+PUTREG(env-regs[10]);
+PUTREG(env-regs[11]);
+PUTREG(env-regs[12]);
+PUTREG(env-regs[13]);
+PUTREG(env-regs[14]);
+PUTREG(env-regs[15]);
+
+PUTREG(env-eip);
+PUTREG32(env-eflags);
+PUTREG32(env-segs[R_CS].selector);
+PUTREG32(env-segs[R_SS].selector);
+PUTREG32(env-segs[R_DS].selector);
+PUTREG32(env-segs[R_ES].selector);
+PUTREG32(env-segs[R_FS].selector);
+PUTREG32(env-segs[R_GS].selector);
+/* XXX: convert floats */
+for(i = 0; i  8; i++) {
+PUTREGF(env-fpregs[i]);
+}
+PUTREG32(env-fpuc);
+fpus = (env-fpus  ~0x3800) | (env-fpstt  0x7)  11;
+PUTREG32(fpus);
+PUTREG32(0); /* XXX: convert tags */
+PUTREG32(0); /* fiseg */
+PUTREG32(0); /* fioff */
+PUTREG32(0); /* foseg */
+PUTREG32(0); /* fooff */
+PUTREG32(0); /* fop */
+
+#undef PUTREG
+#undef PUTREG32
+#undef PUTREGF
+
+return p - mem_buf;
+}
+
+static void cpu_gdb_write_registers(CPUState *env, uint8_t *mem_buf, int size)
+{
+uint8_t *p = mem_buf;
+uint32_t junk;
+int i, fpus;
+
+#define GETREG(x) do { \
+   target_ulong reg; \
+   memcpy(reg, p, sizeof reg); \
+   x = tswapl(reg); \
+   p += sizeof reg; \
+} while (0)
+#define GETREG32(x) do { \
+   uint32_t reg; \
+   memcpy(reg, p, sizeof reg); \
+   x = tswap32(reg);\
+   p += sizeof reg; \
+} while (0)
+#define GETREGF(x) do { \
+   memcpy((x), p, 10); \
+   p += 10; \
+} while (0)
+
+GETREG(env-regs[R_EAX]);
+GETREG(env-regs[R_EBX]);
+GETREG(env-regs[R_ECX]);
+GETREG(env-regs[R_EDX]);
+GETREG(env-regs[R_ESI]);
+GETREG(env-regs[R_EDI]);
+GETREG(env-regs[R_EBP]);
+GETREG(env-regs[R_ESP]);
+GETREG(env-regs[8]);
+GETREG(env-regs[9]);
+GETREG(env-regs[10]);
+GETREG(env-regs[11]);
+GETREG(env-regs[12]);
+GETREG(env-regs[13]);
+GETREG(env-regs[14]);
+GETREG(env-regs[15]);
+
+GETREG(env-eip);
+GETREG32(env-eflags);
+GETREG32(env-segs[R_CS].selector);
+GETREG32(env-segs[R_SS].selector);
+GETREG32(env-segs[R_DS].selector);
+GETREG32(env-segs[R_ES].selector);
+GETREG32(env-segs[R_FS].selector);
+GETREG32(env-segs[R_GS].selector);
+/* XXX: convert floats */
+for(i = 0; i  8; i++) {
+GETREGF(env-fpregs[i]);
+}
+GETREG32(env-fpuc);
+GETREG32(fpus); /* XXX: convert fpus */
+GETREG32(junk); /* XXX: convert tags */
+GETREG32(junk); /* fiseg */
+GETREG32(junk); /* fioff */
+GETREG32(junk); /* foseg */
+GETREG32(junk); /* fooff */
+GETREG32(junk); /* fop */
+
+#undef GETREG
+#undef GETREG32
+#undef GETREGF
+}
+
+#elif defined(TARGET_I386)
+
+static int cpu_gdb_read_registers(CPUState *env, uint8_t *mem_buf)
+{
uint32_t *registers = (uint32_t *)mem_buf;
int i, fpus;

@@ -545,7 +679,8 @@
char buf[4096];
uint8_t mem_buf[2000];
uint32_t *registers;
-uint32_t addr, len;
+target_ulong addr;
+uint32_t len;

#ifdef DEBUG_GDB

printf(command='%s'\n, line_buf);


--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] hosting the forum

2006-10-03 Thread Avi Kivity

Sorry, this was meant to be off-list.  My apologies.

Avi Kivity wrote:

Hetz Ben Hamo wrote:

Due to some personal problems (related to financial situations) I'm no
longer being able to keep paying the server which hosts the QEMU
forum.

Therefore, I'm looking for someone who can host the QEMU forum (and,
if possible, the nightly snapshots). If anyone would want to host it,
please contact me offline.



Hi,

We may be interested in hosting the forum and the nightlies on our 
servers here.  Can you tell us about the bandwidth required?


Also, if you have qemu coding experience we might be able to help with 
the financial situation as well.  Please send your CV to me if 
interested (we are located in the Poleg area).





--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] qemu vs gcc4

2006-10-23 Thread Avi Kivity

Paul Brook wrote:

On Monday 23 October 2006 09:16, Martin Guy wrote:
  

Now, gcc4 can produce code with several return instructions (with no
option to turn that of, as far as I understand). You cannot cut them out,
and therefore you cannot chain the simple functions.
  

...unless you also map return instructions within the generated
functions into branches to the soon-to-be-dropped final return? Not
that I know anything about qemu internals mind u...



That's exactly what my gcc4 hacks do.

It gets complicated because a x86 uses variable length insn encodings so you 
don't know where insn boundaries are, and a jmp instruction is larger than a 
ret instruction so it's not always possible to do a straight replacement.
  


how about

void some_generated_instruction(u32 a1, u32 s2)
{
  // code
  asm volatile (  );
}


that will force the code to fall through to the null asm code, avoiding 
premature returns.


if the code uses 'return' explicitly, turn it to a goto just before the 
'asm volatile'.



--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] qemu vs gcc4

2006-10-23 Thread Avi Kivity

Paul Brook wrote:

That's exactly what my gcc4 hacks do.

It gets complicated because a x86 uses variable length insn encodings so
you don't know where insn boundaries are, and a jmp instruction is larger
than a ret instruction so it's not always possible to do a straight
replacement.
  

how about

void some_generated_instruction(u32 a1, u32 s2)
{
   // code
   asm volatile (  );
}


that will force the code to fall through to the null asm code, avoiding
premature returns.

if the code uses 'return' explicitly, turn it to a goto just before the
'asm volatile'.



We already do that. It doesn't stop gcc putting the return in the middle of 
the function.


Paul
  

void f1();
void f2();

void f(int *z, int x, int y)
{
   if (x) {
   *z = x;
   f1();
   } else {
   *z = y;
   f2();
   }
   asm volatile ();
}

works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the 
-f flag doesn't.  No idea if it's consistent across architectures.


(the function calls are there to prevent cmov optimizations)




--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] qemu vs gcc4

2006-10-23 Thread Avi Kivity

Paul Brook wrote:

We already do that. It doesn't stop gcc putting the return in the middle
of the function.

Paul
  

void f1();
void f2();

void f(int *z, int x, int y)
{
if (x) {
*z = x;
f1();
} else {
*z = y;
f2();
}
asm volatile ();
}

works, with gcc -O2 -fno-reorder-blocks. removing either the asm or the
-f flag doesn't.  No idea if it's consistent across architectures.



It doesn't work reliably though. We already do everything you mention above.
  


Okay.  Sorry for pestering :)

--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [PATCH] Experimental initial patch providing accelerated OpenGL for Linux i386 (2nd attempt to post)

2006-11-19 Thread Avi Kivity

Even Rouault wrote:
Apart from portability to other architectures, what would be the other 
advantages of a solution based on a PCI device ?
  


I can think of a two advantages:

- the device can deliver interrupts to the guest when something is 
completed (I don't know how useful this is to a 3D device)
- the guest can autodetect the presence of the device by means of a PCI 
ID.  This means that if the guest driver is eventually included in the 
OS (which is possible in Linux), then the driver will be loaded 
automatically by the OS.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] Re: NBD server for QEMU images

2006-12-13 Thread Avi Kivity

Anthony Liguori wrote:


Mounting a partition being served on the same host as read-write can
cause deadlocks. From nbd-2.9.0 README file:


This text is pretty old.  Is this still valid?  This would imply that 
things like loop can result in dead locks.  I don't see why flushing 
one device would depend on the completion of another device.  
Otherwise, if you had two disk adapters, they would always be 
operating in lock step.


As I've said, I've never seen a problem doing-write with nbd on 
localhost.


A deadlock can happen under heavy I/O load:

- write tons of data to nbd device, data ends up in pagecache
- memory gets low, kswapd wakes up, calls nbd device to actually write 
the data

- nbd issues a request, which ends up on the nbd server on the same machine
- the nbd server allocates memory
- memory allocation hangs waiting for kswapd

I submitted a patch/workaround for this some years ago (for a similar 
problem with a local mount to a userspace nfs server).



--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] Re: NBD server for QEMU images

2006-12-13 Thread Avi Kivity

Martin Guy wrote:

- write tons of data to nbd device, data ends up in pagecache
- memory gets low, kswapd wakes up, calls nbd device to actually write
the data
- nbd issues a request, which ends up on the nbd server on the same 
machine

- the nbd server allocates memory
- memory allocation hangs waiting for kswapd


In other words, it can deadlock only if you are swapping to an nbd
device that is served by nbd-server running on the same machine and
kernel. 


No.  It is possible if you issue non-O_SYNC writes.


In the case of a qemu system swapping over nbd to a server on
the host machine, it is the guest kernel that waits on the host kernel
paging the nbd server in from the host's separate swap space, so no
deadlock is possible.

Practice bears this out; if you wanna stress-test it, here's a program
that creates a low memory condition by saturating the VM.


It isn't enough to thrash the guest, you need to exhaust host memory as 
well.  You also need to do it faster than kswapd can write it out.




Of course, this has nothing to do with the original patch, which just
lets nbd-server interpret qemu image files ;)



Agreed.  But mounting nbd from localhost is dangerous.

--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] Re: NBD server for QEMU images

2006-12-14 Thread Avi Kivity

Salvador Fandino wrote:

I have run some tests and found that it's easy to cause a deadlock just
untaring a file over an nbd device being served from localhost (using
the standard nbd-server or my own, it doesn't matter).

Another interesting finding is that when the deadlock happens, qemu-nbds
is inside a read() call, waiting for new nbd requests to arrive over the
socket, and so, not trying to allocate memory or writing to disk.

  


If you use sysrq-t and a serial console, you will find exactly how it's 
waiting for memory.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] QEMU/pc and scsi disks

2007-03-03 Thread Avi Kivity

Anthony Liguori wrote:


I think most people agree that we need a config file.  I haven't seen 
any comments on my config file patch though.


So, any comments on that patch?  Any requirements on a format?



1.  Any option should be settable either in the config file or command 
line.  In other words, the user should not be forced to use a config 
file.  This is useful for management programs who keep all options in an 
internal database, and for users who can experiment via a ^P edit edit 
edit enter.


2. The %(blah)s syntax is a bit ugly.  It's one of the few things in 
python I dislike.  I'd prefer ${blah} and ${blah:modifier}.


3. It would be nice to be able to embed a config file in a qcow3 format 
so that one can ship an image with some default options in one file.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] QEMU/pc and scsi disks

2007-03-04 Thread Avi Kivity

Anthony Liguori wrote:


1.  Any option should be settable either in the config file or 
command line.  In other words, the user should not be forced to use a 
config file.  This is useful for management programs who keep all 
options in an internal database, and for users who can experiment via 
a ^P edit edit edit enter.


I think we should still provide the ability to set the most common 
options via the command line.  I'm also fine with specifying single 
options on the command line.  I suspect though that being able to do 
-config - is more useful for management tools than building large 
strings of command line options.


Out of curiosity, why?  If the options are store in some database, as is 
likely, surely it is easier to generate a longish command line than to 
generate a unique name for a file, remove it if it already exists, write 
out the data, launch qemu, and clean up the file later?  And for 
migration, you have to regenerate it, since some options may have 
changed (cdrom media).


Of course the config file is very useful for a user maintaining a few 
virtual machines on their own, but when qemu is part of a longer food 
chain, I think it's just a complication.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] QEMU/pc and scsi disks

2007-03-04 Thread Avi Kivity

Paul Brook wrote:

Out of curiosity, why?  If the options are store in some database, as is
likely, surely it is easier to generate a longish command line than to
generate a unique name for a file, remove it if it already exists, write
out the data, launch qemu, and clean up the file later?  And for
migration, you have to regenerate it, since some options may have
changed (cdrom media).



You're going to start hitting commandline length limits fairly rapidly 
(Windows in particular has a fairly low limit).


  


Well, at list on linux, the limit is sane (1024 or 10240?).


--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] QEMU/pc and scsi disks

2007-03-04 Thread Avi Kivity

Chris Wilson wrote:

Hi Avi,

On Sun, 4 Mar 2007, Avi Kivity wrote:

Anthony Liguori wrote:
 I think we should still provide the ability to set the most common 
 options via the command line.  I'm also fine with specifying single 
 options on the command line.  I suspect though that being able to 
do  -config - is more useful for management tools than building 
large  strings of command line options.


Out of curiosity, why?  If the options are store in some database, as 
is likely, surely it is easier to generate a longish command line 
than to generate a unique name for a file, remove it if it already 
exists, write out the data, launch qemu, and clean up the file 
later?  And for migration, you have to regenerate it, since some 
options may have changed (cdrom media).


I think Anthony was suggesting that -config - would read the config 
from standard input, avoiding the need for any temporary file, and 
also avoiding any command-line length limits on any platforms (Win32 
is unlikely to accept as much as Linux).




Ah, I missed the '-'.  Thanks.  Indeed it's quite easy to feed a config 
file from memory into qemu.



--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] please review this scsi patch

2007-03-19 Thread Avi Kivity

Wang Cheng Yeh wrote:

thanks


If you include a description of what the patch does and why it is 
necessary, it will probably be reviewed a lot quicker.



--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] [RFC/experimental patch] qemu (x86_64 on x86_64 -no-kqemu) compiles with gcc4 and works

2007-03-25 Thread Avi Kivity

Axel Zeuner wrote:
A full featured converter (cvtasm) has a lot of dependencies: it has to 
support all hosts (M) (with all assembler dialects M') and all targets N, 
i.e. in the worst case one would end with M'x N variants of it, or M x N if 
one supports only one assembler dialect per host.  It is clear, that the 
number of variants is one of the biggest disadvantages of such an approach.
  


Perhaps a mixed approach can be made for gradual conversion: for 
combinations where cvtasm has been written, use that.  Where that's 
still to be done, have dyngen generate call instructions to the ops 
instead of pasting the ops text directly.



--
error compiling committee.c: too many arguments to function



___
Qemu-devel mailing list
Qemu-devel@nongnu.org
http://lists.nongnu.org/mailman/listinfo/qemu-devel


Re: [Qemu-devel] Exception running under KVM

2007-04-04 Thread Avi Kivity

GSMD wrote:

When trying to launch with
--
kvm -no-acpi -nographic -m 512 -cdrom 
/media/ubuntu-7.04-beta-alternate-i386.iso -boot d /opt/ubuntu.img
--
I get
Code:
--
(qemu) exception 12 (0)
  



Please report kvm problems to [EMAIL PROTECTED]

This looks like the real-mode emulation problem kvm has on Intel 
processors.  It is a well known problem, unfortunately difficult to fix.



--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.





Re: [kvm-devel] [Qemu-devel] Fixing ACPI for Vista -- Source code for bios.bin vs. bochs CVS?

2007-05-01 Thread Avi Kivity

Paul Brook wrote:

On Monday 30 April 2007, Nakajima, Jun wrote:
  

We found a way to get 32-bit Vista up on KVM by fixing bios.bin. And I
think it should be helpful to qemu as well. However, I'm not sure if the
binary file has been created simply by applying bios.diff to the latest
bochs CVS tree (actually I doubt it).



What makes you think this won't work? Have you tried it?

  


Regardless of whether it works or not, it would be nice to have a file 
in qemu/pc-bios detailing the revision numbers of the sources taken from 
bochs, synced on each update.  Something like the output of


   cvs status | awk '/^File:/ { file = $2 } /^   Working revision:/ { 
printf(%-20s %s\n, file, $3) }'



--
error compiling committee.c: too many arguments to function





[Qemu-devel] Re: [kvm-devel] qemu/kvm seems to take ALSA all for itself?

2007-05-21 Thread Avi Kivity

David Abrahams wrote:
When I have windows XP running under kvm, I get 

  [EMAIL PROTECTED]:/tmp$ aplay /usr/share/sounds/gaim/receive.wav 
  ALSA lib pcm_dmix.c:864:(snd_pcm_dmix_open) unable to open slave

  aplay: main:550: audio open error: Device or resource busy

As soon as I shut down my VM, though, it works perfectly.  Is this
expected behavior?
  


I think you have to set up a mixer or something.  But I'd rather be deaf 
than have to wade through all the pseudo-documentation in order to find 
out how.


--
error compiling committee.c: too many arguments to function





[Qemu-devel] Re: [kvm-devel] PIIX/IDE: ports disabled in PCI config space?

2007-06-05 Thread Avi Kivity

Luca Tettamanti wrote:

Hello,
I'm testing the new Fedora7 under KVM. As you may know Fedora has
migrated to the new libata drivers.

ata_piix is unhappy with the PIIX IDE controller provided by QEmu/KVM:


  


[...]


The following patch fixes the problem (i.e. ata_piix finds both the HD
and the cdrom):

--- a/hw/ide.c  2007-06-04 19:34:25.0 +0200
+++ b/hw/ide.c  2007-06-04 21:45:28.0 +0200
@@ -2586,6 +2586,8 @@ static void piix3_reset(PCIIDEState *d)
 pci_conf[0x06] = 0x80; /* FBC */
 pci_conf[0x07] = 0x02; // PCI_status_devsel_medium
 pci_conf[0x20] = 0x01; /* BMIBA: 20-23h */
+pci_conf[0x41] = 0x80; // enable port 0
+pci_conf[0x43] = 0x80; // enable port 1
 }
 
 void pci_piix_ide_init(PCIBus *bus, BlockDriverState **hd_table, int devfn)
  


I imagine the reset state in the spec is disabled?  If so, then the 
long-term fix is to enable these bits in the bios.


In any case, I applied this to the kvm repo.  Thanks.

--
error compiling committee.c: too many arguments to function





[Qemu-devel] Re: [kvm-devel] PIIX/IDE: ports disabled in PCI config space?

2007-06-05 Thread Avi Kivity

Luca wrote:


Good point. I'm looking at bochs right now: the BIOS doesn't touch
that bits. The 2 ports are enabled/disabled - using values from the
config file - by the reset function of the emulated controller
(iodev/pci_ide.cc), so they're doing the same thing I've done for
QEMU.
I'd rather not touch the BIOS, it's a bit obscure ;-)


No kidding :-)

Maybe if you post your findings to bochs-devel they'd make the change.

--
error compiling committee.c: too many arguments to function





Re: [Qemu-devel] Detecting Client OS BSOF/Kernel Oops

2007-06-06 Thread Avi Kivity
Clemens Kolbitsch wrote:
 Hi!
 I'd like to detect if the client OS crashes... right now, only for
 linux, but windows systems will become interesting for me as well in
 the future...

 Is there an easy way of detecting if a BSOD or a kernel oops
 happened?? Maybe that'd be possible by checking if the IP is inside a
 certain range (I could find that location, I think... it does not have
 to be working generically... for specific client OS would be
 sufficient)!!

You could try checking whether the hlt instruction (on x86) was executed
with interrupts disabled.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.





[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-10 Thread Avi Kivity

 On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:

Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---
  docs/specs/qed_spec.txt |   94 +++
  1 files changed, 94 insertions(+), 0 deletions(-)

+Feature bits:
+* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
+* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.


Great that QED_F_NEED_CHECK is now non-optional.  However, suppose we 
add a new structure (e.g. persistent freelist); it's now impossible to 
tell whether the structure is updated or not:


1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 shuts down
5 old qemu opens image
6 doesn't update persistent freelist
7 clears need_check
8 shuts down

The image is now inconsistent, but has need_check cleared.

We can address this by having a third feature bitmask that is 
autocleared by guests that don't recognize various bits; so the sequence 
becomes:


1 new qemu opens image
2 writes persistent freelist
3 clears need_check
4 sets persistent_freelist
5 shuts down
6 old qemu opens image
7 clears persistent_freelist (and any other bits it doesn't recognize)
8 doesn't update persistent freelist
9 clears need_check
10 shuts down

The image is now consistent, since the persistent freelist has disappeared.


+* QED_CF_BACKING_FORMAT = 0x01.  The image has a specific backing file format 
stored.
+


It was suggested to have just a bit saying whether the backing format is 
raw or not.  This way you don't need to store the format.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-10 Thread Avi Kivity

 On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:

This patch implements the read/write state machine.  Operations are
fully asynchronous and multiple operations may be active at any time.

Allocating writes lock tables to ensure metadata updates do not
interfere with each other.  If two allocating writes need to update the
same L2 table they will run sequentially.  If two allocating writes need
to update different L2 tables they will run in parallel.



Shouldn't there be a flush between an allocating write and an L2 
update?  Otherwise a reuse of a cluster can move logical sectors from 
one place to another, causing a data disclosure.


Can be skipped if the new cluster is beyond the physical image size.


+
+/*
+ * Table locking works as follows:
+ *
+ * Reads and non-allocating writes do not acquire locks because they do not
+ * modify tables and only see committed L2 cache entries.


What about a non-allocating write that follows an allocating write?

1 Guest writes to sector 0
2 Host reads backing image (or supplies zeros), sectors 1-127
3 Host writes sectors 0-127
4 Guest writes sector 1
5 Host writes sector 1

There needs to be a barrier that prevents the host and the disk from 
reordering operations 3 and 5, or guest operation 4 is lost.  As far as 
the guest is concerned no overlapping writes were issued, so it isn't 
required to provide any barriers.


(based on the comment only, haven't read the code)

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:

On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
  Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
  ---
 docs/specs/qed_spec.txt |   94 
+++
 1 files changed, 94 insertions(+), 0 deletions(-)
  
  +Feature bits:
  +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
  +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check before use.

  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
  we add a new structure (e.g. persistent freelist); it's now
  impossible to tell whether the structure is updated or not:

  1 new qemu opens image
  2 writes persistent freelist
  3 clears need_check
  4 shuts down
  5 old qemu opens image
  6 doesn't update persistent freelist
  7 clears need_check
  8 shuts down

  The image is now inconsistent, but has need_check cleared.

  We can address this by having a third feature bitmask that is
  autocleared by guests that don't recognize various bits; so the
  sequence becomes:

  1 new qemu opens image
  2 writes persistent freelist
  3 clears need_check
  4 sets persistent_freelist
  5 shuts down
  6 old qemu opens image
  7 clears persistent_freelist (and any other bits it doesn't recognize)
  8 doesn't update persistent freelist
  9 clears need_check
  10 shuts down

  The image is now consistent, since the persistent freelist has disappeared.

It is more complicated than just the feature bit.  The freelist requires
space in the image file.  Clearing the persistent_freelist bit leaks the
freelist.

We can solve this by using a compat feature bit and an autoclear feature
bit.  The autoclear bit specifies whether or not the freelist is valid
and the compat feature bit specifices whether or not the freelist
exists.

When the new qemu opens the image again it notices that the autoclear
bit is unset but the compat bit is set.  This means the freelist is
out-of-date and its space can be reclaimed.

I don't like the idea of doing these feature bit acrobatics because they
add complexity.  On the other hand your proposal ensures backward
compatibility in the case of an optional data structure that needs to
stay in sync with the image.  I'm just not 100% convinced it's worth it.


My scenario ends up with data corruption if we move to an old qemu and 
then back again, without any aborts.


A leak is acceptable (it won't grow; it's just an unused, incorrect 
freelist), but data corruption is not.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-11 Thread Avi Kivity

 On 10/11/2010 12:37 PM, Stefan Hajnoczi wrote:

On Sun, Oct 10, 2010 at 11:10:15AM +0200, Avi Kivity wrote:
   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
  This patch implements the read/write state machine.  Operations are
  fully asynchronous and multiple operations may be active at any time.
  
  Allocating writes lock tables to ensure metadata updates do not
  interfere with each other.  If two allocating writes need to update the
  same L2 table they will run sequentially.  If two allocating writes need
  to update different L2 tables they will run in parallel.
  

  Shouldn't there be a flush between an allocating write and an L2
  update?  Otherwise a reuse of a cluster can move logical sectors
  from one place to another, causing a data disclosure.

  Can be skipped if the new cluster is beyond the physical image size.

Currently clusters are never reused and new clusters are always beyond
physical image size.  The only exception I can think of is when the
image file size is not a multiple of the cluster size and we round down
to the start of cluster.

In an implementation that supports TRIM or otherwise reuses clusters
this is a cost.

  +
  +/*
  + * Table locking works as follows:
  + *
  + * Reads and non-allocating writes do not acquire locks because they do not
  + * modify tables and only see committed L2 cache entries.

  What about a non-allocating write that follows an allocating write?

  1 Guest writes to sector 0
  2 Host reads backing image (or supplies zeros), sectors 1-127
  3 Host writes sectors 0-127
  4 Guest writes sector 1
  5 Host writes sector 1

  There needs to be a barrier that prevents the host and the disk from
  reordering operations 3 and 5, or guest operation 4 is lost.  As far
  as the guest is concerned no overlapping writes were issued, so it
  isn't required to provide any barriers.

  (based on the comment only, haven't read the code)

There is no barrier between operations 3 and 5.  However, operation 5
only starts after operation 3 has completed because of table locking.
It is my understanding that *independent* requests may be reordered but
two writes to the *same* sector will not be reordered if write A
completes before write B is issued.

Imagine a test program that uses pwrite() to rewrite a counter many
times on disk.  When the program finishes it prints the counter
variable's last value.  This scenario is like operations 3 and 5 above.
If we read the counter back from disk it will be the final value, not
some intermediate value.  The writes will not be reordered.


Yes, all that is needed is a barrier in program order.  So, operation 5 
is an allocating write as long as 3 hasn't returned? (at which point it 
becomes a non-allocating write)?


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 05:02 PM, Anthony Liguori wrote:

On 10/11/2010 08:44 AM, Avi Kivity wrote:

 On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:


  A leak is acceptable (it won't grow; it's just an unused, incorrect
  freelist), but data corruption is not.

The alternative is for the freelist to be a non-compat feature bit.
That means older QEMU binaries cannot use a QED image that has enabled
the freelist.


For this one feature.  What about others?


A compat feature is one where the feature can be completely ignored 
(meaning that the QEMU does not have to understand the data format).


An example of a compat feature is copy-on-read.  It's merely a 
suggestion and there is no additional metadata.  If a QEMU doesn't 
understand it, it doesn't affect it's ability to read the image.


An example of a non-compat feature would be zero cluster entries.  
Zero cluster entries are a special L2 table entry that indicates that 
a cluster's on-disk data is all zeros.  As long as there is at least 1 
ZCE in the L2 tables, this feature bit must be set.  As soon as all of 
the ZCE bits are cleared, the feature bit can be unset.


An older QEMU will gracefully fail when presented with an image using 
ZCE bits.  An image with no ZCEs will work on older QEMUs.




What's the motivation behind ZCE?

There is yet a third type of feature, one which is not strictly needed 
in order to use the image, but if used, must be kept synchronized.  An 
example is the freelist.  Another example is a directory index for a 
filesystem.  I can't think of another example which would be relevant to 
QED -- metadata checksums perhaps? -- we can always declare it a 
non-compatible feature, but of course, it reduces compatibility.




--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 04:54 PM, Anthony Liguori wrote:

On 10/11/2010 08:04 AM, Avi Kivity wrote:

 On 10/11/2010 12:09 PM, Stefan Hajnoczi wrote:

On Sun, Oct 10, 2010 at 11:20:09AM +0200, Avi Kivity wrote:
   On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:
 Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
 ---
docs/specs/qed_spec.txt |   94 
+++

1 files changed, 94 insertions(+), 0 deletions(-)
 
 +Feature bits:
 +* QED_F_BACKING_FILE = 0x01.  The image uses a backing file.
 +* QED_F_NEED_CHECK = 0x02.  The image needs a consistency check 
before use.


  Great that QED_F_NEED_CHECK is now non-optional.  However, suppose
  we add a new structure (e.g. persistent freelist); it's now
  impossible to tell whether the structure is updated or not:

  1 new qemu opens image
  2 writes persistent freelist
  3 clears need_check
  4 shuts down
  5 old qemu opens image
  6 doesn't update persistent freelist
  7 clears need_check
  8 shuts down

  The image is now inconsistent, but has need_check cleared.

  We can address this by having a third feature bitmask that is
  autocleared by guests that don't recognize various bits; so the
  sequence becomes:

  1 new qemu opens image
  2 writes persistent freelist
  3 clears need_check
  4 sets persistent_freelist
  5 shuts down
  6 old qemu opens image
  7 clears persistent_freelist (and any other bits it doesn't 
recognize)

  8 doesn't update persistent freelist
  9 clears need_check
  10 shuts down

  The image is now consistent, since the persistent freelist has 
disappeared.


It is more complicated than just the feature bit.  The freelist 
requires
space in the image file.  Clearing the persistent_freelist bit leaks 
the

freelist.

We can solve this by using a compat feature bit and an autoclear 
feature

bit.  The autoclear bit specifies whether or not the freelist is valid
and the compat feature bit specifices whether or not the freelist
exists.

When the new qemu opens the image again it notices that the autoclear
bit is unset but the compat bit is set.  This means the freelist is
out-of-date and its space can be reclaimed.

I don't like the idea of doing these feature bit acrobatics because 
they

add complexity.  On the other hand your proposal ensures backward
compatibility in the case of an optional data structure that needs to
stay in sync with the image.  I'm just not 100% convinced it's worth 
it.


My scenario ends up with data corruption if we move to an old qemu 
and then back again, without any aborts.


A leak is acceptable (it won't grow; it's just an unused, incorrect 
freelist), but data corruption is not.


A leak is unacceptable.  It means an image can grow to an unbounded 
size.  If you are a server provider offering multitenancy, then a 
malicious guest can potentially grow the image beyond it's allotted 
size causing a Denial of Service attack against another tenant.


This particular leak cannot grow, and is not controlled by the guest.

A freelist has to be a non-optional feature.  When the freelist bit is 
set, an older QEMU cannot read the image.  If the freelist is 
completed used, the freelist bit can be cleared and the image is then 
usable by older QEMUs.


Once we support TRIM (or detect zeros) we'll never have a clean freelist.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 05:30 PM, Stefan Hajnoczi wrote:


  It was discussed before, but I don't think we came to a conclusion. Are
  there any circumstances under which you don't want to set the
  QED_CF_BACKING_FORMAT flag?

I suggest the following:

QED_CF_BACKING_FORMAT_RAW = 0x1

When set, the backing file is a raw image and should not be probed for
its file format.  The default (unset) means that the backing image file
format may be probed.

Now the backing_fmt_{offset,size} are no longer necessary.


Should it not be an incompatible option?  If the backing disk starts 
with a format magic, it will be probed by an older qemu, incorrectly.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 05:49 PM, Anthony Liguori wrote:

On 10/11/2010 09:58 AM, Avi Kivity wrote:
A leak is unacceptable.  It means an image can grow to an unbounded 
size.  If you are a server provider offering multitenancy, then a 
malicious guest can potentially grow the image beyond it's allotted 
size causing a Denial of Service attack against another tenant.



This particular leak cannot grow, and is not controlled by the guest.


As the image gets moved from hypervisor to hypervisor, it can keep 
growing if given a chance to fill up the disk, then trim it all way.


In a mixed hypervisor environment, it just becomes a numbers game.


I don't see how it can grow.  Both the freelist and the clusters it 
points to consume space, which becomes a leak once you move it to a 
hypervisor that doesn't understand the freelist.  The older hypervisor 
then allocates new blocks.  As soon as it performs a metadata scan (if 
ever), the freelist is reclaimed.


You could only get a growing leak if you moved it to a hypervisor that 
doesn't perform metadata scans, but then that is independent of the 
freelist.




A freelist has to be a non-optional feature.  When the freelist bit 
is set, an older QEMU cannot read the image.  If the freelist is 
completed used, the freelist bit can be cleared and the image is 
then usable by older QEMUs.


Once we support TRIM (or detect zeros) we'll never have a clean 
freelist.


Zero detection doesn't add to the free list.


Why not?  If a cluster is zero filled, you may drop it (assuming no 
backing image).




A potential solution here is to treat TRIM a little differently than 
we've been discussing.


When TRIM happens, don't immediately write an unallocated cluster 
entry for the L2.  Leave the L2 entry in-tact.  Don't actually write a 
UCE to the L2 until you actually allocate the block.


This implies a cost because you'll need to do metadata syncs to make 
this work.  However, that eliminates leakage.


The information is lost on shutdown; and you can have a large number of 
unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a user 
expecting a visit from RIAA).


A slight twist on your proposal is to have an allocated-but-may-drop bit 
in a L2 entry.  TRIM or zero detection sets the bit (leaving the cluster 
number intact).  A following write to the cluster needs to clear the 
bit; if we reallocate the cluster we need to replace it with a ZCE.


This makes the freelist all L2 entries with the bit set; it may be less 
efficient than a custom data structure though.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 05:41 PM, Anthony Liguori wrote:

On 10/11/2010 10:24 AM, Avi Kivity wrote:

 On 10/11/2010 05:02 PM, Anthony Liguori wrote:

On 10/11/2010 08:44 AM, Avi Kivity wrote:

 On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:


  A leak is acceptable (it won't grow; it's just an unused, 
incorrect

  freelist), but data corruption is not.

The alternative is for the freelist to be a non-compat feature bit.
That means older QEMU binaries cannot use a QED image that has 
enabled

the freelist.


For this one feature.  What about others?


A compat feature is one where the feature can be completely ignored 
(meaning that the QEMU does not have to understand the data format).


An example of a compat feature is copy-on-read.  It's merely a 
suggestion and there is no additional metadata.  If a QEMU doesn't 
understand it, it doesn't affect it's ability to read the image.


An example of a non-compat feature would be zero cluster entries.  
Zero cluster entries are a special L2 table entry that indicates 
that a cluster's on-disk data is all zeros.  As long as there is at 
least 1 ZCE in the L2 tables, this feature bit must be set.  As soon 
as all of the ZCE bits are cleared, the feature bit can be unset.


An older QEMU will gracefully fail when presented with an image 
using ZCE bits.  An image with no ZCEs will work on older QEMUs.




What's the motivation behind ZCE?


It's very useful for Copy-on-Read.  If the cluster in the backing file 
is unallocated, then when you do a copy-on-read, you don't want to 
write out a zero cluster since you'd expand the image to it's maximum 
size.


It's also useful for operations like compaction in the absence of 
TRIM.  The common implementation on platforms like VMware is to open a 
file and write zeros to it until it fills up the filesystem.  You then 
delete the file.  The result is that any unallocated data on the disk 
is written as zero and combined with zero-detection in the image 
format, you can compact the image size by marking unallocated blocks 
as ZCE.


Both make sense.  The latter is also useful with TRIM: if you have a 
backing image it's better to implement TRIM with ZCE rather than 
exposing the cluster from the backing file; it saves you a COW when you 
later reallocate the cluster.




There is yet a third type of feature, one which is not strictly 
needed in order to use the image, but if used, must be kept 
synchronized.  An example is the freelist.  Another example is a 
directory index for a filesystem.  I can't think of another example 
which would be relevant to QED -- metadata checksums perhaps? -- we 
can always declare it a non-compatible feature, but of course, it 
reduces compatibility.


You're suggesting a feature that is not strictly needed, but that 
needs to be kept up to date.  If it can't be kept up to date, 
something needs to happen to remove it.  Let's call this a transient 
feature.


Most of the transient features can be removed given some bit of code.  
For instance, ZCE can be removed by writing out zero clusters or 
writing an unallocated L2 entry if there is no backing file.


I think we could add a qemu-img demote command or something like that 
that attempted to remove features when possible.  That doesn't give 
you instant compatibility but I'm doubtful that you can come up with a 
generic way to remove a feature from an image without knowing anything 
about the image.




That should work, and in the worst case there is qemu-img convert (which 
should be taught about format options).


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 04:06 PM, Stefan Hajnoczi wrote:

On Mon, Oct 11, 2010 at 03:44:38PM +0200, Avi Kivity wrote:
   On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:
  
 A leak is acceptable (it won't grow; it's just an unused, incorrect
 freelist), but data corruption is not.
  
  The alternative is for the freelist to be a non-compat feature bit.
  That means older QEMU binaries cannot use a QED image that has enabled
  the freelist.

  For this one feature.  What about others?

Compat features that need to be in sync with the image state will either
require specific checks (e.g. checksum or shadow of the state) or they
need to be non-compat features and are not backwards compatible.

I'm not opposing autoclear feature bits themselves, they are a neat
idea.  However, they will initially have no users so is this something
we really want to carry?


Hard to tell in advance.  It seems like a simple feature with potential 
to make our lives easier in the future.


Anything we develop in the next few months can be rolled back into the 
baseline, assuming we declare the format unstable while 0.14 is in 
development, so this is really about post 0.14 features.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [SeaBIOS] [RFC] Passing boot order from qemu to seabios

2010-10-11 Thread Avi Kivity

 On 10/11/2010 05:39 PM, Gleb Natapov wrote:

On Mon, Oct 11, 2010 at 05:09:22PM +0200, Avi Kivity wrote:
   On 10/11/2010 12:18 PM, Gleb Natapov wrote:
  Currently if VM is started with multiple disks it is almost impossible to
  guess which one of them will be used as boot device especially if there
  is a mix of ATA/virtio/SCSI devices. Essentially BIOS decides the order
  and without looking into the code you can't tell what the order will
  be (and in qemu-kvm if boot=on is used it brings even more havoc). We
  should allow fine-grained control of boot order from qemu command line,
  or as a minimum control what device will be used for booting.
  
  To do that along with inventing syntax to specify boot order on qemu
  command line we need to communicate boot order to seabios via fw_cfg
  interface. For that we need to have a way to unambiguously specify a
  disk from qemu to seabios.  PCI bus address is not enough since not all
  devices are PCI (do we care about them?) and since one PCI device may
  control more then one disk (ATA slave/master, SCSI LUNs). We can do what
  EDD specification does. Describe disk as:
bus type (isa/pci),
address on a bus (16 bit base address for isa, b/s/f for pci)
device type (ATA/SCSI/VIRTIO)
device path (slave/master for ATA, LUN for SCSI, nothing for virtio)
  
  Will it cover all use cased? Any other ideas? Any ideas about qemu
  command line syntax? May be somebody whats to implement it? :)

  Instead of fwcfg, we should store the boot order in the bios.  This
  allows seabios to implement persistent boot selection and control
  boot order from within the guest.

It is not instead of it is in a best case in addition too. First of
all seabios does not have persistent storage currently and second I much
prefer specifying boot device from command line instead of navigating
bios menus. That what we have to do on real HW because there is not
other way to do it, but in virtualization we can do better.


Ok.  So fwcfg will have an option do your default thing which the bios 
can take as a hint to look in cmos memory.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [SeaBIOS] [RFC] Passing boot order from qemu to seabios

2010-10-11 Thread Avi Kivity

 On 10/11/2010 12:18 PM, Gleb Natapov wrote:

Currently if VM is started with multiple disks it is almost impossible to
guess which one of them will be used as boot device especially if there
is a mix of ATA/virtio/SCSI devices. Essentially BIOS decides the order
and without looking into the code you can't tell what the order will
be (and in qemu-kvm if boot=on is used it brings even more havoc). We
should allow fine-grained control of boot order from qemu command line,
or as a minimum control what device will be used for booting.

To do that along with inventing syntax to specify boot order on qemu
command line we need to communicate boot order to seabios via fw_cfg
interface. For that we need to have a way to unambiguously specify a
disk from qemu to seabios.  PCI bus address is not enough since not all
devices are PCI (do we care about them?) and since one PCI device may
control more then one disk (ATA slave/master, SCSI LUNs). We can do what
EDD specification does. Describe disk as:
 bus type (isa/pci),
 address on a bus (16 bit base address for isa, b/s/f for pci)
 device type (ATA/SCSI/VIRTIO)
 device path (slave/master for ATA, LUN for SCSI, nothing for virtio)

Will it cover all use cased? Any other ideas? Any ideas about qemu
command line syntax? May be somebody whats to implement it? :)


Instead of fwcfg, we should store the boot order in the bios.  This 
allows seabios to implement persistent boot selection and control boot 
order from within the guest.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-11 Thread Avi Kivity

 On 10/11/2010 03:42 PM, Stefan Hajnoczi wrote:


  A leak is acceptable (it won't grow; it's just an unused, incorrect
  freelist), but data corruption is not.

The alternative is for the freelist to be a non-compat feature bit.
That means older QEMU binaries cannot use a QED image that has enabled
the freelist.


For this one feature.  What about others?

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: [PATCH v2 3/7] docs: Add QED image format specification

2010-10-12 Thread Avi Kivity

 On 10/11/2010 06:10 PM, Anthony Liguori wrote:

On 10/11/2010 11:02 AM, Avi Kivity wrote:

 On 10/11/2010 05:49 PM, Anthony Liguori wrote:

On 10/11/2010 09:58 AM, Avi Kivity wrote:
A leak is unacceptable.  It means an image can grow to an 
unbounded size.  If you are a server provider offering 
multitenancy, then a malicious guest can potentially grow the 
image beyond it's allotted size causing a Denial of Service attack 
against another tenant.



This particular leak cannot grow, and is not controlled by the guest.


As the image gets moved from hypervisor to hypervisor, it can keep 
growing if given a chance to fill up the disk, then trim it all way.


In a mixed hypervisor environment, it just becomes a numbers game.


I don't see how it can grow.  Both the freelist and the clusters it 
points to consume space, which becomes a leak once you move it to a 
hypervisor that doesn't understand the freelist.  The older 
hypervisor then allocates new blocks.  As soon as it performs a 
metadata scan (if ever), the freelist is reclaimed.


Assume you don't ever do a metadata scan (which is really our design 
point).


What about crashes?



If you move to a hypervisor that doesn't support it, then move to a 
hypervisor that does, you create a brand new freelist and start 
leaking more space.  This isn't a contrived scenario if you have a 
cloud environment with a mix of hosts.


It's only a leak if you don't do a metadata scan.



You might not be able to get a ping-pong every time you provision, but 
with enough effort, you could create serious problems.


It's really an issue of correctness.  Making correctness trade-offs 
for the purpose of compatibility is a policy decision and not 
something we should bake into an image format.  If a tool feels 
strongly that it's a reasonable trade off to make, it can always fudge 
the feature bits itself.


I think the effort here is reasonable, clearing a bit on startup is not 
that complicated.




A potential solution here is to treat TRIM a little differently than 
we've been discussing.


When TRIM happens, don't immediately write an unallocated cluster 
entry for the L2.  Leave the L2 entry in-tact.  Don't actually write 
a UCE to the L2 until you actually allocate the block.


This implies a cost because you'll need to do metadata syncs to make 
this work.  However, that eliminates leakage.


The information is lost on shutdown; and you can have a large number 
of unallocated-in-waiting clusters (like a TRIM issued by mkfs, or a 
user expecting a visit from RIAA).


A slight twist on your proposal is to have an allocated-but-may-drop 
bit in a L2 entry.  TRIM or zero detection sets the bit (leaving the 
cluster number intact).  A following write to the cluster needs to 
clear the bit; if we reallocate the cluster we need to replace it 
with a ZCE.


Yeah, this is sort of what I was thinking.  You would still want a 
free list but it becomes totally optional because if it's lost, no 
data is leaked (assuming that the older version understands the bit).


I was suggesting that we store that bit in the free list though 
because that let's us support having older QEMUs with absolutely no 
knowledge still work.


It doesn't - on rewrite an old qemu won't clear the bit, so a newer qemu 
would think it's still free.


The autoclear bit solves it nicely - the old qemu automatically drops 
the allocated-but-may-drop bits, undoing any TRIMs (which is 
unfortunate) but preserving consistency.




--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-12 Thread Avi Kivity

 On 10/12/2010 06:16 PM, Anthony Liguori wrote:


It's fairly simple to add a sync to this path.  It's probably worth 
checking the prefill/postfill for zeros and avoiding the write/sync if 
that's the case.  That should optimize the common cases of allocating 
new space within a file.


My intuition is that we can avoid the sync entirely but we'll need to 
think about it further.




I don't think so.  This isn't a guest initiated write so we can't shift 
responsibility to the guest.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH v2 2/7] cutils: Add bytes_to_str() to format byte values

2010-10-13 Thread Avi Kivity

 On 10/08/2010 05:48 PM, Stefan Hajnoczi wrote:

From: Anthony Liguorialigu...@us.ibm.com

This common function converts byte counts to human-readable strings with
proper units.

Signed-off-by: Anthony Liguorialigu...@us.ibm.com
Signed-off-by: Stefan Hajnoczistefa...@linux.vnet.ibm.com
---
  cutils.c  |   15 +++
  qemu-common.h |1 +
  2 files changed, 16 insertions(+), 0 deletions(-)

diff --git a/cutils.c b/cutils.c
index 6c32198..5041203 100644
--- a/cutils.c
+++ b/cutils.c
@@ -301,3 +301,18 @@ int get_bits_from_size(size_t size)
  return __builtin_ctzl(size);
  #endif
  }
+
+void bytes_to_str(char *buffer, size_t buffer_len, uint64_t size)
+{
+if (size  (1ULL  10)) {
+snprintf(buffer, buffer_len, % PRIu64  byte(s), size);
+} else if (size  (1ULL  20)) {
+snprintf(buffer, buffer_len, % PRIu64  KB(s), size  10);
+} else if (size  (1ULL  30)) {
+snprintf(buffer, buffer_len, % PRIu64  MB(s), size  20);
+} else if (size  (1ULL  40)) {
+snprintf(buffer, buffer_len, % PRIu64  GB(s), size  30);
+} else {
+snprintf(buffer, buffer_len, % PRIu64  TB(s), size  40);
+}
+}


This will show 1.5GB as 1GB.  Need either floating point with a couple 
of digits of precision, or show 1.5GB as 1500MB.


It also misuses SI prefixes. 1 GB means 10^9 bytes, not 2^30 bytes (with 
a common exception for RAM which is usually packaged in 1.125 times some 
power of two).


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-13 Thread Avi Kivity

 On 10/13/2010 03:24 PM, Anthony Liguori wrote:

On 10/13/2010 08:07 AM, Kevin Wolf wrote:

Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:

We can avoid it when a backing image is not used.  Your idea to check
for zeroes in the backing image is neat too, it may well reduce the
common case even for backing images.

The additional requirement is that we're extending the file and not
reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
doesn't work on host_devices anyway)


Yes, that's a good point.

BTW, I think we've decided that making it work on host_devices is not 
that bad.


We can add an additional feature called QED_F_PHYSICAL_SIZE.

This feature will add another field to the header that contains an 
offset immediately following the last cluster allocation.


During a metadata scan, we can accurately recreate this field so we 
only need to update this field whenever we clear the header dirty bit 
(which means during an fsync()).


If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the 
header dirty bit.





That means we can maintain the physical size without introducing 
additional fsync()s in the allocation path.  Since we're already 
writing out the header anyway, the write operation is basically free too.


I don't see how it is free.  It's an extra write.  The good news is that 
it's very easy to amortize.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-13 Thread Avi Kivity

 On 10/13/2010 04:11 PM, Anthony Liguori wrote:
Why would you ever update the header, apart from relocating L1 for 
some reason?



To update the L1/L2 tables clean bit.  That's what prevents a check in 
the normal case where you have a clean shutdown.


I see - so you wouldn't update it every allocation, only when the disk 
has been quiet for a while.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-13 Thread Avi Kivity

 On 10/13/2010 04:07 PM, Stefan Hajnoczi wrote:

On Wed, Oct 13, 2010 at 03:50:00PM +0200, Avi Kivity wrote:
   On 10/13/2010 03:24 PM, Anthony Liguori wrote:
  On 10/13/2010 08:07 AM, Kevin Wolf wrote:
  Am 13.10.2010 14:13, schrieb Stefan Hajnoczi:
  We can avoid it when a backing image is not used.  Your idea to check
  for zeroes in the backing image is neat too, it may well reduce the
  common case even for backing images.
  The additional requirement is that we're extending the file and not
  reusing an old cluster. (And bdrv_has_zero_init() == true, but QED
  doesn't work on host_devices anyway)
  
  Yes, that's a good point.
  
  BTW, I think we've decided that making it work on host_devices is
  not that bad.
  
  We can add an additional feature called QED_F_PHYSICAL_SIZE.
  
  This feature will add another field to the header that contains an
  offset immediately following the last cluster allocation.
  
  During a metadata scan, we can accurately recreate this field so
  we only need to update this field whenever we clear the header
  dirty bit (which means during an fsync()).

  If you make QED_F_PHYSICAL_SIZE an autoclear bit, you don't need the
  header dirty bit.

Do you mean we just need to check the physical size header field against
the actual file size?  If the two are different, then a consistency
check is forced.


I thought you'd only use a header size field when you don't have a real 
file size.  Why do you need both?



  
  That means we can maintain the physical size without introducing
  additional fsync()s in the allocation path.  Since we're already
  writing out the header anyway, the write operation is basically
  free too.

  I don't see how it is free.  It's an extra write.  The good news is
  that it's very easy to amortize.

We only need to update the header field on disk when we're already
updating the header, so it's not even an extra write operation.


Why would you ever update the header, apart from relocating L1 for some 
reason?


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




Re: [Qemu-devel] Re: [PATCH v2 6/7] qed: Read/write support

2010-10-13 Thread Avi Kivity

 On 10/13/2010 04:53 PM, Anthony Liguori wrote:

On 10/13/2010 09:16 AM, Avi Kivity wrote:

 On 10/13/2010 04:11 PM, Anthony Liguori wrote:
Why would you ever update the header, apart from relocating L1 for 
some reason?



To update the L1/L2 tables clean bit.  That's what prevents a check 
in the normal case where you have a clean shutdown.


I see - so you wouldn't update it every allocation, only when the 
disk has been quiet for a while.


Right, the current plan is to flush the header dirty bit on shutdown 
or whenever there is an explicit flush of the device.  Current that is 
caused by either a guest-initiated flush or a L1 update.


That does add an extra write (and a new write+flush later to mark the 
header dirty again when you start allocating).  I'd drop it and only use 
the timer.


in fact, it adds an extra flush too.  The sequence

1 L1 update
2 mark clean
3 flush

is unsafe since you can crash between 2 and 3, ad only 2 makes it.  So 
I'd do something like


1 opportunistic flush (for whatever reason)
2 set timer
3 no intervening metadata changes
4 mark clean
5 no intervening metadata changes
6 mark dirty
7 flush
8 metadata changes


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.




[Qemu-devel] Re: [patch 0/8] port qemu-kvm's MCE support (v3)

2010-10-14 Thread Avi Kivity

 On 10/11/2010 08:31 PM, Marcelo Tosatti wrote:

Port qemu-kvm's KVM MCE (Machine Check Exception) handling to qemu. It
allows qemu to propagate MCEs to the guest.

v2:
- rename do_qemu_ram_addr_from_host.
- fix kvm_on_sigbus/kvm_on_sigbus_vcpu naming.
- fix bank register restoration (Dean Nelson).

v3:
- condition MCE generation on MCE_SEG_P bit (Huang Ying).



I only see patches 1 and 4 from v2, and this cover letter from v3.  
Please repost.


Also, if the patchset ends up with qemu-kvm master being different from 
uq/master in this area, please post the corresponding qemu-kvm master 
patches.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Hitting 29 NIC limit

2010-10-14 Thread Avi Kivity

 On 10/14/2010 12:54 AM, Anthony Liguori wrote:

On 10/13/2010 05:32 PM, Anjali Kulkarni wrote:


Hi,

Using the legacy way of starting up NICs, I am hitting a limitation 
after 29

NICs ie no more than 29 are detected (that's because of the 32 PCI slot
limit on a single bus- 3 are already taken up)
I had initially increased the MAX_NICS to 48, just on my tree, to get to
more, but ofcource that wont work.
Is there any way to go beyond 29 NICs the legacy way?  What is the 
maximum

that can be supported by the qdev mothod?


I got up to 104 without trying very hard using the following script:

args=
for slot in 5 6 7 8 9 10 11 12 13 14 15 16 17; do
for fn in 0 1 2 3 4 5 6 7; do
args=$args -netdev user,id=eth${slot}_${fn}
args=$args -device 
virtio-net-pci,addr=${slot}.${fn},netdev=eth${slot}_${fn},multifunction=on,romfile=

done
done

x86_64-softmmu/qemu-system-x86_64 -hda ~/images/linux.img ${args} 
-enable-kvm


The key is to make the virtio-net devices multifunction and to fill 
out all 8 functions for each slot.


This is unlikely to work right wrt pci hotplug.  If we want to support a 
large number of interfaces, we need true multiport cards.


What's the motivation for such a huge number of interfaces?

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Hitting 29 NIC limit

2010-10-14 Thread Avi Kivity

 On 10/14/2010 02:54 PM, Anthony Liguori wrote:
The key is to make the virtio-net devices multifunction and to fill 
out all 8 functions for each slot.


This is unlikely to work right wrt pci hotplug.



Yes.  Our hotplug design is based on devices..  This is wrong, it 
should be based on bus-level concepts (like PCI slots).


If we want to support a large number of interfaces, we need true 
multiport cards.


This magic here creates a multiport virtio-net card so I'm not really 
sure what you're suggesting.  It would certainly be nice to make this 
all more user friendly (and make hotplug work).




The big issue is to fix hotplug.

I don't see how we can make it user friendly, without making the 
ordinary case even more unfriendly.  Looks like we need yet another 
level of indirection here.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Hitting 29 NIC limit

2010-10-14 Thread Avi Kivity

 On 10/14/2010 04:11 PM, Anthony Liguori wrote:

On 10/14/2010 08:23 AM, Avi Kivity wrote:

 On 10/14/2010 02:54 PM, Anthony Liguori wrote:
The key is to make the virtio-net devices multifunction and to 
fill out all 8 functions for each slot.


This is unlikely to work right wrt pci hotplug.



Yes.  Our hotplug design is based on devices..  This is wrong, it 
should be based on bus-level concepts (like PCI slots).


If we want to support a large number of interfaces, we need true 
multiport cards.


This magic here creates a multiport virtio-net card so I'm not 
really sure what you're suggesting.  It would certainly be nice to 
make this all more user friendly (and make hotplug work).




The big issue is to fix hotplug.


Yes, but this is entirely independent of multifunction devices.

Today we shoe-horn hot remove into device_del.  Instead, we should 
have explicit bus-level interfaces for hot remove.


I'm not saying multiplug is not the right way to approach this (it is).  
The only concern is to get hotplug right.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets

2010-10-14 Thread Avi Kivity

 On 10/14/2010 11:15 AM, Stefan Hajnoczi wrote:

I forgot to add that the semantics of cancellation make it difficult
to write correct user code.  Every cancellation user needs to add
extra synchronization after the cancel call to handle the case where
the work is currently executing.

This seems tricky to me and I suspect code using this interface will
be buggy.  How about the following?
1. Add a return value indicating that the work is currently executing
(this still requires the caller to add extra synchronization but is at
least explicit) versus work is no longer on the list.
2. Add a flag to block until the work has been cancelled or completed.
  This is useful to callers who are allowed to block.


Blocking is somewhat against the spirit of the thing, no?  While I agree 
that the current cancel API is hard to use correctly, blocking defeats 
the purpose of the API.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets

2010-10-17 Thread Avi Kivity

 On 10/14/2010 11:32 PM, Venkateswararao Jujjuri (JV) wrote:


  Blocking is somewhat against the spirit of the thing, no?  While I agree that
  the current cancel API is hard to use correctly, blocking defeats the 
purpose of
  the API.

Are you proposing to add additional state in the return
(canceled/running/not-canceled)
and leave the synchronization part to the user?
i.e not to provide any additional interface for the user to wait
for the scheduled work to finish? Just trying to understand.


I wasn't proposing anything since I don't have a good proposal.  Adding 
a callback makes the whole thing an asynchronous design which threads 
are trying to avoid.  Blocking is bad.  Leaving it to the caller is hard 
to use correctly.


Perhaps we can have a threadlet with barrier semantics.  You queue a 
piece of work which is guaranteed to execute after all previously 
submitted work (against the same queue) and before any consequently 
submitted work.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [patch 0/8] port qemu-kvm's MCE support (v3 resend)

2010-10-17 Thread Avi Kivity

 On 10/11/2010 08:31 PM, Marcelo Tosatti wrote:

Port qemu-kvm's KVM MCE (Machine Check Exception) handling to qemu. It
allows qemu to propagate MCEs to the guest.

v2:
- rename do_qemu_ram_addr_from_host.
- fix kvm_on_sigbus/kvm_on_sigbus_vcpu naming.
- fix bank register restoration (Dean Nelson).

v3:
- condition MCE generation on MCE_SEG_P bit (Huang Ying).




Thanks, applied to uq/master.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/3] Introduce threadlets

2010-10-18 Thread Avi Kivity

 On 10/18/2010 12:47 PM, Arun R Bharadwaj wrote:

* Avi Kivitya...@redhat.com  [2010-10-17 10:57:23]:

   On 10/14/2010 11:32 PM, Venkateswararao Jujjuri (JV) wrote:
  
 Blocking is somewhat against the spirit of the thing, no?  While I 
agree that
 the current cancel API is hard to use correctly, blocking defeats the 
purpose of
 the API.
  
  Are you proposing to add additional state in the return
  (canceled/running/not-canceled)
  and leave the synchronization part to the user?
  i.e not to provide any additional interface for the user to wait
  for the scheduled work to finish? Just trying to understand.

  I wasn't proposing anything since I don't have a good proposal.
  Adding a callback makes the whole thing an asynchronous design which
  threads are trying to avoid.  Blocking is bad.  Leaving it to the
  caller is hard to use correctly.

  Perhaps we can have a threadlet with barrier semantics.  You queue a
  piece of work which is guaranteed to execute after all previously
  submitted work (against the same queue) and before any consequently
  submitted work.

  -- 
  error compiling committee.c: too many arguments to function




I would suggest that we have 2 APIs - cancel_threadletwork (current
cancel implementation) and cancel_threadletwork_sync (waits for work
to complete). As of now there is no known user for
cancel_threadletwork_sync. So we can keep this as a TODO for later. I
can provide the APIs for both these so that when we have a user for
cancel_threadletwork_sync, we can go ahead and implement it.


I agree it's best not to implement c_t_s() now.  Using it implies a 
stall so we should discourage it.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH] Add support for async page fault to qemu

2010-10-18 Thread Avi Kivity

 On 10/18/2010 05:48 PM, Juan Quintela wrote:

Gleb Natapovg...@redhat.com  wrote:
  Add save/restore of MSR for migration and cpuid bit.

It is there a way to test if async page faults are in use?


Yes, msr != 0 - need a subsection.  Good idea.


if so, we can add a subsection instead of changing the cpuversion.

I think that at some point we are going to need a bitmap that indicates
what MSR's have been used or something like that.

What do you think?


We just need to check if an msr is different from its default value 
(which we can get by reading msrs immediately after the initial reset).


Currently the reset code assumes msr reset value is zero, that's wrong.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH] Implement a virtio GPU transport

2010-10-19 Thread Avi Kivity

 On 10/19/2010 12:31 PM, Ian Molton wrote:



an virtualization@, many virtio developers live there.


you mean virtualizat...@lists.osdl.org ?


Yes.




2. should start with a patch to the virtio-pci spec to document what
you're doing


Where can I find that spec?


http://ozlabs.org/~rusty/virtio-spec/




+ /* Transfer data */
+ if (virtqueue_add_buf(vq, sg_list, o_page, i_page, (void *)1)= 0) {
+ virtqueue_kick(vq);
+ /* Chill out until it's done with the buffer. */
+ while (!virtqueue_get_buf(vq,count))
+ cpu_relax();
+ }
+


This is pretty gross, and will burn lots of cpu if the hypervisor
processes the queue asynchronously.


It doesnt, at present... It could be changed fairly easily ithout 
breaking anything if that happens though.


The hypervisor and the guest can be changed independently.  The driver 
should be coded so that it doesn't depend on hypervisor implementation 
details.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: KVM call agenda for Oct 19

2010-10-19 Thread Avi Kivity

 On 10/19/2010 02:48 PM, Dor Laor wrote:

On 10/19/2010 04:11 AM, Chris Wright wrote:

* Juan Quintela (quint...@redhat.com) wrote:


Please send in any agenda items you are interested in covering.


- 0.13.X -stable handoff
- 0.14 planning
- threadlet work
- virtfs proposals



- Live snapshots
  - We were asked to add this feature for external qcow2
images. Will simple approach of fsync + tracking each requested
backing file (it can be per vDisk) and re-open the new image would
be accepted?
  - Integration with FS freeze for consistent guest app snapshot
Many apps do not sync their ram state to disk correctly or frequent
enough. Physical world backup software calls fs freeze on xfs and
VSS for windows to make the backup consistent.
In order to integrated this with live snapshots we need a guest
agent to trigger the guest fs freeze.
We can either have qemu communicate with the agent directly through
virtio-serial or have a mgmt daemon use virtio-serial to
communicate with the guest in addition to QMP messages about the
live snapshot state.
Preferences? The first solution complicates qemu while the second
complicates mgmt.


Third option, make the freeze path management - qemu - virtio-blk - 
guest kernel - file systems.  The advantage is that it's easy to 
associate file systems with a block device this way.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: KVM call agenda for Oct 19

2010-10-19 Thread Avi Kivity

 On 10/19/2010 03:22 PM, Anthony Liguori wrote:


I had assumed that this would involve:

qemu -hda windows.img

(qemu) snapshot ide0-disk0 snap0.img

1) create snap0.img internally by doing the equivalent of `qemu-img 
create -f qcow2 -b windows.img snap0.img'

2) bdrv_flush('ide0-disk0')
3) bdrv_open(snap0.img)
4) bdrv_close(windows.img)
5) rename('windows.img', 'windows.img.tmp')
6) rename('snap0.img', 'windows.img')
7) rename('windows.img.tmp', 'snap0.img')



Looks reasonable.

Would be interesting to look at this as a use case for the threading 
work.  We should eventually be able to create a snapshot without 
stalling vcpus (stalling I/O of course allowed).


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: KVM call agenda for Oct 19

2010-10-19 Thread Avi Kivity

 On 10/19/2010 02:58 PM, Dor Laor wrote:

On 10/19/2010 02:55 PM, Avi Kivity wrote:

On 10/19/2010 02:48 PM, Dor Laor wrote:

On 10/19/2010 04:11 AM, Chris Wright wrote:

* Juan Quintela (quint...@redhat.com) wrote:


Please send in any agenda items you are interested in covering.


- 0.13.X -stable handoff
- 0.14 planning
- threadlet work
- virtfs proposals



- Live snapshots
- We were asked to add this feature for external qcow2
images. Will simple approach of fsync + tracking each requested
backing file (it can be per vDisk) and re-open the new image would
be accepted?
- Integration with FS freeze for consistent guest app snapshot
Many apps do not sync their ram state to disk correctly or frequent
enough. Physical world backup software calls fs freeze on xfs and
VSS for windows to make the backup consistent.
In order to integrated this with live snapshots we need a guest
agent to trigger the guest fs freeze.
We can either have qemu communicate with the agent directly through
virtio-serial or have a mgmt daemon use virtio-serial to
communicate with the guest in addition to QMP messages about the
live snapshot state.
Preferences? The first solution complicates qemu while the second
complicates mgmt.


Third option, make the freeze path management - qemu - virtio-blk -
guest kernel - file systems. The advantage is that it's easy to
associate file systems with a block device this way.


OTH the userspace freeze path already exist and now you create another 
path. 


I guess we would still have a userspace daemon; instead of talking to 
virtio-serial it talks to virtio-blk.  So:


  management - qemu - virtio-blk - guest driver - kernel fs 
resolver - daemon - apps


Yuck.


What about FS that span over LVM with multiple drives? IDE/SCSI?


Good points.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] Re: KVM call agenda for Oct 19

2010-10-19 Thread Avi Kivity

 On 10/19/2010 03:38 PM, Stefan Hajnoczi wrote:

bdrv_aio_freeze() or any mechanism to deal with pending requests in
the generic block code would be a good step for future live support
of other operations like truncate.


+ logical disk grow, etc.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] KVM call minutes for Oct 19

2010-10-20 Thread Avi Kivity

 On 10/20/2010 10:21 AM, Alexander Graf wrote:

On 19.10.2010, at 17:14, Chris Wright wrote:

  0.13.X -stable
  - Anthony will send note to qemu-devel on this
  - move 0.13.X -stable to a separate tree
  - driven independently of main qemu tree
  - challenge is always in the porting and testing of backported fixes
  - looking for volunteers

  0.14
  - would like to do this before end of the year
  - 0.13 forked off a while back (~July),
  - 0.14 features
   - QMP stabilized
 - 0.13.0 -  0.14 QMP
 - hard attempt not to break compatibility
 - new commands, rework, async, human monitor passthrough
 - goal getting to libvirt not needing human monitor at all
 - QMP KVM autotest test suite submitted
  - in-kernel apic, tpr patching still outstanding
  - QED coroutine concurrency

Would it be realistic to declare deprecating the qemu-kvm fork for 0.14 as goal?


For general use perhaps, device assignment might need another cycle.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [SeaBIOS] [PATCH] mark irq9 active high in DSDT

2010-10-21 Thread Avi Kivity

 On 10/21/2010 04:00 AM, Kevin O'Connor wrote:

On Wed, Oct 20, 2010 at 11:34:41AM +0200, Gleb Natapov wrote:
  In PIIX4 SCI (irq9) is active high. Seabios marks it so in interrupt
  override table, but some OSes (FreeBSD) require the same information to
  be present in DSDT too. Make it so.

  Signed-off-by: Gleb Natapovg...@redhat.com

Thanks.


How do we manage the stable series wrt this issue?

qemu-kvm-0.12.5 has a regression within the stable series that this 
patch fixes.  qemu 0.12.5 does not, but only because it does not emulate 
polarity in the I/O APIC correctly.


There are several paths we could take:

- do nothing, bug is fixed in mainline
- release a seabios 0.x.1 for qemu 0.13.1 with this patch
- same, plus seabios 0.y.1 for qemu 0.12.6 with this patch
- skip qemu (which is not truly affected), patch qemu-kvm's copy of 
seabios for both 0.12.z and 0.13.z


The third option is the most correct from a release engineering point 
of view, but involves more work for everyone.  The fourth is quick pain 
relief but is a little forky.


--
error compiling committee.c: too many arguments to function




Re: [SeaBIOS] [Qemu-devel] [PATCH 1/2] pci: Automatically patch PCI device id in PCI ROM

2010-10-21 Thread Avi Kivity

 On 10/18/2010 08:39 PM, Anthony Liguori wrote:
SeaBIOS rejects them when loaded from the rom bar and doesn't reject 
them when loaded via fw_cfg.



What I meant was, rombar=0 in qdev.  Sometimes my fingers don't work 
the same way my brain does :-)



Are you using qmp or the human monitor protocol?

--
error compiling committee.c: too many arguments to function




Re: [SeaBIOS] [Qemu-devel] [PATCH 1/2] pci: Automatically patch PCI device id in PCI ROM

2010-10-21 Thread Avi Kivity

 On 10/21/2010 03:14 PM, Anthony Liguori wrote:

On 10/21/2010 05:09 AM, Avi Kivity wrote:

 On 10/18/2010 08:39 PM, Anthony Liguori wrote:
SeaBIOS rejects them when loaded from the rom bar and doesn't 
reject them when loaded via fw_cfg.



What I meant was, rombar=0 in qdev.  Sometimes my fingers don't work 
the same way my brain does :-)



Are you using qmp or the human monitor protocol?


I'm still running on a DCE/RPC implementation from the early 80s.



Well, I hope you keep it up to date.  I wouldn't want a vulnerability 
inserted into qemu by an attacker controlling a maintainer's hands.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] [PATCH 0/2] Type-safe ioport callbacks

2010-10-24 Thread Avi Kivity
A recent qemu - qemu-kvm merge broke cpu hotplug without the compiler
complaining because of the type-unsafeness of the ioport callbacks.  This
patchset adds a type-safe variant of ioport callbacks and coverts a sample
ioport.  Converting the other 300-odd registrations is left as an excercise
to the community.

Avi Kivity (2):
  Type-safe ioport callbacks
  piix4 acpi: convert io BAR to type-safe ioport callbacks

 hw/acpi_piix4.c |   30 +++--
 ioport.c|   64 +++
 ioport.h|   16 +
 3 files changed, 98 insertions(+), 12 deletions(-)

-- 
1.7.3.1




[Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks

2010-10-24 Thread Avi Kivity
The current ioport callbacks are not type-safe, in that they accept an opaque
pointer as an argument whose type must match the argument to the registration
function; this is not checked by the compiler.

This patch adds an alternative that is type-safe.  Instead of an opaque
argument, both registation and the callback use a new IOPort type.  The
callback then uses container_of() to access its main structures.

Currently the old and new methods exist side by side; once the old way is gone,
we can also save a bunch of memory since the new method requires one pointer
per ioport instead of 6.

Signed-off-by: Avi Kivity a...@redhat.com
---
 ioport.c |   64 ++
 ioport.h |   16 +++
 2 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index ec3dc65..47eafc3 100644
--- a/ioport.c
+++ b/ioport.c
@@ -174,6 +174,70 @@ int register_ioport_write(pio_addr_t start, int length, 
int size,
 return 0;
 }
 
+static uint32_t ioport_readb_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readb(ioport, addr);
+}
+
+static uint32_t ioport_readw_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readw(ioport, addr);
+}
+
+static uint32_t ioport_readl_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readl(ioport, addr);
+}
+
+static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-writeb(ioport, addr, data);
+}
+
+static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-writew(ioport, addr, data);
+}
+
+static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-writel(ioport, addr, data);
+}
+
+void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length)
+{
+if (ioport-ops-readb) {
+register_ioport_read(start, length, 1, ioport_readb_thunk, ioport);
+}
+if (ioport-ops-readw) {
+register_ioport_read(start, length, 2, ioport_readw_thunk, ioport);
+}
+if (ioport-ops-readl) {
+register_ioport_read(start, length, 4, ioport_readl_thunk, ioport);
+}
+if (ioport-ops-writeb) {
+register_ioport_write(start, length, 1, ioport_writeb_thunk, ioport);
+}
+if (ioport-ops-writew) {
+register_ioport_write(start, length, 2, ioport_writew_thunk, ioport);
+}
+if (ioport-ops-writel) {
+register_ioport_write(start, length, 4, ioport_writel_thunk, ioport);
+}
+}
+
 void isa_unassign_ioport(pio_addr_t start, int length)
 {
 int i;
diff --git a/ioport.h b/ioport.h
index 3d3c8a3..8036e59 100644
--- a/ioport.h
+++ b/ioport.h
@@ -36,6 +36,22 @@ typedef uint32_t pio_addr_t;
 typedef void (IOPortWriteFunc)(void *opaque, uint32_t address, uint32_t data);
 typedef uint32_t (IOPortReadFunc)(void *opaque, uint32_t address);
 
+struct IOPort;
+
+typedef struct IOPortOps {
+uint32_t (*readb)(struct IOPort *ioport, pio_addr_t addr);
+uint32_t (*readw)(struct IOPort *ioport, pio_addr_t addr);
+uint32_t (*readl)(struct IOPort *ioport, pio_addr_t addr);
+void (*writeb)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+void (*writew)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+void (*writel)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+} IOPortOps;
+
+typedef struct IOPort {
+IOPortOps *ops;
+} IOPort;
+
+void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length);
 int register_ioport_read(pio_addr_t start, int length, int size,
  IOPortReadFunc *func, void *opaque);
 int register_ioport_write(pio_addr_t start, int length, int size,
-- 
1.7.3.1




[Qemu-devel] [PATCH 2/2] piix4 acpi: convert io BAR to type-safe ioport callbacks

2010-10-24 Thread Avi Kivity
Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/acpi_piix4.c |   30 ++
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index f74f34c..5772667 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -52,6 +52,7 @@ struct pci_status {
 
 typedef struct PIIX4PMState {
 PCIDevice dev;
+IOPort ioport;
 uint16_t pmsts;
 uint16_t pmen;
 uint16_t pmcntrl;
@@ -128,9 +129,9 @@ static void pm_tmr_timer(void *opaque)
 pm_update_sci(s);
 }
 
-static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val)
+static void pm_ioport_writew(IOPort *ioport, pio_addr_t addr, uint32_t val)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 addr = 0x3f;
 switch(addr) {
 case 0x00:
@@ -184,9 +185,9 @@ static void pm_ioport_writew(void *opaque, uint32_t addr, 
uint32_t val)
 PIIX4_DPRINTF(PM writew port=0x%04x val=0x%04x\n, addr, val);
 }
 
-static uint32_t pm_ioport_readw(void *opaque, uint32_t addr)
+static uint32_t pm_ioport_readw(IOPort *ioport, pio_addr_t addr)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 uint32_t val;
 
 addr = 0x3f;
@@ -208,15 +209,15 @@ static uint32_t pm_ioport_readw(void *opaque, uint32_t 
addr)
 return val;
 }
 
-static void pm_ioport_writel(void *opaque, uint32_t addr, uint32_t val)
+static void pm_ioport_writel(IOPort *ioport, pio_addr_t addr, uint32_t val)
 {
-//PIIX4PMState *s = opaque;
+//PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 PIIX4_DPRINTF(PM writel port=0x%04x val=0x%08x\n, addr  0x3f, val);
 }
 
-static uint32_t pm_ioport_readl(void *opaque, uint32_t addr)
+static uint32_t pm_ioport_readl(IOPort *ioport, uint32_t addr)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 uint32_t val;
 
 addr = 0x3f;
@@ -265,10 +266,7 @@ static void pm_io_space_update(PIIX4PMState *s)
 
 /* XXX: need to improve memory and ioport allocation */
 PIIX4_DPRINTF(PM: mapping to 0x%x\n, pm_io_base);
-register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s);
-register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s);
-register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s);
-register_ioport_read(pm_io_base, 64, 4, pm_ioport_readl, s);
+ioport_register(s-ioport, pm_io_base, 64);
 }
 }
 
@@ -413,6 +411,13 @@ static int piix4_pm_initfn(PCIDevice *dev)
 return 0;
 }
 
+static IOPortOps pm_ioport_ops = {
+.readw = pm_ioport_readw,
+.readl = pm_ioport_readl,
+.writew = pm_ioport_writew,
+.writel = pm_ioport_writel,
+};
+
 i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base,
qemu_irq sci_irq, qemu_irq cmos_s3, qemu_irq smi_irq,
int kvm_enabled)
@@ -424,6 +429,7 @@ i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t 
smb_io_base,
 qdev_prop_set_uint32(dev-qdev, smb_io_base, smb_io_base);
 
 s = DO_UPCAST(PIIX4PMState, dev, dev);
+s-ioport.ops = pm_ioport_ops;
 s-irq = sci_irq;
 s-cmos_s3 = cmos_s3;
 s-smi_irq = smi_irq;
-- 
1.7.3.1




[Qemu-devel] Re: [PATCH 1/2] Type-safe ioport callbacks

2010-10-24 Thread Avi Kivity

 On 10/24/2010 05:34 PM, Avi Kivity wrote:

Currently the old and new methods exist side by side; once the old way is gone,
we can also save a bunch of memory since the new method requires one pointer
per ioport instead of 6.


Actually, 1:7, we replace 3 read callbacks, 3 write callbacks, and 1 
opaque pointer with a single IOPort pointer.


--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH 0/2] Type-safe ioport callbacks

2010-10-24 Thread Avi Kivity

 On 10/24/2010 07:35 PM, Paolo Bonzini wrote:

On 10/24/2010 05:34 PM, Avi Kivity wrote:

A recent qemu -  qemu-kvm merge broke cpu hotplug without the compiler
complaining because of the type-unsafeness of the ioport callbacks.  
This
patchset adds a type-safe variant of ioport callbacks and coverts a 
sample
ioport.  Converting the other 300-odd registrations is left as an 
excercise

to the community.


Should we create a Documentation/ file with incomplete transitions and 
the commit(s) that introduced them, for volunteers who wish to do some 
dirty work or to learn Coccinelle?


If we have a TODO, we could add a janitor section there.

--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks

2010-10-25 Thread Avi Kivity

 On 10/24/2010 08:14 PM, Blue Swirl wrote:

On Sun, Oct 24, 2010 at 3:34 PM, Avi Kivitya...@redhat.com  wrote:
  The current ioport callbacks are not type-safe, in that they accept an 
opaque
  pointer as an argument whose type must match the argument to the registration
  function; this is not checked by the compiler.

  This patch adds an alternative that is type-safe.  Instead of an opaque
  argument, both registation and the callback use a new IOPort type.  The
  callback then uses container_of() to access its main structures.

  Currently the old and new methods exist side by side; once the old way is 
gone,
  we can also save a bunch of memory since the new method requires one pointer
  per ioport instead of 6.

  Signed-off-by: Avi Kivitya...@redhat.com

If we are going to change the interface, let's do it so that it's
useful for other uses too:
http://article.gmane.org/gmane.comp.emulators.qemu/76984


I don't really see why we need registration; cpu_register_io() takes 
function pointers, a size, and an opaque, and gives an integer handle in 
return.  With the IOPort object approach, you set up the IOPort with 
function pointers, size is implied, and the opaque is derived using 
container_of(); the handle is simply the address of the object.



  +typedef struct IOPort {
  +IOPortOps *ops;

const


Yup, will fix.  Will repost once we agree on the approach.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [SeaBIOS] [PATCH] mark irq9 active high in DSDT

2010-10-25 Thread Avi Kivity

 On 10/23/2010 04:12 PM, Kevin O'Connor wrote:

On Thu, Oct 21, 2010 at 12:07:17PM +0200, Avi Kivity wrote:
  How do we manage the stable series wrt this issue?

  qemu-kvm-0.12.5 has a regression within the stable series that this
  patch fixes.  qemu 0.12.5 does not, but only because it does not
  emulate polarity in the I/O APIC correctly.

  There are several paths we could take:

  - do nothing, bug is fixed in mainline
  - release a seabios 0.x.1 for qemu 0.13.1 with this patch
  - same, plus seabios 0.y.1 for qemu 0.12.6 with this patch
  - skip qemu (which is not truly affected), patch qemu-kvm's copy of
  seabios for both 0.12.z and 0.13.z

  The third option is the most correct from a release engineering
  point of view, but involves more work for everyone.

I'm okay with making tags and branches of seabios for bug fixes.  So
far qemu/kvm has just grabbed various builds of seabios - is it
worthwhile to branch off of the seabios-0.6.1 version - which would
mean qemu/kvm would pull in additional changes beyond the bug fix
above?


qemu 0.12 is based on 0.5.1-stable, appears to be an untagged commit
qemu 0.13 is based on 17d3e46511, doesn't appear to be a part of a 
branch or a tag?


git-wise, tags are more important than branches.  You can always 
retrofit a branch to a tag (and you can always retrofit a tag to a 
commit hash).  For the qemu git repositories, neither matter so much 
since the commit is recorded in git; but the distro people really like 
nice stable tags with lots of digits and dots in them.



--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: [PATCH 1/2] Type-safe ioport callbacks

2010-10-25 Thread Avi Kivity

 On 10/25/2010 02:54 PM, Juan Quintela wrote:

Avi Kivitya...@redhat.com  wrote:

  +static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data)
  +{
  +IOPort *ioport = opaque;
  +
  +return ioport-ops-writeb(ioport, addr, data);
  +}
  +
  +static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data)
  +{
  +IOPort *ioport = opaque;
  +
  +return ioport-ops-writew(ioport, addr, data);
  +}
  +
  +static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data)
  +{
  +IOPort *ioport = opaque;
  +
  +return ioport-ops-writel(ioport, addr, data);
  +}

return is not needed on this three functions.



Fixed.

--
error compiling committee.c: too many arguments to function




[Qemu-devel] [PATCH v2 0/2] Type-safe ioport callbacks

2010-10-25 Thread Avi Kivity
A recent qemu - qemu-kvm merge broke cpu hotplug without the compiler
complaining because of the type-unsafeness of the ioport callbacks.  This
patchset adds a type-safe variant of ioport callbacks and coverts a sample
ioport.  Converting the other 300-odd registrations is left as an excercise
to the community.

v2:
 - const correctness
 - avoid return void

Avi Kivity (2):
  Type-safe ioport callbacks
  piix4 acpi: convert io BAR to type-safe ioport callbacks

 hw/acpi_piix4.c |   30 +++--
 ioport.c|   64 +++
 ioport.h|   16 +
 3 files changed, 98 insertions(+), 12 deletions(-)

-- 
1.7.3.1




[Qemu-devel] [PATCH v2 1/2] Type-safe ioport callbacks

2010-10-25 Thread Avi Kivity
The current ioport callbacks are not type-safe, in that they accept an opaque
pointer as an argument whose type must match the argument to the registration
function; this is not checked by the compiler.

This patch adds an alternative that is type-safe.  Instead of an opaque
argument, both registation and the callback use a new IOPort type.  The
callback then uses container_of() to access its main structures.

Currently the old and new methods exist side by side; once the old way is gone,
we can also save a bunch of memory since the new method requires one pointer
per ioport instead of 6.

Acked-by: Anthony Liguori aligu...@us.ibm.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 ioport.c |   64 ++
 ioport.h |   16 +++
 2 files changed, 80 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index ec3dc65..b9e74b8 100644
--- a/ioport.c
+++ b/ioport.c
@@ -174,6 +174,70 @@ int register_ioport_write(pio_addr_t start, int length, 
int size,
 return 0;
 }
 
+static uint32_t ioport_readb_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readb(ioport, addr);
+}
+
+static uint32_t ioport_readw_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readw(ioport, addr);
+}
+
+static uint32_t ioport_readl_thunk(void *opaque, uint32_t addr)
+{
+IOPort *ioport = opaque;
+
+return ioport-ops-readl(ioport, addr);
+}
+
+static void ioport_writeb_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+ioport-ops-writeb(ioport, addr, data);
+}
+
+static void ioport_writew_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+ioport-ops-writew(ioport, addr, data);
+}
+
+static void ioport_writel_thunk(void *opaque, uint32_t addr, uint32_t data)
+{
+IOPort *ioport = opaque;
+
+ioport-ops-writel(ioport, addr, data);
+}
+
+void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length)
+{
+if (ioport-ops-readb) {
+register_ioport_read(start, length, 1, ioport_readb_thunk, ioport);
+}
+if (ioport-ops-readw) {
+register_ioport_read(start, length, 2, ioport_readw_thunk, ioport);
+}
+if (ioport-ops-readl) {
+register_ioport_read(start, length, 4, ioport_readl_thunk, ioport);
+}
+if (ioport-ops-writeb) {
+register_ioport_write(start, length, 1, ioport_writeb_thunk, ioport);
+}
+if (ioport-ops-writew) {
+register_ioport_write(start, length, 2, ioport_writew_thunk, ioport);
+}
+if (ioport-ops-writel) {
+register_ioport_write(start, length, 4, ioport_writel_thunk, ioport);
+}
+}
+
 void isa_unassign_ioport(pio_addr_t start, int length)
 {
 int i;
diff --git a/ioport.h b/ioport.h
index 3d3c8a3..6a7742b 100644
--- a/ioport.h
+++ b/ioport.h
@@ -36,6 +36,22 @@ typedef uint32_t pio_addr_t;
 typedef void (IOPortWriteFunc)(void *opaque, uint32_t address, uint32_t data);
 typedef uint32_t (IOPortReadFunc)(void *opaque, uint32_t address);
 
+struct IOPort;
+
+typedef struct IOPortOps {
+uint32_t (*readb)(struct IOPort *ioport, pio_addr_t addr);
+uint32_t (*readw)(struct IOPort *ioport, pio_addr_t addr);
+uint32_t (*readl)(struct IOPort *ioport, pio_addr_t addr);
+void (*writeb)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+void (*writew)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+void (*writel)(struct IOPort *ioport, pio_addr_t addr, uint32_t data);
+} IOPortOps;
+
+typedef struct IOPort {
+const IOPortOps *ops;
+} IOPort;
+
+void ioport_register(IOPort *ioport, pio_addr_t start, pio_addr_t length);
 int register_ioport_read(pio_addr_t start, int length, int size,
  IOPortReadFunc *func, void *opaque);
 int register_ioport_write(pio_addr_t start, int length, int size,
-- 
1.7.3.1




[Qemu-devel] [PATCH v2 2/2] piix4 acpi: convert io BAR to type-safe ioport callbacks

2010-10-25 Thread Avi Kivity
Acked-by: Anthony Liguori aligu...@us.ibm.com
Signed-off-by: Avi Kivity a...@redhat.com
---
 hw/acpi_piix4.c |   30 ++
 1 files changed, 18 insertions(+), 12 deletions(-)

diff --git a/hw/acpi_piix4.c b/hw/acpi_piix4.c
index f74f34c..c01c2ee 100644
--- a/hw/acpi_piix4.c
+++ b/hw/acpi_piix4.c
@@ -52,6 +52,7 @@ struct pci_status {
 
 typedef struct PIIX4PMState {
 PCIDevice dev;
+IOPort ioport;
 uint16_t pmsts;
 uint16_t pmen;
 uint16_t pmcntrl;
@@ -128,9 +129,9 @@ static void pm_tmr_timer(void *opaque)
 pm_update_sci(s);
 }
 
-static void pm_ioport_writew(void *opaque, uint32_t addr, uint32_t val)
+static void pm_ioport_writew(IOPort *ioport, pio_addr_t addr, uint32_t val)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 addr = 0x3f;
 switch(addr) {
 case 0x00:
@@ -184,9 +185,9 @@ static void pm_ioport_writew(void *opaque, uint32_t addr, 
uint32_t val)
 PIIX4_DPRINTF(PM writew port=0x%04x val=0x%04x\n, addr, val);
 }
 
-static uint32_t pm_ioport_readw(void *opaque, uint32_t addr)
+static uint32_t pm_ioport_readw(IOPort *ioport, pio_addr_t addr)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 uint32_t val;
 
 addr = 0x3f;
@@ -208,15 +209,15 @@ static uint32_t pm_ioport_readw(void *opaque, uint32_t 
addr)
 return val;
 }
 
-static void pm_ioport_writel(void *opaque, uint32_t addr, uint32_t val)
+static void pm_ioport_writel(IOPort *ioport, pio_addr_t addr, uint32_t val)
 {
-//PIIX4PMState *s = opaque;
+//PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 PIIX4_DPRINTF(PM writel port=0x%04x val=0x%08x\n, addr  0x3f, val);
 }
 
-static uint32_t pm_ioport_readl(void *opaque, uint32_t addr)
+static uint32_t pm_ioport_readl(IOPort *ioport, uint32_t addr)
 {
-PIIX4PMState *s = opaque;
+PIIX4PMState *s = container_of(ioport, PIIX4PMState, ioport);
 uint32_t val;
 
 addr = 0x3f;
@@ -265,10 +266,7 @@ static void pm_io_space_update(PIIX4PMState *s)
 
 /* XXX: need to improve memory and ioport allocation */
 PIIX4_DPRINTF(PM: mapping to 0x%x\n, pm_io_base);
-register_ioport_write(pm_io_base, 64, 2, pm_ioport_writew, s);
-register_ioport_read(pm_io_base, 64, 2, pm_ioport_readw, s);
-register_ioport_write(pm_io_base, 64, 4, pm_ioport_writel, s);
-register_ioport_read(pm_io_base, 64, 4, pm_ioport_readl, s);
+ioport_register(s-ioport, pm_io_base, 64);
 }
 }
 
@@ -413,6 +411,13 @@ static int piix4_pm_initfn(PCIDevice *dev)
 return 0;
 }
 
+static const IOPortOps pm_ioport_ops = {
+.readw = pm_ioport_readw,
+.readl = pm_ioport_readl,
+.writew = pm_ioport_writew,
+.writel = pm_ioport_writel,
+};
+
 i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t smb_io_base,
qemu_irq sci_irq, qemu_irq cmos_s3, qemu_irq smi_irq,
int kvm_enabled)
@@ -424,6 +429,7 @@ i2c_bus *piix4_pm_init(PCIBus *bus, int devfn, uint32_t 
smb_io_base,
 qdev_prop_set_uint32(dev-qdev, smb_io_base, smb_io_base);
 
 s = DO_UPCAST(PIIX4PMState, dev, dev);
+s-ioport.ops = pm_ioport_ops;
 s-irq = sci_irq;
 s-cmos_s3 = cmos_s3;
 s-smi_irq = smi_irq;
-- 
1.7.3.1




Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks

2010-10-26 Thread Avi Kivity

 On 10/25/2010 08:38 PM, Blue Swirl wrote:


  I don't really see why we need registration; cpu_register_io() takes
  function pointers, a size, and an opaque, and gives an integer handle in
  return.  With the IOPort object approach, you set up the IOPort with
  function pointers, size is implied, and the opaque is derived using
  container_of(); the handle is simply the address of the object.

With the handle, we can separate setting up the structures at device
level, and mapping the object using only the handle at bus or other
higher level. Can this be done with the object approach?


I believe so.  The handle is simply an indirect pointer, no?


The purpose of that patch series was to perform the separation for PCI
BARs. I wasn't so happy with the series, so I never pushed.


In fact I think an IOPort is even more suitable; if we need additional 
attributes we can use a derived object:


struct PCIIOPort {
IOPort ioport;
/* additional fields */
};



--
error compiling committee.c: too many arguments to function




[Qemu-devel] Re: KVM call agenda for Oct 26

2010-10-26 Thread Avi Kivity

 On 10/26/2010 02:49 PM, Anthony Liguori wrote:

On 10/26/2010 03:20 AM, Avi Kivity wrote:

 On 10/25/2010 09:51 PM, Juan Quintela wrote:

Please send in any agenda items you are interested in covering.



0.13.1

At least two fatal issues:
- vmmouse is broken
- assertion failure in block layer


There's actually a fair bit.  I'm hoping to also get the hotplug fixes 
because tose are pretty important IMHO.  I don't want to wait too long 
though.


Also, a new seabios for the SCI level triggered fix.



What's the assertion in the block layer?


https://bugzilla.redhat.com/show_bug.cgi?id=646538

This is qemu-img, so maybe not so critical.  I confused this with a 
runtime assertion failure in the block layer, for which a fix was merged 
post rc3.


--
error compiling committee.c: too many arguments to function




Re: [Qemu-devel] [PATCH 1/2] Type-safe ioport callbacks

2010-10-26 Thread Avi Kivity

 On 10/26/2010 05:09 PM, Blue Swirl wrote:

On Tue, Oct 26, 2010 at 8:05 AM, Avi Kivitya...@redhat.com  wrote:
On 10/25/2010 08:38 PM, Blue Swirl wrote:

  
  I don't really see why we need registration; cpu_register_io() takes
  function pointers, a size, and an opaque, and gives an integer handle
in
  return.  With the IOPort object approach, you set up the IOPort with
  function pointers, size is implied, and the opaque is derived using
  container_of(); the handle is simply the address of the object.

  With the handle, we can separate setting up the structures at device
  level, and mapping the object using only the handle at bus or other
  higher level. Can this be done with the object approach?

  I believe so.  The handle is simply an indirect pointer, no?

Yes, but then the object should also contain size information. That
should not be needed for mapping at higher level.


Sorry, I don't follow your meaning.

When I said size is implied I meant that the IOPort object has a 
separate function pointer for sizes 1, 2, and 4, so it ioport_register() 
doesn't need a size parameter.  But I don't see how that relates to your 
comment.



  The purpose of that patch series was to perform the separation for PCI
  BARs. I wasn't so happy with the series, so I never pushed.

  In fact I think an IOPort is even more suitable; if we need additional
  attributes we can use a derived object:

  struct PCIIOPort {
  IOPort ioport;
  /* additional fields */
  };

One issue with my series was that it would be great if the devices
just had some BAR structures (used by PCI layer to map the devices)
inside PCI/qdev structures, but I invented that too late. Maybe this
can be addressed in your design?


It looks to be orthogonal.  It would be great to have a BAR object; that 
object can then use your API, my API, or the existing API.


--
error compiling committee.c: too many arguments to function




  1   2   3   4   5   6   7   8   9   10   >