[COMMIT master] device-assignment: Fix off-by-one in header check

2010-12-13 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

Include the first byte at 40h or else access might go to the
hardware instead of the emulated config space, resulting in
capability loops, since the ordering is different.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 832c236..6d6e657 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -410,7 +410,7 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
   ((d-devfn  3)  0x1F), (d-devfn  0x7),
   (uint16_t) address, val, len);
 
-if (address  PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
+if (address = PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
 return assigned_device_pci_cap_write_config(d, address, val, len);
 }
 
@@ -456,7 +456,7 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
(address = 0x10  address = 0x24) || address == 0x30 ||
 address == 0x34 || address == 0x3c || address == 0x3d ||
-(address  PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
+(address = PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] pci: Remove PCI_CAPABILITY_CONFIG_*

2010-12-13 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

Half of these aren't used anywhere, the other half are wrong.  Now that
device assignment is trying to match physical hardware offsets for PCI
capabilities, we can't round up the MSI and MSI-X length.  MSI-X is
always 12 bytes.  MSI is variable length depending on features, but for
the current device assignment implementation, it's always the minimum
length of 10 bytes.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 6d6e657..1a90a89 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1302,10 +1302,9 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
  * MSI capability is the 1st capability in capability config */
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) {
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI;
-pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos,
-   PCI_CAPABILITY_CONFIG_MSI_LENGTH);
-
 /* Only 32-bit/no-mask currently supported */
+pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10);
+
 pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS,
  pci_get_word(pci_dev-config + pos + PCI_MSI_FLAGS) 
  PCI_MSI_FLAGS_QMASK);
@@ -1326,8 +1325,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 uint32_t msix_table_entry;
 
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
-pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos,
-   PCI_CAPABILITY_CONFIG_MSIX_LENGTH);
+pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12);
 
 pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS,
  pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) 
diff --git a/hw/pci.h b/hw/pci.h
index 34955d8..d579738 100644
--- a/hw/pci.h
+++ b/hw/pci.h
@@ -122,11 +122,6 @@ enum {
 QEMU_PCI_CAP_MULTIFUNCTION = (1  QEMU_PCI_CAP_MULTIFUNCTION_BITNR),
 };
 
-#define PCI_CAPABILITY_CONFIG_MAX_LENGTH 0x60
-#define PCI_CAPABILITY_CONFIG_DEFAULT_START_ADDR 0x40
-#define PCI_CAPABILITY_CONFIG_MSI_LENGTH 0x10
-#define PCI_CAPABILITY_CONFIG_MSIX_LENGTH 0x10
-
 typedef int (*msix_mask_notifier_func)(PCIDevice *, unsigned vector,
   int masked);
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] pci: Error on PCI capability collisions

2010-12-13 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

Nothing good can happen when we overlap capabilities

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/pci.c b/hw/pci.c
index b08113d..288d6fd 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1845,6 +1845,20 @@ int pci_add_capability(PCIDevice *pdev, uint8_t cap_id,
 if (!offset) {
 return -ENOSPC;
 }
+} else {
+int i;
+
+for (i = offset; i  offset + size; i++) {
+if (pdev-config_map[i]) {
+fprintf(stderr, ERROR: %04x:%02x:%02x.%x 
+Attempt to add PCI capability %x at offset 
+%x overlaps existing capability %x at offset %x\n,
+pci_find_domain(pdev-bus), pci_bus_num(pdev-bus),
+PCI_SLOT(pdev-devfn), PCI_FUNC(pdev-devfn),
+cap_id, offset, pdev-config_map[i], i);
+return -EFAULT;
+}
+}
 }
 
 config = pdev-config + offset;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] device-assignment: Error checking when adding capabilities

2010-12-13 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 1a90a89..0ae04de 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -1288,7 +1288,7 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 {
 AssignedDevice *dev = container_of(pci_dev, AssignedDevice, dev);
 PCIRegion *pci_region = dev-real_device.regions;
-int pos;
+int ret, pos;
 
 /* Clear initial capabilities pointer and status copied from hw */
 pci_set_byte(pci_dev-config + PCI_CAPABILITY_LIST, 0);
@@ -1303,7 +1303,9 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 if ((pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_MSI))) {
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSI;
 /* Only 32-bit/no-mask currently supported */
-pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10);
+if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSI, pos, 10))  0) {
+return ret;
+}
 
 pci_set_word(pci_dev-config + pos + PCI_MSI_FLAGS,
  pci_get_word(pci_dev-config + pos + PCI_MSI_FLAGS) 
@@ -1325,7 +1327,9 @@ static int assigned_device_pci_cap_init(PCIDevice 
*pci_dev)
 uint32_t msix_table_entry;
 
 dev-cap.available |= ASSIGNED_DEVICE_CAP_MSIX;
-pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12);
+if ((ret = pci_add_capability(pci_dev, PCI_CAP_ID_MSIX, pos, 12))  0) 
{
+return ret;
+}
 
 pci_set_word(pci_dev-config + pos + PCI_MSIX_FLAGS,
  pci_get_word(pci_dev-config + pos + PCI_MSIX_FLAGS) 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] device-assignment: pass through and stub more PCI caps

2010-12-13 Thread Avi Kivity
From: Alex Williamson alex.william...@redhat.com

Some drivers depend on finding capabilities like power management,
PCI express/X, vital product data, or vendor specific fields.  Now
that we have better capability support, we can pass more of these
tables through to the guest.  Note that VPD and VNDR are direct pass
through capabilies, the rest are mostly empty shells with a few
writable bits where necessary.

It may be possible to consolidate dummy capabilities into common files
for other drivers to use, but I prefer to leave them here for now as
we figure out what bits to handle directly with hardware and what bits
are purely emulated.

Signed-off-by: Alex Williamson alex.william...@redhat.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 0ae04de..50c6408 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -67,6 +67,9 @@ static void assigned_device_pci_cap_write_config(PCIDevice 
*pci_dev,
  uint32_t address,
  uint32_t val, int len);
 
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len);
+
 static uint32_t assigned_dev_ioport_rw(AssignedDevRegion *dev_region,
uint32_t addr, int len, uint32_t *val)
 {
@@ -370,11 +373,32 @@ static uint8_t assigned_dev_pci_read_byte(PCIDevice *d, 
int pos)
 return (uint8_t)assigned_dev_pci_read(d, pos, 1);
 }
 
-static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap)
+static void assigned_dev_pci_write(PCIDevice *d, int pos, uint32_t val, int 
len)
+{
+AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+ssize_t ret;
+int fd = pci_dev-real_device.config_fd;
+
+again:
+ret = pwrite(fd, val, len, pos);
+if (ret != len) {
+   if ((ret  0)  (errno == EINTR || errno == EAGAIN))
+   goto again;
+
+   fprintf(stderr, %s: pwrite failed, ret = %zd errno = %d\n,
+   __func__, ret, errno);
+
+   exit(1);
+}
+
+return;
+}
+
+static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t cap, uint8_t start)
 {
 int id;
 int max_cap = 48;
-int pos = PCI_CAPABILITY_LIST;
+int pos = start ? start : PCI_CAPABILITY_LIST;
 int status;
 
 status = assigned_dev_pci_read_byte(d, PCI_STATUS);
@@ -453,10 +477,16 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
*d, uint32_t address,
 ssize_t ret;
 AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 
+if (address = PCI_CONFIG_HEADER_SIZE  d-config_map[address]) {
+val = assigned_device_pci_cap_read_config(d, address, len);
+DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
+  (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
+return val;
+}
+
 if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
(address = 0x10  address = 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d ||
-(address = PCI_CONFIG_HEADER_SIZE  d-config_map[address])) {
+address == 0x34 || address == 0x3c || address == 0x3d) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
@@ -1251,7 +1281,70 @@ static void assigned_dev_update_msix(PCIDevice *pci_dev, 
unsigned int ctrl_pos)
 #endif
 #endif
 
-static void assigned_device_pci_cap_write_config(PCIDevice *pci_dev, uint32_t 
address,
+/* There can be multiple VNDR capabilities per device, we need to find the
+ * one that starts closet to the given address without going over. */
+static uint8_t find_vndr_start(PCIDevice *pci_dev, uint32_t address)
+{
+uint8_t cap, pos;
+
+for (cap = pos = 0;
+ (pos = pci_find_cap_offset(pci_dev, PCI_CAP_ID_VNDR, pos));
+ pos += PCI_CAP_LIST_NEXT) {
+if (pos = address) {
+cap = MAX(pos, cap);
+}
+}
+return cap;
+}
+
+/* Merge the bits set in mask from mval into val.  Both val and mval are
+ * at the same addr offset, pos is the starting offset of the mask. */
+static uint32_t merge_bits(uint32_t val, uint32_t mval, uint8_t addr,
+   int len, uint8_t pos, uint32_t mask)
+{
+if (!ranges_overlap(addr, len, pos, 4)) {
+return val;
+}
+
+if (addr = pos) {
+mask = (addr - pos) * 8;
+} else {
+mask = (pos - addr) * 8;
+}
+mask = 0xU  (4 - len) * 8;
+
+val = ~mask;
+val |= (mval  mask);
+
+return val;
+}
+
+static uint32_t assigned_device_pci_cap_read_config(PCIDevice *pci_dev,
+uint32_t address, int len)
+{
+uint8_t cap, cap_id = pci_dev-config_map[address];
+uint32_t val;
+
+switch (cap_id) {
+

[COMMIT master] KVM: SVM: Add clean-bit for intercetps, tsc-offset and pause filter count

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch adds the clean-bit for intercepts-vectors, the
TSC offset and the pause-filter count to the appropriate
places. The IO and MSR permission bitmaps are not subject to
this bit.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 0904c11..609f661 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -186,6 +186,8 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, 
unsigned nr,
  bool has_error_code, u32 error_code);
 
 enum {
+   VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
+   pause filter count */
VMCB_DIRTY_MAX,
 };
 
@@ -217,6 +219,8 @@ static void recalc_intercepts(struct vcpu_svm *svm)
struct vmcb_control_area *c, *h;
struct nested_state *g;
 
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
+
if (!is_guest_mode(svm-vcpu))
return;
 
@@ -854,6 +858,8 @@ static void svm_write_tsc_offset(struct kvm_vcpu *vcpu, u64 
offset)
}
 
svm-vmcb-control.tsc_offset = offset + g_tsc_offset;
+
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
 }
 
 static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, s64 adjustment)
@@ -863,6 +869,7 @@ static void svm_adjust_tsc_offset(struct kvm_vcpu *vcpu, 
s64 adjustment)
svm-vmcb-control.tsc_offset += adjustment;
if (is_guest_mode(vcpu))
svm-nested.hsave-control.tsc_offset += adjustment;
+   mark_dirty(svm-vmcb, VMCB_INTERCEPTS);
 }
 
 static void init_vmcb(struct vcpu_svm *svm)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for IOPM_BASE and MSRPM_BASE

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch adds the clean bit for the physical addresses of
the MSRPM and the IOPM. It does not need to be set in the
code because the only place where these values are changed
is the nested-svm vmrun and vmexit path. These functions
already mark the complete VMCB as dirty.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 609f661..1802f7c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -188,6 +188,7 @@ static int nested_svm_check_exception(struct vcpu_svm *svm, 
unsigned nr,
 enum {
VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
pause filter count */
+   VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
VMCB_DIRTY_MAX,
 };
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bits infrastructure code

2010-12-13 Thread Avi Kivity
From: Roedel, Joerg joerg.roe...@amd.com

This patch adds the infrastructure for the implementation of
the individual clean-bits.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..235dd73 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -79,7 +79,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
u32 event_inj_err;
u64 nested_cr3;
u64 lbr_ctl;
-   u64 reserved_5;
+   u32 clean;
+   u32 reserved_5;
u64 next_rip;
u8 reserved_6[816];
 };
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index ae943bb..0904c11 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -185,6 +185,28 @@ static int nested_svm_vmexit(struct vcpu_svm *svm);
 static int nested_svm_check_exception(struct vcpu_svm *svm, unsigned nr,
  bool has_error_code, u32 error_code);
 
+enum {
+   VMCB_DIRTY_MAX,
+};
+
+#define VMCB_ALWAYS_DIRTY_MASK 0U
+
+static inline void mark_all_dirty(struct vmcb *vmcb)
+{
+   vmcb-control.clean = 0;
+}
+
+static inline void mark_all_clean(struct vmcb *vmcb)
+{
+   vmcb-control.clean = ((1  VMCB_DIRTY_MAX) - 1)
+   ~VMCB_ALWAYS_DIRTY_MASK;
+}
+
+static inline void mark_dirty(struct vmcb *vmcb, int bit)
+{
+   vmcb-control.clean = ~(1  bit);
+}
+
 static inline struct vcpu_svm *to_svm(struct kvm_vcpu *vcpu)
 {
return container_of(vcpu, struct vcpu_svm, vcpu);
@@ -973,6 +995,8 @@ static void init_vmcb(struct vcpu_svm *svm)
set_intercept(svm, INTERCEPT_PAUSE);
}
 
+   mark_all_dirty(svm-vmcb);
+
enable_gif(svm);
 }
 
@@ -1089,6 +1113,7 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
if (unlikely(cpu != vcpu-cpu)) {
svm-asid_generation = 0;
+   mark_all_dirty(svm-vmcb);
}
 
 #ifdef CONFIG_X86_64
@@ -2140,6 +2165,8 @@ static int nested_svm_vmexit(struct vcpu_svm *svm)
svm-vmcb-save.cpl = 0;
svm-vmcb-control.exit_int_info = 0;
 
+   mark_all_dirty(svm-vmcb);
+
nested_svm_unmap(page);
 
nested_svm_uninit_mmu_context(svm-vcpu);
@@ -2351,6 +2378,8 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
 
enable_gif(svm);
 
+   mark_all_dirty(svm-vmcb);
+
return true;
 }
 
@@ -3488,6 +3517,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
if (unlikely(svm-vmcb-control.exit_code ==
 SVM_EXIT_EXCP_BASE + MC_VECTOR))
svm_handle_mce(svm);
+
+   mark_all_clean(svm-vmcb);
 }
 
 #undef R
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for the ASID

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the clean-bit for the asid in the
vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 1802f7c..a3fd9ba 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -189,6 +189,7 @@ enum {
VMCB_INTERCEPTS, /* Intercept vectors, TSC offset,
pause filter count */
VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
+   VMCB_ASID,   /* ASID */
VMCB_DIRTY_MAX,
 };
 
@@ -1488,6 +1489,8 @@ static void new_asid(struct vcpu_svm *svm, struct 
svm_cpu_data *sd)
 
svm-asid_generation = sd-asid_generation;
svm-vmcb-control.asid = sd-next_asid++;
+
+   mark_dirty(svm-vmcb, VMCB_ASID);
 }
 
 static void svm_set_dr7(struct kvm_vcpu *vcpu, unsigned long value)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for interrupt state

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the clean-bit for all interrupt
related state in the vmcb. This corresponds to vmcb offset
0x60-0x67.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index a3fd9ba..b98092d 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -190,10 +190,12 @@ enum {
pause filter count */
VMCB_PERM_MAP,   /* IOPM Base and MSRPM Base */
VMCB_ASID,   /* ASID */
+   VMCB_INTR,   /* int_ctl, int_vector */
VMCB_DIRTY_MAX,
 };
 
-#define VMCB_ALWAYS_DIRTY_MASK 0U
+/* TPR is always written before VMRUN */
+#define VMCB_ALWAYS_DIRTY_MASK (1U  VMCB_INTR)
 
 static inline void mark_all_dirty(struct vmcb *vmcb)
 {
@@ -2508,6 +2510,8 @@ static int clgi_interception(struct vcpu_svm *svm)
svm_clear_vintr(svm);
svm-vmcb-control.int_ctl = ~V_IRQ_MASK;
 
+   mark_dirty(svm-vmcb, VMCB_INTR);
+
return 1;
 }
 
@@ -2878,6 +2882,7 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm)
kvm_make_request(KVM_REQ_EVENT, svm-vcpu);
svm_clear_vintr(svm);
svm-vmcb-control.int_ctl = ~V_IRQ_MASK;
+   mark_dirty(svm-vmcb, VMCB_INTR);
/*
 * If the user space waits to inject interrupts, exit as soon as
 * possible
@@ -3169,6 +3174,7 @@ static inline void svm_inject_irq(struct vcpu_svm *svm, 
int irq)
control-int_ctl = ~V_INTR_PRIO_MASK;
control-int_ctl |= V_IRQ_MASK |
((/*control-int_vector  4*/ 0xf)  V_INTR_PRIO_SHIFT);
+   mark_dirty(svm-vmcb, VMCB_INTR);
 }
 
 static void svm_set_irq(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for control registers

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the CRx clean-bit for the vmcb. This
bit covers cr0, cr3, cr4, and efer.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 2a63dfa..135727c 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -192,6 +192,7 @@ enum {
VMCB_ASID,   /* ASID */
VMCB_INTR,   /* int_ctl, int_vector */
VMCB_NPT,/* npt_en, nCR3, gPAT */
+   VMCB_CR, /* CR0, CR3, CR4, EFER */
VMCB_DIRTY_MAX,
 };
 
@@ -441,6 +442,7 @@ static void svm_set_efer(struct kvm_vcpu *vcpu, u64 efer)
efer = ~EFER_LME;
 
to_svm(vcpu)-vmcb-save.efer = efer | EFER_SVME;
+   mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR);
 }
 
 static int is_external_interrupt(u32 info)
@@ -1338,6 +1340,7 @@ static void update_cr0_intercept(struct vcpu_svm *svm)
*hcr0 = (*hcr0  ~SVM_CR0_SELECTIVE_MASK)
| (gcr0  SVM_CR0_SELECTIVE_MASK);
 
+   mark_dirty(svm-vmcb, VMCB_CR);
 
if (gcr0 == *hcr0  svm-vcpu.fpu_active) {
clr_cr_intercept(svm, INTERCEPT_CR0_READ);
@@ -1404,6 +1407,7 @@ static void svm_set_cr0(struct kvm_vcpu *vcpu, unsigned 
long cr0)
 */
cr0 = ~(X86_CR0_CD | X86_CR0_NW);
svm-vmcb-save.cr0 = cr0;
+   mark_dirty(svm-vmcb, VMCB_CR);
update_cr0_intercept(svm);
 }
 
@@ -1420,6 +1424,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
cr4 |= X86_CR4_PAE;
cr4 |= host_cr4_mce;
to_svm(vcpu)-vmcb-save.cr4 = cr4;
+   mark_dirty(to_svm(vcpu)-vmcb, VMCB_CR);
 }
 
 static void svm_set_segment(struct kvm_vcpu *vcpu,
@@ -3547,6 +3552,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
struct vcpu_svm *svm = to_svm(vcpu);
 
svm-vmcb-save.cr3 = root;
+   mark_dirty(svm-vmcb, VMCB_CR);
force_new_asid(vcpu);
 }
 
@@ -3559,6 +3565,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
 
/* Also sync guest cr3 here in case we live migrate */
svm-vmcb-save.cr3 = vcpu-arch.cr3;
+   mark_dirty(svm-vmcb, VMCB_CR);
 
force_new_asid(vcpu);
 }
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for Segements and CPL

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the clean-bit defined for the cs, ds,
ss, an es segemnts and the current cpl saved in the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index bb640ae..85d3350 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -195,6 +195,7 @@ enum {
VMCB_CR, /* CR0, CR3, CR4, EFER */
VMCB_DR, /* DR6, DR7 */
VMCB_DT, /* GDT, IDT */
+   VMCB_SEG,/* CS, DS, SS, ES, CPL */
VMCB_DIRTY_MAX,
 };
 
@@ -1457,6 +1458,7 @@ static void svm_set_segment(struct kvm_vcpu *vcpu,
= (svm-vmcb-save.cs.attrib
SVM_SELECTOR_DPL_SHIFT)  3;
 
+   mark_dirty(svm-vmcb, VMCB_SEG);
 }
 
 static void update_db_intercept(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for LBR state

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the clean-bit for all LBR related
state. This includes the debugctl, br_from, br_to,
last_excp_from, and last_excp_to msrs.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index e5db339..05ae90a 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -197,6 +197,7 @@ enum {
VMCB_DT, /* GDT, IDT */
VMCB_SEG,/* CS, DS, SS, ES, CPL */
VMCB_CR2,/* CR2 only */
+   VMCB_LBR,/* DBGCTL, BR_FROM, BR_TO, LAST_EX_FROM, LAST_EX_TO */
VMCB_DIRTY_MAX,
 };
 
@@ -2847,6 +2848,7 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, unsigned 
ecx, u64 data)
return 1;
 
svm-vmcb-save.dbgctl = data;
+   mark_dirty(svm-vmcb, VMCB_LBR);
if (data  (1ULL0))
svm_enable_lbrv(svm);
else
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Add clean-bit for CR2 register

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch implements the clean-bit for the cr2 register in
the vmcb.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 85d3350..e5db339 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -196,11 +196,12 @@ enum {
VMCB_DR, /* DR6, DR7 */
VMCB_DT, /* GDT, IDT */
VMCB_SEG,/* CS, DS, SS, ES, CPL */
+   VMCB_CR2,/* CR2 only */
VMCB_DIRTY_MAX,
 };
 
-/* TPR is always written before VMRUN */
-#define VMCB_ALWAYS_DIRTY_MASK (1U  VMCB_INTR)
+/* TPR and CR2 are always written before VMRUN */
+#define VMCB_ALWAYS_DIRTY_MASK ((1U  VMCB_INTR) | (1U  VMCB_CR2))
 
 static inline void mark_all_dirty(struct vmcb *vmcb)
 {
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: MMU: rename 'no_apf' to 'prefault'

2010-12-13 Thread Avi Kivity
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com

It's the speculative path if 'no_apf = 1' and we will specially handle this
speculative path in the later patch, so 'prefault' is better to fit the sense.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cfbcbfa..f7e5066 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -241,7 +241,8 @@ struct kvm_mmu {
void (*new_cr3)(struct kvm_vcpu *vcpu);
void (*set_cr3)(struct kvm_vcpu *vcpu, unsigned long root);
unsigned long (*get_cr3)(struct kvm_vcpu *vcpu);
-   int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err, bool 
no_apf);
+   int (*page_fault)(struct kvm_vcpu *vcpu, gva_t gva, u32 err,
+ bool prefault);
void (*inject_page_fault)(struct kvm_vcpu *vcpu,
  struct x86_exception *fault);
void (*free)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d75ba1e..4954de9 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2284,11 +2284,11 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t 
gfn, pfn_t pfn)
return 1;
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 gva_t gva, pfn_t *pfn, bool write, bool *writable);
 
 static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,
-bool no_apf)
+bool prefault)
 {
int r;
int level;
@@ -2310,7 +2310,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
 
-   if (try_async_pf(vcpu, no_apf, gfn, v, pfn, write, map_writable))
+   if (try_async_pf(vcpu, prefault, gfn, v, pfn, write, map_writable))
return 0;
 
/* mmio */
@@ -2583,7 +2583,7 @@ static gpa_t nonpaging_gva_to_gpa_nested(struct kvm_vcpu 
*vcpu, gva_t vaddr,
 }
 
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
-   u32 error_code, bool no_apf)
+   u32 error_code, bool prefault)
 {
gfn_t gfn;
int r;
@@ -2599,7 +2599,7 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
gfn = gva  PAGE_SHIFT;
 
return nonpaging_map(vcpu, gva  PAGE_MASK,
-error_code  PFERR_WRITE_MASK, gfn, no_apf);
+error_code  PFERR_WRITE_MASK, gfn, prefault);
 }
 
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
@@ -2621,7 +2621,7 @@ static bool can_do_async_pf(struct kvm_vcpu *vcpu)
return kvm_x86_ops-interrupt_allowed(vcpu);
 }
 
-static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
+static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
 gva_t gva, pfn_t *pfn, bool write, bool *writable)
 {
bool async;
@@ -2633,7 +2633,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
no_apf, gfn_t gfn,
 
put_page(pfn_to_page(*pfn));
 
-   if (!no_apf  can_do_async_pf(vcpu)) {
+   if (!prefault  can_do_async_pf(vcpu)) {
trace_kvm_try_async_get_page(gva, gfn);
if (kvm_find_async_pf_gfn(vcpu, gfn)) {
trace_kvm_async_pf_doublefault(gva, gfn);
@@ -2649,7 +2649,7 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool 
no_apf, gfn_t gfn,
 }
 
 static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,
- bool no_apf)
+ bool prefault)
 {
pfn_t pfn;
int r;
@@ -2673,7 +2673,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
 
-   if (try_async_pf(vcpu, no_apf, gfn, gpa, pfn, write, map_writable))
+   if (try_async_pf(vcpu, prefault, gfn, gpa, pfn, write, map_writable))
return 0;
 
/* mmio */
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index d5a0a11..52b3e91 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -539,7 +539,7 @@ out_gpte_changed:
  *   a negative value on error.
  */
 static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
-bool no_apf)
+bool prefault)
 {
int write_fault = error_code  PFERR_WRITE_MASK;
int user_fault = error_code  PFERR_USER_MASK;
@@ -581,7 +581,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
mmu_seq = vcpu-kvm-mmu_notifier_seq;
smp_rmb();
 
-   if 

[COMMIT master] KVM: SVM: Remove flush_guest_tlb function

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This function is unused and there is svm_flush_tlb which
does the same. So this function can be removed.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 05ae90a..16334bb 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -426,11 +426,6 @@ static inline void force_new_asid(struct kvm_vcpu *vcpu)
to_svm(vcpu)-asid_generation--;
 }
 
-static inline void flush_guest_tlb(struct kvm_vcpu *vcpu)
-{
-   force_new_asid(vcpu);
-}
-
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: MMU: fix accessed bit set on prefault path

2010-12-13 Thread Avi Kivity
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com

Retry #PF is the speculative path, so don't set the accessed bit

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 4954de9..04f9033 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2214,7 +2214,8 @@ static void direct_pte_prefetch(struct kvm_vcpu *vcpu, 
u64 *sptep)
 }
 
 static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, int write,
-   int map_writable, int level, gfn_t gfn, pfn_t pfn)
+   int map_writable, int level, gfn_t gfn, pfn_t pfn,
+   bool prefault)
 {
struct kvm_shadow_walk_iterator iterator;
struct kvm_mmu_page *sp;
@@ -2229,7 +2230,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
pte_access = ~ACC_WRITE_MASK;
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
 0, write, 1, pt_write,
-level, gfn, pfn, false, map_writable);
+level, gfn, pfn, prefault, map_writable);
direct_pte_prefetch(vcpu, iterator.sptep);
++vcpu-stat.pf_fixed;
break;
@@ -2321,7 +2322,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn,
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
-   r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn);
+   r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
+prefault);
spin_unlock(vcpu-kvm-mmu_lock);
 
 
@@ -2684,7 +2686,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
gpa, u32 error_code,
goto out_unlock;
kvm_mmu_free_some_pages(vcpu);
r = __direct_map(vcpu, gpa, write, map_writable,
-level, gfn, pfn);
+level, gfn, pfn, prefault);
spin_unlock(vcpu-kvm-mmu_lock);
 
return r;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: MMU: retry #PF for softmmu

2010-12-13 Thread Avi Kivity
From: Xiao Guangrong xiaoguangr...@cn.fujitsu.com

Retry #PF for softmmu only when the current vcpu has the same cr3 as the time
when #PF occurs

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f7e5066..b55d789 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -593,6 +593,7 @@ struct kvm_x86_ops {
 struct kvm_arch_async_pf {
u32 token;
gfn_t gfn;
+   unsigned long cr3;
bool direct_map;
 };
 
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 04f9033..1a953ac 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2607,9 +2607,11 @@ static int nonpaging_page_fault(struct kvm_vcpu *vcpu, 
gva_t gva,
 static int kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn)
 {
struct kvm_arch_async_pf arch;
+
arch.token = (vcpu-arch.apf.id++  12) | vcpu-vcpu_id;
arch.gfn = gfn;
arch.direct_map = vcpu-arch.mmu.direct_map;
+   arch.cr3 = vcpu-arch.mmu.get_cr3(vcpu);
 
return kvm_setup_async_pf(vcpu, gva, gfn, arch);
 }
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 52b3e91..146b681 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -438,7 +438,8 @@ static void FNAME(pte_prefetch)(struct kvm_vcpu *vcpu, 
struct guest_walker *gw,
 static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 struct guest_walker *gw,
 int user_fault, int write_fault, int hlevel,
-int *ptwrite, pfn_t pfn, bool map_writable)
+int *ptwrite, pfn_t pfn, bool map_writable,
+bool prefault)
 {
unsigned access = gw-pt_access;
struct kvm_mmu_page *sp = NULL;
@@ -512,7 +513,7 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 
mmu_set_spte(vcpu, it.sptep, access, gw-pte_access  access,
 user_fault, write_fault, dirty, ptwrite, it.level,
-gw-gfn, pfn, false, map_writable);
+gw-gfn, pfn, prefault, map_writable);
FNAME(pte_prefetch)(vcpu, gw, it.sptep);
 
return it.sptep;
@@ -568,8 +569,11 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
 */
if (!r) {
pgprintk(%s: guest page fault\n, __func__);
-   inject_page_fault(vcpu, walker.fault);
-   vcpu-arch.last_pt_write_count = 0; /* reset fork detector */
+   if (!prefault) {
+   inject_page_fault(vcpu, walker.fault);
+   /* reset fork detector */
+   vcpu-arch.last_pt_write_count = 0;
+   }
return 0;
}
 
@@ -599,7 +603,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
trace_kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
kvm_mmu_free_some_pages(vcpu);
sptep = FNAME(fetch)(vcpu, addr, walker, user_fault, write_fault,
-level, write_pt, pfn, map_writable);
+level, write_pt, pfn, map_writable, prefault);
(void)sptep;
pgprintk(%s: shadow pte %p %llx ptwrite %d\n, __func__,
 sptep, *sptep, write_pt);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed373ba..018bb70 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6183,7 +6183,7 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
 {
int r;
 
-   if (!vcpu-arch.mmu.direct_map || !work-arch.direct_map ||
+   if ((vcpu-arch.mmu.direct_map != work-arch.direct_map) ||
  is_error_page(work-page))
return;
 
@@ -6191,6 +6191,10 @@ void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, 
struct kvm_async_pf *work)
if (unlikely(r))
return;
 
+   if (!vcpu-arch.mmu.direct_map 
+ work-arch.cr3 != vcpu-arch.mmu.get_cr3(vcpu))
+   return;
+
vcpu-arch.mmu.page_fault(vcpu, work-gva, 0, true);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Use svm_flush_tlb instead of force_new_asid

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch replaces all calls to force_new_asid which are
intended to flush the guest-tlb by the more appropriate
function svm_flush_tlb. As a side-effect the force_new_asid
function is removed.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 16334bb..b4aad21 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -421,11 +421,6 @@ static inline void invlpga(unsigned long addr, u32 asid)
asm volatile (__ex(SVM_INVLPGA) : : a(addr), c(asid));
 }
 
-static inline void force_new_asid(struct kvm_vcpu *vcpu)
-{
-   to_svm(vcpu)-asid_generation--;
-}
-
 static int get_npt_level(void)
 {
 #ifdef CONFIG_X86_64
@@ -999,7 +994,7 @@ static void init_vmcb(struct vcpu_svm *svm)
save-cr3 = 0;
save-cr4 = 0;
}
-   force_new_asid(svm-vcpu);
+   svm-asid_generation = 0;
 
svm-nested.vmcb = 0;
svm-vcpu.arch.hflags = 0;
@@ -1419,7 +1414,7 @@ static void svm_set_cr4(struct kvm_vcpu *vcpu, unsigned 
long cr4)
unsigned long old_cr4 = to_svm(vcpu)-vmcb-save.cr4;
 
if (npt_enabled  ((old_cr4 ^ cr4)  X86_CR4_PGE))
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 
vcpu-arch.cr4 = cr4;
if (!npt_enabled)
@@ -1762,7 +1757,7 @@ static void nested_svm_set_tdp_cr3(struct kvm_vcpu *vcpu,
 
svm-vmcb-control.nested_cr3 = root;
mark_dirty(svm-vmcb, VMCB_NPT);
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
@@ -2366,7 +2361,7 @@ static bool nested_svm_vmrun(struct vcpu_svm *svm)
svm-nested.intercept_exceptions = 
nested_vmcb-control.intercept_exceptions;
svm-nested.intercept= nested_vmcb-control.intercept;
 
-   force_new_asid(svm-vcpu);
+   svm_flush_tlb(svm-vcpu);
svm-vmcb-control.int_ctl = nested_vmcb-control.int_ctl | 
V_INTR_MASKING_MASK;
if (nested_vmcb-control.int_ctl  V_INTR_MASKING_MASK)
svm-vcpu.arch.hflags |= HF_VINTR_MASK;
@@ -3308,7 +3303,7 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned int 
addr)
 
 static void svm_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   force_new_asid(vcpu);
+   to_svm(vcpu)-asid_generation--;
 }
 
 static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
@@ -3560,7 +3555,7 @@ static void svm_set_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
 
svm-vmcb-save.cr3 = root;
mark_dirty(svm-vmcb, VMCB_CR);
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned long root)
@@ -3574,7 +3569,7 @@ static void set_tdp_cr3(struct kvm_vcpu *vcpu, unsigned 
long root)
svm-vmcb-save.cr3 = vcpu-arch.cr3;
mark_dirty(svm-vmcb, VMCB_CR);
 
-   force_new_asid(vcpu);
+   svm_flush_tlb(vcpu);
 }
 
 static int is_disabled(void)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Implement Flush-By-Asid feature

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

This patch adds the new flush-by-asid of upcoming AMD
processors to the KVM-AMD module.

Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 235dd73..82ecaa3 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -88,6 +88,8 @@ struct __attribute__ ((__packed__)) vmcb_control_area {
 
 #define TLB_CONTROL_DO_NOTHING 0
 #define TLB_CONTROL_FLUSH_ALL_ASID 1
+#define TLB_CONTROL_FLUSH_ASID 3
+#define TLB_CONTROL_FLUSH_ASID_LOCAL 7
 
 #define V_TPR_MASK 0x0f
 
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index b4aad21..740884b 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3158,7 +3158,6 @@ static void pre_svm_run(struct vcpu_svm *svm)
 
struct svm_cpu_data *sd = per_cpu(svm_data, cpu);
 
-   svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
/* FIXME: handle wraparound of asid_generation */
if (svm-asid_generation != sd-asid_generation)
new_asid(svm, sd);
@@ -3303,7 +3302,12 @@ static int svm_set_tss_addr(struct kvm *kvm, unsigned 
int addr)
 
 static void svm_flush_tlb(struct kvm_vcpu *vcpu)
 {
-   to_svm(vcpu)-asid_generation--;
+   struct vcpu_svm *svm = to_svm(vcpu);
+
+   if (static_cpu_has(X86_FEATURE_FLUSHBYASID))
+   svm-vmcb-control.tlb_ctl = TLB_CONTROL_FLUSH_ASID;
+   else
+   svm-asid_generation--;
 }
 
 static void svm_prepare_guest_switch(struct kvm_vcpu *vcpu)
@@ -3527,6 +3531,8 @@ static void svm_vcpu_run(struct kvm_vcpu *vcpu)
 
svm-next_rip = 0;
 
+   svm-vmcb-control.tlb_ctl = TLB_CONTROL_DO_NOTHING;
+
/* if exit due to PF check for async PF */
if (svm-vmcb-control.exit_code == SVM_EXIT_EXCP_BASE + PF_VECTOR)
svm-apf_reason = kvm_read_and_reset_pf_reason();
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: VMX: add module parameter to avoid trapping HLT instructions (v5)

2010-12-13 Thread Avi Kivity
From: Anthony Liguori aligu...@us.ibm.com

In certain use-cases, we want to allocate guests fixed time slices where idle
guest cycles leave the machine idling.  There are many approaches to achieve
this but the most direct is to simply avoid trapping the HLT instruction which
lets the guest directly execute the instruction putting the processor to sleep.

Introduce this as a module-level option for kvm-vmx.ko since if you do this
for one guest, you probably want to do it for all.

Signed-off-by: Anthony Liguori aligu...@us.ibm.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 42d9590..9642c22 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -297,6 +297,12 @@ enum vmcs_field {
 #define GUEST_INTR_STATE_SMI   0x0004
 #define GUEST_INTR_STATE_NMI   0x0008
 
+/* GUEST_ACTIVITY_STATE flags */
+#define GUEST_ACTIVITY_ACTIVE  0
+#define GUEST_ACTIVITY_HLT 1
+#define GUEST_ACTIVITY_SHUTDOWN2
+#define GUEST_ACTIVITY_WAIT_SIPI   3
+
 /*
  * Exit Qualifications for MOV for Control Register Access
  */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 72cfdb7..5c62ef2 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -69,6 +69,9 @@ module_param(emulate_invalid_guest_state, bool, S_IRUGO);
 static int __read_mostly vmm_exclusive = 1;
 module_param(vmm_exclusive, bool, S_IRUGO);
 
+static int __read_mostly yield_on_hlt = 1;
+module_param(yield_on_hlt, bool, S_IRUGO);
+
 #define KVM_GUEST_CR0_MASK_UNRESTRICTED_GUEST  \
(X86_CR0_WP | X86_CR0_NE | X86_CR0_NW | X86_CR0_CD)
 #define KVM_GUEST_CR0_MASK \
@@ -1009,6 +1012,17 @@ static void skip_emulated_instruction(struct kvm_vcpu 
*vcpu)
vmx_set_interrupt_shadow(vcpu, 0);
 }
 
+static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
+{
+   /* Ensure that we clear the HLT state in the VMCS.  We don't need to
+* explicitly skip the instruction because if the HLT state is set, then
+* the instruction is already executing and RIP has already been
+* advanced. */
+   if (!yield_on_hlt 
+   vmcs_read32(GUEST_ACTIVITY_STATE) == GUEST_ACTIVITY_HLT)
+   vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
+}
+
 static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
bool has_error_code, u32 error_code,
bool reinject)
@@ -1035,6 +1049,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, 
unsigned nr,
intr_info |= INTR_TYPE_HARD_EXCEPTION;
 
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
+   vmx_clear_hlt(vcpu);
 }
 
 static bool vmx_rdtscp_supported(void)
@@ -1419,7 +1434,7 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
_pin_based_exec_control)  0)
return -EIO;
 
-   min = CPU_BASED_HLT_EXITING |
+   min =
 #ifdef CONFIG_X86_64
  CPU_BASED_CR8_LOAD_EXITING |
  CPU_BASED_CR8_STORE_EXITING |
@@ -1432,6 +1447,10 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  CPU_BASED_MWAIT_EXITING |
  CPU_BASED_MONITOR_EXITING |
  CPU_BASED_INVLPG_EXITING;
+
+   if (yield_on_hlt)
+   min |= CPU_BASED_HLT_EXITING;
+
opt = CPU_BASED_TPR_SHADOW |
  CPU_BASED_USE_MSR_BITMAPS |
  CPU_BASED_ACTIVATE_SECONDARY_CONTROLS;
@@ -2728,7 +2747,7 @@ static int vmx_vcpu_reset(struct kvm_vcpu *vcpu)
vmcs_writel(GUEST_IDTR_BASE, 0);
vmcs_write32(GUEST_IDTR_LIMIT, 0x);
 
-   vmcs_write32(GUEST_ACTIVITY_STATE, 0);
+   vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
vmcs_write32(GUEST_PENDING_DBG_EXCEPTIONS, 0);
 
@@ -2821,6 +2840,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
} else
intr |= INTR_TYPE_EXT_INTR;
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
+   vmx_clear_hlt(vcpu);
 }
 
 static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
@@ -2848,6 +2868,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
}
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
+   vmx_clear_hlt(vcpu);
 }
 
 static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: Fix OSXSAVE after migration

2010-12-13 Thread Avi Kivity
From: Sheng Yang sh...@linux.intel.com

CPUID's OSXSAVE is a mirror of CR4.OSXSAVE bit. We need to update the CPUID
after migration.

KVM-Stable-Tag.
Signed-off-by: Sheng Yang sh...@linux.intel.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 018bb70..bb04957 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5585,6 +5585,8 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
 
mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs-cr4;
kvm_x86_ops-set_cr4(vcpu, sregs-cr4);
+   if (sregs-cr4  X86_CR4_OSXSAVE)
+   update_cpuid(vcpu);
if (!is_long_mode(vcpu)  is_pae(vcpu)) {
load_pdptrs(vcpu, vcpu-arch.walk_mmu, vcpu-arch.cr3);
mmu_reset_needed = 1;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: MMU: Fix incorrect direct page write protection due to ro host page

2010-12-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

If KVM sees a read-only host page, it will map it as read-only to prevent
breaking a COW.  However, if the page was part of a large guest page, KVM
incorrectly extends the write protection to the entire large page frame
instead of limiting it to the normal host page.

This results in the instantiation of a new shadow page with read-only access.

If this happens for a MOVS instruction that moves memory between two normal
pages, within a single large page frame, and mapped within the guest as a
large page, and if, in addition, the source operand is not writeable in the
host (perhaps due to KSM), then KVM will instantiate a read-only direct
shadow page, instantiate an spte for the source operand, then instantiate
a new read/write direct shadow page and instantiate an spte for the
destination operand.  Since these two sptes are in different shadow pages,
MOVS will never see them at the same time and the guest will not make
progress.

Fix by mapping the direct shadow page read/write, and only marking the
host page read-only.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 146b681..5ca9426 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -511,6 +511,9 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
link_shadow_page(it.sptep, sp);
}
 
+   if (!map_writable)
+   access = ~ACC_WRITE_MASK;
+
mmu_set_spte(vcpu, it.sptep, access, gw-pte_access  access,
 user_fault, write_fault, dirty, ptwrite, it.level,
 gw-gfn, pfn, prefault, map_writable);
@@ -593,9 +596,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
if (is_error_pfn(pfn))
return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
-   if (!map_writable)
-   walker.pte_access = ~ACC_WRITE_MASK;
-
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: Fix build error on s390 due to missing tlbs_dirty

2010-12-13 Thread Avi Kivity
From: Avi Kivity a...@redhat.com

Make it available for all archs.

Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index bd0da8f..b5021db 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,8 +256,8 @@ struct kvm {
struct mmu_notifier mmu_notifier;
unsigned long mmu_notifier_seq;
long mmu_notifier_count;
-   long tlbs_dirty;
 #endif
+   long tlbs_dirty;
 };
 
 /* The guest did something we don't support. */
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[COMMIT master] KVM: SVM: Do not report xsave in supported cpuid

2010-12-13 Thread Avi Kivity
From: Joerg Roedel joerg.roe...@amd.com

To support xsave properly for the guest the SVM module need
software support for it. As long as this is not present do
not report the xsave as supported feature in cpuid.
As a side-effect this patch moves the bit() helper function
into the x86.h file so that it can be used in svm.c too.

KVM-Stable-Tag.
Signed-off-by: Joerg Roedel joerg.roe...@amd.com
Signed-off-by: Avi Kivity a...@redhat.com

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 740884b..9b3d166 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3622,6 +3622,10 @@ static void svm_cpuid_update(struct kvm_vcpu *vcpu)
 static void svm_set_supported_cpuid(u32 func, struct kvm_cpuid_entry2 *entry)
 {
switch (func) {
+   case 0x0001:
+   /* Mask out xsave bit as long as it is not supported by SVM */
+   entry-ecx = ~(bit(X86_FEATURE_XSAVE));
+   break;
case 0x8001:
if (nested)
entry-ecx |= (1  2); /* Set SVM bit */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5c62ef2..c195260 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4268,11 +4268,6 @@ static int vmx_get_lpage_level(void)
return PT_PDPE_LEVEL;
 }
 
-static inline u32 bit(int bitno)
-{
-   return 1  (bitno  31);
-}
-
 static void vmx_cpuid_update(struct kvm_vcpu *vcpu)
 {
struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index bb04957..8d76150 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -163,11 +163,6 @@ static inline void kvm_async_pf_hash_reset(struct kvm_vcpu 
*vcpu)
vcpu-arch.apf.gfns[i] = ~0;
 }
 
-static inline u32 bit(int bitno)
-{
-   return 1  (bitno  31);
-}
-
 static void kvm_on_user_return(struct user_return_notifier *urn)
 {
unsigned slot;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2cea414..c600da8 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -70,6 +70,11 @@ static inline int is_paging(struct kvm_vcpu *vcpu)
return kvm_read_cr0_bits(vcpu, X86_CR0_PG);
 }
 
+static inline u32 bit(int bitno)
+{
+   return 1  (bitno  31);
+}
+
 void kvm_before_handle_nmi(struct kvm_vcpu *vcpu);
 void kvm_after_handle_nmi(struct kvm_vcpu *vcpu);
 int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq);
--
To unsubscribe from this list: send the line unsubscribe kvm-commits in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-13 Thread Thomas Gleixner
On Sun, 12 Dec 2010, Jan Kiszka wrote:
 Am 12.12.2010 18:29, Thomas Gleixner wrote:
  Also we should name it different than status, drv_status perhaps, to
  avoid confusion with the irq_desc status.
 
 OK, will address both in a succeeding round (just waiting for potential
 further comments).

No further comments from my side ATM.

Thanks,

tglx
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-13 Thread Alexander Graf

On 12.12.2010, at 23:31, Erik Brakkee wrote:

 Jan Kiszka wrote:
 
 Are there some tuning parameters I can use or perhaps even kernel
 configuration paramters on the host to solve this?
 
 Cheers
   Erik
 
 Host:Motherboard Supermicro X8DTi-F, Intel Xeon L5630, 12MB
  OS: Opensuse 11.3 64 bit
 
 Guest:   OS: Opensuse 11.3 64 bit
   
 I can say now that I am giving up on getting this to work. One
 alternative was to use PCI passthrough the USB hardware,  but that
 didn't work for the USB that was on the motherboard. So I bought a USB
 PCI card and tried to use PCI passthrough for that. Unfortunately other
 problems occured there.
 
 For one, the problem with 4K alignment. But I could fix that by using
 the pci=resource_alignment=... kernel parameter. In my grub/menu.lst it
 says:
 
kernel /vmlinuz-2.6.34.7-0.5-default root=/dev/hsystem/root quiet
showopts intel_iommu=on
pci=resource_alignment=01:04.0;01:04.1;01:04.2 noirqdebug vga=0x31a
 
 
 The noirqdebug flas was needed to avoid the host from disabling the IRQ
 (it was a shared IRQ).
 
 Using this, I could configure PCI passthrough and start the VM. Also the
 USB device showed up there. Only it did not work at all.
 
 Here is a summary of my journey up until know:
 
 The original approach I wanted to use was to pass my old PCI card (WinTV
 PVR-500) to a VM. This card is a well supported card and has been doing
 fine for me. Because of the PCI passthrough problems with the wintv
 card, I decided to try a USB card instead. This gave me a 'ctrl buffer
 too small' issue that I could solve by taking the source RPM for kvm and
 applying a known patch from red hat (increasing buffer size from 2048 to
 8192). But then I got jerky video, probably due to USB 1.1 issues. To
 bypass these I could use PCI passthrough for USB. But with the PCI
 passthrough of this card I am again running into issues probably related
 to Shared IRQs. So, after all this I am back to square one.
 
 I have now modified my approach so instead of running a separate minimal
 host with my old server as a guest, I am now running the old server
 (same install) on the new hardware, using it as a host. I would
 definitely be interested in trying this out further in the future. I
 even tried Xen for a brief moment, only to realize that my host and
 guest felt slower (slower startup and execution) and much more difficult
 to handle.
 
 From the experience of the last two days fulltime trying to get things
 working I can only conclude that the following two features would be
 really important to have:
 
* Extended PCI passthrough support
  o shared IRQ support
 
 Addressed by the series I sent out today.
   
 Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500 
 might work now?
 What version is this and where can I get this for opensuse?
 
 I still have the setup I used for testing with the host OS still installed 
 but not running so it would be really easy to try out new releases of KVM (it 
 is not a serious production server after all but mainly used to run some 
 websites and mailing lists).
 
   
  o supporting cases where memory is not aligned on a 4K boundary
 
 Hmm, I'm seeing warnings here when passing through one of my EHCIs, but
 no fatal errors.
   
 In my case, the domain just didn't start.
 Btw. I was using 0.12.5 on opensuse 11.3 but could only find the sources for 
 0.12.3 on download.opensuse.org (perhaps I looked wrong) and I patched those 
 for th 4K issue. PCI passthrough also did not work with my wintv pci card 
 with KVM 0.12.5.

The source rpm for the 11.3 update channel is here:

  http://download.opensuse.org/update/11.3/rpm/src/kvm-0.12.5-1.2.1.src.rpm

   
* USB passthrough
  o support USB 2.0
  o support USB 3.0 (but taking one step at a time, 2.0 would
also be great).
 
 Note that this will not solve any real-time issue (if that is part of
 your problem). E.g.: While my EHCIs work nicely in PCI-passthrough
 scenarios, I'm unable to use certain webcams that sooner or later run
 out of sync.
 
 Jan
 
   
 Is your point in this case that USB in a VM based on PCI passthrough will 
 always have problems when it comes to more real-time issues or does this only 
 apply to USB passthrough? I can imagine that PCI passthrough is better since 
 it uses hardware support. By the way, I have seen issues in the past whereby 
 the tv card stopped working because of high load on the server running 
 natively so real-time issues also exist apart from virtualization.

IIRC the reason that PCI passthrough with EHCI performs as badly as it does is 
that BARs  4k get passed through using the slow path (trap to qemu, issue MMIO 
in user space). Unfortunately, EHCI seems to have a 256 byte BAR region usually 
that is used for some handshaking:

00:12.2 USB Controller: ATI Technologies Inc SB700/SB800 USB EHCI Controller 
(prog-if 20 [EHCI])
Subsystem: ATI 

Re: [PATCH V2] qemu,kvm: Enable user space NMI injection for kvm guest

2010-12-13 Thread Lai Jiangshan
On 12/10/2010 04:41 PM, Jan Kiszka wrote:
 Am 10.12.2010 08:42, Lai Jiangshan wrote:

 Make use of the new KVM_NMI IOCTL to send NMIs into the KVM guest if the
 user space raised them. (example: qemu monitor's nmi command)

 Signed-off-by: Lai Jiangshan la...@cn.fujitsu.com
 ---
 diff --git a/configure b/configure
 index 2917874..f6f9362 100755
 --- a/configure
 +++ b/configure
 @@ -1646,6 +1646,9 @@ if test $kvm != no ; then
  #if !defined(KVM_CAP_DESTROY_MEMORY_REGION_WORKS)
  #error Missing KVM capability KVM_CAP_DESTROY_MEMORY_REGION_WORKS
  #endif
 +#if !defined(KVM_CAP_USER_NMI)
 +#error Missing KVM capability KVM_CAP_USER_NMI
 +#endif
  int main(void) { return 0; }
  EOF
if test $kerneldir !=  ; then
 
 That's what I meant.
 
 We also have a runtime check for KVM_CAP_DESTROY_MEMORY_REGION_WORKS on
 kvm init, but IMHO adding the same for KVM_CAP_USER_NMI would be
 overkill. So...
 
 diff --git a/target-i386/kvm.c b/target-i386/kvm.c
 index 7dfc357..755f8c9 100644
 --- a/target-i386/kvm.c
 +++ b/target-i386/kvm.c
 @@ -1417,6 +1417,13 @@ int kvm_arch_get_registers(CPUState *env)
  
  int kvm_arch_pre_run(CPUState *env, struct kvm_run *run)
  {
 +/* Inject NMI */
 +if (env-interrupt_request  CPU_INTERRUPT_NMI) {
 +env-interrupt_request = ~CPU_INTERRUPT_NMI;
 +DPRINTF(injected NMI\n);
 +kvm_vcpu_ioctl(env, KVM_NMI);
 +}
 +
  /* Try to inject an interrupt if the guest can accept it */
  if (run-ready_for_interrupt_injection 
  (env-interrupt_request  CPU_INTERRUPT_HARD) 
 
 Acked-by: Jan Kiszka jan.kis...@siemens.com
 

Hi, Avi

Could you apply this patch or give me any comments/suggest?

Thanks,
Lai
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-12-11 09:31:24]:

 On 12/10/2010 07:03 AM, Balbir Singh wrote:
 
   Scheduler people, please flame me with anything I may have done
   wrong, so I can do it right for a next version :)
 
 
 This is a good problem statement, there are other things to consider
 as well
 
 1. If a hard limit feature is enabled underneath, donating the
 timeslice would probably not make too much sense in that case
 
 What's the alternative?
 
 Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
 involves ping-ponging within the guest.  If the scheduler decides to
 schedule the vcpus without any overlap, then the throughput will be
 dictated by the time slice.  If we allow donation, throughput is
 limited by context switch latency.


If the vpcu holding the lock runs more and capped, the timeslice
transfer is a heuristic that will not help. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-13 Thread Michael S. Tsirkin
On Sun, Dec 12, 2010 at 12:22:40PM +0100, Jan Kiszka wrote:
 The result may look simpler on first glance than v1, but it comes with
 more subtle race scenarios IMO. I thought them through, hopefully
 catching all, but I would appreciate any skeptical review.

Thought about the races till my head hurt, and yes, they
all seem to be handled correctly. FWIW

Reviewed-by: Michael S. Tsirkin m...@redhat.com

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-12-13 Thread Avi Kivity

On 12/12/2010 01:22 PM, Jan Kiszka wrote:

From: Jan Kiszkajan.kis...@siemens.com

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.

However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register the IRQ in adaptive
mode and switch between line and device level disabling on demand.

This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.



Looks fine.


+   ret =IRQ_NONE;
+


Danger, whitespace error detected.  Initiating self-destruct sequence.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable

2010-12-13 Thread Xiao Guangrong
Currently, if the page is not allowed to write, then it can drop
ACC_WRITE_MASK in pte_access, and the direct sp's access is:
gw-pt_access  gw-pte_access
so, it also removes the write access in the direct sp. 

There is a problem: if the access of those pages which map thought the same
mapping in guest is different in host, it causes host switch direct sp very
frequently.

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/kvm/mmu.c |4 ++--
 arch/x86/kvm/paging_tmpl.h |   11 ++-
 2 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 1a953ac..0c5cad0 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1987,6 +1987,8 @@ static int set_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 
if (host_writable)
spte |= SPTE_HOST_WRITEABLE;
+   else
+   pte_access = ~ACC_WRITE_MASK;
 
spte |= (u64)pfn  PAGE_SHIFT;
 
@@ -2226,8 +2228,6 @@ static int __direct_map(struct kvm_vcpu *vcpu, gpa_t v, 
int write,
if (iterator.level == level) {
unsigned pte_access = ACC_ALL;
 
-   if (!map_writable)
-   pte_access = ~ACC_WRITE_MASK;
mmu_set_spte(vcpu, iterator.sptep, ACC_ALL, pte_access,
 0, write, 1, pt_write,
 level, gfn, pfn, prefault, map_writable);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 146b681..6ed2c5e 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -593,9 +593,6 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
addr, u32 error_code,
if (is_error_pfn(pfn))
return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
-   if (!map_writable)
-   walker.pte_access = ~ACC_WRITE_MASK;
-
spin_lock(vcpu-kvm-mmu_lock);
if (mmu_notifier_retry(vcpu, mmu_seq))
goto out_unlock;
@@ -809,12 +806,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct 
kvm_mmu_page *sp)
 
nr_present++;
pte_access = sp-role.access  FNAME(gpte_access)(vcpu, gpte);
-   if (!(sp-spt[i]  SPTE_HOST_WRITEABLE)) {
-   pte_access = ~ACC_WRITE_MASK;
-   host_writable = 0;
-   } else {
-   host_writable = 1;
-   }
+   host_writable = !!(sp-spt[i]  SPTE_HOST_WRITEABLE);
+
set_spte(vcpu, sp-spt[i], pte_access, 0, 0,
 is_dirty_gpte(gpte), PT_PAGE_TABLE_LEVEL, gfn,
 spte_to_pfn(sp-spt[i]), true, false,
-- 
1.7.0.4
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] KVM: MMU: audit: allow audit more guests at the same time

2010-12-13 Thread Xiao Guangrong
It only allows to audit one guest in the system since:
- 'audit_point' is a glob variable
- mmu_audit_disable() is called in kvm_mmu_destroy(), so audit is disabled
  after a guest exited

this patch fix those issues then allow to audit more guests at the same time

Signed-off-by: Xiao Guangrong xiaoguangr...@cn.fujitsu.com
---
 arch/x86/include/asm/kvm_host.h |4 
 arch/x86/kvm/mmu.c  |   27 ++-
 arch/x86/kvm/mmu_audit.c|   39 ++-
 3 files changed, 40 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b55d789..6244958 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -460,6 +460,10 @@ struct kvm_arch {
/* fields used by HYPER-V emulation */
u64 hv_guest_os_id;
u64 hv_hypercall;
+
+   #ifdef CONFIG_KVM_MMU_AUDIT
+   int audit_point;
+   #endif
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 0c5cad0..daa36ba 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3532,13 +3532,6 @@ static void mmu_destroy_caches(void)
kmem_cache_destroy(mmu_page_header_cache);
 }
 
-void kvm_mmu_module_exit(void)
-{
-   mmu_destroy_caches();
-   percpu_counter_destroy(kvm_total_used_mmu_pages);
-   unregister_shrinker(mmu_shrinker);
-}
-
 int kvm_mmu_module_init(void)
 {
pte_chain_cache = kmem_cache_create(kvm_pte_chain,
@@ -3731,12 +3724,6 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, 
u64 addr, u64 sptes[4])
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_get_spte_hierarchy);
 
-#ifdef CONFIG_KVM_MMU_AUDIT
-#include mmu_audit.c
-#else
-static void mmu_audit_disable(void) { }
-#endif
-
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
ASSERT(vcpu);
@@ -3744,5 +3731,19 @@ void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
destroy_kvm_mmu(vcpu);
free_mmu_pages(vcpu);
mmu_free_memory_caches(vcpu);
+}
+
+#ifdef CONFIG_KVM_MMU_AUDIT
+#include mmu_audit.c
+#else
+static void mmu_audit_disable(void) { }
+#endif
+
+void kvm_mmu_module_exit(void)
+{
+   mmu_destroy_caches();
+   percpu_counter_destroy(kvm_total_used_mmu_pages);
+   unregister_shrinker(mmu_shrinker);
mmu_audit_disable();
 }
+
diff --git a/arch/x86/kvm/mmu_audit.c b/arch/x86/kvm/mmu_audit.c
index ba2bcdd..5f6223b 100644
--- a/arch/x86/kvm/mmu_audit.c
+++ b/arch/x86/kvm/mmu_audit.c
@@ -19,11 +19,9 @@
 
 #include linux/ratelimit.h
 
-static int audit_point;
-
-#define audit_printk(fmt, args...) \
+#define audit_printk(kvm, fmt, args...)\
printk(KERN_ERR audit: (%s) error:\
-   fmt, audit_point_name[audit_point], ##args)
+   fmt, audit_point_name[kvm-arch.audit_point], ##args)
 
 typedef void (*inspect_spte_fn) (struct kvm_vcpu *vcpu, u64 *sptep, int level);
 
@@ -97,18 +95,21 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
 
if (sp-unsync) {
if (level != PT_PAGE_TABLE_LEVEL) {
-   audit_printk(unsync sp: %p level = %d\n, sp, level);
+   audit_printk(vcpu-kvm, unsync sp: %p 
+level = %d\n, sp, level);
return;
}
 
if (*sptep == shadow_notrap_nonpresent_pte) {
-   audit_printk(notrap spte in unsync sp: %p\n, sp);
+   audit_printk(vcpu-kvm, notrap spte in unsync 
+sp: %p\n, sp);
return;
}
}
 
if (sp-role.direct  *sptep == shadow_notrap_nonpresent_pte) {
-   audit_printk(notrap spte in direct sp: %p\n, sp);
+   audit_printk(vcpu-kvm, notrap spte in direct sp: %p\n,
+sp);
return;
}
 
@@ -125,8 +126,9 @@ static void audit_mappings(struct kvm_vcpu *vcpu, u64 
*sptep, int level)
 
hpa =  pfn  PAGE_SHIFT;
if ((*sptep  PT64_BASE_ADDR_MASK) != hpa)
-   audit_printk(levels %d pfn %llx hpa %llx ent %llxn,
-  vcpu-arch.mmu.root_level, pfn, hpa, *sptep);
+   audit_printk(vcpu-kvm, levels %d pfn %llx hpa %llx 
+ent %llxn, vcpu-arch.mmu.root_level, pfn,
+hpa, *sptep);
 }
 
 static void inspect_spte_has_rmap(struct kvm *kvm, u64 *sptep)
@@ -142,8 +144,8 @@ static void inspect_spte_has_rmap(struct kvm *kvm, u64 
*sptep)
if (!gfn_to_memslot(kvm, gfn)) {
if (!printk_ratelimit())
return;
-   audit_printk(no memslot for gfn %llx\n, gfn);
-   audit_printk(index %ld of sp (gfn=%llx)\n,
+   audit_printk(kvm, no memslot for gfn %llx\n, gfn);
+   audit_printk(kvm, index %ld of sp (gfn=%llx)\n,
 

Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable

2010-12-13 Thread Avi Kivity

On 12/13/2010 12:31 PM, Xiao Guangrong wrote:

Currently, if the page is not allowed to write, then it can drop
ACC_WRITE_MASK in pte_access, and the direct sp's access is:
gw-pt_access  gw-pte_access
so, it also removes the write access in the direct sp.

There is a problem: if the access of those pages which map thought the same
mapping in guest is different in host, it causes host switch direct sp very
frequently.


I just sent a patch to fix this in a different way, please review it.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-13 Thread Gerd Hoffmann

  Hi,


I am using a tv card in a VM and get jerky video.As I understand it, the
VM is using USB 1.1. However, when I set the USB controller in the BIOS
of my server to Fullspeed (12 Mbit/s) which is the USB 1.1 speed I am
able to get perfect results on the host but still on the guest the video
is jerky.


There is a patch series from Hans de Goede on qemu-devel which adds 
buffering for isochronous usb transfers to the usb passthrough code. 
Certainly worth a try.


cheers,
  Gerd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations

2010-12-13 Thread Michael S. Tsirkin
Please merge the following tree for 2.6.38.
Thanks!

The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0:

  net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800)

are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next

Jason Wang (1):
  vhost: fix typos in comment

Julia Lawall (1):
  drivers/vhost/vhost.c: delete double assignment

Michael S. Tsirkin (9):
  vhost: put mm after thread stop
  vhost-net: batch use/unuse mm
  vhost: copy_to_user - __copy_to_user
  vhost: get/put_user - __get/__put_user
  vhost: remove unused include
  vhost: correctly set bits of dirty pages
  vhost: better variable name in logging
  vhost test module
  tools/virtio: virtio_test tool

 drivers/vhost/net.c  |9 +-
 drivers/vhost/test.c |  320 ++
 drivers/vhost/test.h |7 +
 drivers/vhost/vhost.c|   44 +++---
 drivers/vhost/vhost.h|2 +-
 tools/virtio/Makefile|   12 ++
 tools/virtio/linux/device.h  |2 +
 tools/virtio/linux/slab.h|2 +
 tools/virtio/linux/virtio.h  |  223 +++
 tools/virtio/vhost_test/Makefile |2 +
 tools/virtio/vhost_test/vhost_test.c |1 +
 tools/virtio/virtio_test.c   |  248 ++
 12 files changed, 842 insertions(+), 30 deletions(-)
 create mode 100644 drivers/vhost/test.c
 create mode 100644 drivers/vhost/test.h
 create mode 100644 tools/virtio/Makefile
 create mode 100644 tools/virtio/linux/device.h
 create mode 100644 tools/virtio/linux/slab.h
 create mode 100644 tools/virtio/linux/virtio.h
 create mode 100644 tools/virtio/vhost_test/Makefile
 create mode 100644 tools/virtio/vhost_test/vhost_test.c
 create mode 100644 tools/virtio/virtio_test.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] SCSI Command support over VirtIO Block device

2010-12-13 Thread अनुज
Hi

2010/12/13 Stefan Hajnoczi stefa...@gmail.com:

 On Dec 13, 2010 5:14 AM, अनुज anu...@gmail.com wrote:

 Hi

 I am trying to implement VirtIO support for a proprietary OS. And It
 would be great if I am able to process SCSI commands over VirtIO Block
 device.

 I tried to execute INQUIRY command but the status returned is UNSUPPORTED.
 If anyone provide example VirtIO SCSI Command request structure for
 INQUIRY command as per VirtIO spec Appendix D would be a great help.

 And also, the paragraph from VirtIO spec - 0.8.9 is confusing for me :

 Historically, devices assumed that the  fields type, ioprio and
 sector reside in
 a single, separate read-only buffer; the  fields errors, data_len,
 sense_len and
 residual reside in a single, separate write-only buffer; the sense
  eld in a separate
 write-only buffer of size 96 bytes, by itself; the fields errors,
 data_len, sense_len
 and residual in a single write-only buffer; and the status field is a
 separate readonly
 buffer of size 1 byte, by itself.

 Here 'status field of buffer size 1 byte' is whether readonly or
 writeonly?

 Writeonly


 I want to know from which version of Qemu-kvm supports processing of
 scsi commands over VirtIO block device as a backend.
 Although I checked the Host Feature fields in which VIRTIO_BLK_F_SCSI
 bit is set. I am using qemu-kvm version 0.12.3.

 Make sure you have a scsi-generic block device in qemu-kvm, not just a
 regular file or physical block device. Open /dev/sg.

Yes, I have given a file name instead of /dev/sg0. Now it's working as a charm.

That means I can use physical disk as a VirtIO disk in guest OS. right?
So it's kind of passthrough for a physical disk. But how can I
distinguish among different physical disks attached to the host.

is /dev/sg is different for each physical disk?

However I thought VirtIO scsi device operations are for virtual disk
(a regular file) also.


 Look at hw/virtio-blk.c in qemu-kvm for host implementation details.


 --

 Anuj Aggarwal

  .''`.
 : :Ⓐ :   # apt-get install hakuna-matata
 `. `'`
    `-



Thanks for your help.


Regards
-- 
Anuj Aggarwal

 .''`.
: :Ⓐ :   # apt-get install hakuna-matata
`. `'`
   `-
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Masami Hiramatsu
(2010/12/13 2:47), Avi Kivity wrote:
 On 12/12/2010 07:43 PM, Arnaldo Carvalho de Melo wrote:
 Em Sun, Dec 12, 2010 at 07:42:06PM +0200, Avi Kivity escreveu:
   On 12/12/2010 07:36 PM, Arnaldo Carvalho de Melo wrote:
   Em Sun, Dec 12, 2010 at 06:35:24PM +0200, Avi Kivity escreveu:
  On 11/23/2010 05:45 PM, Steven Rostedt wrote:
  Again, the work around is to replace your trace_printks() with
  __trace_printk(_THIS_IP_, ...) or just modify the trace_printk() 
  macro
  in include/linux/kernel.h to always use the __trace_printk() 
  version.
   
  This works; I'm using it for now (I tried to use 'perf probe', but I
  get unpredictable results, like null pointer derefs).
   
   Can you tell us which functions, environment, etc?
 
   Something around 2.6.27-rc4; example functions are FNAME(fetch) in
   arch/x86/kvm/paging_tmpl.h; compiled modular (which was Steven's
   guess as to why it fails).
 
   (note, the failure is with trace-cmd, not /sys/kernel/debug/tracing).

 I mean the I tried to use 'perf probe' part.
 
 Well, same, more or less.
 
   perf probe -m kvm --add 'fetch_access=paging64_fetch 
 pt_access=gw-pt_access pte_access=gw-pte_access dirty'
 
 would return garbage for gw-*, and the log would show the exception handler 
 called.  gw is most certainly valid.
 

Thank you for reporting.
Hmm, actually, pagefaults could happen on fetching variables. But
fetching argument routines should handle it...
I'd like to check it, could you tell me details? for example, that exception 
log,
kprobe-tracer's event definition(you can see it via 
debugfs/tracing/kprobe-events)
and the result of `perf probe -L paging64_fetch:0-10`.

Best regards,

-- 
Masami HIRAMATSU
2nd Dept. Linux Technology Center
Hitachi, Ltd., Systems Development Laboratory
E-mail: masami.hiramatsu...@hitachi.com
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Avi Kivity

On 12/11/2010 03:57 PM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-12-11 09:31:24]:

  On 12/10/2010 07:03 AM, Balbir Singh wrote:
  
 Scheduler people, please flame me with anything I may have done
 wrong, so I can do it right for a next version :)
  
  
  This is a good problem statement, there are other things to consider
  as well
  
  1. If a hard limit feature is enabled underneath, donating the
  timeslice would probably not make too much sense in that case

  What's the alternative?

  Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
  involves ping-ponging within the guest.  If the scheduler decides to
  schedule the vcpus without any overlap, then the throughput will be
  dictated by the time slice.  If we allow donation, throughput is
  limited by context switch latency.


If the vpcu holding the lock runs more and capped, the timeslice
transfer is a heuristic that will not help.


Why not?  as long as we shift the cap as well.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] pci: Fix PCI capabilities collision error value

2010-12-13 Thread Avi Kivity

On 12/09/2010 06:16 PM, Alex Williamson wrote:

Signed-off-by: Alex Williamsonalex.william...@redhat.com


Applied, thanks.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [qemu-kvm-next-tree] fix compile error of hw/device-assignment.c

2010-12-13 Thread Avi Kivity
On 12/09/2010 08:43 AM, Wei Yongjun wrote:
 Fix the following compile error in next tree:
   CCx86_64-softmmu/device-assignment.o
 hw/device-assignment.c: In function ‘assigned_device_pci_cap_init’:
 hw/device-assignment.c:1463: error: ‘PCI_PM_CTRL_NO_SOFT_RST’ undeclared 
 (first use in this function)
 hw/device-assignment.c:1463: error: (Each undeclared identifier is reported 
 only once
 hw/device-assignment.c:1463: error: for each function it appears in.)


Applied, thanks.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: cleanup CR8 handling

2010-12-13 Thread Avi Kivity

On 12/08/2010 01:27 PM, Andre Przywara wrote:

The handling of CR8 writes in KVM is currently somewhat cumbersome.
This patch makes it look like the other CR register handlers
and fixes a possible issue in VMX, where the RIP would be incremented
despite an injected #GP.

  unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu)
@@ -4104,7 +4098,7 @@ static int emulator_set_cr(int cr, unsigned long val, 
struct kvm_vcpu *vcpu)
res = kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val));
break;
case 8:
-   res = __kvm_set_cr8(vcpu, val  0xfUL);
+   res = kvm_set_cr8(vcpu, val);
break;
default:
vcpu_printf(vcpu, %s: unexpected cr %u\n, __func__, cr);


Why drop the mask?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] kvm/svm: enhance MOV CR intercept handler

2010-12-13 Thread Avi Kivity

On 12/10/2010 03:51 PM, Andre Przywara wrote:

Newer SVM implementations provide the GPR number in the VMCB, so
that the emulation path is no longer necesarry to handle CR
register access intercepts. Implement the handling in svm.c and
use it when the info is provided.

Signed-off-by: Andre Przywaraandre.przyw...@amd.com
---
  arch/x86/include/asm/svm.h |2 +
  arch/x86/kvm/svm.c |   91 ++-
  2 files changed, 82 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/svm.h b/arch/x86/include/asm/svm.h
index 11dbca7..589fc25 100644
--- a/arch/x86/include/asm/svm.h
+++ b/arch/x86/include/asm/svm.h
@@ -256,6 +256,8 @@ struct __attribute__ ((__packed__)) vmcb {
  #define SVM_EXITINFOSHIFT_TS_REASON_JMP 38
  #define SVM_EXITINFOSHIFT_TS_HAS_ERROR_CODE 44

+#define SVM_EXITINFO_REG_MASK 0x0F
+
  #define   SVM_EXIT_READ_CR0   0x000
  #define   SVM_EXIT_READ_CR3   0x003
  #define   SVM_EXIT_READ_CR4   0x004
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 298ff79..ee5f100 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -2594,12 +2594,81 @@ static int emulate_on_interception(struct vcpu_svm *svm)
return emulate_instruction(svm-vcpu, 0, 0, 0) == EMULATE_DONE;
  }

+static int cr_interception(struct vcpu_svm *svm)
+{
+   int reg, cr;
+   unsigned long val;
+   int err;
+
+   if (!static_cpu_has(X86_FEATURE_DECODEASSISTS))
+   return emulate_on_interception(svm);
+
+   /* bit 63 is the valid bit, as not all instructions (like lmsw)
+  provide the information */


Please use kernel style comments:

 /*
  * text
  * text
  */

Even better, use a name for the bit, which will obviate the need for a 
comment.



--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] kvm/svm: enhance mov DR intercept handler

2010-12-13 Thread Avi Kivity

On 12/10/2010 03:51 PM, Andre Przywara wrote:

Newer SVM implementations provide the GPR number in the VMCB, so
that the emulation path is no longer necesarry to handle debug
register access intercepts. Implement the handling in svm.c and
use it when the info is provided.

+
+   if (!err)
+   skip_emulated_instruction(svm-vcpu);
+   else
+   kvm_inject_gp(svm-vcpu, 0);
+


This repeats, how about using complete_insn_gp()?

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] kvm/svm: copy instruction bytes from VMCB

2010-12-13 Thread Avi Kivity

On 12/10/2010 03:51 PM, Andre Przywara wrote:

In case of a nested page fault or an intercepted #PF newer SVM
implementations provide a copy of the faulting instruction bytes
in the VMCB.
Use these bytes to feed the instruction emulator and avoid the costly
guest instruction fetch in this case.



+static int svm_prefetch_instruction(struct kvm_vcpu *vcpu)
+{
+   struct vcpu_svm *svm = to_svm(vcpu);
+   uint8_t len;
+   struct fetch_cache *fetch;
+
+   len = svm-vmcb-control.insn_len  0x0F;
+   if (len == 0)
+   return 1;
+
+   fetch =svm-vcpu.arch.emulate_ctxt.decode.fetch;
+   fetch-start = kvm_rip_read(svm-vcpu);
+   fetch-end = fetch-start + len;
+   memcpy(fetch-data, svm-vmcb-control.insn_bytes, len);
+
+   return 0;
+}


This reaching in into the emulator internals from svm code is not very 
good.  It also assumes -prefetch_instruction() is called immediately 
after an exit; this isn't true in vmx and at least was considered for 
svm (emulating multiple instructions during the nsvm vmexit sequence).


Alternatives are:
- add the insn data to emulate_instruction() and friends (my first 
suggestion)
- adding x86_decode_insn_init(), which initializes the decode cache, and 
x86_decode_insn_prefill_cache(), called only if we have the insn data


Another one: teach kvm_fetch_guest_virt() to check if addr/bytes 
intersects with csbase+rip/len; if so, use that instead of doing the 
page table dance.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Balbir Singh
* Avi Kivity a...@redhat.com [2010-12-13 13:57:37]:

 On 12/11/2010 03:57 PM, Balbir Singh wrote:
 * Avi Kivitya...@redhat.com  [2010-12-11 09:31:24]:
 
   On 12/10/2010 07:03 AM, Balbir Singh wrote:
   
  Scheduler people, please flame me with anything I may have done
  wrong, so I can do it right for a next version :)
   
   
   This is a good problem statement, there are other things to consider
   as well
   
   1. If a hard limit feature is enabled underneath, donating the
   timeslice would probably not make too much sense in that case
 
   What's the alternative?
 
   Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
   involves ping-ponging within the guest.  If the scheduler decides to
   schedule the vcpus without any overlap, then the throughput will be
   dictated by the time slice.  If we allow donation, throughput is
   limited by context switch latency.
 
 
 If the vpcu holding the lock runs more and capped, the timeslice
 transfer is a heuristic that will not help.
 
 Why not?  as long as we shift the cap as well.


Shifting the cap would break it, no? Anyway, that is something for us
to keep track of as we add additional heuristics, not a show stopper. 

-- 
Three Cheers,
Balbir
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Avi Kivity

On 12/13/2010 02:39 PM, Balbir Singh wrote:

* Avi Kivitya...@redhat.com  [2010-12-13 13:57:37]:

  On 12/11/2010 03:57 PM, Balbir Singh wrote:
  * Avi Kivitya...@redhat.com   [2010-12-11 09:31:24]:
  
 On 12/10/2010 07:03 AM, Balbir Singh wrote:
 
 Scheduler people, please flame me with anything I may have done
 wrong, so I can do it right for a next version :)
 
 
 This is a good problem statement, there are other things to consider
 as well
 
 1. If a hard limit feature is enabled underneath, donating the
 timeslice would probably not make too much sense in that case
  
 What's the alternative?
  
 Consider a two vcpu guest with a 50% hard cap.  Suppose the workload
 involves ping-ponging within the guest.  If the scheduler decides to
 schedule the vcpus without any overlap, then the throughput will be
 dictated by the time slice.  If we allow donation, throughput is
 limited by context switch latency.
  
  
  If the vpcu holding the lock runs more and capped, the timeslice
  transfer is a heuristic that will not help.

  Why not?  as long as we shift the cap as well.


Shifting the cap would break it, no?


The total cap for the guest would remain.


Anyway, that is something for us
to keep track of as we add additional heuristics, not a show stopper.


Sure, as long as we see a way to fix it eventually.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


In what order are the CPUs discovered?

2010-12-13 Thread Henry Pepper
Hi

Where can I find out, the order in which the CPUs are discovered?

When having:
  - Multiple sockets.
  - Multiple cores.
  - Hyper-threading (HTT).

E.g. a single Socket with two cores and HTT enabled on both cores.
This would be 4 CPUs.
Would cpu0 and cpu1 be the first core, and cpu2 and 3 the second core?

What happens if HTT is disabled on core0 and enabled on core1?
 Would I see cpu0, cpu2,cpu3?
  or would it be cpu0, cpu1, cpu2 ?

Any suggestions on where I could find information on this would be appreciated.

  Thanks

 Henry
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/2] Do not register kvmclock savevm section if kvmclock is disabled.

2010-12-13 Thread Glauber Costa
On Wed, 2010-12-08 at 17:31 -0200, Marcelo Tosatti wrote:
 On Tue, Dec 07, 2010 at 03:12:36PM -0200, Glauber Costa wrote:
  On Mon, 2010-12-06 at 19:04 -0200, Marcelo Tosatti wrote:
   On Mon, Dec 06, 2010 at 09:03:46AM -0500, Glauber Costa wrote:
Usually nobody usually thinks about that scenario (me included and 
specially),
but kvmclock can be actually disabled in the host.

It happens in two scenarios:
 1. host too old.
 2. we passed -kvmclock to our -cpu parameter.

In both cases, we should not register kvmclock savevm section. This 
patch
achives that by registering this section only if kvmclock is actually
currently enabled in cpuid.

The only caveat is that we have to register the savevm section a little 
bit
later, since we won't know the final kvmclock state before cpuid gets 
parsed.
   
   What is the problem of registering the section? Restoring the value if
   the host does not support it returns an error?
   
   Can't you ignore the error if kvmclock is not reported in cpuid, in the
   restore handler?
  
  We can change the restore handler, but not the restore handler of
  binaries that are already out there. The motivation here is precisely to
  address migration to hosts without kvmclock, so it's better to have
  a way to disable, than to count on the fact that the other side will be
  able to ignore it.
 
 OK. Can't you register conditionally on kvmclock cpuid bit at the end of
 kvm_arch_init_vcpu, in target-i386/kvm.c?

Haven't looked at it, but will today. Actually, tsc has (obviously) the
same problem and I plan to respin the patch today including a fix for it
as well.

Thanks!


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: Correct kvm_pio tracepoint count field

2010-12-13 Thread Avi Kivity
Currently, we record '1' for count regardless of the real count.  Fix.

Signed-off-by: Avi Kivity a...@redhat.com
---
 arch/x86/kvm/x86.c |4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8d76150..cf5fab1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3948,7 +3948,7 @@ static int emulator_pio_in_emulated(int size, unsigned 
short port, void *val,
if (vcpu-arch.pio.count)
goto data_avail;
 
-   trace_kvm_pio(0, port, size, 1);
+   trace_kvm_pio(0, port, size, count);
 
vcpu-arch.pio.port = port;
vcpu-arch.pio.in = 1;
@@ -3976,7 +3976,7 @@ static int emulator_pio_out_emulated(int size, unsigned 
short port,
  const void *val, unsigned int count,
  struct kvm_vcpu *vcpu)
 {
-   trace_kvm_pio(1, port, size, 1);
+   trace_kvm_pio(1, port, size, count);
 
vcpu-arch.pio.port = port;
vcpu-arch.pio.in = 0;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Steven Rostedt
On Sun, 2010-12-12 at 18:10 +0200, Avi Kivity wrote:
 On 11/23/2010 12:52 PM, Avi Kivity wrote:

  I see a trace_printk() commit in trace-cmd.git.  Is that related?  If 
  not, I'll work on getting a small sample of the problem.
 
 
 Sample: http://people.redhat.com/akivity/trace.dat.bz2
 

You said previously that /debug/tracing/printk_formats was empty? This
is the problem. It uses this file to map what the format of the printk
is to what is being printed. But if we don't have this mapping,
trace-cmd (nor perf) can not figure this out.

You are using the latest kernel for this? What's your work flow? Do you
load kvm modules after you start the trace, or are they always loaded?

Are the trace_printk's in the core kernel too, and not being printed?

Thanks,

-- Steve


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Avi Kivity

On 12/13/2010 05:26 PM, Steven Rostedt wrote:

On Sun, 2010-12-12 at 18:10 +0200, Avi Kivity wrote:
  On 11/23/2010 12:52 PM, Avi Kivity wrote:

I see a trace_printk() commit in trace-cmd.git.  Is that related?  If
not, I'll work on getting a small sample of the problem.
  

  Sample: http://people.redhat.com/akivity/trace.dat.bz2


You said previously that /debug/tracing/printk_formats was empty?


Still the case.


This
is the problem. It uses this file to map what the format of the printk
is to what is being printed. But if we don't have this mapping,
trace-cmd (nor perf) can not figure this out.

You are using the latest kernel for this?


2.6.37-rc5 plus a bunch of kvm patches.


  What's your work flow? Do you
load kvm modules after you start the trace, or are they always loaded?


Loaded on boot.


Are the trace_printk's in the core kernel too, and not being printed?


I don't have any trace_printk()s in the core kernel, only in modules.  
Perhaps module initialization does not communicate trace_printk formats?


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Steven Rostedt
On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote:

What's your work flow? Do you
  load kvm modules after you start the trace, or are they always loaded?
 
 Loaded on boot.

Via initramfs?

 
  Are the trace_printk's in the core kernel too, and not being printed?
 
 I don't have any trace_printk()s in the core kernel, only in modules.  
 Perhaps module initialization does not communicate trace_printk formats?

They should.

Could you send me a patch that has the trace_printk()s you are using.

Thanks,

-- Steve



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] tools/virtio: virtio_test tool

2010-12-13 Thread Michael S. Tsirkin
On Mon, Dec 06, 2010 at 02:37:05PM -0200, Thiago Farina wrote:
 On Mon, Nov 29, 2010 at 3:16 PM, Michael S. Tsirkin m...@redhat.com wrote:
  +#define container_of(ptr, type, member) ({                     \
  +       const typeof( ((type *)0)-member ) *__mptr = (ptr);    \
  +       (type *)( (char *)__mptr - offsetof(type,member) );})
  +
  +#define uninitialized_var(x) x = x
  +
  +# ifndef likely
  +#  define likely(x)    (__builtin_expect(!!(x), 1))
  +# endif
  +# ifndef unlikely
  +#  define unlikely(x)  (__builtin_expect(!!(x), 0))
  +# endif
 
 It seems you are not using these macros. Do you really need them here?

They are used by virtio that I'm compiling in userspace here.

 Can't you include the right linux header files for these macros
 instead?

Far from trivial as linux headers aren't intended to
be built in userspace, if you try you get all kind of
conflicts with libc headers etc.

If you see a way to do this, pls send me a patch.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Rik van Riel

On 12/11/2010 08:57 AM, Balbir Singh wrote:


If the vpcu holding the lock runs more and capped, the timeslice
transfer is a heuristic that will not help.


That indicates you really need the cap to be per guest, and
not per VCPU.

Having one VCPU spin on a lock (and achieve nothing), because
the other one cannot give up the lock due to hitting its CPU
cap could lead to showstoppingly bad performance.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Avi Kivity

On 12/13/2010 06:28 PM, Steven Rostedt wrote:

On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote:

  What's your work flow? Do you
load kvm modules after you start the trace, or are they always loaded?

  Loaded on boot.

Via initramfs?


No, regular printks.



Are the trace_printk's in the core kernel too, and not being printed?

  I don't have any trace_printk()s in the core kernel, only in modules.
  Perhaps module initialization does not communicate trace_printk formats?

They should.

Could you send me a patch that has the trace_printk()s you are using.



Attached (with __trace_printk()s, which is what I used).


--
error compiling committee.c: too many arguments to function

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index d75ba1e..df86917 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1449,6 +1449,10 @@ static struct kvm_mmu_page *kvm_mmu_get_page(struct kvm_vcpu *vcpu,
 	if (role.direct)
 		role.cr4_pae = 0;
 	role.access = access;
+	__trace_printk(_THIS_IP_,
+		   base_role %x access %x role.access %x role %x\n,
+		   vcpu-arch.mmu.base_role, access, role.access,
+		   role.word);
 	if (!vcpu-arch.mmu.direct_map
 	 vcpu-arch.mmu.root_level = PT32_ROOT_LEVEL) {
 		quadrant = gaddr  (PAGE_SHIFT + (PT64_PT_BITS * level));
@@ -1576,6 +1580,11 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		if (child-role.access == direct_access)
 			return;
 
+		__trace_printk(_THIS_IP_,
+			   child-role %x child-role.access %x direct_access %x\n,
+			   child-role.word, child-role.access,
+			   direct_access);
+
 		mmu_page_remove_parent_pte(child, sptep);
 		__set_spte(sptep, shadow_trap_nonpresent_pte);
 		kvm_flush_remote_tlbs(vcpu-kvm);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 4f61fbb..1049729 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -450,6 +450,8 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr,
 	if (!is_present_gpte(gw-ptes[gw-level - 1]))
 		return NULL;
 
+	__trace_printk(_THIS_IP_, pt_access %x pte_access %x dirty %d\n,
+		   gw-pt_access, gw-pte_access, dirty);
 	direct_access = gw-pt_access  gw-pte_access;
 	if (!dirty)
 		direct_access = ~ACC_WRITE_MASK;
@@ -592,6 +594,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
 	if (is_error_pfn(pfn))
 		return kvm_handle_bad_page(vcpu-kvm, walker.gfn, pfn);
 
+	__trace_printk(_THIS_IP_, page_fault: map_writeable %x\n,
+		   map_writable);
+
 	spin_lock(vcpu-kvm-mmu_lock);
 	if (mmu_notifier_retry(vcpu, mmu_seq))
 		goto out_unlock;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 83f5bf6..05481a3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1015,6 +1015,8 @@ static pfn_t hva_to_pfn(struct kvm *kvm, unsigned long addr, bool atomic,
 	if (unlikely(npages != 1)  !atomic) {
 		might_sleep();
 
+		__trace_printk(_THIS_IP_, %s: addr %lx not writeable\n,
+			   __func__, addr);
 		if (writable)
 			*writable = write_fault;
 


Re: trace_printk() support in trace-cmd

2010-12-13 Thread Avi Kivity

On 12/13/2010 07:05 PM, Avi Kivity wrote:

On 12/13/2010 06:28 PM, Steven Rostedt wrote:

On Mon, 2010-12-13 at 17:43 +0200, Avi Kivity wrote:

 What's your work flow? Do you
   load kvm modules after you start the trace, or are they always 
loaded?


  Loaded on boot.

Via initramfs?


No, regular printks.


Regular modprobe.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] tools/virtio: virtio_test tool

2010-12-13 Thread Michael S. Tsirkin
On Mon, Dec 06, 2010 at 03:23:02PM +1030, Rusty Russell wrote:
 On Tue, 30 Nov 2010 03:46:37 am Michael S. Tsirkin wrote:
  This is the userspace part of the tool: it includes a bunch of stubs for
  linux APIs, somewhat simular to linuxsched. This makes it possible to
  recompile the ring code in userspace.
  
  A small test example is implemented combining this with vhost_test
  module.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 Hi Michael,
 
   I'm not sure what the point is of this work?  You'll still need to
 benchmark on real systems, but it's not low-level enough to measure
 things like cache misses.

The point is to be able to create easy to test workloads:
(just running the single test included here produces a
 result that seems repeatable to a high degree)
while still staying as close as possible to what we might expect in real
life.

I also want to be able to measure just the overhead of the ring,
without involving block or network core in guest and host.

In other words, it's a synthetic benchmark.

 I'm assuming you're thinking of playing with layout to measure cache
 behaviour.

In one example, using this test I saw that different publish
used index layouts don't seem to behave at all differently.

But I also saw that the extra pointer hasing
added by my publish used index patches did add
measureable overhead.

Plan to look into that.

  I was thinking of a complete userspace implementation

The disadvantage is that any work done there needs to be
redone in real life, though. And implementation details often matter.
What I did let me actually use the virtio/vhost code that we have and see how
it performs.

 where
 either it was run under cachegrind, or each access was wrapped to allow
 tracking of cachelines to give an exact measure of cache movement

perf stat not good enough?

 under
 various scenarios (esp. ring mostly empty, ring in steady state, ring
 mostly full).

Yes, I do want to add tests to stress various scenarios.

 
 Cheers,
 Rusty.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations

2010-12-13 Thread Michael S. Tsirkin
On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote:
 Please merge the following tree for 2.6.38.
 Thanks!

Um, I sent this out before I noticed the mail from Rusty
with some questions on the test code. I missed that and
assumed no comments - no issues, perhaps wrongly.

Rusty - I tried answering the questions there - any issues
with merging this? It's just a test so won't be hard to remove
later if it's not helpful ...

 The following changes since commit ad1184c6cf067a13e8cb2a4e7ccc407f947027d0:
 
   net: au1000_eth: remove unused global variable. (2010-12-11 12:01:48 -0800)
 
 are available in the git repository at:
   git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git vhost-net-next
 
 Jason Wang (1):
   vhost: fix typos in comment
 
 Julia Lawall (1):
   drivers/vhost/vhost.c: delete double assignment
 
 Michael S. Tsirkin (9):
   vhost: put mm after thread stop
   vhost-net: batch use/unuse mm
   vhost: copy_to_user - __copy_to_user
   vhost: get/put_user - __get/__put_user
   vhost: remove unused include
   vhost: correctly set bits of dirty pages
   vhost: better variable name in logging
   vhost test module
   tools/virtio: virtio_test tool
 
  drivers/vhost/net.c  |9 +-
  drivers/vhost/test.c |  320 
 ++
  drivers/vhost/test.h |7 +
  drivers/vhost/vhost.c|   44 +++---
  drivers/vhost/vhost.h|2 +-
  tools/virtio/Makefile|   12 ++
  tools/virtio/linux/device.h  |2 +
  tools/virtio/linux/slab.h|2 +
  tools/virtio/linux/virtio.h  |  223 +++
  tools/virtio/vhost_test/Makefile |2 +
  tools/virtio/vhost_test/vhost_test.c |1 +
  tools/virtio/virtio_test.c   |  248 ++
  12 files changed, 842 insertions(+), 30 deletions(-)
  create mode 100644 drivers/vhost/test.c
  create mode 100644 drivers/vhost/test.h
  create mode 100644 tools/virtio/Makefile
  create mode 100644 tools/virtio/linux/device.h
  create mode 100644 tools/virtio/linux/slab.h
  create mode 100644 tools/virtio/linux/virtio.h
  create mode 100644 tools/virtio/vhost_test/Makefile
  create mode 100644 tools/virtio/vhost_test/vhost_test.c
  create mode 100644 tools/virtio/virtio_test.c
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Freezing Windows 2008 x64bit guest

2010-12-13 Thread Manfred Heubach


Gleb Natapov gleb at redhat.com writes:

 
 On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote:
  Gleb Natapov wrote:
  On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote:
  Gleb Natapov kirjoitti:
  On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote:
  Gleb Natapov kirjoitti:
  On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote:
  But one Windows 2008 64 Bit Server Standard is freezing regularly.
  This happens sometimes 3 times a day, sometimes it takes 2 days
  until freeze. The Windows Machine is a clean fresh install.
  I think I have seen same problem occur on my Windows 2008 SBS SP2
  64bit system, but a bit less often, only like once a week.
  Now I haven't seen crashes but only freezes with qemu on 100% and
  virtual system unresponsive.
  Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is
  not total freeze which even harder to debug. I don't see anything
  extraordinary in your logs. 4643 interrupt per second for 4 cpus is
  normal if windows runs multimedia or other app that need hi-res timers.
  Does your host swapping? Is there any chance that you can try upstream
qemu-kvm?
  
  I tried running qemu-kvm from git but it exhibited the same problem
  as 12.x that I tried before, BSODing once in a while, running kernel
  2.6.34.1.
  
 That should be pretty stable config, although it would be nice if you
 could try running in qemy-kvm.git head.
 
  sample BSOD failure details:
  These two with Realtec nic and qemu cpu
  0x0019 (0x0020, 0xf88007e65970,
  0xf88007e65990, 0x0502040f)
  0x0019 (0x0020, 0xf88007a414c0,
  0xf88007a414e0, 0x0502044c)
  
  These are with e1000 and -cpu host
  0x003b (0xc005, 0xf80001c5d842,
  0xfa60093ddb70, 0x)
  0x003b (0xc005, 0xf80001cb8842,
  0xfa600c94ab70, 0x)
  0x000a (0x0080, 0x000c,
  0x0001, 0xf80001cadefd)
  
 Can you attach screenshots of BSODs? Have you reinstalled your guests or
 are you running the same images you ran in 11.x?
 
  I'll see if I can analyze minidumps later.
  
  In addition to these there have been as many reboots that have been
  only logged as 'disruptive shutdown'.
  
  Right now I'm running the problematic guest under Xen
  3.2.1-something from Debian to see if it works better.
  
  -- 
  Harri.

 Hello,

is there a solution for that problem? I'm experiencing the same problems ever
since I installed SBS 2008 on KVM.

I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because
of performance problems which were solved by the upgrade.

After the upgrade the system became extremly unstable. It was crashing as soon
as disk io and network io load was growing. 100% reproduceable with windows
server backup to an iscsi volume.

i had virtio drivers for storage and network installed (redhat/fedora 1.1.11).
At each BSOD I had the following line in the log of the guest:

 virtio_ioport_write: unexpected address 0x13 value 0x1

I changed the network interface back to e1000. What I experience now (and I had
that a the very beginning before i switched to virtio network) are freezes. The
guest doesn't respond anymore (doesn't answer to pings and doesn't interact via
mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many
cores as there are virtual cpus (in this case 4).

I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3
linux guests (2x 32bit, 1x64 bit). They are all running without any problems
(except that the windows xp guest cannot boot without an ntldr cd image). Only
the SBS2008 guest regulary freezes.

The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24
GB DDR3 RAM.

I know there is a lack of detailed information right now. I first need to know
if anybody is working on this or has similar problems. I can deliver minidumps,
and any debugging information you need.

I don't want to give up now. We will switch to Hyper-V if we cannot solve this,
because we need a stable virtualization plattform for Windows Guests. I would
like to use KVM it is so much more flexibel.

Best regards
Manfred




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Freezing Windows 2008 x64bit guest

2010-12-13 Thread Dor Laor

On 12/13/2010 09:42 PM, Manfred Heubach wrote:



Gleb Natapovglebat  redhat.com  writes:



On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote:

Gleb Natapov wrote:

On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote:

Gleb Natapov kirjoitti:

On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote:

Gleb Natapov kirjoitti:

On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote:

But one Windows 2008 64 Bit Server Standard is freezing regularly.
This happens sometimes 3 times a day, sometimes it takes 2 days
until freeze. The Windows Machine is a clean fresh install.

I think I have seen same problem occur on my Windows 2008 SBS SP2
64bit system, but a bit less often, only like once a week.
Now I haven't seen crashes but only freezes with qemu on 100% and
virtual system unresponsive.

Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is
not total freeze which even harder to debug. I don't see anything
extraordinary in your logs. 4643 interrupt per second for 4 cpus is
normal if windows runs multimedia or other app that need hi-res timers.
Does your host swapping? Is there any chance that you can try upstream

qemu-kvm?


I tried running qemu-kvm from git but it exhibited the same problem
as 12.x that I tried before, BSODing once in a while, running kernel
2.6.34.1.


That should be pretty stable config, although it would be nice if you
could try running in qemy-kvm.git head.


sample BSOD failure details:
These two with Realtec nic and qemu cpu
0x0019 (0x0020, 0xf88007e65970,
0xf88007e65990, 0x0502040f)
0x0019 (0x0020, 0xf88007a414c0,
0xf88007a414e0, 0x0502044c)

These are with e1000 and -cpu host
0x003b (0xc005, 0xf80001c5d842,
0xfa60093ddb70, 0x)
0x003b (0xc005, 0xf80001cb8842,
0xfa600c94ab70, 0x)
0x000a (0x0080, 0x000c,
0x0001, 0xf80001cadefd)


Can you attach screenshots of BSODs? Have you reinstalled your guests or
are you running the same images you ran in 11.x?


I'll see if I can analyze minidumps later.

In addition to these there have been as many reboots that have been
only logged as 'disruptive shutdown'.

Right now I'm running the problematic guest under Xen
3.2.1-something from Debian to see if it works better.

--
Harri.



  Hello,

is there a solution for that problem? I'm experiencing the same problems ever
since I installed SBS 2008 on KVM.

I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly because
of performance problems which were solved by the upgrade.

After the upgrade the system became extremly unstable. It was crashing as soon
as disk io and network io load was growing. 100% reproduceable with windows
server backup to an iscsi volume.

i had virtio drivers for storage and network installed (redhat/fedora 1.1.11).


Which fedora/rhel release is that?
What's the windows virtio driver version?

Have you tried using virt-manager/virhs instead of raw cmdline?
About e1000, some windows comes with buggy driver and an update e1000 
from Intel fixes some issues.




At each BSOD I had the following line in the log of the guest:

  virtio_ioport_write: unexpected address 0x13 value 0x1

I changed the network interface back to e1000. What I experience now (and I had
that a the very beginning before i switched to virtio network) are freezes. The
guest doesn't respond anymore (doesn't answer to pings and doesn't interact via
mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as many
cores as there are virtual cpus (in this case 4).

I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp and 3
linux guests (2x 32bit, 1x64 bit). They are all running without any problems
(except that the windows xp guest cannot boot without an ntldr cd image). Only
the SBS2008 guest regulary freezes.

The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 5805, 24
GB DDR3 RAM.

I know there is a lack of detailed information right now. I first need to know
if anybody is working on this or has similar problems. I can deliver minidumps,
and any debugging information you need.

I don't want to give up now. We will switch to Hyper-V if we cannot solve this,
because we need a stable virtualization plattform for Windows Guests. I would
like to use KVM it is so much more flexibel.

Best regards
Manfred




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Freezing Windows 2008 x64bit guest

2010-12-13 Thread Vadim Rozenfeld
On Mon, 2010-12-13 at 22:12 +0200, Dor Laor wrote:
 On 12/13/2010 09:42 PM, Manfred Heubach wrote:
 
 
  Gleb Natapovglebat  redhat.com  writes:
 
 
  On Wed, Jul 28, 2010 at 12:53:02AM +0300, Harri Olin wrote:
  Gleb Natapov wrote:
  On Wed, Jul 21, 2010 at 09:25:31AM +0300, Harri Olin wrote:
  Gleb Natapov kirjoitti:
  On Mon, Jul 19, 2010 at 10:17:02AM +0300, Harri Olin wrote:
  Gleb Natapov kirjoitti:
  On Thu, Jul 15, 2010 at 03:19:44PM +0200, Christoph Adomeit wrote:
  But one Windows 2008 64 Bit Server Standard is freezing regularly.
  This happens sometimes 3 times a day, sometimes it takes 2 days
  until freeze. The Windows Machine is a clean fresh install.
  I think I have seen same problem occur on my Windows 2008 SBS SP2
  64bit system, but a bit less often, only like once a week.
  Now I haven't seen crashes but only freezes with qemu on 100% and
  virtual system unresponsive.
  Does sendkey from monitor works? qemu-kvm-0.11.1 is very old and this is
  not total freeze which even harder to debug. I don't see anything
  extraordinary in your logs. 4643 interrupt per second for 4 cpus is
  normal if windows runs multimedia or other app that need hi-res timers.
  Does your host swapping? Is there any chance that you can try upstream
  qemu-kvm?
 
  I tried running qemu-kvm from git but it exhibited the same problem
  as 12.x that I tried before, BSODing once in a while, running kernel
  2.6.34.1.
 
  That should be pretty stable config, although it would be nice if you
  could try running in qemy-kvm.git head.
 
  sample BSOD failure details:
  These two with Realtec nic and qemu cpu
  0x0019 (0x0020, 0xf88007e65970,
  0xf88007e65990, 0x0502040f)
  0x0019 (0x0020, 0xf88007a414c0,
  0xf88007a414e0, 0x0502044c)
 
  These are with e1000 and -cpu host
  0x003b (0xc005, 0xf80001c5d842,
  0xfa60093ddb70, 0x)
  0x003b (0xc005, 0xf80001cb8842,
  0xfa600c94ab70, 0x)
  0x000a (0x0080, 0x000c,
  0x0001, 0xf80001cadefd)
 
  Can you attach screenshots of BSODs? Have you reinstalled your guests or
  are you running the same images you ran in 11.x?
 
  I'll see if I can analyze minidumps later.
 
  In addition to these there have been as many reboots that have been
  only logged as 'disruptive shutdown'.
 
  Right now I'm running the problematic guest under Xen
  3.2.1-something from Debian to see if it works better.
 
  --
  Harri.
 
Hello,
 
  is there a solution for that problem? I'm experiencing the same problems 
  ever
  since I installed SBS 2008 on KVM.
 
  I was running the host with Ubuntu 10.04 but upgraded to 10.10 - mainly 
  because
  of performance problems which were solved by the upgrade.
 
  After the upgrade the system became extremly unstable. It was crashing as 
  soon
  as disk io and network io load was growing. 100% reproduceable with windows
  server backup to an iscsi volume.
 
  i had virtio drivers for storage and network installed (redhat/fedora 
  1.1.11).
 
 Which fedora/rhel release is that?
 What's the windows virtio driver version?
 
 Have you tried using virt-manager/virhs instead of raw cmdline?
 About e1000, some windows comes with buggy driver and an update e1000 
 from Intel fixes some issues.
 
 
  At each BSOD I had the following line in the log of the guest:
 
virtio_ioport_write: unexpected address 0x13 value 0x1
 
  I changed the network interface back to e1000. What I experience now (and I 
  had
  that a the very beginning before i switched to virtio network) are freezes. 
  The
  guest doesn't respond anymore (doesn't answer to pings and doesn't interact 
  via
  mouse/keyboard anymore). Host CPU usage of the kvm process is 100% on as 
  many
  cores as there are virtual cpus (in this case 4).
 
Sounds like an interrupt storm to me. Can you try to ping your VM?
Anyway the best way to start debugging a stalled system is just to crash
it with BSOD. For doing it you will need:
- enable NMICrashDump (please see http://support.microsoft.com/kb/927069
for more information
- enable Kernel Memory Dump (actually Complete is much better, but it
can be too big)  http://support.microsoft.com/kb/969028
- you only will need to type nmi 0 in the qemu monitor to crash the
system, when the system hangs next time.
Best regards,
Vadim. 
  I'm a bit frustrated about this. I have 2 windows 2003 32bit, 1 windows xp 
  and 3
  linux guests (2x 32bit, 1x64 bit). They are all running without any problems
  (except that the windows xp guest cannot boot without an ntldr cd image). 
  Only
  the SBS2008 guest regulary freezes.
 
  The host system has 2 Intel Xeon 5504, Intel Chipset 5500, Adaptec Raid 
  5805, 24
  GB DDR3 RAM.
 
  I know there is a lack of detailed information right now. I first need to 
  know
  if anybody is working on this or has similar problems. I can deliver 
  minidumps,
  

[RESEND PATCH v3 0/2] Minimal RAM API support

2010-12-13 Thread Alex Williamson
No comments since v3, please apply.  Thanks,

Alex

v3:

 - Address review comments
 - pc registers all memory below 4G in one chunk

Let me know if there are any further issues.

v2:

 - Move to Makefile.objs
 - Move structures to memory.c and create a callback function
 - Fix memory leak

I haven't moved to the state parameter because there should only
be a single instance of this per VM.  The state parameter seems
like it would add complications in setup and function calling, but
maybe point me to an example if I'm off base.

v1:

For VFIO based device assignment, we need to know what guest memory
areas are actual RAM.  RAMBlocks have long since become a grab bag
of misc allocations, so aren't effective for this.  Anthony has had
a RAM API in mind for a while now that addresses this problem.  This
implements just enough of it so that we have an interface to get
actual guest memory physical addresses to setup the host IOMMU.  We
can continue building a full RAM API on top of this stub.

Anthony, feel free to add copyright to memory.c as it's based on
your initial implementation.  I had to add something since the file
in your branch just copies a header with Frabrice's copywrite.

---

Alex Williamson (2):
  RAM API: Make use of it for x86 PC
  Minimal RAM API support


 Makefile.objs |1 +
 cpu-common.h  |2 +
 hw/pc.c   |9 ++---
 memory.c  |   97 +
 memory.h  |   44 ++
 5 files changed, 147 insertions(+), 6 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH v3 1/2] Minimal RAM API support

2010-12-13 Thread Alex Williamson
This adds a minimum chunk of Anthony's RAM API support so that we
can identify actual VM RAM versus all the other things that make
use of qemu_ram_alloc.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 Makefile.objs |1 +
 cpu-common.h  |2 +
 memory.c  |   97 +
 memory.h  |   44 ++
 4 files changed, 144 insertions(+), 0 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h

diff --git a/Makefile.objs b/Makefile.objs
index cebb945..47f3c3a 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o
 hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
 hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
 hw-obj-y += watchdog.o
+hw-obj-y += memory.o
 hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
 hw-obj-$(CONFIG_ECC) += ecc.o
 hw-obj-$(CONFIG_NAND) += nand.o
diff --git a/cpu-common.h b/cpu-common.h
index 6d4a898..f08f93b 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -29,6 +29,8 @@ enum device_endian {
 /* address in the RAM (different from a physical address) */
 typedef unsigned long ram_addr_t;
 
+#include memory.h
+
 /* memory API */
 
 typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, 
uint32_t value);
diff --git a/memory.c b/memory.c
new file mode 100644
index 000..742776f
--- /dev/null
+++ b/memory.c
@@ -0,0 +1,97 @@
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamson alex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+#include memory.h
+#include range.h
+
+typedef struct ram_slot {
+target_phys_addr_t start_addr;
+ram_addr_t size;
+ram_addr_t offset;
+QLIST_ENTRY(ram_slot) next;
+} ram_slot;
+
+static QLIST_HEAD(ram_slots, ram_slot) ram_slots =
+QLIST_HEAD_INITIALIZER(ram_slots);
+
+static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr,
+   ram_addr_t size)
+{
+ram_slot *slot;
+
+QLIST_FOREACH(slot, ram_slots, next) {
+if (slot-start_addr == start_addr  slot-size == size) {
+return slot;
+}
+
+if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) {
+hw_error(Ram range overlaps existing slot\n);
+}
+}
+
+return NULL;
+}
+
+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+  ram_addr_t phys_offset)
+{
+ram_slot *slot;
+
+if (!size) {
+return -EINVAL;
+}
+
+assert(!qemu_ram_find_slot(start_addr, size));
+
+slot = qemu_mallocz(sizeof(ram_slot));
+
+slot-start_addr = start_addr;
+slot-size = size;
+slot-offset = phys_offset;
+
+QLIST_INSERT_HEAD(ram_slots, slot, next);
+
+cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset);
+
+return 0;
+}
+
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size)
+{
+ram_slot *slot;
+
+if (!size) {
+return;
+}
+
+slot = qemu_ram_find_slot(start_addr, size);
+assert(slot != NULL);
+
+QLIST_REMOVE(slot, next);
+qemu_free(slot);
+cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED);
+
+return;
+}
+
+int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn)
+{
+ram_slot *slot;
+
+QLIST_FOREACH(slot, ram_slots, next) {
+int ret = fn(opaque, slot-start_addr, slot-size, slot-offset);
+if (ret) {
+return ret;
+}
+}
+return 0;
+}
diff --git a/memory.h b/memory.h
new file mode 100644
index 000..e7aa5cb
--- /dev/null
+++ b/memory.h
@@ -0,0 +1,44 @@
+#ifndef QEMU_MEMORY_H
+#define QEMU_MEMORY_H
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamson alex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include cpu-common.h
+
+typedef int (*qemu_ram_for_each_slot_fn)(void *opaque,
+ target_phys_addr_t start_addr,
+ ram_addr_t size,
+ ram_addr_t phys_offset);
+
+/**
+ * qemu_ram_register() : Register a region of guest physical memory
+ *
+ * The new region must not overlap an existing region.
+ */
+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+  ram_addr_t phys_offset);
+
+/**
+ * qemu_ram_unregister() : Unregister a region of guest physical memory
+ */
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size);
+
+/**
+ * qemu_ram_for_each_slot() : Call fn() on each registered region
+ *
+ * Stop on non-zero return from fn().
+ */
+int qemu_ram_for_each_slot(void *opaque, 

[RESEND PATCH v3 2/2] RAM API: Make use of it for x86 PC

2010-12-13 Thread Alex Williamson
Register the actual VM RAM using the new API

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pc.c |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index e1b2667..1554164 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size,
 /* allocate RAM */
 ram_addr = qemu_ram_alloc(NULL, pc.ram,
   below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+qemu_ram_register(0, below_4g_mem_size, ram_addr);
 #if TARGET_PHYS_ADDR_BITS  32
 if (above_4g_mem_size  0) {
-cpu_register_physical_memory(0x1ULL, above_4g_mem_size,
- ram_addr + below_4g_mem_size);
+qemu_ram_register(0x1ULL, above_4g_mem_size,
+  ram_addr + below_4g_mem_size);
 }
 #endif
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH v3 1/2] Minimal RAM API support

2010-12-13 Thread Anthony Liguori

On 12/13/2010 02:47 PM, Alex Williamson wrote:

This adds a minimum chunk of Anthony's RAM API support so that we
can identify actual VM RAM versus all the other things that make
use of qemu_ram_alloc.

Signed-off-by: Alex Williamsonalex.william...@redhat.com
---

  Makefile.objs |1 +
  cpu-common.h  |2 +
  memory.c  |   97 +
  memory.h  |   44 ++
  4 files changed, 144 insertions(+), 0 deletions(-)
  create mode 100644 memory.c
  create mode 100644 memory.h

diff --git a/Makefile.objs b/Makefile.objs
index cebb945..47f3c3a 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o
  hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
  hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
  hw-obj-y += watchdog.o
+hw-obj-y += memory.o
  hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
  hw-obj-$(CONFIG_ECC) += ecc.o
  hw-obj-$(CONFIG_NAND) += nand.o
diff --git a/cpu-common.h b/cpu-common.h
index 6d4a898..f08f93b 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -29,6 +29,8 @@ enum device_endian {
  /* address in the RAM (different from a physical address) */
  typedef unsigned long ram_addr_t;

+#include memory.h
+
  /* memory API */

  typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, 
uint32_t value);
diff --git a/memory.c b/memory.c
new file mode 100644
index 000..742776f
--- /dev/null
+++ b/memory.c
@@ -0,0 +1,97 @@
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamsonalex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+#include memory.h
+#include range.h
+
+typedef struct ram_slot {
+target_phys_addr_t start_addr;
+ram_addr_t size;
+ram_addr_t offset;
+QLIST_ENTRY(ram_slot) next;
+} ram_slot;
+
+static QLIST_HEAD(ram_slots, ram_slot) ram_slots =
+QLIST_HEAD_INITIALIZER(ram_slots);
+
+static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr,
+   ram_addr_t size)
+{
+ram_slot *slot;
+
+QLIST_FOREACH(slot,ram_slots, next) {
+if (slot-start_addr == start_addr  slot-size == size) {
+return slot;
+}
+
+if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) {
+hw_error(Ram range overlaps existing slot\n);
+}
+}
+
+return NULL;
+}

   


CODING_STYLE.  RamSlot and drop the qemu_ prefix.


+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+  ram_addr_t phys_offset)
+{
+ram_slot *slot;
+
+if (!size) {
+return -EINVAL;
+}
+
+assert(!qemu_ram_find_slot(start_addr, size));
+
+slot = qemu_mallocz(sizeof(ram_slot));
+
+slot-start_addr = start_addr;
+slot-size = size;
+slot-offset = phys_offset;
+
+QLIST_INSERT_HEAD(ram_slots, slot, next);
+
+cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset);
+
+return 0;
+}
+
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size)
+{
+ram_slot *slot;
+
+if (!size) {
+return;
+}
+
+slot = qemu_ram_find_slot(start_addr, size);
+assert(slot != NULL);
+
+QLIST_REMOVE(slot, next);
+qemu_free(slot);
+cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED);
+
+return;
+}
+
+int qemu_ram_for_each_slot(void *opaque, qemu_ram_for_each_slot_fn fn)
+{
+ram_slot *slot;
+
+QLIST_FOREACH(slot,ram_slots, next) {
+int ret = fn(opaque, slot-start_addr, slot-size, slot-offset);
+if (ret) {
+return ret;
+}
+}
+return 0;
+}
diff --git a/memory.h b/memory.h
new file mode 100644
index 000..e7aa5cb
--- /dev/null
+++ b/memory.h
@@ -0,0 +1,44 @@
+#ifndef QEMU_MEMORY_H
+#define QEMU_MEMORY_H
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamsonalex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include cpu-common.h
+
+typedef int (*qemu_ram_for_each_slot_fn)(void *opaque,
+ target_phys_addr_t start_addr,
+ ram_addr_t size,
+ ram_addr_t phys_offset);
+
+/**
+ * qemu_ram_register() : Register a region of guest physical memory
+ *
+ * The new region must not overlap an existing region.
+ */
+int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+  ram_addr_t phys_offset);
+
+/**
+ * qemu_ram_unregister() : Unregister a region of guest physical memory
+ */
+void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size);
+
+/**
+ * qemu_ram_for_each_slot() : Call fn() on each 

Re: [Qemu-devel] [RESEND PATCH v3 1/2] Minimal RAM API support

2010-12-13 Thread Blue Swirl
On Mon, Dec 13, 2010 at 8:47 PM, Alex Williamson
alex.william...@redhat.com wrote:
 This adds a minimum chunk of Anthony's RAM API support so that we
 can identify actual VM RAM versus all the other things that make
 use of qemu_ram_alloc.

 Signed-off-by: Alex Williamson alex.william...@redhat.com
 ---

  Makefile.objs |    1 +
  cpu-common.h  |    2 +
  memory.c      |   97 
 +
  memory.h      |   44 ++
  4 files changed, 144 insertions(+), 0 deletions(-)
  create mode 100644 memory.c
  create mode 100644 memory.h

 diff --git a/Makefile.objs b/Makefile.objs
 index cebb945..47f3c3a 100644
 --- a/Makefile.objs
 +++ b/Makefile.objs
 @@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o
  hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
  hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
  hw-obj-y += watchdog.o
 +hw-obj-y += memory.o
  hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
  hw-obj-$(CONFIG_ECC) += ecc.o
  hw-obj-$(CONFIG_NAND) += nand.o
 diff --git a/cpu-common.h b/cpu-common.h
 index 6d4a898..f08f93b 100644
 --- a/cpu-common.h
 +++ b/cpu-common.h
 @@ -29,6 +29,8 @@ enum device_endian {
  /* address in the RAM (different from a physical address) */
  typedef unsigned long ram_addr_t;

 +#include memory.h
 +
  /* memory API */

  typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, 
 uint32_t value);
 diff --git a/memory.c b/memory.c
 new file mode 100644
 index 000..742776f
 --- /dev/null
 +++ b/memory.c
 @@ -0,0 +1,97 @@
 +/*
 + * RAM API
 + *
 + *  Copyright Red Hat, Inc. 2010
 + *
 + * Authors:
 + *  Alex Williamson alex.william...@redhat.com
 + *
 + * This work is licensed under the terms of the GNU GPL, version 2.  See
 + * the COPYING file in the top-level directory.
 + *
 + */
 +#include memory.h
 +#include range.h
 +
 +typedef struct ram_slot {
 +    target_phys_addr_t start_addr;
 +    ram_addr_t size;
 +    ram_addr_t offset;
 +    QLIST_ENTRY(ram_slot) next;
 +} ram_slot;

Please see CODING_STYLE for structure naming.

 +
 +static QLIST_HEAD(ram_slots, ram_slot) ram_slots =
 +    QLIST_HEAD_INITIALIZER(ram_slots);
 +
 +static ram_slot *qemu_ram_find_slot(target_phys_addr_t start_addr,
 +                                   ram_addr_t size)
 +{
 +    ram_slot *slot;
 +
 +    QLIST_FOREACH(slot, ram_slots, next) {
 +        if (slot-start_addr == start_addr  slot-size == size) {
 +            return slot;
 +        }
 +
 +        if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) {
 +            hw_error(Ram range overlaps existing slot\n);
 +        }
 +    }
 +
 +    return NULL;
 +}
 +
 +int qemu_ram_register(target_phys_addr_t start_addr, ram_addr_t size,
 +                      ram_addr_t phys_offset)
 +{
 +    ram_slot *slot;
 +
 +    if (!size) {
 +        return -EINVAL;
 +    }
 +
 +    assert(!qemu_ram_find_slot(start_addr, size));
 +
 +    slot = qemu_mallocz(sizeof(ram_slot));

Since you initialize every field by hand later, this could be qemu_malloc().

 +
 +    slot-start_addr = start_addr;
 +    slot-size = size;
 +    slot-offset = phys_offset;
 +
 +    QLIST_INSERT_HEAD(ram_slots, slot, next);
 +
 +    cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset);
 +
 +    return 0;
 +}
 +
 +void qemu_ram_unregister(target_phys_addr_t start_addr, ram_addr_t size)
 +{
 +    ram_slot *slot;
 +
 +    if (!size) {
 +        return;
 +    }
 +
 +    slot = qemu_ram_find_slot(start_addr, size);
 +    assert(slot != NULL);
 +
 +    QLIST_REMOVE(slot, next);
 +    qemu_free(slot);
 +    cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED);
 +
 +    return;

Useless.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RESEND PATCH] exec: Implement qemu_ram_free_from_ptr()

2010-12-13 Thread Alex Williamson
Required for regions mapped via qemu_ram_alloc_from_ptr().  VFIO
and ivshmem will make use of this to remove mappings when devices
are hot unplugged.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

No comments on original patch.  Obvious missing function.  Cam has since
requested the same function for ivshmem.

 cpu-common.h |1 +
 exec.c   |   13 +
 2 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/cpu-common.h b/cpu-common.h
index 6d4a898..9b763d0 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -49,6 +49,7 @@ ram_addr_t cpu_get_physical_page_desc(target_phys_addr_t 
addr);
 ram_addr_t qemu_ram_alloc_from_ptr(DeviceState *dev, const char *name,
 ram_addr_t size, void *host);
 ram_addr_t qemu_ram_alloc(DeviceState *dev, const char *name, ram_addr_t size);
+void qemu_ram_free_from_ptr(ram_addr_t addr);
 void qemu_ram_free(ram_addr_t addr);
 /* This should only be used for ram local to a device.  */
 void *qemu_get_ram_ptr(ram_addr_t addr);
diff --git a/exec.c b/exec.c
index a338495..eea7ea7 100644
--- a/exec.c
+++ b/exec.c
@@ -2875,6 +2875,19 @@ ram_addr_t qemu_ram_alloc(DeviceState *dev, const char 
*name, ram_addr_t size)
 return qemu_ram_alloc_from_ptr(dev, name, size, NULL);
 }
 
+void qemu_ram_free_from_ptr(ram_addr_t addr)
+{
+RAMBlock *block;
+
+QLIST_FOREACH(block, ram_list.blocks, next) {
+if (addr == block-offset) {
+QLIST_REMOVE(block, next);
+qemu_free(block);
+return;
+}
+}
+}
+
 void qemu_ram_free(ram_addr_t addr)
 {
 RAMBlock *block;

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 1/2] Minimal RAM API support

2010-12-13 Thread Alex Williamson
This adds a minimum chunk of Anthony's RAM API support so that we
can identify actual VM RAM versus all the other things that make
use of qemu_ram_alloc.

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 Makefile.objs |1 +
 cpu-common.h  |2 +
 memory.c  |   94 +
 memory.h  |   44 +++
 4 files changed, 141 insertions(+), 0 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h

diff --git a/Makefile.objs b/Makefile.objs
index cebb945..47f3c3a 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -172,6 +172,7 @@ hw-obj-y += pci.o pci_bridge.o msix.o msi.o
 hw-obj-$(CONFIG_PCI) += pci_host.o pcie_host.o
 hw-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
 hw-obj-y += watchdog.o
+hw-obj-y += memory.o
 hw-obj-$(CONFIG_ISA_MMIO) += isa_mmio.o
 hw-obj-$(CONFIG_ECC) += ecc.o
 hw-obj-$(CONFIG_NAND) += nand.o
diff --git a/cpu-common.h b/cpu-common.h
index 6d4a898..f08f93b 100644
--- a/cpu-common.h
+++ b/cpu-common.h
@@ -29,6 +29,8 @@ enum device_endian {
 /* address in the RAM (different from a physical address) */
 typedef unsigned long ram_addr_t;
 
+#include memory.h
+
 /* memory API */
 
 typedef void CPUWriteMemoryFunc(void *opaque, target_phys_addr_t addr, 
uint32_t value);
diff --git a/memory.c b/memory.c
new file mode 100644
index 000..07cb020
--- /dev/null
+++ b/memory.c
@@ -0,0 +1,94 @@
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamson alex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+#include memory.h
+#include range.h
+
+typedef struct RamSlot {
+target_phys_addr_t start_addr;
+ram_addr_t size;
+ram_addr_t offset;
+QLIST_ENTRY(RamSlot) next;
+} RamSlot;
+
+static QLIST_HEAD(ram_slot_list, RamSlot) ram_slot_list =
+QLIST_HEAD_INITIALIZER(ram_slot_list);
+
+static RamSlot *ram_find_slot(target_phys_addr_t start_addr, ram_addr_t size)
+{
+RamSlot *slot;
+
+QLIST_FOREACH(slot, ram_slot_list, next) {
+if (slot-start_addr == start_addr  slot-size == size) {
+return slot;
+}
+
+if (ranges_overlap(start_addr, size, slot-start_addr, slot-size)) {
+hw_error(Ram range overlaps existing slot\n);
+}
+}
+
+return NULL;
+}
+
+int ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+ ram_addr_t phys_offset)
+{
+RamSlot *slot;
+
+if (!size) {
+return -EINVAL;
+}
+
+assert(!ram_find_slot(start_addr, size));
+
+slot = qemu_malloc(sizeof(RamSlot));
+
+slot-start_addr = start_addr;
+slot-size = size;
+slot-offset = phys_offset;
+
+QLIST_INSERT_HEAD(ram_slot_list, slot, next);
+
+cpu_register_physical_memory(slot-start_addr, slot-size, slot-offset);
+
+return 0;
+}
+
+void ram_unregister(target_phys_addr_t start_addr, ram_addr_t size)
+{
+RamSlot *slot;
+
+if (!size) {
+return;
+}
+
+slot = ram_find_slot(start_addr, size);
+assert(slot != NULL);
+
+QLIST_REMOVE(slot, next);
+qemu_free(slot);
+cpu_register_physical_memory(start_addr, size, IO_MEM_UNASSIGNED);
+}
+
+int ram_for_each_slot(void *opaque, ram_for_each_slot_fn fn)
+{
+RamSlot *slot;
+
+QLIST_FOREACH(slot, ram_slot_list, next) {
+int ret = fn(opaque, slot-start_addr, slot-size, slot-offset);
+if (ret) {
+return ret;
+}
+}
+return 0;
+}
diff --git a/memory.h b/memory.h
new file mode 100644
index 000..98c85ea
--- /dev/null
+++ b/memory.h
@@ -0,0 +1,44 @@
+#ifndef QEMU_MEMORY_H
+#define QEMU_MEMORY_H
+/*
+ * RAM API
+ *
+ *  Copyright Red Hat, Inc. 2010
+ *
+ * Authors:
+ *  Alex Williamson alex.william...@redhat.com
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+#include qemu-common.h
+#include cpu-common.h
+
+typedef int (*ram_for_each_slot_fn)(void *opaque,
+target_phys_addr_t start_addr,
+ram_addr_t size,
+ram_addr_t phys_offset);
+
+/**
+ * ram_register() : Register a region of guest physical memory
+ *
+ * The new region must not overlap an existing region.
+ */
+int ram_register(target_phys_addr_t start_addr, ram_addr_t size,
+ ram_addr_t phys_offset);
+
+/**
+ * ram_unregister() : Unregister a region of guest physical memory
+ */
+void ram_unregister(target_phys_addr_t start_addr, ram_addr_t size);
+
+/**
+ * ram_for_each_slot() : Call fn() on each registered region
+ *
+ * Stop on non-zero return from fn().
+ */
+int ram_for_each_slot(void *opaque, ram_for_each_slot_fn fn);
+
+#endif /* QEMU_MEMORY_H */

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a 

[PATCH v4 0/2] Minimal RAM API support

2010-12-13 Thread Alex Williamson
Update per comments, Thanks,

Alex

v4:

 - ram_slot - RamSlot (per CODING_STYLE)
 - drop qemu_ prefix from functions (per CODING_STYLE)
 - mallocz - malloc
 - drop extraneous return from void function

v3:

 - Address review comments
 - pc registers all memory below 4G in one chunk

Let me know if there are any further issues.

v2:

 - Move to Makefile.objs
 - Move structures to memory.c and create a callback function
 - Fix memory leak

I haven't moved to the state parameter because there should only
be a single instance of this per VM.  The state parameter seems
like it would add complications in setup and function calling, but
maybe point me to an example if I'm off base.

v1:

For VFIO based device assignment, we need to know what guest memory
areas are actual RAM.  RAMBlocks have long since become a grab bag
of misc allocations, so aren't effective for this.  Anthony has had
a RAM API in mind for a while now that addresses this problem.  This
implements just enough of it so that we have an interface to get
actual guest memory physical addresses to setup the host IOMMU.  We
can continue building a full RAM API on top of this stub.

Anthony, feel free to add copyright to memory.c as it's based on
your initial implementation.  I had to add something since the file
in your branch just copies a header with Frabrice's copywrite.

---

Alex Williamson (2):
  RAM API: Make use of it for x86 PC
  Minimal RAM API support


 Makefile.objs |1 +
 cpu-common.h  |2 +
 hw/pc.c   |9 ++---
 memory.c  |   94 +
 memory.h  |   44 +++
 5 files changed, 144 insertions(+), 6 deletions(-)
 create mode 100644 memory.c
 create mode 100644 memory.h
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 2/2] RAM API: Make use of it for x86 PC

2010-12-13 Thread Alex Williamson
Register the actual VM RAM using the new API

Signed-off-by: Alex Williamson alex.william...@redhat.com
---

 hw/pc.c |9 +++--
 1 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/hw/pc.c b/hw/pc.c
index e1b2667..87adca2 100644
--- a/hw/pc.c
+++ b/hw/pc.c
@@ -913,14 +913,11 @@ void pc_memory_init(ram_addr_t ram_size,
 /* allocate RAM */
 ram_addr = qemu_ram_alloc(NULL, pc.ram,
   below_4g_mem_size + above_4g_mem_size);
-cpu_register_physical_memory(0, 0xa, ram_addr);
-cpu_register_physical_memory(0x10,
- below_4g_mem_size - 0x10,
- ram_addr + 0x10);
+ram_register(0, below_4g_mem_size, ram_addr);
 #if TARGET_PHYS_ADDR_BITS  32
 if (above_4g_mem_size  0) {
-cpu_register_physical_memory(0x1ULL, above_4g_mem_size,
- ram_addr + below_4g_mem_size);
+ram_register(0x1ULL, above_4g_mem_size,
+ ram_addr + below_4g_mem_size);
 }
 #endif
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] RFC: delay pci_update_mappings for 64-bit BARs

2010-12-13 Thread Cam Macdonell
Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar.  Wait for 
the upper 32 or else Qemu will try to map on just the lower 32 which is 
probably going to corrupt memory.

I was encountering crashes when mapping certain PCI region sizes.  The problem 
turns out that pci_update_mappings is being called without all 64-bits in the 
BAR.  For example when mapping to 0x18000, once the lower 32-bits were 
written the remapping happened (mapping to 0x800) which would overwrite 
something.

I'm not certain if this is completely correct, I'm simply testing the lower 
4-bits to only be MEM_TYPE_64 flag.  Upper 32-bit address parts can be values 
like 0xff which is tricky to test against.

Cam
---
 hw/pci.c |5 -
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/hw/pci.c b/hw/pci.c
index 438c0d1..3b81792 100644
--- a/hw/pci.c
+++ b/hw/pci.c
@@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
addr, uint32_t val, int l)
 {
 int i, was_irq_disabled = pci_irq_disabled(d);
 uint32_t config_size = pci_config_size(d);
+int is_64 = 0;
+
+is_64 = ((val  0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64);
 
 for (i = 0; i  l  addr + i  config_size; val = 8, ++i) {
 uint8_t wmask = d-wmask[addr + i];
@@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
addr, uint32_t val, int l)
 d-config[addr + i] = (d-config[addr + i]  ~wmask) | (val  wmask);
 d-config[addr + i] = ~(val  w1cmask); /* W1C: Write 1 to Clear */
 }
-if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
+if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24)  (!is_64)) ||
 ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
 ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
 range_covers_byte(addr, l, PCI_COMMAND))
-- 
1.7.0.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-13 Thread Jan Kiszka
Am 13.12.2010 11:10, Michael S. Tsirkin wrote:
 On Sun, Dec 12, 2010 at 12:22:40PM +0100, Jan Kiszka wrote:
 The result may look simpler on first glance than v1, but it comes with
 more subtle race scenarios IMO. I thought them through, hopefully
 catching all, but I would appreciate any skeptical review.
 
 Thought about the races till my head hurt, and yes, they
 all seem to be handled correctly. FWIW

Ouch, I'm endlessly sorry for causing this pain.

 
 Reviewed-by: Michael S. Tsirkin m...@redhat.com
 

Thanks!

Jan



signature.asc
Description: OpenPGP digital signature


[PATCH v3 0/4] KVM genirq: Enable adaptive IRQ sharing for passed-through devices

2010-12-13 Thread Jan Kiszka
This addresses the review comments of the previous round:
 - renamed irq_data::status to drv_status
 - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED
 - fixed signature of get_irq_status (irq is now unsigned int)
 - converted register_lock into a global one
 - fixed critical white space breakage (that I just left in to check if
   anyone is actually reading the code, of course...)

Note: The KVM patch still depends on
http://thread.gmane.org/gmane.comp.emulators.kvm.devel/64515

Thanks for all comments!

Final but critical question: Who will pick up which bits?

Jan Kiszka (4):
  genirq: Introduce driver-readable IRQ status word
  genirq: Inform handler about line sharing state
  genirq: Add support for IRQF_COND_ONESHOT
  KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

 Documentation/kvm/api.txt |   27 
 arch/x86/kvm/x86.c|1 +
 include/linux/interrupt.h |   15 ++
 include/linux/irq.h   |2 +
 include/linux/kvm.h   |6 +
 include/linux/kvm_host.h  |   10 ++-
 kernel/irq/manage.c   |   77 ++-
 virt/kvm/assigned-dev.c   |  336 -
 8 files changed, 436 insertions(+), 38 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/4] genirq: Introduce driver-readable IRQ status word

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

This associates a status word with every IRQ descriptor. Drivers can obtain
its content via get_irq_status(irq). First use case will be propagating the
interrupt sharing state.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 include/linux/interrupt.h |2 ++
 include/linux/irq.h   |2 ++
 kernel/irq/manage.c   |   15 +++
 3 files changed, 19 insertions(+), 0 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..4c1aa72 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -126,6 +126,8 @@ struct irqaction {
 
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
+extern unsigned long get_irq_status(unsigned int irq);
+
 #ifdef CONFIG_GENERIC_HARDIRQS
 extern int __must_check
 request_threaded_irq(unsigned int irq, irq_handler_t handler,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index abde252..8bdb421 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -96,6 +96,7 @@ struct msi_desc;
  * methods, to allow shared chip implementations
  * @msi_desc:  MSI descriptor
  * @affinity:  IRQ affinity on SMP
+ * @drv_status:driver-readable status flags (IRQS_*)
  *
  * The fields here need to overlay the ones in irq_desc until we
  * cleaned up the direct references and switched everything over to
@@ -111,6 +112,7 @@ struct irq_data {
 #ifdef CONFIG_SMP
cpumask_var_t   affinity;
 #endif
+   unsigned long   drv_status;
 };
 
 /**
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5f92acc..2ea0d30 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1157,3 +1157,18 @@ int request_any_context_irq(unsigned int irq, 
irq_handler_t handler,
return !ret ? IRQC_IS_HARDIRQ : ret;
 }
 EXPORT_SYMBOL_GPL(request_any_context_irq);
+
+/**
+ * get_irq_status - read interrupt line status word
+ * @irq: Interrupt line of the status word
+ *
+ * This returns the current content of the status word associated with
+ * the given interrupt line. See IRQS_* flags for details.
+ */
+unsigned long get_irq_status(unsigned int irq)
+{
+   struct irq_desc *desc = irq_to_desc(irq);
+
+   return desc ? desc-irq_data.drv_status : 0;
+}
+EXPORT_SYMBOL_GPL(get_irq_status);
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/4] KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.

However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register the IRQ in adaptive
mode and switch between line and device level disabling on demand.

This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 Documentation/kvm/api.txt |   27 
 arch/x86/kvm/x86.c|1 +
 include/linux/kvm.h   |6 +
 include/linux/kvm_host.h  |   10 ++-
 virt/kvm/assigned-dev.c   |  336 -
 5 files changed, 346 insertions(+), 34 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..1c34e25 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
 
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
+#define KVM_DEV_ASSIGN_MASK_INTX   (1  2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
 
 4.48 KVM_DEASSIGN_PCI_DEVICE
 
@@ -1263,6 +1271,25 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
 };
 
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct kvm_assigned_pci_dev (in)
+Returns: 0 on success, -1 on error
+
+Informs the kernel about the guest's view on the INTx mask. As long as the
+guest masks the legacy INTx, the kernel will refrain from unmasking it at
+hardware level and will not assert the guest's IRQ line. User space is still
+responsible for applying this state to the assigned device's real config space.
+To avoid that the kernel overwrites the state user space wants to set,
+KVM_ASSIGN_SET_INTX_MASK has to be called prior to updating the config space.
+
+See KVM_ASSIGN_DEV_IRQ for the data structure. The target device is specified
+by assigned_dev_id. In the flags field, only KVM_DEV_ASSIGN_MASK_INTX is
+evaluated.
+
 5. The kvm_run structure
 
 Application code obtains a pointer to the kvm_run structure by
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed373ba..8775a54 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1965,6 +1965,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_X86_ROBUST_SINGLESTEP:
case KVM_CAP_XSAVE:
case KVM_CAP_ASYNC_PF:
+   case KVM_CAP_PCI_2_3:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index ea2dc1a..3cadb42 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -541,6 +541,7 @@ struct kvm_ppc_pvinfo {
 #define KVM_CAP_PPC_GET_PVINFO 57
 #define KVM_CAP_PPC_IRQ_LEVEL 58
 #define KVM_CAP_ASYNC_PF 59
+#define KVM_CAP_PCI_2_3 60
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -677,6 +678,9 @@ struct kvm_clock_data {
 #define KVM_SET_PIT2  _IOW(KVMIO,  0xa0, struct kvm_pit_state2)
 /* Available with KVM_CAP_PPC_GET_PVINFO */
 #define KVM_PPC_GET_PVINFO   _IOW(KVMIO,  0xa1, struct kvm_ppc_pvinfo)
+/* Available with KVM_CAP_PCI_2_3 */
+#define KVM_ASSIGN_SET_INTX_MASK  _IOW(KVMIO,  0xa2, \
+  struct kvm_assigned_pci_dev)
 
 /*
  * ioctls for vcpu fds
@@ -742,6 +746,8 @@ struct kvm_clock_data {
 #define KVM_SET_XCRS _IOW(KVMIO,  0xa7, struct kvm_xcrs)
 
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU(1  0)
+#define KVM_DEV_ASSIGN_PCI_2_3 (1  1)
+#define KVM_DEV_ASSIGN_MASK_INTX   (1  2)
 
 struct kvm_assigned_pci_dev {
__u32 assigned_dev_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ac4e83a..4f95070 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -477,6 +477,12 @@ struct kvm_irq_ack_notifier {
void (*irq_acked)(struct kvm_irq_ack_notifier *kian);
 };
 
+enum kvm_intx_state {
+   KVM_INTX_ENABLED,
+   KVM_INTX_LINE_DISABLED,
+   KVM_INTX_DEVICE_DISABLED,
+};
+
 struct kvm_assigned_dev_kernel {
struct kvm_irq_ack_notifier ack_notifier;
struct list_head list;
@@ -486,7 +492,7 @@ struct kvm_assigned_dev_kernel {
int host_devfn;
unsigned int entries_nr;
int host_irq;
-   bool host_irq_disabled;
+   unsigned long 

[PATCH v3 2/4] genirq: Inform handler about line sharing state

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

This enabled interrupt handlers to retrieve the current line sharing state via
the new interrupt status word so that they can adapt to it.

The switch from shared to exclusive is generally uncritical and can thus be
performed on demand. However, preparing a line for shared mode may require
preparational steps of the currently registered handler. It can therefore
request an ahead-of-time notification via IRQF_ADAPTIVE. The notification
consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in
the status word.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 include/linux/interrupt.h |   10 +
 kernel/irq/manage.c   |   47 ++--
 2 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 4c1aa72..12e5fc0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -55,6 +55,7 @@
  *Used by threaded interrupts which need to keep the
  *irq line disabled until the threaded handler has been run.
  * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
+ * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
  *
  */
 #define IRQF_DISABLED  0x0020
@@ -67,6 +68,7 @@
 #define IRQF_IRQPOLL   0x1000
 #define IRQF_ONESHOT   0x2000
 #define IRQF_NO_SUSPEND0x4000
+#define IRQF_ADAPTIVE  0x8000
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
 
@@ -126,6 +128,14 @@ struct irqaction {
 
 extern irqreturn_t no_action(int cpl, void *dev_id);
 
+/*
+ * Driver-readable IRQ line status flags:
+ * IRQS_SHARED - line is shared between multiple handlers
+ * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable
+ */
+#define IRQS_SHARED0x0001
+#define IRQS_MAKE_SHAREABLE0x0002
+
 extern unsigned long get_irq_status(unsigned int irq);
 
 #ifdef CONFIG_GENERIC_HARDIRQS
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2ea0d30..2dd4eef 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -14,9 +14,12 @@
 #include linux/interrupt.h
 #include linux/slab.h
 #include linux/sched.h
+#include linux/mutex.h
 
 #include internals.h
 
+static DEFINE_MUTEX(register_lock);
+
 /**
  * synchronize_irq - wait for pending IRQ handlers (on other CPUs)
  * @irq: interrupt number to wait for
@@ -754,6 +757,8 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct 
irqaction *new)
old = *old_ptr;
} while (old);
shared = 1;
+
+   desc-irq_data.drv_status |= IRQS_SHARED;
}
 
if (!shared) {
@@ -883,6 +888,7 @@ static struct irqaction *__free_irq(unsigned int irq, void 
*dev_id)
 {
struct irq_desc *desc = irq_to_desc(irq);
struct irqaction *action, **action_ptr;
+   bool single_handler = false;
unsigned long flags;
 
WARN(in_interrupt(), Trying to free IRQ %d from IRQ context!\n, irq);
@@ -928,7 +934,8 @@ static struct irqaction *__free_irq(unsigned int irq, void 
*dev_id)
desc-irq_data.chip-irq_shutdown(desc-irq_data);
else
desc-irq_data.chip-irq_disable(desc-irq_data);
-   }
+   } else if (!desc-action-next)
+   single_handler = true;
 
 #ifdef CONFIG_SMP
/* make sure affinity_hint is cleaned up */
@@ -943,6 +950,9 @@ static struct irqaction *__free_irq(unsigned int irq, void 
*dev_id)
/* Make sure it's not being used on another CPU: */
synchronize_irq(irq);
 
+   if (single_handler)
+   desc-irq_data.drv_status = ~IRQS_SHARED;
+
 #ifdef CONFIG_DEBUG_SHIRQ
/*
 * It's a shared IRQ -- the driver ought to be prepared for an IRQ
@@ -1002,9 +1012,13 @@ void free_irq(unsigned int irq, void *dev_id)
if (!desc)
return;
 
+   mutex_lock(register_lock);
+
chip_bus_lock(desc);
kfree(__free_irq(irq, dev_id));
chip_bus_sync_unlock(desc);
+
+   mutex_unlock(register_lock);
 }
 EXPORT_SYMBOL(free_irq);
 
@@ -1055,7 +1069,7 @@ int request_threaded_irq(unsigned int irq, irq_handler_t 
handler,
 irq_handler_t thread_fn, unsigned long irqflags,
 const char *devname, void *dev_id)
 {
-   struct irqaction *action;
+   struct irqaction *action, *old_action;
struct irq_desc *desc;
int retval;
 
@@ -1091,12 +1105,39 @@ int request_threaded_irq(unsigned int irq, 
irq_handler_t handler,
action-name = devname;
action-dev_id = dev_id;
 
+   mutex_lock(register_lock);
+
+   old_action = desc-action;
+   if (old_action  (old_action-flags  IRQF_ADAPTIVE) 
+   !(desc-irq_data.drv_status  IRQS_SHARED)) {
+   /*
+* Signal the old 

[PATCH v3 3/4] genirq: Add support for IRQF_COND_ONESHOT

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Provide an adaptive version of IRQF_ONESHOT: If the line is exclusively used,
IRQF_COND_ONESHOT provides the same semantics as IRQF_ONESHOT. If it is
shared, the line will be unmasked directly after the hardirq handler, just as
if IRQF_COND_ONESHOT was not provided.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 include/linux/interrupt.h |3 +++
 kernel/irq/manage.c   |   19 ---
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 12e5fc0..bbb16f4 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -56,6 +56,8 @@
  *irq line disabled until the threaded handler has been run.
  * IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
  * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
+ * IRQF_COND_ONESHOT - If line is not shared, keep interrupt disabled after
+ * hardirq handler finshed.
  *
  */
 #define IRQF_DISABLED  0x0020
@@ -69,6 +71,7 @@
 #define IRQF_ONESHOT   0x2000
 #define IRQF_NO_SUSPEND0x4000
 #define IRQF_ADAPTIVE  0x8000
+#define IRQF_COND_ONESHOT  0x0001
 
 #define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
 
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2dd4eef..9a73633 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -583,7 +583,7 @@ static int irq_thread(void *data)
struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
struct irqaction *action = data;
struct irq_desc *desc = irq_to_desc(action-irq);
-   int wake, oneshot = desc-status  IRQ_ONESHOT;
+   int wake, oneshot;
 
sched_setscheduler(current, SCHED_FIFO, param);
current-irqaction = action;
@@ -606,6 +606,7 @@ static int irq_thread(void *data)
desc-status |= IRQ_PENDING;
raw_spin_unlock_irq(desc-lock);
} else {
+   oneshot = desc-status  IRQ_ONESHOT;
raw_spin_unlock_irq(desc-lock);
 
action-thread_fn(action-irq, action-dev_id);
@@ -759,6 +760,15 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, 
struct irqaction *new)
shared = 1;
 
desc-irq_data.drv_status |= IRQS_SHARED;
+   desc-status = ~IRQ_ONESHOT;
+
+   /* Unmask if the interrupt was masked due to oneshot mode. */
+   if ((desc-status 
+(IRQ_INPROGRESS | IRQ_DISABLED | IRQ_MASKED)) ==
+   IRQ_MASKED) {
+   desc-irq_data.chip-irq_unmask(desc-irq_data);
+   desc-status = ~IRQ_MASKED;
+   }
}
 
if (!shared) {
@@ -783,7 +793,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct 
irqaction *new)
desc-status = ~(IRQ_AUTODETECT | IRQ_WAITING | IRQ_ONESHOT |
  IRQ_INPROGRESS | IRQ_SPURIOUS_DISABLED);
 
-   if (new-flags  IRQF_ONESHOT)
+   if (new-flags  (IRQF_ONESHOT | IRQF_COND_ONESHOT))
desc-status |= IRQ_ONESHOT;
 
if (!(desc-status  IRQ_NOAUTOEN)) {
@@ -934,8 +944,11 @@ static struct irqaction *__free_irq(unsigned int irq, void 
*dev_id)
desc-irq_data.chip-irq_shutdown(desc-irq_data);
else
desc-irq_data.chip-irq_disable(desc-irq_data);
-   } else if (!desc-action-next)
+   } else if (!desc-action-next) {
single_handler = true;
+   if (desc-action-flags  IRQF_COND_ONESHOT)
+   desc-status |= IRQ_ONESHOT;
+   }
 
 #ifdef CONFIG_SMP
/* make sure affinity_hint is cleaned up */
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] pci-assign: Host IRQ sharing suppport + some fixes and cleanups

2010-12-13 Thread Jan Kiszka
This series includes cleanups of the PCI config access of assigned
devices, fixes a corner case in this area, removes that suspicious VGA
hunk from assigned_dev_pci_read_config, and finally enables support for
the latest host IRQ sharing support via PCI-2.3 interrupt masking. See
the patches for details.

Jan Kiszka (5):
  pci-assign: Clean up assigned_dev_pci_read/write_config
  pci-assign: Fix dword read at PCI_COMMAND
  pci-assign: Remove suspicious hunk from assigned_dev_pci_read_config
  pci-assign: Convert need_emulate_cmd into a bitmask
  pci-assign: Use PCI-2.3-based shared legacy interrupts

 hw/device-assignment.c |  100 ---
 hw/device-assignment.h |2 +-
 qemu-kvm.c |8 
 qemu-kvm.h |3 +
 4 files changed, 88 insertions(+), 25 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] pci-assign: Clean up assigned_dev_pci_read/write_config

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Use rages_overlap and proper constants to match the access range against
regions that need special handling. This also fixes yet uncaught
high-byte write access to the command register. Moreover, use more
constants instead of magic numbers.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/device-assignment.c |   39 +--
 1 files changed, 29 insertions(+), 10 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 50c6408..bc3a57b 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -438,13 +438,20 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
 return assigned_device_pci_cap_write_config(d, address, val, len);
 }
 
-if (address == 0x4) {
+if (ranges_overlap(address, len, PCI_COMMAND, 2)) {
 pci_default_write_config(d, address, val, len);
 /* Continue to program the card */
 }
 
-if ((address = 0x10  address = 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d) {
+/*
+ * Catch access to
+ *  - base address registers
+ *  - ROM base address  capability pointer
+ *  - interrupt line  pin
+ */
+if (ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
+ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
+ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
 /* used for update-mappings (BAR emulation) */
 pci_default_write_config(d, address, val, len);
 return;
@@ -484,9 +491,20 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice *d, 
uint32_t address,
 return val;
 }
 
-if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
-   (address = 0x10  address = 0x24) || address == 0x30 ||
-address == 0x34 || address == 0x3c || address == 0x3d) {
+/*
+ * Catch access to
+ *  - vendor  device ID
+ *  - command register (if emulation needed)
+ *  - base address registers
+ *  - ROM base address  capability pointer
+ *  - interrupt line  pin
+ */
+if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) ||
+(pci_dev-need_emulate_cmd 
+ ranges_overlap(address, len, PCI_COMMAND, 2)) ||
+ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
+ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
+ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
 val = pci_default_read_config(d, address, len);
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
@@ -517,10 +535,11 @@ do_log:
 
 if (!pci_dev-cap.available) {
 /* kill the special capabilities */
-if (address == 4  len == 4)
-val = ~0x10;
-else if (address == 6)
-val = ~0x10;
+if (address == PCI_COMMAND  len == 4) {
+val = ~(PCI_STATUS_CAP_LIST  16);
+} else if (address == PCI_STATUS) {
+val = ~PCI_STATUS_CAP_LIST;
+}
 }
 
 return val;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] pci-assign: Fix dword read at PCI_COMMAND

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

If we emulate the command register, we must only read its content from
the shadow config space. For dword read of both PCI_COMMAND and
PCI_STATUS, at least the latter must be read from the device.

For simplicity reasons and as the code path is not considered
performance critical for the affected SRIOV devices, the fix performes
device access to the command word unconditionally, even if emulation is
enabled and only that word is read.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/device-assignment.c |   14 +++---
 1 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index bc3a57b..6ff1456 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -494,14 +494,11 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
*d, uint32_t address,
 /*
  * Catch access to
  *  - vendor  device ID
- *  - command register (if emulation needed)
  *  - base address registers
  *  - ROM base address  capability pointer
  *  - interrupt line  pin
  */
 if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) ||
-(pci_dev-need_emulate_cmd 
- ranges_overlap(address, len, PCI_COMMAND, 2)) ||
 ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
 ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
 ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
@@ -533,6 +530,17 @@ do_log:
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
 
+if (pci_dev-need_emulate_cmd 
+ranges_overlap(address, len, PCI_COMMAND, 2)) {
+if (address == PCI_COMMAND) {
+val = 0x;
+val |= pci_default_read_config(d, PCI_COMMAND, 2);
+} else {
+/* high-byte access */
+val = pci_default_read_config(d, PCI_COMMAND+1, 1);
+}
+}
+
 if (!pci_dev-cap.available) {
 /* kill the special capabilities */
 if (address == PCI_COMMAND  len == 4) {
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] pci-assign: Convert need_emulate_cmd into a bitmask

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Define a mask of PCI command register bits that need to be emulated,
i.e. read back from their shadow state. We will need this for
selectively emulating the INTx mask bit.

Note: No initialization of emulate_cmd_mask to zero needed, the device
state is already zero-initialized.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/device-assignment.c |   18 ++
 hw/device-assignment.h |2 +-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index ef045f4..26d3bd7 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -525,14 +525,17 @@ again:
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   (d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
 
-if (pci_dev-need_emulate_cmd 
+if (pci_dev-emulate_cmd_mask 
 ranges_overlap(address, len, PCI_COMMAND, 2)) {
 if (address == PCI_COMMAND) {
-val = 0x;
-val |= pci_default_read_config(d, PCI_COMMAND, 2);
+val = ~pci_dev-emulate_cmd_mask;
+val |= pci_default_read_config(d, PCI_COMMAND, 2) 
+pci_dev-emulate_cmd_mask;
 } else {
 /* high-byte access */
-val = pci_default_read_config(d, PCI_COMMAND+1, 1);
+val = ~(pci_dev-emulate_cmd_mask  8);
+val |= pci_default_read_config(d, PCI_COMMAND+1, 1) 
+(pci_dev-emulate_cmd_mask  8);
 }
 }
 
@@ -800,10 +803,9 @@ again:
 
 /* dealing with virtual function device */
 snprintf(name, sizeof(name), %sphysfn/, dir);
-if (!stat(name, statbuf))
-   pci_dev-need_emulate_cmd = 1;
-else
-   pci_dev-need_emulate_cmd = 0;
+if (!stat(name, statbuf)) {
+pci_dev-emulate_cmd_mask = 0x;
+}
 
 dev-region_number = r;
 return 0;
diff --git a/hw/device-assignment.h b/hw/device-assignment.h
index c94a730..9ead022 100644
--- a/hw/device-assignment.h
+++ b/hw/device-assignment.h
@@ -109,7 +109,7 @@ typedef struct AssignedDevice {
 void *msix_table_page;
 target_phys_addr_t msix_table_addr;
 int mmio_index;
-int need_emulate_cmd;
+uint32_t emulate_cmd_mask;
 char *configfd_name;
 QLIST_ENTRY(AssignedDevice) next;
 } AssignedDevice;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts

2010-12-13 Thread Jan Kiszka
From: Jan Kiszka jan.kis...@siemens.com

Enable the new KVM feature that allows legacy interrupt sharing for
PCI-2.3-compliant devices. This requires to synchronize any guest
change of the INTx mask bit to the kernel.

Signed-off-by: Jan Kiszka jan.kis...@siemens.com
---
 hw/device-assignment.c |   38 +-
 qemu-kvm.c |8 
 qemu-kvm.h |3 +++
 3 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/hw/device-assignment.c b/hw/device-assignment.c
index 26d3bd7..cf75c52 100644
--- a/hw/device-assignment.c
+++ b/hw/device-assignment.c
@@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, uint8_t 
cap, uint8_t start)
 return 0;
 }
 
+static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn)
+{
+return (uint32_t)seg  16 | (uint32_t)bus  8 | (uint32_t)devfn;
+}
+
 static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address,
   uint32_t val, int len)
 {
 int fd;
 ssize_t ret;
 AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
+struct kvm_assigned_pci_dev assigned_dev_data;
+#ifdef KVM_CAP_PCI_2_3
+bool intx_masked, update_intx_mask;
+#endif /* KVM_CAP_PCI_2_3 */
 
 DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
   ((d-devfn  3)  0x1F), (d-devfn  0x7),
@@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
uint32_t address,
 }
 
 if (ranges_overlap(address, len, PCI_COMMAND, 2)) {
+#ifdef KVM_CAP_PCI_2_3
+update_intx_mask = false;
+if (address == PCI_COMMAND+1) {
+intx_masked = val  (PCI_COMMAND_INTX_DISABLE  8);
+update_intx_mask = true;
+} else if (len = 2) {
+intx_masked = val  PCI_COMMAND_INTX_DISABLE;
+update_intx_mask = true;
+}
+if (update_intx_mask) {
+memset(assigned_dev_data, 0, sizeof(assigned_dev_data));
+assigned_dev_data.assigned_dev_id  =
+calc_assigned_dev_id(pci_dev-h_segnr, pci_dev-h_busnr,
+ pci_dev-h_devfn);
+if (intx_masked) {
+assigned_dev_data.flags = KVM_DEV_ASSIGN_MASK_INTX;
+}
+kvm_assign_set_intx_mask(kvm_context, assigned_dev_data);
+}
+#endif /* KVM_CAP_PCI_2_3 */
 pci_default_write_config(d, address, val, len);
 /* Continue to program the card */
 }
@@ -876,11 +905,6 @@ static void free_assigned_device(AssignedDevice *dev)
 }
 }
 
-static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t devfn)
-{
-return (uint32_t)seg  16 | (uint32_t)bus  8 | (uint32_t)devfn;
-}
-
 static void assign_failed_examine(AssignedDevice *dev)
 {
 char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns;
@@ -971,6 +995,10 @@ static int assign_device(AssignedDevice *dev)
 cause host memory corruption if the device issues DMA write 
 requests!\n);
 }
+#ifdef KVM_CAP_PCI_2_3
+assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
+dev-emulate_cmd_mask |= PCI_COMMAND_INTX_DISABLE;
+#endif /* KVM_CAP_PCI_2_3 */
 
 r = kvm_assign_pci_device(kvm_context, assigned_dev_data);
 if (r  0) {
diff --git a/qemu-kvm.c b/qemu-kvm.c
index 471306b..8157b4f 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -740,6 +740,14 @@ int kvm_deassign_pci_device(kvm_context_t kvm,
 }
 #endif
 
+#ifdef KVM_CAP_PCI_2_3
+int kvm_assign_set_intx_mask(kvm_context_t kvm,
+ struct kvm_assigned_pci_dev *assigned_dev)
+{
+return kvm_vm_ioctl(kvm_state, KVM_ASSIGN_SET_INTX_MASK, assigned_dev);
+}
+#endif
+
 int kvm_reinject_control(kvm_context_t kvm, int pit_reinject)
 {
 #ifdef KVM_CAP_REINJECT_CONTROL
diff --git a/qemu-kvm.h b/qemu-kvm.h
index 7e6edfb..522b1b2 100644
--- a/qemu-kvm.h
+++ b/qemu-kvm.h
@@ -602,6 +602,9 @@ int kvm_assign_set_msix_entry(kvm_context_t kvm,
   struct kvm_assigned_msix_entry *entry);
 #endif
 
+int kvm_assign_set_intx_mask(kvm_context_t kvm,
+ struct kvm_assigned_pci_dev *assigned_dev);
+
 #else   /* !CONFIG_KVM */
 
 typedef struct kvm_context *kvm_context_t;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-13 Thread Jan Kiszka
Am 12.12.2010 23:31, Erik Brakkee wrote:
 Jan Kiszka wrote:

 Are there some tuning parameters I can use or perhaps even kernel
 configuration paramters on the host to solve this?

 Cheers
Erik

 Host:Motherboard Supermicro X8DTi-F, Intel Xeon L5630, 12MB
   OS: Opensuse 11.3 64 bit

 Guest:   OS: Opensuse 11.3 64 bit

 I can say now that I am giving up on getting this to work. One
 alternative was to use PCI passthrough the USB hardware,  but that
 didn't work for the USB that was on the motherboard. So I bought a USB
 PCI card and tried to use PCI passthrough for that. Unfortunately other
 problems occured there.

 For one, the problem with 4K alignment. But I could fix that by using
 the pci=resource_alignment=... kernel parameter. In my grub/menu.lst it
 says:

 kernel /vmlinuz-2.6.34.7-0.5-default root=/dev/hsystem/root quiet
 showopts intel_iommu=on
 pci=resource_alignment=01:04.0;01:04.1;01:04.2 noirqdebug vga=0x31a


 The noirqdebug flas was needed to avoid the host from disabling the IRQ
 (it was a shared IRQ).

 Using this, I could configure PCI passthrough and start the VM. Also the
 USB device showed up there. Only it did not work at all.

 Here is a summary of my journey up until know:

 The original approach I wanted to use was to pass my old PCI card (WinTV
 PVR-500) to a VM. This card is a well supported card and has been doing
 fine for me. Because of the PCI passthrough problems with the wintv
 card, I decided to try a USB card instead. This gave me a 'ctrl buffer
 too small' issue that I could solve by taking the source RPM for kvm and
 applying a known patch from red hat (increasing buffer size from 2048 to
 8192). But then I got jerky video, probably due to USB 1.1 issues. To
 bypass these I could use PCI passthrough for USB. But with the PCI
 passthrough of this card I am again running into issues probably related
 to Shared IRQs. So, after all this I am back to square one.

 I have now modified my approach so instead of running a separate minimal
 host with my old server as a guest, I am now running the old server
 (same install) on the new hardware, using it as a host. I would
 definitely be interested in trying this out further in the future. I
 even tried Xen for a brief moment, only to realize that my host and
 guest felt slower (slower startup and execution) and much more difficult
 to handle.

  From the experience of the last two days fulltime trying to get things
 working I can only conclude that the following two features would be
 really important to have:

 * Extended PCI passthrough support
   o shared IRQ support
  
 Addressed by the series I sent out today.

 Does this mean I have a chance now that PCI passthrough of my WinTV
 PVR-500 might work now?
 What version is this and where can I get this for opensuse?

Currently you have to clone my git trees [1, 2], then build and install
those to have the feature. Will take a while to see it in releases, and
after that also Opensuse packages.

Jan

[1] git://git.kiszka.org/linux-kvm.git queues/dev-assign
[2] git://git.kiszka.org/qemu-kvm.git queues/dev-assign



signature.asc
Description: OpenPGP digital signature


Re: USB Passthrough 1.1 performance problem...

2010-12-13 Thread Kenni Lund
2010/12/12 Erik Brakkee e...@brakkee.org:
 Does this mean I have a chance now that PCI passthrough of my WinTV PVR-500
 might work now?

Passthrough of a PVR-500 has been working for a long time. I've been
running with passthrough of a PVR-500 in my HTPC, since
November/December 2009...so it should work with any recent kernel and
any recent version of qemu-kvm you can find today - No patching
needed. The only issue I had with the PVR-500 card, was when *I*
didn't free up the shared interrupts...once I fixed that, it just
worked.

On the other hand, I've never had success with passthrough of USB.
I've spend a bunch of time trying to get various USB cards to work
with passthrough, I even purchased 3 USB cards, just to test USB
passthrough with different brands, interfaces and versions (PCI, PCIe,
USB 2.0, USB 3.0, etc). I gave up on that 5 months ago -
http://article.gmane.org/gmane.comp.emulators.kvm.devel/56719

 What version is this and where can I get this for opensuse?

I can't remember if I started out with the PVR-500 card with 0.11 or
0.12 ...I think it was 0.11...but anyway, you'll hopefully not run
with such an old version today, so any version should work.

Best regards
Kenni
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/5] pci-assign: Fix dword read at PCI_COMMAND

2010-12-13 Thread Alex Williamson
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 If we emulate the command register, we must only read its content from
 the shadow config space. For dword read of both PCI_COMMAND and
 PCI_STATUS, at least the latter must be read from the device.
 
 For simplicity reasons and as the code path is not considered
 performance critical for the affected SRIOV devices, the fix performes
 device access to the command word unconditionally, even if emulation is
 enabled and only that word is read.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/device-assignment.c |   14 +++---
  1 files changed, 11 insertions(+), 3 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index bc3a57b..6ff1456 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -494,14 +494,11 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
 *d, uint32_t address,
  /*
   * Catch access to
   *  - vendor  device ID
 - *  - command register (if emulation needed)
   *  - base address registers
   *  - ROM base address  capability pointer
   *  - interrupt line  pin
   */
  if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) ||
 -(pci_dev-need_emulate_cmd 
 - ranges_overlap(address, len, PCI_COMMAND, 2)) ||
  ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
  ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
  ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
 @@ -533,6 +530,17 @@ do_log:
  DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
(d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
  
 +if (pci_dev-need_emulate_cmd 
 +ranges_overlap(address, len, PCI_COMMAND, 2)) {
 +if (address == PCI_COMMAND) {
 +val = 0x;
 +val |= pci_default_read_config(d, PCI_COMMAND, 2);
 +} else {
 +/* high-byte access */
 +val = pci_default_read_config(d, PCI_COMMAND+1, 1);
 +}
 +}
 +
  if (!pci_dev-cap.available) {
  /* kill the special capabilities */
  if (address == PCI_COMMAND  len == 4) {

We might be able to use the merge_bits function that I just added for
capability support, perhaps something like:

if (pci_dev-need_emulate_cmd) {
val = merge_bits(val, pci_default_read_config(d, address, len), 
PCI_COMMAND, 0x)
}

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/5] pci-assign: Clean up assigned_dev_pci_read/write_config

2010-12-13 Thread Alex Williamson
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Use rages_overlap and proper constants to match the access range against
 regions that need special handling. This also fixes yet uncaught
 high-byte write access to the command register. Moreover, use more
 constants instead of magic numbers.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/device-assignment.c |   39 +--
  1 files changed, 29 insertions(+), 10 deletions(-)

A long overdue cleanup, looks good.

Acked-by: Alex Williamson alex.william...@redhat.com

 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 50c6408..bc3a57b 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -438,13 +438,20 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
 uint32_t address,
  return assigned_device_pci_cap_write_config(d, address, val, len);
  }
  
 -if (address == 0x4) {
 +if (ranges_overlap(address, len, PCI_COMMAND, 2)) {
  pci_default_write_config(d, address, val, len);
  /* Continue to program the card */
  }
  
 -if ((address = 0x10  address = 0x24) || address == 0x30 ||
 -address == 0x34 || address == 0x3c || address == 0x3d) {
 +/*
 + * Catch access to
 + *  - base address registers
 + *  - ROM base address  capability pointer
 + *  - interrupt line  pin
 + */
 +if (ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
 +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
 +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
  /* used for update-mappings (BAR emulation) */
  pci_default_write_config(d, address, val, len);
  return;
 @@ -484,9 +491,20 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
 *d, uint32_t address,
  return val;
  }
  
 -if (address  0x4 || (pci_dev-need_emulate_cmd  address == 0x4) ||
 - (address = 0x10  address = 0x24) || address == 0x30 ||
 -address == 0x34 || address == 0x3c || address == 0x3d) {
 +/*
 + * Catch access to
 + *  - vendor  device ID
 + *  - command register (if emulation needed)
 + *  - base address registers
 + *  - ROM base address  capability pointer
 + *  - interrupt line  pin
 + */
 +if (ranges_overlap(address, len, PCI_VENDOR_ID, 4) ||
 +(pci_dev-need_emulate_cmd 
 + ranges_overlap(address, len, PCI_COMMAND, 2)) ||
 +ranges_overlap(address, len, PCI_BASE_ADDRESS_0, 24) ||
 +ranges_overlap(address, len, PCI_ROM_ADDRESS, 8) ||
 +ranges_overlap(address, len, PCI_INTERRUPT_LINE, 2)) {
  val = pci_default_read_config(d, address, len);
  DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
(d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
 @@ -517,10 +535,11 @@ do_log:
  
  if (!pci_dev-cap.available) {
  /* kill the special capabilities */
 -if (address == 4  len == 4)
 -val = ~0x10;
 -else if (address == 6)
 -val = ~0x10;
 +if (address == PCI_COMMAND  len == 4) {
 +val = ~(PCI_STATUS_CAP_LIST  16);
 +} else if (address == PCI_STATUS) {
 +val = ~PCI_STATUS_CAP_LIST;
 +}
  }
  
  return val;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/5] pci-assign: Remove suspicious hunk from assigned_dev_pci_read_config

2010-12-13 Thread Alex Williamson
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 No one can remember where this came from, and it looks very hacky
 anyway (we return 0 for config space address 0xfc of _every_ assigned
 device, not only vga as the comment claims). So better remove it and
 wait for the underlying issue to reappear.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/device-assignment.c |5 -
  1 files changed, 0 insertions(+), 5 deletions(-)

Yay!

Acked-by: Alex Williamson alex.william...@redhat.com

 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 6ff1456..ef045f4 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -508,10 +508,6 @@ static uint32_t assigned_dev_pci_read_config(PCIDevice 
 *d, uint32_t address,
  return val;
  }
  
 -/* vga specific, remove later */
 -if (address == 0xFC)
 -goto do_log;
 -
  fd = pci_dev-real_device.config_fd;
  
  again:
 @@ -526,7 +522,6 @@ again:
   exit(1);
  }
  
 -do_log:
  DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
(d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
  



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] pci-assign: Convert need_emulate_cmd into a bitmask

2010-12-13 Thread Alex Williamson
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Define a mask of PCI command register bits that need to be emulated,
 i.e. read back from their shadow state. We will need this for
 selectively emulating the INTx mask bit.
 
 Note: No initialization of emulate_cmd_mask to zero needed, the device
 state is already zero-initialized.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/device-assignment.c |   18 ++
  hw/device-assignment.h |2 +-
  2 files changed, 11 insertions(+), 9 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index ef045f4..26d3bd7 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -525,14 +525,17 @@ again:
  DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
(d-devfn  3)  0x1F, (d-devfn  0x7), address, val, len);
  
 -if (pci_dev-need_emulate_cmd 
 +if (pci_dev-emulate_cmd_mask 
  ranges_overlap(address, len, PCI_COMMAND, 2)) {
  if (address == PCI_COMMAND) {
 -val = 0x;
 -val |= pci_default_read_config(d, PCI_COMMAND, 2);
 +val = ~pci_dev-emulate_cmd_mask;
 +val |= pci_default_read_config(d, PCI_COMMAND, 2) 
 +pci_dev-emulate_cmd_mask;
  } else {
  /* high-byte access */
 -val = pci_default_read_config(d, PCI_COMMAND+1, 1);
 +val = ~(pci_dev-emulate_cmd_mask  8);
 +val |= pci_default_read_config(d, PCI_COMMAND+1, 1) 
 +(pci_dev-emulate_cmd_mask  8);
  }
  }

We should definitely be using merge_bits here, this is the sort of thing
I had in mind for it:

val = merge_bits(val, pci_default_read_config(d, address, len), PCI_COMMAND, 
pci_dev-emulate_cmd_mask);

 @@ -800,10 +803,9 @@ again:
  
  /* dealing with virtual function device */
  snprintf(name, sizeof(name), %sphysfn/, dir);
 -if (!stat(name, statbuf))
 - pci_dev-need_emulate_cmd = 1;
 -else
 - pci_dev-need_emulate_cmd = 0;
 +if (!stat(name, statbuf)) {
 +pci_dev-emulate_cmd_mask = 0x;
 +}
  
  dev-region_number = r;
  return 0;
 diff --git a/hw/device-assignment.h b/hw/device-assignment.h
 index c94a730..9ead022 100644
 --- a/hw/device-assignment.h
 +++ b/hw/device-assignment.h
 @@ -109,7 +109,7 @@ typedef struct AssignedDevice {
  void *msix_table_page;
  target_phys_addr_t msix_table_addr;
  int mmio_index;
 -int need_emulate_cmd;
 +uint32_t emulate_cmd_mask;
  char *configfd_name;
  QLIST_ENTRY(AssignedDevice) next;
  } AssignedDevice;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM call agenda for Dec 14

2010-12-13 Thread Chris Wright
Please send in any agenda items you are interested in covering.

thanks,
-chris
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] pci-assign: Use PCI-2.3-based shared legacy interrupts

2010-12-13 Thread Alex Williamson
On Tue, 2010-12-14 at 00:25 +0100, Jan Kiszka wrote:
 From: Jan Kiszka jan.kis...@siemens.com
 
 Enable the new KVM feature that allows legacy interrupt sharing for
 PCI-2.3-compliant devices. This requires to synchronize any guest
 change of the INTx mask bit to the kernel.
 
 Signed-off-by: Jan Kiszka jan.kis...@siemens.com
 ---
  hw/device-assignment.c |   38 +-
  qemu-kvm.c |8 
  qemu-kvm.h |3 +++
  3 files changed, 44 insertions(+), 5 deletions(-)
 
 diff --git a/hw/device-assignment.c b/hw/device-assignment.c
 index 26d3bd7..cf75c52 100644
 --- a/hw/device-assignment.c
 +++ b/hw/device-assignment.c
 @@ -423,12 +423,21 @@ static uint8_t pci_find_cap_offset(PCIDevice *d, 
 uint8_t cap, uint8_t start)
  return 0;
  }
  
 +static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t 
 devfn)
 +{
 +return (uint32_t)seg  16 | (uint32_t)bus  8 | (uint32_t)devfn;
 +}
 +
  static void assigned_dev_pci_write_config(PCIDevice *d, uint32_t address,
uint32_t val, int len)
  {
  int fd;
  ssize_t ret;
  AssignedDevice *pci_dev = container_of(d, AssignedDevice, dev);
 +struct kvm_assigned_pci_dev assigned_dev_data;
 +#ifdef KVM_CAP_PCI_2_3
 +bool intx_masked, update_intx_mask;
 +#endif /* KVM_CAP_PCI_2_3 */
  
  DEBUG((%x.%x): address=%04x val=0x%08x len=%d\n,
((d-devfn  3)  0x1F), (d-devfn  0x7),
 @@ -439,6 +448,26 @@ static void assigned_dev_pci_write_config(PCIDevice *d, 
 uint32_t address,
  }
  
  if (ranges_overlap(address, len, PCI_COMMAND, 2)) {
 +#ifdef KVM_CAP_PCI_2_3
 +update_intx_mask = false;
 +if (address == PCI_COMMAND+1) {
 +intx_masked = val  (PCI_COMMAND_INTX_DISABLE  8);
 +update_intx_mask = true;
 +} else if (len = 2) {
 +intx_masked = val  PCI_COMMAND_INTX_DISABLE;
 +update_intx_mask = true;
 +}

I wonder if this might be a little cleaner as something like this.

if (ranges_overlap(address, len, PCI_COMMAND + 1, 1) {
update_intx_mask = true;
intx_masked = (len == 1 ? val  8 : val)  PCI_COMMAND_INTX_DISABLE;
}

 +if (update_intx_mask) {
 +memset(assigned_dev_data, 0, sizeof(assigned_dev_data));
 +assigned_dev_data.assigned_dev_id  =
 +calc_assigned_dev_id(pci_dev-h_segnr, pci_dev-h_busnr,
 + pci_dev-h_devfn);
 +if (intx_masked) {
 +assigned_dev_data.flags = KVM_DEV_ASSIGN_MASK_INTX;
 +}
 +kvm_assign_set_intx_mask(kvm_context, assigned_dev_data);
 +}
 +#endif /* KVM_CAP_PCI_2_3 */
  pci_default_write_config(d, address, val, len);
  /* Continue to program the card */
  }
 @@ -876,11 +905,6 @@ static void free_assigned_device(AssignedDevice *dev)
  }
  }
  
 -static uint32_t calc_assigned_dev_id(uint16_t seg, uint8_t bus, uint8_t 
 devfn)
 -{
 -return (uint32_t)seg  16 | (uint32_t)bus  8 | (uint32_t)devfn;
 -}
 -
  static void assign_failed_examine(AssignedDevice *dev)
  {
  char name[PATH_MAX], dir[PATH_MAX], driver[PATH_MAX] = {}, *ns;
 @@ -971,6 +995,10 @@ static int assign_device(AssignedDevice *dev)
  cause host memory corruption if the device issues DMA write 
 
  requests!\n);
  }
 +#ifdef KVM_CAP_PCI_2_3
 +assigned_dev_data.flags |= KVM_DEV_ASSIGN_PCI_2_3;
 +dev-emulate_cmd_mask |= PCI_COMMAND_INTX_DISABLE;
 +#endif /* KVM_CAP_PCI_2_3 */
  
  r = kvm_assign_pci_device(kvm_context, assigned_dev_data);
  if (r  0) {
 diff --git a/qemu-kvm.c b/qemu-kvm.c
 index 471306b..8157b4f 100644
 --- a/qemu-kvm.c
 +++ b/qemu-kvm.c
 @@ -740,6 +740,14 @@ int kvm_deassign_pci_device(kvm_context_t kvm,
  }
  #endif
  
 +#ifdef KVM_CAP_PCI_2_3
 +int kvm_assign_set_intx_mask(kvm_context_t kvm,
 + struct kvm_assigned_pci_dev *assigned_dev)
 +{
 +return kvm_vm_ioctl(kvm_state, KVM_ASSIGN_SET_INTX_MASK, assigned_dev);
 +}
 +#endif
 +
  int kvm_reinject_control(kvm_context_t kvm, int pit_reinject)
  {
  #ifdef KVM_CAP_REINJECT_CONTROL
 diff --git a/qemu-kvm.h b/qemu-kvm.h
 index 7e6edfb..522b1b2 100644
 --- a/qemu-kvm.h
 +++ b/qemu-kvm.h
 @@ -602,6 +602,9 @@ int kvm_assign_set_msix_entry(kvm_context_t kvm,
struct kvm_assigned_msix_entry *entry);
  #endif
  
 +int kvm_assign_set_intx_mask(kvm_context_t kvm,
 + struct kvm_assigned_pci_dev *assigned_dev);
 +
  #else   /* !CONFIG_KVM */
  
  typedef struct kvm_context *kvm_context_t;



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB Passthrough 1.1 performance problem...

2010-12-13 Thread Kenni Lund
2010/12/14 Erik Brakkee e...@brakkee.org:
 From: Kenni Lund ke...@kelu.dk
 Does this mean I have a chance now that PCI passthrough of my WinTV
 PVR-500
 might work now?

 Passthrough of a PVR-500 has been working for a long time. I've been
 running with passthrough of a PVR-500 in my HTPC, since
 November/December 2009...so it should work with any recent kernel and
 any recent version of qemu-kvm you can find today - No patching
 needed. The only issue I had with the PVR-500 card, was when *I*
 didn't free up the shared interrupts...once I fixed that, it just
 worked.

 How did you free up those shared interrupts then? I tried different slots
 but always get conflicts with the USB irqs.

I did an unbind of the conflicting device (eg. disabled it). I moved
the PVR-500 card around in the different slots and once I got a
conflict with the integrated sound card, I left the PVR-500 card in
that slot (it's a headless machine, so no need for sound) and
configured unbind of the sound card at boot time. On my old system I
think it was conflicting with one of the USB controllers as well, but
it didn't really matter, as I only lost a few of the ports on the back
of the computer for that particular USB controller - I still had
plenty of USB ports left and if I really needed more ports, I could
just plug in an extra USB PCI card.

My /etc/rc.local boot script looks like the following today:
--
#Remove HDA conflicting with ivtv1
echo :00:1b.0  /sys/bus/pci/drivers/HDA\ Intel/unbind

# ivtv0
echo  0016  /sys/bus/pci/drivers/pci-stub/new_id
echo :04:08.0  /sys/bus/pci/drivers/ivtv/unbind
echo :04:08.0  /sys/bus/pci/drivers/pci-stub/bind
echo  0016  /sys/bus/pci/drivers/pci-stub/remove_id

# ivtv1
echo  0016  /sys/bus/pci/drivers/pci-stub/new_id
echo :04:09.0  /sys/bus/pci/drivers/ivtv/unbind
echo :04:09.0  /sys/bus/pci/drivers/pci-stub/bind
echo  0016  /sys/bus/pci/drivers/pci-stub/remove_id
--

Best regards
Kenni
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: MMU: don't make direct sp read-only if !map_writable

2010-12-13 Thread Xiao Guangrong
On 12/13/2010 06:32 PM, Avi Kivity wrote:
 On 12/13/2010 12:31 PM, Xiao Guangrong wrote:
 Currently, if the page is not allowed to write, then it can drop
 ACC_WRITE_MASK in pte_access, and the direct sp's access is:
 gw-pt_access  gw-pte_access
 so, it also removes the write access in the direct sp.

 There is a problem: if the access of those pages which map thought the
 same
 mapping in guest is different in host, it causes host switch direct sp
 very
 frequently.
 
 I just sent a patch to fix this in a different way, please review it.
 

Your patch is good for me, please ignore this one :-)

Umm, do we need move access = ~ACC_WRITE_MASK into set_spte() then
can remove the same code in the caller?
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH] RFC: delay pci_update_mappings for 64-bit BARs

2010-12-13 Thread Isaku Yamahata
On Mon, Dec 13, 2010 at 03:43:44PM -0700, Cam Macdonell wrote:
 Do not call pci_update_mappings on the lower 32-bits of a 64-bit bar.  Wait 
 for the upper 32 or else Qemu will try to map on just the lower 32 which is 
 probably going to corrupt memory.
 
 I was encountering crashes when mapping certain PCI region sizes.  The 
 problem turns out that pci_update_mappings is being called without all 
 64-bits in the BAR.  For example when mapping to 0x18000, once the lower 
 32-bits were written the remapping happened (mapping to 0x800) which 
 would overwrite something.
 
 I'm not certain if this is completely correct, I'm simply testing the lower 
 4-bits to only be MEM_TYPE_64 flag.  Upper 32-bit address parts can be values 
 like 0xff which is tricky to test against.

You're assuming that guest OS always write lower 32bit and them upper 32bit.
Is the assumption correct?
I found Linux does, but I don't know about other OSes.
And I couldn't find any sentence about how to update (64bit) BAR in the specs.
(Please correct me if I missed it)

Some work around would be necessary regardless of 32bit-or-64bit.
because qemu doesn't emulate bus accurately at the moment.
How about the followings?
If BAR overlaps with RAM, don't map BAR.
If BAR overlaps with other BARs, record the overlapping and
when updating one of the BARs, update all the overlapping BARs.
Which BAR wins depends on the order of updating, it doesn't matter because
it's anomaly case.

This way, 32bit BAR case is also covered.

thanks,

 
 Cam
 ---
  hw/pci.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)
 
 diff --git a/hw/pci.c b/hw/pci.c
 index 438c0d1..3b81792 100644
 --- a/hw/pci.c
 +++ b/hw/pci.c
 @@ -1000,6 +1000,9 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
 addr, uint32_t val, int l)
  {
  int i, was_irq_disabled = pci_irq_disabled(d);
  uint32_t config_size = pci_config_size(d);
 +int is_64 = 0;
 +
 +is_64 = ((val  0xf) == PCI_BASE_ADDRESS_MEM_TYPE_64);
  
  for (i = 0; i  l  addr + i  config_size; val = 8, ++i) {
  uint8_t wmask = d-wmask[addr + i];
 @@ -1008,7 +1011,7 @@ void pci_default_write_config(PCIDevice *d, uint32_t 
 addr, uint32_t val, int l)
  d-config[addr + i] = (d-config[addr + i]  ~wmask) | (val  wmask);
  d-config[addr + i] = ~(val  w1cmask); /* W1C: Write 1 to Clear */
  }
 -if (ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24) ||
 +if ((ranges_overlap(addr, l, PCI_BASE_ADDRESS_0, 24)  (!is_64)) ||
  ranges_overlap(addr, l, PCI_ROM_ADDRESS, 4) ||
  ranges_overlap(addr, l, PCI_ROM_ADDRESS1, 4) ||
  range_covers_byte(addr, l, PCI_COMMAND))
 -- 
 1.7.0.4
 
 

-- 
yamahata
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v2 PATCH 1/3] kvm: keep track of which task is running a KVM vcpu

2010-12-13 Thread Rik van Riel
Keep track of which task is running a KVM vcpu.  This helps us
figure out later what task to wake up if we want to boost a
vcpu that got preempted.

Unfortunately there are no guarantees that the same task
always keeps the same vcpu, so we can only track the task
across a single run of the vcpu.

Signed-off-by: Rik van Riel r...@redhat.com
---
- move vcpu-task manipulation as suggested by Chris Wright

 include/linux/kvm_host.h |1 +
 virt/kvm/kvm_main.c  |2 ++
 2 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a055742..180085b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -81,6 +81,7 @@ struct kvm_vcpu {
 #endif
int vcpu_id;
struct mutex mutex;
+   struct task_struct *task;
int   cpu;
atomic_t guest_mode;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5225052..c95bad1 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2248,6 +2248,7 @@ static void kvm_sched_in(struct preempt_notifier *pn, int 
cpu)
 {
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
+   vcpu-task = NULL;
kvm_arch_vcpu_load(vcpu, cpu);
 }
 
@@ -2256,6 +2257,7 @@ static void kvm_sched_out(struct preempt_notifier *pn,
 {
struct kvm_vcpu *vcpu = preempt_notifier_to_vcpu(pn);
 
+   vcpu-task = current;
kvm_arch_vcpu_put(vcpu);
 }
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v2 PATCH 2/3] sched: add yield_to function

2010-12-13 Thread Rik van Riel
Add a yield_to function to the scheduler code, allowing us to
give the remainder of our timeslice to another thread.

We may want to use this to provide a sys_yield_to system call
one day.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com
---
- move to a per sched class yield_to
- fix the locking

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c79e92..408326f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1086,6 +1086,8 @@ struct sched_class {
 #ifdef CONFIG_FAIR_GROUP_SCHED
void (*task_move_group) (struct task_struct *p, int on_rq);
 #endif
+
+   void (*yield_to) (struct rq *rq, struct task_struct *p);
 };
 
 struct load_weight {
@@ -1947,6 +1949,7 @@ extern void set_user_nice(struct task_struct *p, long 
nice);
 extern int task_prio(const struct task_struct *p);
 extern int task_nice(const struct task_struct *p);
 extern int can_nice(const struct task_struct *p, const int nice);
+extern void requeue_task(struct rq *rq, struct task_struct *p);
 extern int task_curr(const struct task_struct *p);
 extern int idle_cpu(int cpu);
 extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
@@ -2020,6 +2023,10 @@ extern int wake_up_state(struct task_struct *tsk, 
unsigned int state);
 extern int wake_up_process(struct task_struct *tsk);
 extern void wake_up_new_task(struct task_struct *tsk,
unsigned long clone_flags);
+
+extern u64 slice_remain(struct task_struct *);
+extern void yield_to(struct task_struct *);
+
 #ifdef CONFIG_SMP
  extern void kick_process(struct task_struct *tsk);
 #else
diff --git a/kernel/sched.c b/kernel/sched.c
index dc91a4d..6399641 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5166,6 +5166,46 @@ SYSCALL_DEFINE3(sched_getaffinity, pid_t, pid, unsigned 
int, len,
return ret;
 }
 
+/*
+ * Yield the CPU, giving the remainder of our time slice to task p.
+ * Typically used to hand CPU time to another thread inside the same
+ * process, eg. when p holds a resource other threads are waiting for.
+ * Giving priority to p may help get that resource released sooner.
+ */
+void yield_to(struct task_struct *p)
+{
+   unsigned long flags;
+   struct rq *rq, *p_rq;
+
+   local_irq_save(flags);
+   rq = this_rq();
+again:
+   p_rq = task_rq(p);
+   double_rq_lock(rq, p_rq);
+   if (p_rq != task_rq(p)) {
+   double_rq_unlock(rq, p_rq);
+   goto again;
+   }
+
+   /* We can't yield to a process that doesn't want to run. */
+   if (!p-se.on_rq)
+   goto out;
+
+   /*
+* We can only yield to a runnable task, in the same schedule class
+* as the current task, if the schedule class implements yield_to_task.
+*/
+   if (!task_running(rq, p)  current-sched_class == p-sched_class 
+   current-sched_class-yield_to)
+   current-sched_class-yield_to(rq, p);
+
+out:
+   double_rq_unlock(rq, p_rq);
+   local_irq_restore(flags);
+   yield();
+}
+EXPORT_SYMBOL_GPL(yield_to);
+
 /**
  * sys_sched_yield - yield the current processor to other threads.
  *
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 00ebd76..d8c4116 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -980,6 +980,25 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity 
*curr, int queued)
  * CFS operations on tasks:
  */
 
+u64 slice_remain(struct task_struct *p)
+{
+   unsigned long flags;
+   struct sched_entity *se = p-se;
+   struct cfs_rq *cfs_rq;
+   struct rq *rq;
+   u64 slice, ran;
+   s64 delta;
+
+   rq = task_rq_lock(p, flags);
+   cfs_rq = cfs_rq_of(se);
+   slice = sched_slice(cfs_rq, se);
+   ran = se-sum_exec_runtime - se-prev_sum_exec_runtime;
+   delta = slice - ran;
+   task_rq_unlock(rq, flags);
+
+   return max(delta, 0LL);
+}
+
 #ifdef CONFIG_SCHED_HRTICK
 static void hrtick_start_fair(struct rq *rq, struct task_struct *p)
 {
@@ -1126,6 +1145,20 @@ static void yield_task_fair(struct rq *rq)
se-vruntime = rightmost-vruntime + 1;
 }
 
+static void yield_to_fair(struct rq *rq, struct task_struct *p)
+{
+   struct sched_entity *se = p-se;
+   struct cfs_rq *cfs_rq = cfs_rq_of(se);
+   u64 remain = slice_remain(current);
+
+   dequeue_task(rq, p, 0);
+   se-vruntime -= remain;
+   if (se-vruntime  cfs_rq-min_vruntime)
+   se-vruntime = cfs_rq-min_vruntime;
+   enqueue_task(rq, p, 0);
+   check_preempt_curr(rq, p, 0);
+}
+
 #ifdef CONFIG_SMP
 
 static void task_waking_fair(struct rq *rq, struct task_struct *p)
@@ -3962,6 +3995,8 @@ static const struct sched_class fair_sched_class = {
 #ifdef CONFIG_FAIR_GROUP_SCHED
.task_move_group= task_move_group_fair,
 #endif
+
+   .yield_to   = yield_to_fair,
 };
 
 #ifdef CONFIG_SCHED_DEBUG
--
To unsubscribe from this list: 

[RFC -v2 PATCH 0/3] directed yield for Pause Loop Exiting

2010-12-13 Thread Rik van Riel
When running SMP virtual machines, it is possible for one VCPU to be
spinning on a spinlock, while the VCPU that holds the spinlock is not
currently running, because the host scheduler preempted it to run
something else.

Both Intel and AMD CPUs have a feature that detects when a virtual
CPU is spinning on a lock and will trap to the host.

The current KVM code sleeps for a bit whenever that happens, which
results in eg. a 64 VCPU Windows guest taking forever and a bit to
boot up.  This is because the VCPU holding the lock is actually
running and not sleeping, so the pause is counter-productive.

In other workloads a pause can also be counter-productive, with
spinlock detection resulting in one guest giving up its CPU time
to the others.  Instead of spinning, it ends up simply not running
much at all.

This patch series aims to fix that, by having a VCPU that spins
give the remainder of its timeslice to another VCPU in the same
guest before yielding the CPU - one that is runnable but got 
preempted, hopefully the lock holder.

v2:
- make lots of cleanups and improvements suggested
- do not implement timeslice scheduling or fairness stuff
  yet, since it is not entirely clear how to do that right
  (suggestions welcome)

-- 
All rights reversed.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC -v2 PATCH 3/3] kvm: use yield_to instead of sleep in kvm_vcpu_on_spin

2010-12-13 Thread Rik van Riel
Instead of sleeping in kvm_vcpu_on_spin, which can cause gigantic
slowdowns of certain workloads, we instead use yield_to to hand
the rest of our timeslice to another vcpu in the same KVM guest.

Signed-off-by: Rik van Riel r...@redhat.com
Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 180085b..af11701 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -92,6 +92,7 @@ struct kvm_vcpu {
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
wait_queue_head_t wq;
+   int spinning;
int sigset_active;
sigset_t sigset;
struct kvm_vcpu_stat stat;
@@ -187,6 +188,7 @@ struct kvm {
 #endif
struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
atomic_t online_vcpus;
+   int last_boosted_vcpu;
struct list_head vm_list;
struct mutex lock;
struct kvm_io_bus *buses[KVM_NR_BUSES];
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index c95bad1..17c6c25 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1289,18 +1289,50 @@ void kvm_resched(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_resched);
 
-void kvm_vcpu_on_spin(struct kvm_vcpu *vcpu)
+void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 {
-   ktime_t expires;
-   DEFINE_WAIT(wait);
+   struct kvm *kvm = me-kvm;
+   struct kvm_vcpu *vcpu;
+   int last_boosted_vcpu = me-kvm-last_boosted_vcpu;
+   int yielded = 0;
+   int pass;
+   int i;
 
-   prepare_to_wait(vcpu-wq, wait, TASK_INTERRUPTIBLE);
+   me-spinning = 1;
 
-   /* Sleep for 100 us, and hope lock-holder got scheduled */
-   expires = ktime_add_ns(ktime_get(), 10UL);
-   schedule_hrtimeout(expires, HRTIMER_MODE_ABS);
+   /*
+* We boost the priority of a VCPU that is runnable but not
+* currently running, because it got preempted by something
+* else and called schedule in __vcpu_run.  Hopefully that
+* VCPU is holding the lock that we need and will release it.
+* We approximate round-robin by starting at the last boosted VCPU.
+*/
+   for (pass = 0; pass  2  !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   struct task_struct *task = vcpu-task;
+   if (!pass  i  last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass  i  last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (vcpu-spinning)
+   continue;
+   if (!task)
+   continue;
+   if (waitqueue_active(vcpu-wq))
+   continue;
+   if (task-flags  PF_VCPU)
+   continue;
+   kvm-last_boosted_vcpu = i;
+   yielded = 1;
+   yield_to(task);
+   break;
+   }
+   }
 
-   finish_wait(vcpu-wq, wait);
+   me-spinning = 0;
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_on_spin);
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL net-next-2.6] vhost-net: tools, cleanups, optimizations

2010-12-13 Thread Rusty Russell
On Tue, 14 Dec 2010 03:54:47 am Michael S. Tsirkin wrote:
 On Mon, Dec 13, 2010 at 12:44:13PM +0200, Michael S. Tsirkin wrote:
  Please merge the following tree for 2.6.38.
  Thanks!
 
 Um, I sent this out before I noticed the mail from Rusty
 with some questions on the test code. I missed that and
 assumed no comments - no issues, perhaps wrongly.
 
 Rusty - I tried answering the questions there - any issues
 with merging this? It's just a test so won't be hard to remove
 later if it's not helpful ...

Traditionally this stuff has not gone in tree.  However, I think it
should be...

Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SMBIOS support in Qemu?

2010-12-13 Thread Anjali Kulkarni

Hi,

Which version of Qemu contains the Smbios code? If I have to get the code in my 
repo, is there any place I can get the complete set of patches?

Thanks
Anjali
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >